Академический Документы
Профессиональный Документы
Культура Документы
F. A. Davis Company
1915 Arch Street
Philadelphia, PA 19103
www.fadavis.com
Copyright © 2020 by F. A. Davis Company. All rights reserved. This product is protected by copyright.
No part of it may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise, without written permission from
the publisher.
As new scientific information becomes available through basic and clinical research, recommended
treatments and drug therapies undergo changes. The author(s) and publisher have done everything
possible to make this book accurate, up to date, and in accord with accepted standards at the time of
publication. The author(s), editors, and publisher are not responsible for errors or omissions or for
consequences from application of the book and make no warranty, expressed or implied, in regard to the
contents of the book. Any practice described in this book should be applied by the reader in accordance
with professional standards of care used in regard to the unique circumstances that may apply in each situ-
ation. The reader is advised always to check product information (package inserts) for changes and new
information regarding dosage and contraindications before administering any drug. Caution is especially
urged when using new or infrequently ordered drugs.
Authorization to photocopy items for internal or personal use, or the internal or personal use of specific
clients, is granted by F. A. Davis Company for users registered with the Copyright Clearance Center
(CCC) Transactional Reporting Service, provided that the fee of $.25 per copy is paid directly to CCC,
222 Rosewood Drive, Danvers, MA 01923. For those organizations that have been granted a photocopy
license by CCC, a separate system of payment has been arranged. The fee code for users of the Transac-
tional Reporting Service is: 978-0-8036-6113-4/20 + $.25.
6113_FM_i-xxiv 05/12/19 1:31 pm Page iii
To Hazel
. . . and joyful running hugs
6113_FM_i-xxiv 05/12/19 1:31 pm Page iv
6113_FM_i-xxiv 05/12/19 1:31 pm Page v
Preface
It is now more than 25 years since this text was first The popular features of the book have not changed,
published in 1993. It has become a classic reference, such as its accessible tone, use of extensive clinical exam-
cited extensively in literature from researchers around ples to illustrate applications of design and statistics, and
the globe. It is gratifying to see its sustained and wide- careful logic in explanation of complex concepts. It re-
spread use by clinicians, faculty, students, and re- mains a comprehensive text, meant to fulfill different
searchers in a variety of fields including rehabilitation, needs at different times, serving as both an introduction
medicine, public health, nursing, public policy, epi- and a reference as readers grow in their knowledge and
demiology, exercise physiology, and other disciplines experience. However, this fourth edition does represent
concerned with healthcare. This edition will continue a significant update in format, focus, and organization.
to serve this wide audience.
As I have approached each revision, I have been struck
by how much the framework for research continues to
evolve over time, and this edition is no different. It in-
■ New and Updated
corporates several important contemporary perspectives.
Perhaps most notable is the new design, intended to
First, regardless of the role we play in healthcare, be
make the text more reader-friendly and engaging with
it clinician, administrator, policy maker, or researcher,
full color. Tables and figures have a new look to facilitate
we all need to be critical consumers of evidence to in-
interpretation. Starting with the book’s title, the empha-
form decisions. This requires at least a basic understand-
sis on EBP is clear, including the Focus on Evidence and
ing of principles of measurement, design, and analysis to
Case in Point features in each chapter that illustrate the
enable reasonable and critical evaluation.
direct application of concepts in the literature. Addi-
Second, the concept of evidence-based practice
tional information is included in boxes to clarify content,
(EBP), although ubiquitous in healthcare discussion,
emphasize key points, offer interesting background and
must be understood in terms of its full purpose for clin-
historical context—and a few fun facts along the way.
ical decision-making, relying not only on literature but
Commentaries build on important concepts. Online re-
also on clinical expertise, patient values, and environ-
sources have also been expanded, with chapter supple-
mental conditions. These considerations have influenced
ments, access to data files, and materials for instructors
translational research and the need to address meaning-
and students.
ful outcomes.
Five new chapters have been included in this edition,
Third, today’s healthcare enterprise mandates atten-
focusing on translational research (Chapter 2), evidence-
tion to interprofessional collaboration. I have included
based practice (Chapter 5), health measurement scales
examples throughout the book to illustrate various con-
(Chapter 12), clinical trials (Chapter 14), and qualitative
cepts. These are drawn from many different fields, with
research (Chapter 21). All of these address content that
the intent that all examples have wide application. I be-
was covered in previous editions but with greater depth
lieve this is a distinct advantage of this book, to under-
and application.
score the importance of appreciating research within and
Many chapters also include new or updated content
beyond one’s own field.
to address current clinical research issues. For example,
And finally, the content will work for those who want
to plan or implement research. Detailed examples are in- • Examples provided throughout the book have been
cluded for those who want more depth, both in the text updated with more recent literature.
as well as in online supplements. • The stronger emphasis on EBP includes use of the
These directions and priorities continue to define re- PICO strategy to answer clinical questions as well as
search philosophies that guide the conduct of clinical in- to generate research questions. The EBP emphasis
quiry, the types of questions that are deemed important, reflects a comprehensive model that includes research
and how they influence advances in measurement and evidence, clinical expertise, patient values, and clinical
analysis techniques. circumstances.
v
6113_FM_i-xxiv 05/12/19 1:31 pm Page vi
vi Preface
Preface vii
meta-analyses, and scoping reviews. Finally, Chapter 38 No matter how many years I have engaged in re-
presents a format for preparing and disseminating search, I continue to learn about advances in design and
research through publications and presentations. analysis that set new directions and priorities for clini-
cal inquiry. This edition is intended to strengthen our
Appendix A includes reference tables for major statis-
collective commitment to new knowledge, discovery
tical tests and additional tables are provided in Appendix
and innovation, and implementation of evidence, as we
A Online. Appendix B includes an algorithm for choos-
strive to provide the most effective care. The process
ing a statistical procedure based on design and measure-
remains a challenge to all of us engaged in healthcare,
ment considerations. Appendix C covers content related
and this text will always be a work in progress. I look
to management of quantitative data, including data trans-
forward to continuing to hear from colleagues across
formation. The glossary, also available online, provides
disciplines regarding how to make this volume most
a handy reference to define terms, symbols, and common
useful.
abbreviations.
6113_FM_i-xxiv 05/12/19 1:31 pm Page viii
6113_FM_i-xxiv 05/12/19 1:31 pm Page ix
Contributors
Jessica Bell, MS K. Douglas Gross, MPT, DPT, ScD,
Director of Library and Instructional Design FAAOMPT, CPed
MGH Institute of Health Professions Associate Professor, Department of Physical Therapy
Boston, Massachusetts MGH Institute of Health Professions
School of Health and Rehabilitation Sciences
Marianne Beninato, DPT, PhD Boston, Massachusetts
Professor Emerita
MGH Institute of Health Professions Catherine Lysack, PhD, OT(C)
Boston, Massachusetts Associate Professor, Occupational Therapy
and Gerontology
Peter S. Cahn, PhD Eugene Applebaum College of Pharmacy and Health
Associate Provost for Academic Affairs Sciences
MGH Institute of Health Professions Wayne State University
Boston, Massachusetts Detroit, Michigan
ix
6113_FM_i-xxiv 05/12/19 1:31 pm Page x
6113_FM_i-xxiv 05/12/19 1:31 pm Page xi
Reviewers
Jason Browning, OTR/L Joanne Gallagher Worthley, EdD, OTR/L, CAPS
Assistant Professor Professor
Occupational Therapy Occupational Therapy
Jefferson College of Health Sciences Worcester State University
Roanoke, Virginia Worcester, Massachusetts
Stephen Chao, PT, DPT, CSCS Sean F. Griech, PT, DPT, OCS, COMT
Clinical Assistant Professor Assistant Professor
Physical Therapy Department Doctor of Physical Therapy Program
School of Health Technology and Management DeSales University
State University of New York at Stony Brook Center Valley, Pennsylvania
Southampton, New York
Emmanuel B. John, PT, DPT, PhD, MBA
Jennifer B. Christy, PT, PhD Associate Professor & Chair
Associate Professor Department of Physical Therapy
Physical Therapy Chapman University
The University of Alabama at Birmingham Irvine, California
Birmingham, Alabama
Steven G. Lesh, PhD, PT, SCS, ATC
Peter C. Douris, PT, EdD, DPT, OCS Chair and Professor of Physical Therapy
Professor Southwest Baptist University
Physical Therapy Bolivar, Missouri
New York Institute of Technology
Old Westbury, New York Jean MacLachlan, PhD, OTR/L
Associate Professor
Winnie Dunn, PhD, OTR, FAOTA Occupational Therapy
Distinguished Professor Salem State University
Occupational Therapy Salem, Massachusetts
University of Missouri
Columbia, Missouri Dennis McCarthy, PhD, OTR/L, FAOTA
Associate Professor
Ashraf Elazzazi, PT, PhD Occupational Therapy
Associate Professor Nova Southeastern University
Chair, Department of Physical Therapy Tampa, Florida
Utica College
Utica, New York Raymond F. McKenna, PT, PhD, CSCS
Clinical Associate Professor
Simon French, BAppSc, MPH, PhD Physical Therapy
Associate Professor Stony Brook University
School of Rehabilitation Therapy Stony Brook, New York
Queen’s University
Kingston, Ontario, Canada
xi
6113_FM_i-xxiv 05/12/19 1:31 pm Page xii
xii Reviewers
Tamara Mills, PhD, OTR/L, ATP Mary P. Shotwell, PhD, OT/L, FAOTA
Assistant Professor Professor
School of Occupational Therapy School of Occupational Therapy
Brenau University Brenau University
Norcross, California Gainesville, Georgia
Acknowledgments
Taking on this edition as a solo effort has been an in- I’d also like to acknowledge my dear friend and writing
teresting challenge. I am so appreciative for the sup- partner Mary Watkins, who retired several years ago and
port of many friends and colleagues who have helped is happily ensconced in New Hampshire with her family.
me through it. I am indebted to those who contributed Our collaboration was a special part of our lives for more
to writing chapters—Jessica Bell, Marianne Beninato, than 20 years, and I sorely missed her this time around.
Peter Cahn, Heather Fritz, Doug Gross, Cathy Lysack, Over the past 50 years of my professional life, of which
and David Scalzitti. I am thankful to John Wong for 45 were spent in academia, I have had the good fortune
many hours of invaluable guidance on statistics. I am to work with thousands of students, faculty colleagues,
also grateful to those who took the time to review sec- and teachers who have inspired and challenged me. Too
tions of the book and provide valuable feedback— numerous to mention by name, I thank them all for their
Marjorie Nicholas, Winnie Dunn, Lisa Connor and friendship, mentorship, and dedication. I am also grateful
Saurabh Mehta among others. I also want to acknowl- to many people, some I know and many I have never met,
edge the long-time support of colleagues at the MGH from all over the world, who over the years have taken
Institute of Health Professions, especially Alex John- the time to stop me at conferences, or to call or send
son, Mary Ellen Ferolito, Denis Stratford and the en- emails, telling me how much they appreciated the book—
tire IT team, as well as the graduate assistants Jasmin and pointing out the mistakes and where information
Torres and Baothy Huynh, who will both be awesome needed updating. Those comments have been immensely
Doctors of Occupational Therapy. I reserve a special rewarding and incredibly helpful. I hope these colleagues
note of deep appreciation for David Scalzitti who has are pleased with the revisions they have inspired.
reviewed every page of this book and who freely of- And of course, there is no way a project like this gets
fered his advice along the way, in every instance pro- done without support and sacrifice from those closest
viding important comments to make things more to me. My parents, who were so proud of this work,
accurate, complete, and useful. Many of his “pearls” have both passed away since the last edition was pub-
are included in the text. Additionally, I’d like to thank lished, but they continue to inspire me every day.
David for creating the online Test Bank questions and My husband Skip has given up so much of our together
PowerPoint slides. Always a wonderful colleague, he time, particularly once we were both supposed to be
has become a good friend. retired. He gave me the space and time I needed, and
The F. A. Davis team was remarkably patient with now I have a lot of dishes to wash to make up my share!
me over many years, providing incredible support. Spe- My children, Devon and Jay and Lindsay and Dan, are
cial thanks to Jennifer Pine, Senior Sponsoring Editor, always there to prop me up with love and understand-
who held my hand and tolerated my missing every ing, and Lindsay was even able to help me through
deadline. Megan Suermann, Content Project Manager, some of the more complex statistics—very cool. And
lived through every word, figure, and table, and all their finally, little Hazel, the ultimate joy that only a grand-
changes. Thank you to everyone who helped with every parent can understand. Her smile and hugs have lifted
aspect of the project—to Cassie Carey, Senior Produc- me up through many tough days. She confirms for me
tion Editor at Graphic World Publishing Services; Kate every day that the future is bright. What they say is so
Margeson, Illustration Coordinator; Nichole Liccio, true—if I’d known grandchildren were so much fun,
Melissa Duffield, and especially to Margaret Biblis, I would have had them first! With no pun intended, the
Editor-in-Chief, who said “yes” so quickly. next chapter awaits!
xiii
6113_FM_i-xxiv 05/12/19 1:31 pm Page xiv
Dr. Leslie G. Portney active member of the American Physical Therapy As-
is Professor and Dean Emerita, MGH Institute of Health sociation, having served as chair of the Commission on
Professions, having served as the inaugural Dean of Accreditation in Physical Therapy Education and inau-
the School of Health and Rehabilitation Sciences and gural president of the American Council of Academic
Chair of the Department of Physical Therapy. She Physical Therapy. She is the recipient of several awards,
holds a PhD in Gerontology from the University including receiving the 2014 Lifetime Achievement
Professors Program at Boston University, a Master of Award from the Department of Physical Therapy at
Science from Virginia Commonwealth University, a Virginia Commonwealth University, being named the
DPT from the MGH Institute of Health Professions, 2014 Pauline Cerasoli Lecturer for the Academy of
a certificate in Physical Therapy from the University Physical Therapy Education of the APTA, and being
of Pennsylvania, and a Bachelor of Arts from Queens elected as a Catherine Worthingham Fellow of the
College. Leslie has been an active researcher and edu- American Physical Therapy Association in 2002. Leslie
cator for more than 50 years, with decades of teaching retired in 2018 but continues to contribute to profes-
critical inquiry to graduate students from a variety of sional projects and to consult with educational pro-
disciplines, which contributed to her being a staunch grams. She lives on Cape Cod with her husband, near
advocate of interprofessional education. She remains an to her children and granddaughter.
xiv
6113_FM_i-xxiv 05/12/19 1:31 pm Page xv
xv
6113_FM_i-xxiv 05/12/19 1:31 pm Page xvi
tables are also available in relevant chapter supple- work in any order that fits your needs. Material is cross-
ments as well as in Appendix A Online. referenced throughout the book.
• DATA The Data Download icon will indicate
• It is not necessary to use all the chapters or all of the
that the original data and full output for an
material in a given chapter, depending on the depth
DOWNLOAD
example are available online in SPSS and
of interest in a particular topic. This book has always
CSV files. Full output will also be available as SPSS
served as a teaching resource as well as a comprehen-
files or as PDF files for those who do not have
sive reference for clinicians and researchers who run
access to SPSS. Data and output files will be named
across unfamiliar content as they read through the
according to the number of the table that contains
literature. It will continue to be a useful reference for
the relevant material. Data Files will be available
students following graduation.
online at www.fadavis.com.
• Study Suggestions are given for each chapter, provid-
• Hypothetical data are not intended to reflect any
ing activities and exercises that can be incorporated
true physiological or theoretical relationships. They
into teaching, assignments, or exams. Many include
have been manipulated to illustrate certain statistical
discussion topics that can be used for individual or
concepts—so please don’t try to subject the results
small group work. Where relevant, answers are pro-
to a clinical rationale!
vided, or ideas are given for focusing discussions.
• Examples of how results can be presented in a
• A Test Item Bank includes multiple choice items for
research report are provided for each statistical
each chapter.
procedure.
• PowerPoint slides outline chapter content, and can
• This text is not a manual for using SPSS. Appendix
be incorporated into teaching or other presentations.
C will introduce the basics of data management
• An Image Bank offers access to all images within
with SPSS, but please consult experts if you are
the text.
unfamiliar with running statistical programs.
Several texts are devoted to detailed instructions
for applying SPSS.1,2
■ For Students
This book is intended to cover the concepts that will
■ Supplements allow you to become a critical consumer of literature and
to apply that knowledge within the context of evidence-
Each chapter has supplemental material available online. based practice. It will also serve as a guide for planning
• Chapter Overviews include objectives, key terms, and implementing research projects. Use the online re-
and a chapter summary. sources to complement your readings.
• Supplemental Materials are included for many It is unlikely that you will use the entire textbook
chapters with references highlighted in the text. within your courses but consider its usefulness as you
Chapter material can be understood without refer- move into your future role as an evidence-based practi-
tioner. You may also find you are interested in doing
ring to supplements, but the information in them
some research once you have identified clinical ques-
may be important for teaching or for understanding
tions. Don’t be overwhelmed with the comprehensive
the application of certain procedures. Where
content—the book will be a resource for you going for-
appropriate, supplements also provide links to
ward, so keep it handy on the bookshelf!
relevant references or other resources related to
the chapter’s content.
• Review Questions are included for each chapter as a
self-assessment to reinforce content for readers.
■ Terminology
Answers to these questions are also provided.
This text has a distinct and purposeful focus on inter-
professional practice. However, many professions have
particular jargon that may differ from terms used here.
■ For Instructors I have tried to define terms and abbreviations so that
these words can be applied across disciplines.
The order of chapters is based on the research process
model introduced in Chapter 1. However, instructors • When statistical procedures and designs can be de-
may approach this order differently. The chapters will scribed using different terms, I have included these
6113_FM_i-xxiv 05/12/19 1:31 pm Page xvii
alternatives. Unfortunately, there are many such that those we study may represent different roles in
instances. different types of research studies, and second, it is
• The International Classification of Functioning, Disabil- just an effort to vary language. Although the term
ity and Health (ICF) is introduced in Chapter 1, “participant” has gained increasing favor, there is no
defining the terms impairment, activity, and partici- consensus on which terms are preferable, and these
pation. These are used to reflect different types of terms continue to be used in federal regulations and
outcome measures throughout the text. research reports.
• The pronouns “he” or “she” are used casually
throughout the book to avoid using passive voice. REF EREN CES
There is no underlying intent with the choice of
1. Field A. Discovering Statistics Using IBM SPSS Statistics. 5th ed.
pronoun unless it is applied to a particular study. Thousand Oaks, CA: Sage; 2018.
• I have chosen to use the terms “subject,” “partici- 2. Green SB, Salkind NJ. Using SPSS for Windows and Macintosh:
pant,” “patient,” “person,” “individual,” or “case” Analyzing and Understanding Data. 8th ed. New York: Pearson;
2017.
interchangeably for two reasons. First, I recognize
6113_FM_i-xxiv 05/12/19 1:31 pm Page xviii
6113_FM_i-xxiv 05/12/19 1:31 pm Page xix
Contents
What Is Evidence-Based Practice? 56
PART 1 Foundations of Research The Process of Evidence-Based Practice, 57
and Evidence Levels of Evidence, 62
Implementing Evidence-Based Practice, 65
1 Frameworks for Generating
and Applying Evidence, 2 6 Searching the Literature, 70
The Research Imperative, 2 with Jessica Bell
The Research Process, 4
Frameworks for Clinical Research, 5 Where We Find Evidence, 70
Types of Research, 11 Databases, 71
The Search Process, 74
2 On the Road to Translational Keyword Searching, 74
Research, 17 Getting Results, 76
The Translation Gap, 17 Medical Subject Headings, 77
The Translation Continuum, 18 Refining the Search, 80
Effectiveness Research, 21 Expanding Search Strategies, 83
Outcomes Research, 23 Choosing What to Read, 83
Implementation Studies, 25 Accessing Full Text, 84
Staying Current and Organized, 85
3 Defining the Research When Is the Search Done? 85
Question, 29
Selecting a Topic, 29 7 Ethical Issues in Clinical
The Research Problem, 29 Research, 88
The Research Rationale, 32 The Protection of Human Rights, 88
Types of Research, 34 The Institutional Review Board, 91
Framing the Research Question, 34 Informed Consent, 92
Independent and Dependent Variables, 34 Research Integrity, 98
Research Objectives and Hypotheses, 38
What Makes a Good Research Question? 39
PART 2 Concepts of Measurement
4 The Role of Theory in Research
and Practice, 42 8 Principles of Measurement, 106
Why We Take Measurements, 106
Defining Theory, 42
Quantification and Measurement, 106
Purposes of Theories, 42
The Indirect Nature of Measurement, 107
Components of Theories, 43
Rules of Measurement, 108
Models, 44
Levels of Measurement, 109
Theory Development and Testing, 45
Characteristics of Theories, 47 9 Concepts of Measurement
Theory, Research, and Practice, 48
Scope of Theories, 49
Reliability, 115
Concepts of Reliability, 115
5 Understanding Evidence-Based Measuring Reliability, 117
Understanding Reliability, 118
Practice, 53 Types of Reliability, 119
Why Is Evidence-Based Practice Important? 53
Reliability and Change, 122
How Do We Know Things? 54
Methodological Studies: Reliability, 124
xix
6113_FM_i-xxiv 05/12/19 1:31 pm Page xx
xx Contents
Contents xxi
xxii Contents
Online Supplements
1 Frameworks for Generating 12 Understanding Health
and Applying Evidence Measurement Scales
Evolving Models of Healthcare 1. Likert Scales: The Multidimensional Health
Locus of Control Scale
2 On the Road to Translational 2. Further Understanding of Rasch Analysis
Research
Ernest Codman: The “End Result Idea” 13 Choosing a Sample
1. Generating a Random Sample
3 Defining the Research Question 2. Weightings for Disproportional Sampling
Links to Study Examples
14 Principles of Clinical Trials
4 The Role of Theory in Research No supplement
and Practice 15 Design Validity
No supplement
No supplement
5 Understanding Evidence-Based 16 Experimental Designs
Practice Sequential Clinical Trials
Barriers and Facilitators of Evidence-Based
Practice 17 Quasi-Experimental Designs
No supplement
6 Searching the Literature
Databases and Search Engines for Health 18 Single-Subject Designs
Sciences 1. Variations of Single-Subject Designs
2. Drawing the Split Middle Line
7 Ethical Issues in Clinical Research 3. Table of Probabilities Associated
2019 Changes to the Common Rule With the Binomial Test
4. Statistical Process Control
8 Principles of Measurement 5. Calculating the C Statistics
No supplement
xxiii
6113_FM_i-xxiv 05/12/19 1:31 pm Page xxiv
31 Multivariate Statistics
1. Principal Components Analysis and
Exploratory Factor Analysis
2. Direct, Indirect, and Total Effects in
Sequential Equation Modeling
6113_Ch01_001-016 02/12/19 4:56 pm Page 1
Foundations of
PART
1
Research and
Evidence
CHAPTER 1 Frameworks for Generating and Applying
Evidence 2
CHAPTER 2 On the Road to Translational Research 17
CHAPTER 3 Defining the Research Question 29
CHAPTER 4 The Role of Theory in Research and Practice 42
CHAPTER 5 Understanding Evidence-Based Practice 53
CHAPTER 6 Searching the Literature 70
CHAPTER 7 Ethical Issues in Clinical Research 88
ractice
dp
se
idence-ba
1. Identify the
arch question
rese
Ev
s
ing
2.
De
ind
sig
te f
n the
5. Dissemina
study
The
Research
Process
dy
stu
4.
na
e
A
ly z th
ed e nt
ata p le m
3. I m
1
6113_Ch01_001-016 02/12/19 4:56 pm Page 2
CHAPTER
1 Frameworks for
Generating and
Applying Evidence
The ultimate goal of clinical research is to max- Clinical research is essential to inform clinical judg-
ments, as well as the organization and economics of prac-
imize the effectiveness of practice. To that end,
tice. We must all exercise a commitment to scientific
health professionals have recognized the neces- discovery that will lead to improvement in standards of
sity for generating and applying evidence to care and patient outcomes. The task of addressing this
need is one that falls on the shoulders of all those en-
clinical practice through rigorous and objective
gaged in healthcare (whether we function as scientific in-
analysis and inquiry. vestigators who collect meaningful data and analyze
The purpose of this text is to provide a frame outcomes, or as consumers of professional literature who
critically apply research findings to promote optimal care
of reference that will bring together the com-
(see Box 1-1).
prehensive skills needed to promote critical The importance of engaging in collaborative and
inquiry as part of the clinical decision-making interprofessional efforts cannot be overemphasized, as
researchers and clinicians share the responsibility to
process for the varied and interdependent
explore complex theories and new approaches, as well
members of the healthcare team. as to contribute to balanced scientific thought and
This chapter develops a concept of research discovery.
that can be applied to clinical practice, as a Defining Clinical Research
method of generating new knowledge and pro- Clinical research is a structured process of investigating
viding evidence to inform healthcare decisions. facts and theories and of exploring connections, with the
purpose of improving individual and public health. It
We will also explore historic and contemporary
proceeds in a systematic way to examine clinical or social
healthcare frameworks and how they influence conditions and outcomes, and to generate evidence for
the different types of research that can be ap- decision-making.
Although there is no one universal description of clin-
plied to translate research to practice.
ical research, the National Institutes of Health (NIH)
has proposed the following three-part definition1:
■ The Research Imperative 1. Patient-oriented research: Studies conducted with
human subjects to improve our understanding of the
The concept of research in health professions has evolved mechanisms of diseases and disorders, and of which
along with the development of techniques of practice and therapeutic interventions will be most effective in
changes in the healthcare system. All stakeholders treating them.
(including care providers and patients, policy analysts, 2. Epidemiologic and behavioral studies: Observa-
administrators, and researchers) have an investment tional studies focused on describing patterns of dis-
in knowing that intervention and healthcare services are ease and disability, as well as on identifying
effective, efficient, and safe. preventive and risk factors.
2
6113_Ch01_001-016 02/12/19 4:56 pm Page 3
Box 1-1 Addressing the Triple Aim more accurate than qualitative study because it lends it-
self to objective statistical analysis, but this is an unfor-
Issues related to evidence-based practice (EBP) and population tunate and mistaken assumption because approaches to
health are especially important in light of the economic, quality, inquiry must best match the context of the question
and access challenges that continue to confront healthcare. In
being asked. Each approach has inherent value and each
2008, the Institute for Healthcare Improvement (IHI) proposed
can inform the other.
the Triple Aim framework to highlight the need to improve the
patient experience and quality of care, to advance the health of Qualitative Approach
populations, and to reduce the per capita cost of healthcare.2 Qualitative research strives to capture naturally oc-
This goal will require extensive research efforts to collect data curring phenomena, following a tradition of social
over time, understand population characteristics and barriers to constructivism. This philosophy is focused on the belief
access, and identify the dimensions of health and outcomes that that all reality is fundamentally social, and therefore
are meaningful to providers and patients.
the only way to understand it is through an individual’s
experience. Researchers often immerse themselves
within the participants’ social environment to better
understand how phenomena are manifested under
natural circumstances.
In qualitative methodology, “measurement” is based
on subjective, narrative information, which can be ob-
tained using focus groups, interviews, or observation.
Analysis of narrative data is based on “thick description”
to identify themes. The purpose of the research may
Population Evidence be to simply describe the state of conditions, or it may
health of care
be to explore associations, formulate theory, or generate
hypotheses.
Quantitative Approach
The IHI Triple Aim In contrast, quantitative research is based on a philos-
ophy of logical positivism, in which human experience is
Several organizations have proposed a Quadruple Aim, assumed to be based on logical and controlled relation-
adding the clinician experience (avoiding burnout),3 as well as ships among defined variables. It involves measurement
other priorities, such as health equity.4 of outcomes using numerical data under standardized
conditions.
The advantage of the quantitative approach is the
3. Outcomes research and health services re-
ability to summarize scales and to subject data to statis-
search: Studies to determine the impact of research
tical analysis. Quantitative information may be obtained
on population health and utilization of evidence-
using formal instruments that address physical, behav-
based therapeutic interventions.
ioral, or physiological parameters, or by putting subjec-
All of these approaches are essential to establishing an tive information into an objective numerical scale.
evidence base for clinical practice and the provision of
quality health services. The Scientific Method
Within the quantitative model, researchers attempt to
Qualitative and Quantitative Methods reduce bias by using principles of the scientific method,
The objective nature of research is a dynamic and cre- which is based on the positivist philosophy that scientific
ative activity, performed in many different settings, using truths exist and can be studied.5 This approach is
a variety of measurement tools, and focusing on the ap- founded on two assumptions related to the nature of
plication of clinical theory and interventions. In catego- reality.
rizing clinical inquiry, researchers often describe studies First, we assume that nature is orderly and regular
by distinguishing between qualitative and quantitative and that events are, to some extent, consistent and pre-
methods. Quantitative research is sometimes viewed as dictable. Second, we assume that events or conditions
6113_Ch01_001-016 02/12/19 4:56 pm Page 4
idence-ba
cause-and-effect relationships so that we can develop 1. Identify the
arch question
rational solutions to clinical problems. rese
Ev
The scientific method has been defined as:
s
ing
2.
De
ind
A systematic, empirical, and controlled critical
sig
te f
examination of hypothetical propositions about the
n the
5. Dissemina
associations among natural phenomena.6
study
The
The systematic nature of research implies a logical Research
sequence that leads from identification of a problem, Process
through the organized collection and objective analysis
of data, to the interpretation of findings.
dy
The empirical component of scientific research refers
stu
4.
to the necessity for documenting objective data through na
e
A
h
ly z n tt
direct observation, thereby minimizing bias. ed me
ata ple
The element of control is perhaps the most important 3. I m
characteristic that allows the researcher to understand
how one phenomenon relates to another, controlling
factors that are not directly related to the variables in Figure 1–1 A model of the research process.
question.
A commitment to critical examination means that the
researcher must subject findings to empirical testing and Step 1: Identify the Research Question
to the scrutiny of others. Scientific investigation is The first step of the research process involves delimiting
thereby characterized by a capacity for self-correction the area of research and formulating a specific question that
based on objective validation of data from primary provides an opportunity for study (see Chapter 3). This
sources of information. requires a thorough review of scientific literature to provide
Although the scientific method is considered the a rationale for the study (see Chapters 6 and 37), justifica-
most rigorous form of acquiring knowledge, the com- tion of the need to investigate the problem, and a theoret-
plexity and variability within nature and the unique ical framework for interpreting results (see Chapter 4).
psychosocial and physiological capacities of individuals During this stage, the researcher must define the type
will always introduce some uncertainty into the inter- of individual to whom the results will be generalized and
pretation and generalization of data. This means that the specific variables that will be studied. Research
researchers and clinicians must be acutely aware of ex- hypotheses are proposed to predict how these variables
traneous influences to interpret findings in a meaning- will be related and what clinically relevant outcomes can
ful way. Qualitative research can serve an important be expected. In descriptive or qualitative studies, guiding
function in helping to understand these types of questions may be proposed that form the framework for
variables that can later be studied using quantitative the study.
methods.
Step 2: Design the Study
In step 2, the researcher designs the study and plans
■ The Research Process methods of implementation (see Chapters 14–21). The
choice of research method reflects how the researcher
The process of clinical research involves sequential steps conceptualizes the research question.
that guide thinking, planning, and analysis. Whether one The first consideration is who will be studied and
is collecting quantitative or qualitative data, the research how subjects will be chosen (see Chapter 13). The re-
process assures that there is a reasonable and logical searcher must carefully define all measurements and in-
framework for a study’s design and conclusions. This is terventions so that outcomes will be reliable and valid
illustrated in Figure 1-1, recognizing that the order of (see Chapters 8–12), and the methods for data analysis
steps may vary or overlap in different research models. are clear (see Chapters 22–34).
Each of these steps is further described in succeeding The completion of steps 1 and 2 leads to the devel-
chapters. opment of a research proposal (see Chapter 35) that
6113_Ch01_001-016 02/12/19 4:56 pm Page 5
explains the research aims, sampling methods, and ethi- sometimes to address limitations in a study’s design.
cal considerations, including potential benefits and risks Replication may be needed to confirm results with dif-
to participants. The proposal defines methods to show ferent samples in different settings.
that the study can be carried out appropriately and that The loop also includes application of evidence based
it should be able to deliver outcomes of some impact. It on dissemination of research findings (see Chapter 5).
is submitted to an Institutional Review Board (IRB) for This process will often lead to new questions, as we con-
review and approval, to assure that ethical concerns are tinue to deal with the uncertainties of practice.
addressed (see Chapter 7).
Step 3: Implement the Study
■ Frameworks for Clinical Research
In the third step, the researcher implements the plans
designed in steps 1 and 2. Data collection is typically the The context of clinical research is often seen within pre-
most time-consuming part of the research process. Re- vailing paradigms. A paradigm is a set of assumptions,
searchers may conduct pilot studies before beginning the concepts, or values that constitute a framework for view-
full study to confirm that measurement methods and pro- ing reality within an intellectual community.
cedures work as expected, typically on a small sample. Scientific paradigms have been described as ways of
Step 4: Analyze the Data looking at the world that define what kinds of questions
are important, which predictions and theories define a
After data are collected, the researcher must reduce and discipline, how the results of scientific studies should be
collate the information into a useful form for analysis, interpreted, and the range of legitimate evidence that
often using tables or spreadsheets to compile “raw data.” contributes to solutions. Quantitative and qualitative
The fourth step of the research process involves analyz- research approaches are based on different paradigms,
ing, interpreting, and drawing valid conclusions about for example, as they are founded on distinct assumptions
the obtained data. and philosophies of science.
Statistical procedures are applied to summarize and American physicist Thomas Kuhn7 also defined
explore quantitative information in a meaningful way to paradigm shifts as fundamental transitions in the way
address research hypotheses (see Chapters 22–34). In disciplines think about priorities and relationships, stim-
qualitative studies, the researcher organizes and catego- ulating change in perspectives, and fostering preferences
rizes data to identify themes and patterns that are guided for varied approaches to research.
by the research question and its underlying theoretical We can appreciate changes in research standards and
perspective (see Chapter 21). priorities in terms of four paradigm shifts that have
emerged in healthcare, rehabilitation, and medicine in
Step 5: Disseminate Findings the United States as we have moved into the 21st cen-
In the fifth step of the research process, researchers have tury: evidence-based practice, a focus on translational re-
a responsibility to share their findings with the appro- search, the conceptualization of health and disability, and
priate audience so that others can apply the information the importance of interprofessional collaboration.
either to clinical practice or to further research. This step
is the pulling together of all the materials relevant to the Evidence-Based Practice
study, to apply them to a generalized or theoretical The concept of evidence-based practice (EBP) repre-
framework. Through the analysis of results, the re- sents the fundamental principle that the provision of
searcher will interpret the impact on practice and where quality care will depend on our ability to make choices
further study is needed. that are based on the best evidence currently available.
Research reports can take many forms including jour- As shown in Figure 1-1, the research process incorpo-
nal articles, abstracts, oral presentations, poster presen- rates EBP as an important application of disseminated
tations, and conference proceedings (see Chapter 38). research.
Students may be required to report their work in the When we look at the foundations of clinical practice,
lengthier form of a thesis or dissertation. however, we are faced with the reality that often compels
practitioners to make intelligent, logical, best-guess de-
Closing the Loop cisions when scientific evidence is either incomplete or
Note that the research process is circular, as no study is unavailable. The Institute of Medicine (IOM) and other
a dead end. Results of one study invariably lead to new agencies have set a goal that, by 2020, “90% of clinical
questions. Researchers contribute to the advancement of decisions will be supported by accurate, timely, and up-
their own work by offering suggestions for further study, to-date clinical information, and will reflect the best
6113_Ch01_001-016 02/12/19 4:56 pm Page 6
available evidence.”8 As this date approaches, however, practically from “bedside to bench and back to bed-
this goal continues to be elusive in light of barriers in side.”11 The goal of translational research is to assure
education and healthcare resources. that basic discoveries are realized in practice, ultimately
Everyone engaged in healthcare must be able to uti- improving the health of populations.
lize research as part of their practice. This means having
The full scope of translational research will be covered in
a reasonable familiarity with research principles and being
Chapter 2.
able to read literature critically. The concept of EBP is
more than just reliance on published research, however,
and must be put in perspective. Informed clinicians will Models of Health and Disability
make decisions based on available literature, as well as Cultural and professional views of health influence how
their own judgment and experience—all in the context illness and disability are perceived, how decisions are
of a clinical setting and patient encounter. Perhaps more made in the delivery of healthcare, and the ensuing di-
aptly called evidence-based decision-making, this process rections of research. An important concept in under-
requires considering all relevant information and then standing the evolution of medical and rehabilitation
making choices that provide the best opportunity for a research is related to the overriding framework for the
successful outcome given the patient-care environment delivery of healthcare, which continues to change over
and available resources. time.
The process of EBP will be covered in more detail in Chap- Understanding Health
ter 5, and these concepts will be included throughout this Historical measures of health were based on mortality
text as we discuss research designs, analysis strategies, and rates, which were readily available. As medical care
the use of published research for clinical decision-making. evolved, so did the view of health, moving from survival
Look for Focus on Evidence examples in each chapter. to freedom from disease, and more recently to concern
for a person’s ability to engage in activities of daily living
Translational Research and to achieve a reasonable quality of life.12
The realization of EBP requires two major elements. These changes led to important shifts in understand-
The first is the availability of published research that has ing disability over the past 50 years. Following World
applicability to clinical care. Researchers must produce War II, disability research was largely focused on med-
relevant data that can be used to support well-informed ical and vocational rehabilitation, based primarily on an
decisions about alternative strategies for treatment and economic model in which disability was associated with
diagnosis.9 The second is the actual adoption of proce- incapacity to work and the need for public benefits.13
dures that have been shown to be effective, so that the Today social and government policies recognize the
highest possible quality of care is provided. This is an impact of the individual and environment on one’s
implementation dilemma to be sure that research find- ability to function. Subsequently, we have seen an im-
ings are in fact translated into practice and policy. Un- portant evolution in the way health and disability have
fortunately, both of these elements can present a major been conceptualized and studied, from the medical
challenge. The NIH has defined translational research model of disease to more recent models that incorpo-
as the application of basic scientific findings to clinically rate social, psychological, and medical health as well as
relevant issues and, simultaneously, the generation of disability.14–17
scientific questions based on clinical dilemmas.10
The International Classification of Functioning,
Efficacy and Effectiveness Disability and Health
Researchers will often distinguish between efficacy and In 2001, the World Health Organization (WHO) pub-
effectiveness in clinical studies. Efficacy is the benefit of lished the International Classification of Functioning,
an intervention as compared to a control, placebo, or Disability and Health (ICF).18 The ICF portrays health
standard program, tested in a carefully controlled envi- and function in a complex and multidimensional model,
ronment, with the intent of establishing cause-and-effect consisting of six components in two domains, illustrated
relationships. Effectiveness refers to the benefits and use in Figure 1-2. The relationships among these compo-
of procedures under “real-world” conditions in which nents are interactive, but importantly do not necessarily
circumstances cannot be controlled within an experi- reflect causal links.
mental setting. In practice, research, and policy arenas, the ICF can
Translation spans this continuum, often described as be used to reflect potentially needed health services, en-
taking knowledge from “bench to bedside,” or more vironmental adaptations, or guidelines that can maximize
6113_Ch01_001-016 02/12/19 4:56 pm Page 7
Health condition
(disorder or disease)
Functioning
⫹ Structural integrity ⫹ Activities ⫹ Participation
Body functions/
Activities Participation
structures
function for individuals or create advantageous commu- Contextual Factors. The lower section of the model
nity services. Rather than focusing on the negative side includes Contextual Factors, which consider the individ-
of disability or simply mortality or disease, the intent of ual’s health and function based on interaction with the
the ICF is to describe how people live with their health environment and personal characteristics. Both of these
condition, shifting the focus to quality of life. This con- contextual elements will either facilitate function or cre-
text makes the model useful for health promotion, pre- ate barriers.
vention of illness and disability, and for the development
• Environmental factors include physical, social,
of relevant research directions.
and attitudinal aspects of the environment
Functioning and Disability. The first ICF domain, in within which the individual functions. This
the upper portion of the model, is Functioning and may include architectural features such as
Disability, which reflects the relationship among health stairs, furniture design, the availability of
conditions, body functions and structures, activities, family or friends, or social policies that impact
and participation. Each of these components can be accessibility.
described in positive or negative terms, as facilitators or • Personal factors reflect the individual’s own per-
barriers. sonality, abilities, preferences, and values. These
may relate to characteristics such as age, gender,
• Body structures refer to anatomical parts of the
pain tolerance, education, family relationships,
body, such as limbs and organs, such as the brain
or physical strength and mobility. An individual
or heart. Body functions are physiological, such as
will react to impairments or functional loss differ-
movement or breathing, or psychological, such as
ently, depending on their previous experiences
cognition. Body systems or anatomical structures
and attitudes.
will be intact or impaired.
• Activities refer to the carrying out of tasks or
actions, such as using a knife and fork, rising from The inclusion of contextual factors is a significant
a chair, or making complex decisions. Individuals contribution of the ICF, as it allows us to put meaning
will be able to perform specific activities or will to the term “disability.” Think of our typical under-
demonstrate difficulties in executing tasks, standing of stroke, for example. The impairment of “brain in-
described as activity limitations. farct” is not the disabling component. The person’s motor
• Participation is involvement in life roles at home, abilities may present some limitations, but the person’s par-
work, or recreation, such as managing household ticipation will also be influenced by financial support, home
environment, availability of assistive devices or communica-
finances, going to school, or taking care of a child.
tion aids, as well as the complex interaction of factors such
Individuals will either be able to participate in
as societal attitudes, social policy, access to healthcare, edu-
life roles or they may experience participation cation, community, and work.19
restrictions.
6113_Ch01_001-016 02/12/19 4:56 pm Page 8
Capacity and Performance. The activity and partici- classification and consequences of health conditions.
pation components of the ICF can be further conceptu- In contrast to other models, the ICF recognizes func-
alized in terms of an individual’s capacity to act versus tioning for its association with, but not necessarily as
actual performance; that is, what one is able to do versus a consequence of, a health condition.
what one actually does. These elements are defined with From a research and evidence standpoint, therefore,
reference to the environment in which assessment is one cannot assume that measures of impairments or
taking place. function always translate to understanding participa-
The gap between capacity and performance reflects tion limitations. Much of the research on the ICF is
the impact of the environment on an individual’s ability focused on how these elements do or do not interact
to perform.18 In the activity/participation construct, with specific types of health conditions (see Focus on
there is a distinction made between a person’s ability Evidence 1-1).
to perform a skill in the clinic or standardized setting
and their ability to perform the same skill in their natural The ICF has been evaluated in a variety of patient popula-
environment. tions, cultures, age groups, and diagnoses,21–28 including a
recent children’s and youth version.29 The model has been
Diwan et al20 studied mobility patterns of children
applied primarily within rehabilitation fields,30–32 although
with cerebral palsy in India in three contexts—home, medicine, nursing, and other professions have recognized its
school, and community settings. Using a measure of relevance.19,33–35
motor capacity, they observed that 70% of the children
were able to walk with or without support in the clinical
setting, and 16% were able to crawl. An expanded discussion of models of health is included in the
Chapter 1 Supplement with examples of studies that are
In the community, however, 52% were lifted by par- based on the ICF. The evolution of these models has influ-
ents and only 6% used wheelchairs to get around; 22% enced priorities and methods in healthcare research.
of children who were able to walk with or without sup-
port were still lifted by parents in school and community
settings. These results emphasize the importance of un- Interprofessional and Interdisciplinary
derstanding environmental context, including culture, Research
when evaluating a patient’s performance and choosing A major challenge to translating new evidence into ap-
measurements to investigate outcomes. plication is moving knowledge through the laboratory,
The ICF is the result of a multidisciplinary interna- academic, clinical, and public/private sectors—often a
tional effort to provide a common language for the problem created by the existence of practical “silos.”
Focus on Evidence 1–1 2. Work status defined by the degree to which LBP interfered with
Health, Function, and Participation—Not Always participation in work activities, based on whether hours or du-
a Straight Line ties had to be curtailed or changed.
Three main consequences of chronic low back pain (LBP) are Results showed only modest correlations between work status
pain, functional limitations, and reduced productive capacity at and the three LBP indices, even though some of the questionnaires
work. A variety of outcome measures have been used to study the include work-related questions. The authors concluded that vocational
impact of LBP, mostly focusing on pain and function, with work sta- liability should not be implied from the severity of LBP because work
tus implied from the severity of functional limitation. Sivan et al36 status is influenced by other complex physical and psychosocial fac-
designed a cross-sectional study to analyze the correlation tors not necessarily related to LBP.
between work status and three standard functional outcome scales This study illustrates two central principles in applying the ICF—
for LBP, to examine the extent to which they measure similar the importance of personal and environmental factors to understand
concepts. participation, and the potential nonlinear relationship across impair-
The researchers looked at function and work status in 375 patients ments, function, and participation. The interaction of these factors,
with chronic LBP who were receiving outpatient care. They considered therefore, must be considered when setting treatment goals. This
two categories of outcomes: framework can also be used to direct research efforts, with a need to
evaluate the ability of measurement tools to reflect the ICF compo-
1. Functional deficits related to pain, using the Oswestry Disability
nents of body functions, activities, participation, environmental fac-
Index,37 the Roland Morris Disability Questionnaire,38 and the
tors, and personal factors.
Orebro Musculoskeletal Pain Questionnaire39; and
6113_Ch01_001-016 02/12/19 4:56 pm Page 9
Translation can be slowed by limited communication individuals included and will often involve both. For the
across disciplines and settings, as well as substantial sake of simplicity, this discussion will continue to use the
differences in scientific and social perspectives. term “interprofessional,” but “interdisciplinary” can be
Just like the idea of interprofessional practice, the substituted appropriately.
concept of collaboration across disciplines to address Here are some useful distinctions related to how
important research questions is not new, as many inter- teams function to solve problems (see Fig. 1-3).44 These
disciplinary fields have emerged over the decades to definitions can be applied to interprofessional education,
make significant contributions to health science, such as practice, or research.
physical chemistry, bioinformatics, or astrophysics.
• Intraprofessional: Members of one profession work
together, sharing information through the lens of
HISTORICAL NOTE their own profession.
Interprofessional collaborations have led to many vital • Multiprofessional: An additive process whereby
connections needed in the process of discovery.
members of multiple professions work in parallel
• A fundamental observation by physicists in 1946 that nuclei alongside each other to provide input, making com-
can be oriented in a magnetic field eventually led to mag- plementary contributions, each staying within their
netic resonance imaging in collaboration with physical own professional boundaries, working separately on
chemists and biologists.40
distinct aspects of the problem.
• The development of the cochlear implant required the inte-
• Interprofessional: Members of multiple professions
gration of electrical engineering, physiology, and acoustics.41
• The sequencing of the human genome required techniques
work together, contributing their various skills in an
from physics, chemistry, biology, and computer science.42 integrative fashion, sharing perspectives to inform
decision-making.
• Transprofessional: A high level of cooperation
of professionals from multiple fields who under-
Terminology
stand each other’s roles and perspectives sufficiently
Discussions of interprofessional research are often ham-
to blur boundaries, share skills and expertise,
pered by a lack of consistency in terminology, starting
and use them to develop new ways of viewing a
with the difference between a discipline and a profession.
problem.
Disciplines represent fields of study or individual sci-
ences, such as biology, chemistry, psychology, economics, One comprehensive definition suggests that interpro-
or informatics. A profession has been defined as a disci- fessional research is:
plined group of individuals who are educated to apply a
. . . any study or group of studies undertaken by scholars
specialized knowledge base in the interest of society, and
from two or more distinct scientific disciplines. The
who adhere to ethical standards and hold themselves
research is based upon a conceptual model that links or
accountable for high standards of behavior.43
integrates theoretical frameworks from those disciplines,
Therefore, occupational therapy, physical therapy,
uses study design and methodology that is not limited to
speech-language pathology, medicine, nursing, and phar-
any one field, and requires the use of perspectives and
macy are examples of professions. Within a profession, dif-
skills of the involved disciplines throughout multiple
ferent disciplines may be represented, such as neurology
phases of the research process.45
and orthopedics. Even within one profession, however,
disciplinary norms may vary. This definition advocates that such scholars must
We can consider different levels of integration across develop skills and knowledge beyond their own core dis-
disciplines or professions. The use of “discipline” or cipline. Interprofessional researchers should be open to
“profession” in these terms depends on the nature of the perspectives of other professions, read journals outside
of their own discipline, and be willing to consider the- theoretical or conceptual frameworks for prioritizing
ories that stretch their understanding of the problems research questions.
at hand.46 Just as professional education must give stu- Some barriers are more practical, such as insufficient
dents collaborative tools that they can bring into prac- funding, institutional constraints on time and space,
tice, clinicians and investigators must be trained in the difficulty identifying the right collaborators with expert-
team science of interprofessional research.47 ise across disciplines, academic tenure requirements,
concerns about authorship, and disagreements about
Supporting Translational Research
where to publish the results. Many interdisciplinary and
The entire concept of translational research, from basic
interprofessional journals have been developed, but these
science to effectiveness studies, makes the argument
may not carry the same weight as publication within a
for interprofessional collaboration even stronger.
discipline’s flagship publication.
Implementation of research requires strong teams to
Successful collaborations have been described as those
develop questions and study methodologies that go
with strong cooperative teams and team leaders, flexibil-
beyond any one discipline. This may include members
ity and commitment of team members, physical proxim-
who are not routinely part of most clinical trials, such
ity, and institutional support, clarity of roles, and a
as economists, sociologists and anthropologists, orga-
common vision and planning process involving the team
nizational scientists, quality-improvement (QI) experts
members from the start.55
and administrators, and patients and clinicians from
various professions.48 Making an Impact
The importance of collaboration in research cannot be
Contributing to translational research demands diversity overemphasized, as healthcare confronts the increasing
of thinking that is best generated by collaborative complexity of health issues, technology, and social
teams.49–51 Most importantly, we cannot ignore the importance systems. Real-world problems seldom come in neat
of collaboration for evidence-based practice. Just as we need to disciplinary packages. The progress of science and
consider interprofessional communication as an essential part healthcare will depend on our ability to generate new
of patient-centered care, we must look to research from many ways of viewing those problems. In health education,
professions to inform our clinical decisions. Many funding service delivery, and policy implementation, the
agencies have set priorities for studies involving multiple progress of healthcare will depend on contributions
disciplines.52–54 with different viewpoints as a way to offer innovative
solutions when trying to solve multifaceted issues (see
Focus on Evidence 1-2).56
Barriers and Promoters in Interprofessional Research The foundational healthcare frameworks that we have
Many barriers have been identified that limit the degree discussed in this chapter all demand interprofessional
to which interprofessional research is carried out.55 attention. In the Triple Aim, for example, the three con-
These include philosophical clashes, such as differences cepts of population health, cost, and quality care, as well
in professional cultures, perceived professional hierar- as the outcomes that would be the appropriate indicators
chies, language and jargon disparities, preferences of change, are almost by definition interprofessional and
for different research methodologies, and conflicting interdisciplinary.57
Focus on Evidence 1–2 directions, visual and written cues, and modeling and repetition
Coming at a Question From Two Sides of instructions. The researchers compared performance on the
standard and modified versions, finding that patients performed
Physical therapists identified a challenge in working with people with
better with the modified test, which indicated that some portion of
aphasia when administering the Berg Balance Scale, a commonly
their performance difficulty was likely due to their communication
used tool to quantify dynamic sitting and standing balance to predict
deficit.
fall risk.58 Although considered a valid test for older adults, the clini-
In this example, the clinicians from both professions brought im-
cians were concerned about the test’s validity for those with aphasia
portant and different perspectives to the question and its research
because of its high demand for auditory comprehension and memory.
implementation that neither group could have done alone, adding to
Collaborative work with speech-language pathologists resulted
the understanding of the test’s validity.
in a modified version of the scale, which included simplified verbal
6113_Ch01_001-016 02/12/19 4:56 pm Page 11
Quasi-Experiments
Single-Subject Designs
(N-of-1)
Cohort Studies
Types of Research
Case-Control Studies
Methodological Research
Reliability/Validity
Developmental Research
Normative Research
Descriptive Surveys
Case Reports
Historical Research
Qualitative Research
Systematic Reviews
of Research
Synthesis
Meta-Analyses
Scoping Reviews
Figure 1–4 Types of research along the continuum from descriptive to exploratory to explanatory research.
6113_Ch01_001-016 02/12/19 4:56 pm Page 13
COMMENTARY
Research is to see what everyone else has seen, and to think what nobody else has
thought.
—Albert Szent-Gyorgyi (1893–1986)
1937 Nobel Prize in Medicine
The focus of a text such as this one is naturally on the one will ask. We must appreciate the influence of profes-
methods and procedures of conducting and interpreting sional values on these applications, reflecting what treat-
research. By understanding definitions and analytic pro- ment alternatives will be considered viable. For instance,
cedures, the clinician has the building blocks to structure we might look at the slow adoption of qualitative research
an investigation or interpret the work of others. approaches in medicine, emphasizing the need to open
Methodology is only part of the story, however. Re- ourselves to evidence that may hold important promise
search designs and statistical techniques cannot lead us for our patients.61 Some disciplines have embraced
to a research question, nor can they specify the technical single-subject designs, whereas others have focused on
procedures needed for studying that question. Designs randomized trials.62 Research choices are often a matter
cannot assign meaning to the way clinical phenomena of philosophical fit that can override issues of evidence,
behave. Our responsibility includes building the knowl- hindering knowledge translation.63 This is especially rel-
edge that will allow us to make sense of what is studied. evant for interdisciplinary clinical associations and the
This means that we also recognize the philosophy of a shared research agenda that can emerge.
discipline, how subject matter is conceptualized, and There is no right or wrong in these contrasts. As we
which scientific approaches will contribute to an under- explore the variety of research approaches and the con-
standing of practice. As one scientist put it: text of EBP, clinicians and researchers will continually
apply their own framework for applying these methods.
Questions are not settled; rather, they are given provisional
The emphasis on clinical examples throughout the book
answers for which it is contingent upon the imagination of
is an attempt to demonstrate these connections.
followers to find more illuminating solutions.60
How one conceives of a discipline’s objectives and the
scope of practice will influence the kinds of questions
6113_Ch01_001-016 02/12/19 4:56 pm Page 15
Research Training Programs. Washington (DC): National Acade- 51. Westfall JM, Mold J, Fagnan L. Practice-based research—“Blue
mies Press, 2005. 8, Emerging Fields and Interdisciplinary Studies. Highways” on the NIH roadmap. JAMA 2007;297(4):403–406.
Available at: http://www.ncbi.nlm.nih.gov/books/NBK22616/. 52. Agency for Healthcare Research and Quality: US Department
Accessed December 26, 2015. of Health and Human Services. Available at https://www.
41. Luxford WM, Brackman DE. The history of cochlear implants. ahrq.gov. Accessed January 5, 2017.
In: Gray R, ed. Cochlear Implants. San Diego, CA: College-Hill 53. Zerhouni E. Medicine. The NIH Roadmap. Science 2003;
Press, 1985:1–26. 302(5642):63–72.
42. US Department of Energy OoSaTI: History of the DOE 54. Patient-Centered Outcomes Research Institute. Available at
human genome program. Available at https://www.osti.gov/ http://www.pcori.org/. Accessed January 5, 2017.
accomplishments/genomehistory.html. Accessed December 18, 55. Choi BC, Pak AW. Multidisciplinarity, interdisciplinarity, and
2015. transdisciplinarity in health research, services, education and
43. Professions Australia: What is a profession? Available at http:// policy: 2. Promotors, barriers, and strategies of enhancement.
www.professions.com.au/about-us/what-is-a-professional. Clin Invest Med 2007;30(6):E224–E232.
Accessed January 23, 2017. 56. Committee on Facilitating Interdisciplinary Research:
44. Choi BC, Pak AW. Multidisciplinarity, interdisciplinarity and Facilitating Interdisciplinary Research. National Academy of
transdisciplinarity in health research, services, education and Sciences, National Academy of Engineering, and the Institute
policy: 1. definitions, objectives, and evidence of effectiveness. of Medicine, 2004. Available at https://www.nap.edu/catalog/
Clin Invest Med 2006;29(6):351–364. 11153.html. Accessed January 24, 2017.
45. Aboelela SW, Larson E, Bakken S, Carrasquillo O, Formicola A, 57. Berwick DM, Nolan TW, Whittington J. The triple aim:
Glied SA, Haas J, Gebbie KM.. Defining interdisciplinary care, health, and cost. Health Aff (Millwood) 2008;27(3):
research: conclusions from a critical review of the literature. 759–769.
Health Serv Res 2007;42(1 Pt 1):329–346. 58. Carter A, Nicholas M, Hunsaker E, Jacobson AM. Modified
46. Gebbie KM, Meier BM, Bakken S, Carrasquillo O, Formicola Berg Balance Scale: making assessment appropriate for people
A, Aboelela SW, Glied S, Larson E. Training for interdiscipli- with aphasia. Top Stroke Rehabil 2015;22(2):83–93.
nary health research: defining the required competencies. 59. Arksey H, O’Malley L. Scoping studies: towards a methodological
J Allied Health 2008;37(2):65–70. framework. Int J Soc Res Methodol 2005;8:19–32.
47. Little MM, St Hill CA, Ware KB, Swanoski MT, Chapman SA, 60. Baltimore D. Philosophical differences. The New Yorker
Lutfiyya MN, Cerra FB. Team science as interprofessional January 27, 1997, p. 8.
collaborative research practice: a systematic review of the science 61. Poses RM, Isen AM. Qualitative research in medicine and
of team science literature. J Investig Med 2017;65(1):15–22. health care: questions and controversy. J Gen Intern Med
48. Bauer MS, Damschroder L, Hagedorn H, Smith J, Kilbourne 1998;13(1):32–38.
AM. An introduction to implementation science for the 62. Welch CD, Polatajko HJ. Applied behavior analysis, autism, and
non-specialist. BMC Psychology 2015;3(1):32. occupational therapy: a search for understanding. Am J Occup
49. Horn SD, DeJong G, Deutscher D. Practice-based evidence Ther 2016;70(4):1–5.
research in rehabilitation: an alternative to randomized con- 63. Polatajko HJ, Welch C. Knowledge uptake and translation: a
trolled trials and traditional observational studies. Arch Phys matter of evidence or of philosophy. Can J Occup Ther
Med Rehabil 2012;93(8 Suppl):S127–137. 2015;82(5):268–270.
50. Green LW. Making research relevant: if it is an evidence-based
practice, where’s the practice-based evidence? Fam Pract
2008;25 Suppl 1:i20–i24.
6113_Ch02_017-028 02/12/19 4:56 pm Page 17
CHAPTER
On the Road to
Translational 2
Research
The imperative of evidence-based practice ■ The Translation Gap
(EBP) must be supported by research that can
be applied to clinical decisions. Unfortu- All too often, the successes of scientific breakthroughs
in the laboratory or in animal models have not trans-
nately, many research efforts do not have clear lated into major changes in medical or rehabilitative
relevance to every day practice, and many care for humans in a timely way. Of course, there are
clinical questions remain unanswered. Our many notable success stories of research that has
resulted in exceptionally important medical treatments,
EBP efforts can be stymied by a lack of mean- such as the discovery of penicillin in 19281 and the
ingful and useful research to support our clin- subsequent development of techniques to extract the
ical efforts. antibiotic. Similarly, the discovery of insulin and its use
in treating diabetes is a landmark achievement that has
The term translational research refers to the resulted in significant changes in medical care.2
direct application of scientific discoveries into There are many more examples, however, of lag times
clinical practice. Often described as taking from 10 to 25 years or more for a discovery to reach pub-
lication, with several estimates of an average of 17 years
knowledge from “bench to bedside,” this for only 14% of new scientific discoveries to eventually
process is actually much broader in scope. Al- reach clinical practice.3 Given that these estimates are
though certainly not a new concept, the health- primarily based on publication records, it is likely that
they are underestimates of true translation to health out-
care community has experienced a renewed comes and impacts (see Focus on Evidence 2-1).4 The
emphasis on the application of findings from NIH Roadmap, proposed in 2002, has called for a new
laboratory and controlled studies to clinically paradigm of research to assure that “basic research dis-
coveries are quickly transformed into drugs, treatments,
important problems in real-world settings. or methods for prevention.”5
The purpose of this chapter is to describe the Lag times are often due to sociocultural or scientific
scope of translational research, including meth- barriers, as well as systemic hurdles such as grant awards
and institutional approvals, marketing and competition,
ods of efficacy and effectiveness research, and policy and regulatory limitations, need for long-term
the importance of implementation research to follow-up in clinical trials, limited funding resources,
make EBP a reality. guideline development, and dissemination. In the mean-
time, evidence-based clinicians who strive to adopt the
17
6113_Ch02_017-028 02/12/19 4:56 pm Page 18
Focus on Evidence 2–1 and it was considered implausible to mimic that outcome by feeding
A Translation Journey: The Lipid Hypothesis rabbits cholesterol!7 Steinberg,8 a prominent researcher in this field,
observed:
It was more than 100 years ago, in 1913, that a young Russian exper-
imental pathologist named Nikolai Anitschkow reported that feeding Anitschkow’s work should have galvanized the scientific com-
rabbits a high-cholesterol diet produced arterial plaques that resem- munity and encouraged innovative approaches to this major
bled those in human atherosclerosis.6 The disease had been well human disease problem. But nothing happened. Here was a
described in the medical literature by then, but was considered to be classic example of how rigid, preconceived ideas sometimes
a predictable and nontreatable consequence of aging. At the time, stand in the way of scientific progress. Why did no one ask the
one of the leading hypotheses as to its pathogenesis was that it was (now) obvious questions?
the result of excessive intake of animal proteins, and several scien-
Although Anitschkow’s work did convincingly show that hyper-
tists were attempting to provide evidence by feeding rabbits diets rich
cholesterolemia in rabbits was sufficient to cause atherosclerosis,
in milk, eggs, and meat, which did lead to vascular abnormalities. And
that did not necessarily mean that cholesterol in the human diet or
although they found that egg yolks or whole eggs could also replicate
blood had to be a factor in the disease. But basic questions about the
the process, egg whites alone did not! Anitschkow showed that, by
pathogenesis of the disease were not being asked—how the choles-
extracting cholesterol from egg yolks, he could duplicate the results
terol traveled in the rabbit’s blood, how it became embedded in the
without any added proteins.
arterial wall, or how a human diet might contribute to increased blood
Although some laboratories were able to confirm Anitschkow’s
cholesterol. It wasn’t until the 1950s that metabolic studies10 and
findings using rabbits, most investigators were using other, more typ-
epidemiologic studies, such as the Framingham Heart Study,9 sparked
ical test animals, such as rats and dogs, and were not getting the
further investigation. Then, it took more than 30 years for the first
same results. Reasons for these differences, understood today but
clinical trial to be completed in 1984 to understand the relationship,11
not then, included different metabolic function in dogs and rabbits.
and the first statin was not approved until 1987. A century later, we
So even though Anitschkow’s work had been widely disseminated
have the benefit of basic laboratory studies, the development of med-
in German in 1913 and republished in English in 1933, there was no
ications, understanding genetic markers, and extensive clinical trials
follow-up. Perhaps, too, the scientific community was indifferent to
that have supported the century-old “lipid hypothesis.”12
the findings because they ran contrary to the prevailing aging theory,
most effective treatment are at the mercy of the research decision-making and treatment effectiveness is para-
enterprise—sometimes information on newer successful mount. One working definition is:
therapies is unavailable, or a study may support a new ther-
Translational research fosters the multidirectional
apy, only to be disputed later with new information. This
integration of basic research, patient-oriented research,
means that some patients may not get the best care, and
and population-based research, with the long-term aim
others may unwittingly get unintentional harmful care.
of improving the health of the public.14
purpose, such as physiological responses, laboratory have been proposed,4,17–21 they all have a common tra-
tests, or assessment of impairments that demonstrate the jectory for the translation process (see Fig. 2-1).
treatment’s safety and biophysiological effect. These are
necessary elements of the design process to assure that ➤ CASE IN POIN T #1
researchers can have confidence in observed outcomes. The botulinum toxin, known to be a poisonous substance, is
Effectiveness also now known to have important properties for relieving
neurogenic disorders, including spasticity. Discoveries from
The major limitation of the efficacy approach is that it often basic mechanisms to clinical use illustrate the stages of
eliminates the very patients who would likely be treated translational research.
with the new therapy; that is, most of our patients come
with comorbidities and other personal traits that cannot be T0—Basic Research
controlled in practice. The testing environment will often The start of the translational continuum is characterized
be artificial, limiting the applicability of findings because as basic research, representing investigations that focus
they are not representative of the variable practice environ- on theory or biological inquiry, often utilizing animal
ment that cannot be controlled in the clinical community. models. These studies are typically necessary for the de-
Therefore, effectiveness trials are needed to look at velopment of new drugs. The purpose is to define phys-
interventions under natural practice conditions, to make iological mechanisms that will allow researchers to
sure that the findings from RCTs translate to “real- generate potential treatment options that can be tested.
world” clinical care. These trials are called pragmatic
(practical) clinical trials (PCTs). PCTs will often incor-
porate measures of function or quality of life that are
➤ T0: The Clostridium botulinum bacterium was first identified
in 1895 following the death of three people from food poisoning
considered more relevant for patient satisfaction to un- in a small Belgian town. Over time, laboratory studies were able
derstand if treatments have a meaningful effect on pa- to isolate the botulinum-A toxin (BTX) and researchers discov-
tient outcomes. Qualitative research, which highlights ered that it blocked neuromuscular transmission by preventing
the human dimensions of health care, is especially useful the release of acetylcholine.22
in health services research to understand how patients’
and providers’ perceptions influence care delivery.
T1—Translation to Humans:
DOES it work?
Efficacy trials are sometimes referred to as explanatory, In this first translation stage, researchers must assess the
whereas effectiveness trials are called pragmatic. This potential for clinical application of basic science findings,
distinction does not really clarify their differences, as testing new methods of diagnosis, treatment, and pre-
both are seeking to discern the success of interventions. It is
vention. These studies are considered Phase I trials and
more appropriate to think of efficacy and effectiveness as ends
case studies. They are usually small, focused studies,
of a continuum, with most trials somewhere in between.15
specifically looking at safety, accuracy, and dosage.
Chapters 14, 15, and 16 will cover the design and valid-
ity issues inherent in RCTs and PCTs. Chapter 21 will ➤ T1: In 1977, the first human injections of BTX were given to
focus on the contributions of qualitative research. relieve strabismus, with no systemic complications.23 In 1985,
injections were successful for 12 patients with torticollis.24
Translation Blocks Following these studies, in 1989 researchers demonstrated
its effect on upper extremity spasticity in six patients post-
Between 2000 and 2005, the Institute of Medicine (IOM) stroke.25 These studies also demonstrated the safety of this
Clinical Research Roundtable identified two “translation application.
blocks” that represented obstacles to realizing true
health benefits from original research.16 The first trans-
lation block, T1, reflected the challenge of transferring T2—Translation to Patients:
laboratory findings related to disease mechanisms into CAN it work under ideal conditions?
the development of new methods for diagnosis, treat- Once clinical applications have been demonstrated, re-
ment, and prevention. The second block, T2, reflected searchers must subject interventions and diagnostic pro-
the need to translate results of clinical studies into every- cedures to rigorous testing using Phase II and Phase III
day clinical practice and decision-making. trials, which are typically RCTs. These trials are larger
These blocks have since been expanded to reflect four efficacy studies, with specific controls and defined patient
stages in the translation of scientific knowledge to prac- samples, comparing the intervention of interest with a
tice and health-care services. Although several models placebo or usual care.
6113_Ch02_017-028 02/12/19 4:56 pm Page 20
T0
Basic Research
Preclinical Studies
g
Defining
kin
T4 T1
Tra
physiological
-ma
mechanisms
nsl
Translation Translation
to Population to Humans
ice and decision
ation
Health Services and Phase I Clinical Trials
Policy Research New methods of safety,
from basic sc
Population outcomes diagnosis, treatment
Is it worth it? Will it work?
T3 T2
Translation Translation
ract
to Practice to Humans
ie
p
n
o
c
t
et
n
o
Does it work in the studies
t
hu
da
➤ T2: Randomized, double-blind, placebo-controlled clinical effects and distinct changes in tone. Treatment aimed at pre-
trials over 20 years demonstrated that BTX could safely reduce vention of contractures has had some success, although there is
upper extremity spasticity over the long term.26,27 limited evidence regarding positive upper extremity functional
outcomes.28,29 Studies continue to look at dosage, dilutions, site
of injections, and combination therapies.
This phase of the translation process will often result
in approval of a new drug or device by the Food and
Drug Administration (FDA). T4—Translation to Populations:
T3—Translation to Practice: Is it worth it?
WILL it work in real-world conditions? This final translation stage is focused on health services
This is the practice-oriented stage, focusing on dissem- and policy research to evaluate population outcomes
ination, to find out if treatments are as effective when and impact on population health, which can include at-
applied in everyday practice as they were in controlled tention to global health. It addresses how people get
studies. access to health-care services, the costs of care, and the
Phase IV clinical trials are typically implemented when system structures that control access. These are also
devices or treatments have been approved and re- considered Phase IV trials. They seek to inform stake-
searchers continue to study them to learn about risk fac- holders if the difference in treatment outcomes war-
tors, benefits, and optimal use patterns. Outcomes are rants the difference in cost for more expensive
focused on issues that are relevant to patients, such as interventions, or if the “best” treatment gives the best
satisfaction and function, and studies are embedded into “bang for the buck.”30
practice settings. The ultimate goal is to impact personal behaviors and
quality of life for individuals and populations, and to pro-
vide information that will be used for clinical decision-
➤ T3: Effectiveness trials continue to demonstrate multiple making by patients, care providers, and policy makers.
uses of BTX. Spasticity studies have shown minimal adverse
Although many therapies make their way through T1 to
6113_Ch02_017-028 02/12/19 4:56 pm Page 21
T3 phases, they must still reach the communities that making informed decisions about treatment choices.
need them in cost-effective ways. The IOM provides the following definition:
Comparative effectiveness research is the generation and
➤ T4: Researchers have been able to show improved quality synthesis of evidence that compares the benefits and
of life in physical dimensions of function for patients with upper harms of alternative methods to prevent, diagnose, treat
extremity spasticity,31 and cost-effectiveness of using BTX and monitor or improve the delivery of care.39
based on a greater improvement in pretreatment functional tar-
gets that warrant continuation of therapy.32 Guidelines for using This purpose includes establishing a core of useful
BTX have also been established.33 evidence that will address conditions that have a high
impact on health care.
Chapter 14 provides a fuller discussion of the four phases • For patients, this strategy means providing informa-
of clinical trials. tion about which approaches might work best,
given their particular concerns, circumstances, and
The Multidirectional Cycle preferences.
Although the translational continuum is presented as se- • For clinicians, it means providing evidence-based
quential stages, it is important to recognize that these information about questions they face daily in
stages truly represent an iterative and interactive pathway practice.
in multiple directions, not always starting with basic re- • For insurers and policy makers, it means producing
search questions. The bidirectional arrows reflect how new evidence to help with decisions on how to improve
knowledge and hypotheses can be generated at each phase. health outcomes, costs, and access to services.37
For example, many conditions, such as AIDS and fetal CER may include studies on drugs, medical devices,
alcohol syndrome, were first described based on clinical surgeries, rehabilitation strategies, or methods of deliv-
observations presented in historical reports and case ering care. These trials include studies done on patients
studies.34,35 These were followed by decades of epidemi- who are typical of those seen in every day practice, with
ologic studies describing the syndromes, with later bench few exclusion criteria, and are typically embedded in rou-
studies confirming physiological and developmental tine care to improve generalizability of findings and to
mechanisms through animal models and genetic inves- allow translation of findings into care for individuals and
tigation. Subsequent clinical trials focused on diagnostic populations.
tests, medications, and guidelines for treatment, often Study designs may include randomized trials as well
leading back to laboratory studies. as observational research, and CER may include original
research or systematic reviews and meta-analyses that
■ Effectiveness Research synthesize existing evidence.
Because the explicit goal of translational research is to An important distinction of CER is a focus on comparing
bridge the gap between the controlled scientific environ- alternative treatments, not comparing new treatments
ment and real-world practice settings, effectiveness trials to a placebo or control. This “head-to-head” comparison
specifically address generalizability and access. These varied is intended to compare at least two interventions that are effec-
tive enough to be the standard of care, to determine which one
study approaches have different names, with some overlap
is “best.” Because different patients may respond differently
in their purposes, but they represent an important direction
to treatments, CER will often focus on studying subgroups of
for clinical research to address T3 and T4 translation. populations.
Federal and nonprofit agencies have created programs
with significant funding to support effectiveness studies,
most notably the Agency for Healthcare Research and Pragmatic Clinical Trials
Quality (AHRQ),36 the Patient-Centered Outcomes Re-
Pragmatic trials have several important features that
search Institute (PCORI),37 and the National Institutes
differ from RCTs (see Fig. 2-2).40,41 In PCTs, the hy-
of Health (NIH) Clinical and Translational Science
pothesis and study design are formulated based on infor-
Awards (CTSA).38
mation needed to make a clinical decision.
Comparative Effectiveness Research Flexible Protocols
Comparative effectiveness research (CER) is designed Pragmatic trials allow for a different level of flexibility
to study the real-world application of research evidence in research design than an RCT. One such difference is
to assist patients, providers, payers, and policy makers in in the specificity of the intervention.
6113_Ch02_017-028 02/12/19 4:56 pm Page 22
ls
will often see a substantial list of exclusion criteria that
ia
Tr
restricts age or gender, and eliminates subjects with
y
or
Controlled design Flexible design comorbidities, those at high risk, or those who take other
at
an
Smaller samples Large samples medications. The purpose of these exclusions is to make
Ex
ls
the sample as homogeneous as possible to improve
ia
Controlled environment Diverse practice settings
Tr
Focused outcomes Meaningful outcomes
ic
statistical estimates.
at
m
In clinical practice, however, we do not have the lux-
ag
Pr
ury of selecting patients with specific characteristics.
Therefore, the PCT tries to include all those who fit
Figure 2–2 The continuum from explanatory to pragmatic the treatment population to better reflect the practical
trials. circumstances of patient care. A full description of
demographics within a sample is important to this
understanding.
➤ C A S E IN PO INT # 2
Physical inactivity in elders can lead to functional decline ➤ The FaME exercise study involved 43 general practices in
and many physical and physiological disorders. Researchers three counties in the United Kingdom. All patients over 65 years
designed a pragmatic trial to evaluate programs to promote of age were recruited. They had to be independently mobile so
physical activity and better functional outcomes among older that they could participate in the exercise activities. The only
people.42 The study compared three treatments: exclusions were a history of at least three falls, not being able
• A group-based Falls Management Exercise (FaME) program to follow instructions, or unstable clinical conditions precluding
held weekly for supervised exercise and unsupervised home exercise.
exercises twice weekly.
• A home-based exercise program involving unsupervised
exercises to be done three times per week. Practice-Based Evidence
• A usual care group that was not offered either program. The term practice-based evidence (PBE) has also been
Although a precise definition of the intervention is important used to describe an approach to research that is of high
in both types of studies, PCTs involve implementation as it quality but developed and implemented in real-world
would typically occur in practice. For example, in the FaME settings. Developing PBE requires deriving research
study, the way in which patients did home exercises would questions from real-world clinical problems. Applying
likely have variability that could not be controlled. evidence to practice is often difficult when studies do
not fit well with practice needs. In a practice-based
Patients may also receive modifications of treatment research model, the clinical setting becomes the “labo-
in a PCT to meet their specific needs, as long as they are ratory,” involving front-line clinicians and “real” treat-
based on a common management protocol or decision- ments and patients.45
making algorithm.43 This is especially relevant to reha-
bilitation interventions, which may be based on the same The ideas of EBP and PBE are often confused. EBP fo-
approach, but have to be adapted to individual patient cuses on how the clinician uses research to inform clini-
goals and conditions. cal practice decisions. PBE refers to the type of evidence that
is derived from real patient care problems, identifying gaps
Diverse Settings and Samples
between recommended and actual practice.46 EBP will be
RCTs typically limit where the research is being done,
more meaningful when the available evidence has direct
whereas PCTs will try to embed the study in routine application to clinical situations—evidence that is practice
practice, such as primary care settings or communities.44 based!
The challenge with PCTs is balancing the need for suf-
ficient control in the design to demonstrate cause and
effect while being able to generalize findings to real- PBE studies often use observational cohorts, surveys,
world practice. secondary analysis, or qualitative studies, with diverse sam-
One of the most important distinctions of a PCT is ples from several clinical sites, with comprehensive data
in the selection of participants and settings, both of collection methods that include details on patient charac-
which impact the generalization of findings. The PCT teristics and their care (see Focus on Evidence 2-2). The
sample should reflect the variety of patients seen in prac- studies are typically designed to create databases with many
tice for whom the interventions being studied would be outcome variables and documentation standards.
6113_Ch02_017-028 02/12/19 4:56 pm Page 23
Focus on Evidence 2–2 the patient’s health. Patients were asked for the same two ratings,
Patients and Providers: Understanding Perceptions in addition to an open-ended question on why they were not adher-
ent. Researchers found a weak agreement of both adherence and
Nonadherence to drug regimens can lead to negative consequences,
importance between patients and physicians. They found that the
and patients often stop taking medications for various reasons. Hav-
physicians considered 13% of the drugs being taken as unimportant,
ing identified this problem in practice, researchers examined the
and that nearly 20% of the drugs that the physicians considered
differences between patient and physician perspectives related to
important were not being taken correctly by patients.
drug adherence and drug importance in patients who were taking
These findings have direct clinical implications for shared deci-
at least one long-term drug treatment.47 The study recruited 128 pa-
sion-making, including involving other professionals such as phar-
tients, from both hospital and ambulatory settings, who were taking
macists. It points to the different perspectives of patients, who may
498 drugs.
see drugs primarily for symptom relief, whereas physicians may be
For each drug taken, they asked physicians to rate their knowl-
focused on long-term prevention or changes in laboratory test values.
edge of the patient’s adherence and the importance of the drug to
length of stay, and so on.52 This is in contrast to disease- Other condition-specific measures include the Shoul-
oriented evidence (DOE), which focuses more on changes der Pain and Disability Index (SPADI)58 and the Stroke
in physiology and laboratory tests. DOE is important to under- Impact Scale (SIS).59 Clinicians need to understand a
standing disease and diagnostic processes but may not be as measure’s theoretical foundation and the validity of these
informative in making choices for clinical management as POEM. tools so they can select the one that will be appropriate
to the condition and the information that will be relevant
to decision-making (see Chapter 10).60
Patient-Reported Outcome Measures
The kinds of outcomes that are most important to
patients are often assessed by using a patient-reported You may also see the term patient-reported outcome
outcome measure (PROM), which is: (PRO) used to refer to the outcome itself, such as satisfac-
tion or health-related quality of life, as opposed to the PROM.
. . . any report of the status of a patient’s health For example. health-related quality of life is a PRO that can be
condition that comes directly from the patient, without measured using the Short Form 36 Health Survey, which is
interpretation of the patient’s response by a clinician or a PROM.
anyone else.53,54
These measures have been a mainstay in the design
of pragmatic trials, but are becoming more common in Patient-Centered Outcomes Research
RCTs in addition to more traditional outcomes. For Because patients and clinicians face a wide range of
example, research has shown that PROMs have a complex and often confusing choices in addressing
strong correlation to objective measures of movement health-care needs, they need reliable answers to their
and function.55 questions. Those answers may not exist, or they may not
PROMs can be generic or condition specific. For be available in ways that patients can understand.
measuring quality of life, for example, the Short Form 36 As a form of CER, patient-centered outcomes re-
Health Survey provides a generic evaluation in domains search (PCOR) has the distinct goal of engaging patients
related to physical, mental, emotional, and social func- and other stakeholders in the development of questions
tioning.56 This tool has been tested across multiple pop- and outcome measures, encouraging them to become
ulations. Similar tests are condition specific, such as integral members of the research process (see Focus on
the Minnesota Living with Heart Failure Questionnaire, Evidence 2-3). Funding for these studies is the primary
which asks for patient perceptions relative to their heart mission of the Patient-Centered Outcomes Research Institute
condition.57 (PCORI).37
Focus on Evidence 2–3 care.” For each dimension, patients were asked what their current
Focusing on Patient Preferences state was and what they would like it to be. The largest gap at
admission was between the patients’ current and preferred level of
For research to be patient centered, assessments must be able to re-
flexibility.
flect what matters to patients, how they would judge the success of
The gaps decreased following the episode of care, indicating that
their treatment. In a study funded by PCORI, Allen et al61 used a tool
most patients perceived improvement, getting closer to where they
they developed called the Movement Abilities Measure (MAM), which
wanted their movement to be. However, the clinicians did not rate
evaluates movement in six dimensions: flexibility, strength, accuracy,
the outcomes so highly and reported that 62% of patients did not
speed, adaptability, and endurance. They wanted to know if there
meet all therapeutic goals. Interestingly, only 29% of the patients had
was a gap between what patients thought they could do and what
their largest current–preferred gap in the same dimension as their
movement ability they wanted to have. Patients completed questions
therapist.
at admission and discharge. Clinicians’ notes on patient performance
The important message in these results, which leads to many
were analyzed for the same six dimensions, to see if there was
research opportunities, is that there appears to be considerable
agreement—did the physical therapists have the same perceptions
disagreement between clinicians and patients on the dimensions of
as the patients in terms of their values for outcomes?
movement that should be the goals of their care and how improve-
Responses were transformed to a six-point scale. For example,
ments are perceived. Such information would help clinicians under-
on the dimension of flexibility, patients were asked to rate their
stand what patients value most and how treatment can be focused
movement from 6 “I move so easily that I can stretch or reach . . .”
on those preferences.
to 0 “Stiffness or tightness keeps me from doing most of my daily
6113_Ch02_017-028 02/12/19 4:56 pm Page 25
Primary and Secondary Outcomes Implementation science is the next step beyond
Trials will often study multiple endpoints, which will effectiveness research. It focuses on understanding the
reflect outcomes at different levels. Researchers will influence of environment and resources on whether re-
typically designate one measure as a primary outcome, the search findings are actually translated to practice. Once
one that will be used to arrive at a decision on the overall trials have established that an intervention is effective,
result of the study and that represents the greatest ther- the question then turns to “how to make it happen.”
apeutic benefit.62 Secondary outcomes are other endpoint
Implementation science is the study of methods to
measures that may also be used to assess the effectiveness
promote the integration of research findings and
of the intervention, as well as side effects, costs, or other
evidence into health-care policy and practice. It seeks to
outcomes of interest.
understand the behavior of health-care professionals and
other stakeholders as a key variable in the sustainable
➤ The primary outcome in the FaME exercise study was the uptake, adoption, and implementation of evidence-based
proportion of patients reaching the recommended target of interventions.65
150 minutes of moderate to vigorous physical activity per week
at 12-months postintervention, based on a comprehensive This approach incorporates a broader scope than tra-
self-report questionnaire. Secondary outcomes included number ditional clinical research, looking not only at the patient,
of falls, quality of life, balance and falls efficacy, and cost. A but more so at the provider, organizational systems, and
significantly higher proportion of the FaME participants achieved policy levels.
the physical activity target and showed better balance and
decreased fall rate compared to the other two groups, but Implementation Science and Evidence-Based
there were no differences in quality of life. The researchers Practice
also established that the FaME program was more expensive, There is a key distinction between implementation
averaging £1,740 (about $2,000 to us on the other side of the
studies and the evidence-based practices they seek to
pond in 2017).
implement. An implementation study is focused on ways
to change behavior, often incorporating education and
The decision regarding the primary outcome of a study training, team-based efforts, community engagement, or
is important for planning, as it will direct the estimate systemic infrastructure redesign. It is not intended to
of the needed sample size, and thereby the power of a demonstrate the health benefits of a clinical intervention,
study (see Chapter 23). Because this estimate is determined but rather the rate and quality of the use of that inter-
by the expected size of the outcome effect, with multiple vention66 and which adaptations are needed to achieve
outcomes, researchers must base it on one value, usually the relevant outcomes.67
primary outcome.63 However, alternatively, the one outcome
variable that has the smallest effect, and thereby would require
a larger sample, can be used.
➤ CASE IN POIN T #3
Quanbeck et al68 reported on the feasibility and effectiveness
of an implementation program to improve adherence to clinical
guidelines for opioid prescribing in primary care. They used a
■ Implementation Studies combination of quantitative and qualitative methods to deter-
mine the success of a 6-month intervention comparing four
As evidence mounts over time with efficacy and effec- intervention clinics and four control clinics.
tiveness studies, at some point “sufficient” evidence will
be accumulated to warrant widespread implementation Types of Implementation Studies
of a new practice, replacing standard care. This is the Understanding implementation issues requires going be-
crux of the “translation gap.” yond efficacy and effectiveness to recognize the geo-
For example, despite extraordinary global progress graphic, cultural, socioeconomic, and regulatory contexts
in immunizing children, in 2007, almost 20% of the that influence utilization patterns.69 The acceptance and
children born each year (24 million children) did not get adoption of interventions, no matter how significant clin-
the complete routine immunizations scheduled for their ical studies are, must be based on what works, for whom,
first year of life. In most cases, this occurred in poorly under which circumstances, and whether the intervention
served remote rural areas, deprived urban settings, and can be scaled up sufficiently to enhance outcomes. There
fragile and strife-torn regions.64 This is not a problem is an element of ethical urgency in this process, to find
of finding an effective intervention—it is a problem of the mechanisms to assure that individuals and communi-
implementation. ties are getting the best care available.70
6113_Ch02_017-028 02/12/19 4:56 pm Page 26
Implementation studies can address many types of typically focus on feasibility, acceptance, adoption, cost,
intervention, including looking at clinical performance and sustainability—all serving as indicators of success.
audits, the use of patient or provider alerts to remind These variables also provide insight into how the pro-
them of guidelines, the influence of local leaders to gram contributes to health outcomes.
adopt best practices, the investigation of consensus as a
foundation for problem management, or patient educa- ➤ The opioid study focused on attendance of clinic staff at
tion interventions.71 monthly sessions and staff satisfaction with the meetings. At
6 and 12 months, they found significant improvement in the
➤ The intervention in the opioid study consisted of a combi- intervention clinics compared to controls. There was a reduction
nation of monthly site visits and teleconferences, including in prescribing rates, average daily dose for patients on long-
consultations and facilitated discussions. Teams of six to term therapy, and the number of patients receiving mental
eight staff members at each clinic included a combination health screens and urine tests. Costs were assessed and
of physician, nurse, medical assistant, and administrative adjustments made as needed.
personnel. Teams reviewed opioid prescribing guidelines and
how to implement mental health screening and urine drug Data for implementation studies may be quantitative,
testing. based on structured surveys about attitudes or behaviors,
or administrative data such as utilization rates. Qualita-
tive data are also often collected to better understand the
Implementation studies are closely related to quality context of implementation, including semi-structured
improvement (QI) techniques. The difference lies in the
interviews, focus groups, and direct observation.
focus of QI on a practical and specific system issue
leading to local and immediate changes, usually with adminis-
trative or logistic concerns.72 Implementation studies address ➤ The opioid study used qualitative data from focus groups,
utilization with the intent of generalizing beyond the individual interviews, and ethnographic techniques to determine which
system under study to populations, applying research principles, types of adaptations were needed. They found that a personal
typically across multiple sites. touch was helpful to promote sustained engagement, with
frequent communication at each stage being key. They also
realized the need for clear expectations to team roles and
Implementation Outcomes responsibilities, as well as for linking the implementation
process with the clinic’s organizational workflows, policies,
Implementation outcomes involve variables that describe
and values.
the intentional activities to deliver services.72 They
COMMENTARY
Knowing is not enough; we must apply. Willing is not enough; we must do.
—Johann Wolfgang von Goethe (1749–1832)
German writer and statesman
So much to think about. So much that still needs to be noncompliance that can interfere with therapeutic
done. Which studies will tell us what we need to benefits. And effectiveness studies struggle with the lack
know—RCTs or PCTs? The answer may be less than of control that defines them, recognizing the potential
satisfying—we need both.15 They give us different in- for bias and the difficulty in demonstrating significant
formation and we need to put each in perspective, use effects because of so much variability.
each for the right purpose. So we are left with the need to consider evidence
Efficacy studies are often referred to as “proof of across the continuum, with the onus on both practition-
concept,” to demonstrate if an intervention can work. If ers and researchers to exchange ideas, understand prior-
the answer is yes, the next step is to test it under “real- ities, and contribute to setting research agendas. It
world” conditions. But real-world conditions are quite also means that both are responsible for creating infor-
a conundrum in “research world,” which tries to be mation in a relevant format and using that information
orderly, logical, and consistent. Efficacy studies try to to make decisions. Collaboration is an essential approach
avoid confounding issues such as diverse patients or to making translation a reality.
6113_Ch02_017-028 02/12/19 4:56 pm Page 27
43. Roland M, Torgerson DJ. What are pragmatic trials? BMJ 58. Breckenridge JD, McAuley JH. Shoulder Pain and Disability
1998;316(7127):285. Index (SPADI). J Physiother 2011;57(3):197.
44. Godwin M, Ruhland L, Casson I, MacDonald S, Delva D, 59. Duncan PW, Bode RK, Min Lai S, Perera S. Rasch analysis
Birtwhistle R, Lam M, Seguin R. Pragmatic controlled clinical of a new stroke-specific outcome scale: the Stroke Impact Scale.
trials in primary care: the struggle between external and internal Arch Phys Med Rehabil 2003;84(7):950–963.
validity. BMC Medical Research Methodology 2003;3:28. 60. Perret-Guillaume C, Briancon S, Guillemin F, Wahl D,
45. Horn SD, Gassaway J. Practice-based evidence study design Empereur F, Nguyen Thi PL. Which generic health related
for comparative effectiveness research. Med Care 2007;45 Quality of Life questionnaire should be used in older inpatients:
(10 Supl 2):S50–S57. comparison of the Duke Health Profile and the MOS Short-
46. Horn SD, DeJong G, Deutscher D. Practice-based evidence Form SF-36? J Nutr Health Aging 2010;14(4):325–331.
research in rehabilitation: an alternative to randomized con- 61. Allen D, Talavera C, Baxter S, Topp K. Gaps between patients’
trolled trials and traditional observational studies. Arch Phys reported current and preferred abilities versus clinicians’
Med Rehabil 2012;93(8 Suppl):S127–S137. emphases during an episode of care: Any agreement? Qual
47. Sidorkiewicz S, Tran V-T, Cousyn C, Perrodeau E, Ravaud P. Life Res 2015;24(5):1137–1143.
Discordance between drug adherence as reported by patients 62. Sedgwick P. Primary and secondary outcome measures.
and drug importance as assessed by physicians. Ann Fam Med BMJ 2010;340.
2016;14(5):415–421. 63. Choudhary D, Garg PK. Primary outcome in a randomized
48. Agency for Healthcare Research and Quality: Outcomes controlled trial: a critical issue. Saudi J Gastroenterol 2011;
Research. Fact Sheet. Available at https://archive.ahrq.gov/ 17(5):369–369.
research/findings/factsheets/outcomes/outfact/outcomes-and- 64. World Health Organization, UNICER, World Bank. State of
research.html. Accessed January 12, 2017. the World’s Vaccines and Immunization. 3 ed. Geneva: World
49. Semmelweis I. The Etiology, Concept, and Prophylaxis of Childbed Health Organization, 2009.
Fever. Madison, WI: University of Wisconsin Press, 1983. 65. NIH Fogarty International Center: Implementation Science
50. Nightingale F. Introductory Notes on Lying-In Institutions, Information and Resources. Available at https://www.fic.nih.
Together with a Proposal for Organising and Institution for Training gov/researchtopics/pages/implementationscience.aspx.
Midwives and Midwifery Nurses. London: Longmans Green & Accessed January 10, 2017.
Co, 1871. 66. Bauer MS, Damschroder L, Hagedorn H, Smith J, Kilbourne
51. Mallon BE: Amory Codman: pioneer New England shoulder AM. An introduction to implementation science for the
surgeon. New England Shoulder & Elbow Society. Available non-specialist. BMC Psychology 2015;3(1):32.
at http://neses.com/e-amory-codman-pioneer-new-england- 67. Chambers DA, Glasgow RE, Stange KC. The dynamic sustain-
shoulder-surgeon/. Accessed January 11, 2017. ability framework: addressing the paradox of sustainment amid
52. Geyman JP. POEMs as a paradigm shift in teaching, learning, ongoing change. Implement Sci 2013;8:117.
and clinical practice. Patient-oriented evidence that matters. 68. Quanbeck A, Brown RT, Zgierska AE, Jacobson N,
J Fam Pract 1999;48(5):343–344. Robinson JM, Johnson RA, Deyo BM, Madden L, Tuan W,
53. Center for Evidence-based Physiotherapy: Physiotherapy Alagoz E. A. A randomized matched-pairs study of feasibility,
Evidence Database. Available at http://www.pedro.fhs.usyd. acceptability, and effectiveness of systems consultation: a novel
edu.au/index.html. Accessed October 17, 2004 implementation strategy for adopting clinical guidelines for
54. Food and Drug Administration: Guidance for industry: opioid prescribing in primary care. Implement Sci 2018;
patient-reported outcome measures: use in medical product 13(1):21.
development to support labeling claims; 2009. Available 69. Edwards N, Barker PM. The importance of context in imple-
at http://www.fda.gov/downloads/drugs/guidances/ mentation research. J Acquir Immune Defic Syndr 2014;67
ucm193282.pdf. Accessed January 18, 2017. Suppl 2:S157–S162.
55. Alcock L, O’Brien TD, Vanicek N. Age-related changes 70. Solomon MZ. The ethical urgency of advancing implementa-
in physical functioning: correlates between objective and tion science. Am J Bioeth 2010;10(8):31–32.
self-reported outcomes. Physiother 2015;101(2):204–213. 71. Jette AM. Moving research from the bedside into practice.
56. Hays RD, Morales LS. The RAND-36 measure of health- Phys Ther 2016;96(5):594–596.
related quality of life. Ann Med 2001;33(5):350–357. 72. Peters DH, Adam T, Alonge O, et al. Implementation research:
57. Bilbao A, Escobar A, García-Perez L, Navarro G, Quirós R. what it is and how to do it. BMJ 2013;347:f6753.
The Minnesota Living with Heart Failure Questionnaire:
comparison of different factor structures. Health Qual Life
Outcomes 2016;14(1):23.
6113_Ch03_029-041 02/12/19 4:56 pm Page 29
CHAPTER
Defining the
Research Question 3
The start of any research effort is identification
➤ CASE IN POIN T #1
of the specific question that will be investi- To illustrate the process of identifying a research problem, let us
gated. This is the most important and often start with an example. Assume a researcher is interested in ex-
ploring issues related to pressure ulcers, a common and serious
most difficult part of the research process be- health problem. Let us see how this interest can be developed.
cause it controls the direction of all subsequent
planning and analysis.
The purpose of this chapter is to clarify the ■ The Research Problem
process for developing and refining a feasible Once a topic is identified, the process of clarifying a
research question, define the different types of research problem begins by sorting through ideas
based on clinical experience and observation, theories,
variables that form the basis for a question, de-
and professional literature to determine what we know
scribe how research objectives guide a study, and what we need to find out. The application of this
and discuss how the review of literature con- information will lead to identification of a research
problem that will eventually provide the foundation for
tributes to this process. Whether designing a
delineating a specific question that can be answered in
study or appraising the work of others, we need a single study.
to appreciate where research questions come Clarify the Problem
from and how they direct study methods,
Typically, research problems start out broad, concerned
analysis, and interpretation of outcomes. with general clinical problems or theoretical issues. This
is illustrated in the top portion of Figure 3-1, showing a
process that will require many iterations of possible ideas
that may go back and forth. These ideas must be manip-
■ Selecting a Topic ulated and modified several times before they become
narrowed sufficiently to propose a specific question. Be-
The research process begins when a researcher identifies
ginning researchers are often unprepared for the amount
a broad topic of interest. Questions grow out of a need
of time and thought required to formulate and hone a
to know something that is not already known, resolve a
precise question that is testable. It is sometimes frustrat-
conflict, confirm previous findings, or clarify some piece
ing to accept that only one small facet of a problem can
of information that is not sufficiently documented. Re-
be addressed in a single study. First, the topic has to be
searchers are usually able to identify interests in certain
made more explicit, such as:
patient populations, specific types of interventions, clin-
ical theory, or fundamental policy issues. How can we effectively manage pressure ulcers?
29
6113_Ch03_029-041 02/12/19 4:56 pm Page 30
What are the outcomes that you Will you study any comparison
will study? These are your C variables? These will define
levels of the independent
dependent variables. You may
specify primary and secondary O variable, including comparing
interventions, relative validity of
outcomes. These may be
intervention effects or outcomes of diagnostic tests, or relative
predictive associations. differences between risk factors.
Figure 3–1 Five-step process for developing a research question. Portions of these steps may
vary, depending on the type of research question being proposed.
This problem can be addressed in many ways, leading or prognosis (see Chapter 5). In both instances, however, vari-
to several different types of studies. For example, one re- ables and outcome measures must be understood to support
searcher might be interested in different interventions and interpretation of findings.
their effect on healing time. Another might be concerned
with identifying risk factors that can help to develop pre-
ventive strategies. Another might address the impact of Clinical Experience
pressure ulcers on patients’ experience of care. Many other
Research problems often emerge from some aspect of
approaches could be taken to address the same general
practice that presents a dilemma. A clinician’s knowl-
problem, and each one can contribute to the overall knowl-
edge, experience, and curiosity will influence the types
edge base in a different way.
of questions that are of interest. Discussing issues with
other researchers or colleagues can often generate inter-
The development of a question as the basis for a esting ideas.
research study must be distinguished from the process We may ask why a particular intervention is successful
for development of a clinical question for evidence- for some patients but not for others. Are new treatments
based practice (EBP). In the latter instance, the clinician formu- more effective than established ones? Does one clinical
lates a question to guide a literature search that will address a problem consistently accompany others? What is the
particular decision regarding a patient’s intervention, diagnosis, natural progression of a specific injury or disease? Often,
6113_Ch03_029-041 02/12/19 4:56 pm Page 31
➤ In a systematic review of behavioral and educational interven- ➤ Researchers studied the effect of high-voltage electrical
tions to prevent pressure ulcers in adults with spinal cord injury, stimulation on healing in stage II and III pressure ulcers,
Cogan et al7 found little evidence to support the success of these and demonstrated that it improved the healing rate.9
strategies, noting considerable methodological problems across They also suggested, however, that their study needed
studies with recruitment, intervention fidelity, participant adher- replication to compare the effectiveness of using cathodal
ence, and measurement of skin breakdown. These limitations and anodal stimulation combined or alone and to determine
make comparison of findings and clinical applications difficult. the optimal duration of these two types of electrical
stimulation.
Authors should address these limitations in discussion
sections of publications and offer suggestions for further Replication is an extremely important process in re-
study. search because one study is never sufficient to confirm a
theory or to verify the success or failure of a treatment.
Descriptive Research
We are often unable to generalize findings of one study
Research questions arise out of data from descriptive
to a larger population because of the limitations of small
studies, which document trends, patterns, or character-
sample size in clinical studies. Therefore, the more stud-
istics that can subsequently be examined more thor-
ies we find that support a particular outcome, the more
oughly using alternative research approaches.
confidence we can have in the validity of those findings
(see Focus on Evidence 3-1).
➤ Spilsbury et al8 provided qualitative data to characterize the
impact of pressure ulcers on a patient’s quality of life. Their find-
ings emphasized the need to provide information to patients, un-
derstand the importance of comfort in positioning, and management ■ The Research Rationale
of dressings, all of which contribute to the effectiveness of care.
Once the research problem has been identified, the
Analysis of these elements can become the building next step is to dig deeper to develop a rationale for
blocks of other qualitative and quantitative research the research question to understand which variables
questions. are relevant, how they may be related, and which out-
comes to expect. The rationale will guide decisions
Replication about the study design and, most importantly, provide
In many instances, replication of a study is a useful strat- the basis for interpreting results. It presents a logical
egy to confirm previous findings, correct for design lim- argument that shows how and why the question makes
itations, study different combinations of treatments, or sense, as well as provides a theoretical framework that
examine outcomes with different populations or in dif- offers a justification for the study (Fig. 3-1). A full review
ferent settings. A study may be repeated using the same of the literature is essential to understand this foun-
variables and methods or slight variations of them. dation. For this example, we will consider a question
Focus on Evidence 3–1 conducted to address efficacy questions that prior trials had already
The Sincerest Form of Flattery: But When Is It Time definitively answered.” Replication is a reasonable approach, but only
to Move On? when prior research has not yet arrived at that threshold.
Many journals now require that authors have conducted a com-
Although replication is important to confirm clinical findings, it is also
prehensive review of literature, with justification for their question
important to consider its limits as a useful strategy. There will come a
based on prior research. Authors need to provide a clear summary
point when the question must move on to further knowledge, not just
of previous research findings, preferably through use of a system-
repeat it. Fergusson et al10 provide an illustration of this point in their
atic review, and explain how their trial’s findings contribute to the
discussion of the use of the drug aprotinin to reduce perioperative
state of knowledge. When such reviews do not exist, authors are
blood loss. Through meta-analysis, they showed that 64 trials on the
encouraged to do their own or describe in a structured way
drug’s effectiveness were published between 1987 and 2002, although
the qualitative association between their research and previous
the effectiveness had been thoroughly documented by 1992 after
findings. Such practices will limit unnecessary clinical research,
12 trials. They suggested that the following 52 trials were unnecessary,
protect volunteers and patients, and guard against the abuse of
wasteful, and even unethical, and that authors “were not adequately
resources.11
citing previous research, resulting in a large number of RCTs being
6113_Ch03_029-041 02/12/19 4:56 pm Page 33
related to different types of dressings and their effect familiarity with the topic. A beginning researcher may have
on the healing of pressure ulcers. Now let us see limited knowledge and experience, and might have to
how we can refine this question based on a review of cover a wider range of materials to feel comfortable with
literature. the information.
In addition, the scope of the review will depend
Review of Literature on how much research has been done in the area and
Every research question can be considered an extension how many relevant references are available. Obviously,
of all the thinking and investigation that has gone before when a topic is new and has been studied only mini-
it. The results of each study contribute to that accumu- mally, fewer materials will exist. In that situation,
lated knowledge and thereby stimulate further research. it is necessary to look at studies that support the theo-
For this process to work, researchers must be able to retical framework for a question or related research
identify prior relevant research and theory through a questions.
review of literature. Reviewing the literature occurs in two phases as
part of the research process. The preliminary review
Start With a Systematic Review is conducted as part of the identification of the re-
The first strategy in this process should be reading search problem, intended to achieve a general under-
recent systematic reviews or meta-analyses that syn- standing of the state of knowledge. Once the problem
thesize the available literature on a particular topic, pro- has been clearly formulated, however, the researcher
vide a critical appraisal of prior research, and help to begins a more comprehensive review to provide a com-
summarize the state of knowledge on the topic (see plete understanding of the relevant background for the
Chapter 35). Importantly, systematic reviews can also study.
establish the state of knowledge in an area of inquiry so
that newer studies will build on that knowledge rather
When a topic has been researched extensively,
than repeat it (see Focus on Evidence 3-1).
the researcher need only choose a representative
sample of articles to provide sufficient background.
➤ Reddy et al12 did a systematic review of studies that After some review, it is common to reach a saturation
looked at various therapies for pressure ulcers. Out of point, when you will see the same articles appearing over
103 RCTs, 54 directly evaluated different types of dressings. and over in reference lists—and you will know you have
The studies were published from 1978 to 2008. They focused hit the most relevant sources. The important consideration is
on patients of varied ages, in acute and long-term care, reha- the relevancy of the literature, not the quantity. Researchers
bilitation, and home settings, used varied outcome measures, will always read more than they will finally report in the
and studied outcomes between 1 and 24 weeks. The authors written review of literature. The volume of literature reviewed
concluded that the evidence did not show any one single may also depend on its purpose. For instance, a thesis or
dressing technique to be superior to others. However, most dissertation may require a fully comprehensive review of
studies had low methodological quality (see Chapter 37) and literature.
they recognized the need for further research to better under-
stand comparative effectiveness.
Strategies for carrying out a literature search are
described in Chapter 6.
Systematic reviews are an important first step but
may not be sufficient, as they often restrict the kind Establishing the Theoretical Framework
of studies that are included; therefore, a more in-depth The review of literature should focus on several aspects
review will likely be necessary to include all relevant
of the study. As we have already discussed, the re-
information.
searcher tries to establish a theoretical rationale for
the study based on generalizations from other studies
(see Fig. 3-2). It is often necessary to review material
Scope of the Review of Literature on the patient population to understand the underlying
Clinicians and students are often faced with a dilemma in pathology that is being studied. Researchers should also
starting a review of literature in terms of how extensive a look for information on methods, including equipment
review is necessary. How does a researcher know when a used and operational definitions of variables. Often it
sufficient amount of material has been read? There is no is helpful to replicate procedures from previous studies
magic formula to determine that 20, 50, or 100 articles will so that results have a basis for comparison. It may be
provide the necessary background for a project. The num- helpful to see which statistical techniques have been
ber of references needed depends first on the researcher’s used by others for the same reason.
6113_Ch03_029-041 02/12/19 4:56 pm Page 34
THE PROBLEM
THE RATIONALE Pressure ulcers are a common
Treatment recommendations for and costly complication for
1 patients in hospitals and long
pressure ulcers vary, with few
comparative studies to document best term care facilities.
approaches. Normal saline is a Treatments include varied
commonly used irrigation and dressing types of dressings.
2
solution. Insulin has been successful in
treating chronic wounds, with studies
suggesting that it has the ability to TYPE OF RESEARCH
enhance proliferation of endothelial
3 Randomized controlled
cells.
comparative effectiveness
trial.
Population: Hospitalized
patients with grade 2 or 3
pressure ulcers in intensive care,
P
trauma, and surgical wards
I Intervention: Insulin dressing
Hypothesis: Insulin will be a safe and effective dressing agent for pressure
5 ulcer healing when compared to normal saline in hospitalized patients with
grade 2 or 3 ulcers.
Figure 3–2 Development of a research question for an explanatory study of intervention for
pressure ulcers. Based on Stephen S, Agnihotri M, Kaur S. A randomized controlled trial to
assess the effect of topical insulin versus normal saline in pressure ulcer healing. Ostomy Wound
Manage 2016;62(6):16–23.
given outcome. This is the I and C in PICO. The outcome The independent variable in this study is the treatment
variable is called the dependent variable, which is a re- (intervention) and the dependent variables are pain level,
sponse or effect that is presumed to vary depending on the ROM, and disability. A change in the dependent variables
independent variable. This is the O in PICO. is presumed to be caused by the independent variable.
Comparative studies can be designed with more than
Explanatory Studies one independent variable. We could look at the pa-
Explanatory studies involve comparison of different con- tients’ sex in addition to intervention, for instance, to
ditions to investigate causal relationships, where the in- determine whether effectiveness of treatment is differ-
dependent variable is controlled and the dependent ent for males and females. We would then have two in-
variable is measured. dependent variables: type of intervention and sex. A
study can also have more than one dependent variable.
➤ Djavid et al13 evaluated the effect of low-level laser therapy on In this example, the researchers included three depend-
chronic LBP. They studied 63 patients, randomly assigned to re- ent variables.
ceive laser therapy and exercise, laser therapy alone, or placebo
laser therapy plus exercise, with treatment performed twice a The Intervention and Comparison components of PICO are
week for 6 weeks. The outcome measures were pain using a visual the levels of the independent variable in an explanatory
analog scale (VAS), lumbar range of motion (ROM), and disability. study.
6113_Ch03_029-041 02/12/19 4:56 pm Page 36
Levels of the Independent Variable researcher tries to establish how different factors interre-
In comparative studies, independent variables are given late to explain the outcome variable. These studies are not
“values” called levels. The levels represent groups or concerned with cause and effect, and the comparison
conditions that will be compared. Every independent component of PICO would be ignored.
variable will have at least two levels. Dependent variables
are not described as having levels. Qualitative research studies do not incorporate compar-
In the example of the study of laser therapy and LBP, isons or predictive relationships. They use a different
the independent variable has three levels: laser therapy + process for development of a research question, which
exercise, laser therapy alone, and placebo + exercise will be described in Chapter 21.
(see Fig. 3-3).
Operational Definitions
It is important to distinguish levels from variables. This
study of laser therapy has one independent variable Once the variables of interest have been identified, the
with three levels, not three independent variables. researcher still faces some major decisions. Exactly which
Dependent variables are not described as having levels. procedures will be used? How will we measure a change
in back pain? As the research question continues to be
refined, we continually refer to the literature and our
Exploratory Studies clinical experience to make these judgments. How often
will subjects be treated and how often will they be tested?
In exploratory studies, independent and dependent vari- These questions must be answered before adequate defi-
ables are usually measured together to determine whether nitions of variables are developed.
they have a predictive relationship. Variables must be defined in terms that explain how
they will be used in the study. For research purposes, we
➤ Researchers were interested in predictors of chronic LBP in distinguish between conceptual and operational defini-
an office environment.14 They studied 669 otherwise healthy office tions. A conceptual definition is the dictionary definition,
workers to assess individual, physical, and psychological variables one that describes the variable in general terms, without
such as body mass index, exercise and activity patterns, work specific reference to its methodological use in a study.
habits, and job stress. They looked at the relationship among For instance, “back pain” can be defined as the degree
these variables and onset of LBP over 1 year (see Fig. 3-4).
of discomfort in the back. This definition is useless, how-
ever, for research purposes because it does not tell us
The dependent outcome variable in this study was the what measure of discomfort is used or how we could
onset of back pain (the O in PICO) and the independent interpret discomfort.
predictor variables were the individual, physical, and psy- In contrast, an operational definition defines a vari-
chological variables (the I in PICO). These types of stud- able according to its unique meaning within a study. The
ies often involve several independent variables, as the operational definition should be sufficiently detailed that
another researcher could replicate the procedure or
measurement. Independent variables are operationalized
Independent Variable Dependent Variables
(Intervention) (Outcomes) according to how they are manipulated by the investiga-
tor. Dependent variables are operationally defined by
Laser Therapy ⫹ Exercise describing the method of measurement, including delin-
• Change in back pain (VAS) eation of tools and procedures used to obtain measure-
Levels
THE PROBLEM
Office work may require sitting long
periods using a computer, working
1 in awkward positions, or performing
THE RATIONALE repetitive manual tasks. More than
Evidence suggests that risk factors 50% of office workers complain of
for the onset of neck and low back LBP, which can recur throughout
pain in office workers are different their lifetime.
2
from those identified in a general
population. Risks may differ across
work settings, and there is a need to
identify predictors in an office TYPE OF RESEARCH
environment. 3 Prospective observational
cohort study.
5
Hypothesis: Individual, physical and psychological factors will be predictive
of onset of chronic low back pain in a cohort of office workers.
Figure 3–4 Process of developing a research question for an observational study to predict
onset of LBP in an office environment. Based on Sihawong R, Sitthipornvorakul E, Paksaichol A,
Janwantanakui P. Predictors for chronic neck and low back pain in office workers: a 1-year
prospective cohort study. J Occup Health 2016;58(1):16–24.
Everything Needs to Be Defined! often find themselves faced with having to choose which
There are a few variables that do not require substantial method best represents the variable for purposes of a single
operational definitions, such as height and weight, which study. For example, disability due to back pain can be
can be defined simply by specifying units of measure- measured using the Oswestry Disability Rating Scale or
ment and the type of measuring instrument. But most the Roland Morris Disability Questionnaire, among many
variables, even those whose definitions appear self- others. Similarly, there are many different scales used to
evident, still present sufficient possibilities for variation measure quality of life, depression, cognitive status, and so
that they require explanation. Think about the debate on. The results of studies using these different tools would
over the interpretation of a “hanging chad” in the out- be analyzed and interpreted quite differently because of
come of the 2000 presidential election! In a more rele- the diverse information provided by each tool.15 The
vant example, consider the concept of gender, which can measurement properties, feasibility of use, and sensitivity
require varied identities, or even strength, which can be of an instrument should be considered in choosing the
measured in a variety of ways. most appropriate dependent variable. A comparison of
There are many examples of concepts for which multi- measurement methods can also become the basis for a re-
ple types of measurements may be acceptable. Researchers search question.
6113_Ch03_029-041 02/12/19 4:56 pm Page 38
Hypothesis 5 is considered directional because the au- 2. To describe the demographic characteristics of
thors predict the presence of a relationship between two workers who experience chronic neck and back pain.
variables and the direction of that relationship. We can ex- 3. To determine risk factors for chronic neck and back
pect that patients with a shorter length of stay will tend to pain based on physical characteristics, work patterns,
have poorer function. Hypothesis 6 does not tell us the ex- and psychological work demands.
pected direction of the proposed relationship (see Fig. 3-4).
A single study may have several aims, typically break-
Simple or Complex Hypotheses ing down the hypothesis into analysis chunks. These de-
fine the scope of the study. Funding agencies often
Research hypotheses can be phrased in simple or com-
require statements of specific aims in the introduction to
plex forms. A simple hypothesis includes one independent
a grant proposal. The number of aims should be limited
variable and one dependent variable. For example,
so that the scope of the study is clear and feasible.
Hypotheses 1, 4, 5, and 6 are simple hypotheses. A
complex hypothesis contains more than one independent
or dependent variable. Hypotheses 2 and 3 contain sev-
eral dependent variables. Complex hypotheses are often ■ What Makes a Good Research
nondirectional because of the potential difficulty in clar- Question?
ifying multiple relationships. Complex hypotheses are
efficient for expressing expected research outcomes in a Throughout the process of identifying and refining a re-
research report, but they cannot be tested. Therefore, search question, four general criteria should be considered
for analysis purposes, such statements must be broken to determine that it is worth pursuing: the question should
down into several simple hypotheses. Several hypotheses be important, answerable, ethical, and feasible for study.
can be addressed within a single study. Importance
Guiding Questions Clinical research should have potential impact on treat-
The purpose of descriptive studies will usually be based ment, understanding risk factors or theoretical founda-
on guiding questions that describe the study’s purpose. For tions, or on policies related to practice. The time, effort,
instance, researchers may specify that they want to de- cost, and potential human risk associated with research
scribe attitudes about a particular issue, the demographic must be “worth it.” Researchers should look at trends in
profile of a given patient group, or the natural progres- practice and health care to identify those problems that
sion of a disease. Survey researchers will design a set of have clinical or professional significance for contributing
questions to organize a questionnaire (see Chapter 11). to EBP. The research problem and rationale will gener-
Because descriptive studies have no fixed design, it is im- ally support the importance of the study (see Figs. 3-2
portant to put structure into place. Setting guiding ques- and 3-3).
tions allows the researcher to organize data and discuss
findings in a meaningful way. Statements that support the importance of the research
question are often considered the justification for a
Specific Aims or Objectives study—every study should be able to pass the “so what?”
The specific aims of a study describe the purpose of a test; that is, the results should be relevant, meaningful, and
project, focusing on what the study seeks to accomplish. useful.
The statements reflect the relationships or comparisons
that are intended and may reflect quantitative or qualita-
It may be relevant to ask how often this clinical prob-
tive goals. Aims may also expand on the study’s hypoth-
lem occurs in practice. Will the findings provide useful
esis based on various analysis goals.
information for clinical decision-making or be general-
For example, consider again Hypothesis 6:
izable to other clinical situations? Will others be inter-
Onset of LBP is related to individual, physical, and ested in the results? For example, in the earlier example
psychological factors in a cohort of office workers.14 of pressure ulcers, the prevalence and incidence of this
health problem was cited to support the importance of
This was the overall goal of the study, but several ap-
the research.
proaches were incorporated as part of the study. The
aims of this study were:
New Information
1. To document test-retest reliability of measures from Research studies should provide new information (see
questionnaire and physical examination outcomes. Focus on Evidence 3-1). Sometimes, the project addresses
6113_Ch03_029-041 02/12/19 4:56 pm Page 40
an innovation or novel idea, but more often the research available pool of potential subjects (see Chapter 13). The
will build on something that has already been studied in choice of outcome variable can also make a difference
some way. Confirmation of previous findings may be and the researcher may look for measurements that will
important, as will variations in subjects or methods or demonstrate larger effects.
clarification of inconsistencies.
Available Resources
Ethical Standards Time, funding, space, personnel, equipment, and admin-
istrative support are all examples of resources that can
A good research question must conform to ethical stan-
make or break a study. Can a realistic timetable be de-
dards in terms of protection of human rights and re-
veloped? Pilot studies are helpful for estimating the time
searcher integrity (see Chapter 7). As part of the planning
requirements of a study. What type of space or special-
process, the researcher must be able to justify the demands
ized equipment is needed to carry out the project, and
that will be placed on the subjects during data collection
will they be accessible? Will the necessary people be
in terms of the potential risks and benefits. The data must
available to work on the project and can a realistic budget
adhere to confidentiality standards. If the variables of
be developed? Resource needs should be assessed at the
interest pose an unacceptable physical, emotional, or psy-
outset and modifications made as needed.
chological risk, the question should be reformulated.
Questions about the ethical constraints in human studies Scope of the Project
research should be directed to an Institutional Review One of the most difficult parts of developing a good re-
Board to assure compliance in the initial stages of study search question is the need to continue to hone the ques-
development. tion into a reasonable chunk—not trying to measure too
many variables or trying to answer too many questions
Feasibility at once. The enthusiasm generated by a project can make
Many factors influence the feasibility of a research study. it difficult to sacrifice a side question, but judgment as to
Because of the commitment of time and resources re- reasonable scope will make a successful outcome much
quired for research, a researcher should recognize the more realistic.
practicalities of carrying out a project and plan sufficiently
Expertise
to make the efforts successful.
Those who are responsible for developing the question
Sample Size should have the relevant clinical, technical, and research
Given the variables under study and the size of the pro- expertise to understand the nuances of the variables
jected outcomes, can a large enough sample be recruited being studied, as well as the ability to administer the
to make the study worthwhile? This relates to the power study protocol. Consultants may need to be brought in
of a study, or the likelihood that true differences can be for specific components of a study. This is often the case
demonstrated statistically (see Chapter 23). Preliminary with statistical assistance, for example. Most importantly,
estimates can be determined and researchers are often anyone whose expertise is needed to carry out the project
surprised by the number of subjects they will need. Re- should be brought onto the team in the planning phases
cruitment strategies need to be considered, as well as the so that the study is designed to reach the desired goals.
COMMENTARY
You’ve got to be very careful if you don’t know where you are going, because you
might not get there.
—Yogi Berra (1925–2015)
Baseball catcher
Research is about answering questions. But before we extreme importance of the development of the right
can get to the answers, we must be able to ask the right question. Albert Einstein22 once wrote:
question. Good questions do not necessarily lead to
good outcomes, but a poorly conceived question will The formation of a problem is far more often essential
likely create problems that influence all further stages than its solution . . . to raise new questions, new
of a study.21 It is no exaggeration to speak of the possibilities, to regard old problems from a new angle,
6113_Ch03_029-041 02/12/19 4:56 pm Page 41
requires creative imagination and marks real advance collect data and analyze it. It is an unfortunate situation
in science. that has occurred all too often, when a researcher has in-
vested hours of work and has obtained reams of data but
A clear research question also allows those who read cannot figure out what to do with the information.
about the study to understand its intent. It is easy to re- There is nothing more frustrating than a statistical con-
port data on any response, but those data will be totally sultant trying to figure out what analysis to perform and
irrelevant if they do not form the context of a specific asking the researcher, “What is the question you are try-
question. Those who read the literature will find them- ing to answer?” The frustrating part is when the re-
selves lost in a sea of results if a question has not been searcher realizes that his question cannot be answered
clearly articulated at the outset. The presentation of re- with the data that were collected. Beginning researchers
sults and discussion in an article or presentation should often “spin their wheels” as they search for what to meas-
focus on those data that address the question, thereby ure, rather than starting their search with the delineation
delimiting the theoretical framework within which the of a specific and relevant question. It does not matter how
results of the study will be interpreted. complex or simple the design—when you are starting on
In the process of doing research, novice researchers the research road, it is not as important to know how to
will often jump to a methodology and design, eager to get the answer as it is to know how to ask the question!23
CHAPTER
42
6113_Ch04_042-052 02/12/19 4:55 pm Page 43
• Theories allow us to predict what should occur— of theory. This theoretical framework is usually
even when they cannot be empirically verified. For discussed within the introduction and discussion
instance, through deductions from mathematical sections of a paper.
theories, Newton was able to predict the motion of
planets around the sun long before technology was
available to confirm their orbits. ■ Components of Theories
The germ theory of disease allows us to predict
how changes in the environment will affect the The role of theory in clinical practice and research is
incidence of disease, which in turn suggests mecha- best described by examining the structure of a theory.
nisms for control, such as the use of drugs, vaccines, Figure 4-1 shows the basic organization of scientific
or attention to hygiene. thought, building from observation of facts to laws of
• Theories stimulate development of new knowledge— nature.
by providing motivation and guidance for asking The essential building blocks of a theory are con-
significant clinical questions. On the basis of a theo- cepts. Concepts are abstractions that allow us to classify
retical premise, a clinician can formulate a hypothesis natural phenomena and empirical observations. From
that can then be tested, providing evidence to sup- birth we begin to structure empirical impressions of
port, reject, or modify the theory. For instance, based the world around us in the form of concepts, such as
on the theory that reinforcement will facilitate learn- “tall” or “far,” each of which implies a complex set of
ing, a clinician might propose that verbal encourage- recognitions and expectations. We supply labels to sets
ment will decrease the time required for a patient of behaviors, objects, or processes that allow us to iden-
to learn a home program. This hypothesis can be tify them and discuss them. Therefore, concepts must be
tested by comparing patients who do and do not defined operationally. When concepts can be assigned
receive reinforcement. The results of hypothesis values, they can be manipulated as variables, so that re-
testing will provide additional affirmation of the lationships can be examined. In this context, variables
theory or demonstrate specific situations where the become the concepts used for building theories and
theory is not substantiated. planning research.
• Theories provide the basis for asking a question in
applied research. As discussed in Chapter 3, re- Constructs
search questions are based on a theoretical rationale Some physical concepts, such as height and distance, are
that provides the foundation for the design and in- observable and varying amounts are easily distinguish-
terpretation of the study. Sometimes, there will be able. But other concepts are less tangible and can be
sufficient background in the literature to build this defined only by inference. Concepts that represent
framework; other times, the researcher must build nonobservable behaviors or events are called constructs.
an argument based on what is known from basic Constructs are invented names for abstract variables that
science. In qualitative or exploratory research, the cannot be seen directly, also called latent variables.
study’s findings may contribute to the development Measurement of a construct can only be inferred by
LAWS
Theory Development
Theory Testing
Theory
DEDUCTION
INDUCTION
Explanation of Relationships
EMPIRICAL
OBSERVATIONS
Figure 4–1 A model of scientific thought, showing the circular relationship between
empirical observations and theory, and the integration of inductive and deductive
reasoning.
6113_Ch04_042-052 02/12/19 4:55 pm Page 44
g
nin
within a system. Where a theory is an explanation of
so
phenomena, a model is a structural representation of the
ea
concepts that comprise that theory. Specific
eR
Observation
tiv
Types of Models
uc
nin
d
De
so
Physical models are used to demonstrate how the real be- Specific
ea
havior might occur. For example, engineers study mod- Conclusion
eR
els of bridges to examine how stresses on cables and
tiv
uc
different loading conditions affect the structure. The
Ind
General Conclusion
benefit of such models is that they obey the same laws as
the original but can be controlled and manipulated to Figure 4–3 The relationship between deductive and
examine the effects of various conditions in ways that inductive reasoning.
would not otherwise be possible and without risk. Com-
puter simulations are the most recent contributions to
the development of physical models. inferences that can be drawn in specific cases. The an-
A process model provides a guide for progressing cient Greek philosophers introduced this systematic
through a course of action or thought process. The struc- method for drawing conclusions by using a series of
tural image of the transtheoretical model in Figure 4-2 is three interrelated statements called a syllogism, contain-
an example of this approach. ing (1) a major premise, (2) a minor premise, and (3) a
A quantitative model is used to describe the relation- conclusion. A classic syllogism will serve as an example:
ship among variables by using numerical representations.
1. All living things must die. [major premise]
Such models usually contain some degree of error re-
2. Man is a living thing. [minor premise]
sulting from the variability of human behavior and phys-
3. Therefore, all men must die. [conclusion]
ical characteristics. For instance, a clinician might want
to set a goal for increasing the level of strength a patient In deductive reasoning, if the premises are true, then
could be expected to achieve following a period of train- it follows that the conclusion must be true. Scientists use
ing. A model that demonstrates the influence of a per- deductive logic to ask a research question by beginning
son’s height, weight, and age on muscle strength would with known principles or generalizations and deducing
be useful in making this determination.5–7 This type of explicit assertions that are relevant to a specific question.
quantitative model can serve as a guide for setting long- The subsequent observations will confirm, reject, or
term goals and for predicting functional outcomes. Re- modify the conclusion. Of course, the usefulness of this
search studies provide the basis for testing these models method is totally dependent on the truth of its premises
and estimating their degree of accuracy for making such (see Focus on Evidence 4-1).
predictions.
HISTORICAL NOTE
The following statement, attributed to the ancient
■ Theory Development and Testing Greek physician Galen of Pergamon (c. 130–210 A.D.),
illustrates the potential abuse of logic and evidence:
As previous examples illustrate, theories are not discov-
All who drink of this remedy recover in a short time,
ered, they are created. A set of observable facts may exist,
except those whom it does not help, who all die.
but they do not become a theory unless someone has the
Therefore, it is obvious that it fails only in incurable
insight to understand the relevance of the observed in-
cases.8
formation and pulls the facts together to make sense of
them. Surely, many people observed apples falling from
trees before Newton was stimulated to consider the force
of gravity! Theories can be developed using inductive or Deductive Theories
deductive processes (see Fig. 4-3). The deductive approach to theory building is the in-
tuitive approach, whereby a theory is developed on the
Deductive Reasoning basis of great insight and understanding of an event
Deductive reasoning is characterized by the acceptance and the variables most likely to impact that event. This
of a general proposition, or premise, and the subsequent type of theory, called a hypothetical-deductive theory, is
6113_Ch04_042-052 02/12/19 4:55 pm Page 46
Focus on Evidence 4–1 we should see a decrease in the number of falls. Li et al10 used this
The Deductive Road to a Research Question logic as the theoretical premise for their study comparing tai chi
exercise, stretching, and resistance exercise to improve postural
Deductive reasoning is often the basis for development of a
stability in patients with Parkinson’s disease. Carter et al11 designed
research question. It is also applied as part of the clinical decision-
an exercise program aimed at modifying risk factors for falls in elderly
making process to support treatment options. For example, we might
women with osteoporosis. Similarly, Mansfield et al12 demonstrated
use previous research to reason that exercise will be an effective
the effect of a perturbation-based training program on balance reac-
intervention to prevent falls in the elderly in the following way:
tions in older adults with a recent history of falls. All three studies
1. Impaired postural stability results in falls. found that stability exercise programs reduced the incidence of falls
2. Exercise improves postural stability. or improved balance behaviors, supporting the premise from which
3. Therefore, exercise will decrease the risk of falls. the treatment was deduced.
This system of deductive reasoning produces a testable hypothe-
sis: If we develop an exercise program to improve impaired stability,
developed with few or no prior observations, and often Glaser and Strauss used the term “grounded theory”
requires the generation of new concepts to provide to describe the development of theory by reflecting on
adequate explanation. Freud’s theory of personality individual experiences in qualitative research.13 As an
fits this definition.9 It required that he create concepts example, Resnik and Jensen used this method to de-
such as “id,” “ego,” and “superego” to explain psycho- scribe characteristics of therapists who were classified
logical interactions and motivations. as “expert” or “average” based on patient outcomes.14
Because they are not developed from existing facts, Building on their observations, they theorized that the
hypothetical-deductive theories must be continually meaning of “expert” was not based on years of experi-
tested in the “real world” to develop a database that ence, but on academic and work experience, utilization
will support them. Einstein’s theory of relativity is an of colleagues, use of reflection, a patient-centered ap-
excellent example of this type of theory; it was first ad- proach to care, and collaborative clinical reasoning.
vanced in 1905 and is still being tested and refined
through research today. Most theories are formulated using a combination of
both inductive and hypothetical-deductive processes.
Inductive Reasoning Observations initiate the theoretical premise and then hy-
potheses derived from the theory are tested. As researchers
Inductive reasoning reflects the reverse type of logic,
go back and forth in the process of building and testing the
developing generalizations from specific observations. It
theory, concepts are redefined and restructured.
begins with experience and results in conclusions or gen-
Both deductive and inductive reasoning are used to de-
eralizations that are probably true. Facts gathered on a
sign research studies and interpret research data. Intro-
sample of events could lead to inferences about the
ductory statements in an article often illustrate deductive
whole. This reasoning gave birth to the scientific method
logic, as the author explains how a research hypothesis was
and often acts as the basis for common sense.
developed. Inductive reasoning is used in the discussion
Inductive Theories section, where generalizations or conclusions are pro-
Inductive theories are data based and evolve through a posed from the observed data. It is the clinical scientist’s
process of inductive reasoning, beginning with empiri- responsibility to evaluate the validity of the information
cally verifiable observations. Through multiple investi- and draw reasonable conclusions. This process occurs
gations and observations, researchers determine those along a circular continuum between observation and the-
variables that are related to a specific phenomenon and ory, whereby a theory can be built on observations which
those that are not. The patterns that emerge from these must then be tested to confirm them (Fig. 4-1).
studies are developed into a systematic conceptual
framework, which forms the basis for generalizations. Theory Testing
This process involves a degree of abstraction and imag- Ultimately, a theory can never be “proven” or even com-
ination, as ideas are manipulated and concepts reorgan- pletely confirmed or refuted. Hypotheses are tested to
ized, until some structural pattern is evident in their demonstrate whether the premises of the theory hold true
relationship. in certain circumstances, always with some probability
6113_Ch04_042-052 02/12/19 4:55 pm Page 47
that the outcomes could be spurious. A replication study of observed facts that makes sense given
may show a different outcome or other logical explana- current knowledge and beliefs (see Focus on
tions can be offered. EBP requires that we consider the Evidence 4-2).
latest research in the context of our own clinical judgment • A theory should be testable. It should provide a
and experience. We cannot ignore the possibility that any basis for classifying relevant variables and predicting
theory will need to be modified or discarded at some their relationships, thereby providing a means for
point in the future. its own verification; that is, it should be sufficiently
When we speak of testing a theory, we should realize developed and clear enough to permit deductions
that a theory itself is not testable. The validity of a theory that form testable hypotheses.
is derived through the empirical testing of hypotheses • A theory should be economical. It should be an
that are deduced from it and from observation of the efficient explanation of the phenomenon, using only
phenomenon the theory describes. The hypotheses pre- those concepts that are truly relevant and necessary
dict the relationships of variables included in the theory. to the explanation offered by the theory.
The results of research will demonstrate certain facts, • A theory should be relevant. It should reflect that
which will either support or not support the hypothesis. which is judged significant by those who will use it.
If the hypothesis is supported, then the theory from In this sense, theories become the mirror of a pro-
which it was deduced is also supported. fession’s values and identity.
• A theory should be adaptable. Theories must
be consistent with observed facts and the already
established body of knowledge but must also
■ Characteristics of Theories be able to adapt to changes in that knowledge
as technology and scientific evidence improve.
As we explore the many uses of theories in clinical re-
Many theories that are accepted today will be
search, we should also consider criteria that can be used
discarded tomorrow (see Box 4-1). Some will be
to evaluate the utility of a theory.
“disproved” by new evidence and others may
• A theory should be rational. First and foremost, be superseded by new theories that integrate the
a theory should provide a reasonable explanation older ones.
Focus on Evidence 4–2 or rejected because it does not “make sense” in light of contemporary
The Tomato Effect theories of disease.17
One relevant example is the historic use of acetylsalicylic acid,
Did you know that up until the
marketed under the name “aspirin” by Bayer in 1899, a compound
19th century, tomatoes were not
known to relieve pain, fever, and inflammation.18 Among other disor-
cultivated or eaten in North America?
ders, this therapy was used successfully in the late 1800s to treat
Although they were a staple in conti-
acute rheumatic fever, with several research reports also demonstrat-
nental European diets, the tomato
ing its efficacy for rheumatoid arthritis (RA).
was shunned in America because it
But from 1900 to 1950, medical textbooks barely mention this
was “known” to be poisonous and
treatment for RA, coinciding with the acceptance of the “infectious
so it would have been foolish to eat
theory of disease” in medicine, which saw bacteria as the prime
one. It seems that many European aristocrats would eat them, be-
culprit in disease.19 Because aspirin was a medication for fever and
come ill, and die. What was actually happening was that they were
pain, it did not “make sense” that it would have an effect on a chronic
eating off pewter plates, which interacted with the acid in
infectious process—and so the treatment approach was ignored.
the tomato, and they were dying from lead poisoning.15 But at the
By 1950, the infectious theory was discarded and RA was recog-
time, the fact that tomatoes were poisonous was just that—a
nized as a chronic inflammatory disease that would respond to the
known fact.
anti-inflammatory properties of aspirin. Although many newer treat-
It was not until June 20, 1820, when an American, Robert Gibbon,
ments for RA have since been adopted, this history is an important
stood on the courthouse steps in Salem, New Jersey, and ate a juicy
lesson in how theory can influence research and practice.
tomato without incident that the myth was finally dispelled and
Americans began consuming and cultivating tomatoes.16
This little-known agricultural history provides the derivation of the Photo courtesy of Rob Bertholf. Found at https://commons.wikimedia.org/
“tomato effect,” which occurs when an effective treatment is ignored
6113_Ch04_042-052 02/12/19 4:55 pm Page 48
Box 4-1 Finding Meaning in Aging Therefore, theories are tested each time the clinician
evaluates treatment outcomes.
The disengagement theory of aging was originally proposed to These theoretical frameworks, used as rationale for
account for observations of age-related decreases in social in- intervention, may not rise to the level of a “named”
teraction, explaining that older individuals withdrew from social
theory, but that does not diminish their importance in
involvements in anticipation of death.21 The activity theory later
setting the context for interpreting outcomes. Some-
countered this explanation by suggesting that remaining active
and engaged with society is pivotal to satisfaction in old age, times we anticipate a certain outcome based on this
necessitating new interests and relationships to replace those framework, and it does not work out as expected, indi-
lost in later life.22 Continuity theory then built upon activity cating that something in our thought process needs to
theory, suggesting that personality and values are consistent be re-examined. Are we using a theoretical rationale that
through the life span.23 These theories continue to change and is not valid? Did we not measure outcomes in a mean-
be challenged, as explanations of observed psychosocial behav- ingful way? Are there intervening factors that could have
ior evolve and as our understanding and perceptions of aging influenced the outcome? These are questions that take
and social interaction have grown.24 us back to the theoretical frameworks that supported our
initial decisions and our examination of the evidence that
led us there (see Focus on Evidence 4-3).
HISTORICAL NOTE
Karl Popper (1902–1994) was one of the most influential HISTORICAL NOTE
philosophers of the 20th century. He proffered a falsifi- For centuries, physicians believed that illnesses were
cationist methodology, suggesting that all theories are charac- caused by “humoral imbalances.” This theory was the
terized by predictions that future observations may reveal to basis for diagnosis and treatment, and bloodletting was one
be false.20 This process causes scientists to revise theory, reject of the most popular interventions to end the imbalances.
it in favor of a different theory, or change the hypothesis that Although patients suffered and died from this practice, it
originally supported it. persisted into the end of the 19th century. In 1892, even
William Osler, one of the most influential medical authorities,
stated that “we have certainly bled too little.”25 Because
research finally showed the harm it caused, bloodletting has
been abandoned for more than a century—except that it is
■ Theory, Research, and Practice back for some modern applications. Leeches are used today
to restore venous circulation and reduce engorgement in limb
Every theory serves, in part, as a research directive. reattachments.26
The empirical outcomes of research can be organized Same practice, different theory.
and ordered to build theories using inductive reason-
ing. Conversely, theories must be tested by subjecting
Understanding of a clinical phenomenon cannot be
deductive hypotheses to scientific scrutiny. The processes
achieved in a single study. It requires a process involving
of theory development and theory testing are repre-
discussion, criticism, and intellectual exchange among a
sented in the model shown in Figure 4-1. The figure
community of researchers and clinicians, to analyze the
integrates the concepts of inductive and deductive
connection between new and previous findings and ex-
reasoning as they relate to the elements of theory
planations. This type of exchange allows inconsistencies
design.
to surface, to identify findings that cannot be explained
Practical Applications of Theory by current theories.
Clinicians are actually engaged in theory testing on a Theory and Research Questions
regular basis in practice. Theories guide us in making The importance of theory for understanding research
clinical decisions every day, although we often miss the findings is often misunderstood. Whenever a research
connection because we are focused on the pragmatics of question is formulated, there is an implicit theoretical
the situation. Specific therapeutic modalities are chosen rationale that suggests how the variables of interest
for treatment because of expected outcomes that are should be related (see Chapter 3). Unfortunately,
based on theoretical assumptions about physiological many authors do not make this foundation explicit.
effects. Treatments are modified according to the pres- Empirical results are often described with only a gen-
ence of risk factors, again based on theoretical relationships. eral explanation or an admission that the author can
6113_Ch04_042-052 02/12/19 4:55 pm Page 49
Focus on Evidence 4–3 exercises for patients with Parkinson’s disease (PD) in order to
Applying Theory to Guide Practice evaluate changes in muscle force production and mobility. They
theorized that the more challenging resistive exercises would be
Researchers have long lamented the difficulties in running clinical
more effective in improving muscular force (measured with a
trials to study treatments that involve clinical judgment, patient
dynamometer), gait (measured by the Functional Gait Assessment),32
interaction, and combinations of interventions for individualized care
and quality of life (measured using the Parkinson’s Disease
because of the challenge in providing consistent operational defi-
Questionnaire [PDQ-39]).33 They found, however, that both groups
nitions.27 For instance, in designing rehabilitation programs, thera-
showed similar improvement in muscle force and gait—suggesting
pists will often incorporate many activities which can vary greatly
that it was not the type of exercise that was relevant, but the pres-
from one patient to another, even if they have the same diagnosis.
ence of an exercise component. They also found that neither group
One way to address this dilemma is to rely on theory to provide
improved in the quality of life measure. Their interpretation sug-
guidelines and rationale about mechanisms of action. Even if the
gested that exercise may not be sufficient to impact health status
specific intervention activities are not identical from patient to
in the PD population. The impairment and activity level treatment
patient, the treatment decisions can all be directed by the same
targets (strength and mobility) showed improvement, but this did
theoretical premise.
not link to participation activities, at least as it was measured in
Whyte28 provides a useful perspective by distinguishing theories
this study.
related to treatment and enablement. Treatment theory specifies the
The PDQ-39 focuses on difficulty of movement tasks as well as
mechanism of action for an intervention, or the “active ingredient”
emotional consequences of PD. Enablement theory leads us to hy-
that is intended to produce change in an aspect of function or health.29
pothesize that intervention may need to be designed to more directly
For example, Mueller and Maluf30 have described physical stress
influence these elements of quality of life and, therefore, this measure
theory, which states that changes in stress to biological tissues (such
might not have been a relevant outcome based on the theoretical
as muscle) will cause a predictable adaptive response. Therefore, we
premise of the intervention, which focused on the neurological and
can predict that, when muscle is challenged at high stresses (as
musculoskeletal influence of exercise.
through exercise), we should see increases in contractile strength.
From an EBP perspective, this example illustrates the need to be
With a lack of stress (inactivity) we should see declines in strength.
wary of simple conclusions (resistive exercise does not improve func-
The treatment target is typically an impairment that can be measured
tion in PD). The theoretical basis for decision-making will affect how
as a direct result of the intervention,28 in this case strength or force
outcomes will be interpreted and applied to practice.
of movement. In contrast, enablement theory provides the link from
theory to outcomes, reflecting the relative importance of clinical
changes to activity and participation. In this example, it would be a Concepts of impairment, activity, and participation are discussed within the
link from increased strength to improved mobility and function. framework of the International Classification of Functioning, Disability and
To illustrate this application, Dibble et al31 conducted a random- Health in Chapter 1.
ized controlled trial (RCT) to compare active exercise with resistance
COMMENTARY
Theories are approximations of reality, plausible expla- theory, the theory may still have validity. It may mean
nations of observable events. This leaves us with much we need to test it differently, choosing different variables
uncertainty in the continued search for understanding or measuring instruments in different circumstances, or
clinical behavior. Steven Hawking offered this caution: choosing subjects with different characteristics. Or the
theory may need to be clarified as to its scope. This is
Any physical theory is always provisional, in the sense
where “making sense” is an important criterion and why
that it is only a hypothesis; you can never prove it. No
operationally defining constructs is so important.
matter how many times the results of experiments agree
Conversely, we may find that certain treatments are
with some theory, you can never be sure that the next
continually successful, even though we cannot explain
time the result will not contradict the theory.43
the theoretical underpinnings. This is often the case, for
In essence, the more that research does not disconfirm example, when working with neurological conditions.
a theory, the more the theory is supported. This may We may find that theoretical foundations change over
sound backward, but in actuality we can only demon- time, but this does not mean that the treatment does not
strate that a theoretical premise does not hold true in a spe- work. It simply means we do not yet fully understand the
cific situation. Hawking also offered this proviso: mechanisms at work. Eventually, we have to reformulate
the theories to reflect more precisely what we actually
On the other hand, you can disprove a theory by finding
observe in practice.
even a single observation that disagrees with the
This is the constant challenge—finding how the pieces
predictions of the theory.43
of the puzzle fit. Most importantly, researchers and
Hawking’s admonition is probably useful in physical evidence-based practitioners must consider how theory
sciences but has to be considered with less conviction in helps to balance uncertainty to allow for reasoned deci-
behavioral and health sciences. We do not always know sions and interpretations of outcomes, remembering that
which variables are relevant to a theory. If we choose today’s theoretical beliefs may be tomorrow’s tomatoes!
one way to test it and the outcome does not support the
R EF ER E NC E S 7. Hanten WP, Chen WY, Austin AA, Brooks RE, Carter HC,
Law CA, Morgan MK, Sanders DJ, Swan CA, Vanderslice AL.
1. Kerlinger FN. Foundations of Behavioral Research. New York: Maximum grip strength in normal subjects from 20 to 64 years
Holt, Rinehart & Winston, 1973. of age. J Hand Ther 1999;12(3):193–200.
2. Prochaska JO, Velicer WF. The transtheoretical model of health 8. Silverman WA. Human Experimentation: A Guided Step into the
behavior change. Am J Health Promot 1997;12(1):38–48. Unknown. New York: Oxford University Press, 1985.
3. Prochaska JO, Wright JA, Velicer WF. Evaluating theories 9. Keutz Pv. The character concept of Sigmund Freud. Schweiz
of health behavior change: a hierarchy of criteria applied to Arch Neurol Neurochir Psychiatr 1971;109(2):343–365.
the transtheoretical model. Appl Psychol: An Internat Rev 2008; 10. Li F, Harmer P, Fitzgerald K, Eckstrom E, Stock R, Galver J,
57(4):561–588. Maddalozzo G, Batya SS. Tai chi and postural stability in patients
4. Aveyard P, Massey L, Parsons A, Manaseki S, Griffin C. The with Parkinson’s disease. N Engl J Med 2012;366(6):511–519.
effect of transtheoretical model based interventions on smoking 11. Carter ND, Khan KM, McKay HA, Petit MA, Waterman C,
cessation. Soc Sci Med 2009;68(3):397–403. Heinonen A, Janssen PA, Donaldson MG, Mallinson A, Riddell L,
5. Hamzat TK. Physical characteristics as predictors of quadriceps et al. Community-based exercise program reduces risk factors
muscle isometric strength: a pilot study. Afr J Med Med Sci for falls in 65- to 75-year-old women with osteoporosis: ran-
2001;30(3):179–181. domized controlled trial. CMAJ 2002;167(9):997–1004.
6. Neder JA, Nery LE, Shinzato GT, Andrade MS, Peres C, Silva 12. Mansfield A, Peters AL, Liu BA, Maki BE. Effect of a perturba-
AC. Reference values for concentric knee isokinetic strength and tion-based balance training program on compensatory stepping
power in nonathletic men and women from 20 to 80 years old. and grasping reactions in older adults: a randomized controlled
J Orthop Sports Phys Ther 1999;29(2):116–126. trial. Phys Ther 2010;90(4):476–491.
6113_Ch04_042-052 02/12/19 4:55 pm Page 52
13. Glaser BG, Strauss AI. The Discovery of Grounded Theory: 29. Whyte J, Barrett AM. Advancing the evidence base of rehabili-
Strategies for Qualitative Research. New York: Aldine, 1967. tation treatments: a developmental approach. Arch Phys Med
14. Resnik L, Jensen GM. Using clinical outcomes to explore Rehabil 2012;93(8 Suppl):S101–S110.
the theory of expert practice in physical therapy. Phys Ther 30. Mueller MJ, Maluf KS. Tissue adaptation to physical stress:
2003;83(12):1090–1106. a proposed “Physical Stress Theory” to guide physical therapist
15. Smith KA. Why the tomato was feared in Europe for more practice, education, and research. Phys Ther 2002;82(4):
than 200 years. Smithsonian.com. Available at http://www. 383–403.
smithsonianmag.com/arts-culture/why-the-tomato-was-feared- 31. Dibble LE, Foreman KB, Addison O, Marcus RL, LaStayo PC.
in-europe-for-more-than-200-years-863735/. Accessed Exercise and medication effects on persons with Parkinson
February 10, 2017. disease across the domains of disability: a randomized clinical
16. Belief that the tomato is poisonous is disproven. History trial. J Neurol Phys Ther 2015;39(2):85–92.
Channel. Available at http://www.historychannel.com.au/ 32. Wrisley DM, Marchetti GF, Kuharsky DK, Whitney SL.
this-day-in-history/belief-that-the-tomato-is-poisonous-is- Reliability, internal consistency, and validity of data obtained
disproven/. Accessed February 9, 2017. with the functional gait assessment. Phys Ther 2004;84(10):
17. Goodwin JS, Goodwin JM. The tomato effect. Rejection of 906–918.
highly efficacious therapies. JAMA 1984;251(18):2387–2390. 33. Fitzpatrick R, Jenkinson C, Peto V, Hyman N, Greenhall R.
18. Ugurlucan M, Caglar IM, Caglar FN, Ziyade S, Karatepe O, Desirable properties for instruments assessing quality of life:
Yildiz Y, Zencirci E, Ugurlucan FG, Arslan AH, Korkmaz S, evidence from the PDQ-39. J Neurol Neurosurg Psychiatry 1997;
et al. Aspirin: from a historical perspective. Recent Pat Cardiovasc 62(1):104.
Drug Discov 2012;7(1):71–76. 34. Merton RK. Social Theory and Social Structure. New York: The
19. Goodwin JS, Goodwin JM. Failure to recognize efficacious Free Press, 1968.
treatments: a history of salicylate therapy in rheumatoid 35. van Whye J. Darwin: The Story of the Man and His Theories of
arthritis. Perspect Biol Med 1981;25(1):78–92. Evolution. London: Andre Deutsch Ltd, 2008.
20. Popper KR. Conjectures and Refutations: The Growth of Scientific 36. Einstein A. Relativity: The Special and General Theory. New York:
Knowledge. 2 ed. London: Routledge Classics, 2002. Holt and Company, 1916.
21. Cummings E, Henry WE. Growing Old: The Process of 37. Wadsworth BJ. Piaget’s Theory of Cognitive and Affective Develop-
Disengagement. New York: Basic Books, 1961. ment: Foundations of Constructivism. 5 ed. New York: Longman
22. Havighurst RJ. Successful aging. In: Williams RH, Tibbits C, Publishing, 1996.
Donohue W, eds. Process of Aging: Social and Pyschological Perspec- 38. Erikson EH, Erikson JM. The Life Cycle Completed-Extended
tives, Volume 1. New Brunswick, NJ: Transaction Publishers, Version. New York: W.W. Norton, 1997.
1963:299–320. 39. Abrams D, Hogg MA. Metatheory: lessons from social identity
23. Atchley RC. A continuity theory of normal aging. Gerontologist research. Pers Soc Psychol Rev 2004;8(2):98–106.
1989;29(2):183–190. 40. Eastwood JG, Kemp LA, Jalaludin BB. Realist theory construc-
24. Jewell AJ. Tornstam’s notion of gerotranscendence: re-examining tion for a mixed method multilevel study of neighbourhood
and questioning the theory. J Aging Stud 2014;30:112–120. context and postnatal depression. Springerplus 2016;5(1):1081.
25. Osler W, McCrae T. The Principles and Practice of Medicine. 41. Paterson BL. The shifting perspectives model of chronic illness.
New York: D. Appleton and Company, 1921. J Nurs Scholarsh 2001;33(1):21–26.
26. Smoot EC, 3rd, Debs N, Banducci D, Poole M, Roth A. Leech 42. Gottlieb G. The relevance of developmental-psychobiological
therapy and bleeding wound techniques to relieve venous metatheory to developmental neuropsychology. Dev Neuropsy-
congestion. J Reconstr Microsurg 1990;6(3):245–250. chol 2001;19(1):1–9.
27. Johnston MV. Desiderata for clinical trials in medical rehabilita- 43. Hawking S. A Brief History of Time From the Big Bang to Black
tion. Am J Phys Med Rehabil 2003;82(10 Suppl):S3–S7. Holes. New York: Bantam Dell Publishing Group, 1988.
28. Whyte J. Contributions of treatment theory and enablement
theory to rehabilitation research and practice. Arch Phys Med
Rehabil 2014;95(1 Suppl):S17–23.e2.
6113_Ch05_053-069 02/12/19 4:55 pm Page 53
CHAPTER
Understanding
Evidence-Based 5
Practice
Evidence-based practice (EBP) is about clinical the care people should receive and the care they actually
receive, between published evidence and healthcare
decision-making—how we find and use informa-
practice. The IOM estimates that one-third of health-
tion and how we integrate knowledge, experi- care spending is for therapies that do not improve
ence, and judgment to address clinical problems. health.2 Another review suggests that 50% of healthcare
practices are of unknown effectiveness and 15% are
EBP requires a mindset that values evidence as
potentially harmful or unlikely to be beneficial.3,4 Con-
an important component of quality care and a sider, for example, the change in recommendations for
skill set in searching the literature, critical ap- infant sleep position, for many years preferred on the
stomach, to reduce the possibility of spitting up and
praisal, synthesis, and reasoning to determine the
choking. In 1992, this recommendation was changed to
applicability of evidence to current issues. The sleeping on the back, which improved breathing and
ultimate goal is to create a culture of inquiry and drastically changed the incidence of sudden infant death
syndrome (SIDS).5
rigorous evaluation that provides a basis for bal-
Clinicians tackle questions every day in practice,
ancing quality with the uncertainty that is found constantly faced with the task of interpreting results of
in practice—all in the effort to improve patient diagnostic tests, determining the efficacy of therapeu-
tic and preventive regimens, the potential harm asso-
care.
ciated with different treatments, the course and
The purpose of this chapter is to clarify prognosis of specific disorders, costs of tests or inter-
how evidence contributes to clinical decision- ventions, and whether guidelines are sound. Despite
having these questions, however, and even with an em-
making, describe the process of EBP and the
phasis on EBP across healthcare, practitioners in most
types of evidence that are meaningful, and dis- professional disciplines do not seek answers or they
cuss the barriers that often limit successful im- express a lack of confidence in evidence, with a higher
value placed on experience, collegial advice, or anec-
plementation. This discussion will continue
dotal evidence than on research to support clinical
throughout the text in relation to specific ele- decisions.6,7
ments of research design and analysis. From an evidence-based standpoint, research has
continued to document escalating healthcare costs, dis-
parities in access to healthcare, and unwarranted vari-
■ Why Is Evidence-Based Practice ations in accepted practice—with geography, ethnicity,
socioeconomic status, and clinical setting often cited
Important? as major determinants.8–11 Addressing these issues
requires understanding how evidence informs our
In a landmark 2001 report, Crossing the Quality Chasm,1
choices to support quality care.
the Institute of Medicine (IOM) documented a signifi-
cant gap between what we know and what we do, between
53
6113_Ch05_053-069 02/12/19 4:55 pm Page 54
HISTORICAL NOTE
Illustration courtesy of Sidney Harris. Used with permission from ScienceCartoons
Plus.com.
Galen of Pergamon (130–200 A.D.), the ancient Greek
physician, was the authority on medical practices for
centuries. His authority held such weight that centuries later,
Where Are the Gaps? when scientists performed autopsies and could not corroborate
his teachings with their physical findings, his followers com-
As discussed in Chapter 2, the focus on translational re-
mented that, if the new findings did not agree with his teach-
search has highlighted the need for research to provide ings, “the discrepancy should be attributed to the fact that
practical solutions to clinical problems in a timely way, nature had changed.”29
further fueling the urgency for a more evidence-based
framework for healthcare.
Generalizing results from randomized controlled • Experience
trials (RCTs) is often difficult because patients and “It’s worked for me before.”
circumstances in the real world do not match experimen-
tal conditions. Clinicians often contend with reimburse- Sometimes a product of trial and error, experience is
ment policies that can interfere with decisions regarding a powerful teacher. Occasionally, it will be the expe-
effective treatment choices (see Focus on Evidence 5-1). rience of a colleague that is shared. The more expe-
They may also be faced with changes in recommenda- rienced the clinician, the stronger the belief in that
tions, as research uncovers inconsistencies or errors in experience!
previous studies and recommendations are reversed.12
Three primary issues drive this discussion, all related
FUN FACT
to the quality of care: overuse of procedures without jus-
Although somewhat tongue-in-cheek, unyielding trust in
tification, underuse of established procedures, and misuse authority has also been called “eminence-based medicine,” the
of available procedures.13 Table 5-1 provides examples persistent reliance on the experience of those considered the
of each of these quality concerns. These issues are exac- experts.30 Such faith has been defined as “making the same
erbated by a lack of attention to research findings as well mistakes with increasing confidence over an impressive
as difficulties in applying results to practice. number of years.”31
Focus on Evidence 5–1 HCFA’s decision cited their experience and informal data, insisting
The Direct Line from Evidence to Practice that the treatment made an important difference for many
patients.18
Treatment to improve healing of chronic wounds remains a serious
This prompted the gathering of more formal evidence and further
healthcare challenge.14 Standard care includes optimization of nutri-
analysis to show that ES was indeed effective for reducing healing
tional status, débridement, dressings to maintain granulation tissue,
time for chronic wounds.15 Based on this information, in 2002 CMS
and treatment to resolve infections.15 Electrical stimulation (ES) has
reversed the decision and coverage was approved, but only for use
been used as an alternative therapy since the 1960s, especially for
with chronic ulcers, defined as ulcers that have not healed within
chronic wounds, with numerous clinical reports showing accelerated
30 days of occurrence.
healing.16,17
One of the continuing issues with research in this area is the lack
Despite several decades of successful use of the modality and
of consistency across trials in parameters of ES application, including
Medicare coverage for such treatment since 1980, in May 1997, the
dosage, waveforms, duration of treatment, and the delivery system.
Health Care Financing Administration (HCFA, now the Centers for
Interpretations are further complicated by varied protocols, types of
Medicare and Medicaid Services, CMS) announced that it would no
wounds studied, generally small samples, and different types of com-
longer reimburse for the use of ES for wound healing. The agency
parisons within the trials.14,19 This type of variability in design still
claimed that a review of literature determined there was insufficient
threatens the weight of evidence, in this and many other areas of clin-
evidence to support such coverage and that ES did not appear to be
ical inquiry. This is not an isolated experience.
superior to conventional therapies.18
This story emphasizes why the use of evidence in practice is
In July 1997, the American Physical Therapy Association (APTA)
not optional. “Insisting” that a treatment works is not sufficient.
filed a lawsuit along with six Medicare beneficiaries seeking a
Our ability to provide appropriate care, change policy, and be re-
temporary injunction against HCFA from enforcing its decision.15
imbursed for services rests on how we have justified interventions
The APTA included several supporting documents, including a
and diagnostic procedures through valid research—not just for our
CPG for pressure ulcers issued by the Agency for Health Care Policy
own decision-making, but for those who influence how our care is
and Research (now the Agency for Healthcare Research and Qual-
provided and paid for and, of course, ultimately for the benefit of
ity, AHRQ). Although several clinical trials were cited, supportive
our patients.
evidence was not considered strong. Therapists who opposed
Misconceptions
When the EBP concept was first introduced, there were Patient Management
many misconceptions about what it meant.32 Proponents Figure 5–2 The components of EBP as a framework for
argued that this was an essential and responsible way to clinical decision-making.
6113_Ch05_053-069 02/12/19 4:55 pm Page 57
. . . and circumstances
There are estimates that more than 2,000 citations are
This last element of the EBP definition is a critical one.
being added to databases every day! In 2018, PubMed
The final decisions about care must also take into ac-
included over 28 million citations.
count the organizational context within which care is
delivered, including available resources in the commu-
The concept of “best” evidence implies that it might nity, costs, clinical culture and constraints, accessibility
not be a perfect fit to answer our questions. Sometimes of the environment, and the nature of the healthcare
we have to refer to basic science research to establish a system. All of these can have a direct influence on what
theoretical premise, when no other direct evidence is is possible or preferable. For instance, evidence may
available. It is also important to know that not everything clearly support a particular treatment approach, but
that is published is of sufficient quality or relevance to your clinic may not be able to afford the necessary
warrant application. This means we have to be able to equipment, you may not have the skills to perform tech-
apply skills of critical appraisal to make those judgments. niques adequately, an institutional policy may preclude
a certain approach, or the patient’s insurance coverage
may be inadequate. Choices must always be made within
. . . with clinical expertise
pragmatic reality.
EBP makes no attempt to stifle the essential elements of
judgment, experience, skill, and expertise in the identifi-
cation of patient problems and the individual risks and It is essential to appreciate all the components of EBP
benefits of particular therapies. Clinical skill and exposure and the fact that evidence does not make a deci-
sion. The clinician and patient will do that together—
to a variety of patients takes time to develop and plays an
with all the relevant and available evidence to inform them for
important role in building expertise and the application
optimal shared decision-making.
of published evidence to practice. As much as we know
the scientific method is not perfect, clinical decisions are
constantly being made under conditions of uncertainty
and variability—that is the “art” of clinical practice. But ■ The Process of Evidence-Based
we cannot dissociate the art from the science that supports Practice
it. Sackett32 implores us to recognize that:
. . . without clinical expertise, practice risks being The process of EBP is generally described in five
tyrannized by evidence, for even excellent external phases—the 5 A’s—all revolving around the patient’s
advice may be inapplicable to or inappropriate for an condition and care needs (see Fig. 5-3). This can best be
individual patient. Without current best evidence, illustrated with an example.
practice risks becoming rapidly out of date, to the
detriment of patients. ➤ CASE IN POIN T
Mrs. H. is 67 years old with a 10-year history of type 2 dia-
. . . the patient’s unique values betes, which she considers essentially asymptomatic. She
has mild peripheral neuropathies in both feet and poor bal-
A key element of the EBP definition is the inclusion of
ance, which has resulted in two noninjurious falls. She is
the patient in decision-making. It is not just about find-
relatively sedentary and overweight, and although she
ing evidence but finding evidence that matters to pa- knows she should be exercising more, she admits she
tients. Patients have personal values, cultural traits, “hates to exercise.”
belief systems, family norms and expectations, and pref- Her HbA1c has remained around 7.5 for the past
erences that influence choices; these must all be weighed 5 years, slightly above the recommended level of less
against the evidence. Patients may opt for one treatment than 7.0. She is on several oral medications and just began
over another because of comfort, cost, convenience, or using a noninsulin injectable once daily. She claims to be
other beliefs. Patients understand the importance of “relatively compliant” with her medications and complains
evidence, but also see value in personalized choices and of periodic gastrointestinal problems, which she attributes
clinical judgment.35 This is especially important when to side effects.
evidence is not conclusive and multiple options exist. Three months ago, Mrs. H started complaining of pain and
This implies, however, that the clinician discusses the limited range of motion (ROM) in her right shoulder. She had an
evidence with the patient in an understandable way so MRI and was diagnosed with adhesive capsulitis (frozen shoul-
that decisions are considered collaboratively. Evidence der), which has continued to get worse. She considers this her
primary health concern right now and is asking about treatment
supports better outcomes with patient participation in
options.
decision-making.36–38
6113_Ch05_053-069 02/12/19 4:55 pm Page 58
Box 5-1 We Are in This Together Comprehensive information about searching the literature
to answer this question will be presented in Chapter 6.
The mandates for interprofessional practice have become a cen-
tral focus of healthcare today, touting the need to move out of Evidence Reviews
our professional silos to support quality and safety in the provi- One of most useful contributions to EBP has been
sion of patient-centered care.1,54 And yet, when it comes to EBP, the development of critical reviews of literature that
the silos may be perpetuated with resources being developed summarize and appraise current literature on a particular
for “evidence-based medicine,”33 “evidence-based occupational topic. For clinicians who do not have the time or skill to
therapy,”55 “evidence-based physical therapy,”56 “evidence- read and assess large numbers of studies, these reviews
based nursing,”57 and so on. provide a comprehensive summary of available evidence.
By thinking about “disciplinary evidence,” we risk becoming
Reviews are not necessarily available for every clinical
short-sighted about how we ask questions, look for evidence, or
question but, whenever possible, such reviews should be
use information to make clinical decisions. Although each disci-
pline surely has its unique knowledge base (albeit with over-
sought as a first source of information. Many excellent
laps), we must be sure to think interprofessonally to phrase resources provide evidence-based reviews on particular
questions and gather evidence broadly enough to reflect the topics (see Chapter 37).
best care for a particular patient, including understanding how Systematic reviews are studies in which the authors
such evidence can inform our role as a member of the team re- carry out an extensive and focused search for research on
sponsible for the patient’s total management. a clinical topic, followed by appraisal and summaries of
findings, usually to answer a specific question.
Finding relevant literature requires knowledge of search
parameters and access to appropriate databases. Sometimes ➤ Blanchard et al53 compiled a systematic review to look at
searches are not successful in locating relevant resources the effectiveness of corticosteroid injections compared to
to answer a particular question. This may be due to a lack physical therapy intervention for adhesive capsulitis. They
evaluated six randomized trials, with varied quality and sam-
of sophistication in searching or it may be because studies
ple sizes. They found that the corticosteroid injections had a
addressing a particular question have not been done!
6113_Ch05_053-069 02/12/19 4:55 pm Page 61
greater effect on mobility in the short term (6–7 weeks) com- Table 5-3 Outline of General Criteria for Critical
pared to physical therapy, but with no difference in pain. Appraisal
At 1 year, there was no difference in shoulder disability be-
1. Is the study valid?
tween the two groups, with a small benefit in terms of pain
This is the determination of the quality of the design
for those who got the injection.
and analysis, as well as the extent to which you can be
confident in the study’s findings. Various scales can be
Meta-analyses are systematic reviews that use quan- used to assess the methodological quality of a study.
titative methods to summarize the results from multiple • Is the research question important, and is it based on a
studies. Summary reviews are also used to establish theoretical rationale? Is the study design appropriate
for the research question?
clinical practice guidelines (CPGs) for specific condi-
• Was the study sample selected appropriately? Was
tions, translating critical analysis of the literature into a
the sample size large enough to demonstrate
set of recommendations. meaningful effects? Were subjects followed for a
Another type of knowledge synthesis is called a scoping long enough time to document outcomes?
review, which also includes a review of literature but • Was bias sufficiently controlled? For an intervention
with a broader focus on a topic of interest to provide trial, was randomization and blinding used? For
direction for future research and policy considerations.58 observational studies, was there control of potential
confounders?
STEP 3: Appraise the Literature • Was the sample size large enough to demonstrate
treatment effects?
Once pertinent articles are found, they need to be critically • What were the outcome measures? Were they valid
appraised to determine whether they meet quality stan- and reliable? Were they operationally defined? Were
dards and whether the findings are important and relevant they appropriate to the research question?
to the clinical question. Although different types of studies • Were data analysis procedures applied and
will have different criteria for determining validity, three interpreted appropriately? Do data support
major categories should be addressed for each article: conclusions?
are relevant to the design. Clinicians must be able to de- accuracy of the diagnostic test, or the degree of risk
termine to what extent the results of a study inform their associated with prognostic factors?
• Is the effect large enough to be clinically meaningful?
practice, especially when some studies will have “signifi-
cant” findings and others will not (see Chapter 36). Hav- 3. Are the results relevant to my patient?
ing a working knowledge of research design and analysis Finally, you must determine whether the findings will
approaches is essential for making these judgments. be applicable to your patient and clinical decisions.
• Were the subjects in the study sufficiently similar to
Appraisal is also dependent on the extent to which au-
my patient?
thors present clear and complete descriptions of their re-
• Can I apply these results to my patient’s problem?
search question, study design, analysis procedures,
What are the potential benefits or harms?
results, and conclusions. Checklists have been developed • Is the approach feasible in my setting and will it be
for various types of studies that list specific content that acceptable to my patient?
should be included in a written report to assure trans- • Is this approach worth the effort to incorporate it
parency (see Chapter 38). into my treatment plan?
And then there is the unhelpful situation where you Oxford Centre for Evidence-Based Medicine (OCEBM),
could not find the evidence—because your search strategy shown in Table 5-4. This system includes five study types:61
was not sufficient, no one has studied the problem, or stud-
• Diagnosis: Is this diagnostic or monitoring test
ies have shown no positive effects (see Box 5-2). This is
accurate?
where the decision-making process meets uncertainty
Questions about diagnostic tests or measurement
and where judgment is needed to determine what consti-
reliability and validity are most effectively studied
tutes best evidence for a specific clinical question. For
using cross-sectional studies applying a consistent
Mrs. H, we can go through the thought process shown in
reference standard and blinding.
Figure 5-5, resulting in a decision to explore corticosteroid
• Prognosis: What will happen if we do not add a
injection as a supplement to physical therapy.
therapy?
STEP 5: Assess the Effectiveness Studies of prognosis look for relationships among
of the Evidence predictor and outcome variables using observational
designs.
The final step in the EBP process is to determine whether
• Intervention: Does this intervention help?
the application of the evidence resulted in a positive out-
This question is most effectively studied using an
come. Did Mrs. H improve as expected? If she did not,
RCT, although strong observational studies may
the clinician must then reflect on why the evidence may
also be used. The latest version of this classification
not have been valid for the patient’s situation, whether
includes the N-of-1 trial as an acceptable alternative
additional evidence is needed to answer the clinical ques-
to the standard RCT.
tion, or whether other factors need to be considered to
• Harm: What are the common or rare harms?
find the best treatment approach for her.
RCTs and observational studies can identify adverse
This step is often ignored but in the end may be the
effects. Outcomes from N-of-1 trials can provide
most important, as we must continue to learn from the
evidence specific to a particular patient.
evidence—what works and what does not. The process
• Screening: Is this early detection test worthwhile?
shown in Figure 5-3 is circular because, through this
RCTs are also the strongest design for this purpose
assessment, we may generate further questions in the
to determine whether screening tests are effective
search of the “best evidence.” These questions may
by comparing outcomes for those who do and do
facilitate a new search process or they may be the basis
not get the test.
for a research study.
The Hierarchy of Evidence
■ Levels of Evidence Hierarchies are delineated for each type of research
question to distinguish the designs that provide the
As we search for evidence, we want to find the most accu- strongest form of evidence for that purpose. The
rate and valid information possible. Depending on the kind OCEBM structure is based on five levels of evidence:
of question being asked, different types of studies will pro- • Level 1 evidence comes from systematic reviews or
vide the strongest evidence. To assist in this effort, studies meta-analyses that critically summarize several stud-
can be assigned to levels of evidence, which essentially re- ies. These reviews are the best first pass when
flect a rating system. These levels represent the expected searching for information on a specific question
rigor of the design and control of bias, thereby indicating (see Chapter 37).
the level of confidence that may be placed in the findings. • Level 2 includes individual RCTs or observational
Levels of evidence are viewed as a hierarchy, with studies with strong design and outcomes (see
studies at the top representing stronger evidence and Chapters 14 and 19).
those at the bottom being weaker. This hierarchy can be The studies included in levels 1 and 2 are intended
used by clinicians, researchers, and patients as a guide to to reflect the strongest designs, including RCTs for
find the likely best evidence.61 These levels are also used assessment of interventions, harms, and screening
as a basis for selecting studies for incorporation into procedures, as well as prospective observational
guidelines and systematic reviews. studies for diagnosis and prognosis.
• Level 3 includes studies that do not have strong
Classification Schemes controls of bias, such as nonrandomized studies,
Several classifications of evidence have been developed, retrospective cohorts, or diagnostic studies that
with varying terminology and grading systems.62,63 The do not have a consistent reference standard (see
most commonly cited system has been developed by the Chapters 17, 19).
6113_Ch05_053-069 02/12/19 4:55 pm Page 63
Clinical Question
In a patient with type 2 diabetes who has adhesive capsulitis, is a
corticosteroid injection more effective than physical therapy for reducing
shoulder disability and pain, and increasing ROM?
‡ Numbers must be sufficient to rule out common harm; duration of follow-up must be sufficient to show long-term harm.
• Level 4 is a relatively low level of evidence, primarily research. With the most recent OCEBM modifications in 2011,
including descriptive studies such as case series or these sources have been replaced by mechanistic reasoning,
studies that use historical controls (see Chapter 20). suggesting that such “opinions” must still be based on
• Level 5 is based on mechanistic reasoning, whereby logical rationale. Also omitted are questions about economics
evidence-based decisions are founded on logical con- for which there is less consensus on what counts as good
nections, including pathophysiological rationale.67 evidence.61
This may be based on biological plausibility or basic
research. Case reports that focus on describing such
mechanisms in one or more individuals can be in- Qualitative and Descriptive Research Evidence
formative at this level of evidence. The framework for levels of evidence has been based on
designs used in quantitative studies. Studies that pro-
As the nature of EBP has matured, hierarchies of duce qualitative or descriptive data have been excluded
evidence have been modified over time. In previous from the hierarchy, and are sometimes not seen as true
versions of levels of evidence, the “bottom” level included forms of evidence. These research approaches do, how-
case reports, expert opinion, anecdotal evidence, and basic ever, make significant contributions to our knowledge
6113_Ch05_053-069 02/12/19 4:55 pm Page 65
base and identification of mechanisms, and will often higher quality studies, but are not absolute. By evalu-
provide an important perspective for appreciating pa- ating the quality of the evidence presented, we may find
tients’ concerns. Quantitative studies can tell us which that a higher-quality study at a lower level of evidence
treatment is better or the degree of correlation between may be more helpful than one at a higher level that has
variables, but qualitative inquiry can help us understand design or analysis flaws (see Chapter 37). For instance,
the context of care and how the patient experience a well-designed cohort study may be more applicable
impacts outcomes—factors that can have substantial than a poorly designed RCT. The level of evidence for
influence on clinical decisions (see Chapter 21). There- a strong observational study may be graded up if there is
fore, rather than thinking of qualitative study as a low a substantial treatment effect. An RCT can be graded
level of evidence in a quantitative paradigm, a different down based on poor study quality, lack of agreement be-
set of criteria need to be applied that focus on the tween studied variables and the PICO question, incon-
various approaches and kinds of questions relevant to sistency between studies, or because the absolute effect
qualitative research.64,65 size is very small.
Qualitative research stands apart from other forms of
inquiry because of its deep-rooted intent to understand Levels of evidence are based on the assumption that
the personal and social experience of patients, families, systematic reviews and meta-analyses provide
and providers. Daly et al66 have proposed a hierarchy stronger evidence than individual trials in all cate-
with four levels of evidence, emphasizing the capacity for gories. The content of reviews, however, is not always suffi-
qualitative research to inform practice and policy. ciently detailed to provide applicable information for clinical
decision-making. For example, reviews often lack specifics
• Level 1, at the top of the hierarchy, represents gen- on patient characteristics, operational definitions about inter-
eralizable studies. These are qualitative studies that ventions, or adverse effects, which can vary across studies.
are guided by theory, and include sample selection Individual references may still need to be consulted to inform
that extends beyond one specific population to cap- clinical decisions.
ture diversity of experiences. Findings are put into
the context of previous research, demonstrating
broad application. Data collection and analysis
procedures are comprehensive and clear. ■ Implementing Evidence-Based
• Level 2 includes conceptual studies that analyze data Practice
according to conceptual themes but have narrower
application based on limited diversity in the sample. Although the concepts and elements of EBP were intro-
Further research is needed to confirm concepts. duced more than 20 years ago, the integration of EBP
• Level 3 includes descriptive studies that are focused into everyday practice has continued to be a challenge in
on describing participant views or experiences with all healthcare disciplines. As we explore evidence, it is
narrative summaries and quotations, but with no important to keep a perspective regarding its translation
detailed analysis. to practice. The ultimate purpose of EBP is to improve
• Level 4 includes case studies, exploring insights that patient care, but this cannot happen by only using pub-
have not been addressed before, often leading to lished research. The clinician must be able to make
recognition of unusual phenomena. the connections between study results, sound judgment,
patient needs, and clinical resources.
Quality of Evidence The literature is replete with studies that have exam-
Levels of evidence are intended to serve as a guide to ac- ined the degree to which clinicians and administrators
cessing and prioritizing the strongest possible research have been able to implement EBP as part of their prac-
on a given topic. However, levels are based on design tice. It is a concern that appears to affect all health pro-
characteristics, not quality. An RCT may have serious fessions and practice settings, as well as healthcare
flaws because of bias, restricted samples, inappropriate cultures across many countries. For the most part, re-
data analysis, or inappropriate interpretation of results. search has shown that issues do not stem from devaluing
At the same time, level 4 studies may provide important evidence-based practice. Although healthcare providers
relevant information that may influence decision-making appreciate their professional responsibility to work in
for a particular patient. an evidence-based way, they also recognize the difficul-
ties in achieving that goal from personal and organiza-
Grading Evidence Up or Down tional perspectives, particularly related to availability of
Hierarchies used to classify quality of published research time and resources, as well as establishing an evidence-
should serve as a guide for searching and identifying based clinical culture.67
6113_Ch05_053-069 02/12/19 4:55 pm Page 66
Several strategies can be incorporated to build an EBP Knowledge creation addresses specific needs and pri-
clinical environment. See the Chapter 5 Supplement for orities, with three phases:
further discussion. 1. Inquiry—including completion of primary research.
2. Synthesis—bringing disparate findings together,
Knowledge Translation typically in systematic reviews, to reflect the totality
of evidence on a topic.
As we come to better appreciate the importance of 3. Development of tools—further distilling knowl-
EBP, we must also appreciate the journey we face in edge into creation of decision-making tools such as
getting from evidence to practice. Implementation clinical guidelines.
studies can help us see how evidence can be integrated
into practice. But then there is still one more hurdle— The action cycle begins with identification of a prob-
securing the knowledge that facilitates actual utiliza- lem, reflecting the current state of knowledge. It then leads
tion of evidence. to application and evaluation of the process for achieving
Knowledge translation (KT) is a relatively new term stable outcomes. Integral to the framework is the input of
that has been used to describe the longstanding problem stakeholders, including patients, clinicians, managers, and
of underutilization of evidence in healthcare.68 It de- policy makers (see Focus on Evidence 5-2).71
scribes the process of accelerating the application of
knowledge to improve outcomes and change behavior Knowledge, Implementation, and Evidence
for all those involved in providing care. It is a process There is an obvious overlap of knowledge translation,
that needs interprofessional support, including disci- implementation research, and EBP, all representing a
plines such as informatics, public policy, organizational process of moving research knowledge to application.
theory, and educational and social psychology to close Although the definitions appear to coincide, KT has
the gap between what we know and what we do (see a broader meaning, encompassing the full scope of
Chapter 2).69 knowledge development, starting with asking the right
Although the term KT has been widely used, its defi- questions, conducting relevant research, disseminating
nition has been ambiguous, often used interchangeably findings that relate to a clinical context, applying that
with other terms such as knowledge transfer, knowledge knowledge, and studying its impact.
exchange, research utilization, diffusion, implementa- KT is seen as a collaborative engagement between re-
tion, dissemination, and evidence implementation.70,71 searchers and practitioners, reflecting together on prac-
The Canadian Institutes of Health Research (CIHR) tice needs and consequences of research choices, with
have offered the following definition: mutual accountability.75 It is an interactive, cyclical, and
dynamic process that integrates an interprofessional
Knowledge translation is a dynamic and iterative approach to change. In contrast, implementation science
process that includes synthesis, dissemination, focuses on testing how interventions work in real set-
exchange and ethically-sound application of knowledge tings, and what solutions need to be introduced into a
to improve health, . . . provide more effective health health system to promote sustainability.76 KT is a com-
services, and products and strengthen the health care plementary process with a stronger emphasis on the
system.72 development, exchange, and synthesis of knowledge.
In further contrast, EBP is used more narrowly to
Knowledge-to-Action Framework refer to the specific part of that process involving the ap-
This definition characterizes KT as a multidimensional plication of information to clinical decision making for
concept involving the adaptation of quality research into a particular patient, including the influence of the pa-
relevant priorities, recognizing that KT goes beyond dis- tient’s values and the practitioner’s expertise. In a
semination and continuing education.70 KT includes broader framework, KT formalizes the relationship be-
both creation and application of knowledge—what has tween the researcher and the practitioner, including
been called the knowledge-to-action framework.71 mechanisms to remove barriers to EBP.
6113_Ch05_053-069 02/12/19 4:55 pm Page 67
Focus on Evidence 5–2 was to guide planning, delivery, and evaluation of organized physical
From Knowledge to Action activity. The process involved teachers, students, community leaders,
and after-school staff, and was tested in several different environ-
Although there is substantial evidence for the importance of physical
ments to document changes in behavior and programming, including
activity for children, school and community programs often do not
adoption of teacher training strategies, development of research pro-
achieve their full potential, either because of poor quality of program-
tocols for evaluation, and dissemination of findings across settings.
ming or lack of engaging activities. Based on a review of relevant data
This is an example of the knowledge-to-action cycle, requiring mul-
and several theories related to activity and goal achievement, a set of
tiple inputs, recognition of barriers, and iterations of contributions to un-
principles was designed to address documented limitations: Supportive,
derstand how evidence can be implemented in real-world environments.
Active, Autonomous, Fair, Enjoyable (SAAFE).75 The purpose of SAAFE
COMMENTARY
Don’t accept your dog’s admiration as conclusive evidence that you are wonderful.
— Ann Landers (1918–2002)
Advice columnist
The central message of EBP is the need to establish In his classic paper, Things I Have Learned (So Far),
logical connections that will strengthen our ability to Cohen77 cautions us as researchers and practitioners:
make sound decisions regarding patient care—but be-
Remember that throughout the process, . . . it is on your
cause we base research results on samples, we make
informed judgment as a scientist that you must rely, and
lots of assumptions about how findings represent
this holds as much for the statistical aspects of the work as
larger populations or individual patients. We also
it does for all the others. . . . and that informed judgment
know that all patients do not respond in the same way
also governs the conclusions you will draw.
and, therefore, the causal nature of treatments is not
absolute. Several studies looking at the same questions So although the need for EBP seems uncontroversial,
can present different findings. Unique patient charac- it has not yet achieved its full promise. As we shall see in
teristics and circumstances can influence outcomes. upcoming chapters, uncertainty is ever-present in prac-
Studies often try to examine subgroups to determine tice and clinical research. Even when evidence appears
the basis for these differences, but even those data strong and convincing, clinical decisions can vary based
are subject to variance that cannot be explained or on practitioner and patient values.78 We need to apply
anticipated. This is why clinical decisions should be in- good principles of design and statistics with sound clin-
formed by multiple sources of knowledge, including ical judgment to strengthen the likelihood that our con-
theory and experience. clusions, and therefore our actions, have merit.
between community- and hospital-based primary care clinics. 32. Sackett DL, Rosenberg WM, Gray JA, Haynes RB, Richardson
Int J Clin Pract 2005;59(8):975–980. WS. Evidence based medicine: what it is and what it isn’t. BMJ
11. Fiscella K, Sanders MR. Racial and ethnic disparities in the 1996;312:71–72.
quality of health care. Annu Rev Public Health 2016;37:375–394. 33. Straus SE, Glasziou P, Richardson WS, Haynes RB, Pattani R,
12. Prasad V, Vandross A, Toomey C, Cheung M, Rho J, Quinn S, Veroniki AA. Evidence-Based Medicine: How to Practice and Teach
Chacko SJ, Borkar D, Gall V, Selvaraj S, et al. A decade of It. London: Elsevier, 2019.
reversal: an analysis of 146 contradicted medical practices. 34. Guyatt GH. Evidence-based medicine. ACP J Club 1991;
Mayo Clinic Proceedings 2013;88(8):790–798. 114(2):A16.
13. National Partnership for Women and Families: Overuse, 35. Carman KL, Maurer M, Mangrum R, Yang M, Ginsburg M,
underuse and misuse of medical care. Fact Sheet. Available at Sofaer S, Gold MR, Pathak-Sen E, Gilmore D, Richmond J,
http://go.nationalpartnership.org/site/DocServer/Three_ et al. Understanding an informed public’s views on the role
Categories_of_Quality.pdf. Accessed February 17, 2017. of evidence in making health care decisions. Health Affairs
14. Isseroff RR, Dahle SE. Electrical stimulation therapy and 2016;35(4):566–574.
wound healing: where are we now? Advances in Wound Care 36. Epstein RM, Alper BS, Quill TE. Communicating evidence
2012;1(6):238–243. for participatory decision making. JAMA 2004;291(19):
15. Centers for Medicare and Medicaid Services: Decision Memo 2359–2366.
for Electrostimulation for Wounds (CAG-00068N). Available 37. Stewart M, Meredith L, Brown JB, Galajda J. The influence of
at https://www.cms.gov/medicare-coverage-database/details/ older patient-physician communication on health and health-
nca-decision-memo.aspx?NCAId=27&fromdb=true. Accessed related outcomes. Clin Geriatr Med 2000;16(1):25–36, vii–viii.
May 20, 2017. 38. Bastian H, Glasziou P, Chalmers I. Seventy-five trials and
16. Kloth LC. Electrical stimulation for wound healing: a review of eleven systematic reviews a day: how will we ever keep up? PLoS
evidence from in vitro studies, animal experiments, and clinical Med 2010;7(9):e1000326.
trials. Int J Low Extrem Wounds 2005;4(1):23–44. 39. Huang YP, Fann CY, Chiu YH, Yen MF, Chen LS, Chen HH,
17. Ennis WJ, Lee C, Gellada K, Corbiere TF, Koh TJ. Advanced Pan SL. Association of diabetes mellitus with the risk of
technologies to improve wound healing: electrical stimulation, developing adhesive capsulitis of the shoulder: a longitudinal
vibration therapy, and ultrasound—what is the evidence? Plast population-based followup study. Arthritis Care Res 2013;65(7):
Reconstr Surg 2016;138(3 Suppl):94s–104s. 1197–1202.
18. AHC Media: Association files suit over reimbursement change. 40. Zreik NH, Malik RA, Charalambous CP. Adhesive capsulitis of
Electrical stimulation funding at issue. Available at https://www. the shoulder and diabetes: a meta-analysis of prevalence. Muscles
ahcmedia.com/articles/48484-association-files-suit-over- Ligaments Tendons J 2016;6(1):26–34.
reimbursement-change. Accessed May 20, 2017. 41. Haynes RB, Sackett DL, Guyatt GH, Tugwell PS. Clinical
19. Ashrafi M, Alonso-Rasgado T, Baguneid M, Bayat A. The Epidemiology: How To Do Clinical Practice Research. 3 ed.
efficacy of electrical stimulation in lower extremity cutaneous Philadelphia: Lippincott Willliams & Wilkins, 2006.
wound healing: a systematic review. Exp Dermatol 2017; 42. Samson D, Schoelles KM. Chapter 2: medical tests guidance
26(2):171–178. (2) developing the topic and structuring systematic reviews of
20. Korenstein D, Falk R, Howell EA, Bishop T, Keyhani S. Over- medical tests: utility of PICOTS, analytic frameworks, decision
use of health care services in the United States: an understudied trees, and other frameworks. J Gen Intern Med 2012;27 Suppl
problem. Arch Intern Med 2012;172(2):171–178. 1:S11–S119.
21. Takata GS, Chan LS, Shekelle P, Morton SC, Mason W, Marcy 43. Centre for Reviews and Dissemination. Systematic Reviews:
SM. Evidence assessment of management of acute otitis media: CRD’s Guidance for Undertaking Reviews in Health Care. York:
I. The role of antibiotics in treatment of uncomplicated acute University of York, 2006. Available at https://www.york.ac.uk/
otitis media. Pediatrics 2001;108(2):239–247. media/crd/Systematic_Reviews.pdf. Accessed February 3, 2017.
22. Mangione-Smith R, DeCristofaro AH, Setodji CM, Keesey J, 44. Schlosser RW, Koul R, Costello J. Asking well-built questions
Klein DJ, Adams JL, Schuster MA, McGlynn EA. The quality for evidence-based practice in augmentative and alternative
of ambulatory care delivered to children in the United States. communication. J Commun Disord 2007;40(3):225–238.
N Engl J Med 2007;357(15):1515–1523. 45. Cooke A, Smith D, Booth A. Beyond PICO: the SPIDER tool
23. McGlynn EA, Asch SM, Adams J, Keesey J, Hicks J, DeCristofaro for qualitative evidence synthesis. Qual Health Res 2012;22(10):
A, Kerr EA. The quality of health care delivered to adults in the 1435–1443.
United States. N Engl J Med 2003;348(26):2635–2645. 46. Perkins BA, Olaleye D, Zinman B, Bril V. Simple screening
24. National Foundation for Infectious Diseases: Pneumococcal tests for peripheral neuropathy in the diabetes clinic. Diabetes
disease fact sheet for the media. Available at http://www.nfid. Care 2001;24(2):250–256.
org/idinfo/pneumococcal/media-factsheet.html. Accessed 47. Donaghy A, DeMott T, Allet L, Kim H, Ashton-Miller J,
February 17, 2017. Richardson JK. Accuracy of clinical techniques for evaluating
25. James JT. A new, evidence-based estimate of patient harms asso- lower limb sensorimotor functions associated with increased fall
ciated with hospital care. J Patient Saf 2013;9(3):122–128. risk. PM R 2016;8(4):331–339.
26. Pronovost PJ, Bo-Linn GW. Preventing patient harms through 48. Jernigan SD, Pohl PS, Mahnken JD, Kluding PM. Diagnostic
systems of care. JAMA 2012;308(8):769–770. accuracy of fall risk assessment tools in people with diabetic
27. Desmeules F, Boudreault J, Roy JS, Dionne C, Fremont P, peripheral neuropathy. Phys Ther 2012;92(11):1461–1470.
MacDermid JC. The efficacy of therapeutic ultrasound for rota- 49. Montori VM, Fernandez-Balsells M. Glycemic control in type 2
tor cuff tendinopathy: a systematic review and meta-analysis. diabetes: time for an evidence-based about-face? Ann Intern Med
Phys Ther Sport 2015;16(3):276–284. 2009;150(11):803–808.
28. Ebadi S, Henschke N, Nakhostin Ansari N, Fallah E, van 50. Roh YH, Yi SR, Noh JH, Lee SY, Oh JH, Gong HS, Baek GH.
Tulder MW. Therapeutic ultrasound for chronic low-back Intra-articular corticosteroid injection in diabetic patients with
pain. Cochrane Database Syst Rev 2014(3):Cd009169. adhesive capsulitis: a randomized controlled trial. Knee Surg
29. Silverman WA. Human Experimentation: A Guided Step into the Sports Traumatol Arthrosc 2012;20(10):1947–1952.
Unknown. New York: Oxford University Press, 1985. 51. Pai LW, Li TC, Hwu YJ, Chang SC, Chen LL, Chang PY. The
30. Isaacs D, Fitzgerald D. Seven alternatives to evidence based effectiveness of regular leisure-time physical activities on long-
medicine. BMJ 1999;319:1618. term glycemic control in people with type 2 diabetes: A system-
31. O’Donnell M. A Sceptic’s Medical Dictionary. London: BMJ atic review and meta-analysis. Diabetes Res Clin Pract 2016;
Books, 1997. 113:77–85.
6113_Ch05_053-069 02/12/19 4:55 pm Page 69
52. Berenguera A, Mollo-Inesta A, Mata-Cases M, Franch-Nadal J, 67. Snöljung A, Mattsson K, Gustafsson LK. The diverging percep-
Bolibar B, Rubinat E, Mauricio D. Understanding the physical, tion among physiotherapists of how to work with the concept of
social, and emotional experiences of people with uncontrolled evidence: a phenomenographic analysis. J Eval Clin Pract 2014;
type 2 diabetes: a qualitative study. Patient Prefer Adherence 20(6):759–766.
2016;10:2323–2332. 68. Grol R, Grimshaw J. From best evidence to best practice:
53. Blanchard V, Barr S, Cerisola FL. The effectiveness of corticos- effective implementation of change in patients’ care. Lancet
teroid injections compared with physiotherapeutic interventions 2003;362(9391):1225–1230.
for adhesive capsulitis: a systematic review. Physiother 2010; 69. Davis D, Evans M, Jadad A, Perrier L, Rath D, Ryan D, Sibbald
96(2):95–107. G, Straus S, Rappolt S, Wowk M, et al. The case for knowledge
54. Interprofessional Education Collaborative. Core competencies for translation: shortening the journey from evidence to effect. BMJ
interprofessional collaborative practice: 2016 update. Washington, 2003;327(7405):33–35.
DC: Interprofessional Education Collaborative, 2016. 70. Graham ID, Logan J, Harrison MB, Straus SE, Tetroe J,
55. Rappolt S. The role of professional expertise in evidence-based Caswell W, Robinson N. Lost in knowledge translation: time
occupational therapy. Am J Occup Ther 2003;57(5):589–593. for a map? J Contin Educ Health Prof 2006;26(1):13–24.
56. Fetters L, Tilson J. Evidence Based Physical Therapy. Philadelphia: 71. Straus SE, Tetroe J, Graham I. Defining knowledge translation.
FA Davis, 2012. CMAJ 2009;181(3–4):165–168.
57. Brown SJ. Evidence-Based Nursing: The Research-Practice 72. Candian Institutes of Health Research: Knowledge translation.
Connection. Norwich, VT: Jones & Bartlett, 2018. Available at http://www.cihr-irsc.gc.ca/e/29418.html#2.
58. Arksey H, O’Malley L. Scoping studies: towards a methodologi- Accessed July 14, 2017.
cal framework. Int J Soc Res Methodol 2005;8:19–32. 73. National Center for Dissemination of Disability Research:
59. Burton M. An invisible unicorn has been grazing in my What is knowledge translation? Focus: A Technical Brief from the
office for a month. . . Prove me wrong. Evidently Cochrane, National Center for the Dissemination of Disability Research.
October 7, 2016. Available at http://www.evidentlycochrane. Number 10. Available at http://ktdrr.org/ktlibrary/articles_
net/invisible-unicorn-grazing-office/. Accessed June 16, 2017. pubs/ncddrwork/focus/focus10/Focus10.pdf. Accessed April 17,
60. Alderson P. Absence of evidence is not evidence of absence. 2017.
BMJ 2004;328(7438):476–477. 74. Lubans DR, Lonsdale C, Cohen K, Eather N, Beauchamp MR,
61. Centre for Evidence-Based Medicine: OCEBM Levels of Morgan PJ, Sylvester BD, Smith JJ. Framework for the design
Evidence. Available at http://www.cebm.net/ocebm-levels-of- and delivery of organized physical activity sessions for children
evidence/. Accessed April 28, 2017. and adolescents: rationale and description of the ‘SAAFE’ teach-
62. Wright JG. A practical guide to assigning levels of evidence. ing principles. Int J Behav Nutr Phys Act 2017;14(1):24.
J Bone Joint Surg Am 2007;89(5):1128-1130. 75. Oborn E, Barrett M, Racko G. Knowledge Translation in Health
63. Guyatt G, Gutterman D, Baumann MH, Addrizzo-Harris D, Care: A Review of the Literature. Cambridge, UK: Cambridge
Hylek EM, Phillips B, Raskob G, Lewis SZ, Schunemann H. Judge Business School 2010.
Grading strength of recommendations and quality of evidence 76. Khalil H. Knowledge translation and implementation science:
in clinical guidelines: report from an American College of Chest what is the difference? Int J Evid Based Healthc 2016;14(2):39-40.
Physicians task force. Chest 2006;129(1):174-181. 77. Cohen J. Things I have learned (so far). Am Pyschologist 1990;
64. Levin RF. Qualitative and quantitative evidence hierarchies: 45(12):1304–1312.
mixing oranges and apples. Res Theory Nurs Pract 2014;28(2): 78. Rubenfeld GD. Understanding why we agree on the evidence
110-112. but disagree on the medicine. Respir Care 2001;46(12):
65. Giacomini MK. The rocky road: qualitative research as 1442–1449.
evidence. ACP J Club 2001;134(1):A11-13.
66. Daly J, Willis K, Small R, Green J, Welch N, Kealy M,
Hughes E. A hierarchy of evidence for assessing qualitative
health research. J Clin Epidemiol 2007;60(1):43-49.
6113_Ch06_070-087 04/12/19 3:25 pm Page 70
CHAPTER
6 Searching the
Literature
—with Jessica Bell
Although most of us have had some experience professional associations, societies or academies of
practice, and other special interest groups, each with a
obtaining references for term papers or assign-
specific disciplinary or content focus.
ments, technology has forever changed how we Most of these journals are peer reviewed, which
locate information, how quickly we want it, and means that manuscripts are scrutinized by experts before
they are accepted for publication to assure a level of qual-
the volume of data available. Clinicians often
ity. It is important to appreciate this process, as it may
go to the literature to stay on top of scientific have implications for the quality and validity of papers
advances as part of professional development, that we read.
or they may gather evidence specific to a pa- Magazines
tient for clinical decision-making. Researchers Scientific magazines are periodicals that include opinion
will review the literature to build the rationale pieces, news, information on policy, and reports of pop-
ular interest. These publications are often read by the
for a study and to help interpret findings. This
public or by professionals from a particular field. Maga-
process, of course, assumes that we can locate zine pieces are not peer reviewed, do not generally con-
and retrieve the relevant research literature. tain primary research reports, and may present biased
views. Most professional associations publish magazines
The purpose of this chapter is to describe
that provide useful summaries of published studies or in-
strategies for successful literature searches that dustry information that can give direction to further
will serve the full range of information-seeking searching.
needs to support evidence-based practice Government and Professional Websites
(EBP). Government agencies, nongovernmental organizations,
hospitals, advocacy organizations, and professional
associations are eager to share the information and
■ Where We Find Evidence knowledge they have acquired, and frequently do so on
their websites. Those sites are easily discoverable with
We can think of “literature” broadly to include all schol- a few well-worded searches in a Web search engine. A
arly products, including original research articles, edito- close examination of the source is always recommended
rials and position papers, reviews and meta-analyses, to ensure its authority and accuracy of the information
books and dissertations, conference proceedings and provided.
abstracts, and website materials.
Grey Literature
Scientific Journals Many useful sources of information can be found in
When we think of searching for information in the what is called the grey literature, which is anything
health sciences, we typically will turn first to published not produced by a commercial publisher. Government
journal articles. Thousands of journals are published by documents, reports, fact sheets, practice guidelines,
70
6113_Ch06_070-087 04/12/19 3:25 pm Page 71
conference proceedings, and theses or dissertations investigator. Most research articles in professional jour-
are some of the many nonjournal types of literature nals are primary sources, as are dissertations or presen-
that fall under this category.1 The grey literature can tations of research results at conferences.
provide information such as demographic statistics, A secondary source is a description or review of one
preliminary data on new interventions and professional or more studies presented by someone other than the
recommendations, and results of experimental, obser- original authors. Review articles and textbooks are ex-
vational, and qualitative inquiries that never made it to amples of secondary sources, as are reports of studies that
formal publication. Studies with null findings are par- often appear in newspapers, professional magazines,
ticularly vulnerable to rejection by publishers. Regard- websites, or television news. Patients often receive their
less, these unpublished studies can be an important information about research through such methods.
source of evidence.2 Secondary sources may be convenient, but they can be
incomplete, inaccurate, biased, or out-of-date, or they
Research reports have shown that journals are less can present misleading information. Clinicians are fre-
likely to publish studies with negative findings or find- quently amazed, when they go back to the original refer-
ings of no significant difference.3 This publication ence, to discover how different their own interpretation
bias must be considered when searching for relevant literature, of results may be from descriptions provided by others.
clearly impacting what we can learn from published work (see
Chapter 37). For instance, studies have shown that, by excluding Systematic Reviews
grey literature, systematic reviews may overestimate interven- Systematic reviews, meta-analyses, and clinical guide-
tion effects.4,5 lines that include a critical analysis of published works
may technically be secondary references, but they be-
Searching for grey literature does present a challenge. come primary sources as a form of research in generating
There are several databases, developed in the United States, new knowledge through rigorous selection criteria, eval-
Canada, and Europe (such as Open Grey,6 GreySource,7 and uation of methodological quality, and conclusions drawn
Grey Net8) that focus on grey literature. Unfortunately, from the synthesis of previous research. Even then, how-
many of these databases are not comprehensive and full ever, when study details are relevant, evidence-based de-
references can be difficult to obtain. Internet strategies cisions will be better served by going back to the original
can be useful for locating grey literature. Google can reference. Reviews do not typically provide sufficient
locate websites related to your topic, and once identified, information on operational definitions, for example,
browsing through their pages can uncover many items of for decisions to be based on a full understanding of an
relevance.9,10 intervention or a measurement.
Any of the thousands of governmental and advocacy
organizations, think tanks, and conference websites
may have hidden gems that Google does not unearth.
For instance, ClinicalTrials.gov, developed by the
■ Databases
National Library of Medicine (NLM), is a database
Regardless of the type of literature you seek, the most
of ongoing and completed privately and publicly
commonly used resources are literature databases. A data-
funded clinical studies conducted around the world.10
base is an index of citations that is searchable by key-
You may also find that contacting researchers in a par-
words, author, title, or journal. Some databases have been
ticular area of study will provide insight into ongoing
created by public agencies, others by private companies
studies.
or publishers with varying levels of access (see Table 6-1).
Many journals provide free full-text options as part of
Grey Matters is a comprehensive resource that provides their listings within a database. The most commonly used
guides to a variety of options for searching grey literature databases in healthcare are MEDLINE, CINAHL, and
websites, including library websites that have created conven-
the Cochrane Database of Systematic Reviews. Every
ient databases.11
health science clinician and researcher should be familiar
with these databases at a minimum.
Primary and Secondary Sources MEDLINE
In the search for relevant literature, clinicians should MEDLINE is by far the most frequently used database in
be seeking primary data sources as much as possible. A health-related sciences. Offered through the NLM, it in-
primary source is a report provided directly by the dexes more than 5,200 journal titles and over 24 million
6113_Ch06_070-087 04/12/19 3:25 pm Page 72
Table 6-1 Major Databases and Search Engines in Health and Social Sciences
DATABASE/SEARCH ENGINE DESCRIPTION
ASHAWire Catalogs articles from any ASHA publication. Searching is free but full
American Speech-Language- Hearing Association text available by subscription or membership. Free access.
(ASHA)
pubs.asha.org
CINAHL Indexes published articles and other references relevant to nursing and
Cumulative Index to Nursing and Allied Health the health professions. Coverage from 1937. Requires subscription.
Literature
www.cinahl.com
ClinicalTrials.gov The U.S. registry for clinical trials, which includes new, ongoing, and
National Institutes of Health completed studies. Free access.
www.clinicaltrials.gov
Cochrane Library of Systematic Reviews Full-text systematic reviews of primary research in healthcare and health
The Cochrane Collaboration policy. Can be accessed through PubMed. Free access, some
www.cochranelibrary.com full text.
MEDLINE Citations from life sciences and biomedical literature. Primary access
National Library of Medicine through PubMed. Free access, some full text.
www.ncbi.nlm.nih.gov/pubmed/
www.pubmed.gov
OT Search Citations and abstracts from the occupational therapy literature and its
American Occupational Therapy Association related subject areas. Fee-based subscription.
www1.aota.org/otsearch/
PEDro Physiotherapy Evidence Database that includes abstracts of randomized
Center for Evidence-based Physiotherapy, clinical trials, systematic reviews, and practice guidelines. All trials
University of Sydney, Australia include quality ratings using the PEDro scale. Free access.
www.pedro.org.au/
PTNow Article Search Full-text articles, clinical summaries synthesizing evidence for
American Physical Therapy Association management of specific conditions, and resources on tests and measures.
www.ptnow.org Provides access to other databases. Requires APTA membership.
SPORTDiscus Citations and full text for sports-related references, including nutrition,
EBSCO physical therapy, occupational health, exercise physiology, and
www.ebsco.com kinesiology. Requires subscription.
UpToDate Point-of-care, clinical reference tool for EBP. Contains original entries
Wolters Kluwer written specifically for UpToDate that synthesize available evidence
www.uptodate.com about clinically important topics. Requires subscription.
references in 40 languages.12 A growing number of ref- and comprehensive. PubMed includes the entirety of the
erences include free full-text articles. Full references are MEDLINE database as well as supplemental content,
available back to 1966. Earlier citations from 1946–1965 including publications that are “ahead of print.” It also
can be obtained through OLD MEDLINE, a limited includes information on studies that are funded through
data set available through PubMed and other platforms.13 the National Institutes of Health (NIH), which are
PubMed required to be freely accessible through the PubMed
MEDLINE can be accessed through various platforms, Central (PMC) repository.
but its primary access is obtained through PubMed, a
platform developed and maintained by the National CINAHL
Center for Biotechnology Information (NCBI). It is The Cumulative Index to Nursing and Allied Health
undoubtedly the most-used health science application Literature (CINAHL) indexes references from nursing,
because it is free, readily available through the internet, biomedicine, health sciences, alternative/complementary
6113_Ch06_070-087 04/12/19 3:25 pm Page 73
medicine, consumer health, and 17 other health disci- topic areas. Cochrane systematic reviews are considered
plines.14 It is also a major point of access for citations high quality with rigorous standards. References can be
from health sciences publications that do not appear in accessed directly through the Cochrane group or through
MEDLINE. Although there is substantial overlap with other databases such as MEDLINE or Embase.
MEDLINE, its unique content is an important supple-
ment. It is more likely to contain citations from smaller, Evidence-Based Resources
specialty journals in the nursing, clinical, and rehabilita- Separate from literature reviews that are published as
tion fields. articles, there are databases devoted to helping clinicians
CINAHL indexes more than 3.8 million records from with decision-making. These EBP databases come in a
more than 3,100 journals, with 70 full-text journals. It variety of types. Some provide single-article analyses
contains a mix of scholarly and other types of publica- such as the Physiotherapy Evidence Database (PEDro),
tions, including healthcare books, dissertations, confer- which includes quality assessments of randomized trials,
ence proceedings, standards of practice, audiovisuals and listings of systematic reviews, and clinical practice guide-
book chapters, and trade publications that contain opin- lines. The Cochrane Library focuses on high-quality sys-
ion and news items. Therefore, searchers need to be tematic reviews. Others, such as UpToDate, are “point of
discerning when reviewing articles to be sure they know care” sources to provide clinicians with a quick overview
the validity of sources. of the research pertaining to all aspects of a diagnosis,
treatment, or prognosis. Many journals include EBP re-
To be included in MEDLINE, journal editors must submit an views in specialized content.16–18
application.12 They must be able to demonstrate that the
Search Engines
journal records research progress in an area of life sciences and
influences development and implementation of related policy. Search engines are platforms that access one or more
Many smaller or newer journals that do not yet qualify for databases. Some databases, such as MEDLINE, can be
MEDLINE will be accessible through CINAHL. accessed from different search engines. In addition to
PubMed, which is freely available, libraries will often
provide access through OVID (an integrated system
Cochrane Database of Systematic Reviews covering more than 100 databases19) or EBSCO,14 both
As its the name implies, the Cochrane Database of Sys- of which also offer access to e-books and other scholarly
tematic Reviews contains citations of systematic reviews products. Translating Research into Practice (Trip) is a
(see Chapter 37). It is managed by a global network of search engine geared primarily toward EBP and clini-
researchers and people interested in healthcare, involving cians, searching multiple databases, and providing a
more than 37,000 contributors from 130 countries.15 Re- search algorithm based on PICO.20 Google Scholar pro-
views are prepared and regularly updated by 53 Review vides free searching for a wide spectrum of references,
Groups made up of individuals interested in particular including grey literature (see Focus on Evidence 6-1).21
Focus on Evidence 6–1 match PubMed in locating primary sources but will also find many
“‘Google’ is not a synonym for ‘research.’” more irrelevant citations.22 It is best used in combination with other
— Dan Brown, The Lost Symbol databases.
Google Scholar can be helpful at the beginning of your literature
Everyone is familiar with Google and the ease of searching for just
searching when you are struggling to find the right keywords, and will
about anything. Google is not particularly useful in the search for re-
often provide full-text files for articles and reports that are not avail-
search evidence, however. Although it may locate some sources, it
able from other sites.23 It can also be helpful when searching for grey
does not focus on scientific literature and does not have the ability to
literature, but it requires thorough searching since these references
designate search parameters to get relevant citations.
will often be buried after 100 pages.24
Google Scholar is a search engine that locates articles, books, and
Like its parent search engine, Google Scholar ranks results by pop-
theses on a wide variety of topics and disciplines, generally from tra-
ularity and filters results based on browser version, geographic loca-
ditional academic publishers.21 Although searches can be limited to
tion, and previously entered search strings.25,26 Therefore, it is not
specific authors or a certain time frame, it does not have search limits
recommended as a primary resource for systematic reviews or other
for research designs or languages, study subject characteristics, or
searches because it cannot be replicated.
many of the other features most databases have. Google Scholar can
6113_Ch06_070-087 04/12/19 3:25 pm Page 74
Many databases offer mobile apps as well as their full Databases tend to have common approaches to the search
Web versions. PubMed and UpToDate are two that make it process, although there will be some unique features and
easy to keep the information you need right at your fingertips. different interfaces. Given its widespread use, we will use
PubMed’s app even helps you set up a PICO search. PubMed to illustrate the major steps in searching. All databases
have tutorials that will explain their specific use.
1 7
2
3
4 5
1 This dropdown menu allows access to other databases provided by the National Library of Medicine, including PubMed Central
(PMC), and MeSH headings.
2 We start the search by entering keywords or phrases into this search window.
3 Several advanced features can be accessed through this link, including combining and building complex searches (see Figure 6.8).
4 Tutorials and FAQs are useful for new and experienced users.
5 Clinical Queries allows searching specifically for references on therapy (intervention), diagnosis, prognosis, etiology, or clinical
guidelines (see Figure 6.9).
6 MeSH terms can be accessed directly through this link.
7 My NCBI allows users to set up a free account, to personalize settings and save searches.
Figure 6–3 The search starts by entering keywords on the PubMed homepage. This page also lists several useful resources.
Accessed from The National Center for Biotechnology Information.
is used to connect keywords representing different topics search using OR to include articles that use either “phys-
and will narrow the search only to references that include ical therapy” or “mobilization,” but we have put them in
both. By adding a third term, “mobilization,” we have fur- parentheses to connect them with “adhesive capsulitis”
ther restricted our results to only those studies that use all using AND. This is called nesting and is necessary
three terms (see Fig. 6-4B). for the database to interpret your search correctly. For
OR broadens the search by combining two keywords, example, if we had entered:
resulting in all citations that mention either word. Typ-
“adhesive capsulitis” AND “physical therapy” OR
ically, OR is used to connect keywords that describe sim-
mobilization
ilar ideas that may be synonyms. Figure 6-4C shows the
combination of “physical therapy” OR “mobilization,” without the parentheses, “OR mobilization” will be seen
since both terms may be used for the intervention. as an independent term, and all references to mobiliza-
NOT excludes words from the search. For example, tion will be included, regardless of their connection to
in this search we do not want to include studies that use the first two terms.
“manipulation” for this condition. In Figure 6-4D, all
references that include that term are omitted from the
results. The NOT operator should be used judiciously,
■ Getting Results
as it will often eliminate citations you might have found
Your search will result in a list of citations that match
relevant. Other ways of narrowing the search should be
the keywords (see Fig. 6-5). As in this example, which
tried first, such as strategies for refining the search, to be
shows 1,035 citations, initial searches often provide an
described shortly.
overly long list that must be further refined. Looking
Nesting through the titles can give you an idea of how closely the
Sometimes the focus of a search requires various combi- results match your intended search.
nations of keywords. Note, for example, the combination Clicking on any title will get you to the study abstract
of AND and OR in Figure 6-4E. We have expanded the (see Fig. 6-6). All research papers, including systematic
6113_Ch06_070-087 04/12/19 3:25 pm Page 77
Adhesive Physical
capsulitis therapy
Adhesive Physical
capsulitis therapy
Mobilization
Adhesive capsulitis AND physical therapy Adhesive capsulitis AND physical therapy
AND mobilization
A B
Physical
Mobilization therapy
Manipulation Mobilization
reviews, are required to have an abstract, although the that can alleviate the need to brainstorm synonyms and
content can vary greatly depending on the journal. Most word variations that arise when authors use different
journals have a required format for abstracts, with spe- terminology for the same concepts. In the health sci-
cific categories of content, such as background, the study ences, the most well-known subject heading system has
question or objective, population, study design, setting, been developed by the NLM, called Medical Subject
methods, analysis, results, and conclusions. The abstract Headings (MeSH). Although specifically developed for
should give you a good idea of whether reading the full use with MEDLINE, many other databases have adopted
article will be worth your time. the system.
MeSH consists of sets of terms in a hierarchical struc-
ture or “tree” that permits searching at various levels of
■ Medical Subject Headings specificity. The tree starts with broad categories and
gradually narrows to more specific terms. In general,
The more traditional databases have added subject head- using both keywords and subject headings helps to en-
ing descriptors to help searchers find the most relevant sure a more comprehensive list of results from your
content. These words are assigned to each article or search, especially for systematic reviews. That said, there
book in the database and provide a consistent vocabulary is not always a relevant subject heading for your needs.
6113_Ch06_070-087 04/12/19 3:25 pm Page 78
4 5
2
3
1
1 A total of 1,035 citations were returned. Only the first 20 are displayed.
2 The citations are sorted by “Best Match,” based on an algorithm of the frequency of use of keywords. This dropdown menu includes
other options such as “Most Recent,” which lists citations in order of publication date, or alphabetical by author, journal, or title.
3 The left menu bar lists several limits that can be set for the search, including type of article, full-text options, and publication dates.
The list of “Article Types” can be expanded using “Customize…” to specify RCTs, systematic reviews, or other formats. “Additional
filters” include language, sex, age, and other descriptors. Clicking on these links will result in a revised display.
4 This dropdown menu allows you to set the number of citations viewed on the screen between 5 and 200.
5 The “Send to” dropdown menu allows you to save selected citations in several formats (see Figure 6.10).
6 Clicking on “Similar Articles” will generate a new list of citations with content similar to that article. Those citations will be listed in order
of relevance to the original article.
7 MeSH terms are provided for the entered keywords. Note that “bursitis” is the heading under which “adhesive capsulitis” will be listed
(see Figure 6.7).
Figure 6–5 Results of the initial search using “adhesive capsulitis” AND (“physical therapy” OR “mobilization”) as keywords.
This page offers several options for the display of results. This search was run on April 5, 2018. Accessed from The National Center for Biotechnology
Information.
1 4
3
5
1 The title and source information are provided, including the journal title, date, and pages. “Epub” indicates that this article was
originally published electronically ahead of print.
2 The authors’ names are listed and clicking on an author’s name will bring up a new list of citations by that author. “Author Information”
can be expanded to reveal contact information.
3 The full abstract is provided. Formats will differ among journals.
4 This icon tells you that this article is available in full text. This may or may not be free. Click on the icon to get information on availability.
5 Systematic reviews that have referenced this study are listed here.
6 Articles with similar content are listed here. Clicking on these links will bring you to the abstract for that article.
7 “PMID” is the PubMed identifier. “DOI” is the digital object identifier, a unique number assigned to academic scholarly articles. These
numbers are often included in reference lists and can be used to locate particular articles in a search.
8 Clicking on this link will provide a list of MeSH terms for the search keywords (see Figure 6.7).
Figure 6–6 The abstract display in PubMed, obtained by clicking on a citation link from the results page. Accessed from The National
Center for Biotechnology Information.
Mellitus,” the explode function allows that search to also the MeSH term as a check box labeled “Restrict to
include articles tagged with the narrower MeSH terms MeSH Major Topic.”
“diabetes mellitus, Type 1” and “Diabetes Mellitus,
Type 2” along with several other subtypes of the disease.
Subheadings
Some databases and search engines have the explode fea- MEDLINE also uses a system of subheadings to further
ture turned on automatically. This is true in PubMed. detail each MeSH term. In Figure 6-7, you can see the
Focusing the search means that it will be limited to MeSH terms for one of the articles found in our
documents for which the subject heading has been iden- PubMed search. You will notice that some of the terms
tified as a major point of the article. In Figure 6-7, you are followed by a slash and a subheading, such as “Shoul-
can tell which terms have been designated as major der Joint/physiopathology.” If you only want articles
points because they are followed by asterisks. In about the physiopathology of shoulder joints, this is one
PubMed, you can find the focus feature on the page for way to narrow your search.
6113_Ch06_070-087 04/12/19 3:25 pm Page 80
3
4
1 The search builder allows you to enter successive terms to create a comprehensive search.
2 Each successive search is numbered, with the most recent at the top.
3 The detailed terms for each search are provided. Searches can be combined by using keywords or using search numbers for prior
searches.
4 The number of citations for each search is provided. By clicking on the number of items, the results of that search will be displayed.
5 If filters are applied (in this case for language), they will be indicated. There were six references from search #5 that were not
published in English.
Figure 6–8 The Advanced Search page, obtained by clicking the link “Advanced” from the results or abstract page (see
Figure 6-3 3 ). This page shows the details of a search history, allowing you to combine search components and to see the
results of various keyword combinations. Accessed from The National Center for Biotechnology Information.
Clinical Queries and, therefore, may also retrieve may irrelevant refer-
ences. In the PubMed Clinical Queries search, you can
PubMed has developed predesigned strategies, called
make the scope of your search more sensitive by using
Clinical Queries, to target a search for studies on therapy
the drop-down menu to select “Broad” (see Fig. 6-9 3 ).
(interventions), diagnosis, etiology, prognosis, or clinical
Specificity is the proportion of relevant citations that
prediction rules (see Fig. 6-9). It will generate references
the search is able to retrieve, or a measure of the ability
for clinical trials and systematic reviews.
of the search strategy to exclude irrelevant articles.
Sensitivity and Specificity A specific search has more precision, trying to find only
Sensitivity is the proportion of relevant articles identified relevant documents. Therefore, while it is more likely to
by a search out of all relevant articles that exist on that include only relevant studies, it might also exclude some
topic. It is a measure of the ability of the search strategy relevant ones. In the Clinical Queries search, you can
to identify all pertinent citations. A sensitive search casts make the scope of your search more specific by using the
a wide net to make sure that the search is comprehensive drop-down menu to select “Narrow” (see Fig. 6-9 3 ).
6113_Ch06_070-087 04/12/19 3:25 pm Page 82
1
4
2
3
1 Keywords are entered in the search box. Phrases are in quotes and terms are nested.
2 Results are given for clinical study categories, with a dropdown menu that allows you to choose studies based on the type of clinical
question: etiology, diagnosis, therapy (intervention), prognosis, or clinical prediction guides.
3 Dropdown menu for scope allows you to choose a narrow or broad focus, making the search more selective or more comprehensive.
With a broad scope, we obtained 54 citations. If we had applied a narrow scope, the result would have only been 12 citations.
4 A separate category for systematic reviews is included.
5 The number of citations displayed can be expanded.
Figure 6–9 Screenshot of results of a search in “Clinical Queries,” showing results for clinical trials and systematic reviews. A
“Medical Genetics” search is a regular part of this feature, but does not result in any citations for this topic. This feature is
accessed through a link on the PubMed homepage (see Figure 6-3 5 ). Accessed from The National Center for Biotechnology Information.
To be sure that a systematic review encompasses the similar to looking at an article’s reference list but, instead
most recent findings, it is important to note the dates of of finding older articles, it finds articles published since
the review, and if updates have been written. the one in question. Look for links next to identified
articles that say, “times cited in this database” or, “find
■ Expanding Search Strategies citing articles.”
In many cases, database searching will not be sufficient Web of Science is an entire database dedicated to helping
to create a comprehensive literature review. Databases track literature forward and backward. It can be an impor-
are not necessarily all-inclusive repositories and even tant component to any comprehensive search.
expertly crafted searches will miss relevant materials.
Therefore, researchers should employ techniques be-
yond keyword and subject heading searches.
Journal and Author Searching
Related References
If an author is an expert on a subject, there is a good
Several databases offer an option to search “related chance he or she has published more than one article
articles” or “similar articles” once a relevant citation has on that topic. Most online databases make it easy to
been found (see Fig. 6-5 6 and 6-6 5 ). By using this track down an author’s other articles by making the
function, one relevant article becomes the key to a list of name a link (see Fig. 6-6 2 ). Similarly, if you notice
useful references. These citations will usually be displayed that a particular journal frequently publishes articles on
in order of their relevance to the original reference. your topic, browsing its tables of contents or doing
general keyword searches limited to that journal may
be advisable.
research. Do not let your literature search be a “forever” comprehensive. There are likely to be many different
task! topics that need to be addressed, including theoretical
underpinnings of your question, how variables have been
Looking for Evidence defined, which design and analysis strategies have been
If you are looking for evidence to support a clinical de- used, and what results have been obtained in prior studies.
cision, you may find yourself sufficiently informed by It is essential to establish the state of knowledge on the
finding a few recent clinical trials or systematic reviews topic of interest; therefore, a thorough review is necessary.
that provide a broad enough perspective for you to move Starting with a recent systematic review is the best
forward. Meta-analyses provide additional information strategy, assuming one is available, but you will need to
on the size of effects or the statistical significance of out- get full-text references to determine the actual study
comes. You will want to see some consistency across methods. As you review recent studies, you will find that
these studies to help you feel confident in results. In ad- reference lists contain the same citations and eventually
dition, you will want to critically appraise the quality of you will be able to determine that you have obtained the
the studies to determine whether they provide adequate most relevant information. You will also likely continue
validity for their findings (see Chapter 36). to search the literature as your study goes on, especially
Depending on the amount of literature that exists on as you interpret your results, needing to compare your
your topic, you may need to think broadly about the findings with other study outcomes. Setting e-mail alerts
pathophysiology and theoretical issues that pertain to your for related topics can be helpful as your study progresses.
patient’s problem. For example, you may not find refer-
ences on adhesive capsulitis and corticosteroid injections
Next Steps
specific to 67-year-old women and, as we have seen, per- Searching and retrieval of references is one early step in
haps not even specific to type 2 diabetes. You will need to the process of reviewing literature. Whatever your pur-
connect the dots and may have to ask background ques- pose, you will now need to determine how useful the ref-
tions to substantiate your decisions (see Chapter 5). Allow erences are. Most importantly, you will need to assess
yourself to be satisfied once you have obtained relevant their quality—how well the studies have been executed.
references that provide consistent, logical rationale. You need to determine whether the results are valid and
whether they provide sufficient information for you to
Review of Literature incorporate into your clinical decisions or research plan.
If your purpose is to establish a foundation for a re- This requires an understanding of design and analysis
search question, your approach will need to be more concepts.
COMMENTARY
If at first you don’t succeed, search, search again. That’s why they call it research.
—Unknown
Information is essential to the health professions. What- literature to inform practice or support research ration-
ever the reason for your literature search, be it for re- ale is not always so simple. The lament we have heard
search or clinical purposes, it can be a frustrating process all too often is, “I’ve spent 5 hours searching and haven’t
if you are not knowledgeable and experienced in the use found anything!” This is not a good use of anyone’s
of resources available to you. Although we have reviewed time! Whether you are just beginning to conceptualize
some of the basic strategies for identifying relevant data- your question, have hit a dead end in your search, or
bases and putting searches together, we cannot begin need advice about evaluating and organizing the infor-
to cover all the possible search strategies or variations mation you have found, reference librarians can help. Do
a particular search may require. Take advantage of the not hesitate to take advantage of these experts who un-
tutorials and tips offered by most search engines and derstand how information is created, organized, stored,
databases, or classes sponsored by many libraries. and retrieved, and who are experienced in working with
Because of our unlimited access to the internet, many researchers of all types. They will also be able to keep
of us, at least the digital natives, feel very self-sufficient you apprised of changes in databases or search options,
in searching for information. But searching for the right as these are often updated.
6113_Ch06_070-087 04/12/19 3:25 pm Page 87
CHAPTER
7 Ethical Issues in
Clinical Research
Ethical issues are of concern in all health pro- ensure protection of the participants, to safe-
fessions, with principles delineated in codes of guard scientific truth, and to engage in mean-
ethics that address all aspects of practice and ingful research.
professional behavior. Clinical research re-
quires special considerations because it in- ■ The Protection of Human Rights
volves the participation of people for the
purpose of gaining new knowledge that may or In our society, we recognize a responsibility to research
subjects for assuring their rights as individuals. It is un-
may not have a direct impact on their lives or fortunate that our history is replete with examples of un-
health. We ask them to contribute to our ef- conscionable research practices that have violated ethical
forts because we seek answers to questions that and moral tenets.1 One of the more well-known exam-
ples is the Tuskegee syphilis study, begun in the 1930s,
we believe will eventually be of benefit to in which treatment for syphilis was withheld from rural,
society. Because their participation may put black men to observe the natural course of the disease.2
them at risk, or at least some inconvenience, it This study continued until the early 1970s, long after
penicillin had been identified as an effective cure.
is our responsibility to ensure that they are One would like to think that most examples of unethical
treated with respect and that we look out for healthcare research have occurred because of misguided
their safety and well-being. We also know that intentions and not because of purposeful malice. Nonethe-
less, the number of instances in which such behavior has
published findings may influence clinical deci- been perpetrated in both human and animal research
sions, potentially affecting the lives of patients mandates that we pay attention to both direct and indirect
or the operation of health systems. Therefore, consequences of such choices (see Focus on Evidence 7-1).
we must be vigilant to ensure the integrity of Regulations for Conducting Human Subjects
Research
data.
Since the middle of the 20th century, the rules of con-
The purpose of this chapter is to present gen-
ducting research have been and continue to be discussed,
eral principles and practices that have become legislated, and codified. Establishment of formal guide-
standards in the planning, implementation, and lines delineating rights of research subjects and obligations
of professional investigators became a societal necessity as
reporting of research involving human subjects
clear abuses of experimentation came to light.
or scientific theory. These principles apply to
Nuremberg Code
quantitative and qualitative studies, and they The first formal document to address ethical human
elucidate the ethical obligation of researchers to research practices was the 1949 Nuremberg Code.3 These
88
6113_Ch07_088-104 02/12/19 4:54 pm Page 89
In 1965, the Willowbrook State School in Staten Island, NY, housed Henrietta Lacks was a patient at Johns Hopkins Hospital in 1951, where
6,000 children with severe intellectual disabilities, with major she was assessed for a lump in her cervix.13 A small tissue sample was
overcrowding and understaffing. With a 90% rate of infection with obtained and she was subsequently diagnosed with cancer. In the process
hepatitis in the school, Dr. Saul Krugman was brought in to study of testing the sample, however, scientists discovered something unusual.
the natural history of the disease. Through the 1960s, children were One of the difficulties in cancer research was that human cells would not
deliberately exposed to live hepatitis virus and observed for changes stay alive long enough for long-range study—but Henrietta’s cancer cells
in skin, eyes, and eating habits.9 Krugman10 reasoned that the not only survived, they continued to grow. The line of cells obtained from
process was ethical because most of the children would get hepa- her, named HeLa, has now been used in labs around the world for
titis anyway, they would be isolated to avoid exposure to other decades, and has fostered large-scale commercial research initiatives.
diseases, and their infections would eventually lead to immunity. Neither Henrietta nor her family ever gave permission for the cells
Although they obtained informed consent from parents following to be used, nor were they even aware of their use until 1973, long after
group information sessions, and approvals through New York Uni- her death in 1951. And of course, they accrued no benefit from the
versity (prior to IRBs),11 the ethical dimensions of these studies raise financial gains that were based on her genetic material. The family
serious concerns by today’s standards. The abuses and deplorable sued, but was denied rights to the cell line, as it was considered “dis-
conditions at Willowbrook were uncovered in a shattering television carded material.”14 The DNA of the cell line has also been published,
exposé by Geraldo Rivera in 1972,12 and the school was eventually without permission of the family. There have since been negotiations,
closed in 1987. assuring that the family would have some involvement in further appli-
cations of the cell line. The Henrietta Lacks Foundation was established
in 2010 to “help individuals who have made important contributions to
scientific research without personally benefiting from those contribu-
tions, particularly those used in research without their knowledge or
consent.”15 This compelling story has been chronicled in The Immortal
Life of Henrietta Lacks,13 which has been turned into a movie.
guidelines were developed in concert with the Nurem- of obtaining informed consent prior to initiation of
berg trials of Nazi physicians who conducted criminal ex- clinical research or therapeutic intervention. The
perimentation on captive victims during World War II. Nuremberg Code also addresses the competence of the
The most common defense in that trial was that there investigator, stating that research “should be conducted
were no laws defining the legality of such actions.4 only by scientifically qualified persons.”
The Nuremberg Code clearly emphasized that every
individual should voluntarily consent to participate as a Declaration of Helsinki
research subject. Consent should be given only after the The World Medical Association adopted the Declaration
subject has sufficient knowledge of the purposes, proce- of Helsinki in 1964.5 It has been amended and reaffirmed
dures, inconveniences, and potential hazards of the ex- several times, most recently in 2013. This document ad-
periment. This principle underlies the current practice dresses the concept of independent review of research
6113_Ch07_088-104 02/12/19 4:54 pm Page 90
protocols by a committee of individuals who are not research. Originally published in 1991, revised in 2011
associated with the proposed project. It also declares that and again in 2019, this rule codifies regulations adopted
reports of research that has not been conducted according by many federal departments and agencies, with a focus
to stated principles should not be accepted for publication. on institutional guidelines, informed consent, and review
This led to an editorial challenge to professional journals of proposals for the protection of participants. Re-
to obtain assurance that submitted reports of human searchers should become familiar with the substantial
studies do indeed reflect proper attention to ethical con- changes in the 2019 revision, as these will also require
duct. Most journals now require acknowledgment that changes to institutional guidelines.16 Technically, studies
informed consent was obtained from subjects and that that are not federally funded are not required to follow
studies received approval from a review board prior to these regulations, although most institutions use the
initiating the research. Common Rule as a guide nonetheless.
National Research Act See the Chapter 7 Supplement for a description of the
In 1974, the National Research Act was signed into law major 2019 revisions to the Common Rule.
by Richard Nixon, largely in response to public recog-
nition of misconduct in the Tuskegee syphilis study.6
This law requires the development of a full research The Health Insurance Portability and Accountability
proposal that identifies the importance of the study, Act (HIPAA) of 1996 and the resultant regulations
obtaining informed consent, and review by an Institu- (the Privacy Rule) issued by the DHHS have added a new
tional Review Board (IRB). These essential principles dimension to the process of protecting research subjects
have been incorporated into federal regulations for the related specifically to the protection of private health
protection of human subjects.7 information.17
Belmont Report
The National Research Act established the National Guiding Ethical Principles
Commission for the Protection of Human Subjects of The Belmont Report established three basic ethical prin-
Biomedical and Behavioral Research. The deliberations ciples that have become the foundation for research
and recommendations of this Commission resulted in efforts, serving as justification for human actions: respect
the Belmont Report in 1979, which delineates the guiding for persons, beneficence, and justice.8
ethical principles for human research studies.8 The
Belmont Report also established the rules and regulations Respect for Persons
that govern research efforts of the Department of Health Respecting human dignity is essential to ethical behavior.
and Human Services (DHHS) and the Food and Drug This principle incorporates two considerations. The first
Administration (FDA). The Office for Human Research is personal autonomy, referring to self-determination
Protection (OHPR) is the administrative arm of the and the capacity of individuals to make decisions affect-
DHHS that is responsible for implementing the regula- ing their lives and to act on those decisions. This princi-
tions and providing guidance to those who conduct human ple also recognizes that those with diminished autonomy
studies. must be protected. Some individuals who may be appro-
priate subjects for research, such as children or patients
The Belmont Report established boundaries between with cognitive problems, may be unable to deliberate
research and practice.8 Although both often take place about personal goals. In these cases, the researcher is
in similar settings and may be implemented by the obliged to ensure that an authorized surrogate decision-
same people, practice is considered routine intervention for the maker is available, has the ability to make a reasoned
benefit of an individual that has reasonable expectation of decision, and is committed to the well-being of the com-
success. In contrast, research is an activity designed to test a promised individual.
hypothesis, draw conclusions, and contribute to generalizable
knowledge, which may or may not benefit the individual being
Beneficence
studied. Research is also typically described in a formal protocol Beneficence refers to the obligation to attend to the
that sets objectives and procedures. well-being of individuals. This extends to research proj-
ects and the entire research enterprise to ensure that all
who engage in clinical research are bound to maximize
The Common Rule possible benefits and minimize possible harm. Risks may
Based on tenets in the Belmont Report, the Federal Pol- be physical, psychological, social, or economic harm that
icy for the Protection of Human Subjects, known as the goes beyond expected experiences in daily life. Research
Common Rule, outlines the basic provisions for ethical benefits may include new knowledge that can be applied
6113_Ch07_088-104 02/12/19 4:54 pm Page 91
Administrative
decision on
review status
Review and
Approval by Review by full Approval
approval by IRB
administrator IRB deferred
chair
■ Informed Consent
Institutional Guidelines
Each institution establishes its own guidelines for review Perhaps the most important ethical tenet in human stud-
in accordance with federal and state regulations. Re- ies is the individual’s ability to agree to participate with
searchers should, therefore, become familiar with the full understanding of what will happen to him. The
requirements in their own institutions, especially as the informed consent process and all its elements address
new Common Rule is being implemented. For instance, the basic principles of respect, beneficence, and justice.
one useful change is a new provision that studies con-
ducted at multiple sites only need to obtain approval ➤ CASE IN POIN T
from one IRB that will oversee the study. The approval A trial was conducted to evaluate the effects of accelerated or
process can take several weeks, depending on the IRB’s delayed rehabilitation protocols on recovery of range of motion
schedule and whether the proposal needs revision. This (ROM) following arthroscopic rotator cuff repair.20 Patients
review process should be included in the timetable for were randomly assigned to begin exercises 2 weeks or 4 weeks
any research project. Research on human subjects should following surgery. Both groups received the same protocol.
not be started without prior review and approval of a Active ROM was assessed at postoperative weeks 3, 5, 8, 12,
designated review committee. and 24. Written informed consent was obtained from all
IRBs require regular reporting of ongoing studies, patients.
typically on an annual basis. Researchers must be able to
document the current status of the project. Any changes The components of the informed consent process in-
to the protocol must also be filed for approval before clude two important elements: information and consent
they take effect. Researchers are also responsible for (see Table 7-1).8 Each of these elements must be incor-
notifying the IRB promptly of any critical incidents porated into an informed consent form. Many institu-
associated with the project. A final notification must tions have created templates that require consistent
6113_Ch07_088-104 02/12/19 4:54 pm Page 93
inclusion of content and some boilerplate language. The Subjects Must Be Fully Informed
sample in Figure 7-2 offers a generic format for informed The informed consent process begins with an invitation
consent to demonstrate how various elements can be to participate. A statement of the purpose of the study
included. permits potential subjects to decide whether they believe
in or agree with the worth and importance of the re-
Information Elements search. The process then requires that the researcher pro-
Information elements include disclosure of information vide, in writing, a fair explanation of the procedures to be
and the subject’s comprehension of that information. used and how they will be applied. This explanation must
6113_Ch07_088-104 02/12/19 4:54 pm Page 94
ed 189
und 8
Fo
Title: The Effect of Accelerated or Delayed Rehabilitation Following Rotator Cuff Repair
Principle Investigator: John Smith, PT, DPT, PhD
Additional Investigators: Alice Jones, MD; Jennifer Ames, OTD
1 Brief Summary
You are being asked to participate in a research study that will evaluate the effect of starting exercises 2 or 4 weeks following
rotator cuff repair. This informed consent document will provide important information about the study, to make clear that your
participation is voluntary, explain potential risks and benefits, and allow you to make an informed decision if you want to
participate. This summary will be followed by more detailed information.
Your participation in this study will go across 14 weeks, as you participate in exercise sessions three times per week. These
exercises will be supervised by a physical therapist, who will also take measurements of your progress in shoulder movement.
You may experience pain in your shoulder, and we will limit the activity in your shoulder to minimize pain. The range of motion
of your shoulder may improve as you continue this exercise program, which may be accompanied by less pain. If you decide
not to participate in the study, you may choose to work with another therapist who will be able to provide standard care for
rotator cuff repair. You may also choose to contact your surgeon, primary care provider, or other health provider to discuss the
best approach to regain movement in your shoulder following surgery.
Before your program begins, at your initial visit, you will be asked to provide descriptive information about you, including your
age, height, weight and other health information regarding other conditions you might have. You are free to decline to provide
any information. This information will be used to describe the group of people who are part of this study.
A physical therapist will measure several movements in your shoulder by asking you to raise your arm in different directions.
This measurement will happen 4 times during the study, at the 3rd, 5th, 8th, and 12th week. You will also be asked to return at the
24th week following your surgery for a final measurement. Measurements will be taken by a different therapist than the one
who works with you at each exercise session. These tests should take approximately 20 minutes.
Therapists who will gather your personal information and who will perform your testing will not know which group you are part
of. We ask that you do not tell them.
Figure 7–2 Sample informed consent form. Numbers in the left margin correspond to required elements listed in Table 7-1.
Content in the form is based on the protocol described by Düzgün I, Baltaci G, Turgut E, Atay OA. Effects of slow and
accelerated rehabilitation protocols on range of motion after arthroscopic rotator cuff repair. Acta Orthop Traumatol Turc
2014;48(6):642–648.
6113_Ch07_088-104 02/12/19 4:54 pm Page 95
Should you be injured during your participation, treatment costs will be charged to your insurance carrier. If for any reason
these costs are not covered, they will be your responsibility. You will also be responsible for any deductible or co-pay costs.
If you have questions about your rights as a research participant, or you wish to ask questions about this study of someone
other than the researchers, please contact the State University Institutional Review Board at 617-555-4343, Email: irb@su.edu.
12 Your consent
By signing this document, you are agreeing to participate in this study. Please make sure you understand what the study is
about and what will be expected of you before you sign. You will be given a copy of this document for your records. We will keep
the signed copy with our study records. If you have questions after you sign the document, you can contact the study team
using the information provided above.
I understand what the study is about and my questions so far have been answered. I agree to take part in this study.
Figure 7–2—cont’d
be complete, with no deception by virtue of commission whether to participate.16 This means that participants do
or omission. Subjects should know what will be done to not have to wade through a long, detailed form to find
them, how long it will take, what they will feel, which side crucial information.
effects can be expected, and what types of questions they
may be asked. An important revision to the Common
Rule includes the requirement that informed consent ➤ For the rotator cuff study, subjects should be told what
kinds of exercises they will be doing. They should be informed
forms must begin with a concise presentation of the
about the schedule of assessments, how often they have to visit
“key information” that a reasonable person would want
the clinic for measurement, and how long each visit will last.
to have in order to make an informed decision about
6113_Ch07_088-104 02/12/19 4:54 pm Page 96
If subjects cannot read the informed consent docu- Box 7-1 Uncertainty and Placebos
ment, it should be read to them. Children should be in-
formed to whatever extent is reasonable for their age. There is an ethical concept in the design of clinical trials called
Subjects should also know why they have been selected equipoise, which means that we should only conduct trials
under conditions of uncertainty when we do not already have
to participate in terms of inclusion criteria for the study,
sufficient evidence of the benefit of a treatment.21 The Declara-
such as clinical condition or age. If the subjects are pa-
tion of Helsinki addresses this issue, stating that: “The benefits,
tients, they should understand the distinction between risks, burdens and effectiveness of a new method should be
procedures that are experimental and procedures, if any, tested against those of the best current prophylactic, diagnostic,
that are proposed to serve their personal needs. and therapeutic methods.”5
This concept also means that placebos must be used judi-
New common rule regulations also require state- ciously. There may be some clinical conditions for which no
ments in informed consent forms regarding use of treatments have been effective. In that case, it is necessary to
identifiable information or biospecimens, commercial- compare a new treatment with a no-treatment condition. It is
ization, and results of whole genome sequencing.16 These also necessary to make such a comparison when the purpose
changes would have addressed the concerns in the case of of the research is to determine whether a particular treatment
Henrietta Lacks. approach is not effective. In situations where the efficacy of a
treatment is being questioned because current knowledge is
inadequate, it may actually be more ethical to take the time to
Control Groups make appropriate controlled comparisons than to continue
The researcher is obliged to inform the potential partic- clinical practice using potentially ineffective techniques. The
ipants when a study is planned that includes a control Declaration of Helsinki specifically states that placebos are
group. Experimental subjects receive the studied inter- permissible “when no proven intervention exists” or when a
vention, whereas control subjects may receive an alter- compelling scientific rationale for a placebo can be provided,
native treatment, usual care, a placebo, or no treatment with the assurance that control subjects will not be at risk.22
at all. When the study involves random assignment, sub- Based on this principle, trials should include a comparison
jects should know that there is a chance they will be as- with standard care or other recognized interventions as one of
signed to either group (see Box 7-1). the treatment arms, as long as such treatment is available.23 It
is also considered unethical to include an alternative therapy
that is known to be inferior to any other treatment.21
➤ In the rotator cuff study, subjects were randomly assigned
to receive intervention on different timelines. They should be
told that there are varied time periods and how they differ.
subjects may not know which treatment they are receiv-
ing to preserve blinding. They should know that they
According to the Declaration of Helsinki, as a form of
will be told their group assignment at the completion of
compensation, the experimental treatment may be of-
the study.
fered to the control participants after data collection is
complete, if results indicate that treatment is beneficial.5 Potential Risks
As clinician-researchers, we are obliged to discuss alter- An important aspect of informed consent is the descrip-
native treatments that would be appropriate for each tion of all reasonable foreseeable risks or discomforts to
patient/subject if such alternatives are employed in which the patient will be subjected, directly or indirectly,
clinical practice. as part of the study. The researcher should detail the
See Chapter 14 for a description of various strategies steps that will be taken to protect against these risks and
for random assignment of subjects, including wait-list the treatments that are available for potential side effects.
controls. The informed consent form is not binding on subjects
and subjects never waive their rights to redress if
Disclosure
their participation should cause them harm. The form
An ethical dilemma occurs when complete disclosure of
should not contain language that appears to release the
procedures might hinder the outcomes of a study by bi-
researcher from liability.
asing the subjects so that they do not respond in a typical
way. When the risks are not great, review boards may
allow researchers to pursue a deceptive course, but ➤ The subjects in the rotator cuff study needed to be informed
subjects must be told that information is being withheld about the potential for pain or fatigue from exercises. ROM
and that they will be informed of all procedures after measurement could also result in some pain as the joint is
stretched. The subjects should know what kind of pain is
completion of data collection. For example, when the
potentially expected and how long it should last.
research design includes a control or placebo group,
6113_Ch07_088-104 02/12/19 4:54 pm Page 97
undue influence, such as children, prisoners, those with selected circumstances. In this case, a written “short form”
impaired decision-making capacity, or those who are ed- can be used that describes the information presented orally
ucationally or economically disadvantaged. More subtle to the subject or their legally authorized representative.7
circumstances exist with the involvement of hospitalized This short form is submitted to the IRB for approval.
patients, nursing home residents, or students. In these
cases, the sense of pleasing those in authority may affect The example that is provided in Figure 7-1 is just one
the subject’s decisions. Importantly, consent implies that possible version of an informed consent form in a con-
there is no intimidation. Researchers must be able to cise format. It is only presented as a sample, and may require
demonstrate how safeguards are included in the protocol expanded explanations, depending on the nature of the re-
to protect the rights and welfare of such subjects. If a search. It does not include sample language, for instance, for
subject is not competent, consent must be obtained from the use of biospecimens, genome sequencing, educational or
a legally authorized representative.7 psychological research, use of secondary data, or data that
The regulations regarding children as research sub- contain private information and identifiers. Institutions often
jects require that parents or guardians give permission have different policies, templates, or boilerplate language
for participation. Furthermore, if a child is considered that must be incorporated into these forms. Guidelines are
usually available through IRBs as well as through the Depart-
competent to understand, regardless of age, his or her
ment of Health and Human Services.27 Given the changes to
assent should be expressed as affirmative agreement to
the Common Rule it is important for researchers to review
participate. relevant information. The new rule also requires that consent
forms for federally funded studies be posted on a public
Subjects Must Be Free to Withdraw Consent website.27
at Any Time
The informed consent document must indicate that the
subject is free to discontinue participation for any reason
at any time without prejudice; that is, the subject should Informed Consent and Usual Care
be assured that no steps will be taken against them, and, Clinical research studies are often designed to test specific
if they are a patient, that the quality of their care will not treatment protocols that are accepted as standard care,
be affected. This can occur before or during an experi- and subjects are recruited from those who would receive
ment, or even after data collection when a subject might such treatments. Clinicians often ask whether informed
request that their data be discarded. The informed con- consent is necessary for a research study when the pro-
sent form should include a statement that subjects will cedures would have been used anyway. The answer is yes!
be informed about important new findings arising during Even where treatment is viewed as usual care, patients are
the course of the study that may be relevant to their con- entitled to understand alternatives that are available to
tinued participation. It should also provide information them. Patients must always be informed of the use of the
about anticipated circumstances under which the sub- data that are collected during their treatments, and they
ject’s participation may be terminated by the investigator should have sufficient information to decide to participate
for the subject’s safety or comfort. or not, regardless of whether treatment is viewed as
experimental or accepted clinical practice.
Format of the Informed Consent Form
All subjects must give informed consent prior to partic-
ipating in a project. The written form should be signed
■ Research Integrity
and dated by the subject and researcher. Subjects should
Although truth and objectivity are hallmarks of science,
receive a copy of this form and the researcher must retain
there are unfortunate instances in which researchers
the signed copy.
have misrepresented data. As we strive to build our evi-
Required formats for consent forms may vary at dif-
dence base for practice, all researchers and clinicians
ferent institutions. The format of the sample provided
must be vigilant in their evaluation of the integrity of
in Figure 7-1 follows recommendations from the NIH
clinical inquiry.
and is similar to those used in many institutions. It
specifically identifies and acknowledges all the elements Research Misconduct
of informed consent.26 Of special note, signatures should
According to the U.S. Office of Research Integrity:
not be on a separate page, but should appear with some
of the text to show that the signatures are applied in the Research misconduct means fabrication, falsification, or
context of the larger document. Although written con- plagiarism in proposing, performing, or reviewing
sent is preferable, some agencies allow oral consent in research, or in reporting research results.28
6113_Ch07_088-104 02/12/19 4:54 pm Page 99
Focus on Evidence 7–2 contributing authors to the original paper published a retraction of
Fraudulent Science: A Detriment to Public Health the interpretation that there was a causal link between MMR vaccine
and autism.54
In 1998, Dr. Andrew Wakefield and 12 colleagues46 published a
These events triggered a formal investigation by the British Gen-
study in The Lancet that implied a link between the measles,
eral Medical Council in 2010, which found that the study had not
mumps, and rubella (MMR) vaccine and a “new syndrome” of
been approved by a bioethics committee, that the researchers had
autism and gastrointestinal disease. The paper was based on a case
exposed children to invasive colonoscopies and lumbar punctures
series of 12 children whose parents described the children’s history
that were not germane to the study, that the children were not drawn
of normal development followed by loss of acquired skills and an
at random but were from families already reporting an association
onset of behavioral symptoms following MMR vaccination. Although
between autism and MMR vaccine, and that the coauthors of the
the authors did state that their study could not “prove” this connec-
paper all claimed to be unaware of these lapses despite their au-
tion, they asserted that the MMR vaccination was the “apparent
thorship.55 Further investigation revealed that facts about the chil-
precipitating event.”
dren’s histories had been altered and that none of their medical
The report received significant media attention and triggered a
records could be reconciled with the descriptions in the published
notable decline in immunizations and a rise in measles outbreaks.47
paper.56 Based on these findings, Wakefield’s license to practice
The Lancet received a barrage of letters objecting to the study’s
medicine in England was revoked.
implications and prospective impact. Critics pointed out several
The Lancet retracted the paper,57 and in 2011 the British Medical
limitations of the study, including no controls, children with multiple
Journal declared it a fraud.56 However, despite the fact that the study
disorders, and data based on parental recall and beliefs.48 Most
has been widely discredited for both scientific and ethical reasons,
importantly, however, over the following two decades, dozens of
and even with the overwhelming scientific evidence to the contrary,
epidemiologic studies consistently demonstrated no evidence of
popular opinion and the internet seem to perpetuate the vaccine
this link.49–52
scare.58–61 The tremendous harm caused by this fraud is evident in
In 2004, The Lancet editors were made aware that Wakefield
the increasing numbers of outbreaks of measles—a disease that had
failed to disclose that he had received significant compensation prior
been declared eradicated in 2000—with the majority of those in-
to the study as a consultant to attorneys representing parents of chil-
fected being unvaccinated.62,63
dren allegedly harmed by the MMR vaccine.53 Soon after, 11 of the
6113_Ch07_088-104 02/12/19 4:54 pm Page 100
Romanian Journal of Morphology and Embryology under or the validity of research) tends to be unduly influenced
a different title with different authors.33 The editor of by a secondary interest (such as financial gain).42
the Romanian journal withdrew the article, banned Such conflicts can undermine the objectivity and in-
the authors from further publication, and notified the tegrity of research endeavors, whether real or perceived.
dean of medicine and the ethics review board at their Conflicts can occur if an investigator has a relationship
institution. with a pharmaceutical company or manufacturer, for ex-
Both articles are still listed in PubMed. ample, in which the outcome of a clinical trial may result
in commercial benefits and there may be pressure to
draw favorable conclusions.43 Researchers may sit on
Research misconduct does not include honest error
agency boards or be involved with government, industry,
or differences of opinion. Mistakes in data entry or
forgetting to report a value, for example, are considered or academic sponsors that can compromise judgment.
honest errors. Conflicts of interest can also apply to reputational
benefits or simply conflicting commitments of time.44
Patients and research participants present a particular
As these examples show, many instances of miscon- source of concern, especially when they are solicited
duct have been documented over the past 50 years, some as subjects. Their welfare can be compromised if re-
quite notorious. Such violations can have serious external searchers are provided incentives, if they have intellec-
effects, including damaging careers, delaying scientific tual property interests, or if subjects are students or
progress, damaging public perception of clinical trials, employees, especially if they are led to believe that the
and, most egregiously, providing misinformation about study will be of specific benefit to them.45
treatment efficacy that can influence healthcare decisions
to the detriment of patients.35
Most professional codes of ethics include language warn-
It is important to note that because many of these stud-
ing against COIs, and universities or research laboratories
ies attract public attention, the frequency of such violations usually require that investigators declare potential conflicts as
may be incorrectly perceived. The actual incidence of such part of ethics committee approval. Many journals and profes-
fraud is assumed to be quite low, although hard to sional organizations now require formal declarations of COI for
estimate.36 For those who would like to know more about research publications or presentations.
interesting and significant examples of research miscon-
duct, several informative reviews have been written.35–38
Peer Review
HISTORICAL NOTE One hallmark of a reputable scientific journal is its
Throughout history, there have been many claims that process for vetting articles, known as peer review. The
scientists have fudged data in ways that would be con- purpose of peer review is to provide feedback to an editor
sidered misconduct today. Ptolemy is purported to have obtained who will make decisions about whether a manuscript
his astronomical results by plagiarizing 200-year-old data or mak- should be accepted, revised, or rejected. Peer review re-
ing up numbers to fit his theory of the Earth being the center of quires the expertise of colleagues who offer their time to
the universe.39 Sir Isaac Newton may have adjusted calculations read and critique the paper, and provide an assessment
to fit his observations,40 and Gregor Mendel is suspected of se- of the study’s validity, the data reporting, and the au-
lectively reporting data on his pea plants to show patterns that
thor’s conclusions. Journals and researchers recognize
are cleaner than might be expected experimentally.41
the necessity of this process to support the quality of
Although none of these can be proven, the suspicions arise
from statistical evidence that observed results are just a little scholarship.
too close to expected values to be compatible with the chance In recent years, many journals have asked authors to
that affects experimental data.36 recommend or solicit reviewers to facilitate finding peo-
ple who are willing to give the necessary time. Because
reviews are now done electronically, contacts are typi-
Conflicts of Interest cally made through e-mail. This has allowed some de-
ceitful practices, where authors have made up identities
A conflict of interest (COI) occurs when an individual’s
and have reviewed their own articles, or groups have re-
connections present a risk that professional judgment
viewed each other’s papers.64–66 Hundreds of articles
will be influenced. It can be defined as:
have been retracted over the past few years because of
. . . a set of conditions in which professional judgment this practice and many journals are now taking steps to
concerning a primary interest (such as patients’ welfare properly identify reviewers to avoid such situations.
6113_Ch07_088-104 02/12/19 4:54 pm Page 101
Citations for retracted articles are maintained in liter- Focus on Evidence 3-1 (Chapter 3) highlights the opposite
ature databases, but notification of the retraction is argument—that researchers must realize when replication
prominently featured in the citation along with a link goes too far, when further corroboration is no longer needed to
to a retraction notice in the relevant journal. The publication support an intervention, creating a potentially unethical use of
type is listed as “retracted publication” in PubMed. human and financial resources.
6113_Ch07_088-104 02/12/19 4:54 pm Page 102
COMMENTARY
The last century has seen remarkable changes in tech- research endeavors. The use of social media creates
nology that have influenced the landscape of research many concerns regarding accessibility of information,
and clinical practice. Guidelines continue to be evalu- spreading of misinformation, and processes for recruit-
ated and revised to address regulatory gaps that con- ing research participants. Personal profiles and digital
tinue to widen as legal and ethical standards try to keep footprints may allow personal information to become
up with advances. The development of copyright laws, available, violating privacy rules. The human genome
for example, originally followed the invention of the project and wide availability of genetic information
printing press in the 1400s, an advance that severely bring concerns about confidentiality while providing
challenged the political and religious norms of the day important opportunities for personalized healthcare.
and changed forever how information could be dissem- Electronic medical records must adhere to HIPAA pro-
inated.75 Our recent history has had to deal with visions, but also provide access to patient information
changes in electronic distribution of intellectual prop- for research purposes. And how do we deal with three-
erty in ways that could never have been anticipated dimensional printing to create models for human
even 20 years ago. organs,76 developing surgical and mechanical human
Many scientific and technological advances bring enhancements to substitute for human function,77 or
with them ethical and legal questions that also challenge cloning of primates?78
6113_Ch07_088-104 02/12/19 4:54 pm Page 103
It was only after blatant examples of unethical con- enterprise would be stymied. All the while, we must
duct were uncovered in the last century that regulations maintain the cornerstones of ethical standards—adhering
began to address the necessary precautions. Surely to ethical and professional principles in reporting, pro-
today’s advances will uncover new concerns that will viding adequate information to participants, maintaining
require continued vigilance regarding their implica- privacy and confidentiality, and, above all, attending to
tions for practice and research. But without individuals the well-being of those we study and those to whom our
who volunteer to participate in studies, the research research findings will apply.
33. Eid K, Thornhill TS, Glowacki J. Chondrocyte gene expression 58. Basch CH, Zybert P, Reeves R, Basch CE. What do popular
in osteoarthritis: correlation with disease severity. J Orthop Res YouTube™ videos say about vaccines? Child Care Health Dev
2006;24(5):1062–1068. 2017;43(4):499–503.
34. Jalbă BA, Jalbă CS, Vlădoi AD, Gherghina F, Stefan E, Cruce 59. Fischbach RL, Harris MJ, Ballan MS, Fischbach GD, Link BG.
M. Alterations in expression of cartilage-specific genes for Is there concordance in attitudes and beliefs between parents
aggrecan and collagen type II in osteoarthritis. Rom J Morphol and scientists about autism spectrum disorder? Autism 2016;
Embryol 2011;52(2):587–591. 20(3):353–363.
35. Stroebe W, Postmes T, Spears R. Scientific misconduct and the 60. Venkatraman A, Garg N, Kumar N. Greater freedom of speech
myth of self-correction in science. Perspect Psycholog Sci 2012; on Web 2.0 correlates with dominance of views linking vaccines
7(6):670–688. to autism. Vaccine 2015;33(12):1422–1425.
36. George SL, Buyse M. Data fraud in clinical trials. Clin Investig 61. Bazzano A, Zeldin A, Schuster E, Barrett C, Lehrer D.
(Lond) 2015;5(2):161–173. Vaccine-related beliefs and practices of parents of children
37. Buckwalter JA, Tolo VT, O’Keefe RJ. How do you know it is with autism spectrum disorders. Am J Intellect Dev Disabil
true? Integrity in research and publications: AOA critical issues. 2012;117(3):233–242.
J Bone Joint Surg Am 2015;97(1):e2. 62. Centers for Disease Control and Prevention (CDC): Measles
38. Sovacool BK. Exploring scientific misconduct: misconduct, (Rubeola). Available at https://www.cdc.gov/measles/index.html.
isolated individuals, impure institutions, or an inevitable idiom Accessed April 29, 2019.
of modern science? Biothical Inq 2008;5:271–582. 63. Vera A. Washington is under a state of emergency as measles
39. Newton RR. The Crime of Claudius Ptolemy. Baltimore: Johns cases rise. CNN, January 16, 2019. Available at https://www.
Hopkins University Press, 1977. cnn.com/2019/01/26/health/washington-state-measles-state-
40. Judson HF. The Great Betrayal: Fraud in Science. Orlando, FL: of-emergency/index.html. Accessed February 25, 2019.
Harcourt Books, 2004. 64. Ferguson C. The peer-review scam. Nature 2014;515:
41. Galton DJ. Did Mendel falsify his data? Qjm 2012;105(2): 480–482.
215–216. 65. Kaplan S. Major publisher retracts 64 scientific papers in
42. Thompson DF. Understanding financial conflicts of interest. fake peer review outbreak. Washington Post, August 18,
N Engl J Med 1993;329(8):573-576. 015. Available at http://www.washingtonpost.com/news/
43. Bekelman JE, Li Y, Gross CP. Scope and impact of financial morning-mix/wp/2015/08/18/outbreak-of-fake-peer-reviews-
conflicts of interest in biomedical research: a systematic review. widens-as-major-publisher-retracts-64-scientific-papers/.
JAMA 2003;298(4):454-465. Accessed August 20, 2015.
44. Horner J, Minifie FD. Research ethics III: publication practices 66. Stigbrand T. Retraction note to multiple articles in Tumor
and authorship, conflicts of interest, and research misconduct. Biology. Tumor Biology 2017:1–6.
J Speech Lang Hear Res 2011;54(1):S346–S362. 67. Fang FC, Steen RG, Casadevall A. Misconduct accounts for
45. Williams JR. The physician’s role in the protection of human the majority of retracted scientific publications. Proc Natl Acad
research subjects. Sci Eng Ethics 2006;12(1):5-12. Sci U S A 2012;109(42):17028–17033.
46. Wakefield AJ, Murch SH, Anthony A, Linnell J, Casson DM, 68. Kakuk P. The legacy of the Hwang case: research misconduct in
Malik M, Berelowitz M, Dhillon AP, Thomson MA, Harvey P, biosciences. Sci Eng Ethics 2009;15(4):545–562.
et al. Ileal-lymphoid-nodular hyperplasia, non-specific colitis, 69. Bartlett T. UConn investigation finds that health researcher
and pervasive developmental disorder in children. Lancet fabricated data. The Chronicle of Higher Education. January 11,
1998;351(9103):637–641. 2012. Available at http://www.chronicle.com/blogs/percolator/
47. Flaherty DK. The vaccine-autism connection: a public health uconn-investigation-finds-that-health-researcher-fabricated-data/
crisis caused by unethical medical practices and fraudulent 28291. Accessed March 12, 2017.
science. Ann Pharmacother 2011;45(10):1302–1304. 70. Steen RG, Casadevall A, Fang FC. Why has the number of
48. Payne C, Mason B. Autism, inflammatory bowel disease, and scientific retractions increased? PLoS One 2013;8(7):e68397.
MMR vaccine. Lancet 1998;351(9106):907; author reply 908–909. 71. Estimating the reproducibility of psychological science. Science
49. Jain A, Marshall J, Buikema A, Bancroft T, Kelly JP, 2015;349(6251):aac4716.
Newschaffer CJ. Autism occurrence by MMR vaccine status 72. Halpern SD, Karlawish JH, Berlin JA. The continuing
among US children with older siblings with and without unethical conduct of underpowered clinical trials. JAMA
autism. JAMA 2015;313(15):1534–1540. 2002;288(3):358–362.
50. Uno Y, Uchiyama T, Kurosawa M, Aleksic B, Ozaki N. Early 73. Bailar IJC. Science, statistics, and deception. Annals of Internal
exposure to the combined measles-mumps-rubella vaccine and Medicine 1986;104(2):259–260.
thimerosal-containing vaccines and risk of autism spectrum 74. Altman DG. Statistics and ethics in medical research. Misuse of
disorder. Vaccine 2015;33(21):2511–2516. statistics is unethical. Br Med J 1980;281(6249):1182–1184.
51. Taylor LE, Swerdfeger AL, Eslick GD. Vaccines are not associ- 75. Wadhwa V. Laws and ethics can’t keep pace with technology.
ated with autism: an evidence-based meta-analysis of case-control MIT Technology Review, April 15, 2014. Available at https://
and cohort studies. Vaccine 2014;32(29):3623–3629. www.technologyreview.com/s/526401/laws-and-ethics-cant-
52. DeStefano F. Vaccines and autism: evidence does not support a keep-pace-with-technology/. Accessed January 20, 2018.
causal association. Clin Pharmacol Ther 2007;82(6):756–759. 76. Griggs B. The next frontier in 3-D printing: human organs.
53. Deer B. How the case against the MMR vaccine was fixed. CNN, April 5, 2014. Available at https://www.cnn.com/
BMJ 2011;342:c5347. 2014/04/03/tech/innovation/3-d-printing-human-organs/
54. Murch SH, Anthony A, Casson DH, Malik M, Berelowitz M, index.html. Accessed January 28, 2018.
Dhillon AP, Thomson MA, Valentine A, Davies SE, 77. Human enhancements. John J. Reilly Center, University of
Walker-Smith JA. Retraction of an interpretation. Lancet Notre Dame. Available at https://reilly.nd.edu/outreach/
2004;363(9411):750. emerging-ethical-dilemmas-and-policy-issues-in-science-
55. General Medical Council: Fitness to practise panel hearing, and-technology/human-enhancements/. Accessed January 28,
28 January 2010. Available at http://www.casewatch.org/foreign/ 2018.
wakefield/gmc_findings.pdf. Accessed March 10, 2017. 78. Katz B. Scientists successfully clone monkeys, breaking new
56. Godlee F, Smith J, Marcovitch H. Wakefield’s article linking ground in a controversial field. Smithsonian.com, January 25,
MMR vaccine and autism was fraudulent. BMJ 2011;342:c7452. 2018. Available at https://www.smithsonianmag.com/
57. Retraction—Ileal-lymphoid-nodular hyperplasia, non-specific smart-news/two-cloned-monkeys-scientists-break-new-
colitis, and pervasive developmental disorder in children. Lancet ground-controversial-field-180967950/. Accessed January 28,
2010;375(9713):445. 2018.
6113_Ch08_105-114 02/12/19 4:52 pm Page 105
Concepts of
PART
2
Measurement
CHAPTER 8 Principles of Measurement 106
CHAPTER 9 Concepts of Measurement Reliability 115
CHAPTER 10 Concepts of Measurement Validity 127
CHAPTER 11 Designing Surveys and Questionnaires 141
CHAPTER 12 Understanding Health Measurement
Scales 159
ractice
dp
se
idence-ba
1. Identify the
arch question
rese
Ev
s
ing
2.
De
ind
sig
te f
n the
5. Dissemina
study
The
Research
Process
dy
stu
4.
na
e
A
h
ly z tt
ed m en
ata ple
3. I m
105
6113_Ch08_105-114 02/12/19 4:52 pm Page 106
CHAPTER
8 Principles of
Measurement
Scientists and clinicians use measurement as a • Making decisions based on a criterion or standard of
performance.
way of understanding, evaluating, and differen-
• Drawing comparisons to choose between courses of
tiating characteristics of people, objects, and action.
systems. Measurement provides a mechanism • Evaluating responses to assess a patient’s condition
by documenting change or progress.
for achieving a degree of precision in this
• Discriminating among individuals who present
understanding so that we can describe physical different profiles related to their conditions or
or behavioral characteristics according to characteristics.
• Predicting outcomes to draw conclusions about
their quantity, degree, capacity, or quality. This
relationships and consider expected responses.
allows us to communicate in objective terms, • “Measurement” has been defined as:
giving us a common sense of “how much” or
The process of assigning numerals to variables to represent
“how little” without ambiguous interpretation.
quantities of characteristics according to certain rules.1
There are virtually no clinical decisions or
Let us explore the three essential elements of this
actions that are independent of some type of definition.
measurement.
The purpose of this chapter is to explore
principles of measurement as they are applied ■ Quantification and Measurement
to clinical research and evidence-based practice
The first part of the definition of measurement emphasizes
(EBP). Discussion will focus on several aspects the process of assigning numerals to variables. A variable is a
of measurement theory and how these relate to property that can differentiate individuals or objects. It
represents an attribute that can have more than one value.
quantitative assessment, analysis, and interpre-
Value can denote quantity (such as age or blood pressure)
tation of clinical variables. or an attribute (such as sex or geographic region).
A number represents a known quantity of a variable. It is not useful, for example, to record blood pressure
Accordingly, quantitative variables can be operational- in anything less than integer units (whole numbers with
ized and assigned values, independent of historical, cul- no decimal places). It may be important, however, to
tural, or social contexts within which performance is record strength to a tenth or hundredth of a kilogram.
observed. Numbers can take on different kinds of values. Computer programs will often record values with three
or more decimal places by default. It is generally not
Continuous Variables
informative, however, to report results to so many
A continuous variable can theoretically take on any
places. How important is it to know that a mean age is
value along a continuum within a defined range. Be-
84.5 years as opposed to 84.528 years? Such precision
tween any two integer values, an indefinitely large num-
has been characterized as statistical “clutter” and is
ber of fractional values can occur. In reality, continuous
generally not meaningful.2 In clinical research, greater
values can never be measured exactly but are limited by
precision may be used in analysis, but numbers will usu-
the precision of the measuring instrument.
ally be reported to one or two decimal places.
For instance, weight could be measured as 50 pounds,
50.4 pounds, or even 50.4352 pounds, depending on the
calibration of the scale. Strength, distance, weight, and
chronological time are other examples of continuous ■ The Indirect Nature of Measurement
variables.
The definition of measurement also indicates that meas-
Discrete Variables
ured values represent quantities of characteristics. Most
Some variables can be described only in whole integer
measurement is a form of abstraction or conceptualiza-
units and are considered discrete variables. Heart rate,
tion; that is, very few variables are measured directly.
for example, is measured in beats per minute, not in frac-
Range of motion in degrees and length in centimeters
tions of a beat. Variables such as the number of trials
are among the few examples of measures that involve
needed to learn a motor task or the number of children
direct observation of a physical property. We can actu-
in a family are also examples of discrete variables.
ally see how far a limb rotates or measure the size of a
Precision wound, and we can compare angles and size between
Precision refers to the exactness of a measure. For statis- people.
tical purposes, this term is usually used to indicate the Most characteristics are not directly observable, how-
number of decimal places to which a number is taken. ever, and we can measure only a proxy of the actual
Therefore, 1.47382 is a number of greater precision than property. For example, we do not observe temperature,
1.47. The degree of precision is a function of the sensitiv- but only the height of a column of mercury in a ther-
ity of the measuring instrument and data analysis system mometer. We are not capable of visualizing the electrical
as well as the variable itself (see Focus on Evidence 8-1). activity of a heartbeat or muscle contraction, although
Focus on Evidence 8–1 Errors were defined in terms of knowledge of prescribed dose and
Units of Measurement and Medication Dosing actual observed dose. After adjusting for factors such as parent age,
Errors language, socioeconomic status, and education, they found that 40%
of parents made an error in measurement of the intended dose and
Parents often make dosing errors when administering oral liquid
17% used a nonstandard instrument. Compared to those who used
medications to children, with potentially serious consequences.
milliliters only, those who used teaspoons or tablespoons were twice
Confusion is common because there is no standard unit, with meas-
as likely to make an error. The authors recommended that a move to
urements prescribed in milliliters, teaspoons, tablespoons, dropper-
a milliliter-only standard may promote the safe use of pediatric liquid
fuls, and cubic centimeters, among the many variations. The actual
medications among groups at particular risk for misunderstanding
instrument used to administer the liquid can also vary, including the
medication instructions.
use of household teaspoons and tablespoons, which can vary greatly
The importance of measurement precision will depend on how
in size and shape. Parents with low health literacy or language fluency
the measurement is applied and the seriousness of small errors.
challenges may be particularly vulnerable.
In this case, for example, confusion related to units of measurement
Yin et al4 performed a cross-sectional analysis of data from a
contribute to more than 10,000 calls to poison control centers
larger study of provider communication and medication errors, looking
every year.5
at 287 parents whose children were prescribed liquid medications.
6113_Ch08_105-114 02/12/19 4:52 pm Page 108
the numbers represent football jerseys, manual muscle test These variables can be classified by different levels of
(MMT) grades, or inches on a ruler—but what will the answer measurement.
mean? The numbers may not know but we must understand
their origin to make reasonable interpretations. Nominal Scale
The lowest level of measurement is the nominal scale,
also referred to as the classificatory scale. Objects or people
■ Levels of Measurement are assigned to categories according to some criterion.
Categories may be coded by name, number, letter, or
In 1946, Stanley Stevens,7 a noted experimental psychol- symbol, although these are purely labels and have no
ogist, wondered about how different types of variables quantitative value or relative order. Blood type and
were measured and classified. He proposed a scheme to handedness are examples of nominal variables. A re-
classify variables on their characteristics, rather than searcher might decide, for example, to assign the value
the number assigned to them. To clarify this process, of 0 to identify right-handed individuals, 1 to identify
Stevens identified four levels of measurement— left-handed individuals, and 2 to identify ambidextrous
nominal, ordinal, interval, and ratio—each with a special individuals.
set of rules for manipulating and interpreting numerical
data.8 The characteristics of these four scales are sum- ➤ In the EI study, variables of sex and ethnicity are nominal
marized in Figure 8-1. measurements. In this case, sex is male/female (a dichotomous
variable), and ethnicity was defined as Black, Hispanic, or
White/Other.
➤ C A SE IN PO INT
Researchers studied the effects of participation in an early inter- Based on the assumption that relationships are con-
vention (EI) program for low-birth-weight premature infants.10 sistent within a measurement system, nominal categories
They collected data on several baseline measures, including sex are mutually exclusive so that no object or person can log-
of the child, mother’s ethnicity, mother’s level of education,
ically be assigned to more than one. This means that the
child’s birth order, the neonatal health index, the mother’s age,
members within a category must be equivalent on the
and the child’s birth weight.
property being scaled, but different from those in other
6113_Ch08_105-114 02/12/19 4:52 pm Page 110
ORDINAL Numbers
Manual muscle indicate rank
test, function, order
pain
NOMINAL Numerals
Gender, are
blood type, category
diagnosis labels
Figure 8–1 Four levels of measurement.
MVIC (%)
has typically been used to grade strength according the patient’s 60
ability to move against gravity and hold against an examiner’s resist-
ance. Using a 12-point ordinal scale (from 0 = “no contraction” to
12 = “normal strength”), the researchers questioned the sensitivity 40
of MMT grades to change.
They collected strength data for 20 subjects over multiple trials
using MMT grades and percent of maximal voluntary isometric 20
contraction (MVIC) using a strain gauge (ratio scale). The graph
presents results for shoulder flexion, showing that there was a 0
substantial range of strength within a single MMT grade. For 0 1 2 3 4 5 6 7 8 9 10 11 12
example, MVIC ranged from 10% to 52% for an MMT score MMT Grades
of 7. An MVIC of 30% could be graded between 4 and 9 on the
MMT scale. These methods provide varying sensitivity, which
will affect determination of progress and change over time. The Adapted from Andres PL, Skerry LM, Thornell B, Portney LG, Finison LJ, Munsat
choice of tool, therefore, could make a distinct difference in TL. A comparison of three measures of disease progression in ALS. J Neurol
evaluative decisions. Sci 1996;139 Suppl:64–70, Figure 2, p. 67. Used with permission.
education as one with a score of 2. Therefore, ordinal relative difference and equivalence within a scale can
scores cannot meaningfully be added, subtracted, multi- be determined. What is not supplied by an interval
plied or divided, which has implications for statistical scale is the absolute magnitude of an attribute because
analysis, such as taking means and assessing variance. interval measures are not anchored to a true zero point
Ordinal scores, therefore, are often described according (similar to an ordinal scale without a natural origin).
to counts or percentages. In the EI study, 37% had some Since the value of zero on an interval scale is not
high school, 27% completed high school, 22% had some indicative of the complete absence of the measured
college, and 13% completed college. The median score trait, negative values are also possible and may simply
within a distribution can also be useful (see Chapter 22). represent lesser amounts of the attribute rather than a
Ordinal scales can be distinguished based on whether substantive deficit.
or not they contain a true zero point. In some cases, an
ordinal scale can incorporate an artificial origin within ➤ In the EI study, the neonatal health index was defined
the series of categories so that ranked scores can occur as the length of stay in the neonatal nursery adjusted for
in either direction away from the origin (+ and -). This birth weight and standardized to a mean of 100. A higher
type of scale is often constructed to assess attitude or score is indicative of better health. This is an interval scale
opinion, such as agree–neutral–disagree. For some con- because it has no true zero and the scores are standardized
struct variables, it may be reasonable to define a true based on an underlying continuum rather than representing a
zero, such as having no pain. For others it may be im- true quantity.
possible to locate a true zero. For example, what is zero
function? A category labeled “zero” may simply refer to The standard numbering of years on the Gregorian
performance below a certain criterion rather than the calendar (B.C. and A.D.) is an interval scale. The year 1
total absence of function. was an arbitrary historical designation, not the begin-
ning of time, which is why we have different years, for
Interval Scale example, for the Chinese and Hebrew calendars. Meas-
An interval scale possesses the rank-order characteris- ures of temperature using Fahrenheit and Celsius scales
tics of an ordinal scale, but also demonstrates known and are also at the interval level. Both have artificial zero
equal intervals between consecutive values. Therefore, points that do not represent a total absence of heat and
6113_Ch08_105-114 02/12/19 4:52 pm Page 112
Table 8-1 Hypothetical Data for Step Length Measured the instrumentation available and the clinician’s prefer-
on Different Scales ence or skill. Even though step length was assessed using
four different scales, the true nature of the variable
SUBJECT RATIO (cm) INTERVAL ORDINAL NOMINAL remains unchanged. Therefore, we must distinguish be-
A 23 4 2 Long tween the underlying nature of a variable and the scale
used to measure it.
B 24 5 3 Long
C 19 0 1 Short Statistics and Levels of Measurement
D 28 9 4 Long An understanding of the levels of measurement is more
than an academic exercise. The importance of determin-
ing the measurement scale for a variable lies in the
We could convert these ratio measures to an interval determination of which mathematical operations are ap-
scale by arbitrarily assigning a score of zero to the lowest propriate and which interpretations are meaningful. All
value and adjusting the intervals accordingly. We would statistical tests have been designed around certain as-
still know which children took longer steps, and we would sumptions about data, including the level of measure-
have a relative idea of how much longer they were, but ment. Parametric tests apply arithmetic manipulations,
we would no longer know what the actual step length requiring interval or ratio data. Nonparametric tests do
was. We would also no longer be able to determine that not make the same assumptions and are designed to be
Subject D takes a step 1.5 times as great as Subject C. In used with ordinal and nominal data (see Chapter 27).
fact, using interval data, it erroneously appears as if Sub- In many cases, using parametric or nonparametric
ject D takes a step nine times the length of Subject C. procedures can make a difference as to how data will be
An ordinal measure can be achieved by simply rank- interpreted (see Chapter 23). Therefore, distinctions
ing the children’s’ step lengths. With this scale we no about levels of measurement are not trivial. However,
longer have any indication of the magnitude of the dif- identifying the level of measurement may not be simple.
ferences. Based on ordinal data, we could not establish The underlying properties of many behavioral variables
that Subjects A and B were more alike than any others. do not fit neatly into one scale or another. For example,
We can eventually reduce our measurement to a nom- behavioral scientists continue to question the classifica-
inal scale by setting a criterion for “long” versus “short” tion of intelligence quotient (IQ) scores as either ordinal
steps and classifying each child accordingly. With this or interval measures.15 Can we assume that intervals are
measurement, we have no way of distinguishing any dif- equal so that the difference between scores of 50 and 100
ferences in performance between Subjects A, B, and D. is the same as the difference between 100 and 150, or are
Clearly, we have lost significant amounts of informa- these scores only reflective of relative order? Can we say
tion with each successive reduction in scale. It will always that someone with an IQ of 200 is twice as intelligent as
be an advantage, therefore, to achieve the highest possible someone with a score of 100? Is everyone with a score
level of measurement. Data can always be manipulated to of 100 equally intelligent? Is there a “unit” of intelli-
use a lower scale, but not vice versa. In reality, clinical gence? Most likely the true scale is somewhere between
researchers usually have access to a limited variety of ordinal and interval. The importance of this issue will be
measurement tools, and the choice is often dictated by relevant in how data are analyzed and interpreted.
COMMENTARY
Although there are clear differentiations among statisti- are used with ordinal data. The question is, how serious
cal tests regarding data assumptions, we find innumer- are the consequences of misassumptions about scale
able instances throughout the clinical and behavioral properties to the interpretation of statistical research re-
science literature in which parametric statistical operations sults? Some say quite serious,16,17 whereas others indicate
6113_Ch08_105-114 02/12/19 4:52 pm Page 114
that the answer is “not very.”18,19 Many researchers are variables being measured and the precision needed to
comfortable constructing ordinal scales using categories use them meaningfully. For the most part, it would seem
that are assumed to logically represent equal intervals of appropriate to continue treating ordinal measurements
the test variable and treating the scores as interval as ranked rather than interval data; however, if the
data,18,20 especially when the scale incorporates some interval approach is defensible, the degree of error asso-
type of natural origin. Questions using a five-point ciated with this practice may be quite tolerable in the
Likert scale (“Strongly Agree” to “Strongly Disagree”) long run.15 Even Stevens,7 in his landmark paper, makes
would be considered ordinal, but if several item re- a practical concession:
sponses are summed, the total is typically treated as an
In the strictest propriety the ordinary statistics involving
interval measurement.21
means and standard deviations ought not to be used
Because ordinal measures occur frequently in the
with these [ordinal] scales, for these statistics imply a
behavioral and social sciences, this issue is of significant
knowledge of something more than the relative rank-
import to the reasonable interpretation of clinical data.
order of data. On the other hand, for this “illegal”
Kerlinger22 suggests that most psychological and educa-
statisticizing there can be invoked a kind of pragmatic
tional scales approximate equal intervals fairly well,
sanction: In numerous instances it leads to fruitful
and that the results of statistical analyses using these
results.
measures provide satisfactory and useful information.
McDowell23 suggests that the classification of a meas- As researchers and evidence-based practitioners, we
urement as ordinal or interval in a health scale is less must scrutinize the underlying theoretical construct that
about inherent numerical properties than it is about how defines a scale and remain responsible for justifying the
scores will be interpreted. Therefore, this issue will take application of statistical procedures and the subsequent
on varied importance depending on the nature of the interpretations of the data.
R EF ER EN C E S 11. Andres PL, Skerry LM, Thornell B, Portney LG, Finison LJ,
Munsat TL. A comparison of three measures of disease
1. Nunnally J, Bernstein IH. Psychometric Theory. 3 ed. New York: progression in ALS. J Neurol Sci 1996;139 Suppl:64–70.
McGraw-Hill, 1994. 12. White LJ, Velozo CA. The use of Rasch measurement to
2. Cohen J. Things I have learned (so far). Am Pyschologist 1990; improve the Oswestry classification scheme. Arch Phys Med
45(12):1304–1312. Rehabil 2002;83(6):822–831.
3. Wright BD, Linacre JM. Observations are always ordinal; 13. Decruynaere C, Thonnard JL, Plaghki L. Measure of experimen-
measurements, however, must be interval. Arch Phys Med Rehabil tal pain using Rasch analysis. Eur J Pain 2007;11(4):469–474.
1989;70(12):857–860. 14. Jacobusse G, van Buuren S, Verkerk PH. An interval scale for
4. Yin HS, Dreyer BP, Ugboaja DC, Sanchez DC, Paul IM, development of children aged 0–2 years. Stat Med 2006;25(13):
Moreira HA, Rodriguez L, Mendelsohn AL. Unit of measure- 2272–2283.
ment used and parent medication dosing errors. Pediatrics 15. Knapp TR. Treating ordinal scales as interval scales: an attempt
2014;134(2):e354–e361. to resolve the controversy. Nurs Res 1990;39(2):121–123.
5. Bronstein AC, Spyker DA, Cantilena LR, Jr., Rumack BH, Dart 16. Jakobsson U. Statistical presentation and analysis of ordinal data
RC. 2011 Annual report of the American Association of Poison in nursing research. Scand J Caring Sci 2004;18(4):437–440.
Control Centers’ National Poison Data System (NPDS): 29th 17. Merbitz C, Morris J, Grip JC. Ordinal scales and foundations of
annual report. Clin Toxicol 2012;50(10):911–1164. misinference. Arch Phys Med Rehabil 1989;70:308–312.
6. Lord FM. On the statistical treatment of football numbers. 18. Wang ST, Yu ML, Wang CJ, Huang CC. Bridging the gap
Am Psychologist 1953;8(12):750–751. between the pros and cons in treating ordinal scales as interval
7. Stevens SS. On the Theory of Scales of Measurement. Science scales from an analysis point of view. Nurs Res 1999;48(4):226–229.
1946;103(2684):677–680. 19. Cohen ME. Analysis of ordinal dental data: evaluation of
8. Stevens SS. Mathematics, measurement and psychophysics. In: conflicting recommendations. J Dent Res 2001;80(1):309–313.
Stevens SS, ed. Handbook of Experimental Psychology. New York: 20. Gaito J. Measurement scales and statistics: resurgence of an old
Wiley, 1951. misconception. Psychol Bull 1980;87:564–567.
9. Metric mishap caused loss of NASA orbiter. CNN.com. Sep- 21. Wigley CJ. Dispelling three myths about Likert scales in commu-
tember 30, 1999. Available at https://edition.cnn.com/TECH/ nication trait research. Commun Res Reports 2013;30(4):366–372.
space/9909/30/mars.metric.02/. Accessed January 25, 2018. 22. Kerlinger FN. Foundations of Behavioral Research. 3 ed. New York:
10. Hill JL, Brooks-Gunn J, Waldfogel J. Sustained effects of Holt, Rinehart & Winston, 1985.
high participation in an early intervention for low-birth-weight 23. McDowell I. Measuring Health: A Guide to Rating Scales and
premature infants. Dev Psychol 2003;39(4):730–744. Questionnaires. 3 ed. New York: Oxford University Press, 2006.
6113_Ch09_115-126 02/12/19 4:52 pm Page 115
CHAPTER
Concepts of
Measurement 9
Reliability
The usefulness of measurement in clinical Measurement Error
decision-making depends on the extent to Consider the simple process of measuring an individual’s
which clinicians can rely on data as accurate height with a tape measure. First, assume that the indi-
vidual’s height is stable—it is not changing, and there-
and meaningful indicators of a behavior or fore, there is a theoretical true value for the person’s
attribute. The first prerequisite, at the heart of actual height. Now let us take measurements, recorded
accurate measurement, is reliability, or the extent to the nearest 1/16 inch, on three separate occasions,
either by one examiner or by three different examiners.
to which a measured value can be obtained Given the fallibility of our methods, we can expect some
consistently during repeated assessment of un- differences in the values recorded from each trial.
changing behavior. Without sufficient reliabil- Assuming that all measurements were acquired using
similar procedures and with equal concern for accuracy,
ity, we cannot have confidence in the data we it may not be possible to determine which of the three
collect nor can we draw rational conclusions measured values was the truer representation of the sub-
about stable or changing performance. ject’s actual height. Indeed, we would have no way of
knowing exactly how much error was included in each
The purpose of this chapter is to present the of the measured values.
conceptual basis of reliability and describe dif- According to classical measurement theory, any
ferent approaches to testing the reliability of observed score (XO) consists of two components: a true
score (XT) that is a fixed value, and an unknown error
clinical measurements. Further discussion of component (E) that may be large or small depending
reliability statistics is found in Chapter 32. on the accuracy and precision of our measurement
procedures. This relationship is summarized by the
equation:
XO = XT ± E
■ Concepts of Reliability
Any difference between the true value and the ob-
Reliability is fundamental to all aspects of measurement. served value is regarded as measurement error, or
If a patient’s behavior is reliable, we can expect consis- “noise,” that gets in the way of our finding the true
tent responses under given conditions. A reliable exam- score. Although it is theoretically possible that a single
iner is one who assigns consistent scores to a patient’s measured value is right on the money (with no error),
unchanging behavior. A reliable instrument is one that it is far more likely that an observed score will either
will perform with predictable accuracy under steady con- overestimate or underestimate the true value to some
ditions. The nature of reality is such that measurements degree. Reliability can be conceptually defined as an
are rarely perfectly reliable. All instruments are fallible estimate of the extent to which an observed test score
to some extent and all humans respond with some level is free from error; that is, how well observed scores re-
of inconsistency. flect true scores.
115
6113_Ch09_115-126 02/12/19 4:52 pm Page 116
Random Error
Whereas reliability is all about the consistency or
Presuming that an instrument is well calibrated and
replicability of measured values, validity describes the
there is no other source of systematic bias, measurement
extent to which a measurement serves its intended
purpose by quantifying the target construct (see Chapter 10). errors can be considered “random” since their effect on
Both reliability and validity are essential considerations for how the observed score is not predictable. Random errors are
measurement accuracy is evaluated in research, as well as for just as likely to result in an overestimate of the true score
how measured data are used as evidence to inform clinical as an underestimate. These errors are a matter of chance,
practice. possibly arising from factors such as examiner or subject
inattention, instrument imprecision, or unanticipated
environmental fluctuation.
For example, if a patient is fidgeting while their height
Types of Measurement Error
is being measured, random errors may cause the ob-
To understand reliability, we must distinguish between served scores to vary in unpredictable ways from trial to
two basic types of measurement errors: systematic and trial. The examiner may be inconsistently positioning
random. Figure 9-1 illustrates this relationship, using a herself when observing the tape, or the tape may be
target analogy in which the center represents the true stretched out to a greater or lesser extent during some
score. trials, resulting in unpredictable errors. We can think of
Systematic Error any single measurement as one of an infinite number of
Systematic errors are predictable errors of measure- possible measures that could have been taken at that
ment. They occur in one direction, consistently over- point in time.
estimating or underestimating the true score. If a If we assume random errors are a matter of chance
systematic error is detected, it is usually a simple matter with erroneous overestimates and underestimates oc-
to correct it either by recalibrating the instrument or by curring with equal frequency, recording the average
adding or subtracting the appropriate constant from each across several trials should allow the positive and neg-
observed value. For example, if the end of a tape measure ative random errors to cancel each other out, bringing
is incorrectly marked, such that markings actually begin us closer to the true score. For example, if we were to
0.25 inch from the end of the tape, measurements of take the average of scores shown as random error in
height will consistently record values that are too short Figure 9-1, we would end up with a score closer to
by 0.25 inch. We can correct this error by cutting off the the true score at the center of the target. One possible
extra length at the end of the tape (thereby recalibrating strategy for reducing random error and improving
the instrument) or by adding the appropriate constant reliability is to record the average across several repe-
(0.25 inch) to each observed value. titions of a measurement rather than just the results of
By definition, systematic errors are consistent. Con- a single trial.
sequently, systematic errors are not a threat to reliability.
Sources of Measurement Error
Instead, systematic errors only threaten the validity of a
measure because, despite highly replicable values over In the development or testing of a measuring instru-
repeated assessments, none of the observed values will ment, a protocol should be specified to maximize
succeed in accurately quantifying the target construct. reliability, detailing procedures and instructions for
using the instrument in the same way each time. When
designing the protocol, researchers should anticipate
sources of error which, once identified, can often be
controlled.
➤ CASE IN POIN T #1
The Timed “Up & Go” (TUG) test is a commonly used functional
assessment to measure balance and gait speed, as well as to
estimate risk for falls.1 Using a stopwatch, a clinician measures
in seconds the time it takes for a subject to rise from a chair,
walk a distance of 3 meters, turn, walk back to the chair, and
sit. Faster times indicate better balance and less fall risk. The
Systematic Error Random Error reliability and validity of the TUG have been studied under a
variety of circumstances and in several populations.2
Figure 9–1 Depiction of systematic and random errors.
6113_Ch09_115-126 02/12/19 4:52 pm Page 117
The simplicity and efficiency of the TUG make it a variable is inherently unstable, it may be difficult
useful example to illustrate reliability concepts. or impossible to quantify reliably (see Focus on
Measurement errors can generally be attributed to Evidence 9-1).
three sources:
1. The individual taking the measurements (the ➤ A patient’s TUG performance in rising, walking, and turning
rater): Errors can result from a lack of skill, not quickly can vary from moment to moment depending on their
following the protocol, distraction, inaccurately physical or emotional status. For example, multiple trials can
cause some patients to fatigue. A patient with Parkinson’s
recording or transcribing values, or an unconscious
disease (PD) may experience physiological changes in response
bias.
to recent medication. If a patient is feeling anxious, they may
walk cautiously, but move more quickly during periods of lessened
➤ The rater for the TUG is the clinician who observes the test anxiety.
and times it with a stopwatch. The rater must be able to follow
the protocol to be consistent from trial to trial. For example, the
rater must determine criteria for the exact moment the patient
Where errors result solely from the inherent instabil-
rises from the chair and sits back down. Instructions to the ity of a response variable, they may be difficult to avoid.
subject should be the same each time. In other instances, however, measurement error can be
substantially reduced through improved training of the
rater or careful standardization of the protocol. It is
2. The measuring instrument itself: Imprecise important for clinicians and researchers to understand
instruments or environmental changes affecting the theoretical and practical nature of the response vari-
instrument performance can also contribute to able so that sources of error affecting reliability can be
random error. Many instruments, especially anticipated and properly interpreted or controlled.
mechanical ones, are subject to some level of
background noise or fluctuating performance.
■ Measuring Reliability
➤ The stopwatch must be working properly to record TUG time
accurately. For instance, the battery could be weak or the stem As it is not possible to know a true score with certainty,
of the stopwatch could occasionally stick. If the stopwatch were the actual reliability of a test can never be directly
calibrated to measure time to the nearest second rather than calculated. Instead, we estimate the reliability of a test
fractions of a second, random measurement errors could result. based on the statistical concept of variance, which is a
measure of the variability among scores in a sample (see
3. Variability of the characteristic being measured: Chapter 22). The greater the dispersion of scores, the
Errors often occur from variations in physiological larger the variance; the more homogeneous the scores,
response or changes in subject performance. Blood the smaller the variance. Within any sample, the total
pressure is an example of a response that can vary variance in scores will be due to two primary sources.
from moment to moment. Responses can be influ- To illustrate, suppose we ask two clinicians to score
enced by personal characteristics such as changes five patients performing the TUG. Let us also assume
in motivation, cooperation, or fatigue, or by envi- that the true scores remain constant. Across the 10 scores
ronmental factors such as transient noise or ambient that are recorded, we would undoubtedly find that scores
temperature fluctuations. If the targeted response differed from one another. One source of variance comes
Focus on Evidence 9–1 Researchers found excellent reliability for all parameters—except
Real Differences or Random Error? for gait velocity during slow walking. Although the lower reliability
could have resulted from other unidentified sources of measurement
Kluge et al3 wanted to establish the reliability of a mobile gait analysis
error that are unique to slow-paced walking, the researchers suggested
system to measure several parameters, including timing of stride,
that a more likely explanation was that slow walking velocities were
stance and swing phases, and walking velocity. They studied 15 pa-
inherently less stable than parameters of fast or normal speed gait.
tients, 11 healthy and 4 with Parkinson’s disease. The participants
They speculated that people do not use a consistent pattern for walking
were asked to walk at three self-selected speeds: fast, normal,
at a slow pace, making it difficult to measure with high reliability.
and slow.
6113_Ch09_115-126 02/12/19 4:52 pm Page 118
Because they are unitless, these relative values allow Reliability Exists in a Context
us to compare the reliability of different tests. For in- Reliability is not an inherent or immutable trait of a meas-
stance, if we studied reliability for the TUG, recorded urement tool. It can only be understood within the con-
in seconds, and the Berg Balance Test, a 14-item sum- text of a tool’s application, including the characteristics of
mative scale, we could determine which was more reli- the subjects, the training and skill of the examiners, the
able using a relative index. The most reliable test would setting, and the number and timing of test trials.7,8 Studies
be the one whose relative reliability coefficient was have also evaluated the unique reliability of various ques-
closest to 1.00. tionnaire tools when translated into different languages
The most commonly used relative reliability indices or administered to people of different cultures.9–11 With-
are the intraclass correlation coefficient (ICC) for quan- out consideration of contextual factors, the reliability of
titative data and the kappa (κ) coefficient for categorical an instrument cannot be fairly judged (see Box 9-1).
data.
Reliability Is Not All-or-None
➤ Studies have shown that the TUG has strong rater reliability Reliability is not an all-or-none trait but exists to some
in patients with lumbar disc disease (ICC = 0.97),4 and stroke
extent in any instrument. References made to an instru-
(ICC >0.95).5
ment as either “reliable” or “unreliable” should be
6113_Ch09_115-126 02/12/19 4:52 pm Page 119
avoided since these labels fail to recognize this distinc- This important issue of magnitude of reliability will be
tion. Reliability coefficients provide values that can help addressed again in Chapter 32.
us estimate the degree of reliability present in a meas-
urement system, but we must always exercise judgment
in deciding whether a measurement is sufficiently reli-
■ Types of Reliability
able for its intended application. Application of reliabil-
Estimates of relative reliability vary depending on the
ity measures from published studies must be based on
type of reliability being analyzed. This section will intro-
the generalizability of the study to clinical circum-
duce design considerations for four general approaches
stances. A test’s reliability is not necessarily the same in
to reliability testing: test–retest reliability, rater reliability,
all situations.
alternate forms reliability, and internal consistency.
To perform a test–retest study, a sample of individuals practice or learning during the initial trial alters per-
is subjected to the identical test on two separate occa- formance on subsequent trials. A test of dexterity may
sions, keeping all testing conditions as constant as pos- improve due to motor learning. Strength measurements
sible. The coefficient derived from this type of analysis can improve following warm-up trials. Sometimes sub-
is called a test–retest reliability coefficient. It estimates reli- jects are given a series of pretest trials to neutralize this
ability in situations in which raters are minimally in- effect and data are collected only after performance has
volved, such as patient self-report questionnaires or stabilized. Because a retest score can be influenced by a
instrumented physiological tests that provide automated subject’s effort to improve, some researchers will conceal
digital readouts. scores from the subjects between trials.
Testing Effects
➤ The FES-I is a self-report inventory completed by an individ- When the test itself is responsible for observed changes
ual using pencil and paper. It has been evaluated for test–retest in a measured variable, the change is considered a testing
reliability by asking subjects to repeat the inventory on two effect. For example, range of motion testing can stretch
separate occasions one week apart.15
soft tissue structures around a joint, thereby increasing
the arc of motion on subsequent testing. The initial trial
Because the FES-I is a questionnaire completed by the of a strength test could provoke joint pain, thereby weak-
patient, error cannot be attributed to a rater, and ening responses on the second trial.
therefore test–retest reliability is a measure of the Oftentimes, testing and carryover effects will manifest
instrument’s consistency. With a performance-based test such as systematic error, creating unidirectional changes
as the TUG, however, where the rater is a central part of the across all subjects. Such an effect will not necessarily
measurement system, consistent scoring will depend on affect reliability coefficients, for reasons we have already
adequate rater reliability and test–retest reliability cannot be discussed.
logically measured.
Coefficients for Test–Retest Reliability
Test–retest reliability of quantitative measurements is
Test–Retest Intervals most often assessed using the ICC. With categorical data, per-
Because the stability of a response variable is such a sig- cent agreement can be calculated and the kappa statistic applied.
nificant factor, the time interval between tests must be In clinical situations, the SEM, being an absolute reliability
considered. Intervals should be far enough apart to avoid measure, provides a more useful estimate for interpreting how
fatigue, learning, or memory effects, but close enough much error is likely to be present in a single measurement.
to avoid genuine changes in the targeted variable.
Rater Reliability
➤ In the study of the FES-I, a 1-week interval was considered
reasonable to ensure that significant changes in balance or fear Many clinical measurements require that a human
of falling were unlikely but sufficient for subjects to no longer be observer, or rater, is part of the measurement system.
influenced by recollection of their initial answers.15 In some cases, the rater constitutes the only measuring
instrument, such as in observation of gait, or palpation
Other measures may require different intervals. For of lymph nodes or joint mobility. Other times, a test
instance, measurements of blood pressure need to be involves physical manipulation of a tool by the rater, as
repeated in rapid succession to avoid changes in this labile in the taking of blood pressure using a sphygmo-
response variable. Tests of infant development might need manometer.
to be taken within several days to avoid the natural devel- Rater reliability is important whether one or several
opmental changes that occur at early ages. If, however, we testers are involved. Training and standardization may
are interested in establishing the ability of an intelligence be necessary, especially when measuring devices are new
quotient (IQ) test to provide a stable assessment of intelli- or unfamiliar, or when subjective observations are used.
gence over time, it might be more meaningful to test a Even when raters are experienced, however, rater relia-
child using intervals of a year or more. A researcher must bility should be documented.
be able to anticipate the inherent stability of the targeted To establish rater reliability, the instrument and the
response variable prior to making test–retest assessments. response variable are assumed to be stable so that any
differences between scores on repeated tests can be at-
Carryover tributed solely to rater error. In many situations, this
When a test is administered on two or more occasions, may be a large assumption and varied contributions to
reliability can be influenced by carryover effects, wherein reliability should be considered (see Box 9-1).
6113_Ch09_115-126 02/12/19 4:52 pm Page 121
intervention for the group with PD, we would need to have a better chance of being consistent. This in-
see a change of at least 3.5 seconds in the TUG to con- cludes creating a standard environment and timeline
clude that some true change had occurred. In a popula- to ensure that everyone is performing in the same
tion with knee OA, we would only need to see a way.
difference of at least 1.1 seconds to conclude that some • Train raters. Depending on the complexity of the
of the change was real. task, training raters is important to ensure that all
measurements are taken in the same way. Reliability
Procedures for calculating and interpreting the MDC
analyses can be done during training to determine
will be presented in Chapter 32.
its effectiveness. For some scales, formal training
procedures are required for certification.
• Calibrate and improve the instrument. For new
■ Methodological Studies: Reliability instruments, continued efforts to refine and improve
a tool are warranted for both reliability and validity.
Methodological research involves the development and
For mechanical instruments, a defined calibration
testing of measuring instruments for use in research or
procedure should be documented.
clinical practice. These studies emphasize the use of out-
• Take multiple measurements. Using the mean of
come instruments as a way of indicating quality of care and
two or more trials may be effective in reducing
effectiveness of intervention. Establishing reliability often
overall error, especially for less stable measurements
requires multiple approaches, looking at different compo-
(see Commentary).
nents with several indices to fully understand a tool’s ac-
• Choose a sample with a range of scores. Testing
curacy. The implications of the results of these studies are
for reliability requires that the sample has some
far-reaching, as we decide which tools will serve us best to
variance; that is, subjects must have a range of
demonstrate the efficacy of our services. It is essential,
scores. Even if scores are consistent from trial 1 to
therefore, that we understand the designs and analyses of
trial 2, if they are all similar to each other, it is not
methodological study so we can choose appropriate instru-
possible to establish whether they contain error.
ments and defend the decisions that we make.
Samples for reliability studies must include subjects
Obviously, we want to develop tools that have the best
with scores across the continuum to show reliability.
chance of assessing patients accurately. There are many
• Pilot testing. Even when an instrument has been
strategies that can be incorporated into studies and clin-
tested for reliability before, researchers should doc-
ical practice to maximize reliability.
ument reliability for their own studies. Pilot studies
• Standardize measurement protocols. By creating are “try outs” to make sure procedures work and to
clear operational definitions and instructions for establish reliability prior to the start of actual data
performing a test, observers, raters, and patients will collection.
COMMENTARY
It’s hard to imagine any measurement system that is free Investigators may not want to use the first score alone
from error. What strategies can we apply in data collec- as the test value. The initial trial is often confounded by
tion procedures to ensure the most reliable outcome? a warm-up or learning effect that will be evident as per-
One approach is to take more than one measurement of formance improves on subsequent trials. Therefore,
a behavior or characteristic whenever possible. You have warm-up or practice trials are often incorporated into a
probably experienced this in your own healthcare, as cli- design.
nicians often take more than one blood pressure meas- Some researchers use the final score in a series. They
urement, for example. But then we must ask, out of rationalize that the last repetition in a set of trials will
several trials, which value best represents the individual’s be stabilized following warm-up or practice. However,
true score? depending on the number of trials, the final score may
6113_Ch09_115-126 02/12/19 4:52 pm Page 125
also be influenced by fatigue or a practice effect, and room for argument in this rationale because this theory
therefore will not necessarily represent the subject’s is based on a large number of trials. With only a few tri-
true effort. als, this may be an unrealistic assumption. However,
Other researchers may use the subject’s “best” effort studies have shown that taking a mean value provides a
of several trials as a reflection of what the subject is more reliable measurement than any single value in a
maximally capable of doing. If we consider reliability series of trials.29,30
theory, however, this is not necessarily the most So, bottom line? There is no simple answer to which
accurate representation because random error can score should be used. Generally, reliability of measure-
contribute both positive and negative error to an ob- ments that are less stable can be improved if averages are
served score. used. This is an important issue for statistical inference,
Theoretically, then, the most representative score is as greater error (low reliability) will reduce the chances
obtained by acquiring several measurements and calcu- of finding statistically significant differences when com-
lating the mean or average. In the “long run,” calculating paring groups or looking for change. Clinicians should
the average will cancel out the error components of each understand how reliability was determined to generalize
overestimate and underestimate. Of course, there is findings to clinical situations.
26. Vaz S, Falkmer T, Passmore AE, Parsons R, Andreou P. The 29. Beattie P, Isaacson K, Riddle DL, Rothstein JM. Validity of de-
case for using the repeatability coefficient when calculating rived measurements of leg-length differences obtained by use of
test-retest reliability. PLoS One 2013;8(9):e73990. a tape measure. Phys Ther 1990;70(3):150–157.
27. Huang SL, Hsieh CL, Wu RM, Tai CH, Lin CH, Lu WS. 30. Mathiowetz V, Weber K, Volland G, Kashman N. Reliability
Minimal detectable change of the timed “up & go” test and the and validity of grip and pinch strength evaluations. J Hand Surg
dynamic gait index in people with Parkinson disease. Phys Ther [Am] 1984;9(2):222–226.
2011;91(1):114–121.
28. Alghadir A, Anwer S, Brismée JM. The reliability and minimal
detectable change of Timed Up and Go test in individuals with
grade 1–3 knee osteoarthritis. BMC Musculoskelet Disord
2015;16:174.
6113_Ch10_127-140 02/12/19 4:51 pm Page 127
CHAPTER
Concepts of
Measurement Validity
—with K. Douglas Gross
10
Clinical decision-making is dependent on the reflects the meaning ascribed to a score, recognizing the
relevance of measurements to clinical decisions, as well
accuracy and appropriate application of meas-
as to the person being assessed.3,4
urements. Whether for diagnosis, prognosis, or We draw inferences from measurements that go be-
treatment purposes, without meaningful assess- yond just the obtained scores. For example, in clinical
applications, when we measure variables such as muscle
ment of a patient’s condition, we have no basis
strength or blood pressure, it is not the muscle grade or
for choosing the right interventions or making blood pressure value that is of interest for its own sake,
reasoned judgments. Validity relates to the con- but what those values mean in terms of the integrity of
the patient’s physical and physiological health.
fidence we have that our measurement tools are
Validity addresses three types of questions:
giving us accurate information about a relevant
• Is a test capable of discriminating among individu-
construct so that we can apply results in a als with and without certain traits, diagnoses, or
meaningful way. conditions?
The purpose of this chapter is to describe the • Can the test evaluate the magnitude or quality of a
variable or the degree of change from one time to
types of evidence needed to support validity and another?
the application of this evidence to the interpre- • Can we make useful and accurate predictions about
tation of clinical measurements. Statistical pro- a patient’s future status based on the outcome of
a test?
cedures related to validity will be identified,
We use those values to infer something about the
with full coverage in Chapter 32.
cause of symptoms or presence of disease, to set goals,
and to determine which interventions are appropriate.
They may indicate how much assistance will be required
■ Defining Validity and if there has been improvement following treatment.
Such scores may also have policy implications, influenc-
Validity concerns the meaning or interpretation that we ing the continuation of therapy, discharge disposition,
give to a measurement.1 A longstanding definition char- or reimbursement for services. Measurements, there-
acterizes validity as the extent to which a test measures what fore, have consequences for actions and decisions.5 A
it is intended to measure. Therefore, a test for depression valid measure is one that offers sound footing to support
should be able to identify someone who is depressed and the inferences and decisions that are made.
not merely tired. Similarly, an assessment of learning
disabilities should be able to identify children with diffi-
Validity and Reliability
culties processing information rather than difficulties A ruler is considered a valid instrument for calibrating
with physical performance. length because we can judge how long an object is
Validity also addresses the interpretation and applica- by measuring linear inches or centimeters. We would,
tion of measured values.2 Beyond the tool itself, validity however, question any attempt to assess the severity of
127
6113_Ch10_127-140 02/12/19 4:51 pm Page 128
low back pain using measurements of leg length because What Validity Is and Is Not
we cannot make reasonable inferences about back pain Because inferences are difficult to verify, establishing
based on that measurement. validity is not as straightforward as establishing reliabil-
Although validity and reliability are both needed to ity. For many variables, there are no obvious rules for
buttress confidence in our ability to accurately capture a judging whether a test is measuring the critical property
target construct, it is important to distinguish between of interest. As with reliability, we do not think of validity
these two measurement properties. Reliability describes in an all-or-none sense, but rather as a characteristic that
the extent to which a test is free of random error. With may be present to a greater or lesser degree.
high reliability, a test will deliver consistent results over Also as with reliability, validity is not an immutable char-
repeated trials. A reliable test may still suffer from sys- acteristic of an instrument itself. Instead, validity can be
tematic error, however. Such a test would deliver results evaluated only within the context of an instrument’s in-
that are consistently “off the mark.” For example, tended use, including the population and setting to which
although a precise ruler may deliver highly reproducible it will be applied. There are no valid or invalid instruments
measurements of leg length, these values would continue per se. The extent of validity is specific to the instrument’s
to “miss the mark” as indicators of back pain severity. intended application, including its ability to assess change
or quantify outcomes.6 Rather than asking, “Is an instrument
Hitting the Target
valid or invalid?,” it is more appropriate to ask, “How valid
Validity concerns the extent to which measurements
are the results of a test for a given purpose within this setting?”
align with the targeted construct. Validity can be
likened to hitting the center of a target, as shown in
Figure 10-1. In (A), the hits are consistent, but missing
the center each time. This is a reliable measure, but one
that is systematically measuring the wrong thing and is
therefore invalid. In (B), the hits are randomly distrib-
uted around the center with an equal number of over-
estimates and underestimates, reflecting both the poor
reliability of the measure and the absence of any sys-
tematic error. However, by calculating an average score
for the group, the overestimates and underestimates
should cancel each other out, providing us with a valid
score that is close to the center of the target.
In (C), the error reflects both poor reliability and poor
validity. There is an inconsistent scatter, but there is also
a pattern of being systematically off the mark in one
direction. Even a calculated average score would still Source unknown.
miss the intended target. Finally, in (D), the hits all clus-
ter tightly around the center of the target and are there- For some physical variables, such as distance or speed,
fore both highly reliable and highly valid. measurements can be acquired by direct observation.
However, for variables that represent abstract constructs
(such as anxiety, depression, intelligence, or pain), direct
observation may not be possible and we are required to
take measurements of a correlate or proxy of the actual
property under consideration. Therefore, we make in-
ferences about the magnitude of a latent trait (such as
anxiety) based on observations of related discernible be-
Reliable Valid Not Valid Valid
Not Valid Not Reliable Not Reliable and Reliable
haviors (such as restlessness and agitation). In this way,
abstract constructs are in large part defined by the meth-
A B C D
ods or instruments used to measure them.7
Figure 10–1 Illustration of the distinction between validity
and reliability. The center of the target represents the
true score. A) Scores are reliable, but not valid, showing
■ Types of Evidence for Validity
systematic error. B) Scores demonstrate random error but
will provide a valid score on average. C) Scores are neither Validity is established by constructing an “evidentiary
reliable nor valid. D) Scores hit the center of the target, chain” that links the actual methods of measurement to
indicating high reliability and validity. the intended application.8 By way of theory, hypothesis,
6113_Ch10_127-140 02/12/19 4:51 pm Page 129
logic, and data, evidence-based practitioners seek to es- Table 10-1 Evidence to Support Measurement
tablish that an instrument’s output is related and propor- Validity: The 3 Cs
tional to the target construct and that it can provide a
sound basis upon which to make clinical decisions. A TYPE OF VALIDITY PURPOSE
measurement may only represent a highly abstruse con- Content Validity Establishes that the
struct to a degree, but perhaps not perfectly.9 As we have multiple items that
emphasized, validity is never an “all-or-none” character- make up a questionnaire,
istic, and choices are often necessary when determining inventory, or scale
whether the validity of a measure is sufficient for the cur- adequately sample the
rent task. universe of content that
Several types of evidence can be used to support a defines the construct
being measured.
tool’s use, depending on specific conditions. Often
referred to as the “3 Cs,” the most common forms of Criterion-Related Validity Establishes the
evidence reflect content validity, criterion-related valid- correspondence
ity, and construct validity (see Fig. 10-2 and Table 10-1). between a target test
and a reference or gold
standard measure of the
➤ C A SE IN PO INT # 1 same construct.
The Harris Infant Neuromotor Test (HINT) is a screening tool
that can be used for early identification of neuromotor, cogni- Concurrent Validity The extent to which the
tive, and behavioral disorders during the first year of life.10 The target test correlates
test is intended to identify difficulties that are often missed until with a reference standard
they become more obvious in preschool years.11 It consists of taken at relatively the
three parts, including background information, a parent ques- same time.
tionnaire, and an infant examination. A lower score indicates Predictive Validity The extent to which the
greater maturity and a higher score indicates greater risk for de- target test can predict a
velopmental delay. The test has demonstrated strong rater and future reference standard.
test–retest reliability (ICC = 0.99).12 Instructions for administra-
Construct Validity Establishes the ability of
tion have been formalized in a training manual.13
an instrument to measure
the dimensions and
Content Validity theoretical foundation of
an abstract construct.
Many behavioral variables have a wide theoretical do-
main or universe of content that consists of all the traits Convergent Validity The extent to which a
test correlates with other
tests of closely related
constructs.
Discriminant Validity The extent to which a test
is uncorrelated with tests
Construct of distinct or contrasting
validity constructs.
is already established and assumed to be valid. If results Box 10-1 Diagnostic and Screening Accuracy
from the two tests are correlated or in agreement, the
target test is considered a valid indicator of the criterion Criterion measures are often applied to validate diagnostic or
score. screening tools. They may require concurrent or predictive
approaches depending on when the condition is manifested.
Finding the Gold Standard A comparison of findings on the criterion and target tests
A crucial element of criterion validation is identification generates data to indicate sensitivity, or the extent to which
of an acceptable criterion measure. The relationship be- the target test can accurately identify those with the condition
tween the target test and the criterion measure should (true-positive findings), and specificity, which is the ability
make theoretical sense in terms of the construct that of the target test to identify those whom the criterion has deter-
both tests represent.3 We must feel confident in assum- mined are without the condition (true-negative findings). These
ing that the criterion itself is a valid measure of that con- are presented as percentages, with values closer to 100%
indicating better accuracy.
struct. In many areas of physical and physiological
These analyses also provide information about false positive
science, standard criteria are readily available for validat-
and false negative findings, which can impact the appropriate
ing clinical tools. For instance, heart rate (the target test) use of a test. For example, the HINT parent questionnaire was
has been established as a valid indicator of energy cost shown to have high sensitivity (80%) and specificity (90.9%) for
during exercise by correlating it with standardized oxy- identifying infants with development concerns, using the Bayley
gen consumption values (the gold standard).15 Unfortu- Motor Scale of Infant Development as the criterion.16 The high
nately, the choice of a gold standard for more abstract specificity indicates that there is a low risk of a false positive;
constructs is not always as obvious. For instance, if we that is, a low risk of identifying a normally developing child as
want to establish the validity of a new quality of life ques- having developmental delay. The 80% sensitivity also suggests
tionnaire, with which referent should it be compared? that there is minimal risk of a false negative result, wherein a
Sometimes, the best one can do is use another instru- child with developmental delay remains undetected.
ment that has already achieved some degree of accept- See Chapter 33 for a full description of procedures for
ance. In these instances, criterion-related validation may diagnostic accuracy.
be based on comparison to a reference standard that,
although not considered a true gold standard, is still
regarded as an acceptable criterion.
convinced that the reference standard itself has sufficient
Criterion-related validity can be separated into two
validity. Concurrent validation is useful in situations in
basic approaches: concurrent validation and predictive
which a new or untested tool is potentially more effi-
validation, differentiated by the time frame within which
cient, easier to administer, more practical, or safer than
measurements are acquired.
the current method, and when it is being proposed as an
Concurrent Validity alternative.
Concurrent validity is studied when target and criterion
Predictive Validity
test scores are obtained at approximately the same time
Predictive validity assessments determine whether a
so that they both reflect the same incident of behavior.
measure will be a valid predictor of some future criterion
This approach is often used to establish the validity of
score or behavior. To assess predictive validity, a target
diagnostic or screening tests to determine the presence
test is administered at a baseline session, followed by a
or absence of diseases or other health conditions (see
period of time before the criterion score is obtained. The
Box 10-1).
interval between these two tests is dependent on the time
needed to achieve the criterion and may be as long as
➤ Concurrent validity for the HINT was established by corre- several years.
lating HINT scores to contemporaneous scores on the Ages and
States Questionnaire (r =.84),17 the Bayley Scale of Infant Devel-
opment Mental Scale (r = –.73) and Motor Scale (r = –.89),12 ➤ Researchers evaluated the predictive validity of the HINT by
and the Alberta Infant Motor Scale (r = –.93 for at-risk infants administering the test to a sample of children at 3 to 12 months
at 10 to 12 months).18 of age and then performing the Bayley Scale of Infant Develop-
ment-II at 17 to 22 months. They found that there was a weak
The reference standards used to evaluate the concur- correlation for the HINT and Bayley-II Mental Scale (r = –.11),
rent validity of the HINT represent similar constructs, but a moderate correlation with the Motor Scale (r = –.49).12
although none could be considered a true gold standard
for the construct of “developmental delay.” We risk Predictive validation can be used for several pur-
becoming mired in circular reasoning if we are not poses, such as validating screening procedures used to
6113_Ch10_127-140 02/12/19 4:51 pm Page 132
identify risk factors for future disease or to determine Methods of Construct Validation
the validity of prognostic indicators used to predict
Because of the abstract and complex nature of con-
treatment outcomes.
structs, construct validation is never fully realized.
Each attempt to validate an instrument provides evi-
Analysis of Criterion Validity dence to support or refute the theoretical framework
Criterion validity, both concurrent and predictive, is behind the construct. This evidence can be gathered
typically assessed using a correlation coefficient or regression
by a variety of methods (see Box 10-2). The more com-
procedure (see Chapters 29 and 30). With two tools using the
monly used procedures include the known groups
same units, the ICC can be used. Limits of agreement can also
be applied to show how scores from two tools agree or dis-
method, methods of convergence or discrimination,
agree (see Chapter 32). and factor analysis.
Known Groups Method
The most general type of evidence in support of con-
■ Construct Validity struct validity is provided when a test can discriminate
between individuals who are known to have the trait and
Construct validity reflects the ability of an instrument to those who do not. Using the known groups method, a
measure the theoretical dimensions of a construct. criterion is chosen to identify the presence or absence of
Because abstract constructs do not directly manifest as a particular characteristic and hypotheses are constructed
physical events, we often make inferences through ob- to predict how these different groups are expected to be-
servable behaviors, measurable performance, or patient have. The validity of a target test is supported if the test’s
self-report. results document these anticipated differences.
Focus on Evidence 10–1 acute pain from rheumatic conditions, showing a correlated linear de-
Constructing Pain cline in score with a VAS and perceived level of pain.
Chronic pain, however, will have a very different context, with mul-
Pain is a difficult construct to define because of its subjective and
tiple dimensions related to physical, emotional, cognitive, and social
unobservable nature. Perceptions of pain and pain relief may be
functions, including quality of life.23 For example, in a systematic review
subject to cognitive and behavioral influences that are specific to
of measures for cancer pain, Abahussin et al24 reviewed eight tools,
individuals and to cultural, sociological, or religious beliefs.21 Many
each with different scoring rubrics and subscales, reflecting traits such
different assessment tools have been developed to measure acute
as severity, impact on function, stoicism, quality of life, sensory, and
and chronic pain, some with a disease-specific focus, others based
affective components. They concluded that most studies of these tools
on patient-specific function. In the development of scales or the choice
had poor methodological quality and that further study was needed to
of a clinical tool, the theoretical context of the measurement must be
determine validity characteristics for use with that population.
determined.
How one chooses to measure pain will greatly affect how the
For instance, it is possible that acute pain is a more unidimen-
outcome will be interpreted. Different elements may be important,
sional construct, and its presence and intensity during a particular
depending on the nature of the underlying cause of or intervention for
time period can be defined using a numerical rating scale or VAS.
the pain.
Grilo et al22 used this method to demonstrate changes in patients with
6113_Ch10_127-140 02/12/19 4:51 pm Page 133
Convergence and Discrimination yield comparable results or scores that are correlated.
Construct validity of a test can be evaluated in terms of Convergence also implies that the theoretical context
how its measures compare to other tests of closely re- behind the construct will be supported by these related
lated or contrasting constructs.25 In other words, it is im- tests. Discriminant validity, also called divergent valid-
portant to demonstrate what a test does measure as well ity, indicates that different results, or low correlations,
as what it does not measure. This determination is based are expected from measures that assess distinct or
on the concepts of convergence and discrimination. contrasting characteristics.
To document these types of relationships, Campbell
➤ C A SE IN PO INT # 2 and Fiske25 suggested that validity should be evaluated
Children diagnosed with autism spectrum disorder (ASD) often in terms of both the characteristic being measured and
experience anxiety. It is important to determine, however, the method used to measure it, using a technique called
whether measures of ASD can distinguish between anxiety the multitrait–multimethod matrix (MTMM). In this ap-
and ASD symptoms as separate constructs, or whether these proach, two or more traits are measured by two or more
two traits are part of a common neurological process. This methods. The correlations among variables within and
information is essential to determine appropriate diagnostic and between methods should show that different tests meas-
intervention strategies. uring overlapping traits produce high correlations,
demonstrating convergent validity, whereas those that
Convergent validity indicates that two measures be- measure distinct traits produce low correlations, demon-
lieved to reflect similar underlying phenomena will strating discriminant validity.
6113_Ch10_127-140 02/12/19 4:51 pm Page 134
➤ For women, the standard for the chair stand test is:
■ Norm and Criterion Referencing 60–64 years: 15 reps 80–84 years: 12 reps
65–69 years: 15 reps 85–89 years: 11 reps
The validity of a test depends on evidence supporting its 70–74 years: 14 reps 90–94 years: 9 reps
use for a specified purpose. Tests can be designed to doc- 75–79 years: 13 reps
ument performance in relation to an external standard,
indicating an individual’s actual abilities, or to document These standards can be used to determine which pa-
ability in relation to a representative group. tients are likely or unlikely to maintain physical independ-
ence and to set goals for intervention. Criterion-based
➤ C A SE IN PO INT # 4 scores are useful for establishing treatment goals, planning
Rikli and Jones32 have developed standards for fitness testing in intervention, and measuring change, as they are based
older adults, predicting the level of capacity needed for main- on an analysis of scores that are required for successful
taining physical independence in later life. The Senior performance.
Fitness Test (SFT) consists of a battery of tasks. One task is the
chair stand test, which is a test of lower body strength, record- Norm Referencing
ing the number of times an individual can fully stand from sitting
over 30 seconds with arms folded across the chest.33 Norm-referenced tests are standardized assessments
designed to compare and rank individuals within a de-
fined population. Such a measure indicates how indi-
Criterion Referencing viduals perform relative to one another. Often scaled
A criterion-referenced test is interpreted according to from 0 to 100, the norm-referenced score is obtained
a fixed standard that represents an acceptable level of by taking raw scores and converting them to a stan-
performance. In educational programs, for example, a dardized score based on the distribution of test-takers.
minimum grade on a test might be used to specify com- This means that a judgment is not made based on how
petence for the material being tested. Criterion scores large or small a measurement is, but rather on how the
have been established for laboratory tests, based on med- measured value relates to a representative distribution
ically established standards. These standards indicate of the population.
6113_Ch10_127-140 02/12/19 4:51 pm Page 136
For example, you are probably familiar with standard- useful, an instrument should be able to detect small but
ized tests such as the SAT or GRE. If you recall, your meaningful change in a patient’s performance. This as-
scores were provided as percentiles, indicating how you pect of validity is the responsiveness of an instrument.
did relative to others who took the test at the same time.
This section offers a brief discussion of concepts of change.
On a different day, your friend might get the same score
Further discussion of MDC and MCID, including meth-
as you but be in a different percentile. If a professor
ods for calculating and interpreting them, are provided in
decides to “curve” test scores so that the grades of the
Chapter 32.
majority of the class are considered a “C,” regardless
of a student’s absolute score, then the test would be
norm-referenced. Minimal Clinically Important Difference
Norm-referenced scores are often based on a distri- In Chapter 9, we introduced the concept of minimal
bution of “normal” values obtained through testing a detectable change (MDC), which is the minimum
large group that meets a certain profile (see Chapter 20). amount of measured change needed before we can rule
The mean of the distribution of scores for the reference out measurement error as being solely responsible.
group, along with standard deviations, is used as the When change scores exceed the MDC, we can be confi-
standard to determine how an individual performs. For dent that some real change in status has occurred. This
example, we can look at charts to determine whether we is a matter of reliability. Beyond this, however, the MDC
fit the profile for average weight for our height. These is not helpful in determining whether the changes that
“norms” can be converted to norm-referenced values. have been measured are sufficient to impact the patient
Norm-referenced scores are typically reported as ranks or their care in meaningful ways.
or percentiles, indicating an individual’s position within The minimal clinically important difference (MCID)
a set of scores (see Chapter 22). is the smallest difference in a measured variable that
signifies an important difference in the patient’s condi-
➤ Normative values for the SFT were established on data tion.39 It acts as an indicator of the instrument’s ability
from over 7,000 individuals aged 60 to 94 from 20 states.33 to reflect impactful changes when they have occurred.
These norms were converted to percentiles as a way of judging The MCID is a reflection of the test’s validity and can
how individuals performed relative to their age group. For be helpful when choosing an instrument, setting goals
example, among men, percentile scores for the chair stand for for treatment, and determining whether treatment is
those aged 60 to 64 years were: successful.
25th: 14 stands 75th: 19 stands The MCID is based on a criterion that establishes
50th: 16 stands 90th: 22 stands when sufficient change has taken place to be considered
“important.”40 This criterion is a subjective determina-
These values allow us to interpret an individual’s tion, typically by the patient, who is asked to indicate
score by determining whether they scored better or whether they feel “better” or “worse” to varying levels.
worse than their peers. These values do not, however, The average score that matches those who have indi-
provide a basis for determining whether a person’s cated at least “somewhat better” or “somewhat worse” is
demonstrated fitness level is sufficient to maintain usually used to demarcate the MCID.
functional independence in later years, nor will they
help to design intervention or allow measurement of The MCID may also be called the minimal clinically
actual change. Such a determination would require a important change (MCIC),41 minimal important difference
criterion-referenced comparison. Tests may provide (MID),42 or minimal important change (MIC).43,44 The term
both norm- and criterion-referenced methods (see minimal clinically important improvement (MCII) has been used
Focus on Evidence 10-2). when the change is only measured as improvement.45,46
Focus on Evidence 10–2 Haley et al36 generated reference curves that provide raw scores
Defining Normal and percentiles for different age groups. For example, data for the
motor subscale is shown here. Scale scores can serve as a criterion
The Pediatric Evaluation of Disability Inventory (PEDI) was developed
reference for evaluative purposes, such as determining the actual
to evaluate age-related capabilities of children with disabilities, aged
amount of change that occurred following intervention. The norm-
0 to 8 years. The test is organized in three domains, including self-
referenced percentile scores can be used to distinguish children
care, mobility, and social function.35 It is scored by summing raw
with high or low function as a basis for determining eligibility for
scores within each domain based on ability to perform tasks and the
disability-related services.37 These two approaches can provide use-
amount of assistance needed. The raw scores are then converted to
ful normative and clinical profiles of patients.38
a normed value ranging from 0 to 100. Higher scores indicate lesser
degree of disability. The normal values were standardized based on
a sample of 412 children within the appropriate age groups who did From Haley SM, Fragala-Pinkham MA, Ni PS, et al. Pediatric physical functioning
not exhibit any functional disabilities. reference curves. Pediatr Neurol 2004;31(5):333–41, Figure 1, p. 337. Used
with permission.
100
90
80
70
Mobility Score
60
50
40
30
20
10
3rd 10th 25th 50th 75th 90th 97th
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Age (years)
the outcome measures used, the population and setting, dimensions of fatigue in this population. For instance,
and the design of the study. It is not a fixed value for a floor effects would attenuate the ability to capture im-
given test. In this case, we can see that the results are provement in patients who were “worse off” at baseline,
not the same for patients with different surgical inter- potentially underestimating the effect of an intervention.52
ventions and that they depend on whether they saw
improvement or got worse. Those who had TKA
needed to show greater improvement to be clinically ■ Methodological Studies: Validity
meaningful. Many studies do not make this important
distinction regarding generalization of the MCID, Methodological research incorporates testing of both
which may lead to misleading conclusions, depending reliability and validity to determine their application and
on the direction of change.46,48 interpretation in a variety of clinical situations. Several
important strategies are important as one approaches
Ceiling and Floor Effects
validity studies.
One concern in the interpretation of improvement or
decline is the ability of an instrument to reflect • Fully understand the construct. Validity studies
changes at extremes of a scale. Some instruments require substantial background to understand the
exhibit a floor effect, which is not being able to show theoretical dimensions of the construct of interest.
small differences when patients are near the bottom of Through a thorough review of literature, it is im-
the scale. For instance, a developmental assessment portant to know which other tools have been used
with a floor effect might not be able to distinguish for the same or similar constructs and how they
small improvements in children who are low function- have been analyzed.
ing. Conversely, a patient who is functioning at a high • Consider the clinical context. Tools that have
level may show no improvement if a scale is not able already gone through validity testing may not be
to distinguish small changes at the top, called a ceiling applicable to all clinical circumstances or patient
effect. populations and may need further testing if applied
for different purposes than used in original validity
studies.
➤ C A S E IN PO INT # 6 • Consider several approaches. Validity testing
Elbers et al49 studied the measurement properties of the requires looking at all aspects of validity and will
Multidimensional Fatigue Inventory (MFI) to assess elements of benefit from multiple analysis approaches. The
fatigue in patients with Parkinson’s disease. The MFI is a self-
broader the testing methods, the more the test’s
report questionnaire with items in five subscales. Each subscale
validity can be generalized.
contains four items, scored 1 to 5 on a Likert scale. Scores are
determined for the total test (maximum 100) as well as for each
• Consider validity issues if adapting existing
subscale (range 4 to 20). tools. Many clinicians find that existing tools
do not fit their applications exactly, and thus
will adapt sections or definitions to fit their needs.
The determination of ceiling and floor effects has Although it is useful to start with existing tools,
been based on finding more than 15% of scores cluster- changing a tool’s properties can alter validity,
ing at the lower or higher end of a scale.50,51 and testing should be done to ensure that validity
is maintained.
➤ Researchers in the fatigue study found no floor or ceiling • Cross-validate outcomes. Validity properties that
effects when analyzing the total score, but did see a floor effect are documented on one sample may not be the same
with the mental fatigue subscale (18.3% of patients at lowest if tested on a different sample. This issue is espe-
end) and a ceiling effect for the physical fatigue subscale cially important when cutoff scores or predictive
(16.3% at high end). values are being generated for application to a
larger population. Cross-validation allows us to
These findings suggest that these subscales should demonstrate the consistency of outcomes on a sec-
be examined to determine whether they can adequately ond sample to determine whether the test criteria
convey the degree of improvement or decline in these can be generalized across samples.
6113_Ch10_127-140 02/12/19 4:51 pm Page 139
COMMENTARY
In our lust for measurement, we frequently measure that which we can rather
than that which we wish to measure . . . and forget that there is a difference.
—George Yule (1871-1951)
British statistician
The importance of validity for evidence-based practice be accurate if we are to determine whether an interven-
cannot be overemphasized. All measurements have a tion is warranted, especially if that intervention has as-
purpose, which must be understood if we are to apply sociated risks. From an evidence perspective, we must
them appropriately and obtain meaningful data. For be able to determine whether outcomes of clinical trials
example, fitness standards for elderly individuals may are applicable to practice, understanding the conse-
influence whether a patient is referred for therapy and quences of using measurements for clinical decisions.
whether insurance coverage will be recommended. The Because validity is context-driven, we must have confi-
application of MCID scores can impact whether a dence in the choice of instrument and validity of the
patient is considered improved and whether payers will data that are generated.
reimburse for continued care. Diagnostic criteria must
R EF ER E NC E S 13. Harris SR, Megens AM, Daniels LE: Harris Infant Neuromotor
Infant Test (HINT). Test User’s Manual Version 1.0. Clinical
1. McDowell I. Measuring Health: A Guide to Rating Scales and Edition. Chicago, IL: Infant Motor Performance Scales, 2010.
Questionnaires. 3 ed. New York: Oxford University Press, 2006. Available at http://thetimp.com/store/large/382h6/TIMP_
2. American Educational Research Association, American Products/HINT_Test_Manual.html. Accessed June 14, 2017.
Psychological Association, National Council on Measurement 14. Harris SR, Daniels LE. Content validity of the Harris Infant
in Education, et al. Standards for Educational and Psychological Neuromotor Test. Phys Ther 1996;76(7):727–737.
Testing. Washington, DC: American Educational Research 15. Buckley JP, Sim J, Eston RG, Hession R, Fox R. Reliability and
Association, 2014. validity of measures taken during the Chester step test to predict
3. Messick S. Validity of psychological assessment: validation of aerobic power and to prescribe aerobic exercise. Br J Sports Med
inferences from persons’ responses and performances as scien- 2004;38(2):197–205.
tific inquiry into score meaning. Am Psychol 1995;50(9):741–749. 16. Harris SR. Parents’ and caregivers’ perceptions of their
4. Streiner DL. Statistics commentary series: commentary no. children’s development. Dev Med Child Neurol 1994;36(10):
17-validity. J Clin Psychopharmacol 2016;36(6):542–544. 918–923.
5. Brown T. Construct validity: a unitary concept for occupational 17. Westcott McCoy S, Bowman A, Smith-Blockley J, Sanders K,
therapy assessment and measurement. Hong Kong J Occup Ther Megens AM, Harris SR. Harris Infant Neuromotor Test: com-
2010;20(1):30–42. parison of US and Canadian normative data and examination
6. Patrick DL, Chiang YP. Measurement of health outcomes of concurrent validity with the Ages and Stages Questionnaire.
in treatment effectiveness evaluations: conceptual and method- Phys Ther 2009;89(2):173–180.
ological challenges. Med Care 2000;38(9 Suppl):II14–25. 18. Tse L, Mayson TA, Leo S, Lee LL, Harris SR, Hayes VE,
7. Whelton PK, Carey RM, Aronow WS, Casey DE, Jr., Collins Backman CL, Cameron D, Tardif M. Concurrent validity
KJ, Dennison Himmelfarb C, DePalma SM, Gidding S, of the Harris Infant Neuromotor Test and the Alberta Infant
Jamerson KA, Jones DW, et al. 2017 ACC/AHA/AAPA/ Motor Scale. J Pediatr Nurs 2008;23(1):28–36.
ABC/ACPM/AGS/APhA/ASH/ASPC/NMA/PCNA guideline 19. Megens AM, Harris SR, Backman CL, Hayes VE. Known-
for the prevention, detection, evaluation, and management groups analysis of the Harris Infant Neuromotor Test. Phys
of high blood pressure in adults: a report of the American Ther 2007;87(2):164–169.
College of Cardiology/American Heart Association Task 20. Waltz CF, Strickland O, Lenz ER. Measurement in Nursing and
Force on Clinical Practice Guidelines. Hypertension 2018; Health Research. 4 ed. New York: Springer Publishing, 2010.
71(6):e13–e115. 21. Melzack R, Katz J. The McGill Pain Questionnnaire: appraisal
8. Downing SM. Validity: on meaningful interpretation of assess- and current status. In: Turk DC, Melzack R, eds. Handbook of
ment data. Med Educ 2003;37(9):830–837. Pain Assessment. New York: Guilford Press, 1992.
9. Cook DA, Beckman TJ. Current concepts in validity and 22. Grilo RM, Treves R, Preux PM, Vergne-Salle P, Bertin P.
reliability for psychometric instruments: theory and application. Clinically relevant VAS pain score change in patients with
Am J Med 2006;119(2):166.e7–e16. acute rheumatic conditions. Joint Bone Spine 2007;74(4):
10. Harris SR, Megens AM, Backman CL, Hayes V. Development 358–361.
and standardization of the Harris Infant Neuromotor Test. 23. Breivik H, Borchgrevink PC, Allen SM, Rosseland LA,
Intants & Young Children 2003;16(2):143–151. Romundstad L, Hals EK, Kvarstein G, Stubhaug A.
11. Harris SR. Early identification of motor delay: family-centered Assessment of pain. Br J Anaesth 2008;101(1):17–24.
screening tool. Can Fam Physician 2016;62(8):629–632. 24. Abahussin AA, West RM, Wong DC, Ziegler LE. PROMs
12. Harris SR, Daniels LE. Reliability and validity of the Harris for pain in adult cancer patients: a systematic review of measure-
Infant Neuromotor Test. J Pediatr 2001;139(2):249–253. ment properties. Pain Pract 2019;19(1):93–117.
6113_Ch10_127-140 02/12/19 4:51 pm Page 140
25. Campbell DT, Fiske DW. Convergent and discriminant 41. Soer R, Reneman MF, Vroomen PC, Stegeman P, Coppes MH.
validation by the multitrait-multimethod matrix. Psychol Bull Responsiveness and minimal clinically important change of the
1959;56:81. Pain Disability Index in patients with chronic back pain. Spine
26. Renno P, Wood JJ. Discriminant and convergent validity of 2012;37(8):711–715.
the anxiety construct in children with autism spectrum disor- 42. Khair RM, Nwaneri C, Damico RL, Kolb T, Hassoun PM,
ders. J Autism Dev Disord 2013;43(9):2135–2146. Mathai SC. The minimal important difference in Borg Dyspnea
27. Meenan RF, Mason JH, Anderson JJ, Guccione AA, Kazis LE. Score in pulmonary arterial hypertension. Ann Am Thorac Soc
AIMS2. The content and properties of a revised and expanded 2016;13(6):842–849.
Arthritis Impact Measurement Scales Health Status Question- 43. Rodrigues JN, Mabvuure NT, Nikkhah D, Shariff Z, Davis TR.
naire. Arthritis Rheum 1992;35(1):1–10. Minimal important changes and differences in elective hand
28. Brown JH, Kazis LE, Spitz PW, Gertman P, Fries JF, Meenan surgery. J Hand Surg Eur Vol 2015;40(9):900–912.
RF. The dimensions of health outcomes: a cross-validated 44. Terwee CB, Roorda LD, Dekker J, Bierma-Zeinstra SM, Peat
examination of health status measurement. Am J Public Health G, Jordan KP, Croft P, de Vet HC. Mind the MIC: large
1984;74(2):159–161. variation among populations and methods. J Clin Epidemiol
29. Mason JH, Anderson JJ, Meenan RF. A model of health status 2010;63(5):524–534.
for rheumatoid arthritis. A factor analysis of the Arthritis Impact 45. Tubach F, Ravaud P, Baron G, Falissard B, Logeart I, Bellamy
Measurement Scales. Arthritis Rheum 1988;31(6):714–720. N, Bombardier C, Felson D, Hochberg M, van der Heijde D,
30. Arthritis Impact Measurement Scales 2 (AIMS2). Available et al. Evaluation of clinically relevant changes in patient re-
at http://geriatrictoolkit.missouri.edu/aims2/AIMS2- ported outcomes in knee and hip osteoarthritis: the minimal
ORIGINAL.pdf. Accessed September 3, 2018. clinically important improvement. Ann Rheum Dis 2005;
31. Meenan RF, Gertman PM, Mason JH, Dunaif R. The arthritis 64(1):29–33.
impact measurement scales. Further investigations of a health 46. Wang YC, Hart DL, Stratford PW, Mioduski JE. Baseline
status measure. Arthritis Rheum 1982;25(9):1048–1053. dependency of minimal clinically important improvement.
32. Rikli RE, Jones CJ. Senior Fitness Test Manual. 2 ed. Phys Ther 2011;91(5):675–688.
Champaign, IL: Human Kinetics, 2013. 47. Danoff JR, Goel R, Sutton R, Maltenfort MG, Austin MS.
33. Rikli RE, Jones CJ. Functional fitness normative scores for How much pain is significant? Defining the minimal clinically
community-residing older adults, ages 60–94. J Aging Phys Act important difference for the visual analog scale for pain after
1999;7(2):162–181. total joint arthroplasty. J Arthroplasty 2018;33(7S):S71–S75.e2.
34. Rikli RE, Jones CJ. Development and validation of criterion- 48. de Vet HC, Terwee CB, Ostelo RW, Beckerman H, Knol DL,
referenced clinically relevant fitness standards for maintaining Bouter LM. Minimal changes in health status questionnaires:
physical independence in later years. Gerontologist 2013; distinction between minimally detectable change and minimally
53(2):255–267. important change. Health Qual Life Outcomes 2006;4:54.
35. Feldman AB, Haley SM, Coryell J. Concurrent and construct 49. Elbers RG, van Wegen EE, Verhoef J, Kwakkel G. Reliability
validity of the Pediatric Evaluation of Disability Inventory. and structural validity of the Multidimensional Fatigue Inven-
Phys Ther 1990;70(10):602–610. tory (MFI) in patients with idiopathic Parkinson’s disease.
36. Haley SM, Fragala-Pinkham MA, Ni PS, Skrinar AM, Kaye Parkinsonism Relat Disord 2012;18(5):532–536.
EM. Pediatric physical functioning reference curves. Pediatr 50. McHorney CA, Tarlov AR. Individual-patient monitoring in
Neurol 2004;31(5):333–341. clinical practice: are available health status surveys adequate?
37. Haley SM, Coster WI, Kao YC, Dumas HM, Fragala-Pinkham Qual Life Res 1995;4(4):293–307.
MA, Kramer JM, Ludlow LH, Moed R. Lessons from use of the 51. Terwee CB, Bot SD, de Boer MR, van der Windt DA, Knol
Pediatric Evaluation of Disability Inventory: where do we go DL, Dekker J, Bouter LM, de Vet HC. Quality criteria were
from here? Pediatr Phys Ther 2010;22(1):69–75. proposed for measurement properties of health status question-
38. Kothari DH, Haley SM, Gill-Body KM, Dumas HM. Measur- naires. J Clin Epidemiol 2007;60(1):34–42.
ing functional change in children with acquired brain injury 52. Feeny DH, Eckstrom E, Whitlock EP, Perdue LA. A Primer
(ABI): comparison of generic and ABI-specific scales using the for Systematic Reviewers on the Measurement of Functional Status
Pediatric Evaluation of Disability Inventory (PEDI). Phys Ther and Health-Related Quality of Life in Older Adults. (Prepared by
2003;83(9):776–785. the Kaiser Permanente Research Affiliates Evidence-based Prac-
39. Redelmeier DA, Guyatt GH, Goldstein RS. Assessing the tice Center under Contract No. 290-2007-10057-I.) AHRQ
minimal important difference in symptoms: a comparison of Publication No. 13-EHC128-EF. Rockville, MD: Agency for
two techniques. J Clin Epidemiol 1996;49(11):1215–1219. Healthcare Research and Quality. September 2013. Available
40. Wright JG. The minimal important difference: who’s to say at https://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0076940/.
what is important? J Clin Epidemiol 1996;49(11):1221–1222. Accessed July 5, 2018.
6113_Ch11_141-158 02/12/19 4:49 pm Page 141
CHAPTER
Designing Surveys
and Questionnaires 11
One of the most popular methods for collecting institutions use surveys to determine patient satisfaction1
and to describe patient demographics.2
information is the survey approach. A survey
Data can be accessed for many federal and state
consists of a set of questions that elicits quanti- surveys.3 For example, large national surveys such as the
tative or qualitative responses. Surveys can be U.S. Census, the National Health Interview Survey,4
and the National Nursing Home Survey5 provide
conducted for a variety of purposes using dif-
important population data that can be used for many
ferent formats and delivery methods, including varied purposes, including documenting the incidence
questionnaires in hard copy or online, and face- of disease, estimating research needs, describing the
quality of life of different groups, and generating policy
to-face or telephone interviews.
recommendations.
Surveys can be used as a data gathering
Testing Hypotheses
technique in descriptive, exploratory, or ex-
Surveys may also involve testing hypotheses about the
planatory studies. Data may be intended for
nature of relationships within a population. Data can be
generalization to a larger population or as a used to establish outcomes following intervention, ex-
description of a smaller defined group. The plore relationships among clinical variables, or examine
risk factors for disease or disability. Surveys can be used
purpose of this chapter is to present an
for data collection in experimental studies to compare
overview of the structure of survey instru- outcomes across groups when studying variables such as
ments, with a focus on essential elements of attitudes and perceptions.
For example, questionnaires have been used to com-
survey design, construction, question writing,
pare quality of life following bilateral or unilateral
and delivery methods. cochlear implants,6 to establish the relationship between
medical adherence and health literacy,7 and to study the
influence of life-style preferences and comorbidities on
■ Purposes of Surveys type 2 diabetes.8
Standardized questionnaires are also used extensively
Surveys can be used for descriptive purposes or to gen- as self-report instruments for assessing outcomes related
erate data to test specific hypotheses. to broad clinical constructs such as function, pain, health
status, and quality of life. These instruments are de-
Descriptive Surveys signed to generate scores that can be used to evaluate
Descriptive surveys are intended to characterize knowl- patient improvement and to support clinical decision
edge, behaviors, patterns, attitudes, or demographics of making, requiring extensive validation procedures. We
individuals within a given population. Such surveys are will discuss these types of scales in Chapter 12, but the
often used to inform marketing decisions, public policy, concepts presented here related to designing surveys will
or quality improvement. For instance, many healthcare apply to those instruments as well.
141
6113_Ch11_141-158 02/12/19 4:49 pm Page 142
Interviews
■ Survey Formats
In an interview, the researcher asks respondents specific
The two main formats for surveys are questionnaires and questions and records their answers for later analysis. In-
interviews. The appropriate method will depend on the terviews can take a few minutes or several hours, depend-
nature of the research question and the study variables. ing on the nature of the questions and the respondent’s
willingness to share information. Interviews can be con-
Questionnaires ducted face-to-face or over the telephone.
Most of us are familiar with questionnaires, a type of Face-to-Face Interviews
structured survey that is usually self-administered, asking Interviews conducted in person tend to be most effective
individuals to respond to a series of questions. The for establishing rapport between the interviewer and
advantages of using questionnaires are many. They are the respondent. This interaction can be important for
generally efficient, allowing respondents to complete eliciting forthright responses to questions that are of a
them on their own time. Data can be gathered from a personal nature.
large sample in a wide geographical distribution in a The advantage of this approach is the opportunity for
relatively short period of time at minimal expense. Ques- in-depth analysis of respondents’ behaviors and opinions
tionnaires are standardized so that everyone is exposed because the researcher can follow up responses with
to the same questions in the same way. Respondents can probes to clarify answers and directly observe respon-
take time to think about their answers and to consult dents’ reactions. Therefore, interviews are a common
records for specific information when needed. Question- approach to obtain qualitative information. To avoid
naires also provide anonymity, encouraging honest and errors of interpretation of replies, many researchers will
candid responses. record interview sessions (with the participant’s consent).
Questionnaires are particularly useful as a research The major disadvantages of face-to-face interviews in-
method for examining phenomena that can be best as- clude cost and time, need for trained personnel to carry
sessed through self-observation, such as attitudes, values, out the interviews, scheduling, and the lack of anonymity
and perceptions. The primary disadvantages of the writ- of the respondents.
ten questionnaire are the potential for misunderstanding
or misinterpreting questions or response choices, and Structure of Interviews
unknown accuracy or motivation of the respondent. Interviews can be structured, consisting of a standardized
Surveys can be used for retrospective or prospective set of questions that will be asked of all participants. In
studies. Most often, surveys are based on a cross-sectional this way, all respondents are exposed to the same ques-
sample, meaning that a large group of respondents are tions, in the same order, and are given the same choices
tested at relatively the same point in time. Surveys can, for responses.
however, be used in longitudinal studies by giving follow- Qualitative studies often use interviews with a less
up interviews or questionnaires to document changes in strict format. In a semi-structured interview, the inter-
attitudes or behaviors over time. viewer follows a list of questions and topics that need to
Questionnaires are usually distributed through the be covered but is able to follow lines of discussion that
mail or by electronic means, although many research may stray from the guide when appropriate. In an un-
situations do allow for in-person distribution. structured interview, the interviewer does not have a fixed
agenda, and can proceed informally to question and dis-
cuss issues of concern. This format is typically conver-
Electronic distribution of questionnaires is less
sational and is often carried out in the respondent’s
expensive, faster to send and return, and generally
easier to complete and submit than written natural setting. Data are analyzed based on themes of
mail questionnaires. Many free electronic platforms are narrative data.
available online, and many provide automatic downloading If a study involves the use of interviewers, a formal
of data, eliminating data entry and associated time training process should be incorporated. Interviewers
or errors. must be consistent in how questions are asked and how
Electronic surveys do require, however, that respondents probing follow-up questions are used (if they are to be
have the appropriate technological skill and access to the allowed). The interviewers must understand the process
6113_Ch11_141-158 02/12/19 4:49 pm Page 143
of recording responses, an important skill when open- Research has shown, however, that self-report meas-
ended questions are used. ures are generally valid. For instance, variables such as
mobility have been reported accurately. 14,15 However,
Further discussion of interviewing techniques for qualita-
some studies have shown a poorer correlation between
tive research is included in Chapter 21.
performance, healthcare utilization, and self-report
measures. 16,17 These differences point out the need to
Telephone Interviews
understand the target population and the respondents’
Telephone interviews have been shown to be an effec-
abilities to answer the questions posed.
tive alternative to in-person discussions and have the
For many variables, self-report is the only direct way
advantage of being less expensive, especially if respon-
to obtain information. Data obtained from standardized
dents are geographically diverse. Disadvantages include
questionnaires that evaluate health and quality of life
not being able to establish the same level of rapport as
must be interpreted as the respondent’s perception of
personal interviews without a visual connection, more
those characteristics, rather than performance measures.
limited time as people become fatigued when on the
phone too long, and the “annoyance factor” that we Although surveys can be administered through
have all experienced when contacted over the phone several methods, we will focus primarily on written
(typically at dinner time!). 10,11 Prior contact to set up questionnaires for the remainder of this chapter. Most
an appointment can help with this. There are also issues of the principles in designing questionnaires can be
with availability depending on what kinds of phone list- applied to interviews as well.
ings are available.
An important use of telephone interviews is gathering
follow-up data in research studies. ■ Planning the Survey
Data were analyzed from a 17-center, prospective cohort
study of infants (age <1 year) who were hospitalized with The process of developing a survey instrument is per-
bronchiolitis during three consecutive fall/winter seasons.12 haps more time consuming than people realize. Most
Parents were called at 6-month intervals after discharge to problems that arise with survey research go back to the
assess respiratory problems. The primary outcome was status original design process, requiring well-defined goals and
at 12 months. They were able to follow up with 87% of the a full understanding of how the information gathered
families and were able to document differences in response rate will be used.
among socioeconomic and ethnic subgroups. Creating a survey involves several stages, within which
the instrument is planned, written, and revised, until it is
finally ready for use as a research tool (see Fig. 11-1).
Survey researchers often utilize more than one mode
of data collection to improve response rates.13 Using
Always Start with a Question
a mixed methods approach, offering written or online The first consideration in every research effort is delin-
questionnaires and interviews or telephone surveys, eation of the overall research question. The purpose of
researchers attempt to appeal to respondent preferences, the research must be identified with reference to a target
reduce costs, and obtain greater coverage. Data are population and the specific type of information that will
typically a combination of qualitative and quantitative be sought. These decisions will form the structure for de-
information. ciding the appropriate research design. A survey is appro-
priate when the question requires obtaining information
Self-Report from respondents, rather than measuring performance.
Survey data that are collected using either an interview
or questionnaire are based on self-report. The re- ➤ CASE IN POIN T
searcher does not directly observe the respondent’s be- For many older adults, access to healthcare is a challenge.
havior or attitudes, but only records the respondent’s Healthcare systems have begun to explore the efficacy of tele-
report of them. There is always some potential for bias health technologies for the delivery of education, monitoring of
or inaccuracy in self-reports, particularly if the questions health status, and supplementing personal visits. However, be-
concern personal or controversial issues. The phenom- fore such efforts can be implemented effectively, attitudes to-
enon of recall bias can be a problem when respondents ward such technology must be evaluated. Edwards et al18 used
are asked to remember past events, especially if these a questionnaire to study acceptance of a telehealth approach for
patients with chronic disease to determine which factors could
events were of a sensitive nature or if they occurred a
influence successful outcomes.
long time ago.
6113_Ch11_141-158 02/12/19 4:49 pm Page 144
casual interest or curiosity, but because they reflect es- the review of literature. But whenever possible, don’t
sential pieces of information that, taken as a whole, will reinvent the wheel!
address the proposed research question.
Expert Review
Developing Content The preliminary draft of the survey should be distributed
to a panel of colleagues who can review the document,
The researcher begins by writing a series of questions
identify problems with questions, including wording and
that address each behavior, knowledge, skill, or attitude
organization. Using the guiding questions as a reference,
reflected in the guiding questions.
these experts should be asked to make suggestions for
constructive change. No matter how carefully the survey
➤ The telehealth study was organized around six categories of has been designed, the researcher is often too close to it
questions: to see its flaws.
1. Sociodemographics Based on the panel’s comments, the survey should be
2. Current access to healthcare revised and then presented to the panel again for further
3. Technology-related factors comment. The revision process should continue with ad-
• Availability of technology ditional feedback until the researcher is satisfied that the
• Confidence in using technology instrument is concise and clear, as well as serves its in-
• Perceived advantages or disadvantages of technology
tended purpose. This process is indeed time consuming
4. Past telehealth satisfaction
but necessary, and helps to establish the content validity
5. Interest in using telehealth
6. Current health status of the instrument.
Pilot Testing and Revisions
Questions should be grouped and organized to re- The revised questionnaire should then be pilot tested on
flect each category or topic. The first draft of a ques- a small representative sample, perhaps 5 to 10 individuals
tionnaire should include several questions for each from the target population. The researcher should in-
topic so that these can eventually be compared and terview these respondents to determine where questions
weeded out. were unclear or misleading. If the researcher is unsure
about the appropriateness of specific wording, several
Using Existing Instruments versions of a question can be asked to elicit the same in-
Depending on the nature of the research question, it formation in different ways, and responses can be com-
is always helpful to review existing instruments to pared for their reliability.
determine if they are applicable or adaptable for your It is also useful to monitor the time it takes for re-
study. Many investigators have developed and vali- spondents to complete the questionnaire and to exam-
dated instruments for a variety of purposes, and these ine missing answers and inconsistencies. It may be
instruments can often be borrowed directly or in part, helpful to administer the survey to this group on two
or they can be modified to fit new research situations, occasions, perhaps separated by several days, to see if
saving a great deal of time.19 Tools like the Beck De- the responses are consistent, as a way of estimating
pression Inventory,20 the Mini-Mental State Examina- test–retest reliability. Based on the results of pilot test-
tion (MMSE),21 and the Short Form (SF)-36 or SF-12 ing, the questionnaire will again be revised and retested
health status questionnaires22 are often applied as part until the final instrument attains an acceptable level of
of studies. validity.
respondents’ feelings and opinions without biases or disadvantage is that it does not allow respondents to
limits imposed by the researcher. For example: express their own personal viewpoints and, therefore,
may provide a biased response set.
How do you use technology, such as phones, computers,
There are two basic considerations in constructing
and social media?
multiple choice questions. First, the responses should be
This would require the respondent to provide specific exhaustive, including all possible responses that can be ex-
examples of using devices or the internet. This format is pected. Second, the response categories should be mutu-
useful when the researcher is not sure of all possible re- ally exclusive so that each choice clearly represents a
sponses to a question. Therefore, respondents are given unique answer with no overlap. In Question 1, for in-
the opportunity to provide answers from their own per- stance, it is possible that respondents could be in school
spective. Sometimes researchers will use open-ended ques- and work at the same time, or taking care of a household
tions in a pilot study to determine a range of responses that could be interpreted as unemployed. As a protection, it
can then be converted to a multiple-choice item. is often advisable to include a category for “not applica-
Open-ended questions are, however, difficult to ble” (NA), “don’t know,” or “other (please specify ____).”
code and analyze because so many different responses
can be obtained. If open-ended questions are misun-
derstood, they may elicit answers that are essentially ir- Question 1
Which one of these best describes your current situation?
relevant to the researcher’s goal. Respondents may not
Please check only one.
want to take the time to write a full answer, or they may
answer in a way that is clear to them but uninter- Full-time paid work (≥30 hrs/wk)
pretable, vague, or incomplete to the researcher. For Part-time paid work (<30 hrs/wk)
Full-time school
instance, the question about technology use could elicit
Unemployed
a long list of specific devices and apps, or it could result
Unable to work due to illness or disability
in a general description of technology use. If the pur- Fully retired
pose of the question is to find out if the respondent uses Taking care of household
specific types of technology, it would be better to pro- Doing something else (please describe)
vide a list of options. ___________________
Some open-ended questions request a numerical re-
sponse, such as exact age, annual income, height, or
weight. The question must specify exactly how it should Where only one response is desired, it may be useful
be answered, such as to the nearest year, gross annual to add an instruction to the question asking respondents
income, or to the nearest inch or pound. to select the one best answer or the answer that is most
important. However, this technique does not substitute
Open-ended questions can be most effective in interviews
for carefully worded questions and choices.
because the interviewer can clarify the respondent’s an- Dichotomous Questions
swers using follow-up questions. The simplest form of closed-ended question is one that
presents two choices, or dichotomous responses, as shown
in Question 2a. Even this seemingly simple question,
Closed-Ended Questions however, can be ambiguous, as definitions of gender
Closed-ended questions ask respondents to select an identity are no longer obvious using this dichotomous
answer from among several fixed choices. There are sev- description. To be sensitive to social issues, Question 2b
eral formats that can be used for closed-ended questions, is an alternative.23
depending on the nature of the information being
requested.
Question 2a
Questions used as examples in this section are taken or What is your gender?
adapted from the telehealth study by Edwards at al.18 See Male Female
the Chapter 11 Supplement for the full survey. Question 2b
What is your gender identity?
Man
Multiple Choice Questions
Woman
One of the most common formats is the use of multiple
Another identity, please specify_____
choice options. This type of question is easily coded I would prefer not to answer
and provides greater uniformity across responses. Its
6113_Ch11_141-158 02/12/19 4:49 pm Page 147
Check All That Apply are provided, as shown in Question 4. An option for
Sometimes the researcher is interested in more than one “Don’t know” or “Unsure” may be included.
answer to a question, as in Question 3a. However, coding
multiple responses in a single question can be difficult. Checklists
When multiple responses are of interest, it is better to ask When a series of questions use the same format, a
respondents to mark each choice separately, as in Question 3b. checklist can provide a more efficient presentation, as
illustrated in Question 4. With this approach, instruc-
tions for using the response choices need only be given
Question 3a
once, and the respondent can quickly go through
Do you have any of the following easily available for your use at
home, at work, or at the home of friends or family? Please check
many questions without reading a new set of choices.
all that apply. Question 5 shows a checklist using a Likert scale (see
Chapter 12).
A telephone (land line)
A mobile phone
Measurement Scales
Internet access
Questions may be formatted as a scale to create a sum-
Personal email
mative outcome related to a construct such as function
Question 3b or quality of life. Instruments such as the Barthel Index24
Please indicate if you have each of the following easily available
and the Functional Independence Measure (FIM)25 are
for your use at home, at work, or at the home of friends or family.
examples of health questionnaires that use total scores
Not Available Available
to summarize functional levels.
1. A telephone
2. A mobile phone See Chapter 12 for a robust discussion of measurement
3. Internet access scales.
4. Personal email
Visual Analog Scales
A commonly used approach incorporates visual analog
scales (VAS). This format is usually based on a 10-cm
Measuring Intensity line with anchors on each side that reflect extremes of
When questions address a characteristic that is on a the variable being measured. Subjects are asked to place
continuum, such as attitudes or quality of performance, a mark along the scale that reflects the intensity of their
it is useful to provide a range of choices so that the response. Although mostly used to assess pain, the VAS
respondent can represent the appropriate intensity of can be used for any variable that has a relative intensity
response. Usually, three to five multiple-choice options (see Chapter 12).
Question 4
Sometimes people find it hard to get the health support and advice they would like. Have you had any difficulty with the following:
No Some Lots of
Difficulty Difficulty Difficulty
a. Getting to appointments outside of your home due to your physical health?
b. Getting to appointments outside of your home due to psychologic or emotional difficulties?
c. Getting to appointments outside of your home due to difficulties with transportation or travel?
d. Cost of transportation and travel to get to appointments?
Question 5
How confident do you feel about doing the following:
Quite Extremely I have never I don’t know
Not at all confident confident confident tried this what this is
1. Using a mobile phone for phone calls
2. Using a mobile phone for text messages
3. Using a computer
4. Sending and receiving emails
5. Finding information on the internet
6. Using social media sites like Facebook
6113_Ch11_141-158 02/12/19 4:49 pm Page 148
another language for specific sample groups. The trans- These types of questions may be difficult to answer,
lation must account for cultural biases. however, because the frequency of the behavior may vary
greatly from day to day, month to month, or even season
Avoid Bias to season.
Questions should not imply a “right” or “preferred” re- The researcher should determine exactly which aspect
sponse. Similarly, response options should also be neu- of the behavior is most relevant to the study and provide
tral. Words that suggest a bias should also be avoided. an appropriate time frame for interpreting the question
For example: (Question 7). For instance, the question could ask about
Do you often go to your physician with trivial problems? a particular period, such as use of a phone within the last
week, or the respondent can be asked to calculate an av-
This implies an undesirable behavior. erage daily time. This assumes, of course, that this time
Clarity period is adequately representative for purposes of the
It goes without saying that survey questions must be study.
clear and unambiguous. Questions that require subtle
distinctions to interpret responses are more likely to be Question 7
misunderstood. For example, consider the question: Do you do any of the following at least once a day?
Yes No
Do you use technology devices in your daily life? Phone call on mobile phone
Access email
There are two ambiguous terms here. First, there may
Access the internet
be different ways to define technology devices. Do they
include phones, computers, tablets, watches? Second,
what constitutes use? Does it mean a simple phone call, Alternatively, the question could ask for an estimate
accessing the internet, just telling time? This type of am- of “typical” or “usual” behaviors. This approach makes
biguity can be corrected by providing the respondent an assumption about the respondent’s ability to form
with appropriate definitions. such an estimate. Some behaviors are much more erratic
Clarity also requires avoiding the use of jargon, abbre- than others. For example, it may be relatively easy to
viations, or unnecessarily complex sentences. Generally, estimate the number of times someone uses a computer
wording should be at the 8th grade reading level. Don’t each day, but harder to estimate typical phone use be-
use words that can be ambiguous, like “most people” or cause that can vary so much.
“several.” Phrase questions in the positive direction when Questions related to time should also be specific. For
using disagree–agree options. Avoid using “not,” “rarely,” instance, Question 8a may be difficult to answer if it is
“never,” and negative prefixes. not a consistent problem. It is better to provide a time
frame for reference, as in Question 8b. Many function,
Double-Barreled Questions pain, and health status questionnaires specify time periods
Each question should be confined to a single idea. Sur- such as within the last month, last week, or last 24 hours.
veys should avoid the use of double-barreled questions, Avoid ambiguous terms like “recently” or “often.”
using “or” or “and” to assess two things within a single
question. For instance: Question 8a
Have you had difficulty obtaining transportation to attend visits
“Do you have difficulty using mobile phones or
to your healthcare provider?
computers to access information?”
Question 8b
It is obviously possible to experience these technolo- Have you had difficulty obtaining transportation to attend visits
gies differently, making it impossible to answer the ques- to your healthcare provider within the past month?
tion. It is better to ask two separate questions to assess
each activity separately. Sensitive Questions
Questionnaires often deal with sensitive or personal is-
Frequency and Time Measures
sues that can cause some discomfort on the part of the
Researchers are often interested in quantifying behavior
respondent. Although some people are only too willing
in terms of frequency and time. For example, it might
to express personal views, others are hesitant, even when
be of interest to ask:
they know their responses are anonymous. Some ques-
How often do you use a mobile phone each day? tions may address social issues or behaviors that have
6113_Ch11_141-158 02/12/19 4:49 pm Page 150
negative associations, such as smoking, sexual practices, interest, or at least be “neutral.” Sensitive questions
or drug use. Question 2b is an example where social is- should come later. Leaving a space for comments at the
sues should be addressed thoughtfully. end of a survey or after specific questions can allow re-
Questions may inquire about behaviors that respon- spondents to provide clarification of their answers.
dents are not anxious to admit to, such as ignorance of
facts they feel they should know, or compliance with Length of the Survey
medications or exercise programs. It may be useful to The length of the survey should be considered. More
preface such questions with a statement to put them at often than not, the initial versions will be too long. Long
ease so they feel okay if they fit into that category, as in questionnaires are less likely to maintain the respon-
Question 9. dent’s attention and motivation, resulting in potentially
invalid or unreliable responses or, more significantly,
Question 9 nonresponses. The importance of each item for the in-
Many people have different levels of familiarity with newer terpretation of the study should be examined, and only
technology and electronic communication. Which of these do those questions that make direct and meaningful contri-
you feel comfortable using? butions should be retained. Best advice—keep it as short
Mobile phone as possible.
Computer
Demographics
Social media (like Facebook)
Typically, researchers include questions about important
Sensitive questions may also be subject to recall bias. demographic information in a survey.
For example, respondents may be selective in their mem-
ory of risk factors for disease or disability. Respondents ➤ The telehealth survey started with questions on gender,
should be reminded in the introduction to the survey ethnic group, and age. These were followed by questions
that they may refuse to answer any questions. regarding current work status, level of education, and home
status (own, rent).
Personal information can also be sensitive. For example,
respondents may not want to share income information, Depending on the nature of the research question,
even when this is important to the research question. Asking an other factors may be relevant such as employment
open-ended question for annual salary may not feel comfortable setting, length of employment, marital status, and
but putting responses in ranges can make this response more income.
palatable. Demographic information is needed to describe the
characteristics of the respondents to compare the charac-
teristics of the sample with those of the population to
■ Formatting the Survey which the results will be generalized, and to interpret how
personal characteristics are related to the participant’s re-
The format of a survey is an essential consideration, as sponses on other questions. Some researchers put demo-
it can easily impact a respondent’s desire to fill it out. graphic questions at the beginning, but many prefer to
The layout should be visually appealing and language keep these less interesting questions for the end.
should be easy to read. The survey should begin with
a short introduction that provides general instructions
for answering questions. Researchers will often give ■ Selecting a Sample
the survey a title that provides an overview of its
purpose. Surveys are used to describe characteristics of a
population, the entire group to which the results will
Question Order be generalized. Most often, a smaller sample of the
Content should flow so that the respondent’s thought population will be asked to complete the survey as a
processes will follow a logical sequence, with smooth representative group. Therefore, it is vitally important
transitions from topic to topic. This allows the respon- that the sample adequately represents the qualities of
dent to continue with relevant thoughts as the survey that population.
progresses and requires that questions be organized There are many strategies for choosing samples, with
by topic. some important considerations for survey research. A
Questions should proceed from the general to the spe- probability sample is one that is chosen at random from
cific. The initial questions should pique the respondent’s a population, where every individual has the same chance
6113_Ch11_141-158 02/12/19 4:49 pm Page 151
Response rates can also be affected by the length of the Table 11-1 Survey Sample Size Estimates*
survey, how long it takes to complete, and its complexity.
MARGIN OF ERROR
With Web-based surveys, technical difficulties can inter-
fere with return, as well as the computer literacy of the POPULATION SIZE ±5% ±3%
respondent. Respondents may not be comfortable sharing 100 80 92
certain types of information online for safety purposes. 500 218 341
Incentives 1,000 278 517
1,500 306 624
Many researchers will use a form of incentive to inspire
5,000 357 880
respondents to return a survey.32 These are often small
10,000 370 965
monetary tokens, such as gift cards. This does require a
budget for the survey and can be prohibitive with a large *Based on 95% confidence level.
sample. The amount really depends on the audience. Data obtained using sample size calculator from Survey
Monkey.34
Many include this token with the initial request and
mention it in a cover letter. Others will indicate that Other considerations in the determination of sample size
it will be available for all those who respond or that will be addressed in Chapter 23.
respondents will be put into a lottery, but these post-
awards tend to be less effective. Estimates are variable as
to whether monetary incentives do increase participa- Once the population exceeds 5,000, sample sizes of
tion. Some data suggest that monetary incentives may around 400 or 1,000 are generally sufficient for a ±5%
produce biased differential responses by being more or ±3% margin of error, respectively.10,26 While larger
attractive to those with lower economic resources.33 samples are generally better, they do not necessarily
make a sample more representative. Although these
Sample Size
numbers probably appear quite large, the ease of con-
Because response rates for surveys are generally low, sam- tacting people may vary with the population and their
ple size becomes an important consideration to adequately availability.
represent the population. Survey sample size determina-
tions are based on three projections regarding the sample Other considerations in the determination of sample size
and expected responses: will be addressed in Chapter 23.
• The size of the population from which respondents You can figure out your own sample size estimates and
will be recruited margin of error preferences using convenient online calcu-
• The acceptable margin of error lators.34,35 Actual recruitment needs are based on projected re-
• The confidence level desired, typically 95% turn rates. With an expected response rate of 30%, for instance,
You are probably familiar with the concept of margin with a population of 5,000 and a ±5% margin of error, it would
be necessary to send out 1,200 questionnaires to expect a
of error from political polls that always report this range,
return of 360 (divide 360 by 30%). With smaller populations,
which is especially important when poll numbers are
it may be necessary to contact the entire population to get a
close. Most often they use a margin of ±5% or ±3%. The reasonable return.
confidence level is based on the statistical concept of
probability, which will be covered fully in Chapter 23.
The most commonly used confidence level is 95%.
Determining Sample Size
■ Contacting Respondents
Generally, as a population gets larger, a smaller proportion
Recruitment is an important part of survey research. Be-
needs to be sampled to get a representative list. Table 11-1
cause response rates are of concern, multiple contacts
shows some estimates based on ±5% and ±3% margins of
should be planned.26 If possible, these communications
error. Most importantly, these sample sizes represent the
should be personalized. The researcher needs to get buy-
actual number of completed surveys that are needed, not the
in to encourage respondents to take the time to fill out
number of respondents originally invited to participate.
and return a survey.
communication, they should be told the purpose of the The Cover Letter
survey, its importance, and why they are being asked to
The questionnaire should be sent to respondents,
participate. It should identify the sponsor of the re-
whether by mail or electronic means, with a cover letter
search. This will help to legitimize the survey.
explaining why their response is important. Because a
For interviews, respondents will typically be within a
questionnaire can easily be tossed away, the cover letter
local geographic area and may be recruited from agen-
becomes vitally important to encourage a return. The
cies or clinics. Before an interview is administered, the
letter should include the following elements (illustrated
potential respondents should be contacted to elicit their
in Figure 11-2):
cooperation. For telephone interviews, it is appropriate
to send advance notice in the mail that the phone call 1. Start with the purpose of the study. Be concise and
will be coming as a means of introduction and as a way on message. Provide brief background that will em-
of establishing the legitimacy of the phone call. phasize the importance of the study. This should be
1 I am part of a research group that is trying to find the best ways to assure access to health care for older adults. Two weeks
ago you received a letter from your primary care provider, introducing our research team, and letting you know that we would
be contacting you regarding an important survey on the use of electronic devices and technology by older adults.
I am writing now to ask for your help in this study. The purpose of the study is to explore whether technology can be used to
contribute to your health care, and whether individuals like yourself would be comfortable with such use. Only by understanding
how patients feel about different ways of communicating with providers can we learn how to be most effective in providing care.
2 We would appreciate your completing the enclosed questionnaire. Your name was selected at random from a list of patients
from your health center because you gave your permission to be contacted for research purposes. Patients in this study are
over 65 years old and have been diagnosed with either depression or risk for cardiovascular disease.
3 The questionnaire is completely confidential and your name will not be attached to any responses. All data will be presented
in summary form only. Your health care will not be affected, whether or not you choose to participate. Your returning the
questionnaire will indicate your consent to use your data for this research. This study has been approved by the Institutional
Review Board at State University.
4 The questionnaire should take approximately 15 minutes to complete. We are interested in your honest opinions. If you
prefer not to answer a question, please leave it blank. A stamped envelope is included for your convenience in returning the
questionnaire. We would appreciate your returning the completed form within the next two weeks, by June 24, 2019.
5 If you have any questions or comments about this study, please feel free to contact me by mail at the address on the
letterhead, or by phone or email at the bottom of this letter.
6,7 Thank you in advance for your cooperation with this important effort. Although there is no direct compensation for your
participation, your answers will make a significant contribution to our understanding of contemporary issues that affect all of
us involved in health care.
8 If you would like a summary of our findings, please fill in your name and address on the enclosed request form, and we will
be happy to forward our findings to you when the study is completed.
9 Sincerely,
Figure 11–2 Sample cover letter. Numbers in the left margin match the list of elements in the text.
6113_Ch11_141-158 02/12/19 4:49 pm Page 154
done in a way that appeals to the participant and thank respondents for their response or, if they
presents no bias in its impact. If the research is have not yet returned the survey, to ask them to
sponsored by an agency or is a student project, this respond soon. Some people may have just forgotten
information should be included. about it.
2. Indicate why the respondent has been chosen for • Replacement: A replacement questionnaire or the
the survey. link can be resent to nonrespondents 2 to 4 weeks
3. Assure the respondents that answers will be later, encouraging them to respond and emphasiz-
confidential and, if appropriate, that they will ing the importance of their response to the success
remain anonymous. Encourage them to be honest of your study.
in their answers and assure them that they can • Final contact: A final contact may be made by
refuse to answer any questions that make them phone or email if needed to bolster the return rate.
uncomfortable.
4. Some electronic survey platforms allow for a re-
spondent to enter a personal identification number ■ Analysis of Questionnaire Data
(PIN) so that their results can be tied to other infor-
mation or a follow-up survey. If this is an important The first step in the analysis of data is to collate re-
part of the study, it should be explained in the cover sponses and enter them into a computer. Each item on
letter as a way of maintaining anonymity. the survey is a data point and must be given a variable
5. Give instructions for completing the survey, in- name, often an item number.
cluding how long it should take to fill out. Provide The researcher must sort through each questionnaire
a deadline date with instructions for sending back as it is returned, or through all responses from an inter-
the survey. It is reasonable to give 2 to 3 weeks for view, to determine if responses are valid.
a response. A shorter time is an imposition, and In many instances, the respondent will have incor-
longer may result in the questionnaire being put rectly filled out the survey and that respondent may have
aside. Paper surveys should always include a to be eliminated from the analysis. The researcher must
stamped self-addressed envelope. Electronic keep track of all unusable questionnaires to report this
surveys, of course, are automatically “returned” percentage in the final report. Some questions may have
once submitted online. to be eliminated from individual questionnaires because
6. Provide contact information if they have any they were answered incorrectly, such as putting two an-
questions. swers in for a question that asked for a single response.
7. Thank respondents for their cooperation. There may be instances where a question is not answered
8. If you are providing incentives, indicate what correctly across most respondents and the question may
they are and how they will be offered. be eliminated from the study.
9. Offer an opportunity for them to receive a sum-
mary of the report. Coding
10. Sign the letter, including your name, degrees, and Analyzing survey data requires summarizing re-
affiliation. If there are several investigators, it is sponses. This in turn requires a form of coding, which
appropriate to include all signatures. Use real allows counting how many respondents answered a
signatures rather than fancy font. certain way.
The cover letter should be concise. If the research is Multiple Choice Responses
being conducted as part of a professional activity, the or- Responses to closed-ended questions are given numeric
ganizational letterhead should be used. If possible, the codes that provide labels for data entry. For instance, sex
letter should be personally addressed to each individual can be coded 0 = male, 1 = female. We could code hos-
respondent. This is easily accomplished through mail pital size as 1 = less than 50 beds, 2 = 50 to 100 beds, and
merge programs. 3 = over 100 beds. These codes are entered into the com-
Follow-Up Communication puter to identify responses. Using codes, the researcher
can easily obtain frequency counts and percentages for
Although the majority of surveys will be returned within each question to determine how many participants
the first 2 weeks, a reasonable improvement can usually checked each response.
be obtained through follow-up. When responses allow for more than one choice
• Follow-up: A week or two after sending the ques- (“check all that apply”), analysis requires that each item
tionnaire, a postcard or email can be sent out to in the list be coded as checked or unchecked. Coding
6113_Ch11_141-158 02/12/19 4:49 pm Page 155
does not allow for multiple answers to a single item. For who indicate they do or do not use particular devices.
example, referring back to Question 3a, each choice Statistical tests can then be used to examine this associ-
option would be considered an item in the final analysis. ation to determine if there is a significant relationship
Question 3b provides an alternative format that makes between gender and device use (see Chapter 28). Other
this type of question easier to code. statistical analyses can also be used to compare group
scores or to establish predictive relationships. Correla-
Codes should be determined at the time the survey
tional and regression statistics are often used for this pur-
is written so that there is no confusion later about
pose (see Chapters 29 and 30).
how to account for each response. See codes
imbedded in the telehealth survey in the Chapter 11
Supplement. Missing Data
A common problem in surveys is handling of miss-
ing data, when respondents leave questions blank. It
Open-Ended Responses may be difficult to know why they have refused to an-
Coding open-ended responses is a bit more challenging. swer. It could be because they didn’t understand the
One approach is to use qualitative analysis to describe question, their preferred choice was not there, they
responses (see Chapter 21). If a more concise summary found the question offensive, they didn’t know the
is desired, the researcher must sort through responses to answer—or they just missed it. The reason for leaving
identify themes and assign codes to reflect these themes. it blank, however, can be important and can create bias
This requires interpretation of the answer to determine in the results. Providing a response choice of “Don’t
similar responses. The challenge comes when individual Know” or “I prefer not to answer” may allow some
respondents provide unique answers, which must often reasons to be clarified. For interviews, a code can be
be coded as “Other.” used to indicate a participant’s refusal to answer a
question.
Summarizing Survey Data When a response is not valid, the researcher will
The analysis of survey data may take many forms. also record the response as “missing,” but can code it
Most often, descriptive statistics are used to summarize to reflect the fact that it was an invalid answer. Chapter 15
responses (see Chapter 22). When quantitative data includes discussion of strategies for handling missing
such as age are collected, the researcher will usually data.
present averages. With categorical data, the researcher
reports the frequency of responses to specific questions.
These frequencies are typically converted to a percent- ■ Ethics of Survey Administration
age of the total sample. For example, a researcher
might report that 30% of the sample was male and 70% Like all forms of research, studies that involve inter-
was female, or in a question about opinions, that 31% views or questionnaires must go through a formal
strongly agree, 20% agree, 5% disagree, and so on. De- process of review and approval by an Institutional Re-
scriptive data are usually presented in tabular or graphic view Board (IRB). Even though there may not be any
form. Percentages should always be accompanied direct physical risk, researchers must be able to
by reference to the total sample size so that the reader demonstrate the protection of subjects from psycho-
can determine the actual number of responses in each logical or emotional risk or invasion of privacy. Re-
category. Percentages are usually more meaningful searchers should also be vigilant about securing the
than actual frequencies because sample sizes may differ confidentiality of data, even if surveys are de-identified
greatly among studies. (see Chapter 7).
The IRB will want assurances that all relevant indi-
Looking for Significant Associations
viduals have been notified and are in support of the proj-
Another common approach to data analysis involves the
ect. This may include healthcare providers, the rest of
description of relationships between two or more sets of
the research team, and administrators. Surveys will often
responses (see Focus on Evidence 11-1). For instance,
receive expedited reviews.
the telehealth survey asked questions about interest in
various technologies and compared them across the two
chronic disease groups.
Informed Consent
Cross-tabulations are often used to analyze the Consent for mail or electronic questionnaires is implied
relationship between two categorical variables. For in- by the return of the questionnaire. The cover letter pro-
stance, we can look at the number of males and females vides the information needed, serving the function of an
6113_Ch11_141-158 02/12/19 4:49 pm Page 156
Focus on Evidence 11–1 who returned the questionnaire and those who did not. This pro-
Presenting Lots of Information vided confidence that the sample was not biased.
The researchers looked at the proportion of respondents interested
In their study of telehealth, Edwards et al18 used a variety of presen-
in individual telehealth technologies by patient group and provided a bar
tations for results to look at interest in using phone-based, email/
chart to show these relationships, as shown. The responses were
internet-based, and social media-based telehealth.
dichotomized to reflect some versus no interest. The data showed that
In a table, demographics of age, sex, and location were pre-
patients with depression were more interested than those with CVD risk
sented for responders and nonresponders to show differences for
in most forms of technology. Landlines were clearly preferred, followed
patients with CVD risk and those with depression. They were able
by finding information on the internet, with little interest in social media.
to show that there were no significant differences between those
To address which factors influenced interest in telehealth, they with good or poor health, with more or less difficulty accessing
used regression models to determine various relationships (see care, and older and younger patients had the similar levels of
Chapter 30). They found several important outcomes: interest. Interestingly, availability of technology was high, but
• Those who had greater confidence in using technology showed this also did not relate to interest in telehealth.
more interest in telehealth. These findings provide important information for policy makers
• For patients with depression only, greater difficulties with get- and health professionals seeking to expand health services using
ting convenient care were related to more interest in phone and different formats.
internet technologies.
• Both groups showed greater interest if they had satisfactory
experiences in the past using such technologies. Source: Edwards L, Thomas C, Gregory A, et al. Are people with chronic disease
interested in using telehealth? A cross-sectional postal survey. J Med Internet
• Health needs, access difficulties, technology availability, and
Res 2014;16(5):e123, Figure 2 (http://jmir.org/). Used under Creative Commons
sociodemographic factors did not consistently have an effect on license (BY/2.0) (http://creativecommons.org/licenses/by/2.0/).
interest in telehealth across either patient group. Respondents
informed consent form (see Fig. 11-2). Individuals beginning of the call so that they can decide whether
who participate in face-to-face interviews can be given they want to continue with the interview. Consent to
an informed consent form to sign in the presence of participate is obtained verbally up front in the conversa-
the interviewer. For telephone interviews, the researcher tion, letting respondents know that they can refuse to
is obliged to give respondents full information at the continue.
6113_Ch11_141-158 02/12/19 4:49 pm Page 157
COMMENTARY
Surveys show that the #1 fear of Americans is public speaking. #2 is death. That
means that at a funeral, the average American would rather be in the casket
than doing the eulogy.
—Jerry Seinfeld (1954– )
American comedian
The term “survey research” is often interpreted as a re- providing information that cannot be ascertained
search design, but it is more accurate to think of surveys through observational techniques. Many questionnaires
as a form of data collection. Like any other data collec- have been developed as instruments for assessing con-
tion method, survey development requires an under- structs that are relevant to patient care. Such instruments
standing of theory, content and construct validity, and are applicable to explanatory, exploratory, and descrip-
basic principles of measurement. Surveys must be used tive research designs.
for appropriate purposes and analyzed to ascribe valid Those interested in pursuing the survey approach are
meaning to results. encouraged to consult researchers who have had experi-
Because of the nature of clinical variables, and the em- ence with questionnaires, as well as several informative
phasis on patient perceptions and values in evidence- texts listed at the end of this chapter.26,36–38
based practice, surveys are an important methodology,
R EF ER E NC E S 10. Leedy PD, Ormrod JE. Practical Research: Planning and Design.
Upper Saddle River, NJ: Prentice Hall, 2001.
1. Jenkinson C, Coulter A, Bruster S, Richards N, Chandola T. 11. Carr EC, Worth A. The use of the telephone interview for
Patients’ experiences and satisfaction with health care: results of a research. J Res Nursing 2001;6(1):511–524.
questionnaire study of specific aspects of care. Qual Saf Health 12. Wu V, Abo-Sido N, Espinola JA, Tierney CN, Tedesco KT,
Care 2002;11(4):335–339. Sullivan AF, et al. Predictors of successful telephone follow-up
2. Calancie B, Molano MR, Broton JG. Epidemiology and demog- in a multicenter study of infants with severe bronchiolitis. Ann
raphy of acute spinal cord injury in a large urban setting. J Spinal Epidemiol 2017;27(7):454–458.e1.
Cord Med 2005;28(2):92–96. 13. Insights Association: Mixed-Mode Surveys. May 24, 2016.
3. Freburger JK, Konrad TR. The use of federal and state databases Available at http://www.insightsassociation.org/issues-policies/
to conduct health services research related to physical and best-practice/mixed-mode-surveys. Accessed July 3, 2017.
occupational therapy. Arch Phys Med Rehabil 2002;83(6): 14. Shumway-Cook A, Patla A, Stewart AL, Ferrucci L, Ciol MA,
837-845. Guralnik JM. Assessing environmentally determined mobility
4. Centers for Disease Control and Prevention (CDC). National disability: self-report versus observed community mobility.
Health Interview Survey: National Center for Health Statistics. J Am Geriatr Soc 2005;53(4):700–704.
Available at https://www.cdc.gov/nchs/nhis/data-questionnaires- 15. Stepan JG, London DA, Boyer MI, Calfee RP. Accuracy of
documentation.htm. Accessed June 30, 2017. patient recall of hand and elbow disability on the QuickDASH
5. National Center for Health Statistics: National Nursing Home questionnaire over a two-year period. J Bone Joint Surg Am
Survey. Survey methodology, documentation, and data files 2013;95(22):e176.
survey. Available at https://www.cdc.gov/nchs/nnhs/nnhs_ 16. Brusco NK, Watts JJ. Empirical evidence of recall bias
questionnaires.htm. Accessed February 7, 2018. for primary health care visits. BMC Health Serv Res
6. van Zon A, Smulders YE, Stegeman I, Ramakers GG, 2015;15:381.
Kraaijenga VJ, Koenraads SP, Zanten GA, Rinia AB, et al. 17. Reneman MF, Jorritsma W, Schellekens JM, Goeken LN.
Stable benefits of bilateral over unilateral cochlear implantation Concurrent validity of questionnaire and performance-based
after two years: A randomized controlled trial. Laryngoscope disability measurements in patients with chronic nonspecific
2017;127(5):1161–1668. low back pain. J Occup Rehabil 2002;12(3):119–129.
7. Owen-Smith AA, Smith DH, Rand CS, Tom JO, Laws R, 18. Edwards L, Thomas C, Gregory A, Yardley L, O’Cathain A,
Waterbury A, et al. Difference in effectiveness of medication Montgomery AA, et al. Are people with chronic diseases
adherence intervention by health literacy level. Perm J 2016; interested in using telehealth? A cross-sectional postal survey.
20(3):38–44. J Med Internet Res 2014;16(5):e123.
8. Adjemian MK, Volpe RJ, Adjemian J. Relationships between diet, 19. McDowell I. Measuring Health: A Guide to Rating Scales and
alcohol preference, and heart disease and type 2 diabetes among Questionnaires. 3 ed. New York: Oxford University Press,
Americans. PLoS One 2015;10(5):e0124351. 2006.
9. Mandal A, Eaden J, Mayberry MK, Mayberry JF. Questionnaire 20. Beck AT, Steer RA, Carbin MG. Psychometric properties of
surveys in medical research. J Eval Clin Pract 2000;6(4): the Beck Depression Inventory: Twenty-five years of evaluation.
395–403. Clin Psychol Rev 1988;8(1):77–100.
6113_Ch11_141-158 02/12/19 4:49 pm Page 158
21. Creavin ST, Wisniewski S, Noel-Storr AH, Trevelyan CM, Available at http://iss.leeds.ac.uk/downloads/top2.pdf.
Hampton T, Rayment D, et al. Mini-Mental State Examination Accessed June 29, 2017.
(MMSE) for the detection of dementia in clinically unevaluated 29. Fan W, Yan Z. Factors affecting response rates of the web
people aged 65 and over in community and primary care survey: a systematic review. Computers Hum Behav 2010;26:
populations. Cochrane Database Syst Rev 2016(1):Cd011145. 132–139.
22. Ware J Jr, Kosinski M, Keller SD. A 12-Item Short-Form 30. Boynton PM, Greenhalgh T. Selecting, designing, and develop-
Health Survey: construction of scales and preliminary tests of ing your questionnaire. BMJ 2004;328(7451):1312–1315.
reliability and validity. Med Care 1996;34(3):220–233. 31. Shih TH, Fan X. Comparing response rates from web and mail
23. National Survey of Student Engagement: Gender identify surveys: a meta-analysis. Field Methods 2008;20:249–271.
and sexual orientation: survey challenges and lessons learned. 32. Millar MM, Dillman DA. Improving response to web and mixed
Available at http://nsse.indiana.edu/pdf/presentations/2016/ mode surveys. Public Opin Q 2011;75(2):249–269.
AIR_2016_Gender_ID_handout.pdf. Accessed February 9, 2018. 33. Galea S, Tracy M. Participation rates in epidemiologic studies.
24. Quinn TJ, Langhorne P, Stott DJ. Barthel index for stroke Ann Epidemiol 2007;17(9):643–653.
trials: development, properties, and application. Stroke 2011; 34. Survey Monkey: Sample Size Calculator. Available at https://
42(4):1146–1151. www.surveymonkey.com/mp/sample-size-calculator/. Accessed
25. Brown AW, Therneau TM, Schultz BA, Niewczyk PM, June 30, 2107.
Granger CV. Measure of functional independence dominates 35. CheckMarket: Sample Size Calculator. Available at https://
discharge outcome prediction after inpatient rehabilitation for www.checkmarket.com/sample-size-calculator/. Accessed
stroke. Stroke 2015;46(4):1038–1044. June 30, 2017.
26. Dillman DA, Smyth JD, Christian LM. Internet, Phone, Mail, 36. Oppenheim AN. Questionnaire Design, Interviewing and Attitude
and Mixed-Mode Surveys: The Tailored Design Method. 3 ed. Measurement. 2 ed. London: Bloomsbury Academic, 2000.
Hoboken, NJ: John Wiley & Sons, 2014. 37. Converse JM, Presser S. Survey Questions: Handcrafting the
27. Survey Monkey: Survey sample size. Available at https://www. Standardized Questionnaire. Thousand Oaks, CA: Sage, 1986.
surveymonkey.com/mp/sample-size/. Accessed July 1, 2017. 38. Saris WE, Gallhofer IN. Design, Evaluation, and Analysis of
28. Burgess TF: A general introduction to the design of question- Questionnaires for Survey Research. 2 ed. Hoboken, NJ: John
naires for survey research. University of Leeds, May, 2001. Wiley & Sons, 2014.
6113_Ch12_159-178 02/12/19 4:48 pm Page 159
CHAPTER
Understanding
Health Measurement 12
Scales
—with Marianne Beninato
The foundational measurement concepts de- require substantial measurement scrutiny to establish
their reliability and validity, as we strive to confirm utility
scribed in preceding chapters create a context
with different populations and under different clinical
for understanding how we can use evidence to conditions.
interpret scores from appropriate measure- Theoretical Foundations
ment tools. Because many aspects of health re-
An important characteristic of a scale’s validity is its focus
flect intangible constructs, they require data on a common dimension of the trait being measured.
that must be converted to a scale that will in- This requires a clear theoretical foundation, including
behavioral expectations when the trait is present to dif-
dicate “how much” of a health factor is present
ferent degrees, and an understanding of distinct compo-
in an individual. nents of the construct.
The purpose of this chapter is to describe dif- Consider the study of pain, which can be manifested
in several dimensions, including differences between
ferent types of scales used in questionnaires to
acute and chronic pain, and potential biopsychosocial
assess health. These may be self-report inven- impacts related to symptoms, how it affects movement,
tories or performance-based instruments. This or its impact on quality of life.1 For example, the McGill
Pain Questionnaire incorporates pain ratings as well
discussion will focus on validity of four types of
qualitative descriptors, based on Melzack’s theory of pain
scales, including summative, visual analog, and reflecting sensory, affective, and evaluative dimensions
cumulative scales, and the Rasch measurement of pain.2 It focuses on various aspects of pain, such as
intensity, type of pain, and duration. In contrast, the
model. Whether you are trying to use or modify
Roland Morris Disability Questionnaire (RMDQ) was
established instruments or construct a new tool, designed to assess physical disability due to low back
an understanding of scale properties is essential pain.3 Therefore, the framework for the RMDQ is based
on how back pain influences daily life, not on a descrip-
for interpretation of scores and application to
tion of the pain itself.
clinical decision-making.
Measuring Health
For many clinical variables, a single indicator, such as
■ Understanding the Construct blood pressure or hematocrit, can provide relevant
information regarding a specific aspect of health. These
The first essential step in choosing or developing a health tests are typically interpreted with reference to a crite-
scale is understanding the construct being measured. The rion that defines a normal range. These values play an
items in a scale must reflect the underlying latent trait important role in identifying disease or indicating
and the dimensions that define it (see Chapter 10). Scales appropriate treatment.
159
6113_Ch12_159-178 02/12/19 4:48 pm Page 160
As the concept of health has broadened, however, sin- construct. Statistical techniques, such as factor analysis,
gle indicators are not sufficient to reflect the dimensions are useful to determine the construct validity, indicat-
of important clinical traits. Assessing health requires the ing how items cluster into different components (see
use of subjective measures that involve judgment on the Chapter 31).
part of a patient, caregiver, clinician, or researcher.4 For instance, the Short Form-36 (SF-36) health status
These judgments may reflect different perceptions, questionnaire is based on eight subscales, including
which will influence which indicators are appropriate. vitality, physical functioning, bodily pain, general health
Therefore, to measure variables such as depression, perceptions, physical role functioning, emotional role
function, or balance, it is necessary to develop a set of functioning, social role functioning, and mental health.9
items that address multiple aspects of health. For func- Each of these represent a different element of health
tion, for example, questions should address physical, that is assessed with different items. Users must be aware
social, and emotional considerations as they affect daily of the differences among subscales, understanding why
activities. a total score would be uninterpretable.
Health Scales Short Forms
A scale is an ordered system based on a series of ques- Health questionnaires are often lengthy and short
tions or items, resulting in a score that represents the forms have been developed for many scales. For exam-
degree to which a respondent possesses a particular ple, the original 116-item Medical Outcomes Study was
attitude, value, or characteristic. The items used will first reduced to the SF-36 health questionnaire9 and
usually be in the same format so that responses can be was then condensed to the SF-1210 and the SF-8.11 The
summed. The purpose of a scale is to distinguish among numbers in these scale names refer to the number of
people who demonstrate different intensities of the items in the scale. Because fewer representative items
characteristic and to document change in individuals are used in the short form, the subscales have been
over time. truncated to two summary subscales of physical and
Typically designed as questionnaires, health scales mental health functioning.
are often composed of self-report items that probe the Validation of these short forms follows full valida-
latent trait being studied according to the individual’s tion of the original scale. Typically, researchers use fac-
own perceptions. Scales may also include tasks that are tor analysis and other construct validity approaches to
observed by another. For example, instruments like the demonstrate that the shorter version can generate a
Beck Depression Inventory ask respondents to indicate meaningful score that correlates with the longer scale
the intensity of emotional, behavioral, and somatic and is consistent with the target construct. The obvious
symptoms they are experiencing related to depression.5 benefit of a short form test is efficiency of time and
Tests of function, such as the Barthel Index6 or Func- effort.
tional Independence Measure (FIM),7 rely on a clini-
The Target Population
cian’s judgment regarding a patient’s performance in
Scales may be generic or they may be geared toward
completing basic activities of daily living (ADLs). To
specific patient groups, age ranges, or care settings.
characterize balance, clinicians usually observe patients
Therefore, a target population must be specified. For
performing several tasks using tests such as the Berg
instance, the SF-36 is considered a generic measure of
Balance Scale (BBS).8
health status,12 whereas the Arthritis Impact Measure-
Creating a scale requires establishing a scoring
ment Scale (AIMS),13 the Functional Living Index-
process that will yield a total score. Therefore, numbers
Cancer,14 and the Spinal Cord Injury–Quality of Life
have to be assigned to subjective judgments on a scale
measurement system15 were all created to address issues
that can indicate high or low levels of the variable.4 Be-
specific to those conditions. The Pediatric Quality of
cause most health scales use ordinal response categories,
Life Inventory16 and the Older Peoples Quality of Life
we must be cognizant of the limitations in interpretation
Questionnaire17 focus on those age-related popula-
of ordinal scores (see Chapter 8).
tions, and the COOP Charts are specifically designed
Unidemensionality to measure quality of life in primary care settings.18
For a scale to be meaningful, it should be unidimen- Cultural validity of an instrument also requires con-
sional, representing a singular overall construct. When sideration of how language and other societal norms
different dimensions exist, subscales can be created vary across populations. Many instruments have been
for each one. Content validity should be evaluated validated in different languages and countries, often
to establish that items represent the full scope of the requiring nuanced changes in questions or wording.
6113_Ch12_159-178 02/12/19 4:48 pm Page 161
1 2 3 4 5
Strongly Neither Agree Strongly
Agreement Disagree Agree
Disagree nor Disagree Agree
Very Very
Satisfaction
Dissatisfied
Dissatisfied Unsure Satisfied
Satisfied
Figure 12–1 Variety of Likert scales
for different contexts.
6113_Ch12_159-178 02/12/19 4:48 pm Page 163
Form A
Instructions: Each item below is a belief statement about your medical condition with which you may agree or disagree.
Beside each statement is a scale which ranges from strongly disagree (1) to strongly agree (6). For each item we would
like you to circle the number that represents the extent to which you agree or disagree with that statement. The more
you agree with a statement, the higher will be the number you circle. The more you disagree with a statement, the lower
will be the number you circle. Please make sure that you answer EVERY ITEM and that you circle ONLY ONE number
per item. This is a measure of your personal beliefs; obviously, there are no right or wrong answers.
1 = STRONGLY DISAGREE (SD) 4 = SLIGHTLY AGREE (A)
2 = MODERATELY DISAGREE (MD) 5 = MODERATELY AGREE (MA)
3 = SLIGHTLY DISAGREE (D) 6 = STRONGLY AGREE (SA)
No. Question SD MD D A MA SA
1. If I get sick, it is my own behavior that determines
I 1 2 3 4 5 6
how soon I get well again.
2. No matter what I do, if I am going to get sick, I will
C 1 2 3 4 5 6
get sick.
3. Having regular contact with my physician is the best
P 1 2 3 4 5 6
way for me to avoid illness.
4. Most things that affect my health happen to me by
C 1 2 3 4 5 6
accident.
5. Whenever I don’t feel well, I should consult a
P 1 2 3 4 5 6
medically trained professional.
I 6. I am in control of my health. 1 2 3 4 5 6
7. My family has a lot to do with my becoming sick or
P 1 2 3 4 5 6
staying healthy.
I 8. When I get sick, I am to blame. 1 2 3 4 5 6
9. Luck plays a big part in determining how soon I will
C 1 2 3 4 5 6
recover from an illness.
P 10. Health professionals control my health. 1 2 3 4 5 6
C 11. My good health is largely a matter of good fortune. 1 2 3 4 5 6
12. The main thing that affects my health is what I
I 1 2 3 4 5 6
myself do.
I 13. If I take care of myself, I can avoid illness. 1 2 3 4 5 6
14. When I recover from an illness, it’s usually because
P other people (like doctors, nurses, family, friends) 1 2 3 4 5 6
have been taking good care of me.
C 15. No matter what I do, I am likely to get sick. 1 2 3 4 5 6
C 16. If it’s meant to be, I will stay healthy. 1 2 3 4 5 6
I 17. If I take the right actions, I can stay healthy. 1 2 3 4 5 6
18. Regarding my health, I can only do what my doctor
P 1 2 3 4 5 6
tells me to do.
Figure 12–2 The Multidimensional Health Locus of Control Scale (Form A), illustrating a six-point Likert scale. Subscales are
shown on the left: I = Internal, C = Chance, P = Powerful others. (These codes are not included when the test is administered.)
Adapted from Wallston KA. Multidimensional Health Locus of Control Scales. Available at https://nursing.vanderbilt.edu/
projects/wallstonk/. Accessed September 16, 2018.
No Worst
pain possible pain
Figure 12–3 Visual analogue pain scale, showing a patient’s mark (in red).
6113_Ch12_159-178 02/12/19 4:48 pm Page 165
Focus on Evidence 12–1 comparisons with two tools that have long been considered valid
It’s Not Really That Simple measures of disability for this population: the SF-369 and the RMDQ.37
They also included a standard VAS for pain.
The VAS is widely used to reflect many different types of subjective
They found weak correlations of the disability VAS with the
assessments because of its great convenience and apparent
SF-36 and RMDQ, but a strong correlation with the pain VAS. Their
simplicity. Appropriately, reliability and validity of the VAS have
interpretation was that pain was likely a clearer concept than disabil-
been the subject of substantial research, which has generally found
ity, and its impact on function was probably better understood than
strong measurement properties for this scale.31-35 However, while
“disability.” Therefore, they concluded that their VAS was not a valid
it may seem like a straightforward measure, we should be vigilant
measure of self-reported disability.
in deciding when to use it. Because of its unidimensional approach,
Their findings provide important general and specific lessons in
the VAS may be limited in its use to evaluate complex clinical
the use of visual analog scales, as we seek to find more efficient ways
phenomena.
to measure clinical variables. It is tempting to create measures with
To illustrate this caveat, Boonstra et al36 conducted a study to
an easy format like the VAS, but this can be deceptive when dealing
determine the validity of a VAS to measure disability in patients with
with multidimensional concepts, especially with variables like dis-
chronic musculoskeletal pain. Their aim was to determine if a single-
ability or pain that are less well understood. It is essential to appre-
item measure could be used to measure the overall construct of
ciate the necessity for validation to assure that a construct is being
disability, in place of established multi-item instruments that are
adequately evaluated.
lengthy and time consuming to use.
They developed a VAS with a left anchor of “no disability” and a
right anchor of “very severe disability.” Validity testing was based on
No Worst
pain possible pain
A 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
No Minimal Moderate Intense Worst
B pain pain pain pain possible
pain
C 0 2 4 6 8 10
Figure 12–4 Variations of VAS scales. A. A 0-to-10 numerical pain rating scale; B. NRS with
interim descriptors; C. The Faces Pain Scale-Revised (Part C used with permission of the
International Association for the Study of Pain. Available at www.iasp-pain.org/FPSR. Accessed
December 13, 2018.)
Analyzing VAS Scores Suppose we have a scale, scored from 0 to 100, that
measures physical function, including elements related
Because VAS scores are recorded in millimeters, the
to locomotion, personal hygiene, dressing, and feeding.
measurement is often interpreted as a ratio measure.
Individuals who achieve the same score of 50 may have
However, the underlying scale is more accurately ordinal.
obtained this score for very different reasons. One may
Figure 12-5 illustrates hypothetical scores from three in-
be able to walk but is unable to perform the necessary
dividuals rating their pain on two occasions. Even though
upper extremity movements for self-care. Another may
they started at different points and ended at the same
be in a wheelchair but is able to take care of his personal
point, we don’t know if actual pain levels are different at
needs. Or a third individual may do all tasks, but
the first mark, the same at the second mark, or if the
poorly. Therefore, people with the same score will not
amount of change is different among patients. We only
necessarily have the same profile. This potential out-
know that all three improved. This type of scale is con-
come reflects the fact that the items within the scale
sistent within a patient, but not across patients. This con-
actually reflect different components of the latent trait
cept is important whether using a traditional VAS, a
being measured, in this case physical function, which
NRS, or a faces rating scale.
are not all equal.
One argument suggests that the individual marking
Cumulative scales (also called Guttman scales) provide
the line is not truly able to appreciate the nuances of a
an alternative approach, wherein a set of statements is
full continuum and, therefore, the mark does not repre-
presented that reflects increasing severity of the charac-
sent a specific ratio score but more of a general estimate.
teristic being measured. This technique is designed to
Therefore, even though the actual readings from the
ensure that there is only one dimension within a set of
scale are obviously at the ratio level, the true measure-
responses—only one unique combination of responses
ment properties are less precise.
that can achieve a particular score. Each item in the scale
is dichotomous.
The issue of level of measurement is relevant for the
choice of statistical procedures in data analysis. Although
truly ordinal, VAS scores are often analyzed as ratio scores ➤ CASE IN POIN T #4
using parametric statistics (see Chapter 8). Gothwal et al60 evaluated the Distance Vision Scale (DVS), an
assessment of visual acuity in which letter reading becomes
progressively difficult, to determine if the scale’s properties held
up in patients with cataracts.
■ Cumulative Scales
The dilemma with a summative scale, where several The DVS asks individuals to respond to five state-
item responses are added to create a total score, is that ments, each scored “1” for yes and “0” for no. A total is
the total score can be interpreted in more than one way. computed, with a maximum score of 5.
Patient A
No Worst
pain possible pain
Patient B
No Worst
pain possible pain
Patient C
No Worst
pain possible pain
Logit
The unidimensionality of a scale can be confirmed
More Able Harder Items
using factor analysis (see Chapter 31) and a measure of
internal consistency (see Chapter 32). 3
Person 4 Item 2
2
The Rasch Model Person 3
1
Although there are several IRT models, the one most Item 3
commonly used is a statistical technique called Rasch 0 M
analysis, named for Georg Rasch (1901–1980), a Danish Person 2 –1
psychometrician. Using data from a patient sample and
specialized computer programs, a model is generated Person 1 –2
that plots a linear metric on an interval scale that locates Item 1
items in order of difficulty and persons in order of
Less Able Easier Items
ability. In an approach similar to Guttman scaling, the
Rasch model is based on the assumption that there is a Figure 12–6 Person-item map, showing the relative
relationship between an item’s difficulty and an individ- position of items and persons along a linear scale in a Rasch
ual’s ability level. model. The scale increases vertically from less able/easier
items to more able/harder items. M = mean logit of item
distribution.
This discussion will focus on a conceptual understanding
of Rasch analysis and how reported data can be inter-
preted. There are many good references for those who want to according to the highest item on which they have a 50%
delve further into this approach.63-69 probability of succeeding.70,71
A good fitting scale should reflect the full range of
difficulty of items. Ideally, the scale items should not be
The Person-Item Map redundant of each other, but should be distributed
The conceptual structure of a Rasch analysis is illustrated
across the continuum of the trait being measured.
in Figure 12-6. The vertical line represents the contin-
Therefore, there should be no gaps where items are
uum of the construct being measured. Using a common
missing along the continuum, and no floor or ceiling
metric, items are plotted on the right, indicating their
effects. We should also find that the rank order of diffi-
level of difficulty. On the left, people are positioned
culty does not change from person to person. The
according to their ability level, based on “how much” or
analysis examines the extent to which an instrument’s
“how little” of the trait they have. This presentation is
responses fit the predicted hierarchical model.65,70,72,73
called a person-item map.
The scale is a linear measure in natural log units Establishing the Hierarchy
called logits. Logit scores are assigned to items and
person ability, with a central zero point at the mean of ➤ CASE IN POIN T #1: REVISI T ED
item distribution, allowing items to be scaled as positive Recall the description of the Functional Gait Assessment
or negative. A logit scale is considered an equal-interval (FGA), presented earlier. The FGA is a 10-item scale that
measure, thereby creating an interval scale that has lin- assesses a person’s ability to balance while walking.20
ear, additive properties. Each item is scored 0 to 3, with lower scores indicating
This transformation of ordinal data into interval val- greater impairment. Beninato and Ludlow74 studied the
ues is an important feature of Rasch analysis. It allows us FGA with 179 older adults to determine if items mapped
to not only rank items, but to determine how much more along a meaningful continuum, determine if the spread of
or less difficult each item is than another. Likewise, given tasks was sufficient to measure patients with varying func-
a specific trait, we can determine how much more or less tional abilities, and describe psychometric properties of
able one person is than another. individual items.
The items are ordered on the map according to the
probability that an individual with a specific ability level The 10 items in the FGA are listed in Table 12-2.
will score high on that item. Therefore, persons 1 and 2 Figure 12-7 illustrates the full person-item map obtained
have lower ability levels and can only achieve item 1. Per- in this study, showing the distribution of these 10 items
son 3 can achieve items 1 and 3, and person 4 can achieve and the patients.
all three items, including item 2, which is the most dif- The person-item map shows that item 2 (gait speed)
ficult item. People are located on the person-item map was the easiest, and item 7 (narrow base) was the most
6113_Ch12_159-178 02/12/19 4:48 pm Page 170
29
.
5 +
28 #
4 +
.#
T
.##
T
#####
23 2 .#### S+
22 .####
Figure 12–7 Person-item map for .#####
a Rasch analysis of the Functional S I8 Eyes closed
Gait Assessment (FGA). Scores to .######
20 1 +
the left (green bar) are the actual
#########
FGA scores. Logit scores are evenly .####### M
spaced. The numbered test items are
listed in blue (for example, item 7 .#####
is numbered I7). The number of .##### I1 Level I6 Obstacle
patients who were able to achieve 15 0 +M
different ability levels is indicated in .###### I10 Steps I3 Horizontal Head Turns I9 Backwards
red (. = one person; # = 2 people). ####
#####
The gold dashed horizontal line
indicates a cutoff score of ≤22, which ### S
indicates risk for falls on the FGA. 11 –1 .### +
Circled letters: M = mean ability or I4 Vertical Head Turns
mean difficulty; S = 1 standard .# S I5 Pivot
deviation from the mean; T = 2 ## I2 Change Speed
standard deviations from the mean.
From Beninato M, Ludlow LH. The #
7 –2 ## +
Functional Gait Assessment in older T
adults: validation through Rasch #
modeling. Phys Ther 2016;96(4): T
456–468. Adapted from Figure 1, .
p. 459. Used with permission of
the American Physical Therapy 4 –3 +
Association. <Less Able> <Easy>
6113_Ch12_159-178 02/12/19 4:48 pm Page 172
➤ The FGA scale demonstrated internal consistency (α = 0.83) Table 12-3 Fit Statistics for Selected FGA Items
and item separation reliability (r = 0.99), supporting overall con- (in Order of Decreasing Difficulty)74
sistency of scores and spread of task difficulty.70 INFIT OUTFIT
ITEM LOGIT MSQ ZSTD MSQ ZSTD
Fit to the Rasch Model
7. Gait with 2.98 1.94 6.8 1.53 3.3
Fit statistics indicate how well the data conform to the pre- narrow base
dicted Rasch measurement model.66,71,82 They tell us how of support
well each individual item contributes to the measurement 9. Ambulating –0.25 0.67 –3.6 0.68 –3.4
of the unidimensional construct. To demonstrate good- backwards
ness of fit, high ratings on more difficult items should be
10. Climbing steps –0.25 0.63 –4.2 0.77 –2.4
accomplished by people of greater ability, and people
should have a higher probability of attaining higher 5. Gait and –1.32 1.01 0.1 1.18 1.4
scores on easier items than on more difficult ones. pivot turn
how these analyses can inform their practice. For exam- which has been previously identified as a cutoff FGA
ple, knowing that ordinal values are not meaningful in a score associated with fall risk in older adults, on the
scale, using logit values generated by a Rasch analysis can person-item map at logit +1.7 logits (see Fig. 12-4). This
provide a sound picture of the tool’s real scale. falls between items 8 and 7. Therefore, if a patient suc-
Conversion tables can be generated, partially illus- ceeds on item 8, there is a high probability that we could
trated in Table 12-4, that show equivalent scores at dif- rule out fall risk. Such scores can identify which tasks
ferent logit values. These conversion scores enable the will be most challenging for those near the fall risk
clinician to estimate a patient’s ability along the person- criterion score (see Focus on Evidence 12-2).
item map ability scale. This can inform intervention
planning, targeting items at the right level of challenge Differential Item Functioning
for their patient and guiding progression of skill. Item functioning is intended to be invariant with respect
For example, the hierarchy shown on the person-item to irrelevant characteristics of those who are measured.
map for the FGA does not match the order of adminis- This means that an item’s difficulty should be related to
tration of the scale items (see Table 12-2). Considering ability, regardless of other factors that are not related
the order of difficulty of items can allow for a more effi- to the task. Differential item functioning (DIF) refers to
cient process so that the test can start with easier or more potential item bias in the fit of data to the Rasch model. It
difficult items to match a patient’s ability, potentially occurs when there are differences across subgroups within
making the test shorter. a sample who respond differently to individual items
The score conversion also helps the clinician interpret despite their having similar levels of the latent trait being
cutoff scores such as those related to fall risk. For exam- measured. Even though individuals may be expected to
ple, Beninato and Ludlow74 superimposed a score of 22, respond differently to an item because of their age, for
example, the expectation is that this difference will only
be due to an actual difference in ability, not because they
Table 12-4 Abridged Conversion Table for FGA differ in age.4 DIF can reduce the validity of the scale.
Scores74 Typical factors that can contribute to DIF are age, gender,
ethnicity, language, or cultural differences.
FGA FGA For example, researchers have evaluated scoring of the
SCORE LOGIT SE SCORE LOGIT SE
FIM in patients with strokes, finding significant DIF
5 –2.66 0.57 20 1.15 0.51 across six European countries.86 Item difficulty varied
10 –1.30 0.49 25 2.67 0.61 across countries, in part because of different cultural
norms. For instance, “transfer to tub/shower” was more
15 –0.10 0.49 30 6.70 1.93 difficult in some countries than others, seemingly due to
SE = standard error preferential use of tubs versus showers in certain cultures.
Focus on Evidence 12–2 specific motor control impairments. For example, one might expect
Are We Measuring the Same Thing? that someone with hemiparesis from a stroke would approach a
balance task with different limitations than someone with bilateral
One of the advantages of Rasch analysis is the ability to translate
peripheral neuropathy. This is illustrated in Rasch analyses of the
scores across instruments by converting measurements to a
BBS, which have shown different results in studies with patients with
common metric. For example, a score ≤22 points on the FGA
lower limb amputations versus older military veterans.80,84 When
has been used as a cutoff to indicate individuals at risk for falls.20
items were compared across these two patient groups, 8 of 14 items
Beninato and Ludlow74 translated this score to a logit of +1.7 (see
changed location in the order of item difficulty. In another study of
Figure 12-7). The BBS, another assessment of fall risk that uses a
patients with PD, the BBS required substantial modification to meet
different scoring system, has identified ≤45 points on that scale as
the unidimensionality assumption and demonstrated a ceiling effect
the cutoff.8 In their Rasch analysis of the BBS, Kornetti et al80 found
because it did not adequately represent postural control impairments
that this matched a logit of +1.5. Because these logits represent a
typical of PD.85 These findings demonstrate the important contribution
common construct, these close values support the validity of the
of Rasch analysis to understanding a scale’s application, especially
scales for assessing balance and fall risk, as well as support their
if we want to assess patients with different conditions or in different
common construct.
settings, or if we monitor patients across the continuum of care,
It is important to realize, however, that the hierarchy of items may
where different instruments are often used to assess function.69
not be consistent across different patient groups, depending on their
6113_Ch12_159-178 02/12/19 4:48 pm Page 174
The important lesson in this analysis is the caution in when the test is taken on a computer.91 Based on the al-
pooling data or comparing FIM scores across countries gorithm used, some test-takers may be given different
unless adjustments are made to account for DIF. questions and different numbers of questions to com-
Identification of DIF may be addressed using factor plete the test. The goal of CATs is to match the level of
analysis (see Chapter 31), regression analysis (see Chap- item difficulty to the ability level of the examinee. Using
ter 30), or Rasch analysis. The individuals in different a complex algorithm, each response, correct or incorrect,
groups would be matched on their scores and their re- is used to adapt further questions to the appropriate
sponses compared for each item so that variant patterns level. This process can reduce the total number of ques-
can be observed. Further discussion of DIF analysis tions needed, as easier items can be eliminated for some-
methods is beyond the scope of this book, but many one who correctly passes the harder items.
good references are available.87-90 Development of a CAT is based on an item bank of
questions on a particular topic or related to a common
Computer Adaptive Testing construct, which are then ordered by level of difficulty.
Computer adaptive tests (CATs) are designed to adjust the This ordering can be established through Rasch analysis.
level of difficulty of items or questions based on the re- The items are then tailored to the individual by first ad-
sponses provided to match the knowledge and ability of a ministering an item of midrange difficulty. The response
test-taker. CATs were originally designed for educational is used to derive an estimated score. The test algorithm
evaluation using the IRT framework, and more recently then selects the next item from the bank to refine the es-
have been used in the administration of health status ques- timated score. This process continues for each successive
tionnaires as a way of potentially reducing the number of item until a specified measurement level is reached or the
items that need to be tested for any one individual. maximum number of questions has been administered. By
You may be familiar with CAT through licensing tailoring content to the respondent, questions are omitted
exams or standardized achievement tests (like the GRE), that are not relevant to the respondent’s level of skill.
COMMENTARY
Creating scales to measure latent traits presents a for- meet their local need, but these are not typically sub-
midable challenge, with considerable implications for jected to the rigorous validation procedures that are
evidence-based practice. The descriptions of reliability needed to assure measures are meaningful. It is advisable
and validity testing in Chapters 9 and 10 are important in to look for existing instruments that have gone through
the design of a health status measure. As examples in those this process whenever possible.
chapters illustrate, the process of developing a new tool is The Rasch approach is often applied during scale de-
not simple or quick. The steps in development of a meas- velopment as a means of informing construction of a
uring instrument include several stages of planning, test scale to determine which items should be included or
construction, reliability testing, and validation. Those who excluded.73,92 However, it can also be applied after a
are inexperienced in this process should consult experts scale is developed to investigate the psychometric prop-
who can provide guidance, including statistical support. erties of its items.72 Many widely used health measures
Part of the validation process for a new scale will have been subjected to validity testing, but recent at-
involve establishing criterion-related validity against tention to IRT has led to their reappraisal. The Rasch
another similar measurement. It will be relevant to ask, model has provided a way to examine the internal con-
therefore, why a new instrument is needed. For exam- struct validity of a measure, including ordering of cat-
ple, the new scale may be more efficient or easier to egories, unidimensionality, and whether items are
administer, or it may relate to the construct in a differ- biased across subgroups. Clinical use of these scales can
ent way, such as focusing on a different target popula- be greatly enhanced by understanding how these meas-
tion. Clinicians often develop “home-grown” scales that urement properties influence interpretation of scores.
6113_Ch12_159-178 02/12/19 4:48 pm Page 175
rheumatic disorders receiving standard care. Isr Med Assoc J 64. Boone WJ, Staver JR, Yale MS. Rasch Analysis in the Human
2015;17(11):691–696. Sciences. New York: Springer, 2014.
43. Wehby GL, Naderi H, Robbins JM, Ansley TN, Damiano PC. 65. Boone WJ. Rasch analysis for instrument development:
Comparing the Visual Analogue Scale and the Pediatric Quality Why, when, and how? CBE Life Sci Educ 2016;15(4):pii: rm4.
of Life Inventory for measuring health-related quality of life in 66. Bond TG, Fox CM. Applying the Rasch Model: Fundamental
children with oral clefts. Int J Environ Res Public Health 2014; Measurement in the Human Sciences. 3 ed. New York: Routledge,
11(4):4280–4291. 2015.
44. Pickard AS, Lin HW, Knight SJ, Sharifi R, Wu Z, Hung SY, 67. Embretson SE, Reise SP. Item Response Theory for Psychologists.
et al. Proxy assessment of health-related quality of life in African Mahwah, NJ: Lawrence Erlbaum Associates, 2000.
American and white respondents with prostate cancer: perspec- 68. Tesio L. Measuring behaviours and perceptions: Rasch analysis
tive matters. Med Care 2009;47(2):176–183. as a tool for rehabilitation research. J Rehabil Med 2003;35(3):
45. Pang PS, Collins SP, Sauser K, Andrei AC, Storrow AB, 105–115.
Hollander JE, et al. Assessment of dyspnea early in acute heart 69. Velozo CA, Kielhofner G, Lai JS. The use of Rasch analysis to
failure: patient characteristics and response differences between produce scale-free measurement of functional ability. Am J
likert and visual analog scales. Acad Emerg Med 2014;21(6): Occup Ther 1999;53(1):83–90.
659–666. 70. Wright BD, Masters GN. Rating Scale Analysis. Chicago: MESA
46. Kleiner-Fisman G, Khoo E, Moncrieffe N, Forbell T, Gryfe P, Press, 1982.
Fisman D. A randomized, placebo controlled pilot trial of 71. Cappelleri JC, Jason Lundy J, Hays RD. Overview of classical
botulinum toxin for paratonic rigidity in people with advanced test theory and item response theory for the quantitative
cognitive impairment. PLoS One 2014;9(12):e114733. assessment of items in developing patient-reported outcomes
47. Bienkowski P, Zatorski P, Glebicka A, Scinska A, Kurkowska- measures. Clin Ther 2014;36(5):648–662.
Jastrzebska I, Restel M, et al. Readiness visual analog scale: 72. Tennant A, Conaghan PG. The Rasch measurement model in
A simple way to predict post-stroke smoking behavior. Int J rheumatology: what is it and why use it? When should it be
Environ Res Public Health 2015;12(8):9536–9541. applied, and what should one look for in a Rasch paper? Arthritis
48. Lansing RW, Moosavi SH, Banzett RB. Measurement of dysp- Rheum 2007;57(8):1358–1362.
nea: word labeled visual analog scale vs. verbal ordinal scale. 73. Rasch G. Probabilistic Models for Some Intelligence and Attainment
Respir Physiol Neurobiol 2003;134(2):77–83. Tests. Copenhagen: Danish Institute of Educational Research,
49. Lucas C, Romatet S, Mekies C, Allaf B, Lanteri-Minet M. 1960.
Stability, responsiveness, and reproducibility of a visual analog 74. Beninato M, Ludlow LH. The Functional Gait Assessment in
scale for treatment satisfaction in migraine. Headache 2012; older adults: validation through Rasch modeling. Phys Ther
52(6):1005–1018. 2016;96(4):456–468.
50. Wong DL, Baker CM. Pain in children: comparison of assess- 75. Wrisley DM, Marchetti GF, Kuharsky DK, Whitney SL.
ment scales. Pediatr Nurs 1988;14(1):9–17. Reliability, internal consistency, and validity of data obtained
51. Keck JF, Gerkensmeyer JE, Joyce BA, Schade JG. Reliability with the functional gait assessment. Phys Ther 2004;84(10):
and validity of the Faces and Word Descriptor Scales to 906–918.
measure procedural pain. J Pediatr Nurs 1996;11(6):368–374. 76. Sick S: Rasch Analysis Software Programs. JALT Testing and
52. Stuppy DJ. The Faces Pain Scale: reliability and validity with Eval SIG Newsletter;13(3):13–16. Available at http://hosted.
mature adults. Appl Nurs Res 1998;11(2):84–89. jalt.org/test/PDF/Sick4.pdf. Accessed August 13, 2018.
53. Garra G, Singer A, Taira BR, Chohan C, Cardoz H, Chisena E, 77. WINSTEPS & Facets Rasch Software. Available at http://
et al. Validation of the Wong-Baker FACES Pain Rating Scale www.winsteps.com/index.htm. Accessed July 23, 2018.
in pediatric emergency department patients. Acad Emerg Med 78. Andrich D, Lyne A, Sheriden B, Luo G. RUMM 2020. Perth,
2010;17(1):50–54. Australia: RUMM Laboratory, 2003.
54. Decruynaere C, Thonnard JL, Plaghki L. How many response 79. Linacre JM. Rasch analysis of rank-ordered data. J Appl Meas
levels do children distinguish on FACES scales for pain assess- 2006;7(1):129–139.
ment? Eur J Pain 2009;13(6):641–648. 80. Kornetti DL, Fritz SL, Chiu YP, Light KE, Velozo CA. Rating
55. Gift AG. Validation of a vertical visual analogue scale as a scale analysis of the Berg Balance Scale. Arch Phys Med Rehabil
measure of clinical dyspnea. Rehabil Nurs 1989;14(6):323–325. 2004;85(7):1128–1135.
56. Stroke Impact Scale, version 3.0. Available at https://www. 81. Silverstein B, Fisher WP, Kilgore KM, Harley JP, Harvey RF.
strokengine.ca/pdf/sis.pdf. Accessed November 5, 2018. Applying psychometric criteria to functional assessment in
57. Amico KR, Fisher WA, Cornman DH, Shuper PA, Redding medical rehabilitation: II. Defining interval measures. Arch Phys
CG, Konkle-Parker DJ, et al. Visual analog scale of ART Med Rehabil 1992;73(6):507–518.
adherence: association with 3-day self-report and adherence 82. Chiu YP, Fritz SL, Light KE, Velozo CA. Use of item response
barriers. J Acquir Immune Defic Syndr 2006;42(4):455–459. analysis to investigate measurement properties and clinical
58. Ramachandran S, Lundy JJ, Coons SJ. Testing the measure- validity of data for the dynamic gait index. Phys Ther 2006;
ment equivalence of paper and touch-screen versions of the 86(6):778–787.
EQ-5D visual analog scale (EQ VAS). Qual Life Res 2008; 83. White LJ, Velozo CA. The use of Rasch measurement to
17(8):1117–1120. improve the Oswestry classification scheme. Arch Phys Med
59. Altenburg J, Wortel K, de Graaff CS, van der Werf TS, Rehabil 2002;83(6):822–831.
Boersma WG. Validation of a visual analogue score (LRTI- 84. Wong CK, Chen CC, Welsh J. Preliminary assessment of
VAS) in non-CF bronchiectasis. Clin Respir J 2016;10(2): balance with the Berg Balance Scale in adults who have a leg
168–175. amputation and dwell in the community: Rasch rating scale
60. Gothwal VK, Wright TA, Lamoureux EL, Pesudovs K. analysis. Physical Therapy 2013;93(11):1520–1529.
Guttman scale analysis of the distance vision scale. Invest 85. La Porta F, Giordano A, Caselli S, Foti C, Franchignoni F. Is
Ophthalmol Vis Sci 2009;50(9):4496–4501. the Berg Balance Scale an effective tool for the measurement of
61. Green BF. A method of scalogram analysis using summary early postural control impairments in patients with Parkinson’s
statistics. Psychometrika 1956;21(1):79–88. disease? Evidence from Rasch analysis. Eur J Phys Rehabil Med
62. Tractenberg RE. Classical and modern measurement theories, 2015;51(6):705–716.
patient reports, and clinical outcomes. Contemp Clin Trials 86. Lundgren-Nilsson A, Grimby G, Ring H, Tesio L, Lawton G,
2010;31(1):1–3. Slade A, et al. Cross-cultural validity of functional independence
63. DeMars C. Item Response Theory. New York: Oxford University measure items in stroke: a study using Rasch analysis. J Rehabil
Press, 2010. Med 2005;37(1):23–31.
6113_Ch12_159-178 02/12/19 4:48 pm Page 177
87. Tennant A, Pallant JF. DIF matters: a practical approach to 90. Krabbe PFM. The Measurement of Health and Health Status:
test if Differential Item Functioning makes a difference. Concepts, Methods and Applications from a Multidisciplinarhy
Rasch Meas Transactions 2007;20(4):1082–1084. Available at Perspective. London: Academic Press, 2017.
https://www.rasch.org/rmt/rmt204d.htm. Accessed August 7, 91. Seo DG. Overview and current management of computerized
2018. adaptive testing in licensing/certification examinations. J Educ
88. Badia X, Prieto L, Linacre JM. Differential item and test func- Eval for Health Prof 2017;14:17.
tioning (DIF & DTF). Rasch Meas Transactions 2002;16(3):889. 92. Ludlow LH, Matz-Costa C, Johnson C, Brown M, Besen E,
Available at https://www.rasch.org/rmt/rmt163g.htm. Accessed James JB. Measuring engagement in later life activities:
August 7, 2018. Rasch-based scenario scales for work, caregiving, informal
89. Holland PW, Wainer H. Differential Item Functioning. New York: helping, and volunteering. Meas Eval Counsel Dev 2014;47(2):
Routledge, 1993. 127–149.
6113_Ch12_159-178 02/12/19 4:48 pm Page 178
6113_Ch13_179-191 02/12/19 4:48 pm Page 179
Designing Clinical
PART
3
Research
CHAPTER 13 Choosing a Sample 180
CHAPTER 14 Principles of Clinical Trials 192
CHAPTER 15 Design Validity 210
CHAPTER 16 Experimental Designs 227
CHAPTER 17 Quasi-Experimental Designs 240
CHAPTER 18 Single-Subject Designs 249
CHAPTER 19 Exploratory Research: Observational
Designs 271
CHAPTER 2 0 Descriptive Research 285
CHAPTER 2 1 Qualitative Research 297
ractice
dp
se
idence-ba
1. Identify the
arch question
re e
s
Ev
s
ing
2.
De
ind
sig
te f
n the
5. Dissemina
study
The
Research
Process
dy
stu
4.
na
e
A
ly z th
ed e nt
ata p le m
3. I m
179
6113_Ch13_179-191 02/12/19 4:48 pm Page 180
CHAPTER
13 Choosing a Sample
Our daily lives are filled with generalizations. ■ Populations and Samples
Cook a good steak and decide if it’s done by
tasting a small piece—you don’t eat the whole If we wanted to be truly confident in the effectiveness of
an intervention, we would have to test it on every patient
steak. Try out a new therapy with one patient in the world for whom that treatment is appropriate.
and decide if you will use it again based on This larger group to which research results will be ap-
its effectiveness—you don’t test it on every plied is called the population. It is the aggregate of per-
sons or objects that meet a specified set of criteria and
patient. Researchers use this same principle to whom we wish to generalize results of a study. For in-
in studying interventions, tests, and relation- stance, if we were interested in studying the effects of
ships. They obtain data on a group of subjects various treatments for osteoarthritis, the population
of interest would be all people in the world who have
to make generalizations about what will happen osteoarthritis.
to others with similar characteristics in a sim- Of course, it is not practical to test every person who
ilar situation. To make this process work, we has osteoarthritis, nor could we access them. Working
with smaller groups is generally more economical,
have to develop a plan to select appropriate in- more time efficient, and potentially more accurate be-
dividuals for a sample in a way that will allow cause it affords better control of measurement. There-
us to extrapolate beyond that group’s perform- fore, through a process of sampling, a researcher
chooses a subgroup of the population, called a sample.
ance for the application of research evidence This sample serves as the reference group for estimat-
to others. ing characteristics of and drawing conclusions about
Whether developing a sampling plan for a the population.
research study or determining if published
Populations are not necessarily restricted to human
evidence is relevant for clinical practice, we
subjects. Researchers may be interested in studying
need to understand how study samples are characteristics of institutions or geographical areas,
recruited and how these methods impact our and these may be the units that define the population. In
test–retest reliability studies, the “population” will consist
ability to generalize findings. The purpose of of all possible trials and the “sample” would be the actual
this chapter is to describe various sampling measurements taken. Industrial quality control studies use
samples of items from the entire inventory of a particular
techniques and how they influence generaliza- manufacturing lot. Surveys often sample households from
tions from representative subjects to make pre- a population of housing units. A population can include
dictions about the larger world. people, places, organizations, objects, or any other unit of
interest.
180
6113_Ch13_179-191 02/12/19 4:48 pm Page 181
Target and Accessible Populations necessary to identify a more restricted target population.
For example, we could specify that our target population
The first step in planning a study is to identify the overall
is people with type 2 diabetes who are enrolled in
group of people to which the researcher intends to gen-
large managed-care systems. The results of the study
eralize findings. This universe of interest is the target
would then be applicable only to people meeting this
population.
criterion. Because the validity of the accessible popula-
tion is not readily testable, researchers must exercise
➤ C A SE IN PO INT # 1 judgment in assessing the degree of similarity with the
Researchers were interested in studying the association between target population.
depression and risk factors for dementia in patients with type 2
diabetes.1 They accessed data from medical records for patients Selection Criteria
with diabetes between 30 and 75 years of age who were enrolled In defining the target population, an investigator must
in a large integrated healthcare system.
specify selection criteria that will govern who will and
will not be eligible to participate. Samples should be
Because it is not possible to gain access to every per- selected in a way that will maximize the likelihood that
son with diabetes in this age range, some portion of the the sample represents the population as closely as possi-
target population that has a chance to be selected must ble. Inclusion and exclusion criteria should be fully
be identified. This is the accessible population. It rep- described in the methods section of a research report.
resents the people you have access to and from which the
actual sample will be drawn (see Fig. 13-1). Inclusion Criteria
Inclusion criteria describe the primary traits of the target
and accessible populations that will make someone eli-
➤ For the study of depression and dementia, the accessible gible to be a participant. The researcher must consider
population was all patients enrolled in one healthcare system
the variety of characteristics present in the population in
who were included in the diabetes registry. The units within this
population were the individual patients. terms of clinical findings, demographics, and geographic
factors, as well as whether these factors are important to
the question being studied.
Strictly speaking, a sample can only be representative
of the accessible population. For example, there may be
differences in healthcare practices in different systems, and
➤ Subjects in the diabetes study were included if they were
part of the diabetes registry. The investigators identified 20,188
not all people with diabetes would be receiving the same
eligible participants with type 2 diabetes between the age of
type of care. Such differences can complicate generaliza- 30 to 75 years.
tions to the target population. Geographic differences,
such as climate, socioeconomic characteristics, or cultural
patterns are often a source of variation in samples. Exclusion Criteria
When the differences between the target and acces- Exclusion criteria are those factors that would preclude
sible populations are potentially too great, it may be someone from being a subject. These factors will
generally be considered potentially confounding to the
results and are likely to interfere with interpretation of
the findings.
Populat
rget ion
Ta ➤ In the diabetes study, subjects were excluded if they had a
ible Popu
ss prior diagnosis of dementia, or if they had type 1 diabetes.
lat
Acce
ion
Focus on Evidence 13–1 Given the intent to describe population risk factors, the sampling
The Framingham Heart Study plan was to select a random group of participants from the commu-
nity within the age range that CVD was likely to develop, with the
The Framingham Heart Study is one of the most prominent longitudi-
original intent of following them for 20 years.11 The researchers
nal cohort studies, begun in 1948 and still going on today into its third
determined that they could complete 6,000 examinations every
generation.2,3 Its initial focus was on arteriosclerotic and hypertensive
2 years, and therefore this was their target sample size out of a
cardiovascular disease (CVD), but has expanded greatly over the
town population of 10,000.
decades to include neurological,4 musculoskeletal,5,6 and functional
The official sampling process focused on the population aged 30
disorders,7 as well as other conditions such as diabetes.8
to 59 years according to the town census, divided into eight precincts.
Within each precinct, lists were arranged by family size and then by
address. Using a systematic sampling procedure, two of every three
families were selected, and all family members within the eligible
age range were invited to participate. This generated a sample of
6,507 subjects—but only 69%, or 4,469, actually came in for an
examination. It turned out that respondents tended to be healthier than
nonrespondents. This was considered a biased sample with a low
response rate. Therefore, researchers had to use volunteers to obtain
their desired sample size, which violated their random strategy.
Although the town’s population has become more diverse, and later
cohorts have been included to increase ethnic diversity, the Framingham
investigators were well aware that the community may not have been
fully representative of the U.S. population.2,9 To address concerns about
The site chosen for the study was a suburban area outside potential sampling biases, there has been a concerted effort, over sev-
of Boston, the town of Framingham, Massachusetts, a primarily eral decades and generations of participants, to draw comparisons with
middle-class white community in 1948.9 Although the investigators other studies that have included subjects from many areas with varied
recognized that it would have been ideal to study several different ethnic, geographic, and socioeconomic profiles.12-15 These have contin-
areas, the logistics were sufficiently challenging to start with one ued to show substantial agreement with the Framingham findings,
town. Framingham was deemed to be of adequate size, with a variety which has supported the study’s generalizability.
of socioeconomic and ethnic subgroups, a relatively stable population,
and a strong medical infrastructure.10 The town had also been the
site of a tuberculosis study nearly 30 years before that had had Courtesy of the Framingham Heart Study, National Heart, Lung, and Blood
Institute.
successful participation by the townspeople.
6113_Ch13_179-191 02/12/19 4:48 pm Page 183
In the study of type 2 diabetes, researchers invited influenced by several factors, including the variability
29,639 patients to participate, with 20,188 consenting. within the sample, the anticipated magnitude of the
Of these, 801 were excluded based on not meeting eligi- effect, and the number of subjects. With a small sample,
bility criteria, and 148 were eventually excluded because power tends to be low and a study may not succeed in
of lack of any follow-up, leaving a final study cohort of demonstrating the desired effects.
19,239 patients.
Issues related to sample size and power will be addressed
It is important to understand why subjects choose not
in Chapter 23 and subsequent chapters.
to participate, as these reasons can influence interpreta-
tion of results. An investigator should have a contingency
plan in mind, such as recruiting subjects from different
sites or expanding inclusion criteria, when the sample ■ Sampling
size falls substantially short of expectations.
Sampling designs can be categorized as probability or
nonprobability methods. Probability samples are cre-
In the end, the sample used for data analysis may be
ated through a process of random sampling or selec-
further reduced because of errors in data collection or
tion. Random is not the same as haphazard. It means that
dropouts for various reasons. The relevant concern is whether
those who do not complete the study are different in any
every unit in the population has an equal chance, or
substantive way from those who remain, changing the profile “probability,” of being chosen. Therefore, the sample
of the final sample. Participants may differ from nonparticipants should be free of any bias and is considered representa-
in ways that could affect the variables being studied, such as tive of the population from which it was drawn. Because
education, employment, insurance, age, gender, or health be- this process involves the operation of chance, however,
liefs.16-18 It is especially important to know if subjects drop out there is always the possibility that a sample will not re-
because of something they experienced as part of the study flect the spread of characteristics in its parent population.
and if those reasons are not randomly distributed across The selection of nonprobability samples is made by
groups. Strategies to address these issues will be discussed in nonrandom methods. Nonprobability techniques are
Chapter 15. probably used more often in clinical research out of
Researchers will often compare baseline and demographic necessity, but we recognize that their outcomes require
characteristics of those who do and do not participate to some caution for generalization.
determine if there is a difference between them as a way of
estimating the effect on sample bias.
HISTORICAL NOTE
Much of our knowledge of prehistoric times comes from
Reporting Subject Participation artifacts such as paintings or burial sites created in
A flowchart that details how participants pass through caves nearly 40,000 years ago. Any paintings or relics made
outdoors would have washed away, and so we have no record
various stages of a study should be included in a re-
of them. Therefore, our understanding of prehistoric people is
search report. Figure 13-2 shows a generic chart for
associated with caves because that is where the data still exist,
this purpose, with expected information about how not necessarily because that’s where they lived most of their
many subjects were included or excluded at the start of lives. Using cave data as representative of all prehistoric behavior
the study, how many left the study (and their reasons), is likely an example of sampling bias, what has been termed the
and finally how many completed the study. Guidelines “caveman effect.”20
for reporting such data have been developed by the Con-
solidated Standards of Reporting Trials (CONSORT)
workgroup.19 Sampling Error
See Figure 15-3 in Chapter 15 for a similar flowchart If we summarize sample responses using averaged data,
with actual study data. The CONSORT guidelines for this average will most likely be somewhat different from
reporting research are discussed further in Chapter 38. the total population’s average responses, just by chance.
The difference between sample values, called statistics,
Sample Size and population values, called parameters, is sampling
A question about the number of subjects needed is often error, or sampling variation.
the first to be raised as a researcher recruits a sample. In most situations, of course, we don’t actually know
The issue of sample size is an essential one, as it directly the population parameters. We use statistical procedures
affects the statistical power of the study. Power is the to estimate them from our sample statistics, always un-
ability to find significant effects when they exist. It is derstanding there will likely be some error. The essence
6113_Ch13_179-191 02/12/19 4:48 pm Page 184
Figure 13–2 Flow diagram template for subject progression in a randomized trial, including data on those who withdrew or
were lost to follow-up. From the CONSORT 2010 guidelines.19
6113_Ch13_179-191 02/12/19 4:48 pm Page 185
of random sampling is that these sampling differences The conclusions drawn from such a sample cannot be
are due to chance and are not a function of any human useful for describing attitudes of the “general public.”
bias, conscious or unconscious. The validity of generalizations made from a sample
Although it is not perfect, random selection affords to the population depend on the method of selecting
the greatest possible confidence in the sample’s validity subjects. Just getting a larger sample won’t necessarily
because, in the long run, it will produce samples that control for bias. A small representative sample of 50 may
most accurately reflect the population’s characteristics. be preferable to an unrepresentative sample of 1,000.
In non-probability samples, the probability of selection Therefore, some impartial mechanism is needed to make
may be zero for some members of the population and unbiased selections.
unknown for other members. Consequently, the degree
of sampling error cannot be estimated. This limits the HISTORICAL NOTE
ability to generalize outcomes beyond the specific sam- Large samples do not necessarily mean better represen-
ple studied. tation. In the 1936 presidential election, the Literary
Digest predicted that Alf Landon would beat Franklin Roosevelt
Sampling error is a measure of chance variability with 57% of the vote. The prediction was based on responses
between a sample and population. The term “error” from over 2 million voters (only 23% of those actually contacted),
in statistics does not mean a mistake but is another word culled from lists of magazine readers, automobile owners, and
for unexplained variance (see Chapter 23). This must be telephone directories. Roosevelt ended up winning by a 3:2 margin,
distinguished from bias, which is not considered a matter but his support came from lower-income voters, most of whom
of chance. did not own cars or phones!
In contrast, in the 1968 presidential election, with better
sampling methods, Gallup and Harris polls predicted that
Richard Nixon would receive 41% to 43% of the popular vote,
Sampling Bias
based on a random sample of only 2,000 voters. Nixon actually
To make generalizations, the researcher must be able to received 42.9%.21
assume that the responses of sample members will be
representative of how the population members would re-
spond in similar circumstances. Human populations are,
by nature, heterogeneous, and the variations that exist
■ Probability Sampling
in behavioral, psychological, or physical attributes should
In probability sampling, every member of a population
also be present in a sample. Theoretically, a good sample
has an equal opportunity, or probability, of being
reflects the relevant characteristics and variations of the
selected. A truly random sample is unbiased in that each
population in the same proportions as they exist in the
selection is independent and no one member of the pop-
population. Although there is no way to guarantee that
ulation has any more chance of being chosen than any
a sample will be representative of a population, sampling
other member. Several approaches are used, each with a
procedures can minimize the degree of bias or error in
different strategy, but all aimed at securing a represen-
choosing a sample.
tative sample (see Table 13-1).
Sampling bias occurs when the individuals selected
for a sample overrepresent or underrepresent certain
population attributes that are related to the phenomenon Often used for large surveys, polls, and epidemiologic
under study. Such biases can be conscious or uncon- studies, probability sampling is essential for accurately
scious. Conscious biases occur when a sample is selected describing variables and relationships in a population. It is
purposefully. For example, a clinician might choose only less practical in clinical research, except when using large
databases. For clinical experiments, where samples are
patients with minimal dysfunction to demonstrate a
smaller and the focus is more on comparisons or explaining
treatment’s effectiveness, eliminating those subjects who
relationships, researchers often do not have access to true
were not likely to improve. random samples.
Unconscious biases might occur if an interviewer in-
terested in studying attitudes of the public toward the
disabled stands on a busy street corner in a downtown
area and interviews people “at random,” or haphazardly. Simple Random Sampling
The interviewer may unconsciously choose to approach In a simple random sample, subjects are drawn from the
only those who look cooperative on the basis of appear- accessible population, often taken from a listing of per-
ance, sex, or some other characteristics. Persons who do sons, such as membership directories or census lists. For
not work or shop in that area will not be represented. example, in the study of type 2 diabetes, researchers had
6113_Ch13_179-191 02/12/19 4:48 pm Page 186
NON-PROBABILITY METHODS
Convenience Sampling Also called accidental sampling, subjects are chosen on the basis of their availability. For example,
they may be recruited as they enter a clinic or they may volunteer through advertising.
Quota Sampling A form of stratified sampling that is not random, where subjects are recruited to represent
various strata in proportion to their number in the population.
Purposive Sampling Subjects are hand-picked and invited to participate because of known characteristics.
Snowball Sampling Small numbers of subjects are purposively recruited and they help to identify other potential
participants through networking. This approach is used when subjects are difficult to identify,
especially when dealing with sensitive topics.
access to the full membership list of the healthcare sys- See the Chapter 13 Supplement for a description of the
tem. Once a listing is available, a random sampling pro- process for using a random numbers table and for gener-
cedure can be implemented. ating a random sample using SPSS.
Suppose we wanted to choose a sample of 100 patients
to study the association between depression and demen-
tia with diabetes, taken from patient lists from 10 area The terms random sampling (selection) and
hospitals. Conceptually, the simplest approach would be random assignment are often confused. Sampling
to place each patient’s name on a slip of paper, place or selection is the process of choosing participants
these slips into a container, and blindly draw 100 slips— for a study. Assignment is the allocation of participants to
not a very efficient process! A more realistic method in- groups as part of a study design once they have been
volves the use of a table of random numbers, which can be recruited.
found in most statistics texts or generated by computers
(see Appendix A-6). They are comprised of thousands of
digits (0 to 9) with no systematic order or relationship.
The selections made using a table of random numbers Systematic Sampling
are considered unbiased, as the order of digits is com- Random sampling can be a laborious technique, espe-
pletely due to chance. More conveniently today, com- cially with large samples. When lists are arranged with
puter programs can be used to generate random choices no inherent order that would bias the sample (such as
from a list of participants. alphabetical order), an alternative approach can be
This procedure involves sampling without replacement, used that simplifies this procedure, called systematic
which means that, once a subject is chosen, he is no sampling. To use this sampling technique, the researcher
longer eligible to be chosen again when the next random divides the total number of elements in the accessible
selection is made. population by the number of elements to be selected.
6113_Ch13_179-191 02/12/19 4:48 pm Page 187
Selection
Selection
Selection
Selection
Random
Random
Random
Random
Stratified Random Sampling
In random and systematic sampling, the resulting distri-
bution of characteristics of the sample can differ from
that of the population from which it was drawn just by
African Other/
chance because each selection is made independently of American
Caucasian Latino
Unknown
n = 3,000 n = 4,800
all others. You could end up with an older sample, or n = 4,600 n = 7,600
more men than women. It is possible, however, to mod- Stratified Sample, N = 20,000
ify these methods to improve a sample’s representative-
ness (and decrease sampling error) through a process
called stratification. Figure 13–3 Schematic representation of proportional
stratified sampling for a study of depression as a risk factor
Stratified random sampling involves identifying rel- for dementia.1
evant population characteristics and partitioning mem-
bers of a population into homogeneous, nonoverlapping
subsets, or strata, based on these characteristics. Random
consider recruiting a random sample from the entire
samples are then drawn from each stratum to control for
population.
bias. Therefore, although subjects are differentiated on
the basis of the stratified variable, all other factors should
be randomly distributed across groups. ➤ CASE IN POIN T #2
The National Health and Nutrition Examination Survey
(NHANES) is a program of interviews and physical examination
➤ In the study of depression and risk of dementia in patients with studies designed to assess the health and nutritional status of
type 2 diabetes, researchers were concerned about racial/ethnic adults and children in the United States.22 Begun in the early
diversity and wanted their sample to reflect the distribution in their 1960s, the survey examines a nationally representative sample
community. The researchers wanted a sample of 20,000 patients. of about 5,000 persons each year in counties across the country.
They stratified their sample into four groups based on their accessi- The NHANES gathers demographic, socioeconomic, dietary,
ble population, which was 23% African American, 15% Caucasian, and health-related information and collects data on medical,
24% Latino, and 38% identified as Other/Unknown. dental, and physiological measurements. Findings are used to
determine the prevalence of major diseases and risk factors as-
Rather than leave the distributions in the final sample sociated with health promotion and disease prevention in the
to chance, they used a proportional stratified sample by first U.S. population.
separating the population into the four categories and
then drawing random samples from each one based on For a study like the NHANES and other national
the proportion in the population (see Fig. 13-3). There- surveys, which seek to represent the U.S. population, a
fore, to obtain a sample of 20,000, they chose 4,600 strategy is needed that will link members of the popula-
African Americans, 3,000 Caucasians, and so on. The re- tion to some already established grouping that can be
sulting sample would intuitively provide a better estimate sampled. One approach is called cluster sampling,
of the population than simple random sampling. which involves grouping or clustering individuals ac-
cording to some common characteristic and then ran-
Cluster Sampling domly choosing specific clusters to study. For instance,
In many research situations, especially those involving we could divide the U.S. into eight different geographic
large dispersed populations, it is impractical or impos- areas, each composed of several states, and then ran-
sible to obtain a complete listing of a population or to domly choose four of these clusters. From there, we can
6113_Ch13_179-191 02/12/19 4:48 pm Page 188
either sample all residents in those states or randomly answers the phones. For certain survey research ques-
choose subjects from each cluster. This approach is often tions, however, random-digit dialing is most useful for
used with surveys to condense the total population into generating a sizable sample over a large area.
a workable subset.
Disproportional Sampling
Multi-Stage Cluster Sampling
The benefits of stratified and cluster sampling can be
With large datasets, like the NHANES, a more common
nullified when the researcher knows in advance that
approach is called multistage sampling. This involves
subgroups within the population are of greatly unequal
several successive phases of sampling to arrive at a rep-
size, creating a situation where one or more groups may
resentative sample (see Fig. 13-4).
provide insufficient samples for making comparisons.
In such situations, certain groups may be oversampled,
➤ The NHANES uses a four-stage model. The first stage is creating what is called a disproportional sample.
identification of the primary sampling units, in this case counties
across the country. The randomly chosen counties are divided
into segments, typically city blocks, and, in stage 2, segments ➤ Over the years, the NHANES has oversampled different
are randomly selected. The third stage is a random selection of groups as part of its selection procedures, such as African
households within each chosen segment, and finally individuals Americans, Mexican Americans, low-income white Americans,
are randomly selected within households. adolescents, and persons over 60 years old.22
Convenience Sampling unclear how these attributes may affect the ability to
generalize experimental outcomes. Those who volun-
The most common form of nonprobability sample is a
teered for the memory study may be atypical of the tar-
convenience sample, or accidental sample. With this
get population in terms of motivation, prior activity
method, subjects are chosen on the basis of availability.
level, family history, and other correlates of health. It is
Investigators performed a trial to study the effective- also not possible to know if those interested in partici-
ness of an exercise intervention for decreasing fatigue pating were those who did or did not already experience
severity and increasing physical activity in individuals memory problems.
with pulmonary arterial hypertension.24 A convenience Although all samples, even random samples, are even-
sample was recruited from local outpatient pulmonary tually composed of those who participate voluntarily,
clinics between 2009 and 2012. Inclusion criteria those who agree to be part of a random sample were not
included being between 21 and 82 years of age, not self-selected. Therefore, characteristics of subjects in a
pregnant, and tobacco free. Several physiological and random sample can be assumed to represent the target
functional criteria were specified for exclusion. population. This is not necessarily a safe assumption with
volunteer samples.
Consecutive Sampling
A practical approach to convenience sampling is Quota Sampling
consecutive sampling, which involves recruiting all Nonprobability sampling can also incorporate elements
patients who meet the inclusion and exclusion criteria of stratification. Using a technique called quota sam-
as they become available, often as they enter a clinic for pling, a researcher can control for the potential con-
care. The sample is usually defined by a time period founding effect of known characteristics of a population
within which patients become enrolled. by guiding the sampling process so that an adequate
Researchers looked at the effectiveness of screening number of subjects are obtained for each stratum. This
spirometry for early identification of patients with approach requires that each stratum is represented in the
chronic obstructive pulmonary disease (COPD) among same proportion as in the population. The difference be-
a cohort of asymptomatic smokers.25 Data were obtained tween this method and stratification is the lack of ran-
by recruiting subjects consecutively as they participated dom selection. There is an assumption that the quota
in a routine screening examination. Participants had to sample is representative on the factors that are used as
be ≥30 years old with a smoking history of ≥5 pack-years. strata, but not on other factors that cannot be accounted
Exclusions included a prior diagnosis of asthma, COPD, for. Although this method still faces the potential for
or another chronic respiratory condition. nonprobability biases, it does improve on the process by
proportionally representing each segment of the popu-
Use of Volunteers lation in the sample.
The use of volunteers is also a commonly used conven-
ience sampling method because of its expedience. Re-
Purposive Sampling
searchers who put up signs in schools or hospitals to A third nonprobability approach is purposive sampling,
recruit subjects with specific characteristics are sampling in which the researcher hand-picks subjects on the basis
by this method. of specific criteria. The researcher may locate subjects
by chart review or interview patients to determine if they
A study was done to determine whether a 6-week
fit the study. A researcher must exercise fair judgment to
educational program on memory training, physical
make this process meaningful.
activity, stress reduction, and healthy diet could improve
memory performance in older adults.26 Researchers re- Because of the health consequences of excessive alco-
cruited a volunteer sample by posting advertisements hol use in older adults, researchers were interested in
and flyers in two continuing care retirement communities. gaining an in-depth understanding of experiences of and
To be eligible, participants had to be at least 62 years attitudes toward support for alcohol-related health issues
of age, living independently, and expressing complaints in people aged 50 and over.27 They designed a qualitative
about memory loss. They were excluded if they had a study that included in-depth interviews of 24 participants.
history of dementia or were unable to exercise. Staff from several local alcohol service clinics identified
clients of various ages, varied self-reported drinking
The major limitation of this method is the potential
habits, and both genders, and invited them to participate.
bias of self-selection. It is not possible to know which at-
tributes are present in those who offer themselves as sub- Purposive sampling is similar to convenience sam-
jects, as compared with those who do not, and it is pling but differs in that specific choices are made, rather
6113_Ch13_179-191 02/12/19 4:48 pm Page 190
than simple availability. This approach has the same lim- “snowballing” is continued until an adequate sample is
itations to generalization as a convenience sample in that obtained. The researcher must be able to verify the
it can result in a biased sample. However, in many in- eligibility of each respondent to ensure a representa-
stances, purposive sampling can yield a sample that will tive group.30,31
be representative of the population if the investigator Snowball sampling is used extensively, for example,
wisely chooses individuals who represent the spectrum in studies involving individuals who are difficult to iden-
of population characteristics.28 tify, such as homeless persons, substance abusers, or in-
dividuals with mental health issues.32–35 This approach
Purposive samples are commonly used in qualitative also lends itself well to qualitative studies, especially
research to assure that subjects have the appropriate when dealing with sensitive topics.36
knowledge and will be good informants for the study (see Researchers were interested in understanding the
Chapter 21).29
causes and effects of postpartum depression.37 They re-
cruited 13 mothers through a hospital network who
Snowball Sampling complained of feeling depressed after giving birth.
These 13 mothers assisted by referring 17 other mothers
Snowball sampling is a method that is often used to to the study. Inclusion criteria included self-identifica-
study sensitive topics, rare traits, personal networks, tion as postpartum depressed and giving birth within
and social relationships. This approach is carried out the past year. Mothers were excluded if they had a former
in stages. In the first stage, a few subjects who meet history of depression.
selection criteria are identified and tested or inter-
viewed. In the second stage, these subjects are asked As this example illustrates, snowball sampling is most
to identify others they know who have the requisite useful when the population of interest is rare, unevenly
characteristics. This process of “chain referral” or distributed, hidden, or hard to reach.
COMMENTARY
With all apologies to Mark Twain, without the ability to has shown that criteria for choosing participants for an
generalize, research would be a futile activity. However, randomized controlled trial (RCT) are highly selective
generalization is an “ideal,” a goal that requires a leap and, therefore, samples tend to exclude patients with
from a derived sample to a defined population. Even comorbidities.38 This effectively eliminates many of the
though random selection is the best way to assure a rep- patients who are typically seen in a healthcare practice.
resentative sample, it does require a large enough sample From an evidence-based practice perspective, this limits
to truly characterize a population. The reality is that final the understanding of how treatments may be relevant
samples are seldom truly random and our assumptions to a particular patient. PCTs, introduced in Chapter 2,
are, therefore, always flawed to some extent. have broad inclusion and fewer exclusion criteria, with a
For clinical studies, particularly those focused on goal of recruiting subjects who represent the full spec-
comparing interventions, the main purpose is not appli- trum of the population for which treatment will be
cation to the population, but more so on whether the used.39,40 Samples can still be chosen at random, but
treatment works. The primary concern is avoiding bias they may have broader criteria, allowing for better gen-
in comparing groups to each other, rather than compar- eralization. The implications of RCTs and PCTs will be
ing the sample to the population. For example, research addressed in future chapters.
6113_Ch13_179-191 02/12/19 4:48 pm Page 191
CHAPTER
14 Principles of
Clinical Trials
A clinical trial is a study designed to test hypothe- (which may include placebo or other control) to evaluate
the effects of those interventions on health-related
ses about comparative effects. Trials are based
biomedical or behavioral outcomes.1
on experimental designs, within which the in-
Trials are primarily conducted to evaluate therapies,
vestigator systematically introduces changes into diagnostic or screening procedures, or preventive
natural phenomena and then observes the con- strategies.
sequences of those changes. The randomized • Therapeutic trials examine the effect of a
controlled trial (RCT) is the standard explana- treatment or intervention on a particular disease
or condition.
tory design, intended to support a cause-and-
effect relationship between an intervention Begun in the 1970s, 25 years of clinical trials have
shown that radical mastectomy is not necessary for re-
(the independent variable) and an observed ducing the risk of recurrence or spread of breast cancer,
response or outcome (the dependent variable). and that limited resection can be equally effective in
The purpose of this chapter is to describe the terms of recurrence and mortality.2
basic structure of clinical trials using the RCT • Diagnostic trials help to establish the accuracy
of tests to identify disease states. Although these
as a model and to discuss the important ele-
tests are often assessed using observational studies
ments that provide structure and control for based on comparison with a reference standard,
both explanatory and pragmatic studies. clinical trials can establish a test’s clinical impor-
tance and its relative efficacy and efficiency against
similar tests.3
■ Types of Clinical Trials Randomized trials have compared selective tests
for suspected deep venous thrombosis (DVT), show-
One of the most important influences in the advance- ing that different strategies can be used with equal
ment of scientific research over the past 60 years has accuracy.4
been the development of the clinical trial as a prospective
• A preventive trial evaluates whether a procedure or
experiment on human subjects (see Box 14-1). Clinical
agent reduces the risk of developing a disease.
trials have been classified as explanatory, focusing on
efficacy, and pragmatic designs, focusing on effectiveness, One of the most famous preventive trials was the
differing in their degree of internal control and external field study of poliomyelitis vaccine in 1954, which cov-
generalizability (see Chapter 2). The National Institutes ered 11 states.5 The incidence of poliomyelitis in the
of Health (NIH) define a clinical trial as: vaccinated group was reduced by more than 50% com-
pared to those receiving the placebo, establishing strong
A research study in which one or more human subjects
evidence of the vaccine’s effectiveness.
are prospectively assigned to one or more interventions
192
6113_Ch14_192-209 02/12/19 4:48 pm Page 193
Source unknown.
■ Manipulation of Variables
The Randomized Controlled Trial Manipulation of variables refers to a deliberate operation
The randomized controlled trial (RCT) is considered performed by the investigator, imposing a set of
the gold standard for experimental research because predetermined experimental conditions, the independ-
its structure provides a rigorous balance to establish ent variable, on at least two groups of subjects. The
RANDOM Independent
ASSIGNMENT Measurement Variable Measurement
Experimental Experimental
Pretest Posttest
Group Intervention
Control Control
Pretest Posttest
Group Condition
Figure 14–1 Basic structure of an RCT.
6113_Ch14_192-209 02/12/19 4:48 pm Page 194
experimenter manipulates the levels of the independent experiment because the researcher is able to manipulate
variable by assigning subjects to varied conditions, usu- the assignment of treatment levels for at least one
ally administering the intervention to one group and independent variable.
withholding it from another.
groups’ average scores should not be different. However, subjects to groups can be carried out in several ways (see
because clinical samples tend to be limited in size, ran- Table 14-1).
dom assignment could result in groups that are quite dis-
parate on certain important properties. Simple Random Assignment
Researchers often use statistical means to compare In simple random assignment, every member of the
groups on initial values that are considered relevant to population has an equal chance of being chosen for
the dependent variable to determine if the allocation either group. Many computer programs can generate
plan did balance out extraneous factors. random sequences to simplify this process.
➤ In the cardiac study, investigators gathered data on the two ➤ Researchers for the CopenHeartVR Trial recruited subjects
treatment groups upon enrollment, looking at age, gender, type by approaching eligible patients 2 days following surgery to
of surgery, classification of heart disease at baseline, and a invite them to participate.12 For those who consented, baseline
variety of comorbid conditions and physiological measures. measures were taken within 5 days post-surgery, and patients
They were then able to establish that the baseline values were were randomly allocated to groups using a computer-generated
balanced across the two groups. sequence carried out by a centralized trials registry.
When random assignment does not successfully bal- With smaller samples, a table of random numbers can
ance the distribution of intersubject differences, there also be used. To illustrate this process, we will use a por-
are several design and statistical strategies that can be tion of a table, shown in Table 14-2. Suppose we have a
used (see Chapter 15). This issue is only relevant, how- sample of 18 subjects and we want to assign them to two
ever, if characteristics influence treatment outcomes. In treatment groups, 1 (experimental) and 2 (control). The
that case, a stratified randomization strategy, to be de- simplest approach is to create a stack of 18 sequentially
scribed shortly, can be used. numbered envelopes, each with a blank slip of paper.
We enter the table anywhere by placing our finger
Assigning Subjects to Groups randomly at one spot—for this example, we will start at
Once a sample has been chosen, and subjects have agreed the second digit in row two (in red). Because we only
to participate following informed consent, a randomiza- have two groups, and digits run from 0 to 9, we will
tion scheme must be implemented to assure that assign- arbitrarily designate group 1 as odd and group 2 as even
ment to groups is unbiased. The process of assigning (ignoring zeros). Therefore, since our first number is 2,
Table 14-2 Random Numbers Table Specific sequences of assignments can also be desig-
nated in advance for each block. For instance, with
61415 00948 80901 36314 blocks of four, subject assignments can cycle through
62758 49516 23459 23896 each unique sequence: ABAB, AABB, ABBA, BAAB, and
50235 20973 19272 66903
so on.15
collected for each individual. At the point of random standard care prior to seeking consent and then only
assignment, all residents within one home had an equal approaching those who will be assigned to the experi-
opportunity of being allocated to the intervention or mental intervention.13 Those who will receive standard
control conditions. Clusters may be other types of natu- care do not supply consent.
rally occurring groups, such as communities or clinical
practice settings, that can be chosen at random to rep- Researchers were interested in the effect of a person-
resent a large population. alized program of physical and occupational therapy on
disability for patients with systemic sclerosis.20 The
Assignment by Patient Preference experimental intervention was a 1-month individualized
Another common concern arises when participants con- program of standardized exercises and splinting, fol-
sent to take part in a study but then withdraw after ran- lowed by home visits. This was compared with usual
domization when they get assigned to a treatment they care based on physician management. The primary out-
do not want, or they may comply poorly. Those who come was performance on the Heath Assessment Ques-
get the treatment they wanted may comply better than tionnaire Disability Index (HAQ-DI) at 12 months.
average, biasing the treatment effect. Patients from four clinical centers were assigned to the
An alternative method is the use of a partially random- experimental or usual care group by an independent
ized patient preference approach, in which all participants statistician. Only those assigned to the experimental
are asked if they have a preference for one treatment group were informed about the study and asked to pro-
condition and only patients who have no preference are vide consent. All others received their standard care.
randomized. The others are assigned to their preferred
treatment arm.17 The researchers used a Zelen design to avoid the pos-
sibility that those assigned to usual care would feel they
Researchers studied the difference between oral and were getting less than optimal care and seek more rig-
topical ibuprofen for chronic knee pain in people aged orous exercise opportunities.
50 years and older.18 The primary outcome was the A main reason for using this approach is to limit bias
score on the Western Ontario and McMaster Univer- in recruitment, as those who prefer one treatment may
sities (WOMAC) Osteoarthritis Index. Eligible patients withdraw once they learn of their assignment.21 It has
in 26 general practices were approached and provided not been used extensively, however, and remains contro-
consent. They were then asked if they preferred versial, primarily because of ethical concerns.22,23 Zelen13
one treatment or the other. Those who did not care addresses this concern by suggesting that every patient
were assigned in a randomized study and those who expects to get the best standard of care and therefore
joined the preference arm selected their treatment. The usual care patients are receiving the therapy they would
data from the two studies were analyzed separately and otherwise have gotten.
compared.
Run-In Period
Not surprisingly, the researchers found that, in the Investigators are often concerned about participants
randomized study, groups looked similar at baseline, but being cooperative, motivated. and adherent to study
this was not the case for the preference groups. By es- protocols. One way to ensure an efficient trial is to
sentially carrying out two studies, they were able to look establish a run-in period, during which all subjects
at a controlled situation and to examine results under receive a placebo. Only those who demonstrate that
more pragmatic conditions that might be considered they can adhere to the regimen are then randomized to
more like a true clinical experience where patients do get be part of the study.24 This approach has been useful to
to choose treatments.19 Obviously, the potential for bias reduce dropout, identifying issues that can improve com-
must be considered with preference groups and separate pliance.25 It may reduce sample size and can dilute or
analyses should be carried out.17 enhance the clinical applicability of the results of a clin-
Randomized Consent Design ical trial, depending on the patient group to whom the
The process of random assignment in clinical trials is results will be applied.26
often seen as problematic for participants who are reluc-
tant to give informed consent without knowing if they Allocation Concealment
will be assigned to a treatment or control group. To An important component of a randomization scheme is
address this issue, Marvin Zelen, a statistician, proposed allocation concealment, ensuring that group assign-
an alternative approach called a randomized consent ment is done without the knowledge of those involved
design (also called a Zelen design), which involves in the experimental process, thereby providing another
randomizing participants to experimental treatment or assurance that bias is not involved in group formation.
6113_Ch14_192-209 02/12/19 4:48 pm Page 198
Evidence suggests that a breakdown in concealment can but has a benign effect. With a mechanical device,
create a substantial bias in estimates of treatment effect.27 for instance, it could be “going through the motions”
Concealment can be accomplished in several ways. A of treatment application without actually turning the
common method is the use of sealed envelopes that con- machine on.
tain a participants’ group assignment, described earlier, For a physical activity, like an exercise program, re-
so that investigators are not involved in the assignment searchers often incorporate an attention control group,
process. An alternative technique is the use of personnel which provides some type of social or educational activ-
from an external service that is separate from the re- ity for the control group, to rule out the possible impact
search institution. They generate the sequence of assign- of the exercise group’s interaction or the attention that
ments and call the participants so that researchers have a treatment group would receive.
no involvement.
Wait List Controls
Another strategy is the use of a wait list control group
➤ In the cardiac rehabilitation study, the allocation sequence in which the experimental treatment is delayed for the
was generated by the Copenhagen Trial Unit, a centralized agency control subjects. The control subjects may be asked to
that collaborates with researchers to foster clinical trials.28 They refrain from any activities that would mimic the treat-
maintained the patient assignment lists, which remained concealed
ment and are measured as control subjects during the
from the investigators for the duration of the study.
first part of the study. They are then scheduled to receive
the treatment following completion of the study. This
approach has appeal when it is considered unethical to
■ Control Groups deny treatment, as long as the delay is not of unreason-
able length.33
An essential design strategy for ruling out extraneous
effects is the use of a control group against which the ex- Active Controls
perimental group is compared. Subjects in a control The use of a placebo or sham may be unfeasible in clin-
group may receive a placebo or sham treatment, a stan- ical situations, for practical or ethical reasons. Therefore,
dard treatment that will act as a basis of comparison for clinical researchers often evaluate a new experimental
a new intervention, or no intervention at all. To be treatment against active controls, which are conven-
effective, the difference between the groups should only tional methods of care or established standards of prac-
be the essential element that is the independent variable. tice. Active controls can be used to show the efficacy of
Inactive Controls a treatment that may have large placebo effects, or to
demonstrate the equivalence or superiority of the new
A placebo control is similar in every way to the experi- treatment compared to established methods.
mental treatment except that it does not contain the ac-
tive component that comprises the intervention’s action
(see Box 14-2).29 For a drug study, for example, a pill can Comparative effectiveness research (CER), using
pragmatic trials, is based on the head-to-head
look exactly like the actual medication but not contain
comparison of two treatments, not including a
the active ingredient.
placebo. CER addresses the need to establish which treat-
A sham treatment is analogous to a placebo but is ments will work better for various types of patients. Usual
used when treatments require physical or mechanical care is typically the “control” condition.
intervention. A sham is similar to the active treatment
between efficacy and effectiveness trials (see Table 14-3). ■ Phases of Clinical Trials
Although varied scales have been used, each component
is typically graded on a 4- or 5-point scale, scored 0 for In the investigation of new therapies, including drugs,
explanatory characteristics at the center. The wider the surgical procedures, and electromechanical devices, clin-
web on the wheel, the more pragmatic the trial; the closer ical trials are classified by phases. The sequential phases
the web is to the hub of the wheel, the more the study of trials are intended to provide different types of infor-
has explanatory qualities (see Focus on Evidence 14-1). mation about the treatment in relation to dosage, safety,
Scoring is based on judgments of researchers and has and efficacy, with increasingly greater rigor in demon-
been evaluated for reliability.44,45 Variations of the model strating the intervention’s effectiveness and safety (see
have also been proposed.46 Fig. 14-2).49 In early phases, the focus is on minimizing
6113_Ch14_192-209 02/12/19 4:48 pm Page 202
Focus on Evidence 14–1 physical therapy, psychology, and rheumatology, all with varying
Explanatory or Pragmatic—or Somewhere degrees of experience in RCTs and pragmatic trials.
in Between The study was planned as a three-arm multicenter randomized
trial. Patients were to be randomized to pain-coping skills training, an
The pragmatic-explanatory continuum indicator summary (PRECIS)
educational control condition, or usual care.
provides a useful planning tool to help researchers assess the
The team went through three rounds of analysis. The “initial” rat-
degree to which a study fits its intended purpose.43 The figure shows
ing by individual investigators followed the development of a tentative
results of a PRECIS analysis performed by an interprofessional team
study design. The “ideal” rating captured each investigator’s own
designing a study of intervention for pain-coping skills in patients
opinion about the best design, which showed a preference for a more
undergoing total knee arthroplasty (TKA).48 The team included inves-
pragmatic design. The “final” rating (in blue) reflected consensus
tigators from biostatistics, internal medicine, orthopedic surgery,
among the team on the final design.
The results showed that the final design was closer to an explana-
Flexibility of Practitioner
tory study than pragmatic, with several items close to the hub of the
the Comparison Expertise
Intervention (Experimental) wheel. For example, the rating for practitioner expertise in providing
the experimental treatment ended up closer to explanatory than orig-
Practitioner Flexibility of inally intended. Similarly, the researchers proposed eligibility criteria
Expertise the Experimental to be more flexible, but in the final design they agreed that, because
(Comparison) Intervention this intervention had not been studied in this population before, they
needed to be more restrictive in their sampling criteria.
These researchers found this visual approach to be particularly
Follow-up Eligibility helpful in raising awareness of specific issues to consider in final-
E
Intensity Criteria izing the design, several of which resulted in revisions from the
original plan. Importantly, the graph shows that studies are not
purely explanatory or pragmatic, and consideration must be given
to all aspects of a study in terms of feasibility as well as desired
Outcomes Primary outcomes.
Analysis
Participant Practitioner Key Adapted from Riddle DL, Johnson RE, Jensen MP, et al. The Pragmatic-Explanatory
Compliance Adherence Continuum Indicator Summary (PRECIS) instrument was useful for refining a
Initial randomized trial design: experiences from an investigative team. J Clin
Ideal
Epidemiol 2010;63(11):271–275. Figure 2, p. 1274. Used with permission.
Final
risk to subjects and insuring that resources are used benefit is so clear that it warrants immediate implemen-
responsibly. Studies progress to establish efficacy and tation (see Focus on Evidence 14-2).
then to demonstrate effectiveness under real-world
conditions—with the ultimate goal of establishing the Trial phases were introduced as part of the description
intervention as a new standard. of translational research (see Fig. 2-1 in Chapter 2).
The definitions of trial phases are not clearly distinct It will be useful to review that information as a context
and there can be significant overlap, depending on the for this discussion.
type of treatment being studied. Some trials will not
• Preclinical Phase
fit neatly into one category and many will not specify
a phase in their reports. The important element is the Before a new therapy is used on humans, it is tested
progression of evidence that either supports adoption under laboratory conditions, often using animal models,
or merits discontinuing a treatment based on sound as part of preclinical or basic research. The purpose of
comparisons and analyses. this type of research is to explore if and how the new treat-
At any point in this sequence, a trial may be stopped ment works, and to establish its physiological properties
if adverse events become too common or if the treatment’s or mechanism of action. When a new drug or device
6113_Ch14_192-209 02/12/19 4:48 pm Page 203
Preclinical Research
In 2001, researchers investigated the effects of roflumilast on airway inflammation in guinea pigs and rats.53 They found that
the drug demonstrated a combination of bronchodilatory, antiallergic, and anti-inflammatory properties that supported its
potential efficacy in humans.
Phase I Trials
Multiple phase I studies were conducted in healthy volunteers to establish the safety and dosage constraints of the drug.54
Men and women between 18-45 years of age were tested in small cohorts of less than 20 subjects. Findings revealed no
food interactions and no effect on blood pressure or lab values. Up to 2.5 mg was safe, but above 1.0 mg there were many
side effects, including headache, dizziness, nausea, and diarrhea. The limit of tolerability appeared to be 1.0 mg dose.
Phase II Trials
A phase II trial was conducted with 516 patients with COPD. Using a randomized placebo-controlled study, results showed
that roflumilast at 500 µg/day significantly improved FEV1 at 24 weeks compared with placebo.55 Data continued to
demonstrate safety and adverse effects of nausea.
Phase IV Trials
Phase IV studies have examined the effectiveness of roflumilast in patients with different levels of exacerbations. They found
that the drug was most effective in improving lung function and reducing exacerbations in participants with frequent
exacerbations and/or hospitalization episodes.57 These trials have continued to document safety in the overall population.
Figure 14–2 Progression of the development of a new drug from preclinical to phase IV trials. Roflumilast (marketed as
Daliresp® in the U.S.) was approved by the FDA in 2011 for treatment of moderate to severe chronic obstructive pulmonary
disease (COPD).56,58
shows promise at this phase, researchers will then request studies are patients with the condition the therapy is
permission to begin human trials from the FDA. intended to treat.
These trials may be further classified as phase IIa
• Phase I Trials to study dosage and phase IIb to focus on efficacy.
Is the Treatment Safe? These may be randomized, double-blind studies that
In a phase I trial, researchers work to show that a new include a placebo control, or they may focus on
therapy is safe. The treatment is tested on small groups cohorts of patients who all receive varied dosages of
of people, usually around 20 healthy volunteers, to un- a drug. Phase II trials tend to have a relatively low suc-
derstand its mechanisms of action, determine a safe cess rate, with only 31% moving on to a phase III trial
dosage range, and identify side effects. in 2015.50,51
• Phase II Trials • Phase III Trials
Does the Treatment Work? How Does this Treatment Compare with Standard Care?
Once a therapy has been shown to be safe in humans, it Phase III trials build on prior research to establish effi-
is studied in a phase II trial to explore its efficacy by cacy through randomized controlled studies, which may
measuring relevant outcomes. These trials continue include thousands of patients in multiple sites over many
to monitor adverse effects and optimal dosage. Also years. These studies will compare the new treatment to
done on relatively small samples of 100 to 300 patients, standard care to confirm its effectiveness, monitor side
these trials may take several years. Subjects in these effects, and continue to study varied dosages. Successful
6113_Ch14_192-209 02/12/19 4:48 pm Page 204
Focus on Evidence 14–2 Therefore, after an average follow-up of 5.2 years, the National Heart,
When Things Don’t Go as Planned—for Better Lung and Blood Institute stopped the trial, stating that the evidence
or Worse of overall health risks exceeded any benefits, and that this regimen
should not be initiated or continued for primary prevention of CHD.
When studies are scheduled to run for many years, independent data
The health risks for breast cancer were still observed 3 years follow-
monitoring boards are tasked with reviewing data along the way to
ing cessation of the study.61
monitor for adverse effects. When evidence is accumulated that
demonstrates an important health finding, these boards can make rec-
The Physicians’ Health Study: In 1982, 22,071 male physicians
ommendations for early termination. Two prominent examples that
were recruited for a randomized study of the effect of aspirin and beta
have had significant influence on healthcare practice demonstrate the
carotene on primary prevention of CVD and cancer.62 Subjects were
importance of this process for public health—for positive and nega-
assigned to one of four groups: aspirin and beta carotene taken every
tive reasons.
other day, aspirin placebo and beta carotene, aspirin and beta
carotene placebo, or aspirin placebo and beta carotene placebo. The
The Women’s Health Initiative: Between 1993 and 1998, 16,608
trial was planned to run until 1995.
postmenopausal women between 50 and 79 years of age were en-
At the end of 1987, interim review of data showed an extreme
rolled in the Women’s Health Initiative (WHI) study of the risks and
beneficial effect on nonfatal and fatal myocardial infarction, with a
benefits associated with estrogen plus progestin therapy.59 Women
44% reduction in risk, specifically among those who were 50 years
were randomly assigned to a daily dose of estrogen plus progestin or
of age or older.63 The effect was present at all levels of cholesterol,
to a placebo, with a primary outcome of coronary heart disease (CHD).
but greatest at lower levels. Therefore, a recommendation was made
Hip fracture was designated as a secondary outcome and invasive
to terminate the aspirin component of the trial and participants were
breast cancer was considered a primary adverse outcome. The study
informed of their aspirin group assignment. Data for strokes and mor-
was scheduled to run until 2005.
tality were inconclusive because too few subjects demonstrated these
An interim review in 2002 revealed that, although there was a
endpoints. The trial continued for the beta carotene arm until the
one-third reduction in hip fracture rates, there were between 20% to
scheduled end in 1995, at which time it was established that the sup-
40% increases in strokes, heart attacks, blood clots, and cardiovas-
plement provided neither benefit nor harm in terms of the incidence
cular disease (CVD), as well as a 26% increase in the number of cases
of malignant neoplasms, CVD, or death from all causes.64
of invasive breast cancer in the group receiving hormone therapy.60
–M 0 +M
Inferior Treatment Non-inferiority Superior Treatment
margin
Superiority
Non-inferiority
Equivalence
Figure 14–3 Relationship of superiority, non-inferiority, and equivalence trials. The negative
and positive margins (-M and +M) represent the confidence intervals to establish the effect of the
intervention.
➤ The primary outcomes in the cLBP study were changes in For further discussion of statistical considerations in non-
function measured on the modified Roland Morris Disability inferiority trials, see Chapter 23. MCID is discussed
Questionnaire (RMDQ) and pain intensity on a 0 to 10 numerical further in Chapter 32. For those interested in exploring
scale. To demonstrate that yoga was at least as good as PT, analysis issues for non-inferiority trials, several good
researchers set a non-inferiority margin of 1.5 points on the references are available.66,72,74–76
RMDQ and 1 point on the pain scale, both derived by halving the
MCIDs for these measures. Equivalence Trials
A third type of trial is called an equivalence trial, which
If you can get past the double negatives, this means is focused on showing that the new treatment is no
that as long as the patients getting yoga had RMDQ better and no worse than the standard therapy. This
scores that were no more than 1.5 points less than ob- approach is used less often, primarily in what is called
served with PT, and pain scores no more than 1 point a bioequivalence trial, in which a generic drug is com-
less than with PT, yoga would be considered non-inferior. pared to the original commercial drug to show that
The data showed that both yoga and PT were compara- they have the same pharmacokinetic properties.76
ble, and both were better than self-care education. These trials are not considered appropriate for thera-
Researchers try to be conservative about the specified peutic studies.74
margin to avoid the possibility of concluding that a new
treatment is non-inferior when it is actually worse than
Non-inferiority and equivalence trials are sometimes
standard care. This is why the preferred margin will
lumped under one classification as “equivalence trials”
be smaller than the MCID or documented effect sizes. because they both focus on no difference between
Because the non-inferiority trial is only concerned about treatments. However, they are actually distinct and will use
whether the new treatment performs no worse than the different analysis strategies.
standard, the analysis will focus on one direction only.
COMMENTARY
The rules of the “research road” for studies of interven- dilemma because clinical trials should allow replication
tions make clear the need to control for the effect of of effectiveness by others.
extraneous factors that can bias results. Clinicians un- Clinical trials are, however, necessary to build an
derstand, however, that such controls will often limit the evidence base for interventions and the use of this
ability to generalize findings to real patients and real design is becoming more prevalent in rehabilitation.79
treatment situations. This is especially true in rehabili- Clinicians must consider the need to build treatments
tation contexts, for several reasons. The need for con- according to theory so that protocols can be designed
trols is problematic, as it would be unethical to deny according to standardized guidelines that would make it
adaptive activity to patients who require rehabilitation.77 reproducible.80–82 Although the specifics of the interven-
Another challenge is the inherent patient–clinician in- tion may be personalized, the decision-making process
teraction that influences intervention and outcome, as can be made clear by delineating the rationale upon
well as the use of combinations of approaches that are which treatments are based.83 Using the International
used to develop treatment regimens.78 In addition, treat- Classification of Functioning, Disability and Health (ICF)
ments are not always protocol driven, requiring flexibil- provides a consistent framework for delineating variables
ity to meet unique patient needs. These issues have led and choosing interventions and outcome measures.84
to concerns about the feasibility of clinical trials in reha- Clinical trials will not be appropriate to answer every
bilitation fields. Defining the active element of these in- clinical question, but the PCT provides an important
terventions is much more difficult than characterizing a alternative to account for the real-time differences that
pharmacologic or surgical intervention. This presents a will be encountered in clinical practice.
17. Torgerson D, Sibbald B. Understanding controlled trials: what 41. Tunis SR, Stryer DB, Clancy CM. Practical clinical trials:
is a patient preference trial? BMJ 1998;316(7128):360. increasing the value of clinical research for decision making in
18. Underwood M, Ashby D, Cross P, Hennessy E, Letley L, clinical and health policy. JAMA 2003;290(12):1624–1632.
Martin J, et al. Advice to use topical or oral ibuprofen for 42. Bolland MJ, Grey A, Gamble GD, Reid IR. Concordance of
chronic knee pain in older people: randomised controlled trial results from randomized and observational analyses within the
and patient preference study. BMJ 2008;336(7636):138–142. same study: a re-analysis of the Women’s Health Initiative
19. Sedgwick P. What is a patient preference trial? BMJ 2013;347. Limited-Access Dataset. PLoS One 2015;10(10):e0139975.
20. Rannou F, Boutron I, Mouthon L, Sanchez K, Tiffreau V, 43. Thorpe KE, Zwarenstein M, Oxman AD, Treweek S, Furberg
Hachulla E, et al. Personalized physical therapy versus usual CD, Altman DG, et al. A pragmatic-explanatory continuum
care for patients with systemic sclerosis: a randomized con- indicator summary (PRECIS): a tool to help trial designers.
trolled trial. Arthritis Care Res 2017;69(7):1050–1059. CMAJ 2009;180(10):E47–E57.
21. Adamson J, Cockayne S, Puffer S, Torgerson DJ. Review of 44. Glasgow RE, Gaglio B, Bennett G, Jerome GJ, Yeh HC, Sarwer
randomised trials using the post-randomised consent (Zelen’s) DB, et al. Applying the PRECIS criteria to describe three effec-
design. Contemp Clin Trials 2006;27(4):305–319. tiveness trials of weight loss in obese patients with comorbid
22. Homer CS. Using the Zelen design in randomized controlled conditions. Health Serv Res 2012;47(3 Pt 1):1051–1067.
trials: debates and controversies. J Adv Nurs 2002;38(2):200–207. 45. Selby P, Brosky G, Oh PI, Raymond V, Ranger S. How prag-
23. Torgerson DJ, Roland M. What is Zelen’s design? BMJ 1998; matic or explanatory is the randomized, controlled trial?
316(7131):606. The application and enhancement of the PRECIS tool to the
24. Stanley K. Design of randomized controlled trials. Circulation evaluation of a smoking cessation trial. BMC Med Res Methodol
2007;115(9):1164–1169. 2012;12:101.
25. Fukuoka Y, Gay C, Haskell W, Arai S, Vittinghoff E. Identify- 46. Loudon K, Treweek S, Sullivan F, Donnan P, Thorpe KE,
ing factors associated with dropout during prerandomization Zwarenstein M. The PRECIS-2 tool: designing trials that are
run-in period from an mHealth Physical Activity Education fit for purpose. BMJ 2015;350:h2147.
study: the mPED trial. JMIR Mhealth Uhealth 2015;3(2):e34. 47. Gartlehner G, Hansen RA, Nissman D, Lohr KN, Carey TS.
26. Pablos-Mendez A, Barr RG, Shea S. Run-in periods in random- Criteria for distinguishing effectiveness from efficacy trials in
ized trials: implications for the application of results in clinical systematic reviews. Technical Review 12. AHRQ Publication
practice. JAMA 1998;279(3):222–225. No. 06-0046. Rockville, MD: Agency for Healthcare Research
27. Schulz KF, Grimes DA. Allocation concealment in randomised and Quality. April, 2006. Available at https://www.ncbi.nlm.
trials: defending against deciphering. Lancet 2002;359 (9306): nih.gov/books/NBK44029/pdf/Bookshelf_NBK44029.pdf.
614–618. Accessed March 27, 2017.
28. Centre for Clinical Intervention Research: Copenhagen Trial 48. Riddle DL, Johnson RE, Jensen MP, Keefe FJ, Kroenke K,
Unit. Available at http://www.ctu.dk/. Accessed March 24, 1017. Bair MJ, et al. The Pragmatic-Explanatory Continuum Indica-
29. Sedgwick P, Hooper C. Placebos and sham treatments. BMJ tor Summary (PRECIS) instrument was useful for refining a
2015;351:h3755. randomized trial design: experiences from an investigative team.
30. Kaptchuk TJ, Miller FG. Placebo effects in medicine. J Clin Epidemiol 2010;63(11):1271–1275.
New England Journal of Medicine 2015;373(1):8–9. 49. US National Library of Medicine: FAQ: Clinical Trials.gov -
31. WMA Declaration of Helsinki: Ethical Principles for Medical Clinical Trial Phases. Available at https://www.nlm.nih.gov/
Research Involving Human Subjects. Amended October, 2013. services/ctphases.html. Accessed March 23, 2017.
Available at http://www.wma.net/en/30publications/10policies/ 50. Wechsler J: Low success rates persist for clinical trials. May 27,
b3/. Accessed March 27, 2017. 2016. Applied Clinical Trials. Available at http://www.applied
32. D’Agostino RB, Sr., Massaro JM, Sullivan LM. Non-inferiority clinicaltrialsonline.com/low-success-rates-persist-clinical-trials.
trials: design concepts and issues – the encounters of academic Accessed March 24, 2017.
consultants in statistics. Stat Med 2003;22(2):169–186. 51. Harrison RK. Phase II and phase III failures: 2013–2015. Nat
33. Elliott SA, Brown JS. What are we doing to waiting list Rev Drug Discov 2016;15(12):817–818.
controls? Behav Res Ther 2002;40(9):1047–1052. 52. Sonnad SS, Mullins CD, Whicher D, Goldsack JC, Mohr PE,
34. Butchart EG, Gohlke-Barwolf C, Antunes MJ, Tornos P, Tunis SR. Recommendations for the design of Phase 3 pharma-
De Caterina R, Cormier B, et al. Recommendations for the ceutical trials that are more informative for patients, clinicians,
management of patients after heart valve surgery. Eur Heart J and payers. Contemp Clin Trials 2013;36(2):356–361.
2005;26(22):2463–2471. 53. Bundschuh DS, Eltze M, Barsig J, Wollin L, Hatzelmann A,
35. Food and Drug Administration: Guidance for Industry. E9: Statis- Beume R. In vivo efficacy in airway disease models of roflumi-
tical Principles for Clinical Trials. September, 1998. Available at last, a novel orally active PDE4 inhibitor. J Pharmacol Exp Ther
https://www.fda.gov/downloads/drugs/guidancecompliance 2001;297(1):280–290.
regulatoryinformation/guidances/ucm073137.pdf. Accessed 54. Center for Drug Evaluation and Research: Application
March 23, 2017. Number 022522Orig1s000, 2011. Clinical Pharmacology and
36. Moher D, Hopewell S, Schulz KF, Monton V, Gotzsche PC, Biopharmaceutics Reviews. Available at https://www.accessdata.
Devereaux PJ, et al. CONSORT 2010 explanation and elabora- fda.gov/drugsatfda_docs/nda/2011/022522Orig1s000ClinPharmR.
tion: updated guidelines for reporting parallel group randomised pdf. Accessed March 26, 2017.
trials. BMJ 2010;340:c869. 55. Roflumilast: APTA 2217, B9302-107, BY 217, BYK 20869.
37. Cochran Bias Methods Groups: Assessing risk of bias in included Drugs R D 2004;5(3):176–181.
studies. Available at http://methods.cochrane.org/bias/assessing- 56. Wagner LT, Kenreigh CA. Roflumilast: the evidence for its clin-
risk-bias-included-studies. Accessed March 24, 2017. ical potential in the treatment of chronic obstructive pulmonary
38. Beyer-Westendorf J, Buller H. External and internal validity of disease. Core Evidence 2005;1(1):23–33.
open label or double-blind trials in oral anticoagulation: better, 57. Martinez FJ, Rabe KF, Sethi S, Pizzichini E, McIvor A, Anzueto A,
worse or just different? J Thromb Haemost 2011;9(11):2153–2158. et al. Effect of Roflumilast and Inhaled Corticosteroid/Long-
39. Horn SD, Gassaway J. Practice-based evidence study design for Acting beta2-Agonist on Chronic Obstructive Pulmonary Dis-
comparative effectiveness research. Med Care 2007;45(10 Supl 2): ease Exacerbations (RE(2)SPOND). A randomized clinical trial.
S50–S57. Am J Respir Crit Care Med 2016;194(5):559–567.
40. Glasgow RE, Magid DJ, Beck A, Ritzwoller D, Estabrooks PA. 58. Food and Drug Administration: Highlights of Prescribing
Practical clinical trials for translating research to practice: design Information: DALIRESP® (roflumilast) tablets. Available at
and measurement recommendations. Med Care 2005;43(6): https://www.accessdata.fda.gov/drugsatfda_docs/label/2015/
551–557. 022522s006lbl.pdf. Accessed March 26, 2017.
6113_Ch14_192-209 02/12/19 4:48 pm Page 209
59. Writing Group for the Women’s Health Initiative Investigators. 70. Snapinn SM. Noninferiority trials. Curr Control Trials Cardiovasc
Risks and benefits of estrogen plus progestin in healthy post- Med 2000;1(1):19–21.
menopausal women: principal results from the Women’s Health 71. Wiens BL. Choosing an equivalence limit for noninferiority
Initiative randomized controlled trial. JAMA 2002;288(3):321–333. or equivalence studies. Control Clin Trials 2002;23(1):2–14.
60. NIH News Release. July 9, 2002. NHLBI stops trial of estrogen 72. Jones B, Jarvis P, Lewis JA, Ebbutt AF. Trials to assess
plus progestin due to increase breast cancer risk, lack of overall equivalence: the importance of rigorous methods. BMJ 1996;
benefit. Available at https://www.nhlbi.nih.gov/whi/pr_02-7-9. 313(7048):36–39.
pdf. Accessed March 24, 2017. 73. Pinto VF. Non-inferiority versus equivalence clinical trials in
61. Heiss G, Wallace R, Anderson GL, Aragaki A, Beresford SA, assessing biological products. Sao Paulo Med J 2011;129(3):
Brzyski R, et al. Health risks and benefits 3 years after stopping 183–184.
randomized treatment with estrogen and progestin. JAMA 74. Lesaffre E. Superiority, equivalence, and non-inferiority trials.
2008;299(9):1036–1045. Bull NYU Hosp Jt Dis 2008;66(2):150–154.
62. Hennekens CH, Eberlein K. A randomized trial of aspirin and 75. Head SJ, Kaul S, Bogers AJJC, Kappetein AP. Non-inferiority
beta-carotene among U.S. physicians. Prev Med 1985;14(2): study design: lessons to be learned from cardiovascular trials.
165–168. Europ Heart J 2012;33(11):1318–1324.
63. Steering Committee of the Physicians’ Health Study Research 76. Hahn S. Understanding noninferiority trials. Korean J Pediatr
Group. Final report on the aspirin component of the ongoing 2012;55(11):403–407.
Physicians’ Health Study. N Engl J Med 1989;321(3):129–135. 77. Fuhrer MJ. Overview of clinical trials in medical rehabilitation:
64. Hennekens CH, Buring JE, Manson JE, Stampfer M, Rosner B, impetuses, challenges, and needed future directions. Am J Phys
Cook NR, et al. Lack of effect of long-term supplementation Med Rehabil 2003;82(10 Suppl):S8–S15.
with beta carotene on the incidence of malignant neoplasms and 78. Whyte J. Clinical trials in rehabilitation: what are the obstacles?
cardiovascular disease. N Engl J Med 1996;334(18):1145–1149. Am J Phys Med Rehabil 2003;82(10 Suppl):S16–S21.
65. Hills RK. Non-inferiority trials: No better? No worse? No 79. Frontera WR. Clinical trials in physical medicine and rehabilita-
change? No pain? Br J Haematol 2017;176(6):883–887. tion. Am J Phys Med Rehabil 2015;94(10 Suppl 1):829.
66. Kaul S, Diamond GA. Good enough: a primer on the analysis 80. Hart T, Tsaousides T, Zanca JM, Whyte J, Packel A, Ferraro
and interpretation of noninferiority trials. Ann Intern Med M, et al. Toward a theory-driven classification of rehabilitation
2006;145(1):62–69. treatments. Arch Phys Med Rehabil 2014;95(1 Suppl):S33–S44.e2.
67. Hayden JA, van Tulder MW, Malmivaara AV, Koes BW. 81. Silverman WA. Human Experimentation: A Guided Step into the
Meta-analysis: exercise therapy for nonspecific low back pain. Unknown. New York: Oxford University Press, 1985.
Ann Intern Med 2005;142(9):765–775. 82. Whyte J. Contributions of treatment theory and enablement
68. Wieland LS, Skoetz N, Pilkington K, Vempati R, D’Adamo theory to rehabilitation research and practice. Arch Phys Med
CR, Berman BM. Yoga treatment for chronic non-specific low Rehabil 2014;95(1 Suppl):S17–S23.e2.
back pain. Cochrane Database Syst Rev 2017;1:Cd010671. 83. Johnston MV. Desiderata for clinical trials in medical
69. Saper RB, Lemaster C, Delitto A, Sherman KJ, Herman PM, rehabilitation. Am J Phys Med Rehabil 2003;82(10 Suppl):
Sadikova E, et al. Yoga, physical therapy, or education for S3–S7.
chronic low back pain: a randomized noninferiority trial. Ann 84. VanHiel LR. Treatment and enablement in rehabilitation
Intern Med 2017;167(2):85–94. research. Arch Phys Med Rehabil 2014;95(1 Suppl):S88–S90.
6113_Ch15_210-226 02/12/19 4:47 pm Page 210
CHAPTER
15 Design Validity
210
6113_Ch15_210-226 02/12/19 4:47 pm Page 211
Can the results be variance in the data or sample from which they
External
Validity generalized to other are collected. If these assumptions are not met,
persons, settings, or times? statistical outcomes may lead to erroneous
inferences.
Construct To what constructs can
• Reliability and Variance. Statistical conclusions
Validity results be generalized? are threatened by any extraneous factors that in-
crease variability within the data, such as unreliable
measurement, failure to standardize the protocol,
Internal Is there evidence of a causal environmental interferences, or heterogeneity of
relationship between independent
Validity
and dependent variables?
subjects.
• Failure to Use Intention to Treat (ITT) Analysis.
When data are missing or subjects do not adhere to
Statistical Is there a relationship between random group assignments, analysis procedures
Conclusion the independent and dependent
Validity variables? are designed to maintain the important element of
randomization. This concept will be discussed later
Figure 15–1 Four types of design validity. Each form is in this chapter.
cumulatively dependent on the components below it.
Validity threats that impact design are often given the
most press when we evaluate studies, focusing on con-
cerns such as randomization, control groups, and oper-
factor when studies show no differences. The ational definitions. The appropriate use of statistics is
converse can also be true, when very large sam- often assumed, perhaps because of a lack of understand-
ples are used. In that case statistical differences ing of statistical concepts. However, statistical outcomes
may be found, even though they may not be may be erroneous or misleading if not based on valid
meaningful. procedures, potentially with serious implications for how
• Violated Assumptions of Statistical Tests. Most conclusions are drawn. Evidence must be understood
statistical procedures are based on a variety of from a variety of perspectives, and statistical conclusion
assumptions about levels of measurement and validity cannot be ignored.
Attrition, also called experimental mortality, is of con- and posttest. For example, if assessors in the AVERT
cern when it results in a differential loss of subjects; trial were not equally trained in applying the mRS, the
that is, dropouts occur for specific reasons related to scores could be unreliable.
the experimental situation. Changes that occur in calibration of hardware or shifts
Studies with long follow-up periods are especially in criteria used by human observers can affect the mag-
prone to attrition. Researchers may be able to compare nitude of the dependent variable, independent of any
pretest scores for those who remain and those who treatment effect. Mechanical and bioelectronic instru-
drop out to determine if there is a biasing effect. When ments can threaten validity if linearity and sensitivity are
subjects drop out for nonrandom reasons, the balance of not constant across the full range of responses and over
individual characteristics that was present with random time. This threat to internal validity can also bias results
assignment may be compromised. if the measurements are not the same for all groups, also
called detection bias. These threats can be addressed by
➤ Over the 3-month follow-up period in the AVERT trial, 21 pa- calibration and documenting test–retest and rater relia-
tients dropped out, 6 for unknown reasons and 15 who refused bility to assure that measurement issues affect both
to continue participation. There was no indication that their de- groups equally.
cisions were related to their group assignment. Regression to the Mean
Is there evidence of regression from pretest to posttest scores?
Attrition is one of the few internal validity threats Regression to the mean is also associated with relia-
that cannot be controlled by having a comparison group bility of a test (see Chapter 9). When there is substantial
because it can occur differentially across groups. This measurement error, there is a tendency for extreme
threat must be addressed through ITT analysis. scores on the pretest to regress toward the mean on the
Testing posttest; that is, extreme scores tend to become less ex-
Did the pretest affect scores on the posttest? treme over time. This effect may occur even in the ab-
Testing effects concern the potential effect of sence of an intervention effect. Statistical regression is
pretesting or repeated testing on the dependent variable. of greatest concern when individuals are selected on the
In other words, the mere act of collecting data changes basis of extreme scores.
the response that is being measured. Testing effects can The amount of statistical regression is directly related
refer to improved performance or increased skill that oc- to the degree of measurement error in the dependent
curs because of familiarity with measurements. variable. This can occur when the baseline and final
For example, in educational tests, subjects may actu- scores are not correlated, when high baselines scores
ally learn information by taking a pretest, thereby chang- match to lower final scores, and vice versa. This effect is
ing their responses on a posttest, independent of any best controlled by including a control group in a study.
instructional intervention. If a coordination test is given
before and after therapy, patients may get higher scores
HISTORICAL NOTE
on the posttest because they were able to practice the
The term regression isn’t intuitive. It was introduced by
activity during the pretest. Testing effects also refer to
the English scientist Sir Frances Galton (1822–1911),
situations where the measurement itself changes the de- who was an explorer, geographer, meteorologist, psychologist,
pendent variable. For instance, if we take repeated meas- inventor of fingerprint identification—and a statistician. With an
urements of range of motion, the act of moving the joint interest in heredity, he used a linear equation to describe the re-
through range to evaluate it may actually stretch it lationship between heights of parents and children. He noticed
enough to increase its range. that very tall fathers tended to have sons who were shorter than
Reactive effects can occur when a testing process in- they were and very short fathers tended to have sons who were
fluences the response, rather than acting as a pure record taller than they were.4 He called this phenomenon “regression
of behavior. This can occur, for instance, when taping a towards mediocrity,” which has become known as regression to
session if the subject’s responses are affected by knowing the mean.
they are being recorded.
Instrumentation Selection
Did the dependent variable change because of how it was Is there bias in the way subjects have been assigned to experi-
measured? mental groups?
Instrumentation effects are concerned with the reli- The ability to assign subjects to groups at random is
ability of measurement. Observers can become more ex- an essential strategy to have confidence that the primary
perienced and skilled at measurement between a pretest difference between the intervention and control groups
6113_Ch15_210-226 02/12/19 4:47 pm Page 214
is the treatment. The threat of selection refers to factors strategy of matched pairs or the analysis of covariance
other than the experimental intervention that can influ- (ANCOVA), to be discussed shortly, may be able to
ence outcomes when groups are not comparable at base- address these initial group differences.
line. This leaves open the potential for threats to internal
validity that may affect the groups differentially. Social Threats
Selection can interact with all the other threats to Research results are often affected by the interaction
internal validity. For example, selection–history effects of subjects and investigators. Social threats to internal
may occur when experimental groups have different validity refer to the pressures that can occur in research
experiences between the pretest and posttest. Selection– situations that may lead to differences between groups.
maturation effects occur when the experimental groups These threats have also been called performance bias,
experience maturational change at different rates. Sim- occurring because those involved are aware of the other
ilar interactions can occur with attrition, testing, instru- groups’ circumstances or are in contact with one another.
mentation, and regression.
Selection threats become an issue when intact groups Diffusion or Imitation of Treatments
are used, when subjects self-select groups, or when the Are subjects receiving interventions as intended?
independent variable is an attribute variable. Designs Sometimes an intervention involves information or
that do not allow random assignment are considered activities that are not intended to be equally available to
quasi-experimental, in that differences between groups all groups. Because the nature of many interventions
cannot be balanced out (see Chapter 17). There may be makes blinding impractical, control subjects are often
several reasons that subjects belong to one group over aware of the interventions intended for another group,
another, and these factors can contribute to differences and may attempt to change their behaviors accordingly.5
in their performance. Researchers can exert some degree This would diffuse the treatment effect and make it
of control for this effect when specific extraneous vari- impossible to distinguish the behaviors of the two groups
ables can be identified that vary between the groups. The (see Focus on Evidence 15-1).
Time Frame
The timing of a study is important in determining the
effect of intervention. How long a treatment lasts, when
the treatment is begun relative to a patient’s diagnosis,
and how long patients are followed are examples of
elements that must be defined for a study. For example,
in the AVERT trial, patients were treated within the
first 2 weeks and then followed for 3 months.
Basing decisions on short-term data can be mislead-
ing if the observed effects are not durable or if they are
slow to develop. For instance, if in the AVERT trial
follow-up was only carried out for 1 month, there would
have been no way to determine if the outcome would
have been different if carried out over a longer period.
Cartoon courtesy of Hilda Bastian. The construct validity of the independent variable must
6113_Ch15_210-226 02/12/19 4:47 pm Page 217
include reference to the time frame used for data Box 15-1 A Myth Too Good to Be Untrue
collection.
From 1924 to 1927, a series of experiments were run at the
Multiple Treatment Interactions Hawthorne plant of the Western Electric Company, located
outside of Chicago.21,22 Managers were interested in studying
Construct validity is also affected when a study involves
how various levels of illumination affected workers’ output.
the administration of multiple treatments. Generalization
The primary test group consisted of five women who worked in
is limited by the possibility of multiple-treatment interac- the relay assembly test room. Their task was to complete relay
tion, creating carryover or combined effects. When treat- switches and drop them down a chute that recorded their
ment components are combined, the researcher cannot productivity—every 40 to 50 seconds, 8 hours a day.23
generalize findings to a situation where only a single
treatment technique is used.
Multiple treatment interactions may be of particular
concern when complex combinations of techniques are
used, as is often the case in rehabilitation. Interpretations
of findings must take these potential interactions into
account, taking care to base conclusions on the full
operational definitions of the variables being studied.
This can be an issue when one component of a treatment
is examined in isolation, especially if the potential influ-
ence of other components on the outcome is ignored.
From Special Collections, Baker Library, Harvard
Business School
Experimental Bias
What they found was that whether they lowered or raised
A fourth aspect of construct validity concerns biases
lights, the workers increased production. Further studies involved
that are introduced into a study by expectations of
introducing changes in work schedules, coffee breaks, shorter work
either the subjects or the investigators. Subjects often week, temperature, even wages, all resulting in better output, even
try their best to fulfill the researcher’s expectations or when conditions were made worse or later returned to original
to present themselves in the best way possible; there- schedules! No matter what they did, productivity in regular and
fore, responses are no longer representative of natural test groups increased.24 This phenomenon became widely known
behavior (see Box 15-1). as the Hawthorne effect, which is defined as the tendency of per-
Those who provide intervention or measure out- sons who are singled out for special attention to perform better
comes may also have certain expectations that can influ- merely because they know they are being observed.
ence how subjects respond. Researchers may react more
positively to subjects in the experimental group or give
less attention to those in the control group because of an between the independent and dependent variables within
emotional or intellectual investment in their hypothesis. a specific set of circumstances, external validity is con-
Rosenthal16 described several types of experimenter cerned with the usefulness of that information outside of
effects in terms of the experimenter’s active behavior and the experimental situation. The generalizability of a study
interaction with the subject, such as verbal cues and smil- is primarily related to the specific patient context and con-
ing, and passive behaviors, such as those related to appear- ditions under investigation.29 Threats to external validity
ance. This threat to construct validity can be avoided by involve the interaction of treatment with the type of sub-
employing testers and treatment providers who are blinded jects tested, the setting in which the experiment is carried
to subject assignment and the research hypothesis. out, or the time in history when the study is done.
Influence of Selection
■ External Validity One of the major goals of clinical research is to apply
results to a target population, the individuals who are not
Can the results be generalized to persons, settings, and experimental subjects but who are represented by them
times that are different from those employed in the (see Chapter 13). If subjects are sampled according to spe-
experimental situation? cific characteristics, those characteristics define the target
External validity refers to the extent to which the population. For instance, subjects may be restricted to a
results of a study can be generalized beyond the inter- limited age range, one gender, a specific diagnosis, or a
nal specifications of the study sample. Whereas internal defined level of function. Because patient characteristics
validity is concerned specifically with the relationship and eligibility requirements can vary, generalizability will
6113_Ch15_210-226 02/12/19 4:47 pm Page 218
depend on how closely the study sample represents the sites that will have varied practice cultures, healthcare
clinical situation. systems, and educational structures.
External validity is threatened when documented
Ecological Validity
cause-and-effect relationships do not apply across sub-
In an extension of external validity, ecological validity
divisions of the target population; that is, when specific
refers to the generalizability of study findings to real-
interventions result in differential treatment effects,
world conditions, societal and cultural norms, and the
depending on the subject’s characteristics. Studies are
health of populations. Pragmatic clinical trials, which are
especially vulnerable to this threat when volunteers are
intended to be carried out in natural clinical settings, will
used as subjects. Those who choose to volunteer may
have stronger ecological validity than studies that are
do so because of certain personal characteristics that
done under controlled clinical conditions.
ultimately bias the sample. When studies demonstrate
conflicting results, it is often because of the differences Influence of History
within the accessible populations.
This threat to external validity concerns the ability
Adherence to generalize results to different periods of time. It is
A related effect can occur if subjects do not comply with actually an important consideration in evidence-based
the experimental protocol. A study that is internally valid practice (EBP) because older studies may have limited
may still be compromised in relation to external validity application based on contemporary practices, such as
under these circumstances. Researchers should examine reimbursement policies, available drugs, newer surgical
adherence as they interpret findings to determine if procedures, updated guidelines, or changes in theoretical
results are realistic and have clinical applicability.30 foundations for practice.
From a sampling perspective, some investigators
screen potential subjects prior to randomizing them to ➤ Researchers in the AVERT study faced a dilemma regarding
groups to eliminate those who are not adherent (see shifts in early mobilization practices that had changed over time
Chapter 14). For example, the Physicians’ Health Study, in practice guidelines. Therefore, clinicians in the control group
which investigated the effect of aspirin and beta carotene had gotten many patients moving earlier, mimicking the experi-
in the prevention of ischemic heart disease, used an mental intervention.
18-week “run-in period” and eliminated 33% of the sub-
jects from the final study based on non-compliance.31 This does not mean that older studies have no value.
This practice may help to build a compliant study sample, Their findings may indeed stand up to the test of time,
but it may also dilute external validity of the findings. It or they may provide important background to how treat-
is the researcher’s responsibility to evaluate its potential ments or theories have changed. Clinical judgment and
effect in demonstrating the applicability of the findings.32 expertise are important for interpreting the applicability
Influence of Setting of such studies.
➤ If we think males and females will respond differently to Matching also limits interpretation of research
the wearing of the orthoses, we can choose only male subjects findings because the differential effect of the matching
for our sample. We could also control for height by choosing variables cannot be analyzed. For example, if we match
only subjects who fit within a certain height range. subjects on height, then we cannot determine if the
effect of treatment is different across height ranges. For
Once a homogeneous group of subjects is selected, most clinical studies, matching is not recommended
those subjects can be randomly assigned to treatment when other methods of controlling extraneous variables
conditions. Because there is no variability on that vari- are appropriate and practical.
able, the effects of the potential confounder are essen-
tially eliminated. The major disadvantage of this When a researcher wants to control for several variables
approach is that the research findings can be general- in a design, matching is more complicated. A technique
ized only to the type of subjects who participate in the of matching using propensity scores is often used in
study, in this case to men of a certain height. This will exploratory studies to match subjects on a set of potential
limit external validity and application of results. confounders. This process uses multivariate regression
6113_Ch15_210-226 02/12/19 4:47 pm Page 220
modeling to develop a single combined score that can effect of the independent variable can be seen more
then be used as the measure of confounding to match clearly. These identified extraneous variables are called
subjects. This technique will be discussed further in covariates. Conceptually, the ANCOVA removes the
Chapter 19. confounding effect of covariates by making them statis-
tically equivalent across groups and then estimating
what the dependent variable would have been under these
Repeated Measures
equivalent conditions.
Research designs can be structured to facilitate compar-
isons between independent groups of subjects, or they
may involve comparisons of responses across treatment
➤ If the patients wearing orthosis A happen to be taller than
those in the other group, the ANCOVA can be used to predict
conditions within a subject. When the levels of the inde- what the step lengths would have been had the heights been
pendent variable are assigned to different groups, with an equally distributed. The analysis of differences between the
active or attribute variable, the independent variable is two groups will then be based on these adjusted scores.
considered an independent factor. When all levels of the
independent variable are experienced by all subjects, the
The ANCOVA can also be used when baseline scores
independent variable is considered a repeated factor or
on the outcome variables are different between groups.
a repeated measure. The use of a repeated measure is
In that case, researchers will specify the baseline score as
often described as using subjects as their own control.
a covariate and compare the outcome scores using the
ANCOVA. This effectively makes the groups equivalent
➤ We could ask each subject to wear both orthoses, and then on their baseline values so that differences can be more
compare their step lengths. The independent variable of “ortho- clearly seen between groups on the outcome measure.33,34
sis” would then be a repeated factor because both levels are ex-
perienced by all subjects.
■ Non-compliance and Missing Data
A repeated measures design is one of the most effi-
cient methods for controlling intersubject differences. It In addition to controlling for confounding variables in
ensures the highest possible degree of equivalence across a design, it is important to maximize adherence to the
treatment conditions. Because subjects are essentially research protocol. Investigators in randomized trials
matched with themselves, any observed differences often face two complications that affect analysis: non-
should be attributable solely to the intervention. compliance and missing data. These situations can occur
for several reasons.
Issues related to the design of repeated measures studies
are explored further in Chapter 16.
➤ CASE IN POIN T #3
Researchers evaluated the efficacy of an exercise intervention
The term independent factor should not be confused (EI) to reduce work-related fatigue.35 Subjects were recruited
with independent variable. An independent variable from various industries and work settings. They were randomly
can be either an “independent factor” or a “repeated allocated to either a 6-week EI or a wait-list control (WLC)
factor” depending on how its levels are defined within the group. The primary outcome measure was work-related fatigue,
study design. defined as three concepts, each measured by a validated
questionnaire: emotional exhaustion, overall fatigue, and
short-term recovery. They were measured at baseline (T0) and
Analysis of Covariance post-intervention (T1). Subjects in the exercise group were
The last method of controlling for confounding effects also measured at 6-week (T2) and 12-week (T3) follow-up.
does not involve a design strategy, but instead uses a sta-
tistical technique to equate groups on extraneous vari- As shown in Figure 15-2, 362 subjects were screened
ables. The ANCOVA is based on concepts of analysis of for eligibility, with 265 excluded because of not meeting
variance and regression (see Chapter 30). Without going eligibility requirements or declining participation. The
into details of statistical procedure at this time, let us remaining 96 subjects were randomly assigned to the two
consider the conceptual premise for ANCOVA using the treatment groups. Not all these subjects, however, com-
hypothetical study of step length in patients wearing two pleted the study as planned.
types of lower extremity orthoses. Compliance refers to getting the assigned treatment,
The purpose of the ANCOVA is to statistically being evaluated according to the protocol, and adherence
eliminate the influence of extraneous factors so that the to protocol requirements.36 Subjects may also drop out,
6113_Ch15_210-226 02/12/19 4:47 pm Page 221
Excluded (n=266)
• Not meeting inclusion criteria (n=182)
• Declined to participate (n=34)
• Other reasons (n=50)
Randomized (n=96)
Allocation
Allocated to intervention (EI; n=49) Allocated to wait list condition (WLC; n=47)
Received allocated intervention (n=45)
Did not receive allocated intervention (total: n=4)
(practical reasons [n=2], unwilling [n=1], unknown
[n=1])
Follow-Up
Discontinued intervention (total: n=14; reasons: Lost at T1 (total: n=2; reasons: family
injury [n=7], severe burnout symptoms [n=1], circumstances [n=1], unknown [n=1])
illness [n=2], practical reasons [n=1], unknown
[n=3])
Analysis
Analyzed at T1: intention-to-treat analyses (n=49); Analyzed at T1: intention-to-treat analyses
per-protocol analyses (n=31) (n=47); per-protocol analyses (n=35)
Figure 15–2 Flowchart of patient recruitment and allocation for the RCT of exercise to reduce work-related
fatigue. From de Vries JD, van Hooff ML, Guerts SA, Kompier MA. Exercise to reduce work-related
fatigue among employees: a randomized controlled trial. Scand J Work Environ Health 2017;43(4):337–349.
Figure 1, p. 2.
miss a test session, or leave survey questions unanswered, • Subjects may cross over to another treatment
creating missing data. during the course of the study. This situation
may be due to patient preference or a change in a
patient’s condition.
➤ The flow diagram shows that four subjects in the EI did not • Subjects may not be compliant with assigned
comply following allocation. We can also see that 14 additional
exercise subjects discontinued their intervention during the
treatments. Although they may remain in a
follow-up period for various reasons. study, subjects may not fulfill requirements of
their group assignment, negating the treatment
effect.
• Subjects may refuse the assigned treatment
after allocation. A patient may initially consent to
join a study, knowing that she may be assigned to ➤ A condition of eligibility for the fatigue study was
either group, but after assignment is complete, that subjects did not exercise more than 1 hour/day. The
control subjects were asked to maintain that level, but
the patient decides she wants the other treatment.
10 control subjects increased their exercise activity during
Ethically, these patients must be allowed to receive
the wait period of the study.
the treatment they want.
6113_Ch15_210-226 02/12/19 4:47 pm Page 222
• Subjects may drop out of the study or termi- is possible that those who complied with the exercise
nate treatment before the study is complete. regimen were those who tended to see positive results,
Those who remain in a study often differ in an and those who stopped exercising were not seeing any
important way from those who drop out.37 Subjects benefits or found the activity inconvenient. Therefore,
in the experimental group may find the commit- when analyzing the data using only those subjects who
ment onerous, for example. Loss to follow-up is a complied, the exercise program would look more suc-
concern when data collection extends over several cessful compared to the control group.
measurement periods, potentially creating bias in
the remaining sample. Intention to Treat Analysis
A more conservative approach uses a principle called
➤ In the fatigue study, a total of 14 subjects in the exercise intention to treat (ITT), which means that data are ana-
group discontinued their intervention over the course of the lyzed according to original random assignments, regard-
study, with 7 lost at the first post-intervention measure (T1). less of the treatment subjects actually received, if they
Two subjects were lost in the control group. Researchers in dropped out or were non-compliant; that is, we analyze
the fatigue study compared baseline scores for those who data according to the way we intended to treat the sub-
completed and did not complete the intervention. Their analysis jects. This analysis ideally includes all subjects.
showed no differences between them in either of the test
groups. Purposes of Intention to Treat Analysis
The ITT approach is important because the effectiveness
of a therapy is based on a combination of many factors
Some research protocols call for random assignment prior beyond its biological effect, including how it is adminis-
to the point where eligibility can be determined. If subjects tered, variations in compliance and adherence to the
are later excluded from the study because of ineligibility, the study protocol, adverse events, and other patient charac-
balance of random assignment will be distorted. The issue of teristics that are related to the outcome.41 All of these
post-randomization exclusion is distinguished from patients
factors are assumed to be balanced across groups through
dropping out because of non-compliance, withdrawal, or loss to
random assignment, which needs to be maintained to
follow-up.38
obtain an unbiased estimate of the treatment effect.
The ITT approach, therefore, serves two main
Per-Protocol Analysis purposes:
Each of the situations just described causes a problem • It guards against the potential for bias if dropouts are
for analysis, as the composition of the groups at the end related to outcomes or group assignment, and pre-
of the study is biased from the initial random assignment, serves the original balance of random assignment.
especially if the number of subjects involved is large. • It reflects routine clinical situations, in which some
There is no clean way to account for this difficulty. patients will be non-compliant. This is most relevant
At first glance, it might seem prudent to just eliminate to pragmatic trials, where the design is intended to
any subjects who did not get or complete their assigned incorporate real-world practice.
treatment, and include only those subjects who suffi-
Reporting Intention to Treat Results
ciently complied with the trial’s protocol. This is called
The ITT principle is considered an essential process for
per-protocol analysis.
the interpretation of randomized controlled trials. Cur-
rent guidelines for reporting randomized trials have
➤ Based on the subjects who did not comply with their as- been endorsed by an international community of medical
signed interventions, researchers in the fatigue study did a journals, recommending use of ITT analysis whenever
per-protocol analysis with 31 subjects in the intervention group possible, with a flow diagram that documents how sub-
and 35 in the control group. Using this approach, they found that
jects were included or excluded through different phases
the exercise group scored significantly lower at T1 on emotional
of the study, as shown in Figure 15-2 (see Chapter 38).42
exhaustion and overall fatigue compared to the WLC group,
suggesting a positive effect of the exercise program. This approach should be used with explanatory and
pragmatic trials.
Generally, the per-protocol approach will bias results Interpreting ITT Analysis
in favor of a treatment effect, as those who succeed at As might be expected, an ITT analysis can result in an
treatment are able to tolerate it well and are most likely underestimate of the treatment effect, especially if the
to stick with it.39,40 For example, in the fatigue study, it number of subjects not getting their assigned treatment
6113_Ch15_210-226 02/12/19 4:47 pm Page 223
is large. This will make it harder to find significant dif- • Missing at Random (MAR). Missing data may be
ferences but is still considered a better representation of related to the methodology of the study, but not to
what would happen in practice. To be safe then, many the treatment variable. For example, in the fatigue
researchers will analyze data using both ITT and per- study, clinicians from different sites may have
protocol analyses. missed certain data points. In this case, missing
data should also have no inherent bias.
➤ Using an ITT analysis, results of the fatigue study did not • Missing Not at Random (MNAR). Sometimes the
show a significant difference between groups on the primary missing data will be related to group membership,
outcome of fatigue. The authors attributed these findings to the such as patients dropping out because a treatment
fact that many of the exercise subjects had stopped exercising is not working, or not showing up because of incon-
prior to the end of the trial, and some of the control subjects venience. This type of missing data is most prob-
increased their level of exercise. Therefore, the treatment effect lematic because it creates a bias in the data.
was diffused.
In contrast, a per-protocol analysis, run on the “pure” groups Strategies for handling missing data should be speci-
of completers and true controls, did show the effect of exercise fied as part of the research plan. The approach may vary
on reducing work fatigue. The authors interpreted this discrep- depending on the type of missing data. Sometimes,
ancy in findings to suggest that exercise was effective for reduc- carrying out more than one strategy is helpful to see if
ing fatigue, but that compliance was necessary to observe the conclusions will differ.
beneficial effects.
Completer Analysis
When outcomes are the same using both analysis ap- The most direct approach to handling missing data is to
proaches, the researcher will have strong confidence in exclude subjects, using only the data from those who
the results.43 If the two methods yield different results, completed the study, called completer analysis, or com-
however, the researcher is obliged to consider which fac- plete case analysis. This analysis will represent efficacy of
tors may be operating to bias the outcome and to offer an intervention for those who persist with it but may be
potential explanations. open to serious bias.
Because the chance of systematic bias is reduced with Using only complete cases can be justified if the
ITT, the per-protocol results may be less valid. There- number of incomplete cases is small and if data are MAR;
fore, the researcher must be cautious in this interpreta- that is, missing data are independent of group assign-
tion, considering the role of confounding, in this ment or outcome. However, if those with missing data
example the role of compliance. Data should be col- differ systematically from the complete cases, marked
lected to determine how this variable contributes to the bias can exist. If a large number of cases are missing, this
outcome. approach will reduce sample size, decreasing the power
of the study.
Data Imputation
■ Handling Missing Data
To maintain the integrity of a sample and to fulfill as-
Studies often end up with some missing data for various sumptions for ITT, missing data points can be replaced
reasons. This can present a challenge for analysis in with estimated values that are based on observed data
RCTs and observational studies because they can create using a process called imputation. Analyses can then be
bias that reduces validity of conclusions. The conse- run on the full dataset with the missing values “filled in”
quences of bias with missing data can depend on the rea- and conclusions drawn about outcomes based on those
sons they are missing. Three types of “missingness” have estimates. Of course, there is always some level of un-
been described.44-46 certainty in how well the imputed scores accurately rep-
resent the unknown true scores. This brief introduction
• Missing Completely at Random (MCAR). Data are will include several of the more commonly used tech-
assumed to be missing because of completely unpre- niques to determine what this replacement value should
dictable circumstances that have no connection to be. Validity of results will vary depending on how im-
the variables of interest or the study design, such puted values are generated.
as records getting lost or equipment failure. In this
case, it is reasonable to assume that the subjects with Non-Completer Equals Failure
missing data will be similar to those for whom full When the outcome of a study is dichotomous, subjects
data are available. can be classified as “successes” or “failures.” For an ITT
6113_Ch15_210-226 02/12/19 4:47 pm Page 224
analysis, then, those who drop out can be considered a This procedure involves creating a random dataset
“failure.” This is the most conservative approach; thus, using the available data, predicting plausible values
if the treatment group is better, we can be confident that derived from observed data. This procedure involves
the results are not biased by dropouts.47 If the results sampling with replacement, which means that the selected
show no difference, however, we would not know if values are returned to the data and are available for se-
the treatment was truly ineffective or if the dropouts lection with further sampling. This sampling procedure
confounded the outcome because they were classified as is repeated several times, each time using a different ran-
failures. dom dataset, and therefore resulting in different imputed
values. The data from each dataset are then subjected to
Last Observation Carried Forward
the planned statistical analysis to see what the differences
When data are measured at several points in time, a tech-
are in the results. One value is then derived, usually by
nique called last observation carried forward (LOCF) is
taking a mean of the several outcomes to get a pooled
often used. With this method, the subject’s last data
estimate for the missing values.
point before dropping out is used as the outcome score.
Multiple imputation techniques require sophisticated
With multiple test points, the assumption is that patients
methodology that is beyond the scope of this book, and
improve gradually from the start of the study until the
a statistician should be consulted. Several excellent
end; therefore, carrying forward an intermediate value
references provide comprehensive reviews of these
is a conservative estimate of how well the person would
issues.44,46,53-55 Most statistical programs include options
have done had he remained in the study. It essentially
for this process.
assumes that the subject would have remained stable
from the point of dropout, rather than improving.48 If Pairwise and Listwise Deletion
the subject drops out right after the pretest, however, Statistical programs provide several options for handling
this last score could be the baseline score. missing data in analysis. Listwise deletion means that a case
Although it is still often used, the LOCF approach has is dropped if it has a missing value for one or more data
fallen into disfavor because it can grossly inflate or deflate points, and analysis will be run only on cases with com-
the true treatment effect.49,50 This is especially true when plete data. Pairwise deletion can be used when a particular
the number of dropouts is greater in the treatment analysis does not include every score and cases are omit-
group.48 It does not take into account that some dropouts ted only if they are missing any of the variables involved
may have shown no change up to their last assessment, in the specific procedure (see Appendix C).
but might have improved had they continued.
Trying to Limit Missing Data
Mean Value Imputation
Another approach involves the use of the distribution Because missing data can create potentially biased
average, or mean, for the missing variable as the best outcomes in results of a clinical trial, researchers
estimate. For example, in the fatigue study, missing should take purposeful steps to limit this problem in
data for a control subject for the emotional exhaustion the design and conduct of a study.36 Efforts should be
outcome could be replaced by the mean of that variable made to reduce non-compliance, drop outs, crossover
for those in the control group with complete data. This of subjects, or loss to follow-up. Eligibility criteria
method assumes that the group mean is a “typical” re- should be as specific as possible to avoid exclusions.
sponse that can represent an individual subject’s score. This process is important because those who are lost
However, this approach can result in a large number of may differ systematically from those who remain on
subjects with identical scores, thereby reducing vari- treatment.
ability, artificially increasing precision, and ultimately A thorough consent process, including adequate
biasing results.51 warning of potential side effects and expectations, may
help to inform subjects sufficiently to avoid non-compli-
Multiple Data Imputation ance. Ongoing support for subjects during a trial, such
The LOCF and mean imputation techniques use a single as periodic check-ins, may also foster continuous partic-
value as the imputed score, leaving a great deal of uncer- ipation. Researchers must also consider how they can
tainty in how well the score represents the individual follow up with subjects even if they withdrew early. It is
subject’s true score. An approach called multiple impu- often possible to obtain a final measurement at the end
tation is considered a more valid method to estimate of the study period, even if subjects have been delinquent
missing data.52 in their participation.47
6113_Ch15_210-226 02/12/19 4:47 pm Page 225
COMMENTARY
The real purpose of the scientific method is to make sure that nature hasn’t
misled you into thinking you know something you actually don’t know.
—Robert Pirsig (1928–2017)
Author, Zen and the Art of Motorcycle Maintenance
Because there are so many potential threats to validity, study? In fact, there is probably no such thing. Every
researchers must make judgments about priorities. Not study contains some shortcomings. Clinical researchers
all threats to validity are of equal concern in every study. operate in an environment of uncertainty and it is vir-
When steps are taken to increase one type of validity, it tually impossible to conduct experiments so that every
is likely that another type will be affected. Because clinical facet of behavior, environment, and personal interac-
trials focus on control to improve interpretation of out- tion is exactly the same for every subject.
comes, internal validity often receives the most press. Because clinical studies cannot be perfectly con-
However, the application of evidence to clinical practice trolled, authors should provide sufficient details on the
requires an appreciation for all validity issues and the abil- design and analysis of the study that will allow clinicians
ity to judge when findings will have applicability. to judge how and to whom results can be reasonably
For instance, if we attempt to control for extraneous applied. Limitations of a study should be included in the
factors by using a homogeneous sample, we will improve discussion section of a research report so that readers
internal validity, but at the expense of external validity. have a complete understanding of the circumstances
Pragmatic studies face unique challenges because their under which results were obtained. Researchers must
purpose specifically negates controlling for many vari- be able to justify the study conditions and analysis pro-
ables for the purpose of improving external validity and cedures as fair tests of the experimental treatment within
construct validity, often at the expense of internal validity. the context of the research question. When important
If we increase statistical conclusion validity by limiting extraneous factors cannot be controlled, to the point that
variability in our data, we will probably reduce external they will have a serious impact on the interpretation and
and construct validity. Similarly, if we work toward in- validity of outcomes, it is advisable to consider alterna-
creasing construct validity by operationalizing variables tives to experimental research, such as observational
to match clinical utilization, we run the risk of decreasing approaches. Guidelines have been developed to assure
reliability of measurements. consistency in reporting of clinical trials. These will be
After describing all the above threats, we might discussed in Chapter 38.
wonder—is it ever possible to design a completely valid
endpoints, coronary and cardiovascular, during the trial. J Am 32. Pablos-Mendez A, Barr RG, Shea S. Run-in periods in random-
Heart Assoc 2012;1(5):e003640. ized trials: implications for the application of results in clinical
13. Grundy SM. Cholesterol and coronary heart disease. A new era. practice. JAMA 1998;279(3):222–225.
JAMA 1986;256(20):2849–2858. 33. Rausch JR, Maxwell SE, Kelley K. Analytic methods for
14. Nelson SR. Steel Drivin’ Man: John Henry, the Untold Story of an questions pertaining to a randomized pretest, posttest, follow-up
American Legend. Oxford: Oxford University Press, 2006. design. J Clin Child Adolesc Psychol 2003;32(3):467–486.
15. Saretsky G. The OEO P.C. experiment and the John Henry 34. Vickers AJ, Altman DG. Statistics notes: Analysing controlled
effect. Phi Delta Kappa 1972;53:579. trials with baseline and follow up measurements. BMJ 2001;
16. Rosenthal R. Experimenter Effects in Behavioral Research. 323(7321):1123–1124.
New York: Appleton-Century-Crofts, 1966. 35. de Vries JD, van Hooff ML, Guerts SA, Kompier MA. Exercise
17. Roethlisberger J, Dickson WJ. Management and the Worker. to reduce work-related fatigue among employees: a randomized
Cambridge, MA: Harvard University Press, 1966. controlled trial. Scand J Work Environ Health 2017;43(4):
18. Gillespie R. Manufacturing Knowledge: A History of the Hawthorne 337–349.
Experiments. New York: Cambridge University Press, 1991. 36. Heritier SR, Gebski VJ, Keech AC. Inclusion of patients in
19. Gale EA. The Hawthorne studies-a fable for our times? clinical trial analysis: the intention-to-treat principle. Med J Aust
Qjm 2004;97(7):439–449. 2003;179(8):438–440.
20. Kompier MA. The “Hawthorne effect” is a myth, but what 37. Mahaniah KJ, Rao G. Intention-to-treat analysis: protecting the
keeps the story going? Scand J Work Environ Health 2006; integrity of randomization. J Fam Pract 2004;53(8):644.
32(5):402–412. 38. Fergusson D, Aaron SD, Guyatt G, Hebert P. Post-randomisation
21. Questioning the Hawthorne effect. The Economist, June 4, 2009. exclusions: the intention to treat principle and excluding patients
Available at http://www.economist.com/node/13788427. from analysis. BMJ 2002;325(7365):652–654.
Accessed April 1, 2017. 39. Hollis S, Campbell F. What is meant by intention to treat
22. Kolata G. Scientific myths that are too good to die. The analysis? Survey of published randomised controlled trials.
New York Times. December 6, 1998. Available at http://www. BMJ 1999;319:670–674.
nytimes.com/1998/12/06/weekinreview/scientific-myths-that- 40. Stanley K. Evaluation of randomized controlled trials.
are-too-good-to-die.html. Accessed March 31, 2017. Circulation 2007;115(13):1819–1822.
23. McCambridge J, Witton J, Elbourne DR. Systematic review 41. Detry MA, Lewis RJ. The intention-to-treat principle: how to
of the Hawthorne effect: new concepts are needed to study assess the true effect of choosing a medical treatment. JAMA
research participation effects. J Clin Epidemiol 2014;67(3): 2014;312(1):85–86.
267–277. 42. CONSORT Website. Available at http://www.consort-statement.
24. Cizza G, Piaggi P, Rother KI, Csako G. Hawthorne effect with org/. Accessed April 5, 2017.
transient behavioral and biochemical changes in a randomized 43. Motulsky H. Intuitive Biostatistics. New York: Oxford University
controlled sleep extension trial of chronically short-sleeping Press, 1995.
obese adults: implications for the design and interpretation of 44. Sterne JAC, White IR, Carlin JB, Spratt M, Royston P,
clinical studies. PLoS One 2014;9(8):e104176. Kenward MG, et al. Multiple imputation for missing data in
25. Gould DJ, Creedon S, Jeanes A, Drey NS, Chudleigh J, epidemiological and clinical research: potential and pitfalls.
Moralejo D. Impact of observing hand hygiene in practice and BMJ 2009;338:b2393.
research: a methodological reconsideration. J Hosp Infect 45. Little RJA, Rubin DM. Statistical Analysis with Missing Data.
2017;95(2):169–174. New York: John Wiley and Sons, 1987.
26. On the magnitude and persistence of the Hawthorne effect– 46. Li P, Stuart EA, Allison DB. Multiple imputation: a flexible tool
evidence from four field studies. Proceedings of the 4th for handling missing data. JAMA 2015;314(18):1966–1967.
European Conference on Behaviour and Energy Efficiency 47. Streiner D, Geddes J. Intention to treat analysis in clinical trials
(Behave 2016); 2016. when there are missing data. Evid Based Ment Health 2001;
27. De Amici D, Klersy C, Ramajoli F, Brustia L, Politi P. Impact 4(3):70-71.
of the Hawthorne effect in a longitudinal clinical study: the case 48. Molnar FJ, Hutton B, Fergusson D. Does analysis using
of anesthesia. Control Clin Trials 2000;21(2):103–114. “last observation carried forward” introduce bias in dementia
28. Zinman R, Bethune P, Camfield C, Fitzpatrick E, Gordon K. research? CMAJ 2008;179(8):751–753.
An observational asthma study alters emergency department 49. Lachin JM. Fallacies of last observation carried forward
use: the Hawthorne effect. Pediatr Emerg Care 1996;12(2): analyses. Clin Trials 2016;13(2):161–168.
78–80. 50. Shoop SJW. Should we ban the use of ‘last observation carried
29. Gartlehner G, Hansen RA, Nissman D, Lohr KN, Carey TS: forward’ analysis in epidemiological studies? SM J Public Health
Criteria for distinguishing effectiveness from efficacy trials in Epidemiol 2015;1(1):1004.
systematic reviews. Technical Review 12. AHRQ Publication 51. Newgard CD, Lewis RJ. Missing data: how to best account for
No. 06-0046. Rockville, MD: Agency for Healthcare Research what is not known. JAMA 2015;314(9):940–941.
and Quality. April, 2006. Available at https://www.ncbi.nlm.nih. 52. Schafer JL. Multiple imputation: a primer. Stat Methods Med Res
gov/books/NBK44029/pdf/Bookshelf_NBK44029.pdf. Accessed 1999;8(1):3–15.
March 27, 2017. 53. National Research Council. The Prevention and Treatment of
30. Guyatt GH, Sackett DL, Cook DJ. Users’ guides to the medical Missing Data in Clinical Trials. Panel on Handling Missing Data
literature. II. How to use an article about therapy or prevention. in Clinical Trials. Committee on National Statistics, Division of
B. What were the results and will they help me in caring for my Behavioral and Social Sciences and Education. Washington,
patients? Evidence-Based Medicine Working Group. JAMA DC: The National Academies Press, 2010.
1994;271(1):59–63. 54. Donders AR, van der Heijden GJ, Stijnen T, Moons KG.
31. Hennekens CH, Buring JE, Manson JE, Stampfer M, Rosner B, Review: a gentle introduction to imputation of missing values.
Cook NR, et al. Lack of effect of long-term supplementation J Clin Epidemiol 2006;59(10):1087–1091.
with beta carotene on the incidence of malignant neoplasms 55. Gelman A, Hill J. Data Analysis Using Regression and
and cardiovascular disease. N Engl J Med 1996;334(18): Multilevel/Hierarchical Models. Chapter 25: Missing-data
1145–1149. imputation. Cambridge: Cambridge University Press, 2007.
6113_Ch16_227-239 02/12/19 4:44 pm Page 227
CHAPTER
Experimental
Designs 16
DESCRIPTIVE EXPLORATORY EXPLANATORY
In the study of interventions, experimental validity, and the most commonly used statisti-
designs provide a structure for evaluating the cal procedures are identified, all of which are
cause-and-effect relationship between a set of covered in succeeding chapters. This informa-
independent and dependent variables. In ex- tion demonstrates the intrinsic relationship
planatory and pragmatic trials, the researcher between analysis and design.
manipulates the levels of the independent vari-
able within the design to incorporate elements
of control so that the evidence supporting a ■ Design Classifications
causal relationship can be interpreted with Although experimental designs can take on a wide
confidence. variety of configurations, the important principles can
The purpose of this chapter is to present be illustrated using a few basic structures. A basic
distinction among them is the degree to which a design
basic configurations for experimental designs offers experimental control.1,2 In a true experiment,
and to illustrate the types of research situations participants are randomly assigned to at least two com-
for which they are most appropriate. For each parison groups. An experimental design is theoretically
able to exert control over most threats to internal
design, the discussion includes strengths and validity, providing the strongest evidence for causal
weaknesses in terms of internal and external relationships.
227
6113_Ch16_227-239 02/12/19 4:44 pm Page 228
Focus on Evidence 16–1 with usual care, (O) reduce community rates of hospitalization for
Asking the Right Question—for the Real World acute myocardial infarction, stroke, or congestive heart failure over
12 months?
The concept of knowledge translation is important in the considera-
tion of experimental designs, as we strive to implement evidence- The researchers acknowledged that this trial was highly prag-
based strategies and guidelines into practice. Pragmatic trials are matic.8 For instance, all municipalities of sufficient size were included,
typically used to assess the effectiveness of interventions, often at and all members of the communities over 65 years old were eligible
the community level, to answer practical questions about interven- to participate, regardless of comorbidities. Interventions were pro-
tions that have already been shown to be efficacious in RCTs. vided within community clinics and primary care settings by regular
For example, studies have established that decreases of 10 mm Hg practitioners. Patient compliance was recorded but not controlled.
in systolic blood pressure or 5 mm Hg in diastolic blood pressure can The results showed a modest but significant effect of the inter-
significantly reduce cardiovascular disease (CVD).7 However, strategies vention, with a 9% relative reduction in hospitalizations compared
to implement preventive programs to reduce hypertension are often to the prior 2 years, which translated to 3.02 fewer admissions/
largely ineffective.8 In 2011, Kaczorowski et al9 described the Cardio- 1,000 people. Although this seems small, the authors note that it
vascular Health Awareness Program (CHAP), which evaluated the could mean hundreds of thousands of fewer admissions across the
effectiveness of community-based intervention to reduce cardiovas- general population.
cular admissions. The relevance of this trial to our discussion of design is based
Using a cluster randomized design, the researchers identified on the need to consider the research question and how validity
39 communities in Ontario with populations of at least 10,000 and would be viewed. Because the study did not include blinding, and
stratified the communities by size and geographic area, randomly provision of services was maintained within the natural health-care
assigning them to receive the CHAP intervention (n = 20) or usual care environment, internal validity could be suspect. However, the study
(n = 19). The study was based on the following PICO question: was not about the efficacy of taking blood pressure measurements
or educational programs to reduce CVD risk—that was established
(P) Among midsized Ontario communities, for those >65 years old,
in earlier studies. Rather, its focus was on the effectiveness of
(I) does a community-based program of blood pressure and CVD
organizing services in a community setting to influence outcomes,
risk assessments with education and referral of all new or uncon-
strengthening external validity.
trolled hypertensives to a source of continuing care, (C) compared
equivalence of groups, strengthening the evidence that the and time (pretest and posttest) as the second (repeated)
causal effects of treatment must have brought about any factor.
measured group differences in posttest scores. Selection With ordinal data, the Mann–Whitney U-test can be used
bias is controlled because subjects are randomly assigned to compare two groups, and the Kruskal–Wallis analysis of
to groups. History, maturation, testing, and instrumenta- variance by ranks is used to compare three or more groups.
tion effects should affect all groups equally in both the The analysis of covariance can be used to compare posttest
scores, using the pretest score as the covariate. Discriminant
pretest and posttest. Threats that are not controlled by
analysis can also be used to distinguish between groups with
this design are attrition and differential social threats.
multiple outcome measures.
The primary threat to external validity in the pretest–
posttest design is the potential interaction of treatment
and testing. Because subjects are given a pretest, there
may be reactive effects, which would not be present in
■ Posttest-Only Control Group
situations where a pretest is not given. Design
The posttest-only control group design (see Fig. 16-2) is
Analysis of Pretest–Posttest Designs. Pretest–
identical to the pretest–posttest design, with the obvious
posttest designs are often analyzed based on
exception that a pretest is not administered to either group.
the change from pretest to posttest using difference
scores. A study was designed to test the effectiveness of a
With interval–ratio data, difference scores are usually computer-based counseling tool on reproductive
compared using an unpaired t-test (with two groups) or a choices in community family planning clinics in North
one-way analysis of variance (with three or more groups). The Carolina.10 A total of 340 women were randomly as-
pretest–posttest design can also be analyzed as a two-factor signed to receive the Smart Choices packet or usual
design, using a two-way analysis of variance with one re-
counseling over a 3-month period. Outcomes included
peated factor, with treatment as one independent variable
6113_Ch16_227-239 02/12/19 4:44 pm Page 231
RANDOM Independent
ASSIGNMENT Variable Measurement ■ Factorial Designs for Independent
Groups
Experimental Experimental
Posttest
Group Intervention
The designs presented thus far have involved the testing
of one independent variable with two or more levels.
Control No Intervention Although easy to develop, these single-factor designs
Posttest
Group or Placebo
may impose an artificial simplicity on many clinical and
behavioral phenomena; that is, they do not account for
Figure 16–2 Posttest-only control group design.
simultaneous and often complex interactions of several
variables within clinical situations.
Interactions are generally important for developing a
the patients’ confidence in choosing a contraceptive theoretical understanding of behavior and for establish-
method that was right for them, ability to engage in ing the construct validity of clinical variables. Interac-
shared decision-making with their provider, and their tions may reflect the combined influence of several
perception of the quality of the counseling experience. treatments or the effect of several attribute variables on
In this study, researchers used the posttest-only design the success of a particular treatment.
because the outcomes could only be assessed following A factorial design incorporates two or more inde-
the intervention condition. This design is a true experi- pendent variables, with independent groups of subjects
mental design that, like the pretest–posttest design, can randomly assigned to various combinations of levels of
be expanded to include multiple levels of the independent the variables. Although such designs can theoretically be
variable, with a control, placebo, or alternative treatment expanded to include any number of independent vari-
group. ables, clinical studies usually involve two or three at
The posttest-only design can also be used when a most. As the number of variables increases, so does the
pretest is either impractical or potentially reactive. For number of experimental groups, creating the need for
instance, in surveys of attitudes, we might ask questions larger and larger samples, which are typically impractical
on a pretest that could sensitize subjects in a way that in clinical situations.
would influence their scores on a subsequent posttest. Dimensions
The posttest-only design avoids this form of bias,
increasing the external validity of the study. Factorial designs are described according to their dimen-
sions or number of factors so that a two-way or two-
Design Validity factor design has two independent variables, a three-way
Because the posttest-only control group design involves or three-factor design has three independent variables,
randomization and comparison groups, its internal and so on.
validity is strong, even without a pretest; that is, we can These designs can also be described by the number
assume groups are equivalent prior to treatment because of levels within each factor so that a 3 × 3 design includes
they have been randomly assigned. Because there is no two independent variables, each with three levels, and a
pretest score to document the results of randomization, 2 × 3 × 4 design includes three independent variables,
this design is most successful when the number of with two, three, and four levels, respectively.
subjects is large so that the probability of truly balancing Factorial Diagrams
interpersonal characteristics is increased. A factorial design is most easily diagrammed using a
matrix that indicates how groups are formed relative to
Analysis of Posttest-Only Designs. With two levels of each independent variable. Uppercase letters,
groups, an unpaired t-test is used with interval–ratio typically A, B, and C, are used to label the independent
data, and a Mann–Whitney U-test test with ordinal data. variables and their levels. The number of groups is the
With more than two groups, a one-way analysis of variance product of the digits that define the design. For exam-
or the Kruskal–Wallis analysis of variance by ranks should be ple, 3 × 3 = 9 groups; 2 × 3 × 4 = 24 groups. Each cell
used to compare posttest scores. of the matrix represents a unique combination of levels.
An analysis of covariance can be used when covariate
In this type of diagram, there is no indication if meas-
data on relevant extraneous variables are available.
urements within a cell include pretest–posttest scores
Regression or discriminant analysis procedures can also
be applied. or posttest scores only. This detail is generally de-
scribed in words.
6113_Ch16_227-239 02/12/19 4:44 pm Page 232
A: Joint Protection
A1 A2
A: Joint Protection Joint Protection None
A1 A2 n = 127 n = 130
Joint Protection None
A1B1 A2B1
B1 Joint Protection Exercise B1
B: Hand Exercise
B: Hand Exercise
A B
Figure 16–3 A. A two-way factorial design for a study of joint protection and hand exercise for
patients with OA. All patients received the information leaflet as a control condition. B. Main
effects for the two-way factorial design.
6113_Ch16_227-239 02/12/19 4:44 pm Page 233
and exercise. Or we could find that exercise is most In studying the effect of position, the researchers were
effective regardless of participation in the joint protec- concerned that men and women would respond differ-
tion program. ently. They accounted for this potential effect by using
This example illustrates the major advantage of the gender as an independent variable and thereby eliminat-
factorial approach, which is that it gives the researcher ing it as a confounder.
important information that could not be obtained with any We can think of the randomized block design as two
one single-factor experiment. The ability to examine single-factor randomized experiments, with each block
interactions greatly enhances the generalizability of results. representing a different subpopulation. Subjects are
grouped by blocks and then random assignment is made
Factorial designs can be extended to include more than within each block to the treatment conditions.
two independent variables. In a three-way factorial When the design is analyzed, we will be able to ex-
design, a two-way design is essentially crossed on a third amine possible interaction effects between the treatment
factor. For example, we could use a 2 × 2 × 2 design, where conditions and the blocking variable. When this inter-
subjects would be assigned to one of eight independent action is significant, we will know that the effects of
groups. We could examine the main effects of each treatment do not generalize across the block classifica-
independent variable, as well as their two-way and tions; in this case, across genders. If the interaction is not
three-way interactions. significant, we have achieved a certain degree of gener-
alizability of the results.
Analysis of Factorial Designs. A two-way or three-
way analysis of variance is most commonly used to Analysis of Randomized Block Designs. Data from
examine the main effects and interaction effects of a factorial a randomized block design can be analyzed using a two-
design. way analysis of variance, multiple regression, or discriminant
analysis.
Figure 16–4 One-way repeated measures design. The order of test conditions may be randomized.
be some concern about the potentially biasing effect of effects is to counterbalance the treatment conditions so
test sequence; that is, the researcher must determine if that their order is systematically varied. This creates a
responses might be dependent on which condition pre- crossover design in which half the subjects receive
ceded which other condition. Effects such as fatigue, treatment A followed by B, and half receive B followed
learning, or carryover may influence responses if sub- by A (see Fig. 16-6). Two subgroups are created, one
jects are all tested in the same order. for each sequence, and subjects are randomly assigned
One solution to the problem of order effects is to to one of the sequences.
randomize the order of presentation for each subject, as
Researchers were interested in comparing the effects
was done in the OA study. This can be done by the flip
of prone and supine positions on stress responses in me-
of a coin or other randomization scheme so that there is
chanically ventilated preterm infants.14 They randomly
no bias involved in choosing the order of testing. This
assigned 28 infants to a supine/prone or prone/supine
approach does theoretically control for order effects.
position sequence. Infants were placed in each position
However, there is still a chance that some sequences
for 2 hours. Stress signs were measured following each
will be repeated more often than others, especially if the
2-hour period, including startle, tremor, and twitch
sample size is small.
responses.
One solution for order effects involves use of a Latin
square, which is a matrix composed of equal numbers of A crossover design should only be used in trials where
rows and columns, designating random permutations of the patient’s condition or disease will not change appre-
sequence combinations. For example, in the cane study, ciably over time. It is not a reasonable approach in situ-
if we had 30 subjects, we could assign 10 subjects to each ations where treatment effects are slow, as the treatment
of three sequences, as shown in Figure 16-5. Using ran- periods must be limited. It is similarly impractical where
domization, we would determine which group would get treatment effects are long term and a reversal is not
each sequence, and then assign each testing condition to likely.
A, B, or C. This design is sometimes considered a ran-
Washout Period
domized block design, with the blocks being the specific
The crossover design is especially useful when treatment
sequences.
conditions are immediately reversible, as in the position-
Crossover Design ing of infants in the previous example. When the treat-
When only two levels of an independent variable are ment has cumulative effects, however, a washout period
repeated, a preferred method to control for order is essential, allowing a common baseline for each treat-
ment condition (Fig. 16-6). The washout period must be
long enough to eliminate any prolonged effects of the
Testing Conditions treatment.
1 2 3
Researchers were interested in the effectiveness of a
Block 1 A B C cranberry supplement for preventing urinary tract in-
Block 2 B C A fections in persons with neurogenic bladders secondary
to spinal cord injury.15 They treated 21 individuals,
Block 3 C A B evaluating responses based on urinary bacterial counts
and white blood cell counts. Subjects were randomly
Figure 16–5 A 3 x 3 Latin Square.
RANDOM
ASSIGNMENT Period 1 Crossover Period 2
Washout Period
Figure 16–6 Crossover design. At the point of crossover, the study may incorporate a washout period.
6113_Ch16_227-239 02/12/19 4:44 pm Page 236
of whether the experimental treatment has been success- Sequential trials incorporate specially constructed
ful. The sequential clinical trial is an alternative ap- charts that provide visual confirmation of statistical out-
proach to the RCT, which allows for continuous analysis comes, without the use of formal statistical calculations
of data as they become available, instead of waiting until (see Fig. 16-7). Upper, lower, and middle boundaries are
the end of the experiment to compare groups. Results are constructed and successive pair preferences are charted
accumulated as each subject is tested so that the experi- in a continuous line, moving toward one of the bound-
ment can be stopped at any point as soon as the evidence aries, indicating if results are in favor of one treatment
is strong enough to determine a significant difference or the other, or show no preference. For instance, in the
between treatments. Consequently, it is possible that a study of dyspnea and opioids, the first four pairs showed
decision about treatment effectiveness can be made earlier a preference for the 50% dose, the next three preferred
than in a fixed sample study, leading to a substantial the 25% dose, and further pairs alternated between the
reduction in the total number of subjects needed to obtain two outcomes. After 15 pairs were evaluated, the results
valid statistical outcomes and avoiding unnecessary showed no difference between the dose regimens, as the
administration of inferior treatments. line that charted the preferences moved progressively
toward the middle boundary.
Allard et al18 used a sequential design to study the
Because of the ongoing sequential analysis, a major
effect of opioids to relieve dyspnea in patients with can-
advantage of this approach is the potential ability to
cer. They compared 25% and 50% supplementary
make a decision about treatment effectiveness earlier
dosage regimens and measured intensity and frequency
than in a fixed sample study, leading to a substantial
of respiratory problems following administration of the
reduction in the total number of subjects needed to
drug. Pairs of patients were alternately assigned to one
obtain valid statistical outcomes and avoiding unnec-
dose and preference was based on the subjective assess-
essary administration of inferior treatments. For exam-
ment of a better outcome.
ple, in a retrospective sequential analysis of data that
15
25% supplementary
dose most effective
10
5
Excess Preferences
5 10 15 20 25 27 No. of preferences
0
–5
–10
50% supplementary
dose most effective
–15
Figure 16–7 Sequential analysis diagram of paired preferences in favor of the 25% or 50%
supplementary dose of opioids to reduce dyspnea in patients with cancer. From Allard P,
Lamontagne C, Bernard P, et al. How effective are supplementary doses of opioids for
dyspnea in terminally ill cancer patients? A randomized continuous sequential clinical trial.
J Pain Symptoms Manage 1999;17(4):256–265. Figure 2, p. 260. Used with permission.
6113_Ch16_227-239 02/12/19 4:44 pm Page 238
were collected in a clinical trial, researchers found that potential for creating an efficient design in other types
the sequential design would have reduced the trial of clinical trials.
sample by 234 patients and shortened its duration by
See the Chapter 16 Supplement for a full description of
3 to 4 months.19
sequential trials.
Although the sequential trial design has great appeal,
it has been used primarily in drug trials. However, it has
COMMENTARY
The importance of understanding concepts of design comprises all the logical requirements of the complete
cannot be overemphasized in the planning stages of an process of adding to natural knowledge by experimentation.
experimental research project. There is a logic in these
Although the clinical trial is considered a “gold stan-
designs that must be fitted to the research question and
dard” for establishing cause and effect, it is by no means
the scope of the project so that meaningful conclusions
the best or most appropriate approach for many of the
can be drawn once data are analyzed. The choice of a
questions that are most important for improving prac-
design should ultimately be based on the intent of the
tice. The real world does not operate with controls
research question:
and schedules the way an experiment demands. Quasi-
. . . the question being asked determines the appropriate experimental and observational studies, using intact
research architecture, strategy, and tactics to be used—not groups or nonrandom samples, play an important
tradition, authority, experts, paradigms, or schools of role in demonstrating effectiveness of interventions.22
thought.20 These differences can be appreciated as the divergence
between “evidence-based practice” and “practice-based
The underlying importance of choosing an appropri-
evidence” (see Chapter 2). As we continue to examine
ate research design relates to consequent analysis issues
outcomes as a primary focus of clinical research, we
that arise once data are collected. Many beginning re-
must consider many alternatives to the traditional
searchers have had the unhappy experience of presenting
clinical trial in order to discover the most “effective”
their data to a statistician, only to find out that they
courses of treatment.23
did not collect the data appropriately to answer their
research question. Fisher21 expressed this idea in his clas-
sical work, The Design of Experiments:
Statistical procedure and experimental design are only
two different aspects of the same whole, and that whole
6. Donner A, Klar N. Pitfalls of and controversies in cluster 15. Linsenmeyer TA, Harrison B, Oakley A, Kirshblum S, Stock JA,
randomization trials. Am J Public Health 2004;94(3):416–422. Millis SR. Evaluation of cranberry supplement for reduction of
7. Murray CJ, Lauer JA, Hutubessy RC, Niessen L, Tomijima N, urinary tract infections in individuals with neurogenic bladders
Rodgers A, et al. Effectiveness and costs of interventions to secondary to spinal cord injury. A prospective, double-blinded,
lower systolic blood pressure and cholesterol: a global and placebo-controlled, crossover study. J Spinal Cord Med 2004;
regional analysis on reduction of cardiovascular-disease risk. 27(1):29–34.
Lancet 2003;361(9359):717–725. 16. Dooley EE, Golaszewski NM, Bartholomew JB. Estimating
8. Thabane L, Kaczorowski J, Dolovich L, Chambers LW, accuracy at exercise intensities: a comparative study of self-
Mbuagbaw L. Reducing the confusion and controversies around monitoring heart rate and physical activity wearable devices.
pragmatic trials: using the Cardiovascular Health Awareness JMIR Mhealth Uhealth 2017;5(3):e34.
Program (CHAP) trial as an illustrative example. Trials 2015; 17. Stuge B, Laerum E, Kirkesola G, Vollestad N. The efficacy of a
16:387. treatment program focusing on specific stabilizing exercises for
9. Kaczorowski J, Chambers LW, Dolovich L, Paterson JM, pelvic girdle pain after pregnancy: a randomized controlled trial.
Karwalajtys T, Gierman T, et al. Improving cardiovascular Spine 2004;29(4):351–359.
health at population level: 39 community cluster randomised 18. Allard P, Lamontagne C, Bernard P, Tremblay C. How effective
trial of Cardiovascular Health Awareness Program (CHAP). are supplementary doses of opioids for dyspnea in terminally ill
BMJ 2011;342:d442. cancer patients? A randomized continuous sequential clinical
10. Koo HP, Wilson EK, Minnis AM. A computerized family trial. J Pain Symptom Manage 1999;17(4):256-265.
planning counseling aid: a pilot study evaluation of smart 19. Bolland K, Weeks A, Whitehead J, Lees KR. How a sequential
choices. Perspect Sex Reprod Health 2017;49(1):45–53. design would have affected the GAIN International Study of
11. Dziedzic K, Nicholls E, Hill S, Hammond A, Handy J, Thomas E, gavestinel in stroke. Cerebrovasc Dis 2004;17(2-3):111-117.
et al. Self-management approaches for osteoarthritis in the 20. Sackett DL, Wennberg JE. Choosing the best research design
hand: a 2x2 factorial randomised trial. Ann Rheum Dis 2015; for each question. BMJ 1997;315:1636.
74(1):108–118. 21. Fisher RA. The Design of Experiments. 9 ed. New York:
12. Dietsch AM, Cirstea CM, Auer ET Jr, Searl JP. Effects of body Macmillan, 1971.
position and sex group on tongue pressure generation. Int J 22. Gartlehner G, Hansen RA, Nissman D, Lohr KN, Carey TS:
Orofacial Myology 2013;39:12–22. Criteria for distinguishing effectiveness from efficacy trials in
13. Chan GN, Smith AW, Kirtley C, Tsang WW. Changes in systematic reviews. Technical Review 12. AHRQ Publication
knee moments with contralateral versus ipsilateral cane usage No. 06-0046. Rockville, MD: Agency for Healthcare Research
in females with knee osteoarthritis. Clin Biomech 2005; and Quality. April, 2006. Available at https://www.ncbi.nlm.
20(4):396–404. nih.gov/books/NBK44029/pdf/Bookshelf_NBK44029.pdf.
14. Chang YJ, Anderson GC, Lin CH. Effects of prone and supine Accessed March 27, 2017.
positions on sleep state and stress responses in mechanically 23. Concato J, Shah N, Horwitz RI. Randomized, controlled trials,
ventilated preterm infants during the first postnatal week. observational studies, and the hierarchy of research designs.
J Adv Nurs 2002;40:161–169. N Engl J Med 2000;342(25):1887–1892.
6113_Ch17_240-248 02/12/19 4:44 pm Page 240
CHAPTER
17 Quasi-Experimental
Designs
Although the randomized trial is considered studies because of the logistic limitations that
the optimal design for testing cause-and-effect occur in practice. The purpose of this chapter
hypotheses, the necessary restrictions of a ran- is to describe time series designs and non-
domized trial are not always possible within the equivalent group designs, the two basic struc-
clinical environment. Depending on the nature tures of quasi-experimental research.
of the treatment under study and the popula-
tion of interest, use of randomization and con-
■ Validity Concerns
trol groups may not be possible.
Quasi-experimental designs utilize similar Because quasi-experimental designs lack at least one of
structures to experimental designs, but lack the requirements for controlled trials, they cannot rule
out threats to internal validity with the same confi-
either random assignment, comparison groups, dence as experimental studies. Nonequivalent groups
or both. Even with these limitations, these may differ from each other in many ways in addition
designs represent an important contribution to differences between treatment conditions. There-
fore, the degree of control is reduced.
to clinical research because they accommodate Quasi-experimental designs present reasonable
for the limitations of natural settings, where alternatives to the randomized trial as long as the
scheduling treatment conditions and random researcher carefully documents subject characteristics,
controls the research protocol, and uses blinding as
assignment are often difficult, impractical, or much as possible. The conclusions drawn from these
unethical. They are often used in pragmatic studies must take into account the potential biases of
240
6113_Ch17_240-248 02/12/19 4:44 pm Page 241
the sample, but may provide important information, possibility that some events other than the experimental
nonetheless. treatment occurred within the time frame of the study
that caused the observed change. External validity is also
limited by potential interactions with selection because
■ Time Series Designs there is no comparison group.
The one-group pretest–posttest design may be
Many research questions focus on the variation of defended, however, in cases where previous research has
responses over time. In such a design, time becomes documented the behavior of a control group in similar
an independent variable with several levels and re- circumstances. For instance, other studies may have
searchers will look for differences across time intervals. shown that shoulder pain does not improve in this
Such designs can be configured in several ways, with population over a 4-week period without intervention.
varying degrees of control. On that basis, we could justify using a single experimen-
tal group to investigate just how much change can be
One-Group Pretest–Posttest Design expected with treatment. This documentation might also
The one-group pretest–posttest design is a quasi- allow us to defend the lack of a control group on ethical
experimental design that involves one set of repeated grounds.
measurements taken before and after treatment on one The design is also reasonable when the experimental
group of subjects (see Fig. 17-1). The effect of treatment situation is sufficiently isolated so that extraneous envi-
is determined by measuring the difference between ronmental variables are effectively controlled, or where
pretest and posttest scores. the time interval between measurements is short so that
In this design, the independent variable is time, with temporal effects are minimized.2 In studies where data
two levels (pretest and posttest). Treatment is not an collection is completed within a single testing session,
independent variable because all subjects receive the temporal threats to internal validity will be minimal, but
intervention. testing effects remain uncontrolled. Under all circum-
stances, however, this design is not considered a true
A study was designed to examine the effect of four- experiment and should be expanded whenever possible
direction shoulder stretching exercises for patients with to compare two groups. It is often possible to use an
idiopathic adhesive capsulitis.1 All subjects received the alternative treatment as a control.
same exercise protocol. Researchers studied the effects
of treatment on pain, range of motion (ROM), function,
Analysis of One-Group Pretest–Posttest
and quality of life measures. Comparisons were made
Designs. A paired t-test is usually used to compare
between pretest scores and final scores at follow-up,
pretest and posttest mean scores. With ordinal data or
with a mean duration of 22 months. small samples, the sign test or the Wilcoxon signed-ranks
In this study, the researchers saw significant improve- test can be used.
ments in outcome variables and concluded that the treat-
ment was successful.
Repeated Measures Design
Design Validity Many research questions that deal with the effects of
We must be cautious in drawing conclusions from this treatment on physiological or psychological variables are
design, however. The design is weak because it has no concerned with how those effects are manifested over
comparison group, making it especially vulnerable to time. As an extension of the pretest–posttest design, the
threats to internal validity. Although the researcher can repeated measures design is naturally suited to assessing
demonstrate change in the dependent variable by com- such trends. Multiple measurements of the dependent
paring pretest and posttest scores, there is always the variable are taken within prescribed time intervals. The
intervention may be applied once or it may be repeated
in between measurements (see Fig. 17-2).
T1 Intervention T2
Researchers studied the effects of constraint-induced
Pretest Posttest movement therapy (CIMT) on spasticity and upper ex-
tremity function in patients with chronic hemiplegia.3
Figure 17–1 A one-group pretest–posttest design. The
They studied 20 patients, taking measurements before
independent variable is time, with two levels (T1 and T2).
6113_Ch17_240-248 02/12/19 4:44 pm Page 242
T1 T2 T3 Intervention T4 T5 T6
Pretests Posttests
Focus on Evidence 17–1 figure. From March through December 2009, they documented their
Documenting Trends process of blood culture collection as a baseline. They found many
factors that contributed to a high contamination rate, including
Taking blood cultures is an important diagnostic tool in emergency
significant variation in practice. This led to development of a
departments. However, false positives from contamination can lead
standardized collection protocol and training sessions that were
to unnecessary morbidity, invasive procedures, antibiotic therapy, and
implemented across the hospital over a 2-month period (blue
healthcare costs. Laboratory standards recommend that hospitals
area).
maintain a contamination rate of less than 3%.
Following the training period, the intervention was fully imple-
One hospital found that its contamination rate was consistently
mented. The data were analyzed using a segmented linear regression
above 4%, and developed an interdisciplinary task force to design
model, which looked at trends by dividing the study periods into 2-
a QI effort to achieve a sustainable rate below the 3% benchmark.4
week intervals (segments), so that they had 24 time points each
They used an ITS analysis to compare contamination rates prior
across baseline and intervention periods.
to and after implementation of an intervention, shown in the
Percentage of Blood Cultures Contaminated (%)
1
3/1/09
3/29/09
4/26/09
5/24/09
6/21/09
7/19/09
8/16/09
9/13/09
10/11/09
11/8/09
12/6/09
1/3/10
1/31/10
2/28/10
3/28/10
4/25/10
5/23/10
6/20/10
7/18/10
8/15/10
9/12/10
10/10/10
11/7/10
12/5/10
1/2/11
Using a forecasting model, the baseline trend was projected As shown in this example, the variability in data can be further explored
into the intervention period (dashed line) to estimate what the con- to determine factors that impacted outcomes. Simple pretest–posttest
tamination rate would have been had the intervention not been im- designs cannot provide this level of information that can be extremely
plemented. During the baseline period, an average of 4.3% of cultures useful for decision-making and policy implementation.
were contaminated, compared to 1.7% during the intervention period.
The intervention was associated with an immediate 2.9% (95%
CI = 2.2% to 3.2%) absolute reduction in contamination, which From Self WH, Speroff T, Grijalva CG, et al. Reducing blood culture contamina-
tion in the emergency department: an interrupted time series quality improve-
was significant. The contamination rate was maintained below 3%
ment study. Acad Emerg Med 2013;20(1):89–97. Adapted from Figure 4, p. 94.
throughout the year of the monitored intervention period. Used with permission.
The ITS design is often used for this kind of QI project because it
allows investigators to observe trends throughout the testing periods.
Design Validity
A time series design does involve repeated measures, The ITS design is an extension of the one-group pretest–
but it is distinguished from standard repeated measures
posttest design. It offers more control, however, because
designs by the large number of measurements that are
the multiple pretests and posttests act as a pseudocontrol
taken continuously across baseline and intervention
phases. condition, demonstrating maturational trends that natu-
rally occur in the data or the confounding effects of
6113_Ch17_240-248 02/12/19 4:44 pm Page 244
design, except that subjects are not assigned to groups may have been related to physiological or psychological
randomly. This design can be structured with one treat- characteristics of subjects. These characteristics could
ment group and one control group or with multiple affect general activity level or rate of healing. Such inter-
treatment and control groups. actions might be mistaken for the effect of the exercise
program. These types of interactions can occur even when
Intact Groups
the groups are identical on pretest scores.
A study was done to determine the effectiveness of an
Design Strategies for Nonequivalent Groups
individualized physical therapy intervention in treating
When subjects cannot be assigned to groups, researchers
neck pain.10 One treatment group of 30 patients with
are often concerned about whether confounding variables
neck pain completed physical therapy treatment. The
are balanced across groups at baseline. Several design and
control group of convenience was formed by a cohort
analysis strategies can be incorporated into a nonequiva-
group of 27 subjects who also had neck pain but did not
lent group design that help to balance groups.
receive treatment for various reasons, such as delay in
Designs can incorporate stratification or a blocking
insurance approval, time constraints, or exacerbations.
variable, whereby a confounder is built into the design
There were no significant differences between
as an independent variable. For example, subjects can be
groups in demographic data or the initial test scores of
divided into groups by age or gender. This allows the
the outcome measures. A physical therapist rendered an
confounding variable to be controlled in comparisons
intervention to the treatment group based on a clinical
within and across strata (see Chapter 14). This technique
decision-making algorithm. Treatment effectiveness
can only be effectively used with one or two variables
was examined by assessing changes in ROM, pain, en-
at most.
durance, and function. Both the treatment and control
A statistical strategy is the use of matching to equate
groups completed the initial and follow-up examina-
groups on baseline scores that may present substantial
tions, with an average duration of 4 weeks between tests.
confounding effects. For instance, subjects may be
In this study of therapy for neck pain, the patients matched on age or gender to assure a balance of these
are members of intact groups by virtue of their personal factors. This is a challenge, however, when many vari-
circumstances. ables may have a confounding effect, making one-on-one
matching unfeasible. Researchers often use a technique
Subject Preferences
of deriving propensity scores for this purpose. Through
A study was designed to examine the influence of regu- the use of regression analyses, a single variable is created
lar participation in chair exercises on postoperative de- that statistically represents a set of several confounders.
conditioning following hip fracture.11 Patients were This propensity score can then be used to match subjects
distinguished by their willingness to participate and so that observed covariates are balanced across groups
were not randomly assigned to groups. A control group (see Chapter 31).12
received usual care following discharge. Physiological,
psychological, and anthropometric variables were meas-
Analysis of Nonequivalent Pretest–Posttest
ured before and after intervention. Designs. Several statistical methods are suggested
In the chair exercise study, subjects self-selected their for use with nonequivalent groups, including the unpaired
group membership. Because groups are not randomly t-test (with two groups), analysis of variance, and analysis
assigned, bias is a concern in these designs. of covariance. Analyses can also include matched scores.
Ordinal data can be analyzed using the Mann–Whitney
Design Validity U-test. Nonparametric tests may be more appropriate with
Although the nonequivalent pretest–posttest control nonequivalent groups, as variances are likely to be unequal.
group design is limited by the lack of randomization, it Preference for one approach will depend in large part on how
still has several strengths. Because it includes a pretest groups were formed and what steps the researcher can take
and a control group, there is some control over history, to ensure or document initial equivalence.
testing, and instrumentation effects. The pretest scores Tests such as the t-test, analysis of variance, and chi
can be used to test the assumption of initial equivalence square are often used to test for differences in baseline meas-
on the dependent variable, based on average scores and ures. Regression analysis or discriminant analysis may be the
measures of variability. most applicable approach to determine how the dependent
variable differentiates the treatment groups, while adjusting
The major threat to internal validity is the interaction
for other variables. Statistical strategies must include mecha-
of selection with history and maturation. For instance, if
nisms for controlling for group differences on potentially
those who chose to participate in chair exercises were confounding variables.
stronger or more motivated patients, changes in outcomes
6113_Ch17_240-248 02/12/19 4:44 pm Page 246
COMMENTARY
The only relevant test of the validity of a hypothesis is comparison of its prediction
with experience.
—Milton Friedman (1912-2006)
American Economist, 1976 Nobel Memorial Prize in Economic Sciences
In the pursuit of evidence-based practice, clinicians must One might reasonably argue, however, that the results
be able to read research literature with a critical eye, of the RCT may not apply to a particular patient who
assessing not only the validity of the study’s design, does not meet all the inclusion or exclusion criteria, or
but also the generalizability of its findings. Published who cannot be randomly assigned to a treatment proto-
research must be applicable to clinical situations and col. Many of the quasi-experimental models will provide
individual patients to be useful. The extent to which any an opportunity to look at comparisons in a more natural
study can be applied to a given patient is always a matter context.
of judgment. In the hierarchy of evidence that is often used to qual-
Purists will claim that generalization of intervention ify the rigor and weight of a study’s findings, the RCT
studies requires random selection and random assignment— is considered the highest level (see Chapter 5). But the
in other words, an RCT. In this view, quasi-experimental quasi-experimental study should not be dismissed as
studies are less generalizable because they do not provide a valuable source of information. As with any study,
sufficient control of extraneous variables. In fact, such however, it is the clinician’s responsibility to make the
designs may be especially vulnerable to all the factors judgments about the applicability of the findings to the
that affect internal validity. individual patient.
11. Nicholson CM, Czernwicz S, Mandilas G, Rudolph I, Greyling 15. Sacks HS, Chalmers TC, Smith H, Jr. Sensitivity and specificity
MJ. The role of chair exercises for older adults following hip of clinical trials. Randomized v historical controls. Arch Intern
fracture. S Afr Med J 1997;87(9):1131–1138. Med 1983;143(4):753–755.
12. Matas AJ, Kandaswamy R, Humar A, Payne WD, Dunn DL, 16. Micciolo R, Valagussa P, Marubini E. The use of historical
Najarian JS, et al. Long-term immunosuppression, without controls in breast cancer. An assessment in three consecutive
maintenance prednisone, after kidney transplantation. Ann Surg trials. Control Clin Trials 1985;6(4):259–270.
2004;240(3):510–516; discussion 16–17. 17. Bridgman S, Engebretsen L, Dainty K, Kirkley A, Maffulli N.
13. Moser M. Randomized clinical trials: alternatives to conven- Practical aspects of randomization and blinding in randomized
tional randomization. Am J Emerg Med 1986;4(3):276–285. clinical trials. Arthroscopy 2003;19(9):1000–1006.
14. Viele K, Berry S, Neuenschwander B, Amzal B, Chen F, Enas 18. Ng JY, Tam SF. Effect of exercise-based cardiac rehabilitation
N, et al. Use of historical control data for assessing treatment on mobility and self-esteem of persons after cardiac surgery.
effects in clinical trials. Pharm Stat 2014;13(1):41–54. Percept Mot Skills 2000;91(1):107–114.
6113_Ch18_249-270 02/12/19 4:43 pm Page 249
CHAPTER
Single-Subject
Designs 18
DESCRIPTIVE EXPLORATORY EXPLANATORY
249
6113_Ch18_249-270 02/12/19 4:43 pm Page 250
group of 20 adults was tested in a repeated measures Single-subject studies provide an alternative approach
design and no significant difference was seen between that allows us to draw conclusions about the effects of
the two conditions based on a comparison of means. treatment based on the responses of individuals under
controlled conditions, with the flexibility to observe
Okay, so reading rate had no effect on stuttering—or change before, during, and after treatment. By focusing
did it? It turns out that a closer look at individual results on the individual patient, this approach provides infor-
showed that 8 subjects actually decreased their frequency mation about which treatment works, for whom, and
of stuttering, 1 did not change, and 11 demonstrated an under which conditions.
increase during the faster speaking condition. By drawing
conclusions from average treatment effects, an important The Research Question
differentiation among individual performances was Single-subject research can be used to study compar-
obscured, potentially affecting clinical outcomes. isons among several treatments, between components
Limitations of Randomized Trials of treatments, or between treatment and no-treatment
conditions. These designs are considered experimental
Because the application of research findings is often because they require the same attention to logical design
viewed through the lens of the randomized controlled and control as other experimental designs. A research
trial (RCT), such studies generally incorporate large sam- hypothesis should indicate the expected relationship
ples, randomization, and controlled protocols. Findings between a defined treatment and a clinical outcome, as
are assumed to be generalizable—but the assumption that well as the characteristics that make subjects appropriate
one treatment’s effect can be generalized to all patients for the study. Hypotheses can be directional or nondi-
is rarely true.2 (If it were, we would not need so many rectional (see Chapter 3).
different medications for arthritis, depression, diabetes,
or heart disease—as evidenced by TV commercials!)
Single-subject designs have also been called single-case
Because group studies typically take measurements designs, single-system strategies, N-of-1 trials, small-N
at only two or three points in time, they may miss vari- designs, and time series designs.
ations in response that occur over time. And sometimes
a particular patient might benefit from a treatment that
is shown to be inferior in a randomized trial. There-
fore, randomized group studies can be ambiguous for ■ Structure of Single-Subject Designs
purposes of clinical decision-making for an individual
patient. SSDs are structured around two core elements: repeated
measurement and design phases.
Individualized Care
In the spirit of EBP, there is an increasing recognition Repeated Measurement
among providers that healthcare is not a “one-size-fits-all”
An SSD involves the systematic collection of repeated
process. The concept of “personalized” or “individualized”
measurements of a behavioral response for one indi-
care is gaining acceptance, especially with the availability
vidual and is therefore considered a within-subjects
of relevant genetic information.6 The personalized ap-
design. The outcome response, or target behavior, is
proach seeks to identify subgroups within populations to
measured at frequent and regular intervals, such as
better understand how an individual’s characteristics may
at each treatment session or over days. These repeated
affect treatment effectiveness and risks.7
assessments are required to observe response patterns,
Another aspect of this concept is a shift in the rela-
evaluate variability of the behavioral response, and
tionship between patient and provider. In addition to
modify the intervention as necessary while the study
patients being “subjects” in research studies that will
progresses.
produce data to inform future care, there is a growing
focus on patient involvement in studying issues that are
most relevant to their lives, the idea of patient-centered Design Phases
outcomes research (see Chapter 2). Giving patients an The second core element is the delineation of at least
opportunity to engage in the decision-making about two testing periods, or phases: a baseline phase, prior to
their individual care allows all involved to take the treatment, and an intervention phase. The target behavior
unique circumstances of each patient into account. is measured repeatedly across both phases.
Therefore, there is a need to bring the search for evi- The baseline phase provides information about re-
dence to the individual patient. sponses during a period of “no treatment,” or a control
6113_Ch18_249-270 02/12/19 4:43 pm Page 251
condition. The assumption is that baseline data reflect one baseline phase and one intervention phase, is called
the natural state of the target behavior over time along an A–B design.
with ongoing effects of background variables, such as
Collecting Baseline Data
daily activities, other treatments, and personal charac-
The collection of baseline data is the single feature of an
teristics. When treatment is initiated, changes from
SSD that particularly distinguishes it from clinical prac-
baseline to the intervention phase should be attributable
tice, case reports, and traditional experimental designs,
to intervention. Therefore, baseline data provide a stan-
where treatment is initiated immediately following
dard of comparison for evaluating the potential cause-
assessment, making it impossible to determine which
and-effect relationship between the intervention and
components of treatment actually caused observed
target behavior.
changes. More importantly, it is not possible to deter-
mine if observed changes would have occurred without
It is important to differentiate an SSD from a case intervention. Just as we need a control group to validate
report (see Chapter 20), which is an after-the-fact group comparisons, we must have a control period to
intensive description of a patient’s episode of care.
make these determinations for a single-subject experi-
A case report is intended to identify patterns and generate
ment (see Box 18-1).
hypotheses, but does not include systematic manipulation of
variables or serial measurements under controlled treatment
Two characteristics of baseline data are important for
conditions.1,8 interpretation of clinical outcomes: stability, which re-
flects the consistency of response over time, and trend or
slope, which shows the direction and rate of change in the
Graphing Responses behavior (see Fig. 18-2).
Data are plotted on a chart showing baseline and inter- The most desirable baseline pattern demonstrates a
vention phases, as shown in Figure 18-1, with magnitude stable baseline, with constant level of behavior with min-
of the target behavior along the Y-axis and time (sessions, imal variability, indicating that the target behavior is not
trials, days, weeks) along the X-axis. Besides serving as changing. Therefore, changes that are observed after the
research documentation, this process provides a clinically intervention is introduced can be confidently attributed
useful method for visualizing patterns of change and to a treatment effect. If treatment has no effect, we
progress over time. This hypothetical data will be used would expect to see this baseline pattern continue into
for various illustrations throughout this chapter. the intervention phase.
Using conventional notation, the baseline period is A variable baseline can present a problem for interpre-
represented by the letter A and the intervention period tation. When this type of pattern emerges, it is generally
by the letter B. To facilitate description, this design, with advisable to continue to collect baseline data until some
20
A B
18
16
14
Target Behavior
12
10
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Sessions
Figure 18–1 Example of a basic A–B design, showing baseline and intervention phases.
6113_Ch18_249-270 02/12/19 4:43 pm Page 252
15
Box 18-1 Being All Right With Baseline Data
Target Behavior
Ethical objections often arise when the baseline concept is 10
introduced, just as they do when a control group is proposed
for a group comparison study. Two points must be made in this 5
regard. First, we can argue that it is not unethical to withhold
treatment for a relatively short period when we are unsure
0
about the effectiveness of the intervention in the first place. 1 2 3 4 5 6 7 8
Indeed, it may actually be unethical to continue to provide an A Sessions
inferior or ineffective treatment without testing it experimen-
15
tally. Second, collecting baseline data does not mean that the
Target Behavior
clinician is denying all treatment to the patient. It only means
that one portion of the patient’s total treatment is being isolated 10
for study while all other treatments and activities are continued
as background. 5
Measuring baseline data, however, may not be an appropri-
ate approach for studying every type of intervention, such as 0
1 2 3 4 5 6 7 8
treatments for critical or life-threatening situations, where
treatment effects are not questioned and withholding treatment B Sessions
would be harmful. 15
Target Behavior
Although single-subject studies are designed to incorporate
intervention within the framework of clinical treatment, in- 10
formed consent is required from all patients participating in
single-subject trials. 5
0
stability is achieved. With extreme variability, the re- 1 2 3 4 5 6 7 8
C Sessions
searcher is obliged to consider which factors might be
influencing the target behavior to create such an erratic 15
Target Behavior
response.
Trending baselines, either accelerating or decelerat- 10
ing, indicate that the target behavior is changing with-
out intervention. Either type of trend may represent 5
improvement or deterioration, depending on the target
behavior. Trends can also be characterized as stable 0
1 2 3 4 5 6 7 8
or unstable; that is, the rate of change may be constant
D Sessions
or variable.
Figure 18–2 Types of baselines that may be encountered in
Length of Phases single-subject designs: A) Stable, level baseline; B) Variable
One of the first considerations when planning single- baseline; C) Stable accelerating trend; D) Variable
subject experiments concerns how long phases should decelerating trend.
be. There is some flexibility in these choices, allowing
for consideration of the type of patient, the type of
treatment, and the expected rate of change in the target analysis procedures is enhanced by having 12 or more
behavior. This flexibility differentiates the SSD from points within each phase to establish stability.10
traditional designs where onset and duration of treat- Phase length must take into account the time needed
ment are established and fixed prior to experimenta- to see a meaningful change in the target behavior. If pos-
tion, regardless of how the patient responds. There are sible, it is generally desirable to use relatively equal phase
some guidelines that can be followed to assist in these lengths to control for potential time-related factors such
decisions. as maturation or the motivational influence of continued
Because trend is an important characteristic of re- attention over prolonged treatment periods. Many re-
peated measurements, a minimum of five data points is searchers set shorter baseline phases because of clinical
recommended in each phase.9 The application of some considerations in treating the patient.
6113_Ch18_249-270 02/12/19 4:43 pm Page 253
Despite these plans, however, it is usually advisable to should be based on the outcomes that are most mean-
extend baseline or intervention phases until stability is ingful to the patient, particularly activity and participa-
achieved, or at least until one is sure that responses are tion measures.
representative of the true condition under study. The The most common techniques for measuring behav-
decision on length, then, can rest with the researcher iors within SSDs are frequency, duration, and magnitude
as a function of the subject’s response, and may change measures.
during the course of the study.11
Frequency
A frequency count indicates the number of occurrences of
■ Defining the Research Question a behavior within a fixed time interval or number of trials.
Frequency can be expressed as a percentage, by dividing the
Just like any other experiment, SSDs require rigor in the number of occurrences of the target behavior by the total
design protocol to assure confidence in observed effects number of opportunities for the behavior to occur. This
of treatment. The intervention, or independent variable, method is often used in studies in which accuracy of per-
should be operationally defined so that others could formance (percentage correct) is of primary interest.
replicate it.
Some interventions, like drugs, can be controlled with Interval Recording
placebos and can be easily blinded. That is not the case A useful procedure for measuring repetitive behavior is
with many types of clinical interventions that require called interval recording, which involves breaking down
direct interaction between clinician and patient. There- the full measurement period into preset time intervals
fore, the strategies for treatment must be carefully and determining if the behavior occurs or does not occur
delineated so that they are well understood by all those during each interval.
involved, including patients and families. Xin et al13 used an SSD to study the effects of using a
An important feature of SSDs is the ability to make customized iPad app on self-monitoring behaviors in four
adjustments in the intervention as the study progresses students who were classified with autism spectrum disor-
based on the subject’s performance.4 For instance, if the der (ASD). Students were observed during 20-minute pe-
subject is not responding to treatment, it would be fruit- riods to determine if they were able to remain seated, pay
less to continue to collect data without trying to vary attention to the teacher, and focus their work on a given
aspects of intervention that might be successful.12 This assignment. Behaviors were observed in 2-minute inter-
flexibility provides a valuable practical approach to un- vals across the 20-minute sessions, with measures
derstanding treatment effectiveness. recorded as “+” for an occurrence of the target behavior
or “–” for a nonoccurrence in each interval.
■ Measuring the Target Behavior The frequency of occurrences was summed over
the 20-minute period and this sum was considered the
Because single-subject studies provide an individualized child’s score for that session. This study illustrates the
approach, target behaviors can be specified that reflect importance of operational definitions for frequency
impairments as well as activity and participation restric- counts, specifying exactly what constitutes an occurrence
tions in ways that are clinically relevant to the patient and nonoccurrence of the behavior.
and practitioner. Like all studies, these measures must
be operationally defined, reliable, observable, quantifi-
Duration
able, and a valid indicator of treatment effectiveness. Target behaviors can also be evaluated according to how
Ideally, the behaviors will be stable and the degree of long they last. Duration can be measured either as the
expected change will be meaningful. cumulative total duration of a behavior during a treat-
Although patients usually present with several clinical ment session or as the duration of each individual occur-
problems, target behaviors should focus on a primary rence of the behavior. For example, this could include
response. Is it how often a patient can perform a partic- how long it takes to complete a task, how long one can
ular task, or how long a behavior can be maintained? Is maintain a certain posture, or how long pain lasts after
it a score on a functional scale? Is the number of correct an activity. Operational definitions for duration meas-
responses of interest? Or is it simply whether or not the ures must specify criteria for determining when the
behavior occurred? The answers to these questions behavior starts and ends.
6113_Ch18_249-270 02/12/19 4:43 pm Page 254
6 Minute Walk
325
300
Distance (meters)
275
250 ⫹2SD
Mean
225
⫺2SD
200
A1 B A2
175
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Weeks
Figure 18–3 Example of an A–B–A design, showing endurance using the
6MWT in a child with cerebral palsy. The intervention was an aquatic-based
aerobic exercise program. The horizontal bands represent 2 standard deviations
(SD) above and below the mean of the first baseline scores. From Retarekar R,
Fragala-Pinkham MA, Townsend EL. Effects of aquatic aerobic exercise for a
child with cerebral palsy: single-subject design. Pediatr Phys Ther 2009;21(4):
336–344. Figure 4, p. 341. Used with permission of the Section on Pediatrics of
the American Physical Therapy Association.
A B A B
50
Subject 2
40
30
20
10
0 7 14 21 28
Measurement Sessions
three or more individuals who share common relevant experimental control can be achieved without with-
characteristics. drawal of treatment.
All subjects begin with a concurrent collection of
baseline data, which is then followed by a staggered Other Multiple Baseline Designs
introduction of the intervention. By allowing each base- Multiple baseline designs can also be used to test the
line to run for a different number of data points, system- influence of environment. In the multiple baseline design
atic changes in the target behavior that are correlated across settings, one individual is monitored in multiple
with the onset of intervention can be reliably attributed settings, with the same treatment applied sequentially
to a treatment effect. across two or more environmental conditions testing the
same intervention and behavior.
Educators studied the effect of a student-centered
Sometimes, it is of interest to study the effect of one
precision teaching model on the development of basic
intervention on several related clinical behaviors. In the
reading skills and word building in three children with
multiple baseline design across behaviors, the researcher
reading disorders.18 They used a multiple baseline
monitors a minimum of three similar yet functionally
design with staggered introduction of the teaching strat-
independent behaviors in the same subject. Behaviors are
egy to study reading comprehension and fluency.
addressed using the same intervention, which is intro-
Results are shown in Figure 18-5. Each of the base- duced sequentially. This design requires that the tar-
lines shows minimal variability, but changes were ob- geted behaviors are similar enough that they will all
vious for each child following onset of the intervention. respond to one treatment approach and that they are
The authors concluded that training improved student functionally independent of one another so that baselines
performance. Each student showed initial decrement will remain stable until treatment is introduced.
during a follow-up period, but performance rebounded A basic premise of the multiple baseline design is that
back to intervention levels within a few weeks. When baseline data are available for all subjects simultaneously
multiple baseline designs use the A–B format, internal so that temporal effects cannot contaminate results. A
validity is controlled because the results are replicated variation, called a nonconcurrent multiple baseline design, can
across three conditions at staggered times, making it be used in the common clinical situation when similar
unlikely that external factors could coincidentally occur subjects are not available for concurrent monitoring.19
at the time each treatment was initiated to cause the This approach requires that the researcher arbitrarily
response change. One of the major advantages of determines the length of several baselines and assigns
the multiple-baseline approach is that replication and them to subjects at random as they become available
6113_Ch18_249-270 02/12/19 4:43 pm Page 257
100
90
80
70
60
50
40
30
20
10
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37
100
90
80
70
60
50
40
30
20
10
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
Figure 18–5 Example of a multiple baseline design across three subjects, showing the number
of correct responses in building words during reading tests before, during, and after a student-
centered precision teaching program. The A–B design incorporated staggered baselines. Green
points represent correct responses; red points represent incorrect responses. From Bonab BG,
Arjmandnia AA, Bahari F, Khoei BA. The impact of precision teaching on reading comprehension
in students with reading disabilities. J Educ Soc Sci 2016;4:366–376. Figure 2, p. 375. Reprinted
with permission.
for the study. This is a practical but weaker design be- Alternating Treatment Design
cause external factors related to the passage of time may
One strategy for studying multiple treatments involves
be different for each subject.
the alternating treatment design. The essential feature
See the Chapter 18 Supplement for description of multiple of this design is the rapid alternation of two or more
baseline studies across settings and behaviors, and the interventions or treatment conditions, each associated
noncurrent multiple baseline design. with a distinct stimulus. Treatment can be alternated
within a treatment session, session by session, or day
by day.
■ Designs With Multiple Treatments This design can be used to compare treatment with a
control or placebo, but is more commonly used to com-
Single-subject strategies can also be used to compare the pare two different interventions to determine which one
effects of two or more treatments. will be more effective. Data for each treatment condition
6113_Ch18_249-270 02/12/19 4:43 pm Page 258
are plotted on the same graph, allowing for comparison behavior is a clear consequence of one specific condi-
of data points and trends between conditions. tion. The target behavior must be capable of reflecting
differences with each intervention and the interven-
A study was done to show the effect of service dog
tions must be able to trigger those changes as they are
partnerships on energy conservation in individuals with
applied and switched. The major advantage of the
mobility challenges.20 Researchers compared duration
alternating treatment design is that it will usually pro-
and perceived effort to complete functional tasks in
vide answers to questions of treatment comparison in
12 sessions over 6 weeks. Subjects worked with dogs
a shorter time frame than designs that require intro-
to train them in the tasks in different settings, with
duction and withdrawal of multiple treatment phases
a randomized order of testing with and without the
over time.
service dog.
Scheduling of alternating treatments will depend on
Figure 18-6 illustrates the outcome for one subject the potential for carryover from trial to trial. Because
with a spinal cord injury, showing the time and effort treatment conditions are continuously alternated, se-
needed to pick up a fanny pack from the floor while quence effects are of primary concern. This must be
seated in her wheelchair. The upper line represents addressed either by random ordering of the treatment
effort without the service dog. applications on each occasion or by systematic counter-
The alternating treatment design has a primary focus balancing. In addition, other conditions that might affect
on comparison of data points between treatment condi- the target behavior, such as the clinician, time of day, or
tions at each session. For instance, we see that the effort setting, should be counterbalanced.
with the service dog at each session was consistently
lower than without, even though the performance Multiple Treatment Designs
showed some variability.
Using a variation of the withdrawal design, a multiple
treatment design typically involves the application of one
It is actually unnecessary to include a baseline phase in treatment (B) following baseline, the withdrawal of
an alternating treatment design, as seen here, just as a that treatment, and introduction of one or more addi-
control group may not be included in group designs in which
tional treatments (C), using various configurations of an
two treatments are compared. Baseline data can, however,
A–B–A–C design.
be useful in situations where both treatments turn out to be
equally effective to show that they are better than no
As a reflection of everyday practice, single-subject
treatment at all. studies can also be used to examine the interactive or joint
effect of two or more treatments as they are applied
individually or as a treatment package. This can be done
Because target behaviors are measured in rapid suc- by starting with a combined treatment and then system-
cession, the alternating treatment design is appropriate atically dropping one component, or by adding successive
where treatment effects are immediate and where components to treatment.21 Designs can be arranged,
20 10
18 9
16 8
14 7
12 6
Time(s)
for instance, as A–B–A–(B+C), or A–(B+C)–A–B, where treatment effects over time to answer the question,
B+C represents the combined application of interven- “Does this treatment work?” In contrast, the N-of-1 trial
tions B and C. The intent is to determine the extent to is used as an active decision-making tool to answer the
which the effects of individual components are independ- question, “Which of these treatments should I use for this
ent or if their combination has an additive effect. patient?” Although it can be used to inform further research,
an N-of-1 trial typically ends with a decision to use one
Changing Criterion Design treatment or the other for the patient’s continued care.
Another approach is needed when treatments require
adjustment over time to change as the patient improves.
A design variation called a changing criterion design pro- The Crossover Design
vides the opportunity for incrementally increasing goals
N-of-1 trials are based on a crossover design (see Chap-
as the patient progresses.1,22 In this design, the treatment
ter 16) in which the treatment and a placebo are system-
phase is broken down into multiple stages, each repre-
atically and randomly alternated until the patient and
senting a new predetermined criterion, such as expected
clinician reach a decision on preference for one, or
time to complete a task. Each criterion serves as a bar
satisfaction that both are equally effective or ineffective.
from which the researcher determines whether behavior
More and longer measurement periods are recom-
gradually changes over time. As performance stabilizes
mended in situations where confounding effects are
and meets each consecutive criterion, the “bar” is raised
likely.
and behavioral responses to criterion changes are meas-
A washout period is an essential control for potential
ured repeatedly over time.23
carryover effects. Some investigators use a run-in
See the Chapter 18 Supplement for description of multiple period prior to randomizing the order of treatment to
treatment and changing criterion designs. assess the patient’s commitment to the protocol (see
Chapter 14).
An important element is using a double-blind design
■ N-of-1 Trials whenever possible. For drug trials, the assistance of a
pharmacist is needed to create identical placebos.
Single-subject methodology, with origins in the behav-
ioral sciences, has been used in behavioral and rehabili-
tation research for decades to investigate systematic Because randomizing the order of treatment could result in
variations in treatment and their effect on patient certain sequences being more common, trials will often use
responses. A somewhat different approach has been block randomization or a counterbalanced Latin Square procedure
adopted in clinical medicine using a design commonly (see Chapter 16).
called an N-of-1 trial. Considered a form of single-subject
investigation, this approach takes its origins from princi-
ples of RCTs, applying a randomized crossover design to
an individual patient. Patient Engagement
Clinicians and patients are often faced with the need The primary purpose of N-of-1 trials is to assist with in-
to choose among several available interventions when dividual treatment decisions about the optimum ap-
existing evidence does not provide clear directive for one proach for a single patient. The study typically involves
over others, whether because of conflicting studies self-report data collected by the patient during the trial.
or lack of objective data. The N-of-1 trial provides a The decision as to which treatment works best, then, is
structure for identifying the best treatment for an indi- based directly on that patient’s information.
vidual patient, thereby serving as a clinical decision An important part of the planning involves coopera-
tool.24 Although the N-of-1 trial has largely been used tion from the patient as well as a shared commitment
to study the effectiveness of drugs, it can be applied to from both patient and clinician in the design of the study
any discrete intervention that does not have carryover and choice of relevant outcome measures. Because the
effects.8 It can be used to compare two treatments or a success of the trial depends on adherence to a schedule
treatment and placebo. and consistent assessment of outcomes, patients often
have to be trained regarding expectations. This includes
Although they have the same basic premise and providing support for interpreting results to understand
the terms are often used interchangeably, single- meaningful improvement or ambiguous effects.25 In-
subject and N-of-1 designs are distinguished by formed consent is required from all patients participating
their purpose. Single-subject studies are designed to explore in N-of-1 trials.
6113_Ch18_249-270 02/12/19 4:43 pm Page 260
Focus on Evidence 18–1 six palliative care services to participate in 18-day trials, with three
Gathering Evidence—One Patient at a Time 6-day cycles: 3 days taking MPH alternating with 3 days on a placebo.
The first day of each period was considered a washout period, and
Fatigue is a common problem for patients with cancer, related to
only 2 days were used for data collection. The order of treatment was
both treatment and the disease itself. One approach to treating this
determined by computer randomization.
effect is the use of psychostimulants. In a systematic review of five
The primary outcome was the score on the Functional Assess-
RCTs, however, evidence on the effectiveness of these drugs was
ment of Chronic Illness Therapy-Fatigue (FACIT-F) subscale, a
inconclusive—some studies showing benefits (large and small) and
13-item scale (0 to 52) with higher scores indicating less fatigue.
some showing none.35
A change of 8 points was considered to be meaningful based on
In light of this uncertainty, Mitchell et al36 developed a series of
previous research, and therefore was considered the threshold
N-of-1 trials to study the effect of methylphenidate hydrochloride
to determine a favorable response. Patients also kept a daily diary
(MPH), a central nervous system stimulant, for alleviating fatigue
to record their symptoms.
in patients with advanced cancer. They recruited 43 patients from
40
FACIT-F Score Difference (higher favors MPH)
30
20
10
⫺10
⫺20
1
9
11
13
15
17
19
21
23
25
27
29
31
33
n
ea
M
Data analysis included looking at each individual’s response, However, the N-of-1 methodology made it possible for them
as well as combining all trials. The figure shows the mean change to identify eight patients who did respond favorably to MPH. Their
for each patient and the aggregated mean (shown in red). The circles credible intervals do not cross the center line, with seven of those
represent the difference between MPH and placebo for each patient. above the 8-point threshold. These data allowed clinicians to make
The solid dots (in blue) represent those patients who had a positive meaningful decisions for those individuals immediately after the
response to MPH for all three cycles; the open circles represent those trial—an opportunity that would have been missed using traditional
who did not have a response; and the grey circle indicates a negative evidence with aggregated data in a group trial.
response. The lines drawn on either side of the circle are the
95% credible intervals, which shows the degree of variability around
the difference score.* If an interval crosses the center line (at zero * A credible interval is analogous to a confidence interval, but based on a
different probability concept. It is a range of values within which an observed
difference), there is no significant difference between the two treat-
parameter value falls with a particular subjective probability.37
ments for that patient. The data show that there was no significant
difference for most patients as well as the aggregated mean. From Mitchell GK, Hardy JR, Nikles CJ, et al. The effect of methylphenidate
The aggregated results of this study can be viewed in the same on fatigue in advanced cancer: an aggregated N-of-1 trial. J Pain Symptom
way as a full RCT, given the consistent protocol used by all patients. Manage 2015;50(3):289–296. Adapted from Figure 3, p. 294. Used with
permission.
Therefore, the researchers concluded that MPH had no effect on
fatigue in this population.
6113_Ch18_249-270 02/12/19 4:43 pm Page 262
Target Behavior
Target Behavior
A Sessions B Sessions
Target Behavior
Target Behavior
C Sessions D Sessions
Target Behavior
E Sessions
Figure 18–7 Examples of data patterns across baseline and intervention phases, showing changes in
level and trend: A) Change in level and trend (dotted lines represent means for each phase); B) Change
in level, no change in trend; C) Change in trend; D) Change in trend, with a curvilinear pattern during
phase B; E) No change in level or trend, but a change in slope.
can be compared across adjacent phases as a method of and may be characterized as stable (constant rate of
summarizing change. change) or variable. Trends can be linear, shown in all
Means are useful for describing stable data that have graphs in Figure 18-7 except for graph D, which shows
no slope, such as graph A, as stable values will tend to a curvilinear pattern.
cluster around the mean. However, when data are very A trend in baseline data does not present a serious
variable or when they exhibit a sharp slope, means can problem when it reflects changes in a direction opposite
be misleading. For instance, based on means in graph C, to that expected during intervention. One would then
one might assume that performance did not change once anticipate a distinct change in direction once treatment
intervention was introduced. Obviously, this is not the is initiated, as seen in graph C. It is a problem, however,
case. Mean values should always be shown on a graph of when the baseline trend follows the direction of change
the raw data to reduce the chance of misinterpretation. expected during treatment. If the improving trend is
continued into the intervention phase, it would be diffi-
Trend cult to assess treatment effects, as the target behavior is
Trend refers to the direction of change within a phase. already improving without treatment (graphs B and E).
Trends can be described as accelerating or decelerating It is important to consider what other factors may be
6113_Ch18_249-270 02/12/19 4:43 pm Page 263
contributing to this baseline change. Perhaps changes can lead to different results, it is important to consider
reflect maturation, a placebo effect, or the effect of other how they approach the data.44 A few of the most com-
treatments. Instituting treatment under these conditions mon methods will be presented. Many excellent texts
would make it difficult to draw definitive conclusions. provide thorough coverage of other techniques that can
When such a trend occurs, it is usually advisable to be applied to these designs.17,40,45
extend the baseline phase in hopes of achieving a plateau
or reversal. The Split-Middle Line
Slope The reliability of assessing trend is greatly enhanced
by drawing a straight line that characterizes rate of
The slope of a trend refers to its angle, or rate of change change.46–49 A popular method involves drawing a line
within the data. Slope can only be determined for linear that represents the linear trend and slope for a data
trends. In graph B, trends in both phases have approxi- series. This procedure results in a celeration line, which
mately the same slope (although their level has changed). describes trends as accelerating or decelerating. A celer-
In graph E, both phases exhibit a decelerating trend, but ation line is considered a split-middle line when it
the slope within the intervention phase is much steeper divides the data within a phase equally above and below
than that in the baseline phase. This suggests that the the line, with an equal number of data points on or above
rate of change in the target behavior increased once and on or below the line (a median line). This method
treatment was initiated. was used, for example, in the study of weight distribution
shown in Figure 18-5.
20
A B
18
16
14
Target Behavior
12
10
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Sessions
Figure 18–8 The split-middle line is drawn for baseline data. The line is extended into the
intervention phase to test the null hypothesis. One point in the intervention phase falls below
the extended line.
6113_Ch18_249-270 02/12/19 4:43 pm Page 264
data should also be the split-middle line for the inter- Table 18-1 Abridged Table of Probabilities
vention phase. Therefore, 50% of the data in the inter- Associated With the Binomial Test
vention phase should fall on or above that line, and 50%
should fall on or below it. If treatment has caused a real x
change in observed behavior, then the extended baseline n 0 1 2 3 4 5
trend should not fit this pattern.
Statistically, we propose a null hypothesis (H0), 5 .031 .188 .500 .812 .969 –
which states that there is no difference across phases; 6 .016 .109 .344 .656 .891 .984
that is, any changes observed from baseline to the in- 7 .008 .062 .227 .500 .773 .938
tervention phase are due to chance, not treatment. Our
intent, of course, is to discredit the null hypothesis, and 8 .004 .035 .145 .363 .637 .855
so we also propose an alternative (H1), which can be 9 .002 .020 .090 .254 .500 .746
phrased as a nondirectional or directional hypothesis.
10 .001 .011 .055 .172 .377 .623
For example, we can propose that there will be an in-
crease in response to intervention for the study shown 15 – – .004 .018 .059 .151
in Figure 18-8. 20 – – – .001 .006 .021
The Binomial Test This table is an abridged version for purposes of illustration. Probabili-
ties are one-tailed values, and must be doubled for a two-tailed test.
To test H0, we apply a procedure called the binomial
test, which is used when outcomes of a test are dichoto-
mous; in this case, data points are either above or below The probability value obtained from the table is
the split-middle line. To do the test, we count the num- interpreted in terms of a conventional upper limit of
ber of points in the intervention phase that fall above and p =.05. Probabilities that exceed this value are consid-
below the extended line (ignoring points that fall directly ered not significant; that is, the observed pattern could
on the line). have occurred by chance. In this example, our proba-
In our example, one point falls below and nine bility is less than .05 (p = .011) and, therefore, is con-
points fall above the extended line. Clearly, this is not a sidered significant. The pattern of response in the
50–50 split. On the basis of these data, we would like to intervention phase is significantly different from base-
conclude that the treatment did effect a change in re- line (see Chapter 23).
sponse. However, we must first pose a statistical question:
Could this pattern, with one point below and nine points See the Chapter 18 Supplement for the full table of proba-
above the line, have occurred by chance? Or can we be bilities associated with the binomial test. See Chapter 27
confident that this pattern shows a true treatment effect? for further discussion of the use of the binomial test.
We answer this question by referring to Table 18-1,
which lists probabilities associated with the binomial
test. Two values are needed to use this table. First, we Serial Dependency
find the appropriate value of n (down the left side), It is possible to use conventional statistical tests, such
which is the total number of points in the intervention as the t-test and analysis of variance (see Chapters 24
phase that fall above and below the line (not including and 25) to test for significant changes across phases.
points on the line). In this case, there are a total of With time series data, however, these applications are
10 points. We then determine if there are fewer points limited when large numbers of measurements are taken
above or below the extended line. In our example, there over time. Under these conditions, data points can be
are fewer points (one) below the line. The number of interdependent, as is often the case in single-subject
fewer points is given the value x. Therefore, x = 1. The research.50,51 This interdependence is called serial
probability associated with n = 10 and x = 1 is .011; that dependency, which means that successive observations
is, p =.011. in a series of data points are correlated; that is, the level
The probabilities listed in the table are one-tailed prob- of performance at one point is related to the value of
abilities, which means they are used to evaluate directional subsequent points in the series. Serial dependency can
alternative hypotheses, as we have proposed in this exam- violate assumptions of several statistical procedures and
ple. If a nondirectional hypothesis is proposed, a two-tailed may also be a problem for making inferences based on
test is performed, which requires doubling the probabilities visual analysis.52,53
listed in the table. If we had proposed a nondirectional The degree of serial dependency is reflected by the
hypothesis for this example, we would use p= .022. autocorrelation in the data, or the correlation between
6113_Ch18_249-270 02/12/19 4:43 pm Page 265
successive data points. The higher the value of autocor- points in the intervention phase fall outside the two standard
relation, the greater the serial dependency in the data. deviation band, changes from baseline to intervention
Ottenbacher45 presents a method for computing auto- are considered significant.
correlation by hand, but the process is easily performed In this example, the mean response for baseline data
by computer, especially with large numbers of data is 5.0 with SD = ±1.33. The shaded areas show 2 SD
points. above and below the mean (±2.66). Eight consecutive
The C statistic can be used to establish a difference in points in the intervention phase fall above this band.
trends between baseline and intervention phases with as Therefore, we conclude that there is a significant change
few as eight points in each phase. It is not affected by from the baseline to the intervention phase.
autocorrelation. This statistic first looks at baseline data The two standard deviation band method is also illus-
to establish that there is no trend and, if so, it combines trated in Figure 18-3.
baseline and intervention scores to determine if there is
a difference between them.45,54
Depending on the variability in the data, the two standard
See the Chapter 18 Supplement for calculation of the deviation band method might not be useful, especially if
C statistic. the width of the band takes scores below zero. Alternative
analysis methods, like the binomial test, should be considered.
20
A B
18
16
14
Target Behavior
12
10
⫹2SD ⫽ 7.66
8
6
Baseline mean ⫽ 5.0
4
2
⫺2SD ⫽ 2.34
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Sessions
Figure 18–9 Two standard deviation band method, showing mean performance for baseline
(solid line) and 2 SDs above and below the mean (broken lines). Because more than two
consecutive points in the intervention phase fall outside of this band, the difference between
phases is considered significant.
6113_Ch18_249-270 02/12/19 4:43 pm Page 266
to be expected. This variation is considered background because of the intervention, not simply due to random
“noise,” or what has been termed common cause varia- variation. SPC offers a mechanism to determine if vari-
tion.57 Statistically, such variation is considered to be “in ations in response are sufficiently different from baseline
control,” random, predictable, and tolerable. to warrant interpretation as special cause.57
There will be a point, however, at which the variation
Control Charts
will exceed acceptable limits and the product will no
SPC is based on analysis of graphs called control charts.
longer be considered satisfactory. Such deviation is con-
These are the same as the graphs we have been using to
sidered “out of control” or nonrandom, and is called special
show the results of single-subject data, although some
cause variation. This is variation that is not part of the
differences exist depending on the type of data being
normal process and has some cause that needs to be
measured.
addressed. We can think of this variation in the context
In SPC, the interpretation of data is based on variabil-
of reliability. How much variation is random error, and
ity around a mean baseline value. A central line is plotted,
how much is meaningful, due to true changes in response?
representing the mean baseline response. An upper control
limit (UCL) and lower control limit (LCL) are then plotted
Think about variations in your own signature. If you sign at 3 SDs above and below the mean. Regardless of the
your name 10 times there will be a certain degree of underlying distribution, if the process is in “statistical
expected (common cause) variation, due to random effects
control,” almost all data will fall within this range around
such as fatigue, distraction, or shift in position. If, however,
the mean (see Chapter 22).58 Therefore, these boundaries
someone comes by and hits your elbow while you are writing,
your signature will look markedly different—a special cause
define the threshold for special cause.
variation. This process is illustrated in Figure 18-10. The base-
line data fall within the upper and lower control limits,
demonstrating common cause. Even though there is
Applying SPC to Single-Subject Data some variability in the baseline data, it can be considered
We can apply this model to single-subject data in two random.
ways. First, by looking at baseline data, we can determine We then extend the control limits into the intervention
if responses are within the limits of expected random phase. If only common cause variation is operating (the
variability (common cause variation).55 This would allow treatment has no real effect), the intervention data should
us to assess the degree to which the data represent a rea- also fall within these limits. We can see, however, that
sonable baseline. 9 points fall outside the UCL and all 10 points are above
Using the SPC model, intervention is a form of “spe- the baseline mean, indicating that there is a significant
cial cause.” We expect a change in the subject’s response difference in the response during the intervention phase.
20
A B
18
16
14
Target Behavior
12
10
UCL ⫽ 8.83
8
6
Baseline mean ⫽ 5.0
4
2
LCL ⫽ 1.16
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Sessions
Figure 18–10 Statistical control chart showing upper (UCL) and lower (LCL) control limits.
The mean of 10 baseline scores is 5.0, which is extended into the intervention phase.
6113_Ch18_249-270 02/12/19 4:43 pm Page 267
As a process for total quality control, SPC can be used Scruggs et al66 suggest the following interpretation of
within healthcare settings to monitor variation in service treatment effectiveness for nonoverlap:
delivery and health outcomes. The reader is encouraged
Above 90% Very effective
to consult several excellent references that discuss this
70-90% Effective
application.58–64
50-70% Questionable
See the Chapter 18 Supplement for calculation and inter- Below 50% Not effective
pretation of upper and lower control limits using the SPC
These are guidelines, not absolute thresholds, that
method.
should be used to make judgments about how important
a treatment effect is. However, this approach provides a
Effect Size standardized common metric that can be used to com-
In the analysis of quantitative studies, a standardized pare studies of different sample size, different phase
effect size index is often used to characterize the lengths, and different treatment presentations.
strength of changes seen with intervention. These in-
dices are based on differences between group means If we look back at the study of endurance measured
and variance within and across groups (see Chapter 32). using the 6MWT in Figure 18-3, we can see that
Important benefits of this approach include the ability eight of nine or 89% of data points in phase B are nonover-
to judge the importance of results, compare or combine lapping with the baseline phase (one point is overlapped)—
a strong effect. However, consider the data in Figure 18-4 for
studies, and develop an evidence-based summary of
the study of weight distribution. With replicated phases,
findings that can be applied to clinical judgment (see
the total proportion of nonoverlapped intervention scores
Chapter 37).65 can be used.11 In this example, all intervention points
Using the approaches to visual analysis and the statis- overlap with baseline points—there is 0% nonoverlap.
tical methods described thus far, SSDs do not allow such This suggests that the effect size is small, even though
judgments. Even when results show a significant change there is an obvious change in direction based on the split-
from baseline to intervention, the assessment of change middle lines.
is primarily descriptive and, in many cases, may be diffi-
cult to interpret because of variability. These types of re-
sults also preclude comparisons across studies. The Several useful references provide detailed descriptions
advantage of using a standardized effect size is that it can of other effect size measures for SSDs, including non-
be applied to comparisons of data in different units and parametric statistical methods for determining confi-
with different sample sizes. dence intervals.42,43,66,68–73
Nonoverlapping Scores
One approach for establishing effect size for single- ■ Generalization
subject research is based on the concept of nonoverlap.66
If we look at the number of points in the intervention The special appeal of single-subject research is that it
phase that do not overlap with values in the baseline phase, focuses on clinical outcomes and can provide real-time
we get an idea of how different the two sets of scores are. data for clinical decision-making. The strongest criticism
The percent of nonoverlapping data (PND) is a reflection of this method, however, suggests that results cannot be
of the treatment effect.67 When all the intervention generalized beyond the individual patient, limiting its
points overlap with baseline scores, there is 0% nonover- value as a research approach.74,75
lap. In that case, the distributions of scores in both Although it certainly would be unreasonable to gener-
phases are conceptually superimposed, indicating no alize findings from one individual to a population, one
treatment effect. might argue that the reverse can be equally precarious, to
As an example, consider the hypothetical data in generalize from a group to an individual patient. In fact,
Figure 18-10. If we were to draw a horizontal line at the one might contend that a single-subject case can provide
highest point in the baseline phase (try it), we can count superior evidence compared to group studies for certain
how many points fall above that line in the intervention types of conditions and interventions.8,14 The concept of
phase. (If the intent of the treatment was to decrease per- pragmatic trials is based on a need to make research more
formance, the line would be drawn at the lowest point.) reflective of actual practice, a good fit for single-subject
Then we can determine the proportion of data points research under the appropriate circumstances.
above this line in the B phase. In this case, 9/10 points But, of course, all research has the ultimate goal of in-
would be above it. Therefore, PND is 90%. forming clinical decisions beyond those being studied.
6113_Ch18_249-270 02/12/19 4:43 pm Page 268
With this intent, then, it is not sufficient just to demon- include involvement of various clinicians, applica-
strate outcomes on a few patients during an experimental tion in different settings, or patients with varied
period. It is also necessary to show that improvements characteristics.
or changes in behavior will occur with other individuals • Clinical replication is perhaps most important for
and that improvement will be sustained after the inter- generalization to complex clinical environments. It
vention has ended. involves the replication of effects in typical clinical
situations where treatment plans may be based on
Phases of Replication various combinations of techniques or approaches.
Through a process of replication across subjects, set- In the long run, the true applicability of research
tings, and clinicians, the single-subject study can estab- data will depend on our ability to demonstrate that
lish generalizability. Three successive phases have been treatment packages can be successfully applied to
described to strengthen external validity.14 patients with multiple behaviors that tend to cluster
together.
• Direct replication involves repeating single-
subject experiments across several subjects. The Beyond the question of external validity, which con-
setting, treatment conditions, and patient charac- cerns generalization of findings to different subjects and
teristics should be kept as constant as possible. settings, researchers are also interested in the importance
Three successful replications are considered suffi- of treatment effects within a practical context. The term
cient to demonstrate that findings are not a result social validity has been used to indicate the application of
of chance.14,76 Multiple baseline designs provide single-subject study findings, including the applicability
direct replication. of procedures in real-world settings.77
• Systematic replication is intended to demonstrate The single-subject approach can readily demonstrate
that findings can be observed under conditions the meaningfulness of the goals of interventions, the
different from those encountered in the initial acceptability of procedures by target groups, and the
study. It is, in essence, a search for exceptions.14 importance of a treatment’s impact. This methodology
It involves the variation of one or two variables is also ideally suited to demonstrate implementation of
from the original study to allow generalization to clinical procedures that have been studied using RCTs,
other similar, but not identical, situations. It may as a form of pragmatic trial.
COMMENTARY
EBP is about trying to find the right approach for the approaches. The process of single-subject research should
individual patient. Although clinical research has an eventually diminish the role of trial and error in clinical
important role in generalizing findings, single-subject practice and provide useful guidelines for making clinical
studies provide a unique opportunity to discover which choices.
factors are necessary to expect certain outcomes.11,78 Of course, SSDs are not a panacea for all clinical
Understanding a treatment’s effectiveness is important research woes. Many research questions cannot be an-
even if it applies only to a narrow range of conditions, as swered adequately using these designs. Single-subject
long as we know what those condition are.79 studies can be limited in that they require time to docu-
The maturation of single-subject research and N-of-1 ment outcomes. They are not readily applicable to acute
methodology has been an important step in the develop- situations in which baseline periods are less reasonable
ment of clinical research. It provides an opportunity for and treatment does not continue for long periods. In
the practitioner to evaluate treatment procedures within addition, researchers often find that patient responses
the context of clinical care, as well as to share insights are too unstable, making attempts at analysis frustrating.
about patient behavior and response that are typically Nonetheless, SSDs serve a distinct purpose in the ongo-
ignored or indiscernible using traditional group research ing search for scientific documentation of therapeutic
6113_Ch18_249-270 02/12/19 4:43 pm Page 269
effectiveness. This process will force us to challenge clin- results of one study have to apply to everyone in a target
ical theories and to document the benefits of our inter- population. Single-subject strategies provide an impor-
ventions in a convincing scientific way, for ourselves and tant perspective with the ultimate purpose of finding
for others. Generalization does not always mean that the what works for the individual patient.
advanced cancer: an aggregated N-of-1 trial. J Pain Symptom 59. Hantula DA. Disciplined decision making in an interdisciplinary
Manage 2015;50(3):289–296. environment: some implications for clinical applications of
37. Haskins R, Osmotherly PG, Tuyl F, Rivett DA. Uncertainty in statistical process control. J Appl Behav Anal 1995;28(3):
clinical prediction rules: the value of credible intervals. J Orthop 371–377.
Sports Phys Ther 2014;44(2):85–91. 60. Solodky C, Chen H, Jones PK, Katcher W, Neuhauser D.
38. Harbst KB, Ottenbacher KJ, Harris SR. Interrater reliability of Patients as partners in clinical research: a proposal for applying
therapists’ judgements of graphed data. Phys Ther 1991;71(2): quality improvement methods to patient care. Med Care 1998;
107–115. 36(8 Suppl):As13–20.
39. Brossart DF, Parker RI, Olson EA, Mahadevan L. The relation- 61. Diaz M, Neuhauser D. Pasteur and parachutes: when statistical
ship between visual analysis and five statistical analyses in a process control is better than a randomized controlled trial.
simple AB single-case research design. Behav Modif 2006; Qual Saf Health Care 2005;14(2):140–143.
30(5):531–563. 62. Mohammed MA. Using statistical process control to improve
40. Kratochwill RT, Levin JR. Single-Case Intervention Research: the quality of health care. Qual Saf Health Care 2004;13(4):
Methodological and Statistical Advances. Washington, DC: 243–245.
American Psychological Association, 2014. 63. Karimi H, O’Brian S, Onslow M, Jones M, Menzies R, Packman
41. Orme JG, Cox ME. Note on research methodology. Analyzing A. Using statistical process control charts to study stuttering
single-subject design data using statistical process control charts. frequency variability during a single day. J Speech Lang Hear Res
Social Work Res 2001;25(2):115–127. 2013;56(6):1789–1799.
42. Parker RI, Vannest KJ, Brown L. The improvement rate 64. Eslami S, Abu-Hanna A, de Keizer NF, Bosman RJ, Spronk PE,
difference for single-case research. Exceptional Children 2009; de Jonge E, et al. Implementing glucose control in intensive
75(2):135–150. care: a multicenter trial using statistical process control. Intensive
43. Brossart DF, Vannest KJ, Davis JL, Patience MA. Incorporating Care Med 2010;36(9):1556–1565.
nonoverlap indices with visual analysis for quantifying interven- 65. Shire SY, Kasari C. Train the trainer effectiveness trials of
tion effectiveness in single-case experimental designs. Neuropsychol behavioral intervention for individuals with autism: a systematic
Rehabil 2014;24(3–4):464–491. review. Am J Intellect Dev Disabil 2014;119(5):436–451.
44. Nourbakhsh MR, Ottenbacher KJ. The statistical analysis of 66. Scruggs TE, Mastropieri MA. Summarizing single-subject
single-subject data: a comparative examination. Phys Ther 1994; research. Issues and applications. Behav Modif 1998;22(3):
74(8):768–776. 221–242.
45. Ottenbacher KJ. Evaluating Clinical Change: Strategies for 67. Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2 ed.
Occupational and Physical Therapists. Baltimore: Williams & Hillsdale, NJ: Lawrence Erlbaum 1988.
Wilkins, 1986. 68. Van den Noortgate W, Onghena P. The aggregation of single-
46. Bobrovitz CD, Ottenbacher KJ. Comparison of visual inspection case results using hierarchical linear models. Behav Anal Today
and statistical analysis of single-subject data in rehabilitation 2007;8(2):196–209.
research. Am J Phys Med Rehabil 1998;77:94–102. 69. Lanovaz MJ, Rapp JT. Using single-case experiments to support
47. Hojem MA, Ottenbacher KJ. Empirical investigation of visual- evidence-based decisions: how much is enough? Behav Modif
inspection versus trend-line analysis of single-subject data. Phys 2016;40(3)377–395.
Ther 1988;68(6):983–988. 70. Campbell JM. Statistical comparison of four effect sizes for
48. Johnson MB, Ottenbacher KJ. Trend line influence on visual single-subject designs. Behav Modif 2004;28(2):234–246.
analysis of single-subject data in rehabilitation research. Int 71. Cushing CC, Walters RW, Hoffman L. Aggregated N-of-1
Disabil Stud 1991;13(2):55–59. randomized controlled trials: modern data analytics applied to
49. Ottenbacher KJ. Interrater agreement of visual analysis in a clinically valid method of intervention effectiveness. J Pediatr
single-subject decisions: quantitative review and analysis. Psychol 2014;39(2):138–150.
Am J Ment Retard 1993;98(1):135–142. 72. Nikles CJ, McKinlay L, Mitchell GK, Carmont SA, Senior HE,
50. Ottenbacher KJ. Analysis of data in idiographic research: issues Waugh MC, et al. Aggregated n-of-1 trials of central nervous
and methods. Am J Phys Med Rehabil 1992;71:202–208. system stimulants versus placebo for paediatric traumatic brain
51. Jones RR, Baught RS, Weinrott MR. Time-series analysis in injury—a pilot study. Trials 2014;15:54.
operant research. J Appl Behav Anal 1977;10:151. 73. Zucker DR, Ruthazer R, Schmid CH. Individual (N-of-1) trials
52. Jones RR, Weinrott MR, Vaught RS. Effects of serial depend- can be combined to give population comparative treatment
ency on the agreement between visual and statistical inference. effect estimates: methodologic considerations. J Clin Epidemiol
J Appl Behav Anal 1978;11:277. 2010;63(12):1312–1323.
53. Bengali MK, Ottenbacher KJ. The effect of autocorrelation on 74. Francis NA. Single subject trials in primary care. Postgrad Med J
the results of visually analyzing data from single-subject designs. 2005;81(959):547–548.
Am J Occup Ther 1998;52(8):650–655. 75. Newcombe RG. Should the single subject design be regarded as
54. Janosky JE, Leininger SL, Hoerger M, P., Libkuman TM. a valid alternative to the randomised controlled trial? Postgrad
Single Subject Designs in Biomedicine. New York: Springer, 2009. Med J 2005;81(959):546–547.
55. Pfadt A, Cohen IL, Sudhalter V, Romanczyk RG, Wheeler DJ. 76. Chambless DL, Hollon SD. Defining empirically supported
Applying statistical process control to clinical data: an illustra- therapies. J Consult Clin Psychol 1998;66(1):7–18.
tion. J Appl Behav Anal 1992;25(3):551–560. 77. Wolf MM. Social validity: The case for subjective measurement
56. Berwick DM. Controlling variation in health care: a consulta- or how applied behavior analysis is finding its heart. J Appl
tion from Walter Shewhart. Med Care 1991;29(12):1212–1225. Behav Anal 1978;11:203.
57. Callahan CD, Barisa MT. Statistical process control and rehabil- 78. Branch MN, Pennypacker HS. Generality and generalization of
itation outcome: The single-subject design reconsidered. Rehabil research findings. In: Madden GJ, Dube W, Hackenberg TD,
Psychol 2005;50(1):24–33. Hanley GP, Lattal KA, eds. APA Handbook of Behavior Analysis.
58. Benneyan JC, Lloyd RC, Plsek PE. Statistical process control as Washington, DC: American Psychological Association, 2011.
a tool for research and healthcare improvement. Qual Saf Health 79. Johnston JM, Pennypacker HS. Strategies and Tactics of
Care 2003;12(6):458–464. Behavioral Research. 3 ed. New York: Routledge, 2008.
6113_Ch19_271-284 02/12/19 4:43 pm Page 271
CHAPTER
Exploratory Research:
Observational 19
Designs
—with K. Douglas Gross
Cohort Studies
Case-Control Studies
In contrast to experimental trials that test exploratory since it often probes data about
the effectiveness of interventions using vari- personal, environmental, behavioral, or ge-
able manipulation and controlled comparisons, netic influences that may explain health
observational studies characterize unmodified outcomes.
relationships, analyzing how existing factors The purpose of this chapter is to describe
or individual characteristics are associated basic approaches to observational research with
with health outcomes. Observational stud- a focus on cohort and case-control study de-
ies explore the causes, consequences, and signs, and to discuss considerations that are rel-
predictors of disease or disability. They also evant to using these designs for evidence-based
characterize the impact of exposure to risk practice. Analysis procedures for these designs
factors on individual or community health. will be covered in Chapter 34.
This research approach may be regarded as
271
6113_Ch19_271-284 02/12/19 4:43 pm Page 272
fully exclude the possibility that the observed groups principles have been proposed to support a purported
differ in other ways that could explain their unequal cause-and-effect association.4-6
health outcomes. Despite the comparative advantages
• Temporal Sequence. An essential criterion for
of experimental trial designs for etiologic research, there
causation is the establishment of a time sequence
are many situations in which investigator-mandated
confirming that the causative exposure clearly pre-
group assignment is infeasible. Consider the practical
ceded its supposed outcome or effect. Without this
and ethical difficulties of assigning subjects to a group
pattern, a relationship cannot be causal.
that will be exposed to wildfire smoke or to an unhealthy
• Strength of Association. Observational researchers
diet. In such situations, observational designs offer the
often evaluate the magnitude of the association be-
only plausible method of investigating the causal impact
tween an exposure and an outcome using a statistical
of a potentially deleterious exposure.
measure of relative risk. The greater the measured
Especially where there is a preponderance of evi-
association, the more likely it is that a true causal
dence, observational studies can strengthen support for
relationship exists which cannot be explained away
hypothesized causal relationships (see Focus on Evi-
entirely by hidden biases.
dence 19-1). For example, the repeated finding of an
• Biological Credibility. The researcher should be
association between tobacco smoking and lung cancer
able to postulate a plausible mechanism by which
was eventually sufficiently compelling that the surgeon
the exposure might reasonably alter the risk of
general changed the required warning label on cigarette
developing an outcome. This explanation may
cartons from “smoking may cause cancer” to “smoking
be theoretical, depending on the current state of
causes cancer.” A more recent example is the emerging
knowledge, but conclusions are more credible
relationship between concussions and chronic trau-
if they are grounded in a realistic model and re-
matic encephalopathy (CTE). A variety of design and
search evidence.
analytic strategies can be applied to observational stud-
• Consistency. Findings are more credible when
ies to strengthen causal inference in the absence of ran-
they are consistent with other studies. The more
dom group assignment.
times a similar relationship is documented by differ-
Interpreting Causation ent researchers using different samples, conditions,
To support a causal hypothesis, the observational and study methods, the more likely it is true.
researcher must first strive to minimize concerns • Dose-Response. Lastly, the presence of a dose-
about bias, confounding, and chance variation. Several response relationship can lend support to a
Focus on Evidence 19–1 Starting in 1956, and over several years,10 Stewart submitted a
“The Woman Who Knew Too Much” 7 series of letters and manuscripts to the Lancet and the British Medical
Journal, sharing data that showed that twice as many cancer deaths
The history of x-rays, from their discovery in 1895, is one of rapid ad-
occurred before the age of 10 among children whose mothers had re-
vancement and varied applications, which eventually led them to be-
ceived a series of pelvic x-rays while pregnant.11-16 This claim sparked
come commonplace in doctor’s offices by the 1930’s. Older readers
a barrage of criticism, with some detractors rejecting her data on the
may remember x-rays being used in shoe stores to assure a good fit,8
grounds that it was acquired from observational and retrospective
or when pregnant women were routinely subjected to pelvic x-rays,
study and so could not provide evidence of causation.17-21 Stewart
purportedly to establish baseline measures in case there were com-
continued to argue that even low doses of radiation might be harmful,
plications later in the pregnancy. The medical community believed
a view that put her at odds with physicians and the emerging nuclear
that there was a threshold of safety, with a dosage that was small
industry. It was not until late into the 1970’s that her admonitions
enough to avoid harm.9
were finally heeded.22 Today, the health effects of radiation exposure
Following WWII, an English physician, Alice Stewart (1935-2002),
are better understood, but with a lack of appreciation for the strength
one of the youngest women to be elected to the Royal College of
of observational data, it took decades for clinical practice to catch up
Physicians in Oxford, became aware of an increasing incidence of
with the research evidence.
leukemia in children and began looking for an environmental cause.
She interviewed mothers of children with and without the disease,
not knowing what to look for, and asked questions about everything For those interested in this remarkable history, two wonderful videos provide
she could think of, like exposure to infection, inoculations, pets, fried great background, one with her coworkers and one with Alice Stewart telling
foods, highly colored drinks, sweets—and whether they had had her own story.23,24
an x-ray.7
6113_Ch19_271-284 02/12/19 4:43 pm Page 274
Retrospective
Research
➤ CASE IN POIN T #1
De Ryck et al25 conducted a closed design prospective cohort
study to determine incidence and risk factors for poststroke
depression (PSD). They performed a baseline examination of
135 patients during the first week after stroke and then followed
Cross-sectional the cohort over a year and a half, testing for PSD at 1, 3, 6, 12,
Research
and 18 months. Results indicated an increased risk of develop-
ing PSD in patients with greater baseline impairments in
Study Sample mobility, cognition, or speech.
study, but potential confounding variables must always Box 19-2 Secondary Analysis
be enumerated prior to data collection so that appro-
priate procedures can be implemented to ascertain Retrospective research often involves secondary analysis,
their presence (see Chapter 15). when a researcher uses existing data sets to re-examine
variables and answer questions other than those for which the
Retrospective Studies data were originally collected. Investigators may analyze
subsets of variables or subjects, or they may be interested in
Retrospective studies involve the examination of data exploring new relationships among the variables. Secondary
that were previously collected. Investigators may obtain analysis has become increasingly common in recent years
previously recorded data from medical records, data- because of the accessibility of large data sets.
bases, completed surveys, or prior public health cen- Advantages of secondary analyses include their minimal
suses (see Box 19-2). Available data must be sufficient expense, the ability to study large samples, and the elimination
to clearly establish each subject’s exposure history and of the most time-consuming part of the research process—data
outcome status. When the target outcome is a transient collection. After formulating hypotheses, researchers can
condition, confirmation of prior outcome status can be proceed without delay to test them. Retrospective researchers
challenging. often develop their study questions after reviewing the existing
data to determine what variables can be defined using available
information. Alternatively, researchers may search for a data-
➤ C A S E IN PO INT # 2 base that can provide necessary information to answer their
Unhealthy weight gain during pregnancy creates many prob-
preformulated study question.
lems for both mother and child. Lindberg et al26 designed a
A variety of large databases are supported by data libraries
retrospective cohort study to evaluate insufficient and excess
such as the Archive of the Inter-University Consortium for Politi-
weight change among pregnant women who received prenatal
cal and Social Research at the University of Michigan.27 The US
care in a statewide health system. From the electronic medical
government also sponsors continued collection of health-related
records of 7,385 women, they determined whether maternal
data through the National Center for Health Statistics (NCHS),28
age, race/ethnicity, pre-pregnancy BMI, smoking status, insur-
which includes data from questionnaires such as the National
ance payor, and socioeconomic status were risk factors for
Health Interview Survey29 and the National Nursing Home
unhealthy weight change.
Survey.30 These studies include longitudinal data that are used
by researchers in public health, rehabilitation, medicine, and
Challenges in Retrospective Studies social science to document changes in healthcare utilization,
Retrospective studies tend to be cheaper and faster than the health status of various age groups, and the association of
prospective studies, offering a more efficient way to personal and lifestyle characteristics with health outcomes.
investigate diseases with long latency or incubation With increasing availability of data through electronic access,
periods. However, retrospective designs present several secondary analyses are an increasingly important research
challenges. Medical records and historical records may option.
be incomplete, and a retrospective researcher is unable
to exert direct control over the procedures that were
used previously to measure target variables. Conse- pregnancy weight changes that they might incorrectly
quently, measurements may not align perfectly with the attribute to prior health or smoking habits.
current study question or with current expectations for
accuracy.
■ Cross-Sectional Studies
➤ Researchers defined weight change as the difference be-
tween pre-pregnancy weight and weight within 8 weeks of In a cross-sectional study, the researcher takes a “snap-
delivery. Unfortunately, immediate pre-pregnancy weight shot” of a population, studying a group of subjects at
measurements were missing for many women and had to be one point in time (see Fig. 19-1). Variables are measured
estimated using weight data from prior ambulatory visits. concurrently, meaning that their time sequence cannot
be confirmed. In cross-sectional studies investigating
Because the outcome status of participants is often potential causal factors, the designation of predictor
known at the start of a retrospective study, investigators (exposure) and outcome variables depends solely on the
need to take action to protect against biased reporting hypothesis that is proposed.
or assessment. By assessing pre-pregnancy smoking sta- Cross-sectional data can provide useful insight into
tus using medical records inscribed prior to the preg- the current health status of a population, informing
nancy, investigators hoped to avoid potentially biased decisions about how best to allocate public resources.
reporting from mothers or doctors already aware of For example, in 2016 the prevalence of deaths from liver
6113_Ch19_271-284 02/12/19 4:43 pm Page 277
disease was found to be higher in New Mexico than 3 to 5 hours of TV per day was 37% greater than the prevalence
other states., alerting public officials to a greater need among children who watched less than 1 hour. However, it was
for dialysis services there.31 not possible to know if the hours of sedentary television watch-
ing were responsible for children becoming obese or if already
➤ C A SE IN PO INT # 3 obese children were simply inclined to forego more social
Using questionnaire data acquired in 37 countries, Braithwaite activities in favor of watching television.
et al32 studied the cross-sectional association of obesity with
daily television viewing in adolescents and children. They This is an example of potential reverse causation,
obtained data for 77,003 children aged 5 to 8 years. wherein the variable designated as the “outcome” may
actually cause the “exposure.” For instance, the finding
A cross-sectional approach is often used because of that liver disease was highly prevalent in New Mexico
its efficiency. Data about the exposure and outcome vari- left unanswered the question of whether the New
ables are collected concurrently, eliminating the need to Mexican lifestyle was an antecedent cause of liver disease
follow subjects over long periods. in that state, or whether, as some have suggested, persons
already at risk of liver disease were simply immigrating
➤ In the television viewing study, parents were asked to esti- or retiring to this sunny border state. When examining
mate their child’s weekly TV watching time and indicate their cross-sectional relationships, researchers must exercise
child’s current height and weight. In some centers, height and caution in drawing conclusions about cause and effect
weight estimates were confirmed using a stadiometer and relationships (see Focus on Evidence 19-2).
scale.
Cross-Sectional Validity Studies
Cross-sectional designs are commonly employed in
Challenges in Cross-Sectional Studies the investigation of diagnostic test accuracy (see
In establishing cause and effect, the time sequence of Chapter 33) and other studies of concurrent validity
predictor and outcome variables is vitally important, a (see Chapter 10). When determining the accuracy of
situation that can be clarified in longitudinal studies. In a diagnostic test to detect the presence or absence of a
cross-sectional studies, however, it may not be possible target health condition, the investigational test should
to know whether the presumed cause (exposure) truly be administered in close temporal proximity to the
preceded the outcome. criterion test used to confirm the diagnosis. Diagnos-
tic studies may be retrospective in so far as results
from the index and reference tests are collected in the
➤ In the television viewing study, researchers found that the past (such as using medical records). Other diagnostic
prevalence of obesity/overweight among children who watched
studies are prospective when data are not yet available
Focus on Evidence 19–2 and at the same time created personality types that would lead to
When Genius Errs33 smoking. Fisher became a consultant for the tobacco companies in
1960 and was apparently instrumental in blocking many law suits at
Sir Ronald Fisher (1890-1962) is generally considered the father of
that time.
modern statistics—he invented a little thing called the analysis of
Fisher’s arguments flew in the face of numerous studies that doc-
variance and a few other important statistical concepts. In addition
umented higher rates of lung cancer and mortality among smok-
to his reputation for statistical discovery, however, he was also known
ers.36,37 Bolstered by Fisher’s learned opinions and statistical
for his strong voice disputing the relationship between smoking and
arguments, however, tobacco companies argued that there was no
lung cancer, despite clear evidence to the contrary. As early as 1950,
proof of causation, as this could only legitimately be achieved through
this relationship had been well documented.34
experimental studies—and of course, there were none.38
As a smoker, Fisher’s argument was that we could not know if
This regrettable story illustrates the need to keep focused on logic
smoking caused cancer or cancer caused smoking.35 Their undeniable
and science, and to continue to question and examine relationships
mutual occurrence, he proposed, could reflect a precancerous state
to truthfully understand clinical phenomena. Observational studies
that caused a chemical irritation in the lungs that was relieved by
provide an invaluable source of data to explore hypothesized relation-
smoking, leading to an increased use of cigarettes. He later suggested
ships for many health-related questions. It remains our job to identify
that the association was most likely due to a third factor, a genetic
the theoretical premise that justifies our conclusions.
predisposition, which made people more susceptible to lung cancer,
6113_Ch19_271-284 02/12/19 4:43 pm Page 278
Document
■ Cohort Studies recorded
Document prior
exposure status
Identify cohort
outcome
In clinical research, a cohort is a group of individuals
who are followed over time. A cohort study, also called ONSET OF
a follow-up study, is a longitudinal investigation in which STUDY
the researcher identifies a group of subjects who do not Retrospective
yet have the outcome of interest. Subjects are inter- Cohort Study
viewed, examined, or observed to determine their Figure 19–2 The design of prospective and retrospective
exposure to factors hypothesized to alter the risk of de- cohort studies. Participants are chosen based on their
veloping the target outcome. Exposed and unexposed membership in a defined cohort. In a prospective study, the
cohort members are then monitored for a period of time investigator enters the study to determine exposure status
to ascertain whether they do, in fact, develop the out- and follows subjects to determine who develops the disorder
of interest. In a retrospective study, the investigator enters
come. Comparisons are made of the frequency of newly
after exposure status has been determined, and tracks data
developed outcomes across exposed and unexposed from the past to determine outcomes.
groups, indicating whether the hypothesized association
is supported by the data.
city, town, or neighborhood. Birth cohorts represent gen-
Types of Cohort Studies erational groups, such as baby boomers or millennials.
Cohort studies can be either prospective or retrospec- An historical cohort includes individuals who experience
tive. If the investigation is initiated after some individuals common events, such as veterans of the Vietnam war.
have already developed the target outcome, the study is Victims who are survivors of natural disasters would be
considered retrospective. More commonly, the investi- considered members of environmental cohorts. Develop-
gator contacts participants before any of them have mental cohorts are based on life changes, such as getting
developed the target condition, in which case, the cohort married or moving into a nursing home.
study it is considered prospective (see Fig. 19-2).
An inception cohort is a group that is assembled early ➤ CASE IN POIN T #4
in the development of a disorder, for example, at the Dale and colleagues39 undertook a prospective cohort study to
time of diagnosis. Although the disorder is already pres- determine the relation of self-reported physical work exposures
ent in all members of an inception cohort, examination to risk of carpal tunnel syndrome (CTS) among clerical, service,
findings and subject characteristics present at the time and construction workers during 3 years of follow-up. CTS was
of diagnosis can be studied for their longitudinal associ- confirmed using nerve conduction velocity (NCV) tests.
ation with incident outcomes, including incidence of
disease progression or remission. Cohort studies may be purely descriptive, with the
Cohorts can be identified in different ways. Geographic intent of chronicling the natural history of a disease
cohorts are based on area of residence, such as a particular (see Chapter 20). More often, however, they are analytic,
6113_Ch19_271-284 02/12/19 4:43 pm Page 279
endeavoring to estimate the magnitude of the risk population, participants in the CTS study were recruited from
associated with an exposure. This estimate is derived eight employers and three trade unions where workers spent
by comparing the incidence of new outcomes across time in activities requiring forceful use of arms, hands, and fin-
groups of exposed and unexposed subjects. This is the gers. Workers who had a history of carpal tunnel release were
strategy used in the study of CTS. excluded because they were no longer at risk for developing
One advantage of a cohort study is the certainty that new CTS.
it affords in establishing the correct temporal sequence
of events. Cohort studies involve only participants Once a cohort is assembled, subjects are classified by
who, at the time exposure is ascertained, are free of the their level of exposure to the hypothesized risk factor. In
target outcome. Given this, we can be certain that a the CTS study, subjects were classified as exposed to a
subject’s baseline exposure status will have preceded physical stressor if they were engaged in that activity for
any new occurrence of the outcome during the follow- an average of at least 4 hours per day.
up period. For common outcomes, cohorts may be recruited
from the general population. For outcomes that are less
➤ In the study of CTS, only workers who tested negative for common in the general population, special cohorts may
existing nerve conduction impairment at the screening visit be assembled from professions that have high rates of
were eligible to participate. Investigators prospectively evalu- occupational exposure, as was done in the CTS study,
ated risk factors for developing incident CTS. Over 3 years of from residents living near sources of environmental con-
follow-up among 710 cohort members, 31 new cases of CTS tamination, or from patients that share an alternative risk
were confirmed by NCV tests. Workers exposed to at least factor for the same outcome.
4 hours per day of activities involving forceful gripping, use of To facilitate a valid comparison, the unexposed group
vibrating tools, or lifting objects had two to three times the odds
should be as similar as possible to the exposed group in
of developing symptomatic CTS.
all factors related to the outcome except for the specific
exposure under study. No subject in either group should
A disadvantage of the cohort design is that it offers be immune to the outcome.
little assurance at the start of the study that the targeted
outcome will occur with sufficient frequency to permit Challenges for Cohort Studies
robust statistical comparisons of exposed and unexposed Misclassification of Exposure
groups. The first task of any cohort study is to accurately
classify subjects by their exposure status. If an exposed
➤ Only eight subjects developed bilateral CTS, making it diffi- person is mistakenly classified as unexposed or vice
cult to estimate the relation of physical work exposures to risk versa, the error is referred to as exposure misclassifica-
of bilateral CTS. tion. When this occurs randomly, as when exposed
and unexposed subjects are misclassified with equal
For this reason, the prospective cohort design may be frequency, it is considered nondifferential. Unreliable
inappropriate for studying rare outcomes. Large num- measurement tools are susceptible to random errors
bers of subjects would have to be followed over a long that can cause nondifferential misclassification. Clinical
period to document new cases for analysis. To investi- cohort studies, especially those that rely on unreliable
gate risk factors for rare or slowly developing disorders, clinical assessments to ascertain exposure status, are
the case–control design is more appropriate, to be de- particularly susceptible to this type of misclassification.
scribed shortly. When exposure misclassification is both nondifferential
and independent of the outcome for participants, the
Cohort Subject Selection true relationship between exposure and outcome may
The first step in subject selection for a prospective cohort be diluted, resulting in findings that are biased “towards
study is identification of the target population. Eligible the null.”
subjects are typically recruited from communities, health- Conversely, when errors are not randomly distrib-
care practices, or professional groups. The subjects must uted across groups, the resulting misclassification may
be free of the target outcome at baseline but susceptible be differential or dependent on outcome status. These
to developing it during the follow-up period. types of errors can result in an exaggeration of the true
relationship between exposure and outcome. Such bias
“away from the null” can occur, for example, when clas-
➤ CTS occurs most frequently in occupations involving upper sification of exposure status is based on the inaccurate
extremity physical work. To generalize findings to this target
self-report of study participants.
6113_Ch19_271-284 02/12/19 4:43 pm Page 280
Retrospective
➤ In the study of CTS, physical work exposures were ascer- Case-Control Study
tained by self-report. Investigators acknowledged a potential ONSET OF STUDY
for workers to overreport or underreport their true exposure
status, potentially resulting in misclassification error and biased
findings.
Determine prior Identify cases
exposure status and controls
In the CTS study, differential misclassification would
have occurred if those who did not actually engage in
Exposed
heavy work consistently overreported their exposure to to factor
physical stressors. Since it seems unlikely that workers
Cases
who truly are engaged in heavy work would underreport
their own high levels of activity with equal frequency, Not exposed
to factor
differential misclassification errors may have biased the
findings away from the null.
Bias and Attrition Exposed
to factor
The longitudinal nature of prospective cohort studies
Controls
makes them prone to attrition. Bias will result if the loss
of subjects is related to either their exposure status, their Not exposed
to factor
outcome status, or both. Unfortunately, it can be diffi-
cult to know if a subject who departed the study went on
Figure 19–3 Design of a retrospective case-control study.
to develop the outcome or not. Participants are chosen based on their status as a case or
control, and then prior exposure status is ascertained.
➤ Of the 1,107 subjects recruited for the study of CTS, only
751 (67.8%) completed the follow-up physical examination. It
was not known whether any of the subjects lost to attrition
developed CTS during what remained of the follow-up period. ➤ CASE IN POIN T #5
Exercise is recognized as an important component of a
healthy lifestyle but can also lead to physical injury. Using
Researchers generally take great pains to avoid attri-
a case-control design, Sutton et al40 tested the hypothesis
tion by maintaining good rapport and regular commu- that development of knee osteoarthritis (OA) was related to
nication with all cohort participants. When possible, past participation in sports and exercise. Data were obtained
researchers look at the baseline measures among those from a large national fitness survey in England. They identi-
who dropped out to see if they are different from those fied 216 eligible cases of knee OA, including 66 men and
who were retained. In the CTS study, for instance, the 150 women.
researchers found no significant difference in the base-
line age, gender, body mass index (BMI), medical his- The case–control design is especially useful for study-
tory, or job category between those who were lost to ing rare disorders or conditions with long latency or in-
attrition and those who completed follow-up. cubation periods. For instance, knee OA develops slowly
and may remain undetected for years, so a longitudinal
study would require following subjects for many years
■ Case-Control Studies before they could detect a sufficient number of knee OA
cases. Greater efficiency is achieved using a retrospective
A case–control study is a method of observational in-
case-control design that compares the past sporting his-
vestigation in which groups of individuals are purposely
tories of seniors who currently have knee OA (cases) to
selected on the basis of whether or not they have the
seniors who do not (controls).
health condition under study. Cases are those who have
the target condition, while controls do not. The investi-
gator then looks to see if these two groups differ with ➤ The only strong risk factor identified for knee OA was prior
respect to their exposure history or the presence of a risk knee injury. Knee OA cases were 8 times as likely to report a
factor. The comparison of cases to controls can be made previous sports-related knee injury than controls. The authors
cross-sectionally by looking at current characteristics, or concluded that evidence did not support the popular belief that,
absent a knee injury, participation in sports or exercise in-
by looking backward in time via interview, questionnaire,
creased risk of knee OA later in life.
or chart review (see Fig. 19-3).
6113_Ch19_271-284 02/12/19 4:43 pm Page 281
the ascertainment of exposure status. Because subjects are When cases are selected from hospitalized patients
purposefully selected based on whether a target disorder and controls are drawn from the general population, the
is present or not, selection bias is a special concern. problem of recall bias may be especially potent.
Cases and controls must be chosen regardless of their
Confounding
exposure histories. If they are differentially selected on
Another factor that may interfere with analytic interpre-
some variable that is related to the exposure of interest,
tations of case–control data is the confounding effect of
it will not be possible to determine if the exposure is
extraneous variables that are related to the exposure of in-
truly related to the disorder. When samples are com-
terest and that independently affect the risk of developing
posed of subjects who have volunteered to participate,
the disorder. For example, previous data have shown that
self-selection biases can also occur.
women have an increased likelihood of developing knee
OA. Female seniors are also less likely to have participated
➤ Out of 4,316 adults who completed the fitness survey, in sports than their male peers. Since gender is related to
216 cases of knee OA were identified, leaving 4,100 adults who both the outcome (knee OA) and the exposure, sporting
were potentially eligible to serve as controls. Even after narrow- history or gender could confound the results. To examine
ing the list of potential control subjects to include only those
the potential contribution of confounders, investigators
who matched to a case on age and gender, the list of potential
must collect data on them and analyze their association
controls still exceeded what was required. To minimize the
threat of bias, matching controls were selected randomly from with both the exposure and the disease.
among those who were eligible. Matching
When sufficient numbers are available, researchers often
Observation Bias attempt to match control subjects to a case on one or more
In a case-control study, the outcome of each participant characteristics, such as age, gender, neighborhood, or
is known. Foreknowledge of the outcome renders the other characteristics with potential to alter the risk of the
design particularly susceptible to biased measurement outcome. Greater similarity between controls and cases
by the assessor or biased reporting by the participant. facilitates more valid comparison of their exposure histo-
Observation bias occurs when there is a systematic ries. This is an effective method of reducing or eliminating
difference in the way information about the exposure is the possibility that the matched factors will confound the
obtained from the study groups. results. Matching on a small number of factors does not,
Interviewer bias can occur when the individual however, eliminate the possibility that other differences
collecting data elicits, records, or interprets information between case and control groups may persist.
differently for controls than for cases. For example, an
interviewer may be aware of the research hypothesis and ➤ In the knee OA study, each case was matched to four
might ascertain information to support this hypothesis. control subjects according to gender and age. It was possible
Because so many epidemiologic studies involve an inter- to pick multiple controls for each case owing to the large num-
view or observation, this type of bias must be explicitly ad- ber of people completing the health survey who did not have
dressed by blinding interviewers to group and hypothesis, knee OA.
and by making data collection as objective as possible.
Matching the control group to the cases on gender and
Recall Bias age eliminates the possibility that these differences could
Recall bias occurs when case subjects who have experi- confound a comparison of their past sporting histories.
enced a particular disorder remember their exposure
history differently than control subjects who have not
experienced the disorder. It is not unusual for individuals When feasible, researchers will try to recruit more
than one control subject for each case to minimize
who have a disease to analyze their habits or past expe-
bias and increase statistical precision during analysis.
riences with greater depth or accuracy than those who
Matching may be difficult when many variables are relevant,
are healthy. This bias may result in an underestimate or such as gender, age, weight, or other health conditions. An
an overestimate of the risk associated with a particular alternative strategy for this purpose is the use of propensity
exposure. score matching. These values are composite scores that
represent a combination of several confounding variables.
➤ Individuals with painful knee OA might be more inclined to Scores are derived through a statistical process of logistic
remember a past knee injury, whereas healthy control subjects regression and can help to equate groups on baseline
could easily forget. variables (see Chapter 30).
6113_Ch19_271-284 02/12/19 4:43 pm Page 283
COMMENTARY
In a classic tongue in cheek 2003 article by Smith and same degree of control as in an RCT, they play an im-
Pell,46 they described their attempt at a systematic review portant role in clinical research, especially considering
to determine whether parachutes are effective for pre- the lack of documented evidence that exists concerning
venting major trauma related to gravitational challenge. most clinical phenomena. Before one can begin to inves-
They based their research question on the observation tigate causal factors for behaviors and responses using
that parachutes appear to reduce the risk of injury when experimental methods, one must first discover which
jumping from an airplane, with data showing increased variables are related and how they occur in nature. As
morbidity and mortality with parachute failure or free several examples have shown, our understanding of the
fall.47,48 The authors satirically noted, however, that as causes of an outcome can only move ahead if we recog-
an intervention, parachute effectiveness had never been nize how those phenomena manifest themselves with
“proven” in any controlled trial. Using death or trauma regard to concurrent variables.
as the primary outcome, the authors attempted a com- This chapter provides an introduction to issues that
prehensive literature search but found no randomized are relevant to observational designs. Analysis proce-
controlled trials of the parachute. They asserted that, de- dures will be discussed in more detail in succeeding
spite this lack of formal evidence, the parachute industry chapters. Readers can refer to many excellent epidemi-
continues to profit from the belief that their product is ology texts to find more comprehensive discussions of
effective, a belief that is likely to remain unchallenged observational designs that go beyond the scope of this
given the difficulty of finding volunteers willing to serve chapter.49-51
as control subjects in a randomized trial!
Even though observational studies are not able to
establish direct cause-and-effect relationships with the
20. Reid F. X-rays and leukemia. Lancet 1957;269(6965):428. No. 1103. Available at https://profiles.nlm.nih.gov/ps/
21. Lindsay DW. Effects of diagnostic irradiation. Lancet 1963; retrieve/ResourceMetadata/NNBBMQ. Accessed October 16,
281(7281):602-603. 2017. Washington, DC: Office of the Surgeon General, 1964.
22. Greene G. Richard Doll and Alice Stewart: reputation and the 37. Proctor RN. The history of the discovery of the cigarette–lung
shaping of scientific “truth”. Perspect Biol Med 2011;54(4): cancer link: evidentiary traditions, corporate denial, global toll.
504-531. Tobacco Control 2012;21(2):87-91.
23. Medical Sleuth: Detecting Links Between Radiation and 38. Brandt AM. Inventing conflicts of interest: a history of tobacco
Cancer. City Club of Portland. Available at https://www. industry tactics. Am J Public Health 2012;102(1):63-71.
youtube.com/watch?v=H4l6frw8Uk8. Accessed May 20, 2017. 39. Dale AM, Gardner BT, Zeringue A, Strickland J, Descatha A,
24. Alice Stewart: The woman who knew too much. Available at Franzblau A, Evanoff BA. Self-reported physical work exposures
https://www.youtube.com/watch?v=proyrn2AAMA. Accessed and incident carpal tunnel syndrome. Am J Ind Med 2014;
August 23, 2016. 57(11):1246-1254.
25. De Ryck A, Brouns R, Fransen E, Geurden M, Van Gestel G, 40. Sutton AJ, Muir KR, Mockett S, Fentem P. A case-control
Wilssens I, De Ceulaer L, Marien P, De Deyn PP, Engelborghs S. study to investigate the relation between low and moderate
A prospective study on the prevalence and risk factors of levels of physical activity and osteoarthritis of the knee using
poststroke depression. Cerebrovasc Dis Extra 2013;3(1):1-13. data collected as part of the Allied Dunbar National Fitness
26. Lindberg S, Anderson C, Pillai P, Tandias A, Arndt B, Survey. Ann Rheum Dis 2001;60(8):756-764.
Hanrahan L. Prevalence and predictors of unhealthy weight 41. Report of the expert committee on the diagnosis and classifica-
gain in pregnancy. WMJ 2016;115(5):233-237. tion of diabetes mellitus. Diabetes Care 2003;26 Suppl 1:S5-20.
27. ICPSR. Available at https://www.icpsr.umich.edu/icpsrweb/. 42. Woolf SH, Rothemich SF. New diabetes guidelines: a closer look
Accessed December 5, 2017. at the evidence. Am Fam Physician 1998;58(6):1287-1288, 1290.
28. National Center for Health Statistics. Centers for Disease 43. Center for Disease Control and Prevention. Revised surveil-
Control and Prevention. Available at https://www.cdc.gov/nchs/ lance case definition for HIV infection—United States, 2014.
index.htm. Accessed December 5, 2017. Morbidity and Mortality Weekly Report, April 11, 2014.
29. Centers for Disease Control and Prevention (CDC). National Available at https://www.cdc.gov/mmwr/preview/mmwrhtml/
Health Interview Survey. National Center for Health rr6303a1.htm. Accessed December 10, 2017.
Statistics. Available at https://www.cdc.gov/nchs/nhis/ 44. World Health Organization. AIDS and HIV Case Definitions.
data-questionnaires-documentation.htm. Accessed June 30, Available at http://www.who.int/hiv/strategic/surveillance/
2017. definitions/en/. Accessed December 9, 2017.
30. National Nursing Home Survey. National Center for Health 45. Bertone ER, Newcomb PA, Willett WC, Stampfer MJ,
Statistics. Centers for Disease Control and Prevention. Avail- Egan KM. Recreational physical activity and ovarian cancer
able at https://www.cdc.gov/nchs/nnhs/index.htm. Accessed in a population-based case-control study. Int J Cancer 2002;
December 5, 2017. 99:431-436.
31. Centers for Disease Control and Prevention. National Center 46. Smith GC, Pell JP. Parachute use to prevent death and major
for Health Statistics. Stats of the State of New Mexico. trauma related to gravitational challenge: systematic review of
Available at https://www.cdc.gov/nchs/pressroom/states/ randomised controlled trials. BMJ 2003;327(7429):1459-1461.
newmexico/newmexico.htm. Accessed March 7, 2019. 47. Highest fall survived without a parachute. Guinness Book of World
32. Braithwaite I, Stewart AW, Hancox RJ, Beasley R, Murphy R, Records, 2002. Available at http://www.guinnessworldrecords.
Mitchell EA. The worldwide association between television com/world-records/highest-fall-survived-without-parachute/.
viewing and obesity in children and adolescents: cross sectional Accessed May 19, 2017.
study. PLoS One 2013;8(9):e74263. 48. Belmont PJ, Jr., Taylor KF, Mason KT, Shawen SB, Polly DW,
33. Stolley PD. When genius errs: R.A. Fisher and the lung cancer Jr., Klemme WR. Incidence, epidemiology, and occupational
controversy. Am J Epidemiol 1991;133:416-425. outcomes of thoracolumbar fractures among U.S. Army
34. Doll R, Hill AB. Smoking and carcinoma of the lung; prelimi- aviators. J Trauma 2001;50(5):855-861.
nary report. Br Med J 1950;2(4682):739-748. 49. Rothman KJ. Epidemiology: An Introduction. 2nd ed. Oxford:
35. Gould SJ. The smoking gun of eugenics. Nat Hist 1991(12): Oxford University Press; 2012.
8-17. 50. Merrill RM. Introduction to Epidemiology. 7th ed. Burlington,
36. United States Public Health Service. Smoking and Health. MA: Jones and Bartlett; 2017.
Report of the Advisory Committee to the Surgeon General of 51. Rothman KJ, Greenland S, Lash TL. Modern Epidemiology.
the Public Health Service. Public Health Service Publication Philadelphia: Lippincott Williams & Wilkins; 2008.
6113_Ch20_285-296 02/12/19 4:41 pm Page 285
CHAPTER
Descriptive Research
20
DESCRIPTIVE EXPLORATORY EXPLANATORY
Developmental Research
Normative Research
Descriptive Surveys
Case Reports
Historical Research
Descriptive research is an observational ap- that can be tested using exploratory or explana-
proach designed to document traits, behaviors, tory techniques.
and conditions of individuals, groups, and pop- Descriptive studies may involve prospective
ulations. Descriptive and exploratory elements or retrospective data collection and may be
are often combined, depending on how the in- designed using longitudinal or cross-sectional
vestigator conceptualizes the research question. methods. Surveys and secondary analysis of
These studies document the nature of existing clinical databases are often used as sources of
phenomena using both quantitative and quali- data for descriptive analysis. Several types of
tative methods and describe how variables research can be categorized as descriptive
change over time. They will generally be struc- based on their primary purpose, including de-
tured around a set of guiding questions or re- velopmental research, normative research,
search objectives, often to serve as a basis for case reports, and historical research.
research hypotheses or theoretical propositions
285
6113_Ch20_285-296 02/12/19 4:41 pm Page 286
Focus on Evidence 20–1 Landa et al14 used a prospective longitudinal approach to study
Developmental Trajectories in Children with Autism 204 infants who were siblings of children with autism and 31 infants
with no family history, considered at low risk. Children were evaluated
Autism spectrum disorders (ASD) are a group of neurodevelopmental
six times over 3 years. The results identified three groups: early ASD
disorders characterized by deficits in communication and social inter-
diagnosis (before 14 months), later ASD diagnosis (after 14 months),
action. Researchers continue to look for etiology, as well as trying to
and no ASD. They found that despite the children exhibiting similar
understand the developmental trajectory.
developmental levels at 6 months, the groups exhibited atypical tra-
jectories thereafter in social, language, and motor skills. The figure
Shared Positive Affect illustrates one behavior, shared positive affect (smiling and eye con-
tact), showing greater impairments from 14 to 24 months in the early
3.5 Early ASD diagnosed group (results are expressed as standardized scores, with
higher values indicating greater impairment). Although similar at
14 months, the late diagnosed group improved over the 2 years,
3.0
whereas the behavior of the early diagnosed group stayed steady.
This type of descriptive study is essential for advancing theories about
2.5 Later ASD ASD, developing screening tools, and early detection to improve
access to early intervention.
2.0 Non ASD
From Landa RJ, Stuart EA, Gross AL, Faherty A. Developmental trajectories in
1.5 children with and without autism spectrum disorders: the first 3 years. Child
14 18 24 Dev 2013;84(2):429-442. Figure 1, p. 436. Used with permission.
Age (months)
assumption is made that all patients with similar dura- to information about health or historical events that in-
tion would have similar symptom development, which fluence life choices and practices.
may not be true.
Reither et al16 studied 1.7 million participants in the
Pragmatically, the cross-sectional method makes it
1976-2002 National Health Interview Surveys to deter-
easier to sample large representative groups than a lon-
mine differences in the prevalence of obesity in different
gitudinal study does. If, however, the primary interest
birth cohorts. They found that obesity at age 25 in-
is the study of patterns of change, the longitudinal
creased by 30% for cohorts born between 1955 and
method is preferred, as only that method can establish
1975. However, independent of birth cohort, the period
the validity of temporal sequencing of behaviors and
of observation had a significant influence on the likeli-
characteristics.15
hood of obesity at age 25, with a predicted probability
of 7% in 1976 to 20% in 2002.
Cross-sectional studies measure prevalence of a condi-
tion in a population, which is a measure of how many The findings in this obesity study demonstrate both
individuals are affected at any one point in time. This cohort and period effects, suggesting that changes such
approach cannot measure incidence, which is a measure of the as societal norms, differences in food practices, and
number of new cases within a period of time (see Chapter 34). lifestyle choices have had an increasing influence on the
Therefore, the cross-sectional design is not suited to studying rise in obesity.
the natural history of a disease or for drawing inferences about
causes or prognosis. The use of cross-sectional studies for analytic observa-
tional studies was described in Chapter 19.
Cohort Effects
Many of the extraneous variables that interfere in cross- ■ Normative Studies
sectional studies pertain to cohort effects, effects that
are not age specific but are due to a subject’s generation, The utility of evaluative findings in the assessment
or period effects, related to the particular time of obser- of a patient’s condition is based on the comparison
vation. For instance, subjects can differ in their exposure of those findings with known standards. Clinicians
6113_Ch20_285-296 02/12/19 4:41 pm Page 288
need these standards as a basis for documenting the If the interpretation of assessments and the conse-
presence and severity of disease or functional limita- quent treatment plan are based on the extent of deviation
tions and as a guide for setting goals. The purpose from normal, the standard values must be valid reflec-
of normative research is to describe typical or stan- tions of this norm. Because no characteristics of a popu-
dard values for characteristics of a given population. lation can be adequately described by a single value,
Normative studies are often directed toward a specific normal values are typically established in ranges with ref-
age group, gender, occupation, culture, or disability. erence to concomitant factors.
Norms are usually expressed as averages within a range As an example, several studies have established nor-
of acceptable values. mative values for grip strength in children based on age
and hand dominance.26-28
Physiological Standards
Häger-Ross and Rösblad29 studied 530 boys and girls
Over time, studies have provided the basis for laboratory
aged 4 through 16 and found that there was no differ-
test standards that are used to assess normally or abnor-
ence in grip strength between the genders until age 10,
mally functioning systems. For example, hematocrit or
after which the boys were significantly stronger than the
serum blood glucose are used to determine the presence
girls. Another study established norms for 5- to 15-year-
of anemia or diabetes. Physiological measures such as
olds based on age, gender, and body composition.30
normal nerve conduction velocity (NCV) are used to di-
Normative data for grip strength in adults31 and older
agnose or explain weakness or pain associated with neu-
persons32 is also distinguished by gender and level of fit-
ral transmission.
ness activity.
Norms can vary across studies, often based on the
characteristics of the sample or the laboratory practices These data provide a reference for making clinical de-
employed. For example, studies have resulted in different cisions about the management of hand problems and for
values for normal NCV based on how the measurements setting goals after hand injury or surgery.
were taken and the quality of the study.17-19 There are There is still a substantial need for normative research
also correction factors which may need to be applied, in health-related sciences. As new measurement tools are
such as corrections for limb temperature,20 age, and developed, research is needed to establish standards for
height21 in measures of NCV. interpretation of their scores. This is especially true in
Normal values may be updated periodically, as indi- areas where a variety of instruments are used to measure
cators of disease are reevaluated. For instance, accept- the same clinical variables, such as balance, pain, func-
able hemoglobin A1c levels, used to monitor type 2 tion and quality of life assessments. It is also essential
diabetes, have been lowered and later raised, as new that these norms be established for a variety of diagnostic
medical information has become available.22-24 Such and age groups, so that appropriate standards can be
changes often result in a documented increase in the in- applied in different clinical situations.
cidence of the disease in the population—not because
the disease is occurring more often, but because more Researchers should be aware of the great potential
people are diagnosed. for sampling bias when striving to establish standard
values. Samples for normative studies must be large,
Standardized Measures of Performance random, and representative of the population’s heterogeneity.
The specific population of interest should be delineated as
Norms can also represent standardized scores that allow
accurately as possible. Replication is essential to this form
interpretation of responses with reference to “normed” of research to demonstrate consistency and thereby validate
value. For example, the Wechsler Intelligence Scales findings.
scores are normed against a mean of 100 and a standard
deviation of 15.25 Several tests of balance have set stan-
dards for fall risk, and norms have been established for
developmental milestones. The influence of joint motion Chapter 10 includes further discussion of norm-referenced
on functional impairment is based on values “within nor- and criterion-referenced tests.
mal limits.”
The importance of establishing the validity of norma-
tive values is obvious. The estimation of “normal” behav- ■ Descriptive Surveys
ior or performance is often used as a basis for prescribing
corrective intervention or for predicting future perform- Descriptive studies attempt to provide an overall picture
ance. Data collected on different samples or in different of a group’s characteristics, attitudes, or behaviors, with
settings may result in variations of normal values. the purpose of describing characteristics or risk factors
6113_Ch20_285-296 02/12/19 4:41 pm Page 289
for disease or dysfunction. Data for such studies are often Purposes of Case Reports
collected through the use of surveys (see Chapter 11).
The purpose of a case report is to reflect on practice, in-
Jensen and coworkers33 used a survey to examine the cluding diagnostic methods, patient management, ethical
nature and scope of pain in persons with neuromuscular issues, innovative interventions, or the natural history of
disorders. Their data demonstrated that while pain is a a condition.40 A case report can present full coverage of a
common problem in this population, there are some im- patient’s management or focus on different aspects of the
portant differences between diagnostic groups in the na- patient’s care. For example, case reports have focused on
ture and scope of pain and its impact. the first procedure for a hand transplant,41 how the sur-
gery was planned,42 and how anesthesia was managed.43
Descriptive studies are a good source of data for de-
A case report can be prospective, when plans are made
velopment of hypotheses that can be tested further with
to follow a case as it progresses, or it can be retrospective,
analytic research methods.
reporting on a case after patient care has ended. Prospec-
Large-scale surveys are often used to establish popula-
tive studies have the advantage of being able to purpose-
tion characteristics that can be used for policy and resource
fully design the intervention and data collection
decisions. For example, the Centers for Disease Control
procedures, providing a better chance that consistent and
and Prevention (CDC) conduct many surveys, like the Na-
complete data will be available.
tional Health Interview Survey or the National Ambulatory
A case series is an expansion of a case report involving
Medical Care Survey, to gather data on health habits, func-
observations in several similar cases. As more and more
tion, and healthcare utilization.34 These types of surveys
similar cases are reported, a form of “case law” gradually
are also used to document the occurrence or determinants
develops, whereby empirical findings are considered rea-
of health conditions and to describe patterns of health, dis-
sonable within the realm of accepted knowledge and
ease, and disability in terms of person, place, and time.
professional experience. Eventually, with successive doc-
Designs related to descriptive epidemiology will be umented cases, a conceptual framework forms, providing
described in Chapter 34. a basis for categorizing patients and for generating hy-
potheses that can be tested using exploratory or experi-
mental methods.
■ Case Reports Clinical case reports can serve many purposes.
Clinicians and researchers have long recognized the im- • Case reports often emphasize unusual patient prob-
portance of in-depth descriptions of individual patients lems and outcomes or diagnoses that present inter-
for developing a clinical knowledge base, influencing esting clinical challenges.
practice and policy. From biblical times to modern day, Prognosis following traumatic brain injury (TBI) is
healthcare providers have described interesting cases, considered challenging. Nelson et al44 presented a case
often presenting important scientific observations that report of a 28-year-old male who experienced severe
may not be obvious in clinical trials.35 A case report is a TBI following a motorcycle accident without wearing a
detailed description of an individual’s condition or re- helmet. His initial examination revealed findings that
sponse to treatment but may also focus on a group, in- were typically predictive of mortality and debilitating
stitution or other social unit, such as a school, healthcare morbidity. Despite these factors, however, after 1 year
setting, or community. the patient was living at home, performing simple ac-
tivities of daily living at a higher level than anticipated.
The terms case report and case study have often been The improvement could not be attributed specifically
used interchangeably, but the terms actually reflect impor- to any given treatment, but the outcome was an impor-
tant differences in the nature of inquiry, how data are collected, tant reminder of the variability among patients that can
and how conclusions are drawn.36-39 Unfortunately, terminology challenge prognostication for TBI.
is not used consistently across journals. A case study is consid-
ered a qualitative research methodology, investigating broad • Case reports may focus on innovative approaches to
questions related to individuals or systems (see Chapter 21). treatment.
Although their contribution to evidence-based practice is well
recognized, case reports are generally not considered a true MacLachan et al45 described the first case of the suc-
form of research because the methodology is not systematic. cessful use of “mirror treatment” in a person with a
Case reports must also be distinguished from single-subject lower limb amputation to treat phantom limb pain. This
and N-of-1 designs, which involve experimental control in the intervention involves the patient feeling movement in
application of interventions (see Chapter 18). their phantom limb while observing similar movement
6113_Ch20_285-296 02/12/19 4:41 pm Page 290
in their intact limb. They demonstrated that this ap- musculature. Following 2 years of speech and physical
proach, which had been successful with upper extremity therapy, the patient was able to significantly improve her
amputation, was equally effective for reducing lower ex- ability to communicate intelligibly while walking. The
tremity phantom pain. results suggested that improved postural control could be
tested as a treatment to facilitate less rigid compensation
• Case reports can present newer approaches to diag- of oral musculature.
nosis, especially with rare conditions, which may
provide more efficient, more accurate or safer Format of a Case Report
testing methods.
Posterior epidural migration of lumbar disc frag- This section will provide a general description of the format
ments (PEMLDF) is extremely rare. It is often confused for case reports. Instructions for Authors for specific journals
with other posterior lesions and is usually diagnosed should be consulted to understand terminology used and the
intraoperatively. Takano et al46 describe the case of a required format (see Chapter 38). Several references provide useful
78-year-old man with acute low back pain, gait distur- tips for writing case reports and reporting guidelines.35,40,49-53
bance, and paresthesia in both legs. The initial diagnosis
suggested malignancy, hematoma, epidural abscess, or
A clinical case report is an intensive investigation de-
migration of lumbar disc fragments. Using preoperative
signed to analyze and understand those factors impor-
lumbar discography, a provocation technique to identify
tant to the etiology, care and outcome of the subject’s
disc pain, they were able to detect the migration of disc
problems. It is a comprehensive description of the
fragments, which was consistent with subsequent surgi-
subject’s background, present status, and responses to
cal findings. They concluded that discography is useful
intervention.
for the definitive diagnosis PEMLDF prior to surgery.
narrative and therefore will only highlight impor- intervention. This will include necessary opera-
tant information. Authors must follow journal tional definitions, characteristics of the setting,
guidelines regarding required subheadings for a how data were collected, and how progress was
case report. determined Where relevant, literature should
be cited to support the rationale for treatment
➤ The Journal of Neurologic Physical Therapy, where the and interpretation of outcomes. The author
robot study was published, requires an abstract for a case should reference theoretical considerations and
report to include headings of Background and Purpose, Case the process of clinical decision making used to
Description, Intervention, Outcomes, and Discussion.55 The ab- develop the treatment and assessment plan.
stract is limited to 250 words.
➤ The authors described the details of the procedures used,
• Introduction. A case report begins with an intro- including timing and duration of sessions over 4 weeks, provid-
duction that describes background literature on ing a rationale for their choices. They provided detailed descrip-
the patient problem, including theoretical or epi- tions of the equipment used, strategies incorporated into the
demiologic information needed to understand the intervention, and the patient’s responses.
scope and context of the disorder. This section will
provide a rationale for the study—why this case • Outcomes. This section includes a full descrip-
report provides important information regarding tion of measurements, which may include refer-
patient care. This may be due to gaps in literature, ence to literature on reliability and validity.
conflicting reports, or new approaches improve on Given that there are usually many ways to
current practice, or documentation of an unusual measure clinical variables, the reasons for choos-
condition. The purpose of the report should be ing specific assessments should be given. If special
described in terms of how this case contributes to assessments are used, they should be described in
the relevant knowledge base. A review of literature functional detail. Some case reports are actually
is essential for establishing the rationale and inter- geared to describing the applicability of new or
preting outcomes. unusual assessment instruments for diagnosing
certain problems.
➤ The authors described literature supporting the use of The patient’s progress and level of improvement
virtual environments and repetitive practice for improving motor or decline should be described, often with data
skills. However, studies using robotic technology to date had presented in tables, charts, or graphs. Literature
used standardized protocols that did not allow specification to may provide a basis for determining the minimally
a patient’s impairments and individual goals. important clinical difference (MCID) to allow
interpretation of the outcome effect.
• Case Description. A full description of the case
starts with assessments and background information. ➤ The authors described a series of outcome measures used
A patient history and exam findings will delineate to evaluate changes in impairments, functional abilities and
problems, symptoms, and prior treatments, as well participation activities.56 They included reliability and validity
as demographic and social factors that are pertinent assessments from the literature to support each measure. They
to the subject’s care and prognosis. A clinical im- provided details on the patient’s improvements and challenges
pression should be included to clarify the patient’s in the application of the intervention.
chief complaint and goals, providing a rationale for
the choice of intervention. • Discussion. The discussion section of the
report provides interpretation of outcomes and
➤ The patient was an 85-year-old gentleman with left hemi- conclusions, including reference to research,
paresis secondary to a stroke 5 years prior to this examination. theory, and other clinical information that may
The authors described his medical and rehabilitation history and support or challenge the current findings. This
examination findings related to his upper extremity function. The section should include discussion of unique or
patient lived at home, needing close guarding and supervision special considerations in the patient’s condition
for functional activities. They provided a clinical impression to or responses that may explain unexpected out-
describe why this patient was appropriate for this intervention. comes, or that may distinguish the patient from
others with the same condition. Authors should
• Intervention. This information should be fol- also describe limitations to the study that impact
lowed by a full description of all elements of the conclusions.
6113_Ch20_285-296 02/12/19 4:41 pm Page 292
or by finding no substantial contrary evidence.63 It is also them. The practice continued because of legal actions, linking
important to understand the relevant definitions and it to hospital policy. Newer ideas regarding the importance of
concepts used during the historical period and to recog- mobility and the federal redefinition of siderails as restraints
nize that standards and terminology change over time. began to change attitudes and practice.
➤ Documents in the siderail study were obtained from archival Because sources, measurements, and organization
collections in medicine and nursing, and government guidelines of data are not controlled, cause-and-effect statements
published by the Food and Drug Administration and the US cannot be made in historical research. One can only
Department of Health and Human Services. synthesize what is already known into systematized ac-
counts of the past and discuss potential relationships be-
Synthesis of Historical Data tween variables based on sequencing of events and
associated underlying characteristics of variables.
After the validity of historical data is established, the data
must be synthesized and analyzed within an objective
frame of reference. The researcher must be careful not ➤ The authors argued that the use of bedrails was based on
to make assumptions about the past merely because no a gradual consensus between law and medicine, rather than
information can be found. The historian attempts to in- empirical evidence. They concluded that changes in this
practice would have to be driven by data and that alternative
corporate a scientific logic into this process, so that
strategies would only become accepted once administrators,
interpretations are made as objectively as possible.
regulators, attorneys, patients, and clinicians understood why
bedrails became common practice in the first place. This histor-
➤ Through their historic research, Brush and Capezuti62 ical perspective is an important consideration as policy on the
documented how siderails became prominently used, despite use of bedrails continues to be revisited.64
incidents of injury and death attributed to patients climbing over
COMMENTARY
To do successful research, you don’t need to know everything, you just need to
know of one thing that isn’t known.
—Arthur Schawlow (1921-1999)
American physicist
study and when they are interpreted within the context
Descriptive research is often a first step to understanding of an appropriate research question. The results of
what exists, often the “first scientific toe in the water” in descriptive studies may provide essential evidence for
new areas of inquiry.65 Without this fundamental knowl- understanding the benefits of clinical trials and for
edge, it would be impossible to ask questions about describing or explaining why some subjects respond
behaviors or treatment effects or to propose theories to differently than others.
explain them. This approach is clearly contrasted with Descriptive studies such as case reports are often the
explanatory or observational research studies that seek first foray into publishing for clinicians, who are best
to determine what will happen in a given set of circum- suited to provide a comprehensive account of patient
stances. Clinical scientists must first discover how the management. These studies have direct impact on prac-
world around them naturally behaves before they can tice as first line reports of clinical decision making—an
manipulate and control those behaviors to test methods important form of communication within the clinical
of changing them. community.
Even though descriptive studies do not involve ma- With many behavioral and clinical concepts not yet
nipulation of variables, they still require rigor in defining fully understood, descriptive research presents a critical
and measuring variables of interest, whether they emerge opportunity for the clinical researcher and a vital source
as narrative descriptions or quantitative summaries. of evidence for clinicians who seek to understand patient
Descriptive findings can be meaningful as a basis for circumstances and characteristics.
explanation when they are the result of a well-designed
6113_Ch20_285-296 02/12/19 4:41 pm Page 295
R EF ER E NC E S 21. Rivner MH, Swift TR, Malik K. Influence of age and height on
nerve conduction. Muscle Nerve 2001;24(9):1134-1141.
1. Gesell A, Amatruda CS. The Embryology of Behavior. New York: 22. Mayfield J. Diagnosis and classification of diabetes mellitus: new
Harper & Brothers 1945. criteria. Am Fam Physician 1998;58(6):1355-1362, 1369-1370.
2. McGraw MB. The Neuromuscular Maturation of the Human 23. Sandoiu A. Type 2 diabetes: new guidelines lower blood sugar
Infant. New York: Hafner 1963. control level. Medical News Today, March 6, 2018. Available
3. Erikson EH. Childhood and Society. 2nd ed. New York: at https://www.medicalnewstoday.com/articles/321123.php.
Norton; 1963. Accessed November 9, 2018.
4. Tanner JM. Physical growth. In: Mussen PH, ed. Carmichael’s 24. Qaseem A, Wilt TJ, Kansagara D, Horwitch C, Barry MJ,
Manual of Child Psychology. 3rd ed. New York: Wiley; 1970. Forciea MA. Hemoglobin A1c targets for glycemic control with
5. Honzik MP, MacFarland JW, Allen L. The stability of mental pharmacologic therapy for nonpregnant adults with type 2 dia-
test performance between two and eighteen years. J Exp Educ betes mellitus: a guidance statement update from the American
1949;17:309. College of Physicians. Ann Intern Med 2018;168(8):569-576.
6. Horn JL, Donaldson G. Cognitive development: II. Adulthood 25. Maddox T. Tests: A Comprehensive Reference for Assessment in
development of human abilities. In: Brim OG, Kagan J, eds. Psychology, Education and Business. 5th ed. Austin, TX:
Constancy and Change in Human Development. Cambridge, MA: Pro-Ed; 2001.
Harvard University Press; 1980. 26. Lee-Valkov PM, Aaron DH, Eladoumikdachi F, Thornby J,
7. Freier MC, Babikian T, Pivonka J, Burley Aaen T, Gardner JM, Netscher DT. Measuring normal hand dexterity values in
Baum M, Bailey LL, Chinnock RE. A longitudinal perspective normal 3-, 4-, and 5-year-old children and their relationship
on neurodevelopmental outcome after infant cardiac transplan- with grip and pinch strength. J Hand Ther 2003;16(1):22-28.
tation. J Heart Lung Transplant 2004;23(7):857-864. 27. Beasley BW, Woolley DC. Evidence-based medicine knowl-
8. Munsat TL, Andres PL, Finison L, Conlon T, Thibodeau L. edge, attitudes, and skills of community faculty. J Gen Intern
The natural history of motoneuron loss in amyotrophic lateral Med 2002;17:632-639.
sclerosis. Neurology 1988;38(3):409-413. 28. Surrey LR, Hodson J, Robinson E, Schmidt S, Schulhof J,
9. Munsat TL, Taft J, Jackson IM, Andres PL, Hollander D, Stoll L, Wilson-Diekhoff N. Pinch strength norms for 5-to
Skerry L, Ordman M, Kasdon D, Finison L. Intrathecal 12-year-olds. Phys Occup Ther Pediatr 2001;21(1):37-49.
thyrotropin-releasing hormone does not alter the progressive 29. Häger-Ross C, Rosblad B. Norms for grip strength in children
course of ALS: experience with an intrathecal drug delivery aged 4-16 years. Acta Paediatr 2002;91(6):617-625.
system. Neurology 1992;42(5):1049-1053. 30. Sartorio A, Lafortuna CL, Pogliaghi S, Trecate L. The impact
10. Miller RG, Bouchard JP, Duquette P, Eisen A, Gelinas D, Harati Y, of gender, body dimension and body composition on hand-grip
Munsat TL, Powe L, Rothstein J, Salzman P, et al. Clinical strength in healthy children. J Endocrinol Invest 2002;25(5):
trials of riluzole in patients with ALS. ALS/Riluzole Study 431-435.
Group-II. Neurology 1996;47(4 Suppl 2):S86-90; discussion S90-82. 31. Hanten WP, Chen WY, Austin AA, Brooks RE, Carter HC,
11. Riviere M, Meininger V, Zeisser P, Munsat T. An analysis of Law CA, Morgan MK, Sanders DJ, Swan CA, Vanderslice AL.
extended survival in patients with amyotrophic lateral sclerosis Maximum grip strength in normal subjects from 20 to 64 years
treated with riluzole. Arch Neurol 1998;55(4):526-528. of age. J Hand Ther 1999;12(3):193-200.
12. DeLoach A, Cozart M, Kiaei A, Kiaei M. A retrospective review 32. Horowitz BP, Tollin R, Cassidy G. Grip strength: collection of
of the progress in amyotrophic lateral sclerosis drug discovery normative data with community dwelling elders. Phys Occup Ther
over the last decade and a look at the latest strategies. Expert Geriatr 1997;15(1):53-64.
Opin Drug Discov 2015;10(10):1099-1118. 33. Jensen MP, Abresch RT, Carter GT, McDonald CM. Chronic
13. Unruh KP, Kuhn JE, Sanders R, An Q, Baumgarten KM, pain in persons with neuromuscular disease. Arch Phys Med
Bishop JY, Brophy RH, Carey JL, Holloway BG, Jones GL, Rehabil 2005;86(6):1155-1163.
et al. The duration of symptoms does not correlate with 34. Centers for Disease Control and Prevention (CDC). Data &
rotator cuff tear severity or other patient-related features: a Statistics. Available at https://www.cdc.gov/DataStatistics/.
cross-sectional study of patients with atraumatic, full-thickness Accessed November 10, 2018.
rotator cuff tears. J Shoulder Elbow Surg 2014;23(7):1052-1058. 35. Cohen H. How to write a patient case report. Am J Health Syst
14. Landa RJ, Gross AL, Stuart EA, Faherty A. Developmental Pharm 2006;63(19):1888-1892.
trajectories in children with and without autism spectrum 36. Jansen CWS. Case reports and case studies: A discussion of
disorders: the first 3 years. Child Dev 2013;84(2):429-442. theory and methodolgy in pocket format. J Hand Ther 1999;
15. Kraemer HC, Yesavage JA, Taylor JL, Kupfer D. How can 12(3):230-232.
we learn about developmental processes from cross-sectional 37. Porcino A. Not birds of a feather: case reports, case studies,
studies, or can we? Am J Psychiatry 2000;157(2):163-171. and single-subject research. Int J Ther Massage Bodywork
16. Reither EN, Hauser RM, Yang Y. Do birth cohorts matter? 2016;9(3):1-2.
Age-period-cohort analyses of the obesity epidemic in the 38. Ridder H. The theory contribution of case study research
United States. Soc Sci Med 2009;69(10):1439-1448. designs. Business Res 2017;10(2):281-305.
17. Ehler E, Ridzon̆ P, Urban P, Mazanec R, Nakládalová M, 39. Salminen AL, Harra T, Lautamo T. Conducting case study
Procházka B, Matulová H, Latta J, Otruba P. Ulnar nerve at research in occupational therapy. Austr Occup Ther J 2006;
the elbow – normative nerve conduction study. J Brachial Plex 53(1):3-8.
Peripher Nerve Inj 2013;8:2-2. 40. McEwen I. Writing Case Reports: A How-To Manual for Clini-
18. Haghighat S, Mahmoodian AE, Kianimehr L. Normative cians. 3 ed. Alexandria, VA: American Physical Therapy
ulnar nerve conduction study: comparison of two measurement Association; 2009.
methods. Adv Biomed Res 2018;7:47. 41. Amaral S, Kessler SK, Levy TJ, Gaetz W, McAndrew C,
19. Chen S, Andary M, Buschbacher R, Del Toro D, Smith B, So Y, Chang B, Lopez S, Braham E, Humpl D, Hsia M, et al.
Zimmermann K, Dillingham TR. Electrodiagnostic reference 18-month outcomes of heterologous bilateral hand transplan-
values for upper and lower limb nerve conduction studies in tation in a child: a case report. Lancet Child Adolesc Health
adult populations. Muscle Nerve 2016;54(3):371-377. 2017;1(1):35-44.
20. Nelson R, Agro J, Lugo J, Gasiewska E, Kaur H, Muniz E, 42. Galvez JA, Gralewski K, McAndrew C, Rehman MA, Chang B,
Nelson A, Rothman J. The relationship between temperature Levin LS. Assessment and planning for a pediatric bilateral hand
and neuronal characteristics. Electromyogr Clin Neurophysiol transplant using 3-dimensional modeling: Case report. J Hand
2004;44(4):209-216. Surg Am 2016;41(3):341-343.
6113_Ch20_285-296 02/12/19 4:41 pm Page 296
43. Gurnaney HG, Fiadjoe JE, Levin LS, Chang B, Delvalle H, 53. Gagnier JJ, Kienle G, Altman DF, Moher D, Sox H, Riley D.
Galvez J, Rehman MA. Anesthetic management of the first The CARE guidelines: consensus-based clinical case reporting
pediatric bilateral hand transplant. Can J Anaesth 2016;63(6): guideline development. J Med Case Rep 2013;7:223.
731-736. 54. Fluet GG, Merians AS, Qiu Q, Lafond I, Saleh S, Ruano V,
44. Nelson CG, Elta T, Bannister J, Dzandu J, Mangram A, Zach Delmonico AR, Adamovich SV. Robots integrated with virtual
V. Severe traumatic brain injury: a case report. Am J Case Rep reality simulations for customized motor training in a person
2016;17:186-191. with upper extremity hemiparesis: a case study. J Neurol Phys
45. MacLachlan M, McDonald D, Waloch J. Mirror treatment Ther 2012;36(2):79-86.
of lower limb phantom pain: a case study. Disabil Rehabil 55. Journal of Neurologic Physical Therapy. Instructions for
2004;26(14-15):901-904. Authors. Available at https://journals.lww.com/jnpt/Pages/
46. Takano M, Hikata T, Nishimura S, Kamata M. Discography InstructionsforAuthors.aspx. Accessed November 24, 2018.
aids definitive diagnosis of posterior epidural migration of 56. World Health Organization. International Classification of
lumbar disc fragments: case report and literature review. BMC Functioning, Disability and Health. Geneva: World Health
Musculoskelet Disord 2017;18(1):151. Organization; 2001.
47. Rosenbaum RS, Kohler S, Schacter DL, Moscovitch M, 57. Alzheimer A, Stelzmann RA, Schnitzlein HN, Murtagh FR.
Westmacott R, Black SE, Gao F, Tulving E. The case of K.C.: An English translation of Alzheimer’s 1907 paper, “Uber eine
contributions of a memory-impaired person to memory theory. eigenartige Erkankung der Hirnrinde.” Clin Anat 1995;8(6):
Neuropsychologia 2005;43(7):989-1021. 429-431.
48. Reinthal AK, Mansour LM, Greenwald G. Improved ambula- 58. Jordan WM. Pulmonary embolism. Lancet 1961;2:1146.
tion and speech production in an adolescent post-traumatic 59. Centers for Disease Control. Pneumocystis pneumonia—Los
brain injury through a therapeutic intervention to increase Angeles. MMWR 1981;30:250.
postural control. Pediatr Rehabil 2004;7(1):37-49. 60. Lusk B. Historical methodology for nursing research. Image J
49. Riley DS, Barber MS, Kienle GS, Aronson JK, von Schoen- Nurs Sch 1997;29(4):355-359.
Angerer T, Tugwell P, Kiene H, Helfand M, Altman DG, 61. Rees C, Howells G. Historical research: process, problems
Sox H, et al. CARE guidelines for case reports: explanation and pitfalls. Nurs Stand 1999;13(27):33-35.
and elaboration document. J Clin Epidemiol 2017;89:218-235. 62. Brush BL, Capezuti E. Historical analysis of siderail use in
50. Rison RA. A guide to writing case reports for the Journal of American hospitals. J Nurs Scholarsh 2001;33(4):381-385.
Medical Case Reports and BioMed Central Research Notes. 63. Christy TE. The methodology of historical research: a brief
J Med Case Rep 2013;7:239-239. introduction. Nurs Res 1975;24(3):189-192.
51. Sun Z. Tips for writing a case report for the novice author. 64. O’Keeffe ST. Bedrails rise again? Age Ageing 2013;42(4):
J Med Radiat Sci 2013;60(3):108-113. 426-427.
52. Green BN, Johnson CD. How to write a case report for 65. Grimes DA, Schulz KF. Descriptive studies: what they can
publication. J Chiropr Med 2006;5(2):72-82. and cannot do. Lancet 2002;359(9301):145-149.
6113_Ch21_297-316 02/12/19 4:41 pm Page 297
CHAPTER
Qualitative Research
—with Heather Fritz and Cathy Lysack 21
DESCRIPTIVE EXPLORATORY EXPLANATORY
The types of research discussed thus far have and experiences, and how they influence health.
included explanatory, exploratory and descrip- The purpose of this chapter is to provide an
tive approaches, with one commonality—they overview of qualitative research methods that
have all been based on quantitative methods, will provide sufficient background to appreciate
using empirical data to answer questions about published qualitative studies.
cause and effect or establishing linkages among Because the true complexity of qualitative
variables based on statistical applications. An research is beyond the scope of this text, read-
important limitation of the quantitative para- ers who want to pursue qualitative study are
digm, however, is its inability to capture human encouraged to seek out expert consultants and
experience in more nuanced contextual ways, to refer to many excellent references that pro-
including understanding the experience of pa- vide more in-depth coverage of qualitative
tients, families, and healthcare providers who methods.1-8
encounter illness and disability in clinical prac-
tice and daily life.
■ Human Experience and Evidence
Qualitative research is based on the belief that
all interactions are inherently social phenomena. Concepts of research typically involve quantitative as-
It is used to provide evidence when the goal is sumptions within which human experience is assumed to
be limited to logical and predictable relationships. Linked
to identify, describe, and ultimately understand to the philosophy of logical positivism, this approach to
human or organizational behaviors, attitudes, inquiry is deductive and hypothesis-based, requiring that
297
6113_Ch21_297-316 02/12/19 4:41 pm Page 298
the researcher operationally define the variables of inter- serve complementary functions.11 For example, quali-
est in advance. tative research can be used as a preliminary phase in
However, when we think about all the reasons that an quantitative investigation, especially when there is a
intervention does or does not work, or why some pa- lack of prior research or theory. Understanding values
tients adhere to therapeutic regimens while others do and behaviors can help to identify relevant variables
not, we realize that “counting” responses for groups is that can lead to generation of hypotheses, development
unlikely to bring us the full understanding that is needed and validation of instruments, setting a framework for
to appreciate how personal differences impact outcomes. evaluation, and developing research questions.12
And while healthcare providers try to give patients the Qualitative methods can also be used when a topic has
best care possible, they may be missing essential under- been studied extensively with quantitative approaches,
lying elements that can influence outcomes. but the phenomenon is still not well understood and
In contrast, qualitative research favors a gestalt way findings are contradictory. Qualitative research may be
of seeing the world that strives to represent social real- critical to “going deeper” to ask additional questions and
ity as it is found naturally. It goes beyond learning may also help to explain unexpected findings of a ran-
about individual beliefs and behaviors to more deeply domized trial (see Focus on Evidence 21-1).
discover and describe socioculturally constructed world-
views, values, and norms and how these are instilled,
enacted, reinforced, or resisted and changed in every-
day life. To obtain these insights, qualitative researchers
must conduct inquiry in the natural settings where peo-
ple interact and then find ways to systematically cap-
ture, mostly in narrative form, qualities that are often
complex and difficult to measure. Where quantitative
research is based on numbers and statistics, qualitative Courtesy of Publicdomainvectors.org.
data are text-based, involving open-ended discussions
and observation. Contributing to Evidence-Based Practice
Whereas quantitative researchers try to control the
Qualitative research addresses essential components of
influence of “extraneous variables” through standardized
evidence-based practice. As described in Chapter 5, EBP
design and measurement, qualitative researchers work to
requires attention to several sources of information, in-
capture as fully as possible the diverse experiences and
cluding patient preferences and providers’ clinical judg-
events they study, so that the data gathered will be the
ments. These components require an understanding of
truest possible representation of the real world, with all
how patients and care providers view their interactions
of its “messiness.” The “insider view” of the individuals’
with each other, family members, and colleagues within
world is particularly important for understanding the
their social environments, and how those interactions
factors that are relevant to clinical assessment, care, and
influence care and health (see Fig. 21-1).
decision-making.
Qualitative designs are uniquely suited to clarify our
Qualitative research is generally considered an induc-
understanding of patients’ perspectives about treatment.
tive process, with conclusions being drawn directly from
There is increasing attention to the patient and family’s
the data. The value of qualitative methods lies in their
perspectives today, as health systems recognize that out-
ability to contribute insights and address questions that
comes are ultimately more positive if the patient’s needs
are not easily answerable by experimental methods.9 In
and preferences are considered throughout treatment
its broadest sense, qualitative research must be under-
planning and delivery.19-20 A better understanding of
stood as the systematic study of social phenomena, serv-
patients’ preferences can also give practitioners the op-
ing three important purposes:10
portunity to understand the concepts of health, illness,
• Describing groups of people or social phenomena and disability from the direct perspective of the person
• Generating hypotheses that can be tested by further who lives it.
research Qualitative study can also contribute to understanding
• Developing theory to explain observed phenomena how providers’ expertise, implicit behaviors, or profes-
sional cultures influence treatment encounters. For exam-
Complementary Traditions ple, Jensen et al21 developed a theoretical model related to
As part of a broad understanding of research, it is what constitutes expert practice in physical therapy. They
useful to consider how quantitative and qualitative identified four dimensions of virtue, knowledge, move-
approaches are distinguished, as well as how they can ment, and clinical reasoning that become increasingly
6113_Ch21_297-316 02/12/19 4:41 pm Page 299
Focus on Evidence 21–1 how social and emotional factors enable or inhibit maintenance of
You Won’t Find What You Aren’t Looking For! weight loss. If they had considered these data, they may have reframed
the intervention and gotten different results. For example, Cioffi14 found
In 2015, Pakkarinen et al13 reported on a randomized controlled trial
that needing to come to classes and involvement with a group were
of 201 patients comparing long-term weight loss with and without a
inhibiting factors in weight-loss maintenance efforts. Other qualitative
maintenance program. The intervention consisted of a defined low-
studies have shown that poor coping skills, inconsistent routines, un-
calorie diet for 17 weeks for both groups, and then the treatment
realistic long-term goals, and social pressures all present challenges to
group began a 1-year maintenance regimen, including behavioral
weight-loss maintenance,15-18 while social support, readiness to change,
modification and regular meetings of support groups. Follow up after
effective coping skills, and self-monitoring contribute to success.
2 years showed no significant difference between the two groups,
After 2 years of involvement with 201 patients, as well as scores
which the authors attributed to subject attrition, poor compliance
of personnel involved in delivering the long-term maintenance pro-
with the weight-loss program, and a lack of sensitivity in their meas-
gram, it is noteworthy that individual perceptions about weight loss
uring tools.
went unnoticed. Qualitative studies that focus on personal rationales
Unfortunately, the authors neglected to consult evidence from qual-
and intentions, as in this field, often hold the key to understanding
itative research studies and thus did not consider several other factors
complex health behavior, what factors may influence it, and how it
known to influence weight-loss behavior, including self-efficacy. The
can be studied.
authors also disregarded evidence from qualitative studies that showed
Clinical Question apply evidence that is available (see Chapter 2). For ex-
ample, clinicians may find it difficult to understand
why a “proven” treatment program is not widely
adopted by patients, even after RCTs have demon-
Best available strated positive outcomes. This challenge is height-
research ened by the complexity of human behavior. New
evidence therapies may not be adopted because of treatment
characteristics, the context where the intervention will
Clinical
Decision-
be delivered, or even the attitudes and behaviors of
Patient Making Clinical providers themselves (see Focus on Evidence 21-2).22
values and expertise
preferences
Focus on Evidence 21–2 potentially contributing to differences in treatment, which then re-
Understanding the Provider-Patient Encounter sult in disparities in outcomes.
One such study showed that oncologists spent less time with
African Americans who have cancer tend to experience worse
and made fewer mentions of available clinical trials to their African
health and quality of life outcomes when compared to their Cau-
American patients than to Caucasian patients.25 Another study ex-
casian counterparts with similar disease burden.24 Many factors
amined whether race or ethnicity was associated with the degree
can contribute to these disparities, including socioeconomic hard-
to which patients asked questions or raised treatment concerns
ship, lack of health insurance, and environmental and behavioral
when interacting with their oncologist.26 That study found that
risk factors. There are also disparities in how research supports
African American patients asked fewer questions and stated fewer
care of different groups. For example, African Americans are often
direct concerns. Consideration of these findings can contribute to
underrepresented in clinical trials that support state-of-the-art
providers becoming more aware of their own biases and the poten-
cancer treatment.25
tial for race/ethnicity-based differences in patient communication
Because oncologists play a central role in guiding patients with
styles. Through this understanding, providers can be more vigilant
cancer through complex treatment decisions, it is important to con-
and work to improve knowledge-sharing during clinical encounters.
sider the impact of provider behaviors in the clinical encounter.
It may also trigger development of educational programs to help
Qualitative studies using real-time video of provider-patient clinical
patients understand the importance of their own participation in
interactions have provided important insights into how oncologists
treatment decisions.
interact differently with Caucasians versus African Americans,
These researchers observed what family caregivers and Exploring Special Populations
staff did to assist and support the older residents during Qualitative studies can also provide insights into why
meal times and took detailed notes about behaviors and individuals participate in research and why they do or do
gestures to encourage eating and drinking. In addition to not adhere to protocols, contributing to more effective
the objective data of what residents ate and drank, data re- design of clinical trials.30 Henshall et al31 studied adults
vealed widespread differences in patient health over that with recently diagnosed type 1 diabetes (T1D), to ex-
timeframe that could be directly linked to a systematic lack plore motivators and deterrents toward participating in
of time to provide support to eat. Detailed observations clinical trials.
showed that many residents could eat and drink if some-
Barriers to participation in clinical trials are well
one were there to encourage and assist them.
documented, although no studies have looked at barriers
Organizational Processes to recruitment in people with T1D. We sought to
Qualitative research is also well-suited to investigate determine the overall experiences of newly diagnosed
complex activities and systems including organizational adults with T1D in an exercise study, to understand
processes and cultures that are otherwise difficult to issues that influence the retention of trial participants in
assess. In a study of operational failures in hospitals, such studies.
Tucker and coworkers29 stated this purpose:
They identified several themes that presented bar-
Prior studies have found that explicit efforts to map the riers, including too much study paperwork, long
flow of materials in hospitals can identify opportunities questionnaires, lack of feedback during a trial, and
for improvement. We build on these studies by explicitly practical hurdles such as work pressures, travel time
examining underlying causes of operational failures to appointments, and not liking blood tests. This in-
in hospitals’ internal supply chains. We looked at formation can be used to develop strategies to alleviate
operational failures related to hospital room turnover, such barriers.
or the rate of getting a recently vacated hospital room
ready for the next patient. Developing Valid Measurements
Through observation and interviews on the hospital Many measurement tools have been developed to quan-
units, the authors were able to identify themes related to tify constructs such as quality of life, function, or pain
specific routines, knowledge, and collaboration mecha- (see Chapter 12). Qualitative information can provide
nisms that could improve material flow and efficiencies essential insight into validity as such scales are developed
across departments. (see Focus on Evidence 21-3).
Focus on Evidence 21–3 interviews, the researchers spoke with 41 children aged 8 to
Fulfilling a PROMIS 17 years old, 20 with asthma, to determine the unique difficulties and
challenges of this population. They identified five themes relevant
No, it isn’t misspelled. The Patient Reported Outcomes Measure-
for the children with asthma: 1) They experienced more difficulty in
ment Information System (PROMIS) is the result of an NIH-funded
physical activities; 2) They could experience anxiety about their
5-year cooperative group program of research designed to develop,
asthma at any time; 3) They experienced sleep disturbances and
validate, and standardize item banks to measure patient-reported
fatigue; 4) Their asthma affected their emotional well-being; and
outcomes (PROs) across common medical conditions (see Chapter 2).
5) They often had insufficient energy to complete school activities.
This was a completely new effort to standardize patient out-
These findings were then used to modify the questions being
comes data so comparisons could be made more easily and the
asked of this age group. For example, although questions did address
data could be exploited more cost-effectively. 33 The PROMIS
sources of anxiety, there was no item on “fear of dying,” which was
Pediatric Working Group created self-report item banks for ages 8 to
a concern commonly expressed by the children with asthma. This item
17 years across five general health domains: emotional health,
was added to capture the concerns that some children with asthma
pain, fatigue, physical function, and social health, consistent
or other chronic diseases might have.
with the larger PROMIS network.33 Qualitative data were an im-
Based on these qualitative data, the item bank was expanded
portant aspect of this initiative, to understand the issues that
to include items that address broader issues faced by children
are faced by patients to inform what questions to ask and how
with chronic disorders. Without this approach, it would not
they should be worded.34
have been possible to create questions that truly reflect relevant
As an example, one study specifically looked at items relevant
outcomes.
to children with asthma.35 Using focus groups with semi-structured
6113_Ch21_297-316 02/12/19 4:41 pm Page 302
Grounded Theory a life-changing event: losing control, digesting the shock, and
A second common qualitative research approach is called regaining control. These categories supported a core theory
grounded theory. As its name implies, theory derived they termed “Pendulating,” describing how patients were
from the study is “grounded” in the data. What distin- swinging both cognitively and emotionally throughout the
guishes the grounded theory method from other qualita- process.
tive approaches is the formalized process for simultaneous
data collection and theory development. This method This theory, illustrated in Figure 21-2, offers a tool
calls for a continuous interplay between data collection for clinicians to recognize where patients are in their re-
and interpretation leading to generalizations that are covery process and how they can be best supported by
project-specific for that data.40 recognizing their reactions.
➤ C A SE IN PO INT # 2 Phenomenology
Vascular leg amputations often occur in frail and vulnerable Phenomenology is built on the premise that lived ex-
populations. Rehabilitation generally focuses on gait training, periences and meanings are fully knowable only to
rather than the patient’s experience. Madsen et al41 designed a those who experience them. In daily life, we share these
study to answer the question, “What is the main concern of experiences using language and sometimes nonverbal
patients shortly after having a leg amputated and how do they behaviors. Thus, the skill of the qualitative researcher
resolve it?” Understanding this experience can inform providers rests on the ability to identify an appropriate source of
as to patient needs during hospitalization as well as planning information, the ability to find the right words to use
for care post-discharge.
to elicit rich descriptions of their study participants’ ex-
periences, and the meanings they attribute to those ex-
Constant Comparison periences. Researcher must also have skills in a second
In the grounded theory tradition, the work of developing step to find their own words to report those findings in
a theory or explanation from the data is built on a process written form.
of constant comparison. The process is one by which
each new piece of information (a belief, explanation for ➤ CASE IN POIN T #3
an event, experience, symbol, or relationship) is compared Hospitalized patients have significant medical and social is-
to data already collected to determine where the data sues that need to be considered in preparation for discharge, a
agree or conflict. As this iterative process continues, in- process that is often time-consuming, inefficient, and complex.
terrelationships between ideas begin to emerge. It contin- Pinelli and colleagues43 conducted a phenomenological study
ues until saturation occurs, a point where the researcher to assess challenges in the discharge process in a teaching
hospital from the perspectives of healthcare professionals and
makes a judgment that additional data are yielding few
patients. They specifically explored communication practices
new ideas or relationships.42
between providers, awareness of colleagues’ roles and re-
sponsibilities, factors inhibiting or promoting interprofessional
➤ Case #2: Researchers sought to build a theoretical model to collaboration, and potential improvements in the discharge
explain patients’ behavior shortly after having a leg amputated process.
as a result of vascular disease. Through observation and
interviews, researchers collected data using the constant
comparative method, covering the period 4 weeks postsurgery The task of data collection in the qualitative tradi-
in 11 patients. Data were examined for patterns until saturation tion is to engage in lengthy discussions with partici-
occurred, yielding categories that focused on the patients’ pants about their experiences, to listen carefully, and
main concern: “How do I mange my life after having lost a leg?” then to summarize common themes in their expres-
These included being overwhelmed, facing dependency, sions of these experiences that convey a clear essential
managing consequences and building up hope. meaning.44,45
Losing Digesting Regaining other emotions. For example, this dimension can be af-
Control the Shock Control
fected by hospitalization that separates patients from
family and the need to interact with many different pro-
fessionals, particularly when care is seen as fragmented.
Temporality is a subjective sense of lived time, as op-
Being Swallowing the Managing posed to clock time. Memories and perceptions can be
overwhelmed life-changing consequences influenced by how time is perceived under different cir-
decision
cumstances. For instance, patient experiences during a
hospital stay can affect how patients perceive the passage
Facing Building up
dependency Detecting the hope and of time. It may feel like it takes forever for someone to
amputated self-motivation help them. Providers may assess time differently in a
body busy and stressful environment, when it seems like there
is never enough time. This can impact perception of the
timeliness of discharge, for example.
These four cases will be used to illustrate varied concepts ➤ Case #1: In the study of day-care staff, a plan was devel-
related to qualitative data collection and analysis through- oped for the author to participate as a member of the staff
out this chapter. during different times of day across three weeks.39 The staff
were fully aware of her role. Handwritten field notes were
recorded sporadically through the day.
quality of the insights from previous research, whether questions are yielding strong insights, it is not unusual
a broader sample is needed to expand exploration of a to proceed with data collection and engage more heavily
well-studied topic or a smaller sample is needed to focus in analysis later in the process.
in an area that is not yet well understood. Data analysis involves systematic procedures to un-
derstand the meaning of the data.45 Although the details
Theoretical Sampling of how to undertake qualitative data analysis are beyond
In the development of theory, qualitative researchers try the scope of this chapter, readers must be aware that
to collect data that fully capture the scope of the phe- there are an array of different analytic approaches. The
nomenon being studied. When building a theory, how- techniques of data analysis can vary from purely narrative
ever, researchers often apply a purposive strategy in a descriptions of observations to creating a detailed system
different way using a process called theoretical sam- of coding the data from which categories can be devel-
pling. Once the research question is formulated, a small oped, and in a systematic way, patterns or themes de-
number of relevant participants are recruited and inter- velop from the mass of information.
viewed, and data are analyzed to identify initial concepts.
As issues are brought up with early cases, new partici- Drawing Meaning from Narrative Data
pants who are likely to expand or challenge those results Coding
are then interviewed, and further analysis is performed, Because qualitative data are textual, coding is an im-
leading to decisions about what additional questions portant process to make sense of the narratives. It pro-
should be added to the interview guide and queried in vides the foundation for determining how themes or
subsequent interviews.55 Thus, as more cases are inter- concepts emerge from the data by giving structure to
viewed, additional questions are added to the interview the data. Codes are the smallest units of text that rep-
guide. It is important to note that the research question resents a common theme. Codes can be phrases or
does not change, only the questions in the guide that are words. While some researchers will develop a coding
used to collect data. scheme prior to data collection, it is more common to
New participants continue to be interviewed and data proceed in a deductive fashion by establishing codes as
analyzed, as components of the theory emerge and new the data are reviewed.
perspectives are presented, continually reframing the Content analysis is a systematic technique used to
theoretical premise. The process continues, moving back draw inferences by interpreting and coding textual ma-
and forth between recruitment of new subjects, data col- terial. It involves counting the frequency of words in a
lection and analysis, until the researcher determines that text to analyze meanings and relationships of certain
data have reached saturation. words, themes, or categories within the data.
➤ Case #2: In the study of patient behavior following leg ➤ Case #3: In their analysis of discharge processes, Pinelli
amputation, researchers used principles of theoretical sampling.41 et al43 used an inductive process to develop categories, integrat-
Their interview guide was customized from interview to inter- ing data from one-on-one and focus group interviews. After
view as analysis developed. They continued to recruit patients reading through transcripts to get a gestalt of the process, they
as they became accessible, until they found that the three reviewed words and phrases that recurred frequently to create
components of the “pendulating” theory were explained. an initial code book. The unit of analysis was phrases that
contained one idea. The team met regularly to check progress of
coding and made minor modifications as needed.
■ Data Analysis and Interpretation
Five final categories were derived describing barriers
Qualitative data analysis is primarily an inductive
related to the discharge process, a lack of understanding
process, with a constant interplay between data that rep-
of roles, communication breakdowns, patient issues, and
resent the reality of the study participants and theoretical
poor collaboration processes. Identified themes are typ-
conceptualizations of that reality. Therefore, at least to
ically illustrated in quotes from participant narratives.
some degree, analysis must be undertaken as data are col-
For example, one resident described the chaotic nature
lected. This is particularly true in observational studies
of the discharge process:
and studies guided by grounded theory.
In studies using in-depth interviews, some analysis It’s like trying to pack for a trip to Europe for two
will be undertaken early on, but once the researchers are weeks. You have five minutes to do it and try to throw
satisfied that they are getting high quality data and the everything you can think of in the suitcase. But you’re
6113_Ch21_297-316 02/12/19 4:41 pm Page 310
going to forget your toothbrush or something essential. analyzing data, developing theory and drawing conclusions
Very often with discharges they evolve that way. about the findings.
A consulting physician described the lack of delin- Evaluating Quality
eation of responsibilities:
Reliability and validity are issues of concern in qualita-
There’s misunderstanding of who’s going to do what. tive research just as they are in all types of research. Be-
For example, when we say continue antibiotics for at cause the data sources are words rather than numbers,
least four weeks until your follow-up CT scan is done, different terms and techniques are used to describe
who’s scheduling that CT scan? We think the primary rigor in terms of trustworthiness of the data. Though
team is going to do it, the primary team thinks we’re multiple frameworks have been proposed for assessing
going to do it. the quality of qualitative research,61-63 the four criteria
proposed by Lincoln and Guba1 remain widely used:
Content analysis benefits hugely from the power of
credibility, transferability, dependability, and confirma-
computing to speed up the analysis and ensure a com-
bility (see Fig. 21-3).
prehensive enumeration of words. Bengtsson56 provides
a helpful review of the types of content analysis and the
steps in conducting a rigorous content analysis study. Although the criteria of credibility, transferability,
dependability, and confirmability reflect a substantially
Volume of Data different paradigm from quantitative methods, they have been
Because observational and interview data are often likened to internal validity, external validity, reliability, and
recorded as narratives, qualitative data are typically vo- objectivity, respectively.1
luminous. This is especially true with the common use
of digital audio recordings along with written memos, Credibility
which are then transcribed. It is not unusual for a 1-hour Credibility refers to actions that are taken to ensure that
interview to yield 40 to 60 pages of single-spaced text. the results of a qualitative study are believable—that
Because of the volume of detailed data, analysis can there is confidence in “truth” of the findings. In planning
be a time-consuming process. An experienced inter- the study, researchers use well-developed research meth-
viewer will need to distinguish between data that are ods, operationalize the phenomenon of interest, make
relevant and data that are not central to the study aims.
Because of time and cost, judgments may be made to
set aside less relevant data and transcribe and analyze
only the most important data. These are important de- Credibility
cisions to make. No researcher wants to invest time and Triangulation
resources in collecting data that are not used. Similarly, Member checking
no researcher wants to lose important insights by mak- Negative case analysis
ing poor judgments about what data matter and what
do not. Transferability
Analysis of qualitative data by hand involves many
hours of sifting through written texts. Data analysis soft- Thick description
ware can help manage large amounts of narrative data, Purposive sampling
assisting in recording, storing, indexing, coding, and in-
terconnecting text-based material.57 Three of the more
Dependability
commonly used software packages for qualitative data
analysis are NVIVO,58 Atlas.ti59, and QDA Miner.60 Audit trail
There is some debate about the use of computer pro- Triangulation
grams for data analysis, but most acknowledge their value
as a data organizing tool. The fundamental objections
focus on the loss of intimacy with the data, and the criti- Confirmability
cism that an overreliance on software can result in mere Audit trail
quantification and counts of words in a text over more nu- Triangulation
Reflexivity
anced interpretations of the themes, and patterns. Com-
puter software may be a valuable tool to assist in managing Figure 21–3 Techniques used to evaluate “trustworthiness”
voluminous data, but it does not direct interpretive of qualitative studies. Based on the evaluative criteria
activity.36 The researchers themselves are responsible for developed by Lincoln and Guba.1
6113_Ch21_297-316 02/12/19 4:41 pm Page 311
Mixed Methods Typologies ➤ Fritz68 recruited a sample of 10 women for the DSM study
Mixed methods researchers have developed design ty- from a list of participants who had previously participated in
pologies helpful for organizing both research approaches formal DSM classes. She intentionally recruited participants
within a single study. Though not an exhaustive list, the with a broad range of demographic characteristics including
age, education, family size, and socioeconomic status.
four basic mixed methods designs include convergent,
sequential, embedded, and multiphase designs.71,72
• Convergent designs involve the collection, often si- Data Collection
multaneously, of both qualitative and quantitative Within mixed methods studies, researchers are free to
data, combining the data during the data analysis use a wide and creative range of data collection methods,
and interpretation phase. including the use of technology.
• Sequential designs are useful when the intent is to
have one dataset build on the other to enhance
understanding of a phenomenon. For example, a
➤ Fritz68 incorporated participant-generated photography,
whereby participants were asked to “take photographs of
researcher may collect and analyze survey data
the things you associate with diabetes self-management.”
followed by in-depth interviews to better under- These photos were then used as a source of discussion in a
stand the quantitative findings. photo-elicitation interview.
• Embedded designs are especially useful in health sciences
as researchers work to develop and refine interventions.
She also asked participants to complete a time geographic
Researchers may collect quantitative data to deter-
diary, in which they were asked to record all of their daily
mine the efficacy of a trial while also using interviews
activities, with the assumption that DSM would emerge as
or other qualitative approaches to better understand
part of the daily routine. Follow-up interviews focused on
participants’ experiences of being in the intervention.
why certain activities occurred in different locations and
• Multiphase designs are best suited for linked projects
how the DSM activities competed with daily demands.
that are conducted over time and that require differ-
ent types of knowledge generated at each stage for Coding and Analysis
the project as a whole to advance.
Investigators must also consider how to treat the varied
Each of these study designs can further be classified as types of qualitative and quantitative data during analysis
predominantly quantitative, predominantly qualitative, and interpretation. Mixing of research data may be one
or equally balanced depending on the methods used. of the least clear aspects of mixed methods research, yet
it is essential for assessing the quality of inferences made.
➤ Data in the DSM study were gathered sequentially, starting The three most common approaches discussed in the lit-
with (1) the Diabetes Care Profile followed by an interview; erature are merging, connecting, and embedding.71
(2) participant-generated photography followed by an interview; Merging, also referred to as integrating, involves combin-
and then (3) a time geographic diary followed by an interview. ing qualitative and quantitative data in both the analysis and
Data were collected in this sequence for all participants. reporting of results. The goal of merging is to reveal where
findings from each method converge, offer complemen-
Sampling tary information on the issue, or contradict each other.73
The grounded theory process resulted in the development of Interview: Participant DCP: Multiple
the Transactional Model of Diabetes Self-Management Integra- reports consistently eating incidents of
healthy diet and BGL hyperglycemia
tion, which depicts a process whereby participants accepted monitoring
how aspects of diabetes education and management can
become integrated within the context of their daily lives.
Additional lines of
Photographs: Mixed questioning in
messages about eating and photo-elicitation
BGL monitoring habits interviews
This process is summarized in Figure 21-4, showing
how each method informed the other. TGD: Multiple entries of Participant explains
Connecting data consists of analyzing one dataset and eating unhealthy foods and contradictions; expands
then using the inferences generated from that analysis to skipping BGL monitoring existing codes
inform subsequent data collection. This approach is es-
pecially useful in survey development because initial qual-
Inferences: Participant’s habits and life situations still do not
itative interviews about a phenomenon may generate data support consistent adherence to diet and BGL monitoring.
to suggest potential items for a quantitative instrument. Participant continues to explore how to negotiate diet into
When embedding data, researchers integrate data of sec- life context and practices different approaches.
ondary priority into a larger principle dataset. For example,
in a study of pain management, qualitative data were em- Figure 21–4 The process of integrating multiple data sources
during data analysis. Note: BGL = blood glucose level, DCP =
bedded within an intervention trial.74 Even though the trial Diabetes Care Profile, and TGD = time geographic diary.
found a significant decrease in pain for the treatment From Fritz, H. Learning to do better: the Transactional Model of Diabetes
group, qualitative data revealed that many patients still did Self-Management Integration. Qual Health Res 2015;25(7):875-886. Figure 1,
not obtain satisfactory relief. p. 879. Used with permission.
COMMENTARY
You may have heard the world is made up of atoms and molecules, but it’s really
made up of stories.
—William Turner (1508-1568)
British scientist
The purpose of qualitative descriptive research is to of care.11,75 Qualitative findings provide rigorous ac-
characterize phenomena so that we know what exists. counts of treatment regimens in everyday contexts.9
Without this fundamental knowledge, it would be im- Without qualitative insights, we could not know what
possible to ask questions about behaviors or treatment kinds of questions to ask in experimental situations, or
effects or to propose theories to explain them. This ap- what measurements would make sense of them.76 The
proach is essential for bridging the gap between scientific best answer to a research question will be achieved by
evidence and clinical practice. It can help us understand using the appropriate design—and that will often call for
the limitations in applying research literature to clinical a qualitative approach.
decisions. Most importantly, clinicians should recognize Just because it does not require statistical analysis,
that different research methods are needed to address qualitative research is by no means easier than other ap-
different questions. proaches, and doing it well requires its own set of skills.
Qualitative research is often viewed as lacking rigor, Novice investigators are encouraged to work with an ex-
and generalizability is challenged because of the nature perienced qualitative researcher who can assist with
of data collection. This is a misguided characterization, planning, implementation, and analysis. Many decisions
however. The ever-increasing complexity of healthcare, need to be made in the planning of a qualitative study,
including provider and organizational changes, requires including the appropriate approach, availability of re-
new understandings of our environment and how pa- sources, data collection strategies, and understanding the
tients and providers interact to develop optimal methods potential influence of personal biases.52
6113_Ch21_297-316 02/12/19 4:41 pm Page 315
This brief introduction to qualitative methods is by research are also addressed with other topics, including
no means sufficient to demonstrate the scope of data col- translational research (Chapter 2), developing research
lection and analysis techniques that are used. It should, questions (Chapter 3), application of theory (Chapter 4),
however, provide a good foundation for understanding evidence-based practice (Chapter 5), informed consent
qualitative research studies and how their findings can be (Chapter 7), writing proposals (Chapter 35), and critically
evaluated and applied. Many topics relevant to qualitative appraising literature (Chapter 36).
R EF ER E NC E S 22. Kane H, Lewis MA, Williams PA, Kahwati LC. Using qualita-
tive comparative analysis to understand and quantify translation
1. Lincoln YS, Guba EG. Naturalistic Inquiry. Newbury Park, and implementation. Transl Behav Med 2014;4(2):201-208.
CA: Sage; 1985. 23. Creswell JC. Qualitative Inquiry and Research Design: Choosing
2. Creswell JW, Poth CN. Qualitative Inquiry and Research among Five Approaches. 3rd ed. Thousand Oaks, CA: Sage; 2013.
Design: Choosing Among Five Approaches. 4th ed. Thousand Oaks, 24. National Cancer Institute. Cancer Disparities. Available at
CA: Sage; 2018. https://www.cancer.gov/about-cancer/understanding/disparities.
3. Creswell JW, Creswell JD. Research Design: Qualitative, Quanti- Accessed February 28, 2019.
tative, and Mixed Methods Approaches. 5th ed. Thousand Oaks, 25. Eggly S, Barton E, Winckles A, Penner LA, Albrecht TL.
CA: Sage; 2018. A disparity of words: racial differences in oncologist-patient
4. Charmaz K. Constructing Grounded Theory. 2nd ed. Los Angeles: communication about clinical trials. Health Expect 2015;18(5):
Sage; 2014. 1316-1326.
5. Yin RK. Case Study Research and Applications: Design and Methods. 26. Eggly S, Harper FW, Penner LA, Gleason MJ, Foster T,
6th ed. Los Angeles: Sage; 2018. Albrecht TL. Variation in question asking during cancer
6. Smith JA, Flowers P, Larkin M. Interpretative Phenomenological clinical interactions: a potential source of disparities in access
Analysis: Theory, Method and Research. Thousand Oaks, CA: Sage; to information. Patient Educ Couns 2011;82(1):63-68.
2009. 27. Rubin HJ, Rubin IS. Qualitative Interviewing: The Art of Hearing
7. Pope C, Mays N. Qualitative Research in Health Care. Oxford: Data. Thousand Oaks, CA: Sage 1995.
Blackwell; 2006. 28. Kayser-Jones J, Schell ES, Porter C, Barbaccia JC, Shaw H.
8. Patton MQ. Qualitative Research and Evaluation Methods: Integrat- Factors contributing to dehydration in nursing homes: inade-
ing Theory and Practice. 4th ed. Newbury Park, CA: Sage; 2015. quate staffing and lack of professional supervision. J Am Geriatr
9. Green J, Britten N. Qualitative research and evidence based Soc 1999;47(10):1187-1194.
medicine. BMJ (Clinical research ed) 1998;316(7139):1230-1232. 29. Tucker AL, Heisler WS, Janisse LD. Designed for workarounds:
10. Benoliel JQ. Advancing nursing science: qualitative approaches. a qualitative study of the causes of operational failures in hospi-
West J Nurs Res 1984;6(3):1-8. tals. Perm J 2014;18(3):33-41.
11. Meadows KA. So you want to do research? 3. An introduction to 30. Sanchez H, Levkoff S. Lessons learned about minority recruit-
qualitative methods. Br J Community Nurs 2003;8(10):464-469. ment and retention from the centers on minority aging and
12. Ailinger RL. Contributions of qualitative research to evidence- health promotion. Gerontologist 2003;43(1):18-26.
based practice in nursing. Rev Lat Am Enfermagem 2003;11(3): 31. Henshall C, Narendran P, Andrews RC, Daley A, Stokes KA,
275-279. Kennedy A, Greenfield S. Qualitative study of barriers to
13. Pekkarinen T, Kaukua J, Mustajoki P. Long-term weight main- clinical trial retention in adults with recently diagnosed type 1
tenance after a 17-week weight loss intervention with or without diabetes. BMJ Open 2018;8(7):e022353.
a one-year maintenance program: a randomized controlled trial. 32. Jackson J, Carlson M, Rubayi S, Scott MD, Atkins MS, Blanche
J Obes 2015;2015:651460. EI, Saunders-Newton C, Mielke S, Wolfe MK, Clark FA.
14. Cioffi J. Factors that enable and inhibit transition from a weight Qualitative study of principles pertaining to lifestyle and pres-
management program: a qualitative study. Health Educ Res 2002; sure ulcer risk in adults with spinal cord injury. Disabil Rehabil
17(1):19-26. 2010;32(7):567-578.
15. Green AR, Larkin M, Sullivan V. Oh stuff it! The experience 33. Cella D, Yount S, Rothrock N, Gershon R, Cook K, Reeve B,
and explanation of diet failure: an exploration using interpreta- Ader D, Fries JF, Bruce B, Rose M. The Patient-Reported
tive phenomenological analysis. J Health Psychol 2009;14(7): Outcomes Measurement Information System (PROMIS):
997-1008. progress of an NIH Roadmap cooperative group during its
16. McKee H, Ntoumanis N, Smith B. Weight maintenance: first two years. Med Care 2007;45(5 Suppl 1):S3-S11.
self-regulatory factors underpinning success and failure. Psychol 34. Quinn H, Thissen D, Liu Y, Magnus B, Lai JS, Amtmann D,
Health 2013;28(10):1207-1223. Varni JW, Gross HE, DeWalt DA. Using item response theory
17. Rogerson D, Soltani H, Copeland R. The weight-loss experi- to enrich and expand the PROMIS(R) pediatric self report
ence: a qualitative exploration. BMC Public Health 2016;16:371. banks. Health Qual Life Outcomes 2014;12:160.
18. Chambers JA, Swanson V. Stories of weight management: 35. Walsh TR, Irwin DE, Meier A, Varni JW, DeWalt DA.
factors associated with successful and unsuccessful weight The use of focus groups in the development of the PROMIS
maintenance. Br J Health Psychol 2012;17(2):223-243. pediatrics item bank. Qual Life Res 2008;17(5):725-735.
19. Downing AM, Hunter DG. Validating clinical reasoning: a 36. Marshall C, Rossman GB. Designing Qualitative Research. 6th ed.
question of perspective, but whose perspective? Man Ther Thousand Oaks, CA: Sage; 2016.
2003;8(2):117-119. 37. Harper D. Visual Sociology. New York: Routledge; 2012.
20. Dougherty DA, Toth-Cohen SE, Tomlin GS. Beyond research 38. Wang C, Burris MA. Photovoice: concept, methodology,
literature: Occupational therapists’ perspectives on and uses of and use for participatory needs assessment. Health Educ Behav
“evidence” in everyday practice. Can J Occup Ther 2016;83(5): 1997;24(3):369-387.
288-296. 39. Hasselkus BR. The meaning of activity: day care for
21. Jensen GM, Gwyer J, Hack LM, Shepard KF. Expertise in Physical persons with Alzheimer disease. Am J Occup Ther 1992;
Therapy Practice, 2nd ed. St. Louis: Saunders; 2007. 46(3):199-206.
6113_Ch21_297-316 02/12/19 4:41 pm Page 316
40. Corbin J, Strauss A. Basics of Qualitative Research: Techniques and 58. NVIVO. QSR International. Available at https://www.
Procedures for Developing Grounded Theory. 4th ed. Thousand qsrinternational.com/nvivo/home. Accessed March 6, 2019.
Oaks, CA: Sage; 2015. 59. ATLAS.ti. Available at https://atlasti.com/. Accessed March 6,
41. Madsen UR, Hommel A, Baath C, Berthelsen CB. Pendulating— 2019.
A grounded theory explaining patients’ behavior shortly after 60. QDA Miner. Available at https://provalisresearch.com/
having a leg amputated due to vascular disease. Int J Qual Stud products/qualitative-data-analysis-software/. Accessed
Health Well-being 2016;11:32739. March 6, 2019.
42. Saunders B, Sim J, Kingstone T, Baker SB, Waterfield J, 61. Spencer L, Ritchie J, Lewis J, Dillon L. Quality in qualitative
Bartlam B, Burroughs H, Jinks C. Saturation in qualitative evaluation: a framework for assessing research evidence.
research: exploring its conceptualization and operationalization. National Center for Social Research, 2003. Available at http://
Qual Quant 2018;52(4):1893-1907. dera.ioe.ac.uk/21069/2/a-quality-framework-tcm6-38740.pdf.
43. Pinelli V, Stuckey HL, Gonzalo JD. Exploring challenges in the Accessed November 25, 2017.
patient’s discharge process from the internal medicine service: 62. Tong A, Sainsbury P, Craig J. Consolidated criteria for reporting
A qualitative study of patients’ and providers’ perceptions. qualitative research (COREQ): a 32-item checklist for interviews
J Interprof Care 2017;31(5):566-574. and focus groups. Int J Qual Health Care 2007;19(6):349-357.
44. Schram TH. Conceputalizing and Proposing Qualitatice Research. 63. Sandelowski M, Docherty S, Emden C. Focus on qualitative
2nd ed. Upper Saddle River, NJ: Pearson; 2006. methods. Qualitative metasynthesis: issues and techniques.
45. Luborsky M. The identifications and analysis of themes and Res Nurs Health 1997;20(4):365-371.
patterns. In: Gubrium J, Sankar A, eds. Qualitative Methods in 64. Sandelowski M. Rigor or rigor mortis: the problem of rigor in
Aging Research. New York: Sage; 1994. qualitative research revisited. Adv Nurs Sci 1993;16(2):1-8.
46. Finlay L. Applying phenomenology in research: problems, 65. Koch T. Establishing rigour in qualitative research: the decision
principles and practice. Br J Occup Ther 1999;62(7):299-306. trail. J Adv Nurs 1994;19(5):976-986.
47. Donnelly C, Brenchley C, Crawford C, Letts L. The integra- 66. Malterud K. Qualitative research: standards, challenges, and
tion of occupational therapy into primary care: a multiple case guidelines. Lancet 2001;358(9280):483-488.
study design. BMC Fam Pract 2013;14:60. 67. Palaganas EC, Sanchez MC, Molintas MP, Caricativo RD.
48. Turner DW. Qualitative interview design: a practical guide for Reflexivity in qualitative research: a journey of learning. Qual
novice investigators. Qual Report 2010;15(3):754-760. Available Report 2017;22(2):426-438. Available at https://nsuworkss.
at http://www.nova.edu/ssss/QR/QR715-753/qid.pdf. Accessed nova.edu/tqr/vol422/iss422/425. Accessed March 410, 2019.
March, 2019. 68. Fritz H. Learning to do better: the Transactional Model of
49. Miller PR, Dasher R, Collins R, Griffiths P, Brown F. Diabetes Self-Management Integration. Qual Health Res 2015;
Inpatient diagnostic assessments: 1. Accuracy of structured vs. 25(7):875-886.
unstructured interviews. Psychiatry Res 2001;105(3):255-264. 69. Fitzgerald JT, Davis WK, Connell CM, Hess GE, Funnell MM,
50. Gall MD, Gall JP, Borg WR. Educational Research: An Hiss RG. Development and validation of the Diabetes Care
Introduction. 8th ed. New York: Pearson; 2006. Profile. Eval Health Prof 1996;19(2):208-230.
51. Krueger RA, Casey MA. Focus Groups: A Practical Guide for 70. Creswell JW, Tashakkori A. Developing publishable mixed
Applied Research. 5th ed. Thousand Oaks, CA: Sage; 2015. methods manuscripts. J Mixed Methods Res 2007;1:107-111.
52. Davies B, Larson J, Contro N, Reyes-Hailey C, Ablin AR, 71. Creswell JC, Plano Clark VL. Designing and Conducting Mixed
Chesla CA, Sourkes B, Cohen H. Conducting a qualitative Methods Research. 3rd ed. Thousand Oaks, CA: Sage; 2018.
culture study of pediatric palliative care. Qual Health Res 2009; 72. Teddlie C, Yu F. Mixed methods sampling: a typology with
19(1):5-16. examples. J Mixed Methods Res 2007;1:77-100.
53. Teddlie C, Tashakkori A. Foundations of Mixed Methods Research: 73. Farmer T, Robinson K, Elliott SJ, Eyles J. Developing and
Integrating Quantitative and Qualitative Approaches in in the Social implementing a triangulation protocol for qualitative health
and Behavioral Sciences. Thousand Oaks, CA: Sage; 2009. research. Qual Health Res 2006;16(3):377-394.
54. Sandelowski M. Sample size in qualitative research. Res Nurs 74. Plano Clark VL, Schumacher K, West C, Edrington J,
Health 1995;18(2):179-183. Dunn LB, Harzstark A, Melisko M, Rabow MW, Swift PS,
55. Glaser BG, Strauss AI. The Discovery of Grounded Theory: Miaskowski C. Practices for embedding an interpretive qualita-
Strategies for Qualitative Research. New York: Routledge; 1999. tive approach within a randomized clinical trial. J Mixed Methods
56. Bengtsson M. How to plan and perform a qualitative study Res 2013;7(3):219-242.
using content analysis. Nursing Plus Open 2016;2:8-14. Available 75. Bachrach CA, Abeles RP. Social science and health research:
at https://doi.org/10.1016/j.npls.2016.1001.1001, Accessed growth at the National Institutes of Health. Am J Public Health
March 10, 2019. 2004;94(1):22-28.
57. Denzin NK, Lincoln YS. Handbook of Qualitative Research. 76. Sofaer S. Qualitative methods: what are they and why use them?
4th ed. Thousand Oaks, CA: Sage; 2011. Health Serv Res 1999;34(5 Pt 2):1101-1118.
6113_Ch22_317-332 02/12/19 4:40 pm Page 317
Analyzing Data
PART
4
CHAPTER 22 Descriptive Statistics 318
CHAPTER 23 Foundations of Statistical Inference 333
CHAPTER 24 Comparing Two Means: The t-Test 351
CHAPTER 25 Comparing More Than Two Means: Analysis
of Variance 365
CHAPTER 26 Multiple Comparison Tests 383
CHAPTER 27 Nonparametric Tests for Group Comparisons 400
CHAPTER 28 Measuring Association for Categorical
Variables: Chi-Square 415
CHAPTER 29 Correlation 428
CHAPTER 30 Regression 440
CHAPTER 31 Multivariate Analysis 468
CHAPTER 32 Measurement Revisited: Reliability and Validity
Statistics 486
CHAPTER 33 Diagnostic Accuracy 509
CHAPTER 34 Epidemiology: Measuring Risk 529
ractice
dp
se
idence-ba
1. Identify the
arch question
rese
Ev
s
ing
2.
De
ind
sig
te f
n the
5. Dissemina
study
The
Research
Process
dy
stu
4.
na DATA
e
A
h
ly z n tt
ed me The Data Download icon in tables will indicate
ata ple
3. I m that the original data and full output for the
DOWNLOAD
CHAPTER
22 Descriptive Statistics
DATA
Table 22-1 Distribution of Scores for the Coin Rotation Test (in seconds) (N= 32)
DOWNLOAD
CRTScore
Score Score Cumulative
Frequency Percent Percent
1 9 17 19 9.00 1 3.1 3.1
2 21 18 28 11.00 1 3.1 6.3
3 23 19 24 13.00 1 3.1 9.4
4 26 20 15 15.00 2 6.3 15.6
5 10 21 17 16.00 1 3.1 18.8
6 16 22 23 17.00 1 3.1 21.9
7 15 23 20 18.00 2 6.3 28.1
8 22 24 25 19.00 2 6.3 34.4
9 30 25 21 20.00 3 9.4 43.8
10 24 26 26 21.00 4 12.5 56.3
11 18 27 18 22.00 2 6.3 62.5
12 32 28 20 23.00 2 6.3 68.8
13 13 29 19 24.00 2 6.3 75.0
14 20 30 27 25.00 1 3.1 78.1
15 21 31 21 26.00 2 6.3 84.4
16 29 32 22 27.00 1 3.1 87.5
28.00 1 3.1 90.6
29.00 1 3.1 93.8
30.00 1 3.1 96.9
32.00 1 3.1 100.0
Total 32 100.0
Portions of the output have been omitted for clarity.
presentation can identify outliers (extreme scores) and suppose we tested another sample of 200 patients and
values that are out of the appropriate range, which may found that 25 individuals obtained a score of 21 seconds.
indicate errors in data collection or recording. The num- Although there are more people in this second sample with
ber of scores can also be confirmed to match the num- this score than in the first sample, they both represent the
bers of subjects in the study. same percentage of the total sample. Therefore, the sam-
ples may be more similar than frequencies would indicate.
Percentages
Sometimes frequencies are more meaningfully expressed Grouped Frequency Distributions
as percentages of the total distribution. We can look at When continuous data are collected, researchers will often
the percentage represented by each score in the distribu- find that very few subjects, if any, obtain the exact same
tion or at the cumulative percent obtained by adding the score, as seen with our CRT scores. Obviously, creating a
percentage value for each score to all percentages that fall frequency distribution can be an impractical process if
below that score. For example, it may be useful to know almost every score has a low frequency. In this situation,
that 4 patients, or 12.5% of the sample, had a score of 21, a grouped frequency distribution can be constructed by
or that 56.3% of the sample had scores of 21 and below. grouping the scores into intervals, each representing a
Percentages are useful for describing distributions be- unique range of scores within the distribution. Frequen-
cause they are independent of sample size. For example, cies are then assigned to each interval.
6113_Ch22_317-332 02/12/19 4:40 pm Page 320
Table 22-2A shows a grouped frequency distribution distributions more clearly than a tabular frequency dis-
for the CRT data. The intervals represent ranges of tribution. The most common methods of graphing fre-
5 seconds. They are mutually exclusive (no overlap) and quency distributions are the histogram, line plot, and
exhaustive within the range of scores obtained. The stem-and-leaf plot, illustrated in Table 22-2.
choice of the number of intervals to be used and the
range within each one is an arbitrary decision. It depends Histogram
on the overall range of scores, the number of observa- A histogram is a type of bar graph, composed of a series
tions, and how much precision is relevant for analysis. of columns, each representing one score or group inter-
Although information is inherently lost in grouped val. A histogram is shown in panel B for the distribution
data, this approach is often the only feasible way to present of CRT scores. The frequency for each score is plotted
large amounts of continuous data. The groupings should on the Y-axis, and the measured variable, in this case
be clustered to reveal the important features of the data. CRT score, is on the X-axis. The bars are centered over
The researcher must recognize that the choice of the num- the scores. This presentation provides a clear visual
ber of intervals and the range within each interval can in- depiction of the frequencies.
fluence the interpretation of how a variable is distributed.
Graphing Frequency Distributions The curve superimposed on the histogram shows how well
the distribution fits a normal curve. This will be addressed
Graphic representation of data often communicates in- later in this chapter.
formation about trends and general characteristics of
DATA
Table 22-2 Various Methods to Display CRT Scores (from Table 22-1)
DOWNLOAD
5-9 1 3.1
10-14 2 6.3 6
15-19 8 25.0
20-24 13 40.6
Frequency
4
25-29 6 18.7
30-34 2 6.3
2
Total 32 100.0
0
5.00 10.00 15.00 20.00 25.00 30.00 35.00
CRTScore
8 8.00 1 . 55678899
6 13.00 2 . 0001111223344
4 6.00 2 . 566789
2 2.00 3 . 02
0
5-9 10-14 15-19 20-24 25-29 30-34
CRTscores
6113_Ch22_317-332 02/12/19 4:40 pm Page 321
Line Plot become the stem. The last digit in each score becomes the
A line plot, also called a frequency polygon, shows data points leaf. To read the stem-and-leaf plot, we look across each
along a contiguous line. Panel C shows a line plot for the row, attaching each single leaf digit to the stem. Therefore,
grouped CRT data. When grouped data are used, the the first row represents the score 9; the second row, 10 and
points on the line represent the midpoint of each interval. 13; the third row starts with 15, 15, 16, 17, and so on.
This display provides a concise summary of the data
Stem-and-Leaf Plot while maintaining the integrity of the original data. If we
The stem-and-leaf plot is a grouped frequency distri- compare this plot with the grouped frequency distribution
bution that is like a histogram turned on its side, but with or the histogram, it is clear how much more information is
individual values. It is most useful for presenting the pat- provided by the stem-and-leaf plot in a small space and how
tern of distribution of a continuous variable, derived by it provides elements of both tabular and graphic displays.
separating each score into two parts. The leaf consists of
the last or rightmost single digit of each score, and the
Graphic displays can be essential to a full understanding of
stem consists of the remaining leftmost digits.
data, and to generate hypotheses that may not be obvious
Panel D shows a stem-and-leaf plot for the CRT data. from tabular data (see Focus on Evidence 22-1).
The scores have leftmost digits of 0 through 3. These values
Focus on Evidence 22–1 EDA relies heavily on descriptive statistics and graphic represen-
A Picture is Worth a Thousand Words tations of data, allowing researchers to examine variability in differ-
ent subgroups and to see patterns that might not have otherwise
Most researchers think of statistics as a way to test specific hypothe-
been obvious.5,6 To illustrate this, researchers were interested in
ses, looking for statistical tests that will tell us whether a treatment
studying the frequency of safety incidents reported in general prac-
was effective, or if we can predict certain outcomes. This approach
tice in England and Wales from 2005 to 2013, to identify contributory
is called confirmatory data analysis, which relies on pre-set assump-
factors.7 They used the EDA approach to examine more than 13,000
tions about the relationship between independent and dependent vari-
reports, and found that that among 462 separate clinical locations
ables. One caveat in this effort is the realization that summary values
that provided at least one incident report, over half of the reports
may not always provide meaningful information about individuals,
originated from only 30 locations (52%), and 67 locations reported
and we must make generalizations and assumptions that allow us to
only one incident. Using a scatterplot, charting the number of incident
apply such findings to clinical decision making.
reports (X-axis) against the degree of harm reported (Y-axis), they
In his classic work, Tukey3 has urged researchers to consider a
showed that some organizations did not report safety incidents at
broader approach as a preliminary step in research, before hypothe-
all, and that the top reporting locations tended to have incidents with
ses are generated. He has suggested a process called exploratory
lower harm.
data analysis (EDA) to gain insight into patterns in data. EDA is
These findings led them to look at thresholds for reporting. For
often referred to as a philosophy as well as an approach to data
instance, by looking at the dots high on harm but low on number of
analysis, requiring a “willingness to look for what can be seen,
reports (upper left), the data suggested that some locations appear
whether or not anticipated,”—along with flexibility and some
to file reports only if there is serious harm or deaths. It also shows
graph paper!4
that those sites with more frequent reporting tended to have less
harmful incidents (bottom right). This approach illustrates how
graphs can be more powerful than summary statistics to find gaps
100
in scores within a certain range, to identify a particular score that
90 is “somewhere in left field,” or to show where there may be a “pile-
80 up” of scores at one point in a distribution.8 This type of analysis
70 can be used to generate hypotheses, to suggest alternative ques-
60 tions of the data, or to guide further analysis. Other, more complex,
Harm (%)
Mean, median If you based your decision on the mode, you would
and mode bring in characters from Sesame Street. Using the median
you might hire a heavy metal band. And according to the
mean age of 34.3 years, you might decide to play soft
rock, although nobody in the group is actually in that
age range. And the Frank Sinatra fans are completely
overlooked!
What we are ignoring is the spread of ages within the
A group. Consider now a more concrete example using
the hypothetical CRT scores reported in Table 22-3 for
Mode two groups of patients. If we were to describe these two
Median distributions using only measures of central tendency,
they would appear identical. However, a careful glance
Mean
reveals that the scores for Group B are more widely
scattered than those for Group A. This difference in
variability, or dispersion of scores, is an essential element
in data analysis.
The description of a sample is not complete unless we
can characterize the differences that exist among the
B scores as well as the central tendency of the data. This
section will describe five commonly used statistical meas-
Mode
ures of variability: range, percentiles, variance, standard
Median
deviation, and coefficient of variation.
Mean
Range
The simplest measure of variability is the range ,
which is the difference between the highest and lowest
values in a distribution. For the test scores reported
in Table 22-3,
C
Group A range: 20–14 = 6
Figure 22–2 Typical relationship of the mean, median and Group B range: 28–9 = 19
mode in A) normal, B) negatively skewed, and C) positively These values suggest that the Group A is more
skewed distributions. The mean is pulled toward the tail of
homogeneous.
skewed curves.
Although the range is a relatively simple statistical
measure, its applicability is limited because it is deter-
mined using only the two extreme scores in the distri-
bution. It reflects nothing about the dispersion of scores
■ Measures of Variability between the two extremes. One aberrant extreme score
can greatly increase the range, even though the variabil-
The shape and central tendency of a distribution are use- ity within the rest of the data set is unchanged.
ful but incomplete descriptors of a sample. Consider the The range of scores also tends to increase with larger
following dilemma: samples, making it an ineffective value for comparing dis-
You are responsible for planning the musical entertainment tributions with different numbers of scores. Therefore, al-
for a party of seven individuals, but you don’t know what though it is easily computed, the range is usually employed
kind of music to choose — so you decide to use their average only as a rough descriptive measure and is typically re-
age as a guide. ported in conjunction with other indices of variability.
DATA
Table 22-3 Descriptive Statistics for CRT Scores (in seconds) for Two Groups
DOWNLOAD
Group A Group B
Descriptives
Score Score
Group A Group B
1 14 N Valid 8 8 1 9
Mean 17.250 17.250
2 15 Median 18.000 18.000 2 10
Mode 18.00 18.00
3 16 3 14
Std. Deviation 2.053 6.251
4 18 Variance 4.214 39.071 4 18
Range 6.00 19.00
5 18 Minimum 14.00 9.00 5 18
Maximum 20.00 28.00
6 18 6 19
Percentiles 25 15.250 11.000
7 19 50 18.000 18.00 7 22
75 18.750 21.250
8 20 8 28
5 10 15 20 25 30 5 10 15 20 25 30
box-and-whisker plots, these graphs are usually drawn with the mean from each _score in the distribution we obtain
“whiskers” representing the range of scores. Sometimes a deviation score, X–X . Obviously, samples with larger
these will indicate the highest and lowest scores, or they deviation scores will be more variable around the
can be drawn at the 90th and 10th percentiles (see Focus mean. To illustrate this, let’s just focus on the distri-
on Evidence 22-2). These plots clearly show that Group bution of scores from Group B in Table 22-3. The de-
B is more variable than Group A, even though their me- viation scores for this sample are shown in Table 22-4A.
dians are the same. The mean of the distribution is 17.25. For the score
Variance X = 9, the deviation score will be 9 – 17.25 = –8.25.
Note that the first three deviation scores are negative
Measures of range have limited application as indices of values because these scores are smaller than the
variability because they are not influenced by every score mean.
in a distribution and are sensitive to extreme scores. To As a measure of variability, the deviation score has in-
more completely describe a distribution, we need an tuitive appeal, as these scores will obviously be larger as
index that reflects the variation within a full set of scores. scores become more heterogeneous and farther from the
This value should be small if scores are close together mean. It might seem reasonable, then, to take the aver-
and large if they are spread out. It should also be objec- age of these values, or the mean deviation, as an index of
tive so that we can compare samples of different sizes dispersion within the sample. This is a useless exercise,
and determine if one is more variable than another. It is however, because the sum of the deviation scores will al-
useful to consider how these values are derived to better ways equal zero, as illustrated in the second column in
_
understand their application. Table 22-4A [∑(X – X ) = 0]. If we think of the mean as a
Deviation Scores central balance point for a distribution, then it makes
We can begin to examine variability by looking at the sense that the scores will be equally dispersed above and
deviation of each score from the mean. By subtracting below that central point.
DATA
Table 22-4 Calculation of Variance and Standard Deviation using Group B CRT Scores (from Table 22-3)
DOWNLOAD
A. Data B. Calculations
X (X X ) (X X )2
9 8.25 68.06 Variance:
10 7.25 52.56
(X X )2 273.48
14 3.25 10.56 s2 39.07
n 1 7
18 0.75 0.56
18 0.75 0.56
Standard Deviation:
19 1.75 3.06
22 4.75 22.56 (X X )2
s 39.07 6.25
n 1
28 10.75 115.56
X 138 ⌺(X ⫺ X ) ⫽ 0.00 ⌺(X ⫺ X )2 ⫽ 273.48
X 17.25
Sum of Squares and Variance data do not include all the observations in a popula-
Statisticians solve this dilemma by squaring each devia- tion, the sample mean is only an estimate of the popu-
tion score to get rid of the minus signs, as shown in the lation mean. This substitution results in a sample
third column of Table _ 22-4A. The sum of the squared de- variance slightly smaller than the true population vari-
viation scores, ∑(X – X )2, shortened to sum of squares ance. To compensate for this bias, the sum of squares
(SS), will be larger as variability increases. is divided by n–1 to calculate the sample variance,
We now have a number we can use to describe the given the symbol s2:
sample’s variability. In this case, _
_ SS ∑(X – X )2
s2 = =
∑(X – X )2 = 273.48 n–1 n–1
As an index of relative variability, however, the sum This corrected statistic is considered an unbiased esti-
of squares is limited because it can be influenced by the mate of the parameter σ2. The descriptive statistics
sample size; that is, as n increases, the sum will also tend in Table 22-3 show that s2 = 4.21 for Group A and
to increase simply because there are more scores. To s2 = 39.07 for Group B. The variance for Group B is sub-
eliminate this problem, the sum of squares is divided stantially larger than for Group A.
by n, to obtain the mean of the squared deviation scores,
shortened to mean square, (MS). This value is a true Standard Deviation
measure of variability and is also called the variance. The limitation of variance as a descriptive measure of a
sample’s variability is that it was calculated using the
Hold onto these concepts. The sum of squares (SS) and squares of the deviation scores. It is generally not useful
mean square (MS) will be used many times as we describe to describe sample variability in terms of squared units,
statistical tests in future chapters. such as seconds squared. Therefore, to bring the index
back into the original units of measurement, we take the
square root of the variance. This value is called the stan-
Population Parameters. For population data, the dard deviation, symbolized by s. As shown in Table 22-3,
variance is symbolized by σ2 (lowercase Greek sigma s = 2.05 for Group A, and s = 6.25 for Group B.
squared). When the population mean is known, devia-
tion scores are obtained by X – μ. Therefore, the popu-
lation variance is defined by Research reports may use sd or SD to indicate standard
deviation.
SS ∑(X – μ)2
σ2 = =
N N
Sample Statistics. With_ sample data, deviation The standard deviation of sample data is usually
scores are obtained using X , not μ. Because sample reported along with the mean so that the data are
6113_Ch22_317-332 02/12/19 4:40 pm Page 328
characterized according to both central tendency and By calculating the coefficient of variation, we get a
variability. Therefore, a mean may be expressed as better idea of the relative variation of these two
_ measurements:
Group A: X_ = 17.25 ± 2.05
Group B: X = 17.25 ± 6.25 9.6
Degrees: CV = × 100 = 23.3%
41.2
The smaller standard deviation for Group A indicates
that the scores are less spread out around the mean than 0.72
Inches: CV = × 100 = 19.5%
the scores for Group B. 3.7
Variance and standard deviation are fundamental
Now we can see that the variability within these two
components in any analysis of data. We will explore the
distributions is actually fairly similar. As this example il-
application of these concepts to many statistical proce-
lustrates, the coefficient of variation is a useful measure
dures throughout the coming chapters.
for making comparisons among patient groups or differ-
When means are not whole numbers, calculation of deviation ent clinical assessments to determine if some are more
scores can be biased by rounding. Computational formulae stable than others.
provide more accurate results. See the Chapter 22 Supple-
ment for calculations using the computational formula for
variance and standard deviations. ■ The Normal Distribution
Coefficient of Variation Earlier in this chapter, the symmetrical normal curve
was described. This distribution represents an important
The coefficient of variation (CV) is another measure of
statistical concept because many biological, psychologi-
variability that can be used to describe data measured on
cal, and social phenomena manifest themselves in popu-
the interval or ratio scale. It is the ratio of the standard
lations according to this shape.
deviation to the mean, expressed as a percentage:
Unfortunately, in the real world we can only estimate
CV= _s × 100 such data from samples and therefore cannot expect data
X to fit the normal curve exactly. For practical purposes,
then, the normal curve represents a theoretical concept
There are two major advantages to this index. First,
only, with well-defined properties that allow us to make
it is independent of units of measurement because units
statistical estimates about populations using sample data.
will mathematically cancel out. Therefore, it is a practi-
For example, a normal curve was superimposed on the
cal statistic for comparing variability in distributions
CRT data in Table 22-2.
recorded in different units.
Second, the coefficient of variation expresses the stan-
dard deviation as a proportion of the mean, thereby ac- The normal distribution is called a Gaussian distribution
counting for differences in the magnitude of the mean. after Carl Friedrich Gauss who identified its properties.
The coefficient of variation is, therefore, a measure of
relative variation, most meaningful when comparing two
distributions. Proportions of the Normal Curve
These advantages can be illustrated using data from The statistical appeal of the normal distribution is that
a study of normal values of lumbar spine range of its characteristics are constant and therefore predictable.
motion, in which data were recorded in both degrees As shown in Figure 22-4, the curve is smooth, symmet-
and inches of excursion.12 The mean ranges for 20- to rical, and bell-shaped, with most of the scores clustered
29-year-olds were around the mean. The mean, median, and mode have
_
the same value.
Degrees: X_ = 41.2 ± 9.6 degrees
Inches: X = 3.7 ± 0.72 inches
The absolute values of the standard deviations for The fact that the normal curve is important to statistical
these two measurements suggest that the measure of theory should not imply that data are not useful or valid
if they are not normally distributed. Such data can be
inches, using a tape measure, was much less variable—
handled using statistics appropriate to non-normal distributions
but because the means and units are substantially differ-
(see Chapter 27).
ent, the standard deviations are not directly comparable.
6113_Ch22_317-332 02/12/19 4:40 pm Page 329
34.13% 34.13%
13.59% 13.59%
The vertical axis of the curve represents the frequency Standardized Scores
of data. The frequency of scores decreases steadily as
Statistical data are meaningful only when they are ap-
scores move in a negative or positive direction away from
plied in some quantitative context. For example, if a pa-
the mean, with relatively rare observations at the ex-
tient has a pulse rate of 58 bpm, the implication of that
tremes. Theoretically, there are no boundaries to the
value is evident only if we know where that score falls in
curve; that is, scores potentially exist with infinite mag-
relation to _a distribution of normal pulse rates. If we
nitude above and below the mean. Therefore, the tails
know that X = 68 and s = 10 for a given sample, then we
of the curve approach but never quite touch the baseline
also know that an individual score of 58 is one standard
(called an asymptotic curve).
deviation below the mean. This gives us a clearer inter-
Because of these standard properties, we can also
pretation of the score.
determine the proportional areas under the curve repre-
When we express scores in terms of standard devia-
sented by the standard deviations in a normal distribu-
tion units, we are using standardized scores, also called
tion. Statisticians have shown that 34.13% of the area
z-scores. For this example, a score of 58 can be ex-
under the normal curve is bounded by the mean and the
pressed as a z-score of –1.0, the minus sign indicating
score one standard deviation above or below the mean.
that it is one standard deviation unit below the mean.
Therefore, 68.26% of the total distribution (the major-
A score of 88 is similarly transformed to a z-score of
ity) will have scores within ±1 standard deviation (±1s)
+2.0, or two standard deviations above the mean.
from the mean.
A z-score is computed by dividing the deviation of an
Similarly, ±2s from the mean will encompass 95.45%,
individual score from the mean by the standard deviation:
and ±3s will cover 99.73% of the total area under the _
curve. At ±3s we have accounted for virtually the entire X–X
z=
distribution. Because we can never discount extreme val- s
ues at either end, we never account for the full 100%.
Using the example _of pulse rates, for an individual
This information can be used as a basis for interpreting
score of 85 bpm, with X = 68 and s = 10,
standard deviations.
85 – 68 17
z= = = 1.7
➤ C A SE IN PO INT # 2 10 10
For this next section, properties of the normal curve will Thus, 85 bpm is 1.7 standard deviations above the
be illustrated using hypothetical_scores for pulse rates from mean.
a sample of 100 patients, with X = 68 and s = 10 beats per
minute (bpm).Therefore, we can estimate that approximately The Standardized Normal Curve
68% of the individuals in the sample have scores between 58
The normal distribution can also be described in terms of
and 78 bpm.
standardized scores. The mean of a normal distribution
6113_Ch22_317-332 02/12/19 4:40 pm Page 330
of z-scores will always equal zero (no deviation from the Therefore, 50 bpm is slightly less than two standard
mean), and the standard deviation will always be 1.0. deviations below the mean.
As shown in Figure 22-4, the area under the standard- Now we want to determine the proportion of our total
ized normal curve between z = 0 and z = +1.0 is approx- sample that is represented by all scores above 50, or above
imately 34%, the same as that defined by the area z = –1.8. We already know that the scores above the mean
between the mean (z = 0) and one standard deviation. (above z = 0) represent 50% of the curve, as shown by the
The total area within z = ±1.00 is 68.26%. Similarly, the dark green area in Figure 22-6. Therefore, we are now
total area within z = ±2.00 is 95.45%. Using this model, concerned with the light green area between 68 and 50,
we can determine the proportional area under the curve which is equal to the area from z = 0 to z = –1.8. Together
bounded by any two points in a normal distribution. these two shaded areas represent the total area above 50.
Table A-1 uses only absolute values of z. Because it in-
Determining Areas under the Normal Curve
cludes standardized units for a symmetrical curve, the pro-
We can illustrate this process using our pulse rate example.
portional area from 0 to z = +1.8 is the same as the area from
Suppose we want to determine what percentage of our
0 to z = –1.8. The area between z = 0 and z = –1.8 is .4641.
sample has a pulse rate above 80 bpm. First, we determine
Therefore, the total area under the curve for all scores
the z-score for 80 bpm:
above 50 bpm will be .50 + .4641 = .9641, or 96.41%.
80 – 68 12 Standardized scores are useful for interpreting an indi-
z= = = 1.2
10 10 vidual’s standing relative to a normalized group. For ex-
ample, many standardized tests, such as psychological,
Therefore, 80 bpm is slightly more than one standard
developmental, and intelligence tests, use z-scores to
deviation above the mean. Now we want to determine
demonstrate that an individual’s score is above or below
the proportion of our total sample that is represented by
the “norm” (the standardized mean) or to show what pro-
all scores above 80, or above z = 1.2. This is the shaded
portion of the subjects in a distribution fall within a certain
area above 80 in Figure 22-5.
range of scores.
These values are given in Appendix Table A-1. This
table is arranged in three columns, one containing
z-scores and the other two representing areas either The validity of estimates using the standard normal
from 0 to z or above z (in one tail of the curve). For this curve depends on the closeness with which a sample
approximates the normal distribution. Many clinical
example, we are interested in the area above z, or above
samples are too small to provide an adequate approximation
1.2. If we look to the right of z = 1.20 in Table A-1, we
and are more accurately described as “non-normal.”
find that the area above z equals .1151. Therefore, scores With skewed distributions, values like means and standard
above 80 bpm represent 11.51% of the total distribution. deviations can be misleading and may contribute to statistical
For another example, let’s determine the area above misinterpretations.9 Areas under the curve are only relevant to
50 bpm. First we determine the z-score for 50 bpm: the normal curve.
50 – 68 –18 Goodness of fit tests can be applied to distributions to deter-
z= = = –1.8 mine how well the data fit a normal curve (see Chapter 23).
10 10
COMMENTARY
A statistician is a person who lays with his head in an oven and his feet in a deep
freeze, and says “on the average, I feel comfortable.”
—C. Bruce Grossman (1921-2001)
Professor
Descriptive statistics are the building blocks for data inferential tests are founded on certain assumptions
analysis. They serve an obvious function in that they about data such as central tendency, variance, and the
summarize important features of quantitative data. Every degree to which the distribution approaches the nor-
study will include a description of participants and re- mal curve. Examination of descriptive data through
sponses using one or more measures of central tendency graphic representation can usually provide important
and variance as a way of profiling the study sample, un- insights into how the data are behaving and how they
derstanding the variables being studied, and establishing can best be analyzed.
the framework for further analysis. The take-home message is the importance of de-
Descriptive measures do not, however, provide a scriptive statistics as the basis for sound statistical rea-
basis for inferring anything that goes beyond the data soning and the generation of further hypotheses.9
themselves. Such interpretations come from inferen- Descriptive analyses are necessary to demonstrate that
tial statistics, which will be covered in following chap- statistical tests are used appropriately and that their in-
ters. It is essential to understand, however, that these terpretations are valid.13
9. Gonzales VA, Ottenbacher KJ. Measures of central tendency in patients: a randomized controlled trial. Biomed Res Int 2014;
rehabilitation research: what do they mean? Am J Phys Med 2014:580861.
Rehabil 2001;80(2):141-146. 12. Fitzgerald GK, Wynveen KJ, Rheault W, Rothschild B.
10. von Hippel PT. Mean, median and skew: Correcting a textbook Objective assessment with establishment of normal values
rule. J Stat Educ 2005;13(2):Available at http://ww2.amstat.org/ for lumbar spinal range of motion. Phys Ther 1983;63(11):
publications/jse/v13n12/vonhippel.html. Accessed October 26, 1776-1781.
2017. 13. Findley TW. Research in physical medicine and rehabilitation.
11. Morone G, Tramontano M, Iosa M, Shofany J, Iemma A, IX. Primary data analysis. Am J Phys Med Rehabil 1990;69(4):
Musicco M, Paolucci S, Caltagirone C. The efficacy of balance 209-218.
training with video game-based therapy in subacute stroke
6113_Ch23_333-350 02/12/19 4:40 pm Page 333
CHAPTER
Foundations of
Statistical Inference 23
In the previous chapter, statistics were pre- terms such as “likely,” “probably,” or “a good chance.”
We use probability as a means of prediction: “There is a
sented that can be used to summarize and
50% chance of rain tomorrow,” or “This operation has
describe data. Descriptive procedures are not a 75% chance of success.”
sufficient, however, for testing theories about Statistically, we can view probability as a system of
rules for analyzing a complete set of possible outcomes.
the effects of experimental treatments, for ex-
For instance, when flipping a coin, there are two possible
ploring relationships among variables, or for outcomes. When tossing a die, there are six possible out-
generalizing the behavior of samples to a pop- comes. An event is a single observable outcome, such as
the appearance of tails or a 3 on the toss of a die.
ulation. For these purposes, researchers use a
process of statistical inference. Probability is the likelihood that any one event will
occur, given all the possible outcomes.
Inferential statistics involve a decision-making
We use a lowercase p to signify probability, expressed
process that allows us to estimate unknown
as a ratio or decimal. For example, the likelihood of get-
population characteristics from sample data. ting tails on any single coin flip will be 1 out of 2, or 1/2,
The success of this process requires that we or .5. Therefore, we say that the probability of getting
tails is 50%, or p = .5. The probability that we will roll a
make certain assumptions about how well a sam-
3 on one roll of a die is 1/6, or p = .167. Conversely, the
ple represents the larger population. These as- probability that we will not roll a 3 is 5/6, or p = .833.
sumptions are based on two important concepts These probabilities are based on the assumption that the
coin or die is unbiased. Therefore, the outcomes are a
of statistical reasoning: probability and sampling
matter of chance, representing random events.
error. The purpose of this chapter is to demon- For an event that is certain to occur, p = 1.00. For in-
strate the application of these concepts for draw- stance, if we toss a die, the probability of rolling a 3 or
not rolling a 3 is 1.00 (p = .167 + .833). These two events
ing valid conclusions from research data. These
are mutually exclusive and complementary events because
principles will be applied across several statistical they cannot occur together and they represent all possi-
procedures in future chapters. ble outcomes. Therefore, the sum of their probabilities
will always equal 1.00. We can also show that the prob-
ability of an impossible event is zero. For instance, the
■ Probability probability of rolling a 7 with one die is 0 out of 6, or
p = 0.00. In the real world, the probability for most
Probability is a complex but essential concept for under- events falls somewhere between 0 and 1. Scientists will
standing inferential statistics. We all have some notion generally admit that nothing is a “sure bet” and nothing
of what probability means, as evidenced by the use of is impossible!
333
6113_Ch23_333-350 02/12/19 4:40 pm Page 334
68.26%
0.13%
Figure 23–1 Hypothetical distribution
of birth weights, with = 7.5 and
–3 –2 –1 µ +1 +2 +3 σ = 1.25. The green area represents ⫾1
σ, or 68.26% of the population. The
3.75 5.00 6.25 7.50 8.75 10.00 11.25
orange area in the right tail represents
Population Birth Weight (pounds) ≥3 σ, or 0.13% of the population.
6113_Ch23_333-350 02/12/19 4:40 pm Page 335
n = 50
sx– = .21
47.5% 47.5%
–1.96 +1.96
95%
z –3 –2 –1 0 +1 +2 +3
Figure 23–3 95% confidence interval
X 6.39 6.8 7.21 for a sampling distribution of birth
x– weights.
6113_Ch23_333-350 02/12/19 4:40 pm Page 337
interval depends on the nature of the variables being Statistical hypotheses are formally stated in terms of
studied and the researcher’s desired level of accuracy. the population parameter, , even though the actual sta-
tistical tests will be based on sample data. Therefore, the
By convention, the 90%, 95% and 99% confidence inter- null hypothesis can be stated formally as
vals are used most often. The corresponding z values are: H0: A = B (H0: A – B = 0)
90% z = 1.645 which predicts that the mean of Population A is not dif-
95% z = 1.96
ferent from the mean of Population B, or that there will
99% z = 2.576
be no treatment effect. For the example of joint mobi-
Confidence intervals can be obtained for most statistics, not lization and shoulder ROM, under H0 we would state:
just for means. The process will be defined in future chapters as
different statistical procedures are introduced. ➤ For patients with limited shoulder mobility, there will be no
difference in the mean change in external rotation between
those who do (μA) and do not (μB) receive mobilization.
■ Statistical Hypothesis Testing
The estimation of population parameters is only one part The null hypothesis will either be true or false in any
of statistical inference. More often inference is used to an- given comparison. However, because of sampling error
swer questions concerning comparisons or relationships, groups will probably not have exactly equal means, even
such as, “Is one treatment more effective than another?” when there is no true treatment effect. Therefore, H0 really
or “Is there a relationship between patient characteristics indicates that observed differences are sufficiently small to
and the degree of improvement observed?” These types be considered pragmatically equivalent to zero.2 So rather
of questions usually involve the comparison of means, than saying the means are “equal” (which they are not), it
proportions, or correlations. is more accurate to say they are not significantly different.
The Alternative Hypothesis
➤ C A S E IN PO INT # 2 The second explanation for the observed findings is that
Researchers proposed a hypothesis that a single intervention
session of soft-tissue mobilization would be effective for increas- the treatment is effective and that the effect is too large
ing external rotation in patients with limited shoulder mobility.1 to be considered a result of chance alone. This is the
Experimental and control groups were formed through random alternative hypothesis (H1) , stated as
assignment. Range of motion (ROM) measurements were taken H1: A ≠ B (H1: A – B ≠ 0)
and the improvement from pretest to posttest was calculated for
each subject. The researchers found that the mean improvement These statements predict that the observed differ-
in external rotation was 16.4 degrees for the treatment group ence between the two population means is not due to
and 1 degree for the control group. chance, or that the likelihood that the difference is due
to chance is very small. The alternative hypothesis usu-
On the basis of the difference between these means, ally represents the research hypothesis (see Chapter 3).
should the researcher conclude that the research hypoth- For the joint mobilization study, under H1 we would
esis has been supported? According to the concept of sam- state:
pling error, we would expect to see some differences
between groups even when a treatment is not at all effec- ➤ For patients with limited shoulder mobility, the mean change
tive. Therefore, we need some mechanism for deciding if in external rotation will be different for those who do and do not
an observed effect is or is not likely to be chance variation. receive mobilization.
We do this through a process of hypothesis testing.
The Null Hypothesis This is a nondirectional alternative hypothesis because
it does not specify which group mean is expected to be
Before we can interpret observed differences, we must con- larger. Alternative hypotheses can also be expressed in
sider two possible explanations for the outcome. The first directional form, indicating the expected direction of dif-
explanation is that the observed difference between the ference between sample means. We could state
groups occurred by chance. This is the null hypothesis
(H0), which states that the group means are not different— H1: A > B (H1: A – B > 0)
the groups come from the same population. No matter or
how the research hypothesis is stated, the researcher’s goal
will always be to statistically test the null hypothesis. H1: A < B (H1: A – B < 0)
6113_Ch23_333-350 02/12/19 4:40 pm Page 339
The mobilization study used a directional alternative we have discredited that assumption. Therefore, there
hypothesis: probably is an effect, and we reject the null hypothesis.
When we reject the null we can accept the alternative
➤ For patients with limited shoulder mobility, the mean hypothesis—we say that the alternative hypothesis is
change in external rotation will be greater for those who receive consistent with our findings.
mobilization than for those who do not receive mobilization. Superiority and Non-Inferiority
Chapter 14 introduced the concepts of superiority and non-
inferiority trials. For most studies, the intent is to show
Note that a stated hypothesis should always include the
that the experimental treatment is better than the control,
four elements of PICO (see Chapter 3), including reference
to the target population, the intervention and comparison condi- hoping that H0 will be rejected. There may be situations,
tions (levels of the independent variable), and the outcome however, where the researcher does not want to find a dif-
(dependent variable). For the mobilization study: P = patients ference, such as trying to show that a new treatment is as
with limited shoulder disability, I = mobilization, C = no mobi- effective as a standard treatment. In such a case, demon-
lization, O = change in external rotation range of motion. strating “no difference” is a bit more complicated, and the
roles of the null and alternative hypotheses are reversed.
Box 23-2 addresses this seemingly “backwards” logic.
“Disproving” the Null Hypothesis
We can never actually “prove” the null hypothesis based
on sample data, so the purpose of a study is to give the Box 23-2 No Better, No Worse
data a chance of disproving it. In essence, we are using a Most of the time when researchers design a randomized trial, they
decision-making process based on the concept of negative are looking to show that a new intervention is more effective than
inference. No one experiment can establish that a null a placebo or standard treatment. Such studies are considered su-
hypothesis is true—it would require testing the entire periority trials because their aim is to demonstrate that one
population with unsuccessful results to prove that a treat- treatment is “superior” to another. The process typically involves
ment has no effect. We can, however, discredit the null the statement of a null hypothesis and an alternative hypothesis
hypothesis by any one trial that shows that the treatment asserting that there will be a difference. This may be directional or
is effective. Therefore, the purpose of testing the statis- nondirectional, but in either case the intent is to reject the null and
tical hypothesis is to decide whether or not H0 is false. accept the alternative in favor of one treatment.
A non-inferiority trial focuses on demonstrating that the
There is often confusion about the appropriate way to
effect of a new intervention is the same as standard care, in an
express the outcome of a statistical decision. We can only
effort to show that it is an acceptable substitute. This approach
legitimately say that we reject or do not reject the null hy- is based on the intent to show no difference, or more precisely
pothesis. We start, therefore, with the assumption that that the new treatment is “no worse” than standard care. It re-
there is no treatment effect in a study (the null hypoth- quires inverted thinking about how we look at significance. It is
esis). Then we ask if the data are consistent with this not as simple as just finding no difference based on a standard
assumption (see Box 23-1). If the answer is yes, then we null hypothesis. Remember, it is not possible to “prove” no differ-
must acknowledge the possibility that no effect exists, and ence based on one study (see Chapter 5, Box 5-2).
we do not reject the null hypothesis. If the answer is no, then The concept of “no worse” requires the specification of a
non-inferiority margin, the largest difference between
the two treatments that would still be considered functionally
Box 23-1 What’s the Verdict? equivalent (see Chapter 14). If the difference between treat-
The decision to “reject” or “not reject” the null hypothesis is analo- ments is within that margin, then non-inferiority is estab-
gous to the use of “guilty” or “not guilty” in a court of law. We lished. The new treatment would be considered no worse
assume a defendant is innocent until proven guilty (the null hypoth- than standard care, and therefore its preference for use
esis is true until evidence disproves it). A jury can reach a guilty would be based on its safety, convenience, cost or comfort.3
verdict if evidence is sufficient (beyond a reasonable doubt) that In the scheme of a non-inferiority trial, the roles of the alterna-
the defendant committed the crime. Otherwise, the verdict is not tive and null hypotheses are actually reversed. The null (the
guilty (H0). The jury cannot find the defendant “innocent.” Similarly, hypothesis we want to reject) says the standard treatment is better
we cannot actually accept the null hypothesis. A “not guilty” ver- than the new approach, while the alternative (the hypothesis we
dict is expected if the evidence is not sufficient to establish guilt. want to accept) states that the new treatment is not inferior to the
But that doesn’t necessarily mean the defendant is innocent! standard treatment. In this case, it is the non-inferiority margin that
Although not technically correct, researchers sometimes do say must be considered and the confidence interval around that mar-
they “accept” the null hypothesis to indicate that it is probably true. gin. Further discussion is beyond the scope of this book, but several
The word “retained” has also been used to denote this outcome.2 useful references provide more detailed explanation.3-7
6113_Ch23_333-350 02/12/19 4:40 pm Page 340
occurred by chance alone. Therefore, if we decide to re- very expensive to produce. In such a situation we would
ject H0 and conclude that the tested groups are different want to be very confident that observed results were not
from each other, we have an 18% chance of being due to chance. If we reject the null hypothesis and rec-
wrong, that is, an 18% chance of committing a Type I ommend the drug, we would want the probability of our
error. committing a Type I error to be very small. We do not
want to encourage the use of the drug unless it is clearly
Taking a Risk
and markedly beneficial.
The question facing the researcher is how to decide if
We can minimize the risk of statistical error by low-
this probability is acceptable. How much of a chance
ering the level of significance to .025 or .01. If we use
is small enough that we would be willing to accept the
␣ = .01 as our criterion for rejecting H0, we would have
risk of being wrong? Is an 18% chance of being wrong
only 1 out of 100 chances of making a Type I error. This
acceptable?
would mean that we could have greater confidence in our
For research purposes, the selected alpha level de-
decision to reject the null hypothesis.
fines the maximal acceptable risk of making a Type I error
Researchers should specify the minimal level of
if we reject H0. Traditionally, researchers have set this
significance required for rejecting the null hypothesis
standard at 5%, which is considered a small risk. This
prior to data collection. The decision to use .05, .01, or
means that we would be willing to accept a 5% chance
any other value, should be based on the concern for
of incorrectly rejecting H0 or a 5% risk that we will reject
Type I error, not on what the data look like. If a re-
H0 when it is actually true. Therefore, for a given analy-
searcher chooses ␣ = .01 as the criterion, and statistical
sis, if p is equal to or less than .05, we would be willing to
testing shows significance at p = .04, the researcher
reject the null hypothesis. We then say that the differ-
would not reject H0. If ␣ = .05 had been chosen as the
ence is significant. If p is greater than .05, we would not
criterion, the same data would have resulted in the op-
reject the null hypothesis and we say that the difference
posite conclusion.
is not significant.
For the earlier example, if we set ␣ = .05, we would
not reject the null hypothesis at p = .18. The probability If we want to be really confident in our findings, why not
that the observed difference is due to chance is too great. just set ␣ really low, like .0001? That would certainly
increase rigor in avoiding a Type I error—but practically
If a statistical test demonstrates that two means are dif-
it would also make finding significant differences very difficult,
ferent at p = .04, we could reject H0, with only a small
increasing risk of Type II error. For instance, a result at p = .01
acceptable risk (4%) of committing a Type I error. would not be significant, although this represents a very low risk.
Absent an otherwise compelling reason, using ␣ = .05 is usually
What does it mean to take a risk? Let’s personalize it: The reasonable to balance both Type I and Type II error.
weather report says that there is a 75% chance of rain
today. Will you take your umbrella? If you decide NOT to take it,
and it rains, you will be wrong and you will get wet! Now let’s Interpreting Probability Values
say that the report is for a 5% chance of rain. Are you more Researchers must be aware of the appropriate interpre-
likely to leave your umbrella at home? What is the maximal risk
tation of p values:
you are willing to take that you will be wrong—that you will get
wet—10% chance of rain, 25%, 50%? What is your limit? Of The p value is the probability of finding an effect as big
course, my bringing my umbrella is likely to make the probability as the one observed when the null hypothesis is true.
100% that it won’t rain!
Therefore, with p = .02, even if there is no true difference,
you would expect to observe this size effect 2% of the
Choosing a Level of Significance time. Said another way, if we performed 100 similar ex-
How does a researcher decide on a level of significance periments, 2 of them would result in a difference this
as the criterion for statistical testing? The conventional large, even if no true difference exists.
designation of .05 is really an arbitrary, albeit accepted, It is tempting to reverse this definition, to assume that
standard. A researcher may choose other criterion levels there is a 98% probability that a real difference exists.
depending on how critical a Type I error would be. This is not the case, however. The p value is based on
For example, suppose we were involved in the study the assumption that the null hypothesis is true, although
of a drug to reduce spasticity, comparing control and ex- it cannot be used to prove it. The p value will only tell
perimental groups. This drug could be very beneficial to us how rarely we would expect a difference this large in
patients with upper motor neuron involvement. How- the population just by chance. It is not accurate to imply
ever, the drug has potentially serious side effects and is that a small p value means the observed result is “true.”8
6113_Ch23_333-350 02/12/19 4:40 pm Page 342
Confidence intervals are actually a more appropriate way hypothesis. If β = .20, there is a 20% chance that we will make
to think about what the “true” value would be. a Type II error, or that we will not reject H0 when it is really
false. The fact that a treatment really is effective does not
We must also be careful to avoid using the magnitude guarantee that a statistically significant finding will result.
of p as an indication of the degree of validity of the
Statistical Power
research hypothesis. It is not advisable to use terms
such as “highly significant” or “more significant” because they The complement of β error, 1 - β, is the statistical power
imply that the value of p is a measure of treatment effect, which of a test.
it is not. We will revisit this important point again shortly when
we discuss statistical power. Power is the probability that a test will lead to rejection
of the null hypothesis, or the probability of attaining
statistical significance.
Significance as a Dichotomy
In conventional statistical practice, the level of significance If β = .20, power = .80. Therefore, for a statistical test
is considered a point along a continuum that demarcates with 80% power, the probability is 80% that we would
the line between “likely” and “unlikely.” Once the level of correctly demonstrate a statistical difference and reject
significance is chosen, it represents a dichotomous decision H0 if actual differences exist. The more powerful a test,
rule: yes or no, significant or not significant. Once the de- the less likely one is to make a Type II error.
cision is made, the magnitude of p reflects only the relative Power can be thought of as sensitivity. The more sen-
degree of confidence that can be placed in that decision. sitive a test, the more likely it will detect important clin-
That said, researchers will still caution that a non- ical differences that truly exist. Where ␣ = .05 has become
significant p value is not necessarily the end of the story, the conventional standard for Type I error, there is no
especially if it is close to ␣.9 We must, therefore, also standard for β, although β = .20, with corresponding
consider the possibility of Type II error. power of 80%, represents a reasonable protection against
Type II error.11 If β is set at .20, it means that the re-
searcher is willing to accept a 20% chance of missing a
HISTORICAL NOTE difference of the expected effect size if it really exists.
The history of statistics actually lies in agriculture, where
Just as we might lower ␣ to reflect the relative impor-
significant differences were intended to show which
tance of Type I error, there may be reasons to use a
plots of land, treated differently, yielded better crops. The ability
to control the soil and common growing conditions made the use lower value of β to avoid Type II error. For instance, re-
of probability practical in making decisions about best growing searchers may be interested in newer therapies that may
conditions. However, when dealing with the uncertainties of clin- only show small but important effects. In that case,
ical practice and the variability in people and conditions, relying β might be set at .10, allowing for 90% power.
on a single fixed probability value does not carry the same logic.
It is more reasonable to think of probability as a value that varies
The Determinants of Statistical Power
continuously between extremes, with no sharp line between Power analysis involves four interdependent concepts
“significant” and “nonsignificant.” 9, 10 (using the mnemonic PANE):
P = power (1 – β)
A = alpha level of significance
■ Type II Error and Power N = number of subjects or sample size
E = effect size
We have thus far established the logic behind classical
statistical inference, based on the probability associated Knowing any three of these will allow determination
with rejecting a true null hypothesis, or Type I error. of the fourth factor.
But what happens when we find no significant effect and
Power
we do not reject the null hypothesis? Does this necessar-
Power can be measured following completion of a sta-
ily mean that there is no real difference?
tistical analysis when there is no significant difference,
If we do not reject the null hypothesis when it is in-
to determine if there was sufficient power to detect a true
deed false, we have committed a Type II error—we have
difference. The researcher can also set the desired level
found no significant difference when a difference really
of power during planning stages of the study.
does exist. This type of error can have serious conse-
quences for evidence-based practice. Level of Significance
The probability of making a Type II error is denoted by Although there is no direct mathematical relationship
beta (β), which is the probability of failing to reject a false null between ␣ and β, there is trade-off between them.
6113_Ch23_333-350 02/12/19 4:40 pm Page 343
Lowering the level of significance reduces the chance However, this is not helpful when one is trying to place
of Type I error by requiring stronger evidence to realistic limits on time and resources for data collection.
demonstrate significant differences. But this also Many researchers have arbitrarily suggested that a sam-
means that the chance of missing a true effect is in- ple size of 30 or 50 is “reasonable.” 10 Unfortunately,
creased. By making the standard for rejecting H0 more these estimates may be wholly inadequate for many
rigorous (lowering ␣), we make it harder for sample research designs or for studies with small effect sizes.
results to meet this standard. Sample size estimates should be incorporated in the
planning stages of every experimental or observational
Sample Size study. The lack of such planning often results in a high
The influence of sample size on power of a test is critical— probability of Type II error and needlessly wasted efforts.
the larger the sample, the greater the statistical power. Institutional Review Boards will demand such estimates
Smaller samples are less likely to be good representations before approving a study (see Box 23-3).
of population characteristics, and, therefore, true differ-
ences between groups are less likely to be recognized. This
is the one factor that is typically under the researcher’s ➤ CASE IN POIN T #3
control to affect power. Nobles et al17 studied the effect of a prenatal exercise inter-
vention on gestational diabetes mellitus (GDM) among preg-
Effect Size nant women at risk for GDM. They compared an individualized
Power is also influenced by the size of the effect of the 12-week exercise program against a comparison wellness pro-
independent variable. When comparing groups, this gram. They hypothesized that participants in the exercise inter-
effect will be the difference between sample means. In vention would have a lower risk of GDM and lower screening
studies where relationships or measures of risk are of in- glucose values as compared to a health and wellness control
terest, this effect will be the degree of association be- intervention.
tween variables. This is the essence of most research
questions: “How large an effect will my treatment have?” By specifying a level of significance and desired power
or “How strong is the relationship between two vari- in the planning stages of a study, a researcher can esti-
ables?” Studies that result in large differences or associ- mate how many participants would be needed to detect
ations are more likely to produce significant outcomes a significant difference for an expected effect size. The
than those with small or negligible effects. Effect size smaller the expected effect size, the larger the required
(ES) is a measure of the degree to which the null hypothesis sample. This is considered a priori power analysis.
is false.11 The null hypothesis states that no difference ex-
ists between groups—that the effect is zero.
An effect size index is a unitless standardized value Box 23-3 Goldilocks and Power: Getting It “Just Right”
that allows comparisons across samples and studies.
Because of established guidelines for reporting, most journals
There are many types of effect size indices, depending
now require authors to substantiate their basis for estimating
on the specific test used. They can focus on differences sample size and power as part of the planning process for
between means, proportions, correlations, and other sta- recruitment.12 Ethical issues related to power center on concern
tistics. The index is a ratio of the effect size to the vari- for participants and justification of resources.13 Institutional
ance within the data. The power of a test is increased as Review Boards typically require this type of analysis before a
the effect gets larger and variance within a set of data is study will be approved, to be sure that a study has a chance of
reduced, showing distinct differences between groups. success and is thereby worth the effort and potential risks.14
Different indices will be described in the context of sta- However, some argue that small, underpowered trials are still
tistical tests in upcoming chapters. defensible if they will be able to eventually contribute to esti-
mates of effect size through meta-analyses (see Chapter 37).15
Power Analysis An equally serious concern has been expressed about sam-
ples that are too large, exposing more participants than neces-
We can analyze power for two purposes: To estimate
sary to potential risks.16 By having a very large sample, thereby
sample size during planning stages of a study, and to de-
increasing the power of a test, even trivial differences can be
termine the probability that a Type II error was com- deemed statistically significant. Using a power analysis to de-
mitted when a study results in a nonsignificant finding. termine the ideal sample size prior to data collection can assure
a reasonable number, not too small or too large—giving the
A Priori Analysis: Determining Sample Size
researcher the most realistic and reasonable opportunity to see
One of the first questions researchers must ask when
an effect. Information on how sample size was determined is
planning a study is, “How many participants are typically reported in the methods section of a research paper.
needed?” An easy answer may be as many as possible.
6113_Ch23_333-350 02/12/19 4:40 pm Page 344
significant treatment effect. When significance levels are the role our clinical expertise and experience play in un-
close to ␣, investigators often talk about a “nonsignifi- derstanding mechanisms of interventions, theoretical
cant trend” toward a difference. This is a cautionary con- underpinnings, relevant changes in patient performance,
clusion, however, and should probably only be supported and the context of all available research on a topic.19 Pa-
when power is an issue, suggesting that a larger sample tient preferences can also help us to understand what
might have led to a significant outcome.8 However, outcomes are meaningful when evidence is not sufficient
when effect sizes are considered small, as in the GDM (see Focus on Evidence 23-2).20
study, there may be reason to consider that there is truly
Resources to determine power and sample size are available
no difference (see Focus on Evidence 23-1).
online and through statistical programs such as SPSS.21,22
It is also important to remember that power can be
G*Power is a widely-used free online program that will
quite high with very large samples, even when effect sizes be illustrated in chapter supplements for various tests.23
are small and meaningless. In our search for evidence, See the Chapter 23 Supplement for an introduction to this
we must be able to distinguish clinical versus statistical process. References are also available for hand calculations.
significance. As clinicians, we should never lose sight of
Focus on Evidence 23–1 So when is it reasonable to claim that a study has “proved”
What does it mean when we DON’T find there is no effect of an intervention? The correct answer is close
a significant difference... to “never,” because uncertainty will always exist,25 although there
needs to be some theoretical premise that warrants continued
Researchers and clinicians have long understood the difference be-
discovery. Missing the ␣ threshold does not end the discussion.
tween statistical significance and clinical relevance. This distinction
Unfortunately, when results are not significant, clinicians may as-
has been part of the impetus for translational research, to make re-
sume that the experimental treatment was not effective and stop
search findings more applicable to practice. In our eagerness to find
there. And compounding the dilemma, researchers and journal ed-
evidence for practice decisions, however, sometimes this distinction
itors are often unable or unwilling to publish reports that end in
can get lost.
nonsignificant outcomes, a problem called publication bias (see
The phrase “Absence of evidence is not evidence of absence”
Chapter 37).26-29
was introduced in Chapter 5, serving as a caveat to remind us that
The implications of this issue can be far-reaching. As practition-
we don’t know what we don’t know! Just because we do not have
ers and researchers, it is important for us to know about interven-
evidence that something exists does not mean that we have evi-
tions that were and were not effective, and how to interpret such
dence that it does not exist—an essential premise for evidence-
findings. Otherwise, we may be losing a great deal of valuable in-
based practice. Remember that, in fact, we can never actually
formation, moving critically off course by ignoring potentially fruitful
prove no difference, which is why we cannot “accept” the null
lines of research.
hypothesis.24
Focus on Evidence 23–2 We must also be careful not to confuse the significance level with
… and what does it mean when we DO find size of the effect. Remember that probabilities only tell us the likeli-
a significant difference? hood of appropriately rejecting a true null hypothesis. A lower prob-
ability does not indicate a larger effect size and a higher probability
Several considerations are critical to understand results of significant
does not indicate a smaller effect—both are influenced by sample
statistical tests. The immediate interpretation will be to consider dif-
size. We can’t compare p values across studies to say that one treat-
ferences or relationships to be meaningful. This may certainly be
ment is better than another, and a low p value doesn’t mean the treat-
true, but there is more to this interpretation than a simple “yes”
ment effect was meaningful, especially if the sample is very large,
or “no.”
which would increase power.
Even when significance is obtained, we must consider what the
It is important to look at the size of the treatment effect, whether or
numbers mean. Looking at data plots is important to consider vari-
not the null hypothesis is rejected. Effect size indices can help to assess
ability and confidence intervals to estimate the effect. Reliability and
the magnitude of the effect and will provide a basis for comparison
validity assessments may be relevant to determine if the measured
across studies. Consider the effect size and confidence intervals in
change is clinically meaningful. Most importantly, it is the size of the
addition to statistical significance, to avoid discarding potentially im-
effect that is going to indicate how important the results are, not sta-
portant discoveries, especially with new treatment approaches.30
tistical significance. There is no magic in p = .05!
6113_Ch23_333-350 02/12/19 4:40 pm Page 346
X ⫺
➤ C A S E IN PO INT # 1 - R E VISIT ED Z=
sX
Assume we are interested in comparing the average birth weight
in a particular community with the national average. The commu-
_
For our example, with X = 7.2, s = 1.73, n = 100, and
nity has limited access to healthcare, and we want to establish if
μ = 7.5, we calculate
this affects birth weights. From national studies, we know the
population birth weight values: μ = 7.5 lbs with σ = 1.25. We 1.73
draw a random_sample of n = 100 infants from the community. sX = = 0.173
100
Results show X = 7.2 lbs and s = 1.73.
Therefore,
The Null Hypothesis 7.2⫺ 7.5 ⫺0.3
z= = = ⫺1.73
If there is a difference between the sample mean and 0.173 0.173
the known population mean, it may be the result of
This tells us that the sample mean is 1.73 standard
chance, or it may indicate that these children should
error units below the population mean of 7.5. The ques-
not be considered part of the overall population. The
tion we must ask is: Does this difference represent a sig-
null hypothesis proposes no difference between the
nificant difference? Is the difference large enough that
sample and the general population. Therefore, we pre-
we would consider the mean birth weight in our com-
dict that the population from which the sample is
munity to be significantly different from 7.5?
drawn has a mean of 7.5.
H0: μ = 7.5
Importantly, the z-ratio can be understood in terms of
The alternative hypothesis states that the children come a relationship between the difference between means
from another population with a mean birth weight dif- (the numerator) and the variance within the sample
ferent from 7.5, that is, different from the general pop- (the denominator). The ratio will be larger as the difference
ulation: between means increases and as variance decreases.
This concept will be revisited in discussions of many
H1: μ ≠ 7.5 statistical tests.
This is a nondirectional hypothesis. We are not predict-
ing if the sample mean will be larger or smaller than the
population mean, only that it will be different.
Applying Probability
We begin by assuming that the null hypothesis is Let us assume we could plot the sampling distribution
true, that the observed difference of 0.3 lbs is due to of means of birth weight for the general population,
chance. We then ask, how likely is a difference of this which takes the form of the normal curve as shown in
size if H0 is true? The answer to this question is based Figure 23-6. Using ␣ =.05 as our criterion, we accept
on the defined properties of the normal sampling dis- that any value that has only a 5% chance of occurring is
tribution and our desired level of significance, ␣ = .05. considered “unlikely.” So we look at the area under the
We must determine the probability that one would curve that represents 95%, which we know is bounded
observe a difference as large as 0.3 lbs by chance if by z = ±1.96. Therefore, there is a 95% chance that any
the population mean for infants in this community is one sample chosen from this population would have a
truly 7.5. mean within those boundaries.
6113_Ch23_333-350 02/12/19 4:40 pm Page 347
–1.73
z –1.96 0 +1.96
2.5% 2.5%
Critical 95% Critical
region Noncritical region region
Figure 23–6 Standard normal distribution of z-scores showing critical values for a two-tailed
(nondirectional) test at ␣2=.05. The critical region falls in both tails, and 5% is divided between
the two. The critical value is at ±1.96. The calculated z-score for community birth weights
(z = –1.73) does not fall within the critical region and is not significant.
Said another way, it is highly likely that any one sample than the critical value (z < ±1.96) it will fall within the
chosen from this general population would have a mean noncritical region, representing a difference that may be
birth weight within this range. Conversely, there is only due to chance. If a calculated z-ratio falls within the crit-
a 5% chance that any sample mean would fall above or ical region (z ≥ ±1.96), the ratio is significant.
below those points. This means that it is unlikely that a
sample mean from this general population would have a Remember that the values for z are symmetrical. The
z-score beyond ±1.96. values in Table A-1 will apply to positive or negative
Therefore, if our sample yields a mean with a z-ratio z scores.
beyond ±1.96, it is unlikely that the sample is from the
general population (H1). If the sample mean of birth
weights falls within z = ±1.96, then there is a 95% chance In our example, z = –1.73 for the sample mean of birth
that the sample does come from the general population weights. This value is less than the absolute critical value
with = 7.5 (H0). of –1.96 and falls within the central noncritical region of
the curve. Therefore, it is likely that this sample does
The Critical Region
come from the general population with = 7.5. The ob-
The tails of the curve depicted in Figure 23-6, represent-
served difference between the means is not large enough
ing the area above and below z = ±1.96, encompass the
to be considered significant at ␣ = .05 and we do not
critical region, or the region of rejection for H0. This re-
reject the null hypothesis.
gion, in both tails, equals a total of 5% of the curve,
which is our criterion for significance. Therefore, this See the data files with supplemental materials for the
test is considered a two-tailed test, to indicate that our data used in the birth weight example.
critical region is split between both tails. Because we pro-
posed a nondirectional alternative hypothesis, we have
to be prepared for the difference between means to be
Directional Versus Nondirectional Tests
either positive or negative. Two-Tailed Test
The value of z that defines the critical region is the The process just described is considered nondirectional
critical value. When we calculate z for our study sample, because we did not predict the direction of the difference
it must be equal to or greater than the absolute critical between the sample means. Consequently, the critical re-
value if H0 is to be rejected. If a calculated z-ratio is less gion was established in both tails, so that a large enough
6113_Ch23_333-350 02/12/19 4:40 pm Page 348
H0 is false H0 is true
Reject H0 Do not reject H0
Therefore, the sample should be a useful representation Box 23-4 What Is Normal?
of population “parameters.” With small samples, or with
distributions that have not been previously described, it Because many statistical procedures are based on assumptions
may be unreasonable to accept this assumption. Tests of related to the normal distribution, researchers should evaluate
the shape of the data as part of their initial analysis. Alternative
“goodness of fit” can be performed to determine how well
non-parametric statistical operations can be used with skewed
data match the normal distribution (see Box 23-4). Alter-
data, or data may be transformed to better reflect the character-
natively, data can be transformed to another scale of istics of a normal distribution.
measurement, such as a logarithmic scale, to create dis- Unfortunately, researchers often do not test for normality as
tributions that more closely satisfy the necessary assump- part of their initial analysis, running the risk of invalid statistical
tions for parametric tests (see Appendix C). conclusions.31,32 “Goodness of fit” procedures can be applied to
• Variances in the samples being compared are roughly test for normality. The two most often applied tests are the
Kolmogorov-Smirnov (KS) Test 33 and the Shapiro-Wilk (SW)
equal, or homogeneous.
test.34 The SW test is considered more powerful, providing
A test for homogeneity of variance can substantiate this better estimates with smaller samples.35
assumption (see Chapters 24 and 25).
• Data should be measured on the interval or ratio scales. Tests of Normality
COMMENTARY
The emphasis placed on significance testing in clinical in a clinical decision with an outcome at p = .049 versus
research must be tempered with an understanding that p = .051?
statistical tests are tools for analyzing data and should Sir Ronald Fisher, who developed the concept of the
never be used as a substitute for knowledgeable and crit- null hypothesis and other statistical principles, cautioned
ical interpretation of outcomes. A simple decision based against drawing a sharp line between “significant” and
on a p value is too simplistic for clinical application. A “nonsignificant” differences, warning that
lack of statistical significance does not necessarily imply
…as convenient as it is to note that a hypothesis is
a lack of practical importance and vice versa.9 The prag-
contradicted at some familiar level of significance, … we
matic difference in clinical effect with p = .05 or p = .06
do not ever need to lose sight of the exact strength which
may truly be negligible. What would be the difference
6113_Ch23_333-350 02/12/19 4:40 pm Page 350
the evidence has in fact reached, or to ignore the fact For example, being able to appreciate the information
that with further trial it might come to be stronger or contained within a confidence interval will put a proba-
weaker.36 bility value within a context that can be more useful for
clinical decision making.
To be evidence-based practitioners, we must all rec-
Finally, we should always reflect on the collective im-
ognize the role of judgment as well as an understanding
pact of prior studies and critically assess the meaningful-
of statistical principles in using published data for clinical
ness of the data in the aggregate. Interpreting results in
decision making. Understanding statistics and design
the context of other studies requires that we recognize
concepts is important, but the statistics do not make de-
the need for replication and meta-analytic approaches
cisions about what works—that is the judgment of the
(see Chapter 37).
researcher and clinician who applies the information.
CHAPTER
Comparing Two
Means: The t-Test 24
The simplest experimental comparison in- randomly assigned groups (n=50), one experimental and
one control, to determine if an intervention makes a dif-
volves the use of two independent groups
ference in their performance. Theoretically, if the treat-
within a randomized controlled trial (RCT). ment was ineffective (H0), we would expect all subjects
This design allows the researcher to assume in both groups to have the same response, with no dif-
ference between groups.
that all individual differences are evenly distrib-
But if the treatment is effective, all other factors being
uted between the groups at baseline, and there- equal, we would expect a difference, with all subjects
fore, observed differences after the application within the experimental group getting the same score,
and all subjects within the control group getting the
of an intervention to one group should reflect
same score, but scores would be different between
a significant difference, one that is unlikely to groups. This is illustrated in Figure 24-1A, showing that
be the result of chance. everyone in the treatment group performed better than
everyone in the control group.
The statistical ratio used to compare two
Now think about combining all 100 scores. If we were
means is called the t-test, also known as Student’s asked to explain why some scores in this sample were dif-
t-test. The purpose of this chapter is to introduce ferent, we would say that all differences were due to the
effect of treatment. There is a difference between the
procedures that can be applied to differences be-
groups but no variance within the groups.
tween two independent samples or between
Error Variance
scores obtained with repeated measures. These
Of course, that scenario is not plausible, so let’s consider
procedures are based on parametric operations
the more realistic situation where subjects within a
and, therefore, are subject to assumptions group do not all respond the same way. As shown in
underlying parametric statistics. Figure 24-1B, the scores within each group are variable,
but we still tend to see higher scores among those in the
experimental group.
Now if we look at the entire set of scores, and we
■ The Conceptual Basis for Comparing were asked again to explain why scores were different,
Means we would say that some of the differences can be
explained by the treatment effect; that is, most of the
The concept of testing for statistical significance was higher scores were in the treatment group. However,
introduced in Chapter 23 in relation to a one-sample test. the scores are also influenced by a host of other random
The process for comparing two sample means is similar, factors that create variance within the groups, variance
with some important variations. To illustrate this process, that is unexplained by the treatment. This unexplained
suppose we divide a sample of 100 subjects into two portion is called error variance.
351
6113_Ch24_351-364 02/12/19 4:40 pm Page 352
Control Experimental clearly differentiated the groups, and the difference be-
tween groups is unlikely to be the result of chance. There-
fore, the null hypothesis would probably be rejected.
Contrast this with the distributions in Figure 24-1C,
which shows the same means but greater variability within
the groups, as evidenced by the wider spread of the curves.
A Factors other than the treatment variable are causing sub-
jects to respond very differently from each other. Here we
find more overlap, indicating that many subjects from
both groups had the same score, regardless of whether or
not they received the treatment. These curves reflect a
greater degree of error variance; that is, the treatment
does not help to fully explain the differences among
B scores. In this case, it is less likely that the treatment is dif-
ferentiating the groups, and the null hypothesis would
probably not be rejected. Any differences observed here
are more likely to be due to chance variation.
The t Distribution
D The properties of the normal curve provide the basis for
determining if significant differences exist. Recall from
Figure 24–1 Four sets of hypothetical distributions with the
Chapter 22 how the normal curve was used to determine
same means but different variances. A) All subjects in each
group have the same score, but the groups are different from the probability associated with choosing a single value
each other. There is no variance within groups, only between out of a population. By knowing the variance and the
groups. B) Subjects’ scores are more spread out, but the proportions under the curve, we could know the proba-
control and experimental conditions are still clearly different. bility of obtaining a certain value to determine if it was
The variances of both groups are equal. C) Subjects are more a “likely” or “unlikely” event. By converting a value to a
variable in responses, with greater variance within groups, z-score, we were able to determine the area of the curve
making the groups appear less different. D) The variances of
above or below the value, and thereby interpret the
the two groups are not equal.
probability associated with it.
We can use this same logic to determine if the dif-
The concept of statistical “error” does not mean mistakes ference between two means is a likely outcome. If we were
or miscalculation. It refers to all sources of variability to choose a random sample of individuals, divide them
within a set of data that cannot be explained by the independ- randomly into two groups and do nothing to them,
ent variable. theoretically _we _should find no difference between
their means (X1–X2 = 0). But because we might expect
some difference just by chance, we have to ask if the
The distributions in Figure 24-1B show two means
observed difference is large enough to be considered
that are far apart, with few overlapping scores at one ex-
significant.
treme of each curve. The curves show that the individu-
The significance of the difference between two group
als in the treatment and control groups behaved very
means is judged by a ratio, the t-test, derived as follows:
differently, whereas subjects within each group per-
formed within a narrow range (error variance is small). Difference between means
t=
In such a comparison, the experimental treatment has Variability within gro
oups
6113_Ch24_351-364 02/12/19 4:40 pm Page 353
The numerator represents the separation between the The alternative hypothesis can be stated in a nondi-
groups, which is a function of all sources of variance, in- rectional format,
cluding treatment effects and error. The denominator
H1: μ1 ≠ μ2 (μ1 - μ2 ≠ 0)
reflects the variability within groups as a result of error
alone. or a directional format,
H1: μ1 > μ2 (μ1 - μ2 > 0)
The between and within sources of variance are essential
elements of significance testing that will be used repeat- Depending on the direction of the difference between
edly. It can be thought of as a signal to noise ratio, where the means, t can be a positive or negative value. Nondirectional
“signal” represents the difference between means, and the hypotheses are tested using a two-tailed test, and direc-
“noise” is all the other random variability. The larger the signal tional hypotheses are tested using a one-tailed test (see
in proportion to the noise, the more likely the null hypothesis Chapter 23).
will be rejected. Most statistical tests are based on this
relationship.
Even though we are comparing sample means, hypotheses
are written in terms of population parameters.
When H0 is false (the treatment worked), the numera-
tor will be larger and the denominator will be smaller—a
greater variance between groups, less variance within Homogeneity of Variance
groups. When H0 is true (no treatment effect), the error A major assumption for the t-test, and most other para-
variance gets larger and the t ratio will be smaller, ap- metric statistics, is equality of variances among groups,
proaching 1.0. If we want to demonstrate that two groups or homogeneity of variance. While there is an expecta-
are significantly different, the value of t should be as large tion that some error variance will be present in each
as possible. Thus, we would want the separation between group, the assumption is that the degree of variance will
the group means to be large and the variability within be roughly equivalent—or not significantly different. This is
groups to be small—a lot of signal to little noise. illustrated by the curves in Figure 24-1B and C. However,
in D we can see that the treatment group is much less
HISTORICAL NOTE variable than the control group, and the assumption of
Following graduation from Oxford in 1899, with de- homogeneity of variance is less tenable.
grees in mathematics and chemistry, William S. Gosset Most statistical procedures include tests to determine
(1876-1937) had the somewhat enviable job of working for if variance components are not significantly different;
the Guinness Brewery in Dublin, Ireland, with responsibility that is, if observed differences between the variances are
for assuring the quality of the beer. He began studying the re- due to chance. With random assignment, larger samples
sults of his inspections on small samples and figured out that will have a better chance of showing equal variances than
his accuracy improved when he estimated the standard error small samples. When variances are significantly different
of his sample. With no computer to assist him, he worked for (they are not equal), adjustments can be made in the test
over a year, testing hundreds of random samples by hand— to account for this difference.
calculations we could do in seconds today—but he got it right!
It seems that another researcher at Guinness had previ-
ously published a paper that had revealed trade secrets, and ■ The Independent Samples t-Test
Guinness prohibited its employees from publishing to prevent
further disclosures. Gosset convinced his bosses that his in- To compare means from two independent groups, usu-
formation presented no threat to the company, and was al- ally created through random assignment, we use an
lowed to publish, but only if he used a pseudonym. He chose
independent or unpaired t-test. Groups are considered
“Student.”1 Gosset’s landmark 1908 article, “The Probable
independent because each is composed of different sets
Error of a Mean,” became the foundation for the t-test, which
is why it is also known as “Student’s t.” 2 Today a plaque
of subjects, with no inherent relationship derived from
hangs in the halls of the Guinness Brewery, honoring Gosset’s repeated measures or matching.
historic contribution. The test statistic for the unpaired t-test is calculated
using the formula
X ⫺X2
Statistical Hypotheses t = s1
X 1⫺X 2
The null hypothesis states that the two means are equal:
The numerator of this ratio represents the difference
H0: μ1 = μ2 (μ1 - μ2 = 0) between the independent group means. The term in the
6113_Ch24_351-364 02/12/19 4:40 pm Page 354
DATA
Table 24-1 Results of the Unpaired t-Test (Equal Variances): Change in Pinch Strength (Pounds of Force)
DOWNLOAD
Following Hand Splinting
A. Descriptive Data
Group Statistics
B. Calculation of t
Step 2: Calculate the standard error of the difference between means: Step 3: Calculate t:
1 Levene’s test compares the variances of the two groups. This difference is not significant (p=.419). Therefore, we will use the t-test
for equal variances.
2 The two-tailed probability for the the t-test is .014. This is a significant test, and we reject H0. SPSS uses the term “Sig” to indicate
p values, and only reports two-tailed values by default. If we had performed a one-tailed test, we would divide this value by 2, and it
would be significant at p = .007.
3 The difference between the means of group 1 and 2 (the numerator of the t-test)
4 The standard error of the difference between the means (denominator of the t-test)
5 The 95% confidence interval does not contain the null value zero, confirming that the difference between the means is significant.
tail holds α/2 or .025. The null hypothesis states that the
Notice in Table A-2 that as the degrees of freedom get larger,
difference between means will be zero, and therefore,
the critical value of t approaches z. For instance, with samples
larger than 120, the critical value of t at α2=.05 is 1.96, the same as z.
the t ratio will also equal zero. The probability that a cal-
culated t ratio will be as large or larger than (2.101 when
means are only different by chance is 5% or less. There-
Figure 24-2 illustrates the critical value of t for 18 df fore, the critical value reflects the maximum difference
at α2 = .05. Because we are using a two-tailed test, each under the null hypothesis. Anything larger would be
6113_Ch24_351-364 02/12/19 4:40 pm Page 356
Critical value of t
at ␣2 = .05
.025 .025
.007 .007
considered significant, and the null hypothesis would be be larger or smaller than the other, and the sign must be
rejected. in the predicted direction for the alternative hypothesis
For a t ratio to represent a significant difference, the to be accepted (see Box 24-1).
absolute value of the calculated ratio must be greater than
Computer Output
or equal to the critical value. In this example, the calcu-
Table 24-1 shows the output for an unpaired t-test for
lated value t = 2.718 is greater than the critical value
the example of pinch strength and hand splints. Descrip-
2.101. Therefore, p < .05 for this calculated value, and
tive statistics are given for each group. This information
the group means are considered significantly different.
is useful as a first pass, to confirm the number of subjects,
We reject H0, accept H1, and conclude that patients
and to see how far apart the means and variances are. We
wearing the new splint improved more than those in the
can see that the standard deviations are similar, although
standard group.
not exactly the same.
The Sign of t The test for homogeneity of variance will indicate if
Critical values of t are absolute values, so that negative the difference between variances is large enough to be
or positive ratios are tested against the same criteria. The meaningful, and not the result of chance. When variances
sign of t can be ignored when a nondirectional hypoth- are significantly different (they are not equal), adjustments
esis has been proposed. The critical region for a two- can be made in the t test that will account for these differ-
tailed test is located in both tails of the t distribution, and ences. Levene’s Test for equality of variances compares
therefore either a positive or negative value can be con- the variances of the two groups. It uses the F statistic and
sidered significant (see Chapter 23). The sign will be an reports a p value that indicates if the variances are signifi-
artifact of which group happened to be designated cantly different. If Levene’s test is significant (p ≤ .05), the
Group 1 or 2. If the group designations were arbitrarily assumption of homogeneity of variance has been violated.
reversed, the ratio would carry the opposite sign, with If p > .05, we can assume the variances are not significantly
no change in outcome. For the current example, the different, that is, they are homogeneous.
ratio happens to be positive, because _the experimental Notice that there are actually two lines of output for
group was designated as Group 1 and X1 was larger _ than the t-test reported in Table 24-1C, labeled according to
the mean improvement for the control group, X2. whether the assumption of equal variances can be ap-
The sign is of concern, however, when a directional plied. Computer packages automatically run the t-test
alternative hypothesis is proposed. In a one-tailed test, for both equal and unequal variances, and the researcher
the researcher is predicting that one specific mean will must choose which one to use for analysis.
6113_Ch24_351-364 02/12/19 4:40 pm Page 357
Focus on Evidence 24–1 Each of these values was averaged over a measurement period
Confidence by Intervals of 6 days. The differences between groups are significant for both
variables (p < .05), supported by CIs that do not contain zero. But how
Researchers in many disciplines have become disenchanted with the
do the CIs help to interpret these outcomes? For sedentary time, a
overemphasis placed on reporting p values in research literature.3 The
difference of 4 min/hr may seem small (with a small CI), but the au-
concern relates to the lack of useful interpretation of these values,
thors suggest that this can add up over a day or week to a meaningful
which focus solely on statistical significance or the viability of the null
variation. However, looking at the number of breaks shows a different
hypothesis, rather than the magnitude of the treatment effect.7 In an
story, with a much wider CI. The degree to which these differences
effort to make hypothesis testing more meaningful, investigators have
have a clinical impact must be interpreted in relation to the amount
begun to rely more on the confidence interval (CI) as a practical esti-
of sedentary time that would be considered detrimental to the child’s
mate of a population’s characteristics. Many journals now require the
health and how intervention might address activity levels. Clearly,
reporting of confidence intervals, some to the exclusion of p values.
there is substantial variability in the sample, which makes interpre-
In hypothesis testing, we always wrestle with a degree of uncer-
tation less certain.
tainty. Confidence intervals provide an expression of that uncertainty,
Although confidence intervals and p values are linked, they are
helping us understand a possible range of true differences, something
based on different views of the same problem. The p value is based
the p value cannot do.8 Although confidence intervals incidentally tell
on the likelihood of finding an observed outcome if the null hypothesis
us whether an effect is “significant” (if the null value is not included),
is true. Confidence intervals make no assumptions about the null
their more important function is to express the precision with which
hypothesis, and instead reflect the likelihood that the limits include
the effect size is estimated.9,10 This information can then be used
the population parameter. Like p values, however, confidence intervals
for evaluating the results of assessments and for framing practice
do not tell us about the importance of the observed effect. That will
decisions.
always remain a matter of clinical judgment.
To illustrate this point, Obeid et al11 were interested in studying
sedentary behavior of children with cerebral palsy (CP), recognizing
the metabolic and cardiovascular risks associated with lack of phys- From Obeid J, Balemans AC, Nooruyn SG, et al. Objectively measured sedentary
ical activity. They hypothesized that children with CP would spend time in youth with cerebral palsy compared with age-, sex-, and season-
more hours sedentary than their typically developing peers and would matched youth who are developing typically: an explorative study. Phys Ther
take fewer breaks to interrupt sedentary time. Using an independent 2014;94(8):1163-7. Adapted from Table, p. 1165. Used with permission of the
t-test, they found that the difference between groups was significant American Physical Therapy Association.
as follows:
The null value for a difference between two means is Table 24-2 shows alternative data for this comparison,
zero. Because this confidence interval does not contain with unequal sample sizes (n1 = 15, n2 = 10). Descriptive
zero, we can reasonably reject H0 with a 5% risk. This statistics are provided for the performance of each group.
confirms the results of the t-test for these same data We first look at the results of Levene’s test, which
using the p value. shows that the variances in this analysis are significantly
different (p = .038, Table 24-2 1 ). Therefore, for this
Unequal Variances analysis, we use the second line of data in the output,
Studies have shown that the validity of the independent “Equal variances not assumed.”
t-test is not seriously compromised by violation of the Note that the degrees of freedom associated with the
assumption of equality of variance when n1 = n2.6 How- t-test for unequal variances are adjusted downward, so
ever, when sample sizes are unequal, differences in vari- that the critical value for t is also modified. In this exam-
ance can affect the accuracy of the t ratio. If a test for ple, 20.6 degrees of freedom are used to determine the
equality of variance shows that variances are significantly critical value of t (see Table 24-2 2 ). From Table A.2,
different, the t ratio must be adjusted. we can find the critical value t (21) = 2.080.
Computer Output Confidence Intervals
Consider the previous example in which we examined This test also includes a confidence interval for the mean
the effect of a hand splint on changes in pinch strength. difference (see Table 24-2 6 ). The standard error is
6113_Ch24_351-364 02/12/19 4:40 pm Page 359
DATA
Table 24-2 Results of the Independent t-Test (Unequal Variances): Changes in Pinch Strength (Pounds of
DOWNLOAD
Force) Following Hand Splinting
A. Descriptive Data
Group Statistics
1 Levene’s test is significant (p=.038). indicating that the variance of group 1 is significantly different from the variance of group 2.
Therefore, we will use the t-test for unequal variances.
2 With 25 subjects, the adjusted total degrees of freedom for the test of unequal variances is 20.625.
3 The two-tailed probability for the t-test is .002. This is significant, and we reject H0.
4 The difference between the means of group 1 and 2 (numerator of the t-test)
5 Standard error of the difference (denominator of the t-test).
6 The 95% confidence interval does not contain zero, confirming that the difference between means is significant at p < .05.
different for the test of unequal variances, calculated as conditions. A repeated measures design may also be
the denominator of the t-test (see Table 24-2 5 ). For based on matched samples.
this analysis, we use the critical value t(21) = 2.080 for In these types of studies, data are considered paired
α2=.05. or correlated, because each measurement has a
matched value for each subject. To determine if these
95% CI = 10.8 – 5.65 ± 2.080 (1.472)
values are significantly different from each other,
= 5.15 ± 3.062
a paired t-test is performed. This test analyzes differ-
= 2.088 to 8.212
ence scores (d) within each pair, so that subjects are
See the Chapter 24 Supplement for calculations associ- compared only with themselves. Statistically, this has
ated with the t-test for unequal variances. the effect of reducing the total error variance in the
data because most of the extraneous factors that
influence data will be the same across both treatment
■ Paired Samples t-Test conditions. Therefore, tests of significance involving
paired comparisons tend to be more powerful than
Researchers often use repeated measures to improve the unpaired tests.
degree of control over extraneous variables in a study. In
these designs, subjects usually serve as their own con- The paired t-test may also be called a test for correlated,
trols, each being exposed to both experimental condi- dependent, or matched samples.
tions and then responses are compared across these
6113_Ch24_351-364 02/12/19 4:40 pm Page 360
The test statistic for paired data is based on the ratio associated with a paired t-test are n – 1, where n is the
d number of pairs of scores. The absolute value of the cal-
t=s culated ratio is compared with a critical value, in this case
d
for a one-tailed test with n – 1 = 7 df: Using Appendix
– Table A-2, we find the critical value t(7) = 1.895 for
where d is the mean of the difference scores, and sd̄ repre-
sents the standard error of the difference scores. Like α1 = .05. Because the calculated value is less than the
the independent test, this t ratio reflects the proportional critical value, these conditions are not considered signifi-
relationship of between- and within-group variance com- cantly different.
ponents. The numerator is a measure of the differences The output in Table 24-3C 6 shows that p = .169 for
between scores in each pair, and the denominator is a a two-tailed test. Because we proposed a directional hy-
measure of the variability within the difference scores. pothesis, however, we must divide this value by 2 to get
In accordance with requirements for parametric sta- a one-tailed value, so p = .085. Because this is greater
tistics, the paired t-test is based on the assumption that than .05, H0 is not rejected.
samples are randomly drawn from normally distributed Confidence Intervals
populations, and data are at the interval/ratio level. Be- For the paired t-test, a confidence interval is obtained
cause the number of scores must be the same for both using the formula:
levels of the repeated measure, it is unnecessary to test –
the assumption of homogeneity of variance. 95% CI = d ± t(sd-)
where t is the critical value for the appropriate degrees
➤ C A S E IN PO INT # 2 of freedom (α2=.05), and sd̄ is the standard error of the
Suppose we set up a study to test the effect of using a lumbar difference scores. Even though we have proposed a one-
support pillow on angular position of the pelvis in relaxed sitting. tailed test, we use a two-sided test to determine high and
We hypothesize that pelvic tilt will be greater with a lumbar low confidence limits. Table A-2 tells us that t(7) = 2.365.
support pillow compared to sitting with no pillow, a directional Therefore,
hypothesis. We test eight subjects, each one sitting with and
without the pillow, in random order. The angle of pelvic tilt is 95% CI = 3.375 ± 2.365 (2.203)
measured using a flexible ruler, with measurements transformed = 3.375 ± 5.210
to degrees. = –1.835 to 8.585
The mean angle with the pillow was 102.375, and the mean
without the pillow was 99.00, a difference of 3.375 degrees. We are 95% confidence that the true difference in
pelvic angle between the pillow and non-pillow conditions
is between –1.84 and 8.59 degrees (see Table 24-3C 7 ).
The Critical Value This means the population mean could be negative, pos-
With a directional hypothesis, this study applies a one- itive, or even zero. When zero is contained within a con-
tailed test. We are predicting that the pelvic tilt will be fidence interval, means are not significantly different.
greater with a pillow than without. Statistical hypotheses
can be expressed as
H0: μ1 = μ2 against H1:μ1 > μ2
■ Power and Effect Size
where means represent repeated conditions. These Power analysis for the t-test is based on the effect size
hypotheses may also be expressed in terms of difference index, d, originally proposed by Cohen.12 This index ex-
scores: presses the difference between the two sample means in
– – standard deviation units (see Fig. 24-3), which is why it
H0: d = 0 and H1: d > 0
is sometimes called the standardized mean difference
Hypothetical data are reported in Table 24-3A. A dif- (SMD). The calculation of d is specific to the type
ference score is calculated for each pair of scores, and the of t-test performed.
–
mean (d ), and standard deviation (sd) of the difference Conventional effect sizes for d are:
scores are calculated. By calculating the standard error,
Small d = .20
sd̄ , and substituting values in the formula for the paired
Medium d = .50
t-test, we obtain t = –1.532 (see Table 24-3B). The value
Large d = .80
of t is positive because we designated the Pillow condi-
tion as group 1. These guidelines are intended to provide a way to
Because a repeated measures design uses only one interpret effect size along a continuum, not to serve as
sample of subjects (not separate groups), the total df absolute cutoff values. They are open to interpretation
6113_Ch24_351-364 02/12/19 4:40 pm Page 361
DATA
Table 24-3 Results of the Paired t-Test: Angle of Pelvis With and Without a Lumbar Support Pillow
DOWNLOAD
A. Data X1 X2
B. Calculation of t
Subject Pillow No pillow d
1 112 108 4 Step 1: Calculate the standard error of the
2 96 102 ⴚ6
different scores:
3 105 98 7
4 110 112 ⴚ2 sd 6.232
sd 2.203 2
5 106 100 6 n 8
6 98 85 13
7 90 92 ⴚ2 Step 2: Calculate t:
8 102 95 7
d 3.375
t s 1.532 3
d 27 d 2.203
d 3.375 1
sd 6.232
Paired Differences
2 95% Confidence Interval 6
1 Std. Std. Error of the Difference 7 3 Sig.
Pair 1 Mean Deviation Mean Lower Upper t df (2-tailed)
Pillow - NoPillow 3.37500 6.23212 2.20339 ⴚ1.83518 8.58518 1.532 7 .169
1 The mean difference score, d, is 3.375 (numerator of the t-test), with a standard deviation, sd of 6.232.
2 The standard error of the difference scores, sd , is 2.203 (denominator of the t-test)
3 The paired t-test results in t = 1.532.
4 Means are provided for each condition.
5 The paired t-test procedure includes the correlation of pillow and non-pillow scores and significance of the correlation.
6 Because a directional alternative hypothesis was proposed, a one-tailed test is used. Therefore, we must divide the 2-tailed
significance value by 2, resulting in p = .085. This test is not significant.
7 95% confidence interval of the difference scores contains zero, confirming no significant difference.
depending on the context of the variables being Effect Size for the Independent t-Test
measured.
For the independent t-test (with equal variances), the
effect size index is
Don’t confuse the use of d to indicate both a difference
score and the effect size index. Unfortunately, these are X1 ⫺X2
conventional designations, but they represent different values. d= s
6113_Ch24_351-364 02/12/19 4:40 pm Page 362
COMMENTARY
What the use of a p-value implies, is that a hypothesis that may be true may be
rejected because it has not predicted observable results that have not occurred.
—Harold Jeffreys (1891–1989)
British mathematician and statistician
Author of Theory of Probability, 1939
The t-test has great appeal because of its simplicity—the were determined), and what level of significance was ap-
basic comparison of two means, which is the foundation plied. Information on means, confidence intervals, exact
for the archetypical randomized clinical trial. Although p values, effect size and power should also be included
the test is apparently straightforward, it still needs to be with results. All of these descriptors are essential for a full
interpreted properly and reported clearly in a publication. understanding of an analysis, interpretation of results,
Unfortunately, many authors do not provide adequate in- and clinical application. If the literature is going to pro-
formation to indicate what kind of test was run—whether vide evidence for decision making, it must foster statisti-
it was independent or paired, one- or two-tailed, what cal conclusion validity so that we can be confident in the
values were used to estimate sample size (and how they work and the meaning of the results.
CHAPTER
Comparing More
Than Two Means: 25
Analysis of Variance
As knowledge and clinical theory have devel- variance within each group—the signal to noise ratio.
The analysis of variance is based on the same principle,
oped, clinical researchers have proposed more
except that it is a little more complicated because the
complex research questions, necessitating ratio must now account for the relationships among
the use of elaborate multilevel and multifactor more than two means.
Also like the t-test, the ANOVA is based on statistical
experimental designs. The analysis of variance
assumptions for parametric tests, including homogeneity
(ANOVA) is a powerful analytic tool for analyzing of variance, normal distributions, and interval/ratio data.
such data when three or more groups or con- When assumptions for parametric statistics are not met,
there are nonparametric analogs that can be applied (see
ditions are compared with independent groups
Chapter 27).
or repeated measures. The ANOVA uses the
F statistic, named for Sir Ronald Fisher, who The ANOVA can be applied to two-group comparisons, but
developed the procedure. the t-test is generally considered more efficient for that
purpose. The results of a t-test and ANOVA with two groups will
The purpose of this chapter is to describe the be the same. The t-test is actually a special case of the analysis
application of the ANOVA for a variety of re- of variance, with the relationship F = t 2.
search designs. Although statistical programs
can run the ANOVA easily, understanding the ■ One-Way Analysis of Variance
basic premise is important for using and inter-
The one-way analysis of variance is used to compare
preting results appropriately. An introduction three or more independent group means. The descrip-
to the basic concepts underlying analysis of tor “one-way” indicates that the design involves one
variance is most easily addressed in the context independent variable, or factor, with three or more
levels.
of a single-factor experiment (one independent
variable) with independent groups. Discussion
will then follow with more complex models, in- ➤ CASE IN POIN T #1
Consider a hypothetical study of the differential effect of four
cluding factorial designs, repeated measures intervention strategies to increase pain-free range of motion
and mixed designs. (ROM) in patients with elbow tendinitis. Through random as-
signment, we create 4 independent groups treated with ice, a
nonsteroidal anti-inflammatory drug (NSAID), a splint support,
■ ANOVA Basics or rest. We measure ROM at baseline and after 10 days of inter-
vention, using the change in pain-free ROM as our outcome.
Recall from our discussion of the t-test (Chapter 24) that We recruit a sample of 44 patients with comparable histories of
tendinitis and randomly assign 11 subjects to each of the four
differences between means are examined in relation to
groups.
the distance between group means as well as the error
365
6113_Ch25_365-382 03/12/19 11:30 am Page 366
In this study, the independent variable modality has four where k is the number of groups or levels of the
levels, incorporated in a single-factor, multigroup pretest- independent variable. In our example, k = 4.
posttest design (see Chapter 16). The dependent variable is The alternative hypothesis (H1) states that there will be
change in pain-free elbow ROM, measured in degrees. Hy- a difference in the change in pain-free ROM between at
pothetical data for this study are reported in Table 25-1A. least two of the four groups following the treatment period.
The one-way ANOVA provides an overall test of the
null hypothesis that there is no significant difference Partitioning the Variance
among the group means:
As we have discussed before, the total variance in a
H0: 1 = 2 = 3 = … = k set of data can be attributed to two sources: a treatment
DATA
Table 25-1 One-Way ANOVA: Change in Pain-Free Elbow ROM Following Treatment for Tendinitis
DOWNLOAD
(k = 4, N = 44)
A. Data
Descriptive Statistics 50
ROM Std. 95% CI for Mean
20
B. SPSS Output: One-Way ANOVA lce NSAID Splint Rest
Test of Homogeneity of Variances Group
ANOVA
ROM Sum of 3 Mean 4 4 Partial Eta Observed
Squares df Square F Sig. Squared Power
Between Groups 3184.250 3 1061.417 12.058 .000 6 .475 6 .999
Within Groups 5 3520.909 40 88.023
Total 6705.159 43
–
XG
Total Sample
10 20 30 40 50 60
4 3 1 2
Rest Splint Ice NSAID
Figure 25–2 Curves representing the change in pain-free ROM for four modality groups and the
combined sample. The center red line represents the grand mean, X G . The between-groups variance is
the difference between group means and the grand mean.
6113_Ch25_365-382 03/12/19 11:30 am Page 368
Two forms of effect size indices are used with the one-
way ANOVA, eta squared, η2, and Cohen’s f. These in- Box 25-1 Power and Sample Size for the
dices represent the proportion of the total variance One-Way ANOVA
associated with the independent variable. These effect
Post Hoc Analysis: Determining Power
size indices can be calculated using the SS values in the
The ANOVA output presented in Table 25-1 includes effect
ANOVA:
size (partial eta squared) and power. This is an option that can
SS b SS b be requested when running the ANOVA in SPSS, and therefore
2 = f = we do not need to do further analysis to determine power.
SS t SSe In this case we have achieved 99.9% power (can’t do much
better than that!).
They can be converted as follows:
If we did not have this information, however, we could
use G*Power, entering f = .951 (converting from η2p ), ␣ = .05,
2 f2
f = 2 = N = 44 with 4 groups, resulting in power = 99.9% (see chapter
1⫺ 2 1⫺ f 2 supplement).
Different procedures for power analysis will use ei- A Priori Analysis: Estimating Sample Size
ther of these values. SPSS uses η2 as the effect size (see If we wanted to determine a reasonable sample size as part of
Table 25-1 6 ), and the online G*Power calculator uses f. planning this study, we would need to project an effect size, the
For the data in Table 25-1, number of groups in the design, ␣ level, and the desired level of
power. This requires some foundation based on prior research to
3184.25 3184.25 estimate what can be expected. Assume we have found a sys-
2 = = 0.475 f = = 0.951 tematic review that reports on several studies showing medium
6705.16 3520.91
effect sizes after 8 weeks of treatment with various modalities.3
Conventional effect sizes have been proposed for Based on this information, we enter f = .25 (a medium effect),
these indices:2 ␣ = .05, four groups, and 80% power in G*Power, resulting in a
total sample size of N = 180, or n = 45 per group (see chapter
Small η2 = .01 f = .10 supplement).
Medium η2 = .06 f = .25 In retrospect, we know we would not have needed that large
Large η2 = .14 f = .40 a sample because our effect size was much larger than we
projected. However, that is not something we would know
Using these criteria, both indices in this example
beforehand—it’s a best guess!
show a substantially large effect.
6113_Ch25_365-382 03/12/19 11:30 am Page 370
Modality (Factor A)
When this is done, the calculated value of F is given, Ice A1 XA1B1 XA1B2 XA1
along with the associated degrees of freedom, proba-
bility, and effect size. Splint A2 XA2B1 XA2B2 XA2
Modality (A) Note that in this example, the lines are parallel, which
means that the pattern of response at each medication
Ice XA1 n = 20
condition is consistent across all levels of modality. We
can reverse the plot, as shown on the right, with med-
Splint XA n = 20
2 ication condition on the X-axis, demonstrating a similar
constant pattern for each level of modality. These graphs
Rest XA n = 20
3 demonstrate a situation where there is no interaction. Ice
will generate the highest response under both medica-
tion conditions and the NSAID scores are higher across
Medication (B) all levels of modality.
NSAID No Med
ANOVA Summary Table
XB1 XB2 The results of the two-way ANOVA are shown in
Table 25-3C. Note that there are three between-
n = 30 n = 30 groups sources of variance listed, two main effects and
Figure 25–4 Main effects for modality and medication in a one interaction effect.
two-way factorial design with N = 60, showing marginal In a two-way ANOVA, these sources of variance are
means for each group. usually listed in the summary table according to the
name of the independent variable. Thus, for our exam-
ple, type of modality and medication condition are listed
effect of each independent variable in the analysis. Com- as main effects. The interaction between two variables
parison of the marginal means within each factor indi- may be signified by * or ×, such as modality * medication,
cates how much of the variability in all 60 scores can be read “modality by medication.” The error term repre-
attributed to the overall effect of modality or medication sents the unexplained variance among all subjects across
alone. all combinations of the two variables.
In addition to main effects, the factorial design has
the added advantage of being able to examine combina- Degrees of Freedom
tions of levels of each independent variable to determine The degrees of freedom associated with each main
if there are differential effects. These are referred to effect is one less than the number of levels of that inde-
as interaction effects. Interaction is present when the pendent variable (k – 1). With two independent vari-
effect of one variable is not constant across different ables, we use (A – 1) degrees of freedom for Factor A, and
levels of the second variable. (B – 1) for Factor B, where the letters A and B represent
the number of levels of each factor. The number of de-
The ANOVA can be expanded to include more than two grees of freedom for the interaction is the product of
independent variables, although it is rare to see analyses the respective degrees of freedom for the two variables,
beyond three dimensions. A three-way ANOVA would generate (A–1)(B–1).
three main effects, three double interactions, and one triple The total degrees of freedom will always be one less
interaction across all three variables. Although this type of than the total number of observations, N – 1. The
design does allow for more comprehensive comparisons, it can error degrees of freedom can be determined by using
become overly complex, requiring a large sample to generate (A)(B)(n–1) with equal-size groups, or by subtracting
sufficient power. the combined between-groups degrees of freedom
from the total degrees of freedom.
For modality with three levels, df = 2. For medication
No Interaction with two levels, df = 1. Therefore, the interaction effect
To illustrate this concept, consider first the hypothetical in the modality study has 2 × 1 = 2 degrees of freedom.
means given for the six treatment groups in Table 25-3A. With N = 60 (n = 10 per group), dft = 59. For this exam-
Each mean represents a unique combination of modality ple, dfe = (3)(2)(9) = 59 – 5 = 54.
and medication condition. We can plot these means to
more clearly illustrate these relationships (Table 25-3B). The F Statistic
The plot on the left shows the three modality groups An F value is calculated for each main effect and interac-
along the X-axis. The means for change in ROM for each tion effect with an associated p value. The F ratio for each
medication condition are plotted at each level of modal- effect is calculated by dividing the MSb for that effect by
ity, with lines representing each main effect. MSe for the common error term.
6113_Ch25_365-382 03/12/19 11:30 am Page 372
DATA
Table 25-3 Two-Way ANOVA for the Effect of Medication and Modality on Pain-Free ROM: Showing No
DOWNLOAD
Interaction (N = 60)
A. Data
Medication Condition
1
Modality NSAID No Med Marginal Means
Ice 45.00 36.00 40.50
Splint 33.00 25.00 29.00
Rest 22.00 12.00 17.00
2 Marginal Means 33.33 23.33
10 10
0 0
Ice Splint Rest NSAID No Med
Modality Medication
Critical Value of ice with an NSAID. Although the data clearly sug-
By looking at the summary table, we can see that the prob- gest that the ice/NSAID combination has a better out-
abilities associated with the two main effects are signifi- come than all other combinations, we will have to use
cant. The interaction effect is not significant (p = .660). multiple comparison tests to determine which of these
This is not surprising given the pattern of means across six means are significantly different from each other (see
the groups. Chapter 26).
Each F ratio is compared to a critical value from Ap-
pendix Table A-3 based on the degrees of freedom for Power and Effect Size
that effect (dfb) and the error term. For modality and For a two-way analysis, effect size and power estimates
modality*medication (df = 2,54), the critical value is ap- must be determined for each main and interaction effect.
proximately 3.15. For Medication (df = 1,54), the critical In a two-way design, the effect size indices f or η2 have
value is approximately 4.00. Because the table does not to be partitioned to represent the proportion of total
include all possible df values, these values are estimated variance that is accounted for by each effect. Therefore,
using the closest value for dfe = 60. we use partial eta squared, ηp2 , and f based on separate
The F ratios for the main effects of modality and sums of squares:
medication are higher than the critical values. Therefore,
this ANOVA demonstrates significant main effects. SS b SS b
2p = f =
The F value for the interaction is smaller than the critical SS b + SSe SSe
value, and is therefore, not significant. These conclu-
sions are consistent with the p values listed in the table. where SSb is the sum of squares for the particular effect
being evaluated, and SSe is the common error sum of
Interpreting F squares.
We look at the outcome of a two-way ANOVA by con- The ANOVA summary tables show η2p and power es-
sidering the significance of each main and interaction timates for each of the three effects. Power analyses for
effect. We focus on the interaction effect first, as the com- data in Table 25-4 are illustrated in Box 25-2.
bination of variables is typically the primary reason for
using the two-way design. If the interaction is not signifi-
cant, as in this example, we then go to each main effect.
➤ Results
A two-way ANOVA was used to analyze the effect of medica-
Further analysis requires a multiple comparison test tion and modality on pain-free ROM. The ANOVA shows signifi-
to compare the three marginal means for the main effect cant outcomes for the main effect of modality (F (2,54) = 5.97,
of modality. There is no need for further tests for the p = .005, η2p = .181), the main effect of medication (F (1,54) =
medication condition since there are only two levels. 5.77. p = .02, η2p = .097), and their interaction (F (22,54) = 5.79,
The NSAID condition results in a significantly better p = .005, η2p = .177).
improvement in ROM than no medication.
Interaction Alternatively, a value for f can be reported instead of
partial eta squared. This statement would be followed by
Now consider a different set of results for the same results of a multiple comparison across the six interaction
study, given in Table 25-4. The plots for these data show means (see Chapter 26).
lines that are not parallel—the pattern of one variable is
not constant across all levels of the second variable. We
can see in the plots that the combination of ice and
NSAID has the greatest effect, whereas all the other con-
■ Repeated Measures Analysis
ditions have means that are lower and fairly close. of Variance
Up to now we have discussed the ANOVA only as it is
Researchers generally propose a factorial design when
applied to completely randomized designs, also called
the research question focuses on the potential interac-
tion between two variables. Therefore, when the between-subjects designs because all sources of variance
interaction is significant, main effects are usually ignored. represent differences between subjects. Clinical inves-
tigators, however, often use repeated factors to evaluate
the performance of each subject under several experi-
When lines are not parallel or when they cross, as mental conditions. The repeated measures design is
shown here, interaction is present (p = .005). In this logically applied to study variables where practice or
example, it is not the use of one modality alone that carryover effects are minimal and where differences in
makes the treatment more effective. It is the combination an individual’s performance across levels are of interest.
6113_Ch25_365-382 03/12/19 11:30 am Page 374
DATA
Table 25-4 Two-Way ANOVA for the Effect of Medication and Modality on Pain-Free ROM Showing Interaction
DOWNLOAD
(N = 60)
A. Data
Medication
Modality NSAID No Med Marginal Means
Ice 50.00 22.00 36.00
Splint 20.00 21.00 20.50
Rest 24.00 23.00 23.50
Marginal Means 31.33 22.00
10 10
0 0
Ice Splint Rest NSAID No Med
Modality Medication
1 The two main effects and the interaction effect are significant. Interactions are illustrated in graphs.
2 Using conventional effect sizes, values of p2 for all three effects represent medium to large effects. The power for Modality and
interaction is above 80%. Power for Medication is lower (65%), but this is irrelevant because the effect was significant
ANOVA and mean plots were run with General Linear Model/Univariate. Portions of the output have been omitted for clarity.
This type of study can involve one or more independ- within-subjects designs. The statistical hypotheses proposed
ent variables (see Chapter 16). for repeated measures designs are the same as those for
In a repeated measures design, all subjects are tested independent samples, except that the means represent
under k treatment conditions. The repeated measures treatment conditions rather than groups. The simplest
analysis of variance is modified to account for the cor- repeated measures design involves one independent
relation among successive measurements on the same variable, where all levels of treatment are administered
individual. For this reason, such designs are also called to all subjects.
6113_Ch25_365-382 03/12/19 11:30 am Page 375
Box 25-2 Power and Sample Size for the Two-Way ANOVA
Post Hoc Analysis: Determining Power power = .867; for medication (N=60, k=2), f = .327, power = .655;
SPSS can generate η2p and power estimates if requested as part of for the interaction (N=60, k=3), f = .463, power = .851.
the ANOVA, in which case no further post hoc analysis is necessary.
A priori Analysis: Estimating Sample Size
For the data in Table 25-3, showing no interaction, the two main ef-
In planning this study, we want to determine a total sample size
fects have large effect sizes and power of 1.00. However, the inter-
to achieve sufficient power for the interaction means. Given that
action effect was small, with extremely low power of 11.5%.
we are interested in comparing combinations of modality and
For the data in Table 25-4, we found significant main and in-
medication, we focus this estimate on six groups. To use
teraction effects. All three effect sizes were medium to large. The
G*Power, we must specify parameters of effect size f, ␣ = .05,
main effect of modality and the interaction effect both showed
power = .80, and the number of groups in the design, in this case
power above .80. The main effect of medication had lower power
for the interaction effect k = 6. If we propose a medium effect,
of 65.5%, but this is irrelevant because the effect was significant.
f = .25, we generate a sample size estimate of N = 158, or 26 sub-
Data from Table 25-4 will be used to illustrate G*Power for
jects per group (see chapter supplement). In retrospect, however
post hoc analysis. We enter parameters including effect size f,
we would not have needed that large a sample for the data in
total sample size (N) and the number of groups (k) for each main
Table 25-4, as we saw a significant interaction, with f = .463
and interaction effect. Effect sizes are calculated using SS values
(η2p = .177)—a much higher effect than we projected.
from the ANOVA. G*Power will convert values of η2p to f (see chap-
ter supplement for calculations). For modality (N=60, k=3), f = .470,
DATA
Table 25-5 One-Way Repeated Measures ANOVA: Elbow Flexor Strength Tested in Three Forearm Positions
DOWNLOAD
(N = 9)
A. Data
Descriptive Statistics
N Mean Std Deviation
Pronation 9 17.33 10.15
Neutral 9 27.56 10.42
Supination 9 29.11 11.34
3
Source SS df MS F Sig. p2 Power
Subjects 5 2604.00 8 325.50
Position 736.89 2 368.44 50.34 .000 .863 1.000
Error 117.11 16 7.32
1 Mauchly’s Test of Sphericity is used to determine if adjustments are needed to degrees of freedom for the repeated measure. In this
case, it is not significant (p .239), and therefore, no further adjustment is necessary.
2 If Mauchly’s test is significant, two versions of epsilon are used for the adjustment of df. In this instance, they are not applied because
the sphericity test is not significant. Separate output is provided showing df and p for the condition when sphericity is assumed, and
with each of the correction factors, showing the change in the df and p values (see chapter supplement).
3 The repeated measure of Forearm Position is significant.
4 Power and effect size estimate for the repeated measures test.
5 The effect of “subjects” as a source of variance is presented as “Between-Subjects Effects.” The error term in this analysis is the
Subjects effect. No F value is generated for this effect, but could be calculated by dividing MS for subjects by MS error. This effect is
often omitted from ANOVA summary tables because it is not meaningful to the analysis.
This analysis was run in SPSS using the General Linear Model. Portions of output have been omitted for clarity.
6113_Ch25_365-382 03/12/19 11:30 am Page 377
effect of forearm position with three levels , dfb = 2, the states that the variances within each pair of difference scores
within-subjects error, df = (2)(8) = 16, and the error com- will be relatively equal and correlated with each other.
ponent of between-subjects, df = 9 – 1 = 8. Because independent tests with equal sample sizes are
generally considered robust to violations of homogeneity
The F Statistic
of variance, one might think that violations of this as-
The sums of squares for the treatment effect and the
sumption would be unimportant for repeated measures,
error effect are divided by their associated degrees of
where treatment conditions must have equal sample sizes.
freedom to obtain the mean squares. These MS values
This is not the case, however. Because the repeated
are then used to calculate the F ratio. For the data in
measures test examines correlated scores across treatment
Table 25-5, F = 50.34.
conditions, it is especially sensitive to variance differences,
Critical Value biasing the test in the direction of Type I error. In other
The critical value for the F ratio for treatment is located words, the repeated measures test is considered too liberal
in Appendix Table A-3, using the degrees of freedom for when variances are not correlated, increasing the chances
treatment (dfb) and the degrees of freedom for the error of finding significant differences above the selected ␣ level.
term (dfe). Therefore, the critical value for this effect will
Tests of Sphericity
be F(2,16) = 3.63. The calculated F ratio exceeds this
The sphericity assumption is evaluated using Mauchly’s
critical value and, therefore, is significant.
Test of Sphericity (Table 25-5B), which is tested using
The output for this analysis shows results for the be-
the chi-square statistic. This test is only relevant if the
tween and within subjects sources of variance. The dif-
ANOVA results in a significant F ratio. If Mauchly’s test
ference across the three positions is significant (p < .001,
is not significant, the sphericity assumption is met, and
Table 25-5 3 ). We conclude that elbow flexor strength
no further adjustment is necessary. This is the case in
does differ across forearm positions. It will be appropriate
our example (p = .239, Table 25-5 1 ).
at this point to perform a multiple comparison test on the
However, if Mauchly’s test is significant (p < .05), the
three means to determine which forearm positions are
sphericity assumption is not met, and a correction is nec-
significantly different from the others (see Chapter 26).
essary. This is achieved by decreasing the degrees of
freedom used to determine the critical value of F,
We can calculate an F ratio for the between-subjects thereby making the critical value larger. If the critical
effect, but this is generally not a meaningful test. We value is larger, then the calculated value of F must be
expect subjects to differ from each other, and it is of no experi- larger to achieve significance. This compensates for bias
mental interest to establish that they are different. The F ratio toward Type I error by making it harder to demonstrate
for between-subjects is often omitted from summary tables
significant differences.
(Table 25-5 5 ), and this effect is typically ignored in the
interpretation of data for a repeated measures test. Sphericity Corrections
Although not important for experimental studies, the effect When Mauchly’s test is significant, the degrees of
of subjects is of interest for some analyses, such as the intra- freedom for the F ratio are adjusted by multiplying
class correlation coefficient (see Chapter 32). them by a correction factor given the symbol epsilon, ε
(Table 25-5 2 ). Two different versions of epsilon are
used: the Greenhouse–Geisser correction6 and the
Variance Assumptions with Repeated
Huynh–Feldt correction.7
Measures Designs
Because the test for sphericity is not significant in our
Past discussions have addressed the important assumption example, we are not concerned about this adjustment. If
of homogeneity of variances among treatment groups. the test for sphericity had been significant, however, the
This assumption is also made with repeated measures de- dfb would be multiplied by ε, a different critical value
signs, but we cannot examine variances of different groups would be applied, and a different p value would be used
because only one group is involved. Instead, the variances as the value for interpretation. Computer output for this
of interest reflect difference scores across treatment con- analysis will generate results for each of these conditions,
ditions within a subject. For example, with three repeated and the researcher can determine which results to use.
treatment conditions, a1, a2, a3, we will have three differ-
ence scores:
In studies that use a repeated measures ANOVA,
a1 – a2 a1 – a3 a2 – a3 authors should mention analysis of the sphericity
assumption in their results section. Without such
With this approach, the homogeneity of variance as- information, a significant finding cannot be properly interpreted.
sumption is called the assumption of sphericity, which
6113_Ch25_365-382 03/12/19 11:30 am Page 378
where SSb is the repeated factor and SSe is the error term
Two-Factor Repeated Measures Designs
for that factor. For example, for the two-way repeated
The concepts of repeated measures analysis can also be measures ANOVA (Table 25-6), for the effect of fore-
applied to multifactor experiments. With two repeated arm position,
factors, the within-subjects design is an extension of the
1942.758
single-factor repeated measures design. 2p = = .375
1942.758 ⫹ 3236.576
➤ C A S E IN PO INT # 4 1942.758
Suppose we redesigned our previous example to study f = = .775
isometric elbow flexor strength with the forearm in three 3236.576
positions and with the elbow at two different angles. We
would then be able to see if the position of the forearm had These effect sizes would be considered large, and this
any influence on strength when combined with different effect was significant.
elbow positions.
➤ Results
A two-way repeated measures ANOVA was used to look at
Because all subjects would be tested in all conditions,
the effect of forearm and elbow position on elbow flexor
this is a two-way repeated measures design. In this 3 × 2 strength. The main effect of forearm position was significant
design with n = 8, each subject would be tested six times, (F (2,20) = 6.003, p = .009, η2p = .375), there was no significant
for a total of 48 measurements. Results for hypothetical difference between the two elbow positions (F (1,10) = 1.433,
data are shown in Table 25-6. p = .259, η2p = .125). The interaction effect was significant
With two repeated factors, variance is partitioned to (F (2,20) = 3.559, p = .048, η2p = .263).
include a main effect for each independent variable and
their interaction. Each of these effects is listed as a source This analysis will be followed with a multiple com-
of variance in the ANOVA table (see Table 25-6 1 ). Like parison comparing the six means for combinations of
the one-way repeated measures ANOVA, the effect of be- forearm and elbow position. (see Chapter 26).
tween subject differences is first removed from the total
variance. Each main and interaction effect is tested against Application of these values using G*Power for power and
its own error term, representing the random or chance sample size estimates is illustrated in the Chapter 25
variations among subjects for each treatment effect. Supplement.
The assignment of degrees of freedom for each of
these variance components follows the rules used for the
regular two-way analysis of variance: for each main effect ■ Mixed Designs
df = k – 1; for each interaction effect df = (A –1)(B –1).
As shown in Table 25-6B, each repeated factor is es- When a single experiment involves at least one inde-
sentially being tested as it would be in a single-factor ex- pendent factor and one repeated factor, the design is
periment. By separating out an error component for each called a mixed design. A mixed ANOVA, also called a
treatment effect, we have created a more powerful test split-plot ANOVA, is therefore used. The overall format
than we would have with one common error term; that for the ANOVA is a combination of between-subjects
is, the error component is smaller for each separate treat- (independent factors) and within-subjects (repeated fac-
ment effect than it would be with a combined error term. tors) analyses. The independent factor is analyzed as it
Therefore, F ratios tend to be larger. In this example, would be in a regular one-way ANOVA, pooling all data
the main effect of forearm and the interaction are sig- for the repeated factor. The repeated factor is analyzed
nificant (Table 25-6 2 ). using techniques for a repeated measures analysis.
6113_Ch25_365-382 03/12/19 11:30 am Page 379
DATA
Table 25-6 Two-Factor Repeated Measures ANOVA: Elbow Flexor Strength in Three Forearm Positions
DOWNLOAD
and Two Elbow Positions (N = 8)
A. Data
Forearm Position
Elbow Marginal
Angle Pronation Neutral Supination Means
90° 25.00 33.00 34.38 30.79
45° 17.38 28.13 30.25 25.25
Marginal Means 21.19 30.57 32.32
1 Because the main effect of forearm position is the only significant effect, the marginal means for forearm position are the only means
of interest.
2 Each repeated measure is tested against its own error term, which is the interaction between that effect and the effect of subjects
3 The between-subjects effect is often eliminated from the summary table in a research report. It does not provide important
information to the interpretation of the main or interaction effects, but is used to determine the error term for each effect. This effect is
expected to be significant, as it reflects that the subjects’ scores were different from each other. In a repeated measures analysis, this
is not of interest. The meaningful comparisons are across subjects within the repeated factors.
This analysis was run in SPSS using the General Linear Model. Some portions of the output have been omitted for clarity.
Repeated measures often reflect testing subjects at each interval representing a level of the independent
different levels of a treatment variable, as illustrated in variable.
previous examples (see Focus on Evidence 25-1). How-
ever, many research questions include follow-up meas-
ures to determine if outcomes change over time. A
When reporting the analysis in a mixed design, the test
common approach utilizes a two-way mixed design to
should be clearly described as a mixed ANOVA or as a
assess changes in the independent factor, with time
two-way ANOVA with one repeated measure.
being an independent variable that is a repeated measure,
6113_Ch25_365-382 03/12/19 11:30 am Page 380
Hypothetical data are reported in Table 25-7. The Given that time is the only significant effect, we will
first part of the analysis for this study is the within- need to do a multiple comparison to assess differences
subjects analysis of all factors that include the repeated across the three time periods (see Chapter 26).
factor (Table 25-7 1 ). This section lists the main effect
for time, the interaction between time and treatment,
Power and Effect Size
and a common error term to test these two effects. Power and effect size estimates for the two-way mixed
In this example, the main effect of time is significant design reflect the same components as repeated meas-
(p < .001), but the interaction effect is not (p = .073). ures and independent designs.
Focus on Evidence 25–1 Physical activity was higher in the field for both boys and girls,
All Play and No Work… and boys were consistently more active than girls. The authors found
significant main effects but no significant interaction:
The health benefits of physical activity in children are well docu-
mented, although studies suggest that children are not meeting the • Environment (p < .001, η2p = .616)
recommendation for 60 minutes of moderate to vigorous physical • Gender (p = .002, η2p = .371)
activity each day. Wood et al9 were interested in the impact of school • Gender × Environment (p = .078, η2p = .140)
playing environment on the duration of vigorous physical activity How meaningful are these findings? Is it worth considering
during recesses. They developed a two-way mixed design to study changes in school-based activity? If we consider effect size, the
the differential effect of gender and play environment—in school environment had a significantly greater effect for boys and girls. How-
playgrounds versus natural field settings. ever, we can also look at the actual values and confidence intervals,
They studied 23 children between 8 and 9 years old in one ele- to determine if the observed differences are important. The actual dif-
mentary school in the United Kingdom. Using a crossover design, stu- ference in playtime between playground and field for boys, for exam-
dents were split into two groups, each participating in play in one of ple, is 3.4 minutes and for girls 6.5 minutes, with relatively small
the two environments for 2 weeks and then switched to the other confidence intervals. With a goal of 60 minutes of physical activity
setting. Children wore accelerometers to monitor time of physical each day, are these differences impactful? Do these differences
activity in minutes. warrant policy or financial considerations that would influence a
Results were presented by gender and play environment (mean school being able to offer alternative environments? These interpre-
and 95% CI): tations are not black and white. It is never just about significance—
decisions must be put in context.
BOYS GIRLS
Playground 15.4 (13.5-17.3) 9.6 (7.7-11.4) Data from Wood C, Gladwell V, Barton J. A repeated measures experiment
Field 18.8(16.3-21.2) 16.1 (13.8-18.5) of school playing environment to increase physical activity and enhance
self-esteem in UK school children. PLoS One 2014;9(9):e108701.
6113_Ch25_365-382 03/12/19 11:30 am Page 381
DATA
Table 25-7 Two-Way (3 X 2) Mixed ANOVA: Symptom Improvement over Time With and Without Exercise
DOWNLOAD
(N = 24)
A. Data
Time
Marginal
Intervention Baseline 6 Weeks Follow-up Means
Exercise 23.63 30.88 45.38 33.09
Stretch 20.00 26.25 33.88 26.71
Control 17.38 28.13 30.25 25.25
Marginal Means 20.34 28.42 36.50
Measure: STRENGTH
Sphericity Assumed Tests of Between-Subjects Effects 2
C. Summary Table
This table is not part of SPSS output, but is created to show how the above data might be combined in a concise way in a
research article.
Within subjects 1
Time 3136.333 2 1568.167 40.320 .000 .658 1.000
Time Treatment 360.833 4 90.208 2.319 .073 .181 .623
Error 1633.500 42 38.893
1 The within-subjects effect includes the repeated measure (Time) and the interaction of the repeated measure with the independent
measure (Treatment). Only the main effect of time is significant (p<.001).
2 The between-groups effect is computed for the main effect of Treatment, which is not significant (p=.097). This section is equivalent to a
one-way analysis of variance.
This analysis was run in SPSS using the General Linear Model. Portions of the printout have been omitted for clarity
6113_Ch25_365-382 03/12/19 11:30 am Page 382
COMMENTARY
Sometimes the questions are complicated and the answers are simple!
—Theodor Geisel (Dr. Seuss) (1904-1991)
American author
To understand clinical phenomena, it is necessary to The analysis of variance provides researchers with a
explore various approaches to treatment and how re- statistical tool that can adapt to a wide variety of design
sponses vary under different conditions or with different situations. This chapter has covered only the most com-
types of patients. Research questions can become more mon applications. Many other designs, such as nested
complicated when we try to incorporate several types of designs, randomized blocks, and studies with unequal
interventions or conditions at one time. We must be samples, require mathematical adjustments in the analy-
careful, however, not to make studies so complicated that sis that are too complex to cover here. As designs get
we cannot make clear comparisons. more complicated, the nuances of analysis of variance
Understanding reports that include ANOVA can help can be substantial, and a statistician should be consulted
to clarify the design of a study. For example, the number to make sure that the right tests are being used.
of subjects actually tested and the number of groups or When the analysis of variance results in a significant
measurements can be discerned from degrees of freedom. finding, researchers will want to pursue the analysis to
This is important because the data that are finally analyzed determine which specific levels of the independent vari-
in a study are often different than originally intended in a ables are different from each other. Multiple comparison
design, especially when subjects drop out or data are lost. tests, designed specifically for this purpose, are described
If effect sizes are not reported, they can often be calculated in the next chapter.
using data provided in a research report. These can be
used for clinical interpretation of the results.
CHAPTER
Multiple
Comparison Tests 26
When an analysis of variance (ANOVA) results α = .05, we limit ourselves to a 5% chance that we will
experience the random event of finding a significant
in a significant F ratio, the researcher is justified
difference for a given comparison when none exists. We
in rejecting the null hypothesis and concluding must differentiate this per comparison error rate (αPC)
that not all means are equal. However, this out- from the situation where α is set at .05 for each of several
comparisons in one experiment. If we test each one at
come tells us nothing about which means are
α = .05, the potential cumulative error for the set of
significantly different from which other means. comparisons is actually greater than .05. This cumulative
The purpose of this chapter is to describe the probability has been called the familywise error rate
(αFW) and represents the probability of making at least one
most commonly used multiple comparison tests for
Type I error in a set or “family” of statistical comparisons.
a variety of designs. Several procedures are
available, given names for the individuals who Think of it this way—you are riding on a train. Quality test-
developed them. Each test involves contrasts of ing has been performed on several components, all done
by different inspectors. One found that there is only a 5% risk of
pairs of means which are tested against a critical failure for the engines. Ok, not bad. Knowing your statistics, you
value to determine if the difference is large consider 5% a small risk. However, you didn’t know that another
inspector found a 5% possibility of a problem with the tracks,
enough to be significant. These tests can be run and still another found a 5% risk for failure of communication
with one-way or multidimensional designs, with systems. How does it feel now? What is your cumulative risk
independent or repeated measures. The use of that at least one problem would occur? It’s a lot more than 5%.
383
6113_Ch26_383-399 02/12/19 4:39 pm Page 384
Bonferroni Correction A correction for Type I error by dividing alpha by the number
of comparisons, thereby requiring a much smaller p value
to attain significance. Used with the t-test for multiple
comparisons but can be used with other tests as well.
Less Powerful
DATA
Table 26-2 One-Way ANOVA: Change in Pain-Free ROM with Treatment for Elbow Tendinitis
DOWNLOAD
(N = 44)
Table 26-3 Abridged Table of Critical Values of the listed in the same column represents means that are not
Studentized Range Statistic, q (α = .05) significantly different. Means that are listed in separate
columns are different from one another. Means are listed
r in ascending order. These results show that the mean for
dfe 2 3 4 5 the rest group is significantly different from the three
other means.
10 3.15 3.88 4.33 4.65
Results of multiple comparison tests are often reported
20 2.95 3.58 3.96 4.23 in narrative form, although they may also be included in
30 2.89 3.49 3.85 4.10 data tables in an article. Some authors include graphs
showing which means are different from each other.
40 2.86 3.44 3.79 4.04
Results should include the ANOVA outcome, followed
60 2.83 3.40 3.74 3.98 by specific description of the multiple comparison test
120 2.80 3.36 3.68 3.92 and significant differences among groups.
∞ 2.77 3.31 3.63 3.86
➤ Results
r = number of means (HSD) or comparison interval (SNK). This is an The one-way ANOVA demonstrated a significant difference
abridged version of this table for purposes of illustration. Refer to the among the four group means (F (3,40) =12.06, p < .001, η2 = .475).
Chapter 26 Supplement for the full table. The post hoc Tukey multiple comparison procedure showed that
the change in pain-free ROM was not significantly different
– –
across ice (X = 44.18 ± 10.87), NSAID (X = 45.27 ± 8.40) and
where n is the number of subjects in each group (assum- –
splint (X = 35.27 ± 9.51) treatments. All three of these treat-
ing equal sample sizes), and q = 3.79 for r = 4 and ments produced significantly greater ROM change than the rest
dfe = 40 (see Table 26-3). The MSD is compared with –
condition (X = 24.09 ± 8.54).
each pairwise mean difference. Absolute differences that
are equal to or greater than this value are significant.
Student-Newman–Keuls Method
See the Chapter 26 Supplement for calculation of the
harmonic mean, which is used to calculate the MSD when The Student-Newman–Keuls (SNK) test (also called
samples sizes are not equal. Newman–Keuls test) is similar to the Tukey method, except
that it uses an adjusted per comparison error rate. There-
fore, α specifies the Type I error rate for each pairwise
Computer Output
contrast rather than for the entire set of comparisons. The
SPSS uses two formats to display results of the Tukey
result is a more powerful test but with greater chances of
HSD. In Table 26-4 2 , multiple comparisons for each
Type I error, especially with a large number of comparisons.
pairwise contrast include the mean difference and its
significance, and a confidence interval (CI). If the mean dif- Comparison Intervals
ference is greater than 10.72, it is significant. For the Tukey The SNK procedure is also based on the studentized
procedure, the same MSD is applied for all comparisons. range q. However, values of q are used differently for each
The 95% confidence interval for each comparison is contrast, depending on the number of adjacent means (r)
calculated using: within an ordered comparison interval.
This method is considered a sequential procedure be-
95% CI = mean difference ± MSD
cause comparisons are made in a “step” fashion, starting
Confidence intervals for comparisons of the mean for with the two most extreme means and then examining
rest with means for ice, splint, or NSAID do not contain successively smaller subsets. To illustrate how this is
zero and therefore represent significant differences. All applied, consider the four sample means for the tendinitis
others are not significant. For example, for the difference study, ranked in ascending size order: (4) rest, (3) splint,
between ice and NSAID, (1) ice (2) NSAID (see Fig. 26-1).
If we compare the two smaller means, the comparison
95% CI = –1.09 ± 10.72
interval for rest → splint includes two adjacent means.
= –11.81 to 9.63
Therefore, r = 2 for that comparison. If we compare the
This confidence interval contains zero, denoting no largest and smallest means, the interval for rest → NSAID
difference between means, which indicates that the dif- contains four adjacent means (4-3-1-2), and so r = 4.
ference between ice and NSAID means is not significant. Table 26-5B shows these intervals for each pair of means.
An alternative presentation shows homogeneous subsets In contrast to Tukey’s approach, which uses one crit-
of means (see Table 26-4 4 ). In this table, each subset ical difference for all comparisons, the SNK test will use
6113_Ch26_383-399 02/12/19 4:39 pm Page 388
DATA
Table 26-4 Significant Differences for Tukey’s HSD (α=.05): Change in Pain-Free ROM with Elbow Tendinitis
DOWNLOAD
(N = 44)
1 MSD ⴝ 10.72
1 All mean differences greater than the MSD 10.72 are significant.
2 Each pairwise mean difference (labeled I – J) is listed with its corresponding p value. The table is redundant because each group is
used as a reference. Minus signs are ignored. Significant effects are highlighted and tagged with an asterisk.
3 95% confidence intervals are presented for each mean difference. If the interval contains zero, it would indicate that the difference
is not significant.
4 Subsets in the same column are considered “homogeneous,” indicating they are not different from each other. Means not in the
same column are significantly different. Means for Splint, Ice, and NSAID are not different, but all three are different from Rest.
These data are taken from Table 26-2. Portions of the output have been omitted for clarity.
DATA
Table 26-5 Significant Differences for the Student-Newman-Keuls Test (α=.05): Change in Pain-Free ROM
DOWNLOAD
with Elbow Tendinitis (N = 44)
4 3 1 2
MSe 88.023 Rest Splint Ice NSAID
MSD q(r ) q(r ) Means 24.09 35.27 44.18 45.27
n 11
4 Rest 24.09 11.18* 20.09* 21.18*
MSD (r 2) 2.86 (2.83) 8.09 (r2) (r3) (r4)
MSD (r 3) 3.44 (2.83) 9.73 3 Splint 35.27 10.00*
8.91*
MSD (r 4) 3.79 (2.83) 10.72 (r2) (r3)
DATA
Table 26-6 Significant Differences for the Scheffé Comparison (α=.05): Change in Pain-Free ROM with
DOWNLOAD
Elbow Tendinitis (N = 44)
1 MSD ⴝ 11.68
1 The minimum significant difference is 11.68. Therefore, all mean differences greater than 11.68 are significant.
2 Each pairwise mean difference (labeled I – J) and the significance of the difference is listed. Significant effects are highlighted
and tagged with an asterisk.
3 Means that are listed in the same column are not significantly different. In this case Rest and Splint are not different from each
other, but both are different from Ice and NSAID. Splint, Ice, and NSAID are not significantly different from each other. This outcome
is difficult to interpret because of the overlap of Splint in both subsets.
These data are taken from Table 26-2. Portions of the output have been omitted for clarity.
would be meaningful. For example, a comparison of ice/ line representing a simple effect. Interaction is defined
NSAID with splint/no med would not provide a clini- as a significant difference between simple effects.5
cally useful contrast. Because these comparisons involve For example, we can look at the simple effect of
completely different combinations, they cannot logically modality for those who got the NSAID, represented by
help us understand the differential effect of modality or the blue line in plot A. Then we could look at the simple
medication. effect for those who got no medication, represented by the
red line. Alternatively, we can look at the simple effect
Simple Effects of medication for each level of modality, represented by
With a significant interaction, it is usually more mean- the lines in plot B.
ingful to consider an analysis of combinations within
rows or columns of the design. These separate analyses
Simple effects are distinguished from main effects which
are called simple effects. Essentially, we convert the fac-
are based on averaged values across a second variable. An
torial design into several smaller “single-factor” experi- analysis of simple effects will reveal differential patterns within
ments. The plots in Table 26-7 show the interaction each level of the independent variables.
between levels of modality and medication, with each
6113_Ch26_383-399 02/12/19 4:39 pm Page 392
Focus on Evidence 26–1 In the first case, the investigator proposed a question that required
The Importance of Asking the Right Question! the comparison of two treatments with respect to a control. Her re-
search hypothesis was “There will be a difference among these three
An investigator is interested in comparing three treatments: experi-
groups.” The comparison of the three groups is an important part of
mental treatments A and B, and a control C, using three independent
the rationale for this study, to account for the potential theoretical
groups. She performs a one-way ANOVA on the data and finds a
connections of these treatments. The ANOVA looks at the entire sam-
significant F test at p = .05. She concludes that there is a significant
ple as part of this analysis, partitioning the total variance across all
difference among the three means and proceeds to perform a multiple
three groups. Likewise, the multiple comparison uses data from the
comparison test to determine where those differences lie. But when
entire sample when examining variances, even though any individual
she gets the results of the multiple comparison test, she finds that
comparison is only between two means. In this case, the omnibus
none of the means are significantly different! How can this be?4
F test showed a significant effect, but this effect could not be attributed
Now, in another lab miles away, three different researchers are
to any one specific comparison. Although not common, sometimes
conducting three different experiments, each comparing two treat-
the variances of the individual groups are not sufficiently independent
ments using an unpaired t-test for analysis. One is comparing A with
to show a significant difference, even when the overall F test does.
B, one is comparing B with C, and the third is comparing A with C.
The investigator who studied the single comparison, on the other
The first two of these researchers find no significant difference
hand, is only concerned with the variance of two groups, and can nar-
between their groups. The third researcher, however, does find a
row his statistical search for a difference. His hypothesis is “There
significant difference at p = .05. He is now able to report that his
will be a difference between these two groups.” And so there was!
experimental treatment A is better than C!
And by the way, it is also possible for the ANOVA to report a p
Why should the investigator who analyzed all three treatments at
value above .05 and still be able to find significant differences with
once be unable to find a significant difference when the investigator
multiple comparisons! This is the basis for planned comparisons.
who ran a single experiment can claim a successful outcome?
DATA
Table 26-7 Simple Effects for Interaction: Pain-Free ROM Across Levels of Modality
DOWNLOAD
and Medication
NSAID
40 Mean 95% CI
2 (I) (J) Difference Lower Upper
Med Modality Modality (I–J) Sig. Bound Bound
30
Ice Splint 30.000* .000 16.509 43.491
20
No meds NSAID Ice Rest 26.000* .000 12.59 39.491
Splint Rest 4.000 .555 17.491 9.491
10
Ice Splint 1.000 .882 12.491 14.491
0 No Med Ice Rest 1.000 .882 14.491 12.491
Ice Splint Rest Splint Rest 2.000 .767 15.491 11.491
Modality
Pairwise Comparisons
40
Dependent Variable: ROMChange
Mean 95% CI
30 3 (I) (J) Difference Lower Upper
Rest Modality Med Med (I–J) Sig. Bound Bound
20
Splint Ice NSAID No Med 28.000* .000 14.509 41.491
10
Splint NSAID No Med 1.000 .882 14.491 12.491
Rest NSAID No Med 1.000 .882 12.491 14.491
0
NSAID No Med
These data are taken from Table 25-4 in Chapter 25. Analyses were run using GLM in SPSS. Portions of the output have been omitted
for clarity.
6113_Ch26_383-399 02/12/19 4:39 pm Page 394
DATA
Table 26-8 Pairwise Comparisons Following Repeated Measures ANOVA: Elbow Flexor Strength in Three
DOWNLOAD
Forearm Positions (N=9)
These data are taken from Table 25-5 in Chapter 25. Portions of the output have been omitted for clarity.
that the variance in one measure cannot be predicted by the Trend Analysis
other.8 Accordingly, the questions being answered can be Multiple comparison tests are most often used in studies
considered in isolation, allowing them to be tested using the where the independent variable is qualitative or nominal
per comparison error rate with no further adjustments.5,9 and where the researcher’s interest focuses on determin-
When contrasts are not orthogonal, adjustments are made ing which categories are significantly different from the
to the comparisons.10 others. When an independent variable is quantitative,
the treatment levels no longer represent categories but
differing amounts of something, such as age, dosage, or
Complex Contrasts time. When the levels of an independent variable are
ordered along a continuum, the researcher is often inter-
The contrasts we have described so far in this chapter are
ested in examining the shape of the response rather than
considered simple contrasts because they involve the com-
differences between levels. This approach is called a
parison of pairs of means. We can also specify contrasts
trend analysis.
in terms of differences between subsets of means. For
The purpose of a trend analysis is to find the most
example, in the study of treatment for elbow tendinitis,
reasonable description of continuous data based on the
suppose we want to compare the group that receives ice
number of turns or “ups and downs” seen across the lev-
with both groups that receive passive treatments of splint
els of the independent variable. Because this approach is
and rest. This is a complex contrast because it involves
an integral part of the study design, it is considered a
more than two means. The multiple comparison test
planned comparison.5
will respond to the question, “Is the mean of Group 1 sig-
nificantly different from the average mean for Groups 3
and 4?” ➤ CASE IN POIN T #4
The null hypothesis for this comparison would be Researchers have documented strength changes that occur with
aging.11,12 A study was designed to test maximum isometric
3 ⫹ 4 strength of knee extension in women from the ages of 8 to 80.
H 0 : 1 = Data were obtained for 10 age groups, each with 10 subjects.
2
This hypothesis indicates that the mean for ice (μ1)
will be equal to the average mean of splint and rest A hypothetical plot of such data is shown in Figure 26-2,
groups (μ3 and μ4). The mean for the NSAID group (μ2) and the ANOVA is shown in Table 26-9A. There is a
would not be included in the comparison. significant difference among the age groups. A multiple
comparison of means will not tell us about the directions
Contrast Coefficients of change across age, but a trend analysis will.
To program complex comparisons, we must indicate
which groups we want to compare using a set of weights
called contrast coefficients. These coefficients define the 90
two sides of the equation. One side of the equation is
given positive weights and the other side negative 80
weights, so that the total will be zero. It does not matter 70
which side is positive or negative, but the means must be
included in the order they are included in the data file. 60
For example, in the elbow tendinitis study, with four
Strength
50
means, we would use (2, 0, –1, –1).
Group 2 is given the coefficient 0 because it is not 40
included in this comparison. Group 1 is given the weight
of 2 so it will balance against the two means for groups 30
3 and 4—remembering that the total must be zero.
20
As the number of means in a study increases, the
number of potential complex contrasts also increases. 10
Complex contrasts are designed to answer specific ques-
tions in a research study that offer different theoretical 0
10 20 30 40 50 60 70 80
explanations of the data. By looking at subsets of data,
Age Group
investigators can often clarify relationships that may not
be evident with simple contrasts. Figure 26-2 Plot of strength scores across 8 age groups.
6113_Ch26_383-399 02/12/19 4:39 pm Page 396
DATA
Table 26-9 Example of an ANOVA with a Trend Analysis: Changes in Strength Across Eight Age Groups
DOWNLOAD
(N = 80)
3 3
1 1
0 0
–1 –1
–3 –3
A 1 2 3 4 B 1 2 3 4
6
3
3
1
0 0
–1
–3
–3
–6
C 1 2 3 4 D 1 2 3 4 5
Figure 26-3 Examples of several types of trends: A) linear trend, B) quadratic trend, C) cubic trend, and D) quartic trend.
See Chapter 30 for further discussion of linear and quad- Significance of Trend Components
ratic trends associated with regression analysis. Trends are tested for significance as part of an analysis
of variance. Table 26-9A shows the one-way ANOVA
Purpose of Trend Analysis for the strength data, resulting in a significant difference
How one chooses to analyze data is directly related to across age groups (p < .001). However, this information
how one frames the research question. When testing does not help us clarify how the data vary across the age
an ordered variable such as age or time, the research groups.
question would not be focused on whether one group The ANOVA can be run to include a trend analysis,
or interval was different from another but rather would as shown in Table 26-9B. The variance, indicated by the
address how the dependent variable changes across sum of squares, for the variable of age is now broken into
the span. linear and quadratic trend components. Each of these
components is tested for significance.
Each specific trend component is tested by an F ratio. the samples of interest. For instance, trends that are es-
In this example, both linear and quadratic trends are tablished over time may involve some intervals of hours
significant. When a trend component is statistically sig- and others of days. Most computer packages that per-
nificant, subjective examination of graphic patterns of form trend analyses will accommodate equal or unequal
the data is usually sufficient for further interpretation. intervals, but distances between unequal intervals must
be specified.
Just to make life interesting, some computer packages A second caution for interpreting trend analysis is to
will refer to trend analyses as orthogonal decomposition avoid extrapolating beyond the upper and lower limits of
or orthogonal polynomial contrasts. These terms derive from the selected intervals. For example, based on Figure 26-3,
the types of equations that are used to characterize the trends, if we had tested only individuals between 40 and 80, we
which you might remember from advanced algebra. would conclude that strength declines linearly with age,
but we could not draw any conclusions about changes from
ages 10 to 30. By limiting the range of intervals, we would
Interpreting Trend Analysis have missed the quadratic function that more accurately
Three important considerations must be addressed when describes the relationship. Therefore, interpretation of
interpreting trend analyses. First, the number and spac- trends must be limited to the range of the dependent
ing of intervals between levels of the independent vari- variable.
able can make a difference to the visual interpretation of Finally, it is important to consider the design of the
the curve. A linear trend requires a minimum of three study. For instance, in the current example, data were
points, a quadratic trend a minimum of four points, and collected from individuals in the different age groups at
so on. With larger spans in the quantitative variable, the same time. This is a cross-sectional design. This can
more intervals may be necessary. be contrasted with a longitudinal study, in which we
Most investigators try to use equally spaced intervals would have followed individuals across the lifespan to see
to achieve consistency in the interpretation. Others will if their strength actually did decrease over time (see
purposefully create unequal intervals to best represent Chapter 19).
COMMENTARY
Statistics: the only science that enables different experts using the same figures to
draw different conclusions.
—Evan Esar (1899 - 1995)
American humorist
There are no widely accepted criteria for choosing one use of per comparison or familywise error rates. Of the
multiple comparison test over another, and the selection three post hoc comparisons described here, the Newman-
of a particular procedure is often made arbitrarily. Two Keuls test is the most powerful. Scheffé’s comparison
basic issues should guide the choice of a multiple com- gives the greatest control over Type I error, but at the
parison procedure. expense of power. Researchers often prefer Tukey’s
The first issue relates to the decision to conduct either HSD because it offers a balanced option, both reasonable
planned or unplanned contrasts. This decision, made power and protection against Type I error. The power
during planning phases of a study, will relate to theoretical of the Newman–Keuls procedure is increased by using
expectations. With planned comparisons, the researcher different comparison intervals but use of the per com-
asks, “Is this difference significant?” With post hoc tests the parison error rate increases the risk of Type I error.
question shifts to, “Which differences are significant?” Other than these rather straightforward criteria, when
When the researcher is interested in exploring all possible there is no overriding concern for either Type I or Type
combinations of variables, unplanned contrasts should II error, there may be no obvious choice for a specific
be used. test. This is probably why the Tukey procedure is so
The second issue concerns the importance of Type popular as a “middle of the road” test. A researcher is
I or Type II error. Each multiple comparison test will obliged to consider the rationale for comparing treat-
control for these errors differently, depending on the ment conditions or groups and to justify the basis for
6113_Ch26_383-399 02/12/19 4:39 pm Page 399
making these comparisons. This rationale should be pro- As the examples in this chapter illustrate, different
vided in an article for the reader but is rarely included. outcomes can be achieved with different multiple com-
Regardless of how the choice is made, consumers of parison procedures with the same data—leading to
research should make judgments about findings based on different conclusions and perhaps different clinical de-
whether they think the author took too liberal or too cisions! This makes the choice of a test during planning
conservative an approach. As we apply evidence to clin- stages of a study that much more meaningful. The choice
ical decisions, this prerogative should be stressed, per- should not be made after the results are available—
haps leading to only tentative use of findings. This and perhaps most importantly, should not be based on
emphasizes the importance of understanding the statis- trying all of the possible alternatives to see what the
tical procedures, so that an author’s conclusions may be results are! Pick your poison and interpret your data
questioned based on their choice of analysis. fairly!
R EF ER E NC E S 10. Field A. Discovering Statistics Using IBM SPSS Statistics. 5th ed.
Thousand Oaks, CA: Sage; 2018.
1. Tukey JW. The Problem of Multiple Comparisons. Ditto: Princeton 11. Keller K, Engelhardt M. Strength and muscle mass loss with
University; 1953. aging process. Age and strength loss. Muscles Ligaments Tendons
2. Scheffé HA. A method for judging all contrasts in the analysis of J 2014;3(4):346-350.
variance. Biometrika 1953;40:87. 12. Borges O. Isometric and isokinetic knee extension and flexion
3. Scheffé HA. The Analysis of Variance. New York: Wiley, 1959. torque in men and women aged 20-70. Scand J Rehabil Med
4. Dallal GE. Multiple comparison procedures. The Little Handbook of 1989;21(1):45-53.
Statistical Practice. Available at http://www.jerrydallal.com/LHSP/ 13. Yang CP, Lin CC, Li CI, Liu CS, Lin WY, Hwang KL, Yang
mc.htm. Accessed November 1, 2018. SY, Chen HJ, Li TC. Cardiovascular risk factors increase the
5. Keppel G. Design and Analysis: A Researcher’s Handbook. 4th ed. risks of diabetic peripheral neuropathy in patients with type 2
Englewood Cliffs, N.J.: Prentice Hall; 2004. diabetes mellitus: The Taiwan Diabetes Study. Medicine
6. Green SB, Salkind NJ. Using SPSS for Windows and Macintosh: (Baltimore) 2015;94(42):e1783.
Analyzing and Understanding Data. 8 ed. New York: Pearson; 14. Hayes H, Geers AE, Treiman R, Moog JS. Receptive vocabu-
2017. lary development in deaf children with cochlear implants:
7. Maxwell SE. Pairwise multiple comparisons in repeated measures achievement in an intensive auditory-oral educational setting.
designs. J Educ Stat 1980;5:269-287. Ear Hear 2009;30(1):128-135.
8. Hinkle DE, Wiersma W, Jurs SG. Applied Statistics for the 15. Mayer-Kress G, Newell KM, Liu YT. Nonlinear dynamics
Behavioral Sciences. 5th ed. Boston: Houghton Mifflin; 2003. of motor learning. Nonlinear Dynamics Psychol Life Sci 2009;
9. Hays WL. Statistics. 5th ed. Belmont, CA: Wadsworth; 1994. 13(1):3-26.
6113_Ch27_400-414 02/12/19 4:38 pm Page 400
CHAPTER
27 Nonparametric
Tests for Group
Comparisons
In previous chapters several statistical tests Kruskal-Wallis ANOVA, the sign test, the
have been presented that are based on certain Wilcoxon signed-ranks test, and the Friedman
assumptions about the parameters of the pop- ANOVA. Nonparametric tests for measures of
ulation from which the samples were drawn. association will be presented in Chapters 28
These parametric tests are appropriate for and 29.
use with ratio/interval data. In this chapter,
we present a set of statistical procedures classi-
■ Criteria for Choosing
fied as nonparametric, which test hypotheses Nonparametric Tests
for group comparisons without normality or
variance assumptions and are appropriate for Nonparametric comparison tests are applied to designs
that are analogous to their parametric counterparts
analysis of nominal or ordinal data. For this using t and F tests (see Table 27-1). Two major criteria
reason, these methods are sometimes referred are generally adopted for choosing a nonparametric test
to as distribution-free tests. The data for these over a parametric procedure. The first criterion is that
data are measured on the nominal or ordinal scales. This
tests are reduced to ranks, and comparisons are makes nonparametric tests useful for analysis of data
based on distributions, not means. from many assessment tools that are based on ordinal
The purpose of this chapter is to describe ranks.
The second criterion is that assumptions of popula-
five commonly used nonparametric compari- tion normality and homogeneity of variance cannot be
son procedures: the Mann-Whitney U test, the satisfied. Many clinical investigations involve variables
400
6113_Ch27_400-414 02/12/19 4:38 pm Page 401
that are represented by skewed distributions rather than and Shapiro-Wilk tests can be used to determine if data
symmetrical ones. In addition, small clinical samples and follow a normal distribution (see Chapter 23, Box 23-4).
samples of convenience cannot automatically be consid- In parametric tests like the t-test and analysis of vari-
ered representative of larger normal distributions (see ance, statistical hypotheses are based on a comparison
Focus on Evidence 27-1). The Kolmogorov-Smirnov of means. The expectation is that sample means will
90
80
70
60
Frequency
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 28 30 31 32 34 42
Threshold (sec)
This situation suggests a strong need for researchers to consider or samples with non-normal distributions. Grant reviewers have
normality in deciding which statistics to use. In comparing outcomes become more aware of the need to consider appropriate analyses
using t-tests versus the nonparametric Wilcoxon test, the nonpara- based on data characteristics. The importance of validating a normal
metric test is almost always more statistically powerful with distribution is often overlooked, or at least ignored, as part of a
non-normal data, meaning that applying t-tests to non-normal data research report—but may be biasing important outcomes.
sets might compromise study outcomes and lead to false negative
conclusions, especially with smaller samples.3 The Kolmorgorov-
Smirnoff and Shapiro-Wilks tests for goodness of fit to a normal From Treister R, Nielsen CS, Stubhaug A, et al. Experimental comparison of
parametric versus nonparametric analyses of data from the cold pressor test.
distribution were described in Chapter 23, Box 23-4.
J Pain 2015;16(6):537-48, From Figure 1A p. 542, Used with permission.
Nonparametric tests are generally appropriate for use with ordinal
data but may also be needed with continuous data with small samples
6113_Ch27_400-414 02/12/19 4:38 pm Page 402
estimate population means because of assumptions of the shape of or variance in underlying distributions.
homogeneity and normality. In nonparametric tests, we Unfortunately, however, without making explicit assump-
are unable to make the same type of comparison because tions about a distribution, it is not possible to estimate
of the ranked nature of scores. The median score is often sample size or power as we do with parametric tests.10
more meaningful but still may not reflect the full distri- With very small samples, as with six subjects or less per
bution. Therefore, the null and alternative hypotheses group, many nonparametric tests will be as powerful as
are stated more generally, based on a comparison of the their parametric counterparts. With larger non-normal
distributions of scores. This will be illustrated for each populations, the nonparametric statistics may actually be
of the examples that follow. more powerful.11 Lehman10 suggests a rule of thumb: com-
pute the sample that would be needed if you were doing a
Although nonparametric tests require fewer statistical parametric test and add 15%. Of course, this requires an
assumptions than parametric procedures, this does not estimate of effect size, which will be described for different
preclude the need for strong experimental design to avoid tests. Although most computer programs will not calculate
bias in the data. For example, whenever possible, random sample effect sizes for nonparametric tests, they can be calculated
selection and random assignment to groups should be incorpo- quite easily by using information from the test result.
rated, and control groups and blinding should also be included.
Nonparametric statistics are generally easy to calculate
The major disadvantage of nonparametric tests is that by hand, and some calculations will be illustrated here to
they generally cannot be adapted to complex clinical clarify the rationale behind the tests. Most statistical packages
will provide analyses for nonparametric tests, however, and this
designs, such as factorial designs and tests of interactions.
chapter will focus on interpretation of those results.
There are many newer tests, however, that have been de-
veloped to allow the application of regression procedures
or tests of interaction effects. These are beyond the scope G*Power can be used to estimate sample size and power
of this book but can be found in many recent texts.4-6 for the Mann-Whitney U test and the Wilcoxon signed-
ranks test when data are continuous (not ranks). The
Power-Efficiency for Nonparametric Tests process is the same as for the t-test (see the Chapter 27
Many researchers prefer to use parametric tests because Supplement).
they are considered more powerful. Nonparametric tests
are less sensitive than parametric tests because most of ■ Procedure for Ranking Scores
them involve ranking scores rather than comparing
precise metric changes. Nonparametric and parametric Most nonparametric tests are based on rank ordering of
methods have been compared on the basis of their power- scores. The procedure for ranking will be illustrated
efficiency, which is a test’s relative ability to identify sig- using the two samples shown in Table 27-2. Scores are
nificant differences for a given sample size. always ranked from smallest to largest, with the rank
Generally, an increase in sample size is needed to of 1 assigned to the smallest score. Algebraic values are
make a nonparametric test as powerful as an analogous taken into account, so that the lowest ranks are assigned
parametric test. For instance, a nonparametric test may to the largest negative values, if any. The highest rank
require a sample size of 50 to achieve the same degree of
power as a parametric test with 30 subjects. This relation-
ship can be expressed as a percentage that indicates the
Table 27-2 Examples of Ranked Scores Without
relative power-efficiency of the nonparametric test.
With equal sample sizes, nonparametric tests will gen- Ties (A) and with Ties (B) (n=8)
erally be less powerful than their parametric counterparts. SAMPLE A SAMPLE B
However, with larger samples this discrepancy is mini-
Score Rank Score Rank
mized. Most of the nonparametric tests described here
can achieve approximately 95% power-efficiency in com- 6 4 8 3
parison to their most powerful parametric analogs.7-9 2 3 11 5
These figures apply to calculations based on comparisons 8 5 3 1.5
9 6 17 8
of normal populations. −3 1 11 5
Estimating Sample Size 0 2 3 1.5
16 8 11 5
Researchers generally choose to use nonparametric
12 7 12 7
tests when they are unable to make assumptions about
6113_Ch27_400-414 02/12/19 4:38 pm Page 403
will equal n. As shown in Sample A, the rank of 1 is we would expect the groups to be equally distributed
assigned to the smallest score (–3), the rank of 2 goes with regard to high and low ranks, and the mean of the
to the next smallest (0), and so on, until the rank of 8 is ranks would be the same for both groups. The test will
assigned to the highest score (16). determine if the difference between the rank sums is
When two or more scores in a distribution are tied, sufficiently large to be considered significant.
they are each given the same rank, which is the average The sum of ranks for vaginal delivery is higher than
of the ranks they occupy. For instance, in Sample B, for caesarean, but there are also more patients in the
there are two scores with the smallest value (3). They oc- vaginal group. This difference in sample size is taken
cupy ranks 1 and 2. Therefore, they are each assigned into account in the calculation of the test statistic.
the average of their ranks: (1 + 2)/2 = 1.5. The next high-
The U Test Statistic
est value (8) receives the rank of 3, as the first two ranks
The test statistic, U, is calculated using each group as a
are filled. The next highest value is 11, which appears
reference, as shown in Table 27-3B, where n1 is the
three times. As we have already filled ranks 1, 2, and 3,
smaller sample size, n2 is the larger sample size, and R1
we average the next three ranks: (4 + 5 + 6)/3 = 5. Each
and R2 are the sums of the ranks for the groups. Desig-
score of 11 is assigned the rank of 5. Having filled the
nation of n1 or n2 is arbitrary if groups are of equal size.
first 6 rank positions, the last two values in the distribu-
These formulas will yield different values of U. For
tion are assigned ranks 7 and 8.
example, we obtain U1 = 8.5 with Group 1 as the reference
group and U2 = 21.5 using Group 2 as the reference
group. Either of these values will produce the same out-
■ Tests for Independent Samples come. However, the smaller value of U is used with a table
of critical values, so in this case we will use U = 8.5.
Nonparametric procedures can be applied to studies
using independent samples, with two or more groups. Test of Significance
These groups may be randomly assigned, or they may Critical values of U are given in Table 27-4 for one- and
represent intact groups. two-tailed tests at α = .05 based on n1 and n2. The calcu-
lated value of U must be equal to or less than the tabled
The Mann-Whitney U Test value to be significant. Note that this is opposite to the
The Mann–Whitney U test is one of the more powerful way we analyzed critical values with parametric tests.
nonparametric procedures, designed to test the null For the current example, we proposed a nondirec-
hypothesis that two independent samples come from the tional alternative hypothesis. Using α2 =.05, with n1 = 5,
same population. The U test does not require that and n2 = 6, the critical value of U is 3. Because the calcu-
groups be of the same size. It is, therefore, an excellent lated value, U = 8.5, is larger than this critical value, we
alternative to the independent t-test when parametric cannot reject H0. Output for this analysis is shown in
assumptions are not met. Table 27-3C.
Nonparametric tests may be subject to bias if there
are many ties in the ranks. The analysis automatically
➤ C A SE IN POINT # 1 generates an adjusted version. The p value can be
A researcher is interested in the effect of vaginal versus cesarean
affected if the number of ties is excessive.
section births on Apgar score, graded from 0-10, with higher scores
indicating a healthier infant.12 Scores are obtained at 5 minutes Large Samples
from newborns of 11 women. The researcher hypothesizes that the With large samples (> 25), the value of U approaches a
Apgar score will be different for the two types of delivery (a nondi- normal distribution and U is converted to z. Although
rectional hypothesis). A nonparametric test is chosen for analysis the present example does not warrant it, we have used
because the data are ordinal and the sample is small. the data to illustrate this application in Table 27-3C. The
calculated value of z must be equal to or larger than the crit-
Procedure ical value at the appropriate level of significance. Because
Hypothetical data for this example are given in we proposed a nondirectional hypothesis with α2 = .05,
Table 27-3A. The first step is to combine scores from the critical value is 1.96.
both groups and rank all 11 scores in order of increas- Our calculated value is smaller than this critical value,
ing size. The ranks are then examined within each and the null hypothesis is not rejected. We can also
group, and the sum of the ranks is designated R1 or R2. determine the exact p value associated with z = 1.2 using
The null hypothesis states that there will be no differ- Appendix Table A-1. This score corresponds to .1151
ence between the distributions of scores for vaginal in each tail, which corresponds to a two-tailed test at
or cesarean delivery. Under the null hypothesis, then, p = .230 (see Table 27-3 5 ).
6113_Ch27_400-414 02/12/19 4:38 pm Page 404
DATA
Table 27-3 Mann–Whitney U Test: Difference in Apgar Score with Cesarean and Vaginal Delivery
DOWNLOAD
(N = 11)
A. Data
Cesarean Delivery Vaginal Delivery
(n1 5) (n2 6)
Ranks
Apgar Rank Apgar Rank
Sum of
7 6 6 4 Delivery N Mean Rank Ranks
4 2 5 3
Apgar Cesarean 5 4.70 23.50
3 1 9 10
Vaginal 6 7.08 42.50
8 8.5 10 11
7 6 8 8.5 Total 11
7 6
B. Calculations
n1(n1 1) 5(5 1)
U1 R1 23.5 8.5
2 2
2
n (n 1) 6(6 1)
U2 R2 2 2 42.5 21.5
2 2
Table 27-4 Abridged Table of Critical Values for the U Statistic (α = .05)
n2 (larger sample)
n1 4 5 6 7 8 9 10
4 α1 1 2 3 4 5 6 7
α2 0 1 2 3 4 4 5
5 α1 — 4 5 6 8 9 11
α2 — 2 3 5 6 7 8
6 α1 — — 7 8 10 12 11
α2 — — 5 6 8 10 14
7 α1 — — — 11 13 15 17
α2 — — — 8 10 12 14
8 α1 — — — — 15 18 20
α2 — — — — 13 15 174
This table is an abridged version for purposes of illustration. See the Chapter 27 Supplement for the full table.
➤ Results
The Mann-Whitney U test was used to compare Apgar scores Procedure
for infants born through Cesarean (n = 5) or vaginal delivery Hypothetical data are reported in Table 27-5A. The
(n = 6). This test was chosen because the sample was small procedure for the Kruskal–Wallis ANOVA is essentially
and Apgar scores are considered ordinal measures. The test an expansion of method used for the Mann–Whitney
was not significant (U = 8.5, p = .247), and we conclude that U-test. The first step is to combine data for all groups
there is no difference in Apgar scores for infants born by and rank scores from the smallest to the largest. The
cesarean or vaginal delivery. ranks are then summed for each group separately, as
shown in Table 27-5A.
The Kruskal-Wallis Analysis of Variance The null hypothesis states that the distributions from
the three groups come from the same population. Our
When three or more independent groups are com- alternative hypothesis states that these distributions
pared (k ≥ 3), a nonparametric analysis of variance come from different populations. If the null hypothesis
is appropriate for the same reasons that an F test is is true, we would expect an equal distribution of ranks
used with parametric data. The Kruskal–Wallis one- under the three conditions.
way analysis of variance by ranks is a nonparametric
analog of the one-way analysis of variance. It is a pow- The H Statistic
erful alternative to the ANOVA when variance and The test statistic for the Kruskal–Wallis test is H, calcu-
normality assumptions for parametric tests are not lated according to
met. When k = 2, this test is equivalent to the Mann–
Whitney U-test. Multiple comparison procedures can
also be applied.
H=
12
N (N ⫹1) ∑ R2
n
⫺ 3(N ⫹1)
6113_Ch27_400-414 02/12/19 4:38 pm Page 406
DATA
Table 27-5 Kruskal–Wallis One-Way Analysis of Variance By Ranks: Change in Pain (VAS in mm)
DOWNLOAD
(N=14)
A. Data
Group 1 Group 2 Group 3
Yoga Acupuncture TENS
Ranks 1
B. Calculations
12 R2
H
N (N 1)
n
3(N 1)
2
12 (27)2 (29)2 (49)2 12
H 3(14 1) [914.25] 45 7.243
14 (14 1) 5 5 4 210
1 The rank sums and mean ranks show a distinct spread between TENS and the other two interventions.
2 The test statistic H is tested against 2 with 2 df. The overall test is significant (p .025).
3 The homogeneous subsets show a significant difference between TENS and both Yoga and Acupuncture.
Calculated values vary due to rounding differences. Portions of the output have been omitted for clarity.
6113_Ch27_400-414 02/12/19 4:38 pm Page 407
Table A-5). The calculated value must be equal to or multiple comparison procedures.
greater than the critical value to be significant. Our cal-
culated value of 7.243 is greater than the critical value,
and therefore, H is significant, and we can reject H0. ■ Tests for Related Samples
The output for this analysis is shown in Table 27-5C,
confirming this conclusion (p = .025). Nonparametric procedures can be used with paired sam-
ples within a two-level repeated measures design, using
Multiple Comparisons either the sign test or the Wilcoxon signed-ranks test.
Like the parametric ANOVA, this result does not in- These are analogous to the paired t-test. The Friedman
dicate which groups are different. Some researchers Two-Way Analysis of Variance by Ranks can be used
will stop here, basing their interpretation of data on a with three or more levels of the independent variable.
subjective comparison of the mean ranks for each
group. For example, we observe that ranks for the The Sign Test
TENS group are higher than the other two groups (see The sign test is one of the simplest nonparametric tests
Table 27-5 1 ). However, this subjective judgment can because it requires no mathematical calculations. It is
be misleading, and a multiple comparison procedure based on the binomial distribution and does not require
should be used to determine which groups are signifi- that measurements be quantitative. As its name implies,
cantly different. the data are analyzed using plus and minus signs rather
Dunn’s multiple comparison is a nonparametric proce- than numerical values. Therefore, this test provides a
dure used to perform pairwise comparisons for the mechanism for testing relative differentiations such as
Kruskal-Wallis test.17 The Mann–Whitney U test can more–less, higher–lower, or larger–smaller. It is partic-
also be used as a multiple comparison procedure, but a ularly useful when quantification is impossible or unfea-
Bonferroni correction should be applied to control for sible and when subjective ratings are necessary.
the increased risk of Type I error.
The results of this test are shown using homogeneous ➤ CASE IN POIN T #3
subsets based on mean ranks (see Table 27-5C 3 ). Here Researchers have been interested in the effect of psychostimu-
we see that TENS is significantly better than both yoga lants on attention deficits in patients with traumatic brain injury
and acupuncture, not unexpected given the difference (TBI).18 A crossover study was designed to test the efficacy of a
among the ranks. new drug versus a placebo. An attention task was administered
Effect Size for the Kruskal-Wallis ANOVA to 10 patients under each test condition, in randomized order.
A washout period was inserted between test conditions. The
The overall effect size index for the Kruskal-Wallis test
outcome was the number of correct responses out of 12 trials on
is eta squared, η2, based on the H statistic.14 For the
an attention task. We hypothesize that there will be a difference
study of pain interventions: in the attention task score between those who take the drug
and those who take the placebo.
H ⫺ k ⫹1 7.340 ⫺ 3⫹1
2 = = = .668
N ⫺k 14 ⫺ 3
Procedure
where k is the number of groups and N is the total sam- Hypothetical data for this study are shown in Table 27-6A.
ple. According to conventional effect sizes for η2, this The sign test is applied to the differences between
would be considered a large effect (see Chapter 25). each pair of scores, based on whether the direction of
6113_Ch27_400-414 02/12/19 4:38 pm Page 408
DATA
Table 27-6 Sign Test and Wilcoxon Signed-Ranks Test: Effect of Drug and Placebo on an Attention Deficit
DOWNLOAD
Task (N = 10)
D. Calculation of z
• Sign test 4 • Wilcoxon signed-ranks test 5
n(n 1) 7(7 1)
D 1 5 1 T 1
z 1.51 Z 4 4
2.197
n 7 n(n 1)(2n 1) 7(7 1)(14 1)
24 24
1 The number of fewer signs (in this case negative signs) is 1. For the binomial sign test, x 1. For the Wilcoxon test, T 1.
2 The two-tailed sign test is not significant (p .125).
3 The two-tailed Wilcoxon signed-ranks test is significant (p .027).
4 The standardized z score for the sign test is less than the critical value 1.96 (p .05) and is not significant.
5 The standardized z score for the Wilcoxon test is greater than the critical value 1.96 and is significant. The sign can be
ignored with a two-tailed test.
Calculated values vary due to rounding differences. Portions of the output have been omitted for clarity.
6113_Ch27_400-414 02/12/19 4:38 pm Page 409
difference is positive or negative. In this example, we For x = 1 and n = 7, the table shows p = .062. We dou-
will use the scores for the drug condition as the refer- ble this value for a two-tailed test, p = .124. This is greater
ence and record whether that score is greater (+), the than the acceptable level of .05, and we do not reject H0.
same (0), or less (–) than the placebo score, always The output for this test is shown in Table 27-6B, con-
maintaining the same direction of comparison. This firming these results.
order is only relevant if we propose a directional
hypothesis. ➤ Results
Under the null hypothesis, the distributions of scores A sign test was run to compare the distributions of attention
under each condition are not different, and chance will deficit under two drug conditions. There was no significant
be the determinant of which score is greater in each pair. difference in performance on the attention task between drug
Therefore, we would expect half the differences to be and placebo conditions (p = .125).
positive and half to be negative (p = .50). We will reject
H0 if one sign occurs sufficiently less often.
The alternative hypothesis states that the probability The determination of the probability associated with x is
of one condition is greater than 50%. With a nondirec- based on a theoretical distribution called the binomial
tional hypothesis, it does not matter which value is used probability distribution. A binomial outcome is one that can take
only two forms, in this case either positive or negative. The
as the reference, as in our example. In the fourth column
binomial test determines the likelihood of getting the smaller
in Table 27-6A, the signs of the differences are listed.
number of plus or minus signs out of the total number of differ-
When no difference is obtained, a zero is recorded. ences just by chance.
Test Statistic x The binomial test was also described in Chapter 18 in
To proceed with the test, we count the number of plus relation to using the split middle line with single-subject
signs and the number of minus signs. Ties, recorded as designs.
zeros, are discarded from the analysis, and n is reduced
accordingly. In this example, 3 subjects showed no dif- Large Samples
ference. Therefore, n = 7. There are six plus signs and With sample sizes greater than 30, x is converted to z
one minus sign. We take the smaller of these two values, and tested against the normal distribution according to
the number of fewer signs, and assign it the test statistic, x. the formula
In this case, x = 1, the number of minus signs.
ⱍ D ⱍ ⫺1
Test of Significance z=
To determine the probability of obtaining x under H0, n
we refer to Table 27-7. This table lists one-tailed prob-
where |D| is the absolute difference between the num-
abilities associated with x for values up to n = 20, where
ber of plus and minus signs (excluding ties). This calcu-
n is the number of pairs whose differences showed direc-
lation is illustrated in Table 27-6 4 for data with six plus
tion. Two-tailed tests require doubling the probabilities given
signs and one minus sign, resulting in z = 1.51. Using
in the table.
the critical value of z = 1.96 for α2 = .05, this outcome
does not achieve significance, and we do not reject H0.
We can confirm that the exact two-tailed probability as-
Table 27-7 Abridged Table of One-Tailed Probabilities sociated with z = 1.51 is .0655 from Table A-1.
Associated with x in the Binomial Test
The Wilcoxon Signed-Ranks Test
x
The sign test evaluates differences within paired scores
n 0 1 2 3 4 5
based solely on whether one score is larger or smaller than
5 .031 .188 .500 .812 .969 — the other. This is often the best approach with subjective
6 .016 .109 .344 .656 .891 .984 clinical variables that offer no greater precision. However,
7 .008 .062 .227 .500 .773 .938
if data provide information on the relative magnitude of
8 .004 .035 .145 .363 .637 .855
differences, the more powerful Wilcoxon signed-ranks
9 .002 .020 .090 .254 .500 .746
10 .001 .011 .055 .172 .377 .623 test can be used. This test examines both the direction of
15 — — .004 .018 .059 .151 difference and the relative difference based on ranks.
20 — — — .001 .006 .021 Procedure
This is an abridged version for purposes of illustration. See the Chapter 27 Consider the same example presented in the previous
Supplement for the full table. section. We obtain a difference score for each subject,
6113_Ch27_400-414 02/12/19 4:38 pm Page 410
disease. Each patient will be placed in three positions—level, With 3 conditions, we have 2 df, and we compare our
calculated value of 9.25 against the critical value (2) =
2
head down, and head elevated—in random order. Blood
pressure will be measured within 1 minute of assuming each 5.99 at α = .05. The calculated value must be equal to or
position. larger than the critical value to be significant. Therefore,
the test is significant.
Even though blood pressure is considered a ratio level Multiple Comparisons
measure, we may choose to use a nonparametric form of The output shows that the overall analysis of variance
analysis for this study because the sample is small, and is significant (p = .008) (see Table 27-9 2 ). At this point,
because we do not have sufficient reason to assume that we can use a multiple comparison procedure to deter-
blood pressure for a population of patients with this dis- mine which groups are significantly different. Although
ease will be normally distributed. the Wilcoxon signed-ranks test is often used for this
Procedure purpose with a Bonferroni adjustment, Dunn’s test is
Hypothetical data for this study are reported in Table appropriate for multiple comparisons with the Fried-
27-9A. Data are arranged so that rows represent subjects man ANOVA.17 Results for homogeneous subsets are
(n) and columns represent experimental conditions (k). shown in Table 26-9.
In this example, n = 6 and k = 3. We begin by converting
all scores to ranks. However, for this test, the ranks are ➤ Results
assigned across each row (within a subject). Ties are as- A Friedman ANOVA identified a significant difference in
blood pressure among the three body positions ( r (2) = 9.65,
2
signed average ranks within a row. The highest rank
within a row will equal k. p = .008). Dunn’s test showed no significant difference in blood
The next step is to sum the ranks within each col- pressure between elevated and head down positions. Blood
umn. If the null hypothesis is true, we would expect the pressure in the level position was significantly lower than in the
distribution of ranks to be a matter of chance, and high other two positions.
and low ranks should be evenly distributed across all
treatment conditions. Therefore, the rank sums within
Effect Size for Friedman ANOVA
each column should be equal. If the alternative hypoth-
The effect size index for the Friedman analysis of variance
esis is true, at least one pair of conditions will show a
is the Kendall coefficient of concordance, W, a correlation
difference.
of ranked data for more than two conditions.14 Based on
The r Statistic the r statistic from Table 27-9 for the comparison of
2 2
The test statistic for the Friedman ANOVA is blood pressure in three head positions:
2r (read “chi square r”). It is computed using the formula
2r 9.652
W = = = .802
2r =
12
nk(k ⫹1) ∑ R ⫺3n(k ⫹1)
2 N (k ⫺1) 6(3⫺1)
DATA
Table 27-9 Friedman Two-Way Analysis of Variance by Ranks: Blood Pressure in Three Positions
DOWNLOAD
(N = 6)
A. Data
B. Calculations
(R)
12
r2 2 3n(k 1)
nk (k 1)
2
12 12
r2 [42.25 156.25 289.00] (3)(6)(3 1) [487.50] 72 9.25
(6)(3)(3 1) 72
1 The rank sums and mean ranks show a distinct spread between Level and the other two positions.
2 The test statistic is significant (p 008).
3 The table of homogeneous subsets shows mean ranks for each group. Blood pressure is significantly lower in the Level condition
than either of the other two conditions.
Calculated values differ due to rounding differences. Portions of the output have been omitted for clarity.
6113_Ch27_400-414 02/12/19 4:39 pm Page 413
COMMENTARY
There are three kinds of lies: lies, damned lies, and statistics.
—Attributed to Benjamin Disraeli (1804 - 1881)
British Prime Minister
There is a longstanding debate among researchers and samples. As examples in this chapter have illustrated,
statisticians regarding the use of nonparametric versus these tests can be applied with continuous data when the
parametric statistical procedures. This discussion contin- conditions warrant their application. And we should
ues today, with no more consensus than 50 years ago.19 remember the converse situations where parametric
Many researchers prefer to use parametric tests be- statistics are used with ordinal scores, when certain
cause they believe they are more powerful than nonpara- assumptions are appropriate regarding the nature of
metric counterparts. However, simulations have shown measurements and their intervals, such as use in cumu-
that this is not necessarily the case, especially when lative scales (see Chapter 12).
scores are not normally distributed or when samples are The tests that have been described in this chapter are
small.8 In fact, nonparametric procedures like the Mann- only a sampling of the most commonly applied nonpara-
Whitney and Wilcoxon signed rank tests have been metric procedures that are available. Statisticians con-
shown to have greater power than the comparable t-test tinue to develop and refine these tests and to expand the
when normality is not established, even in larger capabilities of nonparametric methods into areas such as
samples.3,11,20,21 This effect is most prominent with regression and factorial designs. Many tests have been
between-group studies, and less of an influence in re- developed with very specific purposes, such as comparing
peated measures designs.22 several treatment groups with a single control or looking
Many misconceptions continue to influence choices at differences in variable that have an inherent order.
between parametric and nonparametric statistical tests.21 Nonparametric statistics can also be used for correlation
These include the classical view that nonparametric tests procedures and for testing nominal scale data. These
should only be used with ordinal measures or with small procedures are presented next.
19. Altman DG, Bland JM. Parametric v non-parametric methods location parameter. J Mod Appl Stat Methods 2005;4(2):
for data analysis. BMJ 2009;338:a3167. 598-600.
20. Blair RC, Higgins JJ, Smitley WDS. On the relative power 22. Bridge PD, Sawilowsky SS. Increasing physicians’ awareness
of the U and t tests. Brit J Math Stat Psychol 1980;33(1): of the impact of statistics on research outcomes: comparative
114-120. power of the t-test and and Wilcoxon Rank-Sum test in
21. Sawilowsky SS. Misconceptions leading to choosing the small samples applied research. J Clin Epidemiol 1999;52(3):
t-test over the Wilcoxon Mann-Whitney test for shift in 229-235.
6113_Ch28_415-427 02/12/19 4:38 pm Page 415
CHAPTER
Measuring
Association for 28
Categorical Variables:
Chi-Square
The statistical procedures that have been of independence to determine if two classifi-
described thus far have focused on compar- cation variables are related to each other.
isons, generally applied to experimental and
quasi-experimental designs. We will now ■ Testing Proportions
begin to examine procedures for exploratory
analysis, where the purpose of the research Many research questions in clinical and behavioral sci-
ence involve categorical variables that are measured on a
question is to evaluate the relationship be- nominal or ordinal scale. Such data are analyzed by de-
tween two or more measured variables. Where termining if there is a difference between the proportions
studies of group differences ask if group A is observed within a set of categories and the proportions that
would be expected by chance.
different from group B, measures of associa- For example, suppose we tossed a coin 100 times.
tion ask, “What is the relationship between A The null hypothesis states that the coin is not biased,
and B?” and therefore we would theoretically expect a 50:50
outcome—50 heads and 50 tails. But we actually observe
Many such research questions involve cat- 47 heads and 53 tails. Does this deviation from the
egorical variables that are measured on a null hypothesis occur because the coin is biased, or is
nominal or ordinal scale. Rather than using it only a matter of chance? In other words, is the
difference between the observed and expected frequen-
means, these questions deal with analysis of cies sufficiently large to justify rejection of the null
proportions or frequencies within various cat- hypothesis?
egories. The purpose of this chapter is to de- Chi-Square
scribe the use of several nonparametric tests
The chi-square statistic, χ2, tests the difference between
based on the chi-square distribution. These observed and expected frequencies:
tests have many applications in clinical re- (O ⫺ E)2
search, in both experimental and descriptive 2 = ∑
E
analysis, including tests of goodness of fit to
where O represents the observed frequency and E repre-
determine if a set of observed frequencies dif- sents the expected frequency. As the difference between
fers from a theoretical distribution and tests observed and expected frequencies increases, the value
415
6113_Ch28_415-427 02/12/19 4:38 pm Page 416
DATA
Table 28-1 Chi-Square Goodness of Fit to a Uniform Distribution: Frequency of Left-Sided and Right-Sided
DOWNLOAD
Stroke (N= 233)1
It may seem strange to be dealing with fractions of a count It is important to emphasize that a χ2 test of significance
in expected frequencies, as we obviously cannot have a establishes that the observed frequencies are different
fraction of an individual in a category. However, these values from expected values. It does NOT indicate that the
represent theoretical values based on probabilities and are not number of counts within categories are different from each other.
interpreted as actual counts. Sample size for goodness of fit tests should be large
enough that no expected frequencies are less than 5.0 (see
Table 28-1 3 ). When this criterion is not met, sample size
Test of Significance should be increased or categories combined to create an
appropriate distribution. Note that this criterion applies to
In the uniform distribution goodness of fit model,
the expected frequencies, not the observed counts.
df = k – 1. With two categories (right and left), df = 1.
Therefore, we compare the calculated value, χ2 = 4.124, with
the critical value χ2(1) = 3.84, obtained from Table A-5.
The calculated value is higher than the critical value, and
➤ Results
The total number of patients with large vessel ischemic strokes
we can reject the null hypothesis. The computer output was 233. The χ2 test was used to examine the null hypothesis
for this test shows that the test is significant at p = .042 that there would be no difference in the proportion of left or
(see Table 28-1 2 ). right hemispheric strokes in this population. The results showed
The difference between the observed and expected that left-hemispheric strokes (57%) were more common than
frequencies is not likely to be a matter of chance, and the right-hemispheric strokes (43%) (χ2(1) = 4.68, p = .031).
null hypothesis is rejected. By looking at the data, we
can see that left-sided strokes occurred more often and
right-sided strokes less often than expected. Therefore,
Known Distributions
researchers in this study concluded that left-hemispheric A second goodness of fit model compares a sample dis-
strokes were more common.1 tribution with a known distribution within an underlying
6113_Ch28_415-427 02/12/19 4:38 pm Page 418
DATA
Table 28-2 Chi-Square Goodness of Fit to a Known Distribution of Blood Types (N = 85)
DOWNLOAD
1 The 2 test statistic is significant (p .004), indicating that the observed frequencies do not fit the proportion of blood types
in the population.
2 Corrections can be applied to account for the small expected frequency for type AB (see discussion of Yates correction).
3 The residuals reported here are not standardized. They are the simple difference between observed and expected frequencies
(O-E). A negative residual tells us that the observed frequency was smaller than expected. These values are not comparable,
however, because they do not account for the number of people in each category.
4 Standardized residuals are obtained by (O - E) /√E. These values allow comparison across categories to indicate which contribute
most to 2. These values are not generated by SPSS for the one-sample case. In this analysis, the categories of AB and O blood
types have the largest standardized residuals. The proportion of type AB is more than expected, and type O is less than expected.
Calculated values vary due to rounding differences.
6113_Ch28_415-427 02/12/19 4:38 pm Page 419
Focus on Evidence 28–1 The researchers were interested in the effect of protocol type (RT
Important Answers: Yes or No or not), age (below or above 70), and severity index (based on pres-
ence of comorbidities, rated minor/moderate or major/severe) on LOS.
Evidence-based respiratory therapy (RT) protocols, based on published
Using a three-way ANOVA, they found that there was no difference
guidelines, are used to standardize patient care in hospital settings,
in LOS regardless of age or use of RT protocol. Not surprisingly, they
allowing timely assessment and management of patients, without
found that the only significant difference was the effect of severity.
waiting for a physician order. Studies have shown that this practice is
They also used chi-square to determine whether readmission
cost-effective and safe, providing efficient utilization of resources.7, 8
within 30 days of discharge was independent of RT protocol, and
Werre et al9 looked at the effectiveness of RT protocols on length of
found that RT protocols were associated with fewer readmissions, as
stay (LOS) and 30-day readmissions for patients with chronic obstruc-
shown in the figure (p = .02). This type of graphic display is useful
tive pulmonary disease (COPD) who were admitted with a diagnosis
with a test of independence to visually demonstrate such differences
of acute bacterial pneumonia in one hospital between 2007 and 2012.
in proportions. Based on these data, the researchers concluded that
160 respiratory care delivered by RT protocols did not confer a disadvantage
to patients in terms of hospital stay compared with physician-directed
140 treatment, that treatment efficacy was not sacrificed regardless of
disease severity, and 30-day post-discharge readmission rates were
lower under RT protocols.
120 Creating dichotomous variables, as in this example, can be a
useful way to look at differences, especially when outcomes can be
Number of Patients
Data taken from Were ND, Boucher EL, Beachey WD. Comparison of therapist-
40 directed and physician-directed respiratory care in COPD subjects with acute
pneumonia. Respir Care 2015;60(2):151-4.
20
0
No Readmission Readmission
RT Protocol Non-RT Protocol
Table 28-3 Contingency Table Showing Relationship the table. The expected frequencies essentially define the
Between Presence of Dementia and null hypothesis. This process of determining expected
frequencies is somewhat more complicated than with
Depression
one-sample tests because we cannot just evenly divide
No Dementia Dementia Total the total sample among the four cells. We must account
Not for the observed proportions within each cell. Each cell
42 11 53 expected frequency (E) is determined using the following
Depressed a b
formula:
Depressed 7 15 22
c d E = (row total × column total)/N
Total 49 26 75
Cell A E = (53 × 49)/75 = 34.63
Test of Significance there are limitations when samples are too small affecting
Table 28-4A shows the calculation of χ2 using the depression the applicability of critical values and potentially increas-
data. These calculations proceed as in previous examples, ing the risk of Type I error. No more than 20% of the
with all observed and expected frequencies listed (order is cells should contain expected frequencies less than 5.10
unimportant). The test value, χ2 = 15.43, is compared with When this occurs, the researcher may choose to collapse
the critical value with (R – 1)(C – 1) degrees of freedom. In the table (if it is larger than 2 × 2) to combine adjacent
this case, we have (2 – 1)(2 – 1) = 1 degree of freedom. categories and increase expected cell frequencies, or
From Appendix Table A-5 we obtain the critical value consider increasing the sample size.
χ2(1) = 3.84. Because the calculated value is higher than
the critical value, χ2 is significant and the null hypothesis Yates’ Continuity Correction
is rejected. These variables are not independent of each When samples are too small in a 2×2 contingency table,
other. There is a significant association between the a correction can be applied. Yates proposed a continuity
presence of depression and dementia in this sample. correction that involves subtracting .5 from the absolute
value of (O-E) prior to squaring in the calculation of χ2.
Interpreting Chi-Square This results in a more conservative test with a smaller
value for χ2. This test is automatically included with the
If χ2 is not significant, our analysis ends, and we would con-
output but would be ignored for the depression study
clude that there was no association between the variables.
because there are no cells with too small expected fre-
However, with a significant test, we can examine the fre-
quencies (see Table 28-4 4 ).
quencies within each cell to interpret these findings. The
A number of statistical sources suggest that Yates’
crosstabulation for this analysis is shown in Table 28-4B.
correction for continuity is too conservative and unduly
Note that this table includes row and column percentages
increases the chance of committing a Type II error.11-13
in addition to observed and expected frequencies. These
It has been suggested that χ2 can provide a reasonable
values are helpful when trying to interpret proportions.
estimate of Type I error for 2 × 2 tables when random
The frequency within each cell is given as a percent-
or mixed models are used with N ≥ 8.14
age of the column (% within Dementia) and the row
(% within Depression). For instance, 15 patients who Fisher’s Exact Test
were depressed exhibited dementia. This represents Because the chi-square distribution can be inaccurate
68.2% of all those who were depressed (the row %) and with very small samples, another approach is use of
57.7% of all those who had dementia (the column %). Fisher’s exact test for 2 × 2 tables. It is recommended
Those who had both dementia and depression repre- when any expected frequency is less than 1, or 20% of
sent 20.0% of the total sample. expected frequencies are 5 or less. The test statistic is it-
If we examine the standardized residuals (R) for each self a probability value. The calculation of this test is
cell (see Table 28-4 3 ), we can see that the two cells quite cumbersome, but it is fortunately available in most
representing patients who were depressed contribute statistical programs. This value can be requested as part
most to the significant outcome, with standardized resid- of the results of the chi-square test, but would not be
uals close to or greater than 1.96. For those who were applied in the depression study because there are no cells
depressed, the number of patients without dementia was with small expected frequencies.
less than expected by chance (R = –1.9), and the number
of patients with dementia was more than expected by
chance (R = 2.7). It is reasonable, then, to conclude that FUN FACT
the presence of depression and dementia are related. On a sunny afternoon in Cambridge, England, in the 1920s, a
group of friends were sharing tea when Muriel Bristol claimed that
It is important to note that the presence of a relation- she could tell whether milk was poured into tea or tea was poured
ship does not indicate that depression causes dementia into milk. The renowned statistician, Sir Ronald Fisher happened to
or vice versa—only that there is a relationship, which be one of the guests, and of course he took on the challenge. He de-
may be due to other factors that were not examined. We do not vised a now famous experiment, The Lady Tasting Tea, with 8 cups
know if one condition preceded the other (see Chapter 19). of tea, 4 each of the two conditions, and presented them to the lady
in random order.15,16 This exercise resulted in his development of
Fisher’s exact test to demonstrate the likelihood that she could
identify the cups of tea correctly. It was a landmark advance in the
Sample Size Considerations
principles of hypothesis testing and probability.
Interpretation of chi-square is based on certain assump-
Although accounts disagree, according to legend the lady got
tions about the expected frequencies in each cell. Because them all correct!
of the continuous nature of the chi-square distribution,
6113_Ch28_415-427 02/12/19 4:38 pm Page 422
DATA
Table 28-4 Chi-Square Test of Independence: Association between Depression
DOWNLOAD
and Dementia (N= 75)
2 15.43 2
B. SPSS Output: Chi-Square Crosstabs
Depression * Dementia Crosstabulation
Dementia
No
Depression Dementia Dementia Total
Not Count A 42 B 11 53
Depressed
Expected Count 34.6 18.4 53.0
% within Depression 79.2% 20.8% 100.0% 1 2 15.43
2 % within Dementia 85.7% 42.3% 70.7% 2 Percentages can be confusing, so
look at rows and columns. The row
% of Total 56.0% 14.7% 70.7% percentages are for depression,
Standardized Residual 1.3 1.7 and column percentages are for
dementia.
Depressed Count C 7 D 15 22
3 Standardized residuals indicate
Expected Count 14.4 7.6 22.0 that cells C and D provide the
% within Depression 31.8% 68.2% 100.0% largest effects. Chi-square is
significant (p .001)
2 % within Dementia 14.3% 57.7% 29.3%
4 The continuity correction is only
% of Total 9.3% 20.0% 29.3% applied when there are cells with
Standardized Residual 1.9 2.7 expected frequencies less than 5.
Therefore, it is not relevant here
Total Count 49 26 75 and we can ignore this value.
Expected Count 49.0 26.0 75.0 5 Effect size indices can be
requested, including phi, V, C,
% within Depression 65.3% 34.7% 100.0%
and the odds ratio.
% within Dementia 100.0% 100.0% 100.0%
% of Total 65.3% 34.7% 100.0%
Calculated values vary due to rounding differences. Portions of the output have been omitted for clarity.
6113_Ch28_415-427 02/12/19 4:38 pm Page 423
■ Power and Effect Size This is the effect size used in G*Power. These values
will range from 0 to 1.0. See Box 28-1 for a description
The chi-square statistic tests the hypothesis that ob- of the process for power analysis and sample size estima-
served proportions differ from chance. However, it tion using this index.
does not provide a measure of effect size. Effect size With symmetrical tables larger than 2 × 2, the contin-
indices for chi-square are interpreted as correlation gency coefficient, C, is used. The range of possible values
coefficients, which reflect the strength of the associa- will differ depending on the size of the table. Therefore,
tion between two variables (see Chapter 29). There are values cannot be compared unless tables are of equal
several measures that can be used for nominal data, size. When tables are asymmetrical, Cramer’s V is used,
each with conventional effect sizes (see Table 28-5).17 Conventional effect sizes vary depending on the smaller
The choice of test depends on the design and type of number of rows or columns (q).
categories. These coefficients can be reported with chi-square
output if requested (see Table 28-4 5 ). We can see that
Effect Size Indices all of these indices are significant, approximating .45.
Given the conventional effect sizes shown in Table 28-5,
The effect size index, w, can be used with uniform and
these would be considered close to large effects. It is the
known distributions as well as contingency tables. This
researcher’s task to determine if this degree of relationship
value is also called the phi coefficient, φ, which is used
was meaningful in the context of the study of depression
with 2 × 2 tables. For the data in Table 28-4,
and dementia.
2 15.44
w= = = .454 It is worth reinforcing Cohen’s admonition that conven-
N 75 tional effect sizes should not be used as absolutes, but must
be put into the context of the variables being studied.17,18
Box 28-1 Power Analysis for Chi-Square but in a repeated measures design. The χ2 test is not
valid under these conditions.
Post Hoc Analysis: Determining Power The McNemar test is a form of the χ2 statistic used
We can use G*Power to determine the level of power achieved with 2 × 2 tables that involve correlated samples, where
for chi-square data in Table 28-4. This requires entering the effect
subjects act as their own controls or where they are
size, total sample size, level of significance, the number of cells in
matched. This test is especially useful with pretest–posttest
the design, and the degrees of freedom associated with the test.
For this analysis we enter w = .454, N = 75, α = .05, 4 cells, df = 1. designs when the dependent variable is measured as a
This results in 98% power (see chapter supplement). dichotomy on an ordinal or nominal scale.
DATA
Table 28-6 McNemar Test: Use of Pain Medications Before and After Vertebroplasty (N =53)
DOWNLOAD
After
No Meds Meds Total Test Statisticsa
Before No meds Count A 25 B 12 37
Before & After
% of Total 35.7% 17.1% 52.9%
N 70
Meds Count C 23 D 10 33
Chi-Squareb 2 2.857
% of Total 32.9% 14.3% 47.1%
Asymp. Sig. 3 .091
Total Count 48 22 70
a. McNemar Test
% of Total 68.6% 31.4% 100.0% b. Continuity Corrected
(B C) 2 (12 − 23)2 ( B C 1) 2 ( 12 23 1) 2
2 3.457 2 2.857
(B C) (12 + 23) (B C) (12 23)
1 The four cells in a 2 2 design are labelled A, B, C, D. The only cells of interest are B and C, representing subjects who changed
their status from before to after. Percentages indicate the proportion of the total sample in each cell. The effect size odds ratio is the
ratio of the proportions for cells B and C (.171/.329 = .519).
2 In SPSS, the McNemar chi-square test statistic is adjusted using the continuity correction. Other statistical programs may use the
standard calculation.
3 The test statistic is less than the critical value 2(1) 3.84.and is not significant (p .091). Note that this result would still be
nonsignificant if the uncorrected value was used.
This calculated value is less than the critical value χ2(1) = OR above 1.0, cell B would have the larger proportion.
3.84 and is not significant (see Table 28-6 3 ). In this example, a larger proportion of those who
changed went from Med to No Meds than the reverse.
Power and Effect Size See Box 28-2 for further discussion of power analysis
The effect size for the McNemar test is based on the for the McNemar test.
odds ratio (OR). However, for this analysis we must
account for the fact that the data are paired, not inde- ➤ Results
pendent. With correlated samples OR = PB/PC, which The McNemar test was used to examine the proportion of
is the ratio of the proportion (P) of cases in cells B and patients who changed their medication status from before to
C, the cells that represent change. These cells denote after vertebroplasty. Of the total sample (N = 70), 47.1% were
discordant pairs. Cells A and D are considered concordant taking medications before surgery, which was reduced to 31.4%
pairs, showing no change. For this analysis, PB is 12/70 = post-surgery. Of those using medications before (n = 33), 70%
.171, and PC is 23/70 = .329. Therefore, OR = .171/ (n = 23) were no longer using medications after surgery. Of those
.329 = .519. This value is used as an effect size index for not using medication before (n = 37), 32.4% (n = 12) started
the test. using medications post-surgery. Using a continuity correction,
This value is interpreted similarly to the standard these associations were not significant (χ2(1) = 2.857, p = .091,
OR. The null value is 1.0 (both B and C proportions OR = .52). Based on these results, the proportion of patients
who changed their medication status was not different from
are the same). With a value less than 1.0, as we see here,
what would be expected by chance.
cell B has a smaller proportion than C, and with an
6113_Ch28_415-427 02/12/19 4:38 pm Page 426
COMMENTARY
Clinical researchers can find many uses for the chi- make these determinations and confirm the validity of
square statistic when proportions are of interest for data the randomization process.
analysis and for descriptive purposes. It is often useful Although chi-square seems like an uncomplicated test
as a way of establishing group equivalence following to run, its interpretation is not so simple. Analysis of
random assignment. For instance, once two groups have proportions is not as straightforward as comparisons
been assigned, it may be of interest to compare the between groups. That is why effect sizes are relevant, so
numbers of males and females in each group to see if that there is some guide for judging the strength of an
they were assigned in equal proportions. Or it may be association. However, even then, it is a matter of trans-
important to determine if certain age groups are bal- lating the story that frequencies provide into clinically
anced across the groups. Chi-square can be used to meaningful judgments.
12. Haviland MG. Yates’s correction for continuity and the 17. Cohen J. Statistical Power Analysis for the Behavioral Sciences.
analysis of 2 × 2 contingency tables. Stat Med 1990;9(4):363-7; 2nd ed. Hillsdale, NJ: Lawrence Erlbaum; 1988.
discussion 69-83. 18. Cohen J. Things I have learned (so far). Am Pyschologist
13. Campbell I. Chi-squared and Fisher-Irwin tests of two-by-two 1990;45(12):1304-12.
tables with small sample recommendations. Stat Med 2007; 19. Evans AJ, Jensen ME, Kip KE, et al. Vertebral compression
26(19):3661-75. fractures: pain reduction and improvement in functional
14. Mantel N, Benjamin E. The odds ratios of a 2 × 2 contingency mobility after percutaneous polymethylmethacrylate
table. Am Stat 1975;29(4):143-45. vertebroplasty—retrospective report of 245 cases. Radiology
15. Fisher RA. The Design of Experiments. 9th ed. New York: 2003;226(2):366-72.
Macmillan; 1971. 20. Siegel S, Castellan NJ. Nonparametric Statistics for the Behavioral
16. Salsburg D. The Lady Tasting Tea: How Statistics Revolutionized Sciences. 2nd ed. New York: McGraw-Hill; 1988.
Science in the Twentieth Century. New York: Henry Holt and
Company; 2001.
6113_Ch29_428-439 04/12/19 3:32 pm Page 428
CHAPTER
29 Correlation
Correlation is, by and large, a familiar concept. scores in weight with higher exercise time in a negative
relationship.
Pairs of observations, X and Y, are examined to
see if they “go together.” For instance, we gen- Scatter Plots
erally accept that heart rate increases with phys- Visualizing these data can help to clarify patterns using
a scatter plot, as shown in Figure 29-1. Each dot repre-
ical exertion, that weight loss will be related to
sents the intersection of a pair of related observations.
caloric intake, or that weight loss will increase The points in Plot A show a pattern in which the values
as the frequency of exercise increases. These of Y increase in exact proportion to the values of X, in a
perfect positive relationship with data points falling on
variables are correlated in that the value of one
a straight line. In Plot B the data demonstrate a perfect
variable is systematically related to values of the negative relationship, with lower values of Y associated
second variable, although not perfectly and to with higher values of X.
More realistic results are shown in the other plots.
differing degrees. With a strong correlation, we
The closer the points approximate a straight line, the
can infer something about the second value by stronger the association. Plots C and D show a positive
knowing the first. In Chapter 28, chi-square was relationship, but the correlation in Plot D is not as
strong. Plot E shows a negative relationship with greater
used to look at the association between nominal
variance from a straight line. In Plot F the points have a
variables. The purpose of this chapter is to in- seemingly random pattern, reflecting no relationship.
troduce several correlation statistics for use with
ordinal and ratio/interval data that can be ap- Positive correlations are sometimes called direct relation-
ships. Negative correlations may also be described as
plied to a variety of exploratory research designs. indirect or inverse relationships.
428
6113_Ch29_428-439 04/12/19 3:32 pm Page 429
between 0.00 and ±1.00. These values are expressed as There may be instances, however, when a one-tailed test
decimals, usually to two places, such as r = .14 or r = –.70. is justified when only one direction is of interest.
The plots in Figure 29-1 represent a variety of poten-
tial outcomes. The magnitude of the coefficient indicates Assumptions for Correlation
the strength of the association between X and Y. The Meaningful interpretation of correlation coefficients can
closer the value is to ±1.00, the stronger the relationship only be made when certain assumptions are met.
(see Box 29-1). The sign of the correlation coefficient
• Subjects’ scores represent the underlying popula-
indicates the direction of the relationship.
tion, which is normally distributed.
• Each subject contributes a score for X and Y.
There is often misunderstanding of negative correla- • X and Y are independent measures so that one score
tions as being weaker or less meaningful, but that is not is not a component of the other. For instance, it
the case. The sign only indicates direction. would make no statistical sense to correlate a meas-
ure of gait velocity with distance walked, as distance
is a component of velocity (distance/time). Simi-
Hypothesis Testing larly, it is fruitless to correlate a subscale score on a
Correlation coefficients can be tested against a population functional assessment with the total score, as the
value to determine if they are significant. A significant cor- first variable is included in the second. In each case,
relation is unlikely to have occurred by chance. Typically, correlations will be artificially high because part of
the null hypothesis will state that the population correla- the variance in each quantity is being correlated
tion will be zero. with itself.
An alternative hypothesis can be specified as nondi- • X values are observed, not controlled. It is not
rectional or directional. Nondirectional hypotheses are appropriate, for instance, to assign subjects to a
more common, analyzed using two-tailed tests, recog- particular intervention dosage to correlate with a
nizing the potential for positive or negative outcomes. response. If it is of interest to see the effect of a
A B C
D E F
Figure 29–1 Examples of scatterplots with different degrees of correlation. Plots A, B, C and D represent strong
positive correlations. Plot E shows a moderately strong negative correlation. Plot F shows a scatterplot with a random
pattern and a correlation close to zero.
6113_Ch29_428-439 04/12/19 3:32 pm Page 430
50
relationships.
40
20
The pattern of a relationship between two variables is r = .58
often classified as linear or nonlinear. The plots in 10
Figures 29-1A and B are perfectly linear because the
0
points fall on a single straight line. Plots C-E can also be 10 20 30 40 50 60 70 80
considered linear, although as they begin to deviate from Age Group
a straight line their correlation decreases.
The coefficient r is a measure of linear relationship Figure 29–2 Illustration of a curvilinear relationship
between age and strength. These data were also used in the
only. When a curvilinear relationship is present, the linear
explanation of quadratic trends in Chapter 26.
correlation coefficient will not be able to describe it ac-
curately. For instance, a curvilinear shape typically char-
scatter diagram for all correlation analyses, researchers
acterizes the relationship between strength and age. As
can observe whether the association in a set of data is lin-
age increases so does strength, until a plateau is reached
ear or curvilinear and thereby decide if r is an appropri-
in adulthood, followed by a decline in elderly years.
ate statistic for analysis.
This type of relationship is illustrated in Figure 29-2
where a strong systematic relationship is clearly evident
between X and Y, although r = .58, suggesting only a The eta coefficient (η), also called the correlation ratio,
moderate relationship. Because r measures only linear is an index that does not assume a linear relationship
functions, the correlation coefficient for a curvilinear re- between two variables. It can be interpreted in the same way
lationship can be small, even when X and Y are indeed as r, although η can only range from 0.00-1.00. This statistic
requires that one variable is nominal.
related.
This should caution readers to be critical about the
interpretation of correlation coefficients. It may not be See the Chapter 29 Supplement for further description of
reasonable to conclude that two variables are unrelated the eta coefficient.
solely on the basis of a low correlation. By plotting a
6113_Ch29_428-439 04/12/19 3:32 pm Page 431
DATA
Table 29-1 Pearson Correlation Coefficient: Ambulation and Cognitive Status in Children with Traumatic
DOWNLOAD
Brain Injury (N = 93)
1 Correlations
30
Cognition Ambulation
25
Ambulation Score
1 The scatterplot reveals a variable pattern with a positive linear trend, but does not approach a straight line.
2 The Pearson correlation is .348. This matrix is redundant, showing the correlation of each variable with the other.
3 The correlation of each variable with itself is 1.0.
4 The correlation coefficient is significant at p .001.
6113_Ch29_428-439 04/12/19 3:32 pm Page 432
Focus on Evidence 29–1 they considered rs = .52 as high, but both rs = .49 and rs = .38 as mod-
Correlation and Validity: Defining “Useful” erate. The other study compared findings to previous research, citing
a systematic review that showed an average correlation of .30 in sim-
There is growing interest in the study of sedentary behavior and its
ilar studies, suggesting that their findings of rs = .30 in women and
correlates, in an effort to develop interventions that will encourage
.25 in men indicated “above average” criterion validity.9 One study
more physical activity. Studies have used specially designed ques-
involved 159 subjects10 and the other tested 2,175 people,9 assuring
tionnaires to derive data on self-report sedentary time.
good power for their analyses. In their conclusions, both sets of au-
Two recent studies have established criterion validity of question-
thors said that the questionnaires would be “useful” and valid instru-
naires by correlating self-report sedentary time with minutes of ac-
ments for gathering information on sedentary behavior.
tivity recorded by a movement monitoring device.9,10 These studies
What does criterion validity mean, and how do correlation coeffi-
examined activity in adolescents and adults, men and women. Both
cients help to establish it? If the monitoring device is considered the
studies found their distributions were skewed and therefore used the
“standard,” is a correlation below .50 helpful in establishing the utility
nonparametric Spearman rank correlation (rs) for analysis.
of the questionnaire? And how does one make judgments about high
The studies found that participants overestimated their sedentary
versus moderate correlations, when they vary by .03? You can draw
time compared to the minutes of activity logged by the device. They
your own judgments about how well validity has been established,
also found various differences between adolescents and adults and
but these findings illustrate why conventional criteria do not always
between men and women. Most relevant, however, is that they both
make sense for clinical application. Authors should find the clinical
documented significant correlations ranging from .02 to .52, predom-
relevance in their data, not take statistical “rules” too literally. And
inantly around .30 and with narrow confidence intervals.
those who look to evidence for guidance must draw their own con-
In summarizing findings, one study defined correlations as low,
clusions about how to use data in a meaningful way.
moderate or high using Cohen’s conventional criteria.10 Interestingly,
Spearman Rank Correlation Coefficient alternative hypothesis states that a positive correlation is
expected:
The Spearman rank correlation coefficient, given the
symbol rs (also called Spearman’s rho), is a nonparamet- H0: rs = 0 H1: rs > 0.
ric analog of the Pearson r. The null hypothesis for the Note that the hypothesis for the Spearman test uses
study of verbal and reading comprehension states that rs because this test does not approximate a population
there is no association between these measures. The parameter.
6113_Ch29_428-439 04/12/19 3:32 pm Page 434
DATA
Table 29-2 Rank Correlations Using Spearman and Kendall’s tau-b: Verbal and Reading Comprehension
DOWNLOAD
Scores (N = 16)
Table 29-3 Critical Values of Spearman’s Rank correlated with a continuous variable on the ordinal
Correlation Coefficient, rs scale.12 Like other nonparametric tests, the total
sample is ranked, and then the average rank of each
α2 .10 .05 .02 .01 X category is computed. If there is an association
n α1 .05 .025 .01 .005 between the variables, the higher ranks will be asso-
10 .564 .648 .745 .794 ciated with the category labeled 1, and lower ranks
11 .536 .618 .709 .755
will be associated with the category labeled 0. This
analysis is analogous to the Mann-Whitney U test,
12 .503 .587 .671 .727 where the dichotomous variable identifies two
13 .484 .560 .648 .703 groups.
14 .464 .538 .622 .675 See the Chapter 29 Supplement for illustrations and calcu-
15 .443 .521 .604 .654 lations of phi, point biserial and rank biserial correlations.
20
Table 29-4 Anatomy and Physiology Grades (r = .87)
STUDENT ANATOMY PHYSIOLOGY 18
r = .095
1 50 60 16
2 56 72
14
3 52 7
4 57 71 12
5 47 63
Y 10
6 65 75
7 60 76 8
Mean 55 ± 6.2 69 ± 5.9 6
r = .63
4
actually be a function of some third variable, or a set of
2
variables, that is related to both X and Y.
For example, researchers have shown that weak grip 0
0 5 10 15 20
strength and slowed hand reaction time are associated
with falling in elderly persons.13 Certainly, we would not X
infer that decreased hand function causes falls. However, Figure 29–4 The effect of an outlier on correlation. Without
weak hand musculature may be associated with general the outlier, the original distribution has a linear correlation
deconditioning and is related to balance and motor re- of r = .63, p = .007. With the single outlier included (red
covery deficits. These associated factors are more likely triangle), the correlation is r = .095, p = .707.
to be the contributory factors to falls. Therefore, a study
that examined the correlation between falls and hand
function would not be able to make any valid assump- Why Outliers Occur
tions about causative factors (see Fig. 29-3). Researchers must consider several possibilities regarding
outliers. The score may, indeed, be a true score but an
Outliers extreme one, because the sample is too small to generate
If a set of data points represents a distribution of related a full range of observations. If more subjects were tested,
scores, the points will tend to cluster in a pattern. But there might be less of a discrepancy between the outlier
sometimes, one or two deviant scores are separated from and the rest of the scores.
the cluster, so that they distort the statistical association. There may be also be circumstances peculiar to this
For example, the blue data points in Figure 29-4 show data point that are responsible for the large deviation.
some variability, but most of the points fall within a For example, the score may be a function of error in
definite linear pattern (r = .63). With the addition of measurement or recording, equipment malfunction, or
one point that does not fit with the rest of the pattern, some miscalculation. It may be possible to go back to the
the correlation reduces to r = .10. This point is called an original data to find and correct this type of error. Other
outlier, because it lies outside the obvious cluster of extraneous factors may also contribute to the aberrant
scores. This point has an unusual XY value and has sig- score, some of which are correctable, others that are not.
nificantly altered the statistical description of the data. For instance, the data point may have been collected by
a different tester who is not reliable. Or the researcher
may find that the subject was inappropriately included
in the sample.
Deconditioning Outliers can have serious effects on the outcome and
Confounding Variable should always be examined using a scatter plot. Some re-
searchers consider scores beyond three standard devia-
tions from the mean to be outliers.
The researcher must determine if outliers should be
Grip Strength Falling retained or discarded in the analysis. This decision
should be made only after a thorough evaluation of the
Figure 29–3 The need for caution in assigning cause to a experimental conditions, the data collection procedures,
correlation. The relationship between grip strength and falls and the data themselves. As a general rule, there is no
is actually a function of their mutual relationship with statistical rationale for discarding an outlier. However,
deconditioning. if a causal factor can be identified, the point may be
6113_Ch29_428-439 04/12/19 3:32 pm Page 437
of X and Y (rXY = .70), we can see that age and hospital be determined by referring to critical values of r using
stay no longer demonstrate as strong a relationship. A n – 3 degrees of freedom. Partial correlation can be
large part of the observed association between them expanded to control for more than one variable at
could be accounted for by their common relationship a time.
with functional status. Partial correlation is a useful analytic tool for elimi-
The simple correlation between X and Y is called nating competing explanations for an association, thereby
a zero-order correlation. The term rXY·Z is called a providing a clearer explanation of the true nature of an
first-order partial correlation, because it represents a observed relationship and ruling out extraneous factors.
correlation with the effect of one variable eliminated. This discussion will be expanded in Chapter 30 in rela-
The significance of a first-order partial correlation can tion to regression analysis.
COMMENTARY
Correlation is a statistic that is generally understood, yet trying to establish a causal relationship. All statistical
often misinterpreted or misused. Researchers, practi- analysis is limited by the clinical significance of the data
tioners, and consumers must be aware of the potential being analyzed.
danger of using statistical correlation as evidence of a There are many examples of spurious correlations
clinical association simply on the basis of numbers, rec- that illustrate this point. For example, Snedecor and
ognizing that any two quantitative variables can be cor- Cochran7 cite a correlation of −.98 between the annual
related, whether or not they are based on a logical birth rate in Great Britain from 1875 to 1920 and the
framework. Most importantly, the interpretation must annual production of pig iron in the United States.
be able to support what the correlation means, or why Maybe we can view these variables as related to some
the two variables are related—which often translates to general socioeconomic trends, but surely, neither one
6113_Ch29_428-439 04/12/19 3:32 pm Page 439
could seriously be considered a function of the other. often tempted to draw conclusions about relationships
Willoughby16 really puts this into perspective, arguing without considering the potential relevance of a third
that a high positive correlation between a boy’s height and variable that might actually be the cause for both. This
the length of his trousers would not mean that lengthen- important caveat will be addressed further in different
ing trousers would produce taller boys! Researchers are contexts in several upcoming chapters.
CHAPTER
30 Regression
Correlation statistics are useful for describing the illustrated in Figure 30-1. In panel A there is almost
total overlap of variance for two variables, X and Y.
relative strength of a relationship between vari-
These variables would be highly correlated. By knowing
ables. However, when a researcher wants to pre- values of X, we could accurately predict Y. In panel B,
dict an outcome or to explain the nature of that there is almost no overlap. These variables have very
little variance in common and X would not be predic-
relationship, a regression procedure is used. These
tive of Y. In panel C, two variables have substantial,
functions are crucial to effective clinical decision but not complete, overlap. In this case, some other
making, helping to explain empirical clinical ob-
servations and providing information that can be
Y
used to set realistic goals for our patients. They
also have important implications for prognosis,
efficiency, and quality of patient care, especially A
in situations where resources are limited.
The purpose of this chapter is to describe the
X
foundations of regression and how it can be
Y Y
used to interpret clinical data. Discussions in-
clude techniques for simple linear regression,
multiple regression, logistic regression, and
C
analysis of covariance.
B
440
6113_Ch30_440-467 02/12/19 5:29 pm Page 441
SBP
130
Therefore, r2 is a measure of the accuracy of prediction
of Y based on X. This term is called the coefficient of
determination, but you will typically only see it referred 110
to as “R squared.” Values of r2 will range from 0.00 to r = 1.00
1.00. No negative ratios are possible because it is a
squared value.
20 22 24 26 28 30
A
BMI
■ Simple Linear Regression
170
The primary goal of regression is to develop an equation
to calculate a predicted value based on a related variable.
150
In its simplest form, linear regression involves the exami-
nation of two variables that are linearly related, to deter-
SBP
mine how well an independent variable X predicts values 130
of a dependent variable Y. Both X and Y variables are ex-
pected to be continuous.
110
r = .87
Dependent variables may also be called response or
outcome variables. Independent variables may also be
referred to as predictor variables, explanatory variables, or 20 22 24 26 28 30
covariates. B BMI
Figure 30–2 Scatter plots of BMI (X) and systolic blood
pressure (Y) for 10 women. A) Perfect correlation with
➤ C A SE IN PO INT # 1 regression line, B) Plot with strong correlation.
Research has shown that there is a high positive correlation
between body mass index (BMI) and systolic blood pressure
(SBP).1 Let’s consider hypothetical data from a sample of Scatterplots for correlation analysis do not specify
10 women, to determine if we can predict their SBP by knowing independent and dependent variables, and it does
their BMI. These variables have a correlation of r = .87. not matter which variable is designated as X or Y. In
regression, however, these designations are relevant. The
In this example, we want to predict values of SBP independent variable (the predictor) is always plotted along
(Y) by knowing values of BMI (X). The scatterplot in the X axis, and the dependent variable (the response) is plotted
Figure 30-2A illustrates the unlikely situation where X along the Y axis.
and Y are perfectly correlated, r = 1.00. Therefore, the
line that connects the dots would allow us to predict
an exact value of Y for each value of X simply by find-
The Regression Line
ing the coordinates along the line. That means that if Even if correlation is not perfect, we can still use the dis-
we know the value of X, we wouldn’t even have to tribution to predict Y scores. We want to draw a line to
measure Y, because we could accurately predict it. fit the data, but the line would not go through all the
With correlations less than 1.00, however, we can’t points, making it only an estimate. Because some points
be as accurate, and we would only be able to estimate Y would fall above and below the line, its prediction would
values. So let’s consider a more realistic scenario, as not be exact, and there will be some error. The process
shown in Figure 30-2B. We can see that the data fall in of regression identifies the one line that “best” describes
a linear pattern, with larger values of BMI associated the orientation of all data points in the scatter plot. This
with larger values of SBP, but it’s not perfect (r = .87). is called the linear regression line or the line of best fit.
6113_Ch30_440-467 02/12/19 5:29 pm Page 442
SBP
The quantity Ŷ ( Y-hat) is the predicted value of Y. 130
The term a is the regression constant. It is the Y-intercept,
representing the value of Y when X = 0. This can be a
positive or negative value. The term b is the regression 110
coefficient. It is the slope of the line, which is the rate ^ = –29.80 + 6.81X
Y
of change in Y for each one-unit change in X. If b = 0,
the slope of the line is horizontal, indicating no relation- 20 22 24 26 28 30
ship between X and Y, where Y is constant for all values A
BMI
of X. The positive or negative direction of the slope will 170
correspond to a positive or negative correlation between
X and Y.
^
150 (Y – Y)
You may remember learning the equation for a straight line
a little differently from algebra as Y = mX + b, or some
SBP
version of that. It all means the same thing, just a little change 130
in terminology and letters.
110
3
The Regression Model Subject #
DATA
Table 30-1 Linear Regression of Systolic Blood Pressure (SBP) on Body Mass Index (BMI) (N = 10)
DOWNLOAD
Standardized Residuals
1 20.80 110 111.85 1.85
2 21.55 130 116.96 13.04 1
3 22.45 105 123.08 18.08
4 23.00 124 126.83 2.83
5 23.60 136 130.92 5.08 0
6 25.40 145 143.17 1.83
7 25.40 157 143.17 13.83
8 26.70 138 152.03 14.03 1
9 27.25 158 155.77 2.23
10 28.70 167 165.65 1.35
Based on data shown in Figure 30-3 2
110 120 130 140 150 160 170
C. SPSS Output: Linear Regression Predicted Values
Correlations 2 ANOVAa 4
Coefficientsa
95% Confidence
Unstandardized Standardized Interval for B
Coefficients Coefficients
Lower Upper
Model 5 B Std. Error Beta 6 t Sig. Bound Bound
1 (Constant) 29.800 33.931 .878 .405 108.046 48.446
BMI 6.812 1.379 .868 4.941 .001 3.633 9.992
a. Dependent Variable: SBP
1 The data show raw scores, predicted values of SBP (Yˆ ) and residual scores (YYˆ ).
2 Descriptive statistics for each variable and their correlation. BMI and SBP have a strong positive correlation (r .868).
3 R-square and SEE reflect the degree of accuracy in the regression equation.
4 The analysis of variance of regression demonstrates that there is a significant relationship (p .001) between the independent
and dependent variables (matches correlation). The df associated with SSreg equal the number of predictor variables, in this
case k 1. The total df will equal N1. The df for the residual variance, dfres, will be N k 1.
5 The regression equation is: Yˆ = 29.80 6.81(BMI).
6 A t-test is used to test the significance of the regression coefficient for BMI. This p value will equal the probability identified in the
ANOVA and correlation, as they are all essentially testing the same thing. A significance test for the constant is run, but it is not
meaningful, and is ignored.
Portions of the output have been omitted for clarity.
6113_Ch30_440-467 02/12/19 5:29 pm Page 444
Residuals
to the actual score, and subject #7 has a larger residual.
Theoretically, if we took measurements for many women
with a BMI of 25.40, we would get varied SBP scores that
would form a normal distribution, and the mean of the Y
scores would fall on the regression line (equal to Ŷ ).
Therefore, the least-squares line is actually an estimate
of the population regression line, and each_ point on the
line is an estimate of the population mean Y at each value
A
of X. Having several measures for each value of X reduces Predicted Values
standard error, and improves the accuracy of prediction.
Analysis of Residuals
One way to determine if the normality assumption for
regression analysis has been met is to examine a plot of
residuals, as shown in Figure 30-4. By plotting the
Residuals
residuals on the Y-axis against the predicted scores on
the X-axis, we can appreciate the magnitude and distri-
bution of the residual scores. The central horizontal
axis represents the mean of the residuals, or zero devi-
ation from the regression line. When the linear regres-
sion model is a good fit, the residual scores will be
randomly dispersed close to zero, as in panel A. The
wider the distribution of residuals around the zero axis, B Predicted Values
the greater the error.
Other patterns illustrate problematic residual distri-
butions. The pattern in panel B indicates that the vari-
ance of the residuals is not consistent but is dependent
on the value of the predicted variable. Residual error in-
creases as the predicted value gets larger. Therefore, the
Residuals
Such transformations may stabilize the variance in the The Regression Equation
data, normalize the distributions, or create a more linear
The multiple regression equation now accommodates
relationship (see Appendix C). When curvilinear tenden-
several predictor variables:
cies are observed, polynomial regression models may be
used to better represent the data. This approach is dis- Ŷ = a + b1X1 + b2X2 +... bk Xk
cussed later in this chapter.
The subscript k denotes the number of independent
variables in the equation. The equation includes the
regression constant, a, and regression coefficients, b1
■ Multiple Regression through bk, for each independent variable.
Multiple regression is an extension of simple linear re- Regression Coefficients
gression analysis, allowing for prediction of Y using a set The regression coefficients are interpreted as weights that
of several independent X variables. The intent is to ex- identify how much each variable contributes to the expla-
plain more of the variance in Y than one predictor can nation of Y. As part of the analysis, a test of significance is
do alone. Like simple linear regression, multiple regres- performed on each regression coefficient, to test the null
sion is also based on the concept of least squares, so that hypothesis, H0: b = 0 for that independent variable (see
the model minimizes error in predicting Y. However, we Table 30-2 6 ). In this example, only the coefficients for
can no longer visualize regression as a single line, but diet and CHOL are significant, while BMI, age, and gen-
conceptually as a linear relationship in a multidimen- der are not. Therefore, these variables are not making a
sional space. significant contribution to the prediction of SBP.
Multiple regression can accommodate continuous
and categorical independent variables, which may be The p value for BMI is .054, arguably a significant outcome.
naturally occurring or experimentally manipulated. The decision to consider BMI important may be justified by
The dependent variable, Y, must be a continuous the researcher within the clinical context of these variables.
measure. However, as we shall see shortly, the strict demarcation of .05 to
indicate significance may influence further analysis of the data.
➤ C A S E IN PO INT # 2
Consider again the question of BMI as a predictor of SBP,
but this time we’ll expand it with a hypothetical sample In Table 30-2 4 , the coefficients for the regression
of 145 men and women. In this group, the correlation of equation are shown:
BMI and SBP is r = .68 and r2 = .46. Based on this relation-
ship, we would expect that the remaining variance in SBP SBP = 75.130 + .673(BMI) + .110(diet) + .092(CHOL)
(54%) must be a function of other factors. For this example, + –.029(age) – .047(gender)
we will look at the potential additional contributions of
diet (daily grams of fat intake), cholesterol (CHOL), age, and Based on this equation, for a 27-year-old male subject
gender. with BMI = 24, diet = 45.0 g, CHOL = 155, gender = 1
(coded for male), we can predict SBP as follows:
Table 30-2 1 shows the correlations among these SBP = 75.130 + .673(24) + .110(45) + .092(155) –
variables. We can see that BMI, diet, and CHOL have .029(27) – .047(1) = 109.662
significant correlations with SBP. In Table 30-2B, the Assume this person’s true SBP was 95. Therefore, the
results of a multiple regression are presented. This looks residual would be 95 – 109.662 = –14.662. The negative
like the simple linear regression output, with a few im- residual indicates that the predicted value is higher than
portant differences. the actual value.
Prediction Accuracy
The model summary shows the value of R2, which tells us Equations derived in a regression study are specific to
that together this set of 5 predictor variables accounts for the sample used. Prediction accuracy is often reduced
53% of the total variance in SBP (see Table 30-2 2 ). The when applied to different subjects.2 Cross-validation
studies involve collection of data from a new sample to evaluate
adjusted R2 is a slightly lower value. With more variables
how well the formula predicts outcomes with different individuals.
in the equation, there is more error variance, which is
This is not just a replication study, but rather a way to validate
reflected in this value. The ANOVA confirms a signifi- results from the original study for clinical use.
cant model (p < .001) (see Table 30-2 3 ).
6113_Ch30_440-467 02/12/19 5:29 pm Page 447
DATA
Table 30-2 Multiple Regression Analysis: Prediction of Systolic Blood Pressure from BMI, Diet, Cholesterol,
DOWNLOAD
Age, and Gender (N=145)
Coefficientsa
Unstandardized Standardized 95% Confidence Collinearity
Coefficients Coefficients Interval for B Statistics 7
4 5 6
Lower Upper
Model B Std. Error Beta t Sig. Bound Bound Tolerance VIF
1 (Constant) 75.130 7.775 9.663 .000 59.757 90.503
BMI .673 .347 .233 1.941 .054 .013 1.359 .236 4.240
DIET .110 .038 .352 2.922 .004 .036 .185 .233 4.286
CHOL .092 .036 .211 2.552 .012 .021 .163 .493 2.027
AGE .029 .053 .033 .547 .585 .135 .076 .959 1.043
GENDER .047 1.407 .002 .034 .973 2.735 2.829 .989 1.011
a. Dependent Variable: SBP
1 The correlation matrix provides an overview of correlations among all variables in the equation. It is redundant, and only the
top triangular portion needs to be considered.
2 R2 and adjusted R2 are shown. The value of R is reported, but it is not meaningful with multiple independent variables and is
ignored.
3 The ANOVA shows a significant relationship between SBP and the set of independent variables.
4 B regression coefficients for each variable in the equation:
Y 75.13 .673(BMI) .110(Diet) .092(CHOL) .029(Age) .047(Gender)
5 Standardized regression coefficients (beta weights). These values allow comparison of weights across variables.
6 Tests of significance of regression coefficients for each independent variable. Values for the constant are ignored.
7 Tolerance and VIF reflect the degree of collinearity with other variables. Lower values for tolerance and higher values for VIF
indicate greater collinearity.
Portions of the output have been omitted for clarity.
6113_Ch30_440-467 02/12/19 5:29 pm Page 448
Some computer programs will automatically generate Box 30-1 Power and Sample Size Estimates
tolerance levels for each variable. Others offer options for Regression
that must be specifically requested to include collinearity
statistics in the output. Post Hoc Analysis: Determining Power
Using G*Power to determine power requires entering the effect
size f 2, level of significance, total sample size, and the number
It is important to recognize that, in this type of regres- of independent variables. For this analysis, f 2 = 1.128, α = .05,
sion, all five variables are included in the regression N = 145, with 5 independent variables. This results in power of
equation, regardless of whether they are significant. 1.00 (see chapter supplement)—you can’t do better than that!
The coefficients in a regression equation are dependent on
which other variables are included and can change substantially A Priori Analysis: Estimating Sample Size
if the analysis was repeated with a different combination of To estimate sample size we need to know how many predictor
independent variables with different levels of collinearity. variables will be included, the desired level of f 2 and power, and
Therefore, regression coefficients do not provide absolute the estimated effect size. For this example, assume that we are
estimates of the importance of any independent variable. The expecting R 2 = .25 which converts to f 2 = .33. Using α =.05 and
interpretation must be made within the context of the set of 80% power with 5 independent variables, the estimated sample
variables used and characteristics of the sample. size is 45 (see chapter supplement).
This outcome illustrates two important considerations in
regression analysis. First, we can see that our sample of
145 subjects contributed marked power to the analysis. Also,
➤ Results the greater the number of independent variables, the larger
The multiple regression examined the association between the needed sample.
SBP and BMI, diet, cholesterol, age, and gender, resulting in a
total R 2 of .530 (p < .001) and a final equation of Y = 75.130 +
.673(BMI) + .110(diet) + .092(cholesterol) – .029(age) – .047 when samples are large. It is the value of R 2 that must be
(gender). Significant coefficients were obtained for BMI considered in context, not solely a p value (see Focus on
(p = .054), diet (p = .004), and cholesterol (p = .012). These three Evidence 30-1)
variables had high correlations with each other (r ≥ .674) and
BMI and diet have low tolerance levels, indicating collinearity See the Chapter 30 Supplement for illustration of power
with other variables. Standardized beta coefficients were high- and sample size calculations using G*Power for regres-
est for diet (β = .352), BMI (β = .233), and cholesterol (β = .211). sion analysis.
Age and gender did not provide significant contributions to
predicting SBP within this model.
■ Stepwise Multiple Regression
Multiple regression can be run by “forcing” or “entering”
■ Power and Effect Size for Regression a set of variables into the equation, as we have done in
the SBP example. With all five variables included, the
The value of R2 is often used as an effect size index for equation accounted for 53% of the variance in SBP val-
regression, like r is used for correlation. However, for ues, although the results demonstrated that the independ-
power analyses this value is converted to f 2. For the SBP ent variables did not all make significant contributions to
example, that estimate and there was some collinearity. We might
R2 .53 ask then, can we establish a more efficient model, to
f2= = = 1.128 achieve this level of prediction accuracy with fewer
1⫺ R 2
1⫺ .53
variables?
Cohen3 proposed the following conventional effect To answer this question, we use a procedure called
sizes and corresponding R2 values: stepwise multiple regression, which applies specific sta-
tistical criteria to retain or eliminate variables to maxi-
Small f 2 = .02 R2=.02
mize prediction accuracy with the smallest number of
Medium f 2 = .15 R2=.13
predictors. It is not unusual to find that only a few inde-
Large f 2 = .35 R2=.26
pendent variables will explain almost as much of the vari-
See Box 30-1 for description of power analysis and ation in the dependent variable as can be explained by a
sample size estimates for regression. larger number of variables. This approach is useful for
Power is important to the interpretation of regres- honing in on those variables that make the most valuable
sion. Like correlation, regression can be sensitive to sam- contribution to a given relationship, thereby creating an
ple size, often finding significance with low values of R 2 economical model.
6113_Ch30_440-467 02/12/19 5:29 pm Page 450
Focus on Evidence 30–1 stated that their results of a small but statistically significant associ-
Statistical Danger! ation supported these findings. However, given that they found no as-
sociation between change in height and IQ in their sample, they
In 1986, a New York Times report cited a study of intelligence tests,
suggested that any processes responsible for the relationship must
with the headline “Children’s Height Linked to Test Scores.”26 The
occur early in childhood, before the age of 8. They also suggested
story was based on an article published in the journal Pediatrics,
that any therapies designed to facilitate physical growth were unlikely
which involved the analysis of data from the National Health Exam-
to promote intellectual growth.
ination Survey on more than 16,000 children from 6 to 17 years old.27
But back to the New York Times—the report that would be more
Intellectual ability was measured using the Wechsler Intelligence
likely seen by the public. Imagine the impact of such a “significant”
Scale for Children, yielding an IQ score. The Pearson product-moment
finding—especially for parents of shorter children! Of course, the
correlation was used to detect the relationship between IQ and
newspaper article did not mention the actual correlation, just its sig-
height, and multiple regression was used to examine the influence
nificance, and offered the public no understanding of the meaning of
of other related variables, including family income, race, birth order,
a correlation versus R 2, the power of the test with such a large sam-
and family size.
ple, or the difference between correlation and causation to put the
The results showed a significant positive correlation, indicating
predictive relationship in perspective. The primary author is quoted
a direct relationship between height and IQ, with r = .18, p < .001.
as saying, “We feel that it may well be possible that short children
The regression model showed that taken together, the other related
are treated somewhat different, and might be thought of as younger
variables accounted for 30% of the variance, and height only con-
or less intellectually mature.”
tributed another 2% (p < .001). In a separate longitudinal analysis of
Studies continue to show some relationship between height and
a subset of 2,177 children from 8 to 13 years of age, the researchers
intelligence, generally with low correlations, offering many potential
found no relationship between change in height and change in IQ
explanations for other contributing factors, including prenatal condi-
over 2 to 5 years.
tions and genetics.28-30 Understanding the impact of these types of
In their discussion, the authors cited several studies that showed
findings, without understanding the statistical implications, requires
significant correlations between height and intellectual performance
considerable caution.
dating from 1893 to 1975, with correlations between .12 and .36. They
The Steps the first step are shown in Table 30-3 2 . As expected,
diet was entered, resulting in R2 = .484. The regression
Stepwise regression is accomplished in “steps” by eval-
equation for this first step is:
uating the contribution of each independent variable
in sequential fashion based on an inclusion criterion, SBP = 100.163 + .218(diet)
usually set at p ≤ .05. First, all proposed independent
Next we examine the remaining “excluded” variables
variables are correlated with the dependent variable,
for their partial correlation with Y, that is, their corre-
and the one variable with the highest significant cor-
lation with SBP with the effect of diet removed (see
relation is entered into the equation at step 1. For our
Table 30-3 3 ). The variable with the highest signifi-
example, referring back to Table 30-2A, we can see
cant partial correlation coefficient is then added to the
that diet has the highest correlation with SBP (r = .699,
equation in step 2, in this case, cholesterol (partial r =
p < .001). Therefore, we expect diet to be entered in
.248 p = .003). Data for the second step are shown in
the first step.
Table 30-3 4 . With the addition of cholesterol, we
After the first variable is entered, the variables that
have increased R2 to .516.
were excluded are examined to see if any have a partial
At this point we look again at the excluded variables.
correlation that is significant; that is, do they have any
None of the partial correlations of the remaining three
important information to add? If so, the variable with
variables are significant, indicating that they have no
the highest partial correlation will be entered in the next
new information to add to the explanation of SBP (see
step. This process continues until, at some point, either
Table 30-4 5 ). Therefore, no further variables were en-
all variables have been entered or the addition of more
tered after step 2, and the final model for the stepwise
variables will not significantly improve the prediction
regression is
accuracy of the model, ending the analysis.
SBP = 81.918 + .167 (diet) + .105 (CHOL)
The Stepwise Models
Table 30-3 shows the output for the stepwise regression Note that the coefficient for diet has changed from
for the SBP study. In each section the number of steps step 1 to step 2 with the addition of cholesterol into
is identified under the column labeled Model. Data for the equation. Note, too, that for this analysis, an
6113_Ch30_440-467 02/12/19 5:29 pm Page 451
DATA
Table 30-3 Stepwise Multiple Regression: Prediction of SBP
DOWNLOAD
ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 10146.209 1 10146.209 134.065 2 .000b
Residual 10822.453 143 75.681
Total 20968.662 144
2 Regression 10813.067 2 5406.533 75.597 4 .000c
Residual 10155.595 142 71.518
Total 20968.662 144
a. Dependent Variable: SBP
b. Predictors: (Constant), DIET
c. Predictors: (Constant), DIET, CHOL
Coefficientsa
Unstandardized Standardized 95% Confidence Collinearity
Coefficients Coefficients Interval for B Statistics
Std. Lower Upper
Model B Error Beta t Sig. Bound Bound Tolerance VIF
1 (Constant) 2 100.163 2.024 49.500 .000 96.163 104.163
DIET .218 .019 .696 11.579 .000 .181 .255 1.000 1.000
2 (Constant) 4 81.918 6.290 13.022 .000 69.483 94.353
DIET .167 .025 .533 6.737 .000 .118 .216 .545 1.834
CHOL .105 .034 .241 3.054 .000 .037 .172 .545 1.834
a. Dependent Variable: SBP
Excluded Variablesa
Partial Collinearity Statistics
Model Beta In t Sig. Correlation Tolerance VIF
1 BMI .306b 2.615 .010 .214 .253 3.960
CHOL .241b 3.054 .003 3 .248 .545 1.834
AGE .003b .049 .961 .004 1.000 1.000
GENDER .013b .223 .824 .019 .997 1.003
2 BMI .233c 1.967 .051 5 .163 .237 4.213
AGE .034c .564 .574 .047 .961 1.041
GENDER .012c .212 .832 .018 .997 1.003
a. Dependent Variable: SBP
b. Predictors: (Constant), DIET
c. Predictors: (Constant), DIET, CHOL
Continued
6113_Ch30_440-467 02/12/19 5:29 pm Page 452
DATA
Table 30-3 Stepwise Multiple Regression: Prediction of SBP—cont’d
DOWNLOAD
1 The derived model has two steps, entering DIET on the first step and CHOL on the second step.
2 DIET enters on the first step with R2 .484 (p .001 from ANOVA). The regression equation for the first step:
ˆ 100.063 .218(DIET).
Y
3 Of the remaining variables after DIET is entered, partial correlations for both CHOL and BMI are significant, but CHOL has the higher
partial correlation, and therefore will be the next variable entered.
4 The second step includes DIET and CHOL, increasing R 2 to .516, which is also significant. The regression equation for the second
step: Yˆ 81.918 .167(DIET) .105(CHOL).
5 In step 2, after DIET and CHOL are entered, none of the remaining variables have significant partial correlations, and the regression
stops at step 2. BMI is no longer significant (using a strict .05 cutoff). BMI has a low tolerance and high VIF, indicating it is correlated
with other variable already in the equation.
new set of independent variables to explain the dependent female and 0 for anyone who is not female. We can use
variable. It usually helps to look at correlations among vari- these codes as scores in a regression equation and treat
ables first, to determine if relationships are worth analyz- them as interval data. For instance, gender was included
ing. Sometimes backwards inclusion may find relationships as a predictor of SBP in the multiple regression (see
that are missed with forward selection methods. Because Table 30-2). Assume we tested only the effect of gender (X),
the analysis starts with all variables entered, their relation- getting the following regression equation:
ships are more easily identified, as compared to bringing
Ŷ = 150 - 27.5X
one variable into the equation at a time.
Using the dummy code for Males,
Hierarchical Regression
Another approach to multiple regression involves enter- SBP = 150 - 27.5 (1) = 122.5
ing sets of variables in sequential stages or “blocks.” In-
and for Females,
dependent variables entered in the first block are often
those the researcher wants to control, such as demo- SBP = 150 - 27.5 (0) = 150
graphic data. After these effects are tested, other vari-
With only this one dummy variable, these pre-
ables can be entered in subsequent blocks to see what
dicted values are actually the means for SBP for fe-
they add. Each stage can be run with all variables entered
males and males. The regression coefficient for X is
or using other methods of inclusion.
the difference between the means for the groups
For example, we might enter age and gender in the
coded 0 and 1.
first block to account for these characteristics, thereby
removing their effect from further analysis. In a second Coding with More than Two Categories
block, we can enter all of the remaining variables, or we When a categorical variable has more than two levels,
might have a rationale for grouping subsets further. more than one dummy variable is required to represent
These results will vary from other methods because of it. For example, let’s assume we had information on
the effect of entry order. smoker status from our subjects. We could assess smoking
in three categories: never smoked, former smoker, current
See online data files for Table 30-3 at www.davisplus.com smoker. If we coded these categories with numbers 1
for examples of output with enter, stepwise, forward, through 3, it would represent an apparent ordinal scale,
backward, and hierarchical inclusion methods using the but this would make no sense in a regression because the
SBP data. numbers have no quantitative meaning. A current smoker
is not three times more of something than someone who
never smoked.
■ Dummy Variables Therefore, we must create a dichotomous dummy
variable for each category, as follows:
One of the general assumptions for regression analysis
is that variables are continuous. However, many of the X1 = 1 if “current smoker”
variables that may be useful predictors for a regression 0 if not “current smoker”
analysis, such as gender, occupation, education, and race,
X2 = 1 if “former smoker”
are measured on a categorical scale. We can also look at
0 if not “former smoker”
group membership in an experimental study as a cate-
gorical variable. It is possible to include such variables Each variable codes for the presence or absence of a
in a regression equation, although the numbers assigned specific class membership. We do not need to create a
to categories cannot be treated as quantitative scores. third variable for “never smoked” because anyone who
One way to do this is to create a set of coded variables has zero for X1 and X2 will be in that category. We can
called dummy variables. show how this works by defining each group with a
unique combination of values for X1 and X2:
Coding
X1 X2
In statistics, coding is the process of assigning numerals
Current Smoker 1 0
to represent categorical or group membership. For re-
Former Smoker 0 1
gression analysis we use 0 and 1 to code for the absence
Never Smoked 0 0
and presence of a dichotomous variable, respectively. All
dummy variables are dichotomous. For example, for a The number of dummy variables needed to define a
variable like gender, we might arbitrarily code male = 0 categorical variable will always be one less than the num-
and female = 1. In essence, then, we are coding 1 for ber of categories.
6113_Ch30_440-467 02/12/19 5:29 pm Page 454
10.00
independent variable, we are asking if there is a rela-
tionship between gender and SBP. For the smoking
example, we are asking if there is a relationship be-
tween smoking status and SBP. If we performed t-tests 9.00
or ANOVAs, these variables would represent groups
(levels of the independent variable), and we would
analyze differences across these groups in SBP based 6.00
on their variance components. Regression is essen-
tially doing the same thing. The ANOVA that is gen- Linear: Y = 7.539 + 0.06 (Age)
erated as part of the regression analysis would examine Quadratic: Y = 1.002 + 0.633 (Age) – 0.010 (Age2)
4.00
how much variance in the data can be explained by 10.00 20.00 30.00 40.00 50.00
gender or smoking status. This would be identical to Age
a regular ANOVA looking at group means. Therefore, Figure 30–6 Curvilinear data for the regression of
a regression technique is feasible with an experimental psychomotor skill on age. The solid line represents the
design with group membership as an independent linear regression (R2 = .125), and the curve (dashed line)
variable. is derived through a second-order polynomial regression.
6113_Ch30_440-467 02/12/19 5:29 pm Page 455
Based on this information alone, one would assume Therefore, a person who is 43 years old could be
that X and Y were not related. expected to have a psychomotor score of
However, examination of the scatter plot reveals that
Score = 1.261 + .686(43) – .011(1849) = 10.42
the data form a distinctly curved pattern with an in-
crease in psychomotor ability through early years and If you look at the curve in Figure 30-6 and draw a ver-
then a decline in later years. Therefore, it makes sense tical line from age 43 on the X axis, it should hit the curve
to draw a “line,” or more precisely a curve, that accu- around this point. However, you will also notice that
rately reflects the relationship between X and Y. We can there were three people in the sample who were 43 years
express this curve statistically in the form of a quadratic old, and only one got a score close to its predicted value.
equation: A closer look at the analysis of variance helps us see
how differently these two approaches explain the data.
Ŷ = a + b1X + b2X2
Note that the total sum of squares for both analyses is the
This type of equation is considered a second order re- same (SSt = 223.367); that is, the total variability in the
gression equation. It defines a parabolic or quadratic curve sample is the same, regardless of which type of regression
with one turn (see Chapter 26). The process of deriving is performed. What is different is the amount of that vari-
its equation is called polynomial regression. Clearly, this ance that is explained by each of the regression models.
fitted curve is more representative of the data points than The sum of squares attributable to the linear regression
the linear regression line, with R2 = .449. Polynomial re- is 9.33, whereas for the quadratic regression it is 68.11
gression is also based on the concept of least squares, so (see Table 30-4 3 ). This indicates that a greater propor-
that the vertical distance of each point from the curve is tion of the total variability is explained by the curve. We
minimized. Therefore, the curve can be used for predict- can also see that the residual error represents a smaller
ing Y scores in the same way as a linear regression line portion of the total variance in the quadratic regression.
(See Focus on Evidence 30-2).
Table 30-4 shows the analysis of variance for both a
linear and quadratic regression of psychomotor ability on
➤ Results
The scatterplot of data shows a distinctly nonlinear relationship
age. The F-ratio for the linear regression is not significant between psychomotor ability and age. Therefore, linear and
(p = .279), and R2 is quite weak at .042, as we might expect polynomial regressions were performed. The linear regression
from looking at the data (see Table 30-4 1 ). was not significant (R 2 = .042, p = .279). Using a curve estimate,
The quadratic regression is significant (p = .007), we found a significant quadratic function (R 2 = .305, p = .007),
indicating that the quadratic curve is a good fit for these with a final equation of Y = 1.261 + .686 (AGE) – .011 (AGE2).
data (see Table 30-4 2 ). The equation for the curve is These data suggest that there is a strong curvilinear relationship
between psychomotor ability and age.
Score = 1.261 + .686(Age) – .011(Age2)
Focus on Evidence 30–2 The accuracy of the model was extremely high, predicting 83% of
If Only They Could Talk the variance in the verbal score. To confirm the equation, the authors
compared the predicted verbal score to the actual verbal score for a
The Glasgow Coma Scale (GSC) is used to assess the level of con-
test sample of 736 patients, with R 2 = .85, p = .0001.
sciousness in patients following trauma. Rutledge et al31 identified
Of course, the ultimate purpose of developing such a model is to
problems in using the GCS in intubated patients because the scale re-
extend its use to a different set of subjects. Therefore, the model must
quires verbal responses that are blocked by intubation. The purpose
be cross-validated by testing it on different groups. For example, Mered-
of their study was to develop a basis for predicting the verbal score
ith et al32 tested the equation using a retrospective sample of over
using only the motor and eye responses of the scale. The authors de-
14,000 patients taken from a trauma registry by comparing their pre-
signed a prospective study to assess 2,521 patients in an intensive
dicted and actual GCS scores. Cheung et al33 also compared the equa-
care unit who could provide verbal responses. They used a multiple
tion with other estimates and found it provided adequate scores. Both
regression procedure to determine if the motor and eye variables were
findings supported the predictive validity of the model, confirming the
strong predictors of the verbal score, resulting in the following second
ability to determine an accurate GCS score in the absence of a verbal
order regression equation:
component. Based on these findings, the equation could be used with
Verbal Score = 2.3976 – .9253(GCS motor) – some confidence with patients who are intubated to predict their verbal
.9214(GCS eye) + .2208 (GSCmotor2) + .2318(GSC eye2) response, thereby allowing a reasonable estimate of the GCS score.
6113_Ch30_440-467 02/12/19 5:29 pm Page 456
DATA
Table 30-4 Linear and Polynomial Regression of Psychomotor Ability on Age (N =30)
DOWNLOAD
Unstandardized Standardized
Coefficients Coefficients 2
5 B Std. Error Beta t Sig.
Age .686 .204 3.242 3.364 .002
Age**2 4 .011 .004 3.081 3.197 .004
(Constant) 1.261 2.510 .502 .619
1 The value of R-square shows that linear prediction is quite weak. The ANOVA and regression show that the linear model is not
significant (p .279).
2 The value of R-square is substantially increased with the quadratic curve, indicating that the curve is a better fit to the data.
The ANOVA and regression show that the quadratic function is significant (p .004).
3 The regression variance is larger and the error variance lower with the quadratic function, indicating that the curve explains more
variance than the linear regression.
4 The symbol for exponent is a double asterisk. This term is Age squared.
5 The polynomial equation for the quadratic curve is: Yˆ 1.231 .686(Age) .011(Age2)
Portions of the output have been omitted for clarity. Data are shown in Figure 30-6.
and the other an outcome, and therefore lend themselves Coding Variables
to a regression analysis. However, unlike linear regression
The dependent variable in logistic regression is a di-
that applies to continuous outcomes, these questions in-
chotomous outcome that is coded 0 for the reference
corporate dichotomous dependent variables that represent
group and 1 for the target group. The target is typically
the occurrence or nonoccurrence of a particular event, or
considered the group with the adverse outcome. There-
the presence or absence of a condition—SUID or survival,
fore, for this example, we would code 0 for going home
depression or no depression, ADL deficit or no deficit. For
(reference), and 1 for going to LTC (target). Only codes
this purpose, we use a process of logistic regression. This
of 0 and 1 are used, to represent the absence or presence
approach is helpful for clinical decision-making and is es-
of the outcome event.
pecially suited to epidemiologic study (see Chapter 34).
The Regression Equation With a dichotomous outcome, this test is termed logistic
Logistic regression is analogous to multiple regression in regression, which will be the focus here. With more
that it is used to explain or predict an outcome (depen- than two categories, the process involves multinomial logistic
dent) variable using one or more predictor (independent) regression, which is beyond the scope of this text. Useful
variables. The result is a best-fitting equation with coeffi- references are available.18-20 Help from a statistician is
cients, predicted values, and residuals that reflect the con- recommended.
tribution of each independent variable. Predictors may be
a combination of continuous or dummy-coded categorical
variables, and logistic regression can be run with all vari- In logistic regression, the independent predictor vari-
ables entered or in a stepwise or hierarchical fashion. ables may be dichotomous or continuous. In this example,
Although similar in purpose, there are several impor- age is a continuous variable. The others are dichotomous
tant differences that distinguish logistic regression. First, and can be assigned the codes 0 and 1, with 1 representing
because the outcome is dichotomous, rather than pre- potential risk factors. These variables are also called indi-
dicting a continuous value of an outcome variable, the cator variables. For variables like gender, with no obvious
logistic regression equation predicts the probability of an adverse condition, the codes may be arbitrary (although
event occurring. in some situations gender may be a risk factor). Categor-
Another important distinction is the method used to de- ical variables with more than two levels must be defined
velop the regression equation. Where multiple regression using dummy coding.
uses the least squares method to find the best-fitting line,
logistic regression uses the process of maximum likelihood
estimation.16 This is an iterative process whereby equa-
➤ For the discharge study, the following codes were assigned:
tions are generated and revised until a best fitting model is Function (ADL) 0 = independent; 1 = limited
obtained. This results in the “most likely” solution that Age Continuous
demonstrates the best odds of the equation accurately pre- Marital status 0 = married; 1 = not married
dicting who did and did not meet the outcome event. Gender 0 = male; 1 = female
Table 30-5 Relationship Between Level of Therefore, the odds of going to LTC are 2.96 times
Independence and Discharge Destination higher for those who are limited in ADL than for those
who are functionally independent.
1 0 If the proportions of individuals who go to LTC and
LTC Home Total home are the same, the OR will be 1.0, considered the
null value. An OR less than 1.0 means that there is a
1 Limited ADL 31 25 56
a b decreased risk, and greater than 1.0 indicates an in-
creased risk.
0 Independent 13 31 44 To make this process easier, the odds ratio can be cal-
c d
culated from the data in Table 30-5 as ad/bc, using the
Total 44 56 100 cell designations in the table. For these data,
(31)(31)
OR = = 2.96
Box 30-2 Probabilities and Odds (25)(13)
The probability of an event reflects the number of times you Applications of the odds ratio will be discussed further
expect to see an outcome out of all possible trials, ranging in Chapter 34.
between 0 and 1 (see Chapter 23). Odds represent the probabil-
ity that the event will occur divided by the probability that the See the Chapter 30 Supplement for calculation of and
event will not occur. interpretation of probabilities associated with logistic
If the probability of occurrence is X, then the probability of regression.
nonoccurrence is 1–X. Therefore,
Odds = probability/1–probability
Probability = odds/1 + odds Strength of Logistic Regression
For example, if a horse runs 100 races and wins 50, the
probability of winning is 50%. The odds are .50/.50 = 1 (even Like multiple regression, several statistics are used to in-
odds). If the horse runs 100 races and wins 25, the probability dicate the degree to which the logistic regression accu-
of winning is 25%, but the odds are .25/.75 = 0.33, or 1 win rately predicts membership of cases to the categories of
to 3 losses. the dependent variable. We cannot use an R2 estimate,
In the discharge study, for those with limitations in ADL, like we do for multiple regression, because it doesn’t
the odds of being discharged to LTC are 2.96 (Table 31-1). work with a dichotomous outcome.
Therefore, One of the most useful measures is a classification
Probability = 2.96/3.96 = .75 table, providing data on how well the regression has
There is a 75% probability that someone with limited correctly predicted each patient’s true outcome (see
ADL will be discharged to LTC. A patient with limited ADL is Table 30-6 5 ). In this case, the overall correct predic-
almost 3 times more likely to go to LTC than someone without tion is 73%. We can see that prediction accuracy was
limitation. higher for those discharged home (78.6% correctly clas-
sified) than for those who went to LTC (65.9% correctly
classified).
without it. For example, for the 56 people who were
Other statistics used to indicate the strength of a lo-
limited in ADL, the odds of going to LTC are
gistic regression equation are shown in Table 30-6A. An
31(went to LTC) omnibus test ( 2 ) is an overall estimate of the significance
= 1.24 of the full set of regression coefficients. Likened to an
25(went home)
overall F test in multiple regression, this significant re-
For the 44 people who were independent, the odds of sult (p < .001) indicates that the regression equation pro-
going to LTC are vides meaningful coefficients that are significantly
different from zero.
13(went to LTC)
= .419 Two “pseudo-R2” values are reported in the model
31(went home) summary ( 3 ). The Cox & Snell R Square has a theoretical
The ratio of these two values is the odds ratio (OR)— maximum less than 1, even for a “perfect” model.21 The
the odds of going to LTC for those limited in ADL versus Nagelkerke R Square is an adjusted version of the Cox &
odds of going to LTC for those who were independent: Snell value that allows the scale to cover the full range
from 0 to 1.22 These can be interpreted like R2.
1.24 The –2 log likelihood index (–2LL) is based on
OR = = 2.96
.419 the concept of deviance, or lack of fit to the regression
6113_Ch30_440-467 02/12/19 5:29 pm Page 459
DATA
Table 30-6 Logistic Regression Analysis: Risk Factors Associated with Discharge Disposition Following
DOWNLOAD
Rehabilitation (N=100)
1 The independent variables are entered into the equation simultaneously (as opposed to stepwise procedures).
2 Chi-square is used to test the null hypothesis that the regression coefficients equal zero. Because all variables were entered
simultaneously, this analysis includes only one block and one step, which is why values are redundant. These would be
different if a stepwise or hierarchical procedure was run. The df equal the number of predictor variables.
3 The Model Summary provides information on the degree to which the model accurately predicts group membership, similar to
the use of R 2 in multiple regression.
4 The Hosmer and Lemeshow test is a measure of goodness of fit to the logistic model using a chi-square distribution.
5 A classification table compares predicted outcomes to observed results. Overall, 73% of the subjects were correctly classified.
6 The (1) after each categorical variable name indicates the code for the target category.
7 Regression coefficients (B) and the constant are listed for each variable in the logistic regression equation. Each coefficient is
tested for significance using the Wald statistic. Gender is the only variable that is not significant.
8 The exponent of the regression coefficient (Exp(B)) is the adjusted odds ratio associated with each independent variable.
Data for the constant is ignored.
9 Confidence intervals for the odds ratios show that ADL, age and marital status do not contain the null value of 1.0. Only the
interval for gender contains 1.0, which is consistent with the associated probability.
model ( 3 ), analogous to looking at error sums of squares To account for larger unit differences, we multiply
in linear regression (smaller value means better fit). This the regression coefficient by the unit difference and then
value is not useful by itself, but is used to calculate chi- take the exponent to obtain the odds ratio. For age
square for the Hosmer-Lemeshow test ( 4 ), which is a (B = .117), with a 2-year difference, we determine the
goodness of fit test that indicates if the observed propor- odds ratio by e(2 × .117) =1.264. Someone who is 72 is 1.264
tions match the expected proportions of event rates gen- times more likely to go to LTC than someone who is
erated by the model. If this test is not significant, it 70—a little increase. However, for a 10-year difference
indicates a good fit, as it does here (p = .189). in age, we find e(10 × .117) = 3.222. The odds of going to
LTC are more than 3 times greater for someone who is
Regression Coefficients and Odds Ratios 80 as compared to someone who is 70. Many researchers
The essential part of the logistic regression is the deri- choose to categorize continuous variables into subgroups
vation of regression coefficients (B) for predictor vari- to simplify this interpretation.
ables (see Table 30-6 7 ). An OR is computed for each
variable by taking the exponent of B ( 8 ). For example, Interpretation of Results
for a subject who is limited in ADL (ADL = 1), the odds The interpretation of logistic regression depends on
of going to LTC are e.965 = 2.625 (Use the ex key on your the research question. In many research situations, the
scientific calculator). This number represents the odds investigator is interested in one particular variable but
of going to LTC with a one-unit change in the value of X. wants to control for potential confounders. Using the
With a dichotomous variable, this means that an indi- discharge study, for example, we might be specifically
vidual who is limited in ADL is 2.6 times more likely interested in the effect of function on discharge status,
to go LTC as compared to one who is independent but we would want to account for the influence of de-
(a change from 0 to 1 for ADL). mographic factors as covariates. In that case we might
The coefficients are tested to determine if they are report that the odds ratio for ADL was 2.63, adjusted
significantly different from zero using the Wald statistic for age, marital status and gender. Adjusted odds ra-
( 7 ), which is analogous to the t-test used in linear re- tios are often based on these types of demographic
gression. Confidence intervals can also be determined variables.
for each OR ( 9 ). A significant OR will not contain the Alternatively, we could approach this analysis using
null value 1.0 within the confidence interval. We can see a broader question, wanting to understand how these
that this is true for the odds ratios associated with ADL, four factors are related to discharge status. In that case,
age, and marital status. we would summarize results, suggesting that ADL,
marital status, and age are most influential in predict-
Adjusted Odds Ratios ing discharge status. In addition to the increased like-
When the logistic regression equation includes several lihood of going to LTC if the patient is functionally
independent variables, as in our example, each odds limited, those who are not married are almost 2.77
ratio is actually corrected for the influence of the other times more likely to be sent to LTC than those who
variables, called adjusted odds ratios (sometimes ab- are married. The actual regression analysis is the same
breviated aOR). Just as independent variables in multi- for both of these purposes. The difference lies only in
ple regression can exhibit collinearity, independent the interpretation.
variables in logistic regression will affect each other.
This is an important consideration for prediction mod-
els. For instance, for the simple association between dis-
charge status and ADL, we found an odds ratio of 2.96, ➤ Results
but the adjusted odds ratio is 2.63 when other variables Results of the logistic regression showed that 73% of the
sample was correctly classified by the final model. ADL, age,
are included.
and marital status provided significant contributions to predic-
Continuous Variables tion of discharge status. The adjusted OR for ADL was 2.63
When an independent variable is continuous, interpre- (95% CI, 1.02 to 6.74) and for marital status was 2.77 (95% CI,
tation is more complex. Consider the effect of age on 1.08 to 7.10). Both are significant but with wide confidence
discharge status, with an adjusted OR of 1.125. Because intervals. The adjusted OR associated with age was 1.13 (95% CI,
the OR relates to the relative increase in odds with a one- 1.05 to 1.21). There was an increased risk of discharge to LTC
unit increase in X, we can interpret this value as the odds with limited function, being unmarried, or being older. The
Hosmer & Lemeshow Test (X2 (4) = 11.23, p = .189) indicates
associated with any 1-year difference in age. For in-
that the regression model is a good fit to the observed data.
stance, from 70 to 71 the odds of going to LTC have
The Cox & Snell R 2 = .235 and the Nagelkerke R 2 = .315.
increased by .125.
6113_Ch30_440-467 02/12/19 5:29 pm Page 461
These data will often be provided in tabular form to Propensity scores are based on a binary logistic re-
show ORs, confidence intervals, and probability values gression, with group membership as the dependent vari-
for each independent variable. able and the possible confounders as the independent
predictor variables. This results in a single propensity
Sample Size Estimates score, regardless of the number of covariates. The score
A commonly used “rule of thumb” in logistic regression will range from 0 to 1, reflecting the probability that the
is based on a minimum ratio of 10 cases per predictor subject belongs to one group or the other.27 By matching
variable (subjects per variable, SPV). There is, however, subjects on the basis of their propensity scores, baseline
some debate that this approach might be too conserva- differences can be controlled and researchers can have
tive, with simulation studies showing the regression can greater confidence in drawing conclusions about group
be stable with fewer subjects.23 Long24 suggests that the differences.28,29 Although this approach cannot com-
minimum number should be 100. pletely replace the RCT, it can reduce bias and provide
Peduzzi et al25 offer the following guideline for a min- some guidance in situations where random assignment
imum number of cases, based on knowing the proportion is not tenable.
of positive cases in the population (p) and the number of Importantly, propensity scores are only as good
independent variables in the equation (k): as the variables used to define them. They cannot pro-
vide adjustment for unmeasured variables, known or
k unknown—which is what random assignment is in-
N = 10 *
p tended to do. The scores will also vary depending on
which covariates are used. For these reasons, there is
For example, for the rehabilitation discharge study, considerable debate on whether propensity scores
we have four covariates in the model (k = 4). Assume that should be used.30
information from the literature indicates that the pro-
portion of stroke survivors who are discharged to LTC
Statisticians can provide advice for use of propensity
is approximately 20%.26 The minimum number of cases scores with specific designs. Many statistical programs
required would be will match participants by calculating propensity scores,
4 including SPSS, Stata, and SAS.
N = 10 * = 200
.20
Sample size estimates are important when using logis-
tic regression to create predictive models, such as in the ■ Analysis of Covariance
development of clinical prediction rules (see Chapter 33).
If samples are too small, the predictive validity of the Matching is one design technique to equate groups on
model will be limited. particular characteristics. A different statistical approach,
called the analysis of covariance (ANCOVA) involves ad-
Calculations for power and sample size for logistic justing scores on the dependent variable to account for
regression in G*Power are complex. See the Chapter 30 the effect of covariates in group comparisons. This pro-
Supplement for illustration of this process.
cedure is a combination of ANOVA and regression.
that a student’s academic performance could confound that conclusion. Our question then becomes, “Is there a
the effect of teaching strategy on clinical performance. If difference in clinical performance using video sessions
there is a correlation between academic and clinical per- or discussion groups when taking GPA into account?”
formance, we would want to know if the grade point
average (GPA) in the two groups was evenly distributed.
Typically, we would test the difference between two
If one group happens to have a higher GPA than the groups using an unpaired t-test. In this case, the ANOVA is
other, our results could be misleading (assuming GPA used so that the results can be compared to the results of the
does reflect academic performance). ANCOVA. However, the outcome of a t-test would be identical.
In this posttest-only design, teaching strategy is the
independent variable, clinical performance is the de-
pendent variable, and GPA is a covariate, a potentially
confounding variable that we want to control. For in- Using Regression
stance, if one group has a higher GPA, then that group’s Figure 30-7A shows the distribution of GPA and clinical
mean for the dependent variable is adjusted downward. performance scores for Strategy 1 (Video) and Strategy 2
A lower group mean on the covariate results in the (Discussion) with their respective regression lines. The
clinical score being raised. The degree of adjustment dependent variable, clinical performance score, is plotted
depends on the difference between groups on GPA. along the Y-axis, and the covariate, GPA, is plotted along
Essentially, ANCOVA estimates what the group clinical the X-axis. We can see from this scatterplot that these
performance means would have been had the groups been variables are highly correlated within each group.
equal on their average GPA. We can also see that the regression line for Strategy 1
The null hypothesis for the ANCOVA is based on ad- is higher than Strategy 2, indicating that Group 1 had
justed means: higher values of clinical performance for any given GPA,
_ _
H0: X1' – X2' = 0 even though the sample means for clinical score are not
_ significantly different.
where X' is an adjusted mean There is, however, another important difference. If
we look at the mean GPA for each group, we can see that
Think of this analogy—you want to compare the heights of the Strategy 1 group has a lower average GPA than those
three people, each one standing on a different step. It will using Strategy 2. Because GPA is a correlate of clinical
be hard to know who is taller or shorter because they are on performance, it is reasonable to believe that this differ-
different levels. The step height is the covariate. If we could ence could have confounded the comparison of teaching
“adjust” the steps, lowering the higher step and raising the strategies.
lower step, so that everyone would be on the “same level,”
we would eliminate the effect of step level, and we can then Adjusted Means
observe how different their heights are.
To eliminate the differential effect of GPA, we want to
artificially equate the two groups on GPA, using the GPA
grand mean for the total sample as the best estimate for
The Research Question both groups. The means for GPA are
To illustrate how the ANCOVA offers this control, let’s
Strategy 1 2.73
first look at the hypothetical comparison between the
Strategy 2 3.10
two teaching strategies without considering GPA. Sup-
Grand mean 2.92
pose we obtain the following means for clinical perform-
ance on a standardized test: We now want to know what average clinical score
we would expect for Strategy 1 and Strategy 2 if the
Strategy 1: Video 49.3 (± 18.5)
groups had the same mean GPA. If we assign a mean
Strategy 2: Discussion 44.2 (± 24.1)
GPA of 2.92 to both groups, we can use the regression
An ANOVA shows that the group means are not sig- lines to predict what the mean clinical score (Y) would
nificantly different (p = .562) (see Table 30-7 2 ). Based be. _As shown in Figure 30-7B, by drawing a vertical line
on this result, we would conclude that the teaching strate- at
_ X = 2.92, we _ would intersect the regression lines at
gies are not different. Y 1' = 57.4 and Y 2' = 36.1. These are the adjusted means
However, because we know that there is a strong cor- for each group.
relation (r = .83) between academic and clinical perform- Note that the adjusted mean for Strategy 1 is higher
ance (see Table 30-7 1 ), we might want to be sure that than the observed mean, and for Strategy 2 it is lower.
the two groups were equivalent on GPA before drawing These differences reflect variation in the covariate.
6113_Ch30_440-467 02/12/19 5:29 pm Page 463
DATA
Table 30-7 Analysis of Covariance for Comparison of Clinical Performance Following Two Teaching
DOWNLOAD
Strategies (N = 24)
Estimates
Dependent Variable: Clinical Perf
7 95% Confidence Interval
Strategy Mean Std. Error Lower Bound Upper Bound
Video 56.718a 1.943 52.666 60.770
Discussion 35.530a 1.906 31.555 39.505
7 a. Covariates appearing in the model are evaluated at the following values:
GPA 2.9158.
1 Observed means and correlation are presented for Clinical Performance and GPA for each group.
2 The ANOVA shows no significant difference in clinical performance between the two teaching strategies (p .562).
3 The ANCOVA is run with GPA as the covariate.
4 There is a significant difference between the two strategies based on the adjusted means (p .022).
5 The effect of GPA is significant. This is a necessary condition for validity of the ANCOVA, indicating a significant
correlation with the dependent variable, clinical performance (see 1 ). The df equal the number of covariates.
6 The interaction between GPA and Strategy is not significant, indicating homogeneity of slopes. This is also a
necessary condition for the validity of the ANCOVA. The df is the product of df for strategy and GPA.
7 These are the adjusted means based on the average GPA for both groups of 2.92. Note that the adjusted mean for
Video is larger and the mean for Discussion is smaller than the observed means (see 1 ).
Portions of the output have been omitted for clarity.
6113_Ch30_440-467 02/12/19 5:29 pm Page 464
70 Video
R 2 = .89 groups based on the adjusted means (see Table 30-7 4 ).
60 Now we find that the difference between the strategy
– groups is significant (p = .022), and we can reject the null
Y 1 = 49.3 Strategy 2
50
Discussion hypothesis. We conclude that clinical performance does
–
40 Y 1 = 44.2 R 2 = .94 differ between those exposed to videotaped cases and dis-
cussion groups when adjusted for their grade point average,
30
with better clinical performance following video sessions.
20 We have increased the power of our test by decreas-
10
ing the unexplained variance. We have accounted for
– – more of the variance in clinical performance by knowing
X 1 = 2.73 X 2 = 3.10
0 GPA than we did by knowing teaching strategy alone.
2 2.5 3 3.5 4
A
GPA Power and sample size estimates for the ANCOVA are simi-
90
lar to the ANOVA. See the Chapter 30 Supplement for illus-
tration of power analysis with the ANCOVA using G*Power.
80
Strategy 1
Clinical Performance Score
70 Video
– A criticism of the ANCOVA is that the adjusted means
60 Y' 1 = 57.4 are not real scores, impeding generalization of data
50 Strategy 2 from an ANCOVA. Interpretation of findings, including
– Discussion estimates of effect size, must take these adjustments into
40 Y' 2 = 38.1 account.
30
20
Linearity of the Covariate
10 –
For an ANCOVA to be valid, the covariate must be corre-
X = 2.92 lated with the dependent variable. This is reflected in the
0
2 2.5 3 3.5 4 second line of the ANCOVA. More precisely, it tests the
B
GPA hypothesis that the slope of the regression line is different
from zero (see Table 30-7 5 ). If this test is not significant,
Figure 30–7 Comparison of the effectiveness of two teaching
the covariate is not linearly related to the dependent variable,
strategies on clinical performance with GPA as a covariate.
A) Regression lines (dashed lines) define the relationship and therefore, there will be no basis for adjusting means. In
between clinical performance score and GPA for each group. this example, we can see that the covariate of GPA is sig-
The R2 values show that this relationship is strong in each nificant (p <.001). This value should always be examined to
group. Solid blue and red lines intersect the regression_ lines determine that the ANCOVA is an appropriate test.
at the means
_ for each group on clinical performance (Y ) and
GPA (X ). B) The solid green line is the combined mean for Homogeneity of Slopes
GPA and _ the dotted
_ blue and red lines are adjusted mean The ANCOVA requires that the slopes of the regression
scores (Y 1' and Y 2' ) for clinical performance for each group. lines for each group be parallel. Unequal slopes indicate
that the relationship between the covariate and depend-
ent variable is different for each group. Therefore, the
Therefore, we have adjusted scores by removing the adjusted means will be based on different proportional
effect of GPA—as if they were now standing on the same relationships, and their comparison will be meaningless.
step! This process can also work in the opposite direc- A test for homogeneity of slopes can be performed by
tion, where group means initially appear significantly looking at the interaction between the independent vari-
different, but are no longer different once adjusted. able and the covariate, in this case between Strategy and
GPA. As shown in Table 30.7 6 , the interaction is not
Computer Output significant (p = .243)—a good thing—indicating that the
An analysis of variance is run on the adjusted values. slopes of the regression lines are not significantly differ-
Table 30-7 3 shows the results of this analysis for the ent. This property is illustrated in Figure 30-7. This is a
teaching strategy data. Recall that the ANOVA showed necessary condition for the validity of the ANCOVA.
6113_Ch30_440-467 02/12/19 5:29 pm Page 465
Additional Assumptions correlated with each other. If the covariates are correlated
Two other assumptions are important considerations for with each other, they provide redundant information and
running an ANCOVA. First, the variable chosen as the no additional benefit is gained by including them. In fact,
covariate must be related to the dependent variable but using a large number of interrelated covariates can be a
must also be independent of the independent variable, disadvantage because it will decrease statistical power. It
that is, the independent variable should not influence the is important, therefore, to make educated choices about
value of the covariate. For example, we would not want the use of covariates. Previous research and pilot studies
to measure GPA after the clinical performance test, as it may be able to document which variables are most highly
is possible that the clinical experience could influence correlated with the dependent variable and which are
future academic performance. To avoid this situation, least likely to be related to each other. Like regression,
covariates should always be measured prior to introduc- the outcome of an ANCOVA can also be significantly
tion of the independent variable. altered if a different combination of covariates are used.
Second, the validity of the ANCOVA is also influ- Researchers must decide which covariates will be most
enced by the reliability of the covariate. Any error found meaningful and decide early so that data are collected on
in the covariate is compounded when the regression co- the proper variables.
efficients and adjusted means are calculated. Therefore,
Pretest–Posttest Adjustments
justification for using the adjusted scores is based on
accuracy of the covariate. The ANCOVA is often used to control for initial differ-
ences between groups in a pretest-posttest design. Even
with randomization, initial measurements on the de-
pendent variable are often different enough across
➤ Results groups to be of concern in further comparison. There-
An analysis of variance was used _ to compare the clinical fore, the pretest score can be used as a covariate.
performance
_ means for video (X = 49.33 ± 18.5) and discussion
groups (X = 44.17 ± 24.1), showing no significant difference
Differences between pretest and posttest scores are
(F (1,22) = .347, p = .562, n2p = .016). Based on a high correlation frequently analyzed as change scores, but the use of the
between clinical performance and GPA (r = .833, p < .001), an pretest as a covariate in an ANCOVA is often pre-
analysis of covariance was run using _ GPA as a covariate. The ferred because of the potential for compounded
mean GPA for the video
_ group (X = 2.73) was lower than for the measurement error (see Chapter 9). However, the AN-
discussion group (X = 3.10). The ANCOVA showed a significant COVA is not a remedy for a study with poor reliability.
difference between groups (F (1,20) =_6.14, p = .022, n2p = .235), Although some research questions may be more read-
based
_ on adjusted means for video (X ' = 57.43) and discussion ily answered by the use of change scores, the re-
(X ' = 36.07) groups. The video strategy was superior to the searcher should consider what type of data will best
discussion strategy with GPA taken into account. serve the analysis.
Intact Groups
The ANCOVA can incorporate varied designs, including
Because of its “correction” function, the ANCOVA is
a 2-way analysis and repeated measures. Multiple widely used. However, there are limitations to its appli-
comparisons can also be run on the adjusted means to cation. When studies involve groups that are not formed
confirm mean differences. randomly, such as intact groups, there is often a concern
about whether the groups are balanced on important
variables. Many researchers use the ANCOVA as a way
of controlling these differences in quasi-experimental
Using Multiple Covariates studies, thinking that this approach can substitute for
There may be instances where one covariate is not suf- randomization. Ironically, in this type of situation,
ficient to account for group baseline differences. Several where such control is most needed, the ANCOVA does
characteristics may be relevant to understanding the de- not work well.31,32 The problem is that baseline means
pendent variable. For example, we might want to include from intact groups cannot be considered to come from
prior clinical experience as another covariate in addition a common underlying population, an assumption that is
to GPA. The analysis of covariance can be extended to made with randomly assigned groups. This situation can
accommodate any number of covariates. bias the adjusted means, making them inaccurate esti-
When several covariates are used, the precision of the mates. Although a common strategy, readers should be
analysis can be greatly enhanced, as long as the covariates cautious about interpretation of results when this ap-
are all highly correlated with the dependent variable and not proach is used.
6113_Ch30_440-467 02/12/19 5:29 pm Page 466
COMMENTARY
Life is like a sewer — what you get out of it depends on what you put into it.
—Tom Lehrer (1928- )
American mathematician and satirist
Regression procedures are ubiquitous in healthcare lit- interpreted as saying that a variable is unimportant to a
erature, and therefore understanding how data can be decision, but this may not be the case at all. This is espe-
interpreted is important to valid application. One of the cially true with stepwise procedures. It is essential to re-
most important lessons is that such analyses are depend- member, therefore, that the relationships defined by a
ent on the variables that are entered in the equation. regression equation can be interpreted only within the
With different combinations of variables, individual re- context of the specific variables included in that equation.
gression coefficients will vary, potentially changing the Criticisms of regression, especially stepwise proce-
overall R2 – and conclusions. dures, suggest that because computers have improved
From an evidence-based practice perspective, we need our ability to produce complex analyses, we have come
to be sure that we do not discount important variables to rely more on statistical outcomes than on our own
because of collinearity. Different combinations of input judgments about what variables are important in the
variables can produce results that may suggest one vari- context of a study or clinical practice.33 This point em-
able is not relevant, when it is important, but related to phasizes the need to evaluate data based on logical and
another variable. The results of such an analysis could be theoretical rationales—not just p values.
R EF ER EN C E S verbal score from the Glasgow eye and motor scores. J Trauma
1998;44(5):839-844; discussion 844-835.
1. Dua S, Bhuker M, Sharma P, Dhall M, Kapoor S. Body mass 12. Cheng K, Bassil R, Carandang R, Hall W, Muehlschlegel S.
index relates to blood pressure among adults. N Am J Med Sci The estimated verbal GCS subscore in intubated traumatic
2014;6(2):89-95. brain injury patients: is it really better? J Neurotrauma
2. Grimm LG, Yarnold PR, eds. Reading and Understanding Multi- 2017;34(8):1603-1609.
variate Statistics. Washington, D.C.: American Pyschological 13. Ostfeld BM, Schwartz-Soicher O, Reichman NE, Teitler JO,
Association, 1995. Hegyi T. Prematurity and sudden unexpecte infant deaths in the
3. Cohen J. Statistical Power Analysis for the Behavioral Sciences. United States. Pediatrics 2017;140(1).
2nd ed. Hillsdale, NJ: Lawrence Erlbaum; 1988. 14. Noh JW, Lee SA, Choi HJ, Hong JH, Kim MH, Kwon YD.
4. UPI Dispatch, Science Desk. Children’s height linked to test Relationship between the intensity of physical activity and
scores. New York Times October 8, 1986. depressive symptoms among Korean adults: analysis of Korea
5. Wilson DM, Hammer LD, Duncan PM, Dornbusch SM, Health Panel data. J Phys Ther Sci 2015;27(4):1233-1237.
Ritter PL, Hintz RL, Gross RT, Rosenfeld RG. Growth and 15. Stamm TA, Pieber K, Crevenna R, Dorner TE. Impairment in
intellectual development. Pediatrics 1986;78(4):646-650. the activities of daily living in older adults with and without
6. Keller MC, Garver-Apgar CE, Wright MJ, Martin NG, Corley osteoporosis, osteoarthritis and chronic back pain: a secondary
RP, Stallings MC, Hewitt JK, Zietsch BP. The genetic correla- analysis of population-based health survey data. BMC Muscu-
tion between height and IQ: shared genes or assortative mating? loskelet Disord 2016;17:139.
PLoS Genet 2013;9(4):e1003451. 16. Menard S. Logistic Regression: From Introductory to Advanced Con-
7. Beauchamp JP, Cesarini D, Johannesson M, Lindqvist E, cepts and Applications. Thousand Oaks, CA: SAGE Publications;
Apicella C. On the sources of the height-intelligence correla- 2010.
tion: new insights from a bivariate ACE model with assortative 17. Pinedo S, Erazo P, Tejada P, Lizarraga N, Aycart J, Miranda M,
mating. Behav Genet 2011;41(2):242-252. Zaldibar B, Gamio A, Gomez I, Sanmartin V, et al. Rehabilita-
8. Silventoinen K, Posthuma D, van Beijsterveldt T, Bartels M, tion efficiency and destination on discharge after stroke.
Boomsma DI. Genetic contributions to the association between Eur J Phys Rehabil Med 2014;50(3):323-333.
height and intelligence: Evidence from Dutch twin data from 18. Field A. Discovering Statistics Using IBM SPSS Statistics. 5th ed.
childhood to middle age. Genes Brain Behav 2006;5(8):585-595. Thousand Oaks, CA: Sage; 2018.
9. Houx PJ, Jolles J. Age-related decline of psychomotor speed: 19. Kwak C, Clayton-Matthews A. Multinomial logistic regression.
effects of age, brain health, sex, and education. Percept Mot Skills Nurs Res 2002;51(6):404-410.
1993;76(1):195-211. 20. Laerd Statistics. Multinomial logistic regression using SPSS
10. Rutledge R, Lentz CW, Fakhry S, Hunt J. Appropriate use of statistics. Available at https://statistics.laerd.com/spss-tutorials/
the Glasgow Coma Scale in intubated patients: a linear regres- multinomial-logistic-regression-using-spss-statistics.php.
sion prediction of the Glasgow verbal score from the Glasgow Accessed January 12, 2019.
eye and motor scores. J Trauma 1996;41(3):514-522. 21. Cox DR, Snell EJ. The Analysis of Binary Data. 2nd ed. Boca
11. Meredith W, Rutledge R, Fakhry SM, Emery S, Kromhout- Raton, FL: Chapman & Hall; 1989.
Schiro S. The conundrum of the Glasgow Coma Scale in 22. Nagelkerke NJD. A note on the general definition of the
intubated patients: a linear regression prediction of the Glasgow coefficient of determination. Biometrika 1991;78(3):691-692.
6113_Ch30_440-467 02/12/19 5:29 pm Page 467
23. Vittinghoff E, McCulloch CE. Relaxing the rule of ten events 29. Staffa SJ, Zurakowski D. Five steps to successfully implement
per variable in logistic and Cox regression. Am J Epidemiol and evaluate propensity score matching in clinical research
2007;165(6):710-718. studies. Anesth Analg 2018;127(4):1066-1073.
24. Long JS. Regression Models for Categorical and Limited Dependent 30. King G, R. N. Why propensity scores should not be used for
Variables Thousand Oaks, CA: Sage; 1997. matching. Political Analysis. Available at http://j.mp/2ovYGsW.
25. Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. Accessed February 9, 2019.
A simulation study of the number of events per variable in logis- 31. Henson RK. ANCOVA with intact groups: don’t do it! Presen-
tic regression analysis. J Clin Epidemiol 1996;49(12):1373-1379. tation at the Annual Meeting of the Mid-South Educational
26. Bejot Y, Troisgros O, Gremeaux V, Lucas B, Jacquin A, Khoumri C, Research Association. New Orleans, LA. Available at https://
Aboa-Eboule C, Benaim C, Casillas JM, Giroud M. Poststroke archive.org/stream/ERIC_ED426086/ERIC_ED426086_djvu.
disposition and associated factors in a population-based study: the txt. Accessed January 11, 2019.
Dijon Stroke Registry. Stroke 2012;43(8):2071-2077. 32. Huck SW. Reading Statistics and Research. 6th ed. Boston:
27. Andrade C. Propensity score matching in nonrandomized stud- Pearson; 2012.
ies: a concept simply explained using antidepressant treatment 33. Dallal GE. Simplifying a multiple regression equation. The
during pregnancy as an example. J Clin Psychiatry 2017;78(2): Little Handbook of Statistical Practice. Available at http://www.
e162-e165. jerrydallal.com/lhsp/simplify.htm. Accessed January 8, 2019.
28. Streiner DL, Norman GR. The pros and cons of propensity
scores. Chest 2012;142(6):1380-1382.
6113_Ch31_468-485 02/12/19 4:37 pm Page 468
CHAPTER
31 Multivariate Analysis
Multivariate analysis refers to a set of statistical should be consulted to make decisions about using par-
ticular methods under specific research conditions.
procedures that are distinguished by the ability
Those interested in greater detail should refer to several
to examine several response variables within a useful resources.1-5
single study and to account for their potential The overarching term “factor analysis” is actually
used to refer to several approaches with different goals.
interrelationships in the analysis of the data.
Exploratory factor analysis (EFA) is used to examine
These tests are distinguished from univariate linear combinations of underlying factors that explain a
analysis procedures, such as the t-test and analy- latent variable or construct. Confirmatory factor analy-
sis (CFA) is used to confirm hypotheses about the struc-
sis of variance, which accommodate only one
ture of underlying constructs. These two approaches will
measured variable. be discussed in the following sections. A third technique
In a brief introduction such as this, it is not called principal components analysis (PCA) attempts
to reduce data to a smaller set of summary variables or
possible to cover the full scope of these com-
components that account for most of the variance in a set
plex statistical procedures. The purpose of of observed variables. Although many of the statistical
this chapter is to present concepts related to techniques used for these analyses are similar, there are
important differences in assumptions and purpose.
commonly used procedures including factor
analysis, sequential equation modeling, cluster -See the Chapter 31 Supplement for further discussion of
principal components analysis.
analysis, multivariate analysis of variance, and
survival analysis.
Exploratory Factor Analysis
The concept of EFA is illustrated in Figure 31-1. The
■ Exploratory Factor Analysis set of “variables” at the top are part of an overarching
construct of “green.” We assume that there is some re-
The technique of factor analysis is quite different from lationship among circles that have similar hues of green,
any of the statistical procedures we have examined thus that light green circles are related to other light green
far. Rather than using data for comparison or prediction, circles, but not to darker green circles, but together
factor analysis examines the structure within a large all the hues explain “green.” Through factor analysis,
number of variables, to reflect how they cluster to rep- these variables are reorganized into three relatively
resent different dimensions of a construct. independent factors, each one representing a unique
Factor analysis is more controversial than other ana- cluster of variables that are highly correlated among
lytic methods because it leaves room for subjectivity and themselves but poorly correlated with items on other
judgment. However, it makes an important contribution factors. Not all factors are equal, however. For instance,
to multivariate methods because it can provide insights factor 1 accounts for more of the variance in the data
into the nature of abstract constructs and allows us to su- than the others, with factors 2 and 3 accounting for suc-
perimpose order on complex phenomena. A statistician cessively less variance.
468
6113_Ch31_468-485 02/12/19 4:37 pm Page 469
DATA
Table 31-1 Correlations of Chronic Low Back Pain Behaviors
DOWNLOAD
DATA
Table 31-2 Exploratory Factor Analysis: Measures of Chronic Pain Behavior (N = 100)
DOWNLOAD
1 The KMO value is above the minimal acceptable value of 0.6, indicating that the pattern of correlations is sufficiently tight to
make the factor analysis meaningful.
2 The test of sphericity is significant, indicating that the correlations among variables are significantly different from zero.
3 Communalities indicate the extent to which each variable correlates with the rest of the data. Only the variable “isolates
oneself” shows a low communality, indicating that its variance is not likely to contribute to any factor.
4 Seven factors are initially identified in the data, corresponding to the number of variables entered. The eigenvalues reflect the
amount of common variance accounted for by each factor. Using the 1.0 cutoff for the eigenvalue, the analysis stopped at 3
factors
5 The three factors initially account for a cumulative 82.2% of the variance in the data.
6 The factor loadings in the unrotated factor matrix do not clearly distinguish the factors.
7 The loadings from the rotated factor matrix show a clearer picture. Highlighted cells identify variables assigned to each factor.
The variable ”Isolates oneself” loads highest on factor 3, but does not meet the 0.4 standard.
Portions of the output have been omitted for clarity.
6113_Ch31_468-485 02/12/19 4:37 pm Page 471
The lower the proportion, the more suited the data are to structure of the data. The data are usually characterized
factor analysis—variables will likely break out into separate most efficiently using only the first few components.
factors. The KMO statistic takes values between 0 and 1, Therefore, we need to establish a cutoff point to limit the
with stronger values closer to 1, and 0.6 considered a min- number of factors for further analysis.
imal acceptable level.6 For the pain behavior data, the The statistic used to set this cutoff is called an eigen-
KMO test = 0.651, which just passes the minimal standard. value. Eigenvalues tell us how much of the total variance is
Bartlett’s Test of Sphericity uses the chi-square distribu- explained by a factor. Factor 1 will always account for more
tion to test the null hypothesis that intercorrelations equal variance than the other factors (in this example 45.0%).
zero (Table 31-2 2 ). We want this to be a significant The most common approach restricts retaining factors to
value, that is, we want to reject the null hypothesis. This those with an eigenvalue of at least 1.00 (Table 31-2 5 ).
value is significant for the pain behavior data (p < .001). Some statisticians recommend retaining factors that
Because sample sizes are usually large in factor analysis, account for at least 70% of the total variance.9
this test has high power and is rarely nonsignificant.
Communalities give us a measure of the proportion of ➤ Using the criteria of eigenvalues above 1.00, three factors
each variable’s variance that can be explained by factors were identified for the pain behavior data, accounting for 82.2%
(Table 31-2 3 ). Higher communalities, above 0.4, indi- of the variance in the original data.
cate that variables will contribute to the factor structure.
Values less than 0.4 reflect variables that may not be a
significant part of any factor. For example, we may be These factors are not “real” yet—they are only statistical
concerned that the variable “Isolates oneself” will be entities at this point—and do not indicate which variables
problematic. are related to which factors. Up to this point the analysis
is simply relying on variance to identify factors. It will not be until
Extraction of Factors the end of the analysis that the “factors” will make sense.
Through a complex series of manipulations, the analysis
derives factors through a process called extraction. The
number of factors that are extracted depends on how The Factor Matrix
much of the total variance in the data can be explained The result of the factor analysis is a factor matrix for the
by a factor, and the extent to which factors are independ- three factors that have been extracted (Table 31.2 6 ).
ent of each other. Several methods are available for this The coefficients listed in the table are called factor load-
process. The most commonly used method for EFA is ings, which are measures of the correlation between the
called principal axis factoring. This method defines individual item and the overall factor. Loadings are
factors based on their intercorrelations. interpreted like correlation coefficients, ranging from
0.00 to ±1.00. Generally, loadings above .40 are consid-
ered meaningful, although some researchers use .30 as
Although used less often, other methods of extraction in-
the minimum. Only the absolute value is considered in
clude maximum likelihood method, alpha factoring, image
this interpretation. The sign indicates if the variable is
factoring, ordinary least squares, and generalized least squares
methods.7 These methods differ in how variance is used to positively or negatively correlated with the factor.
determine factors.8 Ideally, each variable will have a loading close to 1.00
on one factor and loadings close to 0.00 on all other fac-
tors. These would be considered “pure’’ factors. This
Determining the Number of Factors rarely happens, however, and so we look at the highest
The first factor that is “extracted” from the data will ac- loadings for each variable. When one variable loads
count for as much of the variance in the data as possible. heavily on more than one factor, those factors do not
The second factor represents the extraction of the next represent unique concepts, and there is some correlation
highest possible amount of variance from the remaining between them. The researcher must then reconsider the
variance. Each successive factor that is identified “uses nature of the variables included in the analysis and how
up” another portion of the total variance, until all the they relate to the construct that is being studied.
variance within the test items has been accounted for. Unfortunately, this factor matrix is often difficult to
The number of factors derived from a set of variables interpret because it does not provide the most unique
will always equal the number of variables. structure possible; that is, several variables may be
As shown in Table 31-2 4 , this analysis has extracted “loaded” on more than one factor. For instance, if we
seven factors (because there are seven variables). How- look across the rows for “Moves stiffly” and “Walks
ever, several factors account for small amounts of variance stiffly,” we can see that factor loadings are moderately
and do not really contribute to an understanding of the strong for both Factors 1 and 2. Therefore, the next step
6113_Ch31_468-485 02/12/19 4:37 pm Page 472
is to develop a unique statistical solution so that each we could rearrange the orientation of axes and variables
variable relates highly to only one factor. without altering the underlying relationships, we might
be able to create a structure that will help us interpret
Factor Rotation
these relationships.
This concept can be illustrated more simply using a two-
We do this by a process of factor rotation, turning
dimensional example. Assume we have identified only
the axes in such a way as to maximize the orientation of
two factors, Factor 1 and Factor 2. We could plot each
variables near one of the axes. There are actually several
of the seven variables against these two axes, as shown in
ways that factor axes can be statistically rotated to arrive
Figure 31-2A. The vertical axis represents Factor 1 and
at this solution (see Box 31-1). This example uses the
the horizontal axis represents Factor 2. As we can see,
most common approach, called varimax rotation, which
variables sit in the geometric space somewhere between
tries to minimize the number of variables that have high
the two factors, not directly on either of the axes. There-
loadings on each factor, simplifying interpretation.10
fore, this plot does not present a clear “structure” in the
This rotation is illustrated in Figure 31-2B. The rota-
data in terms of specific factor assignments. If, however,
tion improves the spatial structure of the variables so that
Factor 1
distinct factors are now visible; that is, several of the vari-
ables lie directly on or close to one of the axes. The higher
the loading, the more the variable appears at the extremes
of the axes. We find that variables 2, 6, and 7 now have
6 the closest orientation to Factor 1, and variables 1 and 5
7 have the closest orientation to Factor 2. Variables 3
5
2 and 4, clustered around the origin (loadings close to zero),
show little or no relationship to either factor.
3
4 Factor 2
Box 31-1 Types of Rotation
There are actually several forms of factor rotation that can be
1 used under different circumstances. Varimax rotation is consid-
ered orthogonal rotation because the axes stay perpendi-
cular to each other as they are rotated. This means that the two
factors are independent of each other (orthogonal means inde-
A pendent); that is, they maintain maximal separation. Oblique
rotation allows the axes to change their orientation to each
other. Therefore, some variables could be close to two factors,
Fa
c
B variable.
Figure 31–2 Factor axes showing loadings of 7 variables on Each of these methods will result in a slightly different posi-
two factors. (A) Unrotated solution. (B) Rotated solution, tioning of the axes. For some analyses it may be necessary to try
showing how variables are now loaded closely on one of the different solutions to develop the one that best differentiates
two factors. For this illustration, the factor loadings are factors. Fortunately, these processes are easily requested in a
hypothetical. We cannot use the factor loadings given in computer analysis. See Tabachnick and Fiddell8 and other refer-
Table 31-2, as those represent coordinates in a three- ences for more detailed discussion of these rotation strategies.
dimensional space.
6113_Ch31_468-485 02/12/19 4:37 pm Page 473
7
(Table 31-2 7 ). We interpret this information by look-
0.8
ing across each row of the matrix to determine which
74
0.
0.
factor has the highest loading for that variable. High- Factor 3
83
6- Moves stiffly
lighted values show the one loading for each variable that Nonverbal
Pain
has the strongest relationship to one of the factors. The Behavior
factors are starting to take on meaning. 7- Walks stiffly
➤• “Complains of pain” and “Groans” load highest on factor 1. Figure 31–3 Names assigned to three factors in the pain
behavior study. Numbers on arrows are the factor loadings.
• “Moves stiffly” and “Walks stiffly” load highest on factor 2.
Item 5 did not load on any factor and may not be appropriately
• “Changes position” and “Rubs back” load highest on factor 3.
included as part of pain behavior based on this analysis.
• The item for “Isolates oneself” does not have a sufficiently
high loading on any factor.
is still possible that it has some importance to the overall
Naming Factors construct of pain behavior. It may also be reasonable,
The final step in a factor analysis is the naming of factors then, to consider other items that should be included in
according to a common theme or theoretical construct the analysis that would better explicate the theoretical
that characterizes the variables in the factor. This is a foundations of pain behavior.
subjective and sometimes difficult task, especially in sit- What we have now is a set of variables that contribute
uations when the variables within a factor do not have to a construct we are calling “chronic pain behavior.”
obvious ties. The researcher must look for commonalties We can begin to understand the structure of pain behav-
and theoretical relationships that will explain the latent ior by focusing on three elements that we have labelled
trait being studied and come up with a factor name that mobility, verbal complaints, and nonverbal complaints.
reflects it. Figure 31-3 shows how the seven variables As we move forward in this research, we can explore how
have been organized into three factors, and the names each of these elements contributes to a patient’s reac-
that have been assigned to describe them. tions to treatment, interactions with family, participation
Sometimes certain combinations of variables don’t in work or social activities, and so on. The factor analysis
easily suggest a factor name, and it may be necessary has provided a framework from which we can better
to reexamine the nature of the construct being studied understand these types of theoretical relationships.
and the variables included. We must also recognize
that similar analyses on different samples or with a Note that although “Changes position” and “Rubs back”
different combination of variables may organize data have been assigned to factor 3 (nonverbal behavior), they
differently, as will different methods of extraction or both have loadings >.40 on factor 1, indicating that there
rotation. may be a correlation between verbal and nonverbal behaviors.
Note that item 5, “Isolates oneself,” does not appear
to be a meaningful variable for understanding these
pain behaviors. Its low communality and factor loadings
suggest that it is not part of the larger construct. Two ➤ Results
Factor analysis, using principal axis factoring and varimax
approaches may be taken to handle this situation. One
rotation, was conducted to determine the structure underlying
would be to eliminate this item in future studies. But it
6113_Ch31_468-485 02/12/19 4:37 pm Page 474
behaviors related to chronic LBP. We confirmed the suitability Confirmatory Factor Analysis
of the data using the Kaiser-Meyer-Olkin test, which was Confirmatory factor analysis is similar to EFA in
.651, and Bartlett’s test of sphericity, which was significant many ways, but with a distinctly different purpose. The
(χ2 (21) = 582.05, p < .001), both meeting minimal standards.
main differences are related to hypothesis testing and
Communalities for the 7 items were between .55 and .98,
the final fit of a model. EFA is run with no prior
except for item 5 (Isolates oneself), which had a communality
of .041.
assumptions about how factors will be manifested. It
The analysis derived three factors, retaining factors with an explores the variance structure and factors are then
eigenvalue greater than 1.0, representing 82.2% of the original identified. For CFA, however, a model is proposed at
variance. Factor loadings above .40 were used to assign vari- the outset, and a hypothesis is set forth to assess how
ables to factors. Factor 1 corresponded to verbal complaints well sample data support the theorized factor model.
(Complaints of pain and Groaning). Factor 2 represented difficult The model is often based on findings from an EFA or
mobility (Moves stiffly and Walks stiffly). Factor 3 described it may be developed to represent a theory (see Focus
nonverbal pain behaviors (Rubs back and Changes position). on Evidence 31-1). The criteria for variable inclusion
Item 5 (Isolates onself) did not meet the .40 threshold for any are usually more stringent in CFA, with a rule of
factor and does not contribute to an understanding of the thumb that variables with factor loadings <0.7 are
behaviors in this construct. dropped.
CFA is a measurement approach based on the as-
A table containing values in the rotated factor matrix sumption that you enter the factor analysis with a firm
is usually provided in the results section of a research idea about the theoretical foundation of the model, the
report. number of factors you will encounter, and which vari-
Factor Scores ables should load onto each factor. Therefore, it is rea-
An interesting use of factor analysis is the creation of a sonable to think of EFA as theory-generating and CFA as
smaller set of composite scores that can be used in fur- theory-testing techniques.13 This approach is often used
ther statistical analysis. Scores can be created for each to establish construct validity for items in a questionnaire
factor by multiplying each variable value by a weighting, or scale (see Chapter 10).
and then summing the weighted scores for all variables
within that factor. This results in a factor score. Factor Take a look back at the discussion of factor analysis in the
scores can be calculated for each subject for each factor, continued development of the Arthritis Impact Measurement
and those scores can then be used for further statistical Scale (AIMS) in Chapter 10, which also illustrated exploratory and
testing. This process decreases the total number of vari- confirmatory approaches.
ables used, which improves variance estimates for analy-
ses such as regression. By using factor scores, large data
sets became much more manageable.
■ Structural Equation Modeling
Sample Size for Factor Analysis Previous discussions of regression and factor analysis
Because of the exploratory nature of factor analysis, there procedures have emphasized exploring relationships, a
is an expectation of stability within the data for an EFA. process in which we have been careful to avoid an as-
One common approach to setting sample size is based sumption of causation. However, a different approach
on the subject-to-variables (STV) ratio. However, there called causal modeling can be used to illustrate possible
is really no consensus regarding the “rules of thumb” causal relationships. These procedures examine whether
that are recommended. One guideline is to have at least a pattern of intercorrelations among variables “fits” an
300 cases for any study,11 although solutions that have underlying theory of how constructs are linked and the
the majority of factor loadings >.80 do not require directionality of those relationships.17.18
as large a sample and may be sufficient with about Models are studied using a process of structural
150 cases.8 Others have recommended a minimum num- equation modeling (SEM), which is a confirmatory
ber of observations at least five times the number of approach for hypothesis testing, combining elements
variables (a STV ratio of at least 5),11 with a minimum of factor analysis and multiple regression. The process
of 100 observations regardless of the STV ratio.12 Still for SEM is quite complex, and this discussion will
others have argued that a STV of 10 to 1 should be used. focus on a conceptual understanding of how results
Because we are not dealing with probabilities, power is are reported and interpreted. Several useful references
not an issue per se, but larger samples will provide a bet- are available for those who are looking for greater
ter estimate of variance components within the data. detail.8 ,19-23
6113_Ch31_468-485 02/12/19 4:37 pm Page 475
Focus on Evidence 31–1 The EFA resulted in a three-factor solution, which was composed
The Whole Is More Than the Sum of Its Parts of physical and emotional items, but a third factor related to social
environment was also identified. These factors were used for the CFA
The Minnesota Living with Heart Failure Questionnaire (MLHFQ) is the
model, which also included a general factor based on the full 21-items
most widely used measure of health-related quality of life (QOL) in
scale, as well as the three specific factors. The CFA confirmed factors
patients with heart failure (HF).14 The MLHFQ is a self-administered tool
for physical, emotional, and social domains, and also confirmed that
consisting of 21 items with responses on a 6-point scale (0-5), ranging
the total score represented a singular construct of QOL.
from “no impairment” to “very much impaired.” Three summary scores
There are two important concepts illustrated by this type of study.
are used: a total score as well as two subscales, physical and emotional.
One is the value of an initial EFA to develop a theoretical framework
Only eight items are used in the total score. The MLHRQ has shown
for a scale, which revealed a third domain related to the social envi-
good responsiveness, sensitivity to change, and reliability.15
ronment that was not considered part of the original scale. Given the
The MLHRQ is often used as a basis for evaluating changes in
chronic nature of HF, this finding highlights the relevance of social is-
QOL, as a foundation for clinical decision-making, intervention, and
sues for evidence-based practice, considerations that could change
monitoring progress or decline in patients with heart failure (HF).
decision-making regarding the long-term impact of the disease. The
Therefore, it is important to recognize the underlying premise for each
second is confirmation of the construct validity of the total score as a
subscale to understand how the scores can influence these judg-
representation of the overall effect of HF on QOL. By studying the
ments. Because the dimensions of the test had not been verified,
scale in this way, the researchers demonstrated that individual items
Garin et al38 explored the components of the MLHFQ using data for
could load directly onto the total score as well as onto one of the
3,847 patients with HF from 21 countries. The factorial structure of
three specific domains. This was pertinent because not all of the items
the MLHFQ was assessed by randomly dividing the sample into two
loaded onto one of the three subscales.
subsamples, the first used to conduct an EFA, which was then tested
An added benefit of using a large international sample was the
on the second subsample using a CFA.
cross-cultural validation of the MLHFQ, which has been translated
into many languages. Because most of the validation work was based
Factor 1 8 on the original English version, these data support the use of the tool
Physical Domain items across settings and countries, providing essential clarification for
using the tool in clinical trials.
Factor 3 4
Social Environment items
Domain
Yes, another meaning for the acronym SEM! It has been ➤ CASE IN POIN T #2
used for standard error of the mean (Chapter 22), standard Wang and Geng24 were interested in exploring the relationship
error of measurement (Chapter 9), and now structural equation between socioeconomic status (SES) and health, which has
modeling. A popular set of letters! been documented in many countries. Research has also shown a
growing understanding that lifestyle also has significant effect
on physical and psychological health.25 They used data from the
Confirmatory factor analysis is actually a special case 2015 Chinese General Social Survey to assess if these theoreti-
of SEM and is often used as a precursor to define how cal relationships were manifested in Chinese culture.
variables are related within factors that fit a theory, but
CFA is not used to explore causal connections. Structural
models are designed to assess latent trait variables but The Path Diagram
may also include observed or measured variables as well SEM begins with the proposal of a theoretical premise,
(see Chapter 10). a researcher’s conceptualization of possible causal
6113_Ch31_468-485 02/12/19 4:37 pm Page 476
Cluster 3 was the smallest cluster (n = 27). Close to 40% that are appropriate for patients with similar combina-
lived in nursing homes, tended to have more comorbidities, and tions of characteristics.
experienced falls. Most could not walk without difficulty, al-
though close to half had regained walking function.
Those in cluster 4 (n = 39) exhibited the poorest perform- ■ Multivariate Analysis of Variance
ance, with few being independent in walking or ADL. They were
more likely to live in a nursing home, be disoriented, have co- Many clinical research designs incorporate tests for more
morbid problems, and be confined to a wheelchair or bed. than one dependent variable. For example, if we were in-
terested in the physiological effects of exercise, we might
Clinicians can examine the characteristics within each measure heart rate, systolic and diastolic blood pressure,
cluster to understand how these profiles emerge, allow- respiration, oxygen consumption, and other related vari-
ing for development of specific management strategies ables on each subject at the same time. It makes sense to
do this because it is efficient to collect data on as many
relevant variables as possible at one time and because it
Number of Clusters is useful to see how one person’s responses vary on all
these parameters concurrently. These types of data are
1 2 3 4 5 6 7 8
usually analyzed using t-tests or analyses of variance,
with each dependent variable being tested in a separate
comparison.
73
This approach to data analysis presents two major
79 79 79 79 79 79 problems. First, the use of multiple tests of significance
6 within a single study can increase the probability of a
Type I error (see Chapter 26). The second problem
relates to the assumption that each test represents an
52 52 independent event. However, if we measure heart rate,
207 62 62 62 blood pressure, and respiration on one person, we can-
10 10
not assume that the responses are unrelated. Most likely,
changes in one variable will influence the others. There-
89
fore, these responses are not independent events and
should not be analyzed as if they were.
18 18 18
The purpose of a multivariate analysis of variance
27 27 (MANOVA) is to account for the relationship among
128
9 9 9 several dependent variables when comparing groups
or conditions. It addresses a research question that
purposefully includes multiple outcome measures that
28 28 28 28
have related effects. In many situations, a MANOVA
39 39 can be more powerful than multiple analyses of vari-
11 11 11 11 ance if the dependent variables are correlated because
it can account for multicollinearity that would not be
evident with separate ANOVAs. In addition, it is pos-
sible that individual tests would not show a significant
207 207 207 207 207 207 207 207
effect, but when the variables are tested as a composite
Total number of patients defining a theoretical construct, significant relation-
ships may become evident.
Figure 31–5 Cluster analysis in a study of functional This test can be applied to all types of experimental
recovery following hip fracture. The authors determined designs, including repeated measures, factorial designs,
that the solution with four clusters best described the and analyses of covariance. Like the ANOVA, the
characteristics of the sample. From Michel JP, Hoffmeyer P,
MANOVA is subject to assumptions of normality, ho-
Klopfenstein C, et al. Prognosis of functional recovery
1 year after hip fracture typical patient profiles through mogeneity of variance, and use of continuous interval/
cluster analysis. J Gerontol: Med Sci 2000;55A(9):M508-515, ratio measures for dependent variables. It may be useful
Figure 1, p. 511. Used with permission of the Gerontological to review concepts presented in Chapter 25, as they will
Society of America. apply to the MANOVA as well.
6113_Ch31_468-485 02/12/19 4:37 pm Page 479
120
Cluster 1 Cluster 2 Cluster 3 Cluster 4
N = 79 (38%) N = 62 (30%) N = 27 (13%) N = 39 (19%)
100
80
Percent of Cluster
60
40
20
0
Cluster 1 Cluster 2 Cluster 3 Cluster 4
Figure 31–6 Four clusters showing characteristics of patients who suffered a hip fracture. These
clusters describe percentages for having one or more comorbidities, falls in previous 3 months,
walking without difficulty, regaining prior ADL level, being confined to a wheelchair (WC), living in
a nursing home (NH), or being disoriented. Blank spaces indicate 0.0%. Data source: Michel JP,
Hoffmeyer P Klopfenstein C, et al. Prognosis of functional recovery 1 year after hip fracture: typical
patient profiles through cluster analysis. J Gerontol: Med Sci 2000;55A(9):M508-515.
in most cases the tests yield similar results.32 The ration- fixed set of variables or in a stepwise manner to reduce
ale for choosing one test over the others is based on a the discriminant function to a minimum number of rele-
complex consideration of statistical power and how well vant variables.
the assumptions underlying each test are met, concepts Functions are generated according to an eigenvalue,
well beyond the scope of this discussion. Readers are which you recall from our discussion of factor analysis,
encouraged to consult with a statistician to make these indicates the amount of variance explained by that func-
decisions based on the specific research situation. tion. In this case the eigenvalue represents the between-
groups/within-groups variance. The first function will
➤ Results account for the most variance, and the second function
A MANOVA was used to compare the three patient groups and will account for variance that is not explained in the first.
the control group on several gait parameters. The MANOVA The number of functions will be the lesser of the number
showed a significant difference in gait variables across the of groups minus 1, or the number of independent vari-
four groups (Roy’s Largest Root = 2.95, F (8,94) = 34.66, p<.001). ables. In the TKA study, with four groups, three discrim-
inant functions were generated.
The accuracy of the discriminant analysis is based on
Follow-Up Tests canonical correlation, which expresses the relationship
The MANOVA is an omnibus test like the analysis of of group membership with variables in the discriminant
variance. Therefore, because this analysis demonstrates function. This is tested in two ways. The first method
a significant effect, follow-up post hoc analyses are nec- uses chi-square to test significance of the correlation.
essary. There are two common ways to do this. The The second is Wilk’s lambda, λ, which estimates the
first is to run univariate ANOVAs on each dependent proportion of total variance that is NOT explained by
variable, followed by multiple comparison tests (see the group effect, or the error variance. Therefore, 1-λ is
Chapter 26). However, this strategy negates the pur- an index of explained variance, which can be interpreted
pose of looking at the dependent variables together to as R2.
account for their correlation, and ignores the potential
for Type I error. A preferred method uses a procedure Although discriminant analysis is multivariate, it is
called discriminant analysis. actually analogous to an ANOVA or t-test, just turned
around. With an ANOVA, means are compared across
Discriminant Analysis groups (the independent variable) to see if they are significantly
Discriminant analysis is a form of multiple regression, different on the dependent variable. With discriminant analysis,
used when the dependent variable is categorical. It is a the group membership becomes the dependent variable (to be
technique for distinguishing between two or more predicted), and the outcome measures become the independent
groups based on a set of characteristics that are predic- variables (predictors). If there is only one measured variable, the
tors of group membership. An equation is generated that results of an ANOVA and discriminant analysis will be the same.
If there are several measured variables, the discriminant analy-
classifies subjects according to their scores, and the
sis can be more useful, accounting for their interdependence
model is then examined to see if the classifications were
and controlling for potential Type I error.
correct.
Discriminant analysis has an important distinction from Post-Hoc Test for MANOVA
logistic regression, which is also based on a categorical Discriminant analysis is often used as a post-hoc proce-
dependent variable. In discriminant analysis, the independent dure for a MANOVA, preferable to multiple ANOVAs
variables are assumed to be normally distributed, and variances because it maintains the integrity of the multivariate re-
are assumed to be equal across groups. Logistic regression is search question. This approach was used in the TKA
best applied when independent variables are dichotomous. study following a significant MANOVA.
(λ = 0.222, χ2 (24) = 144.50, p<.001). The three patient groups interest in relation to survival or prevention because they
all showed a continued deficit in gait, with no significant better reflect the true impact of interventions. Long-term
improvement in the post-op groups. effects are typically evaluated with reference to survival
time to an identified “event.” The concept of survival
The results using the first two discriminant functions analysis is important to understanding prognosis and
are plotted in Fig. 31.7. Each point in the graph repre- treatment effectiveness. It answers questions relating to
sents the vector for each subject on the combined inde- time, such as: “How long will it be before I am better?”
pendent variables. There is a clear separation between “How long am I going to live?” “When in the future will
the three patient groups and the controls. Further analy- the risk of recurrence of a disorder increase or decrease?”
sis of these data includes looking at coefficients in the Survival analysis is based on the documentation of a
discriminant function equations for each of the measured terminal event. The most common use is in the estimate
variables to determine which have contributed most of life expectancy. For many diseases, such as cancer or
to the group differentiation. For example, in the TKA cardiovascular disease, the terminal event of interest may
study, researchers could determine which of the gait pa- be death. Life expectancy can also be examined in rela-
rameters were most effective in differentiating between tion to functional conditions. For example, Strauss et al33
patients and controls. looked at decline in function and life expectancy in older
persons with cerebral palsy (CP). They were able to
demonstrate that survival rates of ambulatory older
■ Survival Analysis adults with CP were only moderately worse than the
general population but were much poorer for those who
Many research questions focus on the effectiveness of in-
had lost mobility.
tervention or prognoses, with measurement of short-term
Survival time can also refer to other events, such as
effects. Long-term outcomes, however, may be of greater
time to relapse, injury, or loss of function. For instance,
Ruland et al34 used survival analysis to examine time to
Canonical Discriminant Functions
recurrence of stroke. Grossman and Moore35 followed
the longitudinal course of aphasia and decline of sen-
5.0
tence comprehension. Verghese et al36 documented that
the incidence and prevalence of gait disorders were high
in community-residing older adults and that gait disor-
ders were associated with greater risk of institutionaliza-
2.5
tion and death over a 5-year period.
Function 2
2
7
0.0
1
3 ➤ CASE IN POIN T #5
Infectious gangrene of the foot is a serious complication of
diabetes, often leading to amputation. Huang et al37 studied
⫺2.5 Group the association between several risk factors and mortality in
1 157 patients with type 2 diabetes who had received treatment
2
3 for foot gangrene between 2002 and 2009. Within the sample,
⫺5.0 7 90 patients had a major amputation (above the ankle), and 67
Group Centroid had a minor amputation (below the ankle).
➤ At the start of the study, both groups of patients had the Cox Proportional Hazards Model
same probability of survival. This probability decreased over time Survival time is often dependent on many interrelated
in both groups, but rate of decline was steeper for patients with factors that can contribute to increased or decreased
a major amputation. We can see that the distinction between the probabilities of survival or failure. A regression model
groups became evident within the first year of follow-up. After can be used to adjust survival estimates on the basis
3 years, the probability of survival for those with minor amputa- of several independent variables. Standard multiple
tion was around 70%, whereas it was closer to 45% for those
regression methods cannot be used because survival
with a major amputation.
times are typically not normally distributed and because
6113_Ch31_468-485 02/12/19 4:37 pm Page 483
of censored observations. The most commonly used terminal event—thereby decreasing survival. A HR less
method is the Cox proportional hazards model, which than 1.0 indicates that the covariate is protective,
is conceptually similar to multiple regression but with- decreasing the probability of the terminal event, and
out assumptions about the shape of distributions. For thereby increasing survival time. Confidence intervals
this reason, this analysis is often considered a nonpara- can be expressed for the hazard ratio to indicate
metric technique. significance, with a null value of 1.0.
The proportional hazards model is based on the haz-
ard function, which is related to the survival curve. This ➤ Results
function represents the risk of dying (or the terminal For the 5 years of the amputation study, 109 of the 157 patients
event) at a given point in time, assuming that one has died, with a median survival time of 3.12 years and a 5-year
survived up to that point. The dependent variable is the survival rate of 40%. Patients with a minor amputation had a
hazard (risk), and the independent variables, or covari- better median survival rate (5.5 years) than those with a major
ates, are those factors thought to explain or influence amputation (1.9 years) (p < .001). Cox regression analysis
the outcome. Variables may be continuous, dichoto- demonstrated that age [HR 1.04 (95% CI, 1.01-1.06)] and major
mous or ordinal.40 For example, in the amputation amputation [HR 1.80 (95% CI, 1.05-3.09)] were independent
study, associated risk factors included age, diabetes du- factors associated with mortality.
ration, HbAlc, level of hypertension, renal function, and
ankle-brachial index. The confidence intervals for these hazard ratios do
Survival curves allow for comparison of the hazard not contain 1.0 and therefore are significant. The au-
risk, and therefore survival time, associated with the thors concluded that efforts to limit amputations to the
different groups. In the amputation example, level below-ankle level resulted in better survival. They sug-
of amputation is the intervention, and we can deter- gested that this outcome could be due to other related
mine if the different levels contribute to a different disabilities with higher amputation level, such as lim-
survival risk. ited mobility or higher risk for falls. Survival curves do
Like odds ratios generated from a logistic regres- not tell us why the groups are different. Separate
sion, a hazard ratio (HR) can be generated from coef- curves can be generated for subgroups to determine
ficients in the hazard function. A HR of 1.0 indicates how other covariates affect the outcome. For example,
that there is no excess risk associated with the covari- in the amputation study, separate hazard ratios were
ates. A value greater than 1.0 indicates that a covariate obtained for several covariates, including age and renal
is positively associated with the probability of the function.
COMMENTARY
Multivariate analyses have become popular in behavioral elements of multivariate analysis but with enough of an
research because of the increased availability of com- introduction to terminology and application that the
puter programs to implement them. Their applications beginning researcher should be able to understand re-
are, however, not well understood by many clinical search reports and follow computer output. Statisticians
researchers, and many studies using multivariate designs can be extremely helpful in sorting through the different
are still analyzed using univariate statistical methods. methods, many of which provide different approaches to
Multivariate techniques can accommodate a wide the same question.
variety of data and are able to account for the complex Although there is great potential for improving expla-
interaction and associations that exist in most clinical nation of clinical data using multivariate methods, an
phenomena. Many research questions could be investi- important caveat is that clinical research need not be
gated more thoroughly if investigators considered mul- complicated to be meaningful. A problem is not neces-
tivariate models when planning their studies. This sarily better solved by a complex analysis, nor should such
chapter has been limited to a discussion of the conceptual an approach be taken just because computer programs
6113_Ch31_468-485 02/12/19 4:37 pm Page 484
are available. The indiscriminate use of multiple meas- methods and designs, and many clinical variables can be
urements is not a useful substitute for a well-defined studied effectively using a single criterion measure. On
study with a select number of variables. To be sure, the the other hand, simple analysis is not necessarily better
results of multivariate analyses are harder to interpret just because the interpretation of results will be easier
and involve some risk of judgmental error, such as in fac- and clearer. The choice of analytic method must be
tor analysis or cluster analysis. Many important and con- based on the research question and theoretical founda-
cise research questions can be answered using simpler tions behind it.
R EF ER EN C E S 18. Schreiber JB, Nora A, Stage FK, Barlow EA, King J. Reporting
structural equation modeling and confirmatory factor analysis
1. Pett MA, Lackey NR, Sullivan JJ. Making Sense of Factor results: a review. J Educ Res 2006;99(6):323-337.
Analysis: The Use of Factor Analysis for Instrument Development 19. Hox JJ, Bechger TM. An introduction to structural equation
in Health Care Research. Thousand Oaks, CA: Sage; 2003. modeling. Fam Sci Rev 1998;11(4):354-373.
2. Fabrigar LR, Wegener DT. Exploratory Factor Analysis. 20. Blunch NJ. Introduction to Structural Equation Modeling Using
New York: Oxford University Press; 2012. IBM SPSS Statistics and AMOS. 2nd ed. Los Angeles, CA: Sage;
3. Kachigan SK. Mutlivariate Statistical Analysis: A Conceptual 2013.
Introduction. 2nd ed. New York: Radius Press; 1991. 21. Schumacker RE, Lomax RG. A Beginner’s Guide to Structural
4. Grimm LG, Yarnold PR, eds. Reading and Understanding Equation Modeling. 4th ed. New York: Routledge; 2016.
Multivariate Statistics. Washington, D.C.: American 22. Raykov T, Marcoulides GA. A First Course in Structural Equation
Pyschological Association; 1995. Modeling. 2nd ed. Mahwah, NJ: Lawrence Erlbaum Associates;
5. Field A. Discovering Statistics Using IBM SPSS Statistics. 5th ed. 2006.
Thousand Oaks, CA: Sage; 2018. 23. Klem L. Structural Equation Modeling. In: Grimm LG,
6. Kaiser H. An index of factorial simplicity. Psychometrika Yarnold PR, eds. Reading and Understanding More Multivariate
1974;39(1):31-36. Statistics. Washington, DC: American Psychological Association;
7. IBM Knowledge Center. Factor Analysis Extraction. Available 2000:227-260.
at https://www.ibm.com/support/knowledgecenter/en/ 24. Wang J, Geng L. Effects of socioeconomic status on physical
SSLVMB_24.0.0/spss/base/idh_fact_ext.html. Accessed and psychological health: lifestyle as a mediator. Int J Environ
February 13, 2019. Res Public Health 2019;16(281).
8. Tabachnick BG, Fidell LS. Using Multivariate Statistics. 7th ed. 25. Simandan D. Rethinking the health consequences of social class
New York: Pearson; 2018. and social mobility. Soc Sci Med 2018;200:258-261.
9. Pituch KA, Stevens JP. Applied Multivariate Statistics for the Social 26. Hooper D, Coughlan J, Mullen MR. Structural equation model-
Sciences: Analyses with SAS and IBM’s SPSS. 6th ed. New York: ling: guidelines for determining model fit. Electronic J Bus Res
Routledge; 2016. Meth 2008;6(1):53-60.
10. IBM Knowledge Center. IBM SPSS Statistics V23.0 documen- 27. Hu L, Bentler PM. Cutoff criteria for fit indexes in covariance
tation. Factor Analysis Rotation. Available at https://www.ibm. structure analysis: conventional criteria versus new alternatives.
com/support/knowledgecenter/en/SSLVMB_23.0.0/spss/base/ Structural Equation Modeling 1999;6(1):1-55.
idh_fact_rot.html. Accessed February 13, 2019. 28. SSI Scientific Software. LISREL 10.1 Release notes. Available
11. Bryant FB, Yarnold PR. Principal-components analysis and at http://www.ssicentral.com/lisrel/lisrel10.pdf. Accessed
exploratory and confirmatory factor analysis. In: Grimm LG, February 2, 2019.
Yarnold PR, eds. Reading and Understanding Mulivariatte Statis- 29. IBM SPSS Amos. Available at https://www.ibm.com/us-en/
tics. Washington, DC: American Psychological Association; marketplace/structural-equation-modeling-sem. Accessed
1995:99-136. February 4, 2019.
12. Gorsuch RL. Factor Analysis. 2nd ed. Hillsdale, NJ: Lawrence 30. Michel JP, Hoffmeyer P, Klopfenstein C, Bruchez M, Grab B,
Erlbaum; 1983. d’Epinay CL. Prognosis of functional recovery 1 year after
13. Huck SW. Reading Statistics and Research. 6th ed. Boston: hip fracture: typical patient profiles through cluster analysis.
Pearson; 2012. J Gerontol A Biol Sci Med Sci 2000;55(9):M508-515.
14. Rector TS, Cohn JN. Assessment of patient outcome with the 31. Rahman J, Tang Q, Monda M, Miles J, McCarthy I. Gait
Minnesota Living with Heart Failure Questionnaire: reliability assessment as a functional outcome measure in total knee
and validity during a randomized, double-blind, placebo- arthroplasty: a cross-sectional study. BMC Musculoskelet
controlled trial of pimobendan. Pimobendan Multicenter Disord 2015;16:66.
Research Group. Am Heart J 1992;124(4):1017-1025. 32. Warne RT. A primer on multivariate analysis of variance
15. Garin O, Ferrer M, Pont A, Rue M, Kotzeva A, Wiklund I, (MANOVA) for behavioral scientists. Pract Assess Res Eval
Van Ganse E, Alonso J. Disease-specific health-related quality 2014;19(17):1-10.
of life questionnaires for heart failure: a systematic review with 33. Strauss D, Ojdana K, Shavelle R, Rosenbloom L. Decline in
meta-analyses. Qual Life Res 2009;18(1):71-85. function and life expectancy of older persons with cerebral
16. Garin O, Ferrer M, Pont A, et al. Evidence on the global palsy. NeuroRehabilitation 2004;19(1):69-78.
measurement model of the Minnesota Living with Heart 34. Ruland S, Richardson D, Hung E, et al. Predictors of recurrent
Failure Questionnaire. Qual Life Res 2013;22(10):2675-2684. stroke in African Americans. Neurology 2006;67(4):567-571.
17. Aron A, Coups EJ, Aron EN. Statistics for the Behavioral 35. Grossman M, Moore P. A longitudinal study of sentence com-
and Social Sciences: A Brief Course. 6th ed. Boston: Pearson; prehension difficulty in primary progressive aphasia. J Neurol
2019. Neurosurg Psychiatry 2005;76(5):644-649.
6113_Ch31_468-485 02/12/19 4:37 pm Page 485
36. Verghese J, LeValley A, Hall CB, Katz MJ, Ambrose AF, 38. Clark TG, Bradburn MJ, Love SB, Altman DG. Survival
Lipton RB. Epidemiology of gait disorders in community- analysis part I: basic concepts and first analyses. Br J Cancer
residing older adults. J Am Geriatr Soc 2006;54(2): 2003;89(2):232-238.
255-261. 39. Lystad RP, Brown BT. “Death is certain, the time is not”: mortal-
37. Huang YY, Lin CW, Yang HM, Hung SY, Chen IW. Survival ity and survival in Game of Thrones. Inj Epidemiol 2018;5(1):44.
and associated risk factors in patients with diabetes and amputa- 40. Bradburn MJ, Clark TG, Love SB, Altman DG. Survival
tions caused by infectious foot gangrene. J Foot Ankle Res analysis part II: multivariate data analysis—an introduction to
2018;11:1. concepts and methods. Br J Cancer 2003;89(3):431-436.
6113_Ch32_486-508 02/12/19 4:36 pm Page 486
CHAPTER
32 Measurement
Revisited: Reliability
and Validity Statistics
—with K. Douglas Gross
Methodological Research
Reliability/Validity
Reliability and validity are foundational concepts concepts presented in Chapters 9 and 10 by
for evidence-based practice, as we strive to meas- describing statistical procedures commonly
ure bodily impairments, activity limitations, used to assess the reliability or validity of a clin-
quality of life, and other outcomes in a meaning- ical test or measure, including the intraclass
ful way. Sound clinical decision-making depends correlation coefficient, standard error of mea-
on having confidence in our measurements to surement, measures of agreement, internal
accurately determine when an intervention is consistency, and estimates of effect size and
effective, when changes in patient status have change. It will be helpful to review earlier
occurred, or when test results are diagnostic. material in preparation for understanding the
It would be misleading to suggest that any statistical procedures described here.
statistic should be used exclusively for a single
purpose. Some statistics, commonly used to as-
■ Intraclass Correlation Coefficient
sess the reliability of a clinical test, are also well-
suited for assessment of criterion-referenced A relative reliability coefficient reflects true variance in
validity, depending on the nature of the ques- a set of continuous scores as a proportion of the total
variance (see Chapter 9). One of the most commonly used
tion being asked. Recognizing this versatility, relative reliability indices is the intraclass correlation
the purpose of this chapter is to expand on the coefficient (ICC). The ICC is primarily used to assess the
486
6113_Ch32_486-508 02/12/19 4:36 pm Page 487
test-retest or rater reliability of a quantitative (interval or these variance estimates, there are different equations
ratio) measure. Possible ICC values range from 0.00 to that can be used to calculate an ICC value depending on
1.00, with higher values indicating greater reliability. Be- the type of measurements that are obtained (ICC form),
cause the ICC is a unitless index, it is permissible to com- and the intended purpose and design of the reliability
pare the ICCs of alternative testing methods to determine study (ICC model).2
which is the more reliable test. This is possible even when
scores are assigned using different units of measurement. Forms of the ICC
The ICC has the advantage of being able to assess The ICC coefficient can be expressed in one of two
reliability across two, three, or more sets of scores, giving forms, depending on whether the assigned scores are de-
it broad applicability for assessment of test-retest relia- rived from single ratings or from mean ratings across
bility across multiple test administrations, interrater several trials. Commonly, reliability studies are designed
reliability across multiple raters, or intrarater reliability so that a single rating is acquired from each subject—
over repeated trials. This applicability is one of several this is form 1. In some studies, however, each subject’s
advantages that distinguishes the ICC from other corre- score is calculated as a mean value across several trials—
lation coefficients, such as Pearson’s r, which can only called form k. If, for example, it is known ahead of time
register correlation between two sets of scores, and does that measurements tend to be unstable, it may be neces-
not assess agreement. sary to record a mean of several trials in order to ensure
more satisfactory reliability. Recording mean scores has
➤ C A SE IN PO INT # 1 the effect of improving reliability since random error is
The Timed 10-Meter Walk Test (10mWT) is often used as a reduced whenever overestimates and underestimates are
measure of functional mobility and gait speed in patients with “averaged out.” Note, however, that a calculated relia-
neurological disorders and mobility limitations.1 The patient bility coefficient based on mean scores should only be
walks without assistance for 10 meters. To account for initial generalized to other settings in which a mean score is
acceleration and terminal deceleration, time is measured with a also recorded. It should never be used to estimate relia-
stopwatch during the intermediate 6 meters. The test can be bility in acquiring a single measured score.
performed at a comfortable walking pace or at the patient’s
fastest speed. To illustrate application of the ICC, we will use Models of the ICC
hypothetical 10mWT data from eight patients and four raters The model of the ICC depends on whether the results
who acquired concurrent measurements of each patient’s per-
of the reliability analysis are to be generalized beyond
formance (see Table 32-1).
the particular testing situation of the study. Assume, for
example, that we wish to establish the inter-rater relia-
The ICC is calculated using variance estimates ob- bility for four clinicians scoring the 10mWT. We must
tained from a repeated measures analysis of variance first decide whether these raters can be regarded as
(ANOVA) that compares scores across raters. From having been “randomly selected” from among a much
larger population of clinicians with a similar back-
ground and training. If so, then our ultimate purpose in
Table 32-1 Hypothetical Data for the 10mWT (seconds) calculating the reliability of these raters would be to
RATER 1 RATER 2 RATER 3 RATER 4 generalize the findings to other clinicians who comprise
ID (Trial 1) (Trial 2) (Trial 3) (Trial 4)
the larger population of raters. Since it would be ex-
ceedingly rare for investigators to have access to all
1 5.0 6.0 9.0 4.5 members of the target population from which to make
2 9.0 13.0 10.0 8.5 a truly random selection, our use of the term “random”
in this context is theoretical. The essential point is that
3 9.5 12.0 18.5 12.5
we intend to generalize the outcome to other similar
4 18.0 15.0 12.5 20.0 raters. Therefore, these raters would represent a random
5 12.0 15.0 18.5 10.0 effect.
Alternatively, the four raters in our study could repre-
6 20.0 25.0 22.5 18.0
sent a fixed effect, indicating that these are the only raters
7 32.0 30.5 28.0 40.0 of interest. Perhaps we are planning a future investigation
8 42.0 38.0 50.0 42.5 in which we will train these raters to administer the
10mWT to measure the effects of an experimental inter-
X̄ (sd) 18.4 19.3 21.1 19.5
vention. In that instance, our sole objective in a prelimi-
(12.7) (10.8) (13.3) (14.3)
nary reliability study may be to establish that these
6113_Ch32_486-508 02/12/19 4:36 pm Page 488
There are three models of the ICC, as defined by ➤ We can design a study to determine intra-rater reliability of
Shrout and Fleiss.2 These models are distinguished accord- a single clinician in assessing the 10mWT across four trials.
ing to how the subjects are assigned to different raters Using data from Table 32-1, assume that each column repre-
and whether the raters and subjects are random or fixed sents a different trial with each trial tested by the same rater
effects.4 who records a single score from each subject. We can look at
the consistency of this one rater’s scores across the trials. The
Model 1 results of our analysis would only apply to the one rater being
In model 1, raters are considered randomly chosen from studied. In this setting, model 3, form 1 is appropriate.
a larger population, but some subjects are assessed by
different sets of raters. This model is rarely applied.
While it may occasionally be appropriate, for example, Classifications of the ICC
in large multisite studies where different raters work with The ICC is classified using two numbers in parentheses.
different sets of subjects, this model makes it difficult to The first number designates the model (1, 2, or 3). The
derive meaningful estimates of rater variance.5 Because second number signifies the form, using either a single
model 1 is seldom used, it will not be discussed further. measurement (1) or the mean of several measurements (k).
For specific information on model 1, refer to references Accordingly, we can designate six types of ICCs:
at the end of this chapter.2,6
• Model 1: ICC(1,1) or ICC(1,k)
Model 2 • Model 2: ICC(2,1) or ICC(2,k)
Model 2 is the most commonly used model of the ICC • Model 3: ICC(3,1) or ICC(3,k)
for assessing inter-rater reliability. In this model, each
The k represents the number of scores used to obtain
subject is assessed by the same set of raters who are con-
a mean, and can be replaced with the actual number of
sidered “randomly” chosen from the larger population
scores when notating the ICC. For the above examples:
of potential raters or clinicians. Study participants are
similarly regarded as representative of a larger population • In the inter-rater reliability study, ICC (2,3) would
of potential test subjects or patients. Using model 2, be used to indicate a random model with the mean
therefore, both subject and rater are considered random of three trials recorded as scores.
effects. Model 2 also works well for test-retest reliability • In the intra-rater reliability study, the rater takes a
assessments because each test administration is considered single measurement at each trial. Therefore, ICC(3,1)
representative of the larger population of all possible test would be used to indicate a mixed model with single
administrations. scores.
6113_Ch32_486-508 02/12/19 4:36 pm Page 489
DATA
Table 32-2 ICC for Models 2 and 3 Based on Repeated Measures ANOVA (N = 8)
DOWNLOAD
1 “Between People” is the between-subjects effect. “Between Items” is the effect of rater. “Residual” is the error term.
2 The F test for the between-subjects effect is not printed as part of the ANOVA. It can be computed using MS between people
divided by MS residual. This value is reported with the ICC (see 6 ).
3 There is no significant difference between raters (p .554). This is a good thing when you are looking for reliability! This is a test
of significance across the four raters.
4 The ICC for model 2 is based on a two-way random model, and for model 3 it is based on a two-way mixed model.
5 Each model is run for both single measures (form 1) and average measures (form k). The researcher must determine which value
to use, based on the study design. Single measures ICC(2,1) or ICC(3,1). Average measures = ICC(2,k) or ICC(3,k).
6 This is the F test for the between-subjects effect, which was not reported with the ANOVA (F 43.99).
This effect is significant (p .001), which tells us that the subjects are different from each other.
7 The 95% confidence intervals are shown for both forms of the ICC in each model.
8 In SPSS, model 2 is run using type A ICC (absolute agreement). Model 3 is run using type C ICC (consistency). This parameter is
specified by the researcher when the test is run.
Focus on Evidence 32–1 study had a unique cohort of patients or raters, or other design
Reliability, Evidence, and Confidence elements that contributed to their findings.
The second concern is statistical. When we express indices like
Several inventories have been developed to assess disability associ-
the ICC as a single value, it is considered a point estimate, a single
ated with chronic low back pain. The Roland Morris Disability Ques-
statistic obtained from the sample that is used to estimate an un-
tionnaire (RMDQ)7 is a commonly used tool that has been subjected
certain population parameter. Such estimates are subject to sampling
to reliability testing over several decades, including many versions in
error, and therefore, may not be closely consistent with population
languages other than English. Several systematic reviews have shown
scores. That is why we need to look at confidence intervals. In the
that the questionnaire has strong reliability, with varying ICC values
2002 study, the confidence intervals were quite wide, with upper
generally above .85.8
bounds that did, in fact, approach acceptable levels of reliability. In
In 2002, however, Davidson and Keating9 compared the RMDQ
one of the confidence intervals, the lower bound contained a nega-
with several other disability measures to determine differences
tive value, which suggests that the between subjects variance would
in their reliability, finding that the RMDQ had a substantially lower
not be significant.11 Obviously, the point estimate from this study
ICC of .53 (95% CI = .29 to .71), and .42 (95% CI = -.07 to .75) in
included a great deal of uncertainty, and results should be viewed
a subgroup of patients. Based on these values, the authors sug-
in this light.
gested that the RMDQ did not have sufficient reliability for clinical
Results can vary from study to study for many different reasons.
application.
As evidence-based clinicians, we must face the challenge of consid-
As we consider evidence to help us with clinical decisions about
ering design issues, statistical approaches, and the overall state of
choosing measurement tools, Riddle and Stratford10 suggest caution
knowledge in making clinical decisions. We want to have confidence
with this type of conclusion for two reasons. One stems from the
in those decisions—not using tests that are unreliable, and just as
need to consider the volume of evidence reported in the literature.
importantly, not dismissing a test that could provide strong clinical
Many articles published before and after the 2002 study showed very
data. Referring to systematic reviews and meta-analyses can be a
different results with much higher reliability values. We know that
useful way to look for consistency across studies (see Chapter 37).
research is always subject to error, and it is possible that the Davidson
DATA
Table 32-3 Test-Retest Reliability Results for Grip Strength in Patients with Carpal Tunnel
DOWNLOAD
Syndrome (N=50)
Item Statistics
Mean Std. Deviation N
trial1 66.0800 3 15.03715 50
trial2 66.5400 3 13.19432 50
Difference .4600 1 6.19812 50
ANOVA
Sum of Squares df Mean Square F Sig.
Between People 18668.890 49 380.998
Within People Between Items 5.290 1 5.290 .275 .602
Residual 941.210 49 2 19.208
Total 946.500 50 18.930
Total 19615.390 99 198.135
B. Calculation of SEM
1 Using the standard deviation of difference scores: SEM sd 6.198 4.383
2 2
2 Using the MS error term: SEM MSE 19.208 4.383
Differences in calculations are due to rounding. Portions of the output have been omitted for clarity.
DATA
Table 32-4 Percent Agreement and Kappa: Ratings of Functional Assessment for Two Raters (N = 100)
DOWNLOAD
A. Data
Rater2 * Rater1 Crosstabulation
Rater1
IND ASST DEP Total
Rater2 IND Count 1 25 5 7 37 Observed agreements ⴝ
Expected Count 1 15.5 11.1 10.4 37.0 25 24 17 66
ASST Count 6 24 4 34 66
P0 .66
100
Expected Count 14.3 10.2 9.5 34.0
DEP Count 11 1 17 29
Expected agreements ⴝ
Expected Count 12.2 8.7 8.1 29.0 15.5 10.2 8.1 33.8
Total Count 42 30 28 100 33.8
Pc .34
Expected Count 42.0 30.0 30.0 100.0 100
1 Crosstabulation table shows observed and expected frequencies in each cell. Along the diagonal, agreements are highlighted in
gold and expected frequencies in green.
2 Kappa .486.
3 This is the standard error of kappa (SE) that can be used to calculate confidence intervals.
4 The probability associated with kappa is p .001.
Calculations vary due to rounding differences. Portions of the output have been omitted for clarity.
Kappa indicates the proportion of the available non- help, would not be as serious in terms of the patient’s
chance agreements that is attained by the raters. In this safety.
example, we find that with the effects of chance eliminated, When disagreements can be differentiated in this
kappa is .49, indicating that 49% of the available non- way, a modified version of the kappa statistic, called
chance agreement was attained. This “chance-corrected” weighted kappa, κw, can be used to estimate reliabil-
percentage is a more meaningful reliability estimate for ity.22 Weighted kappa allows the researcher to specify
categorical assignments. differential weights to the off-diagonal cells, thereby
Like other statistics, kappa is a point estimate. Confi- assigning a greater penalty for some disagreements
dence intervals can be calculated, as shown in Table 32-4C. than for others. Different weighting schemes can be ap-
plied when calculating κw to accommodate desired levels
Strength of Association of adjustment for particular instances of disagreement.
Landis and Koch21 have suggested the following criteria Weighted kappa can only be used when the grades
for interpreting kappa: represent ordinal ranks, however. It cannot be used with
nominal categories, whereas unweighted kappa can be
• Excellent agreement- above .80
used with nominal or ordinal categories.
• Substantial agreement- .60 to .80
• Moderate agreement- .40 to .60 See the Chapter 32 Supplement for further discussion of
• Poor to fair agreement- below .40 weighted kappa, including examples of calculations with
different weighting schemes.
For the functional assessment example, then, we have
achieved only a moderate degree of reliability. However,
the interpretation of this outcome, like any other reliabil-
Limitations of Kappa
ity coefficient, must depend on how the data will be used
and the degree of precision required for making rational Average Agreement
clinical decisions. It is also helpful to consider confidence It is important to recognize that kappa represents an av-
intervals to appreciate the application of these values. erage rate of agreement across a set of scores. It will not
For example, the confidence interval for the functional indicate if most of the disagreement is accounted for by
assessment data shows a lower boundary in the poor to one specific category or rater. It is useful to examine the
fair range. data when discussing results, to see where the major dis-
crepancies may lie.
➤ Results Variance
The percent agreement for the two raters was 66% for identifi- As with other reliability measures, variance among study
cation of level of independence, assistance, or dependence in subjects is required for kappa to be meaningful. In a
100 participants. The kappa statistic resulted in a chance-corrected
group of subjects with homogeneous characteristics,
agreement of .49 (95% CI: .345 to .627). Using the criteria from
expected agreement by chance alone will necessarily be
Landis and Koch, this is considered a fair to moderate level of
agreement. high, leaving little room for substantial non-chance
agreement, thereby deflating kappa. This is also a rele-
vant consideration when assessing the reliability of a
Weighted Kappa diagnostic test to detect the presence or absence of a
health condition. If the prevalence of the target condi-
For some applications, kappa is limited in that it does tion is either very high or very low in the study sample,
not differentiate among disagreements of greater or the calculated kappa may be depressed.
lesser severity. Because it is calculated using only the
frequencies along the agreement diagonal, kappa as- Number of Categories
sumes that all disagreements (off the diagonal) are of Kappa is also influenced by the number of categories
equal seriousness. There may be instances, however, used. As the number of categories increases, the extent
when a researcher wants to assign greater weight to of agreement will generally decrease. This is logical, as
some disagreements in order to account for differential with more possibilities of assignment, there is room for
risks. greater discrepancy between raters. Therefore, if values
For example, rating patients as IND when they are of kappa are to be compared across alternative testing
really DEP would mean they would be left without needed methods, these tests should assign scores using the same
assistance, creating a potentially unsafe environment. By number of possible categories. It may also be necessary
comparison, assigning ASST to a patient who is actu- to collapse categories to better reflect agreement within
ally IND, while it could result in providing unneeded a set of data.
6113_Ch32_486-508 02/12/19 4:37 pm Page 497
■ Internal Consistency
Internal consistency is a function of the correlation
Some measuring instruments, including many question-
among these six item scores as well as the correlation of
naires and physical performance batteries, are composed
each individual item score with the total score. Hypo-
of multiple items that in total provide a comprehensive
thetical output for this analysis shows that α = .894 (see
reflection of the broader construct being measured. For
Table 32-5 1 ). Like other correlation coefficients, α is a
instance, the quantitative portion of the Graduate
proportion, with possible values ranging from 0.00 to 1.00.
Record Examination (QGRE) includes numerous items
A value that approaches .90 is considered strong, so the
to test a student’s overall mathematical ability. Similarly,
scale can be considered to have high internal consistency.
scales measuring overall functional capacity typically
consist of multiple items corresponding to different
functional tasks. In each of these examples, we want to In a bit of a twist, Cronbach’s α should show moderate
draw conclusions about an individual’s performance correlations among items, between .70 and .90.16,26,27 If
based on the total score across all contributing items (see items have too low a correlation, they are possibly measuring
different traits, but if the items have too high a correlation, they
Chapter 12).
are probably redundant, and some items could be removed with-
One assumption that is inherent in the use of such
out jeopardizing the reliability of the scale.
scales is the internal consistency of the items. A good
scale is one that assesses different aspects of the same
overarching construct. For example, each item compris-
ing a multi-item scale of physical function should reflect Item-to-Total Correlations
an aspect of physical performance and not an aspect of Using results of the analysis for Cronbach’s α, we also
emotional function or some other distinct construct. examine individual items to determine how well they
Statistically, if the multiple items comprising a scale are fit the overall scale. In Table 32-5 2 , the means and
truly measuring the same overarching attribute, then standard deviations for each item are displayed. We can
individual item scores should be moderately correlated see that walking had the highest mean functional score
with one other and with the total summative score on and car transfer the lowest. The scale totals averaged 16.00,
the scale. indicating that this sample was only moderately func-
tional (see Table 32-5 3 ). We can also look at correla-
The concept of internal consistency should not be con- tions among items. All 6 item-pairs have correlations
fused with content or construct validity (see Chapter 10). above .60 except for car transfer, which has consistently
Internal consistency is a measure of reliability, indicating low correlations with all other items (see Table 32-5 4 ).
if item scores are correlated and thereby measuring the same Perhaps this one variable should not be part of the scale,
thing. However, even if the items in a scale are measuring the possibly representing a unique component of function
same thing, the scale may not be measuring what it is intended that is distinct from the other items.
to measure, which is a validity issue. To investigate this possibility, we can compute α repeat-
edly, each time eliminating one item from the analysis.
In Table 32-5C, we see what happens to the total score
when each item is deleted. In the first column, the mean
Cronbach’s Alpha (α) total score is higher when car transfer is deleted, whereas
The most commonly applied statistical index of internal these values remain fairly stable for all other items. The
consistency is Cronbach’s alpha (α).25 Alpha can be used third column in this panel shows the correlation of each
when item scores are dichotomous or when there are item with the sum of the remaining items, or the item-to-
more than two response choices (such as ordinal multiple total correlation. Only car transfer has a low correlation
choice scores). of .19, suggesting that this item is not related to the total
6113_Ch32_486-508 02/12/19 4:37 pm Page 498
DATA
Table 32-5 Output for Internal Consistency of a Functional Scale with 6 Items Using Cronbach’s
DOWNLOAD
Alpha (N=14)
A. Item Statistics
Reliability Statistics Item Statistics 2
Cronbach's Std.
Alpha Based Mean Deviation N
on Car 1.86 .770 14
Cronbach's Standardized N of
Alpha Items Items Carry 2.79 1.251 14
1 .894 .881 6 Dressing 2.57 1.089 14
Reach 3.00 1.240 14
Scale Statistics 3 Stairs 2.36 1.336 14
Walk 3.43 1.089 14
Std. N of
Mean Variance Deviation Items
16.00 30.769 5.547 6
B. Correlations
Inter-Item Correlation Matrix 4
C. Item-Total Statistics
Item-Total Statistics
Corrected Squared Cronbach's
Scale Mean if Scale Variance Item-Total Multiple Alpha if Item
Item Deleted if Item Deleted Correlation Correlation Deleted
Car 14.14 28.593 .192 .566 5 .932
Carry 13.21 19.874 .836 .904 .854
Dressing 13.43 21.033 .856 .777 .854
Reach 13.00 20.615 .765 .916 .867
Stairs 13.64 19.632 .790 .817 .863
Walk 12.57 21.187 .837 .893 .856
test score derived from all other items. In contrast, each of second procedure is the paired t-test, (or repeated mea-
the other 5 items has a correlation of approximately .80 or sures ANOVA) which is used to show that mean scores
higher with the total score. These results can be inter- for two (or more) methods are not significantly differ-
preted in two ways. The initial α = .894 is considered ent from one another. This approach is also problem-
strong, even though it increases when car transfer is atic, however, as two group distributions may show no
removed. A Cronbach’s α as high as .932 (without car statistical difference but still be composed of pairs of
transfer) suggests that items in the scale may be redun- observations that are in poor agreement. A pragmatic
dant (see Table 32-5 5 ). alternative for examining agreement across methods is
an index called limits of agreement (LoA).32
➤ Results
The function scale was composed of six items, scored 1 to 5, ➤ CASE IN POIN T #5
with 5 representing full independence. The average total score Accurate measurement of blood pressure (BP) is important for
was 16.00, indicating that this sample was composed of individ- detection and monitoring of hypertension. Automated and manual
uals with moderate function. Cronbach’s alpha was .89. Correla- methods are often used interchangeably. Several studies have
tions among the six items were generally strong (>.603) except compared the two methods to determine their reliability, with
for car transfer which had correlations <.36 with all other items. varied results.33 We will consider hypothetical data for systolic
Item-total statistics showed that when car transfer was removed, blood pressure (SBP) measures on 50 patients using both devices.
alpha increased to .932. The item-total correlations ranged
from .82 to .92 for all items except car transfer which was .19.
These scores suggest that car transfer does not fit with other The differences between these alternate methods was
items on this scale. Without the car transfer item, the rest of calculated by subtracting the values obtained using the
the scale shows strong internal consistency. automated device from the values obtained using the
manual sphygmomanometer (this direction is arbitrary
but must be consistent) (see Table 32-6). Therefore, a
■ Limits of Agreement positive difference score reflects a higher reading for the
manual system. The mean of the difference scores was
Reliability is an essential property when measurements are .90 mmHg. On average, then, the manual measurement
taken with alternate forms of an instrument. For example, was higher, but the average difference between the meth-
we might want to compare different types of dynamome- ods was quite small.
ters for measuring strength,28 different types of spirome- Patterns of Difference Scores
ters for assessing pulmonary function,29 or different types A visual analysis can help to clarify this relationship. The
of monitoring devices for measuring blood glucose.30 scatterplot in Figure 32-1 demonstrates a relatively strong
Although different in design and mechanics, we would correlation between values obtained using the two methods
expect these different methods to record similar values. (r = .82), although scores are not in complete agreement.
The analysis of reliability in this situation focuses on
the agreement between alternative methods based on Bland-Altman Plots
their difference scores. We can consider two methods in A further understanding of this relationship can be
agreement when the difference between measurements achieved by examining the spread of difference scores.
is small enough for the methods to be considered inter- When differences between methods vary widely from a
changeable.31 This property is an important practical shared mean, poorer agreement is evident. Assuming a
concern as we strive for effective and efficient clinical normal distribution and random variation, we would ex-
measurement across settings, where different types of pect 95% of the difference scores to fall within ±1.96 SD
instruments are used for similar purposes.32
Analyzing Difference Scores DATA
Table 32-6 Systolic Blood Pressure (mmHg)
Two analysis procedures have historically been applied DOWNLOAD
Taken With Manual and
to evaluate consistency of measurements using alterna- Automated Methods (N = 50)
tive methods. The correlation coefficient, r, has been MEAN SD MIN. MAX.
used to demonstrate covariance among methods. How-
ever, we know this is a poor estimate of reliability, as it Manual 135.16 19.47 90.0 172.0
does not necessarily reflect the extent of agreement in Automated 134.26 18.12 85.0 175.0
the data and does not provide an absolute index using
Difference .90 11.40 –17.0 24.0
the same unit of measurement as the tool itself. The
6113_Ch32_486-508 02/12/19 4:37 pm Page 500
200
160 0
–10
–20
140
–30 –1.96 SD
–40
120 80 100 120 140 160 180
Mean (mmHg)
Figure 32–1 Correlation of manual and automatic BP mea- difference scores equal zero. As this is unlikely, we
sures. The red line is the regression line. The black dashed line would hope to see an unbiased pattern, with difference
is the line of identity, which emerges from the origin. Points scores clustering above and below the zero line in a ran-
along the line of identity would indicate perfect agreement. dom distribution within the limits of agreement. The
spread of scores above and below zero difference helps
from the mean difference score (d̄ ). This would set the us decide if the disagreement is acceptable when we
95% limits of agreement. Scores beyond these limits substitute one measurement method for the other. For
would indicate outliers, suggesting poor reliability. The instance, if most scores fall above the zero line, one
mean difference score was .90 ± 11.40 (see Table 32-6). method would be consistently measuring a higher score
than the other, indicating that disagreement is system-
95% LoA = d̄ ± 1.96 (SD) atic rather than random.
= .90 ± 1.96 (11.40) Because the difference scores are plotted for each
subject against the mean of that subject’s two measured
95% LoA = –22.25 to 22.43 scores, we can also determine if different patterns of
Figure 32-2 shows how difference scores are inter- disagreement exist for subjects with high or low mean
preted using a Bland-Altman plot, named for those who scores. For example, we might see lower difference
developed this strategy.32,34 The Y-axis shows the differ- scores for those with low mean scores, and higher dif-
ence scores (manual – automated) for each subject. Val- ference scores for those with high mean scores. Such a
ues at zero indicate no difference. Values on the X-axis pattern would suggest that disagreement between the
represent the mean of the two measures for each subject. two methods would depend on whether a patient had a
For instance, for the leftmost point in Figure 32-2, the high or low SBP level. For example, in Figure 32-2, we
difference between the two measures for this individual can see larger difference scores (approaching the upper
was 5.0 mmHg, and the average of the manual and au- limit) for some individuals who had higher mean SBP
tomated scores for that person was 87.5. values. Individuals with lower mean scores did not show
A line is drawn at the mean difference score for the total as wide a spread.
sample on the Y-axis. The closer the mean difference score
is to zero, the more closely the two measurement methods ➤ Results
agree. The 95% limits of agreement are shown at 1.96 SD The mean difference score between manual and automated
above and below this mean. This is a range of almost measures was .90 mm Hg (SD = 11.40), with a range from –17.0
45 mmHg. Our question, then, is would we be comfortable to 24.0 mmHg. The scores on both methods were highly corre-
using either method to measure SBP if we knew that the lated (r = .82). The Bland-Altman plot shows most difference
measured values might disagree by this much? scores were within the limits of agreement, with a range both
above and below the mean difference. However, several larger
In a perfect scenario, there would be no difference
difference scores were found in some patients with high mean
between the two methods, and we should see all the
6113_Ch32_486-508 02/12/19 4:37 pm Page 501
No change
RELIABILITY VALIDITY
ments exceeded automated measurements. Generally, the two
instruments show good reliability, but it may be advisable to Measurement Small real Meaningful
Error change change
use the same method for monitoring individuals with higher
blood pressure. MDC MCID
Minimal detectable Minimal amount of
change beyond change that is
threshold of error meaningful
In addition to assessing the reliability of alternative
instruments, the Bland-Altman plot can also be used to
test criterion validity when a new instrument is measured Figure 32–3 Along a continuum of change, MDC = minimal
against an established measure that derives scores in the detectable change, and MCID = minimal clinically important
difference. Determining the threshold of error is a matter of
same units. The same process can be used to establish
reliability, while characterizing meaningful change is a
test-retest or rater reliability, looking at the differences matter of validity.
between the two test scores. With highly reliable tests,
there should be only small variability around the center
line and narrow limits of agreement.
can eliminate the possibility that measurement error
is solely responsible. If measured change exceeds the
■ Measuring Change MDC value for that instrument, we can confidently
conclude that at least a portion of the measured
A central objective of clinical care is the promotion change was due to real improvement (or decline) in
of positive change or prevention of decline in our clients. performance.36
Accurate documentation of this change is the corner-
stone of successful decision-making, goal-setting, and Calculating the MDC
reimbursement. Clinicians may use terms like “better,” The MDC is based on the SEM, obtained from a sample
“improved,” “worse,” or “declined” to indicate when a of stable persons who are not expected to change.
patient’s condition appears to have changed, yet these
general descriptors are insufficient for making valid and MDC = z* SEM* 2
reliable judgments based on measured values.
Responsiveness is the ability of a measuring instru- The MDC establishes a threshold value that is
ment to register a change in score whenever an actual linked to either the 90% (z = 1.65) or 95% (z = 1.96)
change in status has occurred.35 We can understand confidence interval. MDC95 means that 95% of stable
responsiveness indices as a ratio of signal (true change) persons demonstrate random variation of less than this
to noise (random error variability). amount when tested on multiple occasions. When a
Two concerns guide our interpretation of measured measured change score exceeds the MDC95 value, we
change scores. We first ask whether the recorded change can have 95% confidence in ruling out random meas-
score might be due entirely to measurement error, or urement error as solely responsible for the observed
whether we can have confidence that some real improve- change.
ment or decline has occurred. Our second concern is MDC is dependent on the size of the reliability coef-
whether the documented change is sufficient to have ficient used to calculate the SEM. With a less reliable
meaningful impact on a patient’s care or an individual’s measure, both SEM and MDC values will increase (see
behavior. When we think about measuring a difference in Focus on Evidence 32-2). Instruments that do not
response from one time to another, we can conceptualize demonstrate good test-retest stability will have larger
the amount of change along a continuum, as shown in MDCs, indicating that even greater change scores are
Figure 32-3. Measured change scores are first evaluated required before measurement error can be ruled out as
against a threshold for measurement error, and second the sole source.
to a threshold that distinguishes impactful change. The
first concern is a reliability issue, while the second is a
validity issue. To test a sample of patients who are considered
stable, researchers will often look at the scores
Minimal Detectable Change for patients in a control group. Since these patients
did not receive intervention, they are not expected to
The minimal detectable change (MDC) is the mini- change.
mal amount of measured change required before we
6113_Ch32_486-508 02/12/19 4:37 pm Page 502
Focus on Evidence 32–2 Relative test-retest reliability was reflected by an ICC of .82,
Reliability and Change which is considered good. Absolute reliability was assessed using
the SEM, which was .81. They used this value to calculate an
Many tools are available to provide ratings of pain. Chuang and col-
MDC90 of 1.87, which means that a measured change exceeding
leagues37 studied a combination of a numerical pain rating scale
1.87 points on the combined scale should (at a 90% confidence
(NPRS) and a faces pain scale (FPS) to establish test-retest reliability
level) indicate that true change had occurred. Using limits of agree-
in patients who had had a stroke (see Chapter 12). They believed that
ment, they were also able to show that there was strong test-retest
the combination of scales would provide a useful approach to account
agreement, with most scores falling within 95% limits. However,
for potential language, visual, and perceptual problems associated
they did not account for the outliers and for the fact that differ-
with stroke.
ences between the measures were more variable for those with
They studied 50 patients who participated in an outpatient occu-
lower pain ratings.
pational therapy program, 29 with left-sided lesions, and 21 with
The authors concluded that the combined NPRS-FPS is a reliable
right-sided lesions. The patients continued to receive routine occu-
pain measure for people with stroke but acknowledged that further
pational therapy focused on functional training during the study. The
work is needed to determine its responsiveness to change with
patients were asked to rate current pain intensity in their involved
specific interventions, to examine its use with broader populations,
arm using both scales on two occasions with a one-week interval.
and to establish reliability over different timeframes. Remember that
Ratings were taken at the same time of day to minimize diurnal
reliability is not an absolute characteristic of a measurement, and
variation.
how these scales function can differ across settings and samples.
Every study adds more information to support further research, and
Difference in Pain Intensity in Participants
3.0
have to use the information wisely, knowing there is still uncertainty
2.0 in these conclusions.
(Time 2 Minus Time 1)
1.0
From Chuang L, Wu C, Lin K, Hsieh C. Relative and absolute reliability of a vertical
numerical pain rating scale supplemented with a faces pain scale after stroke.
0.0
Phys Ther. 2014;94:129-138, Figure 2A, p.134. Used with permission.
–1.0
–2.0
–3.0
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
Mean Pain Intensity in Participants With
Right- or Left-Side Brain Lesions
(Time 1 and Time 2)
deviation of the change scores.47 Therefore, a distribution Table 32-8 Effect Size Measures of Responsiveness
that has high variability in the degree of change will tend for the WOMAC in Patients Following Total
to have a small SRM.
Hip Replacement
Guyatt’s responsiveness index (GRI) uses an anchor-
based MCID for the measure being studied, or the AT 6 MONTHS AT 2 YEARS
smallest difference between baseline and posttest that Subscale ES SRM GRI ES SRM GRI
would represent a meaningful benefit in a group of
patients.48 We will discuss various methods to determine Pain 2.10 1.86 1.10 2.24 1.98 2.18
an anchor-based MCID shortly. When the MCID is not Function 2.34 1.80 1.45 2.56 1.97 1.79
known, the difference between baseline and posttest can Stiffness 1.61 1.39 .81 1.81 1.53 1.12
be used.35 The denominator for this index is obtained
from an ANOVA of repeated observations in a group of Data taken from Quintana et al36,
adapted from Table III, p. 1080.
Used with permission.
subjects who are clinically stable, which is a measure of
ES = effect size index; SRM = standardized response mean;
test-retest reliability. Therefore, the denominator reflects
GRI = Guyatt’s Responsiveness Index.
the intrinsic variability of the instrument.
using an ordinal scale. Scales generally range from “a interpret measured change in the context of imperfect
great deal worse” to “a great deal better,” with as few as 5 test-retest reliability and differing perceptions of value
to as many as 15 ordinal rankings (see Table 32-9). A for the patient.51,52
score of zero indicates that no change has been
perceived.39,49 In studies that derive an MCID value, Interpreting the MCID
rankings of “somewhat better” or “somewhat worse” Several considerations are important for interpreting the
have usually been used to establish the threshold indicat- MCID.
ing when minimally important change was perceived. Going beyond MDC. The MCID sets a threshold for
meaningful change, while the MDC only sets a threshold
➤ Results for substantive change, whether or not that substantive
In the WOMAC study, patients were asked to rate their per- change is perceived as having real impact or value. In-
ceived improvement following intervention based on a 5-point terestingly, in some studies, researchers find an MDC
scale. Possible responses were “a great deal better,” “some- value that is larger than the MCID.53-56 This may seem
what better,” “equal,” “somewhat worse,” or “a great deal illogical—in theory, it makes little sense that an MDC
worse.” The MCID was based on the WOMAC scores for those would ever be greater than an MCID. We would expect
who rated themselves at least “somewhat better,” which corre- important differences to always be greater than measure-
sponded to a change of at least 25 points for all three WOMAC
ment error. This is not an investigative error, however.
subscales.36 Therefore, changes beyond 25 points were consid-
It is merely a consequence of the fact that these two
ered to reflect meaningful improvement.
statistics quantify distinct constructs and are derived
from different data.
When assessing meaningful change, anchor-based The MDC is a reliability statistic and is calculated
methods are generally preferred over distribution- using test-retest data based on group values. In contrast,
based methods because they reflect a definition of what the MCID is a validity measure, based on individual
is considered important.50 However, using both distri- patient perceptions about whether change was meaning-
bution-based and anchor-based approaches should pro- ful.57 When MDC and MCID values fail to align in the
vide a strong foundation for understanding how best to expected way, clinicians will typically use the larger of the
two values as the threshold for most evidence-based
applications, recognizing that these values provide guide-
lines and not absolute limits.
Table 32-9 15-Point Global Rating Scale to Assess
Magnitude of Change Baseline Scores. A second consideration is the poten-
tial effect of a patient’s baseline score. Patients who have
RATING DESCRIPTION OF CHANGE
lower initial scores may have greater incentive to recog-
7 A very great deal better
Feeling Better →
-4 Moderately worse
in the MCID obtained in different studies for the same
instrument.57 Unfortunately for clinical decision making,
-5 Quite a bit worse this means that we cannot view the MCID as a universal
-6 A great deal worse fixed attribute of a test, but must consider the method of
calculation and the specific context within which it was
-7 A very great deal worse
obtained, including the population and setting.
6113_Ch32_486-508 02/12/19 4:37 pm Page 506
Lack of Confidence Intervals. Because the MCID is all experience that amount of change—some will have
typically reported as a single value without confidence in- achieved more and some less. Therefore, any conclusions
tervals, we don’t have a good sense of an applicable range about an individual patient’s response based on a mean
of possible scores and therefore run the risk of misclassi- may be flawed. Without looking at variability within the
fying patients whose scores do not meet the threshold, sample, we may be missing important information.
but for whom impactful change may still have occurred. When studying a dichotomous variable, decisions
about improvement or decline are straightforward—the
In addition to making judgments about a patient’s patient has either gotten better or not, cured or not cured.
degree of improvement, the MCID can be used to When dealing with a continuous measure, however, it is
estimate how much of a change would be considered necessary to determine how much change is meaningful.
important in a research study. The MCID value for the primary Therefore, the proportion of individuals in a group that
outcome can be used to estimate needed sample size as part achieve minimal change can be considered another im-
of a priori power analysis. Results from clinical trials can be portant benchmark to evaluate an intervention’s effec-
compared to the MCID as a way of determining whether mea- tiveness. MDC or MCID proportion is the percentage of
sured treatment effects are likely to have meaningful impact.62 patients whose measured change scores exceed the min-
The MCID can also be used to establish the margin in a non- imal threshold, either for detectable change or for clin-
inferiority study (see Chapter 14). ically meaningful change. These values can be especially
useful for examining group data during program evalu-
ation or quality assurance reviews.51
MDC and MCID Proportion
Group values must be interpreted with reference to their
sample size and variability. We know that power is greatly ➤ In the study of responsiveness for the WOMAC, the MDC
influenced by the number of subjects in a sample. There- proportion was greater than 80% for pain and function sub-
fore, with a large sample, small differences may turn out scales, and the MCID proportion was between 70% to 80%
to be significant even when they are clinically meaning- after 2 years following hip joint replacement. These values
demonstrate that a large majority of patients in this population
less. We also recognize that a mean is a measure of cen-
perceive themselves as “better” following intervention.
tral tendency and that individuals in the sample do not
COMMENTARY
If you can’t measure something, you can’t understand it. If you can’t understand
it, you can’t control it. If you can’t control it, you can’t improve it.
—H. James Harrington (1929- )
Leader in performance improvement methodologies
Measurement is essential to the care of patients, design specific standards just because they have been used or
of studies, management of systems, and decision making. recommended by others. Guidelines are just that—not
Because there are so many approaches to measuring clin- gold standards. Researchers and clinicians are obligated
ical or organizational phenomena, we must be clear in to justify their interpretation of acceptable measure-
reporting and knowledgeable in the consumption of ment characteristics.
data. What we learn from looking through professional Researchers should address each of these relevant is-
literature, however, is that preferred methods for ana- sues in their reports, so that others can interpret their
lyzing reliability and validity seem to vary with different work properly. Many articles are published with no
researchers and within different disciplines, with little such discussion, leaving the reader to guess why a par-
consensus. ticular statistic was used or a standard applied. Because
As evidence-based practitioners, we should be aware statistics can be applied in so many ways, it is important
of the intended application of the data and the degree to maintain an exchange of ideas that promotes such
of precision needed to make safe and meaningful clin- accountability. By having to justify our choices, we are
ical decisions. When considering reliability or validity, forced to consider what a statistic can really tell us
we cannot allow ourselves to fall into the trap of using about a variable and what conclusions are warranted.
6113_Ch32_486-508 02/12/19 4:37 pm Page 507
Clinical judgments regarding validity of measure- of statistics as we apply them to our own decision
ments must be based on some criterion that is relevant making.
for a particular patient. Any given study will present As our understanding of validity grows, we will con-
data from a sample that has specific properties and that tinue to struggle with the definitions of clinical signifi-
has been studied in a specific context over a given time cance. The evidence-based practitioner will benefit from
period. Clinicians must appraise that information to complete reporting of effect sizes and change values in
determine if it is appropriately applied to their pa- clinical studies. Estimates are needed for different set-
tients. Published values may provide an estimate that tings, conditions, age groups, and baselines. Clinicians,
can be used to predict our own patients’ responses, but patients, and health policy analysts all want to appreciate
it is essential that we remain cognizant of the limits how much better is truly “better.”
37. Chuang LL, Wu CY, Lin KC, Hsieh CJ. Relative and absolute 52. Eton DT, Cella D, Yost KJ, Yount SE, Peterman AH, Neuberg
reliability of a vertical numerical pain rating scale supplemented DS, Sledge GW, Wood WC. A combination of distribution-
with a faces pain scale after stroke. Phys Ther 2014;94(1):129-138. and anchor-based approaches determined minimally important
38. Crosby RD, Kolotkin RL, Williams GR. Defining clinically differences (MIDs) for four endpoints in a breast cancer scale.
meaningful change in health-related quality of life. J Clin J Clin Epidemiol 2004;57(9):898-910.
Epidemiol 2003;56(5):395-407. 53. Young BA, Walker MJ, Strunce JB, Boyles RE, Whitman JM,
39. Jaeschke R, Singer J, Guyatt GH. Measurement of health status. Childs JD. Responsiveness of the Neck Disability Index in
Ascertaining the minimal clinically important difference. Control patients with mechanical neck disorders. Spine J 2009;9(10):
Clin Trials 1989;10(4):407-415. 802-808.
40. Beaton DE, Tarasuk V, Katz JN, Wright JG, Bombardier C. 54. Dawson J, Doll H, Boller I, Fitzpatrick R, Little C, Rees J,
“Are you better?” A qualitative study of the meaning of recov- Carr A. Comparative responsiveness and minimal change for
ery. Arthritis Rheum 2001;45(3):270-279. the Oxford Elbow Score following surgery. Qual Life Res 2008;
41. Guyatt GH, Feeny DH, Patrick DL. Measuring health-related 17(10):1257-1267.
quality of life. Ann Intern Med 1993;118(8):622-629. 55. Kocks JW, Tuinenga MG, Uil SM, van den Berg JW, Stahl E,
42. Osoba D, Rodrigues G, Myles J, Zee B, Pater J. Interpreting the van der Molen T. Health status measurement in COPD: the
significance of changes in health-related quality-of-life scores. minimal clinically important difference of the clinical COPD
J Clin Oncol 1998;16(1):139-144. questionnaire. Respir Res 2006;7:62.
43. Stratford PW, Binkley JM, Riddle DL. Health status measures: 56. Carey H, Martin K, Combs-Miller S, Heathcock JC. Relia-
strategies and analytic methods for assessing change scores. Phys bility and responsiveness of the Timed Up and Go Test in
Ther 1996;76:1109-1123. children with cerebral palsy. Pediatr Phys Ther 2016;28(4):
44. Stratford PW, Binkley J, Solomon P, Gill C, Finch E. Assessing 401-408.
change over time in patients with low back pain. Phys Ther 1994; 57. Wright A, Hannon J, Hegedus EJ, Kavchak AE. Clinimetrics
74(6):528-533. corner: a closer look at the minimal clinically important differ-
45. Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting ence (MCID). J Man Manip Ther 2012;20(3):160-166.
changes in health status. Med Care 1989;27(3):S178-S189. 58. Beaton DE, Boers M, Wells GA. Many faces of the minimal
46. Cohen J. Statistical Power Analysis for the Behavioral Sciences. clinically important difference (MCID): a literature review and
2nd ed. Hillsdale, NJ: Lawrence Erlbaum; 1988. directions for future research. Curr Opin Rheumatol 2002;14(2):
47. Liang MH, Fossel AH, Larson MG. Comparisons of five health 109-114.
status instruments for orthopedic evaluation. Med Care 1990; 59. Wang YC, Hart DL, Stratford PW, Mioduski JE. Baseline
28(7):632-642. dependency of minimal clinically important improvement. Phys
48. Guyatt G, Walter S, Norman G. Measuring change over time: Ther 2011;91(5):675-688.
Assessing the usefulness of evaluative instruments. J Chronic Dis 60. Terwee CB, Roorda LD, Dekker J, Bierma-Zeinstra SM,
1987;40(2):171-178. Peat G, Jordan KP, Croft P, de Vet HC. Mind the MIC: large
49. Guyatt GH, Osoba D, Wu AW, Wyrwich KW, Norman GR. variation among populations and methods. J Clin Epidemiol
Methods to explain the clinical significance of health status 2010;63(5):524-534.
measures. Mayo Clin Proc 2002;77(4):371-383. 61. Norman GR, Stratford P, Regehr G. Methodological prob-
50. de Vet HC, Terwee CB, Ostelo RW, Beckerman H, Knol DL, lems in the retrospective computation of responsiveness to
Bouter LM. Minimal changes in health status questionnaires: change: the lesson of Cronbach. J Clin Epidemiol 1997;50(8):
Distinction between minimally detectable change and minimally 869-879.
important change. Health Qual Life Outcomes 2006;4:54. 62. Revicki D, Hays RD, Cella D, Sloan J. Recommended methods
51. Haley SM, Fragala-Pinkham MA. Interpreting change scores of for determining responsiveness and minimally important differ-
tests and measures used in physical therapy. Phys Ther 2006; ences for patient-reported outcomes. J Clin Epidemiol 2008;
86(5):735-743. 61(2):102-109.
6113_Ch33_509-528 03/12/19 9:51 am Page 509
CHAPTER
Diagnostic Accuracy
33
DESCRIPTIVE EXPLORATORY EXPLANATORY
One of the essential purposes of measurement strategies or further diagnostic tests. Because
is to distinguish people based on a criterion, these procedures involve allocation of resources,
distinctions that can influence how we proceed influence patient management decisions, and
with a patient’s care. Clinicians face uncer- present potential risks to patients, it is important
tainty in many aspects of patient management to verify their validity and to understand how
and often apply their expertise, intuition, and results are appropriately applied to clinical prac-
judgment to make decisions about appropriate tice. The purpose of this chapter is to present
treatment or prevention. Given our focus on statistical procedures related to accuracy of
evidence-based practice, however, we are bet- diagnostic and screening tools, making judg-
ter prepared for decision-making when we ments regarding diagnostic probability, choos-
can apply explicit criteria that help to reduce ing cutoff scores, and the application of clinical
uncertainty. prediction rules.
An essential component of decision-making
is the ability to diagnose a condition, to deter- ■ Validity of Diagnostic Tests
mine its presence or absence, to apply screen-
ing procedures to identify those at risk for The term “diagnosis” can be used in different ways,
depending on its context. It is literally the act of making
certain disorders, and to classify patients who a judgment about the exact character of a condition,
are likely to benefit from specific intervention situation or problem.1 In healthcare, the traditional
509
6113_Ch33_509-528 03/12/19 9:51 am Page 510
usage focuses on identification of a disease or systemic may indicate illness in those without the disorder. Validity
disorder through lab tests or other procedures. It can will depend on several criteria:
also be used to mean identification of impairments, func-
• Determination of accuracy is based on the validity
tional limitation, or disability to make judgments about
of the gold standard. However, for many clinical
an individual’s ability to participate in life roles. A diag-
variables there is no true “gold standard” and a
nosis can be based on a score from a health scale, used
reference standard must be operationally defined
to indicate the presence of a latent trait that impacts an
that is “assumed” to be an accurate outcome. This
individual’s life, such as depression, dementia, risk for
can be especially challenging with measurement
falls, or quality of life. In each of these instances, the
of constructs such as balance, function, or pain.
diagnosis will lead to a clinical decision regarding ther-
Clearly, the validity of the index test will depend
apy, medication, surgery or other strategy to improve
on the validity of the standard used.
the condition or prevent an adverse outcome. Therefore,
• Validity is dependent on the sample used to verify
sound decisions depend on the validity of the test, to
the test. Selection of participants will often be
assure that these actions are warranted.
purposive, to assure that they adequately represent
The purpose of a study of diagnostic accuracy is to
the target population for which the test would be
determine if a “new” test, the index test, is accurate,
used. This means that inclusion and exclusion crite-
based on a gold standard which is “known” to be an
ria must be specified. Because many conditions
accurate indication of the patient’s true status, either the
can be manifested as mild, moderate, or severe,
presence or absence of the condition. These studies usu-
generalization of findings depends on the extent to
ally use a cohort design, enrolling people who are at risk
which the sample reflects the full spectrum of the
for the disease of interest. All participants are adminis-
condition. Validity can be further supported by
tered both the index test and the gold standard test, and
confirmation on a separate sample.
the diagnostic outcomes are compared.
• How measurements are taken can influence the
The results of a diagnostic procedure may be dichoto-
assessment of an index test. All subjects should
mous, categorical, or continuous, although most tests
get both the index and reference test. This is
indicate the presence or absence of a certain condition.
important to assure that the decision to use a test
Ordinal and continuous scales are often converted to
is not based on other clinical information, or an
dichotomous outcomes using cutoff scores to indicate a
a priori opinion of whether the patient has the
“normal” or “abnormal” state.
condition. Blinding those who take the measure-
ments will reduce potential bias. Operational
➤ C A S E IN PO INT # 1 definitions should be clear, and testers should be
Vertebral fractures (VFx) are associated with various comor- evaluated for their reliability so that measurement
bidities and mortality, particularly in older individuals. How- error is minimized.
ever, many individuals with such fractures do not present • Because many conditions will change over time,
overt symptoms, and more than two-thirds go undetected.2 the timing of the index and reference test can
Height loss has been used as a screening measure, although
be important, to be sure that they are measuring
studies are not clear as to how much height loss is indicative
the same thing. For diagnosis, this should be
of VFx.
Yoh et al3 studied the relationship between height loss and concurrent.
presence of VFx in 151 postmenopausal women to determine • Clinicians should be able to evaluate the method-
the best cutoff score for screening. Radiographic data were ologic quality of diagnostic validity studies. Studies
used to define the diagnosis of vertebral fracture. A cutoff of with poor quality of design or analysis may overes-
height loss ≥ 4 cm was considered the criterion for identifying timate the accuracy of diagnostic tests.4-6 The Stan-
potential VFx. dards for Reporting of Diagnostic Accuracy (STARD)
provide a checklist for inclusion of all relevant
Assessing Validity information that comprise essential elements of
design, analysis, and interpretation of diagnostic
The ideal diagnostic test, of course, would always be ac-
test studies (see Chapter 38).8
curate in discriminating between those with and without
the disease or condition—it would always have a positive
result for someone with the condition, whether a mild or ➤ Radiographic data were considered a gold standard in the
severe case, and a negative result for everyone else. But diagnosis of VFx. The index test was determination of height
loss, calculated as maximum height recalled by the patient
we know that tests are not perfect. They may miss identi-
minus the current height. Although this method has potential
fying some individuals with a particular disorder, or they
6113_Ch33_509-528 03/12/19 9:51 am Page 511
for inaccurate recall, it has been shown to be a useful measure a negative test result despite the presence of the target
for purposes of identifying VFx.8 disease.
Participants were invited to participate in the VFx study
Sensitivity and Specificity
through outpatient clinics in one community. Exclusion criteria
included pre-existing metabolic bone disease or severe skeletal To analyze these values, we can calculate two propor-
deformity. The participants had a range of scores, with median tions which indicate the accuracy of the height loss test
height loss of 3.2 cm with an interquartile range of 1.5 to 7 cm. in detecting the true presence or absence of vertebral
The x-rays and height measures occurred at the same session, fractures (see Table 33-2A and B). Proportions can
controlling for any changes in condition. range from 0.0 to 1.0. The closer to 1.0, the stronger
the association.
Positive and Negative Tests Sensitivity is the test’s ability to obtain a positive test
when the target condition is really present, or the true
The validity of a diagnostic test is evaluated in terms of
positive rate (also called the hit rate).
its ability to accurately detect the true presence or ab-
sence of the target condition. Our challenge is to deter-
a
mine how likely it is that someone who tests positive Sensitivity =
actually has the disorder and someone who tests negative a c
does not. This value is the proportion of individuals who test
If we compare the positive and negative results of a positive for the condition out of all those who actually
diagnostic test to the reference standard diagnosis, have it, or the probability of obtaining a correct positive
there are four possible scenarios, which are summa- test in patients who have the target condition. A highly
rized in a familiar 2 × 2 arrangement (see Table 33-1). sensitive test will correctly classify those who have the condition
The columns indicate the true presence or absence of of interest.
the disease (Dx+ or Dx–), as determined by the refer- Specificity is the test’s ability to obtain a negative test
ence standard. The rows indicate whether positive or when the condition is really absent, or the true negative
negative results were obtained using the index test. rate.
The cells labeled a and d represent true positive and
true negative test results, respectively. Individuals who d
have the target condition are accurately detected by a Specificity =
bd
positive test (cell a) and individuals who do not have
the target condition are accurately detected by a neg- This value is the proportion of individuals who test
ative test (cell d). The overall diagnostic accuracy negative for the condition out of all those who are truly
of a test is the proportion of all correct tests in the normal, or the probability of a correct negative test in
sample, (a+d)/N. those who do not have the target condition. A highly
Cell b represents the false positives, those who are specific test will rarely test positive when a person does not
incorrectly assigned a positive test result despite not hav- have the disease.
ing the target condition, and cell c represents the false
negatives, individuals who are incorrectly assigned ➤ Using ≥ 4 cm as the cutoff results in sensitivity of the
height loss test of 79%. Of the 62 patients identified as having
VFx, 49 tested positive using this criterion. The specificity of the
Table 33-1 Summary of Diagnostic Test Results test was also 79%. Of the 89 patients who did not have a VFx,
Reference Standard 70 were considered negative. These statistics indicate that the
True Diagnosis height loss criterion of ≥ 4 cm is fairly accurate for identifying
Dx+ Dx− Total both those with and without fractures.
a b
Positive True False a+b The complement of sensitivity (1 – sensitivity) is
Test positive positive the false negative rate, or the probability of obtaining
an incorrect negative test in patients who do have
c d the target disorder. The complement of specificity
Negative False True c+d (1 – specificity) is the false positive rate, sometimes
Test negative negative called the false alarm rate. This is the probability of
an incorrect positive test in those who do not have the
Total a+c b+d N
target condition.
6113_Ch33_509-528 03/12/19 9:51 am Page 512
A. Data 2 Positive
test
Reference Standard 1 True
VFx No VFx Total positive
Height Loss 49 19 68
ⱖ4 cm a b a⫹b False
positive
Height Loss 13 70 83
⬍4 cm c d c⫹d False
negative
62 89 N
Total a⫹c b⫹d 151
True
negative
B. Diagnostic Statistics
C. Confidence Intervals
Sensitivity 95% CI: 0.69 to 0.89 Specficity 95% CI: 0.70 to 0.87
PVⴙ 95% CI: 0.61 to 0.83 PVⴚ 95% CI: 0.77 to 0.92
LRⴙ 95% CI: 2.44 to 5.63 LRⴚ 95% CI: 0.16 to 0.44
1 The distribution of positive and negative tests VFx for 151 patients based on x-rays (reference standard) and height loss of ⱖ4cm.
Cell a ⫽ true positives; cell b ⫽ false positives; cell c ⫽ false negatives; cell d ⫽ true negatives.
2 Visualization of distribution of positive and negative tests. Each patient is represented by an icon, red indicating those who do have
a VFx and green indicating those who do not. The black dot indicates an individual who tested positive.
3 Diagnostic tests are typically reported as percentages or 4 taken to 2 decimal places.
Confidence intervals were obtained using the CCRB CI Calculator: Diagnostic Statistics.9 Calculations vary due to rounding differences.
Based on data from Yoh K, Kuwahara A, Tanaka K. Detective value of historical height loss and current height/knee height ratio for
prevalent vertebral fracture in Japanese postmenopausal women. J Bone Miner Metab 2014;32(5):533-538.
6113_Ch33_509-528 03/12/19 9:51 am Page 513
sensitivity, it will properly identify most of those who If we use the example of the VFx study (see Table 33-2),
have the disorder. If the test has high specificity, it will with specificity of 79% (and a PV– of 84%), we can be
properly identify most of those without the condition. confident that someone with a positive test is at risk
But how do these definitions relate to confidence in for VFx. Similarly, with sensitivity of 79% (and a PV+
diagnostic decisions? Consider these two questions: of 72%), if someone has a negative test, we can be fairly
sure that person is really not at risk. The height loss test
• If a patient has a positive test, can we be confident in
seems to be a reasonable procedure to identify both
ruling IN the diagnosis?
those with and without risk for VFx, helping us reason-
• If a patient has a negative test, can we be confident
ably rule the risk in or out.
in ruling OUT the diagnosis?
Sensitivity and specificity help us answer these ques-
tions but probably not the way you would expect. When ■ Pretest and Posttest Probabilities
a test has high specificity, a positive test rules in the diagno-
sis. When a test has high sensitivity, a negative test rules out The validity of a diagnostic test is based on how strongly
the diagnosis. Straus and colleagues10 offer two mnemon- it can support a decision to rule the disorder in or out.
ics to remember these relationships (see Fig. 33-1). Therefore, a test is considered a good one if it can help
Unfortunately, there is no consensus on what a “high” to increase our certainty about a patient’s diagnosis (see
value is. Mostly, it refers to values closer to 1.0—not a Box 33-1).
big help. Sensitivity and specificity values above .90 will
generally result in “few” false negatives and false positives. Pretest Probabilities
The impact of these values will depend on the test and the When we begin to evaluate a patient by taking a history
risks associated with misdiagnosis. and using screening or other subjective procedures, we
begin to rule in and rule out certain conditions and even-
Ruling in and out can get confusing, so think of it this tually generate one or more hypotheses about the likely
way. A highly specific test will accurately identify most diagnosis. A hypothesis can be translated into a measure
of the patients who do NOT have the disorder. If the test is of probability or confidence, indicating the clinician’s
so good at finding those who are normal, we can be pretty estimate of how likely a particular disorder is present.
sure that someone with a positive test DOES have the This has been termed the pretest probability or prior
disorder (ruling IN the diagnosis)—because if he didn’t have probability of the disorder––what we think might be the
the disorder, the test would have correctly identified him as
problem before we perform any formal testing.13
normal!
The concept of a pretest probability is related to
Conversely, a highly sensitive test will identify most of those
who DO have the disorder. Therefore, we can be pretty sure
differential diagnosis. Conceptually, it represents a “best
that someone with a negative test does NOT have the disorder guess” or clinical impression, which can be determined
(ruling OUT the diagnosis)—because if he did have the disor- in different ways. Following initial examination, clini-
der, the test would have correctly diagnosed him! cians may exercise judgment based on experience with
certain types of patients to estimate the probability of
a diagnosis,14 although such estimates are not always
These concepts are also related to predictive value. reliable.15,16 In our example, for instance, a clinician who
With a more specific test, negative cases are identified has a lot of experience with patients with vertebral frac-
more readily. Therefore, it is less likely that an individual tures may be able to generate an initial hypothesis about
with a positive test will actually be normal. This results a patient’s condition just based on initial observation
in a high positive predictive value. With a more sensitive and complaints.
test, positive cases are identified more readily; that is, we Information from the literature can also be used to
will not miss many true cases. Therefore, it is less likely help with this estimate by referring to prevalence of the
that an individual with a negative test will have the dis- disorder using population data. For instance, some stud-
ease. This leads to a high negative predictive value. ies have shown that the prevalence of vertebral fractures
SpPin With high Specificity, a Positive test rules in the diagnosis Figure 33–1 Mnemonics to
remember the relationship of
sensitivity and specificity for ruling
SnNout With high Sensitivity, a Negative test rules out the diagnosis in and ruling out a diagnosis based
on a test result.
6113_Ch33_509-528 03/12/19 9:51 am Page 515
among older women and men is about 20%.16 It would Posttest Probabilities
be important to determine if the characteristics of the
A diagnostic test allows a clinician to revise the pretest
samples used in these studies were sufficiently similar to
probability estimate of the disorder, to be more confi-
your patient’s profile to make them applicable. Data
dent in the diagnosis, to improve certainty. The revised
from a study sample can also be used to indicate the
likelihood of the diagnosis based on the outcome of a
prevalence within the accessible population.
test is the posttest probability, or posterior probability––
what we think the problem is (or is not) once we know
➤ The height loss study demonstrated a 41% prevalence of the test result. A good test will allow us to have a high
vertebral fractures (n=62) in its sample of 151 patients.3 Know- posttest probability, confirming the diagnosis, or a low
ing this, before any further testing, our best estimate is that the posttest probability, causing us to abandon it.
pretest probability of a patient being at risk for VFx is 41%.
Decision-Making Thresholds
With no other information, this value gives us our Being able to estimate a pretest probability is central to
best initial guess of likelihood that a patient could have deciding if a condition is present and if further testing
this condition. or treatment is warranted. Based on the initial hypothesis
and pretest probability of a condition, the clinician
must decide if a diagnostic test is necessary or useful to
confirm the actual diagnosis. Straus et al11 suggest that
two thresholds should be considered, as illustrated in
Figure 33-2A. With a very low pretest probability, the
diagnosis is so unlikely that testing is not useful—it will
provide little additional information to improve cer-
tainty. Even with a positive test, results are likely to be
false positives. Therefore, treatment should not be
initiated and other diagnoses should be considered.
In contrast, with a very high pretest probability, the
likelihood of the diagnosis is so strong that testing may
be unnecessary for the same reason—results are unlikely
to offer any additional useful information. Therefore,
treatment should just be initiated without taking time for
testing. Even with a negative test, results are likely to be
false negatives. Perhaps more common, however, the
pretest probability is not definitive, and testing is neces-
sary to pursue the diagnosis.
This approach must also take into consideration other
factors such as the relative severity of the disorder, the
From Hemant Morparia. test’s cost, or the discomfort or risk. Therefore, the
6113_Ch33_509-528 03/12/19 9:51 am Page 516
A PRETEST PROBABILITY
Test
Do not test Treat
Determine posttest probability
Consider Test
different dx unnecessary
B POSTTEST PROBABILITY
threshold for testing may vary for different conditions. first test becomes the pretest probability for the next test.
A test that requires a potentially harmful procedure may In this way, sequential testing continues to refine the
not be worthwhile if the condition has little conse- decision-making process.
quence. However, when considering the possibility of
a dangerous condition, the risk may be reasonable. For Likelihood Ratios
instance, lumbar puncture may be needed to rule out Once we have established an initial hypothesis that the
life-threatening conditions but may not be a first test to patient may have a particular diagnosis, we want to
rule out less severe conditions. In contrast, a patient may determine if a test can make us more confident in that
exhibit symptoms that could mean the presence of a deep diagnosis. A measure called the likelihood ratio (LR) tells
venous thrombosis (DVT), which is potentially life us how much more likely it is that a person has the diagnosis
threatening. Even if the symptoms are minimal and the after the test is done; that is, it will help us determine the
pretest probability is low, the clinician may feel com- posttest probability. We can determine a LR for a posi-
pelled to test for the condition to safely rule it out before tive or negative test.
continuing with other tests or interventions. The LR indicates the value of the test for increasing
certainty about a positive diagnosis, or its “confirming
If sensitivity and specificity add up to 100%, (sensitivity = power.”18 Likelihood ratios are being reported more
1-specificity), the test will not contribute to a change in often in the literature as an important standard for
pretest to posttest probabilities. For a test to be useful, evidence-based practice. The LR has an advantage over
the difference between sensitivity and 1 – specificity (true sensitivity, specificity, and predictive values because it is
positive rate – true negative rate) should be greater than zero, independent of disease prevalence and therefore can be
preferably at least 50%, and ideally 100%.13 applied across settings and patients.
Positive Likelihood Ratio
Once a test is performed, we still face decision-making To understand this statistic, let us assume that a patient
thresholds. With a high posttest probability, the diagno- tests positive on the height loss test. If this were a
sis is confirmed, and treatment can be initiated with perfect test, we would be certain that the patient has a
confidence. With a low posttest probability, a different VFx (true positive). But we hesitate to draw this con-
diagnosis should be considered. When the posttest prob- clusion definitively because we know that some patients
ability is not definitive, further testing may be necessary. who are not at risk will also test positive (false positive).
In that case, the posttest probability associated with the Therefore, to determine if this test improves our
6113_Ch33_509-528 03/12/19 9:51 am Page 517
diagnostic conclusion, we must correct the true positive between 2 to 5 may be important, depending on the
rate by the false positive rate. This is our positive like- nature of the diagnosis being studied. Values close to
lihood ratio (LR+): 1.0 represent unimportant effects. A likelihood ratio of
1.0 means the true-positive and false positive (or true-
true positive rate sensitivity negative and false negative) rates are the same, making
LR = =
false positive rate 1 specificity the results of the test useless.
LRⴚ LRⴙ
0 to .1 .1 to .2 .2 to .5 .5 to 2 2 to 5 5 to 10 >10
Figure 33–3 Scale for interpreting the
relative importance of positive and negative Important Unimportant Important
likelihood ratios.
6113_Ch33_509-528 03/12/19 9:51 am Page 518
0.1 99
0.2 98
0.5 95
2000
1 1000 90
500
2 200
80
100 Posttest
50 70 probability for
5
20 60 a positive test
10 10 50
5 40
t 2
20 tes 30
i t ive 1
s
30 Po 0.5 20
Posttest
40 0.2 probability for
Negative test 10 a negative test
50 0.1
60 0.05
0.02 5
70
0.01
80
0.005 2
0.002
90 0.001 1
0.005
95 0.5
negative test, however, there is only a 16% chance the calculated using the pretest probability and likelihood
patient has a VFx. ratios (see Box 33-2).
The posttest probability is dependent on the sensitiv-
ity and specificity of the diagnostic test (translated to a Confidence Intervals
likelihood ratio) as well as the clinician’s estimate of the Sensitivity, specificity, predictive values, and likelihood
pretest probability for an individual patient. Using the ratios are all point estimates, and should also be expressed
nomogram, for example, with a positive test for height in terms of confidence intervals (see Table 33-2C).
loss, if we had started with a pretest probability of 20%, Although these values are often not reported, they are
we would get a posttest probability of about 50% for a important to understanding the true nature of these
positive test. But if the pretest probability was only 5%, estimates. Given a sample of scores, the confidence
a positive test would increase our posttest certainty to interval will indicate the range within which we can be
only 15%. Where we start will influence the degree to sure the true population value will fall. Although not
which our certainty can be improved by the test. When interpreted in terms of significance testing, confidence
greater precision is desired, posttest probabilities can be intervals for these measures of diagnostic accuracy will
6113_Ch33_509-528 03/12/19 9:51 am Page 519
Box 33-2 Calculating Exact Posttest Probabilities score of 3 cm was used, the sensitivity and specificity
would be different, influencing posttest probabilities.
Posttest probabilities can be calculated directly by converting The problem, then, is to determine what cutoff score
pretest probabilities to odds. For the vertebral fracture example, should be used. This decision point must be based on
with prevalence of 41% and LR+ = 3.71, LR– = .27:
the relative importance of and desired balance between
Step 1: Convert pretest probability (prevalence) to pretest odds: sensitivity and specificity, or the cost of incorrect out-
pretest probability .41 comes versus the benefits of correct outcomes.
Pretest odds = = = .695
1pretest probability 1.41
In weighing sensitivity and specificity, consider this anal-
Step 2: Multiply pretest odds by LR+ and LR– to get posttest ogy: Although there may be costs associated with unnec-
odds: essary preparations for a predicted storm that does not occur
Posttest odds = pretest odds LR (false positive), these costs would probably be considered minor
For positive test = .695 3.71 = 2.578 relative to the danger of failing to prepare for a storm that does
For negative test = .695 .27 = .188 occur (false negative). There are criteria, for instance, for a
storm “warning” versus a storm “watch.” The point at which
Step 3: Convert posttest odds to posttest probability: the alert moves to the more severe warning would be the cutoff
point. There will often be an inverse relationship between
posttest odds
Posttest probability = these estimates.
posttest oddss 1
2.578
For positive test: = = .721 Cutoff Scores
3.578
.188 Suppose we use the 4-cm height loss criterion to predict
For negative test: = = .158
1.188 risk for vertebral fracture, and an individual with only
1 cm height loss is referred for an x-ray to determine if
Compare these values with the results estimated from the
nomogram. The calculations will typically be more exact.
a fracture is present. If the individual is not truly at risk
(false positive), the test may be considered low cost, com-
pared to the situation where an individual who is at risk
is not correctly diagnosed (false negative) and not re-
indicate the relative stability of the test’s results; that is,
ferred for testing. Therefore, it might be reasonable to
with a wide confidence interval we would be less likely
set the cutoff score low to avoid false negatives, thereby
to consider the value a good estimate. The process for
increasing sensitivity.
calculating confidence intervals for diagnostic tests is
Conversely, consider a scenario where a test is used
complex,22 and more easily carried out by a confidence
to determine the presence of a condition that requires
interval calculator.
potentially life-threatening surgery. A physician would
See the Chapter 33 Supplement for a clean copy of the nomo- want to avoid the procedure for a patient who does not
gram for your use and formulas for calculation of confidence in- truly have the condition. For that situation, the threshold
tervals. Links are also provided for Web-based programs might be set high to avoid false positives, increasing
that provide interactive nomograms, as well as calculators specificity. We would not want to perform this proce-
for posttest probabilities, sample size, and confidence intervals dure unless we knew for certain that it was warranted.
for diagnostic tests.9,23,24 Obviously, it is desirable for a screening test to be
both sensitive and specific. Unfortunately, there is often
a trade-off between these two characteristics. One way
■ Receiver Operating Characteristic to evaluate this decision point would be to look at several
(ROC) Curves cutoff points to determine the sensitivity and specificity
at each point. We could then consider the relative trade-
Although continuous scales are considered preferable for off to determine the most appropriate cutoff score based
screening because they are more precise, they are often on the consequences of false negatives versus false
converted to a dichotomous outcome for diagnostic positives. It is often necessary to combine the results of
purposes. This requires establishing a cutoff score to several tests to minimize the trade-off. The decision
demarcate a positive or negative test. For example, in the about the relative importance of false positives versus
VFx study, a height loss of more than 4 cm was consid- false negatives should be a collaborative one between
ered indicative of vertebral fracture. However, if a cutoff clinician and patient (see Focus on Evidence 33-1).
6113_Ch33_509-528 03/12/19 9:51 am Page 520
Focus on Evidence 33–1 In 2008, the U.S. Preventive Services Task Force (USPSTF) issued
Moving the Goal Posts: Statistics Versus guidelines that stated that there was insufficient evidence to make a
Meaningful Conversation recommendation on PSA-based screening, particularly in men younger
than 75 years, but they did recommend against use of the test for men
Prostate cancer is one of the most common forms of cancer in men,
older than 75. Even though prevalence of prostate cancer is higher
with estimates of 165.000 new cases and 29,000 deaths in 2018.25
among older men, the slow progression of the disease indicates that
Tests of prostate-specific antigen (PSA) levels may indicate prostate
testing is unlikely to affect mortality.
cancer, with a traditional cutoff of 4.0 ng/mL.26 At that level, sensi-
In 2012, based on results of a clinical trial, recommendations were
tivity has been estimated at 21% and specificity at 91%.27 These
revised, discouraging the use of PSA screening for all men, regardless
values suggest that a high PSA test can “rule in” cancer (SpPin), but
of age, race, or family history, concluding that the benefits of such
with such low sensitivity, the test cannot rule it out (SnNout). There-
testing did not outweigh harms.32,33 The harms stem from treatment or
fore, as a screening test, it is likely to result in many false positives,
further testing following a false-positive test, including infection and
with many other factors that can also result in an elevated PSA count,
hospitalization from prostate biopsy, surgery or radiation therapy caus-
including urinary tract infection, recent ejaculation, prostatitis,
ing bowel problems, urinary incontinence, or erectile dysfunction—all
prostate enlargement, and changes occurring after age 40.28 The
issues that can seriously impact quality of life.
estimated prevalence of prostate cancer is 25% for those with PSA
In 2018, evidence was reexamined and recommendations were
between 4.0 and 10.0 ng/mL,29 with a LR– of .87 and LR+ of 2.33.
revised again. Now guidelines suggest that PSA testing could be
These ratios are relatively uninformative (in the “unimportant” range).
selectively beneficial for men between 55-69 but still recommend
This was confirmed in one study that analyzed results with different
against testing for men 70 years and older.34 The recommendation for
PSA cutoff points, resulting in an ROC curve with AUC of only .678.30
men aged 55 to 69 states that the decision to undergo screening
The PV+ has been estimated at 30%, indicating that less than one in
should be an individual one in consultation between patient and
three men with an abnormal test will have cancer on biopsy.31
healthcare provider, considering the risks, benefits, and personal
preferences of the patient—which fits well with the concept of
0.9 evidence-based practice.35
1.1 ng/mL This story provides two significant and related lessons. The first
0.8 is the important consideration of numbers as guidelines, not absolute
1.6
0.7 values to dictate decisions. As with all research findings, clinicians
2.1
0.6
must work with patients to interpret risks and benefits and account
for their preferences, acknowledging that statistics have inherent
Sensitivity
0.5 2.6 uncertainty. And the second is the recognition that evidence isn’t stag-
0.4 nant, and revisions are common as research studies are reexamined
3.1
or new information becomes available. Bayes would be pleased that
0.3
AUC ⫽ .678 the system is able to reconsider findings, allowing new information
4.1
0.2 to change recommendations!
0.1
6.1
0 8.1
Curve derived from data from Thompson JM, Ankerst DP, Chi C, et al. Operating
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 characteristics of prostate-specific antigen in men with an initial PSA of
1-Specificity 3.0 ng/mL or lower. JAMA 2005;294(1):66-70, from Table 3, p. 69.
The ROC Curve that the noise will grow faster than the signal. If we set the vol-
ume to its maximum, we may claim that the signal is strong, but
The balance between sensitivity and specificity can the noise will be so great that the signal will be indecipherable.
be examined using a graphic representation called a Therefore, the optimal setting will be that point where we de-
receiver operating characteristic (ROC) curve—a way tect the largest ratio of signal to noise.
to examine the ratio of signal to noise.
Suppose we were listening to a radio station that has a This is what we are trying to do with a diagnostic test.
weak signal. We turn the volume up so we can hear better, We want to detect the “signal” (the true positive and true
but as we do so, we not only pick up the desired signal but also negative) with the least amount of interference possible
background noise. Initially, the signal will increase faster than (false positive and false negative). The ROC curve
the noise, but there will come a point, as we increase volume, diagrams this relationship. It allows us to answer the
6113_Ch33_509-528 03/12/19 9:51 am Page 521
question: How well can a test discriminate between upper right hand corner. A noninformative curve occurs
signal and noise—can it discriminate between the pres- when the true-positive and false-positive rates are equal,
ence or absence of disease? which means that the test provides no better information
than 50:50 chance. Such a “curve” starts at the origin and
Coordinates of the ROC Curve
moves diagonally to the upper right hand corner (shown
The process of constructing an ROC curve involves
as the broken line).
setting several cutoff points for a test and calculating
sensitivity and specificity at each one. The curve is then Area Under the Curve
created by plotting a point for each cutoff score that rep- If we want to get a sense of the quality of the ROC curve,
resents the proportion of patients correctly identified as we can look to see how closely it approximates the perfect
having the condition on the Y axis (true positives or curve. Although this provides a visual approximation, a
sensitivity) against the proportion of patients incorrectly quantitative standard is more definitive. The best index
identified as having the condition (false positives or for this purpose is a measure of the area under the curve
1– specificity) on the X axis. (AUC). This value equals the probability of correctly
To illustrate this process, consider again the example choosing between normal and abnormal signals.
of vertebral fracture using a hypothetical dataset for the
sample of 151 patients. Table 33-3A shows the distribu-
tion of scores for the patients who did and did not have
➤ As shown in Table 33-3 3 , the AUC = .815 for the height
loss data, which is generally considered a good curve. This
a VFx, with height loss values from 1 cm to ≥ 6 cm. means that presented with a randomly chosen pair of patients,
Table 33-3B shows the sensitivity and specificity at one with VFx and one without, the clinician would correctly
six cutoff points based on the number of true-positive identify the patient with a VFx 82% of the time.
and true-negative scores. It is generally recommended
that at least five to six points be used to plot an ROC
curve. For example, if we use a cutoff score of 1 cm, then Therefore, the area represents the ability of the test
all those with a height loss of 1 cm or more will be iden- to discriminate between those with and without the test
tified as having a fracture. At that level, all those with a condition. A perfect test has an AUC of 1.00—allowing
VFx have been correctly identified but all those without a 100% identification of individuals who have the disor-
VFx have been incorrectly identified. This cutoff point is der. The AUC can also be used to compare curves from
so low that everyone meets it, leading to sensitivity of 1.0 two tests to determine which is a better diagnostic tool.
but specificity of zero (and 1– specificity of 1.0). Choosing a Cutoff Point
Alternatively, using a cutoff of 6 cm, we have cor- In addition to making comparisons or describing the
rectly identified all those who do not have a VFx but relative effectiveness of a test for identifying a disorder,
have missed all but 11 who do have a fracture. This we can also use the ROC curve to decide which cutoff
leads to a corresponding sensitivity of .18 and speci- point would be most useful. Most ROC curves have a
ficity of 1.00, which results in 1– specificity of 0.00. steep initial section, which reflects a large increase in
Similarly, with a cutoff score of 3 cm, all those who had sensitivity with little change in the false positive rate.
a height loss of less than 3 cm will be considered nega- A relatively flat region across the top is also typical.
tive. Those with height loss ≥3 cm will be diagnosed Neither of these sections of the curve make sense for
positive. When this cutoff score is used, 53 individuals choosing a cutoff point, as they represent little change
are correctly diagnosed and 26 are incorrectly diag- in one component of the curve.
nosed. This leads to a corresponding sensitivity of .86 The best cutoff point will typically be at the point
and 1– specificity of .29. where the curve turns.36 This will be the point at which
These values are plotted to create the ROC curve (see there is a maximal difference between the true positive
Table 33-3 5 ). The curve is completed at the origin and rate and the false positive rate—the difference between
the upper right hand corners, reflecting cutoff points sensitivity and 1– specificity, known as the Youden index.37
above and below the highest and lowest scores.
Interpreting the ROC Curve ➤ Yoh et al3 used the Youden index to determine the best cutoff
The ROC curve is plotted on a square with values of 1.0 point at ≥4.0 cm. In Table 33-3 5 we can see a marked turn
for sensitivity and 1– specificity at the upper left and at that point, suggesting that this cutoff would provide the best
lower right corners, respectively. A perfect test instru- balance between sensitivity and specificity for this test (79% and
ment will have a true positive rate of 1.0 and a false pos- 79%, respectively). At that point we would miss diagnosing a VFx
for 13 out of 62 individuals with a fracture and would incorrectly
itive rate of zero, resulting in a curve that essentially fills
diagnose 19 out of the 89 individuals without fracture.
the square from the origin to the upper left corner to the
6113_Ch33_509-528 03/12/19 9:51 am Page 522
DATA
Table 33-3 Cutoff Points for Height Loss (HL) as a Predictor of Vertebral Fracture (N=151)
DOWNLOAD
Known Group 1 Cutoff True True Coordinates of the ROC Curve 2 Youden
Point Positives Negatives Sensitivity Specificity 1ⴚSpecificity Index
No
Height Loss Fracture Fracture ⱖ1.0 cm 62 0 1.00 0.00 1.00 0.0
n ⴝ 62 n ⴝ 89 ⱖ2.0 cm 58 19 .94 .21 .79 .15
1.0–1.9 cm 4 19 ⱖ3.0 cm 53 63 .86 .71 .29 .57
2.0–2.9 cm 5 44 ⱖ4.0 cm 49 70 .79 .79 .21 .58
3.0–3.9 cm 4 7 ⱖ5.0 cm 32 78 .52 .88 .12 .40
4.0–4.9 cm 17 8 ⱖ6.0 cm 11 89 .18 1.00 0.00 .18
5.0–5.9 cm 21 11 An Individual has a positive test if height loss is equal to or greater than the cutoff point.
ⱖ6.0 cm 11 0
0.6
positive actual state group and the negative actual state group. ⱖ5.0
Statistics may be biased.
0.4
Coordinates of the Curve 4
Test Result Variables(s): HL
0.2 ⱖ6.0
Positive if Greater
Than or Equal Toa Sensitivity 1–Specificity
.0000 1.000 1.000 0.0
1.5000 .935 .787 1 0.2 0.4 0.6 0.8 1.0
2.5000 .855 .292 1ⴚSpecificity
3 The area under the curve is .815, which is significant (p ⬍.001) with a narrow confidence interval. We are 95% confident that
the true area under the curve will fall between 74% and 89%.
4 The coordinates of the curve are listed, although SPSS calculates the cutoff values a little differently (see footnote). The
values for sensitivity and 1-specificity may vary slightly from hand calculations shown in (B), but these will not change the
overall shape of the curve.
5 The ROC curve is presented, showing its relationship to the null curve at 50% (dashed line). The curve could be interpreted
as turning at either ⱖ3 cm or ⱖ4.0 cm, depending on the relative importance of sensitivity and specificity. The study by Yoh
et al3 used the 4 cm cutoff point.
Portions of output have been omitted for clarity. Values may vary due to rounding.
6113_Ch33_509-528 03/12/19 9:51 am Page 523
The highest values for the Youden index are .58 for by demonstrating how specific clusters of clinical find-
the cutoff at ≥4.0, and .57 at ≥3.0, suggesting that there ings can be used to efficiently predict outcomes.40,41
is little difference between these points in terms of how
they would contribute to determination of posttest prob- You may also see CPRs referred to as clinical decision
abilities. Therefore, we might also consider the turn of rules, clinical decision guidelines, or clinical prediction
the curve at ≥3 cm to indicate a meaningful cutoff point. guides. And of course, don’t confuse this acronym with
Using this criterion, we would have a higher sensitivity cardiopulmonary resuscitation!
at 86% and slightly lower specificity at 71%. The final
choice of a cutoff must be based on how the clinician and
patient see the impact of an incorrect identification in a Diagnosis
positive or negative direction. For instance, one might Perhaps the most obvious application of a CPR is to as-
consider it more important to identify those with a sist in the diagnosis of a disorder based on clinical signs.
potential VFx using a more sensitive test, as opposed to
taking an x-ray of someone who is suspected of having a Stiell et al42 developed clinical prediction rules for the
VFx and find out she does not. The ROC curve should use of radiography with suspected acute ankle injuries.
act as a guide for that decision. They noted that many patients with ankle injuries did
not have a fracture, and yet the typical response to
➤ Results emergency care was to order an x-ray. Estimates had
In a sample of 151 postmenopausal women, radiographic diag- shown that the prevalence of fractures with ankle
nosis of VFx was present in 41% of the subjects. Using a cutoff injuries was less than 15%.
value of ≥4 cm of height loss, sensitivity was 79% (95% CI: .69 This study addressed an interest in efficiency and
to .88), specificity was 79% (95% CI: .70 to .87), PV+ was 72% cost-savings as well as a desire for diagnostic accuracy.
(95% CI: .61 to .83), PV– was 84% (95% CI: .77 to .92), LR+ was
The prediction rules that were developed through this
3.71 (95% CI: 2.44 to 5.63), and LR– was .27 (95% CI: .16 to .44).
process have come to be known as the Ottawa Ankle
ROC curve analysis showed AUC = .815 (95% CI: .743 to .888).
These values show that height loss of ≥4 cm is a good cutoff
Rules (OARs) (based on Stiell’s affiliation with Ottawa
value for screening women for vertebral fracture. Civic Hospital), which include rules for both ankle and
midfoot injuries. The indicators for ruling out a fracture
are based on a lack of tenderness in specific areas of the
Note: These values will often be presented in a table to avoid having to express foot or ankle and the patient’s ability to bear weight on
this much data in a narrative form. the affected limb, even with a limp. Figure 33-5 shows
these guidelines, which have been validated in different
countries43-47 in different populations,48,49 and in vari-
HISTORICAL NOTE ous settings.50,51
The ROC curve has its roots in World War II, with tests Systematic reviews of the OARs have shown that they
of the ability of radar operators to determine whether a are 95% to 100% sensitive, with a negative likelihood
blip on a radar screen represented an object or noise.38 Follow- ratio of .08.52-54 If we apply the LR– to a pretest proba-
ing the attack on Pearl Harbor, the U.S. Army began research to bility of 15% (prevalence estimate), the posttest proba-
increase the accuracy of correctly detecting Japanese aircraft bility is approximately 1.5%, indicating that a fracture is
from radar signals. For this purpose, they measured the ability of not likely (try it out on the nomogram).
radar receiver operators to make these important distinctions, Using the logic of SnNout, with a test that is highly
which was called “receiver operating characteristics.” This
sensitive, a negative test will effectively rule out the dis-
process later formed the basis for signal detection theory.
order. Therefore, a negative result using the OARs will
consistently and accurately rule out fractures after ankle
or foot injury, making an x-ray unnecessary. This rule
■ Clinical Prediction Rules has effectively and substantially reduced the number of
radiographs taken.55
A clinical prediction rule (CPR) is a tool that quantifies Reviews have also shown that the OARs have much
the contributions of different variables to the diagnosis, lower specificity, ranging from 40% to 50%, translating
prognosis, or likely response to treatment for an individ- to a positive likelihood ratio around 1.5. With low speci-
ual patient.39 It may include information derived from ficity, a positive test does not necessarily mean a fracture
the history, physical exam, lab tests, or patient com- is present, requiring an x-ray to confirm a fracture. For
plaints. These “rules” are intended to simplify and in- example, a LR+ of 1.5 translates to about 20% posttest
crease accuracy of diagnostic and prognostic assessments probability, barely different from the pretest probability.
6113_Ch33_509-528 03/12/19 9:51 am Page 524
Malleolar zone
6 cm
A. Posterior edge or tip 6 cm B. Posterior edge or tip
Midfoot zone
of lateral malleolus of medial malleolus
An ANKLE x-ray is needed if there is pain in the malleolar zone plus at least one of A FOOT x-ray is needed if there is pain in the midfoot zone plus at least one
the following: of the following:
• Bone tenderness at the posterior edge or tip of the lateral malleolus (A) • Bone tenderness at the base of the fifth metatarsal (C)
• Bone tenderness at the posterior edge or tip of the medial malleolus (B) • Bone tenderness at the navicular bone (D)
• Inability to bear weight for at least 4 steps immediately after injury and • Inability to bear weight for at least 4 steps immediately after injury
at the time of evaluation. and at the time of evaluation.
Figure 33–5 The Ottawa Ankle Rules. (From Starkey, Examination of Orthopedic & Athletic Injuries 4th Ed., FA Davis, Philadelphia,
2015, with permission.)
Other examples of diagnostic prediction rules include determine the more relevant measure of diagnostic
guidelines for detecting deep venous thrombosis,56 validity. In this case, considering the frequency of back
pulmonary embolism,57 coronary artery disease,58 and pain and the resources spent on its care even with benign
sleep apnea.59 Guidelines have also been developed cases, an instrument that allows identification of those at
for ordering x-rays for knee injuries, called the Ottawa low risk of adverse outcome is quite useful.
Knee Rules,60,61 and cervical spine injuries, called the Other examples of prognostic clinical prediction rules
Canadian C-Spine Rule.62 include identifying risk for functional decline in older
community-dwelling women,64 identifying patients at
Prognosis risk of complications following cardiac surgery,65 and
Clinical predication rules can also be established to identifying factors that lead to hospitalization with
determine the degree to which certain outcomes can be asthma.66
expected.
Response to Intervention
Dionne and colleagues63 developed a CPR to predict the
Clinical prediction rules have also been developed to
2-year work disability status of people who were treated
determine the likelihood that a patient will respond
for back pain in primary care settings. Based on data
positively to a specific intervention.
from 337 patients, they derived a model based on seven
predictors to identify those who did not return to work Margolis et al67 used a retrospective cohort design to
(failure is a positive test) versus those who could be expected develop a prediction model to determine in which
to return to work (success is a negative test). Predictors patients’ venous leg ulcers will heal within 24 weeks
included patient expectations, radiating pain, previous back with standard limb compression bandages. Data
surgery, pain intensity, changing position, irritable temper, confirmed a simple prognostic model including
and difficulty sleeping. They found 16.9% prevalence of two variables: wound area >5 cm2 (1 point) and
an adverse outcome, with 57 patients failing to return duration > 6 months (1 point). A score of 0 (negative
to work, and 280 returning to work—indicating 79% test) was predictive of healing with compression
sensitivity and 64% specificity. bandages. They validated the data in another sample,
confirming these findings.
These data translated to a PV+ of 31% and a PV– of
94%. The high PV– indicates a high probability that a Their data resulted in sensitivity of 65% and speci-
person who tests negative on the prediction rule is able ficity of 91%, with PV+ = 59% and PV– = 93%. There-
to return to work. The authors suggest that although fore, healthcare providers can be reasonably certain that
decision rules often focus on positive predictive values using compression bandages will be effective in patients
to confirm a diagnosis, the nature of the outcome must with a score of 0. Patients with scores of 1 or 2 are less
6113_Ch33_509-528 03/12/19 9:51 am Page 525
likely to heal and should be considered for other thera- under the ROC curve was 89% for the 10-item rule
pies. This model was supported through construction versus 87% for the 2-item model.
of a ROC curve with area under the curve of .87, indi-
cating that the model correctly discriminated between Validation of the Rule
ulcers that healed as compared to those that did not heal The second step requires validation of the rule in several
87% of the time. cohorts in different settings to demonstrate that the
It is important to note that CPRs are often misunder- model continues to work for other individuals besides
stood in this context, as providing evidence of treatment those used to derive it. This is an essential step to assure
effectiveness. Because there are no control subjects in a that chance was not a factor in the initial study and to
prediction rule study that did not receive the treatment, verify that the study sample was not unique. This step
it is impossible to infer whether treatment was more or may require several studies to fully assess how well the
less effective than an alternative. For instance, in the leg rule works in a variety of settings.
ulcer study, all subjects got the compression bandages, In the study of return to work status, Dionne et al63
and we can say that those with certain characteristics randomly divided their total sample into two subgroups,
got better with that intervention. However, we cannot the first used to derive the model and the second to
say what might have happened had they not received the validate it. This is called a split sample. They found strong
intervention. confirmation, noting that 37% of subjects were correctly
Other examples of CPRs for intervention response in- classified in the initial sample and 40% in the validation
clude identifying patients who will benefit from cervical sample.
manipulation for neck pain,68 from spinal manipulation This technique of using a split sample is useful for
for low back pain,69 from patellar taping for anterior early validation, but further validation studies should be
knee pain,70 and prediction of the need for hospitaliza- run on different samples taken from other settings with
tion of children with acute asthma exacerbations using characteristics that are different from the original sam-
predictors available upon admission to the emergency ple, as well as with other investigators. Unfortunately,
department.71 many prediction rules are adopted without the benefit of
this validation.
Validating Clinical Prediction Rules
Impact Analysis
The development of a clinical prediction rule is a three- Finally, an impact analysis will demonstrate if the rule
step process, involving creation of the rule, validation, has changed clinician behavior and resulted in beneficial
and assessment of the impact on clinical practice.40,72 outcomes. A CPR is only effective if it is widely imple-
Some studies may include all three steps, but most will mented, if it can be applied accurately, and if it can be
be published separately. shown to reduce costs or risk. The Ottawa Ankle Rules,
Deriving the Model for example, have been studied in many countries and
First, the factors that potentially contribute to prediction settings, and with different populations, all reporting
of the outcome are identified in a cohort of patients. This significant decreases in the number of unnecessary ankle
allows for derivation of the rule, establishing which vari- radiographs.47,73 Even with the widespread acceptance
ables are most predictive. This requires identification of of this CPR, however, researchers and clinicians con-
a set of predictors and the relevant outcome, specified as tinue to test its validity, with varying degrees of success.52
a reference standard for those with and without the con- Impact analyses are best performed using a random-
dition of interest. Data collection may include question- ized design, whereby patients are managed with usual
naires, interviews, or examination of medical records. care or the CPR. For example, the OARs have been
The adequacy of a prediction rule is dependent on which tested against usual care and other prediction rules, con-
variables are included in the derivation and how the out- sistently showing that the number of x-rays is reduced
come is assessed. and that negative tests are accurate.74-77
CPRs are usually analyzed using logistic regression,
often with stepwise procedures to delimit the predictors Levels of Evidence for CPRs
to the smallest number that can most accurately predict McGinn et al38 have proposed a hierarchy of evidence to
the outcome. This was illustrated in the study of limb judge the applicability of a clinical decision rule, based
compression bandages, which started with a complex on its having gone through the full process of validation
model including 10 variables, such as race, wound area, (see Fig. 33-6). Widespread use of a CPR is not recom-
inability to walk one block, and other wound character- mended until it has been validated in at least one
istics. The final equation needed only wound area and prospective study in a variety of settings, and an impact
duration to create a strong prediction model.67 The area analysis has demonstrated its clinical utility.
6113_Ch33_509-528 03/12/19 9:51 am Page 526
At least one prospective validation in a different Wide use of the CPR in a variety of
population AND settings, influencing clinician
One impact analysis demonstrating change in 1 behavior and improving patient
clinician behavior with beneficial consequences. outcomes.
Figure 33–6 Four levels of evidence for clinical prediction rules. Adapted from McGinn TG, Guyatt GH,
Wyer PC, et al. User’s guides to the medical literature: XXII: how to use articles about clinical decision
rules. JAMA 2000;284:79-84, Table 1, p. 81.
COMMENTARY
A diagnostic decision can be consequential, depending • Data integration errors such as failing to consider
on the seriousness of the condition, as it will generally findings, faulty estimate of prevalence, misconcep-
direct decisions regarding further management. If the tions about causal factors, or over- or underestimat-
diagnosis is incorrect (a false positive or false negative), ing meaning of information.
then the intervention path (or lack of one) may be useless • Situational factors such as not enough time to pursue
at best or harmful at worst. The cognitive process of needed information or consultation.
clinical reasoning is not black and white, and how we
Elstein and Schwarz14 also suggest that we are influ-
proceed through it can be influenced by many factors.
enced by recalling recent diagnostic experiences, being
Bayes’ theorem suggests that we stay open to new
overwhelmed by the complexity of a case, looking for
information in decision-making but does not tell us how
more common conditions, and our own clinical experi-
to revise our opinions, or how to reconcile quantitative
ence. Data have also shown that the order in which we
findings with conflicting clinical intuition—emphasizing
gather evidence can be a factor, as we tend to put more
the need to improve our skills in critically appraising
weight on more recent information.79
evidence.14 Bordage78 offers three types of potential
The use of evidence to improve diagnostic certainty is
diagnostic errors:
essential. At any point, we may seek new evidence, always
• Data gathering errors such as ineffective history recognizing that our current evidence may be insufficient
taking (too little or too much information), missing or will be outdated in the future. Using evidence provides
symptoms, overreliance on someone else’s history, no guarantees—and diagnostic measures should not
or failure to validate findings with the patient dictate decisions but only inform them.
6113_Ch33_509-528 03/12/19 9:51 am Page 527
R EF ER ENC E S 23. Center for Evidence Based Medicine. Likelihood Ratio nomo-
gram. Available at http://www.cebm.net/shockwave/nomogram.
1. Cambridge Dictionary. Available at https://dictionary.cambridge. html. Accessed February 20, 2019.
org/us/. Accessed August 5, 2019. 24. Schwartz A. Diagnostic Test Calculator. Available at http://
2. Schousboe JT. Epidemiology of vertebral fractures. J Clin araw.mede.uic.edu/cgi-bin/testcalc.pl. Accessed February 20,
Densitom 2016;19(1):8-22. 2019. (Plots the nomogram from raw data.)
3. Yoh K, Kuwabara A, Tanaka K. Detective value of historical 25. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2018. CA
height loss and current height/knee height ratio for prevalent Cancer J Clin 2018;68(1):7-30.
vertebral fracture in Japanese postmenopausal women. J Bone 26. Mettlin C, Lee F, Drago J, Murphy GP. The American Cancer
Miner Metab 2014;32(5):533-538. Society National Prostate Cancer Detection Project. Findings
4. Gatsonis C, Paliwal P. Meta-analysis of diagnostic and screening on the detection of early prostate cancer in 2425 men. Cancer
test accuracy evaluations: methodologic primer. AJR Am J 1991;67(12):2949-2958.
Roentgenol 2006;187(2):271-281. 27. Wolf AM, Wender RC, Etzioni RB, Thompson IM, D’Amico
5. Deeks JJ. Systematic reviews in health care: Systematic reviews AV, Volk RJ, Brooks DD, Dash C, Guessous I, Andrews K,
of evaluations of diagnostic and screening tests. BMJ 2001; DeSantis C, Smith RA. American Cancer Society guideline for
323(7305):157-162. the early detection of prostate cancer: update 2010. CA Cancer J
6. Tatsioni A, Zarin DA, Aronson N, Samson DJ, Flamm CR, Clin 2010;60(2):70-98.
Schmid C, Lau J. Challenges in systematic reviews of diagnostic 28. Carter HB. Prostate-specific antigen (PSA) screening for prostate
technologies. Ann Intern Med 2005;142(12 Pt 2):1048-1055. cancer: Revisiting the evidence. JAMA 2018;319(18):1866-1868.
7. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, 29. Hoffman RM, Clanon DL, Littenberg B, Frank JJ, Peirce JC.
Irwig LM, Lijmer JG, Moher D, Rennie D, de Vet HC. Using the free-to-total prostate-specific antigen ratio to detect
Towards complete and accurate reporting of studies of diagnos- prostate cancer in men with nonspecific elevations of prostate-
tic accuracy: the STARD initiative. BMJ 2003;326(7379):41-44. specific antigen levels. J Gen Intern Med 2000;15(10):739-748.
8. Siminoski K, Warshawski RS, Jen H, Lee K. The accuracy of 30. Thompson IM, Ankerst DP, Chi C, Lucia MS, Goodman PJ,
historical height loss for the detection of vertebral fractures in Crowley JJ, Parnes HL, Coltman CA, Jr. Operating characteris-
postmenopausal women. Osteoporos Int 2006;17(2):290-296. tics of prostate-specific antigen in men with an initial PSA level
9. Centre for Clinical Research and Biostatistics (CCRB). C.I. of 3.0 ng/ml or lower. JAMA 2005;294(1):66-70.
Calculator: Diagnostic Statistics. Available at https://www2. 31. Wilbur J. Prostate cancer screening: the continuing controversy.
ccrb.cuhk.edu.hk/stat/confidence%20interval/Diagnostic% Am Fam Physician 2008;78(12):1377-1384.
20Statistic.htm. Accessed February 20, 2019. 32. Barocas DA, Mallin K, Graves AJ, Penson DF, Palis B,
10. Straus SE, Glasziou P, Richardson WS, Haynes RB, Pattani R, Winchester DP, Chang SS. Effect of the USPSTF Grade D
Veroniki AA. Evidence-Based Medicine: How to Practice and Teach recommendation against screening for prostate cancer on
EBM. 5th ed. London: Elsevier; 2019. incident prostate cancer diagnoses in the United States. J Urol
11. Price R. An Essay towards solving a Problem in the Doctrine of 2015;194(6):1587-1593.
Chances. By the Late Rev. Mr. Bayes, F. R. S. Communicated 33. Moyer VA. Screening for prostate cancer: U.S. Preventive
by Mr. Price, in a Letter to John Canton, A. M. F. R. S. Philos Services Task Force recommendation statement. Ann Intern
Transactions (1683-1775) 1763;53:370-418. Med 2012;157(2):120-134.
12. McGrayne SB. The Theory That Would Not Die: How Bayes’ Rule 34. US Preventive Services Task Force. Screening for prostate
Cracked the Enigma Code, Hunted Down Russian Submarines & cancer: US Preventive Services Task Force recommendation
Emerged Triumphant from Two Centuries of Controversy. New statement [published online May 8, 2018]. JAMA doi:101001/
Haven, CT: Yale University Press; 2011. jama20183710.
13. Davidson M. The interpretation of diagnostic test: a primer for 35. US Preventive Services Task Force. Grade definitions, 2012.
physiotherapists. Aust J Physiother 2002;48(3):227-232. Available at https://www.uspreventiveservicestaskforce.org/
14. Elstein AS, Schwartz A. Clinical problem solving and diagnostic Page/Name/grade-definitions. Accessed June 20, 2017.
decision making: selective review of the cognitive literature. 36. Centor RM. Signal detectability: the use of ROC curves and
BMJ 2002;324(7339):729-732. their analyses. Med Decis Making 1991;11(2):102-106.
15. Cahan A, Gilon D, Manor O, Paltiel O. Clinical experience 37. Akobeng AK. Understanding diagnostic tests 3: Receiver oper-
did not reduce the variance in physicians’ estimates of pretest ating characteristic curves. Acta Paediatr 2007;96(5):644-647.
probability in a cross-sectional survey. J Clin Epidemiol 2005; 38. Fan J, Upadhye S, Worster A. Understanding receiver operating
58(11):1211-1216. characteristic (ROC) curves. Cjem 2006;8(1):19-20.
16. Phelps MA, Levitt MA. Pretest probability estimates: a pitfall to 39. Laupacis A, Sekar N, Stiell IG. Clinical prediction rules. A re-
the clinical utility of evidence-based medicine? Acad Emerg Med view and suggested modifications of methodological standards.
2004;11(6):692-694. JAMA 1997;277(6):488-494.
17. Waterloo S, Ahmed LA, Center JR, Eisman JA, Morseth B, 40. McGinn TG, Guyatt GH, Wyer PC, Naylor CD, Stiell IG,
Nguyen ND, Nguyen T, Sogaard AJ, Emaus N. Prevalence of Richardson WS. Users’ guides to the medical literature: XXII:
vertebral fractures in women and men in the population-based how to use articles about clinical decision rules. Evidence-Based
Tromsø Study. BMC Musculoskelet Disord 2012;13:3-3. Medicine Working Group. JAMA 2000;284(1):79-84.
18. Van den Ende J, Moreira J, Basinga P, Bisoffi Z. The trouble 41. Wasson JH, Sox HC, Neff RK, Goldman L. Clinical prediction
with likelihood ratios. Lancet 2005;366(9485):548. rules. Applications and methodological standards. N Engl J Med
19. Attia J. Moving beyond sensitivity and specificity: using 1985;313(13):793-799.
likelihood ratios to help interpret diagnostic tests. Aust Prescr 42. Stiell IG, McKnight RD, Greenberg GH, McDowell I, Nair
2003;26(3):111-13. RC, Wells GA, Johns C, Worthington JR. Implementation of
20. Geyman JP, Deyo RA, Ramsey SD. Evidence-Based Clinical the Ottawa ankle rules. JAMA 1994;271(11):827-832.
Practice: Concepts and Apporaches. Boston: Butterworth 43. Emparanza JI, Aginaga JR. Validation of the Ottawa Knee
Heinemann; 2000. Rules. Ann Emerg Med 2001;38(4):364-368.
21. Fagan TJ. Letter: Nomogram for Bayes theorem. N Engl J Med 44. Broomhead A, Stuart P. Validation of the Ottawa Ankle Rules
1975;293(5):257. in Australia. Emerg Med (Fremantle) 2003;15(2):126-132.
22. Mercaldo ND, Lau KF, Zhou XH. Confidence intervals for 45. Yazdani S, Jahandideh H, Ghofrani H. Validation of the Ottawa
predictive values with an emphasis to case-control studies. Ankle Rules in Iran: A prospective survey. BMC Emerg Med
Stat Med 2007;26(10):2170-2183. 2006;6:3.
6113_Ch33_509-528 03/12/19 9:51 am Page 528
46. Meena S, Gangary SK. Validation of the Ottawa Ankle Rules in 64. Sarkisian CA, Liu H, Gutierrez PR, Seeley DG, Cummings SR,
Indian scenario. Arch Trauma Res 2015;4(2):e20969. Mangione CM. Modifiable risk factors predict functional de-
47. Can U, Ruckert R, Held U, Buchmann P, Platz A, Bachmann cline among older women: a prospectively validated clinical
LM. Safety and efficiency of the Ottawa Ankle Rule in a Swiss prediction tool. The Study of Osteoporotic Fractures Research
population with ankle sprains. Swiss Med Wkly 2008;138 Group. J Am Geriatr Soc 2000;48(2):170-178.
(19-20):292-296. 65. Fortescue EB, Kahn K, Bates DW. Prediction rules for compli-
48. Leisey J. Prospective validation of the Ottawa Ankle Rules in a cations in coronary bypass surgery: a comparison and method-
deployed military population. Mil Med 2004;169(10):804-806. ological critique. Med Care 2000;38(8):820-835.
49. Gravel J, Hedrei P, Grimard G, Gouin S. Prospective validation 66. Schatz M, Cook EF, Joshua A, Petitti D. Risk factors for
and head-to-head comparison of 3 ankle rules in a pediatric asthma hospitalizations in a managed care organization:
population. Ann Emerg Med 2009;54(4):534-540.e1. development of a clinical prediction rule. Am J Manag Care
50. Stiell IG, Bennett C. Implementation of clinical decision 2003;9(8):538-547.
rules in the emergency department. Acad Emerg Med 2007; 67. Margolis DJ, Berlin JA, Strom BL. Which venous leg ulcers
14(11):955-959. will heal with limb compression bandages? Am J Med 2000;
51. McBride KL. Validation of the Ottawa Ankle Rules. Experience 109(1):15-19.
at a community hospital. Can Fam Physician 1997;43:459-465. 68. Tseng YL, Wang WT, Chen WY, Hou TJ, Chen TC, Lieu
52. Bachmann LM, Kolb E, Koller MT, Steurer J, ter Riet G. FK. Predictors for the immediate responders to cervical
Accuracy of Ottawa Ankle Rules to exclude fractures of the ankle manipulation in patients with neck pain. Man Ther 2006;
and mid-foot: Systematic review. BMJ 2003;326(7386):417. 11(4):306-315.
53. Beckenkamp PR, Lin CC, Macaskill P, Michaleff ZA, Maher 69. Fritz JM, Childs JD, Flynn TW. Pragmatic application of a
CG, Moseley AM. Diagnostic accuracy of the Ottawa Ankle clinical prediction rule in primary care to identify patients with
and Midfoot Rules: a systematic review with meta-analysis. low back pain with a good prognosis following a brief spinal
Br J Sports Med 2017;51(6):504-510. manipulation intervention. BMC Fam Pract 2005;6(1):29.
54. Jenkin M, Sitler MR, Kelly JD. Clinical usefulness of the 70. Lesher JD, Sutlive TG, Miller GA, Chine NJ, Garber MB,
Ottawa Ankle Rules for detecting fractures of the ankle and Wainner RS. Development of a clinical prediction rule for
midfoot. J Athl Train 2010;45(5):480-82. classifying patients with patellofemoral pain syndrome who
55. Gwilym SE, Aslam N, Ribbans WJ, Holloway V. The impact respond to patellar taping. J Orthop Sports Phys Ther 2006;
of implementing the Ottawa Ankle Rules on ankle radiography 36(11):854-866.
requests in A&E. Int J Clin Pract 2003;57(7):625-627. 71. Arnold DH, Gebretsadik T, Moons KG, Harrell FE, Hartert
56. Goodacre S, Sutton AJ, Sampson FC. Meta-analysis: The value TV. Development and internal validation of a pediatric acute
of clinical assessment in the diagnosis of deep venous thrombosis. asthma prediction rule for hospitalization. J Allergy Clin
Ann Intern Med 2005;143(2):129-139. Immunol Pract 2015;3(2):228-235.
57. Righini M, Bounameaux H. External validation and comparison 72. Childs JD, Cleland JA. Development and application of clinical
of recently described prediction rules for suspected pulmonary prediction rules to improve decision making in physical thera-
embolism. Curr Opin Pulm Med 2004;10(5):345-349. pist practice. Phys Ther 2006;86(1):122-131.
58. Genders TS, Steyerberg EW, Alkadhi H, Leschka S, Desbiolles 73. Nugent PJ. Ottawa Ankle Rules accurately assess injuries and
L, Nieman K, et al. A clinical prediction rule for the diagnosis reduce reliance on radiographs. J Fam Pract 2004;53(10):
of coronary artery disease: Validation, updating, and extension. 785-788.
Eur Heart J 2011;32(11):1316-1330. 74. Derksen RJ, Knijnenberg LM, Fransen G, Breederveld RS,
59. Miranda Serrano E, Lopez-Picado A, Etxagibel A, Casi A, Heymans MW, Schipper IB. Diagnostic performance of the
Cancelo K, Aguirregomoscorta JI, et al. Derivation and valida- Bernese versus Ottawa Ankle Rules: Results of a randomised
tion of a clinical prediction rule for sleep apnoea syndrome for controlled trial. Injury 2015;46(8):1645-1649.
use in primary care. BJGP Open2018;2(2):bjgpopen18X101481. 75. Ho JK, Chau JP, Chan JT, Yau CH. Nurse-initiated radi-
60. Stiell IG, Greenberg GH, Wells GA, McDowell I, Cwinn AA, ographic-test protocol for ankle injuries: A randomized
Smith NA, Cacciotti TF, Sivilotti ML. Prospective validation of controlled trial. Int Emerg Nurs 2018;41:1-6.
a decision rule for the use of radiography in acute knee injuries. 76. Auleley GR, Ravaud P, Giraudeau B, Kerboull L, Nizard R,
JAMA 1996;275(8):611-5. Massin P, Garreau de Loubresse C, Vallee C, Durieux P.
61. Stiell IG, Wells GA, Hoag RH, Sivilotti ML, Cacciotti TF, Implementation of the Ottawa Ankle Rules in France. A
Verbeek PR, Greenway KT, McDowell I, Cwinn AA, multicenter randomized controlled trial. JAMA 1997;
Greenberg GH, Nichol G, Michael JA. Implementation 277(24):1935-1939.
of the Ottawa Knee Rule for the use of radiography in acute 77. Tajmir S, Raja AS, Ip IK, Andruchow J, Silveira P, Smith S,
knee injuries. JAMA 1997;278(23):2075-9. Khorasani R. Impact of clinical decision support on radiography
62 Stiell IG, Wells GA, Vandemheen KL, Clement CM, Lesiuk H, for acute ankle injuries: randomized trial. West J Emerg Med
De Maio VJ, Laupacis A, Schull M, McKnight RD, Verbeek R, 2017;18(3):487-495.
Brison R, Cass D, Dreyer J, Eisenhauer MA, Greenberg GH, 78. Bordage G. Why did I miss the diagnosis? Some cognitive
MacPhail I, Morrison L, Reardon M, Worthington J. The explanations and educational implications. Acad Med 1999;
Canadian C-spine rule for radiography in alert and stable 74(10 Suppl):S138-143.
trauma patients. JAMA 2001;286(15):1841-8. 79. Bergus GR, Chapman GB, Gjerde C, Elstein AS. Clinical
63. Dionne CE, Bourbonnais R, Fremont P, Rossignol M, Stock reasoning about new symptoms despite preexisting disease:
SR, Larocque I. A clinical return-to-work rule for patients with sources of error and order effects. Fam Med 1995;27(5):
back pain. CMAJ 2005;172(12):1559-1567. 314-320.
6113_Ch34_529-546 02/12/19 4:36 pm Page 529
CHAPTER
Epidemiology:
Measuring Risk 34
DESCRIPTIVE EXPLORATORY EXPLANATORY
Case-Control Studies
529
6113_Ch34_529-546 02/12/19 4:36 pm Page 530
presence. Epidemiologic questions often arise out of look at environmental factors such as weather, local in-
clinical experience, laboratory findings, or public health dustry, water and food sources, and lifestyle as potential
concerns about the relationship between societal prac- causative factors. For instance, the early studies in AIDS
tices and disease outcomes. Through the analysis of documented high incidence in San Francisco and New
health status indicators and population characteristics, York.5 Legionnaire’s disease6 and severe acute respira-
epidemiologists try to identify and explain the causal tory syndrome (SARS)7 are other examples of diseases
factors in disease patterns. that had specific geographic origins (See Box 34-1).
As medical cures and treatments have been developed
to control many of these problems, and as patterns of WHEN does the Disorder Occur Most or Least
disease have changed, the scope of epidemiology has Frequently?
broadened. Today clinical epidemiology includes the The epidemiologist will compare the present frequency
study of chronic disease, disability, and health status. of a disorder with that of different time periods. When
This approach fits with the World Health Organiza- the frequency of occurrence varies significantly at one
tion’s definition of health which encompasses social, point in time, some specific time-related causative factor
psychological and physical well-being.1 is sought. Seasonal variations may become obvious, or
trends may be related to other historical factors. For
example, researchers have found a higher incidence of
The terms disease, disorder, and disability will be used in-
terchangeably to represent health outcomes, recognizing
hip fractures in elderly individuals during winter
the variety of conditions of interest, including illness, injury, and months8 and an increased rate of hospitalization due to
physical, psychological, or social dysfunction. adult asthma symptoms in spring months.9
Disease Frequency
■ Descriptive Epidemiology The statistical measures used to describe epidemiologic
outcomes focus on quantification of disease occurrence.
Descriptive epidemiologic studies are done when little The simplest measure of disease frequency would be a
is known about the occurrence or determinant of health count of the number of affected individuals. However,
conditions. They will often provide information that meaningful interpretation and comparisons of such a
can be used to set priorities for healthcare planning measure would also require knowing how many people
and will generate hypotheses that can be studied using there were in the total population who could have gotten
analytic methods. Descriptive studies may be presented the disease and the length of time over which the occur-
as case reports, correlational studies, or cross-sectional rence of the disease was monitored. Therefore, measures
surveys. of disease frequency will always include reference to
Person, Place, and Time population size and time period of observation.
For example, we might document 35 cases of a disease
The purpose of descriptive epidemiologic studies is to within 1 year in a population of 3,200 people, or
describe patterns of health, disease, and disability in 35/3,200/year. Typically, population size is expressed in
terms of person, place, and time. terms of thousands, such as 1,000 (103), 10,000 (104), and
WHO Experiences this Disorder? 100,000 (105). For instance, the preceding values could
Relevant characteristics might include age, sex, religion, be expressed as 109.4 cases per 10,000 per year. To make
race, cultural background, education, socioeconomic sta- estimates more useful, such rates are usually calculated
tus, occupation, and so on. This is the demography of the in whole numbers, such as 1,094/100,000/year.
disorder. Epidemiologists try to determine if individuals The number of cases of a disease that exist in a
with certain characteristics are more at risk for a partic- population reflects the risk of disease for that group. It
ular disorder than others. For example, researchers have describes the relative importance of the disease and can
studied the increasing prevalence of type 2 diabetes in provide a basis for comparison with other groups who
adolescents,2 the incidence of incontinence in women may have different exposure histories. The two most
over age 45,3 and more recently the outbreak of measles common measures of disease frequency are prevalence
among school-aged children.4 and incidence.
that time. Prevalent cases are all persons with a particular We know that obesity has become a national concern.
health outcome within a specified time period, regardless The 2000 National Health Interview Survey data
of when they developed or were diagnosed with the dis- showed that the number of adults with self-reported
order. Prevalence (P) is calculated as obesity was 7,058 out of a sample of 32,375.14
number of existing cases at a given point The prevalence of obesity in this population is
P=
total population at risk 7,058/32,375 = .22, or 22%. Because this value reflects
6113_Ch34_529-546 02/12/19 4:36 pm Page 532
the cross-sectional status of the population at a single Burdorf et al16 studied the onset of low back pain in
point in time, it is also called point prevalence. 196 novice golfers. In their initial assessment, they found
Period prevalence can be established for a specified that 80 subjects (41% of the sample) had experienced
period in time. For example, data obtained from a ran- back pain before the study or at the time of entry (these
dom sample of 973 newspaper employees found that the were prevalent cases). They followed the sample over a
number of individuals categorized as having upper limb 1-year period, and identified 16 new cases.
musculoskeletal complaints after one year was 395.15
Therefore, this estimate, combining existing with new Therefore, the 1-year cumulative incidence of first-
cases of musculoskeletal complaints during the period of time back pain for this cohort was 16/196 = .08, or 8%.
1 year, is 41%. The specification of the time period of observation is
essential to the interpretation of this value. The number
of cases would be perceived differently if subjects were
Prevalence was also discussed in Chapter 33 in reference
followed for 1 or 10 years. Other issues that require
to establishing estimates of pretest probability for diagnos-
tic testing. consideration in interpreting a measure of cumulative
incidence include the possibility that the number of
individuals at risk in the cohort will vary over time and
Prevalence is most useful as an indicator for planning the possibility that the condition under study is caused
health services, because it reflects the impact of a disease by other, competing risks.
on the population. Therefore, a measure of prevalence Person-Time
can be used to project requirements such as healthcare Measuring the total population at risk for cumulative
personnel, community services, specialized medical incidence assumes that all subjects were followed for
equipment, or number of hospital beds. Prevalence the entire observation period. However, some individ-
should not, however, be used as a basis for examining uals in the population may enter the study at different
etiology of a disease because it is influenced by the times, some may drop out, and others who acquire the
length of survival of those with the disorder; that is, disease are no longer at risk. Therefore, the length of
prevalence is a function of both the number of individ- the follow-up period is not uniform for all participants.
uals who develop the disease and the duration or sever- To account for these differences, incidence rate (IR) can
ity of the illness. Because this estimate looks at the total be calculated:
number of individuals who have the disease at a given
number of new cases during a given period
time, that number will be large if the disease tends to be IR =
of long duration. total person-time
in this case 1,794,565 person-years of observation. The mortality specifically resulting from diseases such as
incidence rate was, therefore, 3,606/1,794,565 = .002 cancer and heart disease or from motor vehicle accidents.
or 2 cases per 1,000 person-years (2 x 10-3 years). Inci- The case-fatality rate is the number of deaths from a
dence rate is often a more efficient measure than cumu- disease relative to the number of individuals who had the
lative incidence, as it allows for inclusion of all subjects, disease during a given time period.
regardless of the amount of time they were able to Other commonly used categories are age, sex, and
participate. Cumulative incidence would only account race. Age-specific rates are probably most common be-
for those subjects who were available for the entire study cause of the differential effect of many diseases across
period. the life span. For example, if one looks at the death rate
for cancer across age groups, we would find that mortal-
The Relationship Between Prevalence and Incidence
ity was higher for older age categories. Therefore, it
The relationship between prevalence and incidence is a
may be more meaningful to present age-specific mortal-
function of the average duration of the outcome of in-
ity rates for each decade of life, rather than a crude mor-
terest. If the incidence of the disorder is low (few new
tality rate. However, this results in a long list of rates that
cases occur), but the duration of the disorder is long,
may not be useful for certain comparisons. An overall
then the prevalence or proportion of the population that
rate would be more practical, but it would have to
has the disease at a given point in time may be large. If,
account for the variation in rates across age categories.
however, incidence is high (many new cases of the dis-
For instance, if we compare the crude cancer mortality
ease occur), but the disorder is manifest for a short du-
rate for today versus the crude rate from 50 years ago,
ration (either by quick recovery or death), the prevalence
we would have to account for the fact that a larger pro-
may be low. For example, a chronic disease such as
portion of the total population now falls in the older age
arthritis may have a low incidence but high prevalence.
range. Therefore, epidemiologists will often report age-
A short-duration condition like a common cold may
adjusted mortality rates that reflect different weightings
have a high incidence but low prevalence, because lots
for the uneven categories. Methods for calculating
of people get colds but few actually have colds at any
adjusted rates are described in most epidemiology texts.
one point in time.
causes the outcome. Criteria for evaluating causation (c + d), and the number of individuals who have the dis-
were described in Chapter 19. order (a + c) and do not have the disorder (b + d). The
sum of all four cells is the total sample size (N).
Relative vs Absolute Effects The incidence of a health outcome represents its
Analyses of association are based on a measure of effect absolute risk (AR), which can be calculated for the
that looks at the frequency of disease among those who exposed (ARE) and unexposed (AR0) groups. The ARE
were and were not exposed to the risk factor. A relative is the number of cases of the disorder among the total
effect is a ratio that describes the risks associated with the exposed sample, or a/(a+b). The AR0 is the number of
exposed group as compared to those who were not ex- cases of the disorder among the total unexposed sample,
posed. An absolute effect is the actual difference between or c/(c+d). Each of these estimates reflects the probability
the rate of disease in the exposed and unexposed groups, of the health outcome for the exposed or unexposed
or the difference in the risk of developing the disease be- groups.
tween these two groups. To illustrate the concepts of rel- Although these measures are informative, it is usually
ative and absolute effect, suppose we purchased two more relevant to ask how many times more likely are
books, one costing $3 and the other $6. The absolute exposed persons to develop the disorder, relative to those
difference is $3, whereas the relative difference is that who were not exposed. Relative risk (RR) (also called risk
the second book is twice as expensive as the first. If two ratio) is the ratio of these two probabilities:
other books cost $10 and $20, the absolute difference
would be $10, but the relative difference would also be AR E a/ (a ⫹b)
RR = =
twice the cost of the first book. Therefore, the relative AR 0 c / (c ⫹ d )
effect is based on the absolute effect, but takes into ac-
count the baseline value. Analogously, we can use meas- This value indicates the likelihood that someone who
ures of incidence of disease in exposed and unexposed has been exposed to a risk factor will develop the disor-
groups to determine both relative and absolute effects of der, as compared with one who has not been exposed.
particular exposures. Measures of relative risk are appropriate for use with
cohort or experimental studies.
Relative Risk If the incidence rates of the outcome are the same for
To analyze risk, data are typically organized in a 2 × 2 the exposed and unexposed groups, the RR is 1.0, indi-
contingency table, a format that should be familiar by cating that risk is the same for exposed and unexposed
now (see Table 34-1). The columns represent the clas- persons. This represents the null value for relative risk.
sification of disorder status (the outcome), and the rows A RR greater than 1.0 indicates that exposed persons
represent exposure status. For consistency in presenta- have a greater risk of developing the health outcome
tion and calculation, the cells in the table are designated than unexposed. A RR less than 1.0 means that the
a, b, c, and d, as shown. Therefore, cell a represents exposure decreases the risk of developing the disorder.
those who have the disorder and were exposed, cell b
represents those who do not have the disorder and were The calculation for relative risk is based on the assump-
exposed, and so on. The marginal totals are the number tion that all subjects in the cohort were followed for the
of individuals who were exposed (a + b) and not exposed same amount of time. When follow-up time differs, it is
the person-time for exposed and nonexposed groups that should
be used for marginal totals, rather than just the total number of
Table 34-1 Summary of Diagnostic Test Results subjects in each category.
Disorder
Yes No Total ➤ CASE IN POIN T #1
Many studies have shown an inverse relationship between
Yes a b a+b
physical activity and risk of hip fracture. Hoidrup et al18 used a
Exposure
DATA
Table 34-2 Relative Risk: Physical Activity and Risk of Hip Fracture (N = 130)
DOWNLOAD
A. Crosstabulation
Activity level * Hip Fx Crosstabulation 1
Hip Fx
Count Yes No Total 2
a/(a b) 48/98
Activity level 2hr/wk 48 50 98 RR 0.78
c/(c d) 20/32
Sedentary 1 20 12 32
Total 68 62 130
1 The crosstabulation shows the relationship between activity level (exposure) and hip fracture (disorder).
2 The relative risk is calculated for the risk of hip fracture, with the presence of hip fracture as the reference value (Hip Fx = Yes).
Therefore, with 2hr/wk of physical activity, the risk of hip fracture is reduced (less than 1.0).
4 These two values are ignored. SPSS automatically generates the OR, which is not relevant for a cohort study. The cohort value of
RR will depend on which of the outcomes is considered the reference. If Hip Fx = No was the reference, the columns in the
contingency table would be reversed.
5 The chi-square test indicates there is no significant difference (p .184) between observed frequencies and what would be
expected by chance.
Portions of the output have been omitted for clarity.
Data source: Høidrup S, Sørensen T, Strøger U, et al. Leisure-time physical activity levels and changes in relation to risk of hip
fracture in men and women. Am J Epidemiol 2001; 154:60-68.
question is: Does physical activity reduce the risk of study (see Table 34-2 2 ). Because the RR is less than 1.0,
hip fracture in elderly women? Physical activity is the physical activity would be considered a preventive factor,
exposure (2hr/week compared to sedentary), and the decreasing the risk of hip fracture compared to individ-
occurrence of hip fracture is the health outcome. uals who do not exercise. Those who were active at
Our first step is to determine the proportion of least 2 hours/week were .78 times as likely (~25% less
patients who exercised (the exposed group) who also likely) to suffer a hip fracture as compared with those
sustained a hip fracture. This is the ARE of hip fracture who were sedentary.
among exercisers, 48 out of 98, or 49%. Then we deter-
mine what proportion of sedentary patients (the unex- Confidence Intervals for Relative Risk
posed group) sustained a hip fracture. This is the AR0 of An important assumption in any research study is that
hip fracture for those who did not exercise, 20 out of 32, we can draw reasonable inferences about population
or 63%. Therefore, the RR = .78 for the hip fracture characteristics based on sample data. This assumption
6113_Ch34_529-546 02/12/19 4:36 pm Page 536
holds true for epidemiologic studies as well. When a risk Table 34-3 shows the distribution of scores for sub-
estimate is derived from a particular set of subjects, the jects with lower and high BMI (exposure) and those with
researcher will use that estimate to make generalizations and without plantar fasciitis (outcome), with a crude
about expected behaviors or outcomes in others who odds ratio of 6.74. This means that the odds of develop-
have similar exposure histories. Therefore, it is impor- ing plantar fasciitis are almost 7 times as much for those
tant to determine a measure of true effect using a confi- who are obese than for those who are not. The confi-
dence interval. For the hip fracture study, 95% CI: dence interval has a wide range, but does not contain the
.56 to 1.097 (Table 34-2 3 ). This interval contains the null value, and therefore this estimate is significant, con-
null value of 1.0 and therefore does not represent a firmed with a chi-square analysis (p <.001).
significant risk estimate.
Chi-square can be used as a test of significance, Please refer to Chapter 31 for a full explanation of the con-
to determine if the proportions differ across categories cept of odds and the derivation of odds ratios. Adjusted
in a crosstabulation (see Chapter 28). In this example, odds ratios can be generated as part of logistic regression
the value of chi-square results in p = .184 (see analysis to account for the effect of multiple exposures on an
Table 34-2 5 ), which is not significant. This confirms outcome.
the conclusion drawn from the confidence interval.
Chi-square tells us that the proportion of individuals
with and without hip fracture who did or did not exer-
cise was not different from what would be expected by ➤ Results
chance. The odds ratio for the relationship between plantar fasciitis and
BMI was 6.74 (95% CI: 3.13 to 14.51), which was significant
p<.001). Those with BMI above 30 were almost 7 times more
➤ Results likely to develop plantar fasciitis than those with BMI ≤30.
Relative risk of hip fracture associated with physical activity
was 0.78 (95% CI: 0.560 to 1.097). Although the RR shows a
reduced risk for hip fracture with physical activity, this value is Power and Sample Size
not significant (χ2(1) = 1.768, p = .184). Like other statistics, RR and OR are subject to Type I
and Type II errors. To determine adequate sample size
Odds Ratio in planning, RR and OR are considered the effect size.
Consider, for example, the study of physical activity
A case-control study differs from a cohort study in that and hip fracture (see Table 34-2), which resulted in
subjects are purposefully chosen based on the presence RR = .78 (p=.184). It would be of interest to determine
or absence of disease (cases or controls) and therefore, if this result could be an issue of a small sample. To
we cannot determine the incidence rate of the disorder. solve for sample size, we must estimate RR or OR and
Relative risk cannot be established because we cannot the expected proportion of exposure among cases and
calculate absolute risk. The relative risk can be esti- controls. Rather than being concerned about power, for
mated, however, using an odds ratio (OR), which is this analysis we must specify the degree of precision we
calculated using the formula want in our estimate. For instance, we could specify that
we would like our results to be within 10% or 20% of
a /c ad
OR = = the true risk. The greater the precision (lower percent-
b /d bc age), the larger the needed sample size will be.
The odds ratio is interpreted in the same way as rel- Estimating sample size for RR and OR requires working
ative risk, with a null value of 1.0. with proportions. See the Chapter 34 Supplement for illus-
tration of these analyses for relative risk and odds ratios.
➤ C A S E IN PO INT # 2
Plantar fasciitis is a localized inflammatory condition leading Effect Modification and Confounding
to heel pain. Riddle et al19 conducted a case-control study to Quite often, in the analysis of the association between
identify risk factors for the disorder, including body mass index a risk factor and exposure, clinicians recognize the
(BMI). The researchers assembled a sample of 50 cases and potential influence of one or more extraneous factors
100 controls, with a 1:2 match on age and gender. Among the that may obscure the association. In some cases, these
cases, 29 individuals had a BMI over 30 (considered obese);
extraneous variables provide important information to
among the controls, 17 subjects had a BMI over 30.
understand how the association varies across different
6113_Ch34_529-546 02/12/19 4:36 pm Page 537
DATA
Table 34-3 Odds Ratio: Body Mass Index and Plantar Fasciitis (N=150)
DOWNLOAD
A. Crosstabulation
BMI * Plantar fasciitis Crosstabulation
Count Plantar fasciitis
Yes No Total 2
ad (29)(83)
BMI 30 29 17 46 OR 6.74
bc (21)(17)
30 1 21 83 104
Total 50 100 150
1 The crosstabulation shows the relationship between BMI (exposure) and plantar fasciitis (disorder).
2 The odds ratio is calculated for the risk of plantar fasciitis, with the presence of plantar fasciitis as the reference value. Therefore,
with BMI 30, the risk of plantar fasciitis is increased 6.74 times.
subgroups, such as age or gender. In other situations, of other factors in the relationship of exposure and
such variables create a bias in the interpretation, inter- outcome. These analyses are accomplished through
fering with the true association being studied. These stratification or multivariate methods such as logistic
two complications of analysis are called effect modifi- regression, to provide adjusted measures of association
cation and confounding. (see Chapter 31).
The simplest type of analyses are based on crude
data. These are data concerning the exposure and out- Effect Modification
come status of all subjects regardless of any other risks Effect modification refers to a difference in the magni-
or characteristics. Although analyses based on crude tude of an effect measure across levels of another vari-
data are often reported in the literature, many studies able. An effect modifier will change the relative risk
require more complicated analyses to evaluate the role associated with the exposure for different subgroups in
6113_Ch34_529-546 02/12/19 4:36 pm Page 538
the population. Researchers study and report effect on the source population and how subjects are chosen.
modification by stratifying the sample to explain why the Measures of association must be interpreted in terms
subgroups respond differently. An effect modifier will of the potential for confounding effects of extraneous
interact with the exposure and disease variables in such variables in the design. Confounding is introduced when
a way as to present a constant effect. It is a natural phe- extraneous variables interfere with the observed associ-
nomenon that exists independent of the study design and ation between the exposure and outcome.
will always be a factor in interpretation of risk. Effect A confounder is a variable that is (1) associated with the
modifiers tend to be biologically related to the variables exposure, (2) is a risk factor for the disease independent
being studied. of the exposure, and (3) is not part of the causal link be-
tween the exposure and the disease. When confounding
➤ C A S E IN PO INT # 3 is present, the statistical outcome may not be document-
Researchers have studied the association between diabetes and ing the true causal factor. A commonly used method to
risk of endometrial cancer, with many conflicting results. Friberg adjust for a potential confounder is stratification in which
et al20 hypothesized that this relationship could be misunder- the comparison between exposure and disease is done at
stood because of major modifiers such as physical activity. They specific levels of the potential confounder. When a study
studied a cohort of over 36,000 women over 7 years and strati- mentions that they controlled for a factor in the analysis,
fied the sample by high or low physical activity. they have tried to remove the effect of that variable (see
Focus on Evidence 34-1).
Data for this study are shown in Table 34-4, showing
how the distributions varied across strata. ➤ CASE IN POIN T #4
Jackson and coworkers24 examined the association between
➤ For the total sample, RR = 2.37, indicating that those with risk of mortality and receiving influenza vaccine in elders over
diabetes had more than two times the risk of cancer compared 65 year. They studied 252 cases who died during an influenza
to those without diabetes. However, for those who were physi- season and 576 age-matched controls. The crude odds ratio
cally active, RR = 1.06, indicating that those with diabetes had was .76 (95% CI: .47 to 1.06), indicating that receiving the
no increased risk as compared to those without diabetes who vaccine decreased the risk of death.
were active. For those with low activity, the RR = 2.67. The risk
associated with endometrial cancer increased by more than The confidence interval for the odds ratio contains 1.0,
two times for those who had diabetes who did not exercise. indicating that the difference in mortality between
those who do and do not get the vaccine is not signifi-
The fact that the risk estimates are different for each cant. The researchers were interested, however, in the
stratum indicates that the association between diabetes potentially confounding effect of limited functional
and endometrial cancer is significantly modified by phys- status, which would be associated with not getting the
ical activity. vaccine, would be a risk factor for mortality, and is not a
causal link between the vaccine and mortality. Because
Confounding older individuals who have functional limitations would
Confounding variables can be thought of as nuisance be less likely to visit a clinic to get the vaccine, and
variables. They may or may not be present, depending
mortality is also related to functional decline, the crude
odds ratio may be an underestimate of the protective
Table 34-4 Association of Diabetes and Endometrial nature of the vaccine on mortality.
Cancer Stratified by Physical Activity
N RR(95% CI) ➤ When rates were adjusted for the effect of functional limi-
tations, the odds ratio was lowered to .59 (95% CI: .41 to .83),
Total No diabetes 203 2.37
and was now significant.
Diabetes 22 (1.51 -3.74)
High Physical No diabetes 103 1.06
If there was no discrepancy between the crude and
Activity Diabetes 5 (.43-2.60)
adjusted estimates, there would be no confounding. The
Low Physical No diabetes 100 2.67 degree of discrepancy is indicative of the extent to which
Activity Diabetes 17 (1.58-4.53) function confounded the original data. By eliminating
Data source: Friberg E, Mantzoros CS, Wolk A. Diabetes and the effect of functional status, a stronger and significant
risk of endometrial cancer: a population-based prospective cohort relationship was seen between getting the vaccine and
study. Cancer Epidemiol Biomarkers Prev 2007;16(2):276-280. decreased mortality.
6113_Ch34_529-546 02/12/19 4:36 pm Page 539
Focus on Evidence 34–1 If we look at the data for the total sample, we would con-
Confounding and Simpson’s Paradox clude that PN is the more effective treatment (83% vs. 78%
success). However, if we look separately at the data for patients
Because confounding is a common problem, understanding its poten-
with small or large stones, we can see that the SURG proce-
tial for influencing interpretation of data is important, especially when
dures had a higher success rate in both instances.
looking for causal explanations. An extreme form of confounding
How does this paradoxical result happen? The confounding
occurs when results for subgroups are the opposite of the crude
variable in this case relates to how the decision to use one
result—a phenomenon popularly known as Simpson’s Paradox.21,22
treatment or the other is determined. Open SURG procedures
To illustrate this problem, Charig et al23 conducted a study to eval-
were generally used for stones > 2 cm, whereas the closed
uate the success rates of different treatments for kidney stones. They
procedures (PN) were performed for smaller stones. The com-
studied 700 patients with kidney stones and classified them according
bined data ignores the fact that success is influenced by stone
to stones being <2 cm or ≥2 cm. Two treatment approaches were
size more strongly than by treatment.
used based on clinical evaluation: open surgical procedures (SURG)
Charig et al23 did not do statistical analyses to determine
and percutaneous nephrolithotomy (PN). Success was defined as no
if the values within the contingency tables represented signifi-
stones at 3 months or stones reduced to particles <2 mm in size. In
cant effects, but if you are so inclined, you can calculate
comparing success rates, here is what they found:
chi-square—you will find that no significant effects exist. Simp-
Combined Sample son’s paradox has nothing to do with the significance of these
Successful Not Successful Success Rate findings, but it illustrates how numbers can be misleading
SURG 273 77 273/350 = 78% without considering confounding variables.
PN 289 6 289/350 = 83%
Small Stones (<2 cm) Data from Charig CR, Webb DR, Payne SR, Wickham JEA. Comparison of treat-
Successful Not Successful Success Rate ment of renal calculi by open surgery, percutaneous nephrolithotomy, and
SURG 81 6 81/87 = 93% extracorporeal shockwave lithotripsy. BMJ 1986;292(6524):879-882
PN 234 36 234/270 = 87%
Large Stones (≥2 cm)
Successful Not Successful Success Rate
SURG 192 71 192/263 = 73%
PN 55 25 55/80 =69%
To evaluate the effect of confounding in an analysis, modification in all analyses and account for them in the
the researcher must collect information on the potentially design or analysis of data as much as possible.
confounding variable. If the investigators in the vaccine
study did not collect data on the subjects’ functional
status, the preceding analysis would not have been possi- ■ Analytic Epidemiology: Measures
ble. Therefore, the researcher must be able to predict of Treatment Effect
what variables are possible confounders. It is conceivable
that several confounding factors will be operating in one When making clinical decisions regarding the effective-
study. Age and gender are often considered potential ness of interventions, we will generally be concerned with
confounders because of their common association with three questions: (1) Does the treatment work? (2) If it
disease or disability as well as being related to the pres- works, how does it compare with other treatments? (3) Is
ence of many exposures. In addition to controlling for the treatment safe? We want to know if the treatment will
confounding in the analysis, researchers can use design improve the patient’s condition, or we may want to know
strategies, such as matching or homogeneous subjects, to if it will prevent or decrease the risk of an adverse event.
control for these effects. Randomized controlled trials (RCT) are the most
Researchers attempt to cancel the effect of a confound- effective approach for answering these questions. For
ing variable in the design or analysis of a study. In contrast, many research questions, measurements of a dependent
effect modifiers are studied so they can be reported. Data variable are taken before and after an intervention, and
are stratified to evaluate and remove confounders. They are the degree of change is subjected to a statistical test
stratified to explain effect modification. Researchers must to determine if there is a significant difference. This
consider the potential influence of confounding and effect often becomes the basis for deciding whether to use the
6113_Ch34_529-546 02/12/19 4:36 pm Page 540
intervention. Such a conclusion is limited, however, • The control event rate (CER), which is the propor-
because it does not tell us if the difference is clinically tion of people in the control group who experienced
important, nor does it help us estimate the likelihood the adverse outcome, c/(c+d).
that our own patient will respond favorably. We know
The calculations of these values are shown in
that even with a well-established intervention, patients
Table 34-5B. For this study, the EER is 18% and the
do not all respond the same way, so how can we weigh
CER is 71%. These values indicate the risk of headaches
the risks and benefits for a particular patient?
recurring for each group. But how do they compare?
We can consider the outcome of treatment in terms
The ratio of these two values is the relative risk (RR)
of an “event,” which is classified as a success or failure.
associated with the intervention:
The success of a treatment may be indicated by whether
an adverse event is prevented, or if there is a beneficial
EER .18
outcome based on a specific threshold. In an RCT, then, RR = = = .25
the success of the intervention is determined by the CER .71
difference in beneficial or adverse outcomes between
treatment and control groups. The concept of relative ➤ The RR associated with the exercise and manipulative
risk (RR) can be used here in reference to treatment therapy is .25. This means that the intervention group was only
effect (the exposure), indicating if the risk of a particular one-quarter as likely (75% less likely) to experience a recurrence
outcome for the control group is higher or lower than of headache as the control group.
for the treatment group.
➤ C A S E IN PO INT # 5 EER and CER are actually the same as the estimate of
absolute risk and incidence, described earlier in defining
Headaches arising from cervical musculoskeletal disorders are
RR. The term “event rate” is used in the context of a clinical
common. Jull et al25 designed a multicenter RCT that studied
trial, where the outcome is a comparison of proportions for
the effectiveness of a combination of exercise and manipula-
those who experience the adverse event in treatment and
tive therapy for reducing cervicogenic headaches, as compared
control groups. When the outcome is continuous, as in the fre-
to a control group that received no therapy. A primary outcome
quency of headaches, it must be converted to a dichotomous
was the frequency of headaches over a 7-week treatment
outcome to determine when an event has or has not occurred.
period.
DATA
Table 34-5 Measures of Treatment Effect: RCT Comparing Exercise and Manipulative Therapy
DOWNLOAD
for Cervicogenic Headaches With a Control Condition (N = 97)
A. Data
Treatment Outcome
FAILURE SUCCESS
50% reduction 50% reduction
in headaches in headaches Total
Manipulation 49
Exercise 9 40 ab
83
Control 34 14 cd
Total 43 54 97
ac bd
B. RISK MEASURES
2 For ARR: 95% CI .53 1.96 (.085) .53 1.67 95% CI .363 to .697
1 1
For NNT: 95% CI to 95% CI 1.43 to 2.75
.697 .363
1 The EER is the proportion of treated people who had the adverse outcome (absolute risk for the experimental group). The CER is
the proportion of the control group who had the adverse outcome (absolute risk for the control group). The calculation of CER and
EER will depend on which column is designated as the adverse event.
2 Confidence intervals for NNT are based on the inverse of values used to get the CI for ARR. Calculation of NNH is based on
absolute risk increase (ARI) which equals EER CER. Calculations proceed in the same way as NNT.
Data source: Jull G, Trott P, Potter H, et al. A randomized controlled trial of exercise and manipulative therapy for cervicogenic
headache. Spine 2002;27 (17):1835-184
limitation. Study A represents the headache study. Study event rate between the control and experimental groups,
B is a hypothetical study with different results, where the simply calculated by:
EER was 8% and the CER was 2%. These effects are
ARR = CER ⫺ EER = .71⫺ .18 = .53
substantially lower than in study A and may not be clin-
ically meaningful, and yet the RRR would also be .75. For the headache study, there is a 53% reduction in event
Importantly, then, a high value for RRR may or may not rate associated with the intervention. For study B, we can
indicate a large treatment effect. see that there is only a 6% reduction. Therefore, the ARR
A more clinically useful measure is the absolute risk is a better reflection of the true difference in the studies’
reduction (ARR), which indicates the actual difference in outcomes, or the magnitude of the treatment effectiveness.
6113_Ch34_529-546 02/12/19 4:36 pm Page 542
0.3
The NNT is always rounded up to the nearest whole
18%
number. A large treatment effect will translate to a small
0.2
number needed to treat.
0.1 8%
ARR 6%
2% (NNT 17)
0
Study A Study B
RRR 75% ➤ For the headache example, the NNT is 1.9, which is rounded
up to 2.0. Therefore, we would need to treat two patients with
Control Intervention manipulative therapy and exercise to prevent a recurrence of
headaches in one patient. This result means one out of two
Figure 34–1 Results from two studies of treatment for
cervicogenic headache, illustrating the important difference in
patients is likely to experience a successful outcome, an out-
absolute and relative risk measures. Bars represent the event come that would be considered clinically meaningful.
rates for the intervention and control groups, indicating the
number of patients who did NOT benefit from treatment. If the NNT is 1.0, this means that we would need to
Both studies show a reduced risk with intervention. Although treat 1 patient to avoid 1 adverse outcome; that is, every
the relative risk reduction (RRR) is the same for both studies,
patient will benefit from treatment. This is, of course,
the absolute risk reduction (ARR) and number needed to treat
(NNT) are substantially different. Data source for Study A: the ideal, although not often the case. The closer to 1.0,
Jull G, Trott P, Potter H, et al. A randomized controlled trial the better the NNT. If an NNT was 30, it would mean
of exercise and manipulative therapy for cervicogenic that 30 patients would need to be treated to prevent one
headache. Spine 2002;27(17):1835-1843. adverse event. If the ARR is zero (no difference be-
tween CER and EER), the treatment has had no effect,
This example illustrates the importance of considering and the NNT will be infinity. We would need to treat
baseline risk when interpreting risk reduction.26 The con- an infinite number of people to see any benefit. Because
trol event rate can be used as an estimate of the risk for the NNT is based on the ARR, it is always interpreted
all subjects at the start of the study, before intervention in reference to the control condition, whether placebo
is applied. Although an RRR of 75% would be considered or usual care.
impressive, we can see that this value is not meaningful Confidence Intervals for NNT
when the baseline risk is only 8%. The ARR indicates The NNT is a point estimate, and therefore, should be
that the treatment will be of minor benefit in study B. presented with a confidence interval for accurate inter-
Compare that to a baseline risk of 71% for study A, and pretation. The confidence limits for NNT are the recip-
we can see that treatment can be of greater benefit. rocals of the confidence limits for ARR, in reverse order
Number Needed to Treat (see Table 34-5C). The null value for ARR is zero, which
converts to infinity for the NNT.
The degree of risk reduction should help us decide
whether a treatment is worth pursuing, based on the
likelihood that the outcome will be successful. The ARR,
➤ Results
In the study of cervicogenic headache, the absolute risk reduc-
however, does not provide the clinician with a clinical tion was .53 (95% CI: .36 to .70), which translated to NNT of 1.9
value that is easily interpreted. A more useful statistic (95% CI: 1.43 to 2.75), which was rounded up to 2.0. Therefore,
is the number needed to treat (NNT), which provides we would need to treat 2 patients with manipulative therapy
information about effectiveness in terms of patient num- and exercise to prevent a recurrence of headaches in 1 patient.
bers.27 For an intervention focused on prevention, such Therefore, overall, 1 out of 2 patients should experience a suc-
as medication to prevent a heart attack, it is defined as cessful outcome.
the number of patients that would need to be treated to
prevent one adverse outcome. For an intervention used Number Needed to Harm (NNH)
to improve performance, such as exercise to recover The NNT reflects the prevention of an adverse event or
function, it is the number of patients needed to achieve improvement in performance, which is considered a suc-
one beneficial outcome. The larger the treatment effect, cessful outcome. These values must also be considered
the smaller the NNT. It can be calculated for any trial in the context of potential harms. We can apply this
that reports a binary outcome. same concept to evaluate serious side-effects, risks, or
6113_Ch34_529-546 02/12/19 4:36 pm Page 543
complications that occur from treatment outside of the NNT to evaluate the balance between benefit and harm
intended effects. Unfortunately, many treatments that of an intervention (see Focus on Evidence 34-2). The
have strong therapeutic effects (low NNT) also tend to NNH may involve a different outcome than used to
have more serious side effects.28. When an intervention evaluate the success of treatment, such as a dangerous
poses excess risk to the patient, we want to assess the side effect.
absolute risk increase (ARI) associated with it, or the
magnitude of a detrimental treatment effect. This is It is important to understand that NNT and NNH are calcu-
the opposite of NNT, when the control group does better lated based on binary outcomes. Outcomes based on con-
than the experimental group—the treatment causes an tinuous measures can be used by setting particular thresholds,
adverse outcome. Therefore, ARI = EER – CER. The re- such as minimal clinically important change.29
ciprocal of this value is called the number needed to harm
(NNH), which indicates the number of patients who would
need to be treated to cause one adverse outcome. Interpretation of NNT and NNH
Opposite to the interpretation of NNT, a larger In terms of clinical decision making, NNT and NNH
NNH means a patient is less likely to experience an provide a useful number to help a clinician and patient
adverse outcome. The NNH is always rounded down to decide if one or alternative treatments are worth pursu-
the nearest whole number. An NNH of 100 would ing, weighing their potential benefit or harm. These
mean that we would need to treat 100 patients to cause values are particularly useful for expressing the relative
one adverse event. An NNH of 1.0, however, would effectiveness or safety of different interventions. It
mean that every patient would experience an adverse quantifies the effort required to obtain a beneficial out-
event. The NNH should be considered along with the come. Several factors must be considered, however,
Focus on Evidence 34–2 fact that the confidence intervals are narrow and do not overlap
Weighing Benefits and Harms for NNT and NNH.
• Quetiapine’s effectiveness is supported by a relatively low NNT,
Bipolar depression (BD) is a common disorder with several evidence-
but the NNH was even lower, indicating that while this treat-
based interventions, many of which present risks of substantial
ment may be effective, it is also likely to cause sedation in
side effects. Ketter and colleagues28 examined data from more than
many patients.
100 randomized, double-blind, placebo-controlled trials to evaluate
• Lamotrigine had a favorably high NNH, indicating that it was
potential benefits and risks of a variety of pharmacotherapies that
more tolerable than other medications, but that is offset by a
are used for acute BD. Their approach focused on clinical effective-
double-digit NNT, suggesting that it is less effective.
ness, basing NNT on at least 50% improvement in mood rating
• Adjunctive armodafinil therapy had a somewhat higher NNT
compared with a placebo. Incidence of adverse events was based on
(but still single digit) and a much higher NNH for anxiety, sug-
the typical side effects for each drug, including sedation, restlessness
gesting that it is more likely to yield a good response than it is
(akasthisia), and anxiety. The advantage of using NNT and NNH can
to present harm. However, because the confidence intervals
be illustrated by or a portion of their findings:
for NNT and NNH show a large overlap, these effects may not
be as obvious in practice.
TREATMENT NNT (95%CI) SIDE EFFECT NNH (95%CI)
Using the data presented here (there were several other treat-
Lurasidone 5 (4-8) Akathisia 15 (10-33) ments described in the study), the authors concluded that lurasidone
monotherapy and lamotrigine presented the best balance between effectiveness
Quetiapine 6 (5-9) Sedation 5 (4-5) and tolerability compared with other treatments. These types of find-
Lamotrigine 12 (8-41) Sedation 37 (Not Sig) ings may seem frustrating, as they do not provide a definitive answer,
Armodafinil 9 (4-43) Anxiety 29 (17-107) the way a test of significance would. But because these treatments
(adjunctive) have different benefits and risks, and can affect patients differently,
these data give providers and patients information that can help them
Here’s what we can take from this. personalize benefits versus harm and weigh the options.
• Lurasidone had a relatively low NNT and a higher value for
NNH. Therefore, the benefit of this treatment was greater than Data source: Ketter TA, Miller S, Dell’Osso B, et al. Balancing benefits and
the risk of akathisia, indicating it is more likely to be effective harms of treatments for acute bipolar depression. J Affect Disord 2014; 169
than to cause harm. This distinction is further supported by the Suppl 1: S24-S33.
6113_Ch34_529-546 02/12/19 4:36 pm Page 544
when comparing values of NNT or NNH across differ- The same holds true for NNH, as specific side
ent interventions or studies. effects will be of varied interest across patients.28
(1) NNT and NNH must be interpreted in terms of a (4) The validity of the clinical trial must be taken
time period for treatment and follow-up. The into account. Because NNTs are used to make
duration of the treatment may make a difference individual patient decisions and to support reim-
in the frequency of expected positive or negative bursement policies, the degree to which research
outcomes. Generally, the NNT will be smaller for studies represent actual clinical expectations will
an intervention of longer duration.30 The compari- influence the application of effective treatments.
son of values for different treatments is only valid Both internal and external validity must be consid-
when the outcome is measured within the same ered. Replication is important to determine if re-
time period. For instance, for the headache study sults stand up to different samples. Patients seen
the report should read: The NNT equals 2 based in a practice setting may differ substantially from
on treatment over 7 weeks. those used in studies.
(2) The interpretation of NNT and NNH will depend (5) There are no standard limits for NNT or NNH
on baseline risk. Because some patients will have that dictate a decision. Like all research results,
a greater risk of an adverse outcome before treat- NNT or NNH should be incorporated into
ment begins, the NNT or NNH must be adjusted decision-making along with the clinician’s experi-
to account for low and high baseline risks. It may ence and judgment, the patient’s preferences, and
not be reasonable to assume that the same relative the nature of the disorder that is being treated.31
risk applies to all patients.31 The patient’s age, For some interventions an NNT of 2 or 3 would
gender or initial severity of disease may alter the be considered good, whereas for others an NNT
relative risk associated with the intervention. of 20 or 40 may still be considered clinically effec-
Therefore, when extrapolating NNT or NNH tive. The application of both NNT and NNH
measures from the literature, the baseline risk must be based on an understanding of the severity
must be taken into account. The control event of the disease and its consequences, balanced with
rate can be used as an indicator of the baseline risk the probable benefit. If dealing with a disease that
without treatment. has a high mortality rate, for example, an interven-
tion with an NNT of 40 may still be worth consid-
(3) To compare values across studies, the outcomes
ering. An NNT of 100 may be acceptable if the
of interest must be the same. For example, in
treatment is easy to take and has few side effects,
the evaluation of exercise programs, an NNT of
but an NNT of 5 may be too high for an expensive
20 for preventing falls in the elderly may be inter-
drug that carries substantial risk of side effects.
preted differently from an NNT of 20 for prevent-
Therefore, the value of NNT and NNH should
ing hip fracture. The NNT for a drug treatment to
always be interpreted within the context of a spe-
reduce hypertension will be different if the adverse
cific disorder, treatment, outcome, disease severity,
event is stroke, heart attack, or death. In the study
and time period. Because treatments should be
of cervicogenic headache, the threshold for success
more likely to help than harm, we strive for inter-
was set at 50%. If this threshold were changed,
ventions with lower NNTs and higher NNHs.
the resulting NNTs would not be comparable.
COMMENTARY
A traditional biomedical model defines health narrowly as of and risk factors for disease. Under this model, we open
the absence of disease. Based on this definition, epidemi- ourselves to looking at risk factors which have an effect
ologists collect data to determine frequency of occurrence at the biological level of the individual. Using a more
6113_Ch34_529-546 02/12/19 4:36 pm Page 545
complete model of health, epidemiology is able to focus application to practice. Decisions regarding choice of
on the occurrence of health-related states and risk factors intervention, who gets intervention (including preven-
related to physical, social, and psychological health status tion), and the intensity and frequency of intervention
as well. The ICF model provides an excellent framework may be aided by an understanding of the risk reduction
for this approach (see Chapter 1). In addition to disease, associated with specific patient characteristics, activities
we can begin to look at the concept of risk in terms of or treatments. We may understand the physiological
impairments that lead to physical disability and use mea- rationale for applying specific exercises for reducing
sures of risk to help us set priorities and evaluate func- pain or improving mobility, but do we know what fac-
tional outcomes. tors may alter the success of our treatments, or which
As healthcare professionals strive to embrace evidence- patients are more likely to improve? Such information
based decision-making, our questions must take on a can provide new insight into the rationales we use for
variety of forms, dealing with interventions or screening making treatment decisions.
tools, as well as larger policy issues that have direct
28. Ketter TA, Miller S, Dell’Osso B, Calabrese JR, Frye MA, and Manipulation (BEAM) trial. BMC Med Res Methodol
Citrome L. Balancing benefits and harms of treatments 2009;9:35.
for acute bipolar depression. J Affect Disord 2014;169 30. Osiri M, Suarez-Almazor ME, Wells GA, Robinson V, Tugwell P.
Suppl 1:S24-33. Number needed to treat (NNT): implication in rheumatology
29. Froud R, Eldridge S, Lall R, Underwood M. Estimating clinical practice. Ann Rheum Dis 2003;62(4):316-321.
the number needed to treat from continuous outcomes in 31. McQuay HJ, Moore RA. Using numerical results from
randomised controlled trials: methodological challenges and systematic reviews in clinical practice. Ann Intern Med 1997;
worked example using data from the UK Back Pain Exercise 126(9):712-720.
6113_Ch35_547-556 02/12/19 4:35 pm Page 547
Putting It All
PART
5
Together
CHAPTER 35 Writing a Research Proposal 548
CHAPTER 36 Critical Appraisal: Evaluating Research
Reports 557
CHAPTER 37 Synthesizing Literature: Systematic Reviews
and Meta-Analyses 574
CHAPTER 38 Disseminating Research 597
ractice
dp
se
idence-ba
1. Identify the
arch question
re e
s
Ev
s
ing
2.
De
ind
sig
te f
n the
5. Dissemina
study
The
Research
Process
dy
stu
4.
na
e
A
h
ly z n tt
ed me
ata ple
3. I m
547
6113_Ch35_547-556 02/12/19 4:35 pm Page 548
CHAPTER
35 Writing a Research
Proposal
The initial stages of the research process in- the research question is refined enough to be stud-
ied, that the assumptions and theoretical rationale
clude development of the research question
on which the study is based are logical, and that the
and delineation of methods of data collection. method is appropriate for answering the question.
The success of the project depends on how well • The well-prepared proposal constitutes the body of a
grant application when external funding is required.
these elements have been defined in advance,
• The proposal will provide details about the project
so that the proper resources are gathered and so that ethical issues can be addressed, the needed
methods proceed with reliability and validity. expertise and competence of researchers and support-
ive personnel can be documented, and the feasibility
The plan that describes all these preparatory
of the project can be substantiated.
elements is the research proposal. The proposal • The proposal serves as an application for review
describes the purpose of the study, the impor- by peer or administrative committees. This is
the document that will be carefully scrutinized
tance of the research question, the research
by the Institutional Review Board (IRB) (see
protocol, and justifies the feasibility of the proj- Chapter 7).
ect. A proposal can be likened to a blueprint for • The proposal enhances communication among
colleagues who may be co-investigators and with
a building—providing a foundation to guide
consultants whose advice may be needed.
the work to its conclusion. Proposals are nec- • The careful, detailed account of the study procedures
essary for both quantitative and qualitative serves as a guide throughout the data collection phase
to ensure that the researchers follow the outlined
studies.
protocol.
Whether you are preparing a proposal for a
The research proposal, therefore, is an indispensable
funding agency, a thesis or dissertation, or a instrument in initiating and implementing a project.
non-funded project, proposal guidelines are The exact format depends on the requirements of the
quite similar. The purpose of this chapter is to institution, funding agency, or academic committee that
will review and approve the project. The order of pres-
describe the process of developing and writing entation of material may vary, as may the extent of the
a research proposal. information that is required.
responsibility for the project, providing leadership for Table 35-1 General Format of a Research Proposal
the planning effort and eventual implementation. The
PI is typically an experienced investigator with a track The Research Plan
Title
record of prior research. In some instances, a co-PI can
Abstract
be designated to strengthen the team leadership. Other
Statement of Purpose
members of the team may include other investigators, a 1. Statement of the research problem
statistician, project manager, research assistants, staff 2. Specific aims or objectives
support, and other collaborators. 3. Research hypotheses or guiding questions
4. Rationale and significance of the study
Although working on a thesis or dissertation may feel like Background
you are on your own, even these projects benefit from 1. Review of literature related to:
• Theory and supportive rationale
collaboration, perhaps with a statistician or with colleagues
• Related studies
who can assist with data collection–and of course, your advisor.
• Methods
2. Previous work by the investigator that supports
the project
■ Components of a Proposal Methods
1. Design:
• Type of study and appropriate design
A research proposal has two basic parts, as shown in (randomized trial, cohort, case-control)
Table 35-1. The first part provides details of the research • Interventions and outcomes
plan, and the other describes the administrative support • Explanation of potential confounders to be
required to carry out the project. The body of the re- assessed
search proposal is the narrative portion that will explain 2. Participants: Characteristics, sampling method,
the purpose and importance of the study and describe recruitment and selection strategy, sample size
the design and procedures in detail. estimates
Before the proposal is written, the researcher must 3. Procedures
• Data collection methods
review guidelines and required formats specified by
• Measurement: Instrumentation, plans to
funding agencies or academic institutions. As the project
establish reliability and validity
proceeds, it is helpful to follow an organized working • Operational definitions
plan that focuses the important elements of the project. • Timetable and organizational chart
Although this outline will vary for some agencies, the 4. Data management
following sections reflect the basic components required • Analysis: statistics
for most research proposals. • Expected outcomes
References
Title and Abstract Appendices/ Letters of support
The title of a research proposal will become the proj- Documentation of informed consent
ect’s introduction to all potential reviewers and should The Administrative Plan
compel their interest. It should be concise but inform- Personnel: Qualifications, job descriptions, time
ative, describing what you will do, how you will do it, commitment
and what outcome you are looking for. For example, a Facilities and Resources: Space, equipment,
title such as “Meniscal surgery and exercise” is certainly technical support
concise, but the reader is likely to say, “what about it?” Budget: Direct Costs- Salaries, equipment, facilities,
With a few more words, this title will say much more: supplies, and travel; Indirect Costs- Overhead
“A randomized controlled trial of meniscal surgery
compared with exercise and patient education for treat-
ment of meniscal tears in young adults.”1 This summa- project. A brief description of the method should
rizes the research question, the study design, the identify the study subjects, procedures, and methods
interventions, and the intended population. for data analysis. The proposed duration of the study
A summary or abstract of the project, often limited and overall projected costs may be stated. Because
to one page, is required by most funding agencies and the summary is likely to be read before the detailed
institutional review boards. Specific headings may be proposal, it must make a positive impression, convey-
required to organize content. The abstract should ing specifically what is to be done and why the study is
highlight the purpose and importance of the proposed important.
6113_Ch35_547-556 02/12/19 4:35 pm Page 550
accurate and complete. The funding agency will proba- Most funding agencies and sponsoring institutions
bly have a statistician review it. require IRB approval before a proposal is submitted and
reviewed. The time delays inherent in obtaining this
References
approval must be built into the timetable for submitting
The final part of the narrative portion of the proposal the proposal. Documentation of IRB approval must
should be a listing of literature cited in the paper. Some accompany the proposal.
agencies require the use of a specific bibliographic style,
but often this is left to the discretion of the researcher.
Whatever style is used, it should be consistent through- ■ Plan for Administrative Support
out the document.
Appendices Personnel
Appendices may include supplementary material such as Identification of the investigators and their qualifica-
sample surveys, equipment details, or letters of support tions is an important element of a proposal, especially
to confirm resources that are needed to carry out the when external funding is being sought. This will prob-
project. For example, letters may confirm the commit- ably not be a factor in student research, except where
ment of various sites or leaders to provide resources for expert assistance is required for carrying out parts of the
the project by allowing data collection to occur in their project. Funding agencies will examine investigators’
institution. Consultants will clarify their expertise and education, experience, track record of research, and
contributions to the study. prior publications to determine that they have appro-
priate experience.
Documentation of Informed Consent Qualifications are evaluated through a biosketch, a
All proposals should document the process for protec- concise summary of each investigator’s background and
tion of human subjects. A copy of the informed consent expertise relevant to the project, including education,
form must accompany the proposal when subjects will employment history, and selected publications or prior
be directly involved in the study. Researchers should be grants. A common format has been developed by NIH,
familiar with the Common Rule and should document which has been adopted by many other granting agen-
their training in ethical principles (see Chapter 7).12 cies. NIH restricts this document to five pages, so being
Reviewers will assess the potential risks for subjects concise is important. Some agencies will require that the
associated with involvement in the research. PI is someone with an MD or PhD, others may accept
6113_Ch35_547-556 02/12/19 4:35 pm Page 553
professional doctorates. Because of these kinds of crite- expected to run more than 1 year, many agencies require
ria, the inclusion of information about the investigators only the first year’s budget to be itemized, with summaries
in a proposed study is essential to the process of evalua- of projected expenses for additional years.
tion by an agency or foundation. The NIH now uses modular budgets for new, renewal,
and revised applications. This approach uses “modules” or
A biosketch should be written specifically for each increments for direct costs rather than line items. Modules
proposal, to emphasize activities and accomplishments may represent from $25,000 up to $250,000 a year.13 A
of the researcher that are relevant to the project. For typical modular budget will request the same modules in
example, particular publications may be listed that demonstrate each year, although there may be exceptions for expenses
the researcher’s prior work in the area of study. A personal that will only exist in the first year, such as equipment costs.
statement should also be directed to the aims of the project
and the agency’s priorities.
Direct Costs
Direct costs are those associated with carrying out the
project, including salaries, equipment, facilities, supplies,
and travel. If study participants will be given honoraria,
See the Chapter 35 Supplement for the NIH biosketch format. this should be described and budgeted (see Chapter 7).
Salaries. Salaries typically represent the largest part of
Resources and Environment the budget. The itemized personnel budget identifies the
Many funding agencies and academic or clinical institu- names of each individual who will participate in the study,
tions will also ask for information regarding existing their proposed title (such as principal investigator, consult-
resources for carrying out the proposed project to assess ant, statistician, project manager, research assistant), the
the organizational resources available to support the salary for each individual, and the percentage of full-time
research. The investigator will be asked to describe avail- or number of hours that will be devoted to the project.
able laboratory facilities, equipment, clinical sites, com- Dollar amounts may be based on percentage of the indi-
puter capability, office space, and so on, to demonstrate vidual’s full-time salary or an hourly wage for a specified
that the project is feasible within the institution’s envi- time period. Some personnel may be asked to participate
ronment. The areas in which data collection will take in the project with no remuneration. These individuals
place should be described, as should the areas where should also be listed, showing no salary request. Associated
equipment will be housed. fringe benefit amounts are listed separately based on the
In addition, administrative support services may need total amount of projected salaries and wages.
to be described. Documentation of staff support and Reviewers will scrutinize the personnel budget par-
technical assistance will be evaluated by reviewers in re- ticularly to evaluate the appropriateness of the time com-
gard to the feasibility and justification of the applicant’s mitment of each participant. The budget justification
budget request. should explain the responsibilities of each participant and
should show that participants will realistically be able to
Budget achieve the desired outcomes. It is undesirable to have
Every proposal, even those written for student research, too few investigators to carry out the work, but it is
should include an estimate of projected expenses, to equally undesirable to have too many, each with very
demonstrate the feasibility of the project. For a grant small percentages of effort.
application, the budget is an extremely important part of
Equipment. Equipment costs are given for all equip-
the proposal and must be complete and detailed accord-
ment that will be purchased with grant funds, if such
ing to the instructions of the funding agency. Students
purchases would be covered. Costs should reflect rea-
may need to show how resources will be made available
sonable current prices and any charges related to instal-
to them if there is no funding associated with the project.
lation, calibration, and maintenance. Most granting
Many schools provide small seed grants for faculty to
agencies define “equipment” as items costing at least
run pilot studies. Some will also provide assistance to
$300 and having an extended life expectancy of at least
students for their thesis or dissertation projects.
3 to 5 years. The narrative should provide details of
The format and content of the budget will vary de-
equipment, such as manufacturer, model, and special
pending on the type of research proposal. Generally, the
accessories that are needed.
budget is presented by category as a summary of totals
and as an itemized budget. For grant applications, a Facilities. The budget may include a request for funds
narrative section, the budget justification, must explain for alteration or renovations to facilities. If space must
the projected costs in each category. For grants that are be altered to accommodate equipment or to provide a
6113_Ch35_547-556 02/12/19 4:35 pm Page 554
work area, the contractors’ estimates should be confirmed ■ Writing the Proposal
before specifying those costs in the budget. Explanations
of all construction costs should be provided in detail, Whether you are writing a proposal for academic re-
justifying why they are necessary for the study. quirements or funding, attention must be paid to for-
matting and writing style. It is helpful to read other
Many agencies will not fund equipment or construction proposals that were submitted to and funded by the
costs, with an expectation that the institution will provide agencies that are being considered, many of which can
the infrastructure for carrying out the project. be found online or colleagues may be willing to share
their successful or rejected attempts. Maintaining a
checklist of all the details required in the proposal allows
Supplies. The category of supplies refers to consumable the investigators to keep tabs on their own progress in
materials, as opposed to capital equipment. Question- completing the document.
naires and survey instruments, office supplies, supplies
for running research tests, and equipment items costing
less than $300 are examples of “supply” items. Specific The research proposal is a forward-looking document.
The researcher’s thinking begins with the present,
quantities of these supplies should be given with justi-
acknowledges and draws from the past, but primarily leads to
fication. A category of “other expenses” may also be
the future. Therefore, the statement of the problem is written in
included to account for miscellaneous items, such as the present tense, the background is written in the past tense,
telephone costs and photocopying. and the method (which is the proposed research) is written in
Travel. Depending on the nature of the project and the the future tense.
regulations of the funding agency, travel expenses may be
budgeted. Travel to and from the institutional “home
base” to collect data is certainly part of conducting a Format
project and is likely to be an acceptable expense. Travel
to meetings where data may be presented is more indi- The format required for a proposal varies among agencies
rectly related to the project but can often be justified. and academic institutions. The researcher must follow the
Travel costs may also be applied to patients who must specific instructions provided by the sponsoring agency.
be transported for purposes of the research. including seemingly trivial details such as pagination, page
limits, line spacing, fonts, and margins. The method of
Indirect Costs citing references should be consistent throughout, and
Indirect costs relate principally to the overhead charged tables and appendices should be clearly labeled and cited
by the sponsoring institution for administrative activities, in the text. Some formats require tables and figures to be
facility maintenance, and any other support services. embedded with text, while others require that they be
Funding agencies usually limit the amount of support placed at the end of the document.
that may be used for indirect costs based on a defined
percentage of the total budget. These rates will differ
Agencies have little tolerance for applications that are
across agencies and may be negotiated. In cases where
not well organized, have typographical errors, grammatical
the customary institutional charge exceeds the set limit, mistakes, sloppy formatting, or that do not adhere to format
the budget narrative should specify the manner in which requirements.14 Details are important.
such a discrepancy will be handled. The total budget for
the project is the sum of all direct and indirect costs.
that there is a need to conduct the proposed research, • Industry support of research comes primarily from
and that the research team has the knowledge, ability, pharmaceutical companies and device manufactur-
and resources to accomplish the study objectives. ers, typically with the intent to show their prod-
Becoming familiar with review criteria can be helpful uct’s effectiveness. Conflicts of interest should
to maximize the potential for your proposal to be scored be addressed up front under these circumstances.
successfully. As much as possible, use the same terminol- • Intramural funding is often available from academic or
ogy as in the criteria so that the reviewer can readily find clinical institutions. These are usually “seed grants,”
all the information needed to satisfy each criterion. providing minimal funding to support pilot studies.
If several team members have been involved in writing
sections of the proposal, one editor should do a final pass
to assure consistency in writing style as well as format- ■ Submitting the Proposal
ting. A proposal that is sensible, factual, and realistic will
receive the attention it deserves. Some submissions are still required in hard copy, some
with multiple copies, but most use electronic submission.
These often include forms that need to be completed on-
■ Funding line, especially around budget.
Take deadlines seriously. Electronic submission
When seeking financial support, one of the first deci-
portals will cut off at the deadline. Best advice—don’t
sions that will be made in the preparation of a proposal
leave things to the last minute. Build in time to get the
will be identification of a funding agency. Researchers
approvals you need from the IRB, your budget office,
must focus proposals to address the agency’s priorities.
or trial registries (see Chapter 14). Make sure the ap-
There are essentially four types of funding sources.
plication is complete, including all necessary forms.
• Many government agencies, like the NIH, are Don’t be discouraged if your application is not ac-
dedicated to supporting research in different cepted. Submitting proposals for small non-funded proj-
areas. Web resources are available that list grant ects will often be successful, as they are intended to be
opportunities.15,16 initial efforts, but rejections are not unexpected with
• Foundations and professional societies offer many larger grants because of the competition for limited
opportunities for funding, often at lower levels than funds. Remember that most successful researchers paid
governmental agencies. A good source of informa- their dues too! Rejections come with feedback that can
tion is the Foundation Center, which lists foundations be used constructively to improve the proposal for resub-
and their funding priorities.17 mission to the same or a different agency.
COMMENTARY
Murphy’s Laws
If anything can go wrong, it will.
Nothing is as easy as it looks.
Everything takes longer than you think.
—Said to be named for Edward Murphy, Jr. (1918-1990)
American aerospace engineer
For those who have not developed proposals before, Writers should seek three kinds of individuals to pro-
getting advice from colleagues who have been successful vide feedback on the proposal. First, a person who is
is a great place to start. They may be willing to share knowledgeable about the topic and the relevance of the
their successes and rejections, which can be invaluable. project should be asked to evaluate the appropriateness,
Researchers should expect to edit and revise the proposal accuracy, and thoroughness of the presentation. Another
several times through the development process. Indeed, who understands research design and methodology
the original draft may be substantially changed when the will concentrate on the validity of the research methods
process is finished. relative to the research question and specific aims. The
6113_Ch35_547-556 02/12/19 4:35 pm Page 556
third should be someone who is unfamiliar with the along the way. Subjects drop out or don’t respond,
subject matter and who will react to the readability of equipment fails, schedules don’t work, measurements
the paper with “fresh eyes.” All three may notice incon- are less reliable than planned, data collection takes
sistencies, instances of unnecessary professional jargon, or longer than planned, and problems may arise in safety
redundancy. This kind of preliminary review by colleagues of subjects. These problems don’t surface until after
is valuable for inspiring the researcher’s confidence that data collection is started, and it is helpful to brainstorm
the proposal is ready for formal review and subsequent in the planning stage as to how such issues will be han-
successful implementation. dled, should they arise. Murphy probably never under-
But Murphy’s law is always looming. Researchers stood how much his initial observation strikes a true
need to keep track of what works and doesn’t work chord every day!
CHAPTER
Critical Appraisal:
Evaluating Research 36
Reports
The success of evidence-based practice will Chapter 6. The purpose of this chapter is to
depend on how well we incorporate research take the mystique out of appraisal by providing
findings into our clinical judgments and treat- a structured framework to guide critical evalu-
ment decisions. We have a responsibility to ation of individual published studies, and to
evaluate research reports to determine whether foster informed judgments about the quality
the findings provide sufficient evidence to sup- and usefulness of the research. This discussion
port the effectiveness of current practices or will focus on studies for intervention, diagnosis,
offer alternatives that will improve patient care. prognosis, and qualitative research, but con-
The purpose of critical appraisal is to deter- cepts can be readily applied to most types of re-
mine the scientific merit of a research report search. You can refer to appropriate chapters
and its applicability to clinical decision making. indicated in each section for background on
Whether a published paper or conference pres- relevant design and analysis elements.
entation, we can evaluate a study for the general
purpose of understanding current information
■ Levels of Evidence
in our areas of practice, or we may be focused
on its application to a particular patient prob- Published studies come in all shapes and sizes— with
lem. In the latter case, the clinician will pose a stronger and weaker designs, with well-controlled or
biased procedures, using statistics appropriately or not,
clinical question using the PICO format to di- and drawing conclusions that may or may not be
rect a search and identify appropriate research warranted. As we strive to use the literature to support
reports (see Chapter 5). clinical decisions, our goal is to identify studies with
sufficient validity so that we can be confident in the
Clinicians often express that one of the fore- application of results to our own practice.
most barriers to evidence-based practice is a The classification of levels of evidence can be used
lack of skill in searching for and appraising the as an initial criterion to judge the rigor of a study’s de-
sign. Criteria reflect the degree of confidence one should
literature.1-3 To be a critical consumer, those have by considering the type of study.4 It will help to
who seek evidence must be able to retrieve and refer back to Chapter 5 to review these classifications for
read published studies and make judgments both quantitative and qualitative studies, which are
broadly summarized in Table 36-1.5 Different criteria
about the validity and relevance of the infor- are used for different types of studies within each level.
mation. The search process was covered in These criteria are not hard and fast, however, and any
557
6113_Ch36_557-573 02/12/19 4:35 pm Page 558
Table 36-1 Summary of Levels of Evidence of the study is not clear from the abstract, reading full-
text is important to make such a judgment. Without
Quantitative Studies studying the details of the report, there is no way to truly
Level 1 Systematic review/Meta-analysis understand the study’s validity.
Level 2 Individual study with strong design Although appraisal tends to focus on design and
Level 3 Study with less rigorous design analysis, we can also pay attention to the clarity of the
Level 4 Case series writing—the logical presentation of literature and back-
Level 5 Mechanistic reasoning ground theory, development of the research question,
Grade up Strong effect size, strong design description of methods and analysis procedures, and
Grade down Poor design, lack of consistency with the depth of discussion. We should not have to wade
other studies, small effect through excessive jargon or try to decipher exactly what
Qualitative Studies was done. We should not have to search for information
that will allow us to judge the quality of the study. The
Level 1 Generalizable studies
contribution of a good study can be obscured if it is not
Level 2 Conceptual studies
Level 3 Descriptive studies
presented in an understandable way.
Level 4 Case studies To avoid being overwhelmed with this process, we
need a systematic way of reviewing an article step by step,
See Chapter 5 for a full description of levels of evidence. asking specific questions for each section and for each
type of study, to make the process more straightforward.
study may be judged stronger or weaker depending on In the end, if and how the evidence is applied will be the
its implementation and precision. reader’s critical judgment.
When we think about questions that can help us
sort through all the information in research reports, it The Equator Network (Enhancing the QUAlity and
is reasonable to seek the highest level of evidence that is Transparency Of health Research) is an international
available. For most categories, this will involve a system- initiative that seeks to improve the quality of published
atic review or meta-analysis, which will synthesize liter- research by promoting transparent and accurate reporting
ature on a topic. But no study is perfect and ascribing a through the use of robust reporting guidelines.10. These guide-
high level of evidence to a study by virtue of its design lines are intended to help authors include appropriate informa-
does not mean that the study will provide a valid or tion in published studies but are often used to guide critical
definitive result to answer your clinical question. Criteria appraisal as well, particularly to identify if appropriate informa-
have been developed to judge the quality of a study, tion is included. Guidelines are available for a wide variety of
allowing for “grading up” or “grading down,” based on study types, including randomized trials, observational studies,
rigor of design and strength of effect.6 For instance, a diagnostic and prognostic studies, clinical practice guidelines,
well-designed observational study may be more relevant and qualitative research. These reporting guidelines will be
discussed in greater detail in Chapter 38.
than a randomized trial that has design flaws. That’s
where critical appraisal comes in.
Criteria for grading evidence and assessing systematic See the Chapter 36 Supplement for links to EQUATOR
reviews and meta-analyses will be addressed in Chapter 37. guidelines, appraisal worksheets for several types of stud-
ies, and other critical appraisal resources.
■ The Appraisal Process
■ Core Questions
Appraisal begins, of course, with finding a relevant study.
Search strategies must be applied to identify articles that Clinicians and researchers will evaluate literature from
are relevant to the clinical question. The initial impres- three perspectives. First, the assessment will help to de-
sion of an article can influence how willing a reader will termine if the study has validity in terms of design and
be to examine it further. The first pass is reading the analysis. Second, the results of the study are examined to
abstract to determine if the study addresses the clinical determine if findings are meaningful—the importance of
question. Be aware, however, that there may inconsis- an intervention effect, the accuracy of diagnostic tests,
tency between the information reported in an abstract the degree of risk associated with prognostic variables, or
versus the full text report.7,8 Therefore, abstracts may the implications of qualitative interpretations. Finally, the
not include all relevant information, or they may mis- clinician may consider the relevance of results to general
characterize the results of the study.7,9 If the relevance practice or to a particular patient’s management.
6113_Ch36_557-573 02/12/19 4:35 pm Page 559
Although methodological criteria for assessing va- and white, and readers must apply principles of design
lidity do vary for different designs, there are several and analysis procedures to judge a study’s value.
core questions that frame the evaluation of any research
Is the Study Valid?
article. Table 36-2 lists suggested questions according
to where the information should be found in an article: Introduction: What Is the Study About?
introduction, methods, results, discussion, and conclu- Before we can assess the validity of a study, we must first
sions. There is no standard format for critical appraisal, understand its intent. Readers begin with the abstract,
and this template should be used only as a guide to the which will include fairly concise information about the
evaluation process. Some studies may require different purpose, participants, methods, results, and major con-
questions that address issues specific to that study. clusions of the presented work. If results and conclusions
Although most questions can be answered using “yes” seem applicable, the next step is to read the full article
or “no” responses, the intent is to provide justification to confirm methods and findings.
for that answer—why is it yes or no and what are the im-
Establishing the Question
plications of the answer? These answers are not black
In the opening paragraphs of an article, the author should
establish the general problem being investigated and its
importance. The background material should demon-
Table 36-2 Core Questions for Evaluating Research strate that the researchers have thoughtfully and thor-
Studies oughly synthesized the literature and related theoretical
Is the Study Valid? models. This synthesis should provide a rationale for pur-
suing this particular line of research, establishing what it
Introduction: What is the study about?
• Is the research question clearly stated?
will contribute to what is already known on the topic (see
• What type of research is being done? Chapter 3). Readers should understand the context of the
question and be convinced that the study was needed.
Methods: How was the study designed and implemented? By the end of the introduction, the study’s specific
• Who were the participants and how were they selected? purpose, aims or hypotheses should be apparent. De-
• How does the design control for potential
pending on the scope of the study, more than one ques-
confounding variables to support validity of findings?
tion may be proposed. From these statements, readers
• Were data collection procedures clearly defined with
documentation of reliability and validity? should know whether the study is designed to be descrip-
• Was data analysis appropriate to the research question tive, observational, explanatory, qualitative, or mixed
and the data? methods research. Questions should follow the PICO
standard (see Chapter 5). They should clearly identify to
Are the Results Meaningful?
whom the study applies, the methods used, how mea-
Results: What were the outcomes? surements were taken, what comparisons were made,
• Were results presented clearly in text, tables or figures? and what outcomes were expected.
• Were the data analyzed appropriately?
• How large was the effect, and is it large enough to be
A research “question” is not always phrased as a ques-
clinically meaningful?
tion. Authors may state the purpose of the study, the
Discussion: How are the results interpreted? study’s aims, or specific hypotheses. Qualitative studies
• Is the author’s interpretation of results supported by may state a general goal to explore certain characteristics or
the data? experiences of participants. Regardless of how it is stated, how-
• Does the author acknowledge limitations that could ever, the reader should be able to discern what the researcher
impact how the results can be applied? intended to study.
Are the Results Relevant to My Patient?
Conclusions: Are the findings useful? Methods: How Was the Study Designed
• Does the author discuss the clinical relevance of and Implemented?
findings? The method section of a research report is the heart of
• Were the participants in the study sufficiently similar the study, giving us details about what the investigation
to my patient? involved. This is where we learn enough to decide if the
• Is the approach feasible in my setting, and will it be results and conclusions of the study are valid. Flaws or
acceptable to my patient? omissions detected in the methods section affect the use-
• Is this approach worth the effort to incorporate it into
fulness of interpretations derived from the study. The
my treatment plan?
method section is typically divided into several parts.
6113_Ch36_557-573 02/12/19 4:35 pm Page 560
• Were all participants accounted for in the final • Are limitations addressed?
measurements? • Are the results clinically important?
• Do the results address the research question?
Readers must determine if the evidence is strong
• Do the results support or refute the proposed
enough to support a change in some aspect of practice or
hypotheses or aims?
how we think about an approach to patient management.
• How large an effect was noted?
When statistical analyses are done, authors should reflect
Results should provide a clear picture of whether data on the relative clinical versus statistical significance. Some-
do or do not support the expected outcomes. If hypothe- times the results do not provide a definitive conclusion
ses were proposed, the author should indicate if each one but do suggest how the findings contribute to continued
is accepted or rejected based on the data. When statistics understanding of the variables that were studied. Data
are used, data should be presented as descriptive values are strengthened by a cross-validation approach, where re-
to characterize the sample, and test results should be tied sults are replicated on a separate but comparable sample.
to significance levels, effect size indices, and confidence This is rarely done, however, and authors should provide
intervals. When expected differences or effects are not evidence from prior research to support their findings. As
obtained, the author should describe a power analysis to critical consumers of the products of clinical research,
determine the likelihood of Type II error. Figures and readers should study the discussion and conclusion sections
tables are often a more efficient way to present detailed of the research report to decide whether the authors’ con-
information. For qualitative studies, results will be in clusions are warranted.
narrative form.
Authors should report the degree of follow-up that Many journals require that authors declare conflicts
was achieved and if attrition was present in the study of interest (COI), often in a statement at the beginning
sample, providing a flowchart that indicates the number or end of the article. COI refers to situations in which
of participants involved at each stage of the process. financial or other personal considerations may compromise, or
Reasons for attrition should be provided to help deter- have the appearance of compromising, a researcher’s judgment
mine if bias was present. The format of the flowchart in conducting and reporting research. When conflicts occur,
will vary with the type of study, which can be found in researchers must be vigilant to assure that they do not introduce
reporting guidelines. bias.11 For instance, researchers funded by drug companies or
equipment manufacturers must show that these connections do
not influence outcomes. Readers should be aware of COIs and
When a study has potential importance for your clinical judge the degree to which bias has been prevented.
decision-making, and you find that the design or analysis
presented is beyond your experience, refer to your textbook
or Web resources. Hopefully they will provide reasonable ex- Are the Results Relevant to My Patient?
planation for you to get through it. When that is not sufficient,
take advantage of connections with clinical and academic
Conclusions: Are the Findings Useful?
colleagues to find resources that can help you sort through Finally, we get to the bottom line. A conclusion section
the relevant data. may be incorporated into the end of the discussion, or it
may be a separate section. The author should summarize
interpretation of the results and describe how results
Discussion: How Are the Results Interpreted?
pertain to practice or to basic knowledge.
A research report culminates in the author’s discussion
and conclusions, putting the results of the study in con- • Are the findings of sufficient strength to inform
text. This is where the author will present interpretation management of particular patients?
of results. The discussion section of an article should • Were the participants sufficiently similar to my
address each of the research questions, all results, and patient to allow generalization of findings?
should show how prior research supports or conflicts • If the study does provide potentially useful find-
with the study’s findings. The author should suggest the ings, will the patient find the treatment or test
clinical importance of the findings, and whether the out- acceptable?
comes do or do not fit expectations. • Is the treatment or test feasible, given my clinical
setting and available resources?
• Do the stated conclusions flow logically from the
obtained results? This last part of the evaluative process is based on
• What alternative explanations does the author con- the reader’s perspective. When the intent is to inform a
sider for the obtained findings? clinical decision for a particular patient, practitioners
• How are the findings related to prior research? should consider the four elements of evidence-based
6113_Ch36_557-573 02/12/19 4:35 pm Page 562
practice (see Fig. 36-1). First, we look for the best available comparison treatment, and the outcomes that will be mea-
evidence. After evaluating a study, we should have an idea sured. Because intervention studies are intended to look for
of its scientific merit. The clinician’s expertise and judg- cause-and-effect relationships, issues of internal validity and
ment will then be important as the results are weighed to generalizability should be scrutinized. Additional questions
determine if the findings are relevant. Knowing that no specific to intervention studies are listed in Table 36-3.
study will include participants exactly like your patient, a
good question to ask is: How different would my patient See content related to the design of intervention studies in
have to be from the patients in the study before this Chapters 13-18. Relevant statistical tests are described in
evidence was of no use to me?" Chapters 22-34.
The patient’s preferences will dictate the acceptability
of a treatment or test. The ultimate decision rests on
judgment by the clinician and patient based on relevant Are the Results of the Study Valid?
information—is the treatment or test worth adopting Participants: Authors should specify how participants
with consideration of its benefits and harms? Finally, it were recruited using probability or nonprobability
is worth considering if the methods used are feasible in methods. Inclusion and exclusion criteria should be
your own setting, or if special equipment or training is provided, and sample size estimates described, includ-
needed for implementation. ing how the effect size was determined. This is where
external validity of the study will be established. When
nonprobability samples are used (as is often the case),
■ Intervention Studies it is important to ask if it affects the relevance of the
study population.
The core questions described thus far provide a general
foundation for understanding research studies. There are, Design: Description of the study should indicate the
however, several additional questions that need to be ad- type of experimental design that is being used, including
dressed depending on the type of study. This section will the number of groups, the number or levels of independ-
focus on intervention studies, which may address efficacy ent variables (the intervention), the number of dependent
in randomized controlled trials (RCTs), or effectiveness in variables (the outcomes), and the frequency of measure-
pragmatic trials, quasi-experimental studies, single-subject ments. The validity of an intervention study will depend
designs or N-of-1 trials. The question should clearly in large part on how randomization was achieved, includ-
state the population being studied, the intervention, the ing concealed allocation, to create comparison groups
that are balanced in terms of known and unknown con-
founding variables. It should be clear if repeated measures
Clinical Question
are used and how order or carryover effects are con-
trolled. The author should document the extent to which
Is the study groups were treated equally, except for the experimental
Are the treatment. When the design does not allow for random
valid?
results
Best available meaningful? assignment, researchers should address how they con-
research trolled for potential bias.
Are the results evidence
relevant to my
Data Collection: Operational definitions should be
patient? Clinical provided so that all intervention and measurement pro-
Decision- cedures could be replicated. Researchers should provide
Patient Making Clinical
values and evidence of validity for the chosen measurement tools,
expertise
preferences especially if they are not standard measures. As much
as possible, bias should be controlled by blinding re-
searchers, participants, and assessors to group assign-
Cl
ini es
ment. Groups should be treated equally except for the
c al nc intervention. Therefore, differences between groups
Circ u m sta
observed at the end of the study should be attributable
to the treatment. When designs do not adhere to these
standards, authors should explain why and discuss how
Patient Management they controlled for potential biases. Information in this
Figure 36–1 The relationship of critical appraisal to the section, combined with information about design, will
components of evidence-based practice. help you determine the internal validity of the study.
6113_Ch36_557-573 02/12/19 4:35 pm Page 563
Question • Does the question clearly state the population being studied, the treatments being
compared, and the outcome measures (PICO)?
Participants • What type of sampling procedure was used?
• How were participants recruited and selected?
• Was a power analysis done to determine sample size?
Design • Does the study design address efficacy or effectiveness?
• Were research hypotheses proposed to indicate expected relationships?
• Have the independent and dependent variables been clearly described?
• Was random assignment used to form study groups or to systematically vary the order of
repeated measurements?
• Was allocation concealment used?
• Was the study internally valid? Were other factors present that could have affected the
outcome?
Data Collection • Were outcome measurements taken at reasonable time intervals, and were participants
followed for a sufficient time period?
• Were all relevant outcomes evaluated?
• Were the groups treated equally, aside from the intervention?
• Were investigators, participants, and those involved in data collection blinded to group
assignment?
Data Analysis • Were groups similar at baseline on relevant characteristics?
• Were all participants who entered the trial properly accounted for at its conclusion?
• Were participants analyzed in the groups to which they were initially assigned (intention to
treat analysis)?
• Were effect sizes determined?
• Were confidence intervals provided?
• Were research hypotheses accepted or rejected?
Are the Results Meaningful?
• Are the results both statistically and clinically significant?
• Did the authors consider alternative explanations for outcomes?
• If outcomes were not significant, was a power analysis done?
Are the Results Relevant to My Patient?
• Are values of NNT and NNH included to put the intervention’s benefits or harms in context?
• Are the outcomes of the intervention consistent with my patient’s preferences?
• Are the treatment benefits worth the potential harm or costs?
See related content on design of intervention studies in Chapters 13-18.
Data Analysis: The study design will influence the se- Researchers should provide data to confirm that
lection of statistical analysis procedures. Mean differ- groups are similar at baseline on relevant characteristics.
ences may be used to determine the effect of the If significant differences are observed before treatment
experimental treatment with statistics such as t-tests or is initiated, authors should indicate how they handled
analysis of variance. These values should include confi- data to account for those differences. Validity will also
dence intervals as measures of precision. Analysis of co- depend on the extent to which participants are adherent
variance may be used if there the researcher can justify to the protocol, and the degree of attrition that occurs
appropriate covariates. When looking at the benefit during the study. To maintain the benefit of randomized
of an intervention, researchers may look at effect size, groups, investigators should account for all participants
relative risk reduction (RRR), attributable risk reduction at the end of the study, including those who “dropped
(ARR), and number needed to treat (NNT) or number out,” and should analyze subject data according to
needed to harm (NNH). the groups they were assigned to, even if they did not
6113_Ch36_557-573 02/12/19 4:35 pm Page 564
maintain that group membership. Authors should confirm may not be realistic or affordable. When the results of
that an intention-to-treat analysis was done. A flow diagram a study are not clear cut, which is probably more often
should be included to demonstrate how subjects were than not, values of NNT and NNH can be directly
recruited and assigned to groups. Researchers should applied to decisions in consultation with the patient,
account for differential attrition in treatment groups. especially when weighing the relative benefits and risks
associated with an intervention. Even if researchers do
Are the Results Meaningful? not express their results using these values, it is often
The intent of an intervention study is to demonstrate the possible to calculate them based on presented data.
benefit of a treatment as compared to a control condi-
tion. The author should clearly state if hypotheses were
supported or not and propose reasons based on previous ■ Studies of Diagnostic Accuracy
research. Despite the pervasive use of p values to indicate
significant differences, confidence intervals and effect This category of studies relates to the validity of a test
sizes provide more relevant information to judge the im- and how meaningful the results will be for determining
portance of the findings. Researchers should present data a patient’s condition or disease state. It should be clear
that reflects the amount of change or difference and its if the intent is to compare two tests, to estimate the
relative importance. If differences are not significant, the accuracy of one test, or to examine a test’s utility across
author should address the potential for Type II error. participant groups. Additional questions for diagnostic
Are the Results Relevant to My Patient? accuracy studies are listed in Table 36-4.
Participants • Was the sample representative of the full spectrum of patients who are affected by the
disorder?
Design • Was there an independent comparison with an appropriate reference standard?
• How was validity of the reference standard established?
• Were all participants given the reference standard and index tests regardless of the test result?
Data Collection • Were the methods for performing the test and reference standard described in sufficient
detail to allow replication?
• Were those performing the index test blinded to the patient’s reference standard results?
Data Analysis • Were measures of diagnostic accuracy calculated, including sensitivity, specificity, predictive
values, and likelihood ratios, including confidence intervals?
• If the index test was continuous, how was a cutoff point determined?
• Were tests interpreted independently of clinical information about the patient?
Are the Results Meaningful?
• For diagnostic tests, do values of sensitivity, specificity, and likelihood ratios improve
posttest probabilities of diagnosing the disorder?
• Is there a reasonable balance between the risk of false positives versus false negatives?
• Are the authors’ conclusions justified within a clinical context?
Are the Results Relevant to My Patient?
• What are the relative benefits of performing the test versus risk or cost?
• Do the results of the test provide meaningful information that contributes to informed
choices in management of my patient?
See Chapter 33 for content related to design of studies of diagnostic accuracy.
6113_Ch36_557-573 02/12/19 4:35 pm Page 565
Are the Results of the Study Valid? method for determining the cutoff point described. A
flow diagram should be included to show how subjects
Participants: In order to judge the validity of a test,
were recruited and how attrition affected ratios.
the sample must be representative of the spectrum of
patients to whom the test would apply, and inclusion Are the Results Meaningful?
and exclusion criteria should be specified. Some mea-
Authors should include their interpretation of data by
surements may simply be judged present or absent, oth-
discussing if the obtained statistics represent strong or
ers may be assessed for different levels of severity or
weak effects. These are subjective determinations and
magnitude. The sample must reflect the full range of the
should be made within a clinical context. Although lit-
condition. Without such variance, it will not be possible
erature should be cited, authors should not rely solely
to accurately determine how well the test reflects true
on a standard range of values to indicate when a rela-
scores. Therefore, samples are usually purposive, to
tionship is considered weak or strong. Clinicians should
assure a diversity of subject characteristics. This infor-
judge the relative risks associated with false positives
mation is usually described in a table.
versus false negatives. Results should be discussed in
Design: The meaningfulness of a test is based on terms of costs and clinical consequences related to the
criterion validity, in which the measure being assessed index test.
for accuracy (the index test) is tested against a criterion
(the reference standard). The validity of the reference
Are the Results Relevant to My Patient?
standard, therefore, must be documented. Sometimes Results should be considered within the context of the
there is a gold standard which is well-established as a patients to whom they will apply to determine if they will
known criterion, but an accepted reference standard impact patient management decisions. The research
must be defined for more abstract measures. Authors should lead to greater confidence in diagnostic accuracy
should be able to justify the reference standard through and decisions to use a specific measure. When more than
prior research. The index test and reference standard one test is available for a given purpose or when there
should be independent of each other, so that interpreta- are different risks or costs associated with them, the
tion of findings is not biased. clinician and patient may weigh these considerations in
deciding how to proceed.
Data Collection: Both index and reference tests should
be given to all participants, even if the reference standard
indicates the absence of the target condition. Methods ■ Studies of Prognosis
of measurement should be well documented, including
a full description of how bias was controlled by blinding Studies of prognosis are based on observational designs,
those who administer and score tests to the subject’s true involving the determination of the predictive value of
diagnosis or condition. exposures and patient characteristics on eventual health
outcomes. The research question should clearly state
Studies that assess clinical measurements can be the expected associations based on prior research as well
appraised using questions similar to those for diagnostic as biologic mechanisms. Additional questions for prog-
accuracy, but with a focus on reliability and content, criterion- nostic studies are listed in Table 36-5.
related, or construct validity for the measurement of impair-
ments and activity or participation restrictions. See Chapters 9,
See content related to the design of prognosis studies in
10 and 32 for further discussion of this form of methodological
Chapters 19 and 34. Relevant statistics are described in
research.
Chapters 30, 31, and 34.
Participants • Was the sample composed of a representative group of participants at a similar point in their
disorder?
• Were participants recruited in an appropriate way?
Design • Was the study prospective or retrospective?
• Were the prognostic indicators appropriate for predicting outcomes?
• Were cases and controls appropriately defined? What are the risks for misclassification?
• Were exposures adequately defined and measured for all participants?
• Was follow-up sufficiently long and complete?
• Were the individuals collecting data blinded to the participant’s prognostic status?
• Were objective and unbiased outcome measures used?
Data Collection • Were evaluators blinded to the patient’s prognosis?
• Were outcome data collected from all participants?
• Were assessments reliable?
• Were potential confounders identified and measured?
Data Analysis • Were appropriate prognostic estimates derived?
• Were subgroups of participants examined for differences in prognostic estimates?
• Were confounders accounted for in the analysis?
• Did the investigators account for missing data?
Are the Results Meaningful?
• How likely are the outcomes within a given period of time?
• How precise are the prognostic estimates?
Are the Results Relevant to My Patient?
• Are the participants in the study sufficiently similar to my patient?
• Will results lead to selecting or avoiding certain intervention or prevention approaches?
• Will the results affect what I tell my patient?
See Chapters 19 and 34 for content on design of prognosis studies.
Design: Prognosis can be assessed using cross-sectional, measurements should be blinded to the subject’s prior
cohort, or case-control designs in retrospective or prospec- status. Data must be documented throughout the study,
tive ways. Each of these designs has strengths and limita- as it is not uncommon for attrition to occur for various
tions in terms of being able to document a temporal reasons. As much as possible, all participants should be
sequence. When trying to determine factors that will im- followed to the study’s end point, even if they did not
pact a long-term outcome, the duration of follow-up must complete all measurement sessions.
be described, as interpretations can only be made within
Data Analysis: Depending on how the research ques-
that timeframe, and there must be sufficient time for the
tion is stated, analysis may include statistics such as
outcome to develop.
relative risk (RR) and odd ratios (OR), sensitivity and
Data Collection: The outcome of interest must be specificity, or regression analyses that will adjust values
operationally defined and measured at the beginning of for covariates. When predicting continuous outcomes,
the study as well as in follow-up. The method for deter- multiple regression and R2 may be used to demonstrate
mining exposures should also be defined, including strate- the strength of predictive relationships. Survival analysis
gies for reducing bias. Reliability and validity should be may be appropriate when comparing two or more sub-
documented for all measurements. Potential confounders groups on long-term outcomes with hazard ratios (HR).
must be identified at the outset so that data can be col- A flow diagram should be included to account for sub-
lected. These often include demographic factors such jects who dropped out and describe how they handled
as age and gender, but will also include health conditions, missing data. Researchers should also establish that those
personal and lifestyle characteristics, and comorbidities who drop out are not significantly different on important
that could be connected to the outcome. Those who take covariates from those who remain.
6113_Ch36_557-573 02/12/19 4:35 pm Page 567
Are the Results Meaningful? factors, the provider and patient should discuss the mean-
ing of prognostic implications. Because there are often
Authors should be able to discuss how their findings are
several covariates that are potential confounders in estab-
supported by previous research. Researchers should be
lishing risk, it is important to consider just how well the
able to discuss the clinical context and application of
sample resembles the patient’s profile—recognizing that
the findings. Effect sizes should be reported, typically as
there will never be a study with the exact characteristics
RR, OR, or HR estimates, with appropriate confidence
of the patient. Understanding the concept of risk, and
intervals. When appropriate, adjusted values should
being able to explain it to a patient, is essential to using
be presented that account for the contributions of
data for decision-making.
potentially confounding variables.
Chapter 5) The author of a CAT searches for and • Summary of the study: Is the evidence in this
appraises “current best evidence” from the literature, study valid? This section should provide a descrip-
summarizes the evidence, integrates it with clinical tion of the study based on the questions shown in
expertise and finally suggests how the information can Tables 34-3 through 34-6, including the type of
be applied to a patient scenario.17 The critique will study participants, procedures, and design and
address internal, external, construct, and statistical con- analysis elements.
clusion validity.
• Summary of the evidence: The results of the
Format of a CAT study should be summarized in narrative and/or
tabular format. Specific effect size estimates
A CAT provides a standardized format to present a
should be provided as appropriate, such as means
critique of one or more articles and a statement of the
and mean differences, confidence intervals,
clinical relevance of the results. Although the format of
odds ratios, likelihood ratios, sensitivity and
a CAT can vary, the essential elements remain fairly
specificity, and NNT or NNH. If these estimates
standard. It is intended to be a concise document, typi-
are not included in the research report, the
cally one to two pages. Table 36-7 shows one common
CAT author may be able to calculate them if
layout. At minimum, a CAT should contain the follow-
sufficient data are provided in the report. Results
ing information:
of qualitative studies will be presented in narrative
• Title: A concise statement that will be used to form, organized around themes or theoretical
catalogue the CAT. propositions.
• Author and Date: The author of the CAT should • Additional Comments: Provide critical comments
be specified as well as the date the search was on the study using appraisal guidelines, including
executed. CATs are often revised as new evidence issues related to sampling, methods, data analysis,
becomes available. Some authors will include an quality of discussion, and interpretation of results.
“expiration date” that suggests when the review This section should address internal, external,
should be revised to account for more recent litera- construct, and statistical validity, with comments
ture. This could be as short as 1 year or as long on positive and negative aspects of the study.
as 5 years, depending on the depth of research on
the topic. The Centre for Evidence-Based Medicine (CEBM) offers
• Clinical scenario: A brief description of the a CATMaker and EBM Calculators, that assist with cre-
patient case that prompted the question. ation of the CAT as well as calculating outcomes such as risk
and NNT.22
• Clinical question: This is the question that was
developed from the patient case. It includes the
A CAT plays an important role in fostering evidence-
elements of PICO: The patient population, the
based practice. First, it is based on a specific clinical
intervention that is being considered (or diagnostic
question, prompted by a clinician’s need to clarify an as-
test or prognostic variables), a comparison (if rele-
pect of patient care related to intervention, diagnosis,
vant), and the outcomes of interest. This structure
prognosis, or harm.19 This question is developed using
helps to refine the question and identifies key
the PICO format, as shown in Table 36-7. The clinician
elements of an efficient database search.
searches the literature explicitly to provide an answer to
• Clinical bottom line: A concise summary of this question, with the intent of influencing management
how the results can be applied, including a descrip- of the patient.
tion of how results will affect clinical decisions or The second unique feature of a CAT is the inclusion
actions. of a clinical bottom line, or a conclusion by the CAT au-
thor as to the value of the findings. The usefulness
• Search history: Description of the search strategy
of the bottom line depends on the ability of the CAT
used to obtain the studies that are being appraised.
author to accurately assess the validity of the literature,
This includes databases used and terms entered at
and to grasp the relevance of findings for a particular
each step. Relevant studies are selected, often based
patient’s management.17 This information is then
on achieving highest levels of evidence.
useful to others who may encounter similar patients,
• Citations: Full bibliographic citations of studies who can take advantage of the summary for efficient
selected for review. decision-making.
6113_Ch36_557-573 02/12/19 4:35 pm Page 570
Using CATs for Clinical Decision-Making Box 36-1 Join the Club!
The utility of CATs will depend on the clinician’s ability Journal clubs have been around since the late 1850s when
to readily access the information within the clinical William Osler, then at McGill University, wanted to keep up
setting. Wyer16 suggests that CATs may serve to make with progress in European journals.27 Today journal clubs are
the results of journal clubs or case conferences available quite common in clinical settings, presenting an opportunity for
to clinical staff to improve the care of subsequent colleagues to meet regularly to discuss published research that
patients (see Box 36-1). He offers the perspective that is relevant to their professional interests.28
building such bridges and making them available at the Journal clubs are useful for teaching practitioners how to
point of care should be an essential part of evidence- search for articles relevant to their clinical practice. It also en-
based practice. Many institutions have established online courages clinicians to read and critically appraise articles with
“CAT Banks” that provide ready access to summaries of colleagues, sharing perspectives and knowledge, and discussing
how the research can impact practice. These discussions may
articles on specific topics.23-26
focus on a single patient problem, or they may relate to a
general area of practice for the group.
Links to several CAT databases are provided in the Chapter 36 Although formats can vary greatly, typically the group
Supplement under Critical Appraisal Resources. By the way, will choose an article to review, which is made available to
be careful when searching for “CAT banks,” as you will all participants who will read it and come prepared with
undoubtedly find lots of ceramic banks in the shape of felines! questions or comments. A clinician is identified as the leader
of the session, with the responsibility to present the study,
reviewing the background to the question, describing meth-
Limitations of CATs ods and their validity, focusing on results and summarizing
CATs do have limitations as a source of clinical infor- key findings. The presentation leads to robust discussion by
mation. They have a short shelf life as new evidence the group regarding the study’s quality and the impact of the
findings for practice.
becomes available, which is why revision dates may
The journal club provides a nonthreatening way for col-
be included. Clinicians must recognize that CATs
leagues to develop skills in critical appraisal and to expand
may not represent a rigorous search of the literature, their appreciation for evidence-based practice. Several good
as in a systematic review. CATs are often based references discuss how to develop and run journals clubs,
on only one or two references and may not represent including onsite and online versions.29-33
the full scope of literature on a topic. Their usefulness
is conditional on the critical appraisal skills and accu-
racy of the author. It is therefore the reader’s respon- author’s conclusions in relation to the evidence pro-
sibility to determine the relevance of the clinical vided. Notwithstanding these limitations, however, in
question, the logic of the search strategy, and the the hands of a discerning reader, CATs provide a prac-
application of validity criteria in the review. It is tical method for sharing evidence to inform clinical
also important to assess the strength of the CAT decisions.
6113_Ch36_557-573 02/12/19 4:35 pm Page 572
COMMENTARY
Being a critical consumer of the literature means that articles that make it to the publication stage have been
we can make judgements about how we can apply the screened, but this does not guarantee the results will
results of published studies to our patient care. Critical be clinically useful. Regardless of the imperfections that
appraisal may not result in a firm decision about the exist in the clinical research process, it is our job as
quality of a study. The checklists don’t provide a score evidence-based practitioners to find the merits of the
that tells us which study to use—they only provide the study and decide whether to accept them for our use.
means for us to consider a study’s findings, and to de- The next step in EBP is integration of the evidence
termine whether to accept them based on evaluation of into clinical decision-making, which must then be
validity and relevance. This is a skill that must be devel- followed by assessment of the process. Was the best
oped and practiced. No study is perfect, but that does evidence obtained, were articles sufficiently appraised,
not mean that we should “tear apart” every study we and did the evidence foster a successful outcome with
read—that could lead us to discard important contribu- my patient?
tions. Sometimes in critical appraisal there is a tendency The task of critical appraisal can be fostered by read-
to be overly critical, finding many flaws, major and ing systematic reviews, in which studies have been
minor, real and potential. Most importantly, although appraised and summarized by others in a “systematic”
we can use guidelines to address the components of a way to reflect the state of knowledge on a particular
study, it will still come down to the reader’s ability to topic. Reviews may not be comprehensive, however, and
make judgments about the study’s validity, and to apply it is not unusual to find that many of the reviewed studies
findings within a clinical context. are not methodologically sound for various reasons. We
Because of the nature of clinical research, there will cannot blindly depend on systematic reviews to offer
always be some aspects of the design that could have those judgments, making critical appraisal skills that
been tighter or cleaner. The important element is much more important for our own evidence-based prac-
whether the limitations in the design are “fatal flaws,” tice. We head there next, to discuss systematic reviews
making the findings useless. In refereed journals, the and how we can use them most effectively.
14. Sandelowski M. Sample size in qualitative research. Res Nurs 24. University of Texas Health Science Center. Welcome to
Health 1995;18(2):179-183. CATs library. Available at https://cats.uthscsa.edu/. Accessed
15. Palaganas EC, Sanchez MC, Molintas MP, Caricativo RD. August 20, 2019.
Reflexivity in qualitative research: A journey of learning. Qual 25. University of Oklahoma Health Sciences Center College of
Report 2017;22(2):426-438. Available at https://nsuworkss. Allied Health. Crtically appraised topic. Available at https://
nova.edu/tqr/vol422/iss422/425. Accessed March 4, 2019. alliedhealth.ouhsc.edu/Departments/Rehabilitation-Sciences/
16. Wyer PC. The critically appraised topic: closing the evidence- Critically-Appraised-Topics-CAT. Accessed August 20, 2019.
transfer gap. Ann Emerg Med 1997;30(5):639-640. 26. Pacific University. School of Occupational Therapy CATS.
17. Fetters L, Figueiredo EM, Keane-Miller D, McSweeney DJ, Available at https://commons.pacificu.edu/otcats/. Accessed
Tsao CC. Critically appraised topics. Pediatr Phys Ther March 27, 2019.
2004;16(1):19-21. 27. Cushing H. The Life of Sir William Osler. Oxford: Clarendon
18. Johnson CJ. Getting started in evidence-based practice for Press; 1925.
childhood speech-language disorders. Am J Speech Lang Pathol 28. Aronson JK. Journal Clubs: 1. Origins. Evid Based Med 2017;
2006;15(1):20-35. 22(6):231.
19. Straus SE, Glasziou P, Richardson WS, Haynes RB, Pattani R, 29. Johnson A, Thornby KA, Ferrill M. Critical appraisal of
Veroniki AA. Evidence-Based Medicine: How to Practice and Teach biomedical literature with a succinct journal club template: the
EBM. 5th ed. London: Elsevier; 2019. ROOTs format. Hosp Pharm 2017;52(7):488-495.
20. Sadigh G, Parker R, Kelly AM, Cronin P. How to write a criti- 30. Ellington AL. Insights gained from an online journal club for
cally appraised topic (CAT). Acad Radiol 2012;19(7):872-888. fieldwork educators. Occup Ther Health Care 2018;32(1):72-78.
21. Kelly AM, Cronin P. How to perform a critically appraised 31. Szucs KA, Benson JD, Haneman B. Using a guided journal club
topic: part 1, ask, search, and apply. AJR Am J Roentgenol 2011; as a teaching strategy to enhance learning skills for evidence-
197(5):1039-1047. based practice. Occup Ther Health Care 2017;31(2):143-149.
22. Centre for Evidence-Based Medicine. CATMaker and EBMN 32. Chan TM, Thoma B, Radecki R, Topf J, Woo HH, Kao LS,
calculators. Available at https://www.cebm.net/2014/06/ Cochran A, Hiremath S, Lin M. Ten steps for setting up an
catmaker-ebm-calculators/. Accessed March 27, 2019. online journal club. J Contin Educ Health Prof 2015;35(2):
23. University of British Columbia. Critically appraised topics 148-154.
(CATs). Rehabilitation Science Online Programs. Available at 33. Bowles P, Marenah K, Ricketts D, Rogers B. How to prepare
http://www.mrsc.ubc.ca/site_page.asp?pageid=98. Accessed for and present at a journal club. Br J Hosp Med (Lond) 2013;74
March 27, 2019. Suppl 10:C150-152.
6113_Ch37_574-596 02/12/19 4:35 pm Page 574
CHAPTER
37 Synthesizing
Literature:
Systematic Reviews
and Meta-Analyses
—with David Scalzitti
Systematic Reviews
Synthesis
Meta-Analyses
Scoping Reviews
In seeking the evidence on which to base clini- additional research. Therefore, researchers and
cal practice, we are faced with myriad published practitioners benefit from syntheses of litera-
papers and websites, often offering unclear or ture that summarize and critique published
conflicting information regarding the choice of studies.
interventions, diagnostic tools, or expectations Systematic reviews present comprehensive in-
of outcomes. We also face the need to critically formation regarding findings from several
appraise these studies to determine how well sources with an analysis of the methodological
they support clinical decisions—a potentially quality of the studies. Reviews can be extended
overwhelming task when the number of studies by pooling data from several studies in a
is large (see Chapter 36). Any single study, even process of meta-analysis, which presents collec-
if well-designed, is essentially a form of tenta- tive statistical measures of effect size. A more
tive evidence, which needs confirmation by recent contribution to synthesis is the scoping
574
6113_Ch37_574-596 02/12/19 4:35 pm Page 575
INTRODUCTION METHODS
RESULTS
NO
with the development of chronic pain in breast cancer demonstrate the prevalence of the condition being stud-
survivors who were at least 6 months post-treatment. ied and its health impact. Information on etiology,
pathophysiology, variations in treatment options and
Systematic reviews can also address qualitative re- their effectiveness, and conflicting ideas about patient
search, to explore patient experiences related to their management provide a rationale for the review question.
care.17 One Cochrane Review Group is dedicated to The background should also include findings of any ex-
developing methods for critical analysis of qualitative isting systematic reviews and make a valid case as to why
studies as well as an appreciation for the role of qualita- a new or updated review is needed.
tive studies in evidence-based practice.
Traumatic spinal cord injury (SCI) impacts an individual’s ➤ CASE IN POIN T #1
return to the community and to paid employment. Motor skills are fundamental to childhood development. How-
Hilton et al18 published a systematic review to answer ever, the effectiveness of treatment options for children with
the question, “What are the barriers and facilitators mild to moderate movement disorders is not clear. Lucas et al20
influencing peoples experience of return to work following conducted a systematic review to investigate the effectiveness
SCI?” They reviewed qualitative studies that investigated of conservative interventions to improve gross motor skills in
themes related to personal and environmental factors. children with a range of neurodevelopmental disorders.
Participants
■ Methods The population of interest must be specified to identify
the types of participants or patients recruited in the
The methods section of a systematic review specifies
chosen studies. Characteristics may include age range,
criteria for choosing studies to include in the review,
gender, or specific diagnostic categories. Case definitions
describes the process of searching for references, and
should be specified. Settings from which participants
how these studies were evaluated. Whereas in primary
are recruited should be described if relevant. Because
studies, researchers specify characteristics of subjects that
participant characteristics can vary widely across studies,
will determine who will be chosen to participate in
systematic reviews should set reasonably broad criteria
a study, in a systematic review, the “subjects” are the
that can reflect the type of individuals to whom the
studies themselves. Therefore, “selection” criteria spec-
intervention would apply.
ify inclusion and exclusion requirements for studies to
be eligible for the review. These include study design,
description of participants, and definitions of interven- ➤ In the study of gross motor skills, researchers included stud-
tions and outcomes. These requirements must be clearly ies with participants between 3 to 18 years who had neurologi-
stipulated prior to searching the literature to facilitate cal conditions that affect coordinated movement such as fetal
alcohol syndrome, cerebral palsy, low birth weight, and acquired
transparency and consistency in the selection process.
brain injury, among others. They excluded studies that included
Selection Criteria participants with genetic disorders, hearing or visual impair-
ments, or intellectual disability.
Study Design
The classification of levels of evidence for research
delineates the types of studies that are considered Interventions
strongest for drawing conclusions about interventions, Selection criteria must include definitions of the inter-
diagnostic tools, prognosis, harm, and screening (see ventions and comparison treatments. It is virtually im-
Chapter 5, Table 5-4). These levels act as a guide to possible to find studies that applied interventions in
decide which studies should be included in a systematic exactly the same way, so these definitions are generally
review. Many authors restrict systematic reviews of broad enough to allow a reasonable match.
interventions to the “gold standard” of RCTs pub-
lished in peer-reviewed journals. When the literature ➤ The motor skills study defined intervention as any nonphar-
is replete with RCTs on a particular topic, this may macological or non-surgical therapy that had the intent of im-
be an effective strategy. However, when published proving gross motor performance. Treatment could be carried
work is less robust or when randomized trials on the out in home, community, or school-based settings and had to be
topic are not feasible, the researcher may find that delivered by a trained healthcare professional such as a physical
using studies at lower levels of evidence is necessary. or occupational therapist. Comparison treatments could include
Quasi-experimental studies, observational cohort or no treatment, placebo, waiting list or usual therapy.
case-control studies, and case series can also be valu-
able for understanding the scope of knowledge about If relevant to conclusions, interventions can be re-
an intervention. The decision to broaden the scope stricted to a certain duration of treatment or length of
to include nonrandomized designs into a systematic follow-up. This will limit the number of acceptable stud-
review of an intervention should be determined in ies but may be important for considering the optimal
advance of conducting the search. dose of the intervention and the duration of the treat-
ment effect. For systematic reviews of diagnosis and
➤ The motor skills study restricted the systematic review to prognosis, criteria for inclusion of the index tests and risk
RCTs, quasi-randomized controlled trials, and randomized factors need to be defined.
crossover trials. They also included conference abstracts and Outcome Measures
clinical trials registered with international registries. Outcomes include measurements or endpoints, includ-
ing improvement in condition or reduction of symp-
Selection criteria for studies of prognosis or diagnosis toms. In studies of risk, outcomes may relate to onset or
will focus on observational designs, and would have to prevention of disease or other conditions. Reviewers may
specify inclusion criteria such as prospective or retro- specify that studies use particular instruments, such as a
spective studies, including case-control, cohort, or cross- health status questionnaire or a diagnostic test. Re-
sectional designs. searchers may require specific primary outcomes and
6113_Ch37_574-596 02/12/19 4:35 pm Page 579
allow for a variety of secondary outcomes. The actual some irrelevant studies. The intent, however, is to be as
tools used to measure outcomes may be different across comprehensive as possible. More than one database
studies, but they should all be assessing the same con- should be employed to provide the most thorough
struct. When doing a review for a meta-analysis, re- search possible. In addition to online resources, hand
searchers may also restrict studies to those that report searching through pertinent journals and reference lists
results using specific statistical outcomes, such as means of published papers can help to ensure that the search
and confidence intervals, number needed to treat (NNT), is complete. It is often helpful to identify a small number
risk ratios, or measures of diagnostic accuracy. of exemplar articles on the topic that can assist in the
selection of keywords that can trigger “similar articles”
➤ Lucas et al20 specified the primary outcome as gross motor in a search.
performance measured with a standardized assessment tool.
Secondary outcomes could include compliance, parental and ➤ Investigators in the motor skills study performed searches
child satisfaction, or cost. of eight electronic databases. They restricted references to pub-
lications in English that were published from January 1980 to
June 2015. They also performed hand searches of the reference
After the methods for the systematic review have been lists of relevant articles, conference abstracts, and registries
developed, it is recommended that the protocol be regis- of clinical trials. A supplement to their article provides details
tered. This helps to minimize duplication of reviews and to of the search terms used for each of the electronic searches.
reduce bias by providing the opportunity to check the completed
review against the protocol. For example, the protocol for
the study by Lucas et al20 was registered with PROSPERO, a
prospective registry for health-related topics, available through Because systematic reviews often take many months to
the UK National Health Service.21 complete, researchers may need to update the search
periodically, especially as the final paper is being prepared, to
ensure the most recent evidence is included.
The Search Strategy
A major component of the methods section is the de- Unpublished Findings
scription of the search strategy, guided by the selection Systematic reviews are obviously influenced by the litera-
criteria. The goal is to amass a comprehensive list of ture that is available at the time the search is performed.
relevant references that can be considered for the re- Therefore, results can be impacted if relevant data are not
view. Before beginning a systematic review, it is impor- published, limiting their access, and potentially affecting
tant to locate other reviews on the same topic to be sure clinical decisions (see Focus on Evidence 37-1).
your review will contribute new information. Depend- One source of concern is publication bias, the situa-
ing on the rate of reporting research on a particular tion when researchers fail to submit studies with results
topic, systematic reviews may need updating every 3 to that are not statistically significant, or editors decline to
5 years. publish such studies.23 Therefore, a systematic review will
The first step of the search is to specify which data- not reflect the true nature of evidence. This practice has
bases and search terms will be used. The strategy should created bias against studies with “negative” or inconclu-
include any restrictions on language or dates of publica- sive results. These studies may still have strong designs
tion. For practical purposes, many reviewers only include and important outcomes but may be underpowered. Be-
articles written in English. However, such a restriction cause the value of a systematic review is to broaden the
may introduce bias, as studies with positive findings are evidence for or against a proposed outcome, evidence of
more likely to be published in larger English-language no effect must be part of the picture. Authors of system-
journals.22 atic reviews should make a concerted effort to find these
Commonly used databases include MEDLINE, studies by contacting authors directly or by looking at
EMBASE, SCOPUS, CINAHL, and the Cochrane entries in clinical trials registries.
Collaboration, among many others (see Chapter 6). Au- Another concern is the accessibility of grey litera-
thors should provide a detailed description of the search ture, research information that is only available
strategy and search terms as a useful reference and to through formats other than customary journals, such as
allow for replication of the process. For this purpose, conference proceedings, abstracts, agency reports, web-
the search strategy should have high sensitivity to find sites, theses, or dissertations.24-26 It is important to bal-
all relevant references, although it will also retrieve ance the desire for higher levels of evidence with the
6113_Ch37_574-596 02/12/19 4:35 pm Page 580
Focus on Evidence 37–1 unpublished FDA data resulted in lower efficacy of the drug in 46%
Many Shades of Grey of the trials compared to results using only published data; 7% had
identical efficacy, and 46% showed greater efficacy. One study that
Research has shown that meta-analyses that exclude unpublished
estimated harm showed more harm from the drug after inclusion
data or grey literature can exaggerate estimates of intervention
of unpublished data. They concluded that the effect of including
effect, positively and negatively.27 For example, Turner et al28 re-
unpublished outcome data varies by the drug and measured outcome.
viewed 74 studies of 12 antidepressant drugs that were registered
However, it was clear that when grey literature was included, the
with the Food and Drug Administration (FDA) between 1987 and
majority of studies in their analysis resulted in substantially different
2004. They found that 31% were not published. With four exceptions,
outcomes, potentially leading to serious bias in the conclusions, one
they found that all positive studies were published, and of 35 non-
way or the other. This influence of grey literature has been supported
significant studies, 22 were not published and 11 were published in
by the results of several other studies.24,25 Clearly, the impact of
a way that implied a positive outcome. In another study, Hart et al29
these omissions in the literature can be widespread, as it influences
investigated the effect of including grey literature, including clinical
credibility of hypotheses and clinical conclusions.30
guidelines and conference proceedings, in 42 meta-analyses
that studied nine different drugs. They found that incorporating
need to find the “best evidence available.” Although inter-library loan, purchasing full-text articles, or re-
there is no true consensus, many advocates have rec- questing full text by contacting study.
ommended considering grey literature to assure that
a systematic review is truly comprehensive. Benzies
Although it may be tempting to filter a search by only
et al26 have published a checklist to use as an aid in
references that are available in full-text, this is counter-
deciding whether to include the grey literature in a sys- productive as it will result in less than a comprehensive
tematic review. Items that should trigger its use include search and is likely to bias the conclusions of the review authors.
complex interventions and outcomes, lack of consensus
about outcomes, and low volume or quality of evidence.
Many reference sources have emerged to help with this At least two authors should review the full-text
type of search (see Chapter 6). citations to determine which articles meet the inclusion
criteria. Multiple reviewers are needed to improve the
Screening References trustworthiness of the screening process. Reasons for ex-
The initial search will yield numerous articles that need clusion of articles at this stage should be documented and
to be screened to determine if they fit the selection cri- reported in the published review. Titles can be stored in
teria. The screening process should be clearly described bibliographic databases or in a spreadsheet to facilitate
in the review. The first task is to remove duplicate tracking and management of the citations retrieved.
titles, especially important when multiple databases are
used. This task may be accomplished by hand but is Data Extraction
greatly facilitated by the use of reference management With a list of appropriate references, the process con-
databases. tinues with critical review of the selected studies. As a
Even with a careful search strategy, one should not ex- rule, a minimum of two primary reviewers should inde-
pect that all identified papers will meet the pre-established pendently assess content and rate the quality and appli-
criteria. Therefore, one or more authors must sort cability of each selected paper or information source, a
through the titles and abstracts of the remaining articles process called data extraction. This information is usually
to eliminate any that are clearly not related to the search recorded on a data extraction form, which includes all of
criteria. The reviewers should err on the side of caution the elements that will be assessed for study quality. Using
during this initial step and not eliminate any article where a consistent and independent process allows reviewers to
the title and abstract does not provide adequate informa- gather the same information on each study, so that com-
tion to make a determination. parisons can be readily made with minimal bias.
Abstracts may not contain sufficient detail to know if Extraction forms include sections for the citation of
an article meets selection criteria. Therefore, once the the study, eligibility criteria, and characteristics related
list is honed to potentially relevant studies, the next step to the PICO question. They also include sections for as-
requires locating the full text. Many articles will be sessing study quality and extracting data for the study’s
freely available online, but this may also require use of results. The extraction form should be designed and
6113_Ch37_574-596 02/12/19 4:35 pm Page 581
pilot-tested prior to extracting information from the ar- number of full-text articles excluded. When possible,
ticles for the systematic review. Sample data extraction reasons should be reported for the exclusion of full-text
forms for intervention studies are available from the articles that were initially assessed for eligibility.
Cochrane website.31 Extraction forms for systematic re-
views of diagnosis and prognosis should include the ap- ➤ In the motor skills study, a total of 3,092 citations were
propriate fields related to these types of clinical questions. identified through the electronic and hand searches. After dupli-
cate and ineligible citations were removed, 190 full-text articles
Because many interventions are complex, reviewers may were screened for eligibility requirements. A standard form was
consider utilizing the Template for Intervention Description used for reviewers to assess study eligibility (available as a
and Replication (TIDieR) checklist to standardize the evaluation supplement). The final synthesis was based on nine papers.
of interventions.32-34 This 12-item checklist is intended to im-
prove the quality of description of interventions and comparison It is not uncommon to see this degree of elimination
procedures in publications. Although the emphasis is on trials,
of citations because many false positives may be retrieved
the guidelines can be applied across all types of studies.
by the search in order to ensure all the true positives
were identified. It is important to have a systematic
Reliability process for this screening process to facilitate getting
As we do for all research studies, we want to establish the through such a large number of studies.
reliability of our “measurements,” in this case the ability
Assessing Methodological Quality
of reviewers to be consistent in the evaluation of articles.
It is useful for reviewers to assess a few sample papers Studies will differ in quality based on their design and
in a training process to judge their reliability. When analysis, affecting the validity of their findings. There-
disagreements occur, they should be resolved by consen- fore, it is important to consider methodological quality
sus, by resolution from a third party, or by clarifying the when synthesizing several studies, to determine if forms
evaluation criteria. A reliability coefficient, such as of bias affect the value of results for answering the
kappa, can be calculated to demonstrate how often the research question. Whether the designs are RCTs or
reviewers were in agreement for each item on the extrac- observational, authors often neglect to provide impor-
tion form (see Chapter 32). tant information in the research report that is necessary
to determine validity. As part of a systematic review,
See the Chapter 37 Supplement for sample extraction quality assessment will provide possible explanations for
forms and a link to the TIDieR template. differences in study results.
Several tools have been developed to assess study
quality for interventions, diagnostic tools, or prognostic
■ Results factors. These tools are checklists that contain a series
of items reflecting essential information needed to judge
The results section of the systematic review will report quality and bias, typically requiring a “yes” or “no”
the outcome of the search and screening process. response (some include an option for “unclear”). Some
scales provide a total score based on summing the “yes”
The PRISMA Flow Diagram responses for each study, and may suggest values that
Clinical trials use a flow diagram to demonstrate how indicate poor, medium, or high quality.
subjects are recruited for a study, indicating those who There is no gold standard for this process, and many
are eligible, those who drop out, and those who remain tools and scoring systems have not been validated.35
in the study (see Chapter 15). In a similar way, a system- Nonetheless, these various tools generally focus on
atic review needs to document the process of identifying a number of relevant concepts to assess the internal
or excluding relevant studies using the PRISMA flow validity of a study and risk of bias. The criteria used to
diagram.4 Figure 37-2 illustrates the flow of articles judge quality and the resultant ratings must be described
though the steps of selection in the systematic review of in a systematic review so that readers can evaluate the
gross motor skills. strength of the reviewer’s conclusions. Those conduct-
In order to complete the diagram, accurate records ing the review must be familiar with the operational
need to be maintained during each step of the article se- definitions for each criterion and reliability of their
lection process. It should include the number of articles assessments should be established.
retrieved from the search, the number of abstracts We will illustrate this process using two of the more
screened, the number of full-text articles assessed, the commonly used methods for RCTs, the PEDro scale
final number selected for the systematic review, and the and the Cochrane Risk of Bias Tool. Assessments for
6113_Ch37_574-596 02/12/19 4:35 pm Page 582
Excluded (n 579)
- Duplicate record
Screening
eligibility
(n 190)
Figure 37–2 The PRISMA flow diagram for the study of gross motor skills, illustrating each
step in the process to identify the final selection of studies to be reviewed. Adapted from Lucas BR,
Elliott EJ, Coggan S, et al. Interventions to improve gross motor performance in children with
neurodevelopmental disorders: a meta-analysis. BMC Pediatrics 2016;16:193, Figure 1. Used under
Creative Commons license (BY/4.0) (http://creativecommons.org/publicdomain/zero/1.0/).
6113_Ch37_574-596 02/12/19 4:35 pm Page 583
other types of designs will be described, including University of Sydney, it is based on information
diagnostic accuracy, prognosis, case-control and cohort reported from the methods and results sections. In
studies, and measurement studies. addition to items related to randomization, blinding
and attrition, the scale also includes items related to
The EQUATOR (Enhancing the QUAlity and Transparency analysis of design and statistics (see Table 37-2). Each
Of health Research) Network provides a database of of 10 items (items 2 through 11) is graded “yes” if the
reporting guidelines for many different trial types.36 Although criterion is met or “no” if it is not met or unclear. The
these are primarily for use by authors to standardize content, number of items marked “yes” are summed, with a
these guidelines may also be helpful as checklists in judging maximum total score of 10. The first item listed on the
study quality (see Chapter 38). scale relates to external validity and is not scored. The
PEDro scale and individual items on the scale have
demonstrated reasonable reliability.37,38 The scale
The PEDro Scale scores may be considered interval level measurement
The Physiotherapy Evidence Database (PEDro) scale has for parametric analysis.39
been widely used in medical and rehabilitation litera- Systematic reviews that use the PEDro scale usually
ture to provide a summary score for study quality of provide the overall score for each study and often show
clinical trials. Developed by physiotherapists at the scores for each individual item on the scale. Table 37-3
Table 37-3 PEDro Ratings for Eligible Studies in the Systematic Review by Lucas et al20
PEDro CRITERION SCORE
Study 2 3 4 5 6 7 8 9 10 11 Total
Chysagis 2012 Y Y Y Y N N Y Y Y Y 8
Fong 2012 Y N Y N N Y N Y Y Y 6
Fong 2013 Y N Y N N Y N Y Y Y 6
Hammond 2013 Y N Y N N N Y N Y Y 5
Hiller 2010 Y Y Y N N Y Y N Y Y 7
Ledebt 2005 Y N Y N N N N N Y N 3
Peens 2008 Y N Y N N N N N Y Y 4
Polatajko 1995 Y N Y N N Y Y N Y Y 6
Tsai 2009 N N Y N N N N N Y Y 3
Total Yes Ratings 8/9 2/9 9/9 1/9 0/9 4/9 4/9 3/9 9/9 8/9
Adapted from Lucas BR, Elliott EJ, Coggan S, et al. Interventions to improve gross motor performance in children with neurodevelopmental
disorders: a meta-analysis. BMC Pediatrics 2016;16:193, Table 4. Used under Creative Commons license (BY/4.0) (http://creativecommons.
org/publicdomain/zero/1.0/).
6113_Ch37_574-596 02/12/19 4:35 pm Page 584
Other bias
of systematic reviews provide an explicit rationale if using
a specific cut-off to include studies in a systematic review.
Chrysagis 2012 ⫹ ⫹ ⫹ ⫺ ⫹ ? ⫹
➤ Lucas et al20 used the PEDro scale to assess the nine stud-
ies in the motor skills systematic review, Scores indicate that Fong 2012 ⫹ ⫺ ⫺ ⫹ ⫹ ? ⫹
six of the studies achieved a total rating of 5 or higher. The Fong 2013 ⫹ ⫺ ⫺ ⫹ ⫹ ? ⫹
majority of studies did not conceal allocation (item 3), did not Hammond 2013 ⫹ ⫺ ⫺ ⫺ ⫹ ? ⫹
blind subjects or testers (items 5, 6, 7), and did not account for Hillier 2010 ⫹ ⫹ ⫺ ⫹ ⫹ ? ⫹
missing data or did not perform an intention-to-treat analysis
Ledebt 2005 ⫹ ⫺ ⫺ ⫺ ⫺ ? ⫹
(items 8 and 9).
Peens 2008 ⫹ ⫺ ⫺ ⫺ ⫺ ? ⫹
Polatajko 1995 ⫹ ⫺ ⫺ ⫹ ⫹ ? ⫹
Cochrane Risk of Bias Tool
Tsai 2009 ⫺ ⫺ ⫺ ⫺ ⫺ ? ⫺
Another approach to rating study quality has been devel-
oped by the Cochrane Collaboration. The Cochrane Risk Figure 37–3 Cochrane Risk of Bias domains completed
of Bias Tool explores the risk of bias in each study included using the data from the studies included in the systematic
in a systematic review, focusing on how the design and review by Lucas et al.20 ⫹ ⫽ low risk of bias, ⫺ ⫽ high risk
conduct of a trial contributes to the believability of the re- of bias and ? = unclear. This figure was generated using
sults.3 The tool includes seven items with defined criteria, RevMan software available through the Cochrane
allowing the reviewer to provide a judgment of “low risk” Collaboration.
or “high risk” of bias, or “unclear.” Rather than serving
as a checklist or a scale, this domain-based tool presents a
visual overview of quality related to seven categories. This • Performance bias refers to differences in the provi-
image is generally included in each Cochrane systematic sion of care to experimental and control groups.
review.6 Using the data from the motor skills systematic This is most effectively controlled through blinding
review by Lucas et al,20 Figure 37-3 illustrates how the of those who receive and give care.
tool would present their results. • Attrition bias is related to the differential loss of
The categories that define risk include five addressing subjects across comparison groups. This becomes
internal validity, one addressing reporting bias (such as especially relevant for studies with follow-up peri-
publication bias), and a final category representing a ods and is addressed through intention-to-treat
judgment regarding other sources of bias such as early analysis.
stopping of the study, severe baseline imbalances, or • Detection bias occurs if outcome assessment differs
claims of fraud. The assessment of internal validity is across comparison groups. Control is achieved by
based on four threats: documenting reliability of measurements.
• Selection bias can distort treatment effects because
of the way comparison groups are formed and can ➤ The risk of bias tool indicates that allocation concealment
and blinding were not used in the majority of studies in the
be controlled with random assignment and conceal-
motor skills systematic review. Given the information included
ment of allocation.
6113_Ch37_574-596 02/12/19 4:35 pm Page 585
in the reports, reporting bias could not be determined, but other for assessing the methodological quality of studies of
sources of bias were not identified. This approach confirms patient-reported outcome measures (PROM).48 Stan-
the same important methodological flaws identified by the dards refer to design and statistical methods based on
PEDro scale. classical test theory and item response theory (see
Chapter 12).
Methodological Assessments for Other Designs See the Chapter 37 Supplement for links to methodological
Studies of Diagnostic Accuracy assessment tools used with a variety of research designs.
The Quality Assessment of Diagnostic Accuracy Studies
(QUADAS and QUADAS-2) checklist has been devel- Data Synthesis
oped to assess methodological quality of diagnostic
Once the articles of interest have been critically re-
accuracy studies in systematic reviews. The QUADAS-2
viewed and data extracted, the researchers must then
consists of 10 signaling questions in four key domains:
determine if and how the results of the studies can be
patient selection, the index test, the reference standard,
summarized and synthesized. This is the final part of
and flow and timing. This tool does not provide a
the results section. The process is facilitated by using
total score. Instead, reviewers provide judgments of
a table to summarize the information obtained from
the risk of bias in each domain (high, low, or unclear)
data extraction, including design, description of par-
and concerns regarding applicability (high, low, or
ticipants, the interventions, and outcomes. Table 37-4
unclear).42
shows a small portion of such a table from the motor
The original QUADAS was published in 2004 as
skills systematic review. The specific columns and
a 14-item scale that generates a numeric score based
included data can vary depending on the nature of the
on a sum of items with “yes” answers.43 The revised
question. These tables can be extensive if the system-
QUADAS-2 was developed in 2011, changing the assess-
atic review incorporates many studies. Many authors
ment format, but the original scale is still in use.
choose to include the score for methodological quality
Studies of Prognosis as well.
The Quality in Prognosis Studies (QUIPS) checklist ad- This summary leads to a judgement about the
dresses six potential sources of bias in studies that look strength of the body of evidence. How much agreement
at predictive prognostic models.44 The six areas relate to is there among study findings? This is an assessment of
participants and the risk of selection bias, attrition and the degree to which studies have similar or dissimilar
handling of missing data, prognostic factor measure- findings—an indication of clinical homogeneity or hetero-
ment, outcome measurement, risk of study confounding, geneity.49 Relevant factors to consider are:
and statistical analysis.
• Composition of groups, including different inclu-
Observational Studies sion and exclusion criteria, different baseline levels,
The Newcastle-Ottawa Scale Quality Assessment Form and the presence of complications or comorbid
(NOS) has been used to assess quality in systematic conditions.
reviews of cohort and case-control studies, such as in • Differences in timing or dose of intervention or
epidemiologic studies of risk factors.45 The NOS in- exposure.
cludes eight items that assess three dimensions: selec- • Design of the study, including length of follow-up
tion of the study groups, comparability, and outcome and proportion of subjects who dropped out.
assessment. Each of the eight items is rated, and re- • Management of patients, including how treatments
ceives one star if it receives the highest rating, with the were delivered.
exception of the comparability item which can receive
If papers have published conflicting or inconclusive
two stars. This results in a maximum score of nine stars
findings, it can be difficult to interpret the results of the
for the NOS.46
systematic review. It can be helpful to cluster studies that
Studies of Clinical Measurement focused on similar outcomes or those that show agree-
The COnsensus-based Standards for the selection of health ment, to better understand the extent of consistency.
Measurement INstruments (COSMIN) is a checklist that Studies with small sample sizes or small effect sizes may
considers properties of reliability, validity, and respon- show no significant effect of the intervention, leaving the
siveness to assess the quality of studies included in a sys- possibility of a Type II error. The choice of measure-
tematic review of clinical measurements.47 A COSMIN ment scale or tool may affect the sensitivity or respon-
Risk of Bias checklist has been developed specifically siveness of measurement.
6113_Ch37_574-596 02/12/19 4:35 pm Page 586
Table 37-4 Excerpt of Study Summaries From the Systematic Review by Lucas et al20
STUDY INTERVENTION INTERVENTION PEDro
DESIGN PARTICIPANTS INTERVENTION DOSE OUTCOMES APPROACH SCORE
Tsai RCT From classrooms Table tennis Intervention Primary: gross Task- 3
et al 2009 in southern vs regular (n = 3) motor skills oriented
Taiwan classroom Three 50 min (M-ABC, Ball
Age: 9-10 yrs activities training sessions skills, Static/
DX: DCD per week dynamic
N = 27 Control (n = 14) balance
Gender not No treatment Secondary:
reported none reported
Hiller RCT From Minimal Aquatic Intervention Primary: gross Traditional 7
et al 2010 Motor Disorder therapy vs (n = 6), weekly motor skills
Unit of hospital waiting list 30- min sessions (M-ABC, ball
in Adelaide, over 8 weeks in skills, static/
Australia 1:1 format dynamic
Age: 5-8 yrs Control (n = 6) balance
Dx: DCD Waiting list Secondary:
N = 13 self-concept,
Gender not parent’s
reported perception
of change
Adapted from Lucas BR, Elliott EJ, Coggan S, et al. Interventions to improve gross motor performance in children with neurodevelopmental
disorders: a meta-analysis. BMC Pediatrics 2016;16:193, Table 3. Used under Creative Commons license (BY/4.0) (http://creativecommons.org/
publicdomain/zero/1.0/).
Polatajko et al 1995 Traditional intervention TOMI: ball skills 24 24 0.5 (1.1 to 0.1)
Hillier et al 2010 Aquatic therapy M-ABC: static/dynamic balance 6 6 0.9 (2.1 to 0.3)
Tsai et al 2009 Table tennis M-ABC: ball skills 14 13 0.7 (1.5 to 0.1)
Hammond et al 2013 Wii Fit (Group A) *BOT2: bilateral coordination 8 10 1.0 (2.0 to 0.0)
Peens et al 2008 Motor based intervention M-ABC: static/dynamic balance 17 20 0.6 (1.3 to 0.1)
Figure 37–4 Forest plot illustrating the results of the meta-analysis of studies of interventions to improve gross motor
performance using the least conservative estimates of group differences. In the final meta-analysis one reference that had used
the same cohort as another study was eliminated, leaving eight studies. The # denotes studies with only one outcome measure.
The * denotes an outcome where a lower score is better. From Lucas BR, Elliott EJ, Coggan S, et al. Interventions to improve
gross motor performance in children with neurodevelopmental disorders: a meta-analysis. BMC Pediatrics 2016;16:193, Figure 4.
Used under Creative Commons license (BY/4.0) (http://creativecommons.org/publicdomain/zero/1.0/).
The X-axis represents the treatment effect for each For illustrative purposes, we have focused primarily on data
study, the SMD in this example. A vertical line is drawn using the least conservative estimates for most of this discus-
at the mean effect size, the mean SMD. The Y-axis rep- sion. Refer to the Chapter 37 Supplement to see the full-text
resents a measure of variance, such as standard error for of the article by Lucas et al,20 including forest plots and
mean differences, or log(odds) for risk measures. The funnel plots for both least and most conservative values,
scale is reversed so that smaller values are at the top as well as the additional files that are available online. Their
(studies with less variance).3,61 In the absence of publi- discussion does a good job of explaining the rationale behind
cation bias, differences between studies should be a mat- their analysis.
ter of random error, and we would expect to see studies
with large and small variance showing up with high and Grading the Body of Evidence
low SMDs, creating a symmetrical plot. In the presence
Authors of systematic reviews and meta-analyses can as-
of publication bias, however, we would expect to see an
sess the quality of the body of evidence using the Grade
asymmetrical pattern, resulting in a gap in one corner of
of Recommendation, Assessment, Development and Evalua-
the funnel.3
tion (GRADE) system.62-64 Four levels of quality have
Visual analysis of the plot in Figure 37-5 suggests a
been defined for studies of interventions, associated with
publication bias may be present. While most studies clus-
types of study designs (see Table 37-5). Reviews based
ter around the mean at the top, there is one data point in
on randomized trials or strong observational studies are
the lower left corner, far to the left of the vertical line,
rated high. This rating can be downgraded one or more
resulting in a more asymmetrical appearance. Although
levels, however, by the presence of one or more of the
funnel plots are more informative with a larger number
following factors:
of studies, we can see that most of the studies had small
variability (near the top) with SMDs clustered around the • High likelihood of bias from study design and
mean. Only one study had substantially larger variability implementation
and a lower SMD (lower left). This pattern suggests that • Indirectness of population, intervention, control,
other studies may be missing from this analysis. This in- or outcomes
terpretation is a matter of judgment. There are statistical • Unexplained heterogeneity or inconsistency of the
procedures that can be used to estimate the degree of findings
bias, most requiring specialized software.61 • Imprecision or very wide confidence intervals
• Publication bias
Figure 37–5 Funnel plot showing potential publication bias Clinical practice guidelines, which provide
based on outlier in lower left corner. The red dots represent recommendations to optimize patient care based on
the individual studies, and the blue diamond at the bottom systematic reviews and a comparison of benefits and
represents the pooled effect size and its 95% confidence harms of different options, may utilize the GRADE approach in
interval. Adapted from Lucas BR, Elliott EJ, Coggan S, their summarizing a body of evidence for each clinical ques-
et al. Interventions to improve gross motor performance in tion.63 A database of current clinical practice guidelines for
children with neurodevelopmental disorders: a meta-analysis. healthcare is maintained by the ECRI Guidelines trust.65 This
BMC Pediatrics 2016;16:193, Additional file 5. Used under resource continues the work previously done by the National
Creative Commons license (BY/4.0) (http://creativecommons. Guideline Clearinghouse.
org/publicdomain/zero/1.0/).
6113_Ch37_574-596 02/12/19 4:35 pm Page 591
Table 37-5 GRADE Levels of Quality of a Body deciding to apply evidence from a systematic review to
of Evidence for Interventions patient care, a clinician should be able to address three
primary questions (see Table 37-6), the same questions
QUALITY STUDY DESIGN used to critique all other types of studies: 1) Are the results
• High Randomized trials or double-upgraded of the study valid? 2) What are the results? 3) Are the
observational studies results meaningful to my patient?
• Moderate Downgraded randomized trials or A MeaSurement Tool to Assess systematic Reviews
upgraded observational studies (AMSTAR) was developed to provide a more formal
• Low Double-downgraded randomized trials or appraisal of a systematic review. The original AMSTAR
observational studies was a checklist with 11 items to critically appraise a
• Very low Triple-downgraded randomized trials,
systematic review. A new version, the AMSTAR-2, was
downgraded observational studies, or case
recently developed and has undergone psychometric
series/case reports
testing.69
The AMSTAR-2 includes 16 items that are each an-
See the Chapter 37 Supplement for links to additional discus- swered as yes or no but does not generate a total score.
sion on evaluating clinical practice guidelines and using This checklist refers to adequacy of reporting in each
the GRADE approach to rate the quality of studies of diag- component of the review, including a clear research
nostic accuracy and prognosis.66-68 question, criteria for selection of studies, a comprehen-
sive search, reliability of reviewers, and assessing risk
of bias. For a meta-analysis, it will also include use of
■ Appraisal of Systematic Reviews appropriate statistical estimates, and evaluation of
and Meta-Analyses heterogeneity and publication bias. The appraiser is
expected to make a judgment regarding the overall con-
Because systematic reviews and meta-analyses are typically fidence in the results of the systematic review (high,
seen as the highest form of evidence in a literature search, moderate, low, and critically low) based on the number
clinicians must take responsibility for critical appraisal of of noncritical weaknesses and critical flaws in the system-
reviews, to determine if they are valid in their presentation atic review identified by the checklist.
of findings (see Chapter 36). We have to assure ourselves See Chapter 36 for discussion of critical appraisal. The
that the authors of the review have done an adequate job Chapter 37 Supplement contains links to methods for ap-
of locating, summarizing, evaluating, and synthesizing the praising systematic reviews and information on AMSTAR-2.
information that we will use for our clinical decisions. In
Table 37-6 Core Questions for Appraising Systematic Reviews and Meta-Analyses
ARE THE RESULTS OF THE STUDY VALID?
• Was the research question clearly stated using components of PICO?
• Were inclusion criteria specified for design, participants, treatment and comparisons, and outcomes?
• Was a thorough search strategy detailed?
• Were reviews and data extraction performed by at least two reviewers using consistent criteria?
• Was the assessment of methodological quality or bias described?
• If a meta-analysis was done, were appropriate statistical procedures used?
ARE THE RESULTS MEANINGFUL?
• Was the result of screening presented as a PRISMA diagram, detailing included and excluded studies (with reasons)?
• What were the results of assessment of risk of bias, and how will these findings influence interpretation of results?
• How was heterogeneity evaluated?
• Were results of a meta-analysis presented as effect sizes in a forest plot?
• Was a funnel plot used to document publication bias?
• Did the authors interpret results?
ARE THE RESULTS RELEVANT TO MY PATIENT?
• Were included studies relevant to my patient in terms of participants, interventions and outcomes?
• Were results consistent with a meaningful effect that will influence clinical decisions?
6113_Ch37_574-596 02/12/19 4:35 pm Page 592
■ Scoping Reviews ➤ Olding et al69 set the following objective for their study on
patient involvement in healthcare:
Systematic reviews typically address very specific
This scoping review investigates the extent and range of
clinical questions. However, not every topic in health-
literature on patient and family involvement in critical and
care is appropriate for a systematic review. For exam-
intensive care settings. The aim was to identify method-
ple, background clinical questions on more general ological and empirical gaps within the existing literature
topics and foreground clinical questions with a paucity in order to inform an emerging research agenda in this
of supporting evidence cannot be answered by a sys- topic area.
tematic review. A scoping review plays a different role,
mapping the body of knowledge on a topic for which
the body of literature has not been comprehensively See the Chapter 37 Supplement for references on scoping re-
views, including the full-text article by Olding et al69 and
reviewed.70,71 It is an exploratory method that is espe-
the PRISMA-ScR statement for reporting guidelines for
cially useful when an area of inquiry exhibits a complex
scoping reviews.
or heterogeneous nature that would preclude a formal
systematic review. Scoping reviews summarize evidence A unique feature of scoping reviews is a recommenda-
on a topic to describe consistencies or inconsistencies tion for consultation with stakeholders as another form
in current practice, to identify gaps in the literature, of input.70,76 This can include expert researchers, practi-
clarify concepts, and can serve as a precursor to system- tioners, and information scientists, as well as patients,
atic reviews.72 policy makers, or health services professionals. These
individuals can help to clarify relevant information, pro-
➤ C A S E IN PO INT # 2 pose directions for the search, prioritize research ques-
Despite an increased focus on patient and family involvement tions, and put findings in context.
in healthcare, these concepts remain poorly defined in the litera-
ture. Olding et al73 completed a scoping review to investigate
the range of literature on patient and family involvement in
➤ The research questions that guided the review of patient
and family involvement were developed in collaboration with an
critical and intensive care settings.
international group of researchers, physicians, nurses, policy
directors, and health services analysts who served on an advi-
Scoping Review Methods sory board for a larger study of interprofessional collaboration
Although conducted for a different purpose than sys- and patient/family involvement in intensive care settings.
tematic reviews, scoping reviews do require rigorous
and transparent methods to establish the trustworthi- Methods
ness of the findings.72 Much of the processes associated Even though questions are broad, scoping reviews must
with scoping reviews are similar to those in systematic still be focused and precise in terms of their objectives so
reviews, including stating a question and providing that criteria for searching can be defined.75 Scoping re-
background to support the rationale, specifying inclu- views use a more comprehensive approach to the search
sion criteria for references, describing a comprehensive for sources, incorporating clinical trials, observational
search strategy, screening citations, and extraction of studies, and qualitative and case studies and may include
study characteristics. The PRISMA Extension for Scoping grey literature. The search can be applied using an itera-
Reviews (PRISMA-ScR) includes a 22-item checklist that tive process that allows new sources to be continually
defines standard reporting items to guide the develop- identified based on obtained information—which is why
ment and implementation of scoping reviews.74 scoping reviews can take longer to complete. They also
provide an overview of existing evidence regardless of
The Research Question quality, providing a map of evidence that has been gen-
There are some important differences between scoping erated rather than seeking information on a specific prac-
reviews and systematic reviews, however. One distinc- tice or policy.77 Therefore, scoping reviews do not
tion is how the research question is developed. Rather involve methodological appraisal. Findings are generally
than using a framework like PICO, which is too specific, organized along a thematic analysis.
a broader approach is needed that identifies a context
for the study’s objectives, which will guide study selec-
tion criteria.75 These may be expressed as a question or
➤ Olding et al73 established three criteria for study selection.
Studies had to be set in an intensive or critical care setting,
statement of purpose.
6113_Ch37_574-596 02/12/19 4:35 pm Page 593
address the topic of patient/family involvement, and include a participation in care and communication. They also noted that
sample of adult patients, their families, and care providers. They “patient involvement” was not a well-defined concept.
searched four databases for English-language literature pub-
lished between 2003-2014. This time period was purposefully
Discussion
chosen to provide insight into the expansion of literature on the
topic. The search strategy included an extensive list of MeSH
The discussion section of a scoping review will be similar
terms, developed in consultation with a health information sci- to a systematic review in that the results and limitations
entist. They also examined reference lists of existing literature of the review are discussed. The results are placed in
and consulted with experts in the field to find eligible articles context of current literature, practice, and healthcare
not identified in the search. policy. Unlike a systematic review, however, the quality
of the body of literature cannot be graded. As the body
The articles identified in the search should be of literature used is more varied, the limitations in the
screened by multiple reviewers to determine which cita- sources should be highlighted. The conclusion of a scop-
tions meet the inclusion criteria. Records should be kept ing review may include recommendations for primary
and the screening process summarized using a PRISMA research in areas where gaps have been identified or for
flow diagram.74 The data extraction process is referred systematic reviews. Direct recommendations for practice
to as charting the results in a table that aligns with the are limited by the lack of formal methodological assess-
research question. The authors should define the process ment. Therefore, practice implications should relate
for collating this information. specifically to the objectives of the review.73
➤ In the study of patient/family involvement, two reviewers ➤ Olding et al73 discussed knowledge gaps related to patients
and families and socio-cultural issues that affected involvement.
charted articles by extracting information on study aim, setting,
They identified study limitations related to the time frame of
design, and population. They then applied a content analysis
their search and not having looked into grey literature. They
to inductively identify patterns in the ways patient and family
also suggested that future research focus on the nature of
involvement were described (see Chapter 21). The process
patient involvement, examination of contextual factors, and
allowed them to comment on general trends, address gaps,
interprofessional dynamics that shape patient and family
and thematically describe the components of patient and family
involvement in intensive care settings.
involvement.
Results Several useful references are available for those who want
The results section of a scoping review will report the to explore scoping reviews further. 70, 71,75,76 The Joanna
results of the search and the number of articles selected. Briggs Institute in Australia has developed a comprehensive
This section may be organized according to main con- resource manual.77
ceptual categories that relate to the purpose and to the
information charted. A map or diagram may be pre-
sented regarding the number of studies that relate to Although scoping reviews are a more recent contri-
each of these categories, but results are generally de- bution than systematic reviews, they are quickly becom-
scribed narratively. Results will include a rich description ing more common in the healthcare literature. They are
of the major concepts and themes that were identified not intended to provide a specific answer to a focused
through the review. clinical question, but they support evidence-based prac-
tice by providing clinicians and researchers with infor-
mation about the current state of knowledge on a topic.
➤ The search resulted in a total of 892 articles. After screen- These studies typically focus on areas that are not well
ing, 124 studies were reviewed: 61 quantitative, 61 qualitative,
understood, creating an opportunity for practitioners
and 2 mixed methods. The researchers identified components
to explore how they relate to their practice, and where
of patient involvement and family involvement involving
further research is needed.
6113_Ch37_574-596 02/12/19 4:35 pm Page 594
COMMENTARY
The value of conducting and reading systematic reviews, Most importantly, as with any research report, sys-
meta-analyses, or scoping reviews cannot be overesti- tematic reviews are valuable tools to help us integrate
mated. It is reassuring that others are taking the time information from many studies—but they are only tools.
to carry out and share rigorous analyses of the existing We must recognize that the results are only as good as
literature. the decisions that were made in developing and inter-
A cautionary note is important, however, as these re- preting the review—which studies were included, how
views are not panaceas and may not offer definitive treat- they were analyzed, and the degree to which the infor-
ment options for practice as much they will help to mation is up to date. The results of reviews with only a
clarify areas of practice that require further research. In small number of references should be viewed with some
a review of 2,535 reviews from the Cochrane Library, caution. In addition, descriptions of studies used in re-
Clarke et al78 found that 82% of them indicated a need views are necessarily brief overviews, and do not provide
for further evaluation of the interventions studied. details about interventions or measurements that may be
To be useful as clinical references for evidence-based important to a clinician for application to a particular
practice, systematic reviews should include transparent patient. Strong statistical outcomes do not necessarily
high-quality methods and be updated regularly as new translate to clinically important ones.
data become available. For instance, the review groups Concerns have been raised that results of systematic
for the Cochrane Collaboration are responsible for up- reviews and meta-analyses may be misinterpreted or mis-
dates on a regular schedule depending on the currency used by the public and policy makers when the methods
of the question and availability of additional sources.3 are questionable, including the pooling of studies with
Sometimes the update will include more recent studies, different populations and interventions that should not
or it may indicate that no further research has been done be combined.79 Systematic reviews and meta-analyses
since the original review was published. Both types of represent a vital component of evidence-based practice
information are important. but will not replace sound clinical decision making.
15. Barelds I, Krijnen WP, van de Leur JP, van der Schans CP, 35. Olivo SA, Macedo LG, Gadotti IC, Fuentes J, Stanton T,
Goddard RJ. Diagnostic accuracy of clinical decision rules to Magee DJ. Scales to assess the quality of randomized
exclude fractures in acute ankle injuries: systematic review and controlled trials: a systematic review. Phys Ther 2008;88(2):
meta-analysis. J Emerg Med 2017;53(3):353-368. 156-175.
16. Leysen L, Beckwee D, Nijs J, Pas R, Bilterys T, Vermeir S, 36. Equator Network. Enhancing the Quality and Transparency of
Adriaenssens N. Risk factors of pain in breast cancer survivors: Health Research. Available at http://www.equator-network.org/.
a systematic review and meta-analysis. Support Care Cancer Accessed, April 3, 2019.
2017;25(12):3607-3643. 37. Tooth L, Bennett S, McCluskey A, Hoffmann T, McKenna K,
17. Dixon-Woods M, Fitzpatrick R. Qualitative research in Lovarini M. Appraising the quality of randomized controlled
systematic reviews. BMJ 2001;323(7316):765-766. trials: inter-rater reliability for the OTseeker evidence database.
18. Hilton G, Unsworth C, Murphy G. The experience of attempt- J Eval Clin Pract 2005;11(6):547-555.
ing to return to work following spinal cord injury: a systematic 38. Maher CG, Sherrington C, Herbert RD, Moseley AM, Elkins
review of the qualitative literature. Disabil Rehabil 2018;40(15): M. Reliability of the PEDro scale for rating quality of random-
1745-1753. ized controlled trials. Phys Ther 2003;83(8):713-721.
19. Agency for Healthcare Research and Quality. Research tools 39. de Morton NA. The PEDro scale is a valid measure of the
and data. Available at https://www.ahrq.gov/research/ methodological quality of clinical trials: a demographic study.
index.html. Accessed February 8, 2019. Aust J Physiother 2009;55(2):129-133.
20. Lucas BR, Elliott EJ, Coggan S, Pinto RZ, Jirikowic T, 40. da Costa BR, Hilfiker R, Egger M. PEDro’s bias: summary
McCoy SW, Latimer J. Interventions to improve gross motor quality scores should not be used in meta-analysis. J Clin
performance in children with neurodevelopmental disorders: Epidemiol 2013;66(1):75-77.
a meta-analysis. BMC pediatrics 2016;16(1):193-193. 41. Costa LO, Maher CG, Moseley AM, Elkins MR, Shiwa SR,
21. PROSPERO. International Prospective Register of Systematic Herbert RD, Sherrington C. da Costa and colleagues’ criticism
Reviews. National Institute for Health Research, National of PEDro scores is not supported by the data. J Clin Epidemiol
Health Service. Available at https://www.crd.york.ac.uk/ 2013;66(10):1192-1193.
prospero/. Accessed February 8, 2019. 42. Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ,
22. Moher D, Pham B, Lawson ML, Klassen TP. The inclusion Reitsma JB, Leeflang MM, Sterne JA, Bossuyt PM. QUADAS-2:
of reports of randomised trials published in languages other a revised tool for the quality assessment of diagnostic accuracy
than English in systematic reviews. Health Technol Assess 2003; studies. Ann Intern Med 2011;155(8):529-536.
7(41):1-90. 43. Whiting PF, Weswood ME, Rutjes AW, Reitsma JB, Bossuyt
23. Pickar JH. Do journals have a publication bias? Maturitas PN, Kleijnen J. Evaluation of QUADAS, a tool for the quality
2007;57(1):16-19. assessment of diagnostic accuracy studies. BMC medical research
24. Hopewell S, McDonald S, Clarke M, Egger M. Grey literature methodology 2006;6:9.
in meta-analyses of randomized trials of health care interven- 44. Hayden JA, van der Windt DA, Cartwright JL, Cote P,
tions. Cochrane Database Syst Rev 2007(2):Mr000010. Bombardier C. Assessing bias in studies of prognostic factors.
25. Mahood Q, Van Eerd D, Irvin E. Searching for grey literature Ann Intern Med 2013;158(4):280-286.
for systematic reviews: challenges and benefits. Res Synth 45. Luchini C, Stubbs B, Solmi M, Veronese N. Assessing the
Methods 2014;5(3):221-234. quality of studies in meta-analyses: Advantages and limitations
26. Benzies KM, Premji S, Hayden KA, Serrett K. State-of-the- of the Newcastle Ottawa Scale. World J Meta-Anal 2017;5(4):
evidence reviews: advantages and challenges of including grey 80-84.
literature. Worldviews Evid Based Nurs 2006;3(2):55-61. 46. Wells G, Shea B, O’Connell O, et al. The Newcastle-Ottawa
27. McAuley L, Pham B, Tugwell P, Moher D. Does the inclusion Scale (NOS) for assessing the quality of nonrandomized studies
of grey literature influence estimates of intervention effective- in meta-analysis. Ottawa Hospital, Our Research. Available
ness reported in meta-analyses? Lancet 2000;356(9237): at http://www.ohri.ca/programs/clinical_epidemiology/
1228-1231. oxford.asp Accessed Augut 18, 2019.
28. Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R. 47. COSMIN. Available at https://www.cosmin.nl/. Accessed
Selective publication of antidepressant trials and its influence on April 17, 2019.
apparent efficacy. N Engl J Med 2008;358(3):252-260. 48. Mokkink LB, de Vet HCW, Prinsen CAC, Patrick DL, Alonso J,
29. Hart B, Lundh A, Bero L. Effect of reporting bias on meta- Bouter LM, Terwee CB. COSMIN Risk of Bias checklist for
analyses of drug trials: reanalysis of meta-analyses. BMJ systematic reviews of Patient-Reported Outcome Measures.
2012;344:d7202. Qual Life Res 2018;27(5):1171-1179.
30. Porter RJ, Boden JM, Miskowiak K, Malhi GS. Failure to 49. Simon S. Do the pieces fit together? Systematic reviews and
publish negative results: A systematic bias in psychiatric meta-analysis. In: Simon SD, ed. Statistical Evidence in Medical
literature. Aust N Z J Psychiatry 2017;51(3):212-214. Trials: What Do the Data Really Tell Us? New York: Oxford
31. Cochrane Collaboration. Available at https://community. University Press; 2006:101-136.
cochrane.org/organizational-info/resources/resources-groups/ 50. Gosselin RA, Roberts I, Gillespie WJ. Antibiotics for preventing
managing-editors-portal/editorialpolicy/editorial-resources- infection in open limb fractures. Cochrane Database Syst Rev
committee. Accessed March 28, 2019. 2004(1):Cd003764.
32. Hoffmann TC, Walker MF, Langhorne P, Eames S, Thomas E, 51. Glass GV. Primary, secondary and meta-analysis of research.
Glasziou P. What’s in a name? The challenge of describing Educ Res 1976(5):3-8.
interventions in systematic reviews: analysis of a random sample 52. Sandelowski M, Barroso J. Creating metasummaries of qualita-
of reviews of non-pharmacological stroke interventions. BMJ tive findings. Nurs Res 2003;52(4):226-233.
Open 2015;5(11):e009051. 53. Timulak L. Meta-analysis of qualitative studies: a tool for
33. Hoffmann TC, Glasziou PP, Boutron I, Milne R, Perera R, reviewing qualitative research findings in psychotherapy.
Moher D, Altman DG, Barbour V, Macdonald H, Johnston M, Psychother Res 2009;19(4-5):591-600.
et al. Better reporting of interventions: template for intervention 54. Fong SS, Chung JW, Chow LP, Ma AW, Tsang WW. Differ-
description and replication (TIDieR) checklist and guide. BMJ ential effect of Taekwondo training on knee muscle strength
2014;348:g1687. and reactive and static balance control in children with develop-
34. van Vliet P, Hunter SM, Donaldson C, Pomeroy V. Using the mental coordination disorder: a randomized controlled trial.
TIDieR Checklist to standardize the description of a functional Res Dev Disabil 2013;34(5):1446-1455.
strength training intervention for the upper limb after stroke. 55. Farquhar C, Marjoribanks J. A short history of systematic
J Neurol Phys Ther 2016;40(3):203-208. reviews. BJOG 2019.
6113_Ch37_574-596 02/12/19 4:35 pm Page 596
56. Roberts D, Brown J, Medley N, Dalziel SR. Antenatal corticos- 68. Huguet A, Hayden JA, Stinson J, McGrath PJ, Chambers CT,
teroids for accelerating fetal lung maturation for women at risk Tougas ME, Wozney L. Judging the quality of evidence in
of preterm birth. Cochrane Database Syst Rev 2017;3:CD004454. reviews of prognostic factor research: adapting the GRADE
57. Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring framework. Syst Rev 2013;2:71.
inconsistency in meta-analyses. BMJ 2003;327(7414):557-560. 69. Shea BJ, Reeves BC, Wells G, Thuku M, Hamel C, Moran J,
58. Hardy RJ, Thompson SG. Detecting and describing hetero- Moher D, Tugwell P, Welch V, Kristjansson E, et al.
geneity in meta-analysis. Stat Med 1998;17(8):841-856. AMSTAR 2: a critical appraisal tool for systematic reviews
59. Chapter 9.5.4. Incorporating heterogeneity into random-effects that include randomised or non-randomised studies of
models. Cochrane Handbook for Systematic Reviews of Inter- healthcare interventions, or both. BMJ 2017;358:j4008.
ventions. Version 5.1.0 [updated March 2011]. The Cochrane 70. Arksey H, O’Malley L. Scoping studies: towards a methodological
Collaboration, 2011. Available at: http://handbook.cochrane. framework. Int J Soc Res Methodol 2005;8:19-32.
org/chapter_9/9_5_4_incorporating_heterogeneity_into_ 71. Peters MD, Godfrey CM, Khalil H, McInerney P, Parker D,
random_effectsmodels.htm. Accessed April 21, 2019. Soares CB. Guidance for conducting systematic scoping reviews.
60. MEDCALC. Meta-analysis: introduction. Available at https:// Int J Evid Based Healthc 2015;13(3):141-146.
www.medcalc.org/manual/meta-analysis-introduction.php. 72. Munn Z, Peters MDJ, Stern C, Tufanaru C, McArthur A,
Accessed April 19, 2019. Aromataris E. Systematic review or scoping review? Guidance
61. Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. for authors when choosing between a systematic or scoping
Introduction to Meta-Analysis. West Sussex, UK: John Wiley & review approach. BMC Med Res Methodol 2018;18(1):143.
Sons; 2009. 73. Olding M, McMillan SE, Reeves S, Schmitt MH, Puntillo K,
62. Atkins D, Best D, Briss PA, Eccles M, Falck-Ytter Y, Flottorp S, Kitto S. Patient and family involvement in adult critical and
Guyatt GH, Harbour RT, Haugh MC, Henry D, et al. Grading intensive care settings: a scoping review. Health Expect 2016;
quality of evidence and strength of recommendations. BMJ 19(6):1183-1202.
2004;328(7454):1490. 74. Tricco AC, Lillie E, Zarin W, O’Brien KK, Colquhoun H,
63. GRADEpro. Available at https://gradepro.org/. Accessed Levac D, Moher D, Peters MDJ, Horsley T, Weeks L, et al.
March 28, 2019. PRISMA Extension for Scoping Reviews (PRISMA-ScR):
64. Guyatt GH, Oxman AD, Vist GE, Kunz R, Falck-Ytter Y, Checklist and Explanation. Ann Intern Med 2018;169(7):
Alonso-Coello P, Schunemann HJ. GRADE: an emerging 467-473.
consensus on rating quality of evidence and strength of 75. Peters MD. In no uncertain terms: the importance of a defined
recommendations. BMJ 2008;336(7650):924-926. objective in scoping reviews. JBI Database System Rev Implement
65. ECRI Guidelines Trust. Available at https://guidelines.ecri.org/. Rep 2016;14(2):1-4.
Accessed April 12, 2019. 76. Levac D, Colquhoun H, O’Brien KK. Scoping studies:
66. Brożek JL, Akl EA, Jaeschke R, Lang DM, Bossuyt P, Glasziou P, advancing the methodology. Implement Sci 2010;5:69.
Helfand M, Ueffing E, Alonso-Coello P, Meerpohl J, et al. 77. Peters MDJ, Godfrey C, McInerney P, Soares CB, Khalil H,
Grading quality of evidence and strength of recommendations Parker D. Chapter 11: Scoping reviews. In: Aromataris E,
in clinical practice guidelines. Part 2 of 3. The GRADE Munn Z, eds. Joanna Briggs Institute Reviewer’s Manual:
approach to grading quality of evidence about diagnostic The Joanna Briggs Institute; 2017. Available from https://
tests and strategies. Allergy 2009;64(8):1109-1116. reviewersmanual.joannabriggs.org/. Accessed April 20, 2019.
67. Brożek JL, Akl EA, Alonso-Coello P, Lang D, Jaeschke R, 78. Clarke L, Clarke M, Clarke T. How useful are Cochrane
Williams JW, Phillips B, Lelgemann M, Lethaby A, Bousquet J, reviews in identifying research needs? J Health Serv Res Policy
et al. Grading quality of evidence and strength of recommenda- 2007;12(2):101-103.
tions in clinical practice guidelines. Part 1 of 3. An overview of 79. Barnard ND, Willett WC, Ding EL. The misuse of
the GRADE approach and grading quality of evidence about meta-analysis in nutrition research. JAMA 2017;318(15):
interventions. Allergy 2009;64(5):669-677. 1435-1436.
6113_Ch38_597-612 02/12/19 4:34 pm Page 597
CHAPTER
Disseminating
Research
—with Peter S. Cahn
38
The culmination of the research process comes ■ Choosing a Journal
when you can share your findings with the clin-
ical and scientific communities. Only shared The typical place to disseminate a completed research
study is through a peer-reviewed journal. This venue not
information can clarify, amplify, and expand only assures the widest audience of interested readers
the professional body of knowledge. This is the but also carries weight in the professional community.
important link between the research process With over 28,000 peer-reviewed journals, you will likely
face more than one possible choice of where to submit
and practice. Without dissemination, evidence- your work.1 To winnow the universe of possible publi-
based practice cannot be realized. cation channels, review the journals’ aim and scope.
Determining the most appropriate format to Editors reveal that one of the most common reasons for
rejecting a manuscript is that it does not fit the editorial
present your research will depend on the project mission of the journal. You can search for recently pub-
and your intended audience. Research findings lished articles to gain a sense of a journal’s mission.
can be developed into a written article, which
provides a permanent record that a broad range The Journal/Author Name Estimator (JANE) is a free
online tool that will suggest possible journals based on
of professionals may access. Oral reports and a title or abstract.2 Some journals require subscription
poster presentations at professional meetings while others are online open access. Many print journals also
provide options for online or open access publication.
serve to disseminate research information in a
timely fashion, although the audience is limited,
and the record of research findings will be found Journal Impact
only in abstract form. Graduate students may be Despite all the considerations in selecting a target journal,
required to document their work in the form of one factor often outweighs the others—reputation, for
which several indices have been developed. The most
a thesis or dissertation but may be given the op-
common index is the impact factor, also called the jour-
tion of writing it in the form of a journal article. nal impact factor (JIF), which measures how frequently a
In the digital era, scholarly communication may journal’s articles have been cited elsewhere in the previous
2 years. Web of Science, a subscription-based service,
also occur in nontraditional media. The purpose
calculates impact factors for thousands of journals. It is
of this chapter is to describe the process of de- considered a reflection of the relative importance of a
ciding the appropriate venue for your research journal and often influences where certain researchers
choose to publish their work. Some journals also report
study and preparing and submitting the results
a 5-year impact factor.
for publication or presentation. The impact factor, however, does not necessarily
reflect the importance of a journal within a defined field
597
6113_Ch38_597-612 02/12/19 4:34 pm Page 598
of study. The 2-year window for citation favors disciplines authors bear some of the cost, typically ranging from
where discoveries occur frequently and consensus changes $1,000 to $3,000 (see Box 38-2).
rapidly. For example, professional journals in rehabilita- There are three types of open access:
tion fields often have ratings between 1.0 and 4.0, whereas
• Green open access means that after an article is
the impact factor for some the flagship medical journals
published in a traditional subscription journal,
can be over 50. Lower scores do not imply that those
the author makes a copy available in a separate
journals are lower quality or less influential. Articles that
repository.
appear in specialty journals tend to receive fewer cita-
• Gold open access refers to journals that make all
tions, but they are more likely to reach an appropriate
articles freely available on their website without
specialized audience.
restrictions.
The use of the impact factor has become a point of
• Hybrid gold access refers to traditional subscription
controversy.3 Its original intention was to support budg-
journals that provide an option to an author to pub-
etary decisions of libraries, not to serve as a measure of
lish articles with free access immediately on the
scientific worth for individual studies. In response, edi-
journal’s website by paying a fee.
tors and authors have introduced several other indicators
of scholarly impact (see Box 38-1). There are several types of licenses for open access
articles, usually obtained through Creative Commons
Open Access Journals (CC) licensing.7 Two types are used most often. Non-
The diminished viability of traditional publishing models commercial open access (CC-BY-NC) permits only non-
and the desire to make research findings widely available commercial reuse, distribution, and reproduction of
have triggered the development of open access journals, an article, provided the original work is properly cited
which are distributed online free at no cost to the reader. and unchanged. Unrestricted open access (CC-BY) allows
In exchange for rapid publication of manuscripts for free, others to use the article for any purpose, including
commercial use, as long as it is properly cited with a link 4. Agreement to be accountable for all aspects of
to the Creative Commons license and description of the work in ensuring that questions related to the
what changes were made. accuracy or integrity of any part of the work are
Other useful resources regarding open access include appropriately investigated and resolved.
the Directory of Open Access Journals (DOAJ)8 and Unpay-
All authors should meet these four criteria, and anyone
wall,9 both of which maintain a database of full-text open
who meets these criteria should be included as an author
access journals.
(see Box 38-3). The definition of “substantial contribution”
may be open to interpretation, but items 2 and 4 have been
See the Chapter 38 Supplement for an example of an open considered minimum thresholds.18 Other contributors
access article. Other examples can also be found in the who do not qualify for authorship should be recognized in
supplements for Chapters 37 and 38. The terms of the license
an acknowledgment at the end of the document.
are noted at the bottom of the first page of each article.
The primary author is also usually designated as the
corresponding author, who will take responsibility for
communication with the journal during manuscript
■ Authorship submission, peer review, and any other administrative
requirements, including IRB approval, clinical trial reg-
Authorship credit is an important element of research istration, and obtaining author statements of contribu-
publication. It implies responsibility and accountability tions and conflict of interest information. This person
for the published work. To avoid conflicts late in the will also be identified on the first page of the article
process, collaborators should decide authorship before a with an email address as a contact for readers.
project begins, including the order in which authors will
be credited. As interprofessional research expands, arti-
cles are increasingly attributed to multiple authors.16 ■ Standards for Reporting
The names should be listed in order of contributions.
The first author is typically designated as the primary Replicability is an essential component of evidence-based
author, and the last author often represents a senior practice. For scientists to validate and build on the work
researcher who provided oversight to the project. of others, the research process must be communicated
Most journals require all authors to provide a signed
copyright release that includes a statement of their con-
tributions to the article. These may include: Box 38-3 Taking Credit Where Credit Is Not Due
• Conception and design of the study In 1981, John Darsee was a physician and researcher at Brigham &
• Participation in data collection Women’s Hospital in Boston, leading NIH-funded research in cardi-
• Carrying out data analysis ology. He had published several studies in prestigious medical
journals, studies that had been cited in more than 300 subsequent
• Writing the article
articles. When it was uncovered that he had fabricated data, nine
• Project management
published studies were retracted as well as many abstracts.19
• Obtaining funding But this isn’t solely about one researcher’s scientific miscon-
• Reviewing the final paper or other consultation. duct. Many of Darsee’s colleagues were coauthors on these
Journals may include these disclosures on the first studies and were unaware of his manipulations of data because
page or at the end of the article. they had little or no direct contact with the work that was
reported. In some cases they were not even aware that the
Criteria for Authorship studies using their names had actually been performed—
sometimes not knowing that their names were on an abstract
The International Committee of Medical Journal Editors
until after it had been published.20
(ICMJE) has established four criteria for authorship, all Although Darsee avowed that his coauthors had no responsi-
of which should be met for someone to be included as bility for what happened, it isn’t that simple. There are inherent
an author17: responsibilities in assuming authorship, even if one is not directly
1. Substantial contributions to the conception and involved in data collection. Coauthors should be able to defend
design of the work or the acquisition, analysis or the research and be confident of its integrity. That is why journals
require signed statements with all authors testifying to their
interpretation of data; AND
level of involvement, assuring that they have had meaningful
2. Drafting the work or revising it critically for impor-
participation in the planning, design, interpretation, and writing
tant intellectual content; AND of the article.
3. Final approval of the version to be published; AND
6113_Ch38_597-612 02/12/19 4:34 pm Page 600
with sufficient detail and clarity. To support best practice consist of a checklist and template for a flow diagram
in publishing scientific studies, many journal editors and to document the selection and attrition of subjects
researchers have agreed to unified standards for reporting (refer to Figs. 13-2 and 15-2). Several extensions have
each type of research. The standards provide guidelines since been developed that address unique features of
for authors so that they will include appropriate sections specific designs, such as repeated measures, noninferi-
and information in their manuscripts in a way that can be ority, N-of-1, or pragmatic trials. The TIDieR (Template
understood by readers, replicated by researchers, used by for Intervention Description and Replication) checklist
clinicians for evidence-based practice, and included in a complements CONSORT, with guidelines specific to
systematic review (see Table 38-1). These guidelines have describing intervention.22 Standards have also been
been brought together under the “umbrella” of the developed for other forms of health-related research,
EQUATOR network (Enhancing the QUAlity and Trans- including observational designs, diagnostic and prog-
parency Of health Research), which is an international nostic studies, clinical practice guidelines, qualitative
collaboration that promotes quality and consistency in research, case reports, and systematic reviews. These
research publications.21 checklists have many items in common with the CON-
This effort grew out of original work that focused SORT statement. All of these guidelines can be accessed
solely on randomized controlled trials (RCT), with the on the EQUATOR website.21 Elaboration and Explana-
development of the Consolidated Standards of Reporting tion Papers have been written to clarify each item on these
Trials (CONSORT) (see Table 38-2). These guidelines checklists.
Table 38-2 CONSORT 2010 Checklist of Information to Include When Reporting a Randomized Trial
SECTION/TOPIC ITEM # CHECKLIST ITEM
Title and Abstract 1a Identification as a randomized trial in the title
1b Structured summary of trial design, methods, results, and conclusions (see CONSORT
for abstracts)
Introduction
Background and 2a Scientific background and explanation of rationale
Objectives 2b Specific objectives or hypotheses
Methods
Trial Design 3a Description of trial design (such as parallel, factorial) including allocation ratio
3b Important changes to methods after trial commencement (such as eligibility criteria),
with reasons
Participants 4a Eligibility criteria for participants
4b Settings and locations where the data were collected
Interventions 5 The interventions for each group with sufficient details to allow replication, including
how and when they were actually administered
Outcomes 6a Completely defined prespecified primary and secondary outcome measures, including
how and when they were assessed
6b Any changes to trial outcomes after the trial commenced, with reasons
Sample Size 7a How sample size was determined
7b When applicable, explanation of any interim analyses and stopping guidelines
Random Sequence 8a Method used to generate the random allocation sequence
Generation 8b Type of randomization; details of any restriction (such as blocking and block size)
Random Allocation 9 Mechanism used to implement the random allocation sequence (such as sequentially
Concealment Mechanism numbered containers), describing any steps taken to conceal the sequence until
interventions were assigned
Implementation of 10 Who generated the random allocation sequence, who enrolled participants, and who
Randomization assigned participants to interventions
Blinding 11a If done, who was blinded after assignment to interventions (for example, participants,
care providers, those assessing outcomes) and how
11b If relevant, description of the similarity of interventions
Statistical Methods 12a Statistical methods used to compare groups for primary and secondary outcomes
12b Methods for additional analyses, such as subgroup analyses and adjusted analyses
Results
Participant Flow 13a For each group, the numbers of participants who were randomly assigned, received
(Flowchart) intended treatment, and were analyzed for the primary outcome
13b For each group, losses and exclusions after randomization, together with reasons
Recruitment 14a Dates defining the periods of recruitment and follow-up
14b Why the trial ended or was stopped
Baseline Data 15 A table showing baseline demographic and clinical characteristics for each group
Numbers Analyzed 16 For each group, number of participants (denominator) included in each analysis and
whether the analysis was by original assigned groups
Outcomes and Estimation 17a For each primary and secondary outcome, results for each group, and the estimated
effect size and its precision (such as 95% confidence interval)
17b For binary outcomes, presentation of both absolute and relative effect sizes is recommended
Ancillary Analyses 18 Results of any other analyses performed, including subgroup analyses and adjusted
analyses, distinguishing prespecified from exploratory
Continued
6113_Ch38_597-612 02/12/19 4:34 pm Page 602
Table 38-2 CONSORT 2010 Checklist of Information to Include When Reporting a Randomized Trial—cont’d
SECTION/TOPIC ITEM # CHECKLIST ITEM
Harms 19 All important harms or unintended effects in each group (for specific guidance see
CONSORT for harms)
Discussions
Limitations 20 Trial limitations, addressing sources of potential bias, imprecision, and, if relevant,
multiplicity of analyses
Generalizability 21 Generalizability (external validity, applicability) of the trial findings
Interpretation 22 Interpretation consistent with results, balancing benefits and harms, and considering
other relevant evidence
Other Information
Registration 23 Registration number and name of trial registry
Protocol 24 Where the full trial protocol can be accessed, if available
Funding 25 Sources of funding and other support (such as supply of drugs), role of funders
Source: Moher D, Hopewell S, Schulz KF, et al. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel
group randomised trials. BMJ 2010;340:c869. Available at http://www.consort-statement.org/downloads.
For purposes of illustration, this discussion will focus on standards Title and Abstract
for RCTs using the CONSORT statement as a guide. However, this Most readers will encounter the title of your study first,
checklist serves as reasonable guidance for all types of so it should be appropriately descriptive. It should include
studies. See the Chapter 38 Supplement for links to check- the population or condition, the specific intervention or
lists and extensions for intervention, observational and qualitative
predictive factors, the outcome, and the type of research
designs.
or design. Many readers will choose to read an article
based solely on the title, so you want to make sure it pro-
vides adequate information. One journal estimates that
■ Structure of the Written 50% of its website traffic comes from Google, so includ-
Research Report ing keywords in the title that might be used in a litera-
ture search will allow more readers to find your study.24
Journal articles typically follow the IMRAD format (intro- Limiting acronyms in the title will also attract a wider
duction, methods, results, discussion). As the CONSORT range of readers. Some journals will restrict the length
statement illustrates, the elements of a research report of the title to a number of characters or words. A good
require substantive detail. Whichever reporting guidelines guideline is to limit the title to 15 words.
you follow, each section should be indicated by headings Research reports include an abstract that summarizes
and subheadings within the article. The introduction pro- the content of the article. The structure of the abstract
vides the rationale for the study, the methods and results will be defined by the journal, including such categories
describe what you did and what you found, and the discus- as purpose or background, description of subjects, pro-
sion shapes the “story,” where you situate the research cedures used, summary of the results, and major conclu-
question in a larger context. Together, the sections of the sions. The journal will also indicate a limit on words for
report make the case for the significance of your research the abstract, usually between 250 to 500 words, which
and how it advances the scholarly conversation.23 means providing a complete but condensed description
of the study. To engage the reader’s interest in reading
The items in the CONSORT statement represents guide- further, the abstract should be clearly written and in-
lines for the inclusion of needed information. The actual clude the key information needed to understand the
order in which specific information is presented may
study.25 To assure that it accurately reflects the finalized
vary in different articles. The important message is that the
report, abstracts should be written after the paper is
information is there and is easy to find.
completed.
The title and abstract play an important role in elec-
See the Chapter 38 Supplement for an annotated example tronic searches. Therefore, they must be able to stand
of a research report that follows the CONSORT guidelines. alone, despite their brevity. Many journals ask authors
to include four to eight keywords that will be listed at the
6113_Ch38_597-612 02/12/19 4:34 pm Page 603
end of the abstract, words that can be used as search often a description of the study design, although this may
terms, and may include MeSH terms (see Chapter 6). be included at the end of the introduction.
Keywords also help editors identify appropriate peer The next section is typically labeled subjects or partic-
reviewers for the manuscript. ipants. It should begin by explaining how participants
were recruited, the setting or accessible population from
Studies that have been submitted to a clinical trials registry which they were drawn, and the timeframe for data col-
are required to provide their identification number and date lection. Observational studies should include how cases
of registration. and controls were defined. This will be important infor-
mation to establish external validity. Eligibility should
be defined by inclusion and exclusion criteria. Most
Introduction journals require a statement documenting that subjects
read and signed an informed consent form and that the
The body of the article begins with the introduction, appropriate Institutional Review Board (IRB) approved
which usually does not have a subheading. It provides the project.
a description of the research question and the specific The methods section continues with a description
motivation for the author to answer it. A compelling of the study procedures, the number of groups, the ran-
justification will point to gaps or conflicts in the litera- domization procedure if applicable, and how blinding
ture and may include demographics, incidence, or costs was incorporated. This section explains how data were
associated with the condition of interest. After reading obtained, through direct measurement, interview, or
the first paragraph of the introduction, the reader should observation, or by accessing data from other sources
understand the problem being studied and why it is such as medical records or questionnaires. The variables
important (see Chapter 3). A review of literature will being studied should be clearly spelled out, including the
follow with a concise description of the current state of independent and dependent variables, as well as potential
knowledge on the topic, referencing recent findings as well confounders that were measured.
as classic studies. Systematic reviews can be helpful in the Operational definitions of interventions and out-
introduction to provide a comprehensive perspective. come measurements should be included with sufficient
detail to allow replication. If follow-up data are ob-
Many journals now require that a systematic review be tained, the methods for contacting participants should
cited to demonstrate that the author has considered the be explained. If appropriate, primary and secondary
full range of available information on the topic and that outcomes should be delineated. If the measurement or
the research is warranted. If the search for a systematic review treatment procedures are standardized and well known,
was unsuccessful, a statement to that effect should be included. they can be described briefly and you can refer the
reader to the original sources for a more detailed de-
scription. Description of equipment should include
The literature review should reflect the relevant manufacturer and model.
background that is necessary to support the theoretical Many researchers develop a written protocol that is
rationale for the study without providing the equivalent used as a guide during data collection to be sure that
of an annotated bibliography. Citing literature selec- all procedures are followed properly. This protocol can
tively will keep the focus on your research question. easily serve as an outline for this section of the report.
The introduction should logically end with a statement Presenting procedures in chronological order also helps
of the specific purpose of the study, delineating the to create a logical presentation. Diagrams, photographs,
variables that were studied and the research hypotheses and tables can clarify how data were collected. Be sure
or guiding questions that have been investigated in the to get permission for any material taken from other
study. sources and for the use of photographs from patients or
participants.
Methods The methods section ends with a full description of
The methods section is the map you followed when con- the procedures used for data analysis, including specific
ducting your research. The details should be clear statistical procedures and the accepted level of signifi-
enough that someone else could recreate the steps you cance. For intervention studies, a statement should clarify
took in conducting your research. Although the methods if an intention-to-treat analysis was done (see Chapter 15).
section comes second in the study, many authors write This section should include a description of the method
it first because it is the most straightforward. used to determine the needed sample size, including level
Methods are composed of several subsections, each of significance, desired power, estimated effect size and
labeled with its own subheading. The first subsection is how this value was determined. Procedures for handling
6113_Ch38_597-612 02/12/19 4:34 pm Page 604
missing data should be described. If unique statistical the narrative. Refer to the tables and figures for details,
methods are used, they should be explained and refer- but only summarize the details in the text. The reader
enced. Analysis plans should be clear for each of the should be able to understand the results without refer-
specified outcomes. ring to the tables and should be able to understand the
tables without referring to the text. Second, the author
Results should not discuss how this information could be applied
The results section contains a report of results without to practice or interpret the outcomes—that will come in
commentary or interpretation. It is a description of the discussion section.
all findings, usually accompanied by a visual depiction
of how data do or do not support the specific aims or Discussion
hypotheses of the study. The narrative should present The discussion section is where you will establish the
general descriptions of findings, and the tables and take-home message of the research report. It reflects
figures will present the details. your interpretation of the results in terms of the purpose
The results should begin with the description of the of the study and its relation to clinical practice. This part
final sample, including a flow diagram showing how of the article is the least formulaic in format, but it must
many participants were part of the final analysis and rea- be logical and focused. It will include comments on the
sons for attrition (see Chapter 14). The results usually importance of the results, immediate or potential appli-
include a table that shows baseline measures for demo- cability of results to clinical practice, generalizability of
graphic and clinical variables for each group. Baseline findings, limitations of the study, suggestions for future
characteristics should be examined and statistical com- research, and clinical implications.
parison reported to indicate if groups were equivalent at The discussion may begin with an overview of findings.
the start. The flow diagram should indicate why attrition It should describe the importance of these findings, not
occurred, loss to follow-up, and to what extent partici- simply reiterate the results, with a focus on explanations
pants remained in their assigned groups. The results of and interpretations of the observed outcomes, emphasiz-
an intention-to-treat analysis should be described, in- ing how they either support or refute previous work or
cluding how missing data affected outcomes. clinical theories. All results should be addressed, including
Statistical results should be presented for primary and those that were not statistically significant, explaining how
secondary outcomes. Descriptive statistics should be in- effect size and clinical significance should be interpreted.
cluded for each group and the total sample. Means It is helpful to develop and outline for the discussion, re-
should be expressed with standard deviations (or medians flecting important points that you want to make about
with ranges for non-normal distributions). The out- your findings, which can then be compared to the results
comes of statistical tests for group differences should of previous studies.
include values for test statistics, degrees of freedom, The limitations of the study, including possible extra-
p values, and effect size. Observational and correlational neous variables that could have affected the outcomes,
studies should include measures of risk or diagnostic should be identified and their effect on internal validity
accuracy, results of regression analyses, and, if appropri- explained. Some of these factors may have been identi-
ate, present the number needed to treat (NNT) or fied before the study began and others will have become
number needed to harm (NNH) (see Chapter 34). All evident during the course of data collection or analysis.
point estimates should also be expressed with confidence These may include a small sample size, attrition of
intervals. If expected findings are not significant, the subjects, or lack of subject adherence to the protocol.
possibility of Type II error should be addressed through It is essential to delineate extraneous factors so that
a power analysis. the reader can examine the results realistically. External
validity may also present a limitation if the sample came
Sometimes nonsignificant outcomes are referred to as from one setting or particular geographic area or if
“negative results,” which is really a misrepresentation sample characteristics affect generalizability of findings.
of findings. It means that the study hypothesis was not Every research endeavor leads to further questions.
supported, but this can be useful information! Because this is a Sometimes, these questions arise out of the expressed
statistical issue, it is more appropriate to refer to such studies limitations of a study and the need to clarify extraneous
as having “nonsignificant” findings. factors, design issues, or the effect of specific treatment
procedures. Given the results of a study, you may want
to reconsider a particular theory and how it may be ap-
Two major principles guide the writing of the results plied. Suggestions for future research will develop from
section. One is that tables and figures should not duplicate these ideas and should be proposed.
6113_Ch38_597-612 02/12/19 4:34 pm Page 605
Understanding the structure of a research report is impor- Other Types of Written Submissions
tant for writing an article and for appraising it. The Several types of journal submissions other than a full-
processes described in Chapters 36 and 37 for critical appraisal
length research article can provide an opportunity for
and systematic reviews are useful considerations, as they both
being published. Each may have different limitations
reflect how others would appraise your work. CONSORT and
other guidelines can also be used to focus critical appraisal. It is
on words, references, or other material. Some will be peer-
helpful to use this same approach when critiquing your own reviewed, and others will not. Depending on the stage of
work for clarity and quality as your writing progresses. research completed or complexity of your methods, a
briefer format may be appropriate to convey your findings.
• Case reports provide an in-depth description of a
Writing Style unique or novel case or clinical approaches to care
The goal of scientific writing is to convey research studies (see Chapter 20).
clearly so that readers can evaluate and apply the findings.
• Reviews can be a good option for those newer
Scientific writing is not the creative process we learned
to the research process, working with a team on a
in school. The standard article template leaves little room
systematic review, meta-analysis, or scoping review
for metaphors or colorful language, although this does
summarizing major research (see Chapter 37).
not mean writing needs to be stilted. Pinker30 has identi-
Reviews may also be submitted as clinical guidelines
fied “the curse of knowledge” as a primary source of the
based on literature review. These will be treated like
impenetrable prose that characterizes some journal arti-
other research studies and will be peer-reviewed.
cles. So steeped are researchers in their scholarly enter-
prise, they may not see the importance of communicating • Perspectives/Commentaries/Expert views are all
their work for a wider audience. Abbreviations and jargon different forms of overview in an area of contemporary
creep in when authors write only for other specialists. interest, a description of a new approach to patient
Writing with the reader in mind requires that you care, or presenting a point of view regarding issues
design a logical structure for the manuscript, explain of current concern in the discipline.
6113_Ch38_597-612 02/12/19 4:34 pm Page 607
with just enough background to justify the study and access them before they attend your session (avoiding
establish what it will contribute to current knowledge. the need for handouts). This timing needs to be built
Audiences appreciate your stating hypotheses or aims into your planning. Some will also request your slide
that will allow them to follow the logic of the presenta- deck in advance so that it can be loaded onto the com-
tion. The methods section should describe the research puter in the presentation room. This allows for a smooth
in sufficient detail for the audience to understand what transition between successive speakers.
was done. There is no time to include comprehensive
Some other hints: If you are not a seasoned speaker, it
descriptions, so important decisions have to be made
is a good idea to rehearse your presentation for timing
about what information is necessary.
and clarity. Ask colleagues to listen and give you feed-
The most important parts of the presentation are the
back. Remember your public speaking lessons—speak
results and discussion. Keep to the key points. You want
slowly and clearly, don’t sway at the podium, look up
these sections to focus on the “take home” message.
from your paper if you are reading it. Get to the room
Include discussion of important limitations. The presen-
where you will be presenting with good lead time so you
tation should end with a clear statement of conclusions
can load your slides onto the computer and you can be-
and implications for practice and further research.
come familiar with the set-up and surroundings. You will
Visual Presentation be able to see your slides on the computer at the podium,
Because oral presentations typically present a lot of in- so you do not have to keep looking backwards at the
formation in a short amount of time, the use of slides is screen to know what is being projected. Mostly, relax
essential for the audience to be able to follow along. and enjoy the experience! Remember people are there
Slides should emphasize and illustrate content to focus because they are interested in your work.
the audience’s attention on important details. Although
some recommendations suggest having no more than Presentation programs like PowerPoint provide a lot of cre-
15-20 slides, the number of slides should be determined ative options—ornate backgrounds, fancy fonts, animations,
by identifying the key points of the written text—but colors, clip art. Keep it simple. The purpose of a professional pres-
keep it as minimal as possible. entation is to get the information across, not to entertain. You
People generally find it difficult to listen to one thing want the audience to be engaged, not distracted by watching
and read another at the same time. Slides should follow words fly in and fade out or slides with novel transitions. Keep
the language used in the presentation but should not colors to a minimum and use a consistent theme. Some people
contain every word. The slides should serve as an outline like dark backgrounds, others like lighter backgrounds. Both work
of the work, and the speaker should not read the slides. well—just make sure the backgrounds work with your graphics.
Use of bullets is helpful to avoid long sentences. Most
importantly, the audience should be able to read the
slides easily from a distance. Limit the verbiage on the Poster Presentations
slide to phrases or terms that will guide the listener to
A poster presentation is a report of research that is dis-
the important points.
played so that it can be read and viewed by large groups
The adage “a picture is worth a thousand words” ap-
in a casual atmosphere. Posters afford a special opportu-
plies here. Photographs and graphs greatly facilitate the
nity for researchers and their colleagues to exchange ideas
audience’s understanding of methods and results. Videos
in conference settings. Poster sessions are organized so
may be informative but should not last more than a
that each poster is available for several hours, although the
minute or two. Don’t put up a graphic and expect the
author may be required to stand with it for only 1 or 2 hours.
audience to interpret it on the spot—guide them
Posters are usually displayed in an open area where inter-
through the information.
ested participants view the posters in a less formal manner.
When a study has generated a lot of data, you may
In this case, the researcher is available to answer questions
have to be selective in what to include because listeners
or engage in discussion. The advantage of the poster pres-
cannot absorb mounds of data in such a short time span.
entation is that interested members of the audience can
Tables may be useful when actual data are relevant—but
study the implications of a study at a comfortable pace,
they are often so busy that the audience cannot read
and the researcher has an opportunity to clarify or amplify
them. Is every p value important, or can you show which
details of the study. The disadvantage is that most posters
variables are significantly different in a chart? Don’t
are viewed by a small number of people.
apologize for busy or unreadable slides—just don’t in-
clude them! Graphics are usually more informative. Content and Layout
Many conferences request that you submit your slides There is no fixed format for a poster. If you take the
in advance so they can be uploaded for participants to time to view posters at a conference, you will see a wide
6113_Ch38_597-612 02/12/19 4:34 pm Page 609
variety, and you can judge for yourself which are more designed using a single PowerPoint slide. The content
effective than others. You want a poster to look interest- elements can be moved about the template to find the
ing, not overwhelming to those passing by. Figure 38-1 best arrangement for the logical flow of information.
is one sample arrangement of a poster. The introductory materials should be placed at the top
The poster should contain the major elements of the left and the conclusion at the bottom right. The effective
study in a clear, brief series of statements including poster should be legible and uncluttered with content
title, purpose, hypothesis or specific aims, method, presented in sharp contrast to its background. The size
results and discussion, and conclusions. The poster and type of the text should be readable from 4 feet away,
should be self-explanatory, but “telegraphic” in style; including data in tables and figure legends. If a study is
that is, content should include key words and phrases funded, you may include information about the grant
and not necessarily complete sentences. Tables, graphs, and funding agency on the bottom of the poster.
or photographs should summarize and illustrate impor- Some researchers like to include references on a
tant findings or unique aspects of the method. The poster, but this may be a space issue. When the poster
most effective posters do not contain so much written cannot accommodate important information, presenters
material that the observer gets lost but should be com- may have handouts for viewers with copies of the poster
plete enough to allow the observer to understand the and additional material.
full intent of the study.
The conference sponsor will provide guidelines about
Posters can be printed in a variety of ways. You should find
the size and composition of the board that will be avail-
out what resources are available to you through your insti-
able and how the poster will be affixed (if you need to
tution. Many commercial vendors can print them, but beware of
bring push pins). The customary size is 4 feet high and costs. Get estimates of time needed for printing to be sure the
4, 6, or 8 feet wide. Many free poster templates are poster is ready when you need it!
available on the Web for different sizes. These can be
.
.
are not ready for a professional journal, but will still be gauges the importance of scholarly output by authors,
of interest to readers. These are good publications for essentially monitoring the amount of attention the work
pilot studies that inform readers of the direction of your has received.44 Reference managers like Mendeley will
work. They may include reports of interviews with experts also let you know how many users have saved your cita-
in certain areas, often sparked by published research. tion to their personal libraries.45 ResearchGate can notify
These are excellent venues for dissemination, either for you when others cite your studies.37
those early in their research careers, or for the purpose These alternative metrics do not necessarily reflect
of targeting specific audiences. the quality of your research, but they do indicate the
breadth and variety of reactions to it. Just as journals
Tracking Metrics earn impact factors, individual authors can calculate the
Increasingly, journals are tracking how often an article impact of their published work using the h index which
is viewed, downloaded, and shared. Altmetrics provides a measures a combination of the number of articles pub-
score that reflects mentions of your work on social lished by a researcher and the number of citations it
media, including bookmarks, links, blog postings, tweets, receives.46 Creating a profile on Google Scholar will not
policy documents, and online reference managers.43 The only tabulate your h index score, but will also allow
score is displayed in a ring with different colored sections other researchers with similar interests to learn about
that represent the various media sources. This score your work.
COMMENTARY
Disseminating your work occupies this book’s final a basis for validation of data from primary sources of
chapter because it marks the culmination of the re- information. When you publish your results, you expose
search process. Yet, the considerations that go into your work to the scholarly community—contributing
publishing should guide the decisions you make at all to the literature and possibly facing rejection. You
prior steps. Defining your question influences where should be aware of principles of academic integrity,
your findings will ultimately appear. Similarly, choos- assuring that your work is original and accurate (see
ing a research design and a data analysis technique will Chapter 7).
shape what kind of article you will write. Veteran re- Think of this process as one that contributes to your
searchers often map out from the beginning of a project growth as a scholar. When we see a journal’s tables of
how many manuscripts it will yield and where they will contents or the extensive resumes of senior scholars,
be submitted. it is not evident how many articles required multiple
Writing often falls prey to procrastination. The lack of submissions before they were accepted, or how many
a fixed timeline allows researchers to find all sorts of ex- projects never received funding or were never published.
cuses not to write. There are always more data to collect, You will pay your dues just as they did, so don’t be
more references to check, and more tests to run. Writing discouraged.
requires stretches of uninterrupted time that are increas- The process of publishing your work has tangible
ingly rare in our multitasking clinical and academic ca- benefits for evidence-based practice. By adding your
reers. Setting up a schedule to write can keep you focused, voice—backed by evidence—to the scholarly conversa-
and you may also find that deadlines are more compelling tion, you allow others to build on your findings in the
if you collaborate with multiple authors. pursuit of scientific knowledge. Through this ongoing
One of the hallmarks of the scientific process is a process, clinical practices improve so that we can fulfill
commitment to critical examination whereby researchers our professional commitment to improve health and
subject their findings to the scrutiny of others, providing quality of life.
6113_Ch38_597-612 02/12/19 4:34 pm Page 612
APPENDIX
Statistical Tables
A
TABLE A- 1 Areas Under the Normal Curve, z 614
TABLE A- 2 Critical Values of t 616
TABLE A- 3 Critical Values of F (α = .05) 617
TABLE A- 4 Critical Values of the Pearson correlation, r 618
TABLE A- 5 Critical Values of Chi-Square, χ2 619
TABLE A- 6 Table of Random Numbers 620
These tables are also available in Appendix A Online
Additional tables available in Appendix A Online and in chapter supplements:
TABLE A- 7 Critical Values of the Studentized Range Statistic, q (α=.05)
Also available in the Chapter 26 Supplement
TABLE A- 8 Critical Values for the Mann-Whitney U Test
Also available in the Chapter 27 Supplement
TABLE A- 9 One-Tailed Probabilities Associated with x in the Binomial Test
Also available in the Chapter 18 and 27 Supplements
TABLE A- 1 0 Critical Values of T for the Wilcoxon Signed-Ranks Test
Also available in the Chapter 27 Supplement
TABLE A- 1 1 Critical Values of Spearman’s Rank Correlation Coefficient, rs
Also available in the Chapter 29 Supplement
613
6113_AppA_613-621 02/12/19 4:58 pm Page 614
APPENDIX
B Relating the
Research Question
to the Choice of
Statistical Test
RESEARCH QUESTION CHART
622
6113_AppB_622-628 02/12/19 4:57 pm Page 623
B1
Is there a difference between means (or medians)?
APPENDIX
B2
APPENDIX
One-way repeated
Dependent variable Assume normal measures ANOVA
Ratio/interval distribution (F )
Chapter 25
Confounding
Can’t assume Variable:
1 GROUP
normal Covariate One-way repeated
distribution measures ANCOVA
Repeated measures
Chapter 30
Relating the Research Question to the Choice of Statistical Test
B4
Is there an association between variables?
APPENDIX
Ratio/interval
distribution Chapter 29
6113_AppB_622-628 02/12/19 4:57 pm Page 626
Two continuous
variables
Do not assume
Spearman Rank Correlation (r s )
Ordinal normal
Chapter 29
distribution
Pearson Correlation
One variable ratio
(Point Biserial Correlation)
One variable nominal
Chapter 29
One dichotomous variable
One continuous variable
Spearman Rank Correlation
One variable ordinal (Rank Biserial Correlation)
One variable nominal Chapter 29
McNemar test
Two Chapter 28
Correlated data
dichotomous Association
from one sample
variables Phi Coefficient
Chapter 28
Small
Relating the Research Question to the Choice of Statistical Test
B5
Do one (or more) independent variables predict one (or more) dependent variables?
BAPPENDIX
B6
Are measurements reliable?
APPENDIX
Method Error
Chapter 32
Limits of Agreement
Alternate forms
Chapter 32
Cronbach’s alpha
Internal consistency
Chapter 32
Relating the Research Question to the Choice of Statistical Test
Coefficient of variation
Chapter 32
6113_AppC_629-636 02/12/19 4:57 pm Page 629
APPENDIX
Management of
Quantitative Data C
An important part of the research planning process is the that define these standards. In the United States, these
development of a data management plan that specifies are part of the Privacy Rule of the Health Insurance
how data will be recorded, organized, and analyzed. This Portability and Accountability Act (HIPAA),1 ,2 and the
plan begins with the research proposal, specifying the re- Common Rule.3 In Canada, they are incorporated into
search question, hypotheses, and design. These plans will the Tri-Council Policy Statement: Ethical Conduct for
also translate to the data analysis section under methods Research Involving Humans.4
in a research report. It is not unusual for these plans to
change once the project is under way, but nothing should
begin without a firm plan in place. This planning re- ■ Monitoring Subject Participation
quires knowledge of statistical programs, data coding and
format requirements. The purpose of this appendix is to Throughout the project, researchers should have proce-
describe the process of data management. SPSS will be dures in place to keep accurate and complete records of
used to illustrate the process, consistent with the pres- participants’ involvement. Records should indicate how
entation of statistical output throughout the text. many participants were recruited and why some were not
eligible, how many agreed to participate, and how many
eventually did participate. Attrition should be monitored
■ Confidentiality and Security and reasons obtained. Initial group assignments and de-
viations from these assignments should be documented.
The research proposal will include a plan for handling This information is relevant to the validity of the project
data, including maintaining confidentiality of participant and will be important if the researcher wants to complete
information. All subjects should be assigned a unique ID an intention to treat analysis. This information should
number that is not related to their name, medical unit also be accurately reported in a flow diagram, including
number, Social Security number, or other personal identi- reasons for drop-outs (see Chapter 15).
fier. Documents for data collection should include the sub-
ject ID only. A list of subject names, addresses or phone
numbers and corresponding ID codes can be kept separate ■ Statistical Programs
from other files in case participants need to be contacted.
As part of informed consent, subjects should be as- Depending on the type of data, analysis can be accom-
sured that their personal information, data from medical plished with a variety of sources. Spreadsheets and data-
records, and data collected as part of the project will be base systems can accommodate some analyses. Available
accessed only as necessary for research. The institutional through academic institutions, integrated Web-based
review board (IRB) that approves the project will want platforms, such as REDCap (Research Electronic Data
to know the type of data to be collected, the purposes for Capture)5 or Survey Monkey,6 can be useful for devel-
which the data will be used now and in the future, who oping formats for secure data collection such as record-
will have access to records, and what safeguards have ing data from surveys.
been put in place for security and confidentiality (see A wide variety of statistical packages are available.
Chapter 7). Many countries have regulations in place Table C-1 lists a few of the more popular ones. Most
629
6113_AppC_629-636 02/12/19 4:57 pm Page 630
Figure C–1 SPSS screen shots of a data file, showing the variable view (top) and the data view (bottom). Data are shown
only for 15 out of 60 cases. These hypothetical data are taken from Chapter 25 for a two-way study of the effect of
modality and medication on change in pain-free range of motion in patients with lateral epicondylitis. Results can be
found in Table 25-4. Data are available in the Chapter 25 Supplement.
frequencies, but they cannot be subject to numeric op- as Scale (interval or ratio), Ordinal, or Nominal. Under
erations. Clicking on a cell under this column will bring the final column, Roles, you can choose among several op-
up a list of other types of variables, including dates, tions that identify the role that variable will play in your
which can be added or subtracted to determine length analysis, such as an independent or dependent variable.
of time in days, weeks, months, or years. Other than This is not a required designation, and Input is listed by
units such as money or dates, it is advisable to code all default, allowing all variables to be used in whatever way
variables as numbers rather than using letters. the design warrants. For this example, the variables have
Variables can also be defined according to their level been designated as independent variables (Input), depend-
of measurement. Under Measure, values can be designated ent variables (Target), or both.
6113_AppC_629-636 02/12/19 4:57 pm Page 632
Labels
Because variable names are usually kept short, it is
Listwise and Pairwise Deletion
sometimes confusing to read a printout of an analysis. For some statistical procedures, you can designate one
It may be difficult to remember what each variable name of two strategies for handling missing data under
means. For instance, items on a survey may be named Options. Listwise deletion means that a case is dropped
Question1, Question2 and so on. To facilitate reading if it has a missing value for one or more data points.
the output, programs allow the researcher to specify a Analysis will be run only on cases with complete
Label for a variable name. These labels do not have the data, eliminating cases with any missing data from
restrictions of variable names and can use spaces or all calculations. Pairwise deletion can be used for
other characters but are limited in length (256 charac- some procedures when a particular comparison, de-
ters in SPSS). For example, in a survey the actual ques- scriptive value, correlation, or regression do not in-
tion can be entered as a label. In the modality study a clude every score. Cases are omitted only if they are
full label has been entered for Medication and ROM missing any of the variables involved in the specific
Change. Labels are not required, but with large data sets procedure. For instance, means can be calculated for
they are extremely useful. each variable, using only cases with a value for that
Categorical codes can also be defined. In the Values variable. This means that n could be different for each
column you can enter labels for codes. For instance, with variable.
6113_AppC_629-636 02/12/19 4:57 pm Page 633
The term code book was originally used for a book that the Computing New Variables
researcher created to list all variable names and codes.
Computing a new variable requires that some arithmetic
The term is still used, although it is no longer an actual book.
operations be performed on the existing data. New vari-
ables are creating using TRANSFORM > COMPUTE
VARIABLE.
■ Data Cleaning A new variable is defined by an arithmetic expression
using various operators. The following symbols are used
Once data are entered and before analyses are run, the by most programs for arithmetic operations:
data should be examined to detect abnormalities or errors,
+ Add A + B * Multiply A*B
a process called data cleaning. Although it may be time
– Subtract A − B ** or ^ Exponent A**2
consuming, it is an essential step to ensure validity of
/ Divide A/B or A^2
the data analysis. The process involves running a series of
descriptive analyses to inspect the distribution of scores These are considered simple expressions because they
for each variable. contain one operator. For instance, in the original data
For instance, running a frequency distribution for cat- file for the ROM study, only baseline and posttest scores
egorical variables will let you see if the correct number were obtained. A new variable was then defined, called
6113_AppC_629-636 02/12/19 4:57 pm Page 634
Variable Information
Measurement Column
Variable Position Label Level Role Width Alignment Print Format Write Format
id 1 <none> Scale Input 8 Right F8 F8
Modality 2 <none> Nominal Input 8 Right F8 F8
Med 3 Medication Nominal Input 8 Right F8 F8
Baseline 4 <none> Scale Input 8 Right F8.2 F8.2
Posttest 5 <none> Scale Input 8 Right F8.2 F8.2
ROMChange 6 ROM Change Scale Input 11 Right F8.2 F8.2
Variables in the working file
Variable Values
Value Label
Modality 1 Ice
2 Splint
3 Rest
Med 1 NSAID
2 No Med
Figure C–2 Code book generated by SPSS for data from the study of the effect of modality and medication on
change in pain-free elbow ROM in patients with lateral epicondylitis. The top panel shows six variables and their
characteristics. This information is similar to the material in variable view. The bottom panel shows the codes used for
the two categorical variables, modality group and medication group. This output is generated using FILE> Display
Data File Information.
ROMChange, calculated as the difference between the complete the addition (C+1.0) within the parentheses.
baseline and posttest scores: Next, the value of A will be squared. This value will
then be multiplied by B. Lastly, this product will be
ROMChange = posttest – baseline
divided by the sum (C + 1.0). If the parentheses had been
When these computations are done, the values for left out, the expression would be read differently. For
the new variable will appear as a new line in variable instance, using
view and a new column in the data view (see Fig. C-1).
A**2*B/C+1.0
These new variables can now be used in statistical
procedures. the expression would read
When more than one operator is used, a compound
(A2 ) (B )
expression is created, such as ⫹1.0
C
A**2*B/(C+1.0)
This expression is equal to Recoding Variables
(A2 ) (B ) Sometimes we want to create new codes for a categorical
C ⫹ 1.0 variable. Recodes are obtained using TRANSFORM >
RECODE INTO DIFFERENT VARIABLES.
When compound expressions are used, specific rules Suppose we wanted to change our design, and com-
apply to the order in which operations take place. First, bine all patients in the Splint and Rest groups to com-
all expressions within parentheses are carried out. Sec- pare them against the group receiving Ice. We would
ond, adjacent operations are carried out in the following want to recode the group designations, creating a new
order: (1) exponentiation, (2) division and multiplication, variable called NewModality. We would then designate
and (3) addition and subtraction. Within each of these old and new values, so that all those with original codes
levels, operations proceed from left to right. Therefore, of 2 or 3 for Modality will now have a value of 2 for
in the preceding expression, the first operation will be to NewModality.
6113_AppC_629-636 02/12/19 4:57 pm Page 635
Data transformation helps in statistical calculations Table C-2 Effect of Square Root Transformation
when the intent is to determine if groups are signifi- ORIGINAL TRANSFORMED
cantly different, or if there are significant relationships X X
among variables. It is important, however, to remember that the
transformed values are not the actual values, and units of out- A B A´ B´
comes cannot be interpreted as such. 1 8 1.00 2.83
3 7 1.73 2.65
8 12 2.83 3.46
Square Root Transformation 6 5 2.45 2.24
The square root transformation ( X ) replaces each 2 18 1.41 4.24
score in a distribution with its square root. This method
is most appropriate when variances are roughly propor- ⌺ 20 50 9.42 15.42
tional to group means, when s 2 /X is similar for all X 4 10 1.88 3.08
s2 8.5 26.5 0.56 0.61
groups. The square root transformation will typically
s 2 /X 2.125 2.65
have the effect of equalizing variances.
6113_AppC_629-636 02/12/19 4:57 pm Page 636
Table C-3 Transformation Based on Largest and Smallest Scores in Two Distributions
GROUP X LOG X 1/X
1 2 1 2 1 2 1 2
Largest 18 40 4.24 6.32 1.26 1.60 .06 .02
Smallest 10 20 3.16 4.47 1.00 1.30 .10 .05
Range 8 20 1.08 1.85 0.26 0.30 .04 .03
Ratio range largest 1.85 .30 .04
= 1.71 = 1.15 = 1.33
range smallest 1.08 .26 .03
Glossary
Numbers in parentheses indicate the chapter in which the term can be achieved in a statistical test to reject the null
is introduced. hypothesis. (23) (see also Cronbach’s alpha)
alternate forms reliability. Reliability of two equiva-
lent forms of a measuring instrument. (32)
■ Terms alternating treatment design. A single-case design
in which two (or more) treatments are compared
absolute reliability. Indicates how much of a mea- by alternating them within a session or in alternate
sured value, expressed in the original units, is likely sessions. (18)
to be due to error. (9,32) alternative hypothesis (H1). Hypothesis stating the
absolute risk increase (ARI). The increase in risk as- expected relationship between independent and
sociated with an intervention as compared to the risk dependent variables; considered the negation of
without the intervention (or the control condition); the null hypothesis. The alternative hypothesis is
the absolute difference between the control event rate accepted when the null hypothesis is rejected. (23)
(CER) and the experimental event rate (EER). (34) analysis of covariance (ANCOVA). Statistical
absolute risk reduction (ARR). The reduction in procedure used to compare two or more conditions
risk associated with an intervention as compared while controlling for the effect of one or more
to the risk without the intervention (or the control covariates. (15,30)
condition); the absolute difference between the analysis of variance (ANOVA). Statistical procedure
experimental event rate (EER) and the control appropriate for comparison of three or more treat-
event rate (CER). (34) ment groups or conditions, or the simultaneous
accessible population. The actual population of manipulation of two or more independent variables;
subjects available to be chosen for a study. This based on the F statistic. (25)
group is usually a nonrandom subset of the target a priori comparisons. (see planned comparisons)
population. (13) arm. (see treatment arm)
active variable. An independent variable with levels attributable risk. An estimate used to quantify the
that can be manipulated and assigned by the risk of disease in an exposed group that is attributa-
researcher. (14) ble to the exposure, by removing the risk that would
adjusted means. Means that have been adjusted have occurred as a result of other causes (risk in the
based on the value of a covariate in an analysis of unexposed group). (34)
covariance. (30) attribute variable. An independent variable with
agreement. (see percent agreement) levels that cannot be manipulated or assigned by the
allocation concealment. Implementation of a process researcher but that represent subject characteristics
of random assignment where those involved in (such as age and sex). (14)
the trial are shielded from knowing the upcoming attrition (experimental mortality). A threat to
participant group assignment. (14) internal validity, referring to the differential loss
alpha coefficient. (see Cronbach’s alpha) of participants during the course of data collection,
alpha (α). Level of statistical significance, or risk potentially introducing bias by changing the compo-
of Type I error; maximum probability level that sition of the sample. (15)
637
6113_Glossary_637-658 04/12/19 3:49 pm Page 638
638 Glossary
audit trail. Comprehensive process of documenting blinding. Techniques to reduce experimental bias
interpretation of qualitative data. (21) by keeping participants and/or investigators
autonomy. The capacity of individuals to make deci- ignorant of group assignments and research
sions affecting their own lives and to act on those hypotheses. (14)
decisions. (7) blocking variable. Attribute variable in which subjects
background question. Question related to etiology are divided into groups or blocks that are homoge-
or general knowledge about a patient’s condition, neous on a particular characteristic. (15,16)
referring to the cause of a disease or condition, block randomization. Distributes subjects evenly
its natural history, signs and symptoms, general among treatment groups within small, even
management, or the anatomic or physiological numbered subgroups or “blocks.” (14)
mechanisms that relate to pathophysiology. (5) Bonferroni correction (adjustment). A correction
backward selection. Form of stepwise multiple often used when multiple statistical tests are per-
regression. The equation starts out with all inde- formed on the same data set to reduce Type I
pendent variables in the model, and variables are error. The desired level of significance (α) is
eliminated one at a time if they do not contribute divided by the number of comparisons. The
significantly to the model. (30) resulting value is then used as the level of signifi-
basic research (preclinical research). Research that cance for each comparison to reject the null
contributes to basic knowledge but that does not hypothesis. (26)
have immediate practical goals. (2,14) Boolean logic. In literature searches, the terms AND,
Bayes’ theorem. The calculation of the probability NOT, and OR used to expand or narrow search
of an event based on the prior probability of an- terms. (6)
other event; used to estimate posttest probabilities box plot (box-and-whisker plot). A graphic display
based on pretest probabilities of a diagnostic of a distribution, showing the median, 25th and
outcome. (33) 75th percentiles (interquartile range), and highest
beneficence. Obligation to attend to the well-being of and lowest scores. (22)
individuals engaged as research subjects. (7) canonical correlation. A multivariate correlation
beta (β). 1. Probability of making a Type II error. (23) procedure, whereby two sets of variables are
2. Used to represent coefficients for the relationship correlated. (31)
between two endogenous variables in structural carryover effect. A temporary or permanent change in
equation modeling. (31) behavior resulting from prior treatments. (10)
beta weight. In a multiple regression equation, the case-control study. A design in analytic epidemiology
standardized coefficient for each independent in which the investigator selects subjects on the
variable. (30) basis of their having or not having a particular
between-groups variance. That portion of the total disease or condition and then determines their
variance in a set of scores that is attributed to the previous exposure to a risk factor. (1,19)
difference between groups. (24,25) case report (case series). Detailed report of the
between-subjects design. A design that compares symptoms, signs, diagnosis, treatment, and follow-
independent groups. (16) up of one or more individual patients, usually
bias. Any influence that may interfere with the valid describing interesting treatment options or
relationship between variables, potentially resulting uncommon conditions. (20)
in misleading interpretation of outcomes. (15) case study. A qualitative research design used to study
binomial test. A significance test of the deviations of complex phenomena within their social context, in-
observations from a theoretically expected deviation vestigating individuals or organizations. Can involve
for a dichotomous variable. 1. Used to evaluate data evaluation of interventions, policies, or theoretical
points above or below the split middle line in single- premises. (21)
subject designs. (18) 2. Used to compare two sets of causal modeling. Statistical technique that examines
data in a repeated sample with the nonparametric patterns of intercorrelations among variables to
sign test. (27) determine if they fit an underlying theory of which
Bland-Altman plot. A plot analyzing the agreement variables cause others. (see also structural equation
between two measurements. The mean of the modeling) (31)
two measures for each subject is plotted against the ceiling effect. A measurement limitation of an instru-
difference between the means. (see also limits of ment whereby the scale cannot determine increased
agreement). (32) performance beyond a certain level. (10)
6113_Glossary_637-658 04/12/19 3:49 pm Page 639
Glossary 639
celeration line. (see split middle line) coefficient of variation (CV). A measure of relative
censored observation. An observation whose value variation; based on the standard deviation divided
is unknown because the subject has not been in by the mean, expressed as a percentage. Can be used
the study long enough for the outcome to have to describe data measured on the interval or ratio
occurred; used to estimate survival curves. (31) scale. (22)
central tendency. Descriptive statistics that repre- Cohen’s d. The effect size index for comparing two
sent “averages” or scores that are representative groups, equals the difference between means divided
of a distribution; includes mean, median, and by a pooled standard deviation. (see also standardized
mode. (22) mean difference). (24)
centroid. A point determined from the intersection cohort effects. Variations of effects of a given group
of two or more means of two dependent variables as they move through time, such as individuals born
(X, Y), used in multivariate analysis. (31) in a particular era. (19,20)
chi-square (2). A nonparametric test applied to cohort study. An observational study design in which
categorical data, comparing observed frequencies a specific group is followed over time. Subjects
within categories to frequencies expected by are classified according to whether they do or do
chance. (28) not have a particular risk factor or exposure and
classical measurement theory (CMT). Concept of followed to determine disease outcomes. (19)
measurement whereby a measured value is consid- collinearity. The correlation between independent
ered to be a function of an underlying true score variables in a multiple regression equation, causing
and measurement error. (9, 12) them to provide redundant information. Also called
clinical practice guideline (CPG). Systematically multicollinearity. (30)
developed statement, based on available evidence, Common Rule. Codification of policies related to in-
that provide recommendations to assist practitioner formed consent and ethical conduct in research. (7)
and patient decisions about appropriate healthcare communality. In factor analysis, the extent to which
for specific clinical circumstances. (5,37) an item correlates with other variables. (31)
clinical prediction rule. A combination of clinical completer analysis (complete case analysis). Analy-
findings that have been shown to predict a diagnosis, sis of data in a clinical trial only for those subjects
prognosis, or treatment outcome. (33) who complete the study. (15)
clinical trial. (see randomized controlled trial) complex contrasts. A multiple comparison strategy
clinical trials registry. A database that provides in which means from two or more groups are
listing of trials in planning and implementation combined as a subset and compared with other
phases. (14) individual means or subsets of means. (26)
closed-ended question. A question on a survey complex hypothesis. Contains more than one
(interview or questionnaire) that offers a set of independent or dependent variable. (3)
specific response choices that are mutually exclusive computer adaptive testing (CAT). A computer-
and exhaustive. (11) based test or scale in which items are adapted to
cluster analysis. A multivariate statistical procedure the subject’s ability level. (12)
that classifies subjects into sets based on defined concurrent validity. A form of criterion-related
characteristics. (31) validity; the degree to which the outcomes of one
cluster random assignment. The site, representing a test correlate with outcomes on a criterion test,
“cluster” of subjects, is randomly assigned to an when both tests are given at relatively the same
intervention, and all individuals at that site receive time. (10,32)
that treatment. (14) confirmability. In qualitative research, ensuring,
cluster sampling. A form of probability sampling in as much as possible, that findings are due to the
which large subgroups (clusters) are randomly experiences and ideas of the participants, rather
selected first, and then smaller units from these than the characteristics and preferences of the
clusters are successively chosen; also called multi- researcher. (21)
stage sampling. (13) confirmatory factor analysis (CFA). Used to support
coefficient alpha. (see Cronbach’s alpha) the theoretical structure of an instrument, to deter-
coefficient of determination (r2). Coefficient repre- mine if it fits with current empirical understanding
senting the amount of variance in one variable (Y) of a construct. (10, 31)
that can be explained (accounted for) by a second confidence interval (CI). The range of values
variable (X). (30) within which a population parameter is estimated
6113_Glossary_637-658 04/12/19 3:49 pm Page 640
640 Glossary
to fall, with a specific level of confidence, usually assess the degree of covariation between two
95%. (23) variables. (29)
confounding. The contaminating effect of extraneous covariate. An extraneous variable that is statistically
variables on interpretation of the relationship controlled in an analysis of covariance or regression
between independent and dependent variables. analysis, so that the relationship between the inde-
(15,19,34) pendent and dependent variables is analyzed with
consecutive sampling. A form of nonprobability the effect of the extraneous factor removed. (15,30)
sampling, where subjects are recruited as they Cox proportional hazards regression. A regression
become available. (13) procedure used to measure the effect of one or
constant comparison. In qualitative research, the more predictor variables on the rate (hazard) at
process by which each new piece of information is which an outcome occurs, accounting for differing
compared with data already collected to determine lengths of follow-up among subjects. Used in
where the data agree or conflict. (21) survival analysis. (31)
construct. An abstract concept that is unobservable, credibility. In qualitative research, a criterion for
that can only be measured by observing related integrity of data, confidence in the truth or validity
behaviors. (see also latent trait) (4,8) of findings. (21)
construct validity. 1. A type of measurement validity criterion-referenced test. A fixed standard that
indicating the degree to which a theoretical con- represents an acceptable level of performance. (10)
struct is measured by an instrument. (10) 2. Design criterion-related validity. A type of measurement
validity related to operational definitions of inde- validity indicating the degree to which the outcomes
pendent and dependent variables. (15) of one test correlate with outcomes on a criterion
content analysis. A procedure for analyzing and test; can be assessed as concurrent validity or
coding narrative data in a systematic way. (21) predictive validity. (10)
content validity. A type of measurement validity critically appraised topic (CAT). A short summary
indicating the degree to which the items in an and appraisal of evidence focused on a clinical
instrument adequately reflect the content domain question. (36)
being measured. (10) critical region. Area of a sample probability curve that
contingency table (crosstabulation). A two- corresponds to rejection of the null hypothesis
dimensional table displaying frequencies or counts, based on the critical value of the test statistic. (23)
with rows (R) and columns (C) representing critical value. The value of a test statistic that must be
categories of nominal or ordinal variables. (28) exceeded for the null hypothesis to be rejected; the
continuity correction. (see Yates’ continuity correction) value of a statistic that defines the critical region. (23)
continuous variable. A quantitative variable that can Cronbach’s alpha (α). Reliability index of internal
theoretically take on values along a continuum. (8) consistency of items on a scale. (9,32)
control event rate (CER). The number of subjects crossover design. A repeated measures design used
in the control group who develop the outcome of to control order effects when comparing two treat-
interest. (34) ments, where half of the sample receives treatment
control group. In an experiment, a group of subjects A first followed by treatment B, and the other half
who resemble the experimental group but who do receives treatment B first followed by treatment A.
not receive the experimental treatment (assigned May include a washout period. (16)
a placebo or control condition), providing a cross-sectional study. A study based on data collected
baseline of comparison to interpret effects of at one point in time. Observations of different age
treatment. (14) or developmental groups may provide the basis for
convenience sampling. A nonprobability sampling inferring trends over time. (19)
procedure, involving selection of the most available crosstabulation. (see contingency table)
subjects for a study. (13) cumulative incidence (CI). The number of new cases
convergent validity. An approach in construct valida- of a disease during a specified time period divided
tion, assessing the degree to which two different by the total number of people at risk; the proportion
instruments or methods are able to measure the of new cases of a disease in a population. (34)
same construct. (10) cumulative scale. A scale designed so that agreement
correlation. The tendency for variation in one with higher-level responses assumes agreement
variable to be related to variation in a second with all lower-level responses. Also called a
variable; those statistical procedures used to Guttman scale. (12)
6113_Glossary_637-658 04/12/19 3:49 pm Page 641
Glossary 641
curvilinear relationship. The relationship between discriminant analysis. A multivariate statistical tech-
two variables that does not follow a linear propor- nique used to determine if a set of variables can
tional relationship. (29,30) predict group membership. (31)
cut-off score. Score used as the demarcation of a discriminant validity. An approach in construct
positive or negative continuous test outcome. (12,33) validation assessing the degree to which an instru-
deductive reasoning. The logical process of devel- ment yields different results when measuring
oping specific hypotheses based on general two different constructs; that is, the ability to
principles. (4) discriminate between the constructs. (10)
degrees of freedom (df). Statistical concept indicating disproportional sample. A sample stratified on a
the number of values within a distribution that particular variable, when the number of subjects
are free to vary, given restrictions on the data set; within a stratum is not proportional to the popula-
usually n-1. (23) tion size of that stratum. (13)
Delphi survey. Survey method whereby decisions on distribution. The total set of scores for a particular
items are based on consensus of a panel over several variable. (22)
rounds. (11 Supplement) divergent validity. (see discriminant validity)
dependability. In qualitative research, the stability of DOI. (see digital object identifier)
data over time, the degree to which the study could dose-response relationship. An outcome in which
be repeated with similar results. (21) risk varies, not only with the presence or absence
dependent variable. A response variable that is as- of an exposure, but also with varying levels of the
sumed to depend on or be caused by another (inde- exposure’s presence. (19)
pendent) variable. (3) double-blind study. An experiment in which both
descriptive research. Research studies that are de- the investigator and the subject are kept ignorant
signed to describe the characteristics of individuals of group assignment. (14)
in specific populations. (20) dummy variable (coding). In regression procedures,
descriptive statistics. Statistics that are used to char- the assignment of codes (0 and 1) to a nominal
acterize the shape, central tendency, and variability variable, reflecting the presence or absence of
within a set of data, often with the intent to describe certain traits. (30)
a sample or population. (22) ecological validity. The generalizability of study
developmental research. A descriptive research findings to real-word conditions, societal norms,
approach designed to document how certain groups and health of populations. (15)
change over time on specific variables. (20) effectiveness. Benefits of an intervention as tested
deviation score. The distance of a single data point under “real world” conditions. (2)
from the mean of the distribution. The sum of the effect modification. A situation in which the strength
deviation scores for a given distribution will always of an association between two variables is affected
equal zero. (22) by a third variable that differs across subgroups in
dichotomy (dichotomous variable). A nominal the population. (34)
variable having only two categories, such as yes/no effect size. The magnitude of the difference between
and male/female; a binomial variable. (8) treatments or the magnitude of a relationship
difference score (d). The difference between two between variables. (23)
scores taken on the same individual. (19) effect size index. A statistical value used to measure
differential item functioning (DIF). Potential item the effect size in standardized units based on the
bias in the fit of data to the Rash model, showing proportional relationship of the difference to
how an item may be measuring different abilities variance. Different indices are used with various
for subgroups. (12) statistical tests. (23)
digital object identifier (DOI). A unique number efficacy. Benefit of an intervention as tested under
assigned to scholarly articles that can be used to controlled experimental conditions, usually
locate particular articles in a search. (6,38) with a control group in a randomized controlled
directional hypothesis. A research hypothesis (or trial. (2)
alternative hypothesis) that predicts the direction eigenvalue. A measure of the proportion of the
of a relationship between two variables. (3) total variance accounted for by a factor in a factor
discrete variable. A variable that can only be mea- analysis. (31)
sured in separate units and that cannot be measured endogenous variable. The variable being explained by
in intervals of less than 1. (8) a causal model, a dependent variable. (31)
6113_Glossary_637-658 04/12/19 3:49 pm Page 642
642 Glossary
epidemiology. That branch of research dedicated to external validity. The degree to which results of a
exploring the frequency and determinants of disease study can be generalized to persons or settings
or other health outcomes in populations. (34) outside the experimental situation. (15)
equipoise. The ethics of a clinical research situation in fabrication. Reporting data or results that have been
which there is genuine uncertainty regarding the made up. (7)
comparative therapeutic merits of each arm of a face validity. The assumption of validity of a measur-
trial. (14) ing instrument based on its appearance as a reason-
equivalence trial. An intervention study focused on able measure of a given variable. (10)
showing that two treatments are not different, no factor. 1. A variable. (3) 2. An independent variable in
better, no worse. (14) an experimental study. (25) 3. A set of interrelated
error variance. That portion of the total variance in a variables in a factor analysis. (31)
data set that cannot be attributed to treatment effects factor analysis. An exploratory multivariate statistical
but that is due to differences between subjects. (23) technique used to examine the structure of a latent
eta squared (η2). Effect size index for analysis of variable within a large set of variables and to deter-
variance. (25) mine the underlying dimensions that exist within
ethnography. An approach to qualitative research in that set of variables. (see also exploratory and confir-
which the experiences of a specific cultural group matory factor analysis). (10,31)
are studied. (21) factorial design. A design that incorporates two or
exclusion criteria. Specific criteria that are used to more independent variables, with independent
determine who is not eligible for a study. (8,13) groups of subjects randomly assigned to various
exogenous variable. A variable in a causal model that combinations of levels of the variables. (10)
is not explained by the model, with variance that is false negative. A test result that is negative in a person
accounted for by variables outside of the model, who has the disease or condition of interest. (34)
an independent variable. (31) falsification. Manipulating, changing, or omitting data
expected frequencies. In a contingency table, the or results such that the research is not accurately
frequencies that would be expected if the null represented in the research record. (7)
hypothesis is true; frequencies that are expected false positive. A test result that is positive in a person
just by chance. (28) who does not have the disease or condition of
experimental design. A design in which the investiga- interest. (34)
tor manipulates the independent variables and familywise error rate (αFW). The probability of at
randomly assigns subjects to groups, and in which a least one false significant finding in a series of
control group or comparison group is used. (16) hypothesis tests, Type I error. (26)
experimental event rate (EER). The number of field notes. Notes recorded during participation or
subjects in the experimental or treatment group observation as part of qualitative research. (21)
who develop the outcome of interest. (34) Fisher’s exact test. A nonparametric procedure applied
explained variance. Between-groups variance; that to nominal data in a 2 x 2 contingency table, compar-
portion of the total variance in a data set that can ing observed frequencies within categories to fre-
be attributed to the differences between groups or quencies expected by chance. Used when samples
treatment conditions. (25) are too small to use the chi-square test. (28)
explanatory research. Utilizes various types of experi- fixed effect. Variable levels that are nonrandom,
mental designs to compare two or more conditions representing the only qualities of interest. (32)
or interventions. (1,3) floor effect. A measurement limitation of an instru-
exploratory factor analysis (EFA). Used to study a ment whereby the scale cannot determine decreased
set of items, when the purpose is to determine how performance beyond a certain level. (10)
the variables cluster, or to establish what underlying foreground question. A clinical question that focuses
concepts may be present in the construct. (10,31) on evidence to inform decisions about a particular
exploratory research. Observational research that has patient’s management. (5)
as its purpose the exploration of data to determine forest plot. Graphic display of effect sizes and confi-
relationships among variables. (3,19) dence intervals for individual studies and pooled
exposure. An experience or condition that may be data in a meta-analysis. (37)
responsible for a health outcome. (19) forward selection. A process used in multiple
external criticism. Assessment of generalizability in regression that enters variables one at a time
historical research. (20) into the equation based on the strength of their
6113_Glossary_637-658 04/12/19 3:49 pm Page 643
Glossary 643
association with the outcome variable, until all hazard function. The probability that a subject
statistically significant variables are included. (30) will achieve a specific outcome in a certain time
frequency distribution. A table of rank ordered scores interval. (31)
that shows the number of times each value occurs in hierarchical regression. Multiple regression model
a distribution. (22) where sets of variables are entered in an iterative
Friedman two-way analysis of variance by ranks fashion. (30)
( 2r ). A nonparametric statistical procedure for histogram. A type of bar graph, composed of a series
repeated measures, comparing more than two of columns, each representing the frequency of one
treatment conditions of one independent variable; score or group interval. (22)
analogous to the one-way repeated measures historical controls. Subjects from previous research
analysis of variance. (27) studies that serve as controls for experimental
funnel plot. Scatterplot of treatment effect size against subjects in a subsequent study. (17)
a measure of study precision such as the standard historical research. Research that seeks to examine
error, used as a visual aid for detecting publication relationships and facts based on documentation of
bias or study heterogeneity in a meta-analysis. (37) past events. (20)
gamma (γ). A coefficient representing the relationship history effect. A threat to internal validity, referring
between exogenous and endogenous variables in to the occurrence of extraneous events prior to a
structural equation modeling. (31) posttest that can affect the dependent variable. (15)
generalizability. 1. The quality of research that justi- homogeneity of slopes. Assumption in analysis of
fies application of outcomes to groups or situations covariance that regression slopes of covariates are
other than those directly involved in the investiga- parallel. (30)
tion, or external validity. (15) 2. The concept of homogeneity of variance. An underlying assumption
reliability theory in which measurement error is in parametric statistics that variances of samples are
viewed as multidimensional and must be interpreted not significantly different. (23,24,25)
under specific measurement conditions. (9) homogeneous sample. A sample where participants
gold standard. A measurement that defines the true are similar on specific characteristics, such as all
value of a variable. 1. In criterion-related validity, an the same gender or age group. (15)
instrument that is considered a valid measure and homoscedasticity. (see homogeneity of variance)
that can be used as the standard for assessing validity hospital-based study. An observational study
of other instruments. (10) 2. In diagnostic testing, a that recruits cases from patients in a medical
procedure that accurately identifies the true disease institution. (19)
condition (negative or positive) of the subject. (33) Huynh-Feldt correction. A correction for variance
goodness of fit. Use of a test to determine if an differences in a repeated measures analysis of
observed distribution of variables fits a given variance. (25)
theoretical distribution. (28,31) hypothesis. A statement of the expected relationship
grand theory. A comprehensive idea that tries to between variables. (3)
explain phenomena at the societal level. (4) impact factor. A rating of journal reputation based on
Greenhouse-Geiser correction. A correction for the frequency with which articles from that journal
variance differences in a repeated measures analysis are cited in reports in other journals. (38)
of variance. (25) implementation science. Approach to research that
grey literature. Written materials not produced by focuses on understanding the influence of environ-
a commercial publisher, including government ment, attitudes, and resources on whether research
documents, reports of all types, fact sheets, practice findings are actually translated to practice. (2)
guidelines, conference proceedings, and theses or imputation. Replacement of missing data points with
dissertations. (6,37) estimated values that are based on observed data. (15)
grounded theory. An approach to collecting and incidence. The proportion of people who develop a
analyzing data in qualitative research, with the goal given disease or condition within a specified time
of developing theories to explain observations and period. (34)
experience. (21) inclusion criteria. Specific criteria that are used to
Guttman scale. (see cumulative scale) determine who is eligible for a study. (13)
Hawthorne effect. The effect of participants’ independent factor. An independent variable in
knowledge that they are part of a study on their which the levels represent independent groups of
performance. (15) subjects. (15)
6113_Glossary_637-658 04/12/19 3:49 pm Page 644
644 Glossary
independent variable. The variable that is presumed interquartile range (IQR). The difference between
to cause, explain, or influence a dependent variable; the first and third quartiles in a distribution, often
a variable that is manipulated or controlled by the expressed graphically in a boxplot. (22)
researcher, who sets its “values” or levels. (3) inter-rater reliability. The degree to which two or
index test. A diagnostic test evaluated against a refer- more raters can obtain the same ratings for a given
ence standard to determine if it accurately identifies variable. (9,32)
those with and without the condition of interest. (33) interrupted time-series design (ITS). A quasi-
inductive reasoning. The logical process of develop- experimental design involving a series of measure-
ing generalizations based on specific observations or ments over time, interrupted by one or more
facts. (4) treatment occasions. (17)
inductive theories. Data-based theories that evolve interval scale. Level of measurement in which values
through a process of inductive reasoning. (4) have equal intervals, but no true zero point. (8)
inferential statistics. That branch of statistics con- intraclass correlation coefficient (ICC). A reliability
cerned with testing hypotheses and using sample coefficient used to assess test-retest and rater
data to make generalizations concerning popula- reliability based on an analysis of variance; a
tions. (23) generalizability coefficient. (9,32)
informed consent. An ethical principle that requires intrarater reliability. The degree to which one rater
obtaining the consent of the individual to partici- can obtain the same rating on multiple occasions of
pate in a study based on full prior disclosure of measuring the same unchanging variable. (9)
risks and benefits. (7) item response theory (IRT). A model that examines
institutional review board (IRB). That group in an error in a series of items that measure a latent trait,
institution that is responsible for reviewing research usually on a questionnaire, to determine how
proposals that will involve human subjects to deter- well the items differentiate individuals based on
mine adherence to ethical principles. (7) ability and difficulty of items. Also called latent
instrumentation effect. A threat to internal validity trait theory. (12)
in which bias is introduced by an unreliable or item-to-total correlation. Correlation of individual
inaccurate measurement system. (15) items in a scale with the total scale score; an indica-
intention-to-treat (ITT). In a randomized trial, the tion of internal consistency. (9,32)
principle whereby data are analyzed according to justice. Fairness in all aspects of the research process. (7)
original group assignments, regardless of how sub- Kaplan-Meier estimate. A nonparametric statistic
jects actually received treatment. (15) used to estimate survival based on that portion of
interaction effect. The differential effect of levels of individuals living for a certain amount of time who
one independent variable on a second independent have experienced a treatment or disease. (31)
variable. (25) kappa (κ). A correction factor for percent agreement
internal consistency. A form of reliability, assessing measures of reliability, accounting for the potential
the degree to which a set of items in an instrument effect of chance agreements. (9,32)
all measure the same trait. Typically measured using Kendall’s tau (τ). A nonparametric correlation proce-
Cronbach’s alpha. (9,26) dure for use with ordinal data. (29)
internal criticism. Assessment of the accuracy of knowledge translation. Describes the process of ac-
sources in historical research. (20) celerating the application of knowledge to improve
internal validity. The degree to which the relationship outcomes and change behavior for those involved in
between the independent and dependent variables is providing care. (5)
free from the effects of extraneous factors. (15) known groups method. A technique for construct
International Classification of Functioning, validation, in which validity is determined by the
Disability and Health (ICF). A model degree to which an instrument can demonstrate
developed by the World Health Organization different scores for groups known to vary on the
that posits the relationship among a health variable being measured. (10)
condition, body structures/functions, activities, Kolmogorov-Smirnov test of normality. A test of
and participation. (1) the null hypothesis that an observed distribution
interprofessional. Members of multiple professions does not differ from the pattern of a normal distri-
working together, contributing their various skills bution. If the test is not significant (p>.05), the dis-
in an integrative fashion, sharing perspectives to tribution does not differ from a normal distribution.
inform decision making. (1) (see also Shapiro-Wilks test of normality) (23)
6113_Glossary_637-658 04/12/19 3:49 pm Page 645
Glossary 645
646 Glossary
_
mean (X ). A measure of central tendency, signifies an important rather than trivial difference a
computed by summing the values of several measurement. (10,32)
observations and dividing by the number of minimal detectable change (MDC). That amount of
observations; the value that is typically called the change in a variable that must be achieved to reflect
“average.” (22) a true difference; the smallest amount of change that
mean square (MS). A value representing the variance; passes the threshold of error. (9,32)
calculated by dividing the sum of squares for a misclassification. In cohort or case-control studies,
particular effect by the degrees of freedom for that classifying participants to an exposure or outcome
effect. (22,25) category that is inaccurate. (19)
measurement error. The difference between an ob- missing at random (MAR). Missing data that may be
served value for a measurement and the theoretical related to the methodology of the study but not to
true score; may be the result of systematic or the treatment variable. (15)
random effects. (9,32) missing completely at random (MCAR). Assump-
median. A measure of central tendency representing tion that missing data are missing because of
the 50th percentile in a ranked distribution of unpredictable circumstances that have no connec-
scores; that is, that point at which 50% of the scores tion to the variables of interest or the study
fall below and 50% fall above. (22) design. (15)
median survival time. In a survival analysis, the time missing not at random (MNAR). Missing data
when half the patients are expected to survive, or related to group membership, creating bias in
when the chance of surviving beyond that time is data. (15)
50%. (31) mixed design. A design that incorporates independent
Medical Subject Headings (MeSH). Hierarchical variables that are independent (between-subjects)
structure of search terms developed by the National and repeated (within-subjects) factors. Also called a
Library of Medicine. (6) split-plot design. (16)
MEDLINE. Database of bibliographic references mixed methods research. A study that incorporates
supported by the National Library of Medicine. (6) both quantitative and qualitative research methods.
member checking. In qualitative research, the process (1,21)
of sharing preliminary findings and interpretations mode. A measure of central tendency representing
with research participants, allowing them to offer the most commonly occurring score in a distribu-
validation, feedback, critique, and alternative expla- tion. (22)
nations that ultimately contribute to the credibility model. Symbolic representation of reality delineating
of findings. (21) concepts or variables and their relationships, often
meta-analysis. Use of statistical techniques to pool demonstrating the structural components of a
effect sizes from several studies with similar theory or process. (4)
independent and dependent variables, resulting multicollinearity. (see collinearity)
an overall summary effect. (5,37) multiple baseline design. In single-subject research, a
meta-theory. An overarching theory that attempts design for collecting data for more than one subject,
to reconcile several theoretical perspectives in the behavior, or treatment condition wherein baseline
explanation of sociological, psychological, and phases are staggered to provide control. (18)
physiological phenomena. (4) multiple comparison test. A test of differences
method error. A form of reliability testing for assessing between individual means following analysis of
response stability based on the discrepancy between variance, used to control for Type I error. (26)
two sets of repeated scores. (32 Supplement) multiple imputation. A method of dealing with
methodological research. Research designed to missing data by creating a random data set using the
develop or refine procedures or instruments available data, predicting plausible values derived
for measuring variables, generally focusing on from observed data. (15)
reliability, validity, and change. (9,10) multiple regression. A multivariate statistical tech-
middle range theories. Theories that sit between nique for establishing the predictive relationship
basic hypotheses that guide everyday practice and between one dependent variable and a set of inde-
the systematic efforts to develop a unified theory to pendent variables. (31)
explain a set of social behaviors. (4) multistage sampling. (see cluster sampling)
minimal clinically important difference (MCID). multitrait-multimethod matrix (MTMM). An
The smallest difference in a measured variable that approach to validity testing to examine the
6113_Glossary_637-658 04/12/19 3:49 pm Page 647
Glossary 647
relationship between two or more traits measured normal distribution. A symmetrical bell-shaped theo-
by two or more methods. (10) retical distribution with defined properties, where
multivariate analysis. A set of statistical procedures most of the scores fall in the middle of the scale and
designed to analyze the relationship among three or progressively fewer fall at the extremes. (22)
more variables; includes techniques such as multiple normative research. A descriptive research approach
regression, discriminant analysis, factor analysis, and designed to determine normal values for specific
multivariate analysis of variance. (31) variables within a population. (20)
multivariate analysis of variance (MANOVA). An norm referencing. Interpretation of a score based on
advanced multivariate procedure that provides a its value relative to a standardized score. (10)
global test of significance for multiple dependent null hypothesis (H0). A statement of no difference
variables using an analysis of variance. (31) or no relationship between variables; the statistical
N-of-1 trial. A form of single-subject investigation hypothesis. (3,23)
that applies a randomized crossover design to an number needed to harm (NNH). The number of
individual patient. (18) patients that need to be treated to observe one
natural history. Longitudinal study of a disease or adverse outcome; the reciprocal of absolute risk
disorder, demonstrating the typical progress of the increase (ARI). (34)
condition. (20) number needed to treat (NNT). The number of
naturalistic inquiry. Qualitative observation and patients that need to be treated to prevent one
interaction with subjects in their own natural adverse outcome or achieve one successful outcome;
environment. (21) the reciprocal of absolute risk reduction (ARR). (34)
negative case analysis. In qualitative research, a oblique rotation. In factor analysis, the rotation of
method of validation by searching for concepts that factors that are correlated with each other. (31)
do not support or may contradict themes emerging observational study. A study that does not involve
from the data. (21) an intervention or manipulation of an independent
negative likelihood ratio (LR–). (see likelihood ratio) variable. (19)
negative predictive value (PV–), (see predictive observed frequencies. Frequencies that occur in a
value) study of categorical variables. (28)
Newman-Keuls (NK) multiple comparison test. odds ratio (OR). An estimate of risk representing the
(see Student-Newman-Keuls multiple comparison) odds that an outcome will occur given a particular
nominal scale. Level of measurement for classification exposure compared to the odds of the outcome
variables; assignment of “values” based on mutually occurring in the absence of the exposure; used as
exclusive and exhaustive categories with no inherent a risk estimate in case-control studies and logistic
rank order. (8) regression. (30,34)
nondirectional hypothesis. A research hypothesis one-tailed test. A statistical test based on a directional
(or alternative hypothesis) that does not indicate alternative hypothesis, in which critical values are
the expected direction of the relationship between obtained for only one tail of a distribution. (23)
independent and dependent variables. (3) one-way analysis of variance. An analysis of variance
non-inferiority margin. In a non-inferiority study, with one independent variable. (25)
the biggest difference that would be acceptable to one-way design. An experimental or quasi-experimental
consider the new treatment a reasonable substitute design that involves one independent variable. (16)
for the standard therapy. (14) on-protocol analysis. Analysis of data in an experi-
non-inferiority trial. A clinical trial designed to show ment based only on subjects who completed the
that a new treatment is no better and no worse than study according to assigned groups. Also called
standard care. (14) completer analysis or on-treatment analysis. (15)
nonparametric statistics. A set of statistical proce- open access. A term used to describe articles and jour-
dures that are not based on assumptions about pop- nals that are free of restrictions on access. (6,38)
ulation parameters, or the shape of the underlying open-ended question. A question on a survey (inter-
population distribution; most often used when view or questionnaire) that does not restrict the
data are measured on the nominal or ordinal scales. respondent to specific choices but allows for a free
(8,27) response. (11)
nonprobability sample. A sample that was not open-label trial. A trial design wherein both the re-
selected using random selection. (11,13) searchers and participants know which treatment is
normal curve. (see normal distribution) being administered. (14)
6113_Glossary_637-658 04/12/19 3:49 pm Page 648
648 Glossary
operational definition. Definition of a variable based patient would care about, such as symptoms, quality
on how it will be used in a particular study; how of life, function, cost of care, length of stay. (2)
a dependent variable will be measured, how an patient-reported outcome measures (PROM). Any
independent variable will be manipulated. (3) report of the status of a patient’s health condition
order effects. The sequential effect of one subject that comes directly from the patient, without inter-
being exposed to several treatments in the same pretation of the patient’s response by a clinician or
order; potentially manifested as carryover or anyone else. (2)
practice effects. (16) Pearson product–moment coefficient of correla-
ordinal scale. Level of measurement in which scores tion (r). A parametric statistical technique for deter-
are ranks. (8) mining the relationship between two variables. (29)
orthogonal. A model in which variables are independ- peer review. Process of review of submitted manu-
ent or uncorrelated. Literally means “perpendicular” scripts or research proposals by subject experts to
in mathematics. (26,31) determine quality. (7,38).
outcome variable. (see dependent variable) percent agreement. A reliability test for categorical
outlier. A numeric value that does not fall within the variables, estimating the ability of researchers to
range of most scores in a distribution. (30) agree on category ratings. (32)
paired t-test. A parametric test for comparing two per comparison error rate (αPC). Probability of
means for correlated samples or repeated measures; making a Type I error in a single comparison. (26)
also called a correlated t-test. (24) percentile. The percentage of a distribution that is
pairwise deletion. Elimination of cases with missing below a specified value. A distribution is divided
data on particular variables in a specific analysis into 99 equal ranks, or percentiles, with 1% of the
only. (15, Appendix C) scores in each rank. (22)
paradigm shift. Fundamental transition in the way a performance bias. A source of bias related to differ-
discipline thinks about priorities and relationships, ences in the provision of care to comparison
stimulating change in perspectives, and fostering groups. (15, 37)
preferences for varied approaches to research. (1) per-protocol analysis. Eliminating subjects who did
parallel group design. A design comparing two not get or complete their assigned treatment, and
independent groups that are similar on all charac- including only those subjects who sufficiently
teristics other than the intervention. (14) complied with the trial’s protocol. (15)
parameter. A measured characteristic of a population. person-item map. Output of a Rasch analysis that
(13,22) shows the distribution of persons according to
parametric statistics. Statistical procedures for ability and of items according to difficulty. (12)
estimating population parameters and for testing person-time. The total amount of time each case in a
hypotheses based on population parameters, with study is at risk, used to calculate incidence rates. (34)
assumptions about the distribution of variables, phase I trial. Researchers work to show that a new
and for use with interval or ratio measures. (8,18) therapy is safe. (2,14)
partial correlation. The correlation between two phase II trial. Studies that explore efficacy and dosage
variables, with the effect of a third variable removed; effects of an intervention by measuring relevant
also called a first-order correlation. (29) outcomes. (2,14)
participant observation. A method of data collection phase III trial. Builds on prior research to establish
in qualitative research in which the researcher is efficacy through randomized controlled studies that
embedded as a participant in the group that is being may include thousands of patients in multiple sites
observed. (21) over many years. (2,14)
path diagram. A flow chart that shows interconnec- phase IV trial. Completed after a treatment has been
tions among variables in a causal model. (31) approved with the purpose of gathering information
patient-centered outcomes research (PCOR). An on the treatment’s effect in various populations or
approach that has the distinct goal of engaging subgroups, under different clinical conditions to
patients and other stakeholders in the development explore side effects associated with long-term use,
of questions and outcomes measures, encouraging and to learn about risk factors, benefits, and optimal
them to become integral members of the research use patterns. (2,14)
process. (2) phenomenology. An approach to qualitative research
patient-oriented evidence that matters (POEM). involving the study of complex human experience as
Refers to outcomes that measure things that a it is actually lived. (21)
6113_Glossary_637-658 04/12/19 3:49 pm Page 649
Glossary 649
phi coefficient (φ). A nonparametric correlation pragmatic clinical trial (practical clinical trial)
statistic for estimating the relationship between (PCT). An effectiveness trial carried out in real-
two dichotomous variables. (28,29) world settings. Participants represent the patients
PICO. An acronym used to represent identification who would typically receive treatment, and testing
of components of clinical and research questions: takes place in clinical practice settings. Outcomes
P = population, I = intervention, C = comparison, focus on issues of importance to patients and
O = outcome. (3,5) stakeholders. (2,14)
placebo. A control that is similar in every way to the precision. 1. The number of decimal places to which a
experimental treatment except that it does not calculation is taken. (8) 2. The acceptable degree of
contain the active component that comprises the error in estimating sample size for relative risk. (34)
intervention’s actions. (14) preclinical research. (see basic research)
plagiarism. The appropriation of another person’s predictive validity. A form of measurement validity
ideas, processes, results, or words without giving in which an instrument is used to predict future
appropriate credit or attribution, and representing performance. (10)
the work as one’s own. (7) predictive value (PV). In diagnostic testing, a positive
planned comparisons. Multiple comparison tests that predictive value (PV+) indicates the probability that
are designated prior to running a study. (26) individuals with a positive test truly have the disease.
point biserial correlation (rpb). A correlation statistic A negative predictive value (PV–) indicates the
for estimating the relationship between a dichotomy probability that individuals with a negative test
and a continuous variable on the interval or ratio truly do not have the disease. (33)
scale. (29) predictor variable. (see independent variable)
point estimate. A single sample statistic that serves as pretest probability (prior probability). The proba-
an estimate of a population parameter. (22) bility that a condition exists prior to performing a
polynomial regression. Regression procedure for diagnostic test. Can be estimated based on the
nonlinear data. (30) prevalence of the condition in a specified group of
polytomous. Variables that can have multiple values, subjects. (33)
such as a 5-point scale. (8,11) prevalence. The number of cases of a disease at a
population. The entire set of individuals or units to given point in time, expressed as a proportion of
which data will be generalized. (11,13) the total population at risk. (34)
population-based study. A study in which participants primary outcome. The measure that will be used to
are recruited from the general population. (19) arrive at a decision on the overall result of the study, and
positive likelihood ratio (LR+). (see likelihood ratio) which represents the greatest therapeutic benefit. (3)
positive predictive value (PV+). (see predictive primary source. Reference source that represents the
value) original document by the original author. (6)
posterior probability. (see posttest probability) principal axis factoring. A method of extraction in
post hoc comparisons. Multiple comparison tests that exploratory factor analysis. (31)
follow a significant analysis of variance. (26) principal components analysis (PCA). Multivariate
posttest probability (posterior probability). The analysis that converts a set of possibly correlated
probability of a condition existing after performing variables into a set of uncorrelated principal compo-
a diagnostic test; predictive value of a diagnostic nents based on variance within the data. Differenti-
test. Depends on the prior probability, and the ated from factor analysis, which uses a different
test’s sensitivity and specificity. (33) model to account for variance. (31)
power (1-β). The ability of a statistical test to find a principal investigator. That person designated as
significant difference that really does exist; the the lead investigator who will have primary respon-
probability that a test will lead to rejection of the sibility for overseeing the study. (35)
null hypothesis, based on sample size, variance, prior probability. (see pretest probability)
level of significance, and effect size. (23) probability (p). The likelihood that an event will
practice-based evidence (PBE). The type of evidence occur, given all possible events. (23)
that is derived from real patient care problems, probability sample. A sample chosen using random
identifying gaps between recommended and actual selection methods. (11,13)
practice. (2) propensity score. A single score based on a set
practice effects. The effect of learning with repeated of characteristics generated through logistic
tasks in a repeated measures design. (16) regression that can be used in matching subjects
6113_Glossary_637-658 04/12/19 3:49 pm Page 650
650 Glossary
in observational or nonrandomized studies to blocks of subjects who are then randomly assigned
reduce baseline confounding. (15,31) to levels of the other independent variable. (16)
proportional hazards model. (see Cox’s regression) randomized consent design (Zelen design). In-
proposition. Statement of the relationship between volves randomizing participants to experimental
variables. (4) treatment or standard care prior to seeking consent,
prospective study. A study designed to collect and then only approaching those who will be
data following development of the research assigned to the experimental intervention. (14)
question. (19) randomized controlled trial (RCT). An experimental
publication bias. Tendency for researchers and study in which a clinical treatment is compared
editors to publish studies finding significant effects, with a control condition, where subjects are ran-
to the exclusion of studies finding no effect. (6,37) domly assigned to groups. Also called a randomized
purposive (purposeful) sample. A nonprobability clinical trial. (1,14)
sample in which subjects are specifically selected by range. A measure of dispersion equal to the difference
the researcher on the basis of subjective judgment between the largest and smallest scores in a distribu-
that they will be the most representative. (13) tion. (22)
Q-sort. A research methodology based on ranking rank biserial correlation (rrb). A correlation proce-
variables as to their importance. (11 Supplement) dure for estimating the degree of relationship
quadratic trend. A nonlinear trend, with one turn in between a dichotomy and an ordinal variable. (29)
direction. (26,30) Rasch analysis. Transformation of items on an
qualitative research. Research that derives data from ordinal scale to an interval scale, demonstrating
observation, interviews, or verbal interactions and the unidimensional nature of a scale. (12)
focuses on the meaning of experience of the partici- ratio scale. The highest level of measurement, in
pants. (1,21) which there are equal intervals between score units
quantitative research. Measurement of outcomes using and a true zero point. (8)
numerical data under standardized conditions. (1) reactive measurement. A measurement that distorts
quartile (Q). Three quartiles divide a distribution of the variable being measured, either by the subject’s
ranked data into four equal groups, each containing awareness of being measured or by influence of the
25% of the scores. (22) measurement process. (15)
quasi-experimental research. Comparative research recall bias. The possible inaccuracy of participants
approach in which subjects cannot be randomly recalling medical history or previous exposures; of
assigned to groups or control groups are not used. particular concern in retrospective studies. (11,19)
(15,17) receiver operating characteristic (ROC) curve. In
quota sampling. Nonprobability sampling method in diagnostic testing, a plot of sensitivity (true positive
which stratification is used to obtain representative rate) against 1-specificity (false positive rate),
proportions of specific subgroups. (13) demonstrating strength of diagnostic accuracy, used
random assignment (random allocation). Assignment to determine the most effective cut-off score. (33)
of subjects to groups using probability methods, reference standard. A value used as a standard against
where every subject has an equal chance of being which to judge a criterion; may or may not be a
assigned to a group. (14) gold standard. Used to judge criterion-related
random effect. Variable levels that are chosen at validity or diagnostic accuracy. (10)
random. (32) reflexivity. In qualitative research, the critical self-
random error. A measurement error that occurs by examination by researchers regarding their own
chance, potentially increasing or decreasing the biases or preferences and how they might influence
true score value to varying degrees. (8) interpretation of data. (21)
random sampling. Probability method of selecting regression analysis. A statistical procedure for
subjects for a sample, where every subject in examining the predictive relationship between a
the population has an equal chance of being dependent (criterion) variable and an independent
chosen. (13) (predictor) variable. (30)
random selection. (see random sampling) regression coefficient. In a regression equation, the
randomization. (see random assignment) weight (b) assigned to the independent variable; the
randomized block design. An experimental design slope of the regression line. (30)
in which one independent variable is an attribute regression line. The straight line that is drawn on a
variable (blocking variable), creating homogeneous scatter plot for bivariate data from the regression
6113_Glossary_637-658 04/12/19 3:49 pm Page 651
Glossary 651
equation, summarizing the relationship between risk factor. A characteristic or exposure that poten-
variables. (see also least squares method). (30) tially increases the likelihood of having a disease or
regression toward the mean (RTM). A statistical condition. (34)
phenomenon in which scores on a pretest are likely risk ratio. (see relative risk)
to move toward the group mean on a posttest be- ROC curve. (see receiver operating characteristic curve)
cause of inherent positive or negative measurement run-in period. A time during which all eligible partici-
error; also called statistical regression. (15) pants receive a placebo prior to implementing an
relative reliability. Reliability measures that reflect intervention, and only those who are adherent to
true variance as a proportion of the total variance in the protocol are eligible for the formal trial. (14)
a set of scores. (9) sample. Subset of a population chosen for study. (13)
relative risk (RR). Estimate of the magnitude of the sampling bias. Bias that occurs when individuals
association between an exposure and disease, indi- who are selected for a sample overrepresent or
cating the likelihood that the exposed group will underrepresent the underlying population charac-
develop the disease relative to those who are not teristics. (13)
exposed. (34) sampling distribution. A theoretical frequency
relative risk reduction (RRR). The reduction in risk distribution of a statistic based on the value of the
associated with an intervention relative to the risk statistic over an infinite number of samples. (18)
without the intervention (control); the absolute sampling error. The difference between an observed
difference between the experimental event rate statistic from a sample and the population parameter.
and the control event rate divided by the control (11,13)
event rate. (34) saturation. In the ongoing analysis of qualitative data,
reliability. The degree of consistency with which an the point at which no new knowledge contributes to
instrument or rater measures a variable; the degree the emerging theory. (21)
to which a measurement is free from error. (9,32) scale. An ordered system based on a series of questions
reliability coefficient. Value used to quantify the de- or items, resulting in a score that represents the
gree of consistency in repeated measurements. (9) degree to which a respondent possesses a particular
repeated measure (repeated factor). An independent attitude, value, or characteristic. (12)
variable for which subjects act as their own control. scale of measurement. (see level of measurement)
Also called a within-subjects factor. (15,16) scatter plot. A graphic representation of the relation-
research hypothesis. A statement of the researcher’s ship between two variables. (29)
expectations about the relationship between vari- Scheffé’s multiple comparison test. A multiple
ables under study. (3,23) comparison procedure for comparing means follow-
reverse causation. A situation wherein the variable ing a significant analysis of variance based on the
designated as the “outcome” may actually cause the F distribution. Considered the most conservative
“exposure.” (19) of the multiple comparison methods. (26)
residual (Y- Ŷ ). The difference between an observed scoping review. An exploratory review of a broad
value and a predicted value. (30,31) question about a clinical topic, including a compre-
response rate. Percentage of people who receive a hensive synthesis of evidence with the aim of
survey who actually complete it. (11) informing practice, programs, and policy, and
response stability. Consistency with which a response providing direction to future research. (5,37)
is manifested over repeated trials. (32) secondary analysis. An approach to research involv-
responsiveness. The ability of a test to demonstrate ing the use of data that were collected for another
change. (10,32) purpose. (13)
retrospective study. A study that analyzes observa- secondary outcome. Other endpoint measures,
tions that were collected in the past. (13,19) besides the primary outcome, that may be used
risk. Consideration of physical, psychological, or to assess the effectiveness of the intervention, as
social harm that goes beyond expected experiences well as side effects, costs, or other outcomes of
in daily life. (19,34) interest. (3)
risk–benefit ratio. An ethical principle that is an secondary source. Reference source that represents a
element of informed consent, in which the risks review or report of another’s work. (6)
of a research study to the participant are evaluated selection bias. A threat to internal validity in which
in relation to the potential benefits of the study’s bias is introduced by initial differences between
outcomes. (7) groups, when these differences are not random. (15)
6113_Glossary_637-658 04/12/19 3:49 pm Page 652
652 Glossary
self-report measures. Data based on participants’ slope. 1. In regression analysis, the rate of change in
reports of their own attitudes or conditions. (12) values of Y for one unit of change in X. (30) 2. In
sensitivity. 1. A measure of validity of a screening single-study research, the rate of change in the
procedure, based on the probability that someone magnitude of the target behavior over time. (18)
with a disease will test positive; true positive SnNout. Pneumonic to remember that with
rate. (33) 2. In a literature search, the proportion high sensitivity, a negative test rules out the
of relevant articles identified out of all relevant diagnosis. (33)
articles on that topic; ability to identify all pertinent snowball sampling. A nonprobability sampling
citations. (6) method in which subjects are successively recruited
sensitivity analysis. A procedure in decision making by referrals from other subjects. (13)
to determine how decisions change as values are Spearman’s rank correlation coefficient (rs). A non-
systematically varied. (37) parametric correlation procedure for ordinal data.
sequential analysis. In mixed methods research, the Also called Spearman’s rho. (29)
design of studies that follow each other, whereby specificity. 1. A measure of validity of a screening
the first study informs the second. (21) procedure, based on the probability that someone
sequential clinical trial. Experimental research de- who does not have a disease will test negative; true
sign that allows consecutive entrance to a clinical negative rate. (34) 2. In a literature search, the
trial and continuous analysis of data, permitting proportion of relevant citations retrieved, the
stopping of the trial when data are sufficient to ability to exclude irrelevant citations. (6)
show a significant effect. (16) sphericity. An assumption in a repeated-measures
serial dependency. Correlation in a set of data col- ANOVA that the variances of the differences
lected over time, in which one observation can be between all possible pairs of within-subject
predicted based on previous observations. (18) conditions (levels of the independent variable)
sham treatment. Analogous to a placebo, involving a are equal. (25)
“fake” treatment as a control condition. (14) split-half reliability. A reliability measure of internal
Shapiro-Wilks test of normality. A test of the null consistency based on dividing the items on an
hypothesis that states that an observed distribution instrument into two halves and correlating the
does not differ from the pattern of a normal results. (9)
distribution. If the test is not significant (p>.05), split middle line. In single-subject research, a line
the distribution does not differ from a normal used to separate data points within one phase into
distribution. (see also Kolmogorov-Smirnov test equal halves, reflecting the trend of the data within
of normality) (23) that phase. (18)
sign test. A nonparametric statistical procedure for SpPin. Pneumonic to remember that with high speci-
comparing two correlated samples, based on com- ficity, a positive test rules in the diagnosis. (33)
parison of positive or negative outcomes. (27) standard deviation. A descriptive statistic reflecting
significance. (see statistical significance) the variability or dispersion of scores around the
significance level (α). (see alpha level) mean. (22)
simple contrast. In a multiple comparison, the standard error of measurement (SEM). A reliability
contrast of two means to determine if they are measure of response stability, estimating the
significantly different from one another. (26) standard error in a set of repeated scores. (9,26)
simple hypothesis. A hypothesis statement that standard error of the estimate (SEE). In regression
includes one independent variable and one analysis, an estimate of prediction accuracy; a
dependent variable. (3) measure of the spread of scores around the regres-
single-blind study. An experiment in which either sion line. (24)
the investigator or the subject is kept ignorant of standard error of the mean ( s X ). The standard devi-
group assignment, but not both. (14) ation of a distribution of sample means; an estimate
single-subject design (SSD). An experimental of the population standard deviation. (18)
design based on time-series data from one or more standardized mean difference (SMD). Difference
subjects, with data compared across baseline and between group means divided by their common
intervention phases. Also called single-case standard deviation (analogous to the effect size
design. (18) index, d). (24,37)
skewed distribution. A distribution of scores that is standardized residual. Difference between observed
asymmetrical, with more scores to one extreme. (22) and expected values in a chi-square test divided by
6113_Glossary_637-658 04/12/19 3:49 pm Page 653
Glossary 653
the expected frequency, indicating the contribution sum of squares (SS). A measure of variability in a set
of each cell to the overall statistic. (28) of data, equal to the sum of squared deviation scores
standardized response mean (SRM). One approach for a distribution [ ⌺(X ⫺ X )2 ]; the numerator in
to evaluating effect size with change scores. Calcu- the formula for variance. Used in analysis of variance
lated as the difference between pretest and posttest and other procedures as the basis for partitioning
scores, divided by the standard deviation of the between-groups and within-groups variance
change scores. (32) components. (22)
standardized score. (see z-score) superiority trials. Clinical trials that seek evidence
statistic. A measured characteristic of a sample. in favor of a new treatment. (14)
(13,22) survival analysis. Methods to analyze data where the
statistical conclusion validity. The validity of variable of interest is the time until occurrence
conclusions drawn from statistical analyses, based of an outcome event. (31)
on the proper application of statistical tests and systematic error. A form of measurement error,
principles. (15) where error is constant across trials. (9)
statistical hypothesis. (see null hypothesis) systematic review. Review of a clearly formulated
statistical process control (SPC). A method of question that uses systematic and explicit methods
charting production outcomes over time to identify to identify, select, and critically appraise relevant
and monitor variances; can be used as a method of research. (1,5,37)
analysis for single-subject designs. (18) systematic sampling. A sampling method in which
stem-and-leaf plot. A graphic display for numerical persons are randomly chosen from unordered lists
data in a frequency distribution showing each value using a fixed sampling interval, such as every tenth
in the distribution. (22) person. (13)
stepwise multiple regression. An approach to t-test. A parametric test for comparing two means;
multiple regression that involves a sequential also called Student’s t-test (see paired t-test and
process of selecting variables for inclusion in the unpaired t-test). (24)
prediction equation. (30) target behavior. In single-subject research,
stopping rule. In a sequential clinical trial, the the response behavior that is monitored over
threshold for stopping a study based on crossing time. (18)
a boundary that indicates a difference or no target population. The larger population to which
difference between treatments. (16) results of a study will be generalized, defined by
stratified random assignment. Accomplished by clinical and demographic characteristics. (3,13)
first dividing subjects into strata, and then testing effect. The effect that occurs when a test
within each stratum randomly assigning them itself is responsible for observed changes in the
to groups. (14) measured variable. (15)
stratified random sampling. Identifying relevant test–retest reliability. The degree to which an
population characteristics, and partitioning instrument is stable, based on repeated administra-
members of a population into homogeneous, tions of the test to the same individuals over a
nonoverlapping subsets, or strata, based on these specified time interval. (9)
characteristics. (13) theoretical sampling. In qualitative research, the
structural equation modeling (SEM). Multivariate process of data collection for generating theory by
method of modeling theoretical causal relationships continuous analysis and recruitment of participants
of latent variables. (31) to add to themes. (21)
studentized range (q). Critical values used to deter- time-series design. (see interrupted time-series design)
mine significant differences in Tukey and Student- tolerance. In multiple regression, a measure of
Newman-Keuls multiple comparisons. collinearity among independent variables.
Student-Newman-Keuls (SNK) multiple Lower values of tolerance (<.2) indicate greater
comparison. A stepwise multiple comparison collinearity. (30)
procedure used to determine if means are signifi- transferability. In qualitative research, the extent
cantly different following an analysis of variance. to which findings can be generalized beyond the
Also called the Newman-Keuls (NK) test. (26) study. (21)
Student’s t-test. (see t-test) translational research. Clinical investigations with
summative scale. A scale that results in a total score human subjects in which knowledge obtained
by adding values across a set of items. (12) from basic research is translated into diagnostic
6113_Glossary_637-658 04/12/19 3:49 pm Page 654
654 Glossary
or therapeutic interventions that can be applied to unplanned comparisons. Multiple comparison tests
treatment or prevention. (2) that explore all possible comparisons following a
treatment arm. Another term for independent groups significant analysis of variance. (26)
in a clinical trial. (14) validity. 1. The degree to which an instrument
trend. 1. The shape of a distribution of scores taken measures what it is intended to measure. (10)
over time, reflecting the distribution’s linearity or 2. The degree to which a research design allows
lack of linearity. (26) 2. In single-subject research, for reasonable interpretations from the data,
the direction of change in the target behavior within based on control of bias and confounding (internal
a phase or across phases. (18) validity), appropriate operational definitions
trend analysis. Using an analysis of variance, a test (construct validity), appropriate analysis procedures
to assess trend within data taken over ordered inter- (statistical conclusion validity), and generalizability
vals; can express data as linear, quadratic, or cubic, (external validity). (15)
reflecting the number of changes in direction in the variable. A characteristic that can be manipulated
data over time. (26) or observed and that can take on different values,
triangulation. The use of multiple methods to either quantitatively or qualitatively. (3)
document and confirm observations related to a variance. A measure of variability in a distribution,
phenomenon. (21) equal to the square of the standard deviation. (22)
true negative. A test result that is negative for variance inflation factor (VIF). In multiple regression,
those who do not have the disease or condition of a measure of collinearity, the inverse of tolerance.
interest. (33) Higher values indicate greater collinearity. (30)
true positive. A test result that is positive for those vector. In MANOVA, the combined mean of several
who do have the disease or condition of interest. (33) dependent variables for each group. (31)
truncation. Usually an * used to substitute for visual analog scale (VAS). An instrument used to
word endings that may have varied forms in a quantify a subjective experience by marking a line
literature search. (6) between two anchors that represent extremes of the
Tukey’s honestly significant difference (HSD). A experience. A commonly used visual analog scale is a
multiple comparison test for comparing multiple 10 cm line labeled with “no pain” as the left anchor,
means following a significant analysis of variance. (26) and “worst pain imaginable” as the right anchor.
two standard deviation band method. A method of (11,12)
data analysis in single-subject research; involves wait list control group. A control group composed
calculating the mean and standard deviation of data of subjects for whom the experimental treatment
points within the baseline phase and extending these is delayed, scheduled to receive the treatment
values into the intervention phase. If two or more following completion of the study. (14)
consecutive points in the intervention phase fall Wald statistic. Statistic used to assess significance of
outside these bands, the change from baseline to coefficients in a logistic regression. (31)
intervention is considered significant. (18) washout period. In a crossover design, that period of
two-tailed test. A statistical test based on a nondirec- time between administration of the two treatments,
tional alternative hypothesis, in which critical values allowing effects of the experimental treatment to
represent both positive and negative tails of a dissipate. (16)
distribution. (22) weighted kappa (κ w). An estimate of percentage
two-way analysis of variance. An analysis of variance agreement, corrected for chance, based on weights
with two independent variables. (25) reflecting levels of seriousness of disagreements. (32)
two-way design. An experimental or quasi-experimental wildcard. Usually an * used in a literature search to
study that involves two independent variables. (16) substitute for several letters within a search term,
Type I error. An incorrect decision to reject the null to allow for alternate spellings. (6)
hypothesis, concluding that a relationship exists Wilcoxon signed-ranks test (T). A nonparametric
when in fact it does not. (23) statistical procedure, comparing two correlated
Type II error. An incorrect decision to not reject samples (repeated measures); analogous to the
the null hypothesis, concluding that no relationship paired t-test. (27)
exists when in fact it does. (23) Wilk’s lambda (λ). Statistical test of significance in
unpaired t-test. A parametric test for comparing two multivariate analysis of variance. (31)
means for independent samples; also called an inde- withdrawal design. In single-subject research, a design
pendent t-test. (24) that involves withdrawal of the intervention. (18)
6113_Glossary_637-658 04/12/19 3:49 pm Page 655
Glossary 655
within-groups variance. (see error variance) z distribution. The standardized normal distribu-
within-subjects design (repeated measures tion, with a mean of 0 and a standard deviation
design). A design in which subjects act as their of 1. (22)
own control. (16) Zelen design. (see randomized consent design)
Yates’ continuity correction. In the chi-square test, a z-score. The number of standard deviations that a
correction factor applied when expected frequencies given value is above or below the mean of the
are too small, effectively reducing the chi-square distribution; also called a standardized score. (22)
statistic. (28) zero-order correlation. A bivariate correlation. (29)
α alpha 1. Level of significance, denotes risk of Type I error; α1 and α2 represent one-tailed and
two-tailed levels of significance (23) 2. Also used in Cronbach’s alpha. (32)
αFW Familywise level of significance for a set of multiple comparisons. (26)
αPC Per comparison level of significance for individual comparisons when multiple comparisons
are made. (26)
β beta 1. Probability associated with Type II error. (23) 2. Coefficient in structural equation.
(31 Supplement)
γ gamma Coefficient in structural equation. (31 Supplement)
ε epsilon Statistic for repeated measures ANOVA, used to adjust degrees of freedom to correct for
unequal variances. (25)
η2 eta Eta-squared, effect size in one-way ANOVA. (25)
2
p Partial eta-squared, effect size for two-way ANOVA. (25)
θ theta Effect size for proportions, used in sequential clinical trials. (16 Supplement)
κ kappa Chance-corrected measure of agreement. (32)
κw Weighted kappa. (32 Supplement)
λ lambda 1. A measure of association between categorical variables. (28) 2. Wilks’ lambda, multivariate
measure of effect. (31)
μ mu Population mean. (22)
ρ rho Population measure of correlation. (29)
σ sigma (lower case) population standard deviation. (22)
σ2 Sigma-squared, population variance. (22)
σX– Population standard error of the mean. (22)
∑ sigma (upper case) used to mean “the sum of.” (22)
τ tau Kendall’s tau. (29)
φ phi Phi coefficient measure of association. (28,29)
2 chi Chi-square test of significance for association between categorical variables. (28)
2
r Chi-square r, test statistic for the Friedman two-way ANOVA by ranks. (27)
ω2 omega Omega squared, effect size in ANOVA. (25)
6113_Glossary_637-658 04/12/19 3:49 pm Page 656
656 Glossary
Abbreviations
Glossary 657
658 Glossary
Index
Note: Page numbers followed by f refer to figures, those followed by t refer to tables. Numbers following S refer to online chapter supplements.
A-B-A-B design, 255, 256f mixed designs, 378–81, 381t observation, 282
A-B-A-C design, 258 null hypothesis, 366 performance, 199, 214, 584
A-B-A design, 254, 255f of regression, 444 publication, 71, 345, 579
A-B design, 251, 251f, 254 one-way, 365–70, 366t reactive, 305
Absolute effect, 534 partitioning variance, 366–67, 367f, recall, 282
Absolute reliability, 118, 492 370, 375 sampling, 185
Absolute risk (AR), 534 power and sample size, 369, 369b, 373, selection, 189, 213–214, 281–82, 584
Absolute risk increase (ARI), 543 375, 378, 380, S25–2 Binomial test, 264, 264t, 613, S18–3t,
Absolute risk reduction (ARR), 541, 541t, repeated measures, 373–78, 376t, 379t S27–2t
542, 542f two-way, 370–73, 370f Bland-Altman plot, 499–501, 500f, 501,
Abstracts, 290–91, 549, 602 Analytic epidemiology 502b
Accessible population, 181 measures of risk, 533–39 Blinding, 199, 200, 201, 214, 215
Accidental sample, 189 measures of treatment effect, 539–44 Blocking variable, 219, 233
Active controls, 198–99 Anchor-based approach, 504–505 Block randomization, 195t, 196
Active variable, 194 ANCOVA. See Analysis of covariance Bonferroni correction, 384, 385t, S26–3
Adjusted means. See Analysis of (ANCOVA) Boolean logic, 75–76, 77f
covariance ANOVA. See Analysis of variance Box plot, 325–26, 325f
Adjusted odds ratio (aOR), 460 (ANOVA)
Adjusted R Square, 444 Applied research, 11 Canonical correlation, 480
Agreement, measures of, 494–96 Appraisal. See Critical appraisal Carryover effects, 120, 234
percent agreement, 494 a priori tests. See Power Case-control study, 12f, 13t, 280–82, 280f
kappa, 494–97, 495t Arc sine transformation, 636 assessing methodological quality, 585
limits of, 499–501 Area probability sampling, 188 case definition, 281, 281b
Allocation concealment, 197–198, 584 Area under normal curve, 329f, 330, 331f, confounding, 282
Alpha (α), 614–16 interviewer bias, 282
level of significance, 340–42 Area under the curve (AUC), 521, 521t matching, 282
Cronbach’s alpha 122 Arm. See Treatment arm observation bias, 282
Alphanumeric variable, 631 Attribute variable, 194 population-based/hospital-based
Alternate forms reliability, 122, 499–501 Attrition study, 281
Alternating treatment design, 257–58 bias, 584 recall bias, 282
Alternative hypothesis (H1), 338 internal validity, 213, 222 retrospective study, 280f
AMSTAR–2, 591 Audit trail, 311 selection bias, 281–82
Analysis of covariance (ANCOVA), Authority, 54, 56f selection of cases and controls, 281
461–466, 463t Authorship, 599 Case reports, 12f, 14t, 289–93, S20
adjusted means, 462–64, 464f Autocorrelation, 264 case series, 289
covariate, 220, 462, 464, 465 Autonomy, 90 epidemiologic reports, 292–93
for confounding effects, 220 ethics, 292
homogeneity of slopes, 464 Background question, 58 format, 290–292
intact groups, 465 Backward inclusion, 452, 453 generalizability, 292
multiple covariates, 465 Bartlett’s test of sphericity, 470t, 471 purposes, 289–90
null hypothesis, 462 Basic research, 11, 19, 20f, 58, 201 Case studies, 304–5
pretest-posttest adjustments, 465 Bayes’ theorem, 515b CAT. See Critically appraised topic
regression analysis, 462–65 Belmont Report, 90 Causal modeling, 474
power and sample size, 464, S30–3 Bench research, 11 Cause-and-effect association, 273–74
Analysis of variance (ANOVA), 365–82. Beneficence, 90–91 reverse causation, 277
See also Friedman ANOVA, Beta (β), 342, 637. See also Power Ceiling effect, 138
Kruskal-Wallis ANOVA, MANOVA Beta weights, 448 Celeration line, 263, S18–2
critical values table, 618t, Appendix A Between-subjects design, 228 Censored observations, 482
Online Bias Central tendency, 322–24
degrees of freedom, 368, 371, 375, 377, ascertainment, 199 and skewness, 323, 324f
378 attrition, 584 mean, 323, 324f
effect size, 369 case control study, 281–82 median, 322–23, 324f
F statistic, 368–69, 371, 373, 377 Cochrane risk of bias tool, 584, 584f mode, 322, 324f
interaction effect, 371–73, 372t, 374t cohort study, 280 Change, measuring, 136–38, 122–24,
Intraclass correlation coefficient (ICC), detection, 199, 213, 584 501–6
489 experimental, 217 anchor-based approaches, 504
main effect, 370–371, 371f interviewer, 282 distribution-based measures, 503
659
6113_Index_659-672 03/12/19 11:02 am Page 660
660 Index
effect size, 503–4, 504t explanatory vs. pragmatic trial, intraclass correlation coefficient (ICC),
global rating scale, 504–5, 505t 200–201, 201t 489
minimal clinically important difference non-inferiority trial, 205–6, 206f number needed to treat (NNT), 542
(MCID), 136–38, 205–6, 501f, 503–6 open-label trial, 199 paired t-test, 360
minimal detectable change (MDC), parallel group, 193 relative risk, 535–36
123–24, 123f, 136–38, 501–2, 501f phases of clinical trials, 19–21, 201–4 unpaired t-test, 357–60
reliability, 122–24, 501 PICO, 194 Confidentiality, 93t, 97, 629
responsiveness, 501 placebo, 198, 198b Confirmatory factor analysis (CFA), 134,
validity, 136–38, 501 pragmatic clinical trial (PCT), 474, 475b
Change score, 122, 136 200–201, 201t, 202b Conflict of interest (COI), 100
Changing criterion design, 259, S18–1 random assignment (allocation), Confounding
Chi-square, 415–27. See also McNemar 194–98, 195t analysis of covariance (ANCOVA), 220
test randomized consent design, 195t, 197 case-control study, 282
assumptions, 416 randomized controlled trial (RCT), observational research, 275
chi-square statistic, 415–16 193, 193f, 200 variable, 212, 538
contingency table, 419, 420f run-in period, 195t, 197 Consecutive sampling, 189
critical values table, 620t, Appendix A stratified random assignment, 195t, 196 CONSORT statement, 183, 184f, 600t,
Online superiority trial, 205, 206f 601–2t
degrees of freedom, 416, 421, 424 wait list control group, 198 Constant comparison, 303
effect size, 423–24, 423t Closed-ended questions, 146 Constructs, 43–44, 108. 132, 133b,
expected frequencies, 415, 416, 418, Cluster analysis, 477–78, 478f, 479f 159–73, 468–469, 473–478
420 Cluster random assignment, 195t, 196–97 latent trait, 108, 469, 469f
Fisher’s exact test, 421 Cluster sampling, 186t, 187–88, 188f Construct validity, 129f, 129t, 132–34,
goodness of fit, 416–19 multistage, 188, 188f 215–17
known distribution, 417–19, 418t Cochrane Collaboration, 72t, 73, 576b, and content validity, 132
McNemar test. See McNemar test 579, 587, 588 convergence and discrimination,
null hypothesis, 415, 417–20, 424 Cochrane risk of bias tool, 584, 584f 133–34
odds ratio, 425, 426b Cochran’s Q, 589 experimental bias, 217
power and sample size, 423–424, 424b, Code books, 633 factor analysis, 134
S28 Coding known groups method, 132
standardized residuals, 419, 421 mixed methods research, 313–14 methods of construct validation, 132
testing proportions, 415–16 qualitative research, 309 multiple treatment interactions, 217
tests of independence, 419–22, 422t regression analysis, 453–54 multitrait-multimethod matrix
uniform distribution, 416–17, 417t Coefficient of determination, 440 (MTMM), 133–34
Yates’ continuity correction, 421 Coefficient of variation (CV), 328, 494 Rasch analysis, 134
CINAHL, 72–73, 72t, 579 Cohen’s f, 369, S25–2 subgroup differences, 216
Citation management applications, 85 Cohort effect, 287 time frame, 216–17
Classical measurement theory, 115, 168 Cohort studies, 12f, 13t, 278–80, 278f types of evidence, S10–2
Classificatory scale, 109 advantages/disadvantages, 279 Content analysis, 309
Clinical practice guidelines (CPGs), 61, assessing methodological quality, 585 Content validity, 129–130, 129t, 129f
575–76, 575t, 587, 590 bias, 280 universe of content, 130
Clinical prediction rules (CPR), 523–26 misclassification, 279–80 content validity index (CVI), 130, S10–1
diagnosis, 523–524, 524f prospective study, 280f Contingency coefficient (C), 423, 423t
intervention, 524 retrospective study, 280f Contingency table, 419
prognosis, 524 subject selection, 279 Continuity correction, 421
validation, 525–26, 526f types of, 278 Continuous variable, 107
Clinical queries, 81, 82f Collinearity, 448–49, 448f Contrast coefficient, 395
Clinical research Common cause variation, 266 Control event rate (CER), 540, 541t
defined, 2–3 Common Rule, 90, S7 Control groups, 96, 198–99
frameworks, 5–16 Communalities, 470t, 471 active controls, 198, 229
types of research, 11–14, 12f, 13–14t Comparative effectiveness research inactive controls, 198
Clinical trials, 192–209 (CER), 19 placebo, 198, 229
active/attribute variable, 194 Comparative studies, 35–36 sham, 198
allocation concealment, 197–98, Compensatory equalization, 215 wait list, 198
584 Completer analysis, 223 Convenience (accidental) sampling, 186t,
blinding, 199, 200, 201, 214, 215 Complex contrasts, 395 189
block randomization, 195t, 196 Complex hypothesis, 90 Conventional effect sizes
clinical trials registries, 193b Computer adaptive test (CAT), 174 d, 360
cluster random assignment, 195t, Concurrent validity, 129t, 131, 277, η2, 369
196–97 278 f, 369
control groups, 198–99 Confidence intervals (CI), 336–338, 337f, f 2, 449
defined, 192 336f r, 432
equipoise, 96b, 198b defined, 336 R2, 449
equivalence trial, 206 diagnostic accuracy, 518–19 w, 423
6113_Index_659-672 03/12/19 11:02 am Page 661
Index 661
Convergent validity, 129t, 133 Crossover design, 235, 235f, 259 per-protocol analysis, 222
Correlation, 428–39 washout period, 235 statistical conclusion validity, 210–11
assumptions, 429–30 Cross-sectional study, 276–78, 398 Detection bias, 199, 213, 584
canonical, 480 accuracy of diagnostic test, 277, 278 Developmental research, 12f, 14t, 286–87
causation, and, 435–36, 436f challenges, 277–78 cohort effect, 287
comparison, and, 435 concurrent validity, 277, 278 natural history, 286
correlation coefficient, 428–29 developmental research, 286–87 Diagnostic tests, 509–28
critical values table, 619t, Appendix A Crosstabulation, 494 clinical prediction rules (CPR), 523–26
Online C statistic, 265, S18–5 confidence intervals, 518–19
curvilinear relationship, 430, 430f Cubic trend, 396 critical appraisal of diagnostic studies,
degrees of freedom, 432 Cumulative incidence. See Incidence 564–65, 564t, 585
of dichotomies, 435, S29–2 Cumulative scales, 167–68 diagnostic accuracy, 511, 513
hypothesis testing, 429 Cutoff scores, 161, 519–523 gold standard, 510
item-to-total, 497–99 index test, 510
Kendall’s tau-b, 434, 434t Data bases, 71–74, 72t, S6 likelihood ratios (LR), 516–17
linear relationships, 429f, 430 Data cleaning, 633 nomogram, 517–18, 518f
null hypothesis, 431 Data collection, 630–32 positive and negative tests, 511
outliers, 436–37, 436f Data management, 629–36 predictive values (PV), 513
null hypothesis, 431 confidentiality and security, 629 pretest and posttest probabilities,
partial, 437–38, 438f data cleaning, 633 514–19, 516f
Pearson correlation coefficient, data collection, 630–32 prevalence, 513
431–32, 432t, 433b data entry, 630, 633 QUADAS and QUADAS-2, 585
positive/negative, 428 data modification/transformation, ROC curves, 519–23, 522t
power and sample size, 432, S29–4 633–36 ruling in and ruling out, 513–14, 514f
range of test values, 437 listwise and pairwise deletion, 632 sensitivity and specificity, 131b, 511,
ranked data, 432–35 missing data, 632 512t
scatter plot, 428, 429f monitoring subject participation, 629 Dichotomous variables, 106
Spearman rank correlation coefficient. statistical programs, 629–30 Differences scores, 360, 499–501
See Spearman rank correlation type of variable, 630–31 Differential item functioning (DIF),
zero-order, 438 variable names, 630 173–74
COSMIN, 585 Data managing boards, 204b Differential misclassification, 279
Covariate, 220, 440, 462, 464, 465 Data modification, 633–36 Direct costs, 553
Cox proportional hazards model, 483 Data transformation, 635–36 Directional hypothesis, 38, 39
CPR. See Clinical prediction rules Declaration of Helsinki, 89–90, 96, 96b, Discrete variable, 107
Cramer’s V, 423, 423t 198b Discriminant analysis, 480–481, 481f
Criterion referencing, 135 Deductive reasoning, 45–46 Discriminant validity, 129t, 133
Criterion-related validity, 129t, 129f, deductive theories, 45–46 Disease-oriented evidence (DOE), 24
130–31, 277, 278 Degrees of freedom (df), 354 Disproportional sampling, 186t, 188,
concurrent validity, 128t, 131 Delphi survey, 142, S11–1 S13–2
predictive validity, 129t, 131–32 Dependent variable, 35, 36, 36f, 440 Disseminating research, 597–612
Critical appraisal, 557–73 Descriptive research, 11, 12f, 14t, 32, 34, authorship, 599, 599b
appraisal process, 558, 559t 285–96 choosing a journal, 597–99
appraisal worksheets, S36–2 case reports, 289–93 e-poster, 610
are results meaningful?, 560–61 descriptive survey, 12f, 288–89 journal impact factor, 597–8, 598b
are results relevant to patient?, 561–62 developmental research, 286–87, 287b open access journals, 84, 598–99, 598b
core questions, 558–62 epidemiology, 530–33 oral presentations, 607–8
critically appraised topic (CAT), historical research, 293–94 peer review, 605
568–71 normative studies, 287–88 poster presentation, 608–9, 609f
diagnostic accuracy studies, 564–65, Descriptive statistics, 318–32 reporting guidelines, 599–600, 600t,
564t frequency distribution, 318–21 602t, S38–1, S38–2
evidence-based practice, and, 562f, graphs, 320–21, 320t standards for reporting, 599–600, 600t
561–562 measures of central tendency, submitting the manuscript, 605–7
intervention studies, 562–64, 563t 322–24 supplementary materials, 605
is the study valid?, 559–60 measures of variability, 324–28 tracking metrics, 611
levels of evidence, 557–58, 558t normal distribution, 322, 328–31 types of submissions, 606, 610–11
meta-analysis, 591 Design validity, 210–26. See also Validity Distribution, 318
prognosis studies, 565–67, 566t construct validity, 215–17 frequency, 318, 319t
qualitative studies, 567–68, 567t controlling subject variability, 218–20 grouped frequency, 319–320
systematic review, 591 external validity, 217–18 normal, 328–31
Critically appraised topic (CAT), 568–71, intention to treat (ITT), 222–23 sampling, 335
570t internal validity, 212–15 skewed, 322, 322f
Critical region, 347, 347f missing data, 223–24 Distribution-based approach, 503
Critical value, 357 non-compliance, 220–22 Divergent validity, 133
Cronbach’s alpha, 497–99 overview, 211f, 211t Dose-response relationship, 272
6113_Index_659-672 03/12/19 11:02 am Page 662
662 Index
Index 663
Factor analysis, 134, 135f, 468, 470–74, Greenhouse-Geisser correction, 376t, one-tailed test, 348
470t. See also Exploratory and 377 power (See Power)
Confirmatory factor analysis Grey literature, 70–71, 579, 580b p value, 341
Factorial design, 231–33 Grounded theory, 46, 303, 304f test statistic, 346
main effects, 232 Group centroid, 479 two-tailed test, 347
interaction effects, 232–33 Grouped frequency distribution, 319–20, Hypothetical-deductive theory, 45–46
randomized block design, 233 320f
Falsification, 99 Guiding questions, 39, 144 ICC. See Intraclass correlation coefficient
False negative, 510 Guttman scale, 167 (ICC)
False positive, 510 Guyatt’s responsiveness index (GRI), 504, ICF. See International Classification of
Familywise error rate (αFW), 383 504t Functioning, Disability and Health
Field notes, 305 Impact factor, 597
Filters, in literature search, 80 Harmonic mean, 387, S26–2 Implementation studies, 25–26, 66–67,
First-order partial correlation, 438 Hawthorne effect, 217b 67b
Fisher’s exact test, 421 Hazard ratio (HR), 483 Imputation, 223, 224
Fisher’s least significant difference Health measurement scales, 159–77 Inception cohort, 278
(LSD), 385t, S26–3 alternate forms, 164 Incidence, 532–33
Fixed effect, 419, 487 cumulative scales, 167–68 Inclusion criteria, 181–82, 551, 560,
Floor effect, 138 cut-off scores, 161, 519–523 562, 565, 578, 580, 585, 589, 592,
Flow diagram, 184f, 221f item response theory (IRT), 168. 593, 603
Focus group interviews, 307 See also Rasch analysis Independent t test. See t-test
Focusing, in literature search, 78 Likert scales, 162–64, S12–1 Independent variable, 34–35, 36, 36f
Follow-up study, 278. See also Cohort numerical rating scale (NRS), 165–66, independent factor, 220
study 166f levels, 36
Foreground question, 58, 60t Rasch analysis, 169–74 repeated measure, 220
Forest plot, 587–88, 588f short forms, 160 Index test, 510
Forward inclusion, 452 subscales, 162 Indirect costs, 554
Framingham Heart Study, 182b summative scales, 161–67 Inductive reasoning, 46
Frequency distribution, 318–21, target population, 160 Inferential statistics, 333–349
320f unidimensionality, 160 Informed consent, 92–98
Friedman two-way ANOVA, 410–12, visual analog scale (VAS), 164–67, and usual care, 98
412t 164f, 165f, 167f confidentiality, 93t, 97, 629
critical values table, chi-square, 620, Henrietta Lacks, 89b consent elements, 97–98
Appendix A Online Heterogeneity, in meta-analysis, 589 control group, 96
degrees of freedom, 411 Hierarchical regression, 453 disclosure, 96
effect size, 411 HIPAA, 90 elements, 93t
multiple comparison, 411, S27–1 Histogram, 320, 320f equipoise, 96b
null hypothesis, 411 Historical controls, 246 information elements, 93–97
F statistic. See Analysis of variance Historical research, 12f, 14t, 293–94 informed consent form, 93f, 94–95f, 98
Funnel plot, 589–90, 590f History, in internal validity, 212 lay language, 97
Homer-Lemeshow test, 459t, 460 placebo, 96b
Generalizability Homogeneity of slopes, 464 potential risks/benefits, 96–97
case reports, 292 Homogeneity of variance, 349, 353, 356, surveys, 156–57
reliability, 119, 119b 368 usual care, 98
single-subject design, 267–68 Levene’s test, 356, 368 “vulnerable” participants, 97–98
Gold standard, 130, 510 Hospital-based study, 281 withdrawing consent, 98
Goodness of fit Huynh-Feldt correction, 376t, 377 Institutional review board (IRB), 91–92,
known distribution, 417–19 Hypotheses 92f, 156
structural equation modeling (SEM), alternative, 338 Instrumentation effects, 213
476–77 complex, 39 Integrity. See Ethical principles
uniform distribution, 416–17 deductive, 45 Intention to treat (ITT), 211, 222–23
Google Scholar, 73, 73b, S6 directional/nondirectional, 38, 338–39 Interaction effects, 232, 371, 390–92
G*Power, 345, S23–34 null, 338–39 Interlibrary loan, 84
GRADE system, 590, 591t research, 38 Internal consistency, 122, 497–99, 498t
Grand mean, 367, 367f simple, 39 Cronbach’s alpha, 497
Grand theory, 50 statistical, 38, 338 item-to-total correlation, 497
Graphs, 320–321, 320t Hypothesis testing. See also Type I and Internal validity, 211t, 211f, 212–15, 584
box plot, 325–26, 325f, 326b Type II error attrition, 212–213
histogram, 320 alternative hypothesis (H1), 338 compensatory equivalization, 215
line plot (frequency polygon), 321 critical region, 347–48, 347f, 348f compensatory rivalry, 215
normal and skewed distribution, 322, disproving the null, 339 confounding variable, 212
322f errors in, 340–345 demoralization, 215
in presentations, 608 level of significance, 340–42 diffusion or imitation of treatments,
stem and leaf plot, 321 null hypothesis (H0), 338 214
6113_Index_659-672 03/12/19 11:02 am Page 664
664 Index
history, 212 Known groups method, 132 Longitudinal studies, 274–76, 398
instrumentation, 213 Kolmogorov-Smirnov test of normality, developmental research, 286
maturation, 212 349b, 643 prospective study, 274–76
regression to the mean, 123, Kruskal-Wallis ANOVA, 405–7, 406t retrospective study, 276
213 critical values table, chi-square, 620,
ruling out the threats, 215 Appendix A Online Main effects, 232, 370, 371f, 390
selection, 213–14 degrees of freedom, 407 Manifest variable, 476
social threats, 214 effect size, 407 Mann-Whitney U test, 403–5, 404t
testing effect, 213 multiple comparison, 407, S27–1 critical values table, 613, S27–2t,
International Classification of null hypothesis, 405 Appendix A Online
Functioning, Disability and large samples, 403
Health (ICF), 6–8, 7f, S1 Last observation carried forward null hypothesis, 403
capacity and performance, 8 (LOCF), 224 MANOVA. See Multivariate analysis
contextual factors, 7 Latent trait, 43,108, 168, 173, 469, 473, of variance (MANOVA)
Interprofessional research, 9–10 475 Marginal means, 370
Interquartile range, 325 Latin square, 235, 235f Matching, 219, 282
Inter-rater reliability, 121 Law, and theory, 50 Maturation, in internal validity, 212
Interrupted time series (ITS) design, Least squares method, 442 Mauchly’s test of sphericity, 376t, 377
242–44 Level of significance, 340–42 Maximum likelihood estimation, 457
Interval recording, 253 Levels of evidence, 62–65, 64t, 557–58, MCID. See Minimal clinically important
Interval scale, 111–12 558t difference
Interviewer bias, 282 qualitative research, 64–65 McNemar test, 424–26, 425t
Interviews, 142–43 Levels of measurement, 109–14, 110f degrees of freedom, 424
face-to-face, 142 interval scale, 111–12 effect size, 425
in qualitative research, 306–8 nominal scale, 109–10 null hypothesis, 424
structured/unstructured, 142–43, ordinal scale, 110–11 power and sample size, 426b, S28
306–7 ratio scale, 112 MDC. See Minimal detectable change
telephone, 143 Levene’s test, 356, 368 Mean, 323, 324f
Intraclass correlation coefficient (ICC), Likelihood ratios (LR), 516–17 Mean square (MS), 327, 368
486–90, 490t negative LR–, 517 Measurement, 105–77
ANOVA, 489 positive LR+, 516 of change. See Change
calculations, S32–1, S32–2 interpreting values, 517, 517f classical measurement theory, 115,
confidence intervals, 489 Likert scales, 162–64, S12–1 168
forms, 487 Limits of agreement (LoA), 132, indirect nature of, 107–8
models, 487–88 499–501. See also Bland-Altman levels of, 109–14, 110f
Intra-rater reliability, 121 plot precision, 107
IRB. See Institutional review board (IRB) Linear regression. See Regression scales. See Health measurement scales
I2 statistic, 589 Linear trend, 396 Measurement error, 115–17
Item response theory (IRT), 168. See also Line of best fit, 441, 442 random, 116, 116f, 128f
Rasch analysis Line plot, 320f, 321 sources of, 116–17
Item-to-total correlation, 497–99 Listwise deletion, 224, 632 systematic, 116, 116f, 128f
ITS design. See Interrupted time series Literature review, 33, 59, 86, 144. Mechanistic reasoning, 64
(ITS) design See also Searching the literature Median, 322–23, 324f, 402, 482
ITT. See Intention to treat (ITT) Logical positivism, 297 Medical subject headings (MeSH),
Logistic regression, 456–61, 459t 77, 80t
Journal club, 571b adjusted odds ratio (aOR), 460 MEDLINE, 71–72, 72t, 579
Journal impact factor (JIF), 597, classification table, 458, 459t Member checking, 311
598b continuous variables, 460 Mendeley, 85
Justice, 91 Cox & Snell R Square, 458, 459t Merging, 313
Homer-Lemeshow test, 459t, MeSH. See Medical subject headings
Kaiser-Meyer-Olkin (KMO) measure, 460 Meta-analysis, 11, 12f, 61, 575t, 587–91,
469, 470t maximum likelihood estimation, S37–3. See also Systematic Reviews
Kaplan-Meier estimate, 482 457 advantages, 587
Kappa, 118, 494–96, 495t Nagelkerke R Square, 458, 459t critical appraisal of, 591, 591t, S37–2
weighted kappa, 456, S32–3 odds, 457–458, 458b effect size, 587
Kendall’s tau-b, 434, 434t odds ratio, 457–58, 460 forest plot, 587–88, 588f
Keywords, 74–76, 579, 602, 609 probabilities, 458b, S30–2 funnel plot, 589–90, 590f
Knowing, ways of, 54, 56f propensity scores, 461 grading the evidence, 590, 591t
Knowledge translation (KT), 18, sample size, 461, S30–3 heterogeneity, 589
66 Wald statistic, 459t, 460 overview, 577f
Knowledge-to-action framework, 66, Logit scores, 169, 171f, 172t, sensitivity analysis, 589
67, 67b 173t standardized mean difference (SMD),
Known distribution, chi-square, 417–19, Log transformation, 635 587
418t Lonesome Doc, 84 statistical heterogeneity, 589
6113_Index_659-672 03/12/19 11:02 am Page 665
Index 665
Meta-theory, 50 trend analysis, 395–98, 396t, 397f N-of-1 trial, 12f, 259–60, 261b
Method error, 494, S32–4 unplanned, 384 Nominal scale, 109–10
Methodological studies, 11, 12f, 13t, 34, Tukey’s honestly significant difference Nomogram, 517, 518f, S33
124, 138–39 (HSD), 385t, 386–87 Non-compliance, 220–22
Middle-range theories, 49–50 Type I error rate, 383–84 Noncritical region, 347, 347f
Minimal clinically important difference Multiple imputation, 224 Noncurrent multiple baseline design,
(MCID), 136–38, 205–6, 501f, 503–6 Multiple regression, 446–49 256, S18–1
MCID proportion, 506 adjusted R2, 444 Nondifferential misclassification, 279
non-inferiority margin, 205–6 beta weights, 448 Nondirectional hypothesis, 38
Minimal detectable change (MDC), collinearity, 448–49, 448f Nonequivalent group designs, 244–47
123–24, 136–38, 137f, 501–2, 501f hierarchical, 453 Nonequivalent posttest-only control
MDC proportion, 506 null hypothesis, 446 group design, 246–47, 246f
Minimum significant difference (MSD), partial correlation, 448 Nonequivalent pretest-posttest control
385–86 prediction accuracy, 446 group design, 244–46
Friedman two-way ANOVA, S27–1 regression coefficients, 446, 448 Non-inferiority trial, 205–6, 206f, 339b
Kruskal-Wallis one-way ANOVA, S27–1 regression equation, 446 MCID, 205–206
Scheffé comparison, 390 stepwise, 449–53, 451t null and alternative hypotheses, 339b
SNK test, 388–89 tolerance, 448 non-inferiority margin, 205–6, 339b
Tukey’s HSD, 386–87 variance inflation factor (VIF)_, 448 Nonlinear regression. See Polynomial
Misclassification, 279–80 Multistage sampling, 188 regression
Missing data Multitrait-multimethod matrix Nonoverlapping scores, in single-subject
completer analysis, 223 (MTMM), 133–34 design, 267
data imputation, 223–24 Multivariate analysis, 468–85. See also Nonparametric tests, 113, 349, 400–414,
design validity, 223–24 Multiple regression and Logistic 400t
identifying missing values, 632 Regression binomial test, 409, 409t, S27–2t
last observation carried forward, 224 cluster analysis, 477–78 Friedman two-way ANOVA, 410–12,
per-protocol analysis, 222 confirmatory factor analysis (CFA), 412t
surveys, 155 474 Kruskal-Wallis ANOVA, 405–7, 406t
“missingness,” 223 defined, 468 Mann-Whitney U test, 403–5, 404t,
Mixed methods research, 14t, 312–14. exploratory factor analysis (EFA), 405t, 613, S27–2t
See also Qualitative Research 468–74 power-efficiency, 402, S27–3
coding and analysis, 313–14 multivariate analysis of variance ranking scores, 402–3, 402t
data collection, 313 (MANOVA), 478–81 sample size, 402
merging qualitative and quantitative structural equation modeling (SEM), sign test, 407–9, 408t
data, 313–14, 314f 474–77 Wilcoxon signed-ranks test, 408t,
sampling, 313 survival analysis, 481–83 409–10, 410t, 613, S27–2t
Mixed repeated measures design, 236 Multivariate analysis of variance Nonprobability sample, 151, 183, 186t,
Mode, 322, 324f (MANOVA), 478–81 188–90. See also Samples
Models, 44–45 discriminant analysis, 480–81, 481f Normal distribution, 328–31
Models of health and disability, 6, S1 follow-up tests, 480–81 area under normal curve, 328–29, 329f,
Multicollinearity, See Collinearity purpose, 478 330, 331f, 614–16
Multidimensional Health Locus of test statistics, 479–80 normality tests, 349b
Control scale, 162–164, 163f, S12–1 vectors, 479 standardized scores, 329–31, 330f, 331f
Multi-factor design, 228 My NCBI, 85 Normative research, 12f, 14t, 287–88
Multigroup pretest-posttest control Norm-referenced test, 135–36
group design, 229 National Research Act, 90 Null hypothesis (H0), 38, 338
Multiple baseline design, 255–57, S18–1 Natural history, 286 Number needed to harm (NNH),
Multiple comparison tests, 369, 383–99, Naturalistic inquiry, 302 542–44
385t, S26–3 Necessary exposure, 272b Number needed to treat (NNT), 541t,
Bonferroni correction, 384t Negative case analysis, 311 542, 542f, 543–44
complex contrasts, 395, S26–3 Negative likelihood ratio. See Likelihood Nuremburg Code, 88–89
factorial designs, 390–92 ratios
familywise error rate, 383 Negative predictive value. See Predictive Oblique rotation, 472b
minimum significant difference (MSD), value Observational research, 11, 12f, 13t, 34,
385 Newcastle-Ottawa Scale Quality 36–8, 37f, 271–84
planned, 394–98, S26–3 Assessment Form (NOS), 585 case-control studies, 280–82, 280f
post hoc, 384–90 Newman-Keuls test. See causation, 273–74, 274b
repeated measures, 392 Student-Newman-Keuls test cohort studies, 278–80, 278f
Scheffé comparison, 385t, 390, 391t NIH biosketch, 552, S35 confounding, 275
simple effects, 391–92, 393t NIH Roadmap, 17 cross-sectional studies, 276–78
studentized range, q, 376, 387t, 613, NNH. See Number needed to harm hypotheses, 38–39
Appendix A Online, S26–1t (NNH) longitudinal studies, 274–76
Student-Newman-Keuls (SNK) test, NNT. See Number needed to treat Newcastle-Ottawa Scale of Quality
385t, 387–89, 389t (NNT) Assessment (NOS), 585
6113_Index_659-672 03/12/19 11:02 am Page 666
666 Index
prospective studies, 274–76, 275f PEDro, 72t, 73, 583–84, 583t Predictive value (PV), 513
retrospective studies, 275f, 276 Peer review, 70, 100, 605 negative PV–, 513
risk, 272, 533 Percent agreement, 494 positive PV+, 513
Observation bias, 282 Percentiles, 325 and prevalence, 513
Odds, 457, 458b Percent of nonoverlapping data (PND), Presentations, 607–610
Odds ratio (OR) 267 Pretest-posttest control group design,
adjusted odds ratio (aOR), 460 Per comparison error rate (αPC), 228–30
case-control studies, 536, 537t 383 Pretest probability, 515–16, 516f
chi-square, effect size, 421–22 Performance bias, 199, 214, 584 Prevalence, 513, 530–32
logistic regression, 458, 460 Per-protocol analysis, 222 Primary outcome, 25
McNemar test, effect size, 425 person-item map, 169, 169f, 171f Primary source, 71
power and sample size, 536, S30–3 Phases of clinical trials, 19–21, 20f, Principal axis factoring, 471
Omega squared, 369 201–204, 203f Principal components analysis (PCA),
One-group pretest-posttest design, 241 Phi coefficient, 423, 423t, 435, S29–2 468, S31–1
One-tailed test, 348, 357b Physiotherapy Evidence Database. Principal investigator (PI), 548–49
One-way ANOVA, 365–70, 366t See PEDro PRISMA, 581, 582f , S37–1
effect size, 369 PICO, 34, 35f, 37f, 58, 59, 59f, 60t, 73, PRISMA-ScR, 592, S37–1
homogeneity of variance, 368 75, 75f, 194 Privacy Rule, 90
partitioning the variance, 366–67, 367f Pillai-Bartlett trace, 479 Probability, 333–34, 341
power and sample size, 369, 369b Pilot testing, 124, 145 Probability sampling, 150, 185–188, 186t
One-way design, 228 Placebo, 96b, 198, 198b Prognosis studies, 565–67
One-way pretest-posttest design, 241–42, Plagiarism, 99–100 QUIPS, 585
242f Planned comparisons, 394–98 Propensity score matching, 282
One-way repeated measures design, Point biserial correlation coefficient, Propensity scores, 219, 461
233–36 435, S29–2 Proportions, testing, 415–16
Open access journals, 84, 598–99, 598b Point estimate, 336 Proposal. See Research proposal
Open-ended questions, 145–46 Polynomial regression, 454–56, 456t Proposition, 44
Open Grey, 71 Polytomous variables, 106 Prospective study, 274–76, 275f
Open-label trial, 199 Population, 150, 180 challenges, 275–76
Operational definition, 36, 216 accessible, 181 cohort study, 280f
Operators, arithmetic, 633 target, 160, 181 PTNow Article Search, 72t
Oral presentations, 607–8 Population-based study, 281 Publication bias, 71, 579
Order effects, 235 Positive likelihood ratio. See Likelihood PubMed, 72, 74, 78, 85
Ordinal scale, 110–11 ratio advanced search, 81f
Orthogonal rotation, 472b Positive predictive value. See Predictive clinical queries, 81, 82f
OT Search, 72t value limits and filters, 78f, 80, 81f
Outcomes research, 23–25, S2 Poster presentations, 608–9, 609f MeSH, 76f, 77, 80f
Outliers, 436–37, 436f Post hoc tests. See Multiple comparison My NCBI, 76f
OVID, 73 tests PMID, 79f
Posttest-only control group design, send to, 78f
Paired t-test. See t-test 230–31 tutorial, 76f
Pairwise deletion, 224, 632 Posttest probability, 515, 517–18, 518f, PubMed Central (PMC) repository, 72
Paradigm shift, 5 519b Purposive sampling, 186t, 189–90, 308–9,
Parallel forms reliability, 122 Power, 40, 102, 210, S24–34 311, 313
Parallel groups study, 193, 228 a priori sample size, 343–344 p value, 341
Parameters, 183, 318 beta (β), 342
Parametric statistics, 113, 348–49 defined, 342 Q-sort, 142, S11–2
Partial correlation, 437–38, 438f, 448 determinants, 342 QUADAS/QUADAS-2, 585
Partial eta squared, 373, 378 effect size index, 343 Quadratic trend, 396, 454
Participant observation, 305–6 post hoc power analysis, 344 Quadruple Aim, 3b
Path analysis, 476 underpowered trials, 102 Qualitative research, 11, 12f, 14t,
Path diagram, 476, 476f Power-efficiency, nonparametric tests, 297–316
Patient-Centered Outcomes Research 402 audit trail, 311
Institute (PCORI), 24 Practice-based evidence (PBE), 22 case study, 304–5
Patient-centered outcomes research Practice effects, 234 coding, 309
(PCOR), 24 Pragmatic clinical trial (PCT), 12f, confirmability, 310f, 312
Patient-oriented evidence that matters 13t, 19, 21–22, 200–201, 201t, content analysis, 309
(POEM), 23–24 202b constant comparison, 303
Patient-reported outcome measure pragmatic exploratory continuum credibility, 310, 310f
(PROM), 24 indicator summary (PRECIS), 200, critical appraisal, 567–68, 567t
Patient-reported outcome (PRO), 24 202b data analysis and interpretation,
Pearson product-moment coefficient Precision, 107 309–12
of correlation, 431–32. See also Preclinical research, 11, 201 data collection, 305–8
Correlation Predictive validity, 129t, 131–32 dependability, 310f, 311
6113_Index_659-672 03/12/19 11:02 am Page 667
Index 667
668 Index
Research question, 4, 29–41, 30f, 35f, 37f selection criteria, 181–82 medical subject headings (MeSH), 77,
clinical experience, 30–31 snowball sampling, 186t, 190 79, 80t
clinical theory, 31 stratified random sampling, 186t, 187, MEDLINE, 71–72, 72t
framing the question, 34–37 187f meta-analysis. See Meta-analysis
good research questions, 39–40 systematic sampling, 186–87, 186t My NCBI, 85
objectives and hypotheses, 38 target population, 181, 181f open access journals, 84
professional literature, 31 theoretical sampling, 309 primary sources, 71
research problem, 29–32 Sample size, 183, 343 PubMed, 72, 72t, 74–82
research rationale, 32–33 chi-square, 421, 424b, S28 reference lists, 83
theory, 49 exploratory factor analysis (EFA), 474 refining the search, 80–83
Research report 602–605. See also logistic regression, 461, S30–3 related references, 83
Disseminating research McNemar test, 425, 426b, S28 saving searches, 85
Residuals, 172, 442, 444, 445 nonparametric tests, 402, S27–3 scoping review, 575t, 592–93
standardized, 419, 445 one-way ANOVA, 369, 369b, S25–2 search engines, 72t, 73, 73b
Respect for persons, 90 Pearson correlation coefficient, 432, secondary sources, 71
Responsiveness, 136, 501. See also Change 433b, S29–1 sensitivity and specificity, 81
Retractions, 101 power, and, 343 systematic review. See Systematic
Retrospective study, 275f, 276 a priori analysis, 343 review
case-control study, 280f qualitative research, 308–9 truncation, 75
challenges, 276 regression analysis, 449, 449b, S30–3 wildcards, 75
cohort study, 280f relative risk/odds ratio, 536, S34 Secondary analysis, 276b
secondary analysis, 276b repeated measures ANOVA, 378, S25–2 Secondary outcome, 25
Reverse causation, 277 surveys, 152 Secondary source, 71
Review of literature, 33, 59, 86, 144. two-way ANOVA, 373, 375b, S25–2 Selection
See also Searching the literature t test, 360–363, 362b, 363b, S24–2 case-control study, 281–82
Risk Sample variance, 327 cohort study, 279
absolute, 534 Sampling bias, 185 experimental design, 228
analytic epidemiology, 533–39 Sampling distribution, 335 internal validity, 213–14
observational research, 272 Sampling error, 151, 183–85, 334–35 research question, 29
relative, 534–36, 540, 541t Saturation, 303 sampling, 181–82
Risk-benefit ratio, 91 Scales. See Health measurement scales surveys, 150–52
Risk factor, 272, 533–34, 536, 538 Scalogram analysis, 168 systematic review, 578–79
Risk reduction, 540–42 Scatter plot, 428, 429f Selection bias, 281–82, 584
ROC curve, 519–23, 522t Scheffé comparison, 385t, 390, 391t, S26–3 Self-report, 143, 159, 160, 164
area under the curve (AUC), 221 minimum significant difference, 390 SEM. See Standard error of
cutoff scores, 519–523 Scientific method, 3–4 measurement; Structural equation
Youden index, 521 Scoping review, 11, 12f, 575t, 592–93, modeling
Rotated factor matrix, 470t, 472–73 S37–6 Sensitivity, 131b, 511, 512t
Roy’s largest characteristic root, 479 SCOPUS, 579, S6 in literature search, 81
Run-in period, 195t, 197 Screening accuracy, 131b Sensitivity analysis, 589
Searching the literature, 70–87. See also Sequential clinical trials, 236–38, S16
Samples and sampling, 180–91 Literature review Serial dependency, 264–65
accessible population, 181, 181f abstracts, 83–84 Sham, 198
area probability sampling, 188 automatic term mapping, 78 Shapiro-Wilk test of normality, 349b, 651
cluster sampling, 186t, 187–88, 188f Boolean logic, 75–76, 77f Sidak-Bonferroni procedure, 385t, S26–3
consecutive sampling, 189 CINAHL, 72t, 72–73 Significance, clinical vs. statistical,
convenience (accidental) sampling, citation management applications, 85 344–45, 345b
186t, 189 cited reference searching, 83 Sign test, 407–9, 408t
disproportional sampling, 186t, 188, clinical queries, 81, 82f Simple effects, 391, 393t
S13–2 Cochrane Database of Systematic Simple linear regression. See Linear
flow diagram, 183, 184f, 221f Reviews, 72t, 73 regression
inclusion/exclusion criteria, 181–82 databases, 71–73, 72t, S6 Simple random assignment, 195–96,
mixed methods research, 313 e-mail alerts, 85 195t
multistage sampling, 188, 188f expanding search strategies, 83 Simple random sampling, 185–86, 186t
nonprobability sampling, 186t, 188–90 exploding, 78–79 Simpson’s paradox, 539b
probability sampling, 185–88, 186t filters, 80 Single-blind study, 199
purposive sampling, 186t, 189–90, focusing, 79 Single-factor design, 228
308–9, 313 full text, 84–85 Single-subject design (SSD), 12f, 13t,
quota sampling, 186t, 189 grey literature, 70–71, 579–80, 249–70
random-digit dialing, 188 580b A-B-A-B design, 255, 256f
random sampling (selection), 183, interlibrary loan, 84 A-B-A design, 254, 255f
185–86, 186t, S13–1 journal and author searching, 83 A-B design, 251, 251f, 254
sampling bias, 185 keyword searching, 74–76 alternating treatment design, 257–58,
sampling error, 183–85 limits, 80 258f
6113_Index_659-672 03/12/19 11:02 am Page 669
Index 669
baseline data, 251–52, 252f, 256b Standardized scores, 329 Studentized range, 386
binomial test, 264, S18–3t Statistic, 183, 318 critical values table, 613, S26–1,
changing criterion design, 259, S18–1 Statistical conclusion validity, 210–11, Appendix A Online
C statistic, 265, S18–5 211f, 211t Student-Newman-Keuls (SNK) test,
design phases, 250–53 Statistical hypothesis, 38 385t, 387–89, 389t
effect size, 267 Statistical hypothesis testing. See comparison intervals, 387, 388f
generalization, 267–68 Hypothesis testing critical values table, studentized range,
level, 260–62, 262f Statistical inference, 333–50 387, 613, S26–1t, Appendix A Online
multiple baseline designs, 255–57, confidence intervals, 336–38. See also minimum significant difference, 388
257f, S18–1 Confidence intervals Subject-to-variables (STV) ratio, 474
multiple treatment design, 258–59 hypothesis testing. See Hypothesis Sufficient exposure, 274b
N-of-1 trial, 259–60, 261b testing Summative scales, 161–67
nonoverlapping scores, 267 nonparametric tests, 349 Sum of squares (SS), 327, 327t, 367, 368,
replication of effects, 254, 268 parametric statistics, 348–49 S25–1
serial dependency, 264–65 probability, 333–34 Superiority trial, 205, 206f, 339b
slope, 263 sampling error, 334–35 Survey questions, See also Surveys
social validity, 268 Statistical packages, 629–30, 630t branching, 148
split middle line, 263–64, 263f, S18–2 Statistical power. See Power check-all-that-apply, 147
statistical process control (SPC), Statistical process control, S18–4. See also checklists, 147
265–67, 266f, S18–4 Single-subject designs closed-ended questions, 146
target behavior, 253–54 common cause and special cause dichotomous questions, 146
trend, 262–63, 262f variation, 266 double-barreled questions, 149
two standard deviation band method, control charts, 266 frequency and time measures, 149
265. 265f upper and lower control limits, 266, multiple choice questions, 146
visual data analysis, 260–63 266f open-ended questions, 145–46
withdrawal designs, 254–55, 255f, 256f Statistical significance vs. clinical purposeful language, 148–49
Skewed distributions, 322, 322f, 323 significance, 344–45 rank-order questions, 148
SnNout, 514f Statistical symbols, 654 sensitive questions, 149–50
Snowball sampling, 186t, 190 Statistical tables, 613–21 writing good questions, 148–50
Social threats, to internal validity, 214 area under normal curve, 614–16t Surveys, 141–58
Spearman rank correlation coefficient, binomial test, 613, S18–3t, S27–2t, analysis of data, 154–55
433–34, 434–35, 434t, Appendix A Online bias, 143, 149
critical values table, 613, S29–1t, chi-square distribution, 620t contacting respondents, 152–54
Appendix A Online F statistic, 618t cover letter, 153–54, 153f
Special cause variation, 266 Mann-Whitney U test, 613, S27–2t, cross-tabulations, 155
Specificity, 131, 511, 512t Appendix A Online demographics, 150
in literature search, 81 Pearson correlations (r), 619t descriptive, 141, 288–289
Sphericity, 377 Spearman’s rank correlation ethical issues, 156–57
Bartlett’s test, 470t, 471 coefficient, 613, S29–1, S29–1t, expert review, 145
Mauchly’s test, 376t, 377 Appendix A Online hypothesis testing, 141, 144
SPIDER, 59 studentized range statistic, 613, S26–1t, incentives, 152
Split-half reliability, 122 Appendix A Online interviews, 142–43
Split middle line, 263–64, 263f, S18–2 table of random numbers, 621t missing data, 155
Split-plot ANOVA, 378 t test, 617t planning, 143–45
SPORTDiscus, 72t, S6 Wilcoxon signed-ranks test, 613, response rate, 151–52
SpPin, 514f S27–2t, Appendix A Online sample size, 152, 152t
Square root transformation, 635 Stem-and-leaf plot, 320f, 321 sampling error, 151
Square transformation, 635 Stepwise multiple regression, 449–53, selecting a sample, 150–52
SRM. See Standardized response mean 451–52t. See also Multiple regression self-report, 143, 160
Standard deviation, 327–28, 329f, S22 forward or backward inclusion, 452–53 summarizing data, 155
Standard error hierarchical regression, 453 type of questions. See Survey questions
of measurement (SEM), 118, 120, step wise inclusion, 450 Survival analysis, 481–83
492–94, 493t Stratified random assignment, 195t, censored observations, 482
of the difference between the means, 196 Cox proportional hazards model,
354 Stratified random sampling, 186t, 187, 483
of difference scores, 360 187f hazard ratio (HR), 483
of the estimate (SEE), 444 Structural equation modeling (SEM), Kaplan-Meier estimate, 482
of the mean, 335, 335f 474–77 survival curve, 482f
Standardized mean difference (SMD), direct and indirect effects, 476, S31–2 Syllogism, 45, 46b
360, 587 endogenous variable, 476 Synthesizing literature, 11, 574–96
Standardized normal curve, 329–30 exogenous variable, 476 See also Meta-analysis and
Standardized residuals, 172, 419, 445 goodness of fit, 476–77 Systematic Reviews
Standardized response mean (SRM), interpretation of, 477 scoping review, 575t, 592–93
503–4, 504t path diagram, 475–56, 476f types of evidence synthesis, 575t
6113_Index_659-672 03/12/19 11:02 am Page 670
670 Index
Systematic error, 116, 116f implementation studies, 25–26 Unplanned comparisons, 384
Systematic review, 11, 12f, 60, 71, outcomes research, 23–25 U test. See Mann-Whitney U test
575–87, 575t, S37–3 qualitative research, 299
AMSTAR-2, 591, S37–2 translation blocks, 19–21, 20f Validity, 127–40, 501–506
Cochrane risk of bias tool, 584, 584f translation continuum, 18–21 ceiling effect, 138
clinical practice guidelines, 590 translation gap, 17–18 change, and, 136–38, 501–506, 501f
critical appraisal of, 591, S37–2 Treatment arms, 193 concurrent, 129t, 131
data extraction, 580–81, S37–4 Tree plot, 587 construct, 129f, 129t, 132–34, 215–17,
data synthesis, 585, 586t Trend, in single-subject design, 262–63 211f, 211t, S10–2
methodological quality, 581–85 Trend analysis, 369t, 397f, 395–98 content, 129–30, 129f, 129t, S10–1
overview, 577f Trial and error, 54, 56f convergent, 129t, 133
PEDro scale, 583–84, 583t Triangulation, 311 criterion-related, 129f, 129t, 130–31
PRISMA flow diagram, 581, 582f, S37–1 Triple Aim, 3b cross-sectional validity study, 277–78
purpose, 575–76 True negative, 510 cultural validity, 160
research question, 33, 576–77 True positive, 510 defined, 127
scoping review, 592–93 Truncation, 75 diagnostic accuracy, of, 509–14
screening references, 580 t-test, 351–64 discriminant, 129t, 133
search strategy, 579–80 alternative hypothesis, 353 external, 217–18, 211f, 211t
selection criteria, 578–79 confidence intervals, 357–358, 358b, face, 130
types of, 575t 359–60 floor effect, 138
Systematic sampling, 186–87, 186t critical values table, 617t, Appendix A internal, 212–215, 211f, 211t
Online methodological studies, 138–39
Table of random numbers, 186, 621 degrees of freedom, 354, 360 predictive, 129t, 131–32
Target behavior, in single-subject design, effect size, 360–63, 362f reliability, compared, 127–28
253 equal variances, 354–58 social, 268
Target population, 160, 181 error variance, 351 statistical conclusion, 210–211, 211f,
Testing effect, 120, 213 homogeneity of variance, 353 211t
Test-retest reliability, 119–20. See also independent t-test. See unpaired strategies to maximize validity,
Reliability misuse/inappropriate use, 363 138–39
Theoretical sampling, 309 null hypothesis, 353, 360 Variability, 324–28, 325t
Theory, 42–52 paired, 359–363, 361t coefficient of variation (CV), 328
characteristics, 47 power and sample size, 360–63, 362b, interquartile range, 325
clinical, 31 363b, S24–2 percentiles, 325
components, 43–44 unequal variances, 358–59, S24–1 quartiles, 325
constructs, 43 unpaired, 353–59, 355t, 356f, 359t range, 324
deductive, 45–46 Tukey’s honestly significant difference standard deviation, 327–28, 329f, 327t
defined, 42 (HSD), 385t, 386–87, 388t standardized scores, 329, 329f
development of, 45–46 critical values table, studentized range, variance, 326–27, 327t
grand, 50 387, 613, S26–1, Appendix A Online Variables
grounded, 46, 303, 304f minimum significant difference, 386 active and attribute, 194
hypothetical-deductive, 45–46 Two-factor repeated measures design, alphanumeric, 631
inductive, 46 378 blocking, 233
laws and, 50 Two standard deviation band method, categorical, 415, 419, 424, 453
meta-, 50 265 construct, 108
middle-range, 49–50 Two-tailed test, 347, 357b continuous, 107
models, 44 Two-way ANOVA, 370–73 dependent, 35, 440, 446, 457
practical applications, 48–49 degrees of freedom, 371 dichotomous, 106
propositions, 44 effect size, 373 discrete, 107
purposes, 42–43 interaction, 371–373, 372t, 374t extraneous, 211, 214, 219, 220, 536
scope, 49–50 partitioning the variance, 370–71 grouping, 632
testing, 46–47 power and sample size, 373, 375b, independent, 34, 440, 449, 457
“tomato effect”, 47b S25–2 latent, 43,108, 168, 173, 469, 473,
Time series design, 241–44 Two-way factorial design, 232–33, 475
interrupted time series (ITS) design, 370 outcome, 440
242–44 Two-way repeated measures designs, polytomous, 106
one-group pretest-posttest design, 241 236 response, 440
repeated measures design, 241–42, 242f Type I error, 340–342, 340f string, 631
Tolerance level, 448 error rate in multiple comparisons, Variance, 117, 326–27, S22
Transformation of data, 445–46, 383–84 homogeneity of, 353, 368, 377
635–36, 635t, 636t Type II error, 342–346, 340f Variance inflation factor (VIF), 448
Translational research, 6, 10, 17–28, 20f Varimax rotation, 472
basic research, 19, 20f Uniform distribution, chi-square, VAS. See Visual analog scale
defined, 6, 17 416–17, 417t Vectors, 479
effectiveness research, 21–22 Unpaired t-test. See t-test Venn diagram, 75, 77f
6113_Index_659-672 03/12/19 11:02 am Page 671
Index 671
Visual analog scale (VAS), 147–48, Wilcoxon signed-ranks test, 408t, Writing
164–67, 164f, 165t, 167f 409–10, 613 for dissemination, 606
anchors, 165 critical values table, S27–2t, Appendix A in proposals, 554
variations, 165–66 Online
Vital statistics, 533 degrees of freedom, Yates’ continuity correction, 421
large samples, 410 Youden index, 521
Wait list control group, 198 null hypothesis, 410
Wald statistic, 459t, 460 power and sample size, 410, S27–3 Zelen design, 197
Washout period, 235, 235f Wildcard, 75 zero-order correlation, 438
Web of Science, S6 Wilk’s lambda, 480 z-score, 329, 329f, 346
Weighted kappa, 496, S32–3 Withdrawal designs, 244, 254–55 critical values table, 614–616t,
Wilcoxon rank sum test, 400t, 404t Within-subject design, 228, 250 Appendix A Online
6113_Index_659-672 03/12/19 11:02 am Page 672