Вы находитесь на странице: 1из 276

Lesson 1: Clinical Trials as Research

Introduction

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

Recognize a research objective that could be met through a clinical trial

Recognize a research objective that would be difficult to meet through a


clinical trial

Discuss the relative contributions of clinical judgment and clinical trials


in evaluating new medical therapies.

Apply six characteristics of statistical reasoning to a research objective


in a clinical trial setting.

1.1 - What is the role of statistics in


clinical research?
What is the role of statistics in clinical research?

Clinical research involves investigating proposed medical treatments,


assessing the relative benefits of competing therapies, and establishing
optimal treatment combinations. Clinical research attempts to answer
questions such as should a man with prostate cancer undergo radical
prostatectomy or radiation or watchfully wait? and is the incidence of serious
adverse effects among patients receiving a new pain-relieving therapy greater
than the incidence of serious adverse effects in patients receiving the
standard therapy?

Before the widespread use of experimental trials, clinicians attempted to


answer such questions by generalizing from the experiences of individual
patients to the population at large. Clinical judgement and reasoning were
applied to reports of interesting cases. The concepts of variability among
individuals and its sources were not formally addressed.
As the field of statistics, the theoretical science or formal study of the
inferential process, especially the planning and analysis of experiments,
surveys, and observational studies. (Piantadosi 2005). has developed in the
twentieth century, clinical research has utilized statistical methods provide
formal accounting for sources of variability in patientsresponses to treatment.
The use of statistics allows clinical researchers to draw reasonable and
accurate inferences from collected information and to make sound decisions
in the presence of uncertainty. Mastery of statistical concepts can prevent
numerous errors and biases in medical research.

Statistical reasoning is characterized by the following:

1. Establishing an objective framework for conducting an investigation

2. Placing data and theory on an equal scientific footing

3. Designing data production through experimentation

4. Quantifying the influence of chance

5. Estimating systematic and random effects

6. Combining theory and data using formal methods

(Piantadosi, 2005)

Carter, Scheaffer, and Marks (1986) stated that:

Statistics is unique among academic disciplines in that statistical thought is


needed at every stage of virtually all research investigations including
planning the study, selecting the sample, managing the data, and interpreting
the results.

Clinical and statistical reasoning are both crucial to progress in medicine.


Clinical researchers must generalize from the few to many and combine
empirical evidence with theory.. In both medical and statistical sciences,
empirical knowledge is generated from observations and data. Medical theory
is based upon established biology and hypotheses. Statistical theory is
derived from mathematical and probabilistic models. (Piantadosi 2005), To
establish a hypothesis requires both a theoretical basis in biology and
statistical support for the hypothesis, based on the observed data and the
theoretical statistical model.
What constitutes a clinical trial?

An experiment is a series of observations made under conditions controlled


by the scientist.

A clinical trial actually is an experiment testing medical treatments on human


subjects. The clinical investigator controls factors that contribute to variability
and bias such as the selection of subjects, application of the treatment,
evaluation of outcome, and methods of analysis. The distinction of a clinical
trial from other types of medical studies is the experimental nature of the trial
and its occurrence in humans.

Design is the process or structure that isolates the factors of interest.


Although the researcher designs a trial to control variability due to factors
other than the treatment of interest, there is inherently larger variability in
research involving humans than in a controlled laboratory situation.

The term clinical trial is preferred over clinical experiment because the
latter may connote disrespect for the value of human life.

In what contexts are clinical trials used?

Clinical trials are used to develop and test interventions in nearly all areas of
medicine and public health. In many countries, approval for marketing new
drugs hinges on efficacy and safety results from clinical trials. Similar
requirements exist for the marketing of vaccines. The U.S. Food and Drug
Administration (FDA) now requires manufacturers of new or high-risk medical
devices to provide data demonstrating clinical safety and effectiveness. (Scott
2004). Surgical interventions pose unique challenges since surgical
approaches are typically undertaken for patients with a good prognosis and
may not be amenable to randomization or masking investigators and patients
to the intervention, all conditions which can lead to biases. Clinical trials are
useful for demonstrating efficacy and safety of various medical therapies,
preventative measures and diagnostic procedures if the treatment can be
applied uniformly and potential biases controlled.

In addition to testing novel therapies, clinical trials frequently are used


to confirm findings from earlier studies. When the results of a study are
surprising or contradict biological theory, a confirmatory trial may follow.
Medical practice generally does not change based upon the results of one
study. Design flaws, methodological errors, problems with study conduct, or
analysis and reporting mistakes can render a clinical trial suspect. Hence,
confirmation of results in a replicative study, or a trial extending the use of the
therapy to a different population, is often warranted.

Clinical trials are time-consuming, labor-intensive, and expensive and require


the cooperative effort of physicians, patients, nurses, data managers,
methodologists, and statisticians. Patient recruitment can be difficult. Some
multi-center (across institutions) clinical trials cost up to hundreds of million of
dollars and take five years or more to complete. Prevention trials, conducted
in healthy subjects to determine if treatments prevent the onset of disease,
are important but the most cumbersome, lengthy, and expensive to conduct.

Many studies have a window of opportunity during which they are most
feasible and will have the greatest impact on clinical practice. For comparative
trials, the window usually exists relatively early in the development of a new
therapy. If the treatment becomes widely accepted or discounted based on
anecdotal experience, it may become impossible to formally test the efficacy
of the procedure. Even when clinicians remain unconvinced of efficacy or
relative safety, patient recruitment can become problematic.

Some important medical advances have been made without the formal
methods of controlled clinical trials, i.e., without randomization, statistical
design, and analysis. Examples include the use of vitamins, insulin, some
antibiotics, and some vaccines.

Piantadosi (2005) gives the following requirements for a study based on a


non-experimental comparative design to provide valid and convincing
evidence:

1. The treatment of interest must occur naturally

2. The study subjects have to provide valid observations for the biological
question

3. The natural history of the disease with standard therapy, or in the


absence of therapy, must be known

4. The effect of the treatment must be large enough to overshadow


random error and bias

5. Evidence of efficacy must be consistent with biological knowledge


Examples of non-experimental designs that can yield convincing evidence of
treatment efficacy can be found among epidemiological studies, historically-
controlled trials, and from data mining.

1.2 - Summary
In this first lesson, Clinical Trials as Research, we learned to :

Recognize research objectives that can be met through a clinical trial

Recognize research objectives that would be difficult to meet through a


clinical trial

Evaluate the relative contributions of clinical judgment and clinical trials


in evaluating new medical therapies.

Recognize six characteristics of statistical reasoning that can be applied


to a clinical trial setting.

Continue to Lesson 2: Ethics of Clinical Trials.

Lesson 2: Ethics of Clinical Trials


Introduction

Ethics of Clinical Trials

A well-designed clinical trial should answer important public health


questions without impairing the welfare of participants. Ethical
obigations to trial participants and to science and medicine pertain to all
stages of a clinical trial: design, conduct and reporting
results. (Friedman, Furberg, DeMets, Reboussin, Granger, 2015, Chapter
2)

Learning objectives & outcomes


Upon completion of this lesson, you should be able to:

State at least 3 internationally recognized conditions that are necessary


to justify conducting a medical experiment in humans.
Define the condition that is required to justify randomizing a patient to a
treatment.

Differentiate between ethical and unethical use of placebo control.

Recognize requirements for IRB approval of human research studies


involving any product regulated by the U.S. FDA.

Recognize U.S requirements for investigators to report financial


interests that may affect design, conduct or reporting of research
regulated by the U.S. Public Health Service.

References:

Annas GJ and Grodin MA. (1992) The Nazi doctors and the Nuremberg Code:
Human Rights in Human Experimentation. New York: Oxford University Press.

Carter RL, Scheaffer RL, Marks RG. (1986) The role of consulting units in
statistics departments. Am. Stat. 40:260-264.

Friedman, L.M., Furberg, C.D., DeMets, D., Reboussin, D.M., Granger, C.B.
(2015). Chapter 2 Ethical Issues. In: Friedman, L.M.,Furberg, C.D.,DeMets,
D.,Reboussin, D.M.,Granger, C.B. Fundamentals of Clinical Trials. 5th ed.
Switzerland: Springer International Publishing. (Notes will refer to Friedman
et al 2015)

Piantadosi Steven. (2005) Clinical trials as research, Why clinical trials are
ethical, Contexts for clinical trials. In: Piantadosi Steven. Clinical Trials: A
Methodologic Perspective.2nd ed. Hobaken, NJ: John Wiley and Sons, Inc.

Scott PE. (2004) Medical device approvals: An assessment of the level of


evidence. PhD dissertation: Johns Hopkins University, Baltimore, MD.

2.1 - Requirements of Investigators


Physician-Patient Relationship

One area of ethical dilemma of physicians and health care workers can be
attributed to the conflicting roles of helping the patient and gaining scientific
knowledge, as stated by Schafer (1982):
In his traditional role of healer, the physician's commitment is exclusively to
his patient. By contrast, in his modern role of scientific investigator, the
physician engaged in medical research or experimentation has a commitment
to promote the acquisition of scientific knowledge.

When properly set, designed, and conducted, a clinical trial is an ethically


appropriate way to acquire new knowledge. Clinical decisions for treatment
that are based on weak or anecdotal evidence, opinion, or dogma, without the
evidence of rigorous scientific support, raise their own ethical questions.

It is not easy to distinguish between clinical research and clinical practice.


How often is a physician certain of the outcome from a specific therapy for a
particular patient? If the patients reaction is predictable, applying the
treatment would be described as practice. In the cases when the physician is
unsure of the outcome, applying the treatment could be considered research.
Many actions by the physician for the benefit of individual patients have the
potential of increasing scientific knowledge. Analogously, scientific knowledge
gained from research can be of benefit to individual patients. Ethical questions
arise when unproven therapies are proposed to replace proven ones and are
particularly acute for chronic or fatal illnesses.

Clinical trials are only one of several settings in which the physicians duty
extends beyond his responsibility to the individual patient. For example,
vaccinations against communicable disease are promoted by physicians, yet
the individual vaccinated incurs a small risk to benefit the population as whole.
Triage is another situation where for the sake of maximizing benefit to the
whole, the needs of an individual may not be met.

The American Medical Association has a code of professional ethics that


includes the obligation of a physician to conduct medical research: A
physician shall continue to study, apply and advance scientific knowledge,
maintain a commitment to medical education, make relevant information
available to patients, colleagues and the public. (AMA, 2001). There is also
a statement of the physicians responsibility to the patient as paramount.
Both ethics must considered when the health professional considers entering
a patient in a clinical trial.

Responsible Conduct of Research Training:

All clinical investigators should have training in research ethics. The US NIH
website has resources [1]for training in the areas of scientific integrity, data,
publication, peer review, mentor/trainee relationships, collaboration, human
and animal subjects and conflict of interest. Many funding sources, including
the US NIH and NSF require responsible conduct of research training for all
students, trainees, fellows, scholars and faculty utilizing their funds to conduct
research.

Conflict of Interest:

A crucial component of ethical medical research is investigator objectivity.


Investigators are required by the US Public Health Service, the umbrella
organization for the FDA and the NIH, to report any significant financial
interest would appear to affect the design, conduct, or reporting of
research. (See Guidance [2]from FDA)

Competence:

Investigators should be competant technically (evidenced by education,


certification, experience) and humanistically (showing compassion and
empathy).

2.2 - Historical Perspective


The current requirements for investigators and considerations in conducting
ethical research should be understood within the context of past abuses,
some horrific.

The 1946-1947 Nuremberg trials brought notoriety to the numerous atrocities


in WWII concentration camps committed by Nazi physicians under the guise
of experimentation. Of the 23 individuals tried for such crimes at Nuremberg,
20 were physicians. Sixteen of the 23 were convicted and given sentences
ranging from imprisonment to death. (Annas and Grodin, 1992)

At the time of the Nuremberg trial, there were no international standards for
ethical conduct in human experimentation. This resulted in the Nuremberg
Code, or directives for human experimentation, adopted in 1947:

1. Voluntary consent of the human subject is essential.

2. There must be no reasonable alternative to conducting the experiment.


3. The anticipated results must have a basis in biological knowledge and
animal experimentation such that the experiment has potential for
yielding fruitful results for the good of society.

4. The procedures should avoid unnecessary physical and mental


suffering and injury.

5. There is no expectation for death or disability as a result of the trial

6. The degree of risk for the patient should not exceed the humanitarian
importance of the problem to be solved.

7. The subjects should be protected against even a remote possibility of


death or injury.

8. The study must be conducted by qualified scientists, with a high degree


of skill and care throughout the experiment.

9. The subject can stop participation at will.

10. The investigator has an obligation to terminate the experiment if


injury, disability or death of the subject seems likely.

The World Medical Association (WMA) held a meeting in 1964 in Helsinki,


Finland, and adopted a formal code of ethics for physicians engaged in clinical
research. The Declaration of Helsinki [3]reiterates the principles of
Nuremberg Code, with particular attention to the duty of the physician to
protect the life, health and dignity of the human subject. The Declaration
states that research involving human subjects must be formulated in a written
protocol which has been reviewed by an ethical review committee distinct
from the investigator, must conform to generally accepted scientific principles
and should include written informed consent from the participants. Negative
as well as positive results should be disseminated and funding sources
disclosed.

The Declaration has been revised regularly by the WMA. Considerable


discussion has ensued regarding the statements on the ethics of placebo
control and the duty of the study planners to provide post-study access for all
study participants to what is regarded as beneficial treatment. The 2008
revision affirms the WMA position on the primacy of the patient and outlines
required consents for research on human material, such as blood, tissues,
and DNA, and human data as well as requiring clinical trials to be registered in
a publicly accessible database.

One well-known example of abuse of ethical research in the USA occurred in


1936 when the US Public Health Service began a study of untreated syphilis
in Tuskegee, Alabama (399 men with advanced disease and 201 controls).
The study continued long after the availability of penicillin, a proven cure, in
the 1950s. The study was stopped in the early 1970s after it was publicized
and became an embarrassment to the country. In response to the Tuskegee
Syphilis Study, the US Congress established the National Commission for the
Protection of Human Subjects of Biomedical and Behavioral Research
through the 1974 National Research Act. This Commission produced
the Belmont Report in 1979 which distilled basic ethical guidelines in
research with human subjects to three principles: respect for persons or
individual autonomy, beneficence and justice.

Respect for persons (individual autonomy) means that patients have the
right to decide what should be done for them with respect to their illness
unless the result would be clearly detrimental to others. Respect for persons
means that potential subjects for clinical trials are informed of alternative
therapies and risks and benefits of participation in a particular trial before they
volunteer to particpate in that study. Since clinical trials often require
participants to surrender some measure of autonomy in order to be
randomized to treatment and follow the established protocol, these aspects
will be described to the potential subject, along with their freedom to choose to
discontinue the study at any time.

Beneficence reflects the patients right to receive advantageous or favorable


treatment. Investigators are obliged to make practical and useful assessments
of the risks and benefits involved in research, which necessitates resolving the
potential conflict between risk to participants and benefit to future patients.
The beneficence obligation extends to both the particular subjects in a study
and to the research endeavor.

Justice addresses the question of fairly distributing the benefits and burdens
of research. Compensation for injury due to research is an application of
justice. Injustice occurs when benefits are denied without good reason or
when burdens are unduly imposed on particular individuals, such as the poor
or uninsured.
These principles are applied in the requirements for informed consent of
subjects, in assessment of risks and benefits and fair procedures and
outcomes and in the selection of subjects.

Other international guidelines have been proposed. The World Health


Organization (WHO) and the Council for International Organizations of
Medical Sciences (CIOMS) issued a document entitled International Ethical
Guidelines for Biomedical Research Involving Human Subjects [4], with its
latest revision published in 2002. UNESCO set forth a Universal Declaration
on Bioethics and Human Rights [5]in 2005.

Timeline of Laws Related to the Protection of Human Subjects [6]

2.3 - IRB and Informed Consent


IRB

The U.S. National Institute of Health Policies for the Protection of Human
Subjects (1966) established the IRB (Institutional Review Board) as a
mechanism for the protection of human participants in research. In 1981, U.S.
regulations required IRB approval for all drugs or products regulated by the
US Food and Drug Administration (FDA), without regard to the funding source,
the research volunteers, or the location of the study. In 1991, core US DHHS
regulations (45 CFR Part 46, Subpart A) were adopted by most Departments
and Agencies involved in research with human subjects. This Federal Policy
for the Protection of Human Subjects, known as the "Common
Rule., [7] requires Institutional Review Board (IRB) review for all research
funded in whole or in part by the U.S. federal government.

An IRB may approve research in human subjects that meets specific


prerequisites set forth by the FDA.
(http://www.fda.gov/oc/ohrt/irbs/review.html) [8]

1. The risks to the study participants are minimized

2. The risks are reasonable in relation to the anticipated benefits

3. The selection of study participants is equitable

4. Informed consent is obtained and appropriately documented for each


participant
5. There are adequate provisions for monitoring data collected to ensure
the safety of the study participants

6. The privacy of the participants and the confidentiality of the data are
protected

Every accredited medical research organization maintains an IRB. An


investigator must submit his/her research plan, called the protocol, and the
informed consent form to the IRB for approval prior to the conduct of the
study. The IRB also requires the investigator to provide the following:

1. Progress reports on an annual basis

2. Reports of any serious adverse events in the human subjects when they
occur.

Informed Consent:

The principle of respect for persons implies that each study participant will be
made aware of potential risks, benefits and costs prior to participating in a
clinical study. To document this, study particpants (or parents/legally
authorized representatives) sign an informed consent document prior to
participation in a research study. The patient assents to having been informed
of the potential risks and benefits resulting from their participation in the
clinical study, to understanding their treatment alternatives and that their
participation is voluntary. There are numerous examples of studies in which patients have been
exposed to potentially or definitively harmful treatments without being fully apprised of the risk. The
consent document should be presented without any coercion. Even so, ill or dying patients and their
families are vulnerable, and it is questionable how much technical information about new treatments they
can truly understand, especially when it is presented to them quickly.

In the United States and many other countries, an IRB must evaluate and
approve the informed consent documents prior to beginning a study.

The informed consent must describe explicitly the following information:

1. The research nature of the study

2. The reasonable foreseeable risks and discomfort

3. The potential benefits and alternatives

4. Procedures for maintaining privacy


5. Treatment for injuries incurred

6. Individuals to contact for questions

7. The voluntary nature of the study and the possibility of withdrawal at any
time

8. Not entering the study does not lead to loss of benefits

Research in emergency settings in which informed consent is not possible is a


special situation. The FDA and the National Institutes of Health (NIH) have
offered guidance [9] for research in emergency medical situations.

2.4 - Planning and Design


Applying ethical considerations in the planning and design phase of a study
requires optimal study design. All involved in clinical research have a
responsibility to promote high-quality clinical trials in order to provide evidence
to guide medical decisions. (Friedman et al 2015), weighing the balance of
risk vs benefit, consideration of patient confidentiality as well as plans for
impartial oversight of informed consent procedures.

Study Question:

The question addressed by a clinical trial should be important enough to


justify possible adverse effects of the treatment that will be administered. A
study design that cannot answer the biological question is unethical. Studies
that pose unimportant questions are unethical as well, even if they pose
minimal risk. What marketing question would have enough benefit to justify
risks to subjects? What about situations where there is an approved and
available therapy?

The study question also involves the choice the study population. Which
population can answer the question? Does the potential benefit outweigh the
risk to these study subjects? Is the selection just?

ethical considerations of respect for persons, beneficience and justice imply


three requirements for the conduct of research, namely, informed consent,
disclosure of the risks and benefits, and the appropriate selection of research
subjects. Applying ethical principles also requires optimal study design, a
balance of risk and benefit for study participants, consideration of patient
privacy, impartial oversight of consent procedures.

Study Sites:

The choice of the study population is also related to the place(s) the trial will
be conducted. There is greater generalizability and potentially faster
enrollment if a trial is conducted in multiple and varied geographic locales.
The concept of justice should be applied--is the location selected due to
prevalent disease and relevance of the results ? Or for sponsor conveniences
such as lower cost and fewer administrative and regulatory burderns? Is the
standard of care in this country less than optimal care and thus event rates
higher? What obligations do the trial sponsors have to the particpants or to
residents of the country once the trial is complete? Will the treatment be
available in this locale once the trial is complete?

Randomization:

Some physicians and patients feel that it is inappropriate to base a patient's


treatment on chance, as is done in a randomized clinical trial. Some feel that
the physician is obligated to have a preference, even when the evidence does
not favor any particular treatment. Randomization is justified when there is
relative ignorance (collectively) about the best treatment. The situation in
which there is genuine uncertainty as to the best available therapy is
called equipoise. (Piantadosi 2005)

Patients and physicians with firm preferences for treating a particular disease,
even those based on weak evidence, should not participate in a clinical trial
involving that disease. Patients with strong convictions about preferred
treatments are likely to become easily dissatisfied with randomization to a
treatment in the clinical trial. Physicians with strong convictions could bias the
clinical trial in a different direction, especially if they are not blinded to
treatment assignment.

Although most often individual subjects are randomized to treatment, in some


situations, the unit of randomization is a larger entity, such as a hospital or
community. How would individuals consent to such research?

Control Group:

Inclusion of a placebo group in a comparative trial can yield a straightforward


comparison with the experimental treatment to determine if the treatment is
safe and effective. Patients assigned to placebo may receive a facsimile of the
active therapy without the knowledge of whether or not the active ingredient is
present; thus, any observed effect is considered a result of the active agent
and not the process of being treated.

Assigning patients to a placebo treatment, however, is not always ethical.


Placebo control is untenable when the disease is life-threatening and an
effective therapy is available. A better approach when there is an effective and
available therapy and/or the condition is life-threatening is to use the standard
accepted therapy as an active control treatment. Comparison is made
between active control and the experimental therapy. Another possible option
if the new therapy can be given in combination with standard therapy: all
subjects receive standard therapy and randomization is to new therapy plus
standard or placebo plus standard.

Should the new intervention be compared with the best known therapy or with
placebo? Will a placebo control result in significant harm to subjects? What if
there is no accepted optimal therapy? What if the optimal therapy is very
costly or not available in some locations? The selection of the control group
has many ethical considerations.

Confidentiality:

The U.S. Department of Health and Human Services (HHS) issued the
Standards for Privacy of Individually Identifiable Health Information (the
Privacy Rule) under the Health Insurance Portability and Accountability Act of
1996 (HIPAA) to provide the first comprehensive Federal protection for
the privacy of personal health information. This became effective on April
14, 2003.

While certain provisions of the Rule specifically concern research and may
affect research activities, the Privacy Rule recognizes that the research
community has legitimate needs to use, access, and disclose Protected
Health Information (PHI) to carry out a wide range of health research
protocols and projects. The Privacy Rule protects the privacy of such
information while providing ways in which researchers can access and use
PHI when necessary to conduct research. The DHHS web site
(http://privacyruleandresearch.nih.gov/ [10]) should be examined for further
information about HIPAA requirements on research.

2.5 - Conduct
Recruitment:

Although it is vital to enroll enough subjects to answer the study questions


adequately, recruitment must also follow ethical norms. Coercion should be
avoided. For this reason, any finanical compensation for the subject's time
and travel will reflect actual expenses or small amounts that would not entice
a person to enroll in the study for financial gain; study personnel other the
primary investigator or the patient's doctor may be designated to ask for
informed consent.

Monitoring:

During the course of a comparative trial, evidence may become available that
one treatment is superior. Interim statistical analyses may be incorporated
into the study design to provide periodic investigations of treatment superiority
prior to study completion without sacrificing the statistical integrity of the trial.
(discussed later in this course). Should patients receiving an inferior treatment
continue in this manner? If there is evidence that a particular type of patient is
unlikely to respond to therapy, should entrance criteria be modified? Is the
adverse experience profile markedly worse for one therapy? Investigators are
required to report such circumstances to their IRB.

Most multi-center clinical trials involve an independent board of scientists to


monitor the trial results and render decisions as to whether the trial should
continue or be modified in some manner. Safety monitoring is required.
These committees have various names, such as Data and Safety Monitoring
Board, External Advisory Committee, etc.

Early Termination for other than scientific or safety reasons:

Related to planning, studies should only be conducted if resources are


adequate to complete the study. Early termination for reasons other than
science or safety reflect a lack of ethical concern. Subjects agreed to
participate so an important question could be answered. There should be an
answer.

Data Integrity:

Data falsification must not be tolerated in any manner. Central statistical


monitoring and other procedures may help detect potential fraud. See George
and Buyse, (2015) Data Fraud in Clinical Trials [11]
2.6 - Reporting
Reporting:

Investigators should report trial results completely and in a timely manner.


Registration of trials on clinicaltrials.gov and publishing associated results
online can reduce publication bias, the bias resulting from journals favoring
studies with significant results. Publish or find mechanisms to disseminate
results effectively.

Authorship:

'Ghost authorship' occurs when people writing a paper are not fully disclosed
(i.e. draft written by contract writer) or when authors are included who did not
actually participate in the research project (for example, an influential name).
Journals combat such deception by asking authors to specify the contribution
of each person listed as an author.

2.7 - Statistical Ethics


The American Statistical Association and the Royal Statistical Society have
published similar guidelines for the conduct of their members:

1. Maintain professional competence and keep abreast of developments

2. Have constant regard for human rights

3. Present findings and interpretations honestly and objectively

4. Avoid untrue, deceptive, or undocumented statements

5. Disclose financial or other interests that may affect or appear to affect


professional statements

6. Seek to advance public knowledge and understanding

7. Encourage and support fellow members in their professional


development.

In terms of data collection in clinical trials, statisticians should do the following:


1. Collect only the data needed for the purposes of the inquiry

2. Inform each participant about the nature and sponsorship of the project
and intended uses of the data

3. Establish the intentions and ability of the sponsor to protect


confidentiality

4. Inform participants of the strengths and limitations of confidentiality


protections

5. Process the data collected according to the intentions and remove


participant-identifying information

6. Ensure that confidentiality is maintained when data are transferred to


other persons or organizations

In terms of dealing with sponsors/clients, statisticians should do the following:

1. Clarify their qualifications to undertake the inquiry

2. Reveal any factors that may conflict with impartiality

3. Accept no contingency fee arrangements

4. Apply statistical methods without regard for a desirable outcome

5. Outline alternate statistical methods along with the chosen methods

6. Maintain confidentiality with regard to other sponsors/clients

ASA Ethical Guidelines for Statistical Practice [12]

2.8 - Summary
In this second lesson, Ethics of Clinical Trials, we learned:

3 internationally recognized conditions necessary to justify conducting a


medical experiment in humans.

The condition that is required to justify randomizing a patient to a


treatment.
To differentiate between ethical and unethical use of placebo control.

Requirements for IRB approval of human research studies involving any


product regulated by the U.S. FDA.

U.S requirements for investigators to report financial interests that may


affect design, conduct or reporting of research regulated by the U.S.
Public Health Service.

Lesson 3: Clinical Trial Designs


Introduction

Experimental design originated in agricultural research and influenced


laboratory and industrial research before being applied to trials of
pharmaceuticals in humans. Experimental design is characterized by control
of the experimental process to reduce experimental error, replication of the
experiment to estimate variability in the response and randomization. For
example, in comparing the yields of two varieties of corn, the experimenter
uses the same type of corn planter and the same fertilizer and weed control
methods in each test plot. Multiple plots of ground are planted with the two
varieties of corn. The assignment of a seed variety to a test plot is
randomized.

Clinical trial design has its roots in classical experimental design, yet has
some different features. The clinical investigator is not able to control as many
sources of variability through design as a laboratory or industrial experimenter.
Human responses to medical treatments display greater variability than
observations from experiments in genetically identical plants and animals or
measuring effects of tightly-controlled physical and chemical processes. And
of course, ethical issues are paramount in clinical research. To study a clinical
response with adequate precision, a trial may require lengthy periods for
patient accrual and follow-up. It is unlikely to enroll all the study subjects on
the same day. There is opportunity for study volunteers to decide to no longer
participate.

Each of these issues will be considered as we extend classical experimental


design to clinical trials.

Let's get started!


Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

1. State 6 general objectives that will be met with proper trial design.

2. Name at least 6 sources of potential bias in clinical studies.

3. Suggest design strategies to reduce bias, variability and placebo


effects in a proposed clinical study.

4. Compare and contrast the following study designs with respect to the
ability of the investigator to minimize bias: Case report or case series,
database analysis, prospective cohort study, case-control study, parallel
design clinical trial, crossover clinical trial.

5. Identify the experimental unit in a proposed study.

6. Differentiate between Phase I - IV trials.

7. Recognize features that should be described in a written protocol for a


clinical trial.

8. Recognize confounding in a clinical study proposal.

9. Identify characteristics and purposes of translational studies.

3.1 - Clinical Trial Design


Good trial design and conduct are far more important than selecting the
correct statistical analysis. When a trial is well designed and properly
conducted, statistical analyses can be performed, modified, and if necessary,
corrected. On the other hand, inaccuracy (bias) and imprecision (large
variability) in estimating treatment effects, the two major shortcomings of
poorly designed and conducted trials, cannot be ameliorated after the trial..
Skillful statistical analysis cannot overcome basic design flaws.

Piantadosi (2005) lists the following advantages of proper design:

1. Allows investigators to satisfy ethical constraints

2. Permits efficient use of scarce resources


3. Isolates the treatment effect of interest from confounders

4. Controls precision

5. Reduces selection bias and observer bias.

6. Minimizes and quantifies random error or uncertainty

7. Simplifies and validates the analysis

8. Increases the external validity of the trial

The objective of most clinical trials is to estimate the magnitude of treatment


effects or estimate differences in treatment effects. Precise statements about
observed treatment effects are dependent on a study design that allows the
treatment effect to be sorted out from person-to-person variability in response.
An accurate estimate requires a study design that minimizes bias.

Piantadosi (2005) states that clinical trial design should accomplish the
following:

1. Quantify and reduce errors due to chance

2. Reduce or eliminate bias

3. Yield clinically relevant estimates of effects and precision

4. Be simple in design and analysis

5. Provide a high degree of credibility, reproducibility, and external validity

6. Influence future clinical practice

3.2 - Controlled Clinical Trials Compared


to Observational Studies
Medical research, as a scientific investigation, is based on careful
observation and theory. Theory directs the observation and provides a basis
for interpreting the results. The strength of the evidence from a clinical study is
proportional to amount of the control of bias and variability when the study
was conducted as well as the magnitude of the observed effect. Clinical
studies can be characterized as uncontrolled observations, observational
comparative and controlled clinical trials.

Case reports and case-series are uncontrolled observational studies.

A case report only demonstrates that a clinical event of interest is possible. In


a case report, there is no control of treatment assignment, endpoint
ascertainment, or confounders. There is no control group for the sake of
comparison. The report is descriptive in nature, not a formal statistical
analysis.

Case reports are useful in generating hypotheses for


future testing. For example, a physician may report
that a patient in his practice, who was taking a specific
anorexic drug, developed primary pulmonary
hypertension (PPH), a rare condition that occurs in 1-
2 out of every million Americans. Is this convincing
evidence that the anorexic drug causes PPH?

A case series carries more weight than a single case


report, but cannot prove efficacy of a treatment.. Case
series and case reports are susceptible to large
selection biases.. Consider the example of laetrile, an
apricot pit extract that was reputed to cure cancer.
Seven case series were reported; the strength of
evidence from these studies has been summarized by
US National Cancer Institute (NCI [1]).While a
proportion of patients may have experienced spontaneous remission of
cancer, rigorous testing in controlled environments was never performed. After
an estimated 70,000 patients had been treated, the NCI undertook a
retrospective analysis of laetrile only to decide no definite conclusions
supporting anti-cancer activity could be made. (Ellison 1978 abstract[2]).
The Cochrane review on laetrile [3] (2015), states, there is no reliable evidence
for the alleged effects of laetrile or amygdalin for curative effects in cancer patients. Based
on a series of reported cases, many believed laetrile would cure their cancer,
perhaps refusing other effective treatments, and subjecting themselves to
adverse effects of cyanide, for many years, this continued for many years with
anti-tumor efficacy of laetrile unsupported while associated adverse effects
were coming to light.

A database analysis is similar to a case series, but may have a control


group, depending on the data source. The source and quality of the data used
for this secondary analysis is key. If the analysis attempts to evaluate
treatment differences from data in which treatment assignment was based on
physician and patient discretion, nonrandomized and open-label, bias is likely.

Databases are best used to study patterns with exploratory statistical


analyses. For example, the NIH sponsored a database analysis of interstitial
cystitis (IC) during the 1990s. This consisted of data from over 400 individuals
with IC who underwent various and numerous therapies for their condition.
The objective of the database analysis was to determine if there were patterns
of treatments that may be effective in treating the disease. (Rovner et al [4].
2000).

As another example, in the case of genomic research, specific data mining


tools have been developed to search for patterns in large databases of
genetic data, leading to the discovery of particular candidate genes.

An epidemiologic study is often a case-control or a cohort design, both


comparative observational studies. An observational study lacks the key
component of an experiment, namely, control over treatment assignment.
Commonly these designs are used in assessing the influence of risk factors
for a disease. Subjects meeting entrance criteria may have been identified
through a database search. The choice of the control group is a crucial design
component in observational studies.

In a case-control study, the investigator identifies cases (subjects with the


disease) and controls (subjects without the disease) and retrospectively
assesses some type of treatment or exposure. Because the investigator has
selected the cases and controls, relative risk cannot be calculated directly
from a case-control study.

In addition, levels of treatment or exposure may be recorded based on a


subjects recall of events that occurred many years previously, thus recall bias,
(systematic differences in accuracy or completeness of recall) can affect the
study results.
In a prospective cohort study, individuals are followed forward in time with
subsequent evalations to determine which individuals develop into cases. The
relationship of specific risk factors that were measured at baseline with the subsequent
outcome is assessed. The cohort study may consist of one or more samples with
particular risk factors, called cohorts. It is possible to control some sources of bias in
a prospective cohort study by following standard procedures in collecting data
and ascertaining endpoints. Since the subjects are not assigned risk factors in
a randomized manner however, there may remain covariates that are
confounded with a risk factor. Sometimes, a particular treatment group (or
groups) from a randomized trial is followed as a cohort, providing a cohort in
which the treatment was assigned at random.

Prospective studies tend to have fewer design problems and less bias than
retrospective studies, but they are more expensive with respect to time and
cost.

An example of a case-control study: A cardiologist identifies 36 patients


currently in his practice with a specific form of cardiac valve disease. He
identifies another group of relatively healthy patients and matches two of them
to each of the patients with cardiac valve disease according to age ( 5years)
and BMI ( 2.5). He plans to interview all 36 + 72 = 108 patients to assess
their use of diet drugs during the past ten years.

A classic example of a cohort study: U.S. National Heart Lung and Blood
Institute Framingham Heart Study [5]

Piantodosi (2005) lists the following conditions for convincing non-


experimental comparative studies:

1. The treatment of interest occurs naturally.

2. The study subjects provide valid observations for the biological


question.
3. The natural history of the disease with standard therapy, or in the
absence of therapy, is known.

4. The effect of the treatment is large enough to overshadow random error


and bias.

5. Evidence of efficacy is consistent with biological knowledge.

A controlled clinical trial contains all of the key components of a true


experimental design. Treatments are assigned by design; administration of
treatment and endpoint ascertainment follows a protocol. When properly
designed and conducted, especially with the use of randomization and
masking, the controlled clinical trial instills confidence that bias has been
minimized. Replication of a controlled clinical trial, if congruent with the results
of the first clinical trial, provides verification.

3.3 - Experimental Design Terminology


In experimental design terminology, the "experimental unit" is randomized to
the treatment regimen and receives the treatment directly. The "observational
unit" has measurements taken on it. In most clinical trials, the experimental
units and the observational units are one and the same, namely, the individual
patient

One exception to this is a community intervention trial in which


communities, e.g., geographic regions, are randomized to treatments. For
example, communities (experimental units) might be randomized to receive
different formulations of a vaccine, whereas the effects are measured directly
on the subjects (observational units) within the communities. The advantages
here are strictly logistical - it is simply easier to implement in this fashion.
Another example occurs in reproductive toxicology experiments in which
female rodents are exposed to a treatment (experimental units) but
measurements are taken on the pups (observational units).

In experimental design terminology, factors are variables that are controlled


and varied during the course of the experiment. For example, treatment is a
factor in a clinical trial with experimental units randomized to treatment.
Another example is pressure and temperature as factors in a chemical
experiment.
Most clinical trials are structured as one-way designs, i.e., only one factor,
treatment, with a few levels.

Temperature and pressure in the chemical experiment are two factors that
comprise a two-way design in which it is of interest to examine various
combinations of temperature and pressure. Some clinical trials may have
a two-way factorial design, such as in oncology where various combinations
of doses of two chemotherapeutic agents comprise the treatments.
An incomplete factorial design may be useful if it is inappropriate to assign
subjects to some of the possible treatment combinations, such as no
treatment (double placebo). We will study factorial designs in a later lesson.

A parallel design refers to a study in which patients are randomized to a


treatment and remain on that treatment throughout the course of the trial. This
is a typical design. In contrast, with a crossover design patients are
randomized to a sequence of treatments and they cross over from one
treatment to another during the course of the trial. Each treatment occurs in a
time period with a washout period in between. Crossover designs are of
interest since with each patient serving as their own control,there is potential
for reduced variability. However, there are potential problems with this type of
design. There should be investigation into possible carry over effects, i.e. the
residual effects of the previous treatment affecting subjects response in the
later treatment period. In addition, only conditions that are likely to be similar
in both treatment periods are amenable to crossover designs. Acute health
problems that do not recur are not well-suited for a crossover study. We will
study crossover design in a later lesson.

Randomization is used to remove systematic error (bias) and to justify Type I


error probabilities in experiments. Randomization is recognized as an
essential feature of clinical trials for removing selection bias.

Selection bias occurs when a physician decides treatment assignment and


systematically selects a certain type of patient for a particular treatment..
Suppose the trial consists of an experimental therapy and a placebo. If the
physician assigns the healthier patients to the experimental therapy and the
less healthy patients to the placebo, the study could result in an invalid
conclusion that the experimental therapy is very effective.

Blocking and stratification are used to control unwanted variation. For


example suppose a clinical trial is structured to compare treatments A and B in
patients between the ages of 18 and 65. Suppose that the younger patients
tend to be healthier. It would be prudent to account for this in the design by
stratifying with respect to age. One way to achieve this is to construct age
groups of 18-30, 31-50, and 51-65 and to randomize patients to treatment
within each age group.

Age Treatment A Treatment B

18 - 30 12 13

31 - 50 23 23

51-65 6 7

It is not necessary to have the same number of patients within each age
stratum. We do, however, want to have balance in the number on each
treatment within each age group..This is accomplished by blocking, in this
case, within the age strata. Blocking is a restriction of the randomization
process that results a balance of numbers of patients on each treatment after
a prescribed number of randomizations. For example, blocks of 4 within these
age strata would mean that after 4, 8, 12, etc. patients in a particular age
group had entered the study, the numbers assigned to each treatment within
that stratum would be equal.

If the numbers are large enough within a stratum, a planned subgroup


analysis may be performed. In the example, the smaller numbers of patients
in the upper and lower age groups would require care in the analyses of these
sub-groups specifically. However, with the primary question as the effect of
treatment regardless of age, the pooled data in which each sub-group is
represented in a balanced fashion would be utilized for the main analysis.

Even ineffective treatments can appear beneficial in some patients. This may
be due to random fluctuations, or variability in the disease. If, however, the
improvement is due to the patients expectation of a positive response, this is
called a "placebo effect" . This is especially problematic when the outcome is
subjective, such as pain or symptom assessment. Placebo effect is widely
recognized and must be removed in any clinical trial. For example, rather than
constructing a nonrandomized trial in which all patients receive an
experimental therapy, it is better to randomize patients to receive either the
experimental therapy or a placebo. A true placebo is an inert or inactive
treatment that mimics the route of administration of the real treatment, e.g., a
sugar pill.
Placebos are not acceptable ethically in many situations, e.g., in surgical
trials. (Although there have been instances where 'sham' surgical procedures
took place as the 'placebo' control.) When an accepted treatment already
exists for a serious illness such as cancer, the control must be an active
treatment. In other situations, a true placebo is not physically possible to
attain. For example, a few trials investigating dimethyl sulfoxide (DMSO) for
providing muscle pain relief were conducted in the 1970s and 1980s. DMSO
is rubbed onto the area of muscle pain, but leaves a garlicky taste in the
mouth, so it was difficult to develop a placebo.

Treatment masking or blinding is an effective way to ensure objectivity of


the person measuring the outcome variables. Masking is especially important
when the measurements are subjective or based on self-
assessment. Double-masked trials refer to studies in which both
investigators and patients are masked to the treatment. Single-masked
trials refer to the situation when only patients are masked. In some studies,
statisticians are masked to treatment assignment when performing the initial
statistical analyses, i.e., not knowing which group received the treatment and
which is the control until analyses have been completed. Even a safety-
monitoring committee may be masked to the identity of treatment A or B, until
there is an observed trend or difference that should evoke a response from
the monitors. In executing a masked trial great care will be taken to keep the
treatment allocation schedule securely hidden from all except those with a
need to know which medications are active and which are placebo. This could
be limited to the producers of the study medications, and possibly the safety
monitoring board before study completion. There is always a caveat for
breaking the blind for a particular patient in an emergency situation.

As with placebos, masking, although highly desirable, is not always possible.


For example, one could not mask a surgeon to the procedure he is to perform.
Even so, some have gone to great lengths to achieve masking. For example,
a few trials with cardiac pacemakers have consisted of every eligible patient
undergoing a surgical procedure to be implanted with the device. The device
was "turned on" in patients randomized to the treatment group and "turned off"
in patients randomized to the control group. The surgeon was not aware of
which devices would be activated.

Investigators often underestimate the importance of masking as a design


feature. This is because they believe that biases are small in relation to the
magnitude of the treatment effects (when the converse usually is true), or that
they can compensate for their prejudice and subjectivity.
Confounding is the effect of other relevant factors on the outcome that may
be incorrectly attributed to the difference between study groups.

Here is an example: An investigator plans to assign 10 patients to treatment


and 10 patients to control. There will be a one-week follow-up on each patient.
The first 10 patients will be assigned treatment on March 01 and the next 10
patients will be assigned control on March 15. The investigator may observe a
significant difference between treatment and control, but is it due to different
environmental conditions between early March and mid-March? The obvious
way to correct this would be to randomize 5 patients to treatment and 5
patients to control on March 01, followed by another 5 patients to treatment
and the 5 patients to control on March 15.

Validity

A trial is said to possess internal validity if the observed difference in


outcome between the study groups is real and not due to bias, chance, or
confounding. Randomized, placebo-controlled, double-blinded clinical trials
have high levels of internal validity.

External validity in a human trial refers to how well study results can be
generalized to a broader population. External validity is irrelevant if internal
validity is low. External validity in randomized clinical trials is enhanced by
using broad eligibility criteria when recruiting patients .

Large simple and pragmatic trials emphasize external validity. A large simple trial
attempts to discover small advantages of a treatment that is expected to be used in a
large population. Large numbers of subjects are enrolled in a study with simplified
design and management. There is an implicit assumption that the treatment effect is
similar for all subjects with the simplified data collection. In a similar vein,
a pragmatic trial emphasizes the effect of a treatment in practices outside academic
medical centers and involves a broad range of clinical practices.

Studies of equivalency and noninferiority have different objectives than the usual trial
which is designed to demonstrate superiority of a new treatment to a control. A study to
demonstrate non-inferiority aims to show that a new treatment is not worse than an
accepted treatment in terms of the primary response variable by more than a pre-
specified margin. A study to demonstrate equivalence has the objective of
demonstrating the response to the new treatment is within a prespecified margin in both
directions. We will learn more about these studies when we explore sample size
calculations.
3.4 - Clinical Trial Phases
When a drug, procedure, or treatment appears safe and effective based on
preclinical studies, it can be considered for trials in humans. Clinical studies of
experimental drugs, procedures, or treatments in humans have been
classified into four phases (Phase I, Phase II, Phase III, and Phase IV) based
on the terminology used when pharmaceutical companies interact with the
U.S. FDA. Greater numbers of patients are assigned to treatment in each
successive phase.

Phase 0 represents pre-clinical testing in animals to obtain pharmacokinetic


information.

Phase I trials investigate the effects of various dose levels on humans, The
studies are usually done in a small number of volunteers (sometimes persons
without the disease of interest or patients with few remaining treatment
options) who are closely monitored in a clinical setting. The purpose is to
determine a safe dosage range and to identify any common side effects or
readily apparent safety concerns. Data may be collected to provide a
description of the pharmacokinetics and pharmacodynamics of the compound,
estimate the maximum tolerated dose (MTD), or evaluate the effects of
multiple dose levels. Many trials in the early stage of therapy development
either investigate treatment mechanism (TM) or incorporate dose-finding (DF)
strategies.

To a pharmacologist, a TM trial is a pharmacokinetics study in which an


attempt is made to investigate the bioavailability of the drug at various sites in
the human system. To a surgeon, a TM study investigates the operative
procedure. A DF trial usually tries to determine the maximum tolerated dose,
or the minimum effective dose, etc. Thus, phase I (drug) trials can be
considered TM and DF trials.

A Phase II trial typically investigates preliminary evidence of efficacy and


continues to monitor safety. A Phase II trial may be the first time that the agent
is administered to patients with the disease of interest to answer questions
such as: What is the correct dosage for efficacy and safety in patients of this
type? What is the probability a patient treated with the compound will benefit
from the therapy or experience an adverse effect? Most trials in the middle
stage of therapy development investigate safety and efficacy (SE). The
experimental drug or treatment is administered to as many as several hundred
patients in Phase II trials.
At the end of Phase II, a decision will be made as to whether or not the drug is
promising and development should continue. In the U.S. there will be an End
of Phase II meeting between the pharmaceutical company and the FDA to
discuss safety and plans for Phase III studies. Ineffective or unsafe
compounds should not proceed into Phase III trials.

A Phase III trial is a rigorous clinical trial with randomization, one or more
control groups and definitive clinical endpoints. Phase III trials are often multi-
center, accumulating the experience of thousands of patients. Phase III trials
address questions of comparative treatment efficacy (CTE). A CTE
trial involves a placebo and/or active control group so that precise and valid
estimates of differences in clinical outcomes attributable to the investigational
therapy can be assessed.

If things go well during Phase III, the company with the license for the
compound will submit an application for approval.to market the drug. U.S.
FDA approval hinges on adequate and well-controlled pivotal Phase III
studies that are convincing of safety and efficacy.

A phase IV trial or expanded safety trial, occurs after regulatory approval of


the new therapy. As usage of the new drug becomes widespread, there is an
opportunity to learn about rare side effects and interactions with other
therapies. An expanded safety (ES) study can provide important information
that was not apparent during the drug development. For example, a few
thousand patients might be involved in all of the SE and CTE trials for a
particular therapy. An ES study, however, could involve >10,000 patients.
Such large sample sizes can detect more subtle safety problems for the
therapy, if such problems exist. Some Phase IV studies will have a marketing
objective for the company as well as collecting safety data.

The terminology of phase I, II, III, and IV trials does not work well for non-
pharmacologic treatments and does not account for translational trials

Most trials in the early stage of therapy development either investigate


treatment mechanism (TM) or incorporate dose-finding (DF) strategies.

Some studies performed prior to large scale clinical trials are characterized
as translational studies. Translational studies have as their primary outcome
a biological measurement or target that has been derived from an accepted
model of the disease process. The results of the translational study may
provide evidence of a mechanism of action for a compound. Target validation
can be an objective of such a study. Large effects on the target are sought.
For example, a large change in the level of a protein, or the activity of an
enzyme might support therapeutic activity of a compound. There is an
understanding that translational work may cycle from preclinical lab to a
clinical setting and back again. Although the translational studies have a
written protocol, the treatment may be modified during the study. The protocol
should clearly define what would be considered lack of effect and the next
experimental step for any possible outcome of the trial.

3.5 - Other Considerations


Some therapies are not developed in the same manner as drugs, such as
disease prevention therapies, vaccines, biologicals, surgical techniques,
medical devices, and diagnostic agents.

Prevention trials are conducted in:

1. healthy individuals to determine if the therapy prevents the onset of


disease,

2. patients with early-stage disease to determine if the therapy prevents


progression, or

3. patients with the disease to determine if the therapy prevents additional


episodes of disease expression.

Vaccine investigations are a type of primary prevention trial They require large
numbers of patients and are very costly because of the numbers and the
length of follow-up that is required.

The objective of a diagnostic or screening trial is to determine if an agent


can diagnose the presence of disease. Usually, the agent is compared to a
gold standard diagnostic that assumed to be perfectly accurate in its
diagnosis. The advantage of the newer diagnostic agent is less expense or a
less invasive procedure.

3.6 - Importance of the Research


Protocol
A protocol is the document that specifies the research plan for the clinical
trial. It is the single-most important quality control tool for all aspects of a
clinical trial. (Piantadosi 2005) This is especially true in a multi-center clinical
trial, which requires collaboration in the research activities of many
investigators and their staffs at multiple institutions.

Every clinical trial experiences violations of the protocol. Some violations are
due to differences in interpretation, some are due to carelessness, and some
are due to unforeseen circumstances. Some protocol deviations are
inconsequential but others can affect the validity of the trial. For instance a
patient might be unaware of a condition that is present in its early or latent
stage or a patient may mislead a researcher intentionally, thinking they will
receive special treatment from participating in a study both result in
violations of the patient exclusion criteria established in the research protocol.
Protocol amendments are common as a long-term multi-center study
progresses. The most serious violations are those which may affect the
conclusions of the study.

The SPIRIT 2013 [6]statement by an international collaboration of persons or


groups responsible for funding, conducting and publishing results of clinical
trials, along with ethicists, sets forth minimal elements that should be included
in a clinical trial protocol and provides a checklist. [7]The U S NIH has its
own template [8] for a phase 2 or 3 clinical trial protocol.

If the conduct of a particular trial is particularly difficult, and especially if it is a


multi-center study, the investigators will construct a manual of operations
(MOP). The MOP has more detailed explanations than the protocol for how
the measurements should be taken, how the data collection forms should be
completed, etc.

3.7 - Summary
In this lesson, among other things, we learned:

the 6 objectives that will be met with proper trial design.

the sources of potential bias in clinical studies.

the design strategies to reduce bias, variability and placebo effects in a


proposed clinical study.

to compare and contrast the following study designs with respect to the
ability of the investigator to minimize bias: Case report or case series,
database analysis, prospective cohort study, case-control study, parallel
design clinical trial, crossover clinical trial.

to identify the experimental unit in a proposed study.

to differentiate between Phases I-IV trials.

to recognize features that should be described in a written protocol for a


clinical trial

Identify characteristics and purposes of translational studies.

Lesson 4: Bias and Random Error


Introduction

Error is defined as the difference between the true value of a measurement


and the recorded value of a measurement. There are many sources pf error in
collecting clinical data. Error can be described as random or systematic.

Random error is also known as variability, random variation, or noise in the


system. The heterogeneity in the human population leads to relatively large
random variation in clinical trials.

Systematic error or bias refers to deviations that are not due to chance alone.
The simplest example occurs with a measuring device that is improperly
calibrated so that it consistently overestimates (or underestimates) the
measurements by X units.

Random error has no preferred direction, so we expect that averaging over a


large number of observations will yield a net effect of zero. The estimate may
be imprecise, but not inaccurate. The impact of random error, imprecision, can
be minimized with large sample sizes.

Bias, on the other hand, has a net direction and magnitude so that averaging
over a large number of observations does not eliminate its effect. In fact, bias
can be large enough to invalidate any conclusions. Increasing the sample size
is not going to help. In human studies, bias can be subtle and difficult to
detect. Even the suspicion of bias can render judgment that a study is invalid.
Thus, the design of clinical trials focuses on removing known biases.
Random error corresponds to imprecision, and bias to inaccuracy. Here is a
diagram that will attempt to differentiate between imprecision and inaccuracy.
(Click the 'Play' button.)

See the difference between these two terms? OK, let's explore these further!

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

1. Distinguish between random error and bias in collecting clinical data.

2. State how the significance level and power of a statistical test are
related to random error.

3. Accurately interpret a confidence interval for a parameter.

4.1 - Random Error


Random error (variability, imprecision) can be overcome by increasing the
sample size. This is illustrated in this section via hypothesis
testing and confidence intervals, two accepted forms of statistical inference.

Review of Hypothesis testing

In hypothesis testing, a null hypothesis and an alternative hypothesis are


formed. Typically, the null hypothesis reflects the lack of an effect and the
alternative hypothesis reflects the presence of an effect (supporting the
research hypothesis). The investigator needs to have sufficient evidence,
based on data collected in a study, to reject the null hypothesis in favor of the
alternative hypothesis.

Suppose an investigator is conducting a two-armed clinical trial in which


subjects are randomized to group A or group B, and the outcome of interest is
the change in serum cholesterol after 8 weeks. Because the outcome is
measured on a continuous scale, the hypotheses are stated as:

H0:A=BH0:A=B versus H0:ABH0:AB

where A and B represent the population means for groups A and B,


respectively.
The alternative hypothesis of H1: A B is labeled a two-sided alternative
because it does not indicate whether A is better than B or vice versa. Rather, it
just indicates that A and B are different. A one-sided alternative of H1: A<
B (or H1: A> B) is possible, but it is more conservative to use the two-sided
alternative.

The investigator conducts a study to test his hypothesis with 40 subjects in


each of group A and group B (nA = 40 and nB = 40). The investigator estimates
the population means via the sample means (labeled xAxA and xBxB,
respectively). Suppose the average changes that we observed
are xA=7.3xA=7.3 and xB=4.8mg/dlxB=4.8mg/dl. Do these data provide
enough evidence to reject the null hypothesis that the average changes in the
two populations means are equal? (The question cannot be answered yet. We
do not know if this is a statistically significant difference!)

If the data approximately follow a normal distribution or are from large enough
samples, then a two-sample t test is appropriate for comparing groups A and
B where:

t=(xAxB)/(standard error of xAxB)t=(xAxB)/(standard error


of xAxB).

We can think of the two-sample t test as representing a signal-to-noise ratio


and ask if the signal is large enough, relative to the noise detected? In the
example, xA=7.3xA=7.3 and xB=4.8mg/dlxB=4.8mg/dl. If the standard
error of xAxBxAxB is 1.2 mg/dl, then:

tobs=(7.34.8)/1.2=2.1tobs=(7.34.8)/1.2=2.1

But what does this value mean?

Each t value has associated probabilities. In this case, we want to know the
probability of observing a t value as extreme or more extreme than the t value
actually observed, if the null hypothesis is true. This is the p-value. At the
completion of the study, a statistical test is performed and its corresponding p-
value calculated. If the p-value < , then H0 is rejected in favor of H1.
Two types of errors can be made in testing hypotheses: rejecting the null
hypothesis when it is true or failing to reject the null hypothesis when it is
false. The probability of making a Type I error, represented by (the
significance level), is determined by the investigator prior to the onset of the
study. Typically, is set at a low value, say 0.01 or 0.05.

Here is an interactive table that presents these options. Roll your cursor over
the specific decisions (reject and fail to reject) to view results.

In our example, the p-value = [probability that |t| > 2.1] = 0.04

Thus, the null hypothesis of equal mean change for in the two populations is
rejected at the 0.05 significance level. The treatments were different in the
mean change in serum cholesterol at 8 weeks.

Note that (the probability of not rejecting H0 when it is false) did not play a
role in the test of hypothesis.

The importance of came into play during the design phase when the
investigator attempted to determine an appropriate sample size for the study.
To do so, the investigator had to decide on the effect size of interest, i.e., a
clinically meaningful difference between groups A and B in average change in
cholesterol at 8 weeks. The statistician cannot determine this but can help the
researcher decide whether he has the resources to have a reasonable chance
of observing the desired effect or should rethink his proposed study design..

The effect size is expressed as: = A - B.

The sample size should be determined such that there exists good statistical
power ( = 0.1 or 0.2) for detecting this effect size with a test of hypothesis
that has significance level .

A sample size formula that can be used for a two-sided, two-sample test with
= 0.05 and = 0.1 (90% statistical power) is:

nA=nA=212/2nA=nA=212/2

where = the population standard deviation (more detailed information will be


discussed in a later lesson).

Note that the sample size increases as increases (noise increases).


Note that the sample size increases as decreases (effect size decreases).

In the serum cholesterol example, the investigator had selected a meaningful


difference, = 3.0 mg/dl and located a similar study in the literature that
reported = 4.0 mg/dl. Then:

nA=nB=212/2=(2116)/9=37nA=nB=212/2=(2116)/9=37

Thus, the investigator randomized 40 subjects to each of group A and group B


to assure 90% power for detecting an effect size that would have clinical
relevance..

Many studies suffer from low statistical power (large Type II error) because the
investigators do not perform sample size calculations.

If a study has very large sample sizes, then it may yield a statistically
significant result without any clinical meaning. Suppose in the serum
cholesterol example that xA=7.3xA=7.3 and xA=7.1mg/dlxA=7.1mg/dl ,
with nA = nB = 5,000. The two-sample t test may yield a p-value = 0.001,
but xAxB=7.37.1=0.2mg/dlxAxB=7.37.1=0.2mg/dl is not clinically
interesting.

Confidence Intervals

A confidence interval provides a plausible range of values for a population


measure. Instead of just reporting xAxBxAxB as the sample estimate of
A - B, a range of values can be reported using a confidence interval..

The confidence interval is constructed in a manner such that it provides a high


percentage of confidence (95% is commonly used) that the true value of A -
B lies within it.

If the data approximately follow a bell-shaped normal distribution, then a 95%


confidence interval for A - B is

(xAxB){1.96(standard error of xAxB)}(xAxB){1.96(standard


error of xAxB)}
In the serum cholesterol
example, (xAxB)=7.34.8=2.5mg/dl(xAxB)=7.34.8=2.5mg/dl and the
standard error = 1.2 mg/dl. Thus, the approximate 95% confidence interval is:

2.5(1.961.2)=[0.1,4.9]2.5(1.961.2)=[0.1,4.9]

Note that the 95% confidence interval does not contain 0, which is consistent
with the results of the 0.05-level hypothesis test (p-value = 0.04). 'No
difference' is not a plausible value for the difference between the treatments.

Notice also that the length of the confidence interval depends on the standard
error. The standard error decreases as the sample size increases, so the
confidence interval gets narrower as the sample size increases (hence,
greater precision).

A confidence interval is actually is more informative than testing a hypothesis.


Not only does it indicate whether H0 can be rejected, but it also provides a
plausible range of values for the population measure. Many of the major
medical journals request the inclusion of confidence intervals within submitted
reports and published articles.

4.2 - Clinical Biases


If a bias is small relative to the random error, then we do not expect it to be a
large component of the total error. A strong bias can yield a point estimate that
is very distant from the true value. Remember the 'bulls eye' graphic?
Investigators seldom know the direction and magnitude of bias, so
adjustments to the estimators are not possible.

There are many sources of bias in clinical studies:

1. Selection bias

2. Procedure selection bias

3. Post-entry exclusion bias

4. Bias due to selective loss of data

5. Assessment bias
1. Selection Bias

Selection bias refers to selecting a sample that is not representative of the


population because of the method used to select the sample. Selection bias in
the study cohort can diminish the external validity of the study findings. A
study with external validity yields results that are useful in the general
population. Suppose an investigator decides to recruit only hospital
employees in a study to compare asthma medications. This sample might be
convenient, but such a cohort is not likely to be representative of the general
population. The hospital employees may be more health conscious and
conscientious in taking medications than others. Perhaps they are better at
managing their environment to prevent attacks. The convenient sample easily
produces bias. How would you estimate the magnitude of this bias? It is
unlikely to find an undisputed estimate and the study will be criticized because
of the potential bias.

If the trial is randomized with a control group, however, something may be


salvaged. Randomized controls increase internal validity of a study.
Randomization can also provide external validity for treatment group
differences. Selection bias should affect all randomized groups equally, so in
taking differences between treatment groups, the bias is removed via
subtraction. Randomization in the presence of selection bias cannot provide
external validity for absolute treatment effects. The graph below illustrates
these concepts).

The estimates of the response from the sample are clearly biased below the
population values. However, the observed difference between treatment and
control is of the same magnitude as that in the population. In other words, it
could be the observed treatment difference accurately reflects the population
difference, even though the observations within the control and treatment
groups are biased.

2. Procedure Selection Bias

Procedure selection bias, a likely result when patients or investigators decide


on treatment assignment, can lead to extremely large biases. The investigator
may consciously or subconsciously assign particular treatments to specific
types of patients. Randomization is the primary design feature that removes
this bias.

3. Post-entry exclusion bias

Post-entry exclusion bias can occur when the exclusion criteria for subjects
are modified after examination of some or all of the data. Some enrolled
subjects may be recategorized as ineligible and removed from the study. In
the past, this may have been done for the purposes of manufacturing
statistically significant results, but would be regarded as unethical practice
now.

4. Bias due to selective loss of data

Bias due to selective loss of data is related to post-entry exclusion bias. In this
case, data from selected subjects are eliminated from the statistical analyses.
Protocol violations (including adding on other medications, changing
medications or withdrawal from therapy) and other situations may cause an
invesigator to request an analysis using only the data from those who adhered
to the protocol or who completed the study on their assigned therapy.

The latter two types of biases can be extreme. Therefore, statisticians prefer
that intention-to-treat analyses be performed as the main statistical analysis..

In an intention-to-treat analysis, all randomized subjects are included in the


data analysis, regardless of protocol violations or lack of compliance. Though
it may seem unreasonable to include data from a patient who simply refused
to take the study medication or violated the protocol in a serious manner, the
intention-to-treat analysis usually prevents more bias than it introduces. Once
all the patients are randomized to therapy, use all of the data collected. Other
analyses may supplement the intention-to-treat analysis, perhaps
substantiating that protocol violations did not affect the overall inferences, but
the analysis including all subjects randomized should be primary.

5. Assessment bias
As discussed earlier, clinical studies that rely on patient self-assessment or
physician assessment of patient status are susceptible to assessment bias. In
some circumstances, such as in measuring pain or symptoms, there are no
alternatives, so attempts should be made to be as objective as possible and
invoke randomization and blinding. What is a mild cough for one person might
be characterized as a moderate cough by another patient. Not knowing
whether or not they received the treatment (blinding) when making these
subjective evaluations will help to minimize this self-assessment or
assessment bias..

Well-designed and well-conducted clinical trials can eliminate or minimize


biases.

Key design features that achieve this goal include:

1. Randomization (minimizes procedure selection bias)

2. Masking (minimizes assessment bias)

3. Concurrent controls (minimizes treatment-time confounding and/or


adjusts for disease remission/progression, as the graph below
illustrates. Both treatment and control had an increase in response, but
the treatment group experienced a greater increase.)

4. Objective assessments (minimizes assessment bias)

5. Active follow-up and endpoint ascertainment (minimizes assessment


bias)

6. No post hoc exclusions (minimizes post-entry exclusion bias)


4.3 - Statistical Biases
For a point estimator, statistical bias is defined as the difference between the
parameter to be estimated and the mathematical expectation of the estimator.

Statistical bias can result from methods of analysis or estimation. For


example, if the statistical analysis does not account for important prognostic
factors (variables that are known to affect the outcome variable), then it is
possible that the estimated treatment effects will be biased. Fortunately, many
statistical biases can be corrected, whereas design flaws lead to biases that
cannot be corrected.

The simplest example of statistical bias is in the estimation of the variance in


the one-sample situation with Y1, ... , Yn denoting independent and identically
distributed random variables and YY denoting their sample mean. Define:

s2=1n1i=1n(YiY)2s2=1n1i=1n(YiY)2

and

v2=1ni=1n(YiY)2v2=1ni=1n(YiY)2

The statistic s2 is unbiased because its mathematical expectation is the


population variance, 2. The statistic v2 is biased because its mathematical
expectation is 2(n - 1)/n. The statistic v2 tends to underestimate the
population variance.

Thus, bias(v2 ) is 2(n - 1)/n - 2 = - 2/n. Obviously, as the sample size, n,


gets larger, the bias becomes negligible.

4.4 - Summary
In this lesson, among other things, we learned:

to distinguish between random error and bias in collecting clinical data.


to state how the significance level and power of a statistical test are
related to random error.

to accurately interpret a confidence interval for a parameter.

Check to see if there are any homework problems associated with this lesson.

Lesson 5: Objectives and Endpoints


Introduction

The objectives of a trial must be stated in specific terms. Achieving objectives


should not depend on observing a particular outcome of the trial, e.g. finding a
difference in mean weight loss of exactly 2 kg, but in obtaining a valid result.
For example, a randomized trial of 4 diets had as its objective, To assess
adherence rates and the effectiveness of 4 popular diets for weight loss and
cardiac risk factor reduction. (Dansinger et al. 2005).

The endpoints (or outcomes), determined for each study participant, are the
quantitative measurements required by the objectives. In the Dansinger
weight loss study, the primary endpoint was identified to be mean absolute
change from baseline weight at 1 year. In a cancer chemotherapy trial the
clinical objective is usually improved survival. Survival time is recorded for
each patient; the primary outcome reported may be median survival time or it
could be five-year survival.

Clinical trials typically have a primary objective or endpoint. Additional


objectives and endpoints are secondary. The sample size calculation is
based on the primary endpoint. Analysis involving a secondary objective has
statistical power that is calculated based on the sample size for the primary
objective.

"Hard" endpoints are well-defined in the study protocol, definitive with respect
to the disease process, and require no subjectivity. "Soft" endpoints are those
that do not relate strongly to the disease process or require subjective
assessments by investigators and/or patients. Some endpoints fall between
these two classifications. For example: the grading of x-rays by radiologists
and the grading of cell and tissue lesions/tumors by pathologists. There is
some degree of subjectivity, but they are valid and reliable endpoints in most
settings.
This lesson will help to differentiate between these types of objectives and
endpoints. Ready, let's get started!

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

1. Identify outcomes that are continuous, binary, event times, counts,


ordered or unordered categories and repeated measurements.

2. State the merits and problems of using a surrogate outcome.

3. Recognize types of censoring that can occur in studies of time-to-event


outcomes.

4. State the components of a typical dose-finding design.

5.1 - Endpoints
The endpoints used in a clinical trial must correspond to the scientific
objectives of the study and the methods of outcome assessment should be
accurate (free of bias).

A wide variety of endpoints that are used in clinical trials as displayed below

blood pressures, weight, blood chemistry variables


time to recurrence of cancer, survival time
frequency of occurrence of migraine headaches, number of uses of rescue meds for asthma
no recurrence/recurrence, major cardiac event yes or no
absent, mild moderate, severe pain, NYHA status
categories of adverse experiences: GI, cardiac, etc.

Some endpoints are assessed many times during the study, leading to
repeated measurements.

5.2 - Special Considerations for Event


Times
Event Times

Event times often are useful endpoints in clinical trials. Examples include
survival time from onset of diagnosis, time until progression from one stage of
disease to another, and time from surgery until hospital discharge. In each
case time is measured from study entry until the event occurs. With an
endpoint that is based on an event time, there always is the chance
of censoring. An event time is censored if there is some amount of follow-up
on a subject, but the event is not observed because of loss-to-follow-up, death
from a cause other than the trial endpoint, study termination, and other
reasons unrelated to the endpoint of interest.. This is known as right censoring
and occurs frequently in studies of survival..

Right-censoring example

Consider the table above which displays time until infection for Patients 1-6. In
some cases, the event did not occur, Patient 1 (from top) was followed for a
year and was censored at the end of the study). The second patient
experienced an infection at approximately 325 days. Patients 3 and 6 dropped
out of the study and were censored when this occurred.

Left censoring occurs when the initiation time for the subject, such as time of
diagnosis, is unknown. Interval censoring occurs when the subject is not
followed for a period of time during the trial and it is unknown if the event
occurred during that period.

Right Censoring Types

There are three types of right censoring that are described in the statistical
literature.
Type I censoring occurs when all subjects are scheduled to begin the study at
the same time and end the study at the same time. This type of censoring is
common in laboratory animal experiments, but unlikely in human trials.

Type II censoring occurs when all subjects begin the study at the same time
and the study is terminated when a predetermined proportion of subjects have
experienced the event

Type III censoring occurs when the censoring is random, which is the case in
clinical trials because of staggered entry (not every patient enters the study on
the first day) and unequal follow-up on subjects.

Statistical methods appropriate for event time data, survival analyses, do not
discard the right-censored observations. Instead, the methods account for the
knowledge that the event did not occur in a subject up to the censoring time.
Survival methods include life table analysis, Kaplan-Meier survival curves,
logrank and Wilcoxon tests, and proportional hazards regression (more
discussion on these in a later lesson).

In order to conduct event-time analyses, two measurements must be


recorded, namely, the follow-up time for a subject and an indicator variable as
to whether this is an event time or a censoring time. These statistical methods
assume that the censoring mechanisms and the event are independent. If this
is not the case, e.g., patients have a tendency to be censored prior to the
occurrence of the event, the event rate will be underestimated.

When the event of interest is death, it is common to examine two different


endpoints, namely, death from all causes and death primarily due to the
disease.

At first glance, death primarily due to the disease appears to be the most
appropriate. It is, however, susceptible to bias because the assumption of
independent causes of death may not be valid. For example, subjects with a
life-threatening cancer are prone to death due to myocardial infarction. It can
also be very difficult to determine the exact cause of death.

5.3 - Surrogate Endpoints


A surrogate endpoint is one that is measured in place of the biologically
definitive or clinically meaningful endpoint. A surrogate endpoint usually tracks
the progress or extent of the disease.
Investigators choose a surrogate endpoint when the definitive endpoint is
inaccessible due to cost, time, or difficulty of measurement. The problem with
a surrogate endpoint in a clinical trial is determining whether it is valid (i.e., is
it strongly associated with the definitive outcome?)

Piantadosi (2005) gives the following characteristics of a useful surrogate


endpoint:

1. It can be measured simply and without invasive procedures

2. It is related to the causal pathway for the definitive endpoint

3. It yields the same statistical inference as that for the definitive endpoint

4. It should be responsive to the effects of treatments

Surrogate Definitive
e >>> >>>
Endpoint Endpoint

The disease affects the surrogate endpoint, which in turn affects the definitive
endpoints.

Examples of surrogate endpoints include CD4 counts in AIDS patients, tumor


size reduction in cancer patients, blood pressure in cardiovascular disease,
and intraocular pressure in glaucoma patients. The response variables in
translational research are surrogate endpoints.

Surrogate endpoints can potentially shorten and increase the efficiency of


clinical trials. If, however, the surrogate is imprecisely associated with
definitive endpoints, use of the surrogate can lead to misleading results.

5.4 - Considerations for Dose Finding


Studies
The terms describing several types of early clinical studies are given below.

Meaning
Early developmental trial that investigates mechanism of treatment effect, e.g., a pharmacokinetics study
nism
and elimination of the drug from the human body
Imprecise term for dose-ranging studies
Design or component of a design that specifies methods for increases in dose for subsequent subjects
Design that tests some or all of a prespecified set of doses (fixed design points)
Design that titrates dose to a prespecified optimum based on biological or clinical considerations

Table from Piantodosi (2005)

Dose-finding (DF) trials are Phase I studies with the objective of determining
the optimal biological dose (OBD) of a drug. In order to determine the dose
with highest potential for efficacy in the patient population that still meets
safety criteria, dose-finding studies are typically conducted by
administering sequentially rising doses to successive groups of individuals.
Such studies may be conducted in healthy volunteers or in patients with
disease.

A question the investigator must answer in designing a dose-finding study is


how to characterize an optimum dose. Should the optimum dose to be
selected on the basis of the highest therapeutic index (the maximal separation
between risk and benefit)? Or is the optimal dose the level which maximizes
therapeutic benefit while maintaining risk below a predetermined threshold?
What measures will denote risk and benefit?

An optimal dose can be selected on the basis of efficacy alone, such as when
a minimum effective dose (MED) is chosen for a pain-relieving medication,
and defined as the dose which eliminates mild-to-moderate pain in 80% of trial
participants. In another case, the optimal dose might be selected as the
highest dose that is associated with serious side effects in no more than 1 of
20 patients. This would be a maximum nontoxic dose (MND). In cancer
therapeutics, the optimal dose for a cytotoxic drug designed to shrink tumors
could be defined as the level that yields serious but reversible toxicity in no
more than 30% of the patients. This is a maximum tolerated dose (MTD). Care
in defining the conditions for optimality is critical to a dose-finding study.

Most DF trials are sequential studies such that number of subjects is itself an
outcome of the trial. Convincing evidence characterizing the relationship of
dose and safety can be obtained after studying a small set of patients. Hence
sample size is not a major concern DF trials.

An idealized DF study would be similar to an animal bioassay design, with K


fixed doses at increasing levels, d1, d2, ... , dK. The hypothesized optimal dose
would lie between d1 and dK. The n participants would be randomized to each
of the K dose groups and the binary response of toxicity would be noted for
each participant. A mathematical model could then be fit to the proportional
responses over the doses such that the optimal dose could be determined.

Would you agree to participate in such a study? Think carefully ....

Most likely, your answer is no because you would not want to risk being
assigned to the highest dose level of this unproven drug as your first
treatment. There is a principle here: it is unethical to treat humans at high
doses of a drug without any prior knowledge of their responses at lower
levels. Furthermore, ethics compel a design that minimizes the numbers of
patients treated with both low ineffective doses and high toxic doses.

Thus, along with defining optimality, a DF study design usually includes a


method for determining the starting dose for the patient, specification of dose
increments and cohort sizes, definition of dose-limiting toxicities as well as the
decision rules for escalation and de-escalation of the dose.

Continual Reassessment Method

The continual reassessment method (CRM) allows fitting a mathematical


model to observed data during the study from which it estimates an optimal
dose via extrapolation or interpolation. The next cohort of patients is assigned
to the estimated optimal dose. A study using CRM would not have an a priori
defined set of doses; thus is dose-finding study. The CRM itself can be
thought of an algorithm for updating the best guess regarding the optimal
dose. Bayesian approaches have also been incorporated into the CRM and
the method is applicable for many types of responses.

In contrast to a dose-finding study, a dose-ranging study uses pre-specified


design points.

Fibonacci Dose-Ranging Designs

Fibonacci, a thirteenth-century Italian mathematician, popularized the number


sequence 1, 1, 2, 3, 5, 8, 13, 21, 34 ... (a number in the sequence is the sum
of the two previous numbers).

A fixed dosing scheme can be based on the Fibonacci sequence. For


example, the first cohort of n participants is assigned dose D, the initial dose.
If they tolerate this dose D well, the next cohort of n participants is assigned
dose 2D. If all goes well with the second cohort, then a third cohort is
assigned dose 3D, the fourth cohort is assigned does 5D, etc. The process is
discontinued when one of the cohorts exhibits toxicity. Numerous
modifications have been proposed to the Fibonacci scheme, such as allowing
for de-escalation as well as escalation.

5.5 - Summary
In this lesson, among other things, we learned how to:

1. identify outcomes that are continuous, binary, event times, counts,


ordered or unordered categories and repeated measurements.

2. state the merits and problems of using a surrogate outcome.

3. recognize types of censoring that can occur in studies of time-to-event


outcomes.

4. state the components of a typical dose-finding design.

Look for any homework assignments listed for this lesson in the ANGEL
course site...

Lesson 6: Sample Size and Power - Part


A
Introduction

The underlying theme of sample size calculation in all clinical trials


is precision. Validity and unbiasedness do not necessarily relate to sample
size.

Usually, sample size is calculated with respect to two circumstances. The first
involves precision for an estimator, e.g., requiring a 95% confidence interval
for the population mean to be within units. The second involves statistical
power for hypothesis testing, e.g., requiring 0.80 or 0.90 statistical power
(1-) for a hypothesis test when the significance level () is 0.05 and the
effect size (the clinically meaningful effect) is units.

The formulae for many sample size calculations will involve percentiles from
the standard normal distribution. The graph below illustrates the
2.5th percentile and the 97.5th percentile.
Fig. 1 Standard normal distribution centered on zero.

For a two-sided hypothesis test with significance level and statistical power
1 - , the percentiles of interest are z (1-/2) and z (1 - ).

For a one-sided hypothesis test, z (1 - ) is used instead. Usual choices of are


0.05 and 0.01, and usual choices of are 0.20 and 0.10, so the percentiles of
interest usually are:

z0.995=2.58,z0.99=2.33,z0.975=1.96,z0.95=1.65,z0.90=1.28,z0.80=0.84z0.995=2.5
8,z0.99=2.33,z0.975=1.96,z0.95=1.65,z0.90=1.28,z0.80=0.84 .

In SAS, the PROBIT function is available to generate percentiles from the


standard normal distribution function, e.g., Z = PROBIT(0.99) yields a value of
2.33 for Z. So, if you ever need to generate z-values you can get SAS to do
this for you.

It is important to realize that sample size calculations are approximations. The


assumptions that are made for the sample size calculation, e.g., the standard
deviation of an outcome variable or the proportion of patients who succeed
with placebo, may not hold exactly.

Also, we may base the sample size calculation on a t statistic for a hypothesis
test, which assumes an exact normal distribution of the outcome variable
when it only may be approximately normal.

In addition, there will be loss-to-follow-up, so not all of the subjects who


initiate the study will provide complete data. .Some will deviate from the
protocol, including not taking the assigned treatment or adding on a treatment.
Sample size calculations and recruitment of subjects should reflect these
anticipated realities.

Learning objectives & outcomes

Upon completion of this week's lessons, you should be able to do the


following:

Identify studies for which sample size is an important issue.

Estimate the sample size required for a confidence interval for p for
given and , using normal approximation and Fisher's exact methods.

Estimate the sample size required for a confidence interval for for
given and , using normal approximation when the sample size is
relatively large.

Estimate the sample size required for a test of H0 : 1 = 2 to have (1 -


) % power for given and , using normal approximation, with equal or
unequal allocation.

Estimate the sample size required for a test of H0 : p1 = p2 for given


and and , using normal approximation and Fisher's exact methods.

Use a SAS program to estimate the number of events required for a


logrank comparison of two hazard functions to have (1 - ) % power
with given

Use Poisson probability methods to determine the cohort size required


to have a certain probability of detecting a rare event that occurs at a
rate = .

Adjust sample size requirements to account for multiple comparisons


and the anticipated withdrawal rate.

References:

Friedman, Furberg, DeMets, Reboussin and Granger. (2015) Sample size. In:
FFDRG. Fundamentals of Clinical Trials. 5th ed. Switzerland: Springer.
Piantadosi Steven. (2005) Sample size and power. In: Piantadosi
Steven. Clinical Trials: A Methodologic Perspective. 2nd ed. Hoboken, NJ:
John Wiley and Sons, Inc.

Wittes, Janet. (2002) "Sample Size Calculations for Randomized Controlled


Trials [1]." Epidemiologic Reviews. Vol. 24. No 1. pp. 39-53.

6A.1 - Treatment Mechanism and Dose


Finding Studies
For many treatment mechanism (TM) studies, sample size is not an
important issue because usually only a few subjects are enrolled to
investigate treatment mechanism. Here you are taking a lot of measurements
on a few subjects in order to find out what might be going on with your
treatment.

As presented last week, dose-finding (DF) and dose-ranging studies


typically involve a design scheme, such as a modified Fibonacci design or
continual reassessment. An example for phase I cytotoxic drug trials is as
follows. A set of doses is determined a priori, such as 100 mg, 200 mg, 300
mg, 500 mg, 800 mg, etc. Subjects are recruited into the DF study in groups
of threes. The first group receives the lowest dose of 100 mg. If none of the
subjects experience the effect (toxicity, side effect, etc.), then the next group
of three subjects is escalated to the next dose of 200 mg. If one of the three
subjects at 100 mg experiences the effect, however, then the next group of
three subjects will receive the same dose of 100 mg. Whenever six subjects at
the same dose reveal at least two subjects that experience the effect, then the
study is terminated and the chosen dose for a safety and efficacy study is the
previous dose level.

With such mechanisms in place to determine initial dosage levels, selection


of. the study sample size is not a major consideration. In fact, the final sample
size is dependent on the patient outcomes.

6A.2 - Safety and Efficacy Studies


The U.S. FDA mandates that efficacy is proven prior to approval of a drug.
Efficacy means that the tested dose of the drug is effective at ameliorating the
treated condition. Phase II trials evaluate the potential for efficacy; Phase III
trials confirm efficacy. These trials can also be referred to as safety and
activity studies.

A typical goal of a safety and efficacy (SE) study is to estimate certain


clinical endpoints with a specified amount of precision. Confidence intervals
are useful for reflecting the amount of precision, and the width of a confidence
interval is a function of sample size.

The simplest example occurs when the outcome response is binary (success
or failure). Let p denote the true (but unknown) proportion of successes in the
population that will be estimated from a sample.

The sample size is denoted as n and the number of observed successes is r.


Thus, the point estimate of p is:

p^=rnp^=rn

If the sample size is large enough, then the 100(1 - )% confidence interval
can be approximated as:

p^z1/2p^(1p^)/np^z1/2p^(1p^)/n

Prior to the conduct of the study, however, the point estimate is undetermined
so that an educated guess is necessary for the purposes of a sample size
calculation.

If it is desirable for the confidence interval to have limits of p^p^ .

for a 100(1 - )% confidence interval, and the researcher has a reasonable


guess as to the value of p, reworking through the sample size equation, the
target sample size is:

n=z21/2p(1p)/2n=z1/22p(1p)/2

If a researcher guesses that p 0.4 and wants a 95% confidence interval to


have limits of = 0.10, then the required sample size is n = (1.96)2(0.4)(0.6)/
(0.10)2 = 92
Notice that p(1 - p) is maximized when p = 0.5. Therefore, because p has to
be guessed, it is more conservative to use p = 0.5 in the sample size
calculation. In the above example this yields n = (1.96)2(0.5)(0.5)/(0.10)2 = 96,
a slightly larger sample size.

Notice that the sample size is a quadratic function of precision. If = 0.05 is


desired instead of 0.10 in the above example, then n = (1.96)2(0.5)(0.5)/
(0.05)2 = 384

If you want the confidence interval to be tighter remember that splitting the
width of the confidence interval in half will involve quadrupling the number of
subjects in the sample size!

The normal approximation for calculating the 100(1 - )% confidence


for p works well if

np^(1p^)5np^(1p^)5

Otherwise, exact binomial methods should be used.

In the exact binomial method, the lower 100(/2)% confidence limit for p is
determined as the value pL that satisfies

/2=k=rnC(n,k)(pL)k(1pL)nk/2=k=rnC(n,k)(pL)k(1pL)nk

The upper 100(1 - /2)% confidence limit for p is determined as the


value pU that satisfies

/2=k=0rC(n,k)(pU)k(1pU)nk/2=k=0rC(n,k)(pU)k(1pU)nk

SAS PROC FREQ provides the exact and asymptotic 100(1 - )% confidence
intervals for a binomial proportion, p.

SAS Example: 6.1_binomial_proportion.sas [2] (from Piantadosi, 2005) This is


a program that illustrates the use of PROC FREQ in SAS for determining an
exact confidence interval for a binomial proportion.
[2]

In the example above, n = 19 is the sample size and r = 3 successes are


observed in a binomial trial. The point estimate of p is

p^=0.16p^=0.16

Note, however, that

np^(1p^)=19(0.16)(0.84)=2.55<5np^(1p^)=19(0.16)(0.84)=2.55<5

The 95% confidence interval for p, based on the exact method, is [0.03, 0.40].
The 95% confidence interval for p, based on the normal approximation, is [-
0.01, 0.32], which is modified to [0.00, 0.32] because p represents the
probability of success that is supposed to be restricted to lie within the [0, 1]
interval. Even with the correction to the lower endpoint, the confidence interval
based on the normal approximation does not appear to be very accurate in
this example.

Now it's your turn!

Modify the SAS program above to reflect 11 successes out of 75 trials. (Click
the 'Inspect' icon and review what part of this program to change and how it
works if you need to.) Run the program. Do the results round to (0.08, 0.25)
for the 95% exact confidence limits?

If an investigator estimates p = 0.15 and wants a 95% exact confidence


interval with = 0.1, what sample size is needed? One way to solve this is to
use SAS PROC FREQ in a "guess and check" manner. In this case, n = 73
with 11 successes will result in a 95% exact confidence interval of (0.07, 0.25).
It may impossible to exactly achieve the desired , but an estimate of the
required sample size can be provided.

Using the exact confidence interval for a binomial proportion is the better
option if you are not sure you are working in a standard normally distributed
population.

6A.3 - Example: Discarding Ineffective


Treatment
An approach for discarding an ineffective treatment in an SE study, based on
the exact binomial method, is as follows. Suppose that the lowest success
rate acceptable to an investigator for the treatment is 0.20. Suppose that the
investigator decides to administer the treatment consecutively to a series of
patients. When can the investigator terminate the SE trial if he continues to
find no treatment successes?

SAS Example: Modifications to the exact confidence interval program used


earlier (6.1_binomial_proportion.sas [2]) can be made to determine when the
exact confidence interval for p no longer contains a certain value.
[2]

SAS PROF FREQ (trial-and-error) indicates that the exact one-sided 95%
upper confidence limit for p, when 0 out of 14 successes are observed, is
0.19. Thus, if the treatment fails in each of the first 14 patients, then the study
is terminated.

Now, try it yourself!

Work out your answer first, then click the graphic (left) to compare answers.

What is the upper 95% one-sided confidence limit for p when you have seen
no successes in 5 trials?

Now, try it yourself!

Work out your answer first, then click the graphic (left) to compare answers.

Here is another one to try... How many straight failures would it take to rule out
a 30% success rate?

6A.4 - Confidence Intervals for Means


For a clinical endpoint that can be approximated by a normal distribution in an
SE study, the 100(1 - )% confidence interval for the population mean, , is
Y[tn1,1/2s/n]Y[tn1,1/2s/n]

where

Y=ni=1Yi/nY=i=1nYi/n is the sample mean,

tn1,1/2tn1,1/2 is the appropriate percentile from the tn-1 distribution, and

s2=ni=1(YiY)2/(n1)s2=i=1n(YiY)2/(n1) is the sample variance and


estimates 2.

If is known, then a z-percentile can replace the t-percentile in the 100(1 - )


% confidence interval for the population mean, , that is,

Y(z1/2/n)Y(z1/2/n)

If n is relatively large, say n 60, then z1 - /2 tn - 1,1 - /2.

If it is desired for the 100(1 - )% confidence interval to be

YY

then

n=z21/22/2n=z1/222/2

For example, the necessary sample size for estimating the mean reduction in
diastolic blood pressure, where = 5 mm Hg and = 1 mm Hg, is n =
(1.96)2(5)2/(1)2 = 96.

6A.5 - Comparative Treatment Efficacy


Studies
Suppose that a comparative treatment efficacy (CTE) trial consists of
comparing two independent treatment groups with respect to the means of the
primary clinical endpoint. Let 1 and 2 denote the unknown population means
of the two groups, and let denote the known standard deviation common to
both groups. Also, let n1 and n2 denote the sample sizes of the two groups.

The treatment difference in means is = 1 -2 and the null hypothesis is H0:


= 0. The test statistic is

Z=(Y1Y2)/1n1+1n2Z=(Y1Y2)/1n1+1n2

which follows a standard normal distribution when the null hypothesis is true.
If the alternative hypothesis is two-sided, i.e., H1: 0, then the null
hypothesis is rejected for large values of |Z|.

Under a particular alternative where there might be some difference , = 1 -


2 ,

Z=(Y1Y2)/1n1+1n2Z=(Y1Y2)/1n1+1n2

Suppose we let AR = n1/n2 denote the allocation ratio (AR), (in most cases we
will assign AR = 1 to get equal sample sizes). If we wish to a have large
enough sample size to detect an effect size with a two-sided, -significance
level test with 100(1 - )% statistical power, then

n2=(AR+1AR)(z1/2+z1)22/2n2=(AR+1AR)(z1/2+z1)22/2

and n1 = ARn2.

Note this formula matches the sample size formula in our FFDRG text on p.
180, assuming equal allocation to the two treatment groups and multiplying
the result here by 2 to get 2N, which FFDRG use to denote the total sample
size.

If the alternative hypothesis is one-sided, then Z1 - replaces Z1 - /2 in either


formula.
Notice that the sample size expression contains (/)2, the square of the
effect size expressed in standard deviation units. Thus, sample size is a
quadratic function of the effect size and the precision. As the variance gets
larger, it has a quadratic effect on the sample size. For example, reducing the
effect size by one-half quadruples the required sample size.

Although this sample size formula assumes that the standard deviation is
known so that a z test can be applied, it works relatively well when the
standard deviation must be estimated and a t-test applied. A preliminary guess
of must be available, however, either from a small pilot study or a report in
the literature. For smaller sample sizes (n1 30, n2 30) percentiles from a t
distribution can be substituted, although this results in both sides of the
formula involving n2 so that it must be solved iteratively:

n2=(AR+1AR)(tn1+n22,1/2+tn1+n22,1)22/2n2=(AR+1AR)
(tn1+n22,1/2+tn1+n22,1)22/2

6A.6 - Example 1: Comparative


Treatment Efficacy Studies
An investigator wants to determine the sample size for comparing two asthma
therapies with respect to the forced expiratory volume in one second (FEV1). A
two-sided, 0.05-significance level test with 90% statistical power is desired.
The effect size is = 0.25 L and the standard deviation reported in the
literature for a similar population is = 0.75 L. The investigator plans to have
equal allocation to the two treatment groups (AR = 1).

The first step is to identify the primary response variable. In this


example, FEV1 is a continuous response variable. Assuming that FEV1 has an approximate
normal distribution, the number of patients required for the second treatment
group based on the z formula is n2 = (2)(1.96 + 1.28)2(0.75)2/(0.25)2 = 189.

Thus, the total sample size required is n1 + n2 = 189 + 189 = 378. SAS
Example (7.4_sample_size__normal_.sas [3]): This is a program that illustrates
the use of PROC POWER to calculate sample size when comparing two
normal means.
[3]

SAS PROC POWER, based on the t formula, yields n1 + n2 = 191 + 191 =


382.

If the investigator had wanted an allocation ratio of AR = 2 (twice as many


subjects in the first group), then n2 = (1.5)(1.96 + 1.28)2(0.75)2/(0.25)2 = 142
and n1 = 2142 = 284.

The total sample size required is n1 + n2 = 142 + 284 = 426.

SAS PROC POWER, based on the t formula, yields n1 + n2 = 143 + 286 =


429.

Notice that the 2:1 allocation, when compared to the 1:1 allocation, requires
an overall larger sample size (429 versus 382).

Now it is your turn to give it a try!

Try it yourself!

Work out your answer first, then click the graphic (left) to compare answers.

Here is another one to try... How many subjects are needed to have 80%
power in testing equivalence of two means when subjects were allocated 2:1,
using a = 0.05 two sided test? The standard deviation is 10 and the
hypothesized difference in means is 5.
6A.7 - Example 2: Comparative
Treatment Efficacy Studies
What if the primary response variable is binary?

When the outcome in a CTE trial is a binary response and the objective is to
compare the two groups with respect to the proportion of success, the results
can be expressed in a 2 2 table as

Group # 1 Group # 2
Success r1 r2
Failure n1 - r1 n2 - r2

There are a variety of methods for performing the statistical test of the null
hypothesis H0: p1 = p2, such as a z-test using a normal approximation, a 2 test
(basically, a square of the z-test), a 2 test with continuity correction, and
Fisher's exact test.

The normal and 2 approximations for comparing two proportions are relatively
accurate when these conditions are met:

n1(r1+r2)(n1+n25,n2(r1+r2)(n1+n25,n1(n1+n2r1r2)
(n1+n25,n2(n1+n2r1r2)(n1+n25n1(r1+r2)(n1+n25,n2(r1+r2)
(n1+n25,n1(n1+n2r1r2)(n1+n25,n2(n1+n2r1r2)(n1+n25

Basically when the expected number in each cell is greater than 5, the normal
or Chi Square approximation is useful.

Otherwise, Fisher's exact test is recommended. All of these tests are available
in SAS PROC FREQ of SAS and will be discussed later in the course.

A sample size formula for comparing the proportions p1 and p2 using the
normal approximation is given below:

n2=(AR+1AR)(z1/2+z1)2p(1p)/(p1p2)2n2=(AR+1AR)
(z1/2+z1)2p(1p)/(p1p2)2
where p1 - p2 represents the effect size and

p=(ARp1+p2)/(AR+1)p=(ARp1+p2)/(AR+1)

is the weighted average of the proportions.

(Note this formula is the same as p. 173 in our text FFDRG if you assume the
allocation ratio is 1:1 and double the sample size here to get total sample size
2N as calculated in FFDRG)

Example: Sample size determination for difference in proportions

An investigator wants to compare an experimental therapy to placebo when


the response is success/failure via a two-sided, 0.05 significance level test
and 90% statistical power. She knows from the medical literature that 25% of
the untreated patients will experience success, so she decides that the
experimental therapy is worthwhile if it can yield a 50% success rate. With
equal allocation, n2 = (2)(1.96 + 1.28)2{0.375(1-0.375)}/(0.25)2 = 79. Thus, the
investigator should enroll n1 = 79 patients into treatment and n2 = 79 into
placebo for a total of 158 patients.

With an unequal allocation ratio of AR = 3, n1 = 168 and n2 = 56. Again, notice


that the allocation ratio of AR = 3 yields a total sample size larger than that for
the allocation ratio of AR = 1 (224 vs. 158).

SAS Example (7.5_sample_size__binary_.sas [4]): This is a program that


illustrates the use of PROC POWER to calculate sample size when comparing
two binomial proportions.

[4]
SAS PROC POWER for Fishers exact test yields n1 = 85 and n2 = 85 for AR =
1, and n1 = 171 and n2 = 57 for AR = 3.

Now, try it yourself!

Work out your answer first, then click the graphic (left) to compare answers.

What would be the sample size required to have 80% power to detect that a
new therapy has a significantly different success rate than the standard
therapy success rate of 30%, if it was expected that the new therapy would
result in at least 40% successes? Use a two-sided test with 0.05 significance
level.

6A.8 - Comparing Treatment Groups


Using Hazard Ratios
For many clinical trials, the response is time to an event. The methods of
analysis for this type variable are generally referred to as survival
analysis methods. The basic approach is to compare survival curves.

With an event time endpoint, it is mathematically convenient to compare


treatment groups (and curves) with respect to the hazard ratio. The survival
function for a treatment group is characterized by , the hazard rate. At time t,
(t) for a treatment group, is defined as the instantaneous risk of the event (or
failure) occurring at time t. In other words, given that a subject has survived
the event up to time t, the hazard at time t is the probability of the event
occurring within the next instant. You can think of the hazard as the slope of
the survival curve.

The hazard ratio is defined as the ratio of two hazard functions, 1(t) and 2(t),
corresponding to two treatment groups. Typically, we assume proportional
hazards, i.e., = 1(t)/2(t) is a constant function independent of time. The
graphs on the next two slides illustrate the concept of proportional hazards.
A hazard function may be constant, increasing, or decreasing over time, or
even be a more complex function of time. In trials in which survival time is the
outcome, an increasing hazard function indicates that the instantaneous risk
of death increases throughout the trial.

An example where the hazard function might be decreasing involves the


disease ARDS (adult respiratory distress syndrome), whereby the risk of
death is highest during the early stage of the disease.
A sample size formula for comparing the hazards of two groups via the
logrank test (discussed later in the course) is expressed in terms of the total
number of events, E, that need to occur. For a two-sided, -level
significance test with 100(1 - )% statistical power, hazard ratio , and
allocation ratio AR,

E=((AR+1)2AR)(z1/2+z1)2/(loge())2E=((AR+1)2AR)(z1/2+z1)2/
(loge())2

(Note this formula above matches FFDRG text p. 185 simple formula, if it is
assumed that all particpants will have an event. However, we most often have
censored data, that is a number of participants who do not experience the
event before the trials ends. )

Since we do not expect all persons in the trial to experience an event, the
sample size must be larger than the required number of events.

Suppose that p1 and p2 represent the anticipated event rates in the two
treatment groups. Then the sample sizes can be determined from n2 = E/
(ARp1 + p2) and n1 = ARn2

If a hazard function is assumed to be constant during the follow-up period [0,


T], then it can be expressed as (t) = = -loge(1 - p)/T. In such a situation, the
hazard ratio for comparing two groups is = loge(1 - p1)/loge(1 - p2) .

A constant hazard rate, (t) = for all time points t, corresponds to an


exponential survival curve, i.e., survival at time t = exp(-t).
Survival curves plot the probability of the event occurring to a subject over
time.

Example

An investigator wants to compare an experimental therapy to placebo when


the response is time to infection via a two-sided, 0.05-significance level test
with 90% statistical power and equal allocation. He plans to follow each
patient for one year and he expects that 40% of the placebo group will
experience infection and he considers a 20% rate in the therapy group as
clinically relevant.

If he assumes constant hazard functions, then

= loge(0.6)/ loge(0.8) = 2.29

Then the number of required events is

E = (4)(1.96 + 1.28)2/{loge(2.29)}2 = 62

and the sample sizes are

n2 = E/(ARp1 + p2) = 62/(0.4 + 0.2) = 104 and n1 = 104


SAS Example (7.6_sample_size__time_.sas [5]) This is a program that
illustrates the use of PROC POWER to calculate sample size when comparing
two hazard functions.

[5]

Additional comments on this program: Note the curve statements indicate


points on the survival curves. In this example, at the end of study, at time 1.01
(followup plus accrual in SAS), the proportion in the placebo group without an
event is 0.6 and the proportion remaining the therapy group is 0.8.

SAS PROC POWER for the logrank test requires information on the accrual
time and the follow-up time. It assumes that if the accrual (recruitment) period
is of duration T1 and the follow-up time is of duration T2, then the total study
time is of duration T1 + T2. It assumes, however, if a patient is recruited at time
T1/2, then the follow-up period for that patient is T1/2 + T2instead of T2. This
assumption may be reasonable for observational studies, but not for clinical
trials in which follow-up on each patient is terminated when the patient
reaches time T2. Therefore, for a clinical trial situation, set accrual time in SAS
PROC POWER equal to a very small positive number. For the given example,
SAS PROC POWER yields n1 = 109 and n2 = 109.

SAS notes for PROC POWER for survival [6]

6A.9 - Expanded Safety Studies


Expanded Safety (ES) trials are phase IV trials designed to estimate the
frequency of uncommon adverse events that may have been undetected in
earlier studies. These studies may be nonrandomized.

Typically, we assume that the study population is large, the probability of an


adverse event is small (because it did not crop up in prior trials), and all
participants in the cohort of size m are followed for approximately the same
length of time. Under these assumptions we can model the probability of
exactly d events occurring based on a Poisson probability function, i.e.,

Pr[D=d]=(m)dexp(m)/d!Pr[D=d]=(m)dexp(m)/d!

where is the adverse event rate.

The cohort should be large enough to have a high probability of observing at


least one event when the event rate is . Thus, we want

=Pr[D1]=1Pr[D=0]=1exp(m)=Pr[D1]=1Pr[D=0]=1exp(m)

to be relatively large. With respect to the cohort size, this means that m should
be selected such that

m=loge(1)/m=loge(1)/

Example

Suppose a pharmaceutical company is planning an ES trial for a new anti-


arrhythmia drug. The company wants to determine the cohort size for
following patients on the drug for a period of two years in terms of myocardial
infarction. They want to have a 0.99 probability ( = 0.99) for detecting a
myocardial infarction rate of one per thousand ( = 0.001). This yields a
cohort size of m = 4,605.

(note the value in this problem is a probability, not quite the same as that
we use in calculating power)
6A.10 - Adjustment Factors for Sample
Size Calculations

When calculating a sample size, we may need to adjust our calculations due
to multiple primary comparisons or for nonadherence to therapy.

If there is more than one primary outcome variable (for example, co-
primary outcomes) or more than one primary comparison (for example,
3 treatment groups), then the significance level should be adjusted to
account for the multiple comparisons in order not to inflate the overall false-
positive rate.

For example, suppose a clinical trial will involve two treatment groups and a
placebo group. The investigator may decide that there are two primary
comparisons of interest, namely, each treatment group compared to placebo.
The simplest adjustment to the significance level for each test is the
Bonferroni correction, which uses /2 instead of .

In general, if there are K comparisons of primary interest, then the Bonferroni


correction is to use a significance level of /K for each of the K comparisons.
The Bonferroni correction is not the most powerful or most sophisticated
multiple comparison adjustment, but it is a conservative approach and easy to
apply.

In the case of multiple primary endpoints, an adjustment to the significance


level may not be necessary, depending on how the investigator plans to
interpret the results. For example, suppose there are two primary outcome
variables. If the investigator plans to claim success of the trial
if either endpoint yields a statistically significant treatment effect, then
an adjustment to the significance level is warranted. If the investigator plans to
claim success of the trial only if both endpoints yield statistically significant
treatment effects, then an adjustment to the significance level is not
necessary. Thus, an adjustment to the significance level in the presence of
multiple primary endpoints depends on whether it is an or or an and
situation.

Another consideration is nonadhereance to the protocol


(noncompliance). All participants randomized to therapy are expected to be
included in the primary statistical analysis, an intention-to-
treat [7] analysis. Intention-to-treat analysis will compare the treatments using
all data from subjects in the group to which they were originally assigned,
regardless of whether or not they followed the protocol, stayed on therapy,etc.
Some participants will choose to withdraw from a trial before it is complete.
Every effort will be made to continue obtaining data from all randomized
subjects; for those who withdraw from the study completely and do not
provide data, an imputation procedure may be required to represent their
missing data in subsequent data analyses. Some participants assigned to
active therapy discontinue therapy but continue to provide data (therapeutic
drop-outs). Some on a placebo or control add an active therapy (drop-ins)
and continue to be observed. The nonadherence (noncompliance) can lead
to a dilution of the treatment effect and lead to lower power for the study as
well as biased estimates of treatment effects.

Thus, a further adjustment to the sample size estimate may be made based
on the anticipated drop-out and drop-in rates in each arm (See Wittes
(2002 [1]). A similar formula is on p. 179 FFDRG.

N=N/((1-RO-RI))2 where N is the sample size without regard to


nonadherence and N* is the adjusted number for that treatment arm.

RO and RI represent the proportion of participants anticipated to discontinue


test therapy and the proportion in the control who will add or change to a more
effective therapy, respectively.

Let's work an example.

Suppose a study has two treatment groups and will compare test therapy to
placebo. With only one primary comparison, we do not need to adjust the
significance level for multiple comparisons. Suppose that the sample size for
a certain power, significance level and clinically important difference works to
be 200 participants/group or 400 total.

To adjust for noncompliance/nonadherence, we must estimate the proportion


from the placebo group who will begin an active therapy before the study is
complete. Let's estimate these 'drop-ins' to be 0.20. In the test therapy group,
we estimate 0.10 will discontinue active therapy.

To adjust for noncompliance, we calculate N*=200/((1-0.2-0.1)2) . N*=


409/group or 818 total. What an increase in sample size to maintain the
power! (note whether I use n/group, 200/(0.49) or total n, 400/(0.49) I will get
the same sample sizes. Just remember what your N represents. If there is any
fraction at the end of sample size calculations, round UP to the next number
divisible by the number of treatment groups.)

These are relatively simple calculations to introduce the idea of adjusting for
noncompliance as well as for multiple comparisons. More complicated
processes can be modeled.

Finally, when estimating a sample size for a study, an iterative process may be
followed (adapted from Wittes, 2002)

1. Determine the null and alternative hypotheses as related to the primary


outcome.

2. What is the desired type I error rate and power? If more than one primary
outcome or comparison, make required adjustments to Type 1 error.

3. Determine the population that will be studied. What information is there


about the variability of the primary outcome in this population? Would would
constitute a clinically important difference?

4. If the study is measuring time to failure, how long is the followup period?
What assumptions should be made about recruitment?

5. Consider ranges of rates or events, loss to follow-up, competing risks, and


noncompliance.

6. Calculate sample size over a range of reasonable assumptions.

7. Select a sample size. Plot power curves as the parameters range over
reasonable values.

8. Iterate as needed.

Which of these adjustments (or others, such as modeling dropout rates that
are not independent of outcome) is important for a particular study depends
on the study objectives. Not only must we consider whether there is more than
primary outcome or multiple primary comparisions, we must also consider the
nature of the trial. For example, if the study results are headed to a regulatory
agency, using a primary intention-to-treat analysis, it is important to
demonstrate an effect of a certain magnitude. Adjusting the sample size to
account for non-adherence makes sense. On the other hand, in a comparative
effectiveness study, the objective may be to estimate the difference in effect
when the intervention is prescribed vs the control, regardless of adherence. In
this situation, the dilution of effect due to nonadherence may be of little
concern.

As we noted beginning this lesson, sample size calculations


are estimates! When stating a required sample size, always state any
assumptions that have been made in the calculations.

6A.11 - Summary
In this lesson, among other things, we learned to:

Identify studies for which sample size is an important issue

Estimate the sample size required for a confidence interval for p for
given and , using normal approximation and Fisher's exact methods

Estimate the sample size required for a confidence interval for for
given and , using normal approximation when the sample size is
relatively large

Estimate the sample size required for a test of H0 : 1 = 2 to have (1 - )


% power for given and , using normal approximation, with equal or
unequal allocation.

Estimate the sample size required for a test of H0 : p1 = p2 for given


and and , using normal approximation and Fishers exact methods

Use a SAS program to estimate the number of events required for a


logrank comparison of two hazard functions to have (1 - ) % power
with given

Use Poisson probability methods to determine the cohort size required


to have a certain probability of detecting a rare event that occurs at a
rate=.

Adjust sample size requirements to account for multiple comparisons


and the anticipated noncompliance rates.
Let's put what we have learned to use by completing the homework problems!

Lesson 6: Sample Size and Power - Part


B
Introduction

This week we continue exploring the issues of sample size and power, this
time with regard to the differing purposes of clinical trials. Often the objective
of the trial is to establish that a therapy is efficacious, but what is the proper
control group? Can superiority to placebo be clearly established when there
are other effective therapies on the market? These questions lead to special
considerations based on whether the trial has an objective of establishing
superiority, equivalence or non-inferiority. So, lets move ahead

Learning objectives & outcomes

Upon completion of this week's lessons, you should be able to do the


following:

Distinguish between superiority, non-inferiority and equivalence trials in


terms of

o objectives

o control group

o hypotheses tested and

o formation of confidence intervals.

Recognize characteristics of a clinical trial with high external validity

Define which data are included in an intention-to-treat analysis

Recognize the major considerations for designing an equivalence or


non-inferiority trial.

Perform sample size calculations for some equivalence and non-


inferiority trials, using SAS programs.
References

Berger RL, Hsu JC. Bioequivalence trials, intersection-union tests, and


equivalence confidence sets. Statistical Science 1996, 11: 283-319).

Piantadosi Steven. (2005) Sample size and power. In: Piantadosi


Steven. Clinical Trials: A Methodologic Perspective. 2nd ed. Hoboken, NJ:
John Wiley and Sons, Inc.

6B.1 - Control Groups


Placebo-controlled trials typically provide an unambiguous statement of the
research hypothesis: either we want to show that the experimental treatment
is superior to placebo (one-sided alternative) or as is more often the case, we
want to show that the experimental treatment is different than placebo (two-
sided alternative). For this reason, we frequently refer to a placebo-controlled
trial as a confirmatory trial or in most recent language it is called a superiority
trial (even if we are using a two-sided alternative).

Active control groups are often used because placebo control groups are
unethical, such as when:

1. the disease is life-threatening or debilitating, and/or

2. an effective therapy already exists and is considered standard-of-care.

Investigators can use an active control group in a superiority trial, an


equivalence trial, or a non-inferiority trial. The new treatment may be preferred
due to less cost, fewer side effects or less impact on quality of life. Or the new
treatment may have superior efficacy.

Equivalence trials and non-inferiority trials have different objectives than


superiority trials. The objective of an equivalence trial is to demonstrate that a
therapy is equivalent to the active control (it is not inferior to and not superior
to the active control). Equivalent might not be the best word choice for this
type of trial as we will see later. The objective of a non-inferiority trial is to
demonstrate that a therapy is not inferior to the active control, i.e., it is not
worse than the treatment.

There are a number of issues related to the design and analysis of


equivalence and non-inferiority trials that are not well understood by clinical
investigators. We will examine these issues using examples of such trials in
this lesson.

6B.2 - Combination Therapy Trials


Combination therapy trials are an example where the appropriateness of a
placebo control must be carefully considered.

Suppose that for a particular disease or condition, there exists a standard


therapy that is accepted as the best available treatment (standard-of-care).
The standard-of-care could be a drug, a medical device, a surgical procedure,
diet, exercise, etc., or some combination of these various regimens.

In the MIRACLE trial [1], the standard-of-care for the eligible patients with heart
failure during the course of the trial consisted of some combination of the
following medications:

Diuretic

Angiotensin-converting enzyme (ACE) inhibitor or angiotensin-receptor


blocker

Digitalis

Beta-blocker

Suppose that the experimental therapy is of a different modality or different


mechanism of action than the standard-of-care. If so, then it may be possible
to use the experimental therapy in combination with the standard-of-care and
we can consider designing a two-armed trial that compares:

standard-of-care + experimental therapy

versus

standard-of-care + placebo therapy

This situation is comparable to a superiority trial because the research


objective is to demonstrate superiority of the combination therapy to the
standard-of-care.
In the MIRACLE trial for example, the comparison consisted of:

standard-of-care + pacemaker

versus

standard-of-care + inactive pacemaker

In other situations, a superiority trial is not feasible. If the experimental therapy


is a similar modality as one of the components of standard-of-care, then it
may not be appropriate to combine the experimental therapy with the
standard-of-care.

There are two other possibilities to consider for designing the clinical trial,
namely, an equivalence trial and a non-inferiority trial.

6B.3 - Equivalence Trials


For an equivalence trial, it is necessary to determine a zone of clinical
equivalence prior to the trial onset.

For example, consider standard (active control) and experimental


antihypertensive drugs (a drug that controls blood pressure). Suppose that the
standard drug yields a mean reduction of 5 mm Hg in diastolic blood pressure
for a certain patient population. The investigator may decide that the
experimental drug is clinically equivalent to the standard drug if its mean
reduction in diastolic blood pressure is 3-7 mm Hg. This is based on clinical
judgment and there may be differences of opinion on this 'arbitrary' level of
equivalence.

Thus, the difference in means between the two therapies does not exceed 2
mm Hg. Let's suppose that we are willing to accept this level.

In general, the zone of equivalence is defined by . The difference in


population means between the experimental therapy and the active control,
E - A, should lie within (-, +). Differences in response less than are
considered 'equally effective' or 'noninferior'.

In nearly every equivalence trial, the selection of is arbitrary and can be


controversial. Some researchers recommend that be selected as less than
one-half of the magnitude of the effect observed from the superiority trials
comparing the active control to placebo.

Given what we know in the the antihypertensive example above, = 2


satisfies this requirement (2 / 5 = 0.4 < 0.5), but why not select = 1?

Here is a second issue to consider...

Unlike a placebo-controlled trial, an equivalence trial does not provide a


natural check for internal validity because equivalence of the experimental
and active control therapies does not necessarily imply that either of them is
effective. In other words, if a third treatment arm of placebo had been included
in the trial, it is possible that neither the experimental therapy nor the active
control therapy would demonstrate superiority over placebo. There is no direct
establishment of superiority inherent in the way the trial is set up.

The investigator needs to select an active control therapy for the equivalence
trial that has been proven to be superior to placebo. An important assumption
is that the active control would be superior to placebo (had placebo been a
treatment arm in the current trial).

In the past, a few equivalence trials incorporated appropriate active controls,


but at doses less than recommended (rendering them ineffective). It is
important to select the proper control, and use it at an appropriate dose level.

One way to ascertain internal validity is through an external validity check,


e.g., compare the experimental and active control therapies of the current
study to published reports for comparative trials that involve the active control
therapy versus a placebo control. Are similar results observed for the active
therapy in the equivalence trial as in the published study against placebo?

External comparisons should examine response levels, patient compliance,


withdrawal rates, use of rescue medications, etc. An external validity check is
only possible if the chosen active control therapy for the equivalence trial was
determined effective in a superiority trial. An under-dosed or over-dosed
regimen for the active control therapy in an equivalence trial can bias the
results and interpretations. In addition, the design for the equivalence trial
should mimic (within reason) the design for the superiority trial. Some of this
advice is difficult to follow and may be impossible to implement.
(Another aspect of internal validity of course, is the quality of the trial, in terms
of inclusion/exclusion criteria, dosing regimens, quality control, etc. Do not run
a sloppy study!)

The U.S. Food and Drug Administration (FDA) and the National Institutes of
Health (NIH) typically require intent-to-treat (ITT) analyses in placebo-
controlled trials. In an ITT analysis, data on all randomized patients are
included in the analysis, regardless of protocol violations, lack of adherence,
withdrawal, incorrectly taking the other treatment, etc. The ITT analysis
reflects what will happen in the real world, outside the realm of a controlled
clinical trial.

Is this appropriate? In a superiority trial, the ITT analysis usually is


conservative because it tends to diffuse the difference between the treatment
arms. There is more 'noise' in an ITT study. This is due to the increased
variability from protocol violations, lack of adherence, withdrawal, etc. You can
overcome this noise by increasing sample size.

In an equivalence trial, the ITT analysis still is appropriate. There is a


misconception that the ITT analysis will have the opposite effect in an
equivalence trial, i.e., it will be easier to demonstrate equivalence. This is not
so. Even with an ITT analysis in an equivalence trial, it still is important to
conduct a well-designed study with sufficient sample size and good quality
control.

An alternative to an intent-to-treat analysis is a protocol analysis, whereby


subjects are analyzed according to the treatment received. A protocol analysis
excludes subjects who did not satisfy the inclusion criteria, did not comply with
taking study medications, violated the protocol, etc. You are excluding data
from the patients that do not follow the protocol when it comes to the analysis.
A protocol analysis is expected to enhance differences between treatments,
so it usually will be conservative for an equivalence trial. Obviously, a protocol
analysis is susceptible to many biases and must be performed very carefully.
You may think that you are removing all of the biases, when in fact you may
not be. A protocol analysis could be considered as supplemental to the ITT
analysis. The U.S. FDA moved to ITT studies years ago to avoid biases
introduced when researcher selectively excluded patients from analysis
because of various protocol deviations. Many of the major medical journals
also will only accept ITT studies for these reasons as well.

6B.4 - Non-Inferiority Trials


A non-inferiority trial is similar to an equivalence trial. The research question in
a non-inferiority trial is whether the experimental therapy is not inferior to the
active control (whereas the experimental therapy in an equivalence trial
should not be inferior to, nor superior to, the active control). Thus, a non-
inferiority trial is one-sided, whereas an equivalence trial is two-sided. (For
non-inferiority, we want experimental therapy to be better than the active
control. )

Assume that the larger response is the better response. The one-sided zone
of non-inferiority is defined by -, i.e., the difference in population means
between the experimental therapy and the active control, E - A, should lie
within (-, + ).

Many of the same issues that are critical for designing an equivalence trial
also are critical for designing a non-inferiority trial, namely, appropriate
selection of an active control and appropriate selection of the zone of clinical
non-inferiority defined by .

Hypertensive Example

Consider the previous example with the standard and experimental


antihypertensive therapies.

The researchers may decide that the experimental drug is clinically not inferior
to the standard drug if its mean reduction in diastolic blood pressure is at least
3 mm Hg ( = 2). Thus, the difference in population means between the
experimental therapy and the active control therapy, E - A, should lie within (-
, + ). It does not matter if the experimental drug is much better than active
control drug, provided that it is not inferior to the active control drug.

Because a non-inferiority trial design allows for the possibility that the
experimental therapy is superior to the active control therapy, the non-
inferiority design is preferred over the equivalence design. The equivalence
design is useful when evaluating generic drugs.

6B.5 - Statistical Inference - Hypothesis


Testing
Statisticians construct the null hypothesis and the alternative hypothesis for
statistical hypothesis testing such that the research hypothesis is the
alternative hypothesis:
H0:{non-equivalence} vs. H1:{equivalence}H0:{non-equivalence} vs. H1:
{equivalence}

or

H0:{inferiority} vs. H1:{non-inferiority}H0:{inferiority} vs. H1:{non-inferiority}

In terms of the population means, the hypotheses for testing equivalence are
expressed as:

H0:{EA or EA}H0:{EA or EA}

vs.

H1:{<EA<}H1:{<EA<}

also expressed as

H0:{|EA|} vs. H1:{|EA|<}H0:{|EA|} vs. H1:{|EA|<}

In terms of the population means, the hypotheses for testing non-inferiority are
expressed as

H0:{EA} vs. H1:{EA>}H0:{EA} vs. H1:{EA>}

The null and alternative hypotheses for an equivalence trial can be


decomposed into two distinct hypothesis testing problems, one for non-
inferiority:

H01:{EA} vs. H11:{EA>}H01:{EA} vs. H11:{EA>}

and one for non-superiority

H02:{EA} vs. H12:{EA<}H02:{EA} vs. H12:{EA<}

The null hypothesis of non-equivalence is rejected if and only if the null


hypothesis of non-inferiority (H01) is rejected AND the null hypothesis of non-
superiority (H02) is rejected.
This rationale leads to what is called two one-sided testing (TOST). If the data
are approximately normally distributed, then two-sample t tests can be
applied. If normality is suspect, then Wilcoxon rank-sum tests can be applied.

With respect to two-sample t tests, reject the null hypothesis of inferiority if:

tinf=(YEYA+)/s1nE+1nZ

>tnE+nA2,1tinf=(YEYA+)/s1nE+1nZ>tnE+nA2,1

and reject the null hypothesis of superiority if:

tsup=(YEYA+)/s1nE+1nA

<tnE+nA2,1tsup=(YEYA+)/s1nE+1nA<tnE+nA2,1

where s is the pooled sample estimate of the standard deviation, calculated as


the square-root of the pooled sample estimate of the variance:

s2=(i=1nE(YEiYE)2+j=1nA(YAjYA)2)/

(nE+nA2)s2=(i=1nE(YEiYE)2+j=1nA(YAjYA)2)/(nE+nA2)

Note that each one-sided t test is conducted at the significance level.

6B.6 - Statistical Inference - Confidence


Intervals
Confidence intervals can be used in place of the statistical tests. Reporting of
confidence intervals is more informative because it indicates the magnitude of
the treatment difference and how close it approaches the equivalence zone.
The 100(1 - )% confidence interval that corresponds to testing the null
hypothesis of non-equivalence versus the alternative hypothesis of
equivalence at the significance level has the following limits

lower limit =min [0,(YEYA)s1nE+1nA

tnE+nA2,1]lower limit =min [0,(YEYA)s1nE+1nAtnE+nA2,1]

upper limit =max [0,(YEYA)+s1nE+1nA

tnE+nA2,1]upper limit =max [0,(YEYA)+s1nE+1nAtnE+nA2,1]

This confidence interval does provide 100(1 - )% coverage - (see Berger RL,
Hsu JC. Bioequivalence trials, intersection-union tests, and equivalence
confidence sets. Statistical Science 1996, 11: 283-319).

Some researchers mistakenly believe that a 100(1 - 2)% confidence interval


is consistent with testing the null hypothesis of non-equivalence versus the
alternative hypothesis of equivalence at the significance level. Note that the
Berger and Hsu 100(1 - )% confidence interval is similar to the 100(1 - 2)%
confidence interval in its construction except that (1) the lower limit, if positive,
is set to zero, and (2) the upper limit, if negative, is set to zero.

If the 100(1 - )% confidence interval lies entirely within (-, +), then the null
hypothesis of non-equivalence is rejected in favor of the alternative hypothesis
of equivalence at the significance level.

For a non-inferiority trial, the two-sample t statistic labeled tinf [2], previously
discussed,can be applied to test:

H0:{EA} vs. H1:{EA>}H0:{EA} vs. H1:{EA>}

Because a non-inferiority design reflects a one-sided situation, only the 100(1


- )% lower confidence limit is of interest:
If the 100(1 - )% lower confidence limit lies within (-, +), then the null
hypothesis of inferiority is rejected in favor of the alternative hypothesis of
non-inferiority at the significance level.

The FDA typically is more stringent than is required in non-inferiority tests. The
FDA typically requires companies to use = 0.025 for a non-inferiority trial, so
that the one-sided test or lower confidence limit is comparable to what would
be used in a two-sided superiority trial.

Equivalence

Non-Equivalence

Non-Inferiority

Inferiority
Example

As an example, suppose an investigator conducted an equivalence trial with


30 patients in each of the experimental therapy and active control groups (nE =
nA = 30). He defines the zone of equivalence with = 4. The sample means
and the pooled sample standard deviation are

YE=17.4,YA=20.6,s=6.5YE=17.4,YA=20.6,s=6.5

The t percentile, t58,0.95, can be found from the TINV function in SAS as
TINV(0.95,58), which yields that t58,0.95 = 1.67. Thus, using the formulas in the
section above, the lower limit = min{0, -3.2 - 2.8} = min{0, -6.0} = -6.0; the
upper limit = max{0, -3.2 + 2.8} = max{0, -0.4} = 0.0. This yields the 95%
confidence interval for testing equivalence of E - A is (-6.0, 0.0). Because the
95% confidence interval for E - A does not lie entirely within (-, +) = (-4,
+4), the null hypothesis of non-equivalence is not rejected at the 0.05
significance level. Hence, the investigator cannot conclude that the
experimental therapy is equivalent to the active control.

Suppose this had been conducted as a non-inferiority trial instead of an


equivalence trial, and he defines the zone of non-inferiority with = 4, i.e., (-
4, +). The 95% lower confidence limit for E - A is -6.0, which does not lie
within (-4, +). Therefore, the investigator cannot claim non-inferiority of the
experimental therapy to the active control.

A real example of a non-inferiority trial is the VALIANT [3] trial in patients with
myocardial infarction and heart failure. Patients were randomized to valsartan
monotherapy (nV = 4,909), captopril monotherapy (nC = 4,909), or valsartan +
captopril combination therapy (nVC = 4,885). The primary outcome was death
from any cause. One objective of the VALIANT trial was to determine if the
combination therapy is superior to each of the monotherapies. Another
objective of the trial was to determine if valsartan is non-inferior to captopril,
defined by = 2.5% in the overall death rate.
Switching Objectives

Suppose that in a non-inferiority trial, the 95% lower confidence limit for E -
A not only lies within (-, +) to establish non-inferiority, but also lies within
(0, +). It is safe to claim superiority of the experimental therapy to the active
control in such a situation (without any statistical penalty).

In a superiority trial, suppose that the 95% lower confidence limit for E -
A does not lie within (0, +), indicating that the experimental therapy is not
superior to the active control. If the protocol had specified non-inferiority as a
secondary objective and specified an appropriate value of , then it is safe to
claim non-inferiority if the 95% lower confidence limit for E - Alies within (-,
+).

6B.7 - Sample Size and Power


For a continuous outcome that is approximately normally distributed in
an equivalence trial, the number of patients needed in the active control arm,
nA, where AR = nE/nA, to achieve 100(1 - )% statistical power with an -level
significance test is approximated by:

nA=(AR+1AR)(tn1+n22,1+tn1+n22,1)22/(||)2nA=(AR+1AR)
(tn1+n22,1+tn1+n22,1)22/(||)2
Notice the difference in the t percentiles between this formula and that for a
superiority comparison, described earlier. The difference is due to the two
one-sided testing that is performed.

Most investigators assume that the true difference in population means, =


E - A, is null in this sample size formula. This is an optimistic assumption and
may not be realistic.

(Note: the formula above simplifies to the formula on p. 189 in the FFDRG text
if AR =1, = 0 and substituting Z for t )

For a binary outcome, the zone of equivalence for the difference in


population proportions between the experimental therapy and the active
control, pE - pA, is defined by the interval (-, +). The number of patients
needed in the active control arm, nA, where AR =nE/nA, to achieve 100(1 - )%
statistical power with an significance test is approximated by:

nA=(AR+1AR)(z1+z1)2p(1p)/(|pEpA|)2nA=(AR+1AR)
(z1+z1)2p(1p)/(|pEpA|)2

where

p=(ARpE+pA)/(AR+1)p=(ARpE+pA)/(AR+1)

How does this formula compare to FFDRG p. 189? The choice of the value for
p in our text is to use the control group value, assuming, that pe- pa=0.

For a time-to-event outcome, the zone of equivalence for the hazard ratio
between the experimental therapy and the active control, , is defined by the
interval (1/, +), where is chosen > 1. The number of patients who need
to experience the event to achieve 100(1 - )% statistical power with an -
level significance test is approximated by

E=((AR+1)2AR)(z1+z1)2/(loge(/))2E=((AR+1)2AR)(z1+z1)2/
(loge(/))2
If pE and pA represent the anticipated failure rates in the two treatment groups,
then the sample sizes can be determined from nA = E/(ARpE + pA) and nE =
ARnA

If a hazard function is assumed to be constant during the follow-up period [0,


T], then it can be expressed as (t) = = -loge(1 - p)/T. In such a situation, the
hazard ratio for comparing two groups is = loge(1 - pE)/ loge(1 - pA). The
same formula can be applied, with different values of pE and pA, to determine

For a continuous outcome that is approximately normally distributed in a


non-inferiority trial, the number of subjects needed in the active control arm,
nA, where AR = nE/nA, to achieve 100(1 - )% statistical power with an -level
significance test is approximated by:

nA=(AR+1AR)(tn1+n22,1+tn1+n22,1)22/(||)2nA=(AR+1AR)
(tn1+n22,1+tn1+n22,1)22/(||)2

Notice that the sample size formulae for non-inferiority trials are exactly the
same as the sample size formulae for equivalence trials. This is because of
the one-sided testing for both types of designs (even though an equivalence
trial involves two one-sided tests). Also notice that the choice of Z in the
formulas above have assumed a one-sided test or two one-sided tests, but
the requirements of regulatory agencies and the approach in our FFDRG text
is to use the Z value that would have been used for a 2-sided hypothesis test.
In homework, be sure to state any assumptions and the approach you are
taking.

6B.8 - SAS Examples


Example 1 ( 7.7_-_sample_size__normal__e.sas [4])

An investigator wants to determine the sample size for an asthma equivalence


trial with an experimental therapy and an active control. The primary outcome
is forced expiratory volume in one second (FEV1). The investigator desires a
0.05-significance level test with 90% statistical power and decides that the
zone of equivalence is (-, +) = (-0.1 L, +0.1L) and that the true difference
in means does not exceed = 0.05 L. The standard deviation reported in the
literature for a similar population is = 0.75 L. The investigator plans to have
equal allocation to the two treatment groups (AR = 1).

[4]

Assuming that FEV1 has an approximate normal distribution, the approximate


number of patients required for the active control group is:

nA = (2)(1.645 + 1. 28)2(0.75)2/(0.1 - 0.05)2 = 3,851

The total sample size required is nE + nA = 3,851 + 3,851 = 7,702.

SAS PROC POWER yields nE + nA = 3,855 + 3,855 = 7,710.

Think About It!

Come up with an answer to this question by yourself and then click on the
icon to the left to reveal the solution.

What happens to the total sample size if the power is to be 0.95 and the
investigator uses 2:1 allocation?

Example 2 (7.8_-_sample_size__binary__n.sas [5])

An investigator wants to compare an experimental therapy to an active control


in a non-inferiority trial when the response is treatment success. She desires a
0.025 significance level test and 90% statistical power. She knows 70% of the
active control patients will experience success, so she decides that the
experimental therapy is not inferior if it yields at least 65% success. Thus, =
0.05 and she assumes that the true difference is pE - pA = 0.
[5]

With equal allocation, the number of patients in the active control group is:

nA = (2)(1.96 + 1.28)2{0.7(1 - 0.7)}/(0.05)2 = 1,764

Thus, nE = nA = 1,764 patients for a total of 3,528 patients.

SAS PROC POWER does not contain a feature for an equivalence trial or a
non-inferiority trial with binary outcomes. Fishers exact test for a superiority
trial can be adapted to yield nE = nA = 1,882 for a total of 3,764 patients. The
discrepancy is due to the superiority trial using p-bar = 0.675 instead of 0.7.

Think About It!

Come up with an answer to this question by yourself and then click on


the icon to the left to reveal the solution.

Suppose the proportions were 0.65 and 0.75. How does the required sample
size, n, change?

Example 3 (7.9_-_sample_size__time__non.sas [6])

An investigator wants to compare an experimental therapy to an active control


in a non-inferiority trial. The response is time to infection. He desires a 0.025-
significance level test with 90% statistical power and AR =1. Follow-up for
each patient is one year and he expects 20% of the active control group will
get an infection (pA = 0.2). Although he believes that pE = 0.2, he considers the
experimental therapy to be non-inferior if pE 0.25. The SAS program below,
for a one-sided superiority trial may approximate the required sample size.
The sample size can be worked out exactly. as follows:

Assuming constant hazard functions, then the effect size with pE = pA = 0.2 is
= 1. With pE = 0.25 and pA = 0.2, the zone of non-inferiority is defined by:

= loge(0.75)/ loge(0.8) = 1.29

The number of events is E = (4)(1.96 + 1.28)2/{loge(1.29)}2 = 648

and the sample sizes are nA = E/(ARpE + pA) = 648/(0.2 + 0.2) = 1,620 and
nE = 1,620

Since SAS PROC POWER does not contain a feature for an equivalence trial
or a non-inferiority trial with time-to-event outcomes, the results from the
logrank test for a superiority trial were adapted to yield nE = nA = 1,457. The
discrepancy in numbers between the program and the calculated n is due to
the superiority trial using pE = 0.25 instead of 0.2 in nA = E/(ARpE + pA).

Notice that the resultant sample sizes in SAS Examples 7.7-7.9 all are
relatively large. This is because the zone of equivalence or non-inferiority is
defined by a small value of . Generally, equivalence trials and non-inferiority
trials will require larger sample sizes than superiority trials.

None of SAS Examples 7.7-7.9 accounted for withdrawals. If a withdrawal rate


of is anticipated, then the sample size should be increased by the factor 1/(1
- ).

6B.9 - Summary
In this lesson, among other things, we learned:

Distinguish between superiority, non-inferiority and equivalence trials in


terms of
o objectives

o control group

o hypotheses tested and

o formation of confidence intervals.

Recognize characteristics of a clinical trial with high external validity

Define which data are included in an intention-to-treat analysis

Recognize the major considerations for designing an equivalence or


non-inferiority trial.

Perform sample size calculations for some equivalence and non-


inferiority trials, using SAS programs.

Let's put what we have learned to use to complete the homework 1


assignment! (posted in ANGEL last week)

Lesson 7: The Study Cohort


Introduction

In a multi-center trial, even when study eligibility criteria are carefully


described and followed precisely by different investigators at different
locations, there can be enough patient heterogeneity and differences in
protocol interpretation that the results can vary greatly across institutions.
Thus, the differences in results actually could be due to different selection
factors at the different institutions. Recruitment strategies might be different.
Due to this, different patients are recruited into the study.

Eligibility criteria also define the accrual rate for a trial. Although tighter
eligibility criteria lead to a more homogeneous trial, they yield a slower accrual
rate. It might be more difficult to meet all of the criteria you specify using strict
eligibility criteria.

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:


Compare the benefits and limitations of narrowly defined eligibility
criteria to broadly defined eligibility criteria

Recognize the impact of the 'healthy worker effect'.

Write inclusion and exclusion criteria that are less subject to


misinterpretation.

Consider the advantages and disadvantages of a run-in period or


extended baseline for a study.

Use a simple method to monitor patient accrual.

Recognize barriers to patient participation in clinical trials

Distinguish between an efficacy study and an effectiveness trial.

References

Gotay, CC. (1991). Accrual to cancer clinical trials: directions from the
research literature. Soc. Sci. Med. 33: 569-577.

Piantadosi Steven. (2005) The study cohort, Treatment allocation. In:


Piantadosi Steven. Clinical Trials: A Methodologic Perspective. 2nd ed.
Hobaken, NJ: John Wiley and Sons, Inc.

7.1 - Defining the Study Cohort


It can be more difficult to isolate a biological effect of a treatment if the
investigator uses a broadly-defined cohorts, i.e., patients with a variety of
disease types and/or severity. It is easier to isolate a biological effect of a
treatment in a narrowly-defined cohort because of patient homogeneity. The
researcher's job is to balance these factors. Every situation is different and the
researcher needs think carefully when defining the selection criteria.

Although a narrowly-defined cohort may have some external validity for others
with the same disease if the treatment appears to be beneficial, in general it
will lack external validity because the study results may not apply to patients
with slightly altered versions of the disease. Again, these are examples of the
competing demands that the researcher must keep in mind.
Epidemiologists have defined the healthy worker effect as the phenomenon
that the general health of employed individuals is better than average. For
example, employed individuals may be unsuitable controls for a case-control
study if the cases are hospitalized patients. Similarly, individuals who
volunteer for clinical trials may have more favorable outcomes than those who
refuse to participate, even if the treatment is ineffective. This selection effect is
known as the trial participant effect and it can be strong. For a randomized
trial, however, this may not be a problem unless selection effects somehow
impact treatment assignment.

Because of the possible effects of prognostic (variables that can affect the
outcome) and selection factors on differences in outcome, the eligibility criteria
for the study cohort need to be defined carefully.

Two contrasting philosophies in defining these criteria are as follows:

1. Define very narrow eligibility criteria so that the study cohort is relatively
homogeneous, which may yield an outcome variable that has less
variability and result in a smaller sample size; however, the results may
not have external validity.

2. Define very broad eligibility criteria, and accommodate the larger


amount of variability by incorporating a larger sample size, which will
provide much more external validity. (This is easy for a statistician to
say!)

In many instances, endpoints/outcomes are more easily evaluated if certain


complicating factors are prevented by patient exclusions. For example,
habitual smokers typically are excluded from asthma trials because their lung
function may be impaired by smoking as well as by their asthma. The smoking
behavior may confound the results of the study. Exclusions also may be
invoked for ethical reasons if the treatment is not expected to benefit a certain
subgroup of patients. For example, some oncology (cancer) trials might
exclude patients whose life expectancy does not exceed six months.

The difficulties with interpretation of inclusion and exclusion criteria can be


minimized via quantitative expressions. For example, inclusion criteria should
specify the range of allowable serum chemistry variables, instead of just
stating that, "we will require normal lab values". Different hospitals are going
to have different interpretations of what normal is. Obviously, you need to be
specific.
Once the decisions are made about the study cohort and other design issues
resolved, the protocol approved and study medications obtained, the
investigator begins what can be the most difficult task in a clinical trial -
recruitment! Despite the most optimistic beliefs about the existence of
available patients out there, a host of factors can make the recruitment of
patients challenging.

(You may notice in this section we have defined a study cohort for the trial.
This doesnt mean however that every clinical trial is a cohort study in the
sense of a long-term study following a defined group of patients.)

7.2 - Assessing Accrual


It is unfortunate that some clinical trials are terminated early due to low
accrual, which is a waste of resources and time for all those involved.
Investigators often overestimate the accrual rate because they may not
account for (1) the restrictions imposed by the eligibility criteria and (2) the
refusal by some eligible patients to participate.

A famous saying which speaks to the challenges associated with recruitment


among clinical trialists is The incidence of a disease diminishes when you
initiate a study on it. (source unknown)

Run-in periods or extended baseline periods are helpful in assessing which


eligible patients will adhere to the protocol. For example, patients can be
administered a placebo during the run-in period and monitored for treatment
compliance. At the completion of the run-in period, those patients who meet
the treatment compliance criteria are then randomized to treatment, whereas
those who do not are discontinued in the study. Another advantage of
incorporating a run-in period is that it may provide the opportunity for patients
to be stabilized via a standard medication prior to randomization.

One criticism of incorporating a run-in period is that it could decrease the


external validity of the trial because in the real world some patients will not be
very compliant. Thus, a trial based on very compliant patients may
overestimate the effectiveness of the treatment.

Sometimes it is possible to conduct a formal survey of patients prior to the


onset of a trial to determine the proportion that would consider participation.
This might indicate to the researcher the approximate proportion of patients
that would consider participating and enable realistic timetables for completing
trials..

In any event, it is extremely important to monitor accrual on a regular basis


throughout the course of a trial. An accrual graph with target and actual
number of recruited patients helps monitor the process. This task typically falls
on the statistician. Here is an example of a plot monitoring the accural of
patients.

The target assumes a constant accrual of patients. There was a lag in the
number of patients at the beginning that were recruited but it caught up with
the target for recruitment by the end of the study. This struggle in the number
of patients recruited is very typical. Recruitment is always a struggle.
Everyone on the research team needs to help with this process.

7.3 - Other Considerations


Among adult cancer patients in the USA, less than 3% of these patients
participate in clinical trials. (Gotay, 1991) Since the process of clinical trials
leads to improvements in cancer therapy over time, it would seem that cancer
patients would be motivated to partipate in increasing numbers over time. But
this has not happened. Most diseases, except for AIDS and some pediatric
conditions, exhibit similar types of participation rates. The three general
reasons for lack of participation are categorized as physician-, patient-, or
administrative-related.

The reasons physicians give for failing to enroll patients in clinical trials are
the perception that the trial may compromise the physician-patient relationship
and the difficulties with informed consent. Many consent forms are
cumbersome, intimidating, and not written at an appropriate reading level. The
'experts' say that these documents should be written at an 8th grade reading
level. Using plain language is important. Also, many patients are mistrustful of
the medical establishment, although they may trust their individual physicians.
Often, ethnic minority groups express even stronger concerns about
participation in clinical trials.

There is a distinction between an efficacy trial and an effectiveness trial. In an


efficacy trial, the study cohort is relatively homogeneous and the objective is
to test a biological question. In an effectiveness trial, the study cohort is
relatively heterogeneous and the objective is to assess effectiveness of a
treatment. An effectiveness trial tends to be very large and expensive, but has
much more external validity because of broad eligibility criteria and a
heterogeneous population. Most clinical trials are effectiveness studies.

Example

An example of an efficacy study is the trial conducted by the Asthma Clinical


Research Network (ACRN), entitled Dose of Inhaled Corticosteroids with
Equisystemic Effects (DICE) [1]. The primary objective of the trial was to
investigate dose-response effects of various inhaled corticosteroids (ICS) on
cortisol production by the adrenal glands. Subjects with mild-moderate asthma
were recruited. There were many exclusion criteria, such as obesity,
pregnancy or lactation, no oral or injectable steroids during the past twelve
months, no ICS or nasal steroids during the past six months, and no topical
steroids during the past two months. Subjects were randomized to either one
of six different ICS (n = 24 per group) or placebo (n = 12). ICS dose was
doubled on a weekly basis (0d, 1d, 2d, 4d, and 8d, where d is a pre-selected
low dose for each ICS). Subjects stayed overnight at a hospital at the end of
each week, during which blood was drawn hourly and analyzed to determine
the concentration of cortisol.

This study was examining a very specific biological question. The primary
objective of the trial was to establish whether increasing the dose for each ICS
yields a decrease in plasma cortisol (adrenal suppression). The researchers
were interested in looking at dose response curves. The DICE trial was not
powered to compare the dose-response curves of each ICS.

DICE is strictly an efficacy trial with very narrow eligibility criteria. Furthermore,
the protocol specified that the intent-to-treat paradigm would not be followed.
Subjects were dropped post-randomization if they received other forms of
steroids, became pregnant, or were non-compliant with dose schedules
and/or visit schedules.
On the other hand, over the past 20 years, there has been great interest in the
gender and ethnic composition of cohorts in clinical trials. Part of this interest
is due to ensuring external validity of the results of the trials. For many years
caucasian males were the only patients recruited for the purpose of assuring
homogeneity. This has been broadened by both the FDA and NIH in their
application process. The broader eligibility requirements will help to ensure
broader external validity.

The NIH typically requires one-half female participation and one-third ethnic
minority participation in CTE trials that it sponsors. Obviously, there are
exceptions to this based on the disease of interest. Required representation in
clinical trials, however, could be a hindrance to acquiring new knowledge if it
consumes too many resources.

7.4 - Summary
In this lesson, among other things, we learned:

Compare the benefits and limitations of narrowly defined eligibility


criteria to broadly defined eligibility criteria

Recognize the impact of the 'healthy worker effect'.

Write inclusion and exclusion criteria that are less subject to


misinterpretation.

Consider the advantages and disadvantages of a run-in period or


extended baseline for a study.

Use a simple method to monitor patient accrual.

Recognize barriers to patient participation in clinical trials

Distinguish between an efficacy study and an effectiveness trial.

Let's put what we have learned to use by completing the following homework
assignment:

Homework

Look for homework assignments in the Homework folder on ANGEL.


Lesson 8: Treatment Allocation and
Randomization
Introduction

Treatment allocation in a clinical trial can be randomized or nonrandomized.


Nonrandomized schemes, such as investigator-selected treatment
assignments, are susceptible to large biases. Even nonrandomized schemes
that are systematic, such as alternating treatments, are susceptible to
discovery and could lead to bias. Obviously, to reduce biases, we prefer
randomized schemes. Credibility requires that the allocation process be non-
discoverable. The investigator should not know what the treatment will be
assigned until the patient has been determined as eligible. Even using
envelopes with the treatment assignment sealed inside are prone to
discovery.

Randomized schemes for treatment allocation are preferable in most


circumstances. When choosing an allocation scheme for a clinical trial, there
are three technical considerations:

1. reducing bias;

2. producing a balanced comparison;

3. quantifying errors attributable to chance.

Randomization procedures provide the best opportunity for achieving these


objectives.

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

Identify three benefits of randomization

Distinguish simple randomization from constrained randomization.

State the purpose of randomization in permuted blocks.

State the objective of stratified randomization.


Contrast the benefits of permuted blocks to those of adaptive
randomization schemes.

Use a SAS program to produce a permuted blocks randomization plan.

Use an allocation ratio that will maximize statistical power in the


situation where greater variability is expected in one treatment group
than the other.

Provide the rationale against randomizing prior to informed consent.

8.1 - Randomization
In some early clinical trials, randomization was performed by constructing two
balanced groups of patients and then randomly assigning the two groups to
the two treatment groups. This is not always practical as most trials do not
have all the patients recruited on day one of the study. Most clinical trials
today invoke a procedure in which individual patients, upon entering the study,
are randomized to treatment.

Randomization is effective in reducing bias because it guarantees that


treatment assignment will not be based on patient's prognostic factors. Thus,
investigators cannot favor one treatment group over another by assigning
patients with better prognoses to it, either knowingly or unknowingly.
Procedure selection bias has been documented to have a very strong effect
on outcome variables.

Another benefit of randomization which might not be as obvious is that it


typically prevents confounding of the treatment effects with other prognostic
variables. Some of these factors may or may not be known. The investigator
usually does not have a complete picture of all the potential prognostic
variables, but randomization tends to balance the treatment groups with
respect to the prognostic variables.

Some researchers argue against randomization because it is possible to


conduct a statistical analysis, e.g., analysis of covariance (ANCOVA), that
adjusts for the prognostic variables. It always is best, however, to prevent a
problem rather than adjust for it later. In addition, ANCOVA does not
necessarily resolve the problem satisfactorily because the investigator may be
unaware of certain prognostic variables and because it assumes a specific
statistical model that may not be correct.
Although randomization provides great benefit in clinical trials, there are
certain methodological problems and biases that it cannot prevent. One
example where randomization has little, if any, impact is external validity in a
trial that has imposed very restrictive eligibility criteria. Another example
occurs with respect to assessment bias, which treatment masking and other
design features can minimize. For instance, when a patient is asked "how do
you feel?" or "how bad is your pain?" to describe their condition the
measurement bias is introduced.

Simple Randomization

The most popular form of randomization is simple randomization. In this


situation, a patient is assigned a treatment without any regard for previous
assignments. This is similar to flipping a coin - the same chance regardless of
what happened in the previous coin flip. Here is an animation of this process.
Notice that each time, once treatment A or B is selected, the selection process
begins again starting with the two treatment possibilities.

One problem with simple randomization is the small probability of assigning


the same number of subjects to each treatment group. Severe imbalance in
the numbers assigned to each treatment is a critical issue with small sample
sizes.

Another disadvantage of simple randomization is that it can lead to imbalance


among the treatment groups with respect to prognostic variables that affect
the outcome variables.

For example, suppose disease severity in a trial is designated as mild,


moderate, and severe. Suppose that simple randomization to treatment
groups A and B is applied. The following table illustrates what possibly could
occur.

Severity Group A Group B


Mild 17 28
Moderate 41 39
Severe 25 13

The moderate is fairly well balanced, the mild and severe groups are much
more imbalanced. This results in Group A getting more of the severe cases
and Group B more of the mild cases.
8.2 - Constrained Randomization
Randomization in permuted blocks is one approach to achieve balance across
treatment groups. The randomization scheme consists of a sequence of
blocks such that each block contains a pre-specified number of treatment
assignments in random order. The purpose of this is so that the randomization
scheme is balanced at the completion of each block. For example, suppose
equal allocation is planned in a two-armed trial (groups A and B) using a
randomization scheme of permuted blocks. The target sample size is 120
patients (60 in A and 60 in B) and the investigator plans to enroll 12 subjects
per week. In this situation, blocks of size 12 are natural, so the randomization
plan looks like

Week #1 BABABAABABAB
Week #2 ABBBAAABAABB

Week #3 BBBABABAABAA

Each week the patients are assigned a treatment based on a randomly


assigned option specified for the week. Notice that there are exactly six As
and six Bs within each block, so that at the end of each week there is balance
between the two treatment arms. If the trial is terminated, say after 64 patients
have been enrolled, there may not be exact balance but it will be close.

Ordinarily, a natural block size is not evident, so logistical procedures may


suggest a block size. A variation of blocked randomization is to use block
sizes of unequal length. This might be helpful for a trial where the investigator
is unmasked. For example, if the investigator knows that the block size is six,
and within a particular block treatment A already has been assigned to three
patients, then it is obvious that the remaining patients in the block will be
assigned treatment B. If the investigator knows the treatment assignment prior
to evaluating the eligibility criteria, then this could lead to procedure selection
bias. It is not good to use a discoverable assignment of treatments. A next
step to take would be to vary the block size in order to keep the investigator's
procedure selection bias minimized.

To illustrate that randomization with permuted blocks is a form of constrained


randomization, let NA and NB denote the number of As and Bs, respectively, to
be contained within each block. Suppose that when an eligible patient is ready
to be randomized there are nA and nB patients already randomized to groups A
and B, respectively. Then the probability that the patient is randomized to
treatment A is:

Pr[A]=0NAnANA+NBnAnB1 if nA=NA if 0<nA<NA if nB=NBPr[A]={0 i


f nA=NANAnANA+NBnAnB if 0<nA<NA1 if nB=NB

This probability rule is based on the model of NA "A" balls and NB "B" balls in
an urn or jar which are sampled without replacement. The probability of being
assigned treatment A changes according to how many patients already have
been assigned treatment A and treatment B within the block.

As an example, suppose each block is supposed to have NA = NB = 6 and nA =


3 and nB = 2 already have been assigned. Thus, there are NA - nA = 3 A balls
left in the urn and NB - nB = 4 B balls left in the urn, so the probability of the
next eligible patient being assigned treatment A is 3/7. Below is an animation
of this process taking place.

8.3 - Stratified Randomization


Another type of constrained randomization is called stratified randomization.
Stratified randomization refers to the situation in which strata are constructed
based on values of prognostic variables and a randomization scheme is
performed separately within each stratum. For example, suppose that there
are two prognostic variables, age and gender, such that four strata are
constructed:

Treatment A Treatment B

male, age < 18 12 12


male, age 18 36 37
female, age < 18 13 12
female, age 18 40 40
The strata size usually vary (maybe there are relatively fewer young males
and young females with the disease of interest). The objective of stratified
randomization is to ensure balance of the treatment groups with respect to the
various combinations of the prognostic variables. Simple randomization will
not ensure that these groups are balanced within these strata so permuted
blocks are used within each stratum are used to achieve balance.

If there are too many strata in relation to the target sample size, then some of
the strata will be empty or sparse. This can be taken to the extreme such that
each stratum consists of only one patient each, which in effect would yield a
similar result as simple randomization. Keep the number of strata used to a
minimum for good effect.

8.4 - Adaptive Randomization


Adaptive randomization refers to any scheme in which the probability of
treatment assignment changes according to assigned treatments of patients
already in the trial. Although permuted blocks can be considered as such a
scheme, adaptive randomization is a more general concept in which treatment
assignment probabilities are adjusted.

One advantage of permuted blocks over adaptive randomization is that the


entire randomization scheme can be determined prior to the onset of the
study, whereas many adaptive randomization schemes require recalculation of
treatment assignment probabilities for each new patient.

Urn models provide some approaches for adaptive randomization. Here is an


exercise that will help to explain this type of scheme. Suppose that there is
one "A" ball and one "B" ball in an urn and the objective of the trial is equal
allocation between treatments A and B. Suppose that an "A" ball is blindly
selected, so that the first patient is assigned treatment A. Then the original "A"
ball and another "B" ball are placed in the urn so that the second patient has a
1/3 chance of receiving treatment A and a 2/3 chance of receiving treatment
B. At any point in time with nA"A" balls and nB"B" balls in the urn, the
probability of being assigned treatment A is nA/(nA+ nB). The scheme changes
based on what treatments have already been assigned to patients.

This type of urn model for adaptive randomization yields tight control of
balance in the early phase of a trial. As nA and nB get larger, the scheme tends
to approach simple randomization, so the advantage of such an approach
occurs when the trial has a small target sample size.
8.5 - Minimization
Minimization is another, rather complicated type of adaptive randomization.
Minimization schemes construct measures of imbalance for each treatment
when an eligible patient is ready for randomization. The patient is assigned to
the treatment which yields the lowest imbalance score. If the imbalance
scores are all equal, then that patient is randomly assigned a treatment. This
type of adaptive randomization imposes tight control of balance, but it is more
labor-intensive to implement because the imbalance scores must be
calculated with each new patient. Some researchers have developed web-
based applications and automated 24-hour telephone services that solicit
information about the stratifiers and a computer algorithm uses the data to
determine the randomization

One popular minimization scheme is based on marginal totals of the stratifying


variables. As an example, consider a three-armed clinical trial (treatments A,
B, C). Suppose there are four stratifying variables, whereby each stratifier has
three levels (low, medium, high), yielding 34 = 81 strata in this trial. When 200
patients have been randomized and patient #201 is ready for randomization.
The observations of the stratifying variables are recorded as follows.

Patient Stratifier #1 Stratifier #2 Stratifier #3 Stratifier #4


001 Low Low Medium Low
002 High Medium Medium High
...
200 Low Low Low Medium

Suppose that patient #201 is ready for randomization and that this patient is
observed to have the low level of stratifier #1, the medium level of stratifier #2,
the high level of stratifier #3, and the high level of stratifier #4. Based on the
200 patients already in the trial, the number of patients with each of these
levels is totaled for each treatment group. (Notice that patients may be double
counted in this table.)

w Stratifier #1 Medium Stratifier #2 High Stratifier #3 High Stratifier #4 Margina


45 19 12 103
48 18 15 112
43 21 15 109
Patient #201 would be assigned to treatment A because it has the lowest
marginal total. If two or more treatment arms are tied for the smallest marginal
total, then the patient is randomly assigned to one of the tied treatment arms.
This is not a perfect scheme but it is a strategy for making sure that the
assignments are as balanced within each treatment group with respect to
each of the four variables.

8.6 - "Play the Winner" Rule


Another type of adaptive randomization scheme is called the "play the winner"
rule. Suppose there is a two-armed clinical trial and the urn contains one "A"
ball and one "B" ball for the first patient. Suppose that the patient randomly is
assigned treatment A. Now you need to know if the treatment was successful
with the patient that received this treatment. If the patient does well on
treatment A, then the original "A" ball and another "A" ball are placed in the
urn. If the patient fails on treatment A, then the original "A" ball and a "B" ball
are placed in the urn. Thus, the second patient has probability of 1/3 or 2/3 of
receiving treatment A depending on whether treatment A was a success or
failure for the first patient. This process continues. If one treatment is more
successful than the other, the odds are stacked in favor of that treatment.

The advantage of the "play the winner" rule is that a higher proportion of
patients will be assigned to the more successful treatment. This seems to be
an ethical approach.

The disadvantages of the "play the winner" rule are that:

1. sample size calculations are difficult, and

2. the outcome on each patient must be determined prior to the entry of


the next patient.

Thus, the "play the winner" rule is not practical for most trials. The procedure
can be modified, however, to be performed in stages. For example, if the
target sample size is 200 patients, then the trial can be put on hold after each
set of 50 patients to assess outcome and redefine the probability of treatment
assignment for the patients yet to be recruited, i.e., "play the winner" after
every 50 patients instead of every patient.
8.7 - Administration of the
Randomization Process
The RANUNI function in SAS yields random numbers from the Uniform(0,1)
distribution (randomly selected a decimal between 0 and 1). These random
numbers can be used to generate a randomization scheme. For example,
suppose that the probability of assignment to treatments A, B, and C are to be
0.25, 0.25, and 0.5, respectively. Let U denote the random number generated
and assign treatment as follows:

A, if 0.00 < U< 0.25


B, if 0.25< U< 0.50
C, if 0.50< U < 1.00

This can be adapted for whatever your scheme requires.

SAS Example: (9.1_randomizion_plan.sas [1]): Here is a SAS program that


provides a permuted blocks randomization scheme for equal allocation to
treatments A and B. In the example, the block size is 6 and the total sample
size is 48.
[1]

Think About It!

Come up with an answer to this question by yourself and then click on the
icon to the left to reveal the solution.
Can you generate a permuted blocks randomization scheme for a total
sample size of 32 with a block size of 4?

Future treatment assignments in a randomization scheme should not be


discoverable by the investigator. Otherwise, the minimization of selection bias
offered by randomization is lost. The administration of the randomization
scheme should not be physically available to the investigator. This usually is
not the case in multi-center trials, but the problem usually arises in small
single-center trials. Logistical problems can arise in trials with hospitalized
patients in which 24-hour access to randomization is necessary. Sometimes,
sealed envelopes are used as a means of keeping the randomized treatment
assignments confidential until a patient is eligible for entry. However, it is
relatively easy for investigators to tamper with the envelope system.

Many clinical trials rely on pharmacies to package the drugs so that they are
masked to investigators and patients. For example, consider a two-armed trial
with a target sample size of 96 randomized subjects (48 within each treatment
group). The pharmacist constructs 96 drug packets and randomly assigns
numeric codes from 01 to 96 which are printed on the drug packet labels. The
pharmacist gives the investigator the masked drug packets (with their numeric
codes). When a subject is eligible for randomization, the investigator selects
the next drug packet (in numeric order). In this way the investigator is kept
from knowing which treatment is assigned to which patient.

SAS Example: (SAS_Example_13.2_randomization_plan.sas [2]): Here is a


SAS program that provides ....
[2]

8.8 - Unequal Treatment Allocation


To maximize the efficiency (statistical power) of treatment comparisons,
investigators typically employ equal allocation of patients to treatment groups
(this assumes that the variability in the outcome measure is the same for each
treatment).

Unequal allocation may be preferable in some situations. An unequal


allocation that favors an experimental therapy over placebo could help
recruitment and it would increase the experience with the experimental
therapy. This also provides the opportunity to perform some subset analyses
of interest, e.g., if more elderly patients are assigned to the experimental
therapy, then the unequal allocation would yield more elderly patients on the
experimental therapy.

Another example where unequal allocation may be desirable occurs when one
therapy is extremely expensive in comparison to the other therapies in the
trial. For budget reasons you may not be able to assign as many to the
expensive therapy.

If it is known that one treatment is more variable (less precise) in the outcome
response than the other treatments, then the statistical power for treatment
comparisons is maximized with unequal allocation. The allocation ratio should
be

r=n1/n2=1/2r=n1/n2=1/2

which is a ratio of the known standard deviations. Thus, the treatment that
yields less precision (larger standard deviation) should receive more patients,
an unequal allocation. Because there is more 'noise', more patients, a larger
sample size will help to cut through this noise.

8.9 - Randomization Prior to Informed


Consent
Randomization prior to informed consent can increase the number of trial
participants, but it causes some difficulties. This is not recommended practice.
Here's why...

One particular scheme with experimental and standard treatments that has
received some attention is as follows. Eligible patients are randomized prior to
providing consent. If the patient is assigned to the standard therapy, then it is
offered to the patient without the need for consent. If the patient is randomized
to the experimental therapy, then the patient is asked for consent. If this
patient refuses, however, then he/she is offered the standard therapy. An
"intent-to-treat" analysis is performed based on the randomized assignment.

This approach can increase trial participation, but patients who are
randomized to the experimental treatment and refuse will dilute the treatment
difference at the time of data analysis. In addition, the "intent-to-treat" analysis
will introduce bias.

There are ethical problems as well because:

1. subjects are randomized to treatment without having been properly


informed and without providing their consent, and

2. subjects randomized to standard therapy have been denied the chance


of receiving the experimental therapy.

For all of these reasons, randomization prior to informed consent is not


recommended.

8.10 - Summary
In this lesson, among other things, we learned:

Identify three benefits of randomization

Distinguish simple randomization from constrained randomization.

State the purpose of randomization in permuted blocks.

State the objective of stratified randomization.

Contrast the benefits of permuted blocks to those of adaptive


randomization schemes.

Use a SAS program to produce a permuted blocks randomization plan.

Use an allocation ratio that will maximize statistical power in the


situation where greater variability is expected in one treatment group
than the other.
Provide the rationale against randomizing prior to informed consent.

Let's put what we have learned to use by completing the following homework
assignment:

Lesson 9: Treatment Effects Monitoring;


Safety Monitoring
Introduction

During a clinical trial over a lengthy period of time, it can be desirable to


monitor treatment effects as well as tracking safety issues. "Interim analysis"
or "early stopping" procedures are used to interpret the accumulating
information during a clinical trial. There may be a variety of practical reasons
for terminating a clinical trial at an early stage. Some of these are overlapping:

1. Treatments are found to be convincingly different,

2. Treatments are found to be convincingly not different,

3. Side effects or toxicity are too severe to continue treatment, relative to


the potential benefits,

4. The data are of poor quality,

5. Accrual is too slow to complete the study in a timely fashion,

6. Definitive information is available from outside the study, making the trial
unnecessary or unethical, this is also related to the next item...

7. The scientific questions are no longer important because of other


developments,

8. Adherence to the treatment is unacceptably poor, preventing an answer


to the basic question,

9. Resources to perform the study are lost or no longer available, and/or

10. The study integrity has been undermined by fraud or misconduct.

(Piantodosi, 2005)
This lesson will look examine different methods or guidelines that can be used
to help decide whether or not to terminate a clinical trial in progress.

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

Differentiate between valid and invalid reasons for interim analyses and
early termination of a trial.

Identify characteristics of a sound plan for interim analysis.

Understand the theoretical framework for a likelihood based interim


analysis.

Compare and contrast the Bayesian approach to analysis with the


frequentist approach.

Recognize the general effects of the choice the prior on the posterior
probability distribution from a Bayesian analysis.

Compare spending functions for 3 group sequential methods for


interim analysis.

Comment on the use of a group sequential method in a published


statistical analysis.

Recognize a futility assessment and define conditional power.

List topics that should be covered in an interim report to an IRB.

List the advantages and disadvantages of a DSMB and describe who


might compose the DSMB.

List the issues of concern to a DSMB in a typical clinical study.

References

DeMets DL, Lan KK, 1994, Interim analysis: The alpha spending function
approach, Statistics in Medicine 13: 1341-1352.
Ellenberg, SS. Fleming, TR. DeMets, DL. 2002, Data Monitoring Committees
in Clinical Trials, New York, NY: Wiley.

Piantadosi, Steven. (2005) Treatment Effects Monitoring. In: Piantadosi


Steven. Clinical Trials: A Methodologic Perspective. 2nd ed. Hobaken, NJ:
John Wiley and Sons, Inc.

Pocock, S.J. 1983 Clinical Trials: A Practical Approach. Chichester: John


Wiley and Sons.

9.1 - Overview
Data-dependent stopping is a general term to describe any statistical or
administrative reason for stopping a trial. Consideration of the reasons given
earlier may lead you to stop the trial at an early stage, or at least change the
protocol.

The review, interpretation, and decision-making aspects of clinical trials based


on interim data are necessary but prone to error. If the investigators learn of
interim results, then it could affect objectivity during the remainder of the trial,
or if statistical tests are performed repeatedly on the accumulating data, then
the Type I error rate is increased.

There is a natural conflict. On one hand, terminating the trial as early as


possible will save costs and labor, expose as few patients as possible to
inferior treatments, and allow disseminating information about the treatments
quickly. On the other hand, there are pressures to continue the trial for as long
as possible in order to increase precision, reduce errors of inference, obtain
sufficient statistical power to account for prognostic factors and examine
subgroups of interest, and gather information on secondary endpoints.

All of the available statistical methods for interim analyses have some similar
characteristics. They

1. require the investigators to state the objectives clearly and in advance of


the study,

2. assume that the basic study design is sound,

3. require some structure to the problem beyond the data being observed,
4. tend to have similar performance characteristics, or

5. impose a penalty for early termination.

No statistical method is a substitute for judgment. The statistical


criteria provide guidelines for terminating the trial because the decision to
stop a trial is not based just on statistical information collected on one
endpoint.

9.2 - Likelihood Methods


It may be possible to assess treatment effects after each patient is accrued,
treated, and evaluated. Such an approach is impractical in most
circumstances, especially for trials that require lengthy follow-up to determine
outcomes.

The first likelihood method proposed for this situation is called the sequential
probability ratio test (SPRT) and it is based on the likelihood function. (This
method is very rarely implemented because it is impracticality, but is important
for historical reasons.) Let's review this method in general terms here.

A likelihood function is constructed from a probability model for a sequence of


random variables which correspond to the outcome measurements on the
experimental units. In the likelihood function, however, the observed data
points replace the random variables. Suppose we have a binary response
(success/failure) from each patient which is determined immediately after a
treatment is administered. (Again, not very practical.) However, for the
situation discussed, we are examining one treatment which is administered to
every patient. If there are N patients with K successes, and p represents the
probability of success within each patient, then the likelihood function is based
on the binomial probability function:

L(p,K)=pK(1p)NKL(p,K)=pK(1p)NK

This is a very simple likelihood function for a very simple example.

If the investigator is trying to decide whether p0 or p1 is the more appropriate


value of p, then the likelihood ratio can be constructed to assess the evidence:
R=L(p0,K)L(p1,K=(p0p1)K(1p01p1)NKR=L(p0,K)L(p1,K=(p0p1)K(1p01
p1)NK

This is a ratio of two different likelihood functions. If R is large, then the


evidence is going to favor p0. If R is small, then the evidence is going to
favor p1. Therefore, when analyzing interim data, we can calculate the
likelihood ratio and stop the trial only if we have the amount of evidence that is
expected for the target sample size.

Suppose that N is the target sample size and that after n patients there
are k successes. After each treatment we will stop and analyze the data to
determine whether to continue the trial or not. Under this scenario, we stop
the trial if:

R=L(p0,K)L(p1,K=(p0p1)k(1p01p1)nkRL or RUR=L(p0,K)L(p1,K=(p
0p1)k(1p01p1)nkRL or RU

where RL and RU are prespecified constants. Let's not worry about the details
of the statistical calculation here. The values of RL and RU that correspond to
testing H0: p = p0 versus H1: p = p1 are RL = /(1 - ) and RU = (1 - )/.

A sample schematic of the SPRT in practice is shown below. Here you would
calculate R after the treatment of each patient. As you accumulate patients
you can see that R is moving around as the trial proceeds. Before we had
accrued all of the patients that we wanted we hit the upper boundary and
would not recruit the remaining patients.
Here is another example...

The SPRT might be useful in a phase II SE trial in which a treatment is to be


monitored closely to determine if it reaches a certain level of success or
failure. For example, suppose the investigator considers the treatment
successful if p = 0.4 (40% or greater), but considers it a failure if p = 0.2 (20 %
or less). Thus, the hypothesis testing problem is H0: p = 0.2 vs. H1: p = 0.4.
Suppose we take = 0.05 and = 0.05. Then the bounds would be calculated
as RL = 1/19 and RU = 19. We would reject H0 in favor of H1, and claim
success, as soon as R gets small enough, R = (0.5)k(1.33)n-k 1/19. On the
other hand, we would stop the trial and accept H0 and reject H1, and claim
failure, as soon as R 19.

The statistical formulation for the SPRT is relatively straightforward, but it is


not a common procedure in clinical trials. The obvious criticism is that each
patients outcome must be observed quickly before you recruit the next
patient. The SPRT also has the statistical property that it has a positive
probability of never reaching the boundaries RL and RU. If this is the case after
the target sample size, N, is reached, then the trial is inconclusive.

9.3 - Bayesian Methods


First, let's review the Bayesian approach in general and then apply it to our
current topic of likelihood methods.
The Bayesian approach to statistical design and inference is very different
from the classical approach (the frequentist approach).

Before a trial begins, a Bayesian statistician summarizes the current


knowledge or belief about the treatment effect, say we call it , in the form of a
probability distribution. This is known as the prior distribution for . These
assumptions are made prior to conducting the study and collecting any data.

Next, the data from the trial are observed, say we call it X, and the likelihood
function of X given is constructed. Finally, the posterior distribution for
given X is constructed. In essence, the prior distribution for is revised into
the posterior distribution based on the data X. The data collection in the study
informs or revises the earlier assumptions.

The following schematic describes this Bayesian approach:

The development of the posterior distribution may be very difficult


mathematically and it may be necessary to approximate it through computer
algorithms.

The Bayesian statistician performs all inference for the treatment effect by
formulating probability statements based on the posterior distribution. This is a
very different approach and is not always accepted by the more traditional
frequentist oriented statisticians.
In the Bayesian approach, is regarded as a random variable, about which
probability statements can be made. This is the appealing aspect of the
Bayesian approach. In contrast, the frequentist approach regards as a fixed
but unknown quantity (called a parameter) that can be estimated from the
data.

As an example of the contrasting philosophies, consider the frequentist


description and the Bayesian description of a 95% confidence interval for .

Frequentist: "If a very large number of samples, each with the same sample
size as the original sample, were taken from the same population as the
original sample, and a 95% confidence interval constructed for each sample,
then 95% of those confidence intervals would contain the true value of ." This
is an extremely awkward and dissatisfying definition but technically represents
the frequentist's approach.

Bayesian: "The 95% confidence interval defines a region that covers 95% of
the possible values of ." This is much more simple and straightforward. (As a
matter of fact, most people when they first take a statistics course believe that
this is the definition of a confidence interval.)

In a Bayesian analysis, if is a parameter of interest, the analysis results in a


probability distribution for . Using the probability distribution, many
statements can be made. For example, if represents a probability of success
for a treatment, a statement can be made about the probability that > 0.90
(or any other value).

9.4 - Back to Clinical Trials


With respect to clinical trials, a Bayesian approach can cause some difficulties
for investigators because they are not accustomed to representing their prior
beliefs about a treatment effect in the form of a probability distribution. In
addition, there may be very little prior knowledge about a new experimental
therapy, so investigators may be reluctant to or not be able to quantify their
prior beliefs. In the business world, the Bayesian approach is used quite often
because of the availability of prior information. In the medical field, more often
than not, this is not the case.

The choice of a prior distribution can be very controversial. Different


investigators may select different priors for the same situation, which could
lead to different conclusions about the trial. This is especially true when the
data, X, are based on a small sample size because in such situations the prior
distributions are modified only slightly to form the posterior distributions. Small
sample sizes only modify the prior slightly. This tends to weight the posterior
distribution very closely to the prior, therefore you are basing your results
almost entirely on your prior assumptions.

When there is little prior information to base your assumptions of the


distribution on, Bayesians employ a reference (or vague or non-informative)
prior. These are intended to represent a minimal amount of prior information.
Although vague priors may yield results similar to those of a frequentist
approach, the priors may be unrealistic because they attempt to assign equal
weight to all values of . Below you can see a very flat distribution, very
spread out over a wide range of values.

Similarly, skeptical prior distributions are those that quantify the belief that
large treatment effects are unlikely. Enthusiastic prior distributions are those
that quantify large treatment effects. Let's not worry about the calculations, but
focus instead on the concepts here...

An example of a Bayesian approach for interim monitoring is as follows.


Suppose an investigator plans a trial to detect a hazard ratio of 2 ( = 2) with
90% statistical power ( = 0.10) using at least a sample size of 90 events. The
investigator plans one interim analysis, approximately halfway through trial,
and a final analysis. (This is the more standard approach, as opposed to the
SPRT where R was calculated after each treatment.)

The estimated logarithm of the hazard ratio is approximately normally


distributed with variance (1/d1) + (1/d2), where d1 and d2 are the numbers of
events in the two treatment groups. The null hypothesis is that the treatment
groups are the same, i.e., H0: = 1. Note that the loge hazard ratio is 0 under
the null hypothesis and the loge hazard ratio is 0.693 when = 2, the
proposed effect size.
Suppose the investigator has access to some pilot data or the published
report of another investigator, in which there appeared to be a very small
treatment effect with 16 events occurring within each of the two treatment
groups. The investigator decides that this preliminary study will form the basis
of a skeptical prior distribution for the loge hazard ratio with a mean of 0 and a
standard deviation of 0.35 = {(1/16) + (1/16)}1/2. This is called a skeptical prior
because it expresses skepticism that the treatment is beneficial.

Next, suppose that at the time of the interim analysis, (45 events have
occurred), there are 31 events in one group and 14 events in the other group,
such that the estimated hazard ratio is 2.25 (calculations not shown). These
values are incorporated into the likelihood function, which modifies the prior
distribution to yield the posterior distribution for the estimated logehazard ratio
that has a mean = 0.474 and standard deviation = 0.228 (calculations not
shown). Therefore we can calculate the probability that is > 2. From the
posterior distribution we construct the following probability statement:

Pr[2]=1(loge(2)0.4740.228)=1(0.961)=0.168Pr[2]=1(loge(
2)0.4740.228)=1(0.961)=0.168

where represents the cumulative distribution function for the standard


normal and is the true hazard ratio.

Conclusion: Based on the results from the interim analysis with a skeptical
prior, there is not strong evidence that the treatment is effective because the
posterior probability of the hazard ratio exceeding 2 is relatively small.
Therefore, there is not enough evidence here to suggest that the study be
stopped. What is too large? A reasonable value should be specified in your
protocol before these values are determined.
In contrast, suppose that before the onset of the trial the investigator is very
excited about the potential benefit of the treatment. Therefore, the investigator
wants to use an enthusiasticprior for the loge hazard ratio, i.e., a normal
distribution with mean = loge(2) = 0.693 and standard deviation = 0.35 (same
as the skeptical prior).

Suppose the interim data results are the same as those described above. This
time, the posterior distribution for the loge hazard ratio is normal with mean =
0.762 and standard deviation = 0.228. Then the probability for the posterior
prior is:

Pr[2]=1(loge(2)0.762.228)=1(0.302)=0.682Pr[2]=1(loge
(2)0.762.228)=1(0.302)=0.682

This is a drastic change in the probability based on the assumptions that were
made ahead of time. In this case, the investigator still may not consider this to
be strong evidence that the trial should terminate because the posterior
probability of the hazard ratio exceeding 2 does not exceed 0.90.

Nevertheless, the example demonstrates the controversy that can arise with a
Bayesian analysis when the amount of experimental data is small, i.e., the
selection of the prior distribution drives the decision-making process. For this
reason, many investigators prefer to use non-informative priors. Using the
Bayesian methods, you can make probability statements about your expected
results.

9.5 - Frequentist Methods


From a frequentist point of view, repeated hypothesis testing of accumulating
data can increase the type I error rate of a clinical trial. Therefore, the
frequentist approach to interim monitoring of clinical trials focuses on
controlling the type I error rate.

In most clinical trials, it is not necessary to perform a statistical analysis after


each patient is accrued. In fact, for most multi-center clinical trials, interim
statistical analyses are conducted only once or twice per year. Usually this
frequency of interim analyses detects treatment effects nearly as early as
continuous monitoring. The group sequential analysis is defined as the
situation in which only a few scheduled analyses are conducted. Again, let's
focus more on the concepts than the statistical details.

Suppose that the group sequential approach consists of R interim analyses,


and we let Z1, ... , ZR denote the test statistic at the R times of hypothesis
testing. So, here we are accumulating data over time. We are adding to the
dataset and analyzing the current set that you have collected. Also, we
let B1, ... , BR denote the corresponding boundary points (critical values). At
the rth interim analysis, the clinical trial is terminated with rejection of the null
hypothesis if:

|Zr|Br,r=1,2,...,R|Zr|Br,r=1,2,...,R

The boundary points are chosen such that the overall significance level does
not exceed the desired . There are primarily three schemes for selecting the
boundary points which have been proposed. These are illustrated in the
following table for an overall significance level of = 0.05 and for R = 2,3,4,5.
The table is constructed under the assumption that n patients are accrued at
each of the R statistical analyses so that the total sample size is N = nR.

O'Brien-Fleming Haybittle-Peto* Pocock


er
B B B

2.782 0.0054 3.0 0.002 2.178

1.967 0.0492 1.960 0.0500 2.178

3.438 0.0006 3.291 0.0010 2.289

2.431 0.0151 3.291 0.0010 2.289


1.985 0.0471 1.960 0.0500 2.289

4.084 0.00005 3.291 0.00100 2.361

2.888 0.0039 3.291 0.00100 2.361

2.358 0.0184 3.291 0.00100 2.361

2.042 0.0412 1.960 0.0500 2.361

4.555 0.000005 3.291 0.00100 2.413

3.221 0.0013 3.291 0.00100 2.413

2.630 0.0085 3.291 0.00100 2.413

2.277 0.0228 3.291 0.00100 2.413

2.037 0.0417 1.960 0.0500 2.413

For example, if you were to have one interim analysis and a final analysis, in
this table, that means R=2. Use the first two rows of the table to find the
critical values.

If you were to have three interim analyses and then one final analysis, then
R=4. You would use the corresponding four rows in the middle of the table to
determine critical values.

Notice different approaches assign 'spend' or distribute the overall


significance differently across the interim and final analyses.

The Pocock approach uses the same significance level at each of


the R interim analyses. Of the three procedures described in the table, it
provides the best chance of early trial termination. Many investigators dislike
the Pocock approach, however, because of its properties at the final stage of
analysis. For example, suppose R = 3 analyses are planned and that
statistical significance is not attained at any of the analyses. Suppose that
the p-value at the final analysis is 0.0350 (this is > 0.0221 found in the table
for the Pocock approach). If interim analyses had not been scheduled,
however, this p-value would be considered to provide a statistically significant
result (cp = 0.0350 < 0.0500).

The Haybittle-Peto (based on intuitive reasoning) and O'Brien-Fleming (based


on statistical reasoning) approaches were designed to avoid this problem. On
the other hand, these two approaches render it very difficult to attain statistical
significance at an early stage. In any case, it is important to make it clear to
investigators in the study the approach that has been selected for the interim
analyses.

Example

An example of the Pocock approach is provided in Pocock's book (Pocock.


1983. Clinical Trials: A Practical Approach, New York, John Wiley & Sons). A
trial was conducted in patients with non-Hodgkin's lymphoma, in which two
drug combinations were compared, namely cytoxan-prednisone (CP) and
cytoxan-vincristine-prednisone (CVP). The primary endpoint was
presence/absence of tumor shrinkage, a surrogate variable.

Patient accrual lasted over two years and 126 patients participated. Statistical
analyses were scheduled after approximately every 25 patients. Chi-square
tests (without the continuity correction) were performed at each of the five
scheduled analyses. The Pocock approach to group sequential testing
requires a significance level of 0.0158 at each analysis. Here is a table with
the results of these analyses.

Tumor shrinkage treatment


p-value
CP CVP

3/14 5/11 p > 0.10

11/27 13/24 p > 0.10

18/40 17/36 p > 0.10

18/54 24/48 0.05 < p < 0.10

23/67 31/59 0.0158 < p < 0.10

Thus, the researchers were concerned that the CVP combination appeared to
be clinically better than the CP combination (53% success versus 34%
success), yet it did not lead to a statistically significant result with Pococks
approach. Further analyses with secondary endpoints convinced the
researchers that the CVP combination is superior to the CP combination.

9.6 - The OBrien-Fleming Approach


The OBrien-Fleming approach is the most popular group sequential approach
because the significance level at the final analysis is near the overall desired
significance level. The REMATCH [1] clinical trial is a good example.

A few drawbacks to the group sequential approach to interim statistical testing


include the strict requirements that (1) the number of scheduled analyses, R,
must be determined prior to the onset of the trial, and (2) there is equal
spacing between scheduled analyses with respect to patient accrual. The
alpha spending function approach was developed to overcome these
drawbacks: (DeMets DL, Lan KK, 1994, Interim analysis: The alpha spending
function approach, Statistics in Medicine 13: 1341-1352.)

Let denote the information fraction available during the course of a clinical
trial. For example, in a clinical trial with a target sample size, N, in which
treatment group means will be compared, the information fraction at an interim
analysis is = n/N, where n is the sample size at the time of the interim
analysis. If your target sample size is 500 and you have taken measurements
on 400 patients then = .8

If the clinical trial involves a time-to-event endpoint, then the information


fraction is = d/D, where D is the target number of events for the entire trial
and d is the events that have occurred at the time of the interim analysis.

The alpha spending function, (), is an increasing function with (0) = 0 and
(1) = , the desired overall significance level. In other words, every time you
are doing analysis you are in a sense "spending part of your alpha." For
the rth interim analysis, where the information fraction is r, 0 r 1, (r)
determines the probability of any of the first r analyses leading to rejection of
the null hypothesis when the null hypothesis is true. As an example, suppose
investigators are planning a trial in which patients are examined every two
weeks over a 12-week period. The investigators would like to incorporate an
interim analysis when one-half of the subjects have completed at least one-
half of the trial. This corresponds to = 0.25.

A simple spending function that is a compromise between the Pocock and


O'Brien-Fleming functions, is () = , 0 1. This leads to a significance
level of 0.012 at the interim analysis and a significance level of 0.04 at the
final analysis (calculations not shown). There are all types of variations that
statisticians have devised.

Regardless of whether a sequential, group sequential, or alpha spending


function approach is invoked, the estimates of a treatment effect will be biased
when a trial is terminated at an early stage. The earlier the decision, the larger
the bias. Intuitively, if your target sample size is 200 and you decide to
terminate the trial after 25 patients because you think you have found a
significant different between treatment groups - there could be a lot of bias in
this type of result. Is this number of patients representative sample from the
population?

9.7 - Futility Assessment with


Conditional Power
As an alternative to the above methods, we might want to terminate a trial
when the results of the interim analysis are unlikely to change after accruing
more patients (futility assessment). It just doesn't look like there could ever be
a significant difference!

Conditional power is defined as the approach that quantifies the statistical


power to yield an answer different from that seen at the interim analysis. If this
quantity is really small, then you can conclude that it would be futile to
continue with the investigation.

As a simple example, consider the situation in which we want to determine if a


coin is fair, so the hypothesis testing problem is:

H0:p=Pr[Heads]=0.5 versus H1:p=Pr[Heads]>0.5H0:p=Pr[Heads]=0.5 versus


H1:p=Pr[Heads]>0.5

The fixed sample size plan is to toss the coin 500 times, count the number of
heads, X. But do we actually need to flip the coin 500 times? Using this futility
assessment procedure we could reject H0 at the 0.025 significance level if:

Z=X250(500)(0.5)(0.5)1.96Z=X250(500)(0.5)
(0.5)1.96

This is equivalent to rejecting H0 if X 272. Suppose that after 400 tosses of


the coin there are 272 heads. It is futile to proceed further because even if the
remaining 100 tosses yielded tails, the null hypothesis still would be rejected
at the 0.025 significance level. The calculation of the conditional power in this
example is trivial (it equals 1) because no matter what is assumed about the
true value of p, the null hypothesis would be rejected if the trial were taken to
completion.

You can also look at this in the other direction. Suppose that after 400 tosses
of the coin there are 200 heads. The null hypothesis will be rejected if there
are at least 72 heads during the remaining 100 tosses.

Even if p = 0.6 (arbitrary assignment), the conditional power is:

Pr[X72|n=100,p=0.6]Pr[X72|n=100,p=0.6]

=Pr[X60(100)(0.6)(0.4)7260(100)(0.6)(0.4)]=Pr[X60(100)(0.6)(0.4)7260(100)
(0.6)(0.4)]
=Pr[X2.45]=0.007=Pr[X2.45]=0.007

The probability based on a standard normal table is calculated to be .007, a


very small probability. Thus, it is futile to continue because there is such a
small chance of rejecting H0.

9.8 - Monitoring and Interim Reporting


for Trials
Single-Center Trials

Here are some practical issues as they relate to single center trials. Typically,
an investigator for a single-center trial needs to submit an annual report to
his/her IRB. The report should address whether the study is safe and whether
it is appropriate to continue.

The report should include the following topics:

1. Compliance with governmental and institutional oversight,

2. Review of eligibility (low frequency of ineligible patients entering the


trial),

3. Treatment review (most patients are adhering to the treatment regimen),

4. Summary of response,
5. Summary of survival,

6. Adverse events,

7. Safety monitoring rules (possibly statistical criteria for evaluating safety


endpoints), and

8. Audit and other quality assurance reviews.

Multi-Center Trials

A multi-center trial is one in which there are one or more clinical investigators
at each of a number of locations (centers). Obviously, multi-center trials are of
great importance when the disease is not common and a single investigator is
capable of recruiting only a handful of patients.

Advantages of a multi-center trial (Pocock, 1983) include the following:

1. Larger sample size and quicker patient accrual,

2. Broader interpretations of results because of the multiple participants


involved in the study across various geographic regions, (this adds to
external validity), and

3. Increased scientific merit of the trial because of collaborations among


experienced clinical scientists involved in the design and
implementation of the study.

Of course there is a down side... Disadvantages of a multi-center trial include


the following:

1. Planning is more complex,

2. The study is going to be more expensive,

3. More effort to needs to go into ensuring compliance to clinical protocol


across all centers,

4. Quality control must be implemented for taking measurements and


recording data,

5. You need a data coordinating center (DCC) for storing, monitoring data
and organizing investigators,
6. A need develops to keep all investigators involved and motivated,

7. Avoidance of passive investigators,

8. Compromise between quantity and quality of centers, and

9. A need for strong leadership.

The NIH requires a Data and Safety Monitoring Board (DSMB) to monitor the
progress of a multi-center clinical trial that it sponsors. Although the FDA does
not require a pharmaceutical/biotech company to construct a DSMB for its
multi-center clinical trials, many companies are starting to use DSMBs on a
regular basis.

There are several advantages that a DSMB provides, such as yielding a


mechanism for protecting the interests and safety of the trial participants,
while maintaining scientific integrity. The manner in which it is constructed
should ensure that the DSMB is financially and scientifically independent of
the study investigators, so that decisions about early stopping or study
continuation are made objectively. Depending on the circumstances, a DSMB
may be composed of anywhere from three to ten experts in medicine,
statistics, epidemiology, data management, clinical chemistry, and ethics.
None of the study investigators should be a part of the DSMB. In addition, the
DSMB should not be masked to treatment assignment when it is evaluating a
clinical trial. Although investigators and statisticians may submit information
and materials to the DSMB for their study, most of the deliberations made by
the DSMB are kept confidential. The DSMB reports directly to the sponsor of
the multi-center trial (the NIH or the company) and does not report to the
investigators.

A DSMB typically examines the following issues when assessing the worth of
a multi-center clinical trial:

1. Are the treatment groups comparable at baseline?

2. Are the accrual rates meeting initial projections and is the trial on its
scheduled timeline?

3. Are the data of sufficient quality?

4. Are the treatment groups different with respect to safety and toxicity
data?
5. Are the treatment groups different with respect to efficacy data?

6. Should the trial continue?

7. Should the protocol be modified?

8. Are other descriptive statistics, graphs, or analyses needed for the


DSMB to make its decisions?

The major disadvantage of a DSMB holding the decision-making authority in a


multi-center clinical trial, instead of the investigators, is that expertise may be
sacrificed in order to maintain impartiality. Investigators gain valuable
knowledge during the course of the trial and it is not possible to provide the
DSMB with the totality of this knowledge. Nevertheless, the advantages of a
DSMB seem to outweigh this disadvantage during the conduct of a multi-
center trial.

A comprehensive book on the aspects of DSMBs is available: Ellenberg, SS.


Fleming, TR. DeMets, DL. 2002, Data Monitoring Committees in Clinical
Trials, New York, NY: Wiley.

9.9 - Summary
In this lesson, among other things, we learned:

Differentiate between valid and invalid reasons for interim analyses and
early termination of a trial.

Identify characteristics of a sound plan for interim analysis.

Understand the theoretical framework for a likelihood based interim


analysis.

Compare and contrast the Bayesian approach to analysis with the


frequentist approach.

Recognize the general effects of the choice the prior on the posterior
probability distribution from a Bayesian analysis.

Compare spending functions for 3 group sequential methods for


interim analysis.
Comment on the use of a group sequential method in a published
statistical analysis.

Recognize a futility assessment and define conditional power.

List topics that should be covered in an interim report to an IRB.

List the advantages and disadvantages of a DSMB and describe who


might compose the DSMB.

List the issues of concern to a DSMB in a typical clinical study.

Let's explore Bayesian methods further in this week's discussion and apply
what we have learned to the assessment questions.

Lesson 10: Missing Data and Intent-to-


Treat
Introduction

Data imperfections can be classified as protocol non-adherence, missing or


incomplete observations, or methodologic errors. Two different perspectives
have been applied to address the situation with imperfect data.

The explanatory approach corresponds to acquiring information and


determining biological effects, which typically is done for efficacy studies.

The pragmatic approach corresponds to interpreting results according to


general use, which typically is done for effectiveness studies.

Distinguishing between the two approaches is useful in determining how to


deal with data imperfections.

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

Apply the terms evaluable and inevaluable to patients in a clinical trial


appropriately.
Differentiate between a pragmatic approach and an explanatory
approach and the conditions for which each is appropriate.

State the conditions under which data imputation is reasonable.

State the advantage of multiple imputation over simple data imputation.

Recognize the conditions under which the International Conference on


Harmonization would allow a patients data to be excluded from
statistical analysis.

10.1 - Specific Data Imperfections


"Evaluable"

Protocols sometimes contain improper plans which create or exacerbate


imperfections in the data. One problem in this regard involves the evaluation
of which patients received the correct treatment at the correct amounts.
Patients who meet certain criteria are said to be "evaluable".

As an example, consider an SE trial in which NE patients are considered


evaluable and NI patients are considered inevaluable. Suppose that the
numbers of evaluable and inevaluable patients with favorable outcomes
are RE and RI, respectively. You may consider using one of the following
estimates for the probability of a favorable outcome, namely,

P=RE+RINE+NI and PE=RENEP=RE+RINE+NI and PE=RENE

P (pragmatic approach) is based on all patients, (intention-to-treat)


whereas PE (explanatory approach) is based on evaluable patients only.
Usually RI is close to zero so that PE > P.

PE may appear a more appropriate estimate of the clinical effect since it is


obvious that the treatment cannot have an effect if it is not received. Some
investigators will write a protocol to indicate that only data from those who
received treatment for at least some number of doses or longer than a
particular length of time will be used in analysis. Do you recall a major
difficulty with this explanatory approach? What about post-entry exclusion
bias?
Since evaluability criteria define inclusion retroactively based on treatment
adherence which is not determined until completion of the study, there is
potential post-entry exclusion bias. Participant data should not be selected
for inclusion in data analysis based on an outcome variable.

The pragmatic approach does not encounter such difficulties, although


obviously it does not help elicit biological effects in an efficacy trial. It is
prudent then, to select treatments and a protocol design that will result in a
high level of treatment adherence with the hope that the pragmatic/intention-
to-treat approach agrees as much as possible with the explanatory approach.

Missing data

Usually, unrecorded data imply that methodologic errors have occurred. If


this happens frequently, there could be a fundamental problem with the design
or conduct of the study. Some missing data are due to human error, such as
forgetting to record/enter the data

In longitudinal clinical trials some patients may be lost to follow-up. If losses


to follow-up occur for reasons not associated with outcome, then they have
little impact, other than reducing precision. If losses to follow-up occured
independently of outcome, then the explanatory and pragmatic approaches
would be equivalent. Investigatorsm however, cannot assume that all losses
to follow-up are random events and conduct analyses that ignore such losses.
Being lost to follow-up may be associated with a higher chance of disease
progression, recurrence, or death. If a patient has not withdrawn consent,
then every effort should be made to recover lost information.

There are three generic approaches to handling missing data values:

1. disregard the observations that contain missing values;

2. disregard the outcome variable if it has a high proportion of missing


values;

3. replace the missing values by appropriate values (data imputation).

Data imputation is a reasonable approach under certain circumstances:

1. the frequency of missingness is relatively small (say less than10%);


2. the outcome variable with the missing values is important clinically or
biologically;

3. reasonable strategies for the data imputation exist;

4. sensitivity of the conclusions to different data imputation strategies can


be determined.

Simple data imputation involves substituting one data point for each missing
value. Some substitution choices include the mean of the non-missing values
or a predicted value from a linear regression model.

Another simple data imputation method is the last observation carried forward
(LOCF) approach in longitudinal studies. With LOCF, the last observed value
for a patient is substituted for all of that patients subsequent missing values.

The problems with simple data imputation methods are that they can yield a
very biased result and they tend to underestimate variability during the data
analysis.

Multiple imputation methods are preferred, in which

1. imputations are generated, usually via a regression model, and random


errors are added to the predicted values via random number generators,

2. multiple imputed data sets are created in this manner (say 10-20 data
sets), and

3. the results are averaged across the multiple data sets.

In most clinical trials, it is common to find errors that yield ineligible patients
participating in the trial. Objective eligibility criteria are less susceptible to error
than subjective criteria. Also, patients can fail to comply with nearly every
aspect of treatment specification, such as reduced or missed doses and
improper dose scheduling.

Ineligible patients in the study can be (1) included in the analysis of the cohort
of eligible patients (pragmatic approach/intention-to-treat) or (2) excluded from
the analysis (explanatory approach).
In a randomized trial, if the eligibility criteria are objective and assessed prior
to randomization, then both approaches do not cause a bias. The pragmatic
approach, however, increases the external validity.

10.2 - Intention-to-Treat
Intention-to-treat (ITT) is the principle that patients in a randomized clinical
trial should be analyzed according to the group to which they were assigned,
even if they did not

1. receive the intended treatment,

2. did not adhere to the treatment regimen, or

3. comply with the protocol in any manner.

The ITT Principle is a generalization of the pragmatic approach while the


Treatment received (TR) or a protocol analysis is the principle that patients
should be analyzed according to the treatment they actually received.

Most statisticians favor the ITT principle because it yields the best properties
for the test of the null hypothesis of no treatment difference. "If randomized,
then analyzed" is the view widely held among clinical trial statisticians and
considered a critical component of the ITT Principle to avoid biases due to
post-randomization exclusions. ITT also is favored by the federal agencies
because a clinical trial is a test of treatment policy, not a test of treatment
received. After a meeting to discuss clinical trials methodology, which included
US FDA representatives, the International Conference on Harmonization
(ICH) published a document entitled "Statistical Principles for Clinical Trials
(E9) [1]" that discusses the ITT Principle under various circumstances.

According to the E9 document there are a limited number of circumstances in


which randomized patients can be excluded from the full analysis set. Patients
who failed to satisfy an entry criterion may be excluded from the full analysis
set only under the following circumstances:

1. The entry criterion was measured prior to randomization

2. The detection of the relevant eligibility violations can be objectively


determined
3. All patients underwent similar scrutiny for eligibility violations

4. All patients with detected violations of the eligibility criterion are


excluded

Although the ITT principle generally is preferred, it can be misleading in some


circumstances. For example, consider a situation in which a new therapy is
compared to placebo. Suppose that a patient undergoing treatment failure is
provided emergency medications for safety purposes. If the placebo group
has a higher failure rate, it actually could appear to be more beneficial than
the new therapy in an ITT analysis because of the emergency medications
(even though this may seem to be a design flaw of the trial). In such a
situation, the statistical analysis would be better served with time to treatment
failure as the primary endpoint. This analysis would still include all patients,
but using time to failure as the primary endpoint eliminates the problem of
misleading results from the ITT analysis.

Many factors can contribute to a patient's failure to complete the intended


therapy, including severe adverse reactions, disease progression, patient or
physician preference for an alternative treatment, and a change of mind. In
nearly all of these circumstances, failure to complete the assigned therapy is
partially a trial outcome. Patients cannot be eliminated from analysis for such
reasons without introducing bias.

10.3 - Summary
In this lesson, among other things, we learned:

Apply the terms evaluable and inevaluable to patients in a clinical trial


appropriately.

Differentiate between a pragmatic approach and an explanatory


approach and the conditions for which each is appropriate.

State the conditions under which data imputation is reasonable.

State the advantage of multiple imputation over simple data imputation.

Recognize the conditions under which the International Conference on


Harmonization would allow a patients data to be excluded from
statistical analysis.
Lesson 11: Estimating Clinical Effects
Introduction

The design of a clinical trial imposes structure on the resulting data. For
example, in pharmacologic treatment mechanism (Phase I) studies, blood
samples are used to display concentration time curves, which relate to
simple physiologic models of drug distribution and/or metabolism. As another
example, in SE (Phase II) trials of cytotoxic drugs, investigators are interested
in tumor response and toxicity of the drug or regimen. The usual study design
permits estimating the unconditional probability of response or toxicity in
patients who met the eligibility criteria.

For every trial, investigators must distinguish between those analyses, tests of
hypotheses, or other summaries of the data that are specified a priori and
justified by the design and those which are exploratory. Remember, the results
from statistical analyses of endpoints that are specified a priori in the protocol
carry more validity. Although exploratory analyses are important and might
uncover biological relationships previously unsuspected, they may not be
statistically reliable because it is not possible to account for the random nature
of exploratory analyses. Exploratory analyses are not confirmatory by
themselves but generate hypotheses for future research.

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

State the objectives of a pharmacokinetic model.

Use a SAS program to calculate a confidence interval for an odds ratio

Use a SAS program to perform a Mantel-Haenszel.analysis to estimate


an odds ratio adjusted for strata effects.

Recognize when odds-ratios or relative risks differ significantly between


groups.

Modify a SAS program to perform JT and Cochran-Armitage tests for


trend.

Interpret a Kaplan-Meier survival curve.


Interpret SAS output comparing survival curves.

Describe the process of bootstrapping to estimate variability of an


estimator.

Reference

Piantadosi Steven. (2005) Counting Subjects and Events. Estimating Clinical


Effects. In: Piantadosi Steven. Clinical Trials: A Methodologic Perspective. 2nd
ed. Hobaken, NJ: John Wiley and Sons, Inc.

Friedman, Furberg, DeMets, Reboussin and Granger. (2015). Survival


Analysis. In FFDRG. Fundamentals of Clinical Trials. 5th ed. NY: Springer.

11.1 - Dose-Finding (Phase I) Studies


One of the principal objectives of dose-finding (DF) studies is to assess the
distribution and elimination of drug in the human system. What level of the
drug is appropriate?

Pharmacokinetic (PK) models, also known as compartmental models, provide


useful analytical approaches for DF studies. The objective of a PK model is to
account for the absorption, distribution, metabolism, and excretion of a drug in
the human system. An example of a two-compartment PK model is as follows
and the objective is to estimate the rates into, between, and out of the two
compartments.

Estimates are made for each of these areas, i.e., absorption rate, distribution
rate, etc. for the drug in question.

11.2 - Safety and Efficacy (Phase II)


Studies: The Odds Ratio
The main objectives of most safety and efficacy (SE) studies with a new
treatment are to estimate the frequency of adverse reactions and estimate the
probability of treatment success. These types of endpoints often are
expressed in binary form, presence/absence of adverse reaction,
success/failure of treatment, etc., although this is not always the case.

Adverse reactions are often classified on an ordinal scale, such as


absent/mild/moderate/severe. The primary efficacy endpoint also may be
measured on an ordinal scale, such as failure/partial success/success, or it
may be a time-to-event variable or measured on a continuous scale, such as
a measurement of blood pressure. There are many ways to assess efficacy.

Estimates of risk can be useful in SE studies. Suppose that an SE study


consists of a placebo group and a treatment group, and that probability of an
adverse reaction is an important investigation. Let p1 and p2 denote the
respective probabilities of an adverse reaction for the treatment and placebo
groups. Three common parameters of risk are as follows:

Risk Difference ==p1p2Risk Difference ==p1p2

Relative Risk =p1p2Relative Risk =p1p2

Odds Ratio ==p1/(1p1)p2/(1p2)Odds Ratio ==p1/(1p1)p2/(1p2)

For relative risk a number significantly different from 1 indicates a difference


between the two groups in the risk for the event. The odds ratio indicates the
relative odds of the event occurring between two groups.. Because both are
ratios, the relative risk and the odds ratio are assessed in terms of their
distance from 1.0.

When would the odds ratio and the relative risk be about the same?
When p1 and p2 are relatively small, for instance, when you are dealing with a
very rare event.

The odds ratio is useful and convenient for assessing risk when the response
outcome is binary, but it does have some limitations.

p1 p2 Risk Diff Rel Risk Odds Ratio


0.25 0.05 0.20 5.00 6.33
0.30 0.10 0.20 3.00 3.86
0.45 0.25 0.20 1.80 2.45
0.70 0.50 0.20 1.40 2.33

Notice in the table above that while the absolute risk difference is constant,
the relative risk varies greatly, as does the odds ratio. Thus, the magnitudes of
the odds ratio and relative risk are strongly influenced by the initial probability
of the condition.

When the outcome in a CTE trial is a binary response and the objective is to
compare the two groups with respect to the proportion of success, the results
can be expressed in a 2 2 table as

Group #1 Group #2
Success r1 r2
Failure n1 - r1 n2 - r2

The estimated relative risk is (r1/n1)/ (r2/n2) and the estimated odds ratio is:

^=r1/(n1r1)r2/(n2r2)=r1(n2r2)r2(n1r1)^=r1/(n1r1)r2/
(n2r2)=r1(n2r2)r2(n1r1)

There are a variety of methods for performing the statistical test of the null
hypothesis H0: = 1 (or H0: = 0) , such as a z-test using a normal
approximation, a 2 test (basically, a square of the z-test), a 2 test with
continuity correction, and Fisher's exact test.

The normal and 2 approximations for testing H0: = 1 are relatively accurate
if these conditions hold:

n1(r1+r2)n1+n25,n2(r1+r2)n1+n25,n1(n1+n2r1r2)n1+n25,n2(n1+
n2r1r2)n1+n25n1(r1+r2)n1+n25,n2(r1+r2)n1+n25,n1(n1+n2r1r2)n1+n25,
n2(n1+n2r1r2)n1+n25

This expression is basically what we would have calculated for the expected
values in the 2 2 table. The first part of the expression is the probability of
success times the probability of being in group 1 times the number of
subjects.
Otherwise, Fisher's exact test is recommended.

If the above condition is met, then the loge-transformed estimated odds ratio
has an approximate normal distribution:

loge(^)N(=loge(),2=1r1+1r2+1n1r1+1n2r2)loge(^)N(=loge(
),2=1r1+1r2+1n1r1+1n2r2)

Therefore, an approximate 100(1 - )% confidence interval for the log odds


ratio is possible and would look like:

loge(^)z1/21r1+1r2+1n1r1+1n2r2

loge(^)z1/21r1+1r2+1n1r1+1n2r2

An approximate 100(1 - )% confidence interval for the odds ratio is


constructed by exponentiation the endpoints of the 100(1 - )% confidence
interval for the log odds ratio. The computer can do this for you.

SAS Example 12.1 [1]: An investigator conducted a small safety and efficacy
study comparing treatment to placebo with respect to adverse reactions. The
data are as follows:

Treatment Placebo
adverse reaction 12 4
no adverse reaction 32 40
The estimated odds ratio is calculated as:

^=(12)(40)(32)(4)=3.75^=(12)(40)(32)(4)=3.75

and the approximate 95% confidence interval for the loge odds ratio is

1.32(1.960.62)=(0.10,2.54)1.32(1.960.62)=(0.10,2.54)

so the 95% confidence interval for is (1.10, 12.68).

Because the approximate 95% confidence interval for does not contain 1.0,
the null hypothesis of H0: = 1 is rejected at the 0.05 significance level.

Even though this data table satisfies the criteria for loge estimated odds ratio
to follow an approximate normal distribution, there still is a discrepancy
between the approximate results and the exact results.

From PROC FREQ of SAS, the exact 95% exact confidence interval for is
(1.00, 17.25). Because the 95% confidence interval for does contain 1.0, H0:
= 1 is not rejected at the 0.05 significance level based on Fisher's exact
test.
11.3 - Safety and Efficacy (Phase II)
Studies: The Mantel-Haenszel Test for
the Odds Ratio
Sometimes a safety and efficacy study is stratified according to some factor,
such as clinical center, disease severity, gender, etc. In such a situation, it still
may be desirable to estimate the odds ratio while accounting for strata effects.
The Mantel-Haenszel test for the odds ratio assumes that the odds ratio is
equal across all strata, although the rates, p1 and p2, may differ across strata.
This procedure calculates the odds ratio within each stratum and then
combines the strata estimates into one estimate of the common odds ratio.
For example,

Stratum p1 p2

1 0.50 0.25 3.00

2 0.40 0.18 3.00

3 0.30 0.12 3.00

4 0.20 0.08 3.00

SAS Example ( 12.2_Mantel-Haenszel_test.sas [2] ): A company performed a


multi-center safety and efficacy study at six sites, with a binary outcome
(success/failure), for comparing placebo and treatment.
[2]

SAS PROC FREQ yields an estimated odds ratio of 1.84 with an approximate
95% confidence interval is (1.28, 2.66).
The exact 95% confidence interval is (1.26, 2.69). The exact and asymptotic
confidence intervals are nearly identical due to the large sample size across
the six clinical centers.

H0: = 1 is rejected at the 0.05 significance level (p = 0.0013), which is


consistent with the 95% confidence interval not containing 1.0. (Later in this
chapter we discuss the construction of the Mantel-Haenszel test statistic.)

11.4 - Safety and Efficacy (Phase II)


Studies: Trend Analysis
In some safety and efficacy studies, it is of interest to determine if an increase
in the dose yields an increase (or decrease) in the response. The statistical
analysis for such a situation is called a dose-response or trend analysis. We
want to see a trend here, not just a difference in groups. Typically, patients in
a dose-response study are randomized to K + 1 treatment groups (a placebo
dose and K increasing doses of the drug). The response variables of interest
may be binary, ordinal, or continuous (in some circumstances, the response
variable may be a time-to-event variable). In some instances trend tests can
be sensitive and reveal a mild trend where pair-wise comparisons would not
be able to find significant differences and not be as helpful.

For the sake of illustration, suppose that the response is continuous and that
we want to determine if there is a trend in the K + 1 population means.

A one-sided hypothesis testing framework for investigating an increasing trend


is

H0: {0 = 1 = = K} versus
H1: {0 1 K with at least one strict inequality}

A one-sided hypothesis testing framework for investigating a decreasing trend


is

H0: {0 = 1 = = K} versus
H1: {0 1 K with at least one strict inequality}

A two-sided hypothesis testing framework for investigating a trend is

H0: {0 = 1 = = K} versus
H1: {0 1 K or 0 1 K with at least one strict inequality}
More than likely we would use one of the one-sided tests as you probably
have a hunch about the effect that will result.

For a continuous response, an appropriate test is the Jonckheere-Terptsra


(JT) trend test that was developed in the 1950's. The JT trend test is based on
a sum of Mann-Whitney-Wilcoxon tests :

JT=k=0K1k=1KMWWkkJT=k=0K1k=1KMWWkk

where MWW,kk is the Mann-Whitney-Wilcoxon test for comparing group k to


group k, 0 k < k K. Essentially, each of the pairs of groups are compared
against one another and then summed up. In this way this test looks for
trends.

If Yki , I = 1, , nk , denote the observations from group k, and Yk'i', i = 1, ,


nk' , denote the observations from group k, then

MWWkk=i=1nki=1nksign(YkiYkiMWWkk=i=1nki=1nksign(YkiYki

Note that each MWW should be constructed in a consistent manner. For


example, when comparing an observation from a lower dose group versus an
observation higher dose group, take the difference of the latter minus the
former.

As an example of how the JT statistic is constructed, suppose there are four


dose groups in a study (placebo, low dose, mid dose, and high dose). Then
the JT trend test is the sum of six Mann-Whitney-Wilcoxon test statistics:

{placebo vs. low dose} +


{placebo vs. mid dose} +
{placebo vs. high dose} +
{low dose vs. mid dose} +
{low dose vs. high dose} +
{mid dose vs. high dose}

Values of the statistic JT near zero support

H0: {0 = 1 = = K} - they are equal


Large positive values of the statistic JT support

H1: {0 1 K with at least one strict inequality} - a increasing trend

Large negative values of the statistics JT support

H1: {0 1 K with at least one strict inequality}- a decreasing trend

The JT trend test actually is testing hypotheses about population medians, but
if the underlying probability distribution is symmetric, the population mean and
the population median are equal to one another. The JT trend test is available
in PROC FREQ of SAS.

The parametric version of the JT trend test, based on the assumption of


normal data, is to substitute the difference between sample means for the
Mann-Whitney-Wilcoxon statistics. The numerator for the parametric test is as
follows:

k=0K1k=k+1K(YkYk)k=0K1k=k+1K(YkYk)

Next, we assume that the K + 1 groups have a homogeneous population


variance, 2 . The population variance is estimated by the pooled sample
variance, s 2 , and it has d degrees of freedom:

s2=1dk=0Ki=1nk(YkiYk)2,d=k=0K(nk1)s2=1dk=0Ki=1nk(YkiY
k)2,d=k=0K(nk1)

Letting ck = 2k - K, k = 0, 1, , K, the numerator reduces to:

k=0KckYkk=0KckYk

Then the trend statistic is:


T=(k=0KckYk)/s2k=0Kc2kn2k

T=(k=0KckYk)/(s2k=0Kck2nk2)

For example, if K = 3 (placebo, low dose, mid dose, and high dose), then c0 =
-3, c1 = -1, c2 = 1, c3 = 3. Notice, however, that if there are an odd number of
groups, then the middle group has a coefficient of zero. For example, with K =
2 (placebo, low dose, and high dose) c0 = - 1, c1 = 0, c2 = 1. This is not ideal
and there are better trend tests than JT and T for continuous data.

To use the actual dose values (denoted as d0, d1, , dK) in the parametric
test, set ck = dk - mean(d0, d1, , dK), k = 0, 1, , K.

The T trend statistic can be constructed by using the CONTRAST statement in


SAS PROC GLM.

The JT trend test works well for binary and ordinal data, as well as being
available for continuous data.

Another trend test for binary data is the Cochran-Armitage (CA) trend test.
The difference between the JT and CA trend tests is that for the latter test, the
actual dose levels can be specified. In other words, instead of designating the
dose levels as low, mid, or high, the actual numerical dose levels can be used
in the CA trend test, such as 20 mg, 60, 180 mg.

The CA trend test, however, can yield unusual results if there is unequal
spacing among the dose levels. If the dose levels are equally spaced and the
sample sizes are equal (n0 = n1 = ... = nK), then the JT and CA trend tests yield
exactly the same results. Each of these parameters needs to be taken into
account to make sure you are applying the best test for your data.

SAS Example ( 12.3_trend_tests.sas [3] ) illustrates how to construct trend


tests.
[3]

11.5 - Safety and Efficacy (Phase II)


Studies: Survival Analysis
In many clinical trials involving serious diseases, such as cancer and AIDS, a
primary objective is to evaluate the survival experience of the cohort. In
clinical trials not involving serious diseases, survival may not be an outcome,
but other time-to-event outcomes may be important. Examples include time to
hospital discharge, time to disease relapse, time to getting another migraine,
time to progression of disease, etc.

The Kaplan-Meier survival curve is a nonparametric technique for estimating


the probability of survival, even in the presence of censoring (e.g. study is
completed before the patient experiences the event), at any point in time. This
statistical approach is nonparametric because it does not assume any
particular distribution for the data, such as lognormal, exponential, or Weibull.
It is a "robust" procedure because it is not adversely affected by one or more
unusual data points.

In order to construct the Kaplan-Meier survival curve, the actual failure times
need to be ordered from smallest to largest. In a sample size of n patients,
denote these times of failure as t1, t2, ... , tK. For convenience, let t0 = 0 denote
the start time and let tK+1 = .

At the kth failure time, tk, the number of failures, dk, are noted as well as the
number of patients who were at risk for failure immediately prior to tk, nk.
Notice that patients who are lost to follow-up (censored) prior to time tj are not
included in nk.

The algebraic formula for the Kaplan-Meier survival probability at time t is:

S^(t)=1,t0tt1S^(t)=1,t0tt1

S^(t)=k=1k(1dknk),tkttk+1,k=1,2,...,KS^(t)=k=1k(1dknk
),tkttk+1,k=1,2,...,K
The calculation of S(t) utilizes conditional probability: the probability of
surviving at time t, given that the person has survived up to time t. S(t) is the
probability of surviving beyond time t.

An example with an initial sample of n = 100 patients is as follows:

dk nk
k tk (days)
(events) (at risk) S^(tk)S^(tk)

1 127 1 98 0.99 = (1 - 1/98)

2 154 2 91 0.97 = (1 - 1/98)(1 - 2/91)

3 195 1 84 0.96 = (1 - 1/98)(1 - 2/91)(1 - 1/8

4 221 3 75 0.92 = (1 - 1/98)(1 - 2/91)(1 - 1/8

Note that the probability estimate does not change until a failure event occurs.
Also, censored values do not affect the numerator, but do affect the
denominator. Thus, the Kaplan-Meier survival curve gives the appearance of a
step function when graphed.

A graphical display of the Kaplan-Meier survival curve is as follows:

Each step down represents the occurrence of an event.


11.6 - Comparative Treatment Efficacy
(Phase III) Trials
For comparative treatment efficacy (CTE) trials, the primary endpoints often
are measured on a continuous scale. The sample mean (sample standard
deviation), the sample median (sample inter-quartile range), or the sample
geometric mean (sample coefficient of variation) serve as reasonable
descriptive statistics in such circumstances.

The sample mean (sample standard deviation) is suitable if the data are
normally distributed or symmetric without heavy tails. The sample median
(sample inter-quartile range) is suitable for symmetric or asymmetric data. The
sample geometric mean (sample coefficient of variation) is suitable when the
data are log-normally distributed.

Usually two-sample t tests or Wilcoxon rank tests are applied to compare the
two randomized groups. In some instances, baseline measurements (prior to
randomized treatment assignment) of the primary endpoints are taken.

Suppose Yi1 and Yi2 denote the baseline and final measurements of the
endpoint, respectively, for the ith subject, i = 1, 2, , n. Instead of statistically
analyzing the Yi2s, there could be an increase in precision by analyzing the
change (or gain) in the response, namely, the Yis where Yi = Yi2 - Yi1.

Suppose that the variance for each Yi1 and Yi2 is 2 and that the correlation
between Yi1 and Yi2 is (we assume that subjects are independent of each
other but that the pair of measurements within each subject are correlated).

This leads to

Var(Yi2Yi1)=Var(Yi2)+Var(Yi1)2Cov(Yi2,Yi1)=22(1)Var(Yi2Yi1)=Var(
Yi2)+Var(Yi1)2Cov(Yi2,Yi1)=22(1)

Therefore,

Var(Yi2)=2 and Var(Yi2Yi1)=22(1)Var(Yi2)=2 and Var(Yi2Yi1)=22(1


)

If > , which often is the case for repeated measurements within patients,
then Var(Yi2 - Yi1) < Var(Yi2). Thus, there may be more precision if
the Yi2 - Yi1 are analyzed instead of the Yi2. This happens all the time. Using
the patient as their own control is a good thing. We are interested in the
differences that are occurring, therefore we will subtract the treatment period
measurements from the baseline data for the patient. A two-sample t test or
the Wilcoxon rank sum test can be applied to the change-from-baseline
measurements if the CTE trial consists of two randomized groups, such as
placebo and an experimental therapy.

An alternative approach with baseline measurements is analysis of covariance


(ANCOVA). In this situation, the baseline measurement, Yi1, serves as a
covariate, so that the final measurement for a subject is adjusted by the
baseline measurement. A linear model that describes this for a two-armed trial
with placebo and experimental treatment groups is as follows. The expected
value for the ith patient, i = 1, 2, , n, is:

E(Yi2)=p+Ti(EP)+Yi1E(Yi2)=p+Ti(EP)+Yi1

where P is the population mean for the placebo group, E is the population
mean for the experimental treatment group, Ti = 0 if the ith patient is in the
placebo group and 1 if in the experimental treatment group, and is the slope
for the baseline measurement.

The expectations for subjects in the placebo group and experimental


treatment group, respectively, can be rewritten as:

E(Yi2Yi1)=P and E(Yi2Yi1)=EE(Yi2Yi1)=P and E(Yi2Yi1)=E

These expectations are analogous to the expectations for the change-from


baseline measurements:

E(Yi2Yi1)=P and E(Yi2Yi1)=EE(Yi2Yi1)=P and E(Yi2Yi1)=E

The only difference between the two approaches is that in the change-from-
baseline measurements, is set equal to 1.0. In the ANCOVA approach, is
estimated in the analysis and may differ from 1.0. Thus, ANCOVA approach is
more flexible and can yield slightly more statistical power and efficiency.

11.7 - Comparing Survival Curves


If the primary endpoint in a CTE trial is a time-to-event variable, then it will be
of interest to compare the survival curves of the randomized treatment arms.
Again, we will focus on a nonparametric approach that corresponds to
comparing the Kaplan-Meier survival curves rather than a parametric
approach.

The Mantel-Haenszel test can be adapted here in terms comparing two


groups, say P and E for placebo and experimental treatment. In this situation,
the Mantel-Haenszel test is called the logrank test.

The assumptions for the logrank test are that (1) the censoring patterns are
the same for the two treatment groups, and (2) the hazard functions for the
two treatment groups are proportional.

For each of the K distinct failure times across the two randomized groups at
times t1, t2, , tK, a 2 2 table is constructed. For failure time tk , k = 1, 2,
, K, the table is:

Placebo Exp Treat

# events dPk dEk

# non events nPk - dPk nEk - dEk

The logrank statistic constructs an observed minus expected score, under the
assumption that the null hypothesis of equal event rates is true, for each of the
K tables and then sums over all tables:

OE=k=1K(nPkdEknEkdPknPk+nEk)OE=k=1K(nPkdEknEkdPknPk+nE
k)

The variance expression for the O - E score is as follows:

VL=Var(OE)=k=1K((dPk+dEk)

(nPk+nEkdPkdEk)nPknEk(nPk+nEk1)
(nPk+nEk)2)VL=Var(OE)=k=1K((dPk+dEk)
(nPk+nEkdPkdEk)nPknEk(nPk+nEk1)(nPk+nEk)2)

Then the logrank statistic is:

ZL=(OE)/VLZL=(OE)/VL

which has an approximate standard normal distribution.

The generalized Wilcoxon test also is a nonparametric test for comparing


survival curves and it is an extension of the Wilcoxon rank sum test in the
presence of censoring. It also requires that the censoring patterns for the two
treatment groups be the same, but it does not assume proportional hazards.

The first step in constructing the generalized Wilcoxon statistic is to pool the
two samples of survival times (including censored values) and order them
from lowest to highest. For the ith observation in the ordered sample with
survival (or censored) time ti, construct a score, Ui, which represents the
number of survival (or censored) times less than ti minus the number of
survival (or censored) times greater than ti. The Ui are summed over the
experimental treatment group and a variance calculated, i.e.,

U=i=1nEUiand VU=Var(U)=(nPnE(nP+nE)

(nP+nE1))i=1nP+nEU2iU=i=1nEUiand VU=Var(U)=(nPnE(nP+nE)
(nP+nE1))i=1nP+nEUi2

such that:

ZU=(OE)/VUZU=(OE)/VU

has an approximate standard normal distribution.

An example of constructing the Ui scores ("+" reflects censoring):

ti Group # < ti # > ti Ui


6 Exp Treat 0 7 -7

10 Placebo 1 6 -5

10+ Exp Treat 2 0 2

12 Exp Treat 2 4 -2

15+ Exp Treat 3 0 3

17 Placebo 3 2 1

21 Placebo 4 1 3

25+ Placebo 5 0 5

Then U = (-7) + 2 + (-2) + 3 = -4.

SAS Example ( 12.4_survival_analysis.sas [4] ): A safety and efficacy study


was conducted in 83 patients with malignant mesothelioma, an uncommon
lung cancer that is strongly associated with asbestos exposure. Patients
underwent one of three types of surgery, namely, biopsy, limited resection,
and extrapleural pneumonectomy (EPP). Treatment assignment was
nonrandomized and based on the extent of disease at the time of diagnosis.
Thus, there can be a strong procedure selection bias here in this example.
The primary outcome variable was time to death (survival). SAS PROC
LIFETEST constructs the Kaplan-Meier survival curve for each surgery group
and compares the survival curves via the logrank test (p = 0.48) and the
generalized Wilcoxon test (p = 0.63).
Strength of Evidence

Although p-values are useful for hypothesis tests that are specified a priori,
they provide poor summaries of clinical effects. In particular, they do not
convey the magnitude of a clinical effect. The size of a p-value depends on
the magnitude of the estimated treatment effect and its estimated variability
(also a function of sample size). Thus, the p-value partially reflects the size of
the trial, which has no biological interpretation. In addition, the p-value can
mask the magnitude of the treatment effect, which does have biological
importance. P-values only quantify the type I error and do not characterize the
biologically important effects in the trial. Thus, p-values should not be used to
describe the strength of evidence in a trial. Investigators have to look at the
magnitude of the treatment effect.

Confidence intervals are more appropriate for describing the strength of


evidence in a clinical trial, although they also are affected by the sample size.
Most major journals now require this approach as it is many times more
informative than simply just the p-value.

11.8 - Special Methods of Analysis


One of the most difficult statistical tasks is assessing the precision of an
estimator, i.e., determining the variance of an estimator can be more difficult
than determining the appropriate estimator. In complicated situations the
bootstrap method can be applied to estimate the variance of an estimator. The
bootstrap is essentially a resampling plan.

For example, suppose an investigator collects a sample of N observations,


denoted as Y1, Y2, ... , YN, and wants to estimate the median, , and get an
expression for its variance. If the investigator does not want to make any
assumptions about the distribution of the sample, then an explicit expression
for the variance of the sample median does not exist.

The bootstrap can be used to construct a variance estimate of the sample


median.

The bootstrap process consists of constructing B data sets, each


with N observations, from the original data set. Each bootstrap sample is
constructed by sampling with replacement from the original data set. This
means that when constructing a bootstrap sample, N observations are
generated one at a time where each Yi has 1/N probability of being selected.
Here is an example of the resampling of the original data:

Original sample: 17, 25, 16, 32, 27, 19, 25, 23, 22, 30

Bootstrap sample #1 30,22,27,25,25,23,32,27,19,22

Bootstrap sample #2 25,16,17,23,30,16,22,19,32,30

... ...

Bootstrap sample #1000 19,32,22,16,25,16,30,22,23,17

Thus, for b = 1,...,B, the bootstrap sample Yb1, Yb2, ... , YbN is constructed and
the sample median within the bth bootstrap sample is formed as:

^b= median (Yb1,Yb2,...,YbN)^b= median (Yb1,Yb2,...,YbN)

From the B estimates of the median we construct the estimated variance as:

S2=1B1b=1B(^b)2 where=1Bb=1B^bS2=1B1b=1B(^b)
2 where=1Bb=1B^b

to get a sense about how these medians are varying over the 100 samples.
The variance estimate can then be used to construct a Z statistic for
hypothesis testing, i.e.,

Z=(^)/SZ=(^)/S

Some statisticians at first were leery of this approach, essentially using one
sample to create many others samples from this original, i.e., "pulling oneself
up by your bootstraps". The bootstrap process, however, over time has shown
to have sound statistical properties. The disadvantage of this approach has to
do with the random selection with replacement which could result in slight
variations in results. The FDA, for instance, requires definitive results. This is
simply a non-parametric approach for estimating the variance of a sample.

Exploratory or Hypothesis-Generating Analyses


Clinical trial data provide the opportunity for exploratory analyses, which are
analyses in addition to those specified by the primary objectives in the
protocol. A the trial design is usually not well-suited for all of the exploratory
analyses that are performed, so the results may not have much validity.

The results from exploratory analyses should not be regarded as confirmatory,


but rather as hypothesis-generating for future research. As a general rule, the
same data should not be used both to generate a new hypothesis and to test
that hypothesis. Unfortunately, many investigators do not follow this principle.

Data in sufficient quantity and detail can be made to yield some effect. A few
statistical sayings attest to this, such as "the data will confess to anything if
tortured enough." It has been well documented that increasing the number of
hypothesis tests inflates the Type I error rate. Exploratory analyses typically
fall into this category and the chances of finding statistically significant results,
when none truly exist, can be very high.

Subset analyses are a form of exploratory analyses that are very popular with
clinical trials data. For example, after performing the primary statistical
analyses, the investigators might decide to compare treatment groups within
certain subsets, such as male subjects, female subjects, minority subjects,
subjects over the age of 50, subjects with serum cholesterol above 220, etc.
Unless it is planned ahead of time, such analyses should remain exploratory.

11.9 - Summary
In this lesson, among other things, we learned:

State the objectives of a pharmacokinetic model.

Use a SAS program to calculate a confidence interval for an odds ratio

Use a SAS program to perform a Mantel-Haenszel.analysis to estimate


an odds ratio adjusted for strata effects.

Recognize when odds-ratios or relative risks differ significantly between


groups.

Modify a SAS program to perform JT and Cochran-Armitage tests for


trend.
Interpret a Kaplan-Meier survival curve.

Interpret SAS output comparing survival curves.

Describe the process of bootstrapping to estimate variability of an


estimator.

Let's put what we have learned to use by completing the following homework
assignment:

Homework

Look for homework assignments in the Homework folder on ANGEL.

Lesson 12: Prognostic Factor Analyses


Introduction

Predictor variables in statistical analyses also are called independent


variables, prognostic factors, regressors, and covariates. Prognostic factor
analysis (PFA) is an analysis that attempts to assess the relative importance
of several predictor variables simultaneously. Typically, a PFA uses one or
more predictor variables that were not controlled by the investigator.

One reason for studying prognostic factors is to learn the relative importance
of several variables that might affect, or be associated with, disease outcome.
A second reason for studying prognostic factors is to improve the design of
clinical trials. For example, if a prognostic factor is identified as strongly
predictive of disease outcome, then investigators of future clinical trials with
respect to that disease should consider using it as a stratifying variable.

Knowledge of prognostic factors can improve the ability to analyze


randomized clinical trials.

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

Differentiate between means and adjusted means.


State how ANCOVA can reduce difficulties resulting from imbalance in
prognostic factors.

Recognize situations in which using a prognostic factor as a covariate is


recommended. Recognize the difficulty presented with time-dependent
covariates.

Recognize interaction effects and differentiate between qualitative and


quantitative interactions.

Select the appropriate analysis of the data among ANCOVA, logistic


regression and proportional hazards.

Modify SAS programs to perform ANCOVA and logistic regression.


Interpret the relevant portions of SAS output for these analyses.

Identify 4 approaches to model building.

Recognize the effects of incomplete data on model-building.

Propose two methods of validating a model from a nonrandomized


study.

Reference

Piantadosi Steven. (2005) Prognostic Factor Analyses. In: Piantadosi


Steven. Clinical Trials: A Methodologic Perspective. 2nd ed. Hobaken, NJ:
John Wiley and Sons, Inc.

12.1 - Prognostic Factors


Knowledge of prognostic factors can improve the ability to analyze
randomized trials. Suppose that there is a strongly prognostic factor in a
clinical trial in which the investigators did not use it as a stratifying variable
(perhaps they were unaware of this factor) and that the treatment groups are
not balanced with respect to this prognostic factor. Then a simple comparison
of treatment groups A and B could yield misleading results as shown in the
illustration below.
You can see that most of the b's have the lower level of the covariate, and
most of the a's have the higher level of the covariate. The apparent difference
between the A mean and the B mean is due to the differing values in the
covariate. But how do we adjust for these differences and then compare A and
B?.

Analysis of covariance (ANCOVA) models can eliminate or reduce the


problem of different levels of the covarariates to yield a fair comparison of
treatment groups. With respect to nonrandomized studies, ANCOVA models
can have the same effect.

How does ANCOVA adjust for these differences?

The unadjusted or raw means for groups A and B, respectively, are:

YA=1nAi=1nAYAi and YB=1nBi=1nBYBiYA=1nAi=1nAYAi and YB=1nB


i=1nBYBi

The adjusted means for groups A and B, respectively, from the ANCOVA are:

YA,adj=1nAi=1nA(YAiXAi^) and YB,adj=1nBi=1nB(YBiXBi^)YA,a


dj=1nAi=1nA(YAiXAi^) and YB,adj=1nBi=1nB(YBiXBi^)
where X represents the covariate and ^^ represents the estimated slope for
X based on the entire set of covariate values across both groups. We are
interested in the adjusted values, which have accounted for the covariate. The
adjusted mean is an average of the adjusted values. In the graphical
illustration above, the covariate slope is positive and the covariate values for
the A group are greater, so the adjustment to the A mean is larger. Subtracting
these adjustments from the measured responses as indicated in the formula
for the adjusted mean (above) will bring the mean of A closer to the mean of
B.

12.2 - Interactions
It is important to examine treatment covariate interactions. For example, it is
possible that the responses in the treatment groups differ for low levels of the
prognostic factor, but not differ for high levels of the prognostic factor.
Interpretations of statistical results in the presence of treatment covariate
interactions can become complex.

It is unfortunate that the detection of interactions is model-dependent. For


example, if the treatment covariate interaction exists in an ANCOVA model
of the outcome variable (additive model), it is possible that it will disappear in
an ANCOVA model of the logarithm of the outcome variable (multiplicative
model). Therefore, interactions can exist or fail to exist based on the selection
of the statistical model and the assumptions associated with it.
Notice in the figure above that the difference between Treatments A and B is
constant and does not depend on the value of the covariate. There is no
treatment by covariate interaction in this scenario above.

Now, in the figure above, the difference between Treatments A and B is not
constant and does depend on the value of the covariate.
Prognostic factors can be continuous measures (age, baseline cholesterol,
etc.), ordinal (age categories, baseline disease severity, etc.), binary (gender,
previous tobacco use, etc.), or categorical (ethnic group, geographic region,
institution in multi-center trials, etc.).

Most prognostic factors are measured at baseline and do not change over
time, hence they are called time-independent covariates. Time-independent
covariates are easily included in many types of statistical models.

On the other hand, some prognostic factors do change over time; these are
called time-dependent covariates. For example, consider a clinical trial in
diabetics with selected dietary intake variables measured on a regular basis
over the course of a six-month clinical trial. Dietary intake could affect the
severity of disease and exacerbate problems for diabetic patients, so a
statistical analysis of the trial might incorporate these prognostic factors as
time-dependent covariates. We have to be very careful when using time-
dependent covariates.

A statistical analysis that uses time-dependent covariates can yield misleading


results if the time-dependent covariates are affected by treatment. In such a
situation, removal of the effect of the time-dependent covariates can also
cause removal of the treatment effect. For example, suppose that one of the
treatments in the aforementioned diabetes example has an additional effect of
increasing appetite. Then removing the effect of dietary intake also could
remove the effect of the treatment!

The schematic below represents the situation where the treatment is affecting
both the covariates and the outcome. The covariates also affect the outcome.
Adjusting for the effect of the covariate over time may account for the majority
of the treatment effect on the outcome.

As another example, consider a clinical trial in which an experimental therapy


for decreasing diastolic blood pressure is compared to placebo. Suppose that
the investigator wants to use pulse as a time-dependent covariate in an
ANCOVA model because there is a positive correlation between pulse and
diastolic blood pressure. Suppose that the treatment not only reduces diastolic
blood pressure, but reduces pulse as well. Then an analysis that removes the
effect of pulse, measured over the course of the trial, could remove the effect
of treatment. The misleading results from the model would cause the
treatment effect on diastolic blood pressure to remain undetected.

12.3 - Model-Based Methods:


Continuous Outcomes
A model relating the outcome variable to the prognosic factors and treatment
effects is a construct that makes use of theoretical knowledge and empirical
knowledge. In mathematical and statistical models, the theoretical component
is represented by one or more equations and the empirical component is
represented by data. The behavior of a model is governed by its structure or
functional form and by the unknown quantities or constants (parameters).
Objectives of the modeling exercise might include estimation of the
parameters, determination of model fit, or efficient summarization of large
amounts of data.

Dr. George Box once stated: "All models are wrong, but some are useful."

A linear model often is used for describing a continuous outcome variable. In


this situation, "linear" refers to the fact that the deterministic component of the
model is a linear combination of parameters and covariates. Statistical models
typically contain a deterministic component and a random component.

An example is the linear model that is used for multiple regression. Let Y
denote the outcome variable and X1, X2, . . . , XK denote K different regressors
(predictors) that are measured on each of n patients. Then the statistical
model for patient i, i = 1, 2, . . . , n, is

Yi=0+1X1i+0+2X2i+...+K+1XKi+iYi=0+1X1i+0+2X2i+...
+K+1XKi+i

where 0 is the intercept, 1, 2, . . . , K are the slopes for the K regressors,


and i represents the random error term for patient i.
In the multiple regression model, 0 + 1X1i + 2X2i + . . . + KXKi represents the
deterministic portion of the model for patient i and i represents the random
error term for patient i, i = 1, 2, . . . , n.

Typically, we assume that 1, 2, ... , n are independent and identically


distributed random variables, each following a N(0, 2) distribution.

A linear model for a one-way analysis of variance (ANOVA) is:

Yij=i+ijYij=i+ij

where i = 1, 2, . . . , K denotes the ith treatment group, j = 1, 2, . . . , ni denotes


the jth patient within the ith treatment group, i denotes the population mean
for the ith treatment group, and ij represents the random error term for the
jth patient within the ith treatment group.

A linear model for a one-way analysis of covariance (ANCOVA) with three


covariates is:

Yij=i+1X1ij+2X2ij+3X3ij+ijYij=i+1X1ij+2X2ij+3X3ij+ij

where the notation is similar to that for the one-way ANOVA with K treatment
groups, and X1ij, X2ij, X3ij denote the values of the three covariates for the
jth patient within the ith treatment group.

The covariates in an ANCOVA model (the regressors in a multiple regression


model) may be continuous, ordinal, or binary.

If a covariate is categorical with L levels, it might be necessary to recode it as


L - 1 distinct covariates that are binary (called dummy variables). One way to
do this is to select a reference level and let the dummy variables correspond
to the remaining L - 1 levels.

For example, suppose that there are four centers in a multi-center trial and
that it is desirable to model for center effects. The above ANCOVA model can
be invoked with center #4 as the reference level:

X1ij = 1, if patient (i,j) is in center #1; 0 otherwise


X2ij = 1, if patient (i,j) is in center #2; 0 otherwise
X3ij = 1, if patient (i,j) is in center #3; 0 otherwise
The implications of the model are that 1, 2, ... , K represent treatment
means within the reference center (center #4). Patients within center #1 have
treatment means 1 + 1, 2 + 1 , ... , K + 1, so that 1 represents the
change in any treatment mean between center #4 and center #1.

Statistical software packages for multiple regression typically require the user
to recode categorical regressors/covariates in this manner (SAS PROC REG),
whereas the statistical software packages for ANOVA and ANCOVA can
recode categorical regressors/covariates for the user (the CLASS statement in
SAS PROC ANOVA and SAS PROC GLM).

New regressors/covariates can be constructed as interactions among other


regressors. For example, suppose that X1 represents age and X2 represents
serum cholesterol. A third regressor, X3 = X1 X2, can be constructed as the
product and might be important to include in the model if only old age in
combination with high cholesterol has an impact on the outcome.

This example is called a first-order interaction and higher-order interactions


can be constructed as products of more than two regressors. Of course,
constructing more regressors in this manner can get unwieldy and lead to an
unmanageable number of potential regressors to consider.

Treatment covariate interactions are important to investigate in randomized


and nonrandomized studies. Treatment center interactions are important to
investigate in multi-center trials.

If the treatment covariate interactions are important, then it could be difficult


to interpret main effects for treatment.

For example, suppose that in a two-armed trial (treatments A and B) baseline


cholesterol level is considered an important covariate. Suppose that the
treatment cholesterol interactions are significant such that for low
cholesterol levels treatment A is better than treatment B, but for high
cholesterol levels treatment B is better than treatment A. Thus, it is not
possible to conclude that one treatment is superior because the choice of the
best treatment depends on baseline cholesterol levels, an important discovery
in and of itself.

If the investigator focuses on a specific region of values for the covariate, such
as high baseline cholesterol, then it may be possible to determine which
treatment is superior in this region.
Sometimes it is possible to make general conclusions if the interactions are
due to the magnitude of the effect.

For example, suppose treatment A is 20 units better than treatment B in the


presence of low baseline serum cholesterol, but treatment A is 60 units better
than treatment B in the presence of high baseline serum cholesterol levels.
Even though there is significant treatment covariate interactions, it still
appears that treatment A is superior. This type of treatment covariate
interactions is called "quantitative."

Profile plots (mean outcome response versus the covariate) for each
treatment group will indicate graphically whether the interactions are
qualitative (not parallel and crossing, below),

or quantitative (not parallel but not crossing).


12.4 - Examples
SAS Example ( 13.1_multiple regression.sas [1] )

[1]

Tershakovec et al (One-year follow-up of nutrition education for


hypercholesterolemic children [2]. American Journal of Public Health 1998; 88:
258-261) conducted a trial in which hypercholesterolemic children were
randomized to three different nutritional educational groups:

1. parent-child autotutorial (PCAT);

2. counseling;

3. control.

The primary outcome was LDL cholesterol and it was assessed at 0


(baseline), 3, 6, and 12 months.

A multiple regression model was applied to determine whether demographic


and nutritional intake variables were predictive of LDL cholesterol at baseline.
Run this program. Look at the regression output. Do you see only one
statistically significant predictor? Female status (p = 0.0004). Now look at the
output for the means and the scatterplot of LDL by Sex. Do you see mean
LDL among females that is 6.2 mg/dL greater than the male average LDL?

SAS Example ( 13.2_ANCOVA.sas [3] ): The longitudinal data from the one-
year follow-up are provided but not analyzed (beyond the scope of this
course).

SAS Example ( 13.3_repeated measurements.sas [4] ): Carithers et al


(Methylprednisolone therapy in patients with severe alcoholic
hepatitis. Annals of Internal Medicine 1989; 110: 685-690) conducted a multi-
center, double-blinded, randomized trial in which they compared
methylprednisolone to placebo in patients with severe alcoholic hepatitis.
Treatment was administered over four weeks. The primary outcome was time
to death, and the Kaplan-Meier survival curves and the logrank test indicated
that methylprednisolone was superior to placebo.

Secondary outcomes included liver enzymes such as albumin, prothrombin


time, bilirubin, and hematocrit. An ANCOVA model was invoked to compare
the treatments with respect to bilirubin at week #4 with baseline bilirubin and
clinical center as covariates.

Here are a couple of places at the end of the program that you will want to
make note of:

Now, run the program and look at the output. Notice 66 observations read, 47
used. This means 19 patients are missing data for a term in the model so
SAS cannot use their data. In this case, 19 are missing bilirubin4
measurements.

Considering the Type III SS in the output, do you see that baseline bilirubin
(bilirubin0) was an important covariate? The treatment groups do not differ
significantly. Nor was significant interaction observed between center and
treatment or baseline bilirubin and treatment.

A more parsimonious model is run; again only baseline bilirubin has a


significant effect. The difference in the raw means at Week 4 is about 4 units,
but is reduced after adjusting for the covariates (see the adjusted means.)
.The difference between treatments was not statistically significant ( p =
0.2782), although methylprednisolone showed a numerical advantage over
placebo with adjusted means of 7.84 and 10.84, respectively for the two
treatments.

12.5 - Model-Based Methods: Binary


Outcomes
For a binary outcome, logistic regression analysis is used to model the log
odds as a linear combination of parameters and regressors. Let p(X1, X2, ... ,
XK) denote the probability of success in the presence of the K regressors. The
logistic regression model for the log-odds for the ith patient is

log(p(X1i,X2i,...,XKi1p(X1i,X2i,...,XKi))=0+1X1i+2X2i+...

+KXKilog(p(X1i,X2i,...,XKi1p(X1i,X2i,...,XKi))=0+1X1i+2X2i+...+KXKi

Notice that 0 represents the reference log odds, i.e., when X1i = 0, X2i = 0, ... ,
XKi = 0. Consider a simple model with one covariate (K = 1) which is binary,
e.g., X1i = 0 if the ith patient is in the placebo group and 1 if the ith patient is in
the treatment group. Then the log odds ratio for comparing the treatment to
the placebo group is

log(p(X1i=1)1p(X1i=1)/p(X1i=0)1p(X1i=0))=(0+1)0=1log(p(X
1i=1)1p(X1i=1)/p(X1i=0)1p(X1i=0))=(0+1)0=1

If the covariate is ordinal or continuous, then


log(p(X1i=x)1p(X1i=x)/p(X1i=0)1p(X1i=0))=(0+1x)

0=1xlog(p(X1i=x)1p(X1i=x)/p(X1i=0)1p(X1i=0))=(0+1x)0=1x

so that the odds ratio is exp(1x). This illustrates that changes in a covariate
have a multiplicative effect on the baseline risk.

For example, suppose x represents (age - 18) in a study of adults, and that
the estimated coefficient is ^1^1 = 0.04 with a p-value < 0.05. Then the
estimated odds ratio is exp(0.04) = 1.041. This may not seem like a clinical
meaningful odds ratio, but remember that it represents the increase in odds
between a 19-year-old and an 18-year-old. For a 25-year-old person, the
estimated odds ratio is exp(0.04 7) = 1.323.

For the logistic regression model, each j, j = 1, 2, ... , K, represents the log
odds ratio for the jth covariate. An equivalent expression for the logistic
regression model in terms of the probability is

p=(X1i,X2i,...,XKi)=11+exp{(0+1X1i++2X2i+...
+KXKi)}p=(X1i,X2i,...,XKi)=11+exp{(0+1X1i++2X2i+...+KXKi)}

Logistic regression models are available for an ordinal response. Suppose


that an outcome variable, Y, is ordinal and that we designate its ordered
categories as 0, 1, ... , C. We model the ordinal logits as

log(Pr[yc|X1i,X2i,...,XKi]1Pr[yc|

X1i,X2i,...,XKi])=0c+1X1i+2X2i+...+KXKi,c=1,2,...,Clog(Pr[yc|
X1i,X2i,...,XKi]1Pr[yc|X1i,X2i,...,XKi])=0c+1X1i+2X2i+...+KXKi,c=1,2,...,C

The ordinal logistic regression model has C intercept terms, but only one term
for each regressor. This reduced modeling for an ordinal outcome assumes
proportional odds (beyond the scope of this course).

SAS Example ( 13.4_logistic regression.sas [5] ): Boyle et al (Masking of


physicians in the Growth Failure in Children with Renal Disease clinical
trial. Pediatric Nephrology 1993; 7: 204-206) investigated the success of the
masking in the randomized, double-blinded, multi-center GFRD [6] clinical trial.
The clinical director at each center was asked to identify or guess the
assigned treatment for each randomized patient.

A logistic regression analysis was applied to the binary outcome of


incorrect/correct guess. Regressors included treatment group and months in
the study. Note the creation of the binary variables, 'newscore' from the score
variable, within the data step before the proc logisitic statements. Similarly a
binary variable 'treatment' is created from the variable 'trtgroup'.

Run the program. On the output, you see "probability modeled is


newscore=1," which also indicates the order for calculating the odds ratio. The
confidence intervals for the odds ratios all include 1. With no statistically
significant results, the investigators remained confident that the masking
scheme was successful.

12.6 - Model-Based Methods: Time-to-


event Outcomes
For a time-to-event outcome variable, proportional hazards regression is
available. Let (t|X1i, X2i, ... ,XKi) denote the hazard function for the ith patient
at time t, i = 1, 2, ... , n, where the K regressors are denoted as X1i, X2i, ... , XKi.
The baseline hazard function at time t, i.e., when X1i = 0, X2i = 0, ... , XKi = 0, is
denoted as 0(t). The baseline hazard function is analogous to the intercept
term in the multiple regression model or logistic regression model.

The proportional hazards regression model states that the log of the hazard
function to the baseline hazard function at time t is a linear combination of
parameters and regressors, i.e.,

log((t|X1i,X2i,...,XKi)0(t))=1X1i,2X2i,...,KXKilog((t|
X1i,X2i,...,XKi)0(t))=1X1i,2X2i,...,KXKi

The proportional hazards regression model is nonparametric because we do


not specify a specific distribution function for the time-to-event outcome
variable. The proportionality assumption, however, is important.

The ratio of hazard functions can be considered a ratio of risk functions, so


the proportional hazards regression model is a function of relative risk (unlike
the logistic regression models which are a function of the odds ratio).
Changes in a covariate have a multiplicative effect on the baseline risk. The
model in terms of the hazard function at time t is:

(t|X1i,X2i,...,XKi)=0(t)exp(1X1i,2X2i,...,KXKi)(t|
X1i,X2i,...,XKi)=0(t)exp(1X1i,2X2i,...,KXKi)

12.7 - Model-Based Methods: Building a


Model
Regression/ANCOVA models as described above are most useful when

1. they contain a few clinically relevant and interpretable prognostic


variables

2. the parameters or coefficients are estimated with relatively high


precision

3. the prognostic factors each carry independent information about the


outcome variable

4. the model is consistent with other clinical and biological data


For a given situation, however, it may not be easy to construct a model that
satisfies these criteria.

With available statistical software in modern computers, portions of the model-


building process are automatic. Caution must be exercised, however, for the
following reasons.

1. The criteria employed by the software may be inappropriate, e.g.,


relying solely on p-values.

2. There are poor statistical properties when performing a large number of


tests and refitting models.

3. It is not possible to incorporate outside information into the model-


building process.

4. The software may not handle the problem of missing data very well.

The model-building process requires thought and an understanding of the


clinical situation. Some statisticians only use prognostic variables in the model
for which there exist plausible biological reasons for their inclusion.

Approaches

Computer software to assist in the construction and evaluation of a model


follows several approaches.

One approach is called a step-up or forward selection process, in which the


initial model contains no regressors but they enter the model one at a time. In
this situation, a regressor enters the model if its p-value is less than a critical
value, say 0.05.

Another approach is called the step-down or backward selection process,


in which the initial model contains all of the regressors. In this situation, a
regressor is eliminated from the model if its p-value is not less than the critical
value.

A third approach, called stepwise selection, is a modification of forward


selection. In this situation, after a new variable enters the model, all the
variables that had entered the model previously are reexamined to see if
their p-values have changed. If any of the revised p-values exceed the critical
value, then the corresponding variables are eliminated from the model.
A fourth approach involves finding the best one-variable-model, the best two-
variable model, etc. with the help of software, and then using judgment as to
which is the best overall model, i.e., if the (c+1)-variable model is only
slightly better than the c-variable model, the latter is selected. It is prudent to
attempt a variety of models and approaches to determine if the results are
consistent.

Some statisticians favor the backward selection or step-down process,


although there is no universal agreement among statisticians. It is not unusual
for a particular data set to discover that step-up and step-down selection
algorithms lead to different models. The main reason for this is that the
regressors/covariates are not completely independent of each other.

When a variable is entered into or removed from a model, the p-values of the
other variables will change. Consider a linear model with two potential
regressors, X1 and X2, and suppose that they are strongly correlated
(independent variables is a misnomer). Suppose that in a model
with X1 only, X1 is significant, and in a model with X2 only, X2 is significant.
When a model is constructed with both X1 and X2, however, the contribution
by X2 to the model is no longer statistically significant. Because X1 and X2 are
strongly correlated, X2 has very little predictive power when X1 already is in the
model.

Initial screening of the entire set of candidate regressors/covariates is


advised. Many statisticians recommend that each potential regressor be
examined individually in a simple model. This can help identify potential
regressors for which there is not a strong biological justification.

Usually the critical significance level in this first-stage approach is more


lenient, say 0.10 or 0.15. Then all of the regressors that meet this first-stage
criterion and/or that have biological/clinical justification comprise the set of
regressors that are subjected to the model-building process. Clinical input
always should augment this first-stage process.

12.8 - Example
SAS Example ( 13.5_ph regression.sas [7] ): A safety and efficacy study was
conducted in 83 patients with malignant mesothelioma, an uncommon lung
cancer that is strongly associated with asbestos exposure. Patients underwent
one of three types of surgery, namely, biopsy, limited resection, and
extrapleural pneumonectomy (EPP). Treatment assignment was
nonrandomized and based on the extent of disease at the time of diagnosis.
Thus, there can be a strong procedure selection bias.

A proportional hazards regression analysis was applied that included a


stepwise selection process to build a model with prognostic factors in addition
to surgery type. Examine the program and note the following:

Run the program. Do you agree that histologic subtype is the only statistically
significant covariate (p = 0.025) ?

If there is a large amount of correlation among a set of regressors, then a


problem known as collinearity can exist. Collinearity can cause difficulties in
interpretation because it is not obvious which regressor in a set of highly
correlated regressors should be used in the model.

In addition, collinearity, if not diagnosed, can lead to numerical instabilities in


the software and yield strange results. It is recommended that the correlations
among the set of regressors be examined prior to the model-building process,
e.g., using PROC CORR of SAS.

If two or more regressors are observed to be highly correlated (say with


correlations above 0.8 or below -0.8), then most of the variables in a
correlated set should not be included in the model-building process.

12.9 - Missing Values


Missing values could cause a problem during the model-building process if
subjects display different patterns of missingness for the set of
regressors/covariates. For example, consider the aforementioned linear model
with two potential regressors, X1 and X2. Suppose that there are 100 subjects,
but 50 are missing X1 and the remaining 50 are missing X2. Thus, no subject
has X1 and X2 observed simultaneously, so a model with both regressors is
not possible. This is an extreme case, but most model-building endeavors
encounter some form of missingness.

Hopefully, missing data among the regressors/covariates are not related to the
outcome. If this is not the case, then it may not be possible to develop a
model that is unbiased. For example, if the patients with the most severe form
of the disease are the ones with missing values for the regressors/covariates,
then the resultant model that does not include these patients will be biased.

As has been discussed earlier, data imputation is one way to handle the
situation of missing values. Data imputation involves the estimation of the
missing values in a manner that is consistent and then imputing the
estimated values for the missing values. Thus, every subject will have a
complete set of regressors/covariates and the statistical analysis can proceed
without eliminating any subjects.

The values to be imputed can be estimated by averaging over the observed


values or by fitting regression models in which the regressors with missing
values become the outcome variables. Obviously, there is some danger of
introducing large biases with imputation, so it must be performed carefully on
a case-by-case basis.

Make every effort to collect complete data to avoid such problems. When data
are missing, be certain to report the numbers of patients used in each analysis
and any methods used to impute missing values.

12.10 - Model Validation


Statisticians have recommended a number of approaches to evaluate a
model. One approach involves partitioning the data set into an estimation data
set and a validation data set (usually in a two-thirds versus one-third split).

The estimation data set is used to build the model, and hence, estimate the
parameters. The validation data set is used to validate the model by inserting
a patients set of observed regressors into the estimated model equation and
predicting the outcome response for that subject.

If the predicted outcome is relatively close to the observed outcome for the
subjects in the validation data set, then the model is considered valid.
Another approach is called the leave-one-out method and consists of
eliminating the first patient from the data set with n subjects, estimating the
model equation based on the remaining n - 1 patients, calculating the
predicted outcome for the first patient, and then comparing the first patients
predicted and observed outcomes.

This process is performed for each of the n patients and an overall validation
statistic is constructed.

These validation procedures work fine for nonrandomized studies, but for
randomized clinical trials, they probably should be applied only to secondary
and exploratory statistical analyses.

12.11 - Adjusted Analyses of


Comparative Efficacy (Phase III) Trials
Some statisticians do not like to perform adjusted analyses such as ANCOVA
in comparative efficacy (Phase III) trials because they feel that randomization
and proper analysis guarantee unbiasedness and the correctness of type I
error levels, even if there are chance imbalances among the treatment groups
with respect to prognostic factors.

This may be true, but the use of prognostic factors in ANCOVA models can
improve precision and verify biological information.

Investigators should consider using a prognostic factor as a covariate in the


analysis of data from a randomized trial under any of the following
circumstances:

1. the prognostic factor is significantly unbalanced among the treatment


groups;

2. the prognostic factor is strongly associated with the outcome variable, in


the presence of balance (i.e., stratifiers) or imbalance among treatment
groups;

3. it is of interest to determine whether the prognostic factor causes or


reduces the treatment effect;

4. the prognostic factor is clinically important and it is of interest to


illustrate and quantify its effect.
12.12 - Summary
In this lesson, among other things, we learned:

Differentiate between means and adjusted means.

State how ANCOVA can reduce difficulties resulting from imbalance in


prognostic factors.

Recognize situations in which using a prognostic factor as a covariate is


recommended. Recognize the difficulty presented with time-dependent
covariates.

Recognize interaction effects and differentiate between qualitative and


quantitative interactions.

Select the appropriate analysis of the data among ANCOVA, logistic


regression and proportional hazards.

Modify SAS programs to perform ANCOVA and logistic regression.


Interpret the relevant portions of SAS output for these analyses.

Identify 4 approaches to model building.

Recognize the effects of incomplete data on model-building.

Propose two methods of validating a model from a nonrandomized


study.

Let's put what we have learned to use by completing the following homework
assignment:

Homework

Look for homework assignment and the dropbox in the folder for this week in
ANGEL.

Lesson 13: Reporting


Introduction
Reporting the results of a clinical trial is one of the most important and least
studied aspects of clinical research. Investigators have an obligation to
disseminate trial results in a timely and competent manner. Many features of
good reports are similar for all types of trials and find widespread acceptance
in journals.

Uniformity is important to readers, particularly to those who are the least


familiar with the details of the disease or the intervention under study. The
benefits of uniformity are evident in some chronic diseases like cancer, where
standardized staging has improved trial design, reporting, and interpretation.

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

recognize critical elements in a journal article or abstract that reports


clinical research

apply guidelines from JAMA to reports of results from parallel group


randomized trials and reports of safety

access a vast number of trials registered with ClinicalTrials.gov

Reference

Piantadosi Steven. (2005) Reporting and Authorship. Factorial Designs. In:


Piantadosi Steven. Clinical Trials: A Methodologic Perspective. 2nd ed.
Hobaken, NJ: John Wiley and Sons, Inc.

13.1 - Publication Bias


Published reports should inform the reader about all aspects of study design,
conduct, analysis, and interpretation that are relevant for assessing the
internal and external validity of the trial. The content and quality of trial reports
in the literature, however, remains inconsistent on these points - this process
is not perfect. The content of the medical literature reflects an imperfect
editorial and peer review process. Despite the limitations of the peer review
process, good alternatives currently do not exist for judging the merits of
scientific papers.
Publication bias is the tendency for studies with positive findings to be
preferentially selected for publication over those with negative findings, (i.e.,
it did not find a statistically significant result).If an editor has a choice
publishing a positive study and one with negative results, they may prefer
publishing the positive results for various reasons. However, negative studies
are very important and should be made known as well. For instance,early
stopping a study of interferon gamma-1b when an interim analysis showed
that patients with IPF did not benefit from the treatment is important
imformation for other IPF patients who may have been prescribed interferon
gamma-1b in off-label use and for others taking it for the conditions for which it
already has approval (FDA Public Health Advisory on Interferon gamma-1b). [1]

If the published literature is used as a basis for drawing conclusions about a


treatment or a group of related treatments from independent studies, then an
impression biased in favor of the treatment could result. Editors and referees
are not the only ones to blame for the presence of publication bias.
Investigators tend to lose enthusiasm for negative results because they may
be viewed as less glamorous and even viewed as failures. This could lead to
weaker reports, or even no reports, being submitted for publication. Journal
editors and referees can reduce publication bias in two ways. First, they
should assign greater weight to methodologic rigor and thorough reporting
than to statistical significance. Second, they must be willing to report negative
findings from sound studies with as much enthusiasm as positive reports.

13.2 -ClinicalTrials.gov and other means


to access study results
Along with efforts of individual journal editors to include negative trials, there
have been recent public and private initiatives to increase public access to
clinical trial results.

The U.S. NIH policy is that results of its funded research should be available
to the public.

We have already discovered that ClinicalTrials.gov provides updated


information for locating U.S. government- and privately-supported clinical trials
and results of certain completed trials. Observational studies addressing
health issues in large groups of people or populations in natural settings are
also included the ClinicalTrials.gov database.
The site was developed by the U.S. National Institutes of Health (NIH),
through its National Library of Medicine (NLM), in collaboration with the Food
and Drug Administration [2] (FDA), as a result of the FDA Modernization Act
(1997). The types of trials required to be registered at the site expanded with
the Food and Drug Administration Amendments Act of 2007. The "basic
results" of trials that study drugs, biologics or devices approved, licensed or
cleared by the FDA are now required to be posted in a timely manner. (read
more about the requirements http://www.clinicaltrials.gov/ct2/manage-
recs/fdaaa#WhenDoINeedToRegister [3] and http://www.clinicaltrials.gov/ct2/ab
out-site/results [4])

Many medical journals now have policies of only publishing a manuscript from
a completed clinical trial if the trial has been registered at ClinicalTrials.gov.

Along with satisfying legislative requirements, pharmaceutical manufacturers


are providing access to results from sponsored studies through various
mechanisms. (e.g. [5]GSK [6], Pfizer [7], Merck [8]) as the Pharmaceutical
Research and Manufacturers of America and the European Federation of
Pharmaceutical Industries and Associations adopted joint Principles for
Responsible Clinical Trial Data Sharing [9] (2013).

13.3 - Contents of Clinical Trial Reports


Using the proper summary/descriptive statistics is essential. Although
investigators may use standard deviations and standard errors
interchangeably, the standard deviation is appropriate as a descriptive
summary, whereas the standard error is intended to convey the uncertainty of
an estimate (such as the mean). Confidence intervals are more informative
than significance levels and p-values.

Reports of clinical trials usually do not distinguish between clinical significance


and statistical significance - but they should. Clinical significance can be
expressed in terms of the magnitude and direction of treatment effects or
differences. Although this is important for superiority trials, it is even more
important for equivalence and non-inferiority trials.

Some journals require structured titles and abstracts because they are the
only part of many reports that some readers examine. Therefore, the abstract
becomes very important - in the medical literature the abstract is critical. A
good abstract for the report of a clinical trial includes objectives, design,
setting or types of practices, characteristics of the study population,
interventions used, primary outcome measurements, principal results, and
conclusions. Abstracts should be no longer than 250 words and usually do not
include descriptions of the statistical methods.

The reports for treatment mechanism and dose-finding studies (Phase I)


should include information about study design, demographics, toxicity and
side effects, and recommendations for later trials. The objectives of safety and
efficacy studies (Phase II) are to demonstrate treatment feasibility, estimate
treatment success, estimate treatment complications, and facilitate informal
comparisons with other therapies that might motivate comparative trials.

With respect to the latter objective, the report should recognize the potential
for strong selection bias and avoid overly-enthusiastic statements about
relative efficacy. The reports for safety and efficacy studies should consist of
the following outline:

introduction

objectives

study design

study setting

demographics

treatments

outcome measures

statistical methods

results

discussion and conclusions

13.4 - Phase III Trials


The outline of reports for comparative efficacy trials (Phase III) is similar to
that for safety and efficacy trials. There are many additional issues, however,
to consider. For example, the reporting of treatment assignment
(randomization) and masking procedures is necessary to assure readers
about the internal validity of the trial. This is important because the readers
want to know how this was implemented. This is a mechanism that reviewers
will use to assess the validity of the study.

The motivation and assumptions for the target sample size should be
included, especially in situation where the primary results are negative
findings. The impact of various prognostic variables should be addressed with
appropriate statistical analyses to demonstrate that treatment effects are not
due entirely to them. Although the intent-to-treat principle should be followed
in randomized trials, it is helpful to report on the results of various exploratory
analyses as well.

Since the methods have direct implications on the validity of the results, top-
line journals require thorough descriptions. They also expect supplemental
reports. When the article is available online, there can be links to more
detailed descriptions, figures, graphs and tables.

The CONSORT (Consolidated Standards of Reporting Trials)


Group (http://www.consort-statement.org/ [10]) updated its guidelines for the
reporting of clinical trials, which appeared in multiple journals and is available
online: Examine the CONSORT checklist [11] and Flow Diagram [12] in the
CONSORT statement. These standards have been adopted by major medical
journals.

Schulz, K, D. Altman, D. Moher and for the CONSORT Group. CONSORT


2010 Statement: Updated Guidelines for Reporting Parallel Group
Randomized Trial [13]. Ann Intern Med June 1, 2010 152:726-732

There are several extensions of the original CONSORT statement which can
also be examined at the CONSORT website [14] focusing on reporting patient
safety. equivalence and non-inferiority trials, cluster trials and other topics.

Ioannidis JPA, Evans SJW, Getzsche PC, ONeill RT, Altman DG, Schultz K,
Moher D for the Consort Group. Better reporting of harms in randomized trials:
An extension of the CONSORT statement. [15] Annals of Internal
Medicine 2004; 141:781-788.
Piaggio,G., Elbourne, D., Altman, D., Pocock, S., Evans, S. for the CONSORT
Group. Reporting of Noninferiority and Equivalence Randomized Trials: An
Extension of the CONSORT statement [16]. JAMA 2006; 295: 1152-1160.

Conflict of Interest

Major medical journals require that manuscript authors report any financial
support for the research presented in the article, and complete a form
describing their conflicts of interest. The information on the financial support
and conflicts usually appears at the end of the article, prior to the references.
Example: In the VALIANT trial (NEJM 2003, in Wk 5 course material) the
authors state
(1) Supported by a grant from Novartis Pharmaceuticals. and
(2) that some of them also received financial payments from Novartis for
serving as consultants, and some of them also have stock equity in Novartis.
Anyone who reads the article should attempt to examine the statements about
financial support and conflicts, in order to judge whether the article may
present a biased viewpoint.
The International Committee of Medical Journal Editors (ICMJE
http://www.icmje.org/ [17]) has developed a standardized form for authors to
provide information about their financial interests that could influence how
their work is viewed. The form is designed to be completed electronically and
stored electronically. It contains programming that allows appropriate data
display. Each author listed on the manuscript should submit a separate form
and is responsible for the accuracy and completeness of the information. The
disclosure form is a fillable pdf file
(http://www.icmje.org/coi_disclosure.pdf [18]). The complete list of journals that
require completion of the ICMJE form appears
at http://www.icmje.org/journals.html [19]

13.5 - Summary
In this lesson, among other things, we learned to:

recognize critical elements in a journal article or abstract that reports


clinical research

apply CONSORT standards to reports of results from parallel group


randomized trials and reports of safety
access a vast number of trials registered with ClinicalTrials.gov

Let's put what we have learned to use by completing the following homework
assignment:

Homework

Let's put what we have learned to use by contributing to this week's


discussion.

Lesson 14: Factorial Design


Introduction

A factor is a variable that is controlled and varied during the course of an


experiment. In a chemistry experiment, temperature and pressure may be the
factors that are deliberately changed over the course of the experiment. In the
clinical trial treatment can be a factor. A study of experimental therapy vs.
placebo can be thought of as having a treatment factor with 2 levels, 0 or the
study dosage. A study with two different treatments has the possibility of a
two-way design, varying the levels of treatment A and treatment B.

Factorial clinical trials are experiments that test the effect of more than one
treatment using a type of design that permits an assessment of potential
interactions among the treatments.

In a factorial design there are two or more factors with multiple levels that are
crossed, e.g., three dose levels of drug A and two levels of drug B can be
crossed to yield a total of six treatment combinations:

low dose of A with low dose of B


low dose of A with high dose of B
mid dose of A with low dose of B
mid dose of A with high dose of B
high dose of A with low dose of B
high dose of A with high dose of B

Factorial designs offer certain advantages over conventional designs. There


are a number of ways that you could look at these groups. This lesson will
consider these alternatives...
Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

identify the conditions that would allow a factorial design to be useful

recognize the difference between qualitative and quantitative


interactions

recognize the situation for which a min test is the appropriate analysis

Reference

Piantadosi Steven. (2005) Reporting and Authorship. Factorial Designs. In:


Piantadosi Steven. Clinical Trials: A Methodologic Perspective. 2nd ed.
Hobaken, NJ: John Wiley and Sons, Inc.

14.1 - Characteristics of Factorial


Designs
The simplest factorial design is the 2 2 factorial with two levels of factor A
crossed with two levels of factor B to yield four treatment combinations. A
special case of the 2 2 factorial with a placebo and an active formulation of
factor A crossed with a placebo and an active formulation of factor B. This
yields the four treatment regimens:

Placebo A + Placebo B
Placebo A + Active B
Active A + Placebo B
Active A + Active B

For example, here you could have a placebo for each treatment. In one case
you might have a placebo injection for A and a placebo pill for B. Such a
design allows the comparison of the levels of factor A (A main effects), the
comparison of the levels of factor B (B main effects), and the investigation of A
B interactions.

There are some issues to consider prior to conducting of a factorial clinical


trial.
First, the treatments must be amenable to being administered in combination
without changing dosage in the presence of each other treatment.

Second, it must be acceptable to not administer the individual treatments,


(i.e., a placebo is ethical) or administer them at lower doses if that will be
required for the combination.

Third, we must be genuinely interested in learning about treatment


combinations required for the factorial design. Otherwise some of the
treatment combinations are unnecessary, yet without them the advantages of
the factorial design are diminished.

Fourth, the therapeutic questions must be chosen appropriately, e.g.,


treatments that use different mechanisms of action are more suitable
candidates for a factorial clinical trial.

14.2 - Interactions
Factorial designs provide the only way to study interactions between
treatment A and treatment B. This is because the design has treatment groups
with all possible combinations of treatments.

The principles presented earlier about treatment covariate interactions are


relevant to the discussion of treatment A treatment B interactions in this
lesson. These concepts included:

(1) Detecting interactions may be dependent on the scale of measurement.


(2) In the presence of interactions, it may not be possible to assess the main
effects because the effect of treatment A changes according to the level of
treatment B.
(3) Quantitative interactions refer to the situation in which the direction of the
main effects does not change although it could change in magnitude.
Qualitative interactions refer to the situation in which the direction of the main
effects does change.
The figure above indicates a quantitative interaction. The lines are not parallel
but they are not crossing either. The magnitude of the response is dependent
on whether treatment A is at a high or low dose. The greatest response is
achieved with both Treatment B and Treatment A at high dose. A greater
response is observed when Treatment B is at high dose than at mid or low
dose, regardless of the dose level of Treatment A, but how much greater is
dependent on the level of A. At the lowest dose of A, there is very little
difference in the response between the dose levels of B. This is called a
quantitative interaction.
The qualitative interaction occurring in the figure above will be difficult to
explain. The greatest response is achieved with the low dose of treatment B
and the high dose of treatment A. However, if a patient is on low dose of
treatment A, the greatest response will be achieved with the high dose of
treatment be. Although difficult to sort out, this qualitative interaction is
intuitively reasonable for some drug combinations. There may be toxicity or a
threshold effect that contribute to making the response greater with only one
treatment at the highest dose.

14.3 - A Special Case with Drug


Combinations
A special case of a partial factorial design that occasionally is used in clinical
research is the incomplete 2 2 factorial design with three treatment groups
consisting of drug A, drug B, and drug A in combination with drug B:

Placebo A + Active B
Active A + Placebo B
Active A + Active B

Notice that the Placebo A + Placebo B group is not included in the design,
hence the incompleteness. The incomplete factorial design has become
popular.

Why?

Combination therapies can be marketable and profitable. If a company can


combine the active ingredients for treatment A and treatment B into one
pill/tablet/capsule, more symptoms are relieved with one dose of medicine.
For example, combining an antihistamine with a decongestant for cold
symptoms produces a new cold remedy that will alleviate two major symptoms
with one capsule. Additionally once the company has created the new
combination product, the company applies for a new patent, extending the
years of profitable returns from the research dollars expended to develop the
intial products. Approval of a combination therapy however, requires evidence
demonstrating the superiority of the AB combination therapy to the A
monotherapy and the B monotherapy. A logical experimental design to
demonstrate these results would be the incomplete factorial.
Suppose that the response is continuous and that we want to compare the
means A, B, and AB, which represent the population means for the A
monotherapy, the B monotherapy, and the AB combination therapy,
respectively. The research objective is to show the superiority of the
combination therapy over the individual therapies.

Assuming that the higher response is more beneficial, a one-sided hypothesis


testing format can be constructed as H0: {A AB or B AB} versus H1: {A <
AB and B < AB}

Notice that the null hypothesis indicates that the AB combination therapy is
not better than at least one of the monotherapies, whereas the alternative
indicates that the AB combination is better than the A monotherapy and the B
monotherapy.

How do we do this?

The appropriate test statistic to use for this situation is called the min test. If
the data are normally distributed, construct two two-sample t statistics, one
comparing the AB combination therapy to the A monotherapy (call it tA) and
the other comparing the AB combination therapy to the B monotherapy (call
it tB).

tA=(YABYA)/s1nAB+1nAtA=(YABYA)/s1nAB+1nA ,

tB=(YABYB)/s1nAB+1nBtB=(YABYB)/s1nAB+1nB

where

YA=1nAi=1AYA,i,YB=1nBi=1BYB,i,YAB=1nABi=1ABYAB,iYA=1nAi
=1AYA,i,YB=1nBi=1BYB,i,YAB=1nABi=1ABYAB,i

and
s2=1nA+nB+nAB3(i=1nA(YA,iYA)2+i=1nB(YB,iYB)2+i=1nAB(

YAB,iYAB)2)s2=1nA+nB+nAB3(i=1nA(YA,iYA)2+i=1nB(YB,iYB)2+i=1nA

B(YAB,iYAB)2)

The null hypothesis is rejected at the significance level in favor if the


alternative hypothesis when each of tA and tB is statistically significant at
the significance level.

It is called the min test because in this situation it is comparable to rejecting


the null hypothesis if

minimum(tA,tb)>tnA+nB+nAB3,1minimum(tA,tb)>tnA+nB+nAB3,1

As a simple example, suppose that:

YA=20,YB=21,YAB=24,nA=nB=nAB=50,s=10YA=20,YB=21,YAB=24,nA
=nB=nAB=50,s=10

Then tA = 2, tB = 1.5, and minimum(tA, tB) = 1.5, which is not greater than t147,
0.95 = 1.66. Thus, the null hypothesis cannot be rejected at the 0.05
significance level, i.e., the AB combination is not significantly better than the A
monotherapy and the B monotherapy. It is close, but there clearly is not
enough statistical evidence to show significant difference.

14.4 - Summary
In this lesson, among other things, we learned to:

identify the conditions that would allow a factorial design to be useful

recognize the difference between qualitative and quantitative


interactions

recognize the situation for which a min test is the appropriate analysis
Let's put what we have learned to use by completing the following homework
assignment:

Homework

Look for homework assignment and the dropbox in the folder for this week in
ANGEL.

Lesson 15: Crossover Designs


Introduction

A crossover design is a repeated measurements design such that each


experimental unit (patient) receives different treatments during the different
time periods, i.e., the patients cross over from one treatment to another during
the course of the trial. This is in contrast to a parallel design in which patients
are randomized to a treatment and remain on that treatment throughout the
duration of the trial.

The reason to consider a crossover design when planning a clinical trial is that
it could yield a more efficient comparison of treatments than a parallel design,
i.e., fewer patients might be required in the crossover design in order to attain
the same level of statistical power or precision as a parallel design.(This will
become more evident later in this lesson...) Intuitively, this seems reasonable
because each patient serves as his/her own matched control. Every patient
receives both treatment A and B. Crossover designs are popular in medicine,
agriculture, manufacturing, education, and many other disciplines. A
comparison is made of the subject's response on A vs. B.

Although the concept of patients serving as their own controls is very


appealing to biomedical investigators, crossover designs are not preferred
routinely because of the problems that are inherent with this design. In
medical clinical trials the disease should be chronic and stable, and
the treatments should not result in total cures but only alleviate the disease
condition. If treatment A cures the patient during the first period, then
treatment B will not have the opportunity to demonstrate its effectiveness
when the patient crosses over to treatment B in the second period. Therefore
this type of design works only for those conditions that are chronic, such as
asthma where there is no cure and the treatments attempt to improve quality
of life.
Crossover designs are the designs of choice for bioequivalence trials. The
objective of a bioequivalence trial is to determine whether test and reference
pharmaceutical formulations yield equivalent blood concentration levels. In
these types of trials, we are not interested in whether there is a cure, this is a
demonstration is that a new formulation, (for instance, a new generic drug),
results in the same concentration in the blood system. Thus, it is highly
desirable to administer both formulations to each subject, which translates into
a crossover design.

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

Distinguish between situations where a crossover design would or


would not be advantageous.

Use the following terms appropriately: first-order carryover, sequence,


period, washout, aliased effect.

State why an adequate washout period is essential between periods of


a crossover study in terms of aliased effects.

Evaluate a crossover design as to its uniformity and balance and state


the implications of these characteristics.

Understand and modify SAS programs for analysis of data from 2 2


crossover trials with continuous or binary data.

Provide an approach to analysis of event time data from a crossover


study.

Distinguish between population bioequivalence, average bioequivalence


and individual bioequivalence.

Relate the different types of bioequivalence to prescribability and


switchability.

Reference
Piantadosi Steven. (2005) Crossover Designs. In: Piantadosi Steven. Clinical
Trials: A Methodologic Perspective. 2nd ed. Hobaken, NJ: John Wiley and
Sons, Inc.

15.1 - Overview of the Crossover


Designs
The order of treatment administration in a crossover experiment is called a
sequence and the time of a treatment administration is called a period.
Typically, the treatments are designated with capital letters, such as A, B, etc.

The sequences should be determined a priori and the experimental units are
randomized to sequences. The most popular crossover design is the 2-
sequence, 2-period, 2-treatment crossover design, with sequences AB and
BA, sometimes called the 2 2 crossover design.

In this particular design, experimental units that are randomized to the AB


sequence receive treatment A in the first period and treatment B in the second
period, whereas experimental units that are randomized to the BA sequence
receive treatment B in the first period and treatment A in the second period.

We express this particular design as AB|BA or diagram it as:

[Design 1] Period 1 Period 2

Sequence AB A B

Sequence BA B A

Examples of 3-period, 2-treatment crossover designs are:

[Design 2] Period 1 Period 2 Period 3

Sequence ABB A B B

Sequence BAA B A A

and

[Design 3] Period 1 Period 2 Period 3


Sequence AAB A A B

Sequence ABA A B A

Sequence BAA B A A

Examples of 3-period, 3-treatment crossover designs are

[Design 4] Period 1 Period 2 Period 3

Sequence ABC A B C

Sequence BCA B C A

Sequence CAB C A B

and

[Design 5] Period 1 Period 2 Period 3

Sequence ABC A B C

Sequence BCA B C A

Sequence CAB C A B

Sequence ACB A C B

Sequence BAC B A C

Sequence CBA C B A

Some designs even incorporate non-crossover sequences such as Balaam's


design:

[Design 6] Period 1 Period 2

Sequence AB A B

Sequence BA B A

Sequence AA A A

Sequence BB B B
Balaams design is unusual, with elements of both parallel and crossover
design. There are advantages and disadvantages to all of these designs; we
will discuss some and the implications for statistical analysis as we continue
through this lesson.

15.2 - Disadvantages
The main disadvantage of a crossover design is that carryover effects may be
aliased (confounded) with direct treatment effects, in the sense that these
effects cannot be estimated separately. You think you are estimating the effect
of treatment A but there is also a bias from the previous treatment to account
for. Significant carryover effects can bias the interpretation of data analysis, so
an investigator should proceed cautiously whenever he/she is considering the
implementation of a crossover design.

A carryover effect is defined as the effect of the treatment from the previous
time period on the response at the current time period. In other words, if a
patient receives treatment A during the first period and treatment B during the
second period, then measurements taken during the second period could be a
result of the direct effect of treatment B administered during the second
period, and/or the carryover or residual effect of treatment A administered
during the first period. These carryover effects yield statistical bias.

What can we do about this carryover effect?

The incorporation of lengthy washout periods in the experimental design can


diminish the impact of carryover effects. A washout period is defined as the
time between treatment periods. Instead of immediately stopping and then
starting the new treatment, there will be a period of time where the treatment
from the first period where the drug is washed out of the patient's system.

The rationale for this is that the previously administered treatment is washed
out of the patient and, therefore, it can not affect the measurements taken
during the current period. This may be true, but it is possible that the
previously administered treatment may have altered the patient in some
manner, so that the patient will react differently to any treatment administered
from that time onward. An example is when a pharmaceutical treatment
causes permanent liver damage so that the patients metabolize future drugs
differently. Another example occurs if the treatments are different types of
educational tests. Then subjects may be affected permanently by what they
learned during the first period.
How long of a wash out period should there be?

In a trial involving pharmaceutical products, the length of the washout period


usually is determined as some multiple of the half-life of the pharmaceutical
product within the population of interest. For example, an investigator might
implement a washout period equivalent to 5 (or more) times the length of the
half-life of the drug concentration in the blood. The figure below depicts the
half-life of a hypothetical drug.

Actually, it is not the presence of carryover effects per se that leads to aliasing
with direct treatment effects in the AB|BA crossover, but rather the presence of
differential carryover effects, i.e., the carryover effect due to treatment A differs
from the carryover effect due to treatment B. If the carryover effects for A and
B are equivalent in the AB|BA crossover design, then this common carryover
effect is not aliased with the treatment difference. So, for crossover designs,
when the carryover effects are different from one another, this presents us
with a significant problem.

In the example of the educational tests, differential carryover effects could


occur if test A leads to more learning than test B. Another situation where
differential carryover effects may occur is in clinical trials where an active drug
(A) is compared to placebo (B) and the washout period is of inadequate
length. The patients in the AB sequence might experience a strong A
carryover during the second period, whereas the patients in the BA sequence
might experience a weak B carryover during the second period.

The recommendation for crossover designs is to avoid the problems caused


by differential carryover effects at all costs by employing lengthy washout
periods and/or designs where treatment and carryover are not aliased or
confounded with each other. It is always much more prudent to address a
problem a priori by using a proper design rather than a posteriori by applying
a statistical analysis that may require unreasonable assumptions and/or
perform unsatisfactorily. You will see this later on in this lesson...

For example, one approach for the statistical analysis of the 2 2 crossover is
to conduct a preliminary test for differential carryover effects. If this is
significant, then only the data from the first period are analyzed because the
first period is free of carryover effects. Essentially you be throwing out half of
your data!

If the preliminary test for differential carryover is not significant, then the data
from both periods are analyzed in the usual manner. Recent work, however,
has revealed that this 2-stage analysis performs poorly because the
unconditional Type I error rate operates at a much higher level than desired.
We won't go into the specific details here, but part of the reason for this is that
the test for differential carryover and the test for treatment differences in the
first period are highly correlated and do not act independently.

Even worse, this two-stage approach could lead to losing one-half of the data.
If differential carryover effects are of concern, then a better approach would be
to use a study design that can account for them.

Prior to the development of a general statistical model and investigations into


its implications for, we require more definitions.

15.3 - Definitions with a Crossover


Design
First-order and Higher-order Carryover Effects

Within time period j, j = 2, ... , p, it is possible that there are carryover effects
from treatments administered during periods 1, ... , j - 1. Usually in period j we
only consider first-order carryover effects (from period j - 1) because:

1. if first-order carryover effects are negligible, then higher-order carryover


effects usually are negligible;

2. the designs needed for eliminating the aliasing between higher-order


carryover effects and treatment effects are very cumbersome and not
practical. Therefore, we usually assume that these higher-order
carryover effects are negligible.
In actuality, the length of the washout periods between treatment
administrations may be the determining factor as to whether higher-order
carryover effects should be considered. We focus on designs for dealing with
first-order carryover effects, but the development can be generalized if higher-
order carryover effects need to be considered. We will focus on:

Uniformity

A crossover design is labeled as:

1. uniform within sequences if each treatment appears the same


number of times within each sequence, and

2. uniform within periods if each treatment appears the same number of


times within each period.

For example, AB/BA is uniform within sequences and period (each sequence
and each period has 1 A and 1 B) while ABA/BAB is uniform within period but
is not uniform within sequence because the sequences differ in the numbers
of A and B.

If a design is uniform within sequences and uniform within periods, then it is


said to be uniform. If the design is uniform across periods you will be able to
remove the period effects. If the design is uniform across sequences then you
will be also be able to remove the sequence effects. An example of a uniform
crossover is ABC/BCA/CAB.

Latin Squares

Latin squares historically have provided the foundation for r-period, r-


treatment crossover designs because they yield uniform crossover designs in
that each treatment occurs only once within each sequence and once within
each period. As will be demonstrated later, Latin squares also serve as
building blocks for other types of crossover designs. Latin squares for 4-
period, 4-treatment crossover designs are:

Design 7] Period 1 Period 2 Period 3 Period 4

equence ABCD A B C D

equence BCDA B C D A

equence CDAB C D A B
equence DABC D A B C

and

esign 8] Period 1 Period 2 Period 3 Period 4

quence ABCD A B C D

quence BDAC B D A C

quence CADB C A D B

quence DCBA D C B A

Latin squares are uniform crossover designs, uniform both within periods and
within sequences. Although with 4 periods and 4 treatments there are 4! = (4)
(3)(2)(1) = 24 possible sequences from which to choose, the Latin square only
requires 4 sequences.

Balanced Designs

The Latin square in [Design 8] has an additional property that the Latin square
in [Design 7] does not have. Each treatment precedes every other treatment
the same number of times (once). For example, how many times is treatment
A followed by treatment B? Only once. How many times do you have one
treatment B followed by a second treatment? Only once. This is an
advantageous property for Design 8. This same property does not occur in
[Design 7]. When this occurs, as in [Design 8], the crossover design is said to
be balanced with respect to first-order carryover effects.

Think About It!

Come up with an answer to this question by yourself and then click on the
icon to the left to reveal the solution.

Look back through each of the designs that we have looked at thus far and
determine whether or not it is balanced with respect to first-order carryover
effects.

When r is an even number, only 1 Latin square is needed to achieve balance


in the r-period, r-treatment crossover. When r is an odd number, 2 Latin
squares are required. For example, the design in [Design 5] is a 6-sequence,
3-period, 3-treatment crossover design that is balanced with respect to first-
order carryover effects because each treatment precedes every other
treatment twice.

Strongly Balanced Designs

A crossover design is said to be strongly balanced with respect to first-order


carryover effects if each treatment precedes every other treatment, including
itself, the same number of times. A strongly balanced design can be
constructed by repeating the last period in a balanced design.

Here is an example:

Period 1 Period 2 Period 3 Period 4 Per

D A B C D D

C B D A C C

B C A D B B

A D C B A A

This is a 4-sequence, 5-period, 4-treatment crossover design that is strongly


balanced with respect to first-order carryover effects because each treatment
precedes every other treatment, including itself, once. Obviously, the
uniformity of the Latin square design disappears because the design in
[Design 9] is no longer is uniform within sequences.

Uniform and Strongly Balanced Design

Latin squares yield uniform crossover designs, but strongly balanced designs
constructed by replicating the last period of a balanced design are not uniform
crossover designs. The following 4-sequence, 4-period, 2-treatment crossover
design is an example of a strongly balanced and uniform design.

sign 10] Period 1 Period 2 Period 3 Period 4

uence ABBA A B B A

uence BAAB B A A B

uence AABB A A B B

uence BBAA B B A A
15.4 - Statistical Bias
Why are these properties important in statistical analysis?

We now investigate statistical bias issues. In other words, does a particular


crossover design have any nuisance effects, such as sequence, period, or
first-order carryover effects, aliased with direct treatment effects? We consider
first-order carryover effects only. If the design incorporates washout periods of
inadequate length, then treatment effects could be aliased with higher-order
carryover effects as well, but let us assume the washout period was adequate
for eliminating carryover beyond 1 treatment period.

The approach is very simple in that the expected value of each cell in the
crossover design is expressed in terms of a direct treatment effect and the
assumed nuisance effects. Then these expected values are averaged and/or
differenced to construct the desired effects.

For example, in the 2 2 crossover design in [Design 1], if we include


nuisance effects for sequence, period, and first-order carryover, then model
for this would look like:

sign 11] Period 1 Period 2

uence AB A + + B + - + A

uence BA B - + A - - + B

where A and B represent population means for the direct effects of


treatments A and B, respectively, represents a sequence effect, represents
a period effect, and A and Brepresent carryover effects of treatments A and
B, respectively.

A natural choice of an estimate of A (or B) is simply the average over all cells
where treatment A (or B) is assigned: [12]

^A=12(YAB,1+YBA,2) and ^B=12(YAB,2+YBA,1)^A=12(YAB,1+YBA,2)


and ^B=12(YAB,2+YBA,1)

Will this give us a good estimate of the means across the treatment? Not
quite...
The mathematical expectations of these estimates are as follows: [13]

E(^A)=12(A+++A+B)=A+12BE(^A)=12(A+++A+B)
=A+12B
E(^B)=12(B++B++A)=B+12AE(^B)=12(B++B++A)
=B+12A
E(^A^B)=(AB)12(AB)E(^A^B)=(AB)12(AB)

From [Design 13] it is observed that the direct treatment effects and the
treatment difference are not aliased with sequence or period effects, but are
aliased with the carryover effects.

The treatment difference, however, is not aliased with carryover effects when
the carryover effects are equal, i.e., A = B. The results in [13] are due to the
fact that the AB|BA crossover design is uniform and balanced with respect to
first-order carryover effects. Any crossover design which is uniform and
balanced with respect to first-order carryover effects, such as the designs in
[Design 5] and [Design 8], also exhibits these results.

Example

Consider the ABB|BAA design, which is uniform within periods, not uniform
with sequences, and is strongly balanced.

Period 1 Period 2 Period 3

A + + 1 B + +2 + A B + - 1 - 2 + B

B - + 1 A - +2 + B A - - 1 - 2 + A

A natural choice of an estimate of A (or B) is simply the average over all cells
where treatment A (or B) is assigned: [15]

^A=13(YABB,1+YBAA,2+YBAA,3) and ^B=13(YABB,2+YABB,3+YBAA,1


)^A=13(YABB,1+YBAA,2+YBAA,3) and ^B=13(YABB,2+YABB,3+YBAA,1)

The mathematical expectations of these estimates are solved to be: [16]


E(^A)=A+13(A+B)E(^A)=A+13(A+B)
E(^B)=B+13(A+B+)E(^B)=B+13(A+B+)
E(^A^B)=(AB)23E(^A^B)=(AB)23

From [16], the direct treatment effects are aliased with the sequence effect
and the carryover effects, whereas the treatment difference only is aliased
with the sequence effect. The results in [16] are due to the ABB|BAA
crossover design being uniform within periods and strongly balanced with
respect to first-order carryover effects.

15.5 - Higher-order Carryover Effects


The lack of aliasing between the treatment difference and the first-order
carryover effects does not guarantee that the treatment difference and higher-
order carryover effects also will not be aliased or confounded. For example, let
2A and 2B denote the second-order carryover effects of treatments A and B,
respectively, for the design in [Design 2] (Second-order carryover effects looks
at the carryover effects of the treatment that took place previous to the prior
treatment.):

Period 1 Period 2 Period 3

A + + 1 B + + 2 + A B + - 1 - 2 + B + 2A

B - + 1 A - + 2 + B A - - 1 - 2 + A + 2B

[18] E(^A^B)=(AB)2313(2A2B)E(^A^B)=(AB)
2313(2A2B)

The expectation of the treatment mean difference indicates that it is aliased


with second-order carryover effects.

Summary of Impacts of Design Types

The ensuing remarks summarize the impact of various design features on the
aliasing of direct treatment and nuisance effects.

1. If the crossover design is uniform within sequences, then sequence


effects are not aliased with treatment differences.
2. If the crossover design is uniform within periods, then period effects are
not aliased with treatment differences.

3. If the crossover design is balanced with respect to first-order carryover


effects, then carryover effects are aliased with treatment differences. If
the carryover effects are equal, then carryover effects are not aliased
with treatment differences.

4. If the crossover design is strongly balanced with respect to first- order


carryover effects, then carryover effects are not aliased with treatment
differences.

Complex Carryover

The type of carryover effects we modeled here is called simple carryover


because it is assumed that the treatment in the current period does not
interact with the carryover from the previous period. Complex carryover
refers to the situation in which such an interaction is modeled. For example,
suppose we have a crossover design and want to model carryover effects.
With simple carryover in a two-treatment design, there are two carryover
parameters, namely, A and B.

With complex carryover, however, there are four carryover parameters,


namely, AB, BA, AA and BB, where AB represents the carryover effect of
treatment A into a period in which treatment B is administered, BA represents
the carryover effect of treatment B into a period in which treatment A is
administered, etc. As you might imagine, this will certainly complicate things!

15.6 - Implementation Overview


Obviously, it appears that an ideal crossover design is uniform and strongly
balanced.

There are situations, however, where it may be reasonable to assume that


some of the nuisance parameters are null, so that resorting to a uniform and
strongly balanced design is not necessary (although it provides a safety net if
the assumptions do not hold).

For example, some researchers argue that sequence effects should be null or
negligible because they represent randomization effects. Another example
occurs in bioequivalence trials where some researchers argue that carryover
effects should be null. This is because blood concentration levels of the drug
or active ingredient are monitored and any residual drug administered from an
earlier period would be detected.

The message to be emphasized is that every proposed crossover trial should


be examined to determine which, if any, nuisance effects may play a role.
Once this determination is made, then an appropriate crossover design should
be employed that avoids aliasing of those nuisance effects with treatment
effects. This is a decision that the researchers should be prepared to address.

For example, an investigator wants to conduct a two-period crossover design,


but is concerned that he will have unequal carryover effects so he is reluctant
to invoke the 2 2 crossover design. If the investigator is not as concerned
about sequence effects, then Balaams design in [Design 8] may be
appropriate. Balaams design is uniform within periods but not within
sequences, and it is strongly balanced. Therefore, Balaams design will not be
adversely affected in the presence of unequal carryover effects.

Some researchers consider randomization in a crossover design to be a minor


issue because a patient eventually undergoes all of the treatments (this is true
in most crossover designs). Obviously, randomization is very important if the
crossover design is not uniform within sequences because the underlying
assumption is that the sequence effect is negligible.
Randomization is important in crossover trials even if the design is uniform
within sequences because biases could result from investigators assigning
patients to treatment sequences.

At a minimum, it always is recommended to invoke a design that is uniform


within periods because period effects are common. Period effects can be due
to:

1. increased patient comfort in later periods with trial processes;

2. increased patient knowledge in later periods;

3. improvement in skill and technique of those researchers taking the


measurements.

The following is a listing of various crossover designs with some, all, or none
of the properties.
It would be a good idea to go through each of these designs and diagram out
what these would look like, the degree to which they are uniform and/or
balanced. Make sure you see how these principles come into play!

15.7 - Statistical Precision


Now that we have examined statistical biases that can arise in crossover
designs, we next examine statistical precision.

During the design phase of a trial, the question may arise as to which
crossover design provides the best precision. For our purposes, we label one
design as more precise than another if it yields a smaller variance for the
estimated treatment mean difference.

Although a comparison of treatment means may be the primary interest of the


experimenter, there may be other circumstances that affect the choice of an
appropriate design. For example, later we will compare designs with respect
to which designs are best for estimating and comparing variances.

At the moment, however, we focus on differences in estimated treatment


means in two-period, two-treatment designs.

The two-period, two-treatment designs we consider here are the 2 2


crossover design AB|BA in [Design 1], Balaam's design AB|BA|AA|BB in
[Design 6], and the two-period parallel design AA|BB.

In order for the resources to be equitable across designs, we assume that the
total sample size, n, is a positive integer divisible by 4. Then:
1. n patients will be randomized to each sequence in the AB|BA design

2. n patients will be randomized to each sequence in the AA|BB design,


and

3. n patients will be randomized to each sequence in the AB|BA|AA|BB


design.

Because the designs we are considering involve repeated measurements on


patients, the statistical modeling must account for between-patient variability
and within-patient variability.

Between-patient variability accounts for the dispersion in measurements from


one patient to another. Within-patient variability accounts for the dispersion in
measurements from one time point to another within a patient. Within-patient
variability tends to be smaller than between-patient variability.

The variance components we model are as follows:

1. WAA = between-patient variance for treatment A;

2. WBB = between-patient variance for treatment B;

3. WAB = between-patient covariance between treatments A and B;

4. AA = within-patient variance for treatment A;

5. BB = within-patient variance for treatment B.

The following table provides expressions for the variance of the estimated
treatment mean difference for each of the two-period, two-treatment designs:

Variance

2/n = {1.0(WAA + WBB) - 2.0(WAB) + (AA + BB)}/n

2/n = {1.5(WAA + WBB) - 1.0(WAB) + (AA + BB)}/n

2/n = {2.0(WAA + WBB) - 0.0(WAB) + (AA + BB)}/n

Under most circumstances, WAB will be positive, so we assume this is so for


the sake of comparison. Not surprisingly, the 2 2 crossover design yields the
smallest variance for the estimated treatment mean difference, followed by
Balaam's design and then the parallel design.

The investigator needs to consider other design issues, however, prior to


selecting the 2 2 crossover. In particular, if there is any concern over the
possibility of differential first-order carryover effects, then the 2 2 crossover
is not recommended. In this situation the parallel design would be a better
choice than the 2 2 crossover design. Balaam's design is strongly balanced
so that the treatment difference is not aliased with differential first-order
carryover effects, so it also is a better choice than the 2 2 crossover design.

With respect to a sample size calculation, the total sample size, n, required for
a two-sided, significance level test with 100(1 - )% statistical power and
effect size A - B is:

n=(z1/2+z1)22/(AB)2n=(z1/2+z1)22/(AB)2

Suppose that an investigator wants to conduct a two-period trial but is not


sure whether to invoke a parallel design, a crossover design, or Balaam's
design. He wants to use a 0.05 significance level test with 90% statistical
power for detecting the effect size of A - B= 10. From published results, the
investigator assumes that:

WAA = WBB = WAB = 400, and

AA = BB = 100

The sample sizes for the three different designs are as follows:

Parallel n = 190
Balaam n = 105
Crossover n = 21

The crossover design yields a much smaller sample size because the within-
patient variances are one-fourth that of the inter-patient variances (which is
not unusual).

Another issue in selecting a design is whether the experimenter wishes to


compare the within-patient variances AA and BB.

For the 2 2 crossover design, the within-patient variances can be estimated


by imposing restrictions on the between-patient variances and covariances.
The resultant estimators of AAand BB, however, may lack precision and be
unstable. Hence, the 2 2 crossover design is not recommended when
comparing AA and BB is an objective.

The parallel design provides optimal estimation of the within-unit variances


because it has n patients who can provide data in estimating each of
AA and BB, whereas Balaam's design has n patients who can provide data
in estimating each of AA and BB. Again, Balaam's design is a compromise
between the 2 2 crossover design and the parallel design.

15.8 - Analysis - Continuous Outcome


The statistical analysis of normally-distributed data from a 2 2 crossover
trial, under the assumption that the carryover effects are equal ( A = A = ),
is relatively straightforward.

Remember the statistical model we assumed for continuous data from the 2
2 crossover trial:

sign 11] Period 1 Period 2

uence AB A + + B + - + A

uence BA B - + A - - + B

For a patient in the AB sequence, the Period 1 vs. Period 2 difference has
expectation AB = A - B + 2 - .

For a patient in the BA sequence, the Period 1 vs. Period 2 difference has
expectation BA = B - A + 2 - .

Therefore, we construct these differences for every patient and compare the
two sequences with respect to these differences using a two-sample t test or a
Wilcoxon rank sumtest. Thus, we are testing:

H0 : AB - BA = 0

The expression:

AB - BA = 2( A - B )

so testing H0 : AB - BA = 0, is equivalent to testing:


H0 : A - B = 0

To get a confidence interval for A - B , simply multiply each difference by


prior to constructing the confidence interval for the difference in population
means for two independent samples.

SAS Example ( 16.1_-_2x2_crossover__contin.sas [1] )

This is an example of an analysis of the data from a 2 2 crossover trial. The


example is taken from Example 3.1 from Senn's book (Senn S. Cross-over
Trials in Clinical Research , Chichester, England: John Wiley & Sons, 1993).
The data set consists of 13 children enrolled in a trial to investigate the effects
of two bronchodilators, formoterol and salbutamol, in the treatment of asthma.
The outcome variable is peak expiratory flow rate (liters per minute) and was
measured eight hours after treatment. There was a one-day washout period
between treatment periods.
The estimated treatment mean difference was 46.6 L/min in favor of
formoterol (p = 0.0012) and the 95% confidence interval for the treatment
mean difference is (22.9, 70.3). The Wilcoxon rank sumtest also indicated
statistical significance between the treatment groups (p = 0.0276).

15.9 - Analysis - Binary Outcome


Suppose that the response from a crossover trial is binary and that there are
no period effects. Then the probabilities of response are:

Failure on B Success on B marginal probabilities

ure on A p00 p01 p0.

cess on A p10 p11 p1.

ginal probabilities p.0 p.1

The probability of success on treatment A is p1. and the probability of success


on treatment B is p.1 testing the null hypothesis:

H0 : p1. - p.1 = 0

is the same as testing:

H0 : p1. - p.1 = (p10 + p11) - (p01 + p11) = p10 - p01 = 0

This indicates that only the patients who display a (1,0) or (0,1) response
contribute to the treatment comparison. For instance, if they failed on both, or
were successful on both, there is no way to determine which treatment is
better. Therefore we will let:

Failure on B Success on B

ure on A n00 n01

cess on A n10 n11

denote the frequency of responses from the study data instead of the
probabilities listed above.
McNemar's test for this situation is as follows. Given the number of patients
who displayed a treatment preference, n10 + n01 , then n10 follows a
binomial(p, n10 + n01) distribution and the null hypothesis reduces to testing:

H0 : p = 0.5

i.e., we would expect a 50-50 split in the number of patients that would be
successful with either treatment in support of the null hypothesis, looking at
only the cells where there was success with one treatment and failure with the
other. The data in cells for both success or failure with both treatment would
be ignored.

SAS Example ( 16.2_-_2x2_crossover__binary.sas [2] )

This is an example of an analysis of the data from a 2 2 crossover trial with


a binary outcome of failure/success. Fifty patients were randomized and the
following results were observed:

Failure on B Success on B

ure on A 21 15

cess on A 7 7

Thus, 22 patients displayed a treatment preference, of which 7 preferred A


and 15 preferred B. McNemar's test, however, indicated that this was not
statistically significant (exact p = 0.1338).
A problem that can arise from the application of McNemar's test to the binary
outcome from a 2 2 crossover trial can occur if there is non-negligible period
effects. If that is the case, then the treatment comparison should account for
this. This is possible via logistic regression analysis.

The Rationale:

The probability of a 50-50 split between treatment A and treatment B


preferences under the null hypothesis is equivalent to the odds ratio for the
treatment A preference to the treatment B preference being 1.0. Because
logistic regression analysis models the natural logarithm of the odds, testing
whether there is a 50-50 split between treatment A preference and treatment
B preference is comparable to testing whether the intercept term is null in a
logistic regression analysis.

To account for the possible period effect in the 2 2 crossover trial, a term for
period can be included in the logistic regression analysis.

SAS Example ( 16.3_-_2x2_crossover__binary.sas [3] )


Use the same data set from SAS Example 16.2 only now it is partitioned as to
patients within the two sequences:

uence AB Failure on B Success on B

ure on A 10 7

cess on A 3 5

uence BA Failure on B Success on B

ure on A 11 8

cess on A 4 2
The logistic regression analysis yielded a nonsignificant result for the
treatment comparison (exact p = 0.2266). There is still no significant statistical
difference to report.

15.10 - Analysis - Time-to-Event


Outcome
You don't often see a cross-over design used in a time-to-event trial. If the
event is death, the patient would not be able to cross-over to a second
treatment. Even when the event is treatment failure, this often implies that
patients must be watched closely and perhaps rescued with other medicines
when event failure occurs.

When it is implemented, a time-to-event outcome within the context of a 2 2


crossover trial actually can reduce to a binary outcome score of preference.
Suppose that in a clinical trial, time to treatment failure is determined for each
patient when receiving treatment A and treatment B.

If the time to treatment failure on A equals that on B, then the patient is


assigned a (0,0) score and displays no preference.

If the time to treatment failure on A is less than that on B, then the patient is
assigned a (0,1) score and prefers B.

If the time to treatment failure on B is less than that on A, then the patient is
assigned a (1,0) score and prefers A.

If the patient does not experience treatment failure on either treatment, then
the patient is assigned a (1,1) score and displays no preference.

Hence, we can use the procedures which we implemented with binary


outcomes.

15.11 - Analysis - More Complex Designs


The analysis of continuous, binary, and time-to-event outcome data from a
design more complex than the 2 2 crossover is not as straightforward as
that for the 2 2 crossover design.

With respect to a continuous outcome, the analysis involves a mixed-effects


linear model (SAS PROC MIXED) to account for the repeated measurements
that yield period, sequence, and carryover effects and to model the various
sources of intra-patient and inter-patient variability.

With respect to a binary outcome, the analysis involves generalized estimating


equations (SAS PROC GENMOD) to account for the repeated measurements
that yield period, sequence, and carryover effects and to model the various
sources of intra-patient and inter-patient variability.
In either case, with a design more complex that the 2 2 crossover, extensive
modeling is required.

15.12 - Bioequivalence Trials


The objective of a bioequivalence trial is to determine whether test (T) and
reference (R) formulations of a pharmaceutical product are "equivalent" with
respect to blood concentration time profiles.

Bioequivalence trials are of interest in two basic situations:

1. Company A demonstrates the safety and efficacy of a drug formulation,


but wishes to market a more convenient formulation, ( i.e., an injection
vs a time-release capsule). This situation is less common.

2. Company B wishes to market a drug formulation similar to the approved


formulation of Company A with an expired patent. Company B has to
prove that they can deliver the same amount of active drug into the
blood stream which the approved formula does.

Pharmaceutical scientists use crossover designs for such trials in order for
each trial participant to yield a profile for both formulations. The blood
concentration time profile is a multivariate response and is a surrogate
measure of therapeutic response. The pharmaceutical company does not
need to demonstrate the safety and efficacy of the drug because that already
has been established.

Are the reference and test blood concentration time profiles similar? The
test formulation could be toxic if it yields concentration levels higher than the
reference formulation. On the other hand, the test formulation could be
ineffective if it yields concentration levels lower than the reference formulation.
Typically, pharmaceutical scientists summarize the rate and extent of drug
absorption with summary measurements of the blood concentration time
profile, such as area under the curve (AUC), maximum concentration (CMAX),
etc. These summary measurements are subjected to statistical analysis (not
the profiles) and inferences are drawn as to whether or not the formulations
are bioequivalent.

There are numerous definitions for what is meant by bioequivalence:

1. population bioequivalence - the formulations are equivalent with respect


to their underlying probability distributions. You want the see that the
AUC or CMAX distributions would be similar.

2. average bioequivalence - the formulations are equivalent with respect to


the means (medians) of their probability distributions.

3. individual bioequivalence - the formulations are equivalent for a large


proportion of individuals in the population. i.e., how well do the AUC's
and CMAX compare across patients?

Prescribability means that a patient is ready to embark on a treatment


regimen for the first time, so that either the reference or test formulations can
be chosen. Switchability means that a patient, who already has established a
regimen on either the reference or test formulation, can switch to the other
formulation without any noticeable change in efficacy and safety.

Prescribability requires that the test and reference formulations are population
bioequivalent, whereas switchability requires that the test and reference
formulations have individual bioequivalence.

Currently, the USFDA only requires pharmaceutical companies to establish


that the test and reference formulations are average bioequivalent. It is felt
that most consumers, however, assume bioequivalence refers to individual
bioequivalence, and that switching formulations does not lead to any health
problems.

The hypothesis testing problem for assessing average bioequivalence is


stated as:

H0 : { T / R 1 or T / R 2 } vs. H1 : {1 < T / R < 2 }


where T and R represent the population means for the test and reference
formulations, respectively, and 1 and 2 are chosen constants.

The FDA recommended values are 1 = 0.80 and 2 = 1.25, ( i.e., the ratios
4/5 and 5/4), for responses such as AUC and CMAX which typically follow
lognormal distributions.

Thus, a logarithmic transformation typically is applied to the summary


measure, the statistical analysis is performed for the crossover experiment,
and then the two one-sided testing approach or corresponding confidence
intervals are calculated for the purposes of investigating average
bioequivalence.

SAS Example ( 16.4_-_bioequivalence.sas [4] )

Test and reference formulations were studied in a bioequivalence trial that


used a 2 2 crossover design. There were 28 healthy volunteers, (instead of
patients with disease), who were randomized (14 each to the TR and RT
sequences). AUC and CMAX were measured and transformed via the natural
logarithm.
The analysis yielded the following results:

AUC CMAX

T) 0.0893 -0.104

1.09 0.90

R / T) (-0.113, 0.294) (-0.289, 0.080)

T (0.89, 1.34) (0.75, 1.08)


Neither 95% confidence interval lies within (0.80, 1.25) specified by the
USFDA, therefore bioequivalence cannot be concluded in this example and
the USFDA would not allow this company to market their generic drug. Both
CMAX and AUC are used because they summarize the desired equivalence.

15.13 - Summary
In this lesson, among other things, we learned:

Distinguish between situations where a crossover design would or


would not be advantageous.

Use the following terms appropriately: first-order carryover, sequence,


period, washout, aliased effect.

State why an adequate washout period is essential between periods of


a crossover study in terms of aliased effects.

Evaluate a crossover design as to its uniformity and balance and state


the implications of these characteristics.

Understand and modify SAS programs for analysis of data from 2x2
crossover trials with continuous or binary data.

Provide an approach to analysis of event time data from a crossover


study.

Distinguish between population bioequivalence, average bioequivalence


and individual bioequivalence.

Relate the different types of bioequivalence to prescribability and


switchability.

Let's put what we have learned to use by completing the following homework
assignment:

Homework

Look for homework assignment and the dropbox in the folder for this week in
ANGEL.
Lesson 16: Overviews and Meta-analysis
Introduction

An overview (also called a systematic review) attempts to summarize the


scientific evidence related to treatment, causation, diagnosis, or prognosis of
a specific disease. An overview does not generate any new data - it reviews
and summarizes already-existing studies.

Overviews, which are relied upon by many physicians, are important because
there usually exist multiple studies that have addressed a specific research
question. Yet these types of studies may differ with respect to:

Design

Patient population

Quality

Results

Although it appears that conducting an overview is easy, it requires a good


deal of effort and care to do it well. For example, determining inclusion and
exclusion criteria for studies is a major challenge for researchers when putting
together a useful overview.

What does this process involve? There are six basic steps to an overview:

1. Define a focused clinical question

2. Conduct a thorough literature search

3. Apply inclusion/exclusion criteria to the identified studies

4. Abstract/summarize the data from the eligible studies

5. Perform statistical analysis (meta-analysis), if appropriate

6. Disseminate the results

Learning objectives & outcomes


Upon completion of this lesson, you should be able to do the following:

1. Describe the processes for conducting a systematic overview.

2. Describe how publication bias can affect the results of a systematic


review.

3. Recognize patterns in a funnel plot that would indicate potential


publication bias.

4. Evaluate the quality of a clinical report with the Jaded scale.

5. Recognize the appropriate use of a fixed effects model vs. a random


effects model for a meta-analysis. State how the weights differ between
the fixed and random approaches.

6. Describe the rationale for a test of heterogeneity among the studies


used in a meta-analysis.

7. Describe methods for performing a sensitivity analysis of the meta-


analysis.

Reference

Piantadosi Steven. (2005) Reporting and Authorship. Meta-Analyses. In:


Piantadosi Steven. Clinical Trials: A Methodologic Perspective. 2nd ed.
Hobaken, NJ: John Wiley and Sons, Inc.

16.1 - 1. Define a Focused Question


Examples:

Does garlic reduce serum cholesterol levels?

Is montelukast (Singulair) as effective as an inhaled steroid in treating


asthma?

Is induced abortion a risk factor for breast cancer?

If the question is too broad, it may not be useful when applied to a particular
patient. For example, whether chemotherapy is effective in cancer is too broad
a question (the number of studies addressing this question could exceed
10,000).

If the question is too narrow, there may not be enough evidence to answer the
question. For example, the following question is too narrow: Is a particular
asthma therapy effective in Caucasian females over the age of 65 years in
Central Pennsylvania?

16.2 - 2. Conduct a Thorough Literature


Search
Many sources for studies (throughout the world) should be explored:

Bibliographic databases (Medline, Embase, etc.)

Publicly available clinical trials databases such as clinicaltrials.gov, etc.

Databanks of pharmaceutical firms


(e.g. http://www.clinicaltrialresults.com/ [1])

Conference proceedings

Theses/dissertations

Personal contacts

Unpublished reports

As discussed earlier in this course, beware of publication bias. Studies in


which the intervention is not found to be effective, or as effective as other
treatments, may not be submitted for publication. (This is referred to as the
"file-drawer problem".) Studies, with 'significant results' are more likely to
make it into a journal. Recent initiatives in online journals, such as PLoS
Medicine, and databases of trial results may encourage increased publication
of results from scientifically valid studies, regardless of outcome. Even so, in
an imperfect world, realize it is possible for an overview based only on
published studies to have a bias towards an overall positive effect.

Construction of a "funnel plot" is one method for evaluating whether or not


publication bias has occurred.
Suppose there are some relevant studies with small sample sizes. If nearly all
of them have a positive finding (p < 0.05), then this may provide evidence of a
"publication bias" because of the following reason. It is more difficult to show
positive results with small sample sizes. Thus, there should be some negative
results (p > 0.05) among the small studies.

A "funnel plot" can be constructed to investigate the latter issue. Plot sample
size (vertical axis) versus p-value or magnitude of effect (horizontal axis).

Notice that the p-values for some of the small studies are relatively large,
yielding a "funnel" shape for the scatterplot.
Notice that none of the p-values for the small studies are large, yielding a
"band" shape for the scatterplot and the suspicion of publication bias. This is
evidence to suggest that there does exist a degree of 'publication bias'.

16.3 - 3. Apply Inclusion/Exclusion


Criteria
Eligibility criteria for studies need to be established prior to the analysis.

The researcher should base the inclusion/exclusion criteria on the design


aspects of the trials, the patient populations, treatment modalities, etc. that are
congruent with the objectives of the overview. Looking across a variety of
studies this process can get quite complicated.

Although subjective, some researchers grade the selected studies according


to quality and may weight the studies accordingly in the analysis. One such
example of a quality rating of randomized trials is the Jadad scale (Jadad et
al. Assessing the quality of reports of randomized clinical trials: Is blinding
necessary? Controlled Clinical Trials , 1996; 17: 1-12). Here are the questions
that are asked as a part of this scale along with the scores that are associated
with these answers:
Is the study described as randomized?

No, or yes but inappropriate method (0 points)

Yes but no discussion of method (1 point)

Yes and appropriate method (2 points)

Is the study described as double blind?

No, or yes but inappropriate method (0 points)

Yes but no discussion of method (1 point)

Yes and appropriate method (2 points)

Is there a description of withdrawals/dropouts?

No (0 points)

Yes (1 point)

16.4 - 4. Abstract/Summarize the Data


In most circumstances, the researcher easily can gather the relevant
descriptive statistics (e.g., means, standard errors, sample sizes) from the
reports on the eligible studies.

Sometimes, older reports (say, prior to 1980) do not include variability


estimates (e.g., standard errors). If possible, the researcher should attempt to
contact the authors directly in such situations. This may not be successful,
however, because the authors may no longer have the data.

Ideally, the statistical analysis for a systematic review will be based on the raw
data from each eligible study. This has rarely occured. Either the raw data
were no longer available or the authors were unwilling to share the raw data.
However, the success of shared data in the Human Genome Project has
given impetus to increased data sharing to promote rapid scientific progress.
Since the US NIH now requires investigators receiving large new NIH grants
to have a plan for data-sharing.( NIH Data Sharing Policy Guide [2] ) and has
provided more guidance [3]on how federal data are to be shared. we may
anticipate more meta-analyses based on raw data.

As we have discussed earlier, problems to be solved before private entities


embrace data sharing include proprietary rights, authorship, patient consent
and confidentiality, common technology, proper use, enforcement of policy,
etc. As these challenges are overcome, the path to a systematic review and
meta-analysis based on raw data will be smoother.

16.5 - 5. Meta-analysis
The obvious advantage for performing a meta-analysis is that a large amount
of data, pooled across multiple studies, can provide increased precision in
addressing the research question. The disadvantage of a meta-analysis is that
the studies can be very heterogeneous in their designs, quality, and patient
populations and, therefore, it may not be valid to pool them. This issue is
something that needs to be evaluated very critically.

Researchers invoke two basic statistical models for meta-analysis, namely,


fixed-effects models and random-effects models.

A fixed-effects model is more straightforward to apply, but its underlying


assumptions are somewhat restrictive. It assumes that if all the involved
studies had tremendously large sample sizes, then they all would yield the
same result. In essence, a fixed-effects model assumes that there is no inter-
study variability (study heterogeneity). This statistical model accounts only for
intra-study variability.

A random-effects model, however, assumes that the eligible studies actually


represent a random sample from a population of studies that address the
research question. It accounts for intra-study and inter-study variability. Thus,
a random-effects model tends to yield a more conservative result, i.e., wider
confidence intervals and less statistical significance than a fixed-effects
model.

A random-effects model is more appealing from a theoretical perspective, but


it may not be necessary if there is very low study heterogeneity. A formal test
of study heterogeneity is available. Its results, however, should not determine
whether to apply a fixed-effects model or random-effects model. You need to
use your own judgment as to which model should be applied.
The test for study heterogeneity is very powerful and sensitive when the
number of studies is large. It is very weak and insensitive if the number of
studies is small. Graphical displays provide much better information as to the
nature of study heterogeneity. Some medical journals require that the authors
provide the test of heterogeneity, along with a fixed-effects analysis and a
random-effects analysis.

16.6 - The Fixed-Effects Model Approach


The basic step for a fixed-effects model involves the calculation of a weighted
average of the treatment effect across all of the eligible studies.

For a continuous outcome variable, the measured effect is expressed as the


difference between sample treatment and control means. The weight is
expressed as the inverse of the variance of the difference between the sample
means. Therefore, if the variance is large the study will be given a lower
weight. If the variance is smaller, the weight of the study is larger.

For a binary outcome variable, the measured effect usually is expressed as


the logarithm of the estimated odds ratio. The weight is expressed as the
inverse of the variance of the logarithm of the estimated odds ratio. Basically,
the weighting takes the same approach using this value.

Suppose that there are K studies.

The estimated treatment effect (e.g., difference between the sample treatment
and control means) in the kth study, k = 1, 2, ... , K, is Yk .

The estimated variance of Yk in the kth study is S2kSk2.

The weight for the estimated treatment effect in the kth study
is wk=1/S2kwk=1/Sk2.

The overall weighted treatment effect is:

Y=(k=1KwkYk)/(k=1Kwk)Y=(k=1KwkYk)/(k=1Kwk)

The estimated variance of Y, the weighted treatment effect, is:


S2=1/(k=1Kwk)S2=1/(k=1Kwk)

Testing the null hypothesis of no treatment effect (e.g., H0 : 1 - 0 = 0 for a


continuous outcome) is performed from assuming that |Y|/S asymptotically
follows a standard normal distribution.

The 100(1 - )% confidence interval for the overall weighted treatment effect
is:

[Y - (z1 - /2 S), Y + (z1 - /2 S)]

The statistic for testing H0 : {study homogeneity} is

Q=k=1Kwk(YkY)2Q=k=1Kwk(YkY)2

Q has an asymptotic 2 distribution with K - 1 degrees of freedom.

Alternative: Mantel-Haenszel Test

An alternative, fixed-effects approach for a binary outcome is to apply the


Mantel-Haenszel test for the pooled odds ratio. The Mantel-Haenszel test for
the pooled odds ratio assumes that the odds ratio is equal across all studies.
For the kth study, k = 1, 2, ... , K, a 2 2 table is constructed:

Control Treatment

Failure n0k - r0k n1k - r1k

Success r0k r1k

The disadvantage of the Mantel-Haenszel approach, however, is that it cannot


adjust for covariates/regressors. Many researchers now use logistic
regression analysis to estimate the odds ratio from a study while adjusting for
covariates/regressors, so the weighted approach described previously is more
applicable.

16.7 - Example
Consider the following example for the difference in sample means between
an inhaled steroid and montelukast in asthmatic children. The outcome
variable is FEV 1 (L) from four clinical trials. Note that only the first study
yields a statistically significant result (p-value < 0.05).

Yk Sk wk p-value

0.070 0.032 977 0.028

0.043 0.049 416 0.375

0.058 0.052 370 0.260

0.075 0.041 595 0.067

The overall treatment effect is Y =

(0.070977)+(0.043416)+(0.058370)+(0.075595)977+416+370+595
(0.070977)+(0.043416)+(0.058370)+(0.075595)977+416+370+595

Overall effect Y = 152.363/2358 = 0.065

Overall estimated variance S2 = 1/2358 = 0.0004 (S = 0.0206)

|Y|/S = |0.065|/0.0206 = 3.155, and the p-value = 0.002

The magnitude of the effect, or the 95% confidence interval is [0.025, 0.105]

It appears that there is a degree of homogeniety between these studies...

The statistic for testing homogeneity is Q = 0.303 which does not exceed 7.81,
the 95th percentile from the 2332 distribution. Therefore, we have further
evidence that the studies are homogenous, although the small number of
studies involved in this overview does not give this result very much power.

Based on the evidence presented above, we can conclude that the inhaled
steroid is significantly better than montelukast in improving lung function in
children with asthma.

The weighted analysis for the fixed-effects approach described previously


corresponds to the following linear model for the kth study, k = 1, 2, ... , K:

Yk = + ek
where Yk is the observed effect in the kth study, is the pooled population
parameter of interest (difference in population treatment means, natural
logarithm of the population odds ratio, etc.) and ek is the random error term for
the kth study.

It is assumed that e1 , e2 , ... , ek are independent random variables


with ek having a N (0 , 2kk2) distribution. The variance term 2kk2 then
reflects intra-study variability and its estimate is S2kSk2. Usually, Yk and Sk are
provided as descriptive statistics in the kth study report.

16.8 - Random Effects / Sensitivity


Analysis
A corresponding linear model for the random-effects approach is as follows:

Yk = + tk + ek

where Yk , , and ek are the same as described above and tk is a random


effect for the kth study.

It is assumed that t1 , t2 , ... , tk are independent and identically distributed as N


(0 , 2 ) random variables. The variance term 2 reflects inter-study variability.

Whereas Var(Yk) = 2kk2 is the variance associated with the fixed-effects


linear model,

Var(Yk) = 2k+2k2+2 is the variance associated with the random-effects


linear model.

A weighted analysis will be applied, analogous to the weighted analysis for the
fixed-effects linear model, but the weights are different. The overall weighted
treatment effect is:

Y=(k=1KwkYk)/(k=1Kwk)Y=(k=1KwkYk)/(k=1Kwk)

where
wk=1/(S2k+^2)wk=1/(Sk2+^2)

^2=max(0,QK+1)

(k=1Kwk)/(k=1Kwk)2k=1Kw2k^2=max(0,QK+1)(k=1Kwk)/

((k=1Kwk)2k=1Kwk2)

and where Q is the heterogeneity statistic and wk is the weight for the kth study,
which were defined previously for the weighted analysis in the fixed-effects
linear model.

Analogous to the fixed-effects linear model, the variance of Y in the random-


effects linear model is:

S2=1/(k=1Kwk)S2=1/(k=1Kwk)

and statistical inference is performed in a similar manner.

If there exists a large amount of study heterogeneity, then ^2^2 will be very
large and will dominate in the expression for the weight in the kth study, i.e.,

wk=1/(S2k+^2)1/^2wk=1/(Sk2+^2)1/^2

Therefore, in such a situation with a relatively large amount of heterogeneity,


the weight for each study will approximately be the same and the weighted
analysis for the random-effects linear model will approximate an unweighted
analysis.

Graphs of the Treatment Differences

Graphical displays showing the estimated treatment difference and its


confidence interval for every study are very useful for evaluating treatment
effects over time or with respect to other factors. An example is given below:
Sensitivity Analyses

Statistical diagnostics (sensitivity analyses) should be performed to


investigate the validity and robustness of the meta-analysis via applying the
meta-analytic approach to subsets of the K studies, and/or applying the leave-
one-out method.

The steps for the leave-one-out method are as follows:

Remove the first of the K studies and conduct the meta-analysis on the
remaining K - 1 studies

Remove the second of the K studies and conduct the meta-analysis on


the remaining K - 1 studies

Continue this process until there are K distinct meta-analyses (each


with K - 1 studies)

If the results of the K meta-analyses in the leave-one-out method are


consistent, then there is confidence that the overall meta-analysis is robust.
The likelihood of consistency increases as K increases. The idea here is that
removing one of the studies from the meta-analysis should not affect the
overall results. If this does occur, this suggests that there exists a lack of
homogeniety among the studies involved.

Rather than invoke the leave-one-out method, researchers often prefer to


perform a sensitivity analysis by applying the meta-analysis to subsets of
studies based on high-quality versus low-quality studies, randomized versus
non-randomized studies, early studies versus late studies, etc.

16.9 - 6. Disseminate the Results


Many medical journals have guidelines on the process for publishing a
systematic review/meta-analysis.

The Cochrane Collaboration has been influential in improving the


methodology for systematic reviews. Cochrane Reviews
( http://www.cochrane.org [4] ) are based on the best available information
about healthcare interventions. They explore the evidence for and against the
effectiveness and appropriateness of treatments (medications, surgery,
education, etc) in specific circumstances.

How to Evaluate an Overview/Meta-analysis

Here are a series of questions that we can ask ourselves as we evaluate the
value of a meta-analysis. You will have the opportunity to evaluate a meta-
analysis in the homework exercise.

1. Did the overview explicitly address a sensible clinical question?

2. Was the search for relevant studies detailed and exhaustive? Were the
inclusion/exclusion criteria for studies developed and applied
appropriately?

3. Were the studies of high methodologic quality?

4. Were the results consistent across studies?

5. What are the results and how precise were they?

6. How can the results be applied to patient care?

16.10 - Summary
In this lesson, among other things, we learned how to:

describe the processes for conducting a systematic overview,


describe how publication bias can affect the results of a systematic
review,

recognize patterns in a funnel plot that would indicate potential


publication bias,

evaluate the quality of a clinical report with the Jaded scale,

recognize the appropriate use of a fixed effects model vs. a random


effects model for a meta-analysis, stating how the weights differ
between the fixed and random approaches,

describe the rationale for a test of heterogeneity among the studies


used in a meta-analysis, and

describe methods for performing a sensitivity analysis of the meta-


analysis.

Let's put what we have learned to use by completing the following homework
assignment:

Homework

Look for homework assignment and the dropbox in the folder for this week in
ANGEL.

Lesson 17: Medical Diagnostic Testing


Introduction

A diagnostic test is any approach used to gather clinical information for the
purpose of making a clinical decision (i.e., diagnosis). Some examples of
diagnostic tests include X-rays, biopsies, pregnancy tests, medical histories,
and results from physical examinations.

From a statistical point of view there are two points to keep in mind:

1. the clinical decision-making process is based on probability;


2. the goal of a diagnostic test is to move the estimated probability of
disease toward either end of the probability scale (i.e., 0 rules out
disease, 1 confirms the disease).

Here is an example taken from Greenberg et al (2000, Medical Epidemiology,


Third Edition ). A 54-year-old woman visits her family physician for an annual
check-up. The physician observes that:

she had no illnesses during the preceding year and there is no family
history of breast cancer,

her physical exam is unremarkable, (nothing unusual is apparent),

her breast exam is normal (no signs of a palpable mass), and

her pelvic and rectal exams are unremarkable.

Based on the woman's age and medical history, the initial (prior) probability
estimate of breast cancer is 0.003. The physician recommends that the
woman have a mammogram, due to her age. Unfortunately, the results of the
mammogram are abnormal. This yields a modification of the women's prior
probability of breast cancer from 0.003 to 0.13 (notice the Bayesian flavor of
this approach - prior probability modified via existing data). Next, the woman is
referred to a surgeon who agrees that the physical breast exam is normal. The
surgeon consults with a radiologist and they decide that the woman should
undergo fine needle aspiration (FNA) of the abnormal breast detected by the
mammogram. (diagnostic test #2) The FNA specimen reveals abnormal cells,
which again revises the probability of breast cancer, from 0.13 to 0.64. Finally,
the woman is scheduled for a breast biopsy the following week to get a
definitive diagnosis.

Ideally, diagnostic tests always would be correct, non-invasive, and inflict no


side effects. If this were the case, a positive test result would unequivocally
indicate the presence of disease and a negative result would indicate the
absence of disease. Realistically, however, every diagnostic test is fallible.

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

calculate and provide confidence intervals for the sensitivity and


specificity of a diagnostic test,
calculate accuracy and predictive values of a diagnostic test,

state the relationship of prevalence of disease to the sensitivity,


specificity and predictive values of a diagnostic test,

test whether sensitivity or specificity of 2 tests are significantly different,


whether the results come from a study in two groups of patients or one
group of patients tested with both tests, and

select an appropriate cut-off for a positive test result, given an ROC


curve, for different cost ratios of false positive/false negative results.

17.1 - Analysis of Diagnostic Tests


To begin, let's consider a simple test which has only two possible outcomes,
namely, positive and negative. When a test is applied to a group of patients,
some with the disease and some without the disease, four groups can result,
as summarized in the following 2 2 table:

Disease No Disease

a b
Test Positive
true positives false positives

c d
Test Negative
false negatives true negatives

a (true-positives) = individuals with the disease, and for whom the test is
positive
b (false-positives) = individuals without the disease, but for whom the test is
positive
c (false-negatives) = individuals with the disease, but for whom the test is
negative
d (true-negatives) = individuals without the disease, and for whom the test is
negative

a + c = total number of individuals with the disease


b + d = total number of individuals without disease
The "Gold Standard" is the method used to obtain a definitive diagnosis for a
particular disease; it may be biopsy, surgery, autopsy or an acknowledged
standard. Gold Standards are used to define true disease status against
which the results of a new diagnostic test are compared. Here are a number
of definitive diagnostic tests that will confirm whether or not you have the
disease. Some of these are quite invasive and this is a major reason why new
diagnostic procedures are being developed.

Target Disorder Gold Standard


breast cancer excisional biopsy
prostate cancer transrectal biopsy
coronary stenosis coronary angiography
myocardial infarction catheterization
strep throat throat culture

17.2 - Describing Diagnostic Tests


The following concepts have been developed to describe the performance of
a diagnostic test relative to the gold standard; these concepts are measures of
the validity of a diagnostic test.

Sensitivity is the probability that an individual with the disease of interest has a
positive test. It is estimated from the sample as a/(a+c).

Specificity is the probability that an individual without the disease of interest


has a negative test. It is estimated from the sample as d/(b+d).

Accuracy is the probability that the diagnostic test yields the correct
determination. It is estimated from the sample as (a+d)/(a+b+c+d).

Tests with high sensitivity are useful clinically to rule out a disease. A negative
result for a very sensitive test virtually would exclude the possibility that the
individual has the disease of interest. If a test has high sensitivity, it also
results in a low proportion of false-negatives. Sensitivity also is referred to as
"positive in disease" or "sensitive to disease".

Tests with high specificity are useful clinically to confirm the presence of a
disease. A positive result for a very specific test would give strong evidence in
favor of diagnosing the disease of interest. If a test has high specificity, it also
results in a low proportion of false-positives. Specificity also is referred to as
"negative in health" or "specific to health".

Sensitivity and specificity are, in theory, stable for all groups of patients.

In a study comparing FNA to the gold standard (excisional biopsy), 114


women with normal physical examinations (nonpalpable masses) and
abnormal mammograms received a FNA followed by surgical excisional
biopsy of the same breast (Bibbo M, et al: Stereotaxic fine needle aspiration
cytology of clinically occult malignant and premalignant breast lesions. Acta
Cytol 1988; 32:193-201.)

Cancer No Cancer

FNA Positive 14 8

FNA Negative 1 91

Sensitivity = 14/15 = 0.93 or 93%


Specificity = 91/99 = 0.92 or 92%
Accuracy = 105/114 = 0.92 or 92%

Point estimates for sensitivity and specificity are based on proportions.


Therefore, we can compute confidence intervals using binomial theory. See
SAS Example (18.1_sensitivity_specifi.sas [1]) below for a SAS program that
calculates exact and asymptotic confidence intervals for sensitivity and
specificity.
For the FNA study, only 15 women with cancer, as diagnosed by the gold
standard, were studied. The rule for using the asymptotic confidence interval
fails for sensitivity because np(1 - p) = 0.9765 < 5 (the rule does hold for
specificity).

As the output shows below, the exact 95% confidence intervals for sensitivity
and specificity are (0.680, 0.998) and (0.847, 0.965), respectively.
17.3 - Estimating the Probability of
Disease
Sensitivity and specificity describe the accuracy of a test. In a clinical setting,
we do not know who has the disease and who does not - that is why
diagnostic tests are used. We would like to be able to estimate the probability
of disease based on the outcome of one or more diagnostic tests. The
following measures address this idea.

Prevalence is the probability of having the disease, also called the prior
probability of having the disease. It is estimated from the sample as (a+c)/
(a+b+c+d).

Positive Predictive Value (PV+) is the probability of disease in an individual


with a positive test result. It is estimated as a/(a+b).

Negative Predictive Value (PV - ) is the probability of not having the disease
when the test result is negative. It is estimated as as d/(c+d).
In the FNA study of 114 women with nonpalpable masses and abnormal
mammograms,

prevalence = 15/114 = 0.13


PV+ = 14/(14+8) = 0.64
PV - = 91/(1+91) = 0.99

Thus, a woman's prior probability of having the disease is 0.13 and is modified
to 0.64 if she has a positive test result. A women's prior probability of not
having the disease is 0.87 and is modified to 0.99 if she has a negative test
result.

If the disease under study is rare, the investigator may decide to invoke a
case-control design for evaluating the diagnostic test, e.g., recruit 50 patients
with the disease and 50 controls. Obviously, prevalence cannot be estimated
from a case-control study because it does not represent a random sample
from the general population.

Predictive values allow us to determine the usefulness of a test and they vary
with the sensitivity and specificity of a test. If all other characteristics held
constant, then:

1. as sensitivity of a test increases, PV - increases and

2. as specificity of a test increases, PV+ increases.

Predictive values vary with the prevalence of the disease in the population
being tested or the pre-test probability of disease in a given individual.

Sensitivity, specificity, and prevalence can be used in a clinical setting to


estimate post-test probabilities (predictive values), even though physicians
work with one patient at a time, not entire populations of patients. Three
pieces of information are necessary prior to performing the test, namely, (1)
either the prevalence of the disease or the prior probability of disease, (2)
sensitivity, and (3) specificity.

Then, formulae for PV+ and PV- are:

PV+=PrevalenceSensitivity(PrevalenceSensitivity)+
{(1Prevalence)(1Specificity)}PV+=PrevalenceSensitivity(PrevalenceSen
sitivity)+{(1Prevalence)(1Specificity)}
PV=(1Prevalence)Specificity{(1Prevalence)Specificity)}+
{Prevalence(1Sensitivity)}PV=(1Prevalence)Specificity{(1Prevalence)S
pecificity)}+{Prevalence(1Sensitivity)}

Although PV+ = 14/(14+8) = 0.64 and PV - = 91/(1+91) = 0.99 can be


calculated directly from the 2 2 data table because the women constituted a
random sample, the above formulae yield the same results:

PV+ = (0.13)(0.93)/{(0.13)(0.93) + (0.87)(0.08)} = 0.64

PV- = (0.87)(0.92)/{(0.87)(0.92) + (0.13)(0.07)} = 0.99

The following example is taken from Sackett et al (1985, Clinical


Epidemiology ). Suppose a patient with the following characteristics visits a
physician:

45-year-old man

ambulatory with episodic chest pain

no coronary risk factors except smoking one pack of cigarettes per day

3-week history of substernal and precordial pain - stabbing and fleeting

physical exam shows a single costochondral junction that is slightly


tender, but does not reproduce the patient's pain

From this information, the physician estimates an intermediate pre-test (prior)


probability of 60% that this patient has significant coronary artery narrowing.

The physician is not sure whether the patient should undergo an exercise
electrocardiogram (ECG). How useful would this test be for this patient?

Suppose it is known from the literature that the sensitivity and specificity of the
exercise ECG in coronary artery stenosis (as compared to the gold standard
of coronary arteriography) are 60% and 91%, respectively.

Then:

PV+ = (0.6)(0.6)/{(0.6)(0.6) + (0.4)(0.09)} = 0.91


PV - = (0.4)(0.91)/{(0.4)(0.91) + (0.6)(0.4)} = 0.60
An additional test characteristic reported in the medical literature is the
likelihood ratio, which is the probability of a particular test result (+ or - ) in
patients with the disease divided by the probability of the result in patients
without the disease. There exists one likelihood ratio for a positive test (LR+)
and one for a negative test (LR - ). Likelihood ratios express how many times
more (or less) likely the test result is found in diseased versus non-diseased
individuals:

LR+ = Sensitivity/(1 - Specificity)


LR - = (1 - Sensitivity)/Specificity

From the FNA study in 114 women with nonpalpable masses and abnormal
mammograms, LR+ = 0.933/0.081 = 11.52 and LR - = 0.067/0.919 = 0.07.
Thus, positive FNA results are 11.52 times more likely in women with cancer
as compared to those without, and negative FNA results are .07 times as
likely in women with cancer as compared to those without.

17.4 - Comparing Two Diagnostic Tests


Suppose that we want to compare sensitivity and specificity for two diagnostic
tests. Let p1 denote the test characteristic for diagnostic test #1 and let p2 =
test characteristic for diagnostic test #2.

The appropriate statistical test depends on the setting. If diagnostic tests were
studied on two independent groups of patients, then two-sample tests for
binomial proportions are appropriate (chi-square, Fisher's exact test). If both
diagnostic tests were performed on each patient, then paired data result and
methods that account for the correlated binary outcomes are necessary
(McNemar's test).

Suppose two different diagnostic tests are performed in two independent


samples of individuals using the same gold standard. The following 2 2
tables result:

Diagnostic Test #1 Disease No Disease

Positive 82 30

Negative 18 70
Diagnostic Test #2 Disease No Disease

Positive 140 10

Negative 60 90

Suppose that sensitivity is the statistic of interest. The estimates of sensitivity


are p1 = 82/100 = 0.82 and p2 = 140/200 = 0.70 for diagnostic test #1 and
diagnostic test #2, respectively. The following SAS program will provide
confidence intervals for the sensitivity for each test as well as comparison of
the tests with regard to sensitivity.

SAS Example 18.2_comparing_diagnostic.sas [2]


Run the program and look at the output. Do you see the exact 95%
confidence intervals for the two diagnostic tests as (0.73, 0.89) and (0.63,
0.76), respectively?

The SAS program also indicates that the p-value = 0.0262 from Fisher's exact
test for testing H0 : p1 = p2 .
Thus, diagnostic test #1 has a significantly better sensitivity than diagnostic
test #2.

Suppose both diagnostic tests (test #1 and test #2) are applied to a given set
of individuals, some with the disease (by the gold standard) and some without
the disease.

As an example, data can be summarized in a 2 2 table for the


100 diseased patients as follows:

Diagnostic Test #2

Diagnostic Test #1 Positive Negative

Positive 30 35

Negative 23 12

The appropriate test statistic for this situation is McNemar's test. The patients
with a (+, +) result and the patients with a ( - , - ) result do not distinguish
between the two diagnostic tests. The only information for comparing the
sensitivities of the two diagnostic tests comes form those patients with a (+, - )
or ( - , +) result.

Testing that the sensitivities are equal, i.e., H0 : p1 = p2 , is comparable to


testing that.

H0 : p = (probability of preferring diagnostic test #1 over diagnostic test # 2) =


In the above example, N = 58 and 35 of the 58 display a (+, - ) result, so the
estimated binomial probability is 35/58 = 0.60. The exact p-value is 0.148 from
McNemar's test (see SAS Example 18.3_comparing_diagnostic.sas [3] below).
Thus, the two diagnostic tests are not significantly different with respect to
sensitivity.

17.5 - Selecting a Positivity Criterion


Methods for calculating sensitivity and specificity depend on test outcomes
that are dichotomous. Many lab tests and other diagnostic tools, however, are
measured on a numerical scale. In this case, sensitivity and specificity depend
on where the cutoff point is made between positive and negative.

The positivity criterion is the cutoff value on a numerical scale that separates
normal values from abnormal values. It determines which test results are
considered positive (indicative of disease) and negative (disease-free).
Because the distributions of test values for diseased and disease-free
individuals are likely to overlap, there will be false-positive and false-negative
results. When defining a positivity criterion, it is important to consider which
mistake is worse.

Now suppose a greater value is selected for the cutoff point. The chosen
cutoff value will yield a good sensitivity because nearly all of the diseased
individuals will have a positive result. Unfortunately, many of the healthy
individuals also will have a positive result (false positives), so this cutoff value
will yield a poor specificity.
In the following example, a high value of the diagnostic test (positive result) is
indicative of disease. The chosen cutoff value will yield a poor sensitivity
because many of the diseased individuals will have a negative result (false
negatives). On the other hand, nearly all of the healthy individuals will have a
negative result, so the chosen cutoff value will yield a good specificity.

When the consequences for missing a case are potentially grave, choose a
value for the positivity criterion that minimizes the number of false-negatives.
For example, in neonatal PKU screening, a false-negative result may delay
essential dietary intervention until mental retardation is evident. False-positive
results, on the other hand, are usually identified during follow-up testing.

When false-positive results may lead to a risky treatment, choose a value for
the positivity criterion that minimizes the number of false-positive results. For
example, false-positive results indicating certain types of cancer can lead to
chemotherapy which can suppress the patient's immune system and leave the
patient open to infection and other side effects.

An ROC curve (Receiver Operating Characteristic) is a graphical


representation of the relationship between sensitivity and specificity for a
diagnostic test measured on a numerical scale. The ROC curve consists of a
plot of sensitivity (true-positives) versus 1 - specificity (false-positives) for
several choices of the positivity criterion. PROC LOGISTIC of SAS provides a
means for constructing ROC curves.

The figure below depicts an ROC curve (drawn with xs). The point in the
upper left corner of the figure, (0,1), represents a perfect test, in which
sensitivity and specificity both are 1. When false-positive and false-negative
results are equally problematic, there are two choices: 1. Set the positivity
criterion to the point on the ROC curve closest to the upper left corner. (This
will also be closest to the dashed line, as the cutoff in the figure indicates.) or
2. Set the positivity criterion to the point on the ROC curver farthest (vertical
distance) from the line of chance (Youdon Index).

When false-positive results are more undesirable, set the positivity criterion to
the point farthest left on the ROC curve (increase specificity). If instead, false-
negative results are more undesirable, set the positivity criterion to a point
farther right on the ROC curve (increase sensitivity).

SAS Example MDT.4

In the ACRN SOCS trial, the investigators wanted to determine if low values of
the methacholine PC20 at baseline are predictive of significant asthma
exacerbations. The methacholine PC20 is a measure of how reactive a
persons airways are to an irritant (methacholine) a low value of the
PC20 corresponds a to high level of airway reactivity.

Here is the SAS code that was used: ACRN_SOCS_trial.sas [4]


Unfortunately, log2(methacholine PC20) is not statistically significant in
predicting the occurrence of significant asthma exacerbation (p = 0.27) and
the ROC curve is very close to the line of identity.

17.6 - Summary
In this lesson, among other things, we learned how to:

calculate and provide confidence intervals for the sensitivity and


specificity of a diagnostic test,

calculate accuracy and predictive values of a diagnostic test,

state the relationship of prevalence of disease to the sensitivity,


specificity and predictive values of a diagnostic test,

test whether sensitivity or specificity of 2 tests are significantly different,


whether the results come from a study in two groups of patients or one
group of patients tested with both tests, and

select an appropriate cut-off for a positive test result, given an ROC


curve, for different cost ratios of false positive/false negative results.

Let's put what we have learned to use by completing the homework


assignment posted in this week's folder!

Lesson 18: Correlation and Agreement


Introduction

Many biostatistical analyses are conducted to study the relationship between


two continuous or ordinal scale variables within a group of patients.

Purposes of these analyses include:

1. assessing correlation between the two variables, i.e., identifying


whether values of one variable tend to be higher (or possibly lower) for
higher values of the other variable;
2. assessing the amount of agreement between the values of the two
variables, i.e., comparing alternative ways of measuring or assessing
the same response;

3. assessing the ability of one variable to predict values of the other


variable, i.e., formulating predictive models via regression analyses.

This lesson will focus only on correlation and agreement, (issues numbered 1
and 2 listed above).

Learning objectives & outcomes

Upon completion of this lesson, you should be able to do the following:

Recognize appropriate use of Pearson correlation, Spearman


correlation, Kendalls tau-b and Cohens Kappa statistics.

Use a SAS program to produce confidence intervals for correlation


coefficients and interpret the results.

Adapt a SAS program to produce the correlation coefficients, their


confidence intervals and Kendalls tau-b.

Recognize situations that call for the use of a statistic measuring


concordance.

Distinguish between a concordance correlation coefficient and a Kappa


statistic based on the type of data used for each.

Interpret a concordance correlation coefficient and a Kappa statistic.

18.1 - Pearson Correlation Coefficient


Correlation is a general method of analysis useful when studying possible
association between two continuous or ordinal scale variables. Several
measures of correlation exist. The appropriate type for a particular situation
depends on the distribution and measurement scale of the data. Three
measures of correlation are commonly applied in biostatistics and these will
be discussed below.
Suppose that we have two variables of interest, denoted as X and Y, and
suppose that we have a bivariate sample of size n:

(X1 , Y1 ), (X2 , Y2 ), ... , (Xn , Yn )

and we define the following statistics:

X=1ni=1nXi,SXX=1n1i=1n(XiX)2X=1ni=1nXi,SXX=1n1i=1n(Xi
X)2

Y=1ni=1nYi,SYY=1n1i=1n(YiY)2Y=1ni=1nYi,SYY=1n1i=1n(YiY
)2

SXY=1n1i=1n(XiX)(YiY)SXY=1n1i=1n(XiX)(YiY)

These statistics above represent the sample mean for X, the sample variance
for X, the sample mean for Y, the sample variance for Y, and the sample
covariance between X and Y, respectively. These should be very familiar to
you.

The sample Pearson correlation coefficient (also called the sample product-
moment correlation coefficient) for measuring the association between
variables X and Y is given by the following formula:

rp=SXYSXXSYYrp=SXYSXXSYY

The sample Pearson correlation coefficient, rp , is the point estimate of the


population Pearson correlation coefficient

p=XYXXYYp=XYXXYY

The Pearson correlation coefficient measures the degree of linear relationship


between X and Y and -1 rp +1, so that rp is a "unitless" quantity, i.e., when
you construct the correlation coefficient the units of measurement that are
used cancel out. A value of +1 reflects perfect positive correlation and a value
of -1 reflects perfect negative correlation.
For the Pearson correlation coefficient, we assume that both X and Y are
measured on a continuous scale and that each is approximately normally
distributed.

The Pearson correlation coefficient is invariant to location and scale


transformations. This means that if every Xi is transformed to

Xi * = aXi + b

and every Yi is transformed to

Yi * = cYi + d

where a > 0, b, c > 0, and d are constants, then the correlation


between X and Y is the same as the correlation between X* and Y*.

With SAS, PROC CORR is used to calculate rp . The output from PROC
CORR includes summary statistics for both variables and the computed value
of rp . The output also contains a p-value corresponding to the test of:

H0 : p = 0 versus H0 : p 0

It should be noted that this statistical test generally is not very useful, and the
associated p-value, therefore, should not be emphasized. What is more
important is to construct a confidence interval.

The sampling distribution for Pearson's rp is not normal. In order to attain


confidence limits for rp based on a standard normal distribution, we
transform rp using Fisher's Z transformation to get a quantity, zp , that has an
approximate normal distribution. Then we can work with this value. Here is
what is involved in the transformation.

Fisher's Z transformation is defined as

zp=12loge(1+rp1rp)N(p,sd=1n3

)zp=12loge(1+rp1rp)N(p,sd=1n3)

where
p=12loge(1+p1p)p=12loge(1+p1p)

We will use this to get the usual confidence interval, so, an approximate 100(1
- )% confidence interval for p is given by [zp, /2 , zp, 1-/2 ], where

zp,/2=zp(tn3,1/2/n3),zp,1/2=zp+(tn3,1/2/n3
)zp,/2=zp(tn3,1/2/n3),zp,1/2=zp+(tn3,1/2/n3)

But really what we want is an approximate 100(1 - )% confidence interval for


p is given by [rp, /2 , rp, 1-/2 ], where

rp,/2=exp(2zp,/2)1exp(2zp,/2)+1,rp,1/2=exp(2zp,1/2)1exp(2zp,1
/2)+1rp,/2=exp(2zp,/2)1exp(2zp,/2)+1,rp,1/2=exp(2zp,1/2)1exp(2zp,1/
2)+1

Again, you do not have to do this by hand. PROC CORR in SAS will do this
for you but it is important to have an idea of what is going on.

18.2 - Spearman Correlation Coefficient


The Spearman rank correlation coefficient, rs , is a nonparametric measure of
correlation based on data ranks. It is obtained by ranking the values of the two
variables (X and Y) and calculating the Pearson rp on the resulting ranks, not
the data itself. Again, PROC CORR will do all of these actual calculations for
you.

The Spearman rank correlation coefficient has properties similar to those of


the Pearson correlation coefficient, although the Spearman rank correlation
coefficient quantifies the degree of linear association between the ranks
of X and the ranks of Y. Also, rs does not estimate a natural population
parameter (unlike Pearson's rp which estimates p ).

An advantage of the Spearman rank correlation coefficient is that


the X and Y values can be continuous or ordinal, and approximate normal
distributions for X and Y are not required. Similar to the Pearson rp , Fisher's Z
transformation can be applied to the Spearman rs to get a statistic, zs , that
has an asymptotic normal distribution for calculating an asymptotic confidence
interval. Again, PROC CORR will do this as well.

18.3 - Kendall Tau-b Correlation


Coefficient
The Kendall tau-b correlation coefficient, b , is a nonparametric measure of
association based on the number of concordances and discordances in paired
observations.

Suppose two observations (Xi , Yi ) and (Xj , Yj ) are concordant if they are in
the same order with respect to each variable. That is, if

(1) Xi < Xj and Yi < Yj , or if


(2) Xi > Xj and Yi > Yj

They are discordant if they are in the reverse ordering for X and Y, or the
values are arranged in opposite directions. That is, if

(1) Xi < Xj and Yi > Yj , or if


(2) Xi > Xj and Yi < Yj

The two observations are tied if Xi = Xj and/or Yi = Yj .

The total number of pairs that can be constructed for a sample size of n is

N=(n2)=12n(n1)N=(n2)=12n(n1)

N can be decomposed into these five quantities:

N=P+Q+X0+Y0+(XY)0N=P+Q+X0+Y0+(XY)0

where P is the number of concordant pairs, Q is the number of discordant


pairs, X0 is the number of pairs tied only on the X variable, Y0 is the number of
pairs tied only on the Y variable, and (XY)0 is the number of pairs tied on
both X and Y.

The Kendall tau-b for measuring order association between


variables X and Y is given by the following formula:
tb=PQ(P+Q+X0)(P+Q+Y0)
tb=PQ(P+Q+X0)(P+Q+Y0)

This value becomes scaled and ranges between -1 and +1. Unlike Spearman
it does estimate a population variance as:

tb is the sample estimate of tb=Pr[concordance]Pr[discordance]tb is the


sample estimate of tb=Pr[concordance]Pr[discordance]

The Kendall tau-b has properties similar to the properties of the Spearman rs.
Because the sample estimate, tb , does estimate a population parameter, tb ,
many statisticians prefer the Kendall tau-b to the Spearman rank correlation
coefficient.

18.4 - Example - Correlation Coefficients


SAS Example (19.1_correlation.sas [1]): Age and percentage body fat were
measured in 18 adults. SAS PROC CORR provides estimates of the Pearson,
Spearman, and Kendall correlation coefficients. It also calculates Fisher's Z
transformation for the Pearson and Spearman correlation coefficients in order
to get 95% confidence intervals.
The resulting estimates for this example are 0.7921, 0.7539, and 0.5762,
respectively for the Pearson, Spearman, and Kendall correlation coefficients.
The Kendall tau-b correlation typically is smaller in magnitude than the
Pearson and Spearman correlation coefficients.

The 95% confidence intervals are (0.5161, 0.9191) and (0.4429, 0.9029),
respectively for the Pearson and Spearman correlation coefficients. Because
the Kendall correlation typically is applied to binary or ordinal data, its 95%
confidence interval can be calculated via SAS PROC FREQ (this is not shown
in the SAS program above).

18.5 - Use and Misuse of Correlation


Coefficients
Correlation is a widely-used analysis tool which sometimes is applied
inappropriately. Some caveats regarding the use of correlation methods
follow.

1. The correlation methods discussed in this chapter should be used only


with independent data; they should not be applied to repeated measures
data where the data are not independent. For example, it would not be
appropriate to use these measures of correlation to describe the relationship
between Week 4 and Week 8 blood pressures in the same patients.

2. Caution should be used in interpreting results of correlation analysis when


large numbers of variables have been examined, resulting in a large number
of correlation coefficients.

3. The correlation of two variables that both have been recorded repeatedly
over time can be misleading and spurious. Time trends should be removed
from such data before attempting to measure correlation.

4. To extend correlation results to a given population, the subjects under study


must form a representative (i.e., random) sample from that population. The
Pearson correlation coefficient can be very sensitive to outlying observations
and all correlation coefficients are susceptible to sample selection biases.

5. Care should be taken when attempting to correlate two variables where one
is a part and one represents the total. For example, we would expect to find a
positive correlation between height at age ten and adult height because the
second quantity "contains" the first quantity.

6. Correlation should not be used to study the relation between an initial


measurement, X, and the change in that measurement over time, Y - X. X will
be correlated with Y - X due to the regression to the mean phenomenon.

7. Small correlation values do not necessarily indicate that two variables are
unassociated. For example, Pearson's rp will underestimate the association
between two variables that show a quadratic relationship. Scatterplots should
always be examined.

8. Correlation does not imply causation. If a strong correlation is observed


between two variables A and B, there are several possible explanations: (a) A
influences B; (b) B influences A; (c) A and B are influenced by one or more
additional variables; (d) the relationship observed between A and B was a
chance error.

9. "Regular" correlation coefficients are often published when the researcher


really intends to compare two methods of measuring the same quantity with
respect to their agreement. This is a misguided analysis, because correlation
measures only the degree of association; it does not measure agreement. The
next section of this lesson will present a measure of agreement.

18.6 - Concordance Correlation


Coefficient for Measuring Agreement
How well do two diagnostic measurements agree? Many times continuous
units of measurement are used in the diagnostic test. We may not be
interested in correlation or linear relationship between the two measures, but
in a measure of agreement.

The concordance correlation coefficient, rc , for


measuring agreement between continuous variables X and Y (both
approximately normally distributed), is calculated as follows:

rc=2SXYSXX+SYY+(XY)2rc=2SXYSXX+SYY+(XY)2

Similar to the other correlation coefficient, the concordance correlation


satisfies -1 rc +1. A value of rc = +1 corresponds to perfect agreement. A
value of rc = - 1 corresponds to perfect negative agreement, and a value of rc =
0 corresponds to no agreement. The sample estimate, rc , is an estimate of
the population concordance correlation coefficient:

c=2XYXX+YY+(XY)2c=2XYXX+YY+(XY)2

Let's look at an example that will help to make this concept clearer.
SAS Example (19.2_agreement_concordanc.sas [2]) : The ACRN DICE trial
was discussed earlier in this course. In that trial, participants underwent hourly
blood draws between 08:00 PM and 08:00 AM once a week in order to
determine the cortisol area-under-the-curve (AUC). The participants hated
this! They complained about the sleep disruption every hour when the nurses
came by to draw blood, so the ACRN wanted to determine for future studies if
the cortisol AUC calculated on measurements every two hours was in good
agreement with the cortisol AUC calculated on hourly measurements. The
baseline data were used to investigate how well these two measurements
agreed. If there is good agreement, the protocol could be changed to take
blood every two hours.

n the program to view the output. This is higher level SAS than you are expected to program yourself in this course, but some of

The SAS program yielded rc = 0.95 and a 95% confidence interval = (0.93,
0.96). The ACRN judged this to be excellent agreement, so it will use two-
hourly measurements in future studies.

What about binary or ordinal data? Cohen's Kappa Statistic will handle this...

18.7 - Cohen's Kappa Statistic for


Measuring Agreement
Cohen's kappa statistic, , is a measure of agreement between categorical
variables X and Y. For example, kappa can be used to compare the ability of
different raters to classify subjects into one of several groups. Kappa also can
be used to assess the agreement between alternative methods of categorical
assessment when new techniques are under study.

Kappa is calculated from the observed and expected frequencies on the


diagonal of a square contingency table. Suppose that there are n subjects on
whom X and Y are measured, and suppose that there are g distinct
categorical outcomes for both X and Y. Let fij denote the frequency of the
number of subjects with the ith categorical response for variable X and the
jth categorical response for variable Y.

Then the frequencies can be arranged in the following g g table:


Y=1 Y=2 ... Y=g

X=1 f11 f12 ... f1g

X=2 f21 f22 ... f2g

| | | ... |
| | | ... |

X=g fg1 fg2 ... fgg

The observed proportional agreement between X and Y is defined as:

p0=1ni=1gfiip0=1ni=1gfii

and the expected agreement by chance is:

pe=1n2i=1gfi+f+ipe=1n2i=1gfi+f+i

where fi+ is the total for the ith row and f+i is the total for the ith column. The
kappa statistic is:

^=p0pe1pe^=p0pe1pe

Cohen's kappa statistic is an estimate of the population coefficient:

=Pr[X=Y]Pr[X=Y|X and Y independent]1Pr[X=Y|X and Y independen


t]=Pr[X=Y]Pr[X=Y|X and Y independent]1Pr[X=Y|X and Y independent]

Generally, 0 1, although negative values do occur on occasion. Cohen's


kappa is ideally suited for nominal (non-ordinal) categories. Weighted kappa
can be calculated for tables with ordinal categories.

SAS Example (19.3_agreement_Cohen.sas [3]) : Two radiologists rated 85


patients with respect to liver lesions. The ratings were designated on an
ordinal scale as:
0 ='Normal' 1 ='Benign' 2 ='Suspected' 3 ='Cancer'

SAS PROC FREQ provides an option for constructing Cohen's kappa and
weighted kappa statistics.

The weighted kappa coefficient is 0.57 and the asymptotic 95% confidence
interval is (0.44, 0.70). This indicates that the amount of agreement between
the two radiologists is modest (and not as strong as the researchers had
hoped it would be).

Note: Updated programs for examples 19.2 and 19.3 are in the folder for this
lesson. Take a look.
18.8 - Summary
In this lesson, among other things, we learned how to:

recognize appropriate use of Pearson correlation, Spearman


correlation, Kendalls tau-b and Cohens Kappa statistics.

use a SAS program to produce confidence intervals for correlation


coefficients and interpret the results.

adapt a SAS program to produce the correlation coefficients, their


confidence intervals and Kendalls tau-b.

recognize situations that call for the use of a statistic measuring


concordance.

distinguish between a concordance correlation coefficient and a Kappa


statistic based on the type of data used for each.

interpret a concordance correlation coefficient and a Kappa statistic.

Let's put what we have learned to use by completing the following homework
assignment:

Homework

Look for homework assignment and the dropbox in the folder for this week in
ANGEL.