0 оценок0% нашли этот документ полезным (0 голосов)

116 просмотров276 страницBasic Stat & Research

© © All Rights Reserved

DOCX, PDF, TXT или читайте онлайн в Scribd

Basic Stat & Research

© All Rights Reserved

0 оценок0% нашли этот документ полезным (0 голосов)

116 просмотров276 страницBasic Stat & Research

© All Rights Reserved

Вы находитесь на странице: 1из 276

Introduction

clinical trial

in evaluating new medical therapies.

in a clinical trial setting.

clinical research?

What is the role of statistics in clinical research?

assessing the relative benefits of competing therapies, and establishing

optimal treatment combinations. Clinical research attempts to answer

questions such as should a man with prostate cancer undergo radical

prostatectomy or radiation or watchfully wait? and is the incidence of serious

adverse effects among patients receiving a new pain-relieving therapy greater

than the incidence of serious adverse effects in patients receiving the

standard therapy?

answer such questions by generalizing from the experiences of individual

patients to the population at large. Clinical judgement and reasoning were

applied to reports of interesting cases. The concepts of variability among

individuals and its sources were not formally addressed.

As the field of statistics, the theoretical science or formal study of the

inferential process, especially the planning and analysis of experiments,

surveys, and observational studies. (Piantadosi 2005). has developed in the

twentieth century, clinical research has utilized statistical methods provide

formal accounting for sources of variability in patientsresponses to treatment.

The use of statistics allows clinical researchers to draw reasonable and

accurate inferences from collected information and to make sound decisions

in the presence of uncertainty. Mastery of statistical concepts can prevent

numerous errors and biases in medical research.

(Piantadosi, 2005)

needed at every stage of virtually all research investigations including

planning the study, selecting the sample, managing the data, and interpreting

the results.

Clinical researchers must generalize from the few to many and combine

empirical evidence with theory.. In both medical and statistical sciences,

empirical knowledge is generated from observations and data. Medical theory

is based upon established biology and hypotheses. Statistical theory is

derived from mathematical and probabilistic models. (Piantadosi 2005), To

establish a hypothesis requires both a theoretical basis in biology and

statistical support for the hypothesis, based on the observed data and the

theoretical statistical model.

What constitutes a clinical trial?

by the scientist.

subjects. The clinical investigator controls factors that contribute to variability

and bias such as the selection of subjects, application of the treatment,

evaluation of outcome, and methods of analysis. The distinction of a clinical

trial from other types of medical studies is the experimental nature of the trial

and its occurrence in humans.

Although the researcher designs a trial to control variability due to factors

other than the treatment of interest, there is inherently larger variability in

research involving humans than in a controlled laboratory situation.

The term clinical trial is preferred over clinical experiment because the

latter may connote disrespect for the value of human life.

Clinical trials are used to develop and test interventions in nearly all areas of

medicine and public health. In many countries, approval for marketing new

drugs hinges on efficacy and safety results from clinical trials. Similar

requirements exist for the marketing of vaccines. The U.S. Food and Drug

Administration (FDA) now requires manufacturers of new or high-risk medical

devices to provide data demonstrating clinical safety and effectiveness. (Scott

2004). Surgical interventions pose unique challenges since surgical

approaches are typically undertaken for patients with a good prognosis and

may not be amenable to randomization or masking investigators and patients

to the intervention, all conditions which can lead to biases. Clinical trials are

useful for demonstrating efficacy and safety of various medical therapies,

preventative measures and diagnostic procedures if the treatment can be

applied uniformly and potential biases controlled.

to confirm findings from earlier studies. When the results of a study are

surprising or contradict biological theory, a confirmatory trial may follow.

Medical practice generally does not change based upon the results of one

study. Design flaws, methodological errors, problems with study conduct, or

analysis and reporting mistakes can render a clinical trial suspect. Hence,

confirmation of results in a replicative study, or a trial extending the use of the

therapy to a different population, is often warranted.

the cooperative effort of physicians, patients, nurses, data managers,

methodologists, and statisticians. Patient recruitment can be difficult. Some

multi-center (across institutions) clinical trials cost up to hundreds of million of

dollars and take five years or more to complete. Prevention trials, conducted

in healthy subjects to determine if treatments prevent the onset of disease,

are important but the most cumbersome, lengthy, and expensive to conduct.

Many studies have a window of opportunity during which they are most

feasible and will have the greatest impact on clinical practice. For comparative

trials, the window usually exists relatively early in the development of a new

therapy. If the treatment becomes widely accepted or discounted based on

anecdotal experience, it may become impossible to formally test the efficacy

of the procedure. Even when clinicians remain unconvinced of efficacy or

relative safety, patient recruitment can become problematic.

Some important medical advances have been made without the formal

methods of controlled clinical trials, i.e., without randomization, statistical

design, and analysis. Examples include the use of vitamins, insulin, some

antibiotics, and some vaccines.

non-experimental comparative design to provide valid and convincing

evidence:

2. The study subjects have to provide valid observations for the biological

question

absence of therapy, must be known

random error and bias

Examples of non-experimental designs that can yield convincing evidence of

treatment efficacy can be found among epidemiological studies, historically-

controlled trials, and from data mining.

1.2 - Summary

In this first lesson, Clinical Trials as Research, we learned to :

clinical trial

in evaluating new medical therapies.

to a clinical trial setting.

Introduction

questions without impairing the welfare of participants. Ethical

obigations to trial participants and to science and medicine pertain to all

stages of a clinical trial: design, conduct and reporting

results. (Friedman, Furberg, DeMets, Reboussin, Granger, 2015, Chapter

2)

Upon completion of this lesson, you should be able to:

to justify conducting a medical experiment in humans.

Define the condition that is required to justify randomizing a patient to a

treatment.

involving any product regulated by the U.S. FDA.

interests that may affect design, conduct or reporting of research

regulated by the U.S. Public Health Service.

References:

Annas GJ and Grodin MA. (1992) The Nazi doctors and the Nuremberg Code:

Human Rights in Human Experimentation. New York: Oxford University Press.

Carter RL, Scheaffer RL, Marks RG. (1986) The role of consulting units in

statistics departments. Am. Stat. 40:260-264.

Friedman, L.M., Furberg, C.D., DeMets, D., Reboussin, D.M., Granger, C.B.

(2015). Chapter 2 Ethical Issues. In: Friedman, L.M.,Furberg, C.D.,DeMets,

D.,Reboussin, D.M.,Granger, C.B. Fundamentals of Clinical Trials. 5th ed.

Switzerland: Springer International Publishing. (Notes will refer to Friedman

et al 2015)

Piantadosi Steven. (2005) Clinical trials as research, Why clinical trials are

ethical, Contexts for clinical trials. In: Piantadosi Steven. Clinical Trials: A

Methodologic Perspective.2nd ed. Hobaken, NJ: John Wiley and Sons, Inc.

evidence. PhD dissertation: Johns Hopkins University, Baltimore, MD.

Physician-Patient Relationship

One area of ethical dilemma of physicians and health care workers can be

attributed to the conflicting roles of helping the patient and gaining scientific

knowledge, as stated by Schafer (1982):

In his traditional role of healer, the physician's commitment is exclusively to

his patient. By contrast, in his modern role of scientific investigator, the

physician engaged in medical research or experimentation has a commitment

to promote the acquisition of scientific knowledge.

appropriate way to acquire new knowledge. Clinical decisions for treatment

that are based on weak or anecdotal evidence, opinion, or dogma, without the

evidence of rigorous scientific support, raise their own ethical questions.

How often is a physician certain of the outcome from a specific therapy for a

particular patient? If the patients reaction is predictable, applying the

treatment would be described as practice. In the cases when the physician is

unsure of the outcome, applying the treatment could be considered research.

Many actions by the physician for the benefit of individual patients have the

potential of increasing scientific knowledge. Analogously, scientific knowledge

gained from research can be of benefit to individual patients. Ethical questions

arise when unproven therapies are proposed to replace proven ones and are

particularly acute for chronic or fatal illnesses.

Clinical trials are only one of several settings in which the physicians duty

extends beyond his responsibility to the individual patient. For example,

vaccinations against communicable disease are promoted by physicians, yet

the individual vaccinated incurs a small risk to benefit the population as whole.

Triage is another situation where for the sake of maximizing benefit to the

whole, the needs of an individual may not be met.

includes the obligation of a physician to conduct medical research: A

physician shall continue to study, apply and advance scientific knowledge,

maintain a commitment to medical education, make relevant information

available to patients, colleagues and the public. (AMA, 2001). There is also

a statement of the physicians responsibility to the patient as paramount.

Both ethics must considered when the health professional considers entering

a patient in a clinical trial.

All clinical investigators should have training in research ethics. The US NIH

website has resources [1]for training in the areas of scientific integrity, data,

publication, peer review, mentor/trainee relationships, collaboration, human

and animal subjects and conflict of interest. Many funding sources, including

the US NIH and NSF require responsible conduct of research training for all

students, trainees, fellows, scholars and faculty utilizing their funds to conduct

research.

Conflict of Interest:

Investigators are required by the US Public Health Service, the umbrella

organization for the FDA and the NIH, to report any significant financial

interest would appear to affect the design, conduct, or reporting of

research. (See Guidance [2]from FDA)

Competence:

certification, experience) and humanistically (showing compassion and

empathy).

The current requirements for investigators and considerations in conducting

ethical research should be understood within the context of past abuses,

some horrific.

in WWII concentration camps committed by Nazi physicians under the guise

of experimentation. Of the 23 individuals tried for such crimes at Nuremberg,

20 were physicians. Sixteen of the 23 were convicted and given sentences

ranging from imprisonment to death. (Annas and Grodin, 1992)

At the time of the Nuremberg trial, there were no international standards for

ethical conduct in human experimentation. This resulted in the Nuremberg

Code, or directives for human experimentation, adopted in 1947:

3. The anticipated results must have a basis in biological knowledge and

animal experimentation such that the experiment has potential for

yielding fruitful results for the good of society.

suffering and injury.

6. The degree of risk for the patient should not exceed the humanitarian

importance of the problem to be solved.

death or injury.

of skill and care throughout the experiment.

injury, disability or death of the subject seems likely.

Finland, and adopted a formal code of ethics for physicians engaged in clinical

research. The Declaration of Helsinki [3]reiterates the principles of

Nuremberg Code, with particular attention to the duty of the physician to

protect the life, health and dignity of the human subject. The Declaration

states that research involving human subjects must be formulated in a written

protocol which has been reviewed by an ethical review committee distinct

from the investigator, must conform to generally accepted scientific principles

and should include written informed consent from the participants. Negative

as well as positive results should be disseminated and funding sources

disclosed.

discussion has ensued regarding the statements on the ethics of placebo

control and the duty of the study planners to provide post-study access for all

study participants to what is regarded as beneficial treatment. The 2008

revision affirms the WMA position on the primacy of the patient and outlines

required consents for research on human material, such as blood, tissues,

and DNA, and human data as well as requiring clinical trials to be registered in

a publicly accessible database.

1936 when the US Public Health Service began a study of untreated syphilis

in Tuskegee, Alabama (399 men with advanced disease and 201 controls).

The study continued long after the availability of penicillin, a proven cure, in

the 1950s. The study was stopped in the early 1970s after it was publicized

and became an embarrassment to the country. In response to the Tuskegee

Syphilis Study, the US Congress established the National Commission for the

Protection of Human Subjects of Biomedical and Behavioral Research

through the 1974 National Research Act. This Commission produced

the Belmont Report in 1979 which distilled basic ethical guidelines in

research with human subjects to three principles: respect for persons or

individual autonomy, beneficence and justice.

Respect for persons (individual autonomy) means that patients have the

right to decide what should be done for them with respect to their illness

unless the result would be clearly detrimental to others. Respect for persons

means that potential subjects for clinical trials are informed of alternative

therapies and risks and benefits of participation in a particular trial before they

volunteer to particpate in that study. Since clinical trials often require

participants to surrender some measure of autonomy in order to be

randomized to treatment and follow the established protocol, these aspects

will be described to the potential subject, along with their freedom to choose to

discontinue the study at any time.

treatment. Investigators are obliged to make practical and useful assessments

of the risks and benefits involved in research, which necessitates resolving the

potential conflict between risk to participants and benefit to future patients.

The beneficence obligation extends to both the particular subjects in a study

and to the research endeavor.

Justice addresses the question of fairly distributing the benefits and burdens

of research. Compensation for injury due to research is an application of

justice. Injustice occurs when benefits are denied without good reason or

when burdens are unduly imposed on particular individuals, such as the poor

or uninsured.

These principles are applied in the requirements for informed consent of

subjects, in assessment of risks and benefits and fair procedures and

outcomes and in the selection of subjects.

Organization (WHO) and the Council for International Organizations of

Medical Sciences (CIOMS) issued a document entitled International Ethical

Guidelines for Biomedical Research Involving Human Subjects [4], with its

latest revision published in 2002. UNESCO set forth a Universal Declaration

on Bioethics and Human Rights [5]in 2005.

IRB

The U.S. National Institute of Health Policies for the Protection of Human

Subjects (1966) established the IRB (Institutional Review Board) as a

mechanism for the protection of human participants in research. In 1981, U.S.

regulations required IRB approval for all drugs or products regulated by the

US Food and Drug Administration (FDA), without regard to the funding source,

the research volunteers, or the location of the study. In 1991, core US DHHS

regulations (45 CFR Part 46, Subpart A) were adopted by most Departments

and Agencies involved in research with human subjects. This Federal Policy

for the Protection of Human Subjects, known as the "Common

Rule., [7] requires Institutional Review Board (IRB) review for all research

funded in whole or in part by the U.S. federal government.

prerequisites set forth by the FDA.

(http://www.fda.gov/oc/ohrt/irbs/review.html) [8]

participant

5. There are adequate provisions for monitoring data collected to ensure

the safety of the study participants

6. The privacy of the participants and the confidentiality of the data are

protected

investigator must submit his/her research plan, called the protocol, and the

informed consent form to the IRB for approval prior to the conduct of the

study. The IRB also requires the investigator to provide the following:

2. Reports of any serious adverse events in the human subjects when they

occur.

Informed Consent:

The principle of respect for persons implies that each study participant will be

made aware of potential risks, benefits and costs prior to participating in a

clinical study. To document this, study particpants (or parents/legally

authorized representatives) sign an informed consent document prior to

participation in a research study. The patient assents to having been informed

of the potential risks and benefits resulting from their participation in the

clinical study, to understanding their treatment alternatives and that their

participation is voluntary. There are numerous examples of studies in which patients have been

exposed to potentially or definitively harmful treatments without being fully apprised of the risk. The

consent document should be presented without any coercion. Even so, ill or dying patients and their

families are vulnerable, and it is questionable how much technical information about new treatments they

can truly understand, especially when it is presented to them quickly.

In the United States and many other countries, an IRB must evaluate and

approve the informed consent documents prior to beginning a study.

5. Treatment for injuries incurred

7. The voluntary nature of the study and the possibility of withdrawal at any

time

special situation. The FDA and the National Institutes of Health (NIH) have

offered guidance [9] for research in emergency medical situations.

Applying ethical considerations in the planning and design phase of a study

requires optimal study design. All involved in clinical research have a

responsibility to promote high-quality clinical trials in order to provide evidence

to guide medical decisions. (Friedman et al 2015), weighing the balance of

risk vs benefit, consideration of patient confidentiality as well as plans for

impartial oversight of informed consent procedures.

Study Question:

justify possible adverse effects of the treatment that will be administered. A

study design that cannot answer the biological question is unethical. Studies

that pose unimportant questions are unethical as well, even if they pose

minimal risk. What marketing question would have enough benefit to justify

risks to subjects? What about situations where there is an approved and

available therapy?

The study question also involves the choice the study population. Which

population can answer the question? Does the potential benefit outweigh the

risk to these study subjects? Is the selection just?

three requirements for the conduct of research, namely, informed consent,

disclosure of the risks and benefits, and the appropriate selection of research

subjects. Applying ethical principles also requires optimal study design, a

balance of risk and benefit for study participants, consideration of patient

privacy, impartial oversight of consent procedures.

Study Sites:

The choice of the study population is also related to the place(s) the trial will

be conducted. There is greater generalizability and potentially faster

enrollment if a trial is conducted in multiple and varied geographic locales.

The concept of justice should be applied--is the location selected due to

prevalent disease and relevance of the results ? Or for sponsor conveniences

such as lower cost and fewer administrative and regulatory burderns? Is the

standard of care in this country less than optimal care and thus event rates

higher? What obligations do the trial sponsors have to the particpants or to

residents of the country once the trial is complete? Will the treatment be

available in this locale once the trial is complete?

Randomization:

treatment on chance, as is done in a randomized clinical trial. Some feel that

the physician is obligated to have a preference, even when the evidence does

not favor any particular treatment. Randomization is justified when there is

relative ignorance (collectively) about the best treatment. The situation in

which there is genuine uncertainty as to the best available therapy is

called equipoise. (Piantadosi 2005)

Patients and physicians with firm preferences for treating a particular disease,

even those based on weak evidence, should not participate in a clinical trial

involving that disease. Patients with strong convictions about preferred

treatments are likely to become easily dissatisfied with randomization to a

treatment in the clinical trial. Physicians with strong convictions could bias the

clinical trial in a different direction, especially if they are not blinded to

treatment assignment.

situations, the unit of randomization is a larger entity, such as a hospital or

community. How would individuals consent to such research?

Control Group:

comparison with the experimental treatment to determine if the treatment is

safe and effective. Patients assigned to placebo may receive a facsimile of the

active therapy without the knowledge of whether or not the active ingredient is

present; thus, any observed effect is considered a result of the active agent

and not the process of being treated.

Placebo control is untenable when the disease is life-threatening and an

effective therapy is available. A better approach when there is an effective and

available therapy and/or the condition is life-threatening is to use the standard

accepted therapy as an active control treatment. Comparison is made

between active control and the experimental therapy. Another possible option

if the new therapy can be given in combination with standard therapy: all

subjects receive standard therapy and randomization is to new therapy plus

standard or placebo plus standard.

Should the new intervention be compared with the best known therapy or with

placebo? Will a placebo control result in significant harm to subjects? What if

there is no accepted optimal therapy? What if the optimal therapy is very

costly or not available in some locations? The selection of the control group

has many ethical considerations.

Confidentiality:

The U.S. Department of Health and Human Services (HHS) issued the

Standards for Privacy of Individually Identifiable Health Information (the

Privacy Rule) under the Health Insurance Portability and Accountability Act of

1996 (HIPAA) to provide the first comprehensive Federal protection for

the privacy of personal health information. This became effective on April

14, 2003.

While certain provisions of the Rule specifically concern research and may

affect research activities, the Privacy Rule recognizes that the research

community has legitimate needs to use, access, and disclose Protected

Health Information (PHI) to carry out a wide range of health research

protocols and projects. The Privacy Rule protects the privacy of such

information while providing ways in which researchers can access and use

PHI when necessary to conduct research. The DHHS web site

(http://privacyruleandresearch.nih.gov/ [10]) should be examined for further

information about HIPAA requirements on research.

2.5 - Conduct

Recruitment:

adequately, recruitment must also follow ethical norms. Coercion should be

avoided. For this reason, any finanical compensation for the subject's time

and travel will reflect actual expenses or small amounts that would not entice

a person to enroll in the study for financial gain; study personnel other the

primary investigator or the patient's doctor may be designated to ask for

informed consent.

Monitoring:

During the course of a comparative trial, evidence may become available that

one treatment is superior. Interim statistical analyses may be incorporated

into the study design to provide periodic investigations of treatment superiority

prior to study completion without sacrificing the statistical integrity of the trial.

(discussed later in this course). Should patients receiving an inferior treatment

continue in this manner? If there is evidence that a particular type of patient is

unlikely to respond to therapy, should entrance criteria be modified? Is the

adverse experience profile markedly worse for one therapy? Investigators are

required to report such circumstances to their IRB.

monitor the trial results and render decisions as to whether the trial should

continue or be modified in some manner. Safety monitoring is required.

These committees have various names, such as Data and Safety Monitoring

Board, External Advisory Committee, etc.

adequate to complete the study. Early termination for reasons other than

science or safety reflect a lack of ethical concern. Subjects agreed to

participate so an important question could be answered. There should be an

answer.

Data Integrity:

monitoring and other procedures may help detect potential fraud. See George

and Buyse, (2015) Data Fraud in Clinical Trials [11]

2.6 - Reporting

Reporting:

Registration of trials on clinicaltrials.gov and publishing associated results

online can reduce publication bias, the bias resulting from journals favoring

studies with significant results. Publish or find mechanisms to disseminate

results effectively.

Authorship:

'Ghost authorship' occurs when people writing a paper are not fully disclosed

(i.e. draft written by contract writer) or when authors are included who did not

actually participate in the research project (for example, an influential name).

Journals combat such deception by asking authors to specify the contribution

of each person listed as an author.

The American Statistical Association and the Royal Statistical Society have

published similar guidelines for the conduct of their members:

professional statements

development.

1. Collect only the data needed for the purposes of the inquiry

2. Inform each participant about the nature and sponsorship of the project

and intended uses of the data

confidentiality

protections

participant-identifying information

other persons or organizations

2.8 - Summary

In this second lesson, Ethics of Clinical Trials, we learned:

medical experiment in humans.

treatment.

To differentiate between ethical and unethical use of placebo control.

product regulated by the U.S. FDA.

affect design, conduct or reporting of research regulated by the U.S.

Public Health Service.

Introduction

laboratory and industrial research before being applied to trials of

pharmaceuticals in humans. Experimental design is characterized by control

of the experimental process to reduce experimental error, replication of the

experiment to estimate variability in the response and randomization. For

example, in comparing the yields of two varieties of corn, the experimenter

uses the same type of corn planter and the same fertilizer and weed control

methods in each test plot. Multiple plots of ground are planted with the two

varieties of corn. The assignment of a seed variety to a test plot is

randomized.

Clinical trial design has its roots in classical experimental design, yet has

some different features. The clinical investigator is not able to control as many

sources of variability through design as a laboratory or industrial experimenter.

Human responses to medical treatments display greater variability than

observations from experiments in genetically identical plants and animals or

measuring effects of tightly-controlled physical and chemical processes. And

of course, ethical issues are paramount in clinical research. To study a clinical

response with adequate precision, a trial may require lengthy periods for

patient accrual and follow-up. It is unlikely to enroll all the study subjects on

the same day. There is opportunity for study volunteers to decide to no longer

participate.

design to clinical trials.

Learning objectives & outcomes

1. State 6 general objectives that will be met with proper trial design.

effects in a proposed clinical study.

4. Compare and contrast the following study designs with respect to the

ability of the investigator to minimize bias: Case report or case series,

database analysis, prospective cohort study, case-control study, parallel

design clinical trial, crossover clinical trial.

clinical trial.

Good trial design and conduct are far more important than selecting the

correct statistical analysis. When a trial is well designed and properly

conducted, statistical analyses can be performed, modified, and if necessary,

corrected. On the other hand, inaccuracy (bias) and imprecision (large

variability) in estimating treatment effects, the two major shortcomings of

poorly designed and conducted trials, cannot be ameliorated after the trial..

Skillful statistical analysis cannot overcome basic design flaws.

3. Isolates the treatment effect of interest from confounders

4. Controls precision

effects or estimate differences in treatment effects. Precise statements about

observed treatment effects are dependent on a study design that allows the

treatment effect to be sorted out from person-to-person variability in response.

An accurate estimate requires a study design that minimizes bias.

Piantadosi (2005) states that clinical trial design should accomplish the

following:

to Observational Studies

Medical research, as a scientific investigation, is based on careful

observation and theory. Theory directs the observation and provides a basis

for interpreting the results. The strength of the evidence from a clinical study is

proportional to amount of the control of bias and variability when the study

was conducted as well as the magnitude of the observed effect. Clinical

studies can be characterized as uncontrolled observations, observational

comparative and controlled clinical trials.

a case report, there is no control of treatment assignment, endpoint

ascertainment, or confounders. There is no control group for the sake of

comparison. The report is descriptive in nature, not a formal statistical

analysis.

future testing. For example, a physician may report

that a patient in his practice, who was taking a specific

anorexic drug, developed primary pulmonary

hypertension (PPH), a rare condition that occurs in 1-

2 out of every million Americans. Is this convincing

evidence that the anorexic drug causes PPH?

report, but cannot prove efficacy of a treatment.. Case

series and case reports are susceptible to large

selection biases.. Consider the example of laetrile, an

apricot pit extract that was reputed to cure cancer.

Seven case series were reported; the strength of

evidence from these studies has been summarized by

US National Cancer Institute (NCI [1]).While a

proportion of patients may have experienced spontaneous remission of

cancer, rigorous testing in controlled environments was never performed. After

an estimated 70,000 patients had been treated, the NCI undertook a

retrospective analysis of laetrile only to decide no definite conclusions

supporting anti-cancer activity could be made. (Ellison 1978 abstract[2]).

The Cochrane review on laetrile [3] (2015), states, there is no reliable evidence

for the alleged effects of laetrile or amygdalin for curative effects in cancer patients. Based

on a series of reported cases, many believed laetrile would cure their cancer,

perhaps refusing other effective treatments, and subjecting themselves to

adverse effects of cyanide, for many years, this continued for many years with

anti-tumor efficacy of laetrile unsupported while associated adverse effects

were coming to light.

group, depending on the data source. The source and quality of the data used

for this secondary analysis is key. If the analysis attempts to evaluate

treatment differences from data in which treatment assignment was based on

physician and patient discretion, nonrandomized and open-label, bias is likely.

analyses. For example, the NIH sponsored a database analysis of interstitial

cystitis (IC) during the 1990s. This consisted of data from over 400 individuals

with IC who underwent various and numerous therapies for their condition.

The objective of the database analysis was to determine if there were patterns

of treatments that may be effective in treating the disease. (Rovner et al [4].

2000).

tools have been developed to search for patterns in large databases of

genetic data, leading to the discovery of particular candidate genes.

comparative observational studies. An observational study lacks the key

component of an experiment, namely, control over treatment assignment.

Commonly these designs are used in assessing the influence of risk factors

for a disease. Subjects meeting entrance criteria may have been identified

through a database search. The choice of the control group is a crucial design

component in observational studies.

disease) and controls (subjects without the disease) and retrospectively

assesses some type of treatment or exposure. Because the investigator has

selected the cases and controls, relative risk cannot be calculated directly

from a case-control study.

subjects recall of events that occurred many years previously, thus recall bias,

(systematic differences in accuracy or completeness of recall) can affect the

study results.

In a prospective cohort study, individuals are followed forward in time with

subsequent evalations to determine which individuals develop into cases. The

relationship of specific risk factors that were measured at baseline with the subsequent

outcome is assessed. The cohort study may consist of one or more samples with

particular risk factors, called cohorts. It is possible to control some sources of bias in

a prospective cohort study by following standard procedures in collecting data

and ascertaining endpoints. Since the subjects are not assigned risk factors in

a randomized manner however, there may remain covariates that are

confounded with a risk factor. Sometimes, a particular treatment group (or

groups) from a randomized trial is followed as a cohort, providing a cohort in

which the treatment was assigned at random.

Prospective studies tend to have fewer design problems and less bias than

retrospective studies, but they are more expensive with respect to time and

cost.

currently in his practice with a specific form of cardiac valve disease. He

identifies another group of relatively healthy patients and matches two of them

to each of the patients with cardiac valve disease according to age ( 5years)

and BMI ( 2.5). He plans to interview all 36 + 72 = 108 patients to assess

their use of diet drugs during the past ten years.

A classic example of a cohort study: U.S. National Heart Lung and Blood

Institute Framingham Heart Study [5]

experimental comparative studies:

question.

3. The natural history of the disease with standard therapy, or in the

absence of therapy, is known.

and bias.

experimental design. Treatments are assigned by design; administration of

treatment and endpoint ascertainment follows a protocol. When properly

designed and conducted, especially with the use of randomization and

masking, the controlled clinical trial instills confidence that bias has been

minimized. Replication of a controlled clinical trial, if congruent with the results

of the first clinical trial, provides verification.

In experimental design terminology, the "experimental unit" is randomized to

the treatment regimen and receives the treatment directly. The "observational

unit" has measurements taken on it. In most clinical trials, the experimental

units and the observational units are one and the same, namely, the individual

patient

communities, e.g., geographic regions, are randomized to treatments. For

example, communities (experimental units) might be randomized to receive

different formulations of a vaccine, whereas the effects are measured directly

on the subjects (observational units) within the communities. The advantages

here are strictly logistical - it is simply easier to implement in this fashion.

Another example occurs in reproductive toxicology experiments in which

female rodents are exposed to a treatment (experimental units) but

measurements are taken on the pups (observational units).

and varied during the course of the experiment. For example, treatment is a

factor in a clinical trial with experimental units randomized to treatment.

Another example is pressure and temperature as factors in a chemical

experiment.

Most clinical trials are structured as one-way designs, i.e., only one factor,

treatment, with a few levels.

Temperature and pressure in the chemical experiment are two factors that

comprise a two-way design in which it is of interest to examine various

combinations of temperature and pressure. Some clinical trials may have

a two-way factorial design, such as in oncology where various combinations

of doses of two chemotherapeutic agents comprise the treatments.

An incomplete factorial design may be useful if it is inappropriate to assign

subjects to some of the possible treatment combinations, such as no

treatment (double placebo). We will study factorial designs in a later lesson.

treatment and remain on that treatment throughout the course of the trial. This

is a typical design. In contrast, with a crossover design patients are

randomized to a sequence of treatments and they cross over from one

treatment to another during the course of the trial. Each treatment occurs in a

time period with a washout period in between. Crossover designs are of

interest since with each patient serving as their own control,there is potential

for reduced variability. However, there are potential problems with this type of

design. There should be investigation into possible carry over effects, i.e. the

residual effects of the previous treatment affecting subjects response in the

later treatment period. In addition, only conditions that are likely to be similar

in both treatment periods are amenable to crossover designs. Acute health

problems that do not recur are not well-suited for a crossover study. We will

study crossover design in a later lesson.

error probabilities in experiments. Randomization is recognized as an

essential feature of clinical trials for removing selection bias.

systematically selects a certain type of patient for a particular treatment..

Suppose the trial consists of an experimental therapy and a placebo. If the

physician assigns the healthier patients to the experimental therapy and the

less healthy patients to the placebo, the study could result in an invalid

conclusion that the experimental therapy is very effective.

example suppose a clinical trial is structured to compare treatments A and B in

patients between the ages of 18 and 65. Suppose that the younger patients

tend to be healthier. It would be prudent to account for this in the design by

stratifying with respect to age. One way to achieve this is to construct age

groups of 18-30, 31-50, and 51-65 and to randomize patients to treatment

within each age group.

18 - 30 12 13

31 - 50 23 23

51-65 6 7

It is not necessary to have the same number of patients within each age

stratum. We do, however, want to have balance in the number on each

treatment within each age group..This is accomplished by blocking, in this

case, within the age strata. Blocking is a restriction of the randomization

process that results a balance of numbers of patients on each treatment after

a prescribed number of randomizations. For example, blocks of 4 within these

age strata would mean that after 4, 8, 12, etc. patients in a particular age

group had entered the study, the numbers assigned to each treatment within

that stratum would be equal.

analysis may be performed. In the example, the smaller numbers of patients

in the upper and lower age groups would require care in the analyses of these

sub-groups specifically. However, with the primary question as the effect of

treatment regardless of age, the pooled data in which each sub-group is

represented in a balanced fashion would be utilized for the main analysis.

Even ineffective treatments can appear beneficial in some patients. This may

be due to random fluctuations, or variability in the disease. If, however, the

improvement is due to the patients expectation of a positive response, this is

called a "placebo effect" . This is especially problematic when the outcome is

subjective, such as pain or symptom assessment. Placebo effect is widely

recognized and must be removed in any clinical trial. For example, rather than

constructing a nonrandomized trial in which all patients receive an

experimental therapy, it is better to randomize patients to receive either the

experimental therapy or a placebo. A true placebo is an inert or inactive

treatment that mimics the route of administration of the real treatment, e.g., a

sugar pill.

Placebos are not acceptable ethically in many situations, e.g., in surgical

trials. (Although there have been instances where 'sham' surgical procedures

took place as the 'placebo' control.) When an accepted treatment already

exists for a serious illness such as cancer, the control must be an active

treatment. In other situations, a true placebo is not physically possible to

attain. For example, a few trials investigating dimethyl sulfoxide (DMSO) for

providing muscle pain relief were conducted in the 1970s and 1980s. DMSO

is rubbed onto the area of muscle pain, but leaves a garlicky taste in the

mouth, so it was difficult to develop a placebo.

the person measuring the outcome variables. Masking is especially important

when the measurements are subjective or based on self-

assessment. Double-masked trials refer to studies in which both

investigators and patients are masked to the treatment. Single-masked

trials refer to the situation when only patients are masked. In some studies,

statisticians are masked to treatment assignment when performing the initial

statistical analyses, i.e., not knowing which group received the treatment and

which is the control until analyses have been completed. Even a safety-

monitoring committee may be masked to the identity of treatment A or B, until

there is an observed trend or difference that should evoke a response from

the monitors. In executing a masked trial great care will be taken to keep the

treatment allocation schedule securely hidden from all except those with a

need to know which medications are active and which are placebo. This could

be limited to the producers of the study medications, and possibly the safety

monitoring board before study completion. There is always a caveat for

breaking the blind for a particular patient in an emergency situation.

For example, one could not mask a surgeon to the procedure he is to perform.

Even so, some have gone to great lengths to achieve masking. For example,

a few trials with cardiac pacemakers have consisted of every eligible patient

undergoing a surgical procedure to be implanted with the device. The device

was "turned on" in patients randomized to the treatment group and "turned off"

in patients randomized to the control group. The surgeon was not aware of

which devices would be activated.

feature. This is because they believe that biases are small in relation to the

magnitude of the treatment effects (when the converse usually is true), or that

they can compensate for their prejudice and subjectivity.

Confounding is the effect of other relevant factors on the outcome that may

be incorrectly attributed to the difference between study groups.

and 10 patients to control. There will be a one-week follow-up on each patient.

The first 10 patients will be assigned treatment on March 01 and the next 10

patients will be assigned control on March 15. The investigator may observe a

significant difference between treatment and control, but is it due to different

environmental conditions between early March and mid-March? The obvious

way to correct this would be to randomize 5 patients to treatment and 5

patients to control on March 01, followed by another 5 patients to treatment

and the 5 patients to control on March 15.

Validity

outcome between the study groups is real and not due to bias, chance, or

confounding. Randomized, placebo-controlled, double-blinded clinical trials

have high levels of internal validity.

External validity in a human trial refers to how well study results can be

generalized to a broader population. External validity is irrelevant if internal

validity is low. External validity in randomized clinical trials is enhanced by

using broad eligibility criteria when recruiting patients .

Large simple and pragmatic trials emphasize external validity. A large simple trial

attempts to discover small advantages of a treatment that is expected to be used in a

large population. Large numbers of subjects are enrolled in a study with simplified

design and management. There is an implicit assumption that the treatment effect is

similar for all subjects with the simplified data collection. In a similar vein,

a pragmatic trial emphasizes the effect of a treatment in practices outside academic

medical centers and involves a broad range of clinical practices.

Studies of equivalency and noninferiority have different objectives than the usual trial

which is designed to demonstrate superiority of a new treatment to a control. A study to

demonstrate non-inferiority aims to show that a new treatment is not worse than an

accepted treatment in terms of the primary response variable by more than a pre-

specified margin. A study to demonstrate equivalence has the objective of

demonstrating the response to the new treatment is within a prespecified margin in both

directions. We will learn more about these studies when we explore sample size

calculations.

3.4 - Clinical Trial Phases

When a drug, procedure, or treatment appears safe and effective based on

preclinical studies, it can be considered for trials in humans. Clinical studies of

experimental drugs, procedures, or treatments in humans have been

classified into four phases (Phase I, Phase II, Phase III, and Phase IV) based

on the terminology used when pharmaceutical companies interact with the

U.S. FDA. Greater numbers of patients are assigned to treatment in each

successive phase.

information.

Phase I trials investigate the effects of various dose levels on humans, The

studies are usually done in a small number of volunteers (sometimes persons

without the disease of interest or patients with few remaining treatment

options) who are closely monitored in a clinical setting. The purpose is to

determine a safe dosage range and to identify any common side effects or

readily apparent safety concerns. Data may be collected to provide a

description of the pharmacokinetics and pharmacodynamics of the compound,

estimate the maximum tolerated dose (MTD), or evaluate the effects of

multiple dose levels. Many trials in the early stage of therapy development

either investigate treatment mechanism (TM) or incorporate dose-finding (DF)

strategies.

attempt is made to investigate the bioavailability of the drug at various sites in

the human system. To a surgeon, a TM study investigates the operative

procedure. A DF trial usually tries to determine the maximum tolerated dose,

or the minimum effective dose, etc. Thus, phase I (drug) trials can be

considered TM and DF trials.

continues to monitor safety. A Phase II trial may be the first time that the agent

is administered to patients with the disease of interest to answer questions

such as: What is the correct dosage for efficacy and safety in patients of this

type? What is the probability a patient treated with the compound will benefit

from the therapy or experience an adverse effect? Most trials in the middle

stage of therapy development investigate safety and efficacy (SE). The

experimental drug or treatment is administered to as many as several hundred

patients in Phase II trials.

At the end of Phase II, a decision will be made as to whether or not the drug is

promising and development should continue. In the U.S. there will be an End

of Phase II meeting between the pharmaceutical company and the FDA to

discuss safety and plans for Phase III studies. Ineffective or unsafe

compounds should not proceed into Phase III trials.

A Phase III trial is a rigorous clinical trial with randomization, one or more

control groups and definitive clinical endpoints. Phase III trials are often multi-

center, accumulating the experience of thousands of patients. Phase III trials

address questions of comparative treatment efficacy (CTE). A CTE

trial involves a placebo and/or active control group so that precise and valid

estimates of differences in clinical outcomes attributable to the investigational

therapy can be assessed.

If things go well during Phase III, the company with the license for the

compound will submit an application for approval.to market the drug. U.S.

FDA approval hinges on adequate and well-controlled pivotal Phase III

studies that are convincing of safety and efficacy.

the new therapy. As usage of the new drug becomes widespread, there is an

opportunity to learn about rare side effects and interactions with other

therapies. An expanded safety (ES) study can provide important information

that was not apparent during the drug development. For example, a few

thousand patients might be involved in all of the SE and CTE trials for a

particular therapy. An ES study, however, could involve >10,000 patients.

Such large sample sizes can detect more subtle safety problems for the

therapy, if such problems exist. Some Phase IV studies will have a marketing

objective for the company as well as collecting safety data.

The terminology of phase I, II, III, and IV trials does not work well for non-

pharmacologic treatments and does not account for translational trials

treatment mechanism (TM) or incorporate dose-finding (DF) strategies.

Some studies performed prior to large scale clinical trials are characterized

as translational studies. Translational studies have as their primary outcome

a biological measurement or target that has been derived from an accepted

model of the disease process. The results of the translational study may

provide evidence of a mechanism of action for a compound. Target validation

can be an objective of such a study. Large effects on the target are sought.

For example, a large change in the level of a protein, or the activity of an

enzyme might support therapeutic activity of a compound. There is an

understanding that translational work may cycle from preclinical lab to a

clinical setting and back again. Although the translational studies have a

written protocol, the treatment may be modified during the study. The protocol

should clearly define what would be considered lack of effect and the next

experimental step for any possible outcome of the trial.

Some therapies are not developed in the same manner as drugs, such as

disease prevention therapies, vaccines, biologicals, surgical techniques,

medical devices, and diagnostic agents.

disease,

progression, or

episodes of disease expression.

Vaccine investigations are a type of primary prevention trial They require large

numbers of patients and are very costly because of the numbers and the

length of follow-up that is required.

can diagnose the presence of disease. Usually, the agent is compared to a

gold standard diagnostic that assumed to be perfectly accurate in its

diagnosis. The advantage of the newer diagnostic agent is less expense or a

less invasive procedure.

Protocol

A protocol is the document that specifies the research plan for the clinical

trial. It is the single-most important quality control tool for all aspects of a

clinical trial. (Piantadosi 2005) This is especially true in a multi-center clinical

trial, which requires collaboration in the research activities of many

investigators and their staffs at multiple institutions.

Every clinical trial experiences violations of the protocol. Some violations are

due to differences in interpretation, some are due to carelessness, and some

are due to unforeseen circumstances. Some protocol deviations are

inconsequential but others can affect the validity of the trial. For instance a

patient might be unaware of a condition that is present in its early or latent

stage or a patient may mislead a researcher intentionally, thinking they will

receive special treatment from participating in a study both result in

violations of the patient exclusion criteria established in the research protocol.

Protocol amendments are common as a long-term multi-center study

progresses. The most serious violations are those which may affect the

conclusions of the study.

groups responsible for funding, conducting and publishing results of clinical

trials, along with ethicists, sets forth minimal elements that should be included

in a clinical trial protocol and provides a checklist. [7]The U S NIH has its

own template [8] for a phase 2 or 3 clinical trial protocol.

multi-center study, the investigators will construct a manual of operations

(MOP). The MOP has more detailed explanations than the protocol for how

the measurements should be taken, how the data collection forms should be

completed, etc.

3.7 - Summary

In this lesson, among other things, we learned:

proposed clinical study.

to compare and contrast the following study designs with respect to the

ability of the investigator to minimize bias: Case report or case series,

database analysis, prospective cohort study, case-control study, parallel

design clinical trial, crossover clinical trial.

clinical trial

Introduction

and the recorded value of a measurement. There are many sources pf error in

collecting clinical data. Error can be described as random or systematic.

system. The heterogeneity in the human population leads to relatively large

random variation in clinical trials.

Systematic error or bias refers to deviations that are not due to chance alone.

The simplest example occurs with a measuring device that is improperly

calibrated so that it consistently overestimates (or underestimates) the

measurements by X units.

large number of observations will yield a net effect of zero. The estimate may

be imprecise, but not inaccurate. The impact of random error, imprecision, can

be minimized with large sample sizes.

Bias, on the other hand, has a net direction and magnitude so that averaging

over a large number of observations does not eliminate its effect. In fact, bias

can be large enough to invalidate any conclusions. Increasing the sample size

is not going to help. In human studies, bias can be subtle and difficult to

detect. Even the suspicion of bias can render judgment that a study is invalid.

Thus, the design of clinical trials focuses on removing known biases.

Random error corresponds to imprecision, and bias to inaccuracy. Here is a

diagram that will attempt to differentiate between imprecision and inaccuracy.

(Click the 'Play' button.)

See the difference between these two terms? OK, let's explore these further!

2. State how the significance level and power of a statistical test are

related to random error.

Random error (variability, imprecision) can be overcome by increasing the

sample size. This is illustrated in this section via hypothesis

testing and confidence intervals, two accepted forms of statistical inference.

formed. Typically, the null hypothesis reflects the lack of an effect and the

alternative hypothesis reflects the presence of an effect (supporting the

research hypothesis). The investigator needs to have sufficient evidence,

based on data collected in a study, to reject the null hypothesis in favor of the

alternative hypothesis.

subjects are randomized to group A or group B, and the outcome of interest is

the change in serum cholesterol after 8 weeks. Because the outcome is

measured on a continuous scale, the hypotheses are stated as:

respectively.

The alternative hypothesis of H1: A B is labeled a two-sided alternative

because it does not indicate whether A is better than B or vice versa. Rather, it

just indicates that A and B are different. A one-sided alternative of H1: A<

B (or H1: A> B) is possible, but it is more conservative to use the two-sided

alternative.

each of group A and group B (nA = 40 and nB = 40). The investigator estimates

the population means via the sample means (labeled xAxA and xBxB,

respectively). Suppose the average changes that we observed

are xA=7.3xA=7.3 and xB=4.8mg/dlxB=4.8mg/dl. Do these data provide

enough evidence to reject the null hypothesis that the average changes in the

two populations means are equal? (The question cannot be answered yet. We

do not know if this is a statistically significant difference!)

If the data approximately follow a normal distribution or are from large enough

samples, then a two-sample t test is appropriate for comparing groups A and

B where:

of xAxB).

and ask if the signal is large enough, relative to the noise detected? In the

example, xA=7.3xA=7.3 and xB=4.8mg/dlxB=4.8mg/dl. If the standard

error of xAxBxAxB is 1.2 mg/dl, then:

tobs=(7.34.8)/1.2=2.1tobs=(7.34.8)/1.2=2.1

Each t value has associated probabilities. In this case, we want to know the

probability of observing a t value as extreme or more extreme than the t value

actually observed, if the null hypothesis is true. This is the p-value. At the

completion of the study, a statistical test is performed and its corresponding p-

value calculated. If the p-value < , then H0 is rejected in favor of H1.

Two types of errors can be made in testing hypotheses: rejecting the null

hypothesis when it is true or failing to reject the null hypothesis when it is

false. The probability of making a Type I error, represented by (the

significance level), is determined by the investigator prior to the onset of the

study. Typically, is set at a low value, say 0.01 or 0.05.

Here is an interactive table that presents these options. Roll your cursor over

the specific decisions (reject and fail to reject) to view results.

In our example, the p-value = [probability that |t| > 2.1] = 0.04

Thus, the null hypothesis of equal mean change for in the two populations is

rejected at the 0.05 significance level. The treatments were different in the

mean change in serum cholesterol at 8 weeks.

Note that (the probability of not rejecting H0 when it is false) did not play a

role in the test of hypothesis.

The importance of came into play during the design phase when the

investigator attempted to determine an appropriate sample size for the study.

To do so, the investigator had to decide on the effect size of interest, i.e., a

clinically meaningful difference between groups A and B in average change in

cholesterol at 8 weeks. The statistician cannot determine this but can help the

researcher decide whether he has the resources to have a reasonable chance

of observing the desired effect or should rethink his proposed study design..

The sample size should be determined such that there exists good statistical

power ( = 0.1 or 0.2) for detecting this effect size with a test of hypothesis

that has significance level .

A sample size formula that can be used for a two-sided, two-sample test with

= 0.05 and = 0.1 (90% statistical power) is:

nA=nA=212/2nA=nA=212/2

discussed in a later lesson).

Note that the sample size increases as decreases (effect size decreases).

difference, = 3.0 mg/dl and located a similar study in the literature that

reported = 4.0 mg/dl. Then:

nA=nB=212/2=(2116)/9=37nA=nB=212/2=(2116)/9=37

to assure 90% power for detecting an effect size that would have clinical

relevance..

Many studies suffer from low statistical power (large Type II error) because the

investigators do not perform sample size calculations.

If a study has very large sample sizes, then it may yield a statistically

significant result without any clinical meaning. Suppose in the serum

cholesterol example that xA=7.3xA=7.3 and xA=7.1mg/dlxA=7.1mg/dl ,

with nA = nB = 5,000. The two-sample t test may yield a p-value = 0.001,

but xAxB=7.37.1=0.2mg/dlxAxB=7.37.1=0.2mg/dl is not clinically

interesting.

Confidence Intervals

measure. Instead of just reporting xAxBxAxB as the sample estimate of

A - B, a range of values can be reported using a confidence interval..

percentage of confidence (95% is commonly used) that the true value of A -

B lies within it.

confidence interval for A - B is

error of xAxB)}

In the serum cholesterol

example, (xAxB)=7.34.8=2.5mg/dl(xAxB)=7.34.8=2.5mg/dl and the

standard error = 1.2 mg/dl. Thus, the approximate 95% confidence interval is:

2.5(1.961.2)=[0.1,4.9]2.5(1.961.2)=[0.1,4.9]

Note that the 95% confidence interval does not contain 0, which is consistent

with the results of the 0.05-level hypothesis test (p-value = 0.04). 'No

difference' is not a plausible value for the difference between the treatments.

Notice also that the length of the confidence interval depends on the standard

error. The standard error decreases as the sample size increases, so the

confidence interval gets narrower as the sample size increases (hence,

greater precision).

Not only does it indicate whether H0 can be rejected, but it also provides a

plausible range of values for the population measure. Many of the major

medical journals request the inclusion of confidence intervals within submitted

reports and published articles.

If a bias is small relative to the random error, then we do not expect it to be a

large component of the total error. A strong bias can yield a point estimate that

is very distant from the true value. Remember the 'bulls eye' graphic?

Investigators seldom know the direction and magnitude of bias, so

adjustments to the estimators are not possible.

1. Selection bias

5. Assessment bias

1. Selection Bias

population because of the method used to select the sample. Selection bias in

the study cohort can diminish the external validity of the study findings. A

study with external validity yields results that are useful in the general

population. Suppose an investigator decides to recruit only hospital

employees in a study to compare asthma medications. This sample might be

convenient, but such a cohort is not likely to be representative of the general

population. The hospital employees may be more health conscious and

conscientious in taking medications than others. Perhaps they are better at

managing their environment to prevent attacks. The convenient sample easily

produces bias. How would you estimate the magnitude of this bias? It is

unlikely to find an undisputed estimate and the study will be criticized because

of the potential bias.

salvaged. Randomized controls increase internal validity of a study.

Randomization can also provide external validity for treatment group

differences. Selection bias should affect all randomized groups equally, so in

taking differences between treatment groups, the bias is removed via

subtraction. Randomization in the presence of selection bias cannot provide

external validity for absolute treatment effects. The graph below illustrates

these concepts).

The estimates of the response from the sample are clearly biased below the

population values. However, the observed difference between treatment and

control is of the same magnitude as that in the population. In other words, it

could be the observed treatment difference accurately reflects the population

difference, even though the observations within the control and treatment

groups are biased.

on treatment assignment, can lead to extremely large biases. The investigator

may consciously or subconsciously assign particular treatments to specific

types of patients. Randomization is the primary design feature that removes

this bias.

Post-entry exclusion bias can occur when the exclusion criteria for subjects

are modified after examination of some or all of the data. Some enrolled

subjects may be recategorized as ineligible and removed from the study. In

the past, this may have been done for the purposes of manufacturing

statistically significant results, but would be regarded as unethical practice

now.

Bias due to selective loss of data is related to post-entry exclusion bias. In this

case, data from selected subjects are eliminated from the statistical analyses.

Protocol violations (including adding on other medications, changing

medications or withdrawal from therapy) and other situations may cause an

invesigator to request an analysis using only the data from those who adhered

to the protocol or who completed the study on their assigned therapy.

The latter two types of biases can be extreme. Therefore, statisticians prefer

that intention-to-treat analyses be performed as the main statistical analysis..

data analysis, regardless of protocol violations or lack of compliance. Though

it may seem unreasonable to include data from a patient who simply refused

to take the study medication or violated the protocol in a serious manner, the

intention-to-treat analysis usually prevents more bias than it introduces. Once

all the patients are randomized to therapy, use all of the data collected. Other

analyses may supplement the intention-to-treat analysis, perhaps

substantiating that protocol violations did not affect the overall inferences, but

the analysis including all subjects randomized should be primary.

5. Assessment bias

As discussed earlier, clinical studies that rely on patient self-assessment or

physician assessment of patient status are susceptible to assessment bias. In

some circumstances, such as in measuring pain or symptoms, there are no

alternatives, so attempts should be made to be as objective as possible and

invoke randomization and blinding. What is a mild cough for one person might

be characterized as a moderate cough by another patient. Not knowing

whether or not they received the treatment (blinding) when making these

subjective evaluations will help to minimize this self-assessment or

assessment bias..

biases.

adjusts for disease remission/progression, as the graph below

illustrates. Both treatment and control had an increase in response, but

the treatment group experienced a greater increase.)

bias)

4.3 - Statistical Biases

For a point estimator, statistical bias is defined as the difference between the

parameter to be estimated and the mathematical expectation of the estimator.

example, if the statistical analysis does not account for important prognostic

factors (variables that are known to affect the outcome variable), then it is

possible that the estimated treatment effects will be biased. Fortunately, many

statistical biases can be corrected, whereas design flaws lead to biases that

cannot be corrected.

the one-sample situation with Y1, ... , Yn denoting independent and identically

distributed random variables and YY denoting their sample mean. Define:

s2=1n1i=1n(YiY)2s2=1n1i=1n(YiY)2

and

v2=1ni=1n(YiY)2v2=1ni=1n(YiY)2

population variance, 2. The statistic v2 is biased because its mathematical

expectation is 2(n - 1)/n. The statistic v2 tends to underestimate the

population variance.

gets larger, the bias becomes negligible.

4.4 - Summary

In this lesson, among other things, we learned:

to state how the significance level and power of a statistical test are

related to random error.

Check to see if there are any homework problems associated with this lesson.

Introduction

should not depend on observing a particular outcome of the trial, e.g. finding a

difference in mean weight loss of exactly 2 kg, but in obtaining a valid result.

For example, a randomized trial of 4 diets had as its objective, To assess

adherence rates and the effectiveness of 4 popular diets for weight loss and

cardiac risk factor reduction. (Dansinger et al. 2005).

The endpoints (or outcomes), determined for each study participant, are the

quantitative measurements required by the objectives. In the Dansinger

weight loss study, the primary endpoint was identified to be mean absolute

change from baseline weight at 1 year. In a cancer chemotherapy trial the

clinical objective is usually improved survival. Survival time is recorded for

each patient; the primary outcome reported may be median survival time or it

could be five-year survival.

objectives and endpoints are secondary. The sample size calculation is

based on the primary endpoint. Analysis involving a secondary objective has

statistical power that is calculated based on the sample size for the primary

objective.

"Hard" endpoints are well-defined in the study protocol, definitive with respect

to the disease process, and require no subjectivity. "Soft" endpoints are those

that do not relate strongly to the disease process or require subjective

assessments by investigators and/or patients. Some endpoints fall between

these two classifications. For example: the grading of x-rays by radiologists

and the grading of cell and tissue lesions/tumors by pathologists. There is

some degree of subjectivity, but they are valid and reliable endpoints in most

settings.

This lesson will help to differentiate between these types of objectives and

endpoints. Ready, let's get started!

ordered or unordered categories and repeated measurements.

outcomes.

5.1 - Endpoints

The endpoints used in a clinical trial must correspond to the scientific

objectives of the study and the methods of outcome assessment should be

accurate (free of bias).

A wide variety of endpoints that are used in clinical trials as displayed below

time to recurrence of cancer, survival time

frequency of occurrence of migraine headaches, number of uses of rescue meds for asthma

no recurrence/recurrence, major cardiac event yes or no

absent, mild moderate, severe pain, NYHA status

categories of adverse experiences: GI, cardiac, etc.

Some endpoints are assessed many times during the study, leading to

repeated measurements.

Times

Event Times

Event times often are useful endpoints in clinical trials. Examples include

survival time from onset of diagnosis, time until progression from one stage of

disease to another, and time from surgery until hospital discharge. In each

case time is measured from study entry until the event occurs. With an

endpoint that is based on an event time, there always is the chance

of censoring. An event time is censored if there is some amount of follow-up

on a subject, but the event is not observed because of loss-to-follow-up, death

from a cause other than the trial endpoint, study termination, and other

reasons unrelated to the endpoint of interest.. This is known as right censoring

and occurs frequently in studies of survival..

Right-censoring example

Consider the table above which displays time until infection for Patients 1-6. In

some cases, the event did not occur, Patient 1 (from top) was followed for a

year and was censored at the end of the study). The second patient

experienced an infection at approximately 325 days. Patients 3 and 6 dropped

out of the study and were censored when this occurred.

Left censoring occurs when the initiation time for the subject, such as time of

diagnosis, is unknown. Interval censoring occurs when the subject is not

followed for a period of time during the trial and it is unknown if the event

occurred during that period.

There are three types of right censoring that are described in the statistical

literature.

Type I censoring occurs when all subjects are scheduled to begin the study at

the same time and end the study at the same time. This type of censoring is

common in laboratory animal experiments, but unlikely in human trials.

Type II censoring occurs when all subjects begin the study at the same time

and the study is terminated when a predetermined proportion of subjects have

experienced the event

Type III censoring occurs when the censoring is random, which is the case in

clinical trials because of staggered entry (not every patient enters the study on

the first day) and unequal follow-up on subjects.

Statistical methods appropriate for event time data, survival analyses, do not

discard the right-censored observations. Instead, the methods account for the

knowledge that the event did not occur in a subject up to the censoring time.

Survival methods include life table analysis, Kaplan-Meier survival curves,

logrank and Wilcoxon tests, and proportional hazards regression (more

discussion on these in a later lesson).

recorded, namely, the follow-up time for a subject and an indicator variable as

to whether this is an event time or a censoring time. These statistical methods

assume that the censoring mechanisms and the event are independent. If this

is not the case, e.g., patients have a tendency to be censored prior to the

occurrence of the event, the event rate will be underestimated.

endpoints, namely, death from all causes and death primarily due to the

disease.

At first glance, death primarily due to the disease appears to be the most

appropriate. It is, however, susceptible to bias because the assumption of

independent causes of death may not be valid. For example, subjects with a

life-threatening cancer are prone to death due to myocardial infarction. It can

also be very difficult to determine the exact cause of death.

A surrogate endpoint is one that is measured in place of the biologically

definitive or clinically meaningful endpoint. A surrogate endpoint usually tracks

the progress or extent of the disease.

Investigators choose a surrogate endpoint when the definitive endpoint is

inaccessible due to cost, time, or difficulty of measurement. The problem with

a surrogate endpoint in a clinical trial is determining whether it is valid (i.e., is

it strongly associated with the definitive outcome?)

endpoint:

3. It yields the same statistical inference as that for the definitive endpoint

Surrogate Definitive

e >>> >>>

Endpoint Endpoint

The disease affects the surrogate endpoint, which in turn affects the definitive

endpoints.

size reduction in cancer patients, blood pressure in cardiovascular disease,

and intraocular pressure in glaucoma patients. The response variables in

translational research are surrogate endpoints.

clinical trials. If, however, the surrogate is imprecisely associated with

definitive endpoints, use of the surrogate can lead to misleading results.

Studies

The terms describing several types of early clinical studies are given below.

Meaning

Early developmental trial that investigates mechanism of treatment effect, e.g., a pharmacokinetics study

nism

and elimination of the drug from the human body

Imprecise term for dose-ranging studies

Design or component of a design that specifies methods for increases in dose for subsequent subjects

Design that tests some or all of a prespecified set of doses (fixed design points)

Design that titrates dose to a prespecified optimum based on biological or clinical considerations

Dose-finding (DF) trials are Phase I studies with the objective of determining

the optimal biological dose (OBD) of a drug. In order to determine the dose

with highest potential for efficacy in the patient population that still meets

safety criteria, dose-finding studies are typically conducted by

administering sequentially rising doses to successive groups of individuals.

Such studies may be conducted in healthy volunteers or in patients with

disease.

how to characterize an optimum dose. Should the optimum dose to be

selected on the basis of the highest therapeutic index (the maximal separation

between risk and benefit)? Or is the optimal dose the level which maximizes

therapeutic benefit while maintaining risk below a predetermined threshold?

What measures will denote risk and benefit?

An optimal dose can be selected on the basis of efficacy alone, such as when

a minimum effective dose (MED) is chosen for a pain-relieving medication,

and defined as the dose which eliminates mild-to-moderate pain in 80% of trial

participants. In another case, the optimal dose might be selected as the

highest dose that is associated with serious side effects in no more than 1 of

20 patients. This would be a maximum nontoxic dose (MND). In cancer

therapeutics, the optimal dose for a cytotoxic drug designed to shrink tumors

could be defined as the level that yields serious but reversible toxicity in no

more than 30% of the patients. This is a maximum tolerated dose (MTD). Care

in defining the conditions for optimality is critical to a dose-finding study.

Most DF trials are sequential studies such that number of subjects is itself an

outcome of the trial. Convincing evidence characterizing the relationship of

dose and safety can be obtained after studying a small set of patients. Hence

sample size is not a major concern DF trials.

fixed doses at increasing levels, d1, d2, ... , dK. The hypothesized optimal dose

would lie between d1 and dK. The n participants would be randomized to each

of the K dose groups and the binary response of toxicity would be noted for

each participant. A mathematical model could then be fit to the proportional

responses over the doses such that the optimal dose could be determined.

Most likely, your answer is no because you would not want to risk being

assigned to the highest dose level of this unproven drug as your first

treatment. There is a principle here: it is unethical to treat humans at high

doses of a drug without any prior knowledge of their responses at lower

levels. Furthermore, ethics compel a design that minimizes the numbers of

patients treated with both low ineffective doses and high toxic doses.

method for determining the starting dose for the patient, specification of dose

increments and cohort sizes, definition of dose-limiting toxicities as well as the

decision rules for escalation and de-escalation of the dose.

model to observed data during the study from which it estimates an optimal

dose via extrapolation or interpolation. The next cohort of patients is assigned

to the estimated optimal dose. A study using CRM would not have an a priori

defined set of doses; thus is dose-finding study. The CRM itself can be

thought of an algorithm for updating the best guess regarding the optimal

dose. Bayesian approaches have also been incorporated into the CRM and

the method is applicable for many types of responses.

design points.

sequence 1, 1, 2, 3, 5, 8, 13, 21, 34 ... (a number in the sequence is the sum

of the two previous numbers).

example, the first cohort of n participants is assigned dose D, the initial dose.

If they tolerate this dose D well, the next cohort of n participants is assigned

dose 2D. If all goes well with the second cohort, then a third cohort is

assigned dose 3D, the fourth cohort is assigned does 5D, etc. The process is

discontinued when one of the cohorts exhibits toxicity. Numerous

modifications have been proposed to the Fibonacci scheme, such as allowing

for de-escalation as well as escalation.

5.5 - Summary

In this lesson, among other things, we learned how to:

ordered or unordered categories and repeated measurements.

outcomes.

Look for any homework assignments listed for this lesson in the ANGEL

course site...

A

Introduction

is precision. Validity and unbiasedness do not necessarily relate to sample

size.

Usually, sample size is calculated with respect to two circumstances. The first

involves precision for an estimator, e.g., requiring a 95% confidence interval

for the population mean to be within units. The second involves statistical

power for hypothesis testing, e.g., requiring 0.80 or 0.90 statistical power

(1-) for a hypothesis test when the significance level () is 0.05 and the

effect size (the clinically meaningful effect) is units.

The formulae for many sample size calculations will involve percentiles from

the standard normal distribution. The graph below illustrates the

2.5th percentile and the 97.5th percentile.

Fig. 1 Standard normal distribution centered on zero.

For a two-sided hypothesis test with significance level and statistical power

1 - , the percentiles of interest are z (1-/2) and z (1 - ).

0.05 and 0.01, and usual choices of are 0.20 and 0.10, so the percentiles of

interest usually are:

z0.995=2.58,z0.99=2.33,z0.975=1.96,z0.95=1.65,z0.90=1.28,z0.80=0.84z0.995=2.5

8,z0.99=2.33,z0.975=1.96,z0.95=1.65,z0.90=1.28,z0.80=0.84 .

standard normal distribution function, e.g., Z = PROBIT(0.99) yields a value of

2.33 for Z. So, if you ever need to generate z-values you can get SAS to do

this for you.

assumptions that are made for the sample size calculation, e.g., the standard

deviation of an outcome variable or the proportion of patients who succeed

with placebo, may not hold exactly.

Also, we may base the sample size calculation on a t statistic for a hypothesis

test, which assumes an exact normal distribution of the outcome variable

when it only may be approximately normal.

initiate the study will provide complete data. .Some will deviate from the

protocol, including not taking the assigned treatment or adding on a treatment.

Sample size calculations and recruitment of subjects should reflect these

anticipated realities.

following:

Estimate the sample size required for a confidence interval for p for

given and , using normal approximation and Fisher's exact methods.

Estimate the sample size required for a confidence interval for for

given and , using normal approximation when the sample size is

relatively large.

) % power for given and , using normal approximation, with equal or

unequal allocation.

and and , using normal approximation and Fisher's exact methods.

logrank comparison of two hazard functions to have (1 - ) % power

with given

to have a certain probability of detecting a rare event that occurs at a

rate = .

and the anticipated withdrawal rate.

References:

Friedman, Furberg, DeMets, Reboussin and Granger. (2015) Sample size. In:

FFDRG. Fundamentals of Clinical Trials. 5th ed. Switzerland: Springer.

Piantadosi Steven. (2005) Sample size and power. In: Piantadosi

Steven. Clinical Trials: A Methodologic Perspective. 2nd ed. Hoboken, NJ:

John Wiley and Sons, Inc.

Trials [1]." Epidemiologic Reviews. Vol. 24. No 1. pp. 39-53.

Finding Studies

For many treatment mechanism (TM) studies, sample size is not an

important issue because usually only a few subjects are enrolled to

investigate treatment mechanism. Here you are taking a lot of measurements

on a few subjects in order to find out what might be going on with your

treatment.

typically involve a design scheme, such as a modified Fibonacci design or

continual reassessment. An example for phase I cytotoxic drug trials is as

follows. A set of doses is determined a priori, such as 100 mg, 200 mg, 300

mg, 500 mg, 800 mg, etc. Subjects are recruited into the DF study in groups

of threes. The first group receives the lowest dose of 100 mg. If none of the

subjects experience the effect (toxicity, side effect, etc.), then the next group

of three subjects is escalated to the next dose of 200 mg. If one of the three

subjects at 100 mg experiences the effect, however, then the next group of

three subjects will receive the same dose of 100 mg. Whenever six subjects at

the same dose reveal at least two subjects that experience the effect, then the

study is terminated and the chosen dose for a safety and efficacy study is the

previous dose level.

of. the study sample size is not a major consideration. In fact, the final sample

size is dependent on the patient outcomes.

The U.S. FDA mandates that efficacy is proven prior to approval of a drug.

Efficacy means that the tested dose of the drug is effective at ameliorating the

treated condition. Phase II trials evaluate the potential for efficacy; Phase III

trials confirm efficacy. These trials can also be referred to as safety and

activity studies.

clinical endpoints with a specified amount of precision. Confidence intervals

are useful for reflecting the amount of precision, and the width of a confidence

interval is a function of sample size.

The simplest example occurs when the outcome response is binary (success

or failure). Let p denote the true (but unknown) proportion of successes in the

population that will be estimated from a sample.

Thus, the point estimate of p is:

p^=rnp^=rn

If the sample size is large enough, then the 100(1 - )% confidence interval

can be approximated as:

p^z1/2p^(1p^)/np^z1/2p^(1p^)/n

Prior to the conduct of the study, however, the point estimate is undetermined

so that an educated guess is necessary for the purposes of a sample size

calculation.

guess as to the value of p, reworking through the sample size equation, the

target sample size is:

n=z21/2p(1p)/2n=z1/22p(1p)/2

have limits of = 0.10, then the required sample size is n = (1.96)2(0.4)(0.6)/

(0.10)2 = 92

Notice that p(1 - p) is maximized when p = 0.5. Therefore, because p has to

be guessed, it is more conservative to use p = 0.5 in the sample size

calculation. In the above example this yields n = (1.96)2(0.5)(0.5)/(0.10)2 = 96,

a slightly larger sample size.

desired instead of 0.10 in the above example, then n = (1.96)2(0.5)(0.5)/

(0.05)2 = 384

If you want the confidence interval to be tighter remember that splitting the

width of the confidence interval in half will involve quadrupling the number of

subjects in the sample size!

for p works well if

np^(1p^)5np^(1p^)5

In the exact binomial method, the lower 100(/2)% confidence limit for p is

determined as the value pL that satisfies

/2=k=rnC(n,k)(pL)k(1pL)nk/2=k=rnC(n,k)(pL)k(1pL)nk

value pU that satisfies

/2=k=0rC(n,k)(pU)k(1pU)nk/2=k=0rC(n,k)(pU)k(1pU)nk

SAS PROC FREQ provides the exact and asymptotic 100(1 - )% confidence

intervals for a binomial proportion, p.

a program that illustrates the use of PROC FREQ in SAS for determining an

exact confidence interval for a binomial proportion.

[2]

observed in a binomial trial. The point estimate of p is

p^=0.16p^=0.16

np^(1p^)=19(0.16)(0.84)=2.55<5np^(1p^)=19(0.16)(0.84)=2.55<5

The 95% confidence interval for p, based on the exact method, is [0.03, 0.40].

The 95% confidence interval for p, based on the normal approximation, is [-

0.01, 0.32], which is modified to [0.00, 0.32] because p represents the

probability of success that is supposed to be restricted to lie within the [0, 1]

interval. Even with the correction to the lower endpoint, the confidence interval

based on the normal approximation does not appear to be very accurate in

this example.

Modify the SAS program above to reflect 11 successes out of 75 trials. (Click

the 'Inspect' icon and review what part of this program to change and how it

works if you need to.) Run the program. Do the results round to (0.08, 0.25)

for the 95% exact confidence limits?

interval with = 0.1, what sample size is needed? One way to solve this is to

use SAS PROC FREQ in a "guess and check" manner. In this case, n = 73

with 11 successes will result in a 95% exact confidence interval of (0.07, 0.25).

It may impossible to exactly achieve the desired , but an estimate of the

required sample size can be provided.

Using the exact confidence interval for a binomial proportion is the better

option if you are not sure you are working in a standard normally distributed

population.

Treatment

An approach for discarding an ineffective treatment in an SE study, based on

the exact binomial method, is as follows. Suppose that the lowest success

rate acceptable to an investigator for the treatment is 0.20. Suppose that the

investigator decides to administer the treatment consecutively to a series of

patients. When can the investigator terminate the SE trial if he continues to

find no treatment successes?

earlier (6.1_binomial_proportion.sas [2]) can be made to determine when the

exact confidence interval for p no longer contains a certain value.

[2]

SAS PROF FREQ (trial-and-error) indicates that the exact one-sided 95%

upper confidence limit for p, when 0 out of 14 successes are observed, is

0.19. Thus, if the treatment fails in each of the first 14 patients, then the study

is terminated.

Work out your answer first, then click the graphic (left) to compare answers.

What is the upper 95% one-sided confidence limit for p when you have seen

no successes in 5 trials?

Work out your answer first, then click the graphic (left) to compare answers.

Here is another one to try... How many straight failures would it take to rule out

a 30% success rate?

For a clinical endpoint that can be approximated by a normal distribution in an

SE study, the 100(1 - )% confidence interval for the population mean, , is

Y[tn1,1/2s/n]Y[tn1,1/2s/n]

where

estimates 2.

% confidence interval for the population mean, , that is,

Y(z1/2/n)Y(z1/2/n)

YY

then

n=z21/22/2n=z1/222/2

For example, the necessary sample size for estimating the mean reduction in

diastolic blood pressure, where = 5 mm Hg and = 1 mm Hg, is n =

(1.96)2(5)2/(1)2 = 96.

Studies

Suppose that a comparative treatment efficacy (CTE) trial consists of

comparing two independent treatment groups with respect to the means of the

primary clinical endpoint. Let 1 and 2 denote the unknown population means

of the two groups, and let denote the known standard deviation common to

both groups. Also, let n1 and n2 denote the sample sizes of the two groups.

= 0. The test statistic is

Z=(Y1Y2)/1n1+1n2Z=(Y1Y2)/1n1+1n2

which follows a standard normal distribution when the null hypothesis is true.

If the alternative hypothesis is two-sided, i.e., H1: 0, then the null

hypothesis is rejected for large values of |Z|.

2 ,

Z=(Y1Y2)/1n1+1n2Z=(Y1Y2)/1n1+1n2

Suppose we let AR = n1/n2 denote the allocation ratio (AR), (in most cases we

will assign AR = 1 to get equal sample sizes). If we wish to a have large

enough sample size to detect an effect size with a two-sided, -significance

level test with 100(1 - )% statistical power, then

n2=(AR+1AR)(z1/2+z1)22/2n2=(AR+1AR)(z1/2+z1)22/2

and n1 = ARn2.

Note this formula matches the sample size formula in our FFDRG text on p.

180, assuming equal allocation to the two treatment groups and multiplying

the result here by 2 to get 2N, which FFDRG use to denote the total sample

size.

formula.

Notice that the sample size expression contains (/)2, the square of the

effect size expressed in standard deviation units. Thus, sample size is a

quadratic function of the effect size and the precision. As the variance gets

larger, it has a quadratic effect on the sample size. For example, reducing the

effect size by one-half quadruples the required sample size.

Although this sample size formula assumes that the standard deviation is

known so that a z test can be applied, it works relatively well when the

standard deviation must be estimated and a t-test applied. A preliminary guess

of must be available, however, either from a small pilot study or a report in

the literature. For smaller sample sizes (n1 30, n2 30) percentiles from a t

distribution can be substituted, although this results in both sides of the

formula involving n2 so that it must be solved iteratively:

n2=(AR+1AR)(tn1+n22,1/2+tn1+n22,1)22/2n2=(AR+1AR)

(tn1+n22,1/2+tn1+n22,1)22/2

Treatment Efficacy Studies

An investigator wants to determine the sample size for comparing two asthma

therapies with respect to the forced expiratory volume in one second (FEV1). A

two-sided, 0.05-significance level test with 90% statistical power is desired.

The effect size is = 0.25 L and the standard deviation reported in the

literature for a similar population is = 0.75 L. The investigator plans to have

equal allocation to the two treatment groups (AR = 1).

example, FEV1 is a continuous response variable. Assuming that FEV1 has an approximate

normal distribution, the number of patients required for the second treatment

group based on the z formula is n2 = (2)(1.96 + 1.28)2(0.75)2/(0.25)2 = 189.

Thus, the total sample size required is n1 + n2 = 189 + 189 = 378. SAS

Example (7.4_sample_size__normal_.sas [3]): This is a program that illustrates

the use of PROC POWER to calculate sample size when comparing two

normal means.

[3]

382.

subjects in the first group), then n2 = (1.5)(1.96 + 1.28)2(0.75)2/(0.25)2 = 142

and n1 = 2142 = 284.

429.

Notice that the 2:1 allocation, when compared to the 1:1 allocation, requires

an overall larger sample size (429 versus 382).

Try it yourself!

Work out your answer first, then click the graphic (left) to compare answers.

Here is another one to try... How many subjects are needed to have 80%

power in testing equivalence of two means when subjects were allocated 2:1,

using a = 0.05 two sided test? The standard deviation is 10 and the

hypothesized difference in means is 5.

6A.7 - Example 2: Comparative

Treatment Efficacy Studies

What if the primary response variable is binary?

When the outcome in a CTE trial is a binary response and the objective is to

compare the two groups with respect to the proportion of success, the results

can be expressed in a 2 2 table as

Group # 1 Group # 2

Success r1 r2

Failure n1 - r1 n2 - r2

There are a variety of methods for performing the statistical test of the null

hypothesis H0: p1 = p2, such as a z-test using a normal approximation, a 2 test

(basically, a square of the z-test), a 2 test with continuity correction, and

Fisher's exact test.

The normal and 2 approximations for comparing two proportions are relatively

accurate when these conditions are met:

n1(r1+r2)(n1+n25,n2(r1+r2)(n1+n25,n1(n1+n2r1r2)

(n1+n25,n2(n1+n2r1r2)(n1+n25n1(r1+r2)(n1+n25,n2(r1+r2)

(n1+n25,n1(n1+n2r1r2)(n1+n25,n2(n1+n2r1r2)(n1+n25

Basically when the expected number in each cell is greater than 5, the normal

or Chi Square approximation is useful.

Otherwise, Fisher's exact test is recommended. All of these tests are available

in SAS PROC FREQ of SAS and will be discussed later in the course.

A sample size formula for comparing the proportions p1 and p2 using the

normal approximation is given below:

n2=(AR+1AR)(z1/2+z1)2p(1p)/(p1p2)2n2=(AR+1AR)

(z1/2+z1)2p(1p)/(p1p2)2

where p1 - p2 represents the effect size and

p=(ARp1+p2)/(AR+1)p=(ARp1+p2)/(AR+1)

(Note this formula is the same as p. 173 in our text FFDRG if you assume the

allocation ratio is 1:1 and double the sample size here to get total sample size

2N as calculated in FFDRG)

the response is success/failure via a two-sided, 0.05 significance level test

and 90% statistical power. She knows from the medical literature that 25% of

the untreated patients will experience success, so she decides that the

experimental therapy is worthwhile if it can yield a 50% success rate. With

equal allocation, n2 = (2)(1.96 + 1.28)2{0.375(1-0.375)}/(0.25)2 = 79. Thus, the

investigator should enroll n1 = 79 patients into treatment and n2 = 79 into

placebo for a total of 158 patients.

that the allocation ratio of AR = 3 yields a total sample size larger than that for

the allocation ratio of AR = 1 (224 vs. 158).

illustrates the use of PROC POWER to calculate sample size when comparing

two binomial proportions.

[4]

SAS PROC POWER for Fishers exact test yields n1 = 85 and n2 = 85 for AR =

1, and n1 = 171 and n2 = 57 for AR = 3.

Work out your answer first, then click the graphic (left) to compare answers.

What would be the sample size required to have 80% power to detect that a

new therapy has a significantly different success rate than the standard

therapy success rate of 30%, if it was expected that the new therapy would

result in at least 40% successes? Use a two-sided test with 0.05 significance

level.

Using Hazard Ratios

For many clinical trials, the response is time to an event. The methods of

analysis for this type variable are generally referred to as survival

analysis methods. The basic approach is to compare survival curves.

treatment groups (and curves) with respect to the hazard ratio. The survival

function for a treatment group is characterized by , the hazard rate. At time t,

(t) for a treatment group, is defined as the instantaneous risk of the event (or

failure) occurring at time t. In other words, given that a subject has survived

the event up to time t, the hazard at time t is the probability of the event

occurring within the next instant. You can think of the hazard as the slope of

the survival curve.

The hazard ratio is defined as the ratio of two hazard functions, 1(t) and 2(t),

corresponding to two treatment groups. Typically, we assume proportional

hazards, i.e., = 1(t)/2(t) is a constant function independent of time. The

graphs on the next two slides illustrate the concept of proportional hazards.

A hazard function may be constant, increasing, or decreasing over time, or

even be a more complex function of time. In trials in which survival time is the

outcome, an increasing hazard function indicates that the instantaneous risk

of death increases throughout the trial.

disease ARDS (adult respiratory distress syndrome), whereby the risk of

death is highest during the early stage of the disease.

A sample size formula for comparing the hazards of two groups via the

logrank test (discussed later in the course) is expressed in terms of the total

number of events, E, that need to occur. For a two-sided, -level

significance test with 100(1 - )% statistical power, hazard ratio , and

allocation ratio AR,

E=((AR+1)2AR)(z1/2+z1)2/(loge())2E=((AR+1)2AR)(z1/2+z1)2/

(loge())2

(Note this formula above matches FFDRG text p. 185 simple formula, if it is

assumed that all particpants will have an event. However, we most often have

censored data, that is a number of participants who do not experience the

event before the trials ends. )

Since we do not expect all persons in the trial to experience an event, the

sample size must be larger than the required number of events.

Suppose that p1 and p2 represent the anticipated event rates in the two

treatment groups. Then the sample sizes can be determined from n2 = E/

(ARp1 + p2) and n1 = ARn2

T], then it can be expressed as (t) = = -loge(1 - p)/T. In such a situation, the

hazard ratio for comparing two groups is = loge(1 - p1)/loge(1 - p2) .

exponential survival curve, i.e., survival at time t = exp(-t).

Survival curves plot the probability of the event occurring to a subject over

time.

Example

the response is time to infection via a two-sided, 0.05-significance level test

with 90% statistical power and equal allocation. He plans to follow each

patient for one year and he expects that 40% of the placebo group will

experience infection and he considers a 20% rate in the therapy group as

clinically relevant.

E = (4)(1.96 + 1.28)2/{loge(2.29)}2 = 62

SAS Example (7.6_sample_size__time_.sas [5]) This is a program that

illustrates the use of PROC POWER to calculate sample size when comparing

two hazard functions.

[5]

points on the survival curves. In this example, at the end of study, at time 1.01

(followup plus accrual in SAS), the proportion in the placebo group without an

event is 0.6 and the proportion remaining the therapy group is 0.8.

SAS PROC POWER for the logrank test requires information on the accrual

time and the follow-up time. It assumes that if the accrual (recruitment) period

is of duration T1 and the follow-up time is of duration T2, then the total study

time is of duration T1 + T2. It assumes, however, if a patient is recruited at time

T1/2, then the follow-up period for that patient is T1/2 + T2instead of T2. This

assumption may be reasonable for observational studies, but not for clinical

trials in which follow-up on each patient is terminated when the patient

reaches time T2. Therefore, for a clinical trial situation, set accrual time in SAS

PROC POWER equal to a very small positive number. For the given example,

SAS PROC POWER yields n1 = 109 and n2 = 109.

Expanded Safety (ES) trials are phase IV trials designed to estimate the

frequency of uncommon adverse events that may have been undetected in

earlier studies. These studies may be nonrandomized.

adverse event is small (because it did not crop up in prior trials), and all

participants in the cohort of size m are followed for approximately the same

length of time. Under these assumptions we can model the probability of

exactly d events occurring based on a Poisson probability function, i.e.,

Pr[D=d]=(m)dexp(m)/d!Pr[D=d]=(m)dexp(m)/d!

least one event when the event rate is . Thus, we want

=Pr[D1]=1Pr[D=0]=1exp(m)=Pr[D1]=1Pr[D=0]=1exp(m)

to be relatively large. With respect to the cohort size, this means that m should

be selected such that

m=loge(1)/m=loge(1)/

Example

arrhythmia drug. The company wants to determine the cohort size for

following patients on the drug for a period of two years in terms of myocardial

infarction. They want to have a 0.99 probability ( = 0.99) for detecting a

myocardial infarction rate of one per thousand ( = 0.001). This yields a

cohort size of m = 4,605.

(note the value in this problem is a probability, not quite the same as that

we use in calculating power)

6A.10 - Adjustment Factors for Sample

Size Calculations

When calculating a sample size, we may need to adjust our calculations due

to multiple primary comparisons or for nonadherence to therapy.

If there is more than one primary outcome variable (for example, co-

primary outcomes) or more than one primary comparison (for example,

3 treatment groups), then the significance level should be adjusted to

account for the multiple comparisons in order not to inflate the overall false-

positive rate.

For example, suppose a clinical trial will involve two treatment groups and a

placebo group. The investigator may decide that there are two primary

comparisons of interest, namely, each treatment group compared to placebo.

The simplest adjustment to the significance level for each test is the

Bonferroni correction, which uses /2 instead of .

correction is to use a significance level of /K for each of the K comparisons.

The Bonferroni correction is not the most powerful or most sophisticated

multiple comparison adjustment, but it is a conservative approach and easy to

apply.

level may not be necessary, depending on how the investigator plans to

interpret the results. For example, suppose there are two primary outcome

variables. If the investigator plans to claim success of the trial

if either endpoint yields a statistically significant treatment effect, then

an adjustment to the significance level is warranted. If the investigator plans to

claim success of the trial only if both endpoints yield statistically significant

treatment effects, then an adjustment to the significance level is not

necessary. Thus, an adjustment to the significance level in the presence of

multiple primary endpoints depends on whether it is an or or an and

situation.

(noncompliance). All participants randomized to therapy are expected to be

included in the primary statistical analysis, an intention-to-

treat [7] analysis. Intention-to-treat analysis will compare the treatments using

all data from subjects in the group to which they were originally assigned,

regardless of whether or not they followed the protocol, stayed on therapy,etc.

Some participants will choose to withdraw from a trial before it is complete.

Every effort will be made to continue obtaining data from all randomized

subjects; for those who withdraw from the study completely and do not

provide data, an imputation procedure may be required to represent their

missing data in subsequent data analyses. Some participants assigned to

active therapy discontinue therapy but continue to provide data (therapeutic

drop-outs). Some on a placebo or control add an active therapy (drop-ins)

and continue to be observed. The nonadherence (noncompliance) can lead

to a dilution of the treatment effect and lead to lower power for the study as

well as biased estimates of treatment effects.

Thus, a further adjustment to the sample size estimate may be made based

on the anticipated drop-out and drop-in rates in each arm (See Wittes

(2002 [1]). A similar formula is on p. 179 FFDRG.

nonadherence and N* is the adjusted number for that treatment arm.

test therapy and the proportion in the control who will add or change to a more

effective therapy, respectively.

Suppose a study has two treatment groups and will compare test therapy to

placebo. With only one primary comparison, we do not need to adjust the

significance level for multiple comparisons. Suppose that the sample size for

a certain power, significance level and clinically important difference works to

be 200 participants/group or 400 total.

from the placebo group who will begin an active therapy before the study is

complete. Let's estimate these 'drop-ins' to be 0.20. In the test therapy group,

we estimate 0.10 will discontinue active therapy.

409/group or 818 total. What an increase in sample size to maintain the

power! (note whether I use n/group, 200/(0.49) or total n, 400/(0.49) I will get

the same sample sizes. Just remember what your N represents. If there is any

fraction at the end of sample size calculations, round UP to the next number

divisible by the number of treatment groups.)

These are relatively simple calculations to introduce the idea of adjusting for

noncompliance as well as for multiple comparisons. More complicated

processes can be modeled.

Finally, when estimating a sample size for a study, an iterative process may be

followed (adapted from Wittes, 2002)

outcome.

2. What is the desired type I error rate and power? If more than one primary

outcome or comparison, make required adjustments to Type 1 error.

about the variability of the primary outcome in this population? Would would

constitute a clinically important difference?

4. If the study is measuring time to failure, how long is the followup period?

What assumptions should be made about recruitment?

noncompliance.

7. Select a sample size. Plot power curves as the parameters range over

reasonable values.

8. Iterate as needed.

Which of these adjustments (or others, such as modeling dropout rates that

are not independent of outcome) is important for a particular study depends

on the study objectives. Not only must we consider whether there is more than

primary outcome or multiple primary comparisions, we must also consider the

nature of the trial. For example, if the study results are headed to a regulatory

agency, using a primary intention-to-treat analysis, it is important to

demonstrate an effect of a certain magnitude. Adjusting the sample size to

account for non-adherence makes sense. On the other hand, in a comparative

effectiveness study, the objective may be to estimate the difference in effect

when the intervention is prescribed vs the control, regardless of adherence. In

this situation, the dilution of effect due to nonadherence may be of little

concern.

are estimates! When stating a required sample size, always state any

assumptions that have been made in the calculations.

6A.11 - Summary

In this lesson, among other things, we learned to:

Estimate the sample size required for a confidence interval for p for

given and , using normal approximation and Fisher's exact methods

Estimate the sample size required for a confidence interval for for

given and , using normal approximation when the sample size is

relatively large

% power for given and , using normal approximation, with equal or

unequal allocation.

and and , using normal approximation and Fishers exact methods

logrank comparison of two hazard functions to have (1 - ) % power

with given

to have a certain probability of detecting a rare event that occurs at a

rate=.

and the anticipated noncompliance rates.

Let's put what we have learned to use by completing the homework problems!

B

Introduction

This week we continue exploring the issues of sample size and power, this

time with regard to the differing purposes of clinical trials. Often the objective

of the trial is to establish that a therapy is efficacious, but what is the proper

control group? Can superiority to placebo be clearly established when there

are other effective therapies on the market? These questions lead to special

considerations based on whether the trial has an objective of establishing

superiority, equivalence or non-inferiority. So, lets move ahead

following:

terms of

o objectives

o control group

non-inferiority trial.

inferiority trials, using SAS programs.

References

equivalence confidence sets. Statistical Science 1996, 11: 283-319).

Steven. Clinical Trials: A Methodologic Perspective. 2nd ed. Hoboken, NJ:

John Wiley and Sons, Inc.

Placebo-controlled trials typically provide an unambiguous statement of the

research hypothesis: either we want to show that the experimental treatment

is superior to placebo (one-sided alternative) or as is more often the case, we

want to show that the experimental treatment is different than placebo (two-

sided alternative). For this reason, we frequently refer to a placebo-controlled

trial as a confirmatory trial or in most recent language it is called a superiority

trial (even if we are using a two-sided alternative).

Active control groups are often used because placebo control groups are

unethical, such as when:

equivalence trial, or a non-inferiority trial. The new treatment may be preferred

due to less cost, fewer side effects or less impact on quality of life. Or the new

treatment may have superior efficacy.

superiority trials. The objective of an equivalence trial is to demonstrate that a

therapy is equivalent to the active control (it is not inferior to and not superior

to the active control). Equivalent might not be the best word choice for this

type of trial as we will see later. The objective of a non-inferiority trial is to

demonstrate that a therapy is not inferior to the active control, i.e., it is not

worse than the treatment.

equivalence and non-inferiority trials that are not well understood by clinical

investigators. We will examine these issues using examples of such trials in

this lesson.

Combination therapy trials are an example where the appropriateness of a

placebo control must be carefully considered.

therapy that is accepted as the best available treatment (standard-of-care).

The standard-of-care could be a drug, a medical device, a surgical procedure,

diet, exercise, etc., or some combination of these various regimens.

In the MIRACLE trial [1], the standard-of-care for the eligible patients with heart

failure during the course of the trial consisted of some combination of the

following medications:

Diuretic

blocker

Digitalis

Beta-blocker

mechanism of action than the standard-of-care. If so, then it may be possible

to use the experimental therapy in combination with the standard-of-care and

we can consider designing a two-armed trial that compares:

versus

objective is to demonstrate superiority of the combination therapy to the

standard-of-care.

In the MIRACLE trial for example, the comparison consisted of:

standard-of-care + pacemaker

versus

is a similar modality as one of the components of standard-of-care, then it

may not be appropriate to combine the experimental therapy with the

standard-of-care.

There are two other possibilities to consider for designing the clinical trial,

namely, an equivalence trial and a non-inferiority trial.

For an equivalence trial, it is necessary to determine a zone of clinical

equivalence prior to the trial onset.

antihypertensive drugs (a drug that controls blood pressure). Suppose that the

standard drug yields a mean reduction of 5 mm Hg in diastolic blood pressure

for a certain patient population. The investigator may decide that the

experimental drug is clinically equivalent to the standard drug if its mean

reduction in diastolic blood pressure is 3-7 mm Hg. This is based on clinical

judgment and there may be differences of opinion on this 'arbitrary' level of

equivalence.

Thus, the difference in means between the two therapies does not exceed 2

mm Hg. Let's suppose that we are willing to accept this level.

population means between the experimental therapy and the active control,

E - A, should lie within (-, +). Differences in response less than are

considered 'equally effective' or 'noninferior'.

controversial. Some researchers recommend that be selected as less than

one-half of the magnitude of the effect observed from the superiority trials

comparing the active control to placebo.

satisfies this requirement (2 / 5 = 0.4 < 0.5), but why not select = 1?

natural check for internal validity because equivalence of the experimental

and active control therapies does not necessarily imply that either of them is

effective. In other words, if a third treatment arm of placebo had been included

in the trial, it is possible that neither the experimental therapy nor the active

control therapy would demonstrate superiority over placebo. There is no direct

establishment of superiority inherent in the way the trial is set up.

The investigator needs to select an active control therapy for the equivalence

trial that has been proven to be superior to placebo. An important assumption

is that the active control would be superior to placebo (had placebo been a

treatment arm in the current trial).

but at doses less than recommended (rendering them ineffective). It is

important to select the proper control, and use it at an appropriate dose level.

e.g., compare the experimental and active control therapies of the current

study to published reports for comparative trials that involve the active control

therapy versus a placebo control. Are similar results observed for the active

therapy in the equivalence trial as in the published study against placebo?

withdrawal rates, use of rescue medications, etc. An external validity check is

only possible if the chosen active control therapy for the equivalence trial was

determined effective in a superiority trial. An under-dosed or over-dosed

regimen for the active control therapy in an equivalence trial can bias the

results and interpretations. In addition, the design for the equivalence trial

should mimic (within reason) the design for the superiority trial. Some of this

advice is difficult to follow and may be impossible to implement.

(Another aspect of internal validity of course, is the quality of the trial, in terms

of inclusion/exclusion criteria, dosing regimens, quality control, etc. Do not run

a sloppy study!)

The U.S. Food and Drug Administration (FDA) and the National Institutes of

Health (NIH) typically require intent-to-treat (ITT) analyses in placebo-

controlled trials. In an ITT analysis, data on all randomized patients are

included in the analysis, regardless of protocol violations, lack of adherence,

withdrawal, incorrectly taking the other treatment, etc. The ITT analysis

reflects what will happen in the real world, outside the realm of a controlled

clinical trial.

conservative because it tends to diffuse the difference between the treatment

arms. There is more 'noise' in an ITT study. This is due to the increased

variability from protocol violations, lack of adherence, withdrawal, etc. You can

overcome this noise by increasing sample size.

misconception that the ITT analysis will have the opposite effect in an

equivalence trial, i.e., it will be easier to demonstrate equivalence. This is not

so. Even with an ITT analysis in an equivalence trial, it still is important to

conduct a well-designed study with sufficient sample size and good quality

control.

subjects are analyzed according to the treatment received. A protocol analysis

excludes subjects who did not satisfy the inclusion criteria, did not comply with

taking study medications, violated the protocol, etc. You are excluding data

from the patients that do not follow the protocol when it comes to the analysis.

A protocol analysis is expected to enhance differences between treatments,

so it usually will be conservative for an equivalence trial. Obviously, a protocol

analysis is susceptible to many biases and must be performed very carefully.

You may think that you are removing all of the biases, when in fact you may

not be. A protocol analysis could be considered as supplemental to the ITT

analysis. The U.S. FDA moved to ITT studies years ago to avoid biases

introduced when researcher selectively excluded patients from analysis

because of various protocol deviations. Many of the major medical journals

also will only accept ITT studies for these reasons as well.

A non-inferiority trial is similar to an equivalence trial. The research question in

a non-inferiority trial is whether the experimental therapy is not inferior to the

active control (whereas the experimental therapy in an equivalence trial

should not be inferior to, nor superior to, the active control). Thus, a non-

inferiority trial is one-sided, whereas an equivalence trial is two-sided. (For

non-inferiority, we want experimental therapy to be better than the active

control. )

Assume that the larger response is the better response. The one-sided zone

of non-inferiority is defined by -, i.e., the difference in population means

between the experimental therapy and the active control, E - A, should lie

within (-, + ).

Many of the same issues that are critical for designing an equivalence trial

also are critical for designing a non-inferiority trial, namely, appropriate

selection of an active control and appropriate selection of the zone of clinical

non-inferiority defined by .

Hypertensive Example

antihypertensive therapies.

The researchers may decide that the experimental drug is clinically not inferior

to the standard drug if its mean reduction in diastolic blood pressure is at least

3 mm Hg ( = 2). Thus, the difference in population means between the

experimental therapy and the active control therapy, E - A, should lie within (-

, + ). It does not matter if the experimental drug is much better than active

control drug, provided that it is not inferior to the active control drug.

Because a non-inferiority trial design allows for the possibility that the

experimental therapy is superior to the active control therapy, the non-

inferiority design is preferred over the equivalence design. The equivalence

design is useful when evaluating generic drugs.

Testing

Statisticians construct the null hypothesis and the alternative hypothesis for

statistical hypothesis testing such that the research hypothesis is the

alternative hypothesis:

H0:{non-equivalence} vs. H1:{equivalence}H0:{non-equivalence} vs. H1:

{equivalence}

or

In terms of the population means, the hypotheses for testing equivalence are

expressed as:

vs.

H1:{<EA<}H1:{<EA<}

also expressed as

In terms of the population means, the hypotheses for testing non-inferiority are

expressed as

decomposed into two distinct hypothesis testing problems, one for non-

inferiority:

hypothesis of non-inferiority (H01) is rejected AND the null hypothesis of non-

superiority (H02) is rejected.

This rationale leads to what is called two one-sided testing (TOST). If the data

are approximately normally distributed, then two-sample t tests can be

applied. If normality is suspect, then Wilcoxon rank-sum tests can be applied.

With respect to two-sample t tests, reject the null hypothesis of inferiority if:

tinf=(YEYA+)/s1nE+1nZ

>tnE+nA2,1tinf=(YEYA+)/s1nE+1nZ>tnE+nA2,1

tsup=(YEYA+)/s1nE+1nA

<tnE+nA2,1tsup=(YEYA+)/s1nE+1nA<tnE+nA2,1

the square-root of the pooled sample estimate of the variance:

s2=(i=1nE(YEiYE)2+j=1nA(YAjYA)2)/

(nE+nA2)s2=(i=1nE(YEiYE)2+j=1nA(YAjYA)2)/(nE+nA2)

Intervals

Confidence intervals can be used in place of the statistical tests. Reporting of

confidence intervals is more informative because it indicates the magnitude of

the treatment difference and how close it approaches the equivalence zone.

The 100(1 - )% confidence interval that corresponds to testing the null

hypothesis of non-equivalence versus the alternative hypothesis of

equivalence at the significance level has the following limits

This confidence interval does provide 100(1 - )% coverage - (see Berger RL,

Hsu JC. Bioequivalence trials, intersection-union tests, and equivalence

confidence sets. Statistical Science 1996, 11: 283-319).

is consistent with testing the null hypothesis of non-equivalence versus the

alternative hypothesis of equivalence at the significance level. Note that the

Berger and Hsu 100(1 - )% confidence interval is similar to the 100(1 - 2)%

confidence interval in its construction except that (1) the lower limit, if positive,

is set to zero, and (2) the upper limit, if negative, is set to zero.

If the 100(1 - )% confidence interval lies entirely within (-, +), then the null

hypothesis of non-equivalence is rejected in favor of the alternative hypothesis

of equivalence at the significance level.

For a non-inferiority trial, the two-sample t statistic labeled tinf [2], previously

discussed,can be applied to test:

- )% lower confidence limit is of interest:

If the 100(1 - )% lower confidence limit lies within (-, +), then the null

hypothesis of inferiority is rejected in favor of the alternative hypothesis of

non-inferiority at the significance level.

The FDA typically is more stringent than is required in non-inferiority tests. The

FDA typically requires companies to use = 0.025 for a non-inferiority trial, so

that the one-sided test or lower confidence limit is comparable to what would

be used in a two-sided superiority trial.

Equivalence

Non-Equivalence

Non-Inferiority

Inferiority

Example

30 patients in each of the experimental therapy and active control groups (nE =

nA = 30). He defines the zone of equivalence with = 4. The sample means

and the pooled sample standard deviation are

YE=17.4,YA=20.6,s=6.5YE=17.4,YA=20.6,s=6.5

The t percentile, t58,0.95, can be found from the TINV function in SAS as

TINV(0.95,58), which yields that t58,0.95 = 1.67. Thus, using the formulas in the

section above, the lower limit = min{0, -3.2 - 2.8} = min{0, -6.0} = -6.0; the

upper limit = max{0, -3.2 + 2.8} = max{0, -0.4} = 0.0. This yields the 95%

confidence interval for testing equivalence of E - A is (-6.0, 0.0). Because the

95% confidence interval for E - A does not lie entirely within (-, +) = (-4,

+4), the null hypothesis of non-equivalence is not rejected at the 0.05

significance level. Hence, the investigator cannot conclude that the

experimental therapy is equivalent to the active control.

equivalence trial, and he defines the zone of non-inferiority with = 4, i.e., (-

4, +). The 95% lower confidence limit for E - A is -6.0, which does not lie

within (-4, +). Therefore, the investigator cannot claim non-inferiority of the

experimental therapy to the active control.

A real example of a non-inferiority trial is the VALIANT [3] trial in patients with

myocardial infarction and heart failure. Patients were randomized to valsartan

monotherapy (nV = 4,909), captopril monotherapy (nC = 4,909), or valsartan +

captopril combination therapy (nVC = 4,885). The primary outcome was death

from any cause. One objective of the VALIANT trial was to determine if the

combination therapy is superior to each of the monotherapies. Another

objective of the trial was to determine if valsartan is non-inferior to captopril,

defined by = 2.5% in the overall death rate.

Switching Objectives

Suppose that in a non-inferiority trial, the 95% lower confidence limit for E -

A not only lies within (-, +) to establish non-inferiority, but also lies within

(0, +). It is safe to claim superiority of the experimental therapy to the active

control in such a situation (without any statistical penalty).

In a superiority trial, suppose that the 95% lower confidence limit for E -

A does not lie within (0, +), indicating that the experimental therapy is not

superior to the active control. If the protocol had specified non-inferiority as a

secondary objective and specified an appropriate value of , then it is safe to

claim non-inferiority if the 95% lower confidence limit for E - Alies within (-,

+).

For a continuous outcome that is approximately normally distributed in

an equivalence trial, the number of patients needed in the active control arm,

nA, where AR = nE/nA, to achieve 100(1 - )% statistical power with an -level

significance test is approximated by:

nA=(AR+1AR)(tn1+n22,1+tn1+n22,1)22/(||)2nA=(AR+1AR)

(tn1+n22,1+tn1+n22,1)22/(||)2

Notice the difference in the t percentiles between this formula and that for a

superiority comparison, described earlier. The difference is due to the two

one-sided testing that is performed.

E - A, is null in this sample size formula. This is an optimistic assumption and

may not be realistic.

(Note: the formula above simplifies to the formula on p. 189 in the FFDRG text

if AR =1, = 0 and substituting Z for t )

population proportions between the experimental therapy and the active

control, pE - pA, is defined by the interval (-, +). The number of patients

needed in the active control arm, nA, where AR =nE/nA, to achieve 100(1 - )%

statistical power with an significance test is approximated by:

nA=(AR+1AR)(z1+z1)2p(1p)/(|pEpA|)2nA=(AR+1AR)

(z1+z1)2p(1p)/(|pEpA|)2

where

p=(ARpE+pA)/(AR+1)p=(ARpE+pA)/(AR+1)

How does this formula compare to FFDRG p. 189? The choice of the value for

p in our text is to use the control group value, assuming, that pe- pa=0.

For a time-to-event outcome, the zone of equivalence for the hazard ratio

between the experimental therapy and the active control, , is defined by the

interval (1/, +), where is chosen > 1. The number of patients who need

to experience the event to achieve 100(1 - )% statistical power with an -

level significance test is approximated by

E=((AR+1)2AR)(z1+z1)2/(loge(/))2E=((AR+1)2AR)(z1+z1)2/

(loge(/))2

If pE and pA represent the anticipated failure rates in the two treatment groups,

then the sample sizes can be determined from nA = E/(ARpE + pA) and nE =

ARnA

T], then it can be expressed as (t) = = -loge(1 - p)/T. In such a situation, the

hazard ratio for comparing two groups is = loge(1 - pE)/ loge(1 - pA). The

same formula can be applied, with different values of pE and pA, to determine

non-inferiority trial, the number of subjects needed in the active control arm,

nA, where AR = nE/nA, to achieve 100(1 - )% statistical power with an -level

significance test is approximated by:

nA=(AR+1AR)(tn1+n22,1+tn1+n22,1)22/(||)2nA=(AR+1AR)

(tn1+n22,1+tn1+n22,1)22/(||)2

Notice that the sample size formulae for non-inferiority trials are exactly the

same as the sample size formulae for equivalence trials. This is because of

the one-sided testing for both types of designs (even though an equivalence

trial involves two one-sided tests). Also notice that the choice of Z in the

formulas above have assumed a one-sided test or two one-sided tests, but

the requirements of regulatory agencies and the approach in our FFDRG text

is to use the Z value that would have been used for a 2-sided hypothesis test.

In homework, be sure to state any assumptions and the approach you are

taking.

Example 1 ( 7.7_-_sample_size__normal__e.sas [4])

trial with an experimental therapy and an active control. The primary outcome

is forced expiratory volume in one second (FEV1). The investigator desires a

0.05-significance level test with 90% statistical power and decides that the

zone of equivalence is (-, +) = (-0.1 L, +0.1L) and that the true difference

in means does not exceed = 0.05 L. The standard deviation reported in the

literature for a similar population is = 0.75 L. The investigator plans to have

equal allocation to the two treatment groups (AR = 1).

[4]

number of patients required for the active control group is:

Come up with an answer to this question by yourself and then click on the

icon to the left to reveal the solution.

What happens to the total sample size if the power is to be 0.95 and the

investigator uses 2:1 allocation?

in a non-inferiority trial when the response is treatment success. She desires a

0.025 significance level test and 90% statistical power. She knows 70% of the

active control patients will experience success, so she decides that the

experimental therapy is not inferior if it yields at least 65% success. Thus, =

0.05 and she assumes that the true difference is pE - pA = 0.

[5]

With equal allocation, the number of patients in the active control group is:

SAS PROC POWER does not contain a feature for an equivalence trial or a

non-inferiority trial with binary outcomes. Fishers exact test for a superiority

trial can be adapted to yield nE = nA = 1,882 for a total of 3,764 patients. The

discrepancy is due to the superiority trial using p-bar = 0.675 instead of 0.7.

the icon to the left to reveal the solution.

Suppose the proportions were 0.65 and 0.75. How does the required sample

size, n, change?

in a non-inferiority trial. The response is time to infection. He desires a 0.025-

significance level test with 90% statistical power and AR =1. Follow-up for

each patient is one year and he expects 20% of the active control group will

get an infection (pA = 0.2). Although he believes that pE = 0.2, he considers the

experimental therapy to be non-inferior if pE 0.25. The SAS program below,

for a one-sided superiority trial may approximate the required sample size.

The sample size can be worked out exactly. as follows:

Assuming constant hazard functions, then the effect size with pE = pA = 0.2 is

= 1. With pE = 0.25 and pA = 0.2, the zone of non-inferiority is defined by:

and the sample sizes are nA = E/(ARpE + pA) = 648/(0.2 + 0.2) = 1,620 and

nE = 1,620

Since SAS PROC POWER does not contain a feature for an equivalence trial

or a non-inferiority trial with time-to-event outcomes, the results from the

logrank test for a superiority trial were adapted to yield nE = nA = 1,457. The

discrepancy in numbers between the program and the calculated n is due to

the superiority trial using pE = 0.25 instead of 0.2 in nA = E/(ARpE + pA).

Notice that the resultant sample sizes in SAS Examples 7.7-7.9 all are

relatively large. This is because the zone of equivalence or non-inferiority is

defined by a small value of . Generally, equivalence trials and non-inferiority

trials will require larger sample sizes than superiority trials.

of is anticipated, then the sample size should be increased by the factor 1/(1

- ).

6B.9 - Summary

In this lesson, among other things, we learned:

terms of

o objectives

o control group

non-inferiority trial.

inferiority trials, using SAS programs.

assignment! (posted in ANGEL last week)

Introduction

described and followed precisely by different investigators at different

locations, there can be enough patient heterogeneity and differences in

protocol interpretation that the results can vary greatly across institutions.

Thus, the differences in results actually could be due to different selection

factors at the different institutions. Recruitment strategies might be different.

Due to this, different patients are recruited into the study.

Eligibility criteria also define the accrual rate for a trial. Although tighter

eligibility criteria lead to a more homogeneous trial, they yield a slower accrual

rate. It might be more difficult to meet all of the criteria you specify using strict

eligibility criteria.

Compare the benefits and limitations of narrowly defined eligibility

criteria to broadly defined eligibility criteria

misinterpretation.

extended baseline for a study.

References

Gotay, CC. (1991). Accrual to cancer clinical trials: directions from the

research literature. Soc. Sci. Med. 33: 569-577.

Piantadosi Steven. Clinical Trials: A Methodologic Perspective. 2nd ed.

Hobaken, NJ: John Wiley and Sons, Inc.

It can be more difficult to isolate a biological effect of a treatment if the

investigator uses a broadly-defined cohorts, i.e., patients with a variety of

disease types and/or severity. It is easier to isolate a biological effect of a

treatment in a narrowly-defined cohort because of patient homogeneity. The

researcher's job is to balance these factors. Every situation is different and the

researcher needs think carefully when defining the selection criteria.

Although a narrowly-defined cohort may have some external validity for others

with the same disease if the treatment appears to be beneficial, in general it

will lack external validity because the study results may not apply to patients

with slightly altered versions of the disease. Again, these are examples of the

competing demands that the researcher must keep in mind.

Epidemiologists have defined the healthy worker effect as the phenomenon

that the general health of employed individuals is better than average. For

example, employed individuals may be unsuitable controls for a case-control

study if the cases are hospitalized patients. Similarly, individuals who

volunteer for clinical trials may have more favorable outcomes than those who

refuse to participate, even if the treatment is ineffective. This selection effect is

known as the trial participant effect and it can be strong. For a randomized

trial, however, this may not be a problem unless selection effects somehow

impact treatment assignment.

Because of the possible effects of prognostic (variables that can affect the

outcome) and selection factors on differences in outcome, the eligibility criteria

for the study cohort need to be defined carefully.

1. Define very narrow eligibility criteria so that the study cohort is relatively

homogeneous, which may yield an outcome variable that has less

variability and result in a smaller sample size; however, the results may

not have external validity.

amount of variability by incorporating a larger sample size, which will

provide much more external validity. (This is easy for a statistician to

say!)

complicating factors are prevented by patient exclusions. For example,

habitual smokers typically are excluded from asthma trials because their lung

function may be impaired by smoking as well as by their asthma. The smoking

behavior may confound the results of the study. Exclusions also may be

invoked for ethical reasons if the treatment is not expected to benefit a certain

subgroup of patients. For example, some oncology (cancer) trials might

exclude patients whose life expectancy does not exceed six months.

minimized via quantitative expressions. For example, inclusion criteria should

specify the range of allowable serum chemistry variables, instead of just

stating that, "we will require normal lab values". Different hospitals are going

to have different interpretations of what normal is. Obviously, you need to be

specific.

Once the decisions are made about the study cohort and other design issues

resolved, the protocol approved and study medications obtained, the

investigator begins what can be the most difficult task in a clinical trial -

recruitment! Despite the most optimistic beliefs about the existence of

available patients out there, a host of factors can make the recruitment of

patients challenging.

(You may notice in this section we have defined a study cohort for the trial.

This doesnt mean however that every clinical trial is a cohort study in the

sense of a long-term study following a defined group of patients.)

It is unfortunate that some clinical trials are terminated early due to low

accrual, which is a waste of resources and time for all those involved.

Investigators often overestimate the accrual rate because they may not

account for (1) the restrictions imposed by the eligibility criteria and (2) the

refusal by some eligible patients to participate.

among clinical trialists is The incidence of a disease diminishes when you

initiate a study on it. (source unknown)

eligible patients will adhere to the protocol. For example, patients can be

administered a placebo during the run-in period and monitored for treatment

compliance. At the completion of the run-in period, those patients who meet

the treatment compliance criteria are then randomized to treatment, whereas

those who do not are discontinued in the study. Another advantage of

incorporating a run-in period is that it may provide the opportunity for patients

to be stabilized via a standard medication prior to randomization.

external validity of the trial because in the real world some patients will not be

very compliant. Thus, a trial based on very compliant patients may

overestimate the effectiveness of the treatment.

onset of a trial to determine the proportion that would consider participation.

This might indicate to the researcher the approximate proportion of patients

that would consider participating and enable realistic timetables for completing

trials..

throughout the course of a trial. An accrual graph with target and actual

number of recruited patients helps monitor the process. This task typically falls

on the statistician. Here is an example of a plot monitoring the accural of

patients.

The target assumes a constant accrual of patients. There was a lag in the

number of patients at the beginning that were recruited but it caught up with

the target for recruitment by the end of the study. This struggle in the number

of patients recruited is very typical. Recruitment is always a struggle.

Everyone on the research team needs to help with this process.

Among adult cancer patients in the USA, less than 3% of these patients

participate in clinical trials. (Gotay, 1991) Since the process of clinical trials

leads to improvements in cancer therapy over time, it would seem that cancer

patients would be motivated to partipate in increasing numbers over time. But

this has not happened. Most diseases, except for AIDS and some pediatric

conditions, exhibit similar types of participation rates. The three general

reasons for lack of participation are categorized as physician-, patient-, or

administrative-related.

The reasons physicians give for failing to enroll patients in clinical trials are

the perception that the trial may compromise the physician-patient relationship

and the difficulties with informed consent. Many consent forms are

cumbersome, intimidating, and not written at an appropriate reading level. The

'experts' say that these documents should be written at an 8th grade reading

level. Using plain language is important. Also, many patients are mistrustful of

the medical establishment, although they may trust their individual physicians.

Often, ethnic minority groups express even stronger concerns about

participation in clinical trials.

efficacy trial, the study cohort is relatively homogeneous and the objective is

to test a biological question. In an effectiveness trial, the study cohort is

relatively heterogeneous and the objective is to assess effectiveness of a

treatment. An effectiveness trial tends to be very large and expensive, but has

much more external validity because of broad eligibility criteria and a

heterogeneous population. Most clinical trials are effectiveness studies.

Example

Research Network (ACRN), entitled Dose of Inhaled Corticosteroids with

Equisystemic Effects (DICE) [1]. The primary objective of the trial was to

investigate dose-response effects of various inhaled corticosteroids (ICS) on

cortisol production by the adrenal glands. Subjects with mild-moderate asthma

were recruited. There were many exclusion criteria, such as obesity,

pregnancy or lactation, no oral or injectable steroids during the past twelve

months, no ICS or nasal steroids during the past six months, and no topical

steroids during the past two months. Subjects were randomized to either one

of six different ICS (n = 24 per group) or placebo (n = 12). ICS dose was

doubled on a weekly basis (0d, 1d, 2d, 4d, and 8d, where d is a pre-selected

low dose for each ICS). Subjects stayed overnight at a hospital at the end of

each week, during which blood was drawn hourly and analyzed to determine

the concentration of cortisol.

This study was examining a very specific biological question. The primary

objective of the trial was to establish whether increasing the dose for each ICS

yields a decrease in plasma cortisol (adrenal suppression). The researchers

were interested in looking at dose response curves. The DICE trial was not

powered to compare the dose-response curves of each ICS.

DICE is strictly an efficacy trial with very narrow eligibility criteria. Furthermore,

the protocol specified that the intent-to-treat paradigm would not be followed.

Subjects were dropped post-randomization if they received other forms of

steroids, became pregnant, or were non-compliant with dose schedules

and/or visit schedules.

On the other hand, over the past 20 years, there has been great interest in the

gender and ethnic composition of cohorts in clinical trials. Part of this interest

is due to ensuring external validity of the results of the trials. For many years

caucasian males were the only patients recruited for the purpose of assuring

homogeneity. This has been broadened by both the FDA and NIH in their

application process. The broader eligibility requirements will help to ensure

broader external validity.

The NIH typically requires one-half female participation and one-third ethnic

minority participation in CTE trials that it sponsors. Obviously, there are

exceptions to this based on the disease of interest. Required representation in

clinical trials, however, could be a hindrance to acquiring new knowledge if it

consumes too many resources.

7.4 - Summary

In this lesson, among other things, we learned:

criteria to broadly defined eligibility criteria

misinterpretation.

extended baseline for a study.

Let's put what we have learned to use by completing the following homework

assignment:

Homework

Lesson 8: Treatment Allocation and

Randomization

Introduction

Nonrandomized schemes, such as investigator-selected treatment

assignments, are susceptible to large biases. Even nonrandomized schemes

that are systematic, such as alternating treatments, are susceptible to

discovery and could lead to bias. Obviously, to reduce biases, we prefer

randomized schemes. Credibility requires that the allocation process be non-

discoverable. The investigator should not know what the treatment will be

assigned until the patient has been determined as eligible. Even using

envelopes with the treatment assignment sealed inside are prone to

discovery.

circumstances. When choosing an allocation scheme for a clinical trial, there

are three technical considerations:

1. reducing bias;

objectives.

Contrast the benefits of permuted blocks to those of adaptive

randomization schemes.

situation where greater variability is expected in one treatment group

than the other.

8.1 - Randomization

In some early clinical trials, randomization was performed by constructing two

balanced groups of patients and then randomly assigning the two groups to

the two treatment groups. This is not always practical as most trials do not

have all the patients recruited on day one of the study. Most clinical trials

today invoke a procedure in which individual patients, upon entering the study,

are randomized to treatment.

treatment assignment will not be based on patient's prognostic factors. Thus,

investigators cannot favor one treatment group over another by assigning

patients with better prognoses to it, either knowingly or unknowingly.

Procedure selection bias has been documented to have a very strong effect

on outcome variables.

typically prevents confounding of the treatment effects with other prognostic

variables. Some of these factors may or may not be known. The investigator

usually does not have a complete picture of all the potential prognostic

variables, but randomization tends to balance the treatment groups with

respect to the prognostic variables.

conduct a statistical analysis, e.g., analysis of covariance (ANCOVA), that

adjusts for the prognostic variables. It always is best, however, to prevent a

problem rather than adjust for it later. In addition, ANCOVA does not

necessarily resolve the problem satisfactorily because the investigator may be

unaware of certain prognostic variables and because it assumes a specific

statistical model that may not be correct.

Although randomization provides great benefit in clinical trials, there are

certain methodological problems and biases that it cannot prevent. One

example where randomization has little, if any, impact is external validity in a

trial that has imposed very restrictive eligibility criteria. Another example

occurs with respect to assessment bias, which treatment masking and other

design features can minimize. For instance, when a patient is asked "how do

you feel?" or "how bad is your pain?" to describe their condition the

measurement bias is introduced.

Simple Randomization

situation, a patient is assigned a treatment without any regard for previous

assignments. This is similar to flipping a coin - the same chance regardless of

what happened in the previous coin flip. Here is an animation of this process.

Notice that each time, once treatment A or B is selected, the selection process

begins again starting with the two treatment possibilities.

the same number of subjects to each treatment group. Severe imbalance in

the numbers assigned to each treatment is a critical issue with small sample

sizes.

among the treatment groups with respect to prognostic variables that affect

the outcome variables.

moderate, and severe. Suppose that simple randomization to treatment

groups A and B is applied. The following table illustrates what possibly could

occur.

Mild 17 28

Moderate 41 39

Severe 25 13

The moderate is fairly well balanced, the mild and severe groups are much

more imbalanced. This results in Group A getting more of the severe cases

and Group B more of the mild cases.

8.2 - Constrained Randomization

Randomization in permuted blocks is one approach to achieve balance across

treatment groups. The randomization scheme consists of a sequence of

blocks such that each block contains a pre-specified number of treatment

assignments in random order. The purpose of this is so that the randomization

scheme is balanced at the completion of each block. For example, suppose

equal allocation is planned in a two-armed trial (groups A and B) using a

randomization scheme of permuted blocks. The target sample size is 120

patients (60 in A and 60 in B) and the investigator plans to enroll 12 subjects

per week. In this situation, blocks of size 12 are natural, so the randomization

plan looks like

Week #1 BABABAABABAB

Week #2 ABBBAAABAABB

Week #3 BBBABABAABAA

assigned option specified for the week. Notice that there are exactly six As

and six Bs within each block, so that at the end of each week there is balance

between the two treatment arms. If the trial is terminated, say after 64 patients

have been enrolled, there may not be exact balance but it will be close.

suggest a block size. A variation of blocked randomization is to use block

sizes of unequal length. This might be helpful for a trial where the investigator

is unmasked. For example, if the investigator knows that the block size is six,

and within a particular block treatment A already has been assigned to three

patients, then it is obvious that the remaining patients in the block will be

assigned treatment B. If the investigator knows the treatment assignment prior

to evaluating the eligibility criteria, then this could lead to procedure selection

bias. It is not good to use a discoverable assignment of treatments. A next

step to take would be to vary the block size in order to keep the investigator's

procedure selection bias minimized.

randomization, let NA and NB denote the number of As and Bs, respectively, to

be contained within each block. Suppose that when an eligible patient is ready

to be randomized there are nA and nB patients already randomized to groups A

and B, respectively. Then the probability that the patient is randomized to

treatment A is:

f nA=NANAnANA+NBnAnB if 0<nA<NA1 if nB=NB

This probability rule is based on the model of NA "A" balls and NB "B" balls in

an urn or jar which are sampled without replacement. The probability of being

assigned treatment A changes according to how many patients already have

been assigned treatment A and treatment B within the block.

3 and nB = 2 already have been assigned. Thus, there are NA - nA = 3 A balls

left in the urn and NB - nB = 4 B balls left in the urn, so the probability of the

next eligible patient being assigned treatment A is 3/7. Below is an animation

of this process taking place.

Another type of constrained randomization is called stratified randomization.

Stratified randomization refers to the situation in which strata are constructed

based on values of prognostic variables and a randomization scheme is

performed separately within each stratum. For example, suppose that there

are two prognostic variables, age and gender, such that four strata are

constructed:

Treatment A Treatment B

male, age 18 36 37

female, age < 18 13 12

female, age 18 40 40

The strata size usually vary (maybe there are relatively fewer young males

and young females with the disease of interest). The objective of stratified

randomization is to ensure balance of the treatment groups with respect to the

various combinations of the prognostic variables. Simple randomization will

not ensure that these groups are balanced within these strata so permuted

blocks are used within each stratum are used to achieve balance.

If there are too many strata in relation to the target sample size, then some of

the strata will be empty or sparse. This can be taken to the extreme such that

each stratum consists of only one patient each, which in effect would yield a

similar result as simple randomization. Keep the number of strata used to a

minimum for good effect.

Adaptive randomization refers to any scheme in which the probability of

treatment assignment changes according to assigned treatments of patients

already in the trial. Although permuted blocks can be considered as such a

scheme, adaptive randomization is a more general concept in which treatment

assignment probabilities are adjusted.

entire randomization scheme can be determined prior to the onset of the

study, whereas many adaptive randomization schemes require recalculation of

treatment assignment probabilities for each new patient.

exercise that will help to explain this type of scheme. Suppose that there is

one "A" ball and one "B" ball in an urn and the objective of the trial is equal

allocation between treatments A and B. Suppose that an "A" ball is blindly

selected, so that the first patient is assigned treatment A. Then the original "A"

ball and another "B" ball are placed in the urn so that the second patient has a

1/3 chance of receiving treatment A and a 2/3 chance of receiving treatment

B. At any point in time with nA"A" balls and nB"B" balls in the urn, the

probability of being assigned treatment A is nA/(nA+ nB). The scheme changes

based on what treatments have already been assigned to patients.

This type of urn model for adaptive randomization yields tight control of

balance in the early phase of a trial. As nA and nB get larger, the scheme tends

to approach simple randomization, so the advantage of such an approach

occurs when the trial has a small target sample size.

8.5 - Minimization

Minimization is another, rather complicated type of adaptive randomization.

Minimization schemes construct measures of imbalance for each treatment

when an eligible patient is ready for randomization. The patient is assigned to

the treatment which yields the lowest imbalance score. If the imbalance

scores are all equal, then that patient is randomly assigned a treatment. This

type of adaptive randomization imposes tight control of balance, but it is more

labor-intensive to implement because the imbalance scores must be

calculated with each new patient. Some researchers have developed web-

based applications and automated 24-hour telephone services that solicit

information about the stratifiers and a computer algorithm uses the data to

determine the randomization

variables. As an example, consider a three-armed clinical trial (treatments A,

B, C). Suppose there are four stratifying variables, whereby each stratifier has

three levels (low, medium, high), yielding 34 = 81 strata in this trial. When 200

patients have been randomized and patient #201 is ready for randomization.

The observations of the stratifying variables are recorded as follows.

001 Low Low Medium Low

002 High Medium Medium High

...

200 Low Low Low Medium

Suppose that patient #201 is ready for randomization and that this patient is

observed to have the low level of stratifier #1, the medium level of stratifier #2,

the high level of stratifier #3, and the high level of stratifier #4. Based on the

200 patients already in the trial, the number of patients with each of these

levels is totaled for each treatment group. (Notice that patients may be double

counted in this table.)

45 19 12 103

48 18 15 112

43 21 15 109

Patient #201 would be assigned to treatment A because it has the lowest

marginal total. If two or more treatment arms are tied for the smallest marginal

total, then the patient is randomly assigned to one of the tied treatment arms.

This is not a perfect scheme but it is a strategy for making sure that the

assignments are as balanced within each treatment group with respect to

each of the four variables.

Another type of adaptive randomization scheme is called the "play the winner"

rule. Suppose there is a two-armed clinical trial and the urn contains one "A"

ball and one "B" ball for the first patient. Suppose that the patient randomly is

assigned treatment A. Now you need to know if the treatment was successful

with the patient that received this treatment. If the patient does well on

treatment A, then the original "A" ball and another "A" ball are placed in the

urn. If the patient fails on treatment A, then the original "A" ball and a "B" ball

are placed in the urn. Thus, the second patient has probability of 1/3 or 2/3 of

receiving treatment A depending on whether treatment A was a success or

failure for the first patient. This process continues. If one treatment is more

successful than the other, the odds are stacked in favor of that treatment.

The advantage of the "play the winner" rule is that a higher proportion of

patients will be assigned to the more successful treatment. This seems to be

an ethical approach.

the next patient.

Thus, the "play the winner" rule is not practical for most trials. The procedure

can be modified, however, to be performed in stages. For example, if the

target sample size is 200 patients, then the trial can be put on hold after each

set of 50 patients to assess outcome and redefine the probability of treatment

assignment for the patients yet to be recruited, i.e., "play the winner" after

every 50 patients instead of every patient.

8.7 - Administration of the

Randomization Process

The RANUNI function in SAS yields random numbers from the Uniform(0,1)

distribution (randomly selected a decimal between 0 and 1). These random

numbers can be used to generate a randomization scheme. For example,

suppose that the probability of assignment to treatments A, B, and C are to be

0.25, 0.25, and 0.5, respectively. Let U denote the random number generated

and assign treatment as follows:

B, if 0.25< U< 0.50

C, if 0.50< U < 1.00

provides a permuted blocks randomization scheme for equal allocation to

treatments A and B. In the example, the block size is 6 and the total sample

size is 48.

[1]

Come up with an answer to this question by yourself and then click on the

icon to the left to reveal the solution.

Can you generate a permuted blocks randomization scheme for a total

sample size of 32 with a block size of 4?

discoverable by the investigator. Otherwise, the minimization of selection bias

offered by randomization is lost. The administration of the randomization

scheme should not be physically available to the investigator. This usually is

not the case in multi-center trials, but the problem usually arises in small

single-center trials. Logistical problems can arise in trials with hospitalized

patients in which 24-hour access to randomization is necessary. Sometimes,

sealed envelopes are used as a means of keeping the randomized treatment

assignments confidential until a patient is eligible for entry. However, it is

relatively easy for investigators to tamper with the envelope system.

Many clinical trials rely on pharmacies to package the drugs so that they are

masked to investigators and patients. For example, consider a two-armed trial

with a target sample size of 96 randomized subjects (48 within each treatment

group). The pharmacist constructs 96 drug packets and randomly assigns

numeric codes from 01 to 96 which are printed on the drug packet labels. The

pharmacist gives the investigator the masked drug packets (with their numeric

codes). When a subject is eligible for randomization, the investigator selects

the next drug packet (in numeric order). In this way the investigator is kept

from knowing which treatment is assigned to which patient.

SAS program that provides ....

[2]

To maximize the efficiency (statistical power) of treatment comparisons,

investigators typically employ equal allocation of patients to treatment groups

(this assumes that the variability in the outcome measure is the same for each

treatment).

allocation that favors an experimental therapy over placebo could help

recruitment and it would increase the experience with the experimental

therapy. This also provides the opportunity to perform some subset analyses

of interest, e.g., if more elderly patients are assigned to the experimental

therapy, then the unequal allocation would yield more elderly patients on the

experimental therapy.

Another example where unequal allocation may be desirable occurs when one

therapy is extremely expensive in comparison to the other therapies in the

trial. For budget reasons you may not be able to assign as many to the

expensive therapy.

If it is known that one treatment is more variable (less precise) in the outcome

response than the other treatments, then the statistical power for treatment

comparisons is maximized with unequal allocation. The allocation ratio should

be

r=n1/n2=1/2r=n1/n2=1/2

which is a ratio of the known standard deviations. Thus, the treatment that

yields less precision (larger standard deviation) should receive more patients,

an unequal allocation. Because there is more 'noise', more patients, a larger

sample size will help to cut through this noise.

Consent

Randomization prior to informed consent can increase the number of trial

participants, but it causes some difficulties. This is not recommended practice.

Here's why...

One particular scheme with experimental and standard treatments that has

received some attention is as follows. Eligible patients are randomized prior to

providing consent. If the patient is assigned to the standard therapy, then it is

offered to the patient without the need for consent. If the patient is randomized

to the experimental therapy, then the patient is asked for consent. If this

patient refuses, however, then he/she is offered the standard therapy. An

"intent-to-treat" analysis is performed based on the randomized assignment.

This approach can increase trial participation, but patients who are

randomized to the experimental treatment and refuse will dilute the treatment

difference at the time of data analysis. In addition, the "intent-to-treat" analysis

will introduce bias.

informed and without providing their consent, and

of receiving the experimental therapy.

recommended.

8.10 - Summary

In this lesson, among other things, we learned:

randomization schemes.

situation where greater variability is expected in one treatment group

than the other.

Provide the rationale against randomizing prior to informed consent.

Let's put what we have learned to use by completing the following homework

assignment:

Safety Monitoring

Introduction

monitor treatment effects as well as tracking safety issues. "Interim analysis"

or "early stopping" procedures are used to interpret the accumulating

information during a clinical trial. There may be a variety of practical reasons

for terminating a clinical trial at an early stage. Some of these are overlapping:

the potential benefits,

6. Definitive information is available from outside the study, making the trial

unnecessary or unethical, this is also related to the next item...

developments,

to the basic question,

(Piantodosi, 2005)

This lesson will look examine different methods or guidelines that can be used

to help decide whether or not to terminate a clinical trial in progress.

Differentiate between valid and invalid reasons for interim analyses and

early termination of a trial.

analysis.

frequentist approach.

Recognize the general effects of the choice the prior on the posterior

probability distribution from a Bayesian analysis.

interim analysis.

statistical analysis.

might compose the DSMB.

References

DeMets DL, Lan KK, 1994, Interim analysis: The alpha spending function

approach, Statistics in Medicine 13: 1341-1352.

Ellenberg, SS. Fleming, TR. DeMets, DL. 2002, Data Monitoring Committees

in Clinical Trials, New York, NY: Wiley.

Steven. Clinical Trials: A Methodologic Perspective. 2nd ed. Hobaken, NJ:

John Wiley and Sons, Inc.

Wiley and Sons.

9.1 - Overview

Data-dependent stopping is a general term to describe any statistical or

administrative reason for stopping a trial. Consideration of the reasons given

earlier may lead you to stop the trial at an early stage, or at least change the

protocol.

on interim data are necessary but prone to error. If the investigators learn of

interim results, then it could affect objectivity during the remainder of the trial,

or if statistical tests are performed repeatedly on the accumulating data, then

the Type I error rate is increased.

possible will save costs and labor, expose as few patients as possible to

inferior treatments, and allow disseminating information about the treatments

quickly. On the other hand, there are pressures to continue the trial for as long

as possible in order to increase precision, reduce errors of inference, obtain

sufficient statistical power to account for prognostic factors and examine

subgroups of interest, and gather information on secondary endpoints.

All of the available statistical methods for interim analyses have some similar

characteristics. They

the study,

3. require some structure to the problem beyond the data being observed,

4. tend to have similar performance characteristics, or

criteria provide guidelines for terminating the trial because the decision to

stop a trial is not based just on statistical information collected on one

endpoint.

It may be possible to assess treatment effects after each patient is accrued,

treated, and evaluated. Such an approach is impractical in most

circumstances, especially for trials that require lengthy follow-up to determine

outcomes.

The first likelihood method proposed for this situation is called the sequential

probability ratio test (SPRT) and it is based on the likelihood function. (This

method is very rarely implemented because it is impracticality, but is important

for historical reasons.) Let's review this method in general terms here.

random variables which correspond to the outcome measurements on the

experimental units. In the likelihood function, however, the observed data

points replace the random variables. Suppose we have a binary response

(success/failure) from each patient which is determined immediately after a

treatment is administered. (Again, not very practical.) However, for the

situation discussed, we are examining one treatment which is administered to

every patient. If there are N patients with K successes, and p represents the

probability of success within each patient, then the likelihood function is based

on the binomial probability function:

L(p,K)=pK(1p)NKL(p,K)=pK(1p)NK

value of p, then the likelihood ratio can be constructed to assess the evidence:

R=L(p0,K)L(p1,K=(p0p1)K(1p01p1)NKR=L(p0,K)L(p1,K=(p0p1)K(1p01

p1)NK

evidence is going to favor p0. If R is small, then the evidence is going to

favor p1. Therefore, when analyzing interim data, we can calculate the

likelihood ratio and stop the trial only if we have the amount of evidence that is

expected for the target sample size.

Suppose that N is the target sample size and that after n patients there

are k successes. After each treatment we will stop and analyze the data to

determine whether to continue the trial or not. Under this scenario, we stop

the trial if:

R=L(p0,K)L(p1,K=(p0p1)k(1p01p1)nkRL or RUR=L(p0,K)L(p1,K=(p

0p1)k(1p01p1)nkRL or RU

where RL and RU are prespecified constants. Let's not worry about the details

of the statistical calculation here. The values of RL and RU that correspond to

testing H0: p = p0 versus H1: p = p1 are RL = /(1 - ) and RU = (1 - )/.

A sample schematic of the SPRT in practice is shown below. Here you would

calculate R after the treatment of each patient. As you accumulate patients

you can see that R is moving around as the trial proceeds. Before we had

accrued all of the patients that we wanted we hit the upper boundary and

would not recruit the remaining patients.

Here is another example...

monitored closely to determine if it reaches a certain level of success or

failure. For example, suppose the investigator considers the treatment

successful if p = 0.4 (40% or greater), but considers it a failure if p = 0.2 (20 %

or less). Thus, the hypothesis testing problem is H0: p = 0.2 vs. H1: p = 0.4.

Suppose we take = 0.05 and = 0.05. Then the bounds would be calculated

as RL = 1/19 and RU = 19. We would reject H0 in favor of H1, and claim

success, as soon as R gets small enough, R = (0.5)k(1.33)n-k 1/19. On the

other hand, we would stop the trial and accept H0 and reject H1, and claim

failure, as soon as R 19.

not a common procedure in clinical trials. The obvious criticism is that each

patients outcome must be observed quickly before you recruit the next

patient. The SPRT also has the statistical property that it has a positive

probability of never reaching the boundaries RL and RU. If this is the case after

the target sample size, N, is reached, then the trial is inconclusive.

First, let's review the Bayesian approach in general and then apply it to our

current topic of likelihood methods.

The Bayesian approach to statistical design and inference is very different

from the classical approach (the frequentist approach).

knowledge or belief about the treatment effect, say we call it , in the form of a

probability distribution. This is known as the prior distribution for . These

assumptions are made prior to conducting the study and collecting any data.

Next, the data from the trial are observed, say we call it X, and the likelihood

function of X given is constructed. Finally, the posterior distribution for

given X is constructed. In essence, the prior distribution for is revised into

the posterior distribution based on the data X. The data collection in the study

informs or revises the earlier assumptions.

mathematically and it may be necessary to approximate it through computer

algorithms.

The Bayesian statistician performs all inference for the treatment effect by

formulating probability statements based on the posterior distribution. This is a

very different approach and is not always accepted by the more traditional

frequentist oriented statisticians.

In the Bayesian approach, is regarded as a random variable, about which

probability statements can be made. This is the appealing aspect of the

Bayesian approach. In contrast, the frequentist approach regards as a fixed

but unknown quantity (called a parameter) that can be estimated from the

data.

description and the Bayesian description of a 95% confidence interval for .

Frequentist: "If a very large number of samples, each with the same sample

size as the original sample, were taken from the same population as the

original sample, and a 95% confidence interval constructed for each sample,

then 95% of those confidence intervals would contain the true value of ." This

is an extremely awkward and dissatisfying definition but technically represents

the frequentist's approach.

Bayesian: "The 95% confidence interval defines a region that covers 95% of

the possible values of ." This is much more simple and straightforward. (As a

matter of fact, most people when they first take a statistics course believe that

this is the definition of a confidence interval.)

probability distribution for . Using the probability distribution, many

statements can be made. For example, if represents a probability of success

for a treatment, a statement can be made about the probability that > 0.90

(or any other value).

With respect to clinical trials, a Bayesian approach can cause some difficulties

for investigators because they are not accustomed to representing their prior

beliefs about a treatment effect in the form of a probability distribution. In

addition, there may be very little prior knowledge about a new experimental

therapy, so investigators may be reluctant to or not be able to quantify their

prior beliefs. In the business world, the Bayesian approach is used quite often

because of the availability of prior information. In the medical field, more often

than not, this is not the case.

investigators may select different priors for the same situation, which could

lead to different conclusions about the trial. This is especially true when the

data, X, are based on a small sample size because in such situations the prior

distributions are modified only slightly to form the posterior distributions. Small

sample sizes only modify the prior slightly. This tends to weight the posterior

distribution very closely to the prior, therefore you are basing your results

almost entirely on your prior assumptions.

distribution on, Bayesians employ a reference (or vague or non-informative)

prior. These are intended to represent a minimal amount of prior information.

Although vague priors may yield results similar to those of a frequentist

approach, the priors may be unrealistic because they attempt to assign equal

weight to all values of . Below you can see a very flat distribution, very

spread out over a wide range of values.

Similarly, skeptical prior distributions are those that quantify the belief that

large treatment effects are unlikely. Enthusiastic prior distributions are those

that quantify large treatment effects. Let's not worry about the calculations, but

focus instead on the concepts here...

Suppose an investigator plans a trial to detect a hazard ratio of 2 ( = 2) with

90% statistical power ( = 0.10) using at least a sample size of 90 events. The

investigator plans one interim analysis, approximately halfway through trial,

and a final analysis. (This is the more standard approach, as opposed to the

SPRT where R was calculated after each treatment.)

distributed with variance (1/d1) + (1/d2), where d1 and d2 are the numbers of

events in the two treatment groups. The null hypothesis is that the treatment

groups are the same, i.e., H0: = 1. Note that the loge hazard ratio is 0 under

the null hypothesis and the loge hazard ratio is 0.693 when = 2, the

proposed effect size.

Suppose the investigator has access to some pilot data or the published

report of another investigator, in which there appeared to be a very small

treatment effect with 16 events occurring within each of the two treatment

groups. The investigator decides that this preliminary study will form the basis

of a skeptical prior distribution for the loge hazard ratio with a mean of 0 and a

standard deviation of 0.35 = {(1/16) + (1/16)}1/2. This is called a skeptical prior

because it expresses skepticism that the treatment is beneficial.

Next, suppose that at the time of the interim analysis, (45 events have

occurred), there are 31 events in one group and 14 events in the other group,

such that the estimated hazard ratio is 2.25 (calculations not shown). These

values are incorporated into the likelihood function, which modifies the prior

distribution to yield the posterior distribution for the estimated logehazard ratio

that has a mean = 0.474 and standard deviation = 0.228 (calculations not

shown). Therefore we can calculate the probability that is > 2. From the

posterior distribution we construct the following probability statement:

Pr[2]=1(loge(2)0.4740.228)=1(0.961)=0.168Pr[2]=1(loge(

2)0.4740.228)=1(0.961)=0.168

normal and is the true hazard ratio.

Conclusion: Based on the results from the interim analysis with a skeptical

prior, there is not strong evidence that the treatment is effective because the

posterior probability of the hazard ratio exceeding 2 is relatively small.

Therefore, there is not enough evidence here to suggest that the study be

stopped. What is too large? A reasonable value should be specified in your

protocol before these values are determined.

In contrast, suppose that before the onset of the trial the investigator is very

excited about the potential benefit of the treatment. Therefore, the investigator

wants to use an enthusiasticprior for the loge hazard ratio, i.e., a normal

distribution with mean = loge(2) = 0.693 and standard deviation = 0.35 (same

as the skeptical prior).

Suppose the interim data results are the same as those described above. This

time, the posterior distribution for the loge hazard ratio is normal with mean =

0.762 and standard deviation = 0.228. Then the probability for the posterior

prior is:

Pr[2]=1(loge(2)0.762.228)=1(0.302)=0.682Pr[2]=1(loge

(2)0.762.228)=1(0.302)=0.682

This is a drastic change in the probability based on the assumptions that were

made ahead of time. In this case, the investigator still may not consider this to

be strong evidence that the trial should terminate because the posterior

probability of the hazard ratio exceeding 2 does not exceed 0.90.

Nevertheless, the example demonstrates the controversy that can arise with a

Bayesian analysis when the amount of experimental data is small, i.e., the

selection of the prior distribution drives the decision-making process. For this

reason, many investigators prefer to use non-informative priors. Using the

Bayesian methods, you can make probability statements about your expected

results.

From a frequentist point of view, repeated hypothesis testing of accumulating

data can increase the type I error rate of a clinical trial. Therefore, the

frequentist approach to interim monitoring of clinical trials focuses on

controlling the type I error rate.

each patient is accrued. In fact, for most multi-center clinical trials, interim

statistical analyses are conducted only once or twice per year. Usually this

frequency of interim analyses detects treatment effects nearly as early as

continuous monitoring. The group sequential analysis is defined as the

situation in which only a few scheduled analyses are conducted. Again, let's

focus more on the concepts than the statistical details.

and we let Z1, ... , ZR denote the test statistic at the R times of hypothesis

testing. So, here we are accumulating data over time. We are adding to the

dataset and analyzing the current set that you have collected. Also, we

let B1, ... , BR denote the corresponding boundary points (critical values). At

the rth interim analysis, the clinical trial is terminated with rejection of the null

hypothesis if:

|Zr|Br,r=1,2,...,R|Zr|Br,r=1,2,...,R

The boundary points are chosen such that the overall significance level does

not exceed the desired . There are primarily three schemes for selecting the

boundary points which have been proposed. These are illustrated in the

following table for an overall significance level of = 0.05 and for R = 2,3,4,5.

The table is constructed under the assumption that n patients are accrued at

each of the R statistical analyses so that the total sample size is N = nR.

er

B B B

1.985 0.0471 1.960 0.0500 2.289

For example, if you were to have one interim analysis and a final analysis, in

this table, that means R=2. Use the first two rows of the table to find the

critical values.

If you were to have three interim analyses and then one final analysis, then

R=4. You would use the corresponding four rows in the middle of the table to

determine critical values.

significance differently across the interim and final analyses.

the R interim analyses. Of the three procedures described in the table, it

provides the best chance of early trial termination. Many investigators dislike

the Pocock approach, however, because of its properties at the final stage of

analysis. For example, suppose R = 3 analyses are planned and that

statistical significance is not attained at any of the analyses. Suppose that

the p-value at the final analysis is 0.0350 (this is > 0.0221 found in the table

for the Pocock approach). If interim analyses had not been scheduled,

however, this p-value would be considered to provide a statistically significant

result (cp = 0.0350 < 0.0500).

on statistical reasoning) approaches were designed to avoid this problem. On

the other hand, these two approaches render it very difficult to attain statistical

significance at an early stage. In any case, it is important to make it clear to

investigators in the study the approach that has been selected for the interim

analyses.

Example

1983. Clinical Trials: A Practical Approach, New York, John Wiley & Sons). A

trial was conducted in patients with non-Hodgkin's lymphoma, in which two

drug combinations were compared, namely cytoxan-prednisone (CP) and

cytoxan-vincristine-prednisone (CVP). The primary endpoint was

presence/absence of tumor shrinkage, a surrogate variable.

Patient accrual lasted over two years and 126 patients participated. Statistical

analyses were scheduled after approximately every 25 patients. Chi-square

tests (without the continuity correction) were performed at each of the five

scheduled analyses. The Pocock approach to group sequential testing

requires a significance level of 0.0158 at each analysis. Here is a table with

the results of these analyses.

p-value

CP CVP

Thus, the researchers were concerned that the CVP combination appeared to

be clinically better than the CP combination (53% success versus 34%

success), yet it did not lead to a statistically significant result with Pococks

approach. Further analyses with secondary endpoints convinced the

researchers that the CVP combination is superior to the CP combination.

The OBrien-Fleming approach is the most popular group sequential approach

because the significance level at the final analysis is near the overall desired

significance level. The REMATCH [1] clinical trial is a good example.

include the strict requirements that (1) the number of scheduled analyses, R,

must be determined prior to the onset of the trial, and (2) there is equal

spacing between scheduled analyses with respect to patient accrual. The

alpha spending function approach was developed to overcome these

drawbacks: (DeMets DL, Lan KK, 1994, Interim analysis: The alpha spending

function approach, Statistics in Medicine 13: 1341-1352.)

Let denote the information fraction available during the course of a clinical

trial. For example, in a clinical trial with a target sample size, N, in which

treatment group means will be compared, the information fraction at an interim

analysis is = n/N, where n is the sample size at the time of the interim

analysis. If your target sample size is 500 and you have taken measurements

on 400 patients then = .8

fraction is = d/D, where D is the target number of events for the entire trial

and d is the events that have occurred at the time of the interim analysis.

The alpha spending function, (), is an increasing function with (0) = 0 and

(1) = , the desired overall significance level. In other words, every time you

are doing analysis you are in a sense "spending part of your alpha." For

the rth interim analysis, where the information fraction is r, 0 r 1, (r)

determines the probability of any of the first r analyses leading to rejection of

the null hypothesis when the null hypothesis is true. As an example, suppose

investigators are planning a trial in which patients are examined every two

weeks over a 12-week period. The investigators would like to incorporate an

interim analysis when one-half of the subjects have completed at least one-

half of the trial. This corresponds to = 0.25.

O'Brien-Fleming functions, is () = , 0 1. This leads to a significance

level of 0.012 at the interim analysis and a significance level of 0.04 at the

final analysis (calculations not shown). There are all types of variations that

statisticians have devised.

function approach is invoked, the estimates of a treatment effect will be biased

when a trial is terminated at an early stage. The earlier the decision, the larger

the bias. Intuitively, if your target sample size is 200 and you decide to

terminate the trial after 25 patients because you think you have found a

significant different between treatment groups - there could be a lot of bias in

this type of result. Is this number of patients representative sample from the

population?

Conditional Power

As an alternative to the above methods, we might want to terminate a trial

when the results of the interim analysis are unlikely to change after accruing

more patients (futility assessment). It just doesn't look like there could ever be

a significant difference!

power to yield an answer different from that seen at the interim analysis. If this

quantity is really small, then you can conclude that it would be futile to

continue with the investigation.

coin is fair, so the hypothesis testing problem is:

H1:p=Pr[Heads]>0.5

The fixed sample size plan is to toss the coin 500 times, count the number of

heads, X. But do we actually need to flip the coin 500 times? Using this futility

assessment procedure we could reject H0 at the 0.025 significance level if:

Z=X250(500)(0.5)(0.5)1.96Z=X250(500)(0.5)

(0.5)1.96

the coin there are 272 heads. It is futile to proceed further because even if the

remaining 100 tosses yielded tails, the null hypothesis still would be rejected

at the 0.025 significance level. The calculation of the conditional power in this

example is trivial (it equals 1) because no matter what is assumed about the

true value of p, the null hypothesis would be rejected if the trial were taken to

completion.

You can also look at this in the other direction. Suppose that after 400 tosses

of the coin there are 200 heads. The null hypothesis will be rejected if there

are at least 72 heads during the remaining 100 tosses.

Pr[X72|n=100,p=0.6]Pr[X72|n=100,p=0.6]

=Pr[X60(100)(0.6)(0.4)7260(100)(0.6)(0.4)]=Pr[X60(100)(0.6)(0.4)7260(100)

(0.6)(0.4)]

=Pr[X2.45]=0.007=Pr[X2.45]=0.007

very small probability. Thus, it is futile to continue because there is such a

small chance of rejecting H0.

for Trials

Single-Center Trials

Here are some practical issues as they relate to single center trials. Typically,

an investigator for a single-center trial needs to submit an annual report to

his/her IRB. The report should address whether the study is safe and whether

it is appropriate to continue.

trial),

4. Summary of response,

5. Summary of survival,

6. Adverse events,

endpoints), and

Multi-Center Trials

A multi-center trial is one in which there are one or more clinical investigators

at each of a number of locations (centers). Obviously, multi-center trials are of

great importance when the disease is not common and a single investigator is

capable of recruiting only a handful of patients.

involved in the study across various geographic regions, (this adds to

external validity), and

experienced clinical scientists involved in the design and

implementation of the study.

the following:

across all centers,

recording data,

5. You need a data coordinating center (DCC) for storing, monitoring data

and organizing investigators,

6. A need develops to keep all investigators involved and motivated,

The NIH requires a Data and Safety Monitoring Board (DSMB) to monitor the

progress of a multi-center clinical trial that it sponsors. Although the FDA does

not require a pharmaceutical/biotech company to construct a DSMB for its

multi-center clinical trials, many companies are starting to use DSMBs on a

regular basis.

mechanism for protecting the interests and safety of the trial participants,

while maintaining scientific integrity. The manner in which it is constructed

should ensure that the DSMB is financially and scientifically independent of

the study investigators, so that decisions about early stopping or study

continuation are made objectively. Depending on the circumstances, a DSMB

may be composed of anywhere from three to ten experts in medicine,

statistics, epidemiology, data management, clinical chemistry, and ethics.

None of the study investigators should be a part of the DSMB. In addition, the

DSMB should not be masked to treatment assignment when it is evaluating a

clinical trial. Although investigators and statisticians may submit information

and materials to the DSMB for their study, most of the deliberations made by

the DSMB are kept confidential. The DSMB reports directly to the sponsor of

the multi-center trial (the NIH or the company) and does not report to the

investigators.

A DSMB typically examines the following issues when assessing the worth of

a multi-center clinical trial:

2. Are the accrual rates meeting initial projections and is the trial on its

scheduled timeline?

4. Are the treatment groups different with respect to safety and toxicity

data?

5. Are the treatment groups different with respect to efficacy data?

DSMB to make its decisions?

multi-center clinical trial, instead of the investigators, is that expertise may be

sacrificed in order to maintain impartiality. Investigators gain valuable

knowledge during the course of the trial and it is not possible to provide the

DSMB with the totality of this knowledge. Nevertheless, the advantages of a

DSMB seem to outweigh this disadvantage during the conduct of a multi-

center trial.

Fleming, TR. DeMets, DL. 2002, Data Monitoring Committees in Clinical

Trials, New York, NY: Wiley.

9.9 - Summary

In this lesson, among other things, we learned:

Differentiate between valid and invalid reasons for interim analyses and

early termination of a trial.

analysis.

frequentist approach.

Recognize the general effects of the choice the prior on the posterior

probability distribution from a Bayesian analysis.

interim analysis.

Comment on the use of a group sequential method in a published

statistical analysis.

might compose the DSMB.

Let's explore Bayesian methods further in this week's discussion and apply

what we have learned to the assessment questions.

Treat

Introduction

incomplete observations, or methodologic errors. Two different perspectives

have been applied to address the situation with imperfect data.

determining biological effects, which typically is done for efficacy studies.

general use, which typically is done for effectiveness studies.

deal with data imperfections.

appropriately.

Differentiate between a pragmatic approach and an explanatory

approach and the conditions for which each is appropriate.

Harmonization would allow a patients data to be excluded from

statistical analysis.

"Evaluable"

imperfections in the data. One problem in this regard involves the evaluation

of which patients received the correct treatment at the correct amounts.

Patients who meet certain criteria are said to be "evaluable".

evaluable and NI patients are considered inevaluable. Suppose that the

numbers of evaluable and inevaluable patients with favorable outcomes

are RE and RI, respectively. You may consider using one of the following

estimates for the probability of a favorable outcome, namely,

whereas PE (explanatory approach) is based on evaluable patients only.

Usually RI is close to zero so that PE > P.

obvious that the treatment cannot have an effect if it is not received. Some

investigators will write a protocol to indicate that only data from those who

received treatment for at least some number of doses or longer than a

particular length of time will be used in analysis. Do you recall a major

difficulty with this explanatory approach? What about post-entry exclusion

bias?

Since evaluability criteria define inclusion retroactively based on treatment

adherence which is not determined until completion of the study, there is

potential post-entry exclusion bias. Participant data should not be selected

for inclusion in data analysis based on an outcome variable.

obviously it does not help elicit biological effects in an efficacy trial. It is

prudent then, to select treatments and a protocol design that will result in a

high level of treatment adherence with the hope that the pragmatic/intention-

to-treat approach agrees as much as possible with the explanatory approach.

Missing data

this happens frequently, there could be a fundamental problem with the design

or conduct of the study. Some missing data are due to human error, such as

forgetting to record/enter the data

to follow-up occur for reasons not associated with outcome, then they have

little impact, other than reducing precision. If losses to follow-up occured

independently of outcome, then the explanatory and pragmatic approaches

would be equivalent. Investigatorsm however, cannot assume that all losses

to follow-up are random events and conduct analyses that ignore such losses.

Being lost to follow-up may be associated with a higher chance of disease

progression, recurrence, or death. If a patient has not withdrawn consent,

then every effort should be made to recover lost information.

values;

2. the outcome variable with the missing values is important clinically or

biologically;

be determined.

Simple data imputation involves substituting one data point for each missing

value. Some substitution choices include the mean of the non-missing values

or a predicted value from a linear regression model.

Another simple data imputation method is the last observation carried forward

(LOCF) approach in longitudinal studies. With LOCF, the last observed value

for a patient is substituted for all of that patients subsequent missing values.

The problems with simple data imputation methods are that they can yield a

very biased result and they tend to underestimate variability during the data

analysis.

errors are added to the predicted values via random number generators,

2. multiple imputed data sets are created in this manner (say 10-20 data

sets), and

In most clinical trials, it is common to find errors that yield ineligible patients

participating in the trial. Objective eligibility criteria are less susceptible to error

than subjective criteria. Also, patients can fail to comply with nearly every

aspect of treatment specification, such as reduced or missed doses and

improper dose scheduling.

Ineligible patients in the study can be (1) included in the analysis of the cohort

of eligible patients (pragmatic approach/intention-to-treat) or (2) excluded from

the analysis (explanatory approach).

In a randomized trial, if the eligibility criteria are objective and assessed prior

to randomization, then both approaches do not cause a bias. The pragmatic

approach, however, increases the external validity.

10.2 - Intention-to-Treat

Intention-to-treat (ITT) is the principle that patients in a randomized clinical

trial should be analyzed according to the group to which they were assigned,

even if they did not

Treatment received (TR) or a protocol analysis is the principle that patients

should be analyzed according to the treatment they actually received.

Most statisticians favor the ITT principle because it yields the best properties

for the test of the null hypothesis of no treatment difference. "If randomized,

then analyzed" is the view widely held among clinical trial statisticians and

considered a critical component of the ITT Principle to avoid biases due to

post-randomization exclusions. ITT also is favored by the federal agencies

because a clinical trial is a test of treatment policy, not a test of treatment

received. After a meeting to discuss clinical trials methodology, which included

US FDA representatives, the International Conference on Harmonization

(ICH) published a document entitled "Statistical Principles for Clinical Trials

(E9) [1]" that discusses the ITT Principle under various circumstances.

which randomized patients can be excluded from the full analysis set. Patients

who failed to satisfy an entry criterion may be excluded from the full analysis

set only under the following circumstances:

determined

3. All patients underwent similar scrutiny for eligibility violations

excluded

circumstances. For example, consider a situation in which a new therapy is

compared to placebo. Suppose that a patient undergoing treatment failure is

provided emergency medications for safety purposes. If the placebo group

has a higher failure rate, it actually could appear to be more beneficial than

the new therapy in an ITT analysis because of the emergency medications

(even though this may seem to be a design flaw of the trial). In such a

situation, the statistical analysis would be better served with time to treatment

failure as the primary endpoint. This analysis would still include all patients,

but using time to failure as the primary endpoint eliminates the problem of

misleading results from the ITT analysis.

therapy, including severe adverse reactions, disease progression, patient or

physician preference for an alternative treatment, and a change of mind. In

nearly all of these circumstances, failure to complete the assigned therapy is

partially a trial outcome. Patients cannot be eliminated from analysis for such

reasons without introducing bias.

10.3 - Summary

In this lesson, among other things, we learned:

appropriately.

approach and the conditions for which each is appropriate.

Harmonization would allow a patients data to be excluded from

statistical analysis.

Lesson 11: Estimating Clinical Effects

Introduction

The design of a clinical trial imposes structure on the resulting data. For

example, in pharmacologic treatment mechanism (Phase I) studies, blood

samples are used to display concentration time curves, which relate to

simple physiologic models of drug distribution and/or metabolism. As another

example, in SE (Phase II) trials of cytotoxic drugs, investigators are interested

in tumor response and toxicity of the drug or regimen. The usual study design

permits estimating the unconditional probability of response or toxicity in

patients who met the eligibility criteria.

For every trial, investigators must distinguish between those analyses, tests of

hypotheses, or other summaries of the data that are specified a priori and

justified by the design and those which are exploratory. Remember, the results

from statistical analyses of endpoints that are specified a priori in the protocol

carry more validity. Although exploratory analyses are important and might

uncover biological relationships previously unsuspected, they may not be

statistically reliable because it is not possible to account for the random nature

of exploratory analyses. Exploratory analyses are not confirmatory by

themselves but generate hypotheses for future research.

an odds ratio adjusted for strata effects.

groups.

trend.

Interpret SAS output comparing survival curves.

estimator.

Reference

Effects. In: Piantadosi Steven. Clinical Trials: A Methodologic Perspective. 2nd

ed. Hobaken, NJ: John Wiley and Sons, Inc.

Analysis. In FFDRG. Fundamentals of Clinical Trials. 5th ed. NY: Springer.

One of the principal objectives of dose-finding (DF) studies is to assess the

distribution and elimination of drug in the human system. What level of the

drug is appropriate?

useful analytical approaches for DF studies. The objective of a PK model is to

account for the absorption, distribution, metabolism, and excretion of a drug in

the human system. An example of a two-compartment PK model is as follows

and the objective is to estimate the rates into, between, and out of the two

compartments.

Estimates are made for each of these areas, i.e., absorption rate, distribution

rate, etc. for the drug in question.

Studies: The Odds Ratio

The main objectives of most safety and efficacy (SE) studies with a new

treatment are to estimate the frequency of adverse reactions and estimate the

probability of treatment success. These types of endpoints often are

expressed in binary form, presence/absence of adverse reaction,

success/failure of treatment, etc., although this is not always the case.

absent/mild/moderate/severe. The primary efficacy endpoint also may be

measured on an ordinal scale, such as failure/partial success/success, or it

may be a time-to-event variable or measured on a continuous scale, such as

a measurement of blood pressure. There are many ways to assess efficacy.

consists of a placebo group and a treatment group, and that probability of an

adverse reaction is an important investigation. Let p1 and p2 denote the

respective probabilities of an adverse reaction for the treatment and placebo

groups. Three common parameters of risk are as follows:

between the two groups in the risk for the event. The odds ratio indicates the

relative odds of the event occurring between two groups.. Because both are

ratios, the relative risk and the odds ratio are assessed in terms of their

distance from 1.0.

When would the odds ratio and the relative risk be about the same?

When p1 and p2 are relatively small, for instance, when you are dealing with a

very rare event.

The odds ratio is useful and convenient for assessing risk when the response

outcome is binary, but it does have some limitations.

0.25 0.05 0.20 5.00 6.33

0.30 0.10 0.20 3.00 3.86

0.45 0.25 0.20 1.80 2.45

0.70 0.50 0.20 1.40 2.33

Notice in the table above that while the absolute risk difference is constant,

the relative risk varies greatly, as does the odds ratio. Thus, the magnitudes of

the odds ratio and relative risk are strongly influenced by the initial probability

of the condition.

When the outcome in a CTE trial is a binary response and the objective is to

compare the two groups with respect to the proportion of success, the results

can be expressed in a 2 2 table as

Group #1 Group #2

Success r1 r2

Failure n1 - r1 n2 - r2

The estimated relative risk is (r1/n1)/ (r2/n2) and the estimated odds ratio is:

^=r1/(n1r1)r2/(n2r2)=r1(n2r2)r2(n1r1)^=r1/(n1r1)r2/

(n2r2)=r1(n2r2)r2(n1r1)

There are a variety of methods for performing the statistical test of the null

hypothesis H0: = 1 (or H0: = 0) , such as a z-test using a normal

approximation, a 2 test (basically, a square of the z-test), a 2 test with

continuity correction, and Fisher's exact test.

The normal and 2 approximations for testing H0: = 1 are relatively accurate

if these conditions hold:

n1(r1+r2)n1+n25,n2(r1+r2)n1+n25,n1(n1+n2r1r2)n1+n25,n2(n1+

n2r1r2)n1+n25n1(r1+r2)n1+n25,n2(r1+r2)n1+n25,n1(n1+n2r1r2)n1+n25,

n2(n1+n2r1r2)n1+n25

This expression is basically what we would have calculated for the expected

values in the 2 2 table. The first part of the expression is the probability of

success times the probability of being in group 1 times the number of

subjects.

Otherwise, Fisher's exact test is recommended.

If the above condition is met, then the loge-transformed estimated odds ratio

has an approximate normal distribution:

loge(^)N(=loge(),2=1r1+1r2+1n1r1+1n2r2)loge(^)N(=loge(

),2=1r1+1r2+1n1r1+1n2r2)

ratio is possible and would look like:

loge(^)z1/21r1+1r2+1n1r1+1n2r2

loge(^)z1/21r1+1r2+1n1r1+1n2r2

constructed by exponentiation the endpoints of the 100(1 - )% confidence

interval for the log odds ratio. The computer can do this for you.

SAS Example 12.1 [1]: An investigator conducted a small safety and efficacy

study comparing treatment to placebo with respect to adverse reactions. The

data are as follows:

Treatment Placebo

adverse reaction 12 4

no adverse reaction 32 40

The estimated odds ratio is calculated as:

^=(12)(40)(32)(4)=3.75^=(12)(40)(32)(4)=3.75

and the approximate 95% confidence interval for the loge odds ratio is

1.32(1.960.62)=(0.10,2.54)1.32(1.960.62)=(0.10,2.54)

Because the approximate 95% confidence interval for does not contain 1.0,

the null hypothesis of H0: = 1 is rejected at the 0.05 significance level.

Even though this data table satisfies the criteria for loge estimated odds ratio

to follow an approximate normal distribution, there still is a discrepancy

between the approximate results and the exact results.

From PROC FREQ of SAS, the exact 95% exact confidence interval for is

(1.00, 17.25). Because the 95% confidence interval for does contain 1.0, H0:

= 1 is not rejected at the 0.05 significance level based on Fisher's exact

test.

11.3 - Safety and Efficacy (Phase II)

Studies: The Mantel-Haenszel Test for

the Odds Ratio

Sometimes a safety and efficacy study is stratified according to some factor,

such as clinical center, disease severity, gender, etc. In such a situation, it still

may be desirable to estimate the odds ratio while accounting for strata effects.

The Mantel-Haenszel test for the odds ratio assumes that the odds ratio is

equal across all strata, although the rates, p1 and p2, may differ across strata.

This procedure calculates the odds ratio within each stratum and then

combines the strata estimates into one estimate of the common odds ratio.

For example,

Stratum p1 p2

multi-center safety and efficacy study at six sites, with a binary outcome

(success/failure), for comparing placebo and treatment.

[2]

SAS PROC FREQ yields an estimated odds ratio of 1.84 with an approximate

95% confidence interval is (1.28, 2.66).

The exact 95% confidence interval is (1.26, 2.69). The exact and asymptotic

confidence intervals are nearly identical due to the large sample size across

the six clinical centers.

consistent with the 95% confidence interval not containing 1.0. (Later in this

chapter we discuss the construction of the Mantel-Haenszel test statistic.)

Studies: Trend Analysis

In some safety and efficacy studies, it is of interest to determine if an increase

in the dose yields an increase (or decrease) in the response. The statistical

analysis for such a situation is called a dose-response or trend analysis. We

want to see a trend here, not just a difference in groups. Typically, patients in

a dose-response study are randomized to K + 1 treatment groups (a placebo

dose and K increasing doses of the drug). The response variables of interest

may be binary, ordinal, or continuous (in some circumstances, the response

variable may be a time-to-event variable). In some instances trend tests can

be sensitive and reveal a mild trend where pair-wise comparisons would not

be able to find significant differences and not be as helpful.

For the sake of illustration, suppose that the response is continuous and that

we want to determine if there is a trend in the K + 1 population means.

is

H0: {0 = 1 = = K} versus

H1: {0 1 K with at least one strict inequality}

is

H0: {0 = 1 = = K} versus

H1: {0 1 K with at least one strict inequality}

H0: {0 = 1 = = K} versus

H1: {0 1 K or 0 1 K with at least one strict inequality}

More than likely we would use one of the one-sided tests as you probably

have a hunch about the effect that will result.

(JT) trend test that was developed in the 1950's. The JT trend test is based on

a sum of Mann-Whitney-Wilcoxon tests :

JT=k=0K1k=1KMWWkkJT=k=0K1k=1KMWWkk

group k, 0 k < k K. Essentially, each of the pairs of groups are compared

against one another and then summed up. In this way this test looks for

trends.

nk' , denote the observations from group k, then

MWWkk=i=1nki=1nksign(YkiYkiMWWkk=i=1nki=1nksign(YkiYki

example, when comparing an observation from a lower dose group versus an

observation higher dose group, take the difference of the latter minus the

former.

dose groups in a study (placebo, low dose, mid dose, and high dose). Then

the JT trend test is the sum of six Mann-Whitney-Wilcoxon test statistics:

{placebo vs. mid dose} +

{placebo vs. high dose} +

{low dose vs. mid dose} +

{low dose vs. high dose} +

{mid dose vs. high dose}

Large positive values of the statistic JT support

The JT trend test actually is testing hypotheses about population medians, but

if the underlying probability distribution is symmetric, the population mean and

the population median are equal to one another. The JT trend test is available

in PROC FREQ of SAS.

normal data, is to substitute the difference between sample means for the

Mann-Whitney-Wilcoxon statistics. The numerator for the parametric test is as

follows:

k=0K1k=k+1K(YkYk)k=0K1k=k+1K(YkYk)

variance, 2 . The population variance is estimated by the pooled sample

variance, s 2 , and it has d degrees of freedom:

s2=1dk=0Ki=1nk(YkiYk)2,d=k=0K(nk1)s2=1dk=0Ki=1nk(YkiY

k)2,d=k=0K(nk1)

k=0KckYkk=0KckYk

T=(k=0KckYk)/s2k=0Kc2kn2k

T=(k=0KckYk)/(s2k=0Kck2nk2)

For example, if K = 3 (placebo, low dose, mid dose, and high dose), then c0 =

-3, c1 = -1, c2 = 1, c3 = 3. Notice, however, that if there are an odd number of

groups, then the middle group has a coefficient of zero. For example, with K =

2 (placebo, low dose, and high dose) c0 = - 1, c1 = 0, c2 = 1. This is not ideal

and there are better trend tests than JT and T for continuous data.

To use the actual dose values (denoted as d0, d1, , dK) in the parametric

test, set ck = dk - mean(d0, d1, , dK), k = 0, 1, , K.

SAS PROC GLM.

The JT trend test works well for binary and ordinal data, as well as being

available for continuous data.

Another trend test for binary data is the Cochran-Armitage (CA) trend test.

The difference between the JT and CA trend tests is that for the latter test, the

actual dose levels can be specified. In other words, instead of designating the

dose levels as low, mid, or high, the actual numerical dose levels can be used

in the CA trend test, such as 20 mg, 60, 180 mg.

The CA trend test, however, can yield unusual results if there is unequal

spacing among the dose levels. If the dose levels are equally spaced and the

sample sizes are equal (n0 = n1 = ... = nK), then the JT and CA trend tests yield

exactly the same results. Each of these parameters needs to be taken into

account to make sure you are applying the best test for your data.

tests.

[3]

Studies: Survival Analysis

In many clinical trials involving serious diseases, such as cancer and AIDS, a

primary objective is to evaluate the survival experience of the cohort. In

clinical trials not involving serious diseases, survival may not be an outcome,

but other time-to-event outcomes may be important. Examples include time to

hospital discharge, time to disease relapse, time to getting another migraine,

time to progression of disease, etc.

the probability of survival, even in the presence of censoring (e.g. study is

completed before the patient experiences the event), at any point in time. This

statistical approach is nonparametric because it does not assume any

particular distribution for the data, such as lognormal, exponential, or Weibull.

It is a "robust" procedure because it is not adversely affected by one or more

unusual data points.

In order to construct the Kaplan-Meier survival curve, the actual failure times

need to be ordered from smallest to largest. In a sample size of n patients,

denote these times of failure as t1, t2, ... , tK. For convenience, let t0 = 0 denote

the start time and let tK+1 = .

At the kth failure time, tk, the number of failures, dk, are noted as well as the

number of patients who were at risk for failure immediately prior to tk, nk.

Notice that patients who are lost to follow-up (censored) prior to time tj are not

included in nk.

The algebraic formula for the Kaplan-Meier survival probability at time t is:

S^(t)=1,t0tt1S^(t)=1,t0tt1

S^(t)=k=1k(1dknk),tkttk+1,k=1,2,...,KS^(t)=k=1k(1dknk

),tkttk+1,k=1,2,...,K

The calculation of S(t) utilizes conditional probability: the probability of

surviving at time t, given that the person has survived up to time t. S(t) is the

probability of surviving beyond time t.

dk nk

k tk (days)

(events) (at risk) S^(tk)S^(tk)

Note that the probability estimate does not change until a failure event occurs.

Also, censored values do not affect the numerator, but do affect the

denominator. Thus, the Kaplan-Meier survival curve gives the appearance of a

step function when graphed.

11.6 - Comparative Treatment Efficacy

(Phase III) Trials

For comparative treatment efficacy (CTE) trials, the primary endpoints often

are measured on a continuous scale. The sample mean (sample standard

deviation), the sample median (sample inter-quartile range), or the sample

geometric mean (sample coefficient of variation) serve as reasonable

descriptive statistics in such circumstances.

The sample mean (sample standard deviation) is suitable if the data are

normally distributed or symmetric without heavy tails. The sample median

(sample inter-quartile range) is suitable for symmetric or asymmetric data. The

sample geometric mean (sample coefficient of variation) is suitable when the

data are log-normally distributed.

Usually two-sample t tests or Wilcoxon rank tests are applied to compare the

two randomized groups. In some instances, baseline measurements (prior to

randomized treatment assignment) of the primary endpoints are taken.

Suppose Yi1 and Yi2 denote the baseline and final measurements of the

endpoint, respectively, for the ith subject, i = 1, 2, , n. Instead of statistically

analyzing the Yi2s, there could be an increase in precision by analyzing the

change (or gain) in the response, namely, the Yis where Yi = Yi2 - Yi1.

Suppose that the variance for each Yi1 and Yi2 is 2 and that the correlation

between Yi1 and Yi2 is (we assume that subjects are independent of each

other but that the pair of measurements within each subject are correlated).

This leads to

Var(Yi2Yi1)=Var(Yi2)+Var(Yi1)2Cov(Yi2,Yi1)=22(1)Var(Yi2Yi1)=Var(

Yi2)+Var(Yi1)2Cov(Yi2,Yi1)=22(1)

Therefore,

)

If > , which often is the case for repeated measurements within patients,

then Var(Yi2 - Yi1) < Var(Yi2). Thus, there may be more precision if

the Yi2 - Yi1 are analyzed instead of the Yi2. This happens all the time. Using

the patient as their own control is a good thing. We are interested in the

differences that are occurring, therefore we will subtract the treatment period

measurements from the baseline data for the patient. A two-sample t test or

the Wilcoxon rank sum test can be applied to the change-from-baseline

measurements if the CTE trial consists of two randomized groups, such as

placebo and an experimental therapy.

(ANCOVA). In this situation, the baseline measurement, Yi1, serves as a

covariate, so that the final measurement for a subject is adjusted by the

baseline measurement. A linear model that describes this for a two-armed trial

with placebo and experimental treatment groups is as follows. The expected

value for the ith patient, i = 1, 2, , n, is:

E(Yi2)=p+Ti(EP)+Yi1E(Yi2)=p+Ti(EP)+Yi1

where P is the population mean for the placebo group, E is the population

mean for the experimental treatment group, Ti = 0 if the ith patient is in the

placebo group and 1 if in the experimental treatment group, and is the slope

for the baseline measurement.

treatment group, respectively, can be rewritten as:

baseline measurements:

The only difference between the two approaches is that in the change-from-

baseline measurements, is set equal to 1.0. In the ANCOVA approach, is

estimated in the analysis and may differ from 1.0. Thus, ANCOVA approach is

more flexible and can yield slightly more statistical power and efficiency.

If the primary endpoint in a CTE trial is a time-to-event variable, then it will be

of interest to compare the survival curves of the randomized treatment arms.

Again, we will focus on a nonparametric approach that corresponds to

comparing the Kaplan-Meier survival curves rather than a parametric

approach.

groups, say P and E for placebo and experimental treatment. In this situation,

the Mantel-Haenszel test is called the logrank test.

The assumptions for the logrank test are that (1) the censoring patterns are

the same for the two treatment groups, and (2) the hazard functions for the

two treatment groups are proportional.

For each of the K distinct failure times across the two randomized groups at

times t1, t2, , tK, a 2 2 table is constructed. For failure time tk , k = 1, 2,

, K, the table is:

The logrank statistic constructs an observed minus expected score, under the

assumption that the null hypothesis of equal event rates is true, for each of the

K tables and then sums over all tables:

OE=k=1K(nPkdEknEkdPknPk+nEk)OE=k=1K(nPkdEknEkdPknPk+nE

k)

VL=Var(OE)=k=1K((dPk+dEk)

(nPk+nEkdPkdEk)nPknEk(nPk+nEk1)

(nPk+nEk)2)VL=Var(OE)=k=1K((dPk+dEk)

(nPk+nEkdPkdEk)nPknEk(nPk+nEk1)(nPk+nEk)2)

ZL=(OE)/VLZL=(OE)/VL

survival curves and it is an extension of the Wilcoxon rank sum test in the

presence of censoring. It also requires that the censoring patterns for the two

treatment groups be the same, but it does not assume proportional hazards.

The first step in constructing the generalized Wilcoxon statistic is to pool the

two samples of survival times (including censored values) and order them

from lowest to highest. For the ith observation in the ordered sample with

survival (or censored) time ti, construct a score, Ui, which represents the

number of survival (or censored) times less than ti minus the number of

survival (or censored) times greater than ti. The Ui are summed over the

experimental treatment group and a variance calculated, i.e.,

U=i=1nEUiand VU=Var(U)=(nPnE(nP+nE)

(nP+nE1))i=1nP+nEU2iU=i=1nEUiand VU=Var(U)=(nPnE(nP+nE)

(nP+nE1))i=1nP+nEUi2

such that:

ZU=(OE)/VUZU=(OE)/VU

6 Exp Treat 0 7 -7

10 Placebo 1 6 -5

12 Exp Treat 2 4 -2

17 Placebo 3 2 1

21 Placebo 4 1 3

25+ Placebo 5 0 5

was conducted in 83 patients with malignant mesothelioma, an uncommon

lung cancer that is strongly associated with asbestos exposure. Patients

underwent one of three types of surgery, namely, biopsy, limited resection,

and extrapleural pneumonectomy (EPP). Treatment assignment was

nonrandomized and based on the extent of disease at the time of diagnosis.

Thus, there can be a strong procedure selection bias here in this example.

The primary outcome variable was time to death (survival). SAS PROC

LIFETEST constructs the Kaplan-Meier survival curve for each surgery group

and compares the survival curves via the logrank test (p = 0.48) and the

generalized Wilcoxon test (p = 0.63).

Strength of Evidence

Although p-values are useful for hypothesis tests that are specified a priori,

they provide poor summaries of clinical effects. In particular, they do not

convey the magnitude of a clinical effect. The size of a p-value depends on

the magnitude of the estimated treatment effect and its estimated variability

(also a function of sample size). Thus, the p-value partially reflects the size of

the trial, which has no biological interpretation. In addition, the p-value can

mask the magnitude of the treatment effect, which does have biological

importance. P-values only quantify the type I error and do not characterize the

biologically important effects in the trial. Thus, p-values should not be used to

describe the strength of evidence in a trial. Investigators have to look at the

magnitude of the treatment effect.

evidence in a clinical trial, although they also are affected by the sample size.

Most major journals now require this approach as it is many times more

informative than simply just the p-value.

One of the most difficult statistical tasks is assessing the precision of an

estimator, i.e., determining the variance of an estimator can be more difficult

than determining the appropriate estimator. In complicated situations the

bootstrap method can be applied to estimate the variance of an estimator. The

bootstrap is essentially a resampling plan.

denoted as Y1, Y2, ... , YN, and wants to estimate the median, , and get an

expression for its variance. If the investigator does not want to make any

assumptions about the distribution of the sample, then an explicit expression

for the variance of the sample median does not exist.

median.

with N observations, from the original data set. Each bootstrap sample is

constructed by sampling with replacement from the original data set. This

means that when constructing a bootstrap sample, N observations are

generated one at a time where each Yi has 1/N probability of being selected.

Here is an example of the resampling of the original data:

Original sample: 17, 25, 16, 32, 27, 19, 25, 23, 22, 30

... ...

Thus, for b = 1,...,B, the bootstrap sample Yb1, Yb2, ... , YbN is constructed and

the sample median within the bth bootstrap sample is formed as:

From the B estimates of the median we construct the estimated variance as:

S2=1B1b=1B(^b)2 where=1Bb=1B^bS2=1B1b=1B(^b)

2 where=1Bb=1B^b

to get a sense about how these medians are varying over the 100 samples.

The variance estimate can then be used to construct a Z statistic for

hypothesis testing, i.e.,

Z=(^)/SZ=(^)/S

Some statisticians at first were leery of this approach, essentially using one

sample to create many others samples from this original, i.e., "pulling oneself

up by your bootstraps". The bootstrap process, however, over time has shown

to have sound statistical properties. The disadvantage of this approach has to

do with the random selection with replacement which could result in slight

variations in results. The FDA, for instance, requires definitive results. This is

simply a non-parametric approach for estimating the variance of a sample.

Clinical trial data provide the opportunity for exploratory analyses, which are

analyses in addition to those specified by the primary objectives in the

protocol. A the trial design is usually not well-suited for all of the exploratory

analyses that are performed, so the results may not have much validity.

but rather as hypothesis-generating for future research. As a general rule, the

same data should not be used both to generate a new hypothesis and to test

that hypothesis. Unfortunately, many investigators do not follow this principle.

Data in sufficient quantity and detail can be made to yield some effect. A few

statistical sayings attest to this, such as "the data will confess to anything if

tortured enough." It has been well documented that increasing the number of

hypothesis tests inflates the Type I error rate. Exploratory analyses typically

fall into this category and the chances of finding statistically significant results,

when none truly exist, can be very high.

Subset analyses are a form of exploratory analyses that are very popular with

clinical trials data. For example, after performing the primary statistical

analyses, the investigators might decide to compare treatment groups within

certain subsets, such as male subjects, female subjects, minority subjects,

subjects over the age of 50, subjects with serum cholesterol above 220, etc.

Unless it is planned ahead of time, such analyses should remain exploratory.

11.9 - Summary

In this lesson, among other things, we learned:

an odds ratio adjusted for strata effects.

groups.

trend.

Interpret a Kaplan-Meier survival curve.

estimator.

Let's put what we have learned to use by completing the following homework

assignment:

Homework

Introduction

variables, prognostic factors, regressors, and covariates. Prognostic factor

analysis (PFA) is an analysis that attempts to assess the relative importance

of several predictor variables simultaneously. Typically, a PFA uses one or

more predictor variables that were not controlled by the investigator.

One reason for studying prognostic factors is to learn the relative importance

of several variables that might affect, or be associated with, disease outcome.

A second reason for studying prognostic factors is to improve the design of

clinical trials. For example, if a prognostic factor is identified as strongly

predictive of disease outcome, then investigators of future clinical trials with

respect to that disease should consider using it as a stratifying variable.

randomized clinical trials.

State how ANCOVA can reduce difficulties resulting from imbalance in

prognostic factors.

recommended. Recognize the difficulty presented with time-dependent

covariates.

quantitative interactions.

regression and proportional hazards.

Interpret the relevant portions of SAS output for these analyses.

study.

Reference

Steven. Clinical Trials: A Methodologic Perspective. 2nd ed. Hobaken, NJ:

John Wiley and Sons, Inc.

Knowledge of prognostic factors can improve the ability to analyze

randomized trials. Suppose that there is a strongly prognostic factor in a

clinical trial in which the investigators did not use it as a stratifying variable

(perhaps they were unaware of this factor) and that the treatment groups are

not balanced with respect to this prognostic factor. Then a simple comparison

of treatment groups A and B could yield misleading results as shown in the

illustration below.

You can see that most of the b's have the lower level of the covariate, and

most of the a's have the higher level of the covariate. The apparent difference

between the A mean and the B mean is due to the differing values in the

covariate. But how do we adjust for these differences and then compare A and

B?.

problem of different levels of the covarariates to yield a fair comparison of

treatment groups. With respect to nonrandomized studies, ANCOVA models

can have the same effect.

i=1nBYBi

The adjusted means for groups A and B, respectively, from the ANCOVA are:

dj=1nAi=1nA(YAiXAi^) and YB,adj=1nBi=1nB(YBiXBi^)

where X represents the covariate and ^^ represents the estimated slope for

X based on the entire set of covariate values across both groups. We are

interested in the adjusted values, which have accounted for the covariate. The

adjusted mean is an average of the adjusted values. In the graphical

illustration above, the covariate slope is positive and the covariate values for

the A group are greater, so the adjustment to the A mean is larger. Subtracting

these adjustments from the measured responses as indicated in the formula

for the adjusted mean (above) will bring the mean of A closer to the mean of

B.

12.2 - Interactions

It is important to examine treatment covariate interactions. For example, it is

possible that the responses in the treatment groups differ for low levels of the

prognostic factor, but not differ for high levels of the prognostic factor.

Interpretations of statistical results in the presence of treatment covariate

interactions can become complex.

example, if the treatment covariate interaction exists in an ANCOVA model

of the outcome variable (additive model), it is possible that it will disappear in

an ANCOVA model of the logarithm of the outcome variable (multiplicative

model). Therefore, interactions can exist or fail to exist based on the selection

of the statistical model and the assumptions associated with it.

Notice in the figure above that the difference between Treatments A and B is

constant and does not depend on the value of the covariate. There is no

treatment by covariate interaction in this scenario above.

Now, in the figure above, the difference between Treatments A and B is not

constant and does depend on the value of the covariate.

Prognostic factors can be continuous measures (age, baseline cholesterol,

etc.), ordinal (age categories, baseline disease severity, etc.), binary (gender,

previous tobacco use, etc.), or categorical (ethnic group, geographic region,

institution in multi-center trials, etc.).

Most prognostic factors are measured at baseline and do not change over

time, hence they are called time-independent covariates. Time-independent

covariates are easily included in many types of statistical models.

On the other hand, some prognostic factors do change over time; these are

called time-dependent covariates. For example, consider a clinical trial in

diabetics with selected dietary intake variables measured on a regular basis

over the course of a six-month clinical trial. Dietary intake could affect the

severity of disease and exacerbate problems for diabetic patients, so a

statistical analysis of the trial might incorporate these prognostic factors as

time-dependent covariates. We have to be very careful when using time-

dependent covariates.

results if the time-dependent covariates are affected by treatment. In such a

situation, removal of the effect of the time-dependent covariates can also

cause removal of the treatment effect. For example, suppose that one of the

treatments in the aforementioned diabetes example has an additional effect of

increasing appetite. Then removing the effect of dietary intake also could

remove the effect of the treatment!

The schematic below represents the situation where the treatment is affecting

both the covariates and the outcome. The covariates also affect the outcome.

Adjusting for the effect of the covariate over time may account for the majority

of the treatment effect on the outcome.

for decreasing diastolic blood pressure is compared to placebo. Suppose that

the investigator wants to use pulse as a time-dependent covariate in an

ANCOVA model because there is a positive correlation between pulse and

diastolic blood pressure. Suppose that the treatment not only reduces diastolic

blood pressure, but reduces pulse as well. Then an analysis that removes the

effect of pulse, measured over the course of the trial, could remove the effect

of treatment. The misleading results from the model would cause the

treatment effect on diastolic blood pressure to remain undetected.

Continuous Outcomes

A model relating the outcome variable to the prognosic factors and treatment

effects is a construct that makes use of theoretical knowledge and empirical

knowledge. In mathematical and statistical models, the theoretical component

is represented by one or more equations and the empirical component is

represented by data. The behavior of a model is governed by its structure or

functional form and by the unknown quantities or constants (parameters).

Objectives of the modeling exercise might include estimation of the

parameters, determination of model fit, or efficient summarization of large

amounts of data.

Dr. George Box once stated: "All models are wrong, but some are useful."

this situation, "linear" refers to the fact that the deterministic component of the

model is a linear combination of parameters and covariates. Statistical models

typically contain a deterministic component and a random component.

An example is the linear model that is used for multiple regression. Let Y

denote the outcome variable and X1, X2, . . . , XK denote K different regressors

(predictors) that are measured on each of n patients. Then the statistical

model for patient i, i = 1, 2, . . . , n, is

Yi=0+1X1i+0+2X2i+...+K+1XKi+iYi=0+1X1i+0+2X2i+...

+K+1XKi+i

and i represents the random error term for patient i.

In the multiple regression model, 0 + 1X1i + 2X2i + . . . + KXKi represents the

deterministic portion of the model for patient i and i represents the random

error term for patient i, i = 1, 2, . . . , n.

distributed random variables, each following a N(0, 2) distribution.

Yij=i+ijYij=i+ij

the jth patient within the ith treatment group, i denotes the population mean

for the ith treatment group, and ij represents the random error term for the

jth patient within the ith treatment group.

covariates is:

Yij=i+1X1ij+2X2ij+3X3ij+ijYij=i+1X1ij+2X2ij+3X3ij+ij

where the notation is similar to that for the one-way ANOVA with K treatment

groups, and X1ij, X2ij, X3ij denote the values of the three covariates for the

jth patient within the ith treatment group.

model) may be continuous, ordinal, or binary.

L - 1 distinct covariates that are binary (called dummy variables). One way to

do this is to select a reference level and let the dummy variables correspond

to the remaining L - 1 levels.

For example, suppose that there are four centers in a multi-center trial and

that it is desirable to model for center effects. The above ANCOVA model can

be invoked with center #4 as the reference level:

X2ij = 1, if patient (i,j) is in center #2; 0 otherwise

X3ij = 1, if patient (i,j) is in center #3; 0 otherwise

The implications of the model are that 1, 2, ... , K represent treatment

means within the reference center (center #4). Patients within center #1 have

treatment means 1 + 1, 2 + 1 , ... , K + 1, so that 1 represents the

change in any treatment mean between center #4 and center #1.

Statistical software packages for multiple regression typically require the user

to recode categorical regressors/covariates in this manner (SAS PROC REG),

whereas the statistical software packages for ANOVA and ANCOVA can

recode categorical regressors/covariates for the user (the CLASS statement in

SAS PROC ANOVA and SAS PROC GLM).

regressors. For example, suppose that X1 represents age and X2 represents

serum cholesterol. A third regressor, X3 = X1 X2, can be constructed as the

product and might be important to include in the model if only old age in

combination with high cholesterol has an impact on the outcome.

can be constructed as products of more than two regressors. Of course,

constructing more regressors in this manner can get unwieldy and lead to an

unmanageable number of potential regressors to consider.

and nonrandomized studies. Treatment center interactions are important to

investigate in multi-center trials.

to interpret main effects for treatment.

cholesterol level is considered an important covariate. Suppose that the

treatment cholesterol interactions are significant such that for low

cholesterol levels treatment A is better than treatment B, but for high

cholesterol levels treatment B is better than treatment A. Thus, it is not

possible to conclude that one treatment is superior because the choice of the

best treatment depends on baseline cholesterol levels, an important discovery

in and of itself.

If the investigator focuses on a specific region of values for the covariate, such

as high baseline cholesterol, then it may be possible to determine which

treatment is superior in this region.

Sometimes it is possible to make general conclusions if the interactions are

due to the magnitude of the effect.

presence of low baseline serum cholesterol, but treatment A is 60 units better

than treatment B in the presence of high baseline serum cholesterol levels.

Even though there is significant treatment covariate interactions, it still

appears that treatment A is superior. This type of treatment covariate

interactions is called "quantitative."

Profile plots (mean outcome response versus the covariate) for each

treatment group will indicate graphically whether the interactions are

qualitative (not parallel and crossing, below),

12.4 - Examples

SAS Example ( 13.1_multiple regression.sas [1] )

[1]

hypercholesterolemic children [2]. American Journal of Public Health 1998; 88:

258-261) conducted a trial in which hypercholesterolemic children were

randomized to three different nutritional educational groups:

2. counseling;

3. control.

(baseline), 3, 6, and 12 months.

and nutritional intake variables were predictive of LDL cholesterol at baseline.

Run this program. Look at the regression output. Do you see only one

statistically significant predictor? Female status (p = 0.0004). Now look at the

output for the means and the scatterplot of LDL by Sex. Do you see mean

LDL among females that is 6.2 mg/dL greater than the male average LDL?

SAS Example ( 13.2_ANCOVA.sas [3] ): The longitudinal data from the one-

year follow-up are provided but not analyzed (beyond the scope of this

course).

(Methylprednisolone therapy in patients with severe alcoholic

hepatitis. Annals of Internal Medicine 1989; 110: 685-690) conducted a multi-

center, double-blinded, randomized trial in which they compared

methylprednisolone to placebo in patients with severe alcoholic hepatitis.

Treatment was administered over four weeks. The primary outcome was time

to death, and the Kaplan-Meier survival curves and the logrank test indicated

that methylprednisolone was superior to placebo.

time, bilirubin, and hematocrit. An ANCOVA model was invoked to compare

the treatments with respect to bilirubin at week #4 with baseline bilirubin and

clinical center as covariates.

Here are a couple of places at the end of the program that you will want to

make note of:

Now, run the program and look at the output. Notice 66 observations read, 47

used. This means 19 patients are missing data for a term in the model so

SAS cannot use their data. In this case, 19 are missing bilirubin4

measurements.

Considering the Type III SS in the output, do you see that baseline bilirubin

(bilirubin0) was an important covariate? The treatment groups do not differ

significantly. Nor was significant interaction observed between center and

treatment or baseline bilirubin and treatment.

significant effect. The difference in the raw means at Week 4 is about 4 units,

but is reduced after adjusting for the covariates (see the adjusted means.)

.The difference between treatments was not statistically significant ( p =

0.2782), although methylprednisolone showed a numerical advantage over

placebo with adjusted means of 7.84 and 10.84, respectively for the two

treatments.

Outcomes

For a binary outcome, logistic regression analysis is used to model the log

odds as a linear combination of parameters and regressors. Let p(X1, X2, ... ,

XK) denote the probability of success in the presence of the K regressors. The

logistic regression model for the log-odds for the ith patient is

log(p(X1i,X2i,...,XKi1p(X1i,X2i,...,XKi))=0+1X1i+2X2i+...

+KXKilog(p(X1i,X2i,...,XKi1p(X1i,X2i,...,XKi))=0+1X1i+2X2i+...+KXKi

Notice that 0 represents the reference log odds, i.e., when X1i = 0, X2i = 0, ... ,

XKi = 0. Consider a simple model with one covariate (K = 1) which is binary,

e.g., X1i = 0 if the ith patient is in the placebo group and 1 if the ith patient is in

the treatment group. Then the log odds ratio for comparing the treatment to

the placebo group is

log(p(X1i=1)1p(X1i=1)/p(X1i=0)1p(X1i=0))=(0+1)0=1log(p(X

1i=1)1p(X1i=1)/p(X1i=0)1p(X1i=0))=(0+1)0=1

log(p(X1i=x)1p(X1i=x)/p(X1i=0)1p(X1i=0))=(0+1x)

0=1xlog(p(X1i=x)1p(X1i=x)/p(X1i=0)1p(X1i=0))=(0+1x)0=1x

so that the odds ratio is exp(1x). This illustrates that changes in a covariate

have a multiplicative effect on the baseline risk.

For example, suppose x represents (age - 18) in a study of adults, and that

the estimated coefficient is ^1^1 = 0.04 with a p-value < 0.05. Then the

estimated odds ratio is exp(0.04) = 1.041. This may not seem like a clinical

meaningful odds ratio, but remember that it represents the increase in odds

between a 19-year-old and an 18-year-old. For a 25-year-old person, the

estimated odds ratio is exp(0.04 7) = 1.323.

For the logistic regression model, each j, j = 1, 2, ... , K, represents the log

odds ratio for the jth covariate. An equivalent expression for the logistic

regression model in terms of the probability is

p=(X1i,X2i,...,XKi)=11+exp{(0+1X1i++2X2i+...

+KXKi)}p=(X1i,X2i,...,XKi)=11+exp{(0+1X1i++2X2i+...+KXKi)}

that an outcome variable, Y, is ordinal and that we designate its ordered

categories as 0, 1, ... , C. We model the ordinal logits as

log(Pr[yc|X1i,X2i,...,XKi]1Pr[yc|

X1i,X2i,...,XKi])=0c+1X1i+2X2i+...+KXKi,c=1,2,...,Clog(Pr[yc|

X1i,X2i,...,XKi]1Pr[yc|X1i,X2i,...,XKi])=0c+1X1i+2X2i+...+KXKi,c=1,2,...,C

The ordinal logistic regression model has C intercept terms, but only one term

for each regressor. This reduced modeling for an ordinal outcome assumes

proportional odds (beyond the scope of this course).

physicians in the Growth Failure in Children with Renal Disease clinical

trial. Pediatric Nephrology 1993; 7: 204-206) investigated the success of the

masking in the randomized, double-blinded, multi-center GFRD [6] clinical trial.

The clinical director at each center was asked to identify or guess the

assigned treatment for each randomized patient.

incorrect/correct guess. Regressors included treatment group and months in

the study. Note the creation of the binary variables, 'newscore' from the score

variable, within the data step before the proc logisitic statements. Similarly a

binary variable 'treatment' is created from the variable 'trtgroup'.

newscore=1," which also indicates the order for calculating the odds ratio. The

confidence intervals for the odds ratios all include 1. With no statistically

significant results, the investigators remained confident that the masking

scheme was successful.

event Outcomes

For a time-to-event outcome variable, proportional hazards regression is

available. Let (t|X1i, X2i, ... ,XKi) denote the hazard function for the ith patient

at time t, i = 1, 2, ... , n, where the K regressors are denoted as X1i, X2i, ... , XKi.

The baseline hazard function at time t, i.e., when X1i = 0, X2i = 0, ... , XKi = 0, is

denoted as 0(t). The baseline hazard function is analogous to the intercept

term in the multiple regression model or logistic regression model.

The proportional hazards regression model states that the log of the hazard

function to the baseline hazard function at time t is a linear combination of

parameters and regressors, i.e.,

log((t|X1i,X2i,...,XKi)0(t))=1X1i,2X2i,...,KXKilog((t|

X1i,X2i,...,XKi)0(t))=1X1i,2X2i,...,KXKi

not specify a specific distribution function for the time-to-event outcome

variable. The proportionality assumption, however, is important.

the proportional hazards regression model is a function of relative risk (unlike

the logistic regression models which are a function of the odds ratio).

Changes in a covariate have a multiplicative effect on the baseline risk. The

model in terms of the hazard function at time t is:

(t|X1i,X2i,...,XKi)=0(t)exp(1X1i,2X2i,...,KXKi)(t|

X1i,X2i,...,XKi)=0(t)exp(1X1i,2X2i,...,KXKi)

Model

Regression/ANCOVA models as described above are most useful when

variables

precision

outcome variable

For a given situation, however, it may not be easy to construct a model that

satisfies these criteria.

building process are automatic. Caution must be exercised, however, for the

following reasons.

relying solely on p-values.

tests and refitting models.

building process.

4. The software may not handle the problem of missing data very well.

clinical situation. Some statisticians only use prognostic variables in the model

for which there exist plausible biological reasons for their inclusion.

Approaches

follows several approaches.

initial model contains no regressors but they enter the model one at a time. In

this situation, a regressor enters the model if its p-value is less than a critical

value, say 0.05.

in which the initial model contains all of the regressors. In this situation, a

regressor is eliminated from the model if its p-value is not less than the critical

value.

selection. In this situation, after a new variable enters the model, all the

variables that had entered the model previously are reexamined to see if

their p-values have changed. If any of the revised p-values exceed the critical

value, then the corresponding variables are eliminated from the model.

A fourth approach involves finding the best one-variable-model, the best two-

variable model, etc. with the help of software, and then using judgment as to

which is the best overall model, i.e., if the (c+1)-variable model is only

slightly better than the c-variable model, the latter is selected. It is prudent to

attempt a variety of models and approaches to determine if the results are

consistent.

although there is no universal agreement among statisticians. It is not unusual

for a particular data set to discover that step-up and step-down selection

algorithms lead to different models. The main reason for this is that the

regressors/covariates are not completely independent of each other.

When a variable is entered into or removed from a model, the p-values of the

other variables will change. Consider a linear model with two potential

regressors, X1 and X2, and suppose that they are strongly correlated

(independent variables is a misnomer). Suppose that in a model

with X1 only, X1 is significant, and in a model with X2 only, X2 is significant.

When a model is constructed with both X1 and X2, however, the contribution

by X2 to the model is no longer statistically significant. Because X1 and X2 are

strongly correlated, X2 has very little predictive power when X1 already is in the

model.

advised. Many statisticians recommend that each potential regressor be

examined individually in a simple model. This can help identify potential

regressors for which there is not a strong biological justification.

lenient, say 0.10 or 0.15. Then all of the regressors that meet this first-stage

criterion and/or that have biological/clinical justification comprise the set of

regressors that are subjected to the model-building process. Clinical input

always should augment this first-stage process.

12.8 - Example

SAS Example ( 13.5_ph regression.sas [7] ): A safety and efficacy study was

conducted in 83 patients with malignant mesothelioma, an uncommon lung

cancer that is strongly associated with asbestos exposure. Patients underwent

one of three types of surgery, namely, biopsy, limited resection, and

extrapleural pneumonectomy (EPP). Treatment assignment was

nonrandomized and based on the extent of disease at the time of diagnosis.

Thus, there can be a strong procedure selection bias.

stepwise selection process to build a model with prognostic factors in addition

to surgery type. Examine the program and note the following:

Run the program. Do you agree that histologic subtype is the only statistically

significant covariate (p = 0.025) ?

problem known as collinearity can exist. Collinearity can cause difficulties in

interpretation because it is not obvious which regressor in a set of highly

correlated regressors should be used in the model.

the software and yield strange results. It is recommended that the correlations

among the set of regressors be examined prior to the model-building process,

e.g., using PROC CORR of SAS.

correlations above 0.8 or below -0.8), then most of the variables in a

correlated set should not be included in the model-building process.

Missing values could cause a problem during the model-building process if

subjects display different patterns of missingness for the set of

regressors/covariates. For example, consider the aforementioned linear model

with two potential regressors, X1 and X2. Suppose that there are 100 subjects,

but 50 are missing X1 and the remaining 50 are missing X2. Thus, no subject

has X1 and X2 observed simultaneously, so a model with both regressors is

not possible. This is an extreme case, but most model-building endeavors

encounter some form of missingness.

Hopefully, missing data among the regressors/covariates are not related to the

outcome. If this is not the case, then it may not be possible to develop a

model that is unbiased. For example, if the patients with the most severe form

of the disease are the ones with missing values for the regressors/covariates,

then the resultant model that does not include these patients will be biased.

As has been discussed earlier, data imputation is one way to handle the

situation of missing values. Data imputation involves the estimation of the

missing values in a manner that is consistent and then imputing the

estimated values for the missing values. Thus, every subject will have a

complete set of regressors/covariates and the statistical analysis can proceed

without eliminating any subjects.

values or by fitting regression models in which the regressors with missing

values become the outcome variables. Obviously, there is some danger of

introducing large biases with imputation, so it must be performed carefully on

a case-by-case basis.

Make every effort to collect complete data to avoid such problems. When data

are missing, be certain to report the numbers of patients used in each analysis

and any methods used to impute missing values.

Statisticians have recommended a number of approaches to evaluate a

model. One approach involves partitioning the data set into an estimation data

set and a validation data set (usually in a two-thirds versus one-third split).

The estimation data set is used to build the model, and hence, estimate the

parameters. The validation data set is used to validate the model by inserting

a patients set of observed regressors into the estimated model equation and

predicting the outcome response for that subject.

If the predicted outcome is relatively close to the observed outcome for the

subjects in the validation data set, then the model is considered valid.

Another approach is called the leave-one-out method and consists of

eliminating the first patient from the data set with n subjects, estimating the

model equation based on the remaining n - 1 patients, calculating the

predicted outcome for the first patient, and then comparing the first patients

predicted and observed outcomes.

This process is performed for each of the n patients and an overall validation

statistic is constructed.

These validation procedures work fine for nonrandomized studies, but for

randomized clinical trials, they probably should be applied only to secondary

and exploratory statistical analyses.

Comparative Efficacy (Phase III) Trials

Some statisticians do not like to perform adjusted analyses such as ANCOVA

in comparative efficacy (Phase III) trials because they feel that randomization

and proper analysis guarantee unbiasedness and the correctness of type I

error levels, even if there are chance imbalances among the treatment groups

with respect to prognostic factors.

This may be true, but the use of prognostic factors in ANCOVA models can

improve precision and verify biological information.

analysis of data from a randomized trial under any of the following

circumstances:

groups;

the presence of balance (i.e., stratifiers) or imbalance among treatment

groups;

reduces the treatment effect;

illustrate and quantify its effect.

12.12 - Summary

In this lesson, among other things, we learned:

prognostic factors.

recommended. Recognize the difficulty presented with time-dependent

covariates.

quantitative interactions.

regression and proportional hazards.

Interpret the relevant portions of SAS output for these analyses.

study.

Let's put what we have learned to use by completing the following homework

assignment:

Homework

Look for homework assignment and the dropbox in the folder for this week in

ANGEL.

Introduction

Reporting the results of a clinical trial is one of the most important and least

studied aspects of clinical research. Investigators have an obligation to

disseminate trial results in a timely and competent manner. Many features of

good reports are similar for all types of trials and find widespread acceptance

in journals.

familiar with the details of the disease or the intervention under study. The

benefits of uniformity are evident in some chronic diseases like cancer, where

standardized staging has improved trial design, reporting, and interpretation.

clinical research

randomized trials and reports of safety

Reference

Piantadosi Steven. Clinical Trials: A Methodologic Perspective. 2nd ed.

Hobaken, NJ: John Wiley and Sons, Inc.

Published reports should inform the reader about all aspects of study design,

conduct, analysis, and interpretation that are relevant for assessing the

internal and external validity of the trial. The content and quality of trial reports

in the literature, however, remains inconsistent on these points - this process

is not perfect. The content of the medical literature reflects an imperfect

editorial and peer review process. Despite the limitations of the peer review

process, good alternatives currently do not exist for judging the merits of

scientific papers.

Publication bias is the tendency for studies with positive findings to be

preferentially selected for publication over those with negative findings, (i.e.,

it did not find a statistically significant result).If an editor has a choice

publishing a positive study and one with negative results, they may prefer

publishing the positive results for various reasons. However, negative studies

are very important and should be made known as well. For instance,early

stopping a study of interferon gamma-1b when an interim analysis showed

that patients with IPF did not benefit from the treatment is important

imformation for other IPF patients who may have been prescribed interferon

gamma-1b in off-label use and for others taking it for the conditions for which it

already has approval (FDA Public Health Advisory on Interferon gamma-1b). [1]

treatment or a group of related treatments from independent studies, then an

impression biased in favor of the treatment could result. Editors and referees

are not the only ones to blame for the presence of publication bias.

Investigators tend to lose enthusiasm for negative results because they may

be viewed as less glamorous and even viewed as failures. This could lead to

weaker reports, or even no reports, being submitted for publication. Journal

editors and referees can reduce publication bias in two ways. First, they

should assign greater weight to methodologic rigor and thorough reporting

than to statistical significance. Second, they must be willing to report negative

findings from sound studies with as much enthusiasm as positive reports.

to access study results

Along with efforts of individual journal editors to include negative trials, there

have been recent public and private initiatives to increase public access to

clinical trial results.

The U.S. NIH policy is that results of its funded research should be available

to the public.

information for locating U.S. government- and privately-supported clinical trials

and results of certain completed trials. Observational studies addressing

health issues in large groups of people or populations in natural settings are

also included the ClinicalTrials.gov database.

The site was developed by the U.S. National Institutes of Health (NIH),

through its National Library of Medicine (NLM), in collaboration with the Food

and Drug Administration [2] (FDA), as a result of the FDA Modernization Act

(1997). The types of trials required to be registered at the site expanded with

the Food and Drug Administration Amendments Act of 2007. The "basic

results" of trials that study drugs, biologics or devices approved, licensed or

cleared by the FDA are now required to be posted in a timely manner. (read

more about the requirements http://www.clinicaltrials.gov/ct2/manage-

recs/fdaaa#WhenDoINeedToRegister [3] and http://www.clinicaltrials.gov/ct2/ab

out-site/results [4])

Many medical journals now have policies of only publishing a manuscript from

a completed clinical trial if the trial has been registered at ClinicalTrials.gov.

are providing access to results from sponsored studies through various

mechanisms. (e.g. [5]GSK [6], Pfizer [7], Merck [8]) as the Pharmaceutical

Research and Manufacturers of America and the European Federation of

Pharmaceutical Industries and Associations adopted joint Principles for

Responsible Clinical Trial Data Sharing [9] (2013).

Using the proper summary/descriptive statistics is essential. Although

investigators may use standard deviations and standard errors

interchangeably, the standard deviation is appropriate as a descriptive

summary, whereas the standard error is intended to convey the uncertainty of

an estimate (such as the mean). Confidence intervals are more informative

than significance levels and p-values.

and statistical significance - but they should. Clinical significance can be

expressed in terms of the magnitude and direction of treatment effects or

differences. Although this is important for superiority trials, it is even more

important for equivalence and non-inferiority trials.

Some journals require structured titles and abstracts because they are the

only part of many reports that some readers examine. Therefore, the abstract

becomes very important - in the medical literature the abstract is critical. A

good abstract for the report of a clinical trial includes objectives, design,

setting or types of practices, characteristics of the study population,

interventions used, primary outcome measurements, principal results, and

conclusions. Abstracts should be no longer than 250 words and usually do not

include descriptions of the statistical methods.

should include information about study design, demographics, toxicity and

side effects, and recommendations for later trials. The objectives of safety and

efficacy studies (Phase II) are to demonstrate treatment feasibility, estimate

treatment success, estimate treatment complications, and facilitate informal

comparisons with other therapies that might motivate comparative trials.

With respect to the latter objective, the report should recognize the potential

for strong selection bias and avoid overly-enthusiastic statements about

relative efficacy. The reports for safety and efficacy studies should consist of

the following outline:

introduction

objectives

study design

study setting

demographics

treatments

outcome measures

statistical methods

results

The outline of reports for comparative efficacy trials (Phase III) is similar to

that for safety and efficacy trials. There are many additional issues, however,

to consider. For example, the reporting of treatment assignment

(randomization) and masking procedures is necessary to assure readers

about the internal validity of the trial. This is important because the readers

want to know how this was implemented. This is a mechanism that reviewers

will use to assess the validity of the study.

The motivation and assumptions for the target sample size should be

included, especially in situation where the primary results are negative

findings. The impact of various prognostic variables should be addressed with

appropriate statistical analyses to demonstrate that treatment effects are not

due entirely to them. Although the intent-to-treat principle should be followed

in randomized trials, it is helpful to report on the results of various exploratory

analyses as well.

Since the methods have direct implications on the validity of the results, top-

line journals require thorough descriptions. They also expect supplemental

reports. When the article is available online, there can be links to more

detailed descriptions, figures, graphs and tables.

Group (http://www.consort-statement.org/ [10]) updated its guidelines for the

reporting of clinical trials, which appeared in multiple journals and is available

online: Examine the CONSORT checklist [11] and Flow Diagram [12] in the

CONSORT statement. These standards have been adopted by major medical

journals.

2010 Statement: Updated Guidelines for Reporting Parallel Group

Randomized Trial [13]. Ann Intern Med June 1, 2010 152:726-732

There are several extensions of the original CONSORT statement which can

also be examined at the CONSORT website [14] focusing on reporting patient

safety. equivalence and non-inferiority trials, cluster trials and other topics.

Ioannidis JPA, Evans SJW, Getzsche PC, ONeill RT, Altman DG, Schultz K,

Moher D for the Consort Group. Better reporting of harms in randomized trials:

An extension of the CONSORT statement. [15] Annals of Internal

Medicine 2004; 141:781-788.

Piaggio,G., Elbourne, D., Altman, D., Pocock, S., Evans, S. for the CONSORT

Group. Reporting of Noninferiority and Equivalence Randomized Trials: An

Extension of the CONSORT statement [16]. JAMA 2006; 295: 1152-1160.

Conflict of Interest

Major medical journals require that manuscript authors report any financial

support for the research presented in the article, and complete a form

describing their conflicts of interest. The information on the financial support

and conflicts usually appears at the end of the article, prior to the references.

Example: In the VALIANT trial (NEJM 2003, in Wk 5 course material) the

authors state

(1) Supported by a grant from Novartis Pharmaceuticals. and

(2) that some of them also received financial payments from Novartis for

serving as consultants, and some of them also have stock equity in Novartis.

Anyone who reads the article should attempt to examine the statements about

financial support and conflicts, in order to judge whether the article may

present a biased viewpoint.

The International Committee of Medical Journal Editors (ICMJE

http://www.icmje.org/ [17]) has developed a standardized form for authors to

provide information about their financial interests that could influence how

their work is viewed. The form is designed to be completed electronically and

stored electronically. It contains programming that allows appropriate data

display. Each author listed on the manuscript should submit a separate form

and is responsible for the accuracy and completeness of the information. The

disclosure form is a fillable pdf file

(http://www.icmje.org/coi_disclosure.pdf [18]). The complete list of journals that

require completion of the ICMJE form appears

at http://www.icmje.org/journals.html [19]

13.5 - Summary

In this lesson, among other things, we learned to:

clinical research

randomized trials and reports of safety

access a vast number of trials registered with ClinicalTrials.gov

Let's put what we have learned to use by completing the following homework

assignment:

Homework

discussion.

Introduction

experiment. In a chemistry experiment, temperature and pressure may be the

factors that are deliberately changed over the course of the experiment. In the

clinical trial treatment can be a factor. A study of experimental therapy vs.

placebo can be thought of as having a treatment factor with 2 levels, 0 or the

study dosage. A study with two different treatments has the possibility of a

two-way design, varying the levels of treatment A and treatment B.

Factorial clinical trials are experiments that test the effect of more than one

treatment using a type of design that permits an assessment of potential

interactions among the treatments.

In a factorial design there are two or more factors with multiple levels that are

crossed, e.g., three dose levels of drug A and two levels of drug B can be

crossed to yield a total of six treatment combinations:

low dose of A with high dose of B

mid dose of A with low dose of B

mid dose of A with high dose of B

high dose of A with low dose of B

high dose of A with high dose of B

are a number of ways that you could look at these groups. This lesson will

consider these alternatives...

Learning objectives & outcomes

interactions

recognize the situation for which a min test is the appropriate analysis

Reference

Piantadosi Steven. Clinical Trials: A Methodologic Perspective. 2nd ed.

Hobaken, NJ: John Wiley and Sons, Inc.

Designs

The simplest factorial design is the 2 2 factorial with two levels of factor A

crossed with two levels of factor B to yield four treatment combinations. A

special case of the 2 2 factorial with a placebo and an active formulation of

factor A crossed with a placebo and an active formulation of factor B. This

yields the four treatment regimens:

Placebo A + Placebo B

Placebo A + Active B

Active A + Placebo B

Active A + Active B

For example, here you could have a placebo for each treatment. In one case

you might have a placebo injection for A and a placebo pill for B. Such a

design allows the comparison of the levels of factor A (A main effects), the

comparison of the levels of factor B (B main effects), and the investigation of A

B interactions.

trial.

First, the treatments must be amenable to being administered in combination

without changing dosage in the presence of each other treatment.

(i.e., a placebo is ethical) or administer them at lower doses if that will be

required for the combination.

combinations required for the factorial design. Otherwise some of the

treatment combinations are unnecessary, yet without them the advantages of

the factorial design are diminished.

treatments that use different mechanisms of action are more suitable

candidates for a factorial clinical trial.

14.2 - Interactions

Factorial designs provide the only way to study interactions between

treatment A and treatment B. This is because the design has treatment groups

with all possible combinations of treatments.

relevant to the discussion of treatment A treatment B interactions in this

lesson. These concepts included:

(2) In the presence of interactions, it may not be possible to assess the main

effects because the effect of treatment A changes according to the level of

treatment B.

(3) Quantitative interactions refer to the situation in which the direction of the

main effects does not change although it could change in magnitude.

Qualitative interactions refer to the situation in which the direction of the main

effects does change.

The figure above indicates a quantitative interaction. The lines are not parallel

but they are not crossing either. The magnitude of the response is dependent

on whether treatment A is at a high or low dose. The greatest response is

achieved with both Treatment B and Treatment A at high dose. A greater

response is observed when Treatment B is at high dose than at mid or low

dose, regardless of the dose level of Treatment A, but how much greater is

dependent on the level of A. At the lowest dose of A, there is very little

difference in the response between the dose levels of B. This is called a

quantitative interaction.

The qualitative interaction occurring in the figure above will be difficult to

explain. The greatest response is achieved with the low dose of treatment B

and the high dose of treatment A. However, if a patient is on low dose of

treatment A, the greatest response will be achieved with the high dose of

treatment be. Although difficult to sort out, this qualitative interaction is

intuitively reasonable for some drug combinations. There may be toxicity or a

threshold effect that contribute to making the response greater with only one

treatment at the highest dose.

Combinations

A special case of a partial factorial design that occasionally is used in clinical

research is the incomplete 2 2 factorial design with three treatment groups

consisting of drug A, drug B, and drug A in combination with drug B:

Placebo A + Active B

Active A + Placebo B

Active A + Active B

Notice that the Placebo A + Placebo B group is not included in the design,

hence the incompleteness. The incomplete factorial design has become

popular.

Why?

combine the active ingredients for treatment A and treatment B into one

pill/tablet/capsule, more symptoms are relieved with one dose of medicine.

For example, combining an antihistamine with a decongestant for cold

symptoms produces a new cold remedy that will alleviate two major symptoms

with one capsule. Additionally once the company has created the new

combination product, the company applies for a new patent, extending the

years of profitable returns from the research dollars expended to develop the

intial products. Approval of a combination therapy however, requires evidence

demonstrating the superiority of the AB combination therapy to the A

monotherapy and the B monotherapy. A logical experimental design to

demonstrate these results would be the incomplete factorial.

Suppose that the response is continuous and that we want to compare the

means A, B, and AB, which represent the population means for the A

monotherapy, the B monotherapy, and the AB combination therapy,

respectively. The research objective is to show the superiority of the

combination therapy over the individual therapies.

testing format can be constructed as H0: {A AB or B AB} versus H1: {A <

AB and B < AB}

Notice that the null hypothesis indicates that the AB combination therapy is

not better than at least one of the monotherapies, whereas the alternative

indicates that the AB combination is better than the A monotherapy and the B

monotherapy.

How do we do this?

The appropriate test statistic to use for this situation is called the min test. If

the data are normally distributed, construct two two-sample t statistics, one

comparing the AB combination therapy to the A monotherapy (call it tA) and

the other comparing the AB combination therapy to the B monotherapy (call

it tB).

tA=(YABYA)/s1nAB+1nAtA=(YABYA)/s1nAB+1nA ,

tB=(YABYB)/s1nAB+1nBtB=(YABYB)/s1nAB+1nB

where

YA=1nAi=1AYA,i,YB=1nBi=1BYB,i,YAB=1nABi=1ABYAB,iYA=1nAi

=1AYA,i,YB=1nBi=1BYB,i,YAB=1nABi=1ABYAB,i

and

s2=1nA+nB+nAB3(i=1nA(YA,iYA)2+i=1nB(YB,iYB)2+i=1nAB(

YAB,iYAB)2)s2=1nA+nB+nAB3(i=1nA(YA,iYA)2+i=1nB(YB,iYB)2+i=1nA

B(YAB,iYAB)2)

alternative hypothesis when each of tA and tB is statistically significant at

the significance level.

the null hypothesis if

minimum(tA,tb)>tnA+nB+nAB3,1minimum(tA,tb)>tnA+nB+nAB3,1

YA=20,YB=21,YAB=24,nA=nB=nAB=50,s=10YA=20,YB=21,YAB=24,nA

=nB=nAB=50,s=10

Then tA = 2, tB = 1.5, and minimum(tA, tB) = 1.5, which is not greater than t147,

0.95 = 1.66. Thus, the null hypothesis cannot be rejected at the 0.05

significance level, i.e., the AB combination is not significantly better than the A

monotherapy and the B monotherapy. It is close, but there clearly is not

enough statistical evidence to show significant difference.

14.4 - Summary

In this lesson, among other things, we learned to:

interactions

recognize the situation for which a min test is the appropriate analysis

Let's put what we have learned to use by completing the following homework

assignment:

Homework

Look for homework assignment and the dropbox in the folder for this week in

ANGEL.

Introduction

experimental unit (patient) receives different treatments during the different

time periods, i.e., the patients cross over from one treatment to another during

the course of the trial. This is in contrast to a parallel design in which patients

are randomized to a treatment and remain on that treatment throughout the

duration of the trial.

The reason to consider a crossover design when planning a clinical trial is that

it could yield a more efficient comparison of treatments than a parallel design,

i.e., fewer patients might be required in the crossover design in order to attain

the same level of statistical power or precision as a parallel design.(This will

become more evident later in this lesson...) Intuitively, this seems reasonable

because each patient serves as his/her own matched control. Every patient

receives both treatment A and B. Crossover designs are popular in medicine,

agriculture, manufacturing, education, and many other disciplines. A

comparison is made of the subject's response on A vs. B.

appealing to biomedical investigators, crossover designs are not preferred

routinely because of the problems that are inherent with this design. In

medical clinical trials the disease should be chronic and stable, and

the treatments should not result in total cures but only alleviate the disease

condition. If treatment A cures the patient during the first period, then

treatment B will not have the opportunity to demonstrate its effectiveness

when the patient crosses over to treatment B in the second period. Therefore

this type of design works only for those conditions that are chronic, such as

asthma where there is no cure and the treatments attempt to improve quality

of life.

Crossover designs are the designs of choice for bioequivalence trials. The

objective of a bioequivalence trial is to determine whether test and reference

pharmaceutical formulations yield equivalent blood concentration levels. In

these types of trials, we are not interested in whether there is a cure, this is a

demonstration is that a new formulation, (for instance, a new generic drug),

results in the same concentration in the blood system. Thus, it is highly

desirable to administer both formulations to each subject, which translates into

a crossover design.

would not be advantageous.

period, washout, aliased effect.

a crossover study in terms of aliased effects.

the implications of these characteristics.

crossover trials with continuous or binary data.

study.

and individual bioequivalence.

switchability.

Reference

Piantadosi Steven. (2005) Crossover Designs. In: Piantadosi Steven. Clinical

Trials: A Methodologic Perspective. 2nd ed. Hobaken, NJ: John Wiley and

Sons, Inc.

Designs

The order of treatment administration in a crossover experiment is called a

sequence and the time of a treatment administration is called a period.

Typically, the treatments are designated with capital letters, such as A, B, etc.

The sequences should be determined a priori and the experimental units are

randomized to sequences. The most popular crossover design is the 2-

sequence, 2-period, 2-treatment crossover design, with sequences AB and

BA, sometimes called the 2 2 crossover design.

sequence receive treatment A in the first period and treatment B in the second

period, whereas experimental units that are randomized to the BA sequence

receive treatment B in the first period and treatment A in the second period.

Sequence AB A B

Sequence BA B A

Sequence ABB A B B

Sequence BAA B A A

and

Sequence AAB A A B

Sequence ABA A B A

Sequence BAA B A A

Sequence ABC A B C

Sequence BCA B C A

Sequence CAB C A B

and

Sequence ABC A B C

Sequence BCA B C A

Sequence CAB C A B

Sequence ACB A C B

Sequence BAC B A C

Sequence CBA C B A

design:

Sequence AB A B

Sequence BA B A

Sequence AA A A

Sequence BB B B

Balaams design is unusual, with elements of both parallel and crossover

design. There are advantages and disadvantages to all of these designs; we

will discuss some and the implications for statistical analysis as we continue

through this lesson.

15.2 - Disadvantages

The main disadvantage of a crossover design is that carryover effects may be

aliased (confounded) with direct treatment effects, in the sense that these

effects cannot be estimated separately. You think you are estimating the effect

of treatment A but there is also a bias from the previous treatment to account

for. Significant carryover effects can bias the interpretation of data analysis, so

an investigator should proceed cautiously whenever he/she is considering the

implementation of a crossover design.

A carryover effect is defined as the effect of the treatment from the previous

time period on the response at the current time period. In other words, if a

patient receives treatment A during the first period and treatment B during the

second period, then measurements taken during the second period could be a

result of the direct effect of treatment B administered during the second

period, and/or the carryover or residual effect of treatment A administered

during the first period. These carryover effects yield statistical bias.

diminish the impact of carryover effects. A washout period is defined as the

time between treatment periods. Instead of immediately stopping and then

starting the new treatment, there will be a period of time where the treatment

from the first period where the drug is washed out of the patient's system.

The rationale for this is that the previously administered treatment is washed

out of the patient and, therefore, it can not affect the measurements taken

during the current period. This may be true, but it is possible that the

previously administered treatment may have altered the patient in some

manner, so that the patient will react differently to any treatment administered

from that time onward. An example is when a pharmaceutical treatment

causes permanent liver damage so that the patients metabolize future drugs

differently. Another example occurs if the treatments are different types of

educational tests. Then subjects may be affected permanently by what they

learned during the first period.

How long of a wash out period should there be?

usually is determined as some multiple of the half-life of the pharmaceutical

product within the population of interest. For example, an investigator might

implement a washout period equivalent to 5 (or more) times the length of the

half-life of the drug concentration in the blood. The figure below depicts the

half-life of a hypothetical drug.

Actually, it is not the presence of carryover effects per se that leads to aliasing

with direct treatment effects in the AB|BA crossover, but rather the presence of

differential carryover effects, i.e., the carryover effect due to treatment A differs

from the carryover effect due to treatment B. If the carryover effects for A and

B are equivalent in the AB|BA crossover design, then this common carryover

effect is not aliased with the treatment difference. So, for crossover designs,

when the carryover effects are different from one another, this presents us

with a significant problem.

occur if test A leads to more learning than test B. Another situation where

differential carryover effects may occur is in clinical trials where an active drug

(A) is compared to placebo (B) and the washout period is of inadequate

length. The patients in the AB sequence might experience a strong A

carryover during the second period, whereas the patients in the BA sequence

might experience a weak B carryover during the second period.

by differential carryover effects at all costs by employing lengthy washout

periods and/or designs where treatment and carryover are not aliased or

confounded with each other. It is always much more prudent to address a

problem a priori by using a proper design rather than a posteriori by applying

a statistical analysis that may require unreasonable assumptions and/or

perform unsatisfactorily. You will see this later on in this lesson...

For example, one approach for the statistical analysis of the 2 2 crossover is

to conduct a preliminary test for differential carryover effects. If this is

significant, then only the data from the first period are analyzed because the

first period is free of carryover effects. Essentially you be throwing out half of

your data!

If the preliminary test for differential carryover is not significant, then the data

from both periods are analyzed in the usual manner. Recent work, however,

has revealed that this 2-stage analysis performs poorly because the

unconditional Type I error rate operates at a much higher level than desired.

We won't go into the specific details here, but part of the reason for this is that

the test for differential carryover and the test for treatment differences in the

first period are highly correlated and do not act independently.

Even worse, this two-stage approach could lead to losing one-half of the data.

If differential carryover effects are of concern, then a better approach would be

to use a study design that can account for them.

its implications for, we require more definitions.

Design

First-order and Higher-order Carryover Effects

Within time period j, j = 2, ... , p, it is possible that there are carryover effects

from treatments administered during periods 1, ... , j - 1. Usually in period j we

only consider first-order carryover effects (from period j - 1) because:

effects usually are negligible;

carryover effects and treatment effects are very cumbersome and not

practical. Therefore, we usually assume that these higher-order

carryover effects are negligible.

In actuality, the length of the washout periods between treatment

administrations may be the determining factor as to whether higher-order

carryover effects should be considered. We focus on designs for dealing with

first-order carryover effects, but the development can be generalized if higher-

order carryover effects need to be considered. We will focus on:

Uniformity

number of times within each sequence, and

times within each period.

For example, AB/BA is uniform within sequences and period (each sequence

and each period has 1 A and 1 B) while ABA/BAB is uniform within period but

is not uniform within sequence because the sequences differ in the numbers

of A and B.

said to be uniform. If the design is uniform across periods you will be able to

remove the period effects. If the design is uniform across sequences then you

will be also be able to remove the sequence effects. An example of a uniform

crossover is ABC/BCA/CAB.

Latin Squares

treatment crossover designs because they yield uniform crossover designs in

that each treatment occurs only once within each sequence and once within

each period. As will be demonstrated later, Latin squares also serve as

building blocks for other types of crossover designs. Latin squares for 4-

period, 4-treatment crossover designs are:

equence ABCD A B C D

equence BCDA B C D A

equence CDAB C D A B

equence DABC D A B C

and

quence ABCD A B C D

quence BDAC B D A C

quence CADB C A D B

quence DCBA D C B A

Latin squares are uniform crossover designs, uniform both within periods and

within sequences. Although with 4 periods and 4 treatments there are 4! = (4)

(3)(2)(1) = 24 possible sequences from which to choose, the Latin square only

requires 4 sequences.

Balanced Designs

The Latin square in [Design 8] has an additional property that the Latin square

in [Design 7] does not have. Each treatment precedes every other treatment

the same number of times (once). For example, how many times is treatment

A followed by treatment B? Only once. How many times do you have one

treatment B followed by a second treatment? Only once. This is an

advantageous property for Design 8. This same property does not occur in

[Design 7]. When this occurs, as in [Design 8], the crossover design is said to

be balanced with respect to first-order carryover effects.

Come up with an answer to this question by yourself and then click on the

icon to the left to reveal the solution.

Look back through each of the designs that we have looked at thus far and

determine whether or not it is balanced with respect to first-order carryover

effects.

in the r-period, r-treatment crossover. When r is an odd number, 2 Latin

squares are required. For example, the design in [Design 5] is a 6-sequence,

3-period, 3-treatment crossover design that is balanced with respect to first-

order carryover effects because each treatment precedes every other

treatment twice.

carryover effects if each treatment precedes every other treatment, including

itself, the same number of times. A strongly balanced design can be

constructed by repeating the last period in a balanced design.

Here is an example:

D A B C D D

C B D A C C

B C A D B B

A D C B A A

balanced with respect to first-order carryover effects because each treatment

precedes every other treatment, including itself, once. Obviously, the

uniformity of the Latin square design disappears because the design in

[Design 9] is no longer is uniform within sequences.

Latin squares yield uniform crossover designs, but strongly balanced designs

constructed by replicating the last period of a balanced design are not uniform

crossover designs. The following 4-sequence, 4-period, 2-treatment crossover

design is an example of a strongly balanced and uniform design.

uence ABBA A B B A

uence BAAB B A A B

uence AABB A A B B

uence BBAA B B A A

15.4 - Statistical Bias

Why are these properties important in statistical analysis?

crossover design have any nuisance effects, such as sequence, period, or

first-order carryover effects, aliased with direct treatment effects? We consider

first-order carryover effects only. If the design incorporates washout periods of

inadequate length, then treatment effects could be aliased with higher-order

carryover effects as well, but let us assume the washout period was adequate

for eliminating carryover beyond 1 treatment period.

The approach is very simple in that the expected value of each cell in the

crossover design is expressed in terms of a direct treatment effect and the

assumed nuisance effects. Then these expected values are averaged and/or

differenced to construct the desired effects.

nuisance effects for sequence, period, and first-order carryover, then model

for this would look like:

uence AB A + + B + - + A

uence BA B - + A - - + B

treatments A and B, respectively, represents a sequence effect, represents

a period effect, and A and Brepresent carryover effects of treatments A and

B, respectively.

A natural choice of an estimate of A (or B) is simply the average over all cells

where treatment A (or B) is assigned: [12]

and ^B=12(YAB,2+YBA,1)

Will this give us a good estimate of the means across the treatment? Not

quite...

The mathematical expectations of these estimates are as follows: [13]

E(^A)=12(A+++A+B)=A+12BE(^A)=12(A+++A+B)

=A+12B

E(^B)=12(B++B++A)=B+12AE(^B)=12(B++B++A)

=B+12A

E(^A^B)=(AB)12(AB)E(^A^B)=(AB)12(AB)

From [Design 13] it is observed that the direct treatment effects and the

treatment difference are not aliased with sequence or period effects, but are

aliased with the carryover effects.

The treatment difference, however, is not aliased with carryover effects when

the carryover effects are equal, i.e., A = B. The results in [13] are due to the

fact that the AB|BA crossover design is uniform and balanced with respect to

first-order carryover effects. Any crossover design which is uniform and

balanced with respect to first-order carryover effects, such as the designs in

[Design 5] and [Design 8], also exhibits these results.

Example

Consider the ABB|BAA design, which is uniform within periods, not uniform

with sequences, and is strongly balanced.

A + + 1 B + +2 + A B + - 1 - 2 + B

B - + 1 A - +2 + B A - - 1 - 2 + A

A natural choice of an estimate of A (or B) is simply the average over all cells

where treatment A (or B) is assigned: [15]

)^A=13(YABB,1+YBAA,2+YBAA,3) and ^B=13(YABB,2+YABB,3+YBAA,1)

E(^A)=A+13(A+B)E(^A)=A+13(A+B)

E(^B)=B+13(A+B+)E(^B)=B+13(A+B+)

E(^A^B)=(AB)23E(^A^B)=(AB)23

From [16], the direct treatment effects are aliased with the sequence effect

and the carryover effects, whereas the treatment difference only is aliased

with the sequence effect. The results in [16] are due to the ABB|BAA

crossover design being uniform within periods and strongly balanced with

respect to first-order carryover effects.

The lack of aliasing between the treatment difference and the first-order

carryover effects does not guarantee that the treatment difference and higher-

order carryover effects also will not be aliased or confounded. For example, let

2A and 2B denote the second-order carryover effects of treatments A and B,

respectively, for the design in [Design 2] (Second-order carryover effects looks

at the carryover effects of the treatment that took place previous to the prior

treatment.):

A + + 1 B + + 2 + A B + - 1 - 2 + B + 2A

B - + 1 A - + 2 + B A - - 1 - 2 + A + 2B

[18] E(^A^B)=(AB)2313(2A2B)E(^A^B)=(AB)

2313(2A2B)

with second-order carryover effects.

The ensuing remarks summarize the impact of various design features on the

aliasing of direct treatment and nuisance effects.

effects are not aliased with treatment differences.

2. If the crossover design is uniform within periods, then period effects are

not aliased with treatment differences.

effects, then carryover effects are aliased with treatment differences. If

the carryover effects are equal, then carryover effects are not aliased

with treatment differences.

carryover effects, then carryover effects are not aliased with treatment

differences.

Complex Carryover

because it is assumed that the treatment in the current period does not

interact with the carryover from the previous period. Complex carryover

refers to the situation in which such an interaction is modeled. For example,

suppose we have a crossover design and want to model carryover effects.

With simple carryover in a two-treatment design, there are two carryover

parameters, namely, A and B.

namely, AB, BA, AA and BB, where AB represents the carryover effect of

treatment A into a period in which treatment B is administered, BA represents

the carryover effect of treatment B into a period in which treatment A is

administered, etc. As you might imagine, this will certainly complicate things!

Obviously, it appears that an ideal crossover design is uniform and strongly

balanced.

some of the nuisance parameters are null, so that resorting to a uniform and

strongly balanced design is not necessary (although it provides a safety net if

the assumptions do not hold).

For example, some researchers argue that sequence effects should be null or

negligible because they represent randomization effects. Another example

occurs in bioequivalence trials where some researchers argue that carryover

effects should be null. This is because blood concentration levels of the drug

or active ingredient are monitored and any residual drug administered from an

earlier period would be detected.

be examined to determine which, if any, nuisance effects may play a role.

Once this determination is made, then an appropriate crossover design should

be employed that avoids aliasing of those nuisance effects with treatment

effects. This is a decision that the researchers should be prepared to address.

but is concerned that he will have unequal carryover effects so he is reluctant

to invoke the 2 2 crossover design. If the investigator is not as concerned

about sequence effects, then Balaams design in [Design 8] may be

appropriate. Balaams design is uniform within periods but not within

sequences, and it is strongly balanced. Therefore, Balaams design will not be

adversely affected in the presence of unequal carryover effects.

issue because a patient eventually undergoes all of the treatments (this is true

in most crossover designs). Obviously, randomization is very important if the

crossover design is not uniform within sequences because the underlying

assumption is that the sequence effect is negligible.

Randomization is important in crossover trials even if the design is uniform

within sequences because biases could result from investigators assigning

patients to treatment sequences.

within periods because period effects are common. Period effects can be due

to:

measurements.

The following is a listing of various crossover designs with some, all, or none

of the properties.

It would be a good idea to go through each of these designs and diagram out

what these would look like, the degree to which they are uniform and/or

balanced. Make sure you see how these principles come into play!

Now that we have examined statistical biases that can arise in crossover

designs, we next examine statistical precision.

During the design phase of a trial, the question may arise as to which

crossover design provides the best precision. For our purposes, we label one

design as more precise than another if it yields a smaller variance for the

estimated treatment mean difference.

experimenter, there may be other circumstances that affect the choice of an

appropriate design. For example, later we will compare designs with respect

to which designs are best for estimating and comparing variances.

means in two-period, two-treatment designs.

crossover design AB|BA in [Design 1], Balaam's design AB|BA|AA|BB in

[Design 6], and the two-period parallel design AA|BB.

In order for the resources to be equitable across designs, we assume that the

total sample size, n, is a positive integer divisible by 4. Then:

1. n patients will be randomized to each sequence in the AB|BA design

and

design.

patients, the statistical modeling must account for between-patient variability

and within-patient variability.

one patient to another. Within-patient variability accounts for the dispersion in

measurements from one time point to another within a patient. Within-patient

variability tends to be smaller than between-patient variability.

The following table provides expressions for the variance of the estimated

treatment mean difference for each of the two-period, two-treatment designs:

Variance

the sake of comparison. Not surprisingly, the 2 2 crossover design yields the

smallest variance for the estimated treatment mean difference, followed by

Balaam's design and then the parallel design.

selecting the 2 2 crossover. In particular, if there is any concern over the

possibility of differential first-order carryover effects, then the 2 2 crossover

is not recommended. In this situation the parallel design would be a better

choice than the 2 2 crossover design. Balaam's design is strongly balanced

so that the treatment difference is not aliased with differential first-order

carryover effects, so it also is a better choice than the 2 2 crossover design.

With respect to a sample size calculation, the total sample size, n, required for

a two-sided, significance level test with 100(1 - )% statistical power and

effect size A - B is:

n=(z1/2+z1)22/(AB)2n=(z1/2+z1)22/(AB)2

sure whether to invoke a parallel design, a crossover design, or Balaam's

design. He wants to use a 0.05 significance level test with 90% statistical

power for detecting the effect size of A - B= 10. From published results, the

investigator assumes that:

AA = BB = 100

The sample sizes for the three different designs are as follows:

Parallel n = 190

Balaam n = 105

Crossover n = 21

The crossover design yields a much smaller sample size because the within-

patient variances are one-fourth that of the inter-patient variances (which is

not unusual).

compare the within-patient variances AA and BB.

by imposing restrictions on the between-patient variances and covariances.

The resultant estimators of AAand BB, however, may lack precision and be

unstable. Hence, the 2 2 crossover design is not recommended when

comparing AA and BB is an objective.

because it has n patients who can provide data in estimating each of

AA and BB, whereas Balaam's design has n patients who can provide data

in estimating each of AA and BB. Again, Balaam's design is a compromise

between the 2 2 crossover design and the parallel design.

The statistical analysis of normally-distributed data from a 2 2 crossover

trial, under the assumption that the carryover effects are equal ( A = A = ),

is relatively straightforward.

Remember the statistical model we assumed for continuous data from the 2

2 crossover trial:

uence AB A + + B + - + A

uence BA B - + A - - + B

For a patient in the AB sequence, the Period 1 vs. Period 2 difference has

expectation AB = A - B + 2 - .

For a patient in the BA sequence, the Period 1 vs. Period 2 difference has

expectation BA = B - A + 2 - .

Therefore, we construct these differences for every patient and compare the

two sequences with respect to these differences using a two-sample t test or a

Wilcoxon rank sumtest. Thus, we are testing:

H0 : AB - BA = 0

The expression:

AB - BA = 2( A - B )

H0 : A - B = 0

prior to constructing the confidence interval for the difference in population

means for two independent samples.

example is taken from Example 3.1 from Senn's book (Senn S. Cross-over

Trials in Clinical Research , Chichester, England: John Wiley & Sons, 1993).

The data set consists of 13 children enrolled in a trial to investigate the effects

of two bronchodilators, formoterol and salbutamol, in the treatment of asthma.

The outcome variable is peak expiratory flow rate (liters per minute) and was

measured eight hours after treatment. There was a one-day washout period

between treatment periods.

The estimated treatment mean difference was 46.6 L/min in favor of

formoterol (p = 0.0012) and the 95% confidence interval for the treatment

mean difference is (22.9, 70.3). The Wilcoxon rank sumtest also indicated

statistical significance between the treatment groups (p = 0.0276).

Suppose that the response from a crossover trial is binary and that there are

no period effects. Then the probabilities of response are:

on treatment B is p.1 testing the null hypothesis:

H0 : p1. - p.1 = 0

This indicates that only the patients who display a (1,0) or (0,1) response

contribute to the treatment comparison. For instance, if they failed on both, or

were successful on both, there is no way to determine which treatment is

better. Therefore we will let:

Failure on B Success on B

denote the frequency of responses from the study data instead of the

probabilities listed above.

McNemar's test for this situation is as follows. Given the number of patients

who displayed a treatment preference, n10 + n01 , then n10 follows a

binomial(p, n10 + n01) distribution and the null hypothesis reduces to testing:

H0 : p = 0.5

i.e., we would expect a 50-50 split in the number of patients that would be

successful with either treatment in support of the null hypothesis, looking at

only the cells where there was success with one treatment and failure with the

other. The data in cells for both success or failure with both treatment would

be ignored.

a binary outcome of failure/success. Fifty patients were randomized and the

following results were observed:

Failure on B Success on B

ure on A 21 15

cess on A 7 7

and 15 preferred B. McNemar's test, however, indicated that this was not

statistically significant (exact p = 0.1338).

A problem that can arise from the application of McNemar's test to the binary

outcome from a 2 2 crossover trial can occur if there is non-negligible period

effects. If that is the case, then the treatment comparison should account for

this. This is possible via logistic regression analysis.

The Rationale:

preferences under the null hypothesis is equivalent to the odds ratio for the

treatment A preference to the treatment B preference being 1.0. Because

logistic regression analysis models the natural logarithm of the odds, testing

whether there is a 50-50 split between treatment A preference and treatment

B preference is comparable to testing whether the intercept term is null in a

logistic regression analysis.

To account for the possible period effect in the 2 2 crossover trial, a term for

period can be included in the logistic regression analysis.

Use the same data set from SAS Example 16.2 only now it is partitioned as to

patients within the two sequences:

ure on A 10 7

cess on A 3 5

ure on A 11 8

cess on A 4 2

The logistic regression analysis yielded a nonsignificant result for the

treatment comparison (exact p = 0.2266). There is still no significant statistical

difference to report.

Outcome

You don't often see a cross-over design used in a time-to-event trial. If the

event is death, the patient would not be able to cross-over to a second

treatment. Even when the event is treatment failure, this often implies that

patients must be watched closely and perhaps rescued with other medicines

when event failure occurs.

crossover trial actually can reduce to a binary outcome score of preference.

Suppose that in a clinical trial, time to treatment failure is determined for each

patient when receiving treatment A and treatment B.

assigned a (0,0) score and displays no preference.

If the time to treatment failure on A is less than that on B, then the patient is

assigned a (0,1) score and prefers B.

If the time to treatment failure on B is less than that on A, then the patient is

assigned a (1,0) score and prefers A.

If the patient does not experience treatment failure on either treatment, then

the patient is assigned a (1,1) score and displays no preference.

outcomes.

The analysis of continuous, binary, and time-to-event outcome data from a

design more complex than the 2 2 crossover is not as straightforward as

that for the 2 2 crossover design.

linear model (SAS PROC MIXED) to account for the repeated measurements

that yield period, sequence, and carryover effects and to model the various

sources of intra-patient and inter-patient variability.

equations (SAS PROC GENMOD) to account for the repeated measurements

that yield period, sequence, and carryover effects and to model the various

sources of intra-patient and inter-patient variability.

In either case, with a design more complex that the 2 2 crossover, extensive

modeling is required.

The objective of a bioequivalence trial is to determine whether test (T) and

reference (R) formulations of a pharmaceutical product are "equivalent" with

respect to blood concentration time profiles.

but wishes to market a more convenient formulation, ( i.e., an injection

vs a time-release capsule). This situation is less common.

formulation of Company A with an expired patent. Company B has to

prove that they can deliver the same amount of active drug into the

blood stream which the approved formula does.

Pharmaceutical scientists use crossover designs for such trials in order for

each trial participant to yield a profile for both formulations. The blood

concentration time profile is a multivariate response and is a surrogate

measure of therapeutic response. The pharmaceutical company does not

need to demonstrate the safety and efficacy of the drug because that already

has been established.

Are the reference and test blood concentration time profiles similar? The

test formulation could be toxic if it yields concentration levels higher than the

reference formulation. On the other hand, the test formulation could be

ineffective if it yields concentration levels lower than the reference formulation.

Typically, pharmaceutical scientists summarize the rate and extent of drug

absorption with summary measurements of the blood concentration time

profile, such as area under the curve (AUC), maximum concentration (CMAX),

etc. These summary measurements are subjected to statistical analysis (not

the profiles) and inferences are drawn as to whether or not the formulations

are bioequivalent.

to their underlying probability distributions. You want the see that the

AUC or CMAX distributions would be similar.

the means (medians) of their probability distributions.

proportion of individuals in the population. i.e., how well do the AUC's

and CMAX compare across patients?

regimen for the first time, so that either the reference or test formulations can

be chosen. Switchability means that a patient, who already has established a

regimen on either the reference or test formulation, can switch to the other

formulation without any noticeable change in efficacy and safety.

Prescribability requires that the test and reference formulations are population

bioequivalent, whereas switchability requires that the test and reference

formulations have individual bioequivalence.

that the test and reference formulations are average bioequivalent. It is felt

that most consumers, however, assume bioequivalence refers to individual

bioequivalence, and that switching formulations does not lead to any health

problems.

stated as:

where T and R represent the population means for the test and reference

formulations, respectively, and 1 and 2 are chosen constants.

The FDA recommended values are 1 = 0.80 and 2 = 1.25, ( i.e., the ratios

4/5 and 5/4), for responses such as AUC and CMAX which typically follow

lognormal distributions.

measure, the statistical analysis is performed for the crossover experiment,

and then the two one-sided testing approach or corresponding confidence

intervals are calculated for the purposes of investigating average

bioequivalence.

used a 2 2 crossover design. There were 28 healthy volunteers, (instead of

patients with disease), who were randomized (14 each to the TR and RT

sequences). AUC and CMAX were measured and transformed via the natural

logarithm.

The analysis yielded the following results:

AUC CMAX

T) 0.0893 -0.104

1.09 0.90

Neither 95% confidence interval lies within (0.80, 1.25) specified by the

USFDA, therefore bioequivalence cannot be concluded in this example and

the USFDA would not allow this company to market their generic drug. Both

CMAX and AUC are used because they summarize the desired equivalence.

15.13 - Summary

In this lesson, among other things, we learned:

would not be advantageous.

period, washout, aliased effect.

a crossover study in terms of aliased effects.

the implications of these characteristics.

Understand and modify SAS programs for analysis of data from 2x2

crossover trials with continuous or binary data.

study.

and individual bioequivalence.

switchability.

Let's put what we have learned to use by completing the following homework

assignment:

Homework

Look for homework assignment and the dropbox in the folder for this week in

ANGEL.

Lesson 16: Overviews and Meta-analysis

Introduction

scientific evidence related to treatment, causation, diagnosis, or prognosis of

a specific disease. An overview does not generate any new data - it reviews

and summarizes already-existing studies.

Overviews, which are relied upon by many physicians, are important because

there usually exist multiple studies that have addressed a specific research

question. Yet these types of studies may differ with respect to:

Design

Patient population

Quality

Results

deal of effort and care to do it well. For example, determining inclusion and

exclusion criteria for studies is a major challenge for researchers when putting

together a useful overview.

What does this process involve? There are six basic steps to an overview:

Upon completion of this lesson, you should be able to do the following:

review.

publication bias.

effects model for a meta-analysis. State how the weights differ between

the fixed and random approaches.

used in a meta-analysis.

analysis.

Reference

Piantadosi Steven. Clinical Trials: A Methodologic Perspective. 2nd ed.

Hobaken, NJ: John Wiley and Sons, Inc.

Examples:

asthma?

If the question is too broad, it may not be useful when applied to a particular

patient. For example, whether chemotherapy is effective in cancer is too broad

a question (the number of studies addressing this question could exceed

10,000).

If the question is too narrow, there may not be enough evidence to answer the

question. For example, the following question is too narrow: Is a particular

asthma therapy effective in Caucasian females over the age of 65 years in

Central Pennsylvania?

Search

Many sources for studies (throughout the world) should be explored:

(e.g. http://www.clinicaltrialresults.com/ [1])

Conference proceedings

Theses/dissertations

Personal contacts

Unpublished reports

which the intervention is not found to be effective, or as effective as other

treatments, may not be submitted for publication. (This is referred to as the

"file-drawer problem".) Studies, with 'significant results' are more likely to

make it into a journal. Recent initiatives in online journals, such as PLoS

Medicine, and databases of trial results may encourage increased publication

of results from scientifically valid studies, regardless of outcome. Even so, in

an imperfect world, realize it is possible for an overview based only on

published studies to have a bias towards an overall positive effect.

publication bias has occurred.

Suppose there are some relevant studies with small sample sizes. If nearly all

of them have a positive finding (p < 0.05), then this may provide evidence of a

"publication bias" because of the following reason. It is more difficult to show

positive results with small sample sizes. Thus, there should be some negative

results (p > 0.05) among the small studies.

A "funnel plot" can be constructed to investigate the latter issue. Plot sample

size (vertical axis) versus p-value or magnitude of effect (horizontal axis).

Notice that the p-values for some of the small studies are relatively large,

yielding a "funnel" shape for the scatterplot.

Notice that none of the p-values for the small studies are large, yielding a

"band" shape for the scatterplot and the suspicion of publication bias. This is

evidence to suggest that there does exist a degree of 'publication bias'.

Criteria

Eligibility criteria for studies need to be established prior to the analysis.

aspects of the trials, the patient populations, treatment modalities, etc. that are

congruent with the objectives of the overview. Looking across a variety of

studies this process can get quite complicated.

to quality and may weight the studies accordingly in the analysis. One such

example of a quality rating of randomized trials is the Jadad scale (Jadad et

al. Assessing the quality of reports of randomized clinical trials: Is blinding

necessary? Controlled Clinical Trials , 1996; 17: 1-12). Here are the questions

that are asked as a part of this scale along with the scores that are associated

with these answers:

Is the study described as randomized?

No (0 points)

Yes (1 point)

In most circumstances, the researcher easily can gather the relevant

descriptive statistics (e.g., means, standard errors, sample sizes) from the

reports on the eligible studies.

estimates (e.g., standard errors). If possible, the researcher should attempt to

contact the authors directly in such situations. This may not be successful,

however, because the authors may no longer have the data.

Ideally, the statistical analysis for a systematic review will be based on the raw

data from each eligible study. This has rarely occured. Either the raw data

were no longer available or the authors were unwilling to share the raw data.

However, the success of shared data in the Human Genome Project has

given impetus to increased data sharing to promote rapid scientific progress.

Since the US NIH now requires investigators receiving large new NIH grants

to have a plan for data-sharing.( NIH Data Sharing Policy Guide [2] ) and has

provided more guidance [3]on how federal data are to be shared. we may

anticipate more meta-analyses based on raw data.

embrace data sharing include proprietary rights, authorship, patient consent

and confidentiality, common technology, proper use, enforcement of policy,

etc. As these challenges are overcome, the path to a systematic review and

meta-analysis based on raw data will be smoother.

16.5 - 5. Meta-analysis

The obvious advantage for performing a meta-analysis is that a large amount

of data, pooled across multiple studies, can provide increased precision in

addressing the research question. The disadvantage of a meta-analysis is that

the studies can be very heterogeneous in their designs, quality, and patient

populations and, therefore, it may not be valid to pool them. This issue is

something that needs to be evaluated very critically.

fixed-effects models and random-effects models.

assumptions are somewhat restrictive. It assumes that if all the involved

studies had tremendously large sample sizes, then they all would yield the

same result. In essence, a fixed-effects model assumes that there is no inter-

study variability (study heterogeneity). This statistical model accounts only for

intra-study variability.

represent a random sample from a population of studies that address the

research question. It accounts for intra-study and inter-study variability. Thus,

a random-effects model tends to yield a more conservative result, i.e., wider

confidence intervals and less statistical significance than a fixed-effects

model.

it may not be necessary if there is very low study heterogeneity. A formal test

of study heterogeneity is available. Its results, however, should not determine

whether to apply a fixed-effects model or random-effects model. You need to

use your own judgment as to which model should be applied.

The test for study heterogeneity is very powerful and sensitive when the

number of studies is large. It is very weak and insensitive if the number of

studies is small. Graphical displays provide much better information as to the

nature of study heterogeneity. Some medical journals require that the authors

provide the test of heterogeneity, along with a fixed-effects analysis and a

random-effects analysis.

The basic step for a fixed-effects model involves the calculation of a weighted

average of the treatment effect across all of the eligible studies.

difference between sample treatment and control means. The weight is

expressed as the inverse of the variance of the difference between the sample

means. Therefore, if the variance is large the study will be given a lower

weight. If the variance is smaller, the weight of the study is larger.

the logarithm of the estimated odds ratio. The weight is expressed as the

inverse of the variance of the logarithm of the estimated odds ratio. Basically,

the weighting takes the same approach using this value.

The estimated treatment effect (e.g., difference between the sample treatment

and control means) in the kth study, k = 1, 2, ... , K, is Yk .

The weight for the estimated treatment effect in the kth study

is wk=1/S2kwk=1/Sk2.

Y=(k=1KwkYk)/(k=1Kwk)Y=(k=1KwkYk)/(k=1Kwk)

S2=1/(k=1Kwk)S2=1/(k=1Kwk)

continuous outcome) is performed from assuming that |Y|/S asymptotically

follows a standard normal distribution.

The 100(1 - )% confidence interval for the overall weighted treatment effect

is:

Q=k=1Kwk(YkY)2Q=k=1Kwk(YkY)2

Mantel-Haenszel test for the pooled odds ratio. The Mantel-Haenszel test for

the pooled odds ratio assumes that the odds ratio is equal across all studies.

For the kth study, k = 1, 2, ... , K, a 2 2 table is constructed:

Control Treatment

adjust for covariates/regressors. Many researchers now use logistic

regression analysis to estimate the odds ratio from a study while adjusting for

covariates/regressors, so the weighted approach described previously is more

applicable.

16.7 - Example

Consider the following example for the difference in sample means between

an inhaled steroid and montelukast in asthmatic children. The outcome

variable is FEV 1 (L) from four clinical trials. Note that only the first study

yields a statistically significant result (p-value < 0.05).

Yk Sk wk p-value

(0.070977)+(0.043416)+(0.058370)+(0.075595)977+416+370+595

(0.070977)+(0.043416)+(0.058370)+(0.075595)977+416+370+595

The magnitude of the effect, or the 95% confidence interval is [0.025, 0.105]

The statistic for testing homogeneity is Q = 0.303 which does not exceed 7.81,

the 95th percentile from the 2332 distribution. Therefore, we have further

evidence that the studies are homogenous, although the small number of

studies involved in this overview does not give this result very much power.

Based on the evidence presented above, we can conclude that the inhaled

steroid is significantly better than montelukast in improving lung function in

children with asthma.

corresponds to the following linear model for the kth study, k = 1, 2, ... , K:

Yk = + ek

where Yk is the observed effect in the kth study, is the pooled population

parameter of interest (difference in population treatment means, natural

logarithm of the population odds ratio, etc.) and ek is the random error term for

the kth study.

with ek having a N (0 , 2kk2) distribution. The variance term 2kk2 then

reflects intra-study variability and its estimate is S2kSk2. Usually, Yk and Sk are

provided as descriptive statistics in the kth study report.

Analysis

A corresponding linear model for the random-effects approach is as follows:

Yk = + tk + ek

effect for the kth study.

(0 , 2 ) random variables. The variance term 2 reflects inter-study variability.

linear model,

linear model.

A weighted analysis will be applied, analogous to the weighted analysis for the

fixed-effects linear model, but the weights are different. The overall weighted

treatment effect is:

Y=(k=1KwkYk)/(k=1Kwk)Y=(k=1KwkYk)/(k=1Kwk)

where

wk=1/(S2k+^2)wk=1/(Sk2+^2)

^2=max(0,QK+1)

(k=1Kwk)/(k=1Kwk)2k=1Kw2k^2=max(0,QK+1)(k=1Kwk)/

((k=1Kwk)2k=1Kwk2)

and where Q is the heterogeneity statistic and wk is the weight for the kth study,

which were defined previously for the weighted analysis in the fixed-effects

linear model.

effects linear model is:

S2=1/(k=1Kwk)S2=1/(k=1Kwk)

If there exists a large amount of study heterogeneity, then ^2^2 will be very

large and will dominate in the expression for the weight in the kth study, i.e.,

wk=1/(S2k+^2)1/^2wk=1/(Sk2+^2)1/^2

the weight for each study will approximately be the same and the weighted

analysis for the random-effects linear model will approximate an unweighted

analysis.

confidence interval for every study are very useful for evaluating treatment

effects over time or with respect to other factors. An example is given below:

Sensitivity Analyses

investigate the validity and robustness of the meta-analysis via applying the

meta-analytic approach to subsets of the K studies, and/or applying the leave-

one-out method.

Remove the first of the K studies and conduct the meta-analysis on the

remaining K - 1 studies

the remaining K - 1 studies

with K - 1 studies)

consistent, then there is confidence that the overall meta-analysis is robust.

The likelihood of consistency increases as K increases. The idea here is that

removing one of the studies from the meta-analysis should not affect the

overall results. If this does occur, this suggests that there exists a lack of

homogeniety among the studies involved.

perform a sensitivity analysis by applying the meta-analysis to subsets of

studies based on high-quality versus low-quality studies, randomized versus

non-randomized studies, early studies versus late studies, etc.

Many medical journals have guidelines on the process for publishing a

systematic review/meta-analysis.

methodology for systematic reviews. Cochrane Reviews

( http://www.cochrane.org [4] ) are based on the best available information

about healthcare interventions. They explore the evidence for and against the

effectiveness and appropriateness of treatments (medications, surgery,

education, etc) in specific circumstances.

Here are a series of questions that we can ask ourselves as we evaluate the

value of a meta-analysis. You will have the opportunity to evaluate a meta-

analysis in the homework exercise.

2. Was the search for relevant studies detailed and exhaustive? Were the

inclusion/exclusion criteria for studies developed and applied

appropriately?

16.10 - Summary

In this lesson, among other things, we learned how to:

describe how publication bias can affect the results of a systematic

review,

publication bias,

effects model for a meta-analysis, stating how the weights differ

between the fixed and random approaches,

used in a meta-analysis, and

analysis.

Let's put what we have learned to use by completing the following homework

assignment:

Homework

Look for homework assignment and the dropbox in the folder for this week in

ANGEL.

Introduction

A diagnostic test is any approach used to gather clinical information for the

purpose of making a clinical decision (i.e., diagnosis). Some examples of

diagnostic tests include X-rays, biopsies, pregnancy tests, medical histories,

and results from physical examinations.

From a statistical point of view there are two points to keep in mind:

2. the goal of a diagnostic test is to move the estimated probability of

disease toward either end of the probability scale (i.e., 0 rules out

disease, 1 confirms the disease).

Third Edition ). A 54-year-old woman visits her family physician for an annual

check-up. The physician observes that:

she had no illnesses during the preceding year and there is no family

history of breast cancer,

Based on the woman's age and medical history, the initial (prior) probability

estimate of breast cancer is 0.003. The physician recommends that the

woman have a mammogram, due to her age. Unfortunately, the results of the

mammogram are abnormal. This yields a modification of the women's prior

probability of breast cancer from 0.003 to 0.13 (notice the Bayesian flavor of

this approach - prior probability modified via existing data). Next, the woman is

referred to a surgeon who agrees that the physical breast exam is normal. The

surgeon consults with a radiologist and they decide that the woman should

undergo fine needle aspiration (FNA) of the abnormal breast detected by the

mammogram. (diagnostic test #2) The FNA specimen reveals abnormal cells,

which again revises the probability of breast cancer, from 0.13 to 0.64. Finally,

the woman is scheduled for a breast biopsy the following week to get a

definitive diagnosis.

side effects. If this were the case, a positive test result would unequivocally

indicate the presence of disease and a negative result would indicate the

absence of disease. Realistically, however, every diagnostic test is fallible.

specificity of a diagnostic test,

calculate accuracy and predictive values of a diagnostic test,

specificity and predictive values of a diagnostic test,

whether the results come from a study in two groups of patients or one

group of patients tested with both tests, and

curve, for different cost ratios of false positive/false negative results.

To begin, let's consider a simple test which has only two possible outcomes,

namely, positive and negative. When a test is applied to a group of patients,

some with the disease and some without the disease, four groups can result,

as summarized in the following 2 2 table:

Disease No Disease

a b

Test Positive

true positives false positives

c d

Test Negative

false negatives true negatives

a (true-positives) = individuals with the disease, and for whom the test is

positive

b (false-positives) = individuals without the disease, but for whom the test is

positive

c (false-negatives) = individuals with the disease, but for whom the test is

negative

d (true-negatives) = individuals without the disease, and for whom the test is

negative

b + d = total number of individuals without disease

The "Gold Standard" is the method used to obtain a definitive diagnosis for a

particular disease; it may be biopsy, surgery, autopsy or an acknowledged

standard. Gold Standards are used to define true disease status against

which the results of a new diagnostic test are compared. Here are a number

of definitive diagnostic tests that will confirm whether or not you have the

disease. Some of these are quite invasive and this is a major reason why new

diagnostic procedures are being developed.

breast cancer excisional biopsy

prostate cancer transrectal biopsy

coronary stenosis coronary angiography

myocardial infarction catheterization

strep throat throat culture

The following concepts have been developed to describe the performance of

a diagnostic test relative to the gold standard; these concepts are measures of

the validity of a diagnostic test.

Sensitivity is the probability that an individual with the disease of interest has a

positive test. It is estimated from the sample as a/(a+c).

has a negative test. It is estimated from the sample as d/(b+d).

Accuracy is the probability that the diagnostic test yields the correct

determination. It is estimated from the sample as (a+d)/(a+b+c+d).

Tests with high sensitivity are useful clinically to rule out a disease. A negative

result for a very sensitive test virtually would exclude the possibility that the

individual has the disease of interest. If a test has high sensitivity, it also

results in a low proportion of false-negatives. Sensitivity also is referred to as

"positive in disease" or "sensitive to disease".

Tests with high specificity are useful clinically to confirm the presence of a

disease. A positive result for a very specific test would give strong evidence in

favor of diagnosing the disease of interest. If a test has high specificity, it also

results in a low proportion of false-positives. Specificity also is referred to as

"negative in health" or "specific to health".

Sensitivity and specificity are, in theory, stable for all groups of patients.

women with normal physical examinations (nonpalpable masses) and

abnormal mammograms received a FNA followed by surgical excisional

biopsy of the same breast (Bibbo M, et al: Stereotaxic fine needle aspiration

cytology of clinically occult malignant and premalignant breast lesions. Acta

Cytol 1988; 32:193-201.)

Cancer No Cancer

FNA Positive 14 8

FNA Negative 1 91

Specificity = 91/99 = 0.92 or 92%

Accuracy = 105/114 = 0.92 or 92%

Therefore, we can compute confidence intervals using binomial theory. See

SAS Example (18.1_sensitivity_specifi.sas [1]) below for a SAS program that

calculates exact and asymptotic confidence intervals for sensitivity and

specificity.

For the FNA study, only 15 women with cancer, as diagnosed by the gold

standard, were studied. The rule for using the asymptotic confidence interval

fails for sensitivity because np(1 - p) = 0.9765 < 5 (the rule does hold for

specificity).

As the output shows below, the exact 95% confidence intervals for sensitivity

and specificity are (0.680, 0.998) and (0.847, 0.965), respectively.

17.3 - Estimating the Probability of

Disease

Sensitivity and specificity describe the accuracy of a test. In a clinical setting,

we do not know who has the disease and who does not - that is why

diagnostic tests are used. We would like to be able to estimate the probability

of disease based on the outcome of one or more diagnostic tests. The

following measures address this idea.

Prevalence is the probability of having the disease, also called the prior

probability of having the disease. It is estimated from the sample as (a+c)/

(a+b+c+d).

with a positive test result. It is estimated as a/(a+b).

Negative Predictive Value (PV - ) is the probability of not having the disease

when the test result is negative. It is estimated as as d/(c+d).

In the FNA study of 114 women with nonpalpable masses and abnormal

mammograms,

PV+ = 14/(14+8) = 0.64

PV - = 91/(1+91) = 0.99

Thus, a woman's prior probability of having the disease is 0.13 and is modified

to 0.64 if she has a positive test result. A women's prior probability of not

having the disease is 0.87 and is modified to 0.99 if she has a negative test

result.

If the disease under study is rare, the investigator may decide to invoke a

case-control design for evaluating the diagnostic test, e.g., recruit 50 patients

with the disease and 50 controls. Obviously, prevalence cannot be estimated

from a case-control study because it does not represent a random sample

from the general population.

Predictive values allow us to determine the usefulness of a test and they vary

with the sensitivity and specificity of a test. If all other characteristics held

constant, then:

Predictive values vary with the prevalence of the disease in the population

being tested or the pre-test probability of disease in a given individual.

estimate post-test probabilities (predictive values), even though physicians

work with one patient at a time, not entire populations of patients. Three

pieces of information are necessary prior to performing the test, namely, (1)

either the prevalence of the disease or the prior probability of disease, (2)

sensitivity, and (3) specificity.

PV+=PrevalenceSensitivity(PrevalenceSensitivity)+

{(1Prevalence)(1Specificity)}PV+=PrevalenceSensitivity(PrevalenceSen

sitivity)+{(1Prevalence)(1Specificity)}

PV=(1Prevalence)Specificity{(1Prevalence)Specificity)}+

{Prevalence(1Sensitivity)}PV=(1Prevalence)Specificity{(1Prevalence)S

pecificity)}+{Prevalence(1Sensitivity)}

calculated directly from the 2 2 data table because the women constituted a

random sample, the above formulae yield the same results:

Epidemiology ). Suppose a patient with the following characteristics visits a

physician:

45-year-old man

no coronary risk factors except smoking one pack of cigarettes per day

tender, but does not reproduce the patient's pain

probability of 60% that this patient has significant coronary artery narrowing.

The physician is not sure whether the patient should undergo an exercise

electrocardiogram (ECG). How useful would this test be for this patient?

Suppose it is known from the literature that the sensitivity and specificity of the

exercise ECG in coronary artery stenosis (as compared to the gold standard

of coronary arteriography) are 60% and 91%, respectively.

Then:

PV - = (0.4)(0.91)/{(0.4)(0.91) + (0.6)(0.4)} = 0.60

An additional test characteristic reported in the medical literature is the

likelihood ratio, which is the probability of a particular test result (+ or - ) in

patients with the disease divided by the probability of the result in patients

without the disease. There exists one likelihood ratio for a positive test (LR+)

and one for a negative test (LR - ). Likelihood ratios express how many times

more (or less) likely the test result is found in diseased versus non-diseased

individuals:

LR - = (1 - Sensitivity)/Specificity

From the FNA study in 114 women with nonpalpable masses and abnormal

mammograms, LR+ = 0.933/0.081 = 11.52 and LR - = 0.067/0.919 = 0.07.

Thus, positive FNA results are 11.52 times more likely in women with cancer

as compared to those without, and negative FNA results are .07 times as

likely in women with cancer as compared to those without.

Suppose that we want to compare sensitivity and specificity for two diagnostic

tests. Let p1 denote the test characteristic for diagnostic test #1 and let p2 =

test characteristic for diagnostic test #2.

The appropriate statistical test depends on the setting. If diagnostic tests were

studied on two independent groups of patients, then two-sample tests for

binomial proportions are appropriate (chi-square, Fisher's exact test). If both

diagnostic tests were performed on each patient, then paired data result and

methods that account for the correlated binary outcomes are necessary

(McNemar's test).

samples of individuals using the same gold standard. The following 2 2

tables result:

Positive 82 30

Negative 18 70

Diagnostic Test #2 Disease No Disease

Positive 140 10

Negative 60 90

are p1 = 82/100 = 0.82 and p2 = 140/200 = 0.70 for diagnostic test #1 and

diagnostic test #2, respectively. The following SAS program will provide

confidence intervals for the sensitivity for each test as well as comparison of

the tests with regard to sensitivity.

Run the program and look at the output. Do you see the exact 95%

confidence intervals for the two diagnostic tests as (0.73, 0.89) and (0.63,

0.76), respectively?

The SAS program also indicates that the p-value = 0.0262 from Fisher's exact

test for testing H0 : p1 = p2 .

Thus, diagnostic test #1 has a significantly better sensitivity than diagnostic

test #2.

Suppose both diagnostic tests (test #1 and test #2) are applied to a given set

of individuals, some with the disease (by the gold standard) and some without

the disease.

100 diseased patients as follows:

Diagnostic Test #2

Positive 30 35

Negative 23 12

The appropriate test statistic for this situation is McNemar's test. The patients

with a (+, +) result and the patients with a ( - , - ) result do not distinguish

between the two diagnostic tests. The only information for comparing the

sensitivities of the two diagnostic tests comes form those patients with a (+, - )

or ( - , +) result.

testing that.

In the above example, N = 58 and 35 of the 58 display a (+, - ) result, so the

estimated binomial probability is 35/58 = 0.60. The exact p-value is 0.148 from

McNemar's test (see SAS Example 18.3_comparing_diagnostic.sas [3] below).

Thus, the two diagnostic tests are not significantly different with respect to

sensitivity.

Methods for calculating sensitivity and specificity depend on test outcomes

that are dichotomous. Many lab tests and other diagnostic tools, however, are

measured on a numerical scale. In this case, sensitivity and specificity depend

on where the cutoff point is made between positive and negative.

The positivity criterion is the cutoff value on a numerical scale that separates

normal values from abnormal values. It determines which test results are

considered positive (indicative of disease) and negative (disease-free).

Because the distributions of test values for diseased and disease-free

individuals are likely to overlap, there will be false-positive and false-negative

results. When defining a positivity criterion, it is important to consider which

mistake is worse.

Now suppose a greater value is selected for the cutoff point. The chosen

cutoff value will yield a good sensitivity because nearly all of the diseased

individuals will have a positive result. Unfortunately, many of the healthy

individuals also will have a positive result (false positives), so this cutoff value

will yield a poor specificity.

In the following example, a high value of the diagnostic test (positive result) is

indicative of disease. The chosen cutoff value will yield a poor sensitivity

because many of the diseased individuals will have a negative result (false

negatives). On the other hand, nearly all of the healthy individuals will have a

negative result, so the chosen cutoff value will yield a good specificity.

When the consequences for missing a case are potentially grave, choose a

value for the positivity criterion that minimizes the number of false-negatives.

For example, in neonatal PKU screening, a false-negative result may delay

essential dietary intervention until mental retardation is evident. False-positive

results, on the other hand, are usually identified during follow-up testing.

When false-positive results may lead to a risky treatment, choose a value for

the positivity criterion that minimizes the number of false-positive results. For

example, false-positive results indicating certain types of cancer can lead to

chemotherapy which can suppress the patient's immune system and leave the

patient open to infection and other side effects.

representation of the relationship between sensitivity and specificity for a

diagnostic test measured on a numerical scale. The ROC curve consists of a

plot of sensitivity (true-positives) versus 1 - specificity (false-positives) for

several choices of the positivity criterion. PROC LOGISTIC of SAS provides a

means for constructing ROC curves.

The figure below depicts an ROC curve (drawn with xs). The point in the

upper left corner of the figure, (0,1), represents a perfect test, in which

sensitivity and specificity both are 1. When false-positive and false-negative

results are equally problematic, there are two choices: 1. Set the positivity

criterion to the point on the ROC curve closest to the upper left corner. (This

will also be closest to the dashed line, as the cutoff in the figure indicates.) or

2. Set the positivity criterion to the point on the ROC curver farthest (vertical

distance) from the line of chance (Youdon Index).

When false-positive results are more undesirable, set the positivity criterion to

the point farthest left on the ROC curve (increase specificity). If instead, false-

negative results are more undesirable, set the positivity criterion to a point

farther right on the ROC curve (increase sensitivity).

In the ACRN SOCS trial, the investigators wanted to determine if low values of

the methacholine PC20 at baseline are predictive of significant asthma

exacerbations. The methacholine PC20 is a measure of how reactive a

persons airways are to an irritant (methacholine) a low value of the

PC20 corresponds a to high level of airway reactivity.

Unfortunately, log2(methacholine PC20) is not statistically significant in

predicting the occurrence of significant asthma exacerbation (p = 0.27) and

the ROC curve is very close to the line of identity.

17.6 - Summary

In this lesson, among other things, we learned how to:

specificity of a diagnostic test,

specificity and predictive values of a diagnostic test,

whether the results come from a study in two groups of patients or one

group of patients tested with both tests, and

curve, for different cost ratios of false positive/false negative results.

assignment posted in this week's folder!

Introduction

two continuous or ordinal scale variables within a group of patients.

whether values of one variable tend to be higher (or possibly lower) for

higher values of the other variable;

2. assessing the amount of agreement between the values of the two

variables, i.e., comparing alternative ways of measuring or assessing

the same response;

variable, i.e., formulating predictive models via regression analyses.

This lesson will focus only on correlation and agreement, (issues numbered 1

and 2 listed above).

correlation, Kendalls tau-b and Cohens Kappa statistics.

coefficients and interpret the results.

confidence intervals and Kendalls tau-b.

concordance.

statistic based on the type of data used for each.

Correlation is a general method of analysis useful when studying possible

association between two continuous or ordinal scale variables. Several

measures of correlation exist. The appropriate type for a particular situation

depends on the distribution and measurement scale of the data. Three

measures of correlation are commonly applied in biostatistics and these will

be discussed below.

Suppose that we have two variables of interest, denoted as X and Y, and

suppose that we have a bivariate sample of size n:

X=1ni=1nXi,SXX=1n1i=1n(XiX)2X=1ni=1nXi,SXX=1n1i=1n(Xi

X)2

Y=1ni=1nYi,SYY=1n1i=1n(YiY)2Y=1ni=1nYi,SYY=1n1i=1n(YiY

)2

SXY=1n1i=1n(XiX)(YiY)SXY=1n1i=1n(XiX)(YiY)

These statistics above represent the sample mean for X, the sample variance

for X, the sample mean for Y, the sample variance for Y, and the sample

covariance between X and Y, respectively. These should be very familiar to

you.

The sample Pearson correlation coefficient (also called the sample product-

moment correlation coefficient) for measuring the association between

variables X and Y is given by the following formula:

rp=SXYSXXSYYrp=SXYSXXSYY

population Pearson correlation coefficient

p=XYXXYYp=XYXXYY

between X and Y and -1 rp +1, so that rp is a "unitless" quantity, i.e., when

you construct the correlation coefficient the units of measurement that are

used cancel out. A value of +1 reflects perfect positive correlation and a value

of -1 reflects perfect negative correlation.

For the Pearson correlation coefficient, we assume that both X and Y are

measured on a continuous scale and that each is approximately normally

distributed.

transformations. This means that if every Xi is transformed to

Xi * = aXi + b

Yi * = cYi + d

between X and Y is the same as the correlation between X* and Y*.

With SAS, PROC CORR is used to calculate rp . The output from PROC

CORR includes summary statistics for both variables and the computed value

of rp . The output also contains a p-value corresponding to the test of:

H0 : p = 0 versus H0 : p 0

It should be noted that this statistical test generally is not very useful, and the

associated p-value, therefore, should not be emphasized. What is more

important is to construct a confidence interval.

confidence limits for rp based on a standard normal distribution, we

transform rp using Fisher's Z transformation to get a quantity, zp , that has an

approximate normal distribution. Then we can work with this value. Here is

what is involved in the transformation.

zp=12loge(1+rp1rp)N(p,sd=1n3

)zp=12loge(1+rp1rp)N(p,sd=1n3)

where

p=12loge(1+p1p)p=12loge(1+p1p)

We will use this to get the usual confidence interval, so, an approximate 100(1

- )% confidence interval for p is given by [zp, /2 , zp, 1-/2 ], where

zp,/2=zp(tn3,1/2/n3),zp,1/2=zp+(tn3,1/2/n3

)zp,/2=zp(tn3,1/2/n3),zp,1/2=zp+(tn3,1/2/n3)

p is given by [rp, /2 , rp, 1-/2 ], where

rp,/2=exp(2zp,/2)1exp(2zp,/2)+1,rp,1/2=exp(2zp,1/2)1exp(2zp,1

/2)+1rp,/2=exp(2zp,/2)1exp(2zp,/2)+1,rp,1/2=exp(2zp,1/2)1exp(2zp,1/

2)+1

Again, you do not have to do this by hand. PROC CORR in SAS will do this

for you but it is important to have an idea of what is going on.

The Spearman rank correlation coefficient, rs , is a nonparametric measure of

correlation based on data ranks. It is obtained by ranking the values of the two

variables (X and Y) and calculating the Pearson rp on the resulting ranks, not

the data itself. Again, PROC CORR will do all of these actual calculations for

you.

the Pearson correlation coefficient, although the Spearman rank correlation

coefficient quantifies the degree of linear association between the ranks

of X and the ranks of Y. Also, rs does not estimate a natural population

parameter (unlike Pearson's rp which estimates p ).

the X and Y values can be continuous or ordinal, and approximate normal

distributions for X and Y are not required. Similar to the Pearson rp , Fisher's Z

transformation can be applied to the Spearman rs to get a statistic, zs , that

has an asymptotic normal distribution for calculating an asymptotic confidence

interval. Again, PROC CORR will do this as well.

Coefficient

The Kendall tau-b correlation coefficient, b , is a nonparametric measure of

association based on the number of concordances and discordances in paired

observations.

Suppose two observations (Xi , Yi ) and (Xj , Yj ) are concordant if they are in

the same order with respect to each variable. That is, if

(2) Xi > Xj and Yi > Yj

They are discordant if they are in the reverse ordering for X and Y, or the

values are arranged in opposite directions. That is, if

(2) Xi > Xj and Yi < Yj

The total number of pairs that can be constructed for a sample size of n is

N=(n2)=12n(n1)N=(n2)=12n(n1)

N=P+Q+X0+Y0+(XY)0N=P+Q+X0+Y0+(XY)0

pairs, X0 is the number of pairs tied only on the X variable, Y0 is the number of

pairs tied only on the Y variable, and (XY)0 is the number of pairs tied on

both X and Y.

variables X and Y is given by the following formula:

tb=PQ(P+Q+X0)(P+Q+Y0)

tb=PQ(P+Q+X0)(P+Q+Y0)

This value becomes scaled and ranges between -1 and +1. Unlike Spearman

it does estimate a population variance as:

sample estimate of tb=Pr[concordance]Pr[discordance]

The Kendall tau-b has properties similar to the properties of the Spearman rs.

Because the sample estimate, tb , does estimate a population parameter, tb ,

many statisticians prefer the Kendall tau-b to the Spearman rank correlation

coefficient.

SAS Example (19.1_correlation.sas [1]): Age and percentage body fat were

measured in 18 adults. SAS PROC CORR provides estimates of the Pearson,

Spearman, and Kendall correlation coefficients. It also calculates Fisher's Z

transformation for the Pearson and Spearman correlation coefficients in order

to get 95% confidence intervals.

The resulting estimates for this example are 0.7921, 0.7539, and 0.5762,

respectively for the Pearson, Spearman, and Kendall correlation coefficients.

The Kendall tau-b correlation typically is smaller in magnitude than the

Pearson and Spearman correlation coefficients.

The 95% confidence intervals are (0.5161, 0.9191) and (0.4429, 0.9029),

respectively for the Pearson and Spearman correlation coefficients. Because

the Kendall correlation typically is applied to binary or ordinal data, its 95%

confidence interval can be calculated via SAS PROC FREQ (this is not shown

in the SAS program above).

Coefficients

Correlation is a widely-used analysis tool which sometimes is applied

inappropriately. Some caveats regarding the use of correlation methods

follow.

with independent data; they should not be applied to repeated measures

data where the data are not independent. For example, it would not be

appropriate to use these measures of correlation to describe the relationship

between Week 4 and Week 8 blood pressures in the same patients.

large numbers of variables have been examined, resulting in a large number

of correlation coefficients.

3. The correlation of two variables that both have been recorded repeatedly

over time can be misleading and spurious. Time trends should be removed

from such data before attempting to measure correlation.

must form a representative (i.e., random) sample from that population. The

Pearson correlation coefficient can be very sensitive to outlying observations

and all correlation coefficients are susceptible to sample selection biases.

5. Care should be taken when attempting to correlate two variables where one

is a part and one represents the total. For example, we would expect to find a

positive correlation between height at age ten and adult height because the

second quantity "contains" the first quantity.

measurement, X, and the change in that measurement over time, Y - X. X will

be correlated with Y - X due to the regression to the mean phenomenon.

7. Small correlation values do not necessarily indicate that two variables are

unassociated. For example, Pearson's rp will underestimate the association

between two variables that show a quadratic relationship. Scatterplots should

always be examined.

between two variables A and B, there are several possible explanations: (a) A

influences B; (b) B influences A; (c) A and B are influenced by one or more

additional variables; (d) the relationship observed between A and B was a

chance error.

really intends to compare two methods of measuring the same quantity with

respect to their agreement. This is a misguided analysis, because correlation

measures only the degree of association; it does not measure agreement. The

next section of this lesson will present a measure of agreement.

Coefficient for Measuring Agreement

How well do two diagnostic measurements agree? Many times continuous

units of measurement are used in the diagnostic test. We may not be

interested in correlation or linear relationship between the two measures, but

in a measure of agreement.

measuring agreement between continuous variables X and Y (both

approximately normally distributed), is calculated as follows:

rc=2SXYSXX+SYY+(XY)2rc=2SXYSXX+SYY+(XY)2

satisfies -1 rc +1. A value of rc = +1 corresponds to perfect agreement. A

value of rc = - 1 corresponds to perfect negative agreement, and a value of rc =

0 corresponds to no agreement. The sample estimate, rc , is an estimate of

the population concordance correlation coefficient:

c=2XYXX+YY+(XY)2c=2XYXX+YY+(XY)2

Let's look at an example that will help to make this concept clearer.

SAS Example (19.2_agreement_concordanc.sas [2]) : The ACRN DICE trial

was discussed earlier in this course. In that trial, participants underwent hourly

blood draws between 08:00 PM and 08:00 AM once a week in order to

determine the cortisol area-under-the-curve (AUC). The participants hated

this! They complained about the sleep disruption every hour when the nurses

came by to draw blood, so the ACRN wanted to determine for future studies if

the cortisol AUC calculated on measurements every two hours was in good

agreement with the cortisol AUC calculated on hourly measurements. The

baseline data were used to investigate how well these two measurements

agreed. If there is good agreement, the protocol could be changed to take

blood every two hours.

n the program to view the output. This is higher level SAS than you are expected to program yourself in this course, but some of

The SAS program yielded rc = 0.95 and a 95% confidence interval = (0.93,

0.96). The ACRN judged this to be excellent agreement, so it will use two-

hourly measurements in future studies.

What about binary or ordinal data? Cohen's Kappa Statistic will handle this...

Measuring Agreement

Cohen's kappa statistic, , is a measure of agreement between categorical

variables X and Y. For example, kappa can be used to compare the ability of

different raters to classify subjects into one of several groups. Kappa also can

be used to assess the agreement between alternative methods of categorical

assessment when new techniques are under study.

diagonal of a square contingency table. Suppose that there are n subjects on

whom X and Y are measured, and suppose that there are g distinct

categorical outcomes for both X and Y. Let fij denote the frequency of the

number of subjects with the ith categorical response for variable X and the

jth categorical response for variable Y.

Y=1 Y=2 ... Y=g

| | | ... |

| | | ... |

p0=1ni=1gfiip0=1ni=1gfii

pe=1n2i=1gfi+f+ipe=1n2i=1gfi+f+i

where fi+ is the total for the ith row and f+i is the total for the ith column. The

kappa statistic is:

^=p0pe1pe^=p0pe1pe

t]=Pr[X=Y]Pr[X=Y|X and Y independent]1Pr[X=Y|X and Y independent]

kappa is ideally suited for nominal (non-ordinal) categories. Weighted kappa

can be calculated for tables with ordinal categories.

patients with respect to liver lesions. The ratings were designated on an

ordinal scale as:

0 ='Normal' 1 ='Benign' 2 ='Suspected' 3 ='Cancer'

SAS PROC FREQ provides an option for constructing Cohen's kappa and

weighted kappa statistics.

The weighted kappa coefficient is 0.57 and the asymptotic 95% confidence

interval is (0.44, 0.70). This indicates that the amount of agreement between

the two radiologists is modest (and not as strong as the researchers had

hoped it would be).

Note: Updated programs for examples 19.2 and 19.3 are in the folder for this

lesson. Take a look.

18.8 - Summary

In this lesson, among other things, we learned how to:

correlation, Kendalls tau-b and Cohens Kappa statistics.

coefficients and interpret the results.

confidence intervals and Kendalls tau-b.

concordance.

statistic based on the type of data used for each.

Let's put what we have learned to use by completing the following homework

assignment:

Homework

Look for homework assignment and the dropbox in the folder for this week in

ANGEL.