Вы находитесь на странице: 1из 33

Interpreting and reporting clinical trials with results of borderline significance

Amy Kirkwood and Professor Allan Hackshaw CRUK and UCL Cancer Trials Centre

What is the problem?


After submitting several phase III trials to high profile journals we noticed a disparity in the language they allowed us to use when forming conclusions about borderline results. Some initially stated that a p-value just above 0.05 indicated that there was no effect, despite a clinically relevant effect size Should these results really just be ignored?

Why do we stick to a 0.05 limit?


The cut-off of 0.05 was first suggested by RA Fisher in 1925 as being low enough to make decisions. It has become widely adopted. It is an arbitrary cut-off, but many researchers seem to adhere to it strictly

Examples
Relative risk of 0.75 (95% CI: 0.57-0.99) p-value = 0.048 Clear evidence of an effect? Relative risk of 0.75 (95% CI: 0.55-1.03) p-value = 0.07 No effect?

Size of treatment effects


In the past, many new interventions at that time were compared to minimal or no treatment so we were looking for (and finding) large treatment effects. Today new interventions are being compared to standard treatments which already work well so smaller differences are expected It is therefore not as easy to get very small p-values.

How are p-values determined?


Size of a p-value

Size of the treatment effect Eg. hazard ratio, relative risk, absolute risk difference, or mean difference

Size of the standard error, which is influenced by: Number of subjects Number of events* Standard deviation*

Interpretation
Very small p-values (easy to interpret) arise when the effect size is large and the standard error is small. Borderline p-values arise when either: We have a clinically meaningful treatment effect but a moderate or large standard error (usually when there are insufficient participants or events). Or The treatment effect is smaller than expected (should have had a larger trial).

An example: EICESS-92 phase III trial


A trial comparing standard chemotherapy with or without etoposide for treating Ewings Sarcoma at high risk of recurrence/death (mainly a childhood cancer). Primary endpoint: Event-free-survival Powered to detect a hazard ratio of 0.60 Sample size: 400 patients (but 492 actually recruited).

The EICESS-92 Phase III Trial


Observed HR: 0.83 (95% CI: 0.65-1.05) p=0.12 P>0.05 Should we conclude no effect? But a 17% risk reduction is clinically significant, although smaller than the 40% initially expected. How do we interpret this?

EICESS-92: The Confidence Interval


Most people understand that the true effect is likely to lie somewhere within the CI range - hence the possibility of it being 1.0 (no effect). But there is a common misconception that it lies anywhere within this range with equal probability.

EICESS-92: The Confidence Interval


The true HR is more likely to lie around the estimated HR (0.83) than at the extremes of the confidence interval.

EICESS-92: The Confidence Interval


There is a 50% chance that the range 0.77 and 0.90 contains the true hazard ratio

0.77

0.90

Similarly there is a 75% chance that 0.72 and 0.95 contains the true HR.

EICESS-92: The Confidence Interval


The upper limit of the confidence interval is 1.05 and only just exceeds 1.0.

There is only a 6% chance that the range 1.0 contains the true HR

The conclusion reported in the paper was that the addition of etoposide seemed to be beneficial. This is the only randomised trial of etoposide in these children. The disorder is uncommon: 6.5 years to recruit 492 patients across Europe. Another trial is unlikely. Although the target sample size was exceeded, the treatment effect was smaller than expected (HR 0.83 vs 0.60), which is probably why the result was not statistically significant (i.e. trial was not big enough).

Are these sort of results common and how are they reported?

The Literature Search


We conducted a literature search to see how often trials with borderline results arose and how they were reported. We looked though every issue of 6 major journals in 2009. The journals chosen were
The BMJ The Lancet JAMA New England Journal of Medicine Journal of the National Cancer Institute Journal of Clinical Oncology

The Literature Search


To be selected a paper had to:
Report the results of a phase III randomised trial. Have borderline results for the primary outcome measure.

What counted as borderline?


To count as a borderline result we needed to see:
A non-zero effect size AND A p-value between 0.05 and 0.1 OR one end of the 95% confidence interval close to the no effect value (eg for ratios, the upper tail of the CI had to be <1.1 or the lower tail >0.90)

Literature search results


Below is a table showing the numbers of phase III trials found and the number with borderline p-values.

Journal
BMJ The Lancet JAMA NEJM JNCI JCO

Number of Phase III trials 44 64 40 70 6 64 288

Number with Borderline p-values 2 3 2 8 3 6 24 (1 in 12)

Literature Search Results


We examined the conclusion given in the abstract because this is what most people focus on. Was the language used appropriate? Some authors discussed their results further in the Discussion section.

Literature Search Results


Conclusion Number of Studies Range of P-values

No effect

10 11 3

0.06 - 0.17 0.06 - 0.13 0.056 - 0.1

Some evidence Confidence in effect

Example 1
Interventions and patient group Primary endpoint Main result Conclusion reported in the Abstract Those receiving nurse-led intervention had higher scores for quality of life and mood, but did not have improvements in symptom intensity scores

Nurse-led psychoeducational intervention versus usual care for palliative care in patients with advanced cancer

Symptom intensity, assessed by an assessment scale (quality of life and resource use were other endpoints) N=322

Mean difference: -27.8 scores (95% CI -57.2 to +1.6) P=0.06

Bakitas et al, JAMA 2009;302:741-9.

Example 2
Interventions and patient group Conclusion reported in the Abstract Admissions to hospital were significantly reducedbut no other clinical benefits were shown

Primary endpoint

Main result

Tailored care plan versus usual care in patients with coronary heart disease

Patients with systolic blood pressure >140mm Hg at 18 months (hospital admission was another endpoint) N=903

Odds ratio 0.66 95% CI 0.43 to 1.01 P=0.06

Murphy et al, BMJ 2009;339:b4220.

Example 3
Interventions and patient group Primary endpoint Main result Conclusion reported in the Abstract

Pre-surgical chemoradiotherapy versus chemotherapy Overall survival in patients with N=126 locally advanced (target was 576) cancer of the esophagogastric junction.

Hazard ratio 0.67 95% CI 0.41 to 1.07 P=0.07

Although statistical significance was not achieved, results point to a survival advantage for preoperative chemoradiotherapy

Stahl et al, J Clin Oncol 2009;27:851-6.

Example 4
Interventions and patient group Primary endpoint Main result Conclusion reported in the Abstract

Aerobic exercise training plus usual care versus usual care alone, in patients with chronic heart failure

All-cause mortality or hospitalisation N=2331

Hazard ratio 0.93 95% CI 0.84 to 1.02 P=0.13

exercise training resulted in non-significant reductions in the primary endpoint.

OConnor et al, JAMA 2009;301:1439-50.

Example 5
Interventions and patient group Primary endpoint Main result Conclusion reported in the Abstract ..a single inexpensive artesunate suppository substantially reduces the risk of death or permanent disability

Artesunate suppository versus placebo in patients with severe malaria who cannot be treated orally; N=12,068

Mortality

Risk difference -0.4% 95% CI -1.0 to +0.2% P=0.1

Gomes et al, Lancet 2009;373:557-66.

Example 6
Interventions and patient group
Telephone counselling using cognitive behavioural skills vs. no intervention to encourage smoking cessation in adolescents; N=2151

Primary endpoint

Main result

Conclusion reported in the Abstract

6-months prolonged abstinence from smoking

Absolute risk difference 4.0% 95% CI -0.2 to 8.1% P=0.06

personalized motivational interviewing...is effective in increasing teen smoking cessation

Peterson et al, J Natl Cancer Inst 2009;101:1378-92.

Papers with borderline negative results


What if a new intervention appears to show harm but has a borderline p-value? Perhaps authors would be inclined to be firmer with conclusions than if a new intervention shows possible benefit? We found two such papers where the authors only concluded that it did not show benefit.

Papers with borderline negative results


Trial of calcuim dobesilate vs placebo for the prevention of clinically significant macular oedema (CSME) in 635 patients with type 2 diabetes. 86 patients in the calcuim dobesilate group and 69 in the placebo group developed CSME Hazard ratio 1.32 (95% CI 0.96-1.81), p=0.08
Calcium dobesilate did not reduce the risk of development of CSME.

Borderline results elsewhere


We were sent this table after our paper was published, showing the results and conclusions from papers on statins and mortality.
Meta-analyses Arch Intern Med 2005; 165:725-730 Arch Intern Med 2006; 166: 2307 2313 J Am Coll Cardiol 2008; 52: 1769-81 BMJ 2009;338:b2376 Arch Intern Med 2010; 170: 1024-1031 Risk Estimate (95% CI) Authors Conclusions 0.87 (0.81 - 0.94) 0.92 (0.84 -1.01) 0.93 (0.87- 0.99) 0.88 (0.81 0.96) 0.91 (0.83 -1.01) Decreases mortality No effect Decreases mortality Decreases mortality No effect

Possible solutions
Design trials with larger numbers. But not always feasible (eg high costs or rare disorder) However, even a relatively large trial can produce an effect size smaller than expected (Ewings sarcoma example) Meta analyses. Example (doublet chemotherapy for pancreatic cancer):
One trial: HR 0.86, 95% CI 0.72-1.02, p=0.08 Meta-analysis 3 trials: HR 0.86, 95% CI 0.75-0.98, p=0.02

Conclusions
Borderline results cannot be used as strong evidence either in favour or against an intervention But do not completely dismiss an effect if p>0.05 when the treatment effect is clinically meaningful Do not conclude no effect; look at other endpoints, and other evidence A lack of statistical significance does not mean lack of an effect (Altman & Bland BMJ 1995)

Conclusions
Say that there is probably evidence of an effect but use appropriate language, eg words such as suggestion, indication and seems The same principles apply to other areas of research (eg risk factors)

Вам также может понравиться