Вы находитесь на странице: 1из 45

The Significance of Statistical

Significance

What is “Statistical Significance” and


Who Cares Anyway?
As a cancer patient, the phrase, “statistical significance” is certainly not one of the first
things you are likely to think about. It’s probably not the first thing you want to think
about either! But if you start getting into the data in depth and read papers and technical
articles, it’s going to come up again and again. You will need to understand statistical
significance if:

 You read the technical literature and see (as you are bound to):

o Results characterized as statistically significant (or not).

o References to “p values” like “p=.01” or “p<.05”.

 The phrase, “statistically significant” comes up in conversation with your doctor.

Statistically significant may not mean what you think it does. You might tell your doctor
that treatment A had a response rate of 67% and treatment B 77% in some paper you
read, and he might say, “I know about that, but the difference is not statistically
significant.” Does that mean the higher response rate doesn’t matter? Does it mean that
Treatment B is really no better than treatment A? It would be easy to assume either of
these, but actually the answer is, “none of the above”! Read on to learn what “statistical
significance” is really about.

Statistical Significance is about


the Reliability of Conclusions
Statistical significance is about deciding whether differences observed between groups
in experiments are “real” or whether they might well just be due to chance. The groups
can be groups of patients who were given different treatments as in a randomized trial.
They can also be groups of patients with different characteristics, rather than people
who were treated differently. For instance, you might want to compare survival in male
and female patients with the same cancer or the survival of patients with different
stages of some cancer.

Examples
It may not be obvious exactly how the question of whether observed differences are real
or might be merely due to chance actually arises, so here are some examples which
should give you some intuition:

Suppose that in a clinical trial patients were randomly assigned to either Treatment A or
Treatment B. Suppose that so far, Treatment A has a cure rate of 100% and Treatment
B has a cure rate of only 50%. That’s a pretty dramatic sounding difference! But is the
difference real, or is the difference just due to chance? Well suppose that so far two
patients have been assigned to each treatment and both were cured with treatment A,
but only one of the two was cured with treatment B. Maybe the next patient assigned to
treatment A will not be cured and the next patient assigned to treatment B will be. Now,
with just two more patients, the cure rate is 2 out of 3 or 66% for both treatments. What
does all this tell you about the “real” cure rate for these treatments? Can you conclude
that they are both the same, 66%? Probably not! With so few patients you should be
feeling pretty unsure about what the real cure rates are.

But what if you treated 100 patients with A, and 100 with B, and the cure rate was still
100% for treatment A and still only 50% for B? I bet you would conclude that there was
a major and very real difference between A and B and you would certainly opt for A,
given the choice!
Now suppose with 100 patients in each group the cure rates were 66% and 60% for A
and B. Is this a real difference, or is the difference just due to chance? It’s not so
obvious! Here is where statistical significance comes to the rescue.

Intuitions
These examples bring out some basic intuitions which are behind the idea of statistical
significance:

 The bigger the difference between groups the less likely it is that it’s due to chance.

 The larger the sample size (number of patients) the more likely it is that the observed

difference is close to the actual difference. This is an example of the “law of large
numbers.”

These two principles interact, so that if you observe a large difference between the
groups, the groups could be relatively small before you conclude that the difference is
likely to be real, and not due to chance alone. Similarly if the observed difference is
small you would need more patients before you are convinced that it’s probably not due
to chance alone.

Statistical testing, “p Values”, and


Statistical Significance
Roughly speaking, statistical testing uses mathematical procedures to examine
particular differences between groups to see if it is likely that the observed difference
could have arisen by chance alone. If it is unlikely enough that the difference would
have arisen by chance alone, the difference is “statistically significant.

More precisely, statistical testing works by assuming that the groups are actually the
same – that there is no difference – and then mathematically estimating the probability
that you would see a difference between groups at least as big as the one you actually
saw – just due to chance. This probability is called ‘p’ and is referred to a “p-value”.
Mathematical probabilities, like p values, range from 0 to 1 where 0 means no chance
and one means certainty. 0.5 means a 50% chance and 0.05 means a 5% chance.

When it comes to p values, bigger is not usually better!

The usual case is that you are hoping a difference between groups is real and not due
to chance. In this case, smaller p-values are better than larger ones. p=.05 is better than
p=.10. If you are hoping there is no real difference, than larger p values are better
than smaller ones. This might be the case if you were hoping less radical surgery was
just as good as more radical or disfiguring surgery. In this case p=.5 is better than
p=.05! Often in a randomized trial various characteristics of the two groups are
compared in the hopes that they are not significantly different. Again in this case bigger
is better!

More precisely, statistical testing works by assuming that the groups are actually the
same – that there is no difference – and then mathematically estimating the probability
that you would see a difference between groups at least as big as the one you actually
saw – just due to chance. This probability is called ‘p’ and is referred to a “p-value”.
Mathematical probabilities, like p values, range from 0 to 1 where 0 means no chance
and one means certainty. 0.5 means a 50% chance and 0.05 means a 5% chance.

If the p-value is relatively large, so that the chances are relatively high that the
difference could have arisen by chance alone then the results are at least consistent
with the idea that there is no real difference between the groups for the characteristic
being tested. But p is very small, then the results are not consistent with the idea that
there was no real difference between the groups, or at least it is very unlikely that there
is no difference.

By convention if p<.05, the difference is said to be “statistically significant.” Again, this


means that if there was no true difference the probability of seeing a difference at least
as big as the one you actually saw by chance alone is less than 5%. Roughly speaking,
“P<.05” means that the probability is less than 5 percent that the observed difference
was due to chance alone.

The choice of .05 (or a 1 in 20 false positive rate) as the usual value to declare
statistical significance is arbitrary – scientists do want to have a high confidence in their
conclusions but attaining ever smaller p values requires larger samples, more time and
expense, and possibly subjecting people to treatments which the data already indicates
are very likely to be inferior.

As a cancer patient, you may choose to make different judgements about what p value
you want to use to guide your decision making – suppose two treatments have about
the same side effects, and one looks like 5 year survival may be better than the other,
but p=.06 – not quite statistically significant. Would you decide it didn’t matter which
treatment you got because the chance of a false positive is six percent instead of five? I
wouldn’t! (Note that p between .10 and .05 is often referred to as a “trend” – a marginal
result). You might decide to choose based on even higher values such as p=.2 (20%
chance of false positive), but at some point you would conclude the wisp of evidence
favoring one treatment isn’t enough to believe it. You need to consider not only the p
value but also the totality of circumstances – how different the treatments are in side
effects, how big the difference in the results was and so forth. Man does not make
rational decisions on p values alone!

While I don’t intend to get into the mathematics of statistical significance (I couldn’t even
if I wanted to!) the basic intuition I described earlier can be recast in the terms of
statistical testing:

 The larger the sample size, the smaller an observed difference has to be in order to be
statistically significant.
 The smaller the sample size, the larger an observed difference would have to be in
order to be statistically significant.

If you design a clinical trial to detect a possible difference between treatments, you need
to have a large enough sample size so that if there is a difference big enough that you
care about it, you are actually likely to find that difference to be statistically significant
when you are done with the trial. Some intuitions about the number of patients in a
clinical trials:

 The smaller the real difference is, the more patients you need to be likely to detect a
statistically significant difference in a clinical trial.

 The larger the real difference is, the fewer patients you need to be likely to be detect a
statistically significant difference in an actual clinical trial.

Perhaps the above suggests to you that even when a trial is designed to detect a
difference of a given size, and even when that difference is really there, it is possible
that the trial will not result in a statistically significant difference – that is there is a
chance the trial will be a false negative. The chance of a false negative can be
calculated given the size of the trial and the size of the minimum difference of interest.

Statistics and Statistical Tests


I have been talking about conducting statistical tests on a “difference” between groups
without being very specific about what kind of difference is tested. Take survival. You
could test the difference between the average or mean survival, or (more commonly in
cancer research) the median survival, or the five or ten year survival, or any number of
other statistics. The point is that you test for a difference in a particular statistic like the
mean or the median. A difference in median survival may be statistically significant but
for the same experiment the difference in survival at five years may not be statistically
significant.
More of the Terminology of Statistical
Significance
You will actually see these terms and concepts in papers in the medical literature!

The Null Hypothesis


The “Null Hypothesis” is the hypothesis that the treatments (or characteristics) being
compared are all the same. The scientific approach is to not assume one thing is better
than another until there is reliable evidence to the contrary. So the “null hypothesis” is
accepted by default. Only if enough evidence accumulates – in the form of a statistically
significant difference from an experiment – is the null hypothesis “rejected”. So
“rejecting the null hypothesis” just means finding a statistically significant difference.

Power
When a clinical trial is designed it is important to have a large enough group of patients
(sample size) so that if there is a difference between treatments big enough that you
care about it, you are likely to get a statistically significant difference from the actual
trial. You can never be guaranteed that you will see a statistically significant difference
even if there is a real difference, since just by chance you might not happen to get as
good results for the better treatment then its real success rate, or you might happen to
get results for the less effective treatment that are better than its real success rate. The
larger the sample size the lower the chance this will happen.

Given the minimum difference you want to detect, and the p value you require to
declare the results statistically significant (usually .05) for any sample size, you can
calculate the probability that you will detect a statistically significant difference. The
probability is the power of the experiment. Of course, what is done is to figure out what
sample size is needed to achieve a desired power – usually between 0.8 and 0.9 (an 80
to 90 percent chance of getting a statistically significant difference, if there is a real
difference of at least the specified size). A trial which doesn’t have enough patients is
often called “underpowered” in the literature. Underpowered trials risk not finding real
differences even when they are there.

Type I and II errors


You will sometimes see references to “Type I” or “Type II” errors.

 A Type I Error is a false positive – deciding that there is a real difference when in fact

there is no difference. If p=.05 then there is a 5% chance of a type I error. Type I error

is also called “alpha error”.


 A Type II Error is a false negative – that is there is a real difference but the statistical

test fails to show the difference to be statistically significant. The chance of a type II

error is estimated with the power of the experiment. Type II error is also called “beta

error”.

Names of Statistical Tests


Different statistical tests are used depending on what kind of data is being tested (for
instance something with a discrete outcome like responded versus didn’t respond
requires a different test than something with a continuous outcome like survival time)
and what statistic is being tested – say the mean or the median. Each of these tests is a
mathematical procedure (often complex) which has a unique set of assumptions. I do
not intend to try to get into the details of which test is used when, but here are the
names of a few of the many statistical tests you may see used in research papers.

 Fisher’s exact test

 Student’s t test

 Chi-Squared test

 Mann-Whitney test
 Log Rank Test

Pitfalls
There are a lot of gotchas in interpreting statistics of all kinds and statistical testing
seems to have a particularly large number. So by the time you finish this next section
you might think “statistically significant” is completely insignificant! Nothing could be
further from the truth. It’s just that you have to be very careful about the conclusions you
draw from statistics.

Statistically Significant Doesn’t Mean it


Matters!
There are several reasons why statistically significant doesn’t necessarily mean
significant to you. There is always the question of whether the right thing is being tested.
For instance, in cancer treatment, the “response rate” is very roughly the percentage of
patients who experience at least a total 50% tumor shrinkage with treatment. The
problem is that “responses” can and very often are only temporary. So if Treatment A
has a response rate of 40% and treatment B a response rate of 50%, and this difference
is statistically significant, does that mean B is “better” than A? It certainly doesn’t if the
responses for A seem to be very long lasting and the responses to B be seem to be
very short lived! At least you should look to see if there appears to be any difference in
the quality of the response! Also you have to look at the big picture: What about quality
of life? If treatment B is extremely toxic and difficult, while A is easy and non-toxic, then
even if there is a real difference in a meaningful outcome in favor of B, it’s a personal
judgement whether it’s worth paying the price of the increased side effects for a
relatively small increase in benefit.

It’s also important to realize that, “statistically significant difference” does not mean “big
difference”. If two treatments are very similar in outcome, but not exactly the same, you
can find a statistically significant difference by just testing enough people. In general the
more people you include in a trial, the smaller a difference is needed before that
difference proves to be statistically significant. So if in some trial, treatment A has a cure
rate of 52% and treatment B 54%, then even if they tested enough patients to make this
difference “statistically significant” you are not likely to decide that B is really much
better, and very likely other characteristics of A and B such as side effects would guide
your choice between them.

Statistically Significant Doesn’t Mean the


Difference was Due to Treatment Differences!
Suppose you have a trial in which people decide whether they would prefer to get
treatment A or treatment B, and then are given the treatment of their choice. You
observe that the patients who choose treatment A have a statistically significant
advantage in survival compared to those who chose treatment B. But suppose
treatment A and B are really no different in their effect on survival, but A has more
difficult side effects. Perhaps the patients who chose B tended to be sicker so they
decided they would prefer a treatment with easier side effects, while conversely if A is
new perhaps patients who are healthier might tend to decide to go for the “new hope”
more often despite the side effects.

Similarly, if a group of patients getting the new treatment, Treatment A, is compared to a


group of patients who got Treatment B, the old treatment, in the past (a so-called
“historical control”) it is possible that with new technology, patients in the current group
tended to be diagnosed earlier in the course of their disease than patients who in the
past got the old treatment. So even if A and B have the same effect on the disease it is
possible that patients who got the new treatment may appear to survive longer than
patients who got the old one.

Finally, not all statistical testing is done about testing some intervention – like a
treatment. Often epidemiological studies will look to see if there is
an association between some characteristic or behavior, and the chance of getting or
surviving some disease. It might be seen that people who drink coffee are more likely to
get lung cancer than those who do not and the difference in lung cancer rates among
coffee drinkers and non coffee drinkers might found to be statistically significant. But
this doesn’t mean coffee drinking causes lung cancer! Instead coffee drinking might be
associated with some other characteristic which pre-disposes people to lung cancer,
such as smoking. In this case it’s obvious and statisticians would try to “adjust” for
smoking, but in general, no one knows all of the causes of any form of cancer, so it
could be an association of the tested characteristic with some completely unknown risk
factor which is actually causing the difference.
In sum, you need to look carefully at the study design to see if it is possible that
something other than treatment differences might explain the difference in results.
Noticing that a trial has a historical control rather than a randomized design should
reduce your confidence in the results. But you may still choose to make decisions based
on uncertain information. After all, that’s the only kind of information there is!
Sometimes, the trick is figuring out which information is least uncertain and that’s where
awareness of things like randomized designs versus historical control designs helps.

Statistically Significant doesn’t mean the real


difference is the observed difference!
“Statistically significant” means that it is unlikely the treatments have the same success
rate, but it’s quite likely that the real difference between the treatments is not exactly the
observed difference. So if Treatment A has a 57% success rate and Treatment B, a
46% success rate, and the difference is statistically significant, this does not mean
these are the exact true success rates. It only means that it’s unlikely that the
advantage of treatment A is entirely due to chance. The actual difference could either
be larger or smaller. All “statistically significant” tells you is that there is probably some
difference that is not due to chance alone.
This does not mean that the difference is equally likely to be any size other than zero.
Usually it is likely to be close to the observed difference and this is more true the larger
the trial. The larger the sample sizes, the greater the reliability of the estimate of the
success rate of each treatment, and the less likely it is that the observed success rates
differ much from the real ones. A related statistical concept called “confidence intervals”
(which will be the subject of a future article) can give better insight into how big the
difference might really be.

Not Statistically Significant Doesn’t Mean


There’s No Difference!
“Not statistically significant” doesn’t mean that the observed differences are due to
chance – only that it would not be surprising if they turned out to be due to chance. It
may be that there is a difference which you care about, but not enough people were
included in the test for the difference to show as statistically significant. If a trial is way
too small, it’s even possible that there’s a big and meaningful difference between the
treatments but hardly any difference was actually seen – just due to chance. When
there is a negative result you need to consider the “power” of the trial to detect
meaningful differences as statistically significant (see above for a definition of “power”).

You need to decide whether they are testing


the right “statistic”
Statistical tests test for differences in specific “statistics,” often the median, or the mean.
So what is said to be a statistically significant difference is a difference in the specific
statistic tested, such as the median. The median is absolutely not the whole story and
real differences that matter need not show up as statistically significant differences in a
mean or median. For a wonderful introduction to why this is, please read Stephen Jay
Gould’s The Median Isn’t the Message
As a specific example, consider my own case. The treatment which saved me,
Interleukin-2, has a low response rate – only 15 or 20 percent – and maybe only five
percent of patients get a long term remission like I have gotten. Still, if no other
treatment offers any real chance of survival (as was the case when I choose IL-2), then
a small but real chance is something worth going for – at least it was to me.

Now the median survival is the amount of time half the patients survive. IL-2
dramatically increases the survival of kidney cancer patients – but only a few of them.
This could not affect the median survival much. If only all of the patients who got a long
term remission happen to be those who were destined to live a little longer than the
median anyway, then IL-2 treatment might not improve the median one bit! In fact, a
treatment which is toxic ,and slightly shortens the survival for many patients, but
produces a few cures could actually decrease the median survival but still be worth
trying for people who are willing to pay a price for a chance at long term survival. In
cases like these long term survival rather than median survival is the statistic of interest.
Of course it takes years to accumulate reliable data on the proportion of long term
survivors, and if this proportion is small, it will take a huge sample size as well. You may
have to take your best guess based on the data… even when statistical significance for
what matters to you has not been achieved.
Often in technical papers and presentations, median survival is just referred to as
“survival” and if there is no statistically significant difference in median survival, you will
hear or read that, “there was no difference in overall survival.” You need to think about
exactly what they are saying, and about what the data says about the effect of the
treatment on the entire population – not just the effect on an average or median.

Statistically Significant Doesn’t Mean There’s


a Real Difference!
That a difference is statistically significant only means it’s unlikely that the observed
difference is due to chance. It doesn’t mean it’s impossible. For instance if p=.05 there
is a 5%, or 1 in 20 chance, that a difference this large could have occurred by chance
alone. Making a one in 20 chance is just not that rare!
I think that for most cancer patients, knowing that the chances are 95% that an
observed difference in favor of some treatment is not due to chance, is good enough
that they will not spend undue time worrying that the difference isn’t real.

Depending on the p value, the chances might be a lot better than that anyway. You can
look at the p value to tell how unlikely it is that the result occurred by chance alone. So if
you have p=.000001, then the chances are only one in a million that the result was due
to chance.

You have to decide what to test before you


do the experiment!
A common, but dangerous, practice is to do an experiment and notice some interesting
pattern in the data only after you are done. Maybe women survived longer, or maybe
black people, or maybe left handers, or maybe… well you get the idea. The temptation
is to then do a statistical test and voila! The difference is highly statistically significant
and you’ve made a great discovery! Not so fast.

The mind is a pattern finding machine extraordinare! There is nothing else like it in the
universe! When you look at data, searching for patterns, you consciously or
unconsciously check many different possibilities to see if it looks like this or that
property makes a difference in the outcome. Similarly you might check many different
outcomes to see if the treatment made a difference in that outcome – if the patients
didn’t live any longer maybe they spent less time in the hospital or used less pain pills
or… well who knows. There can be hundreds or thousands of possibilities.

The problem is that if you check hundreds of things then it is actually likely that you will
see a statistically significant difference in at least one of them just by chance alone!
Remember that you could get a result that is statistically significant at the p<.05 level
one in twenty times just by chance alone. So if you check enough things this is
actually likely to happen at least once. If you buy enough lottery tickets you’re probably
gonna win – only in this case the prize is the bitter fruit of false conclusions. It is easy to
be fooled by this trick of the statistical light.
The way not to be fooled is to specify exactly what you are going to test before you start
the experiment (this is called defining endpoints prospectively) and to specify a limited
number of things. There are statistical methods of correcting for multiple tests if you
know in advance exactly what you are testing.
If you see something interesting in the data afterwards and do a statistical test that was
not planned in advance then often all you have is an interesting hypothesis which must
be confirmed by doing the experiment again, rather than a statistically valid conclusion.
If the p value is very small (much less than .05) and if the problem space is not one that
lends itself to an indefinite number of possibilities to check, then it is more likely that you
really have something. It is also more likely if it makes sense in terms of what is known
about the disease – the observation that, say left handers have worse survival doesn’t
make sense in terms of any known biological mechanism and seems unlikely – a trick of
the statistical light. In contrast if you were to observe that people with higher levels of
testosterone had worse survival in prostate cancer it would be relatively plausible
because prostate cancer cells are known stimulated by male hormones.

This CancerGuide Page By Steve Dunn. © Steve Dunn


Page Created: 2000, Last Updated: January 21, 2001

Postcards from Beyond The Zero


This is a statistical walk through the odds I've faced in my battle with metastatic renal
cell cancer and how those odds have changed with time and circumstances. To
understand the background, read my complete story. I hope this inspires you to see that
even the most terrible prognosis and odds can be overcome. I also intend it to illustrate
some of the considerations involved in interpreting survival curves.

Diagnosis
I was diagnosed with widely metastatic kidney cancer only one month after surgery for a
huge, but seemingly localized tumor. I found the survival curve to the right in
a review article my
own doctor had written. This curve shows survival of patients with what I had, who
either relapsed within six months of surgery (like me), or who had metastasis at
diagnosis. It applied to me.

The curve is approximately constant risk. Constant risk means that the chance of dying
does not change over time. Constant risk curves have a constant half life, which is the
time it takes for half the remaining patients to die. Living longer brings no relief, the risk
is always the same. For my curve, the half life (and therefore also median survival) is
only about four months. The chance of surviving one year is only about 12%. At four
and one half years, the curve reaches the zero.
Taken at face value, this is a statistical nightmare, a stark portrait of the odds against
terminal cancer. There is no hint of even the slightest possibility of a cure or way out.
Only a relentless descent to death. But this superficial appearance was not the reality.
Escape was possible!
Caveats: The data was old, and strongly reflects the fact that at that time there was no
standard effective treatment. But that was still true at the time of my diagnosis. This
curve is also based on a fairly small number of patients and cannot exclude a tiny
fraction of truly long term survivors. While the chance of surviving 5 years is surely very
small, I don’t believe it is actually zero!

Treatment
I changed my odds and got off the curve above by entering a clinical trial of a then
experimental drug called Interleukin-2. Here is a long term survival curve for 255

patients given high dose


IL-2 (I had Interferon with the IL-2, but this is as close as I can find).

Notice that this curve, while still very rough, has 20% survivors at four and a half years
(instead of zero percent in the previous curve) and maybe 15% survivors at 10 to 11
years. There is a small but real chance here! Best of all it flattens out towards the end,
suggesting that a few may actually be cured.

When I started my treatment, IL-2 was still in development. So since it took more than a
decade to accumulate enough follow-up to make this curve, I didn’t have it to look at
when I had to make my decision. What I had instead were hints that IL-2 might improve
the curve – hints in the form of a report in the literature of dramatic responses that were
continuing after two or three years (To see just exactly what I was looking at, see my
article,The Hint). Given the survival with standard treatment, that was more than enough
for me. Seeing the first curve with its message of doom turned out to be a positive
because only in comparison to that dreadful curve could I have known that a mere hint
was worth pursuing with everything I had.
Caveats: Unlike the first curve, this curve is not limited to treatment of patients who
relapsed within 6 months of diagnosis, so the inclusion of patients who had a somewhat
better prognosis to begin with may represent part of the improvement. More generally,
there is considerable uncertainty any time you compare survival curves from completely
different trials or groups. There can easily be a difference between the groups which,
rather than the difference in treatment, accounts for a difference in survival. Ideally,
survival curves are best compared when they are from a randomized trial where bias
due to group difference is eliminated by the trial design. But in the real world, decisions
often have to be made with less than ideal data. In this case the fraction of really long
term survivors is still better from what could reaonably be expected with metastatic renal
cancer with earlier treatments, so while the curve I changed to may not be exactly what
I present here, I surely did change my odds for the better by taking IL-2.

Response
I could have responded… or not responded. I had control over my choice of treatment,
but not over whether it worked. I was fortunate in that I did respond. Once
that happened, the odds
changed again, and again dramatically for the better! Here is a curve for patients who
responded to high dose IL-2. Notice that nearly 40% are still in remission over a decade
later. Had I not responded, my odds would have changed for the worse and probably
would not be much better than they were when I started.

Caveats: Note that this curve charts response duration rather than actual survival.
Since patients survive at least some time after relapse, and since it takes at least some
time to get into remission, an actual survival curve would look at least
slightly better than this.
There is a classic statistical trap in comparing responders to non-responders (or to all
patients treated) because those who respond may have been those who were healthier
to begin with and who would have lived longer anyway. So it may not be the case that
the patients really benefited from treatment, even if they achieve a temporary shrinkage
of their tumors, as is often the case with chemotherapy (My treatment was
immunotherapy). I do not think that this is the case here because many of the
responses that did occur were long term, and because without treatment, long term
survival for this disease is almost zero.
Flat Line!
Once I got into remission, I resumed my life hoping I would be one of those whose
responses lasted. As time passed I got to see more and more data which showed, as

these curves do, that the


longer I stayed in remission, the greater the chance I was going to stay there! As you
can see from the previous two curves, the risk of relapse or death is highest during the
first 30 months and then decreases substantially. So the mere passage of time
improved my odds as I moved past the initial high risk 30 months to a substantially
flatter area of the curve.

Since there have been no relapses after 85 months, for survivors who are out at least
that far, the curve is flat at 100%. In fact, as I write this in September 2001, I am off the
end of the curve in remission at 142 months, So my curve is now flat at 100%! (If you
don’t actually see a curve here, it’s the line at the very top!)

Caveats: That 100% is based on a small number of patients, so just as I don’t believe
the first curve really guaranteed death, I also don’t believe this one guarantees life,
though things are looking very, very good. Finally, because I am “off the curve,” I am
extrapolating a little in time to claim 100%. But despite these caveats, the difference
between the curve I am on now and the one on which I started my journey is not in
doubt. It is infinite.
This CancerGuide Page By Steve Dunn. © Steve Dunn
Page Created: 2001, Last Updated: March 14, 2002

Endpoints: How the Results of Clinical


Trials are Measured

What are “Endpoints” and Why are they


Important?
Oncologists use the term endpoint to refer to an outcome they are trying to measure
with a clinical trial. Understanding endpoints is absolutely critical to understanding the
technical medical literature. All journal articles reporting on clinical trials will report the
results in terms of the endpoints which were measured. If you don’t understand what
they mean, you can’t understand the article.
Endpoints can include all kinds of things, but this article is about the most important kind
of endpoint – those which relate to the effectiveness of treatment, called efficacy
endpoints. There are many ways to measure the effectiveness of treatment – the
endpoints I talk about here are the most commonly used ones.

Oncologists frequently use some of the same terms in talking to patients about benefits
that might be expected from treatment. In particular, it’s common for an oncologist to
estimate the chance you will “respond” to treatment. Doctors will also use the jargon of
endpoints in talking about how well treatment is working as in, “you are responding to
the treatment.” Knowing what “respond” might mean will help you know exactly what
questions to ask.

Rather than just reciting a laundry list of different efficacy endpoints, I discuss when
they’re used, how they relate to clinical trials (this article partially overlaps some of
my clinical trials articles), technical details of how they’re measured, and the pros and
cons of each and, of course, the associated jargon, which (surprise!) can be just a bit
thick at times.

Capsule Summary of Endpoints

 Response and Related Endpoints: Measures tumor shrinkage in response to treatment and

how long that shrinkage lasts. Response is the typical main endpoint for phase II trials, and is

frequently measured in other trials as well.

 Survival: Just what you think it is! A typical main endpoint (survival is important!) for phase III

and adjuvant trials.

 Progression Free Survival and Disease Free Survival: Measures the length of time that a

patient is both alive and without worsening of their cancer. These are typical endpoints for

phase III and adjuvant trials.

 Quality of Life: Based on subjective measures of how well the patient is functioning and

enjoying life. This takes into account both benefits of treatment and loss of quality of life due to

the side effects of treatment. Quality of life is typically an endpoint of phase III and adjuvant

trials.

Detailed Description of Endpoints


Response and Related Endpoints
Response is about measuring tumor shrinkage. Because cancers only very rarely get
smaller without treatment, a significant tumor shrinkage shows treatment is having an
effect on the tumor. Response is usually the primary endpoint in Phase II trials but is
often measured in Phase I and III trials as well. Response is not an endpoint for
adjuvant clinical trials where the primary tumor has been removed surgically since in
that case there are no detectable tumors to measure.

Rough Definitions of Response


This section will give you a rough view of what the different response categories mean.
There are specific technical criteria for each of these categories which I cover later.
Note that a patient’s response is characterized by the greatest amount of shrinkage they
achieve from the time treatment started. So if a patient has a major shrinkage which is a
response but later has tumor growth it’s still categorized as a response. The duration of
the response is from the time response is achieved until renewed growth is detected.
Main Categories

 Complete Response (CR): Complete response means all detectable tumor has disappeared.

If a treatment does cure some patients, those patients will have their tumor disappear. A CR is

a potential cure. If you have advanced cancer a CR is also the best result you can actually

see from treatment. Even so, a complete response does not necessarily mean the patient is

cured. Even when no tumor can be seen on scans, there can be residual tumor which is too

small to detect, and so unfortunately, complete responses may not last. Whether a complete

response is likely to last can often be gauged by looking at the history of the type of treatment

that produced the response in your type of cancer. In some situations very few complete

responses are cures and in others most are cures. To find out, you have to research your

cancer and the treatment in question.A patient who has had a complete response may be said

to be “NED”. NED means “No Evidence of Disease”.

 Partial Response (PR): This roughly corresponds to at least a 50% decrease in the total

tumor volume but with evidence of some residual disease still remaining. Partial responses

aren’t usually cures and usually aren’t a long term benefit because significant tumor remains.

In some cases the residual disease in a deep partial response may actually be dead tumor or

scar so that a few patients classified as having a PR may actually have a CR. Also many
patients who show shrinkage during treatment show further shrinkage with continued

treatment and may achieve a CR.

 Minor Response (MR): “Minor response” roughly means a small amount of shrinkage. Minor

response is not really a standard term but is increasingly used. Roughly speaking, a minor

response is more than 25% of total tumor volume but less than the 50% that would make it a

PR. A minor response is not enough to be considered a true response, but data on minor

responses may be given in reports on clinical trials. If minor responses are not categorized

then they would be considered Stable Disease (see below). Although minor responses are

often considered insignificant, I believe that some of the new anti-angiogenic therapies

(treatments that target tumor blood vessels) which show a high rate of minor response are

likely to be benefiting patients, as long as those responses last for a reasonable length of

time. Also, of course an “almost partial” response, is hardly different from a “barely partial

response”. Such boundaries are artificial. Again with more treatment, a minor response may

evolve into a deeper response, a PR or a CR.

 Stable Disease (SD): Although stable disease intuitively means the tumors stay the same

size, to account for measurement errors on scans and to discount “insignificant” changes,

stable disease includes either a small amount of growth (typically less than 20 or 25%) or a

small amount of shrinkage (Anything less than a PR unless minor responses are broken out. If

so, then SD is defined as typically less 25%) Because of this, slow growing tumors may be

classed as stable for quite some time, or for several scans if scanning is frequent. Also some

periods of stability are relatively common in some kinds of cancer even without treatment.

Therefore, it is difficult to know if stable disease is the result of treatment. Claims of benefit for

new treatments involving stable disease should be examined skeptically. For more on the

difficulties of Stable Disease, see my article, Stable Disease: A Prolematic Endpoint.Like a

minor response, stable disease is not considered a true response. At the same time, if you are

experiencing stability with or without treatment that is better than growth. Finally, stable

disease can evolve into a response with more treatment.


 Progressive Disease (PD): Progressive disease means the tumor has grown significantly or

that new tumors have appeared. The appearance of new tumors is always progressive

disease regardless of the response of other tumors. Progressive disease normally means the

treatment has failed and in most cases is the signal that it’s time to try something else (or stop

treatment altogether if no good options remain). If you are on a clinical trial and have

progressive disease during treatment, you are likely to be taken off study (in a few cases you

may be allowed to cross-over to the other arm of a randomized trial, or your treatment may be

otherwise modified). Most clinical trials for advanced cancer which allow prior treatment

require that you have had progressive disease since your last treatment (or in other words,

that your last treatment didn’t work or has stopped working).


Summary Categories:

 Objective Response (OR): Objective response means either a partial or complete response

(In the literature you’ll frequently see “CR+PR” which means the same thing). When you see

an objective response rate be sure to look at how many are complete responses and how

many are partial since benefits from complete response tend to be greater. Often news

reports and especially press releases by self-interested companies blur this and don’t reveal

that the CR rate is low or non-existent. Track down the original source and find out!

 “Clinical Benefit”: Clinical benefit is an informal term which usually means anything other

than progressive disease. Use of this term is suspect, particularly if it is in a press release or

news report. It isn’t automatically clear that patients with stable disease are benefiting from

treatment since the natural history of cancer can include periods of apparent stable disease

and since tumor shrinkage is not equal to clinical benefit to begin with. When you see this

term you should look at both the CR and PR rates and also the duration of “benefit” including

for stable disease cases.


Note that you’ll constantly see the abbreviations CR, PR etc in the research literature.

Duration of Response
Cancer therapies can produce temporary responses without any lasting benefit. On the
other hand, some cancer therapies can be curative or at least give a meaningful respite
from the disease. If there are no data to show an improvement in survival, then you
want to look at responses and see how many are lasting. If a treatment is going to make
a real difference, tumor shrinkages should last at least long enough to give a meaningful
respite. If a treatment is to cure some patients, then some complete responses should
last indefinitely. There are treatments which give a small percentage of patients durable
CRs, which may be cures. If you have a difficult cancer, then even if survival is reported
or is the main endpoint, it’s worth looking to see if there are lasting CRs (Frustratingly,
they don’t always report on this).

Reports on clinical trials in the medical literature commonly give not only the number of
responses, but also the duration of responses, particularly for phase II trials where
response is the primary endpoint. Some papers only provide summary information on
this but often the duration of response is given for every responding patient.

Response duration for an individual patient is given as a number possibly with a plus
sign after it like “6+”. Usually the unit is months but not always, so you should read
carefully to be sure. A plus sign after the response duration for an individual patient
means the response was still ongoing at the last evaluation. Conversely, the lack of a
plus sign means the patient relapsed at the given time. The plus sign is anything but a
minor detail – it’s key to telling whether a treatment is giving some patients lasting
benefit!

Often all of the responses are listed sorted by length like:

45+, 38+, 32+, 30, 23+, 21 or

12, 8, 6+, 5+, 3+, 2+

CRs and PRs are usually listed separately.


The difference between these two examples is important. In the first there is relatively
long term follow-up and the responses are mostly holding. It’s encouraging. In the
second, follow-up time is much shorter and the fact that four of the responses are
holding at six months or less doesn’t really tell you much, especially since the two
slightly longer responses didn’t last. Obviously the two relapses at 8 and 12 months in
this small example is not nearly enough data to know what the range of possibilities
might be, but you’d certainly want to look hard for other options if this was the best you
could find on a treatment.

You will find that the results of early phase trials often include only relatively short
follow-up when they are first presented or published, particularly if they’re presented at
medical meetings where very early results are often presented. Naturally, this makes it
hard to judge response duration. For more established therapies, especially anything
FDA approved, there should be better data on response duration. Similarly, if a new
therapy has been in testing for a while, there may long enough follow-up that there is
useful data on response duration. This was the case for the experimental treatment
which saved me. See my article The Hint for the actual data.
There are a few nits about measuring response duration. The duration of response will
depend two things, on when the response is counted as starting and on when the
response is counted as ending. Unsurprisingly, a response is normally counted as
lasting from the time response is first achieved to the time progression from the best
response is detected. What is a little more subtle is that these things depend a bit on
how often scans and tests are scheduled including how close to the start of treatment
the pre-treatment measurement scan is taken. The longer response durations the, less
all of this matters.

Questions to Ask Your Doctor


If your oncologist quotes a chance of “responding” to a proposed treatment it’s worth
asking some basic questions:
 Does “response” mean CR + PR? If not, what is the CR + PR rate? What is the CR rate?

 What is the range of durations of response? Are there long term CRs?

 Is there any data to show a survival benefit from this treatment?

 If this is a local or regional treatment like radiation, how much will it help me to have a

response in only the treated tumors?

Technicalities
Measuring Tumor Volume
How tumor volume is measured depends on the kind of cancer. Most common cancers
form discrete nodules or masses. These kinds of cancer are called “solid tumors” and
there is are very standardized methods for measuring solid tumors. I talk about
response criteria and measurement of solid tumors below. After that I briefly discuss the
other cases, liquid tumors, and types of solid tumor where the standard response
criteria don’t work well.

Measuring Solid Tumors: General


 Response or stable disease always requires that you have no new tumors. A new tumor

means you have progressive disease.

 Normally, a response has last at least a month or be confirmed by the next set of scans before

it counts. Often a longer duration is required for apparently Stable Disease to count. I am

seeing 6 months more and more often. For most kinds of cancer 6 months would be a long

time for there not to be significant growth if treatment weren’t having an effect. Watch out for

early reports, especially meeting presentations (and those ever-spinning press releases),

which don’t specify how long counts as stable or which don’t have any minimum to count as

stable.
 Individual trials can have differing definitions of response. Any clinical trial report in the

technical literature will describe what those criteria were. Often this is just referenced as the

standard WHO or RECIST criteria (see below) but sometimes more details are given.

 Blood markers like PSA for prostate cancer or CA-125 for ovarian cancer are not used as a

substitute for measuring tumor masses in the standard cases. If there is a standard blood

marker for your cancer, your levels must return to within the normal range for you to be

declared to be in complete response. If the standard response criteria don’t work well for your

type of cancer and there is a standard blood marker, it will probably be used in constructing

response criteria for your kind of cancer.

Measuring Solid Tumors: Measurable


Disease
In order to quantify tumor shrinkage, it is necessary to be able to accurately measure
the size of the tumor using scans or physical exam in a few cases. Most metastases
can be accurately measured but some types such as bone metastases and
accumulations of fluid caused by tumors (called effusions) are not considered to be
measurable (this is not a complete list of non-measurable tumors). A measurable tumor
usually has to be at least a minimum size such as 1 centimeter in diameter.

You may have both measurable tumors and non-measurable tumors. For instance you
might have measurable lung metastases and bone metastases. If you have at least one
measurable tumor then you have measurable disease.

Measurable disease is required in any trial where response is a primary endpoint.


Phase II trials which usually have response as the primary endpoint normally require
measurable disease, and phase III trials sometimes do. Phase I trials usually do not
require measurable disease. Since the criteria for measurable disease can be technical,
if you are considering trials which require measurable disease, please discuss whether
you have measurable disease with your doctor if you have any doubt.
Measuring Solid Tumors: WHO Criteria
Versus RECIST Criteria
You may sometimes hear partial response described as a 30% shrinkage instead of a
50% shrinkage. So when is 30% equal to 50%? Well when there is only one dimension
considered instead of two.

The older standard for response, the WHO (World Health Organization) criteria, defined
a shrinkage of a tumor as the decrease in the product of the largest perpendicular
diameters in the largest “slice” of the tumor on a scan. This defines the area of a square
and is proportional to the area of a circle (a more likely cross-section of a tumor).
Measurable disease under this system is called “bidimensionally measurable”.

The newer standard, the RECIST criteria, defines a tumor’s shrinkage as the decrease
in the length of its largest diameter. This is called “unidimensionally measurable”. This
makes measurement easier and has been shown to be as good as measuring in two
dimensions.

It turns out that on average, a 30% reduction in one dimension is the same as a 50%
reduction in the product of two dimensions. If you had a circle and its diameter shrunk
by 30%, as in the RECIST criteria, then the product of two perpendicular diameters as
in the WHO criteria would shrink by 49% which is close enough to 50% (100% x (1-
.3)2 = 100% x 0.72 = 49%). Anyway, the main point here is that the two systems give
very close to equal results.

Note that actually tumors have three dimensions so a partial response in either system
actually means more than a 50% reduction in tumor volume.

Special Tumor Types


Some kinds of cancer require different definitions of response. This includes blood
cancers like the leukemias which don’t form solid tumors. It also includes a few other
kinds of cancer such as prostate cancer which predominantly metastasizes to the bone,
or such as multiple myeloma which starts in the bone. As I mentioned about bone
tumors aren’t considered measurable. In all of these cases the broad concept of
response is still similar to the rough categories I’ve given above.

If you have one of these tumors, you will need to learn what the criteria are for response
in your specific cancer. You’ll pick it up as you read papers, from your own experience
as a patient, and from the basic information on your cancer which is likely to talk about
any specialized tests particular to your cancer. There may or may not be standardized
criteria, and you may also find that definition of response incorporate blood marker tests
like PSA for prostate cancer, or incredibly sensitive DNA based tests for finding cancer
cells in some types of leukemia.

Resources
 RECIST Criteria Quick Reference from the US National Cancer Institute.

 Therasse P, Arbuck SG, Eisenhauer EA, Wanders J, Kaplan RS, Rubinstein L, Verweij J,

Van Glabbeke M, van Oosterom AT, Christian MC, Gwyther SG.

New Guidelines to Evaluate the Response to Treatment in Solid Tumors

J Natl Cancer Inst. 2000 ;92(3):205-16.[PubMed Abstract (will open in new window)][Free

Full Text (will open in new window)]

Comment: A more complete description of the RECIST criteria along with validation
studies comparing to the WHO criteria.

Thoughts about Response


It’s often emphasized that tumor shrinkage is not in and of itself a benefit to patients –
only to the extent that it relieves symptoms or improves survival. While this is true, I also
think that tumor shrinkage is extremely important.
Although tumor shrinkage may not be proof that a treatment is truly beneficial, if a
treatment really is beneficial one expects at least it will stop the cancer from growing. If
a treatment cures some patients one expects that all detectable cancer will disappear
under treatment for cure patients. Also if tumors shrink and this effect lasts for a
significant amount of time it is likely the patient has gotten a benefit.

Unlike other endpoints, response shows treatment has a direct biological effect on the
cancer. Unlike survival related endpoints, response does not require randomized trials
to demonstrate that the treatment is having an effect. People survive various lengths of
time with no treatment, but tumors only very rarely go away on their own.

Knowing that a treatment creates response in advanced cancer supports positive


results for trials adjuvant therapy using that treatment. Positive results with an adjuvant
trial with a marginal “p value” (To understand p values see my article, The Significance
of Statistical Significance) might still be due to chance. Knowing the treatment is
biologically active should add some confidence in the results.
Documentation of response also buttresses and clarifies results with survival related
endpoints in advanced cancer. A treatment which at best slows the growth of the tumor
and doesn’t even create stable disease is surely a weak treatment. A treatment which
does not give complete responses is not curing anyone.

If you have advanced cancer which has no good treatment and are looking for new and
promising therapies you may very well need to rely on phase II trial results which have
response as an endpoint. Phase III trials to prove survival take a long time to conduct
and phase II results will be available much sooner. If there are complete responses and
if responses appear to be lasting this is real evidence the treatment is helping some
patients, even if there is uncertainty.

Survival
Improved survival is a major goal of cancer treatment (Cure is the goal!). Survival is
therefore a very important endpoint in cancer trials and is one which, unlike some other
endpoints is a direct benefit to patients.
Because showing an improvement in survival requires comparison to a control group,
survival is a primary endpoint in Phase III and Adjuvant trials. These trials are normally
randomized so that the survival of comparable groups treated with different treatments
can be compared. Survival is normally reported using survival curves. See the Statistics
Section for several articles on these important curves.

Thoughts on Survival
Survival is technically easy to measure unambiguously and objectively since there is
nothing subjective about the date someone dies. This is an advantage over endpoints
like response or progression free survival which require measuring whether the tumor is
growing as well as how long they live. Survival also accounts for any long term increase
in the death rate due to long term side effects even if they are unrelated to acute side-
effects.

In situations where long term survival is common, measuring survival differences may
take a long time. There is a risk that ideas, technologies and treatments will have
improved by the time the results are in. Survival always takes longer than progression-
free survival (see below) to measure.

When you see positive results from a survival trial, it’s important to ask whether the trial
is actually increasing the cure rate or whether instead it extends life without increasing
the cure rate (An increase in cure rate will result in a higher plateau on the right tail of
the curve). If the treatment improves survival without improving the cure rate, you want
to know the likely and possible benefits. Not infrequently a big deal is made out of
treatments which improve median survival by only a few weeks or months. But what
may be lost is whether some patients get a much larger benefit and survive years longer
than they would have. The best way to answer these questions is to look at the survival
curves. For much more on this, look at my articles on survival curves in the Statistics
Section.
Measuring survival in a randomized trial for advanced cancer requires that patients
given one treatment who progress cannot be allowed to try the other treatment
(switching arms is called cross-over). This may not be perceived as a disadvantage by
those who design trials, but it most certainly is by patients whenever denying cross-over
denies them any additional hope that might come from trying the other treatment.
Although cross-over isn’t allowed in these circumstances, ethical considerations
normally require that patients who have advanced disease be permitted to seek out
other therapy of their choice. This actually is a scientific disadvantage since the results
may be blurred or even biased by treatments chosen later. Note that if the endpoint is
progression free survival (see below) then cross-over can be permitted.
Finally, historical or concurrent non-randomized controls have very low-standing
because they are subject to bias. For instance, improvement in tests for detecting
recurrence which result in finding recurrences sooner can make it look like more recent
patients live longer after a diagnosis of recurrence. Concurrent non-randomized controls
suffer from all kinds of possible bias related to characteristics of patients who choose (or
are referred to) one kind of treatment versus another. For instance, more motivated and
stronger patients might choose to travel for a difficult experimental treatment compared
to those who came from the local area and took the standard treatment. Patients with
those characteristics might well live longer than others even if the new treatment is
actually no better than the old. A closing thought: Almost every intelligent layman I have
ever talked to comes up with the idea that with rigorous recording of treatment results
and patient characteristics, the need for expensive (twice as many patients), and
sometimes ethically uncomfortable randomized trials could be eliminated. I believe the
dogma of the randomized trial inhibits efforts to find a better way.

Jargon
 OS: Overall Survival
Progression Free Survival and Disease
Free Survival
Progression Free Survival is the length of time you are both alive and free from any
significant increase in your cancer (free from progression). Progression is defined the
same way as I described under response.

Disease Free Survival


Disease Free Survival is a special case of Progression Free Survival used as an
endpoint in the clinical trials of adjuvant therapy to prevent recurrence after surgery to
completely remove all visible cancer. In this case “progression” means the patient has
had a recurrence.

Thoughts on Progression Free Survival and


Disease Free Survival
Like survival, these endpoints normally require a randomized trial to measure. An
improvement in progression free survival isn’t guaranteed to translate into an
improvement in survival although I think it usually does.

For adjuvant therapy, if the same treatment works to some extent in recurrent cancer,
the question arises whether treating patients after they relapse is just as effective. The
hope is that treating when there is so little cancer that the patient appears to be free of
disease will be more effective than treating when there are large detectable tumors. In
some cases this has proven to be true. If adjuvant treatment appears to improve the
cure rate when the same treatment doesn’t cure in relapsed patients, it’s a win.
Because no one knows for sure who will relapse, adjuvant therapies usually mean
treating some patients who are already cured. The disadvantage of that increases the
more likely it is that the patients who are treated are in fact already cured, and the more
toxic and expensive the adjuvant therapy.
Progression free survival and disease free survival can translate to an improvement in
quality of life since symptoms from the cancer are delayed – but only if side effects of
treatment aren’t worse.

Also unlike trials with a survival endpoint, randomized trials in metastatic cancer which
measure Progression Free Survival can allow cross-over on progression if it makes
sense. In adjuvant therapy trials cross-over doesn’t make sense, since patients get only
one adjuvant therapy and are treated for recurrence if they relapse.

Both of these endpoints are subject to some uncertainty compared to plain survival
because determining that there is progression or relapse involves reading scans, which
always has some associated uncertainty. Doing these trials well requires a rigorous
schedule for testing, and a rigorous protocol for examining the scans and declaring a
relapse. This includes reading of the scans at a central location by radiologists who are
not told which treatment the patients got (the radiologists are said to be “blinded”).

Jargon Summary
 PFS: Progression Free Survival

 DFS: Disease Free Survival

 RFS: Relapse Free Survival

Quality of Life
Quality of Life is supposed to measure how you feel and how you function. Although
quality of life is certainly important in the broad sense, unfortunately, there is no
unambiguous physical measurement or definable property which corresponds to your
“Quality of Life”. Quality of Life is therefore measured using a brief questionnaire in
which patients rate their ability to function in various ways and enjoy life. Patients
typically fill out the questionnaire several times during the course of the trial.
Quality Of Life is typically measured in Phase III trials or Adjuvant Trials and it is
typically a secondary endpoint, less important than survival related endpoints.

The most commonly used questionnaire is called the Functional Assessment of Cancer
Therapy (FACT) scale [Cella 1993], and there are specialized FACT questionnaires for
several different types of cancer (which tend to affect quality of life in different ways).

Technicalities
The FACT questionnaire was constructed (very roughly here) by surveying patients and
also doctors about what is important to quality of life and then constructing test
questions which were evaluated by patients. The resulting questionnaire was then
tested in several ways for validity, such as comparing results to other measures of
Quality Of Life.

The FACT questionnaire asks you to indicate how true statements are for you in five
different life areas:

 Physical Well Being

o Sample Test Statement: “I have a lack of energy.”

 Social/Family Well Being

o Sample Test Statement: “I am satisfied with my sex life.” (Honestly!)

 Relationship With Doctor

o Sample Test Statement: “I have confidence in my doctor(s).”

 Emotional Well Being

o Sample Test Statement: “I feel sad.”

 Functional Well Being

o Sample Test Statement: “I am able to work (include work in home).”

Thoughts on Quality of Life


Quality of Life is a much fuzzier endpoint than the others and that doesn’t sit well with
me. It seems most appropriate for helping to decide whether a treatment with side
effects which has only modest benefits is worth doing, especially if the treatment is
taken for as long as it helps, as is common with many treatments for advanced cancer.
Quality of Life also seems appropriate for judging the cost in quality of life of adjuvant
therapies which often make people sick who are apparently fine in the hope of
increasing their prognosis for the future. If any treatment is extremely beneficial and
doesn’t need to be taken indefinitely, I think there is much less reason to be concerned
with how much quality of life is affected during treatment.

Quality of Life is determined by two main variables, the side effects of treatment, and
symptoms of the disease. Intuitively, a treatment which is highly effective against cancer
is likely to improve quality of life by preventing or relieving disease associated
symptoms. Similarly, the side effects of treatment obviously affect your quality of life. In
a sense, this measures the tradeoff between the adverse effects of treatment and its
benefits.

The timing of side effects and benefits is also important in Quality of Life
Measurements. If the treatment really works, it will have long lasting benefits. It may
have only short term side effects in which case it’s hard to imagine a way to balance the
chance of a long term benefit against the acute side effects. This is less of a problem if
the benefits are temporary or if there are long term side effects. In essence, really good
treatments – those that are highly effective without bad side effects, obviously improve
Quality Of Life and you don’t need a questionnaire to know it. Problem is there aren’t
enough of those!

Finally, keep in mind that while a very honest effort was made to make the
questionnaire reflect the values of the average patient, you are not the average patient.
You very likely will have either more or less severe side effects from any treatment than
average and will react to them either more or less than average. You don’t have
average symptoms from the disease either, and not only that, you won’t get the average
benefit from the treatment either. Therefore, Quality of Life tradeoffs can give only a
rough idea of how you might feel about the same treatment. Normally, you can get a
similar feel for this just from the reported severity of the side effects of the treatment and
the extent of benefit as estimated by other endpoints.

Jargon
Quality Of Life is often abbreviated QOL or QL.

Reference
Cella DF, Tulsky DS, Gray G, Sarafian B, Linn E, Bonomi A, Silberman M, Yellen
SB, Winicour P, Brannon J, et al.
The Functional Assessment of Cancer Therapy scale: development and validation of
the general measure.
J Clin Oncol. 1993 Mar;11(3):570-9.[PubMed Abstract (will open in new window)]

The essential concepts of


statistics

If you know twelve concepts about a given topic you will look like an expert to people who only
know two or three.
Scott Adams, creator of Dilbert
When learning statistics, it is easy to get bogged down in the details, and lose track of the big
picture. Here are the twelve most important concepts in statistical inference.
Statistics lets you make general conclusions from limited data.
The whole point of inferential statistics is to extrapolate from limited data to make a general
conclusion. "Descriptive statistics" simply describes data without reaching any general
conclusions. But the challenging and difficult aspects of statistics are all about reaching general
conclusions from limited data.
Statistics is not intuitive.
The word ‘intuitive’ has two meanings. One meaning is “easy to use and understand.” That was
my goal when I wrote Intuitive Biostatistics. The other meaning of 'intuitive' is “instinctive, or
acting on what one feels to be true even without reason.” Using this definition, statistical
reasoning is far from intuitive. When thinking about data, intuition often leads us astray. People
frequently see patterns in random data and often jump to unwarranted conclusions. Statistical
rigor is needed to make valid conclusions from data.
Statistical conclusions are always presented in terms of probability.
"Statistics means never having to say you are certain." If a statistical conclusion ever seems
certain, you probably are misunderstanding something. The whole point of statistics is to
quantify uncertainty.
All statistical tests are based on assumptions.
Every statistical inference is based on a list of assumptions. Don't try to interpret any statistical
results until after you have reviewed that list. An assumption behind every statistical calculation
is that the data were randomly sampled, or at least representative of, a larger population of
values that could have been collected. If your data are not representative of a larger set of data
you could have collected (but didn't), then statistical inference makes no sense.
Decisions about how to analyze data should be made in advance.
Analyzing data requires many decisions. Parametric or nonparametric test? Eliminate outliers
or not? Transform the data first? Normalize to external control values? Adjust for covariates?
Use weighting factors in regression? All these decisions (and more) should be part of
experimental design. When decisions about statistical analysis are made after inspecting the
data, it is too easy for statistical analysis to become a high-tech Ouja board -- a method to
produce preordained results, rather an objective method of analyzing data. The new name for
this is p-hacking.
A confidence interval quantifies precision, and is easy to interpret.
Say you've computed the mean of a set of values you've collected,or the proportion of subjects
where some event happened. Those values describe the sample you've analyzed. But what
about the overall population you sampled from? The true population mean (or proportion) might
be higher, or it might be lower. The calculation of a 95% confidence interval takes into account
sample size and scatter. Given a set of assumptions, you can be 95% sure that the
confidence interval includes the true population value (which you could only know for sure by
collecting an infinite amount of data). Of course, there is nothing special about 95% except
tradition. Confidence intervals can be computed for any degree of desired confidence. Almost
all results -- proportions, relative risks, odds ratios, means, differences between means, slopes,
rate constants... -- should be accompanied with a confidence interval.
A P value tests a null hypothesis, and is hard to understand at first.
The logic of a P value seems strange at first. When testing whether two groups differ (different
mean, different proportion, etc.), first hypothesize that the two populations are, in fact, identical.
This is called the null hypothesis. Then ask: If the null hypothesis were true, how unlikely would
it be to randomly obtain samples where the difference is as large (or even larger) than actually
observed? If the P value is large, your data are consistent with the null hypothesis. If the P
value is small, there is only a small chance that random chance would have created as large a
difference as actually observed. This makes you question whether the null hypothesis is true. If
you can't identify the null hypothesis, you cannot interpret the P value.
"Statistically significant" does not mean the effect is large or scientifically
important.
If the P value is less than 0.05 (an arbitrary, but well accepted threshold), the results are
deemed to be statistically significant. That phrase sounds so definitive. But all it means is that,
by chance alone, the difference (or association or correlation..) you observed (or one even
larger) would happen less than 5% of the time. That's it. A tiny effect that is scientifically or
clinically trivial can be statistically significant (especially with large samples). That conclusion
can also be wrong, as you'll reach a conclusion that results are statistically significant 5% of the
time just by chance.
"Not significantly different" does not mean the effect is absent, small or
scientifically irrelevant.
If a difference is not statistically significant, you can conclude that the observed results are not
inconsistent with the null hypothesis. Note the double negative. You cannot conclude that the
null hypothesis is true. It is quite possible that the null hypothesis is false, and that there really
is a difference between the populations. This is especially a problem with small sample
sizes. It makes sense to define a result as being statistically significant or not statistically
significant when you need to make a decision based on this one result. Otherwise, the concept
of statistical significance adds little to data analysis.
Multiple comparisons make it hard to interpret statistical results.
When many hypotheses are tested at once, the problem of multiple comparisons makes it very
easy to be fooled. If 5% of tests will be "statistically significant" by chance, you expect lots of
statistically significant results if you test many hypotheses. Special methods can be used to
reduce the problem of finding false, but statistically significant, results, but these methods also
make it harder to find true effects. Multiple comparisons can be insidious. It is only possible to
correctly interpret statistical analyses when all analyses are planned, and all planned analyses
are conducted and reported. However, these simple rules are widely broken.
Correlation does not mean causation.
A statistically significant correlation or association between two variables may indicate that one
variable causes the other. But it may just mean that both are influenced by a third variable. Or it
may be a coincidence.
Published statistics tend to be optimistic.
By the time you read a paper, a great deal of selection has occurred. When experiments are
successful, scientists continue the project. Lots of other projects get abandoned.When the
project is done, scientists are more likely to write up projects that lead to remarkable results, or
to keep analyzing the data in various ways to extract a "statistically significant" conclusion.
Finally, journals are more likely to publish “positive” studies. If the null hypothesis were true,
you would expect a statistically significant result in 5% of experiments. But those 5% are more
likely to get published than the other 95%.
 Contents
 Search

 Learn about analyses with Prism


 View the Prism 7 User or Regression
Guide
 How to cite these pages
 PRINCIPLES OF STATISTICS
o The big picture
 When do you need statistical calculations?
 The essential concepts of statistics
 Extrapolating from 'sample' to 'population'
 Why statistics can be hard to learn
 Don't be a P-hacker
 How to report statistical results
 Ordinal, interval and ratio variables
 The need for independent samples
 Intuitive Biostatistics (the book)
 Essential Biostatistics (the book)
o The Gaussian distribution
 Importance of the Gaussian distribution
 Origin of the Gaussian distribution
 The Central Limit Theorem of statistics
o Standard Deviation and Standard Error of
the Mean
 Key concepts: SD
 Computing the SD
 How accurately does a SD quantify scatter?
 Key concepts: SEM
 Computing the SEM
 The SD and SEM are not the same
 Advice: When to plot SD vs. SEM
 Alternatives to showing the SD or SEM
o The lognormal distribution and geometric
mean and SD
 The lognormal distribution
 The geometric mean and geometric SD factor
o Confidence intervals
 Key concepts: Confidence interval of a mean
 Interpreting a confidence interval of a mean
 Other confidence intervals
 Advice: Emphasize confidence intervals over P
values
 One sided confidence intervals
 Compare confidence intervals, prediction
intervals, and tolerance intervals
 Confidence interval of a standard deviation
o P Values
 What is a P value?
 The most common misinterpretation of a P value
 More misunderstandings of P values
 One-tail vs. two-tail P values
 Advice: Use two-tailed P values
 Advice: How to interpret a small P value
 Advice: How to interpret a large P value
 Decimal formatting of P values
 How Prism computes exact P values
o Hypothesis testing and statistical
significance
 Statistical hypothesis testing
 Asterisks
 Advice: Avoid the concept of 'statistical
significance' when possible
 The false discovery rate and statistical
signficance
 A legal analogy: Guilty or not guilty?
 Advice: Don't P-Hack
 Advice: Don't keep adding subjects until you hit
'significance'.
 Advice: Don't HARK
o Statistical power
 Key concepts: Statistical Power
 An analogy to understand statistical power
 Type I, II (and III) errors
 Using power to evaluate 'not significant' results
 Why doesn't Prism compute the power of tests
 Advice: How to get more power
o Choosing sample size
 Overview of sample size determination
 Why choose sample size in advance?
 Choosing alpha and beta for sample size
calculations
 What's wrong with standard values for effect
size?
 Sample size for nonparametric tests
o Multiple comparisons
 The problem of multiple comparisons
 The multiple comparisons problem
 Lingo: Multiple comparisons
 Three approaches to dealing with multiple
comparisons
 Approach 1: Don't correct for multiple
comparisons
 When it makes sense to not correct for multiple
comparisons
 Example: Planned comparisons
 Fisher's Least Significant Difference (LSD)
 Approach 2: Control the Type I error rate for the
family of comparisons
 What it means to control the Type I error for a family
 Multiplicity adjusted P values
 Bonferroni and Sidak methods
 The Holm-Sidak method
 Tukey and Dunnett methods
 Dunn's multiple comparisons after nonparametric
ANOVA
 Newman-Keuls method
 Approach 3: Control the False Discovery Rate
(FDR)
 What it means to control the FDR
 Key facts about controlling the FDR
 Pros and cons of the three methods used to control the
FDR
o Testing for equivalence
 Key concepts: Equivalence
 Testing for equivalence with confidence intervals
or P values
o Nonparametric tests
 Key concepts: Nonparametric tests
 Advice: Don't automate the decision to use a
nonparametric test
 The power of nonparametric tests
 Nonparametric tests with small and large
samples
 Advice: When to choose a nonparametric test
 Lingo: The term "nonparametric"
o Outliers
 An overview of outliers
 Advice: Beware of identifying outliers manually
 Advice: Beware of lognormal distributions
 How it works: Grubb's test
 How it works: ROUT method
 The problem of masking
 Simulations to compare the Grubbs' and ROUT
methods
o Analysis checklists
 Unpaired t test
 Paired t test
 Ratio t test
 Mann-Whitney test
 Wilcoxon matched pairs test
 One-way ANOVA
 Repeated measures one-way ANOVA
 Kruskal-Wallis test
 Friedman's test
 Two-way ANOVA
 Repeated measures two-way ANOVA
 Contingency tables
 Survival analysis
 Outliers

Вам также может понравиться