Академический Документы
Профессиональный Документы
Культура Документы
Significance
You read the technical literature and see (as you are bound to):
Statistically significant may not mean what you think it does. You might tell your doctor
that treatment A had a response rate of 67% and treatment B 77% in some paper you
read, and he might say, “I know about that, but the difference is not statistically
significant.” Does that mean the higher response rate doesn’t matter? Does it mean that
Treatment B is really no better than treatment A? It would be easy to assume either of
these, but actually the answer is, “none of the above”! Read on to learn what “statistical
significance” is really about.
Examples
It may not be obvious exactly how the question of whether observed differences are real
or might be merely due to chance actually arises, so here are some examples which
should give you some intuition:
Suppose that in a clinical trial patients were randomly assigned to either Treatment A or
Treatment B. Suppose that so far, Treatment A has a cure rate of 100% and Treatment
B has a cure rate of only 50%. That’s a pretty dramatic sounding difference! But is the
difference real, or is the difference just due to chance? Well suppose that so far two
patients have been assigned to each treatment and both were cured with treatment A,
but only one of the two was cured with treatment B. Maybe the next patient assigned to
treatment A will not be cured and the next patient assigned to treatment B will be. Now,
with just two more patients, the cure rate is 2 out of 3 or 66% for both treatments. What
does all this tell you about the “real” cure rate for these treatments? Can you conclude
that they are both the same, 66%? Probably not! With so few patients you should be
feeling pretty unsure about what the real cure rates are.
But what if you treated 100 patients with A, and 100 with B, and the cure rate was still
100% for treatment A and still only 50% for B? I bet you would conclude that there was
a major and very real difference between A and B and you would certainly opt for A,
given the choice!
Now suppose with 100 patients in each group the cure rates were 66% and 60% for A
and B. Is this a real difference, or is the difference just due to chance? It’s not so
obvious! Here is where statistical significance comes to the rescue.
Intuitions
These examples bring out some basic intuitions which are behind the idea of statistical
significance:
The bigger the difference between groups the less likely it is that it’s due to chance.
The larger the sample size (number of patients) the more likely it is that the observed
difference is close to the actual difference. This is an example of the “law of large
numbers.”
These two principles interact, so that if you observe a large difference between the
groups, the groups could be relatively small before you conclude that the difference is
likely to be real, and not due to chance alone. Similarly if the observed difference is
small you would need more patients before you are convinced that it’s probably not due
to chance alone.
More precisely, statistical testing works by assuming that the groups are actually the
same – that there is no difference – and then mathematically estimating the probability
that you would see a difference between groups at least as big as the one you actually
saw – just due to chance. This probability is called ‘p’ and is referred to a “p-value”.
Mathematical probabilities, like p values, range from 0 to 1 where 0 means no chance
and one means certainty. 0.5 means a 50% chance and 0.05 means a 5% chance.
The usual case is that you are hoping a difference between groups is real and not due
to chance. In this case, smaller p-values are better than larger ones. p=.05 is better than
p=.10. If you are hoping there is no real difference, than larger p values are better
than smaller ones. This might be the case if you were hoping less radical surgery was
just as good as more radical or disfiguring surgery. In this case p=.5 is better than
p=.05! Often in a randomized trial various characteristics of the two groups are
compared in the hopes that they are not significantly different. Again in this case bigger
is better!
More precisely, statistical testing works by assuming that the groups are actually the
same – that there is no difference – and then mathematically estimating the probability
that you would see a difference between groups at least as big as the one you actually
saw – just due to chance. This probability is called ‘p’ and is referred to a “p-value”.
Mathematical probabilities, like p values, range from 0 to 1 where 0 means no chance
and one means certainty. 0.5 means a 50% chance and 0.05 means a 5% chance.
If the p-value is relatively large, so that the chances are relatively high that the
difference could have arisen by chance alone then the results are at least consistent
with the idea that there is no real difference between the groups for the characteristic
being tested. But p is very small, then the results are not consistent with the idea that
there was no real difference between the groups, or at least it is very unlikely that there
is no difference.
The choice of .05 (or a 1 in 20 false positive rate) as the usual value to declare
statistical significance is arbitrary – scientists do want to have a high confidence in their
conclusions but attaining ever smaller p values requires larger samples, more time and
expense, and possibly subjecting people to treatments which the data already indicates
are very likely to be inferior.
As a cancer patient, you may choose to make different judgements about what p value
you want to use to guide your decision making – suppose two treatments have about
the same side effects, and one looks like 5 year survival may be better than the other,
but p=.06 – not quite statistically significant. Would you decide it didn’t matter which
treatment you got because the chance of a false positive is six percent instead of five? I
wouldn’t! (Note that p between .10 and .05 is often referred to as a “trend” – a marginal
result). You might decide to choose based on even higher values such as p=.2 (20%
chance of false positive), but at some point you would conclude the wisp of evidence
favoring one treatment isn’t enough to believe it. You need to consider not only the p
value but also the totality of circumstances – how different the treatments are in side
effects, how big the difference in the results was and so forth. Man does not make
rational decisions on p values alone!
While I don’t intend to get into the mathematics of statistical significance (I couldn’t even
if I wanted to!) the basic intuition I described earlier can be recast in the terms of
statistical testing:
The larger the sample size, the smaller an observed difference has to be in order to be
statistically significant.
The smaller the sample size, the larger an observed difference would have to be in
order to be statistically significant.
If you design a clinical trial to detect a possible difference between treatments, you need
to have a large enough sample size so that if there is a difference big enough that you
care about it, you are actually likely to find that difference to be statistically significant
when you are done with the trial. Some intuitions about the number of patients in a
clinical trials:
The smaller the real difference is, the more patients you need to be likely to detect a
statistically significant difference in a clinical trial.
The larger the real difference is, the fewer patients you need to be likely to be detect a
statistically significant difference in an actual clinical trial.
Perhaps the above suggests to you that even when a trial is designed to detect a
difference of a given size, and even when that difference is really there, it is possible
that the trial will not result in a statistically significant difference – that is there is a
chance the trial will be a false negative. The chance of a false negative can be
calculated given the size of the trial and the size of the minimum difference of interest.
Power
When a clinical trial is designed it is important to have a large enough group of patients
(sample size) so that if there is a difference between treatments big enough that you
care about it, you are likely to get a statistically significant difference from the actual
trial. You can never be guaranteed that you will see a statistically significant difference
even if there is a real difference, since just by chance you might not happen to get as
good results for the better treatment then its real success rate, or you might happen to
get results for the less effective treatment that are better than its real success rate. The
larger the sample size the lower the chance this will happen.
Given the minimum difference you want to detect, and the p value you require to
declare the results statistically significant (usually .05) for any sample size, you can
calculate the probability that you will detect a statistically significant difference. The
probability is the power of the experiment. Of course, what is done is to figure out what
sample size is needed to achieve a desired power – usually between 0.8 and 0.9 (an 80
to 90 percent chance of getting a statistically significant difference, if there is a real
difference of at least the specified size). A trial which doesn’t have enough patients is
often called “underpowered” in the literature. Underpowered trials risk not finding real
differences even when they are there.
A Type I Error is a false positive – deciding that there is a real difference when in fact
there is no difference. If p=.05 then there is a 5% chance of a type I error. Type I error
test fails to show the difference to be statistically significant. The chance of a type II
error is estimated with the power of the experiment. Type II error is also called “beta
error”.
Student’s t test
Chi-Squared test
Mann-Whitney test
Log Rank Test
Pitfalls
There are a lot of gotchas in interpreting statistics of all kinds and statistical testing
seems to have a particularly large number. So by the time you finish this next section
you might think “statistically significant” is completely insignificant! Nothing could be
further from the truth. It’s just that you have to be very careful about the conclusions you
draw from statistics.
It’s also important to realize that, “statistically significant difference” does not mean “big
difference”. If two treatments are very similar in outcome, but not exactly the same, you
can find a statistically significant difference by just testing enough people. In general the
more people you include in a trial, the smaller a difference is needed before that
difference proves to be statistically significant. So if in some trial, treatment A has a cure
rate of 52% and treatment B 54%, then even if they tested enough patients to make this
difference “statistically significant” you are not likely to decide that B is really much
better, and very likely other characteristics of A and B such as side effects would guide
your choice between them.
Finally, not all statistical testing is done about testing some intervention – like a
treatment. Often epidemiological studies will look to see if there is
an association between some characteristic or behavior, and the chance of getting or
surviving some disease. It might be seen that people who drink coffee are more likely to
get lung cancer than those who do not and the difference in lung cancer rates among
coffee drinkers and non coffee drinkers might found to be statistically significant. But
this doesn’t mean coffee drinking causes lung cancer! Instead coffee drinking might be
associated with some other characteristic which pre-disposes people to lung cancer,
such as smoking. In this case it’s obvious and statisticians would try to “adjust” for
smoking, but in general, no one knows all of the causes of any form of cancer, so it
could be an association of the tested characteristic with some completely unknown risk
factor which is actually causing the difference.
In sum, you need to look carefully at the study design to see if it is possible that
something other than treatment differences might explain the difference in results.
Noticing that a trial has a historical control rather than a randomized design should
reduce your confidence in the results. But you may still choose to make decisions based
on uncertain information. After all, that’s the only kind of information there is!
Sometimes, the trick is figuring out which information is least uncertain and that’s where
awareness of things like randomized designs versus historical control designs helps.
Now the median survival is the amount of time half the patients survive. IL-2
dramatically increases the survival of kidney cancer patients – but only a few of them.
This could not affect the median survival much. If only all of the patients who got a long
term remission happen to be those who were destined to live a little longer than the
median anyway, then IL-2 treatment might not improve the median one bit! In fact, a
treatment which is toxic ,and slightly shortens the survival for many patients, but
produces a few cures could actually decrease the median survival but still be worth
trying for people who are willing to pay a price for a chance at long term survival. In
cases like these long term survival rather than median survival is the statistic of interest.
Of course it takes years to accumulate reliable data on the proportion of long term
survivors, and if this proportion is small, it will take a huge sample size as well. You may
have to take your best guess based on the data… even when statistical significance for
what matters to you has not been achieved.
Often in technical papers and presentations, median survival is just referred to as
“survival” and if there is no statistically significant difference in median survival, you will
hear or read that, “there was no difference in overall survival.” You need to think about
exactly what they are saying, and about what the data says about the effect of the
treatment on the entire population – not just the effect on an average or median.
Depending on the p value, the chances might be a lot better than that anyway. You can
look at the p value to tell how unlikely it is that the result occurred by chance alone. So if
you have p=.000001, then the chances are only one in a million that the result was due
to chance.
The mind is a pattern finding machine extraordinare! There is nothing else like it in the
universe! When you look at data, searching for patterns, you consciously or
unconsciously check many different possibilities to see if it looks like this or that
property makes a difference in the outcome. Similarly you might check many different
outcomes to see if the treatment made a difference in that outcome – if the patients
didn’t live any longer maybe they spent less time in the hospital or used less pain pills
or… well who knows. There can be hundreds or thousands of possibilities.
The problem is that if you check hundreds of things then it is actually likely that you will
see a statistically significant difference in at least one of them just by chance alone!
Remember that you could get a result that is statistically significant at the p<.05 level
one in twenty times just by chance alone. So if you check enough things this is
actually likely to happen at least once. If you buy enough lottery tickets you’re probably
gonna win – only in this case the prize is the bitter fruit of false conclusions. It is easy to
be fooled by this trick of the statistical light.
The way not to be fooled is to specify exactly what you are going to test before you start
the experiment (this is called defining endpoints prospectively) and to specify a limited
number of things. There are statistical methods of correcting for multiple tests if you
know in advance exactly what you are testing.
If you see something interesting in the data afterwards and do a statistical test that was
not planned in advance then often all you have is an interesting hypothesis which must
be confirmed by doing the experiment again, rather than a statistically valid conclusion.
If the p value is very small (much less than .05) and if the problem space is not one that
lends itself to an indefinite number of possibilities to check, then it is more likely that you
really have something. It is also more likely if it makes sense in terms of what is known
about the disease – the observation that, say left handers have worse survival doesn’t
make sense in terms of any known biological mechanism and seems unlikely – a trick of
the statistical light. In contrast if you were to observe that people with higher levels of
testosterone had worse survival in prostate cancer it would be relatively plausible
because prostate cancer cells are known stimulated by male hormones.
Diagnosis
I was diagnosed with widely metastatic kidney cancer only one month after surgery for a
huge, but seemingly localized tumor. I found the survival curve to the right in
a review article my
own doctor had written. This curve shows survival of patients with what I had, who
either relapsed within six months of surgery (like me), or who had metastasis at
diagnosis. It applied to me.
The curve is approximately constant risk. Constant risk means that the chance of dying
does not change over time. Constant risk curves have a constant half life, which is the
time it takes for half the remaining patients to die. Living longer brings no relief, the risk
is always the same. For my curve, the half life (and therefore also median survival) is
only about four months. The chance of surviving one year is only about 12%. At four
and one half years, the curve reaches the zero.
Taken at face value, this is a statistical nightmare, a stark portrait of the odds against
terminal cancer. There is no hint of even the slightest possibility of a cure or way out.
Only a relentless descent to death. But this superficial appearance was not the reality.
Escape was possible!
Caveats: The data was old, and strongly reflects the fact that at that time there was no
standard effective treatment. But that was still true at the time of my diagnosis. This
curve is also based on a fairly small number of patients and cannot exclude a tiny
fraction of truly long term survivors. While the chance of surviving 5 years is surely very
small, I don’t believe it is actually zero!
Treatment
I changed my odds and got off the curve above by entering a clinical trial of a then
experimental drug called Interleukin-2. Here is a long term survival curve for 255
Notice that this curve, while still very rough, has 20% survivors at four and a half years
(instead of zero percent in the previous curve) and maybe 15% survivors at 10 to 11
years. There is a small but real chance here! Best of all it flattens out towards the end,
suggesting that a few may actually be cured.
When I started my treatment, IL-2 was still in development. So since it took more than a
decade to accumulate enough follow-up to make this curve, I didn’t have it to look at
when I had to make my decision. What I had instead were hints that IL-2 might improve
the curve – hints in the form of a report in the literature of dramatic responses that were
continuing after two or three years (To see just exactly what I was looking at, see my
article,The Hint). Given the survival with standard treatment, that was more than enough
for me. Seeing the first curve with its message of doom turned out to be a positive
because only in comparison to that dreadful curve could I have known that a mere hint
was worth pursuing with everything I had.
Caveats: Unlike the first curve, this curve is not limited to treatment of patients who
relapsed within 6 months of diagnosis, so the inclusion of patients who had a somewhat
better prognosis to begin with may represent part of the improvement. More generally,
there is considerable uncertainty any time you compare survival curves from completely
different trials or groups. There can easily be a difference between the groups which,
rather than the difference in treatment, accounts for a difference in survival. Ideally,
survival curves are best compared when they are from a randomized trial where bias
due to group difference is eliminated by the trial design. But in the real world, decisions
often have to be made with less than ideal data. In this case the fraction of really long
term survivors is still better from what could reaonably be expected with metastatic renal
cancer with earlier treatments, so while the curve I changed to may not be exactly what
I present here, I surely did change my odds for the better by taking IL-2.
Response
I could have responded… or not responded. I had control over my choice of treatment,
but not over whether it worked. I was fortunate in that I did respond. Once
that happened, the odds
changed again, and again dramatically for the better! Here is a curve for patients who
responded to high dose IL-2. Notice that nearly 40% are still in remission over a decade
later. Had I not responded, my odds would have changed for the worse and probably
would not be much better than they were when I started.
Caveats: Note that this curve charts response duration rather than actual survival.
Since patients survive at least some time after relapse, and since it takes at least some
time to get into remission, an actual survival curve would look at least
slightly better than this.
There is a classic statistical trap in comparing responders to non-responders (or to all
patients treated) because those who respond may have been those who were healthier
to begin with and who would have lived longer anyway. So it may not be the case that
the patients really benefited from treatment, even if they achieve a temporary shrinkage
of their tumors, as is often the case with chemotherapy (My treatment was
immunotherapy). I do not think that this is the case here because many of the
responses that did occur were long term, and because without treatment, long term
survival for this disease is almost zero.
Flat Line!
Once I got into remission, I resumed my life hoping I would be one of those whose
responses lasted. As time passed I got to see more and more data which showed, as
Since there have been no relapses after 85 months, for survivors who are out at least
that far, the curve is flat at 100%. In fact, as I write this in September 2001, I am off the
end of the curve in remission at 142 months, So my curve is now flat at 100%! (If you
don’t actually see a curve here, it’s the line at the very top!)
Caveats: That 100% is based on a small number of patients, so just as I don’t believe
the first curve really guaranteed death, I also don’t believe this one guarantees life,
though things are looking very, very good. Finally, because I am “off the curve,” I am
extrapolating a little in time to claim 100%. But despite these caveats, the difference
between the curve I am on now and the one on which I started my journey is not in
doubt. It is infinite.
This CancerGuide Page By Steve Dunn. © Steve Dunn
Page Created: 2001, Last Updated: March 14, 2002
Oncologists frequently use some of the same terms in talking to patients about benefits
that might be expected from treatment. In particular, it’s common for an oncologist to
estimate the chance you will “respond” to treatment. Doctors will also use the jargon of
endpoints in talking about how well treatment is working as in, “you are responding to
the treatment.” Knowing what “respond” might mean will help you know exactly what
questions to ask.
Rather than just reciting a laundry list of different efficacy endpoints, I discuss when
they’re used, how they relate to clinical trials (this article partially overlaps some of
my clinical trials articles), technical details of how they’re measured, and the pros and
cons of each and, of course, the associated jargon, which (surprise!) can be just a bit
thick at times.
Response and Related Endpoints: Measures tumor shrinkage in response to treatment and
how long that shrinkage lasts. Response is the typical main endpoint for phase II trials, and is
Survival: Just what you think it is! A typical main endpoint (survival is important!) for phase III
Progression Free Survival and Disease Free Survival: Measures the length of time that a
patient is both alive and without worsening of their cancer. These are typical endpoints for
Quality of Life: Based on subjective measures of how well the patient is functioning and
enjoying life. This takes into account both benefits of treatment and loss of quality of life due to
the side effects of treatment. Quality of life is typically an endpoint of phase III and adjuvant
trials.
Complete Response (CR): Complete response means all detectable tumor has disappeared.
If a treatment does cure some patients, those patients will have their tumor disappear. A CR is
a potential cure. If you have advanced cancer a CR is also the best result you can actually
see from treatment. Even so, a complete response does not necessarily mean the patient is
cured. Even when no tumor can be seen on scans, there can be residual tumor which is too
small to detect, and so unfortunately, complete responses may not last. Whether a complete
response is likely to last can often be gauged by looking at the history of the type of treatment
that produced the response in your type of cancer. In some situations very few complete
responses are cures and in others most are cures. To find out, you have to research your
cancer and the treatment in question.A patient who has had a complete response may be said
Partial Response (PR): This roughly corresponds to at least a 50% decrease in the total
tumor volume but with evidence of some residual disease still remaining. Partial responses
aren’t usually cures and usually aren’t a long term benefit because significant tumor remains.
In some cases the residual disease in a deep partial response may actually be dead tumor or
scar so that a few patients classified as having a PR may actually have a CR. Also many
patients who show shrinkage during treatment show further shrinkage with continued
Minor Response (MR): “Minor response” roughly means a small amount of shrinkage. Minor
response is not really a standard term but is increasingly used. Roughly speaking, a minor
response is more than 25% of total tumor volume but less than the 50% that would make it a
PR. A minor response is not enough to be considered a true response, but data on minor
responses may be given in reports on clinical trials. If minor responses are not categorized
then they would be considered Stable Disease (see below). Although minor responses are
often considered insignificant, I believe that some of the new anti-angiogenic therapies
(treatments that target tumor blood vessels) which show a high rate of minor response are
likely to be benefiting patients, as long as those responses last for a reasonable length of
time. Also, of course an “almost partial” response, is hardly different from a “barely partial
response”. Such boundaries are artificial. Again with more treatment, a minor response may
Stable Disease (SD): Although stable disease intuitively means the tumors stay the same
size, to account for measurement errors on scans and to discount “insignificant” changes,
stable disease includes either a small amount of growth (typically less than 20 or 25%) or a
small amount of shrinkage (Anything less than a PR unless minor responses are broken out. If
so, then SD is defined as typically less 25%) Because of this, slow growing tumors may be
classed as stable for quite some time, or for several scans if scanning is frequent. Also some
periods of stability are relatively common in some kinds of cancer even without treatment.
Therefore, it is difficult to know if stable disease is the result of treatment. Claims of benefit for
new treatments involving stable disease should be examined skeptically. For more on the
minor response, stable disease is not considered a true response. At the same time, if you are
experiencing stability with or without treatment that is better than growth. Finally, stable
that new tumors have appeared. The appearance of new tumors is always progressive
disease regardless of the response of other tumors. Progressive disease normally means the
treatment has failed and in most cases is the signal that it’s time to try something else (or stop
treatment altogether if no good options remain). If you are on a clinical trial and have
progressive disease during treatment, you are likely to be taken off study (in a few cases you
may be allowed to cross-over to the other arm of a randomized trial, or your treatment may be
otherwise modified). Most clinical trials for advanced cancer which allow prior treatment
require that you have had progressive disease since your last treatment (or in other words,
Objective Response (OR): Objective response means either a partial or complete response
(In the literature you’ll frequently see “CR+PR” which means the same thing). When you see
an objective response rate be sure to look at how many are complete responses and how
many are partial since benefits from complete response tend to be greater. Often news
reports and especially press releases by self-interested companies blur this and don’t reveal
that the CR rate is low or non-existent. Track down the original source and find out!
“Clinical Benefit”: Clinical benefit is an informal term which usually means anything other
than progressive disease. Use of this term is suspect, particularly if it is in a press release or
news report. It isn’t automatically clear that patients with stable disease are benefiting from
treatment since the natural history of cancer can include periods of apparent stable disease
and since tumor shrinkage is not equal to clinical benefit to begin with. When you see this
term you should look at both the CR and PR rates and also the duration of “benefit” including
Duration of Response
Cancer therapies can produce temporary responses without any lasting benefit. On the
other hand, some cancer therapies can be curative or at least give a meaningful respite
from the disease. If there are no data to show an improvement in survival, then you
want to look at responses and see how many are lasting. If a treatment is going to make
a real difference, tumor shrinkages should last at least long enough to give a meaningful
respite. If a treatment is to cure some patients, then some complete responses should
last indefinitely. There are treatments which give a small percentage of patients durable
CRs, which may be cures. If you have a difficult cancer, then even if survival is reported
or is the main endpoint, it’s worth looking to see if there are lasting CRs (Frustratingly,
they don’t always report on this).
Reports on clinical trials in the medical literature commonly give not only the number of
responses, but also the duration of responses, particularly for phase II trials where
response is the primary endpoint. Some papers only provide summary information on
this but often the duration of response is given for every responding patient.
Response duration for an individual patient is given as a number possibly with a plus
sign after it like “6+”. Usually the unit is months but not always, so you should read
carefully to be sure. A plus sign after the response duration for an individual patient
means the response was still ongoing at the last evaluation. Conversely, the lack of a
plus sign means the patient relapsed at the given time. The plus sign is anything but a
minor detail – it’s key to telling whether a treatment is giving some patients lasting
benefit!
You will find that the results of early phase trials often include only relatively short
follow-up when they are first presented or published, particularly if they’re presented at
medical meetings where very early results are often presented. Naturally, this makes it
hard to judge response duration. For more established therapies, especially anything
FDA approved, there should be better data on response duration. Similarly, if a new
therapy has been in testing for a while, there may long enough follow-up that there is
useful data on response duration. This was the case for the experimental treatment
which saved me. See my article The Hint for the actual data.
There are a few nits about measuring response duration. The duration of response will
depend two things, on when the response is counted as starting and on when the
response is counted as ending. Unsurprisingly, a response is normally counted as
lasting from the time response is first achieved to the time progression from the best
response is detected. What is a little more subtle is that these things depend a bit on
how often scans and tests are scheduled including how close to the start of treatment
the pre-treatment measurement scan is taken. The longer response durations the, less
all of this matters.
What is the range of durations of response? Are there long term CRs?
If this is a local or regional treatment like radiation, how much will it help me to have a
Technicalities
Measuring Tumor Volume
How tumor volume is measured depends on the kind of cancer. Most common cancers
form discrete nodules or masses. These kinds of cancer are called “solid tumors” and
there is are very standardized methods for measuring solid tumors. I talk about
response criteria and measurement of solid tumors below. After that I briefly discuss the
other cases, liquid tumors, and types of solid tumor where the standard response
criteria don’t work well.
Normally, a response has last at least a month or be confirmed by the next set of scans before
it counts. Often a longer duration is required for apparently Stable Disease to count. I am
seeing 6 months more and more often. For most kinds of cancer 6 months would be a long
time for there not to be significant growth if treatment weren’t having an effect. Watch out for
early reports, especially meeting presentations (and those ever-spinning press releases),
which don’t specify how long counts as stable or which don’t have any minimum to count as
stable.
Individual trials can have differing definitions of response. Any clinical trial report in the
technical literature will describe what those criteria were. Often this is just referenced as the
standard WHO or RECIST criteria (see below) but sometimes more details are given.
Blood markers like PSA for prostate cancer or CA-125 for ovarian cancer are not used as a
substitute for measuring tumor masses in the standard cases. If there is a standard blood
marker for your cancer, your levels must return to within the normal range for you to be
declared to be in complete response. If the standard response criteria don’t work well for your
type of cancer and there is a standard blood marker, it will probably be used in constructing
You may have both measurable tumors and non-measurable tumors. For instance you
might have measurable lung metastases and bone metastases. If you have at least one
measurable tumor then you have measurable disease.
The older standard for response, the WHO (World Health Organization) criteria, defined
a shrinkage of a tumor as the decrease in the product of the largest perpendicular
diameters in the largest “slice” of the tumor on a scan. This defines the area of a square
and is proportional to the area of a circle (a more likely cross-section of a tumor).
Measurable disease under this system is called “bidimensionally measurable”.
The newer standard, the RECIST criteria, defines a tumor’s shrinkage as the decrease
in the length of its largest diameter. This is called “unidimensionally measurable”. This
makes measurement easier and has been shown to be as good as measuring in two
dimensions.
It turns out that on average, a 30% reduction in one dimension is the same as a 50%
reduction in the product of two dimensions. If you had a circle and its diameter shrunk
by 30%, as in the RECIST criteria, then the product of two perpendicular diameters as
in the WHO criteria would shrink by 49% which is close enough to 50% (100% x (1-
.3)2 = 100% x 0.72 = 49%). Anyway, the main point here is that the two systems give
very close to equal results.
Note that actually tumors have three dimensions so a partial response in either system
actually means more than a 50% reduction in tumor volume.
If you have one of these tumors, you will need to learn what the criteria are for response
in your specific cancer. You’ll pick it up as you read papers, from your own experience
as a patient, and from the basic information on your cancer which is likely to talk about
any specialized tests particular to your cancer. There may or may not be standardized
criteria, and you may also find that definition of response incorporate blood marker tests
like PSA for prostate cancer, or incredibly sensitive DNA based tests for finding cancer
cells in some types of leukemia.
Resources
RECIST Criteria Quick Reference from the US National Cancer Institute.
Therasse P, Arbuck SG, Eisenhauer EA, Wanders J, Kaplan RS, Rubinstein L, Verweij J,
J Natl Cancer Inst. 2000 ;92(3):205-16.[PubMed Abstract (will open in new window)][Free
Comment: A more complete description of the RECIST criteria along with validation
studies comparing to the WHO criteria.
Unlike other endpoints, response shows treatment has a direct biological effect on the
cancer. Unlike survival related endpoints, response does not require randomized trials
to demonstrate that the treatment is having an effect. People survive various lengths of
time with no treatment, but tumors only very rarely go away on their own.
If you have advanced cancer which has no good treatment and are looking for new and
promising therapies you may very well need to rely on phase II trial results which have
response as an endpoint. Phase III trials to prove survival take a long time to conduct
and phase II results will be available much sooner. If there are complete responses and
if responses appear to be lasting this is real evidence the treatment is helping some
patients, even if there is uncertainty.
Survival
Improved survival is a major goal of cancer treatment (Cure is the goal!). Survival is
therefore a very important endpoint in cancer trials and is one which, unlike some other
endpoints is a direct benefit to patients.
Because showing an improvement in survival requires comparison to a control group,
survival is a primary endpoint in Phase III and Adjuvant trials. These trials are normally
randomized so that the survival of comparable groups treated with different treatments
can be compared. Survival is normally reported using survival curves. See the Statistics
Section for several articles on these important curves.
Thoughts on Survival
Survival is technically easy to measure unambiguously and objectively since there is
nothing subjective about the date someone dies. This is an advantage over endpoints
like response or progression free survival which require measuring whether the tumor is
growing as well as how long they live. Survival also accounts for any long term increase
in the death rate due to long term side effects even if they are unrelated to acute side-
effects.
In situations where long term survival is common, measuring survival differences may
take a long time. There is a risk that ideas, technologies and treatments will have
improved by the time the results are in. Survival always takes longer than progression-
free survival (see below) to measure.
When you see positive results from a survival trial, it’s important to ask whether the trial
is actually increasing the cure rate or whether instead it extends life without increasing
the cure rate (An increase in cure rate will result in a higher plateau on the right tail of
the curve). If the treatment improves survival without improving the cure rate, you want
to know the likely and possible benefits. Not infrequently a big deal is made out of
treatments which improve median survival by only a few weeks or months. But what
may be lost is whether some patients get a much larger benefit and survive years longer
than they would have. The best way to answer these questions is to look at the survival
curves. For much more on this, look at my articles on survival curves in the Statistics
Section.
Measuring survival in a randomized trial for advanced cancer requires that patients
given one treatment who progress cannot be allowed to try the other treatment
(switching arms is called cross-over). This may not be perceived as a disadvantage by
those who design trials, but it most certainly is by patients whenever denying cross-over
denies them any additional hope that might come from trying the other treatment.
Although cross-over isn’t allowed in these circumstances, ethical considerations
normally require that patients who have advanced disease be permitted to seek out
other therapy of their choice. This actually is a scientific disadvantage since the results
may be blurred or even biased by treatments chosen later. Note that if the endpoint is
progression free survival (see below) then cross-over can be permitted.
Finally, historical or concurrent non-randomized controls have very low-standing
because they are subject to bias. For instance, improvement in tests for detecting
recurrence which result in finding recurrences sooner can make it look like more recent
patients live longer after a diagnosis of recurrence. Concurrent non-randomized controls
suffer from all kinds of possible bias related to characteristics of patients who choose (or
are referred to) one kind of treatment versus another. For instance, more motivated and
stronger patients might choose to travel for a difficult experimental treatment compared
to those who came from the local area and took the standard treatment. Patients with
those characteristics might well live longer than others even if the new treatment is
actually no better than the old. A closing thought: Almost every intelligent layman I have
ever talked to comes up with the idea that with rigorous recording of treatment results
and patient characteristics, the need for expensive (twice as many patients), and
sometimes ethically uncomfortable randomized trials could be eliminated. I believe the
dogma of the randomized trial inhibits efforts to find a better way.
Jargon
OS: Overall Survival
Progression Free Survival and Disease
Free Survival
Progression Free Survival is the length of time you are both alive and free from any
significant increase in your cancer (free from progression). Progression is defined the
same way as I described under response.
For adjuvant therapy, if the same treatment works to some extent in recurrent cancer,
the question arises whether treating patients after they relapse is just as effective. The
hope is that treating when there is so little cancer that the patient appears to be free of
disease will be more effective than treating when there are large detectable tumors. In
some cases this has proven to be true. If adjuvant treatment appears to improve the
cure rate when the same treatment doesn’t cure in relapsed patients, it’s a win.
Because no one knows for sure who will relapse, adjuvant therapies usually mean
treating some patients who are already cured. The disadvantage of that increases the
more likely it is that the patients who are treated are in fact already cured, and the more
toxic and expensive the adjuvant therapy.
Progression free survival and disease free survival can translate to an improvement in
quality of life since symptoms from the cancer are delayed – but only if side effects of
treatment aren’t worse.
Also unlike trials with a survival endpoint, randomized trials in metastatic cancer which
measure Progression Free Survival can allow cross-over on progression if it makes
sense. In adjuvant therapy trials cross-over doesn’t make sense, since patients get only
one adjuvant therapy and are treated for recurrence if they relapse.
Both of these endpoints are subject to some uncertainty compared to plain survival
because determining that there is progression or relapse involves reading scans, which
always has some associated uncertainty. Doing these trials well requires a rigorous
schedule for testing, and a rigorous protocol for examining the scans and declaring a
relapse. This includes reading of the scans at a central location by radiologists who are
not told which treatment the patients got (the radiologists are said to be “blinded”).
Jargon Summary
PFS: Progression Free Survival
Quality of Life
Quality of Life is supposed to measure how you feel and how you function. Although
quality of life is certainly important in the broad sense, unfortunately, there is no
unambiguous physical measurement or definable property which corresponds to your
“Quality of Life”. Quality of Life is therefore measured using a brief questionnaire in
which patients rate their ability to function in various ways and enjoy life. Patients
typically fill out the questionnaire several times during the course of the trial.
Quality Of Life is typically measured in Phase III trials or Adjuvant Trials and it is
typically a secondary endpoint, less important than survival related endpoints.
The most commonly used questionnaire is called the Functional Assessment of Cancer
Therapy (FACT) scale [Cella 1993], and there are specialized FACT questionnaires for
several different types of cancer (which tend to affect quality of life in different ways).
Technicalities
The FACT questionnaire was constructed (very roughly here) by surveying patients and
also doctors about what is important to quality of life and then constructing test
questions which were evaluated by patients. The resulting questionnaire was then
tested in several ways for validity, such as comparing results to other measures of
Quality Of Life.
The FACT questionnaire asks you to indicate how true statements are for you in five
different life areas:
Quality of Life is determined by two main variables, the side effects of treatment, and
symptoms of the disease. Intuitively, a treatment which is highly effective against cancer
is likely to improve quality of life by preventing or relieving disease associated
symptoms. Similarly, the side effects of treatment obviously affect your quality of life. In
a sense, this measures the tradeoff between the adverse effects of treatment and its
benefits.
The timing of side effects and benefits is also important in Quality of Life
Measurements. If the treatment really works, it will have long lasting benefits. It may
have only short term side effects in which case it’s hard to imagine a way to balance the
chance of a long term benefit against the acute side effects. This is less of a problem if
the benefits are temporary or if there are long term side effects. In essence, really good
treatments – those that are highly effective without bad side effects, obviously improve
Quality Of Life and you don’t need a questionnaire to know it. Problem is there aren’t
enough of those!
Finally, keep in mind that while a very honest effort was made to make the
questionnaire reflect the values of the average patient, you are not the average patient.
You very likely will have either more or less severe side effects from any treatment than
average and will react to them either more or less than average. You don’t have
average symptoms from the disease either, and not only that, you won’t get the average
benefit from the treatment either. Therefore, Quality of Life tradeoffs can give only a
rough idea of how you might feel about the same treatment. Normally, you can get a
similar feel for this just from the reported severity of the side effects of the treatment and
the extent of benefit as estimated by other endpoints.
Jargon
Quality Of Life is often abbreviated QOL or QL.
Reference
Cella DF, Tulsky DS, Gray G, Sarafian B, Linn E, Bonomi A, Silberman M, Yellen
SB, Winicour P, Brannon J, et al.
The Functional Assessment of Cancer Therapy scale: development and validation of
the general measure.
J Clin Oncol. 1993 Mar;11(3):570-9.[PubMed Abstract (will open in new window)]
If you know twelve concepts about a given topic you will look like an expert to people who only
know two or three.
Scott Adams, creator of Dilbert
When learning statistics, it is easy to get bogged down in the details, and lose track of the big
picture. Here are the twelve most important concepts in statistical inference.
Statistics lets you make general conclusions from limited data.
The whole point of inferential statistics is to extrapolate from limited data to make a general
conclusion. "Descriptive statistics" simply describes data without reaching any general
conclusions. But the challenging and difficult aspects of statistics are all about reaching general
conclusions from limited data.
Statistics is not intuitive.
The word ‘intuitive’ has two meanings. One meaning is “easy to use and understand.” That was
my goal when I wrote Intuitive Biostatistics. The other meaning of 'intuitive' is “instinctive, or
acting on what one feels to be true even without reason.” Using this definition, statistical
reasoning is far from intuitive. When thinking about data, intuition often leads us astray. People
frequently see patterns in random data and often jump to unwarranted conclusions. Statistical
rigor is needed to make valid conclusions from data.
Statistical conclusions are always presented in terms of probability.
"Statistics means never having to say you are certain." If a statistical conclusion ever seems
certain, you probably are misunderstanding something. The whole point of statistics is to
quantify uncertainty.
All statistical tests are based on assumptions.
Every statistical inference is based on a list of assumptions. Don't try to interpret any statistical
results until after you have reviewed that list. An assumption behind every statistical calculation
is that the data were randomly sampled, or at least representative of, a larger population of
values that could have been collected. If your data are not representative of a larger set of data
you could have collected (but didn't), then statistical inference makes no sense.
Decisions about how to analyze data should be made in advance.
Analyzing data requires many decisions. Parametric or nonparametric test? Eliminate outliers
or not? Transform the data first? Normalize to external control values? Adjust for covariates?
Use weighting factors in regression? All these decisions (and more) should be part of
experimental design. When decisions about statistical analysis are made after inspecting the
data, it is too easy for statistical analysis to become a high-tech Ouja board -- a method to
produce preordained results, rather an objective method of analyzing data. The new name for
this is p-hacking.
A confidence interval quantifies precision, and is easy to interpret.
Say you've computed the mean of a set of values you've collected,or the proportion of subjects
where some event happened. Those values describe the sample you've analyzed. But what
about the overall population you sampled from? The true population mean (or proportion) might
be higher, or it might be lower. The calculation of a 95% confidence interval takes into account
sample size and scatter. Given a set of assumptions, you can be 95% sure that the
confidence interval includes the true population value (which you could only know for sure by
collecting an infinite amount of data). Of course, there is nothing special about 95% except
tradition. Confidence intervals can be computed for any degree of desired confidence. Almost
all results -- proportions, relative risks, odds ratios, means, differences between means, slopes,
rate constants... -- should be accompanied with a confidence interval.
A P value tests a null hypothesis, and is hard to understand at first.
The logic of a P value seems strange at first. When testing whether two groups differ (different
mean, different proportion, etc.), first hypothesize that the two populations are, in fact, identical.
This is called the null hypothesis. Then ask: If the null hypothesis were true, how unlikely would
it be to randomly obtain samples where the difference is as large (or even larger) than actually
observed? If the P value is large, your data are consistent with the null hypothesis. If the P
value is small, there is only a small chance that random chance would have created as large a
difference as actually observed. This makes you question whether the null hypothesis is true. If
you can't identify the null hypothesis, you cannot interpret the P value.
"Statistically significant" does not mean the effect is large or scientifically
important.
If the P value is less than 0.05 (an arbitrary, but well accepted threshold), the results are
deemed to be statistically significant. That phrase sounds so definitive. But all it means is that,
by chance alone, the difference (or association or correlation..) you observed (or one even
larger) would happen less than 5% of the time. That's it. A tiny effect that is scientifically or
clinically trivial can be statistically significant (especially with large samples). That conclusion
can also be wrong, as you'll reach a conclusion that results are statistically significant 5% of the
time just by chance.
"Not significantly different" does not mean the effect is absent, small or
scientifically irrelevant.
If a difference is not statistically significant, you can conclude that the observed results are not
inconsistent with the null hypothesis. Note the double negative. You cannot conclude that the
null hypothesis is true. It is quite possible that the null hypothesis is false, and that there really
is a difference between the populations. This is especially a problem with small sample
sizes. It makes sense to define a result as being statistically significant or not statistically
significant when you need to make a decision based on this one result. Otherwise, the concept
of statistical significance adds little to data analysis.
Multiple comparisons make it hard to interpret statistical results.
When many hypotheses are tested at once, the problem of multiple comparisons makes it very
easy to be fooled. If 5% of tests will be "statistically significant" by chance, you expect lots of
statistically significant results if you test many hypotheses. Special methods can be used to
reduce the problem of finding false, but statistically significant, results, but these methods also
make it harder to find true effects. Multiple comparisons can be insidious. It is only possible to
correctly interpret statistical analyses when all analyses are planned, and all planned analyses
are conducted and reported. However, these simple rules are widely broken.
Correlation does not mean causation.
A statistically significant correlation or association between two variables may indicate that one
variable causes the other. But it may just mean that both are influenced by a third variable. Or it
may be a coincidence.
Published statistics tend to be optimistic.
By the time you read a paper, a great deal of selection has occurred. When experiments are
successful, scientists continue the project. Lots of other projects get abandoned.When the
project is done, scientists are more likely to write up projects that lead to remarkable results, or
to keep analyzing the data in various ways to extract a "statistically significant" conclusion.
Finally, journals are more likely to publish “positive” studies. If the null hypothesis were true,
you would expect a statistically significant result in 5% of experiments. But those 5% are more
likely to get published than the other 95%.
Contents
Search