Вы находитесь на странице: 1из 7

# 102

## Musculoskeletal Care Volume 3 Number 2 Whurr Publishers 2005

Main article

An introduction to medical statistics for health care professionals: Hypothesis tests and estimation
Elaine Thomas PhD MSc BSc Lecturer in Biostatistics, Primary Care Sciences Research Centre, Keele University, North Staffordshire, UK

Abstract

Main articles

## Hypothesis testing and the meaning of a p-value

Data are collected to answer burning research questions. One approach is to perform a hypothesis test. These tests generally involve comparisons, such as between treatment groups or between groups of subjects. The intention of the test is to determine whether the effect of interest is statistically different between the groups under investigation. The research question being used to illustrate this process is What is the relative effectiveness of acupuncture (A) and massage (M) as treatments

## Medical statistics: Hypothesis tests and estimation

103

for chronic low back pain in patients presenting to primary care? where the effect of interest is the difference in pain scores between the two treatment groups. From the research question we define one statement (the null hypothesis, H0) which states that there is no true effect between the groups and that any observed effect has arisen from sampling error. This can be written more formally as: H0: mean pain scores in acupuncture and massage patients are the same (meanA = meanM), A second, alternative hypothesis (HA), is also defined which states that any observed effect is real and not due to sampling error. This can either take the form of a general statement of difference, ie., HA: mean pain scores in acupuncture and massage patients are different (meanA meanM), or an a priori direction to the effect can be applied, for example, HA: mean pain score is higher in acupuncture than in massage patients (meanA > meanM), In the first alternative hypothesis (meanA meanM) the direction of the difference in pain scores is not indicated, it is a simple statement that the mean pain scores are different. This is known as a two-sided test as we believe that the difference may be in either direction. Conversely, in the second alternative hypothesis (meanA > meanM) it is believed a priori that the difference can only be in one direction. This is an example of a one-sided test, which are much less commonly used, compared to two-sided tests, as it is rare that the direction of the effect of interest can be determined before the study. For each research question there is an appropriate statistical test to apply to the data to determine whether to reject or accept the null hypothesis. Independent of which test is applied, a p-value is determined which assesses the evidence against the null hypothesis. The smaller the p-value, the more likely it is that the observed result is not due to sampling error or chance, and hence we reject the null hypothesis and accept that there is a difference. The p-value is the probability of getting the study results, or results more extreme, when the null hypothesis is true. Convention states that differences are statistically significant if the p-value is less than 0.05, i.e. the observed outcome, or one more extreme, would occur less than 1 time in 20 when the null hypothesis is true, which you can see is an unlikely occurrence. Returning to the example, the study examined the pain scores, range: 0 (no pain) to 100 (worst pain), recorded six months after treatment on 60 patients in each of the treatment groups. A two-sided test is carried out as no assumptions are being made about the direction of the effect. The mean pain scores at six-months posttreatment are 20.3 for the acupuncture group and 28.4 for the massage group. Hence, the study data suggest a mean difference of 8.1 points is the best estimate of the difference in the population. To determine whether this difference is statistically

Main articles

104

Thomas

significant or is a consequence of chance, the relevant hypothesis test (here, a twoindependent sample t-test) is carried out and the p-value is calculated as 0.04. This implies that the probability of getting a mean difference of at least 8.1 points given that the pain scores are the same, is 0.04, i.e. an outcome that will occur less than 1 time in 25. Hence, we reject the null hypothesis that the two groups have the same pain scores. The study data suggest that acupuncture is statistically more effective than massage at reducing pain scores six months post-treatment in chronic low back pain sufferers presenting to primary care clinics, by a mean of 8.1 points.

## Confidence intervals - what are they and why do we use them?

Data from an individual study can give an estimate of the difference between the groups examined, this is known as a point estimate. It is known that this point estimate may not be exactly the mean difference in the population, but that the observed point estimate will be close to the unknown population difference, given the study sample is representative of the population. When hypothesis tests are performed, the question of interest is answered in a simple way, Is the difference observed statistically significant? An alternative analysis method allows us to address an additional question, In what range of values is it likely that the population difference lies? By calculating a range of values around our point estimate, known as a confidence interval or CI, we can produce a statement about how accurately our sample point estimate matches the population figure. Hence, a confidence interval is a range of values, symmetrical about our point estimate, that we are confident includes the true population estimate. Confidence intervals can be calculated for most summary measures such as means, proportions, and differences in means, and in most cases can be easily calculated by knowing the values of the summary measure, the variability in the summary measure and the sample size. It is not possible to be 100% confident of the range the population estimate will fall into, so a confidence level is attached to the range of values. The most commonly applied confidence level is 95%, which is analogous to p = 0.05, but alternatives are 90% (p = 0.1) and 99% (p = 0.01). By increasing the confidence level, say from 95% to 99%, we increase the width of the confidence interval, which increases our confidence that the population estimate lies within the confidence interval, but consequently decreases the certainty about the actual value of the population estimate. The opposite is true when decreasing the confidence level, say from 95% to 90% when the interval is narrower, but the certainty about the population value increases. For the chronic low back pain study, a confidence interval for the mean difference in pain scores can be calculated to give a range of values within which we are confident that the true, unknown population mean difference lies. Figure 1

Main articles

105

Upper limit

Point estimate

## Lower limit 99% CI

FIGURE 1: Point estimate and 90%, 95%, and 99% confidence intervals (CI) for the mean difference in pain score between the acupuncture and massage treatment groups.

## Statistical vs. clinical significance

Suppose that in a clinical trial of a new wonder drug for rheumatoid arthritis, the remission rates turn out to be 5% higher in those taking the new drug than in those taking the standard drug. Is this difference in remission rates large enough not to be

Main articles

presents the three commonest confidence intervals calculated on this same set of data to illustrate the effect of changing the confidence level. The horizontal line in the figure represents a difference in mean scores of 0, i.e. the pain scores in the two treatment groups are the same. The 95% confidence interval for the mean difference in pain scores of 8.1 points can be written as (1.2,15.0) and hence we can be 95% confident that the true population difference in pain scores lies between 1.2 and 15.0 points. The 90% confidence interval (2.3,13.9) is narrower than the 95% CI whereas the 99% confidence interval (1.0,17.2) is wider than the 95% CI. Both the 95% and the 90% confidence intervals do not include the value 0, and hence at these levels of confidence there is evidence of a statistically significant difference in mean pain scores. By increasing the confidence level to 99%, i.e. widening the range of values and increasing the uncertainty, the interval now includes 0 which indicates that it is possible, at this confidence level, that there is no difference in mean scores between the treatment groups, i.e. the observed difference in mean pain scores is nonsignificant at this level of confidence.

106

Thomas

Main articles

explained by random variation? Let us say that in a large study of 5,000 women the finding is statistically significant, i.e. users of the new wonder drug are statistically more likely to experience a remission than users of the standard drug. However, is a difference of 5% clinically relevant? What decisions should be made about the allocation of the wonder drug if it was found that the rate of gastric bleeds in the group using the wonder drug is twice that seen in those using the standard drug? In an alternative pilot study of 50 participants examining colles fracture rate in users of calcium supplementation compared to non-users, the calcium users were found to have a 25% reduced rate of fracture over a one year period. However, when tested, the difference is not statistically significant. Does this mean that calcium supplements do not affect the likelihood of a colles fracture? A result that is not statistically significant does not necessarily indicate that there is no effect of the treatment: the best estimate from this pilot study tells us that the rate is 25% lower in the calcium users. What we can say is that from the current data we have no evidence to suggest a statistical difference in colles fracture rate between users and non-users of calcium supplementation. This lack of statistical significance is likely to be due to the small sample size. If the sample size were to be increased and if the same difference in fracture rate was maintained then it is likely that the result would achieve statistical significance. These two examples illustrate the importance of considering both statistical significance and clinical relevance. One way of describing a clinically relevant change in a measure is through the minimally clinically important difference (MCID), the smallest difference in scores that patients perceive as important and which would mandate a change in the patients management (Jaeschke et al., 1989). The two commonest methodologies used in the determination of MCIDs are (1) through the use of an expert panel (van der Heijde et al., 2001; Maillefert et al., 2002; Wyrwich et al., 2003a, b) or (2) through the use of statistical analysis (Jaeschke et al., 1989; Peto et al., 2001; Angst et al., 2002; Lee et al., 2003). It is now generally accepted that MCIDs are context-specific rather than a fixed value. As an example, MCIDs on a visual analogue scale (range 0100) for acute low back pain patients may be different from that for chronic low back pain patients. Similarly, MCIDs for patients enrolled into a primary care study may differ from those in secondary care or from a population-based study (Beaton et al., 2002).

## Generalizability of study findings

As stated previously, the extrapolation of the findings from a sample to a population of interest is crucially dependent upon the sample being representative of the population. In theory, the study sample should be randomly chosen, but this is almost never the case. In practice we need some way of assessing whether the sample may

## Medical statistics: Hypothesis tests and estimation

107

be considered representative and this is usually done by way of describing the characteristics of the subjects in the sample and comparing them with the known characteristics of the population. The whole process of statistical inference fails if the sample is not representative, hence it is important that HCPs both recognize and address the issue of generalizability. There is a collection of evidence across several areas of research that suggests that people who choose to take part in research are different from those who do not. For example in the field of pain research, numerous population-based surveys suggest that taking part in research is strongly linked to both age and gender (Walsh, 1994; Papageorgiou et al., 1995; Jinks et al., 2002; Boardman et al., 2003). Authors who examine non-response more closely have reported that response is also related to such things as social class (Tickle et al., 1996; Hill et al., 1997) and health status; in a population-based survey of the prevalence of low back pain, responders to the survey were found to have a higher consultation rate for back pain in the year subsequent to the survey than those who did not respond (7.1% cf. 5.5%) (Papageorgiou et al., 1995).

Conclusion
The aim of this second article was to discuss the two main ways of assessing statistical significance, along with issues that need to be addressed when considering the implications of the study results. The final article in the series will address tests of association. The aim of the series is to give health care professionals a greater understanding of the use of statistics in health research, so that they may feel confident when reading the literature in their area of interest. It is hoped that this introduction to statistics will encourage health care professionals to be more aware of statistics and less fearful.

Acknowledgements

References
Angst F, Aeschlimann A, Michel BA, Stucki G (2002). Minimally clinically important rehabilitation effects in patients with osteoarthritis of the lower extremities. Journal of Rheumatology 29: 131-8. Beaton DE, Boers M, Wells GA (2002). Many faces of the minimal clinically important difference (MCID): A literature review and directions for future research. Current Opinions in

Main articles

I would like to thank Sarah Ryan for asking me to write this series, the two anonymous reviewers for their constructive comments on an earlier draft, and Professor Peter Croft for allowing me the time to complete this series.

108

Thomas

Rheumatology 14: 10914. Boardman HF, Thomas E, Croft PR, Millson DS (2003). Epidemiology of headache in an English district. Cephalalgia 23: 12937. Hill A, Roberts J, Ewings P, Gunnell D (1997). Non-response bias in a lifestyle survey. Journal of Public Health 19: 2037. Jaeschke R, Singer J, Guyatt GH (1989). Measurement of health status. Ascertaining the minimally clinically important difference. Controlled Clinical Trials 10: 40715. Jinks C, Jordan K, Croft P (2002). Measuring the population impact of knee pain and disability with the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC). Pain 100: 5564. Lee JS, Hobden E, Stiell IG, Wells GA (2003). Clinically important change in the visual analog scale after adequate pain control. Academic Emergency Medicine 10: 112830. Maillefert JF, Nguyen M, Gueguen A, Berdah L, Lequesne M, Mazieres B, Vignon E, Dougados M (2002). Relevant change in radiological progression in patients with hip osteoarthritis. II. Determination using an expert opinion approach. Rheumatology 41: 14852. Papageorgiou AC, Croft PR, Ferry S, Jayson MIV, Silman AJ (1995). Estimating the prevalence of low back pain in the general population: Evidence from the South Manchester Back Pain Survey. Spine 20: 188994. Peto V, Jenkinson C, Fitzpatrick R (2001). Determining minimally important difference for the PDQ-39 Parkinsons disease questionnaire. Age and Ageing 30: 299302. Thomas E (2004). An introduction to medical statistics for health care professionals: Describing and presenting data. Musculoskeletal Care 4(2): 21828. Tickle M, Craven R, Blinkhorn AS (1996). Use of self-reported postal questionnaires for districtbased adult oral health needs assessment. Community Dental Health 13: 1938. van der Heijde D, Lassere M, Edmonds J, Kirwan J, Strand V, Boers M (2001). Minimally clinically important difference in plain films in RA: Group discussions, conclusions, and recommendations. OMERACT Imaging Task Force. Journal of Rheumatology 28: 9147. Walsh K (1994). Evaluation of the use of general practice agesex registers in epidemiological research. British Journal of General Practice 44: 11822. Wyrwich KW, Fihn SD, Tierney WM, Kroenke K, Babu AN, Wolinsky FD (2003a). Clinically important changes in health-related quality of life for patients with chronic obstructive pulmonary disease: An expert consensus panel report. Journal of General Internal Medicine 18: 196202. Wyrwich KW, Nelson HS, Tierney WM, Babu AN, Kroenke K, Wolinsky FD (2003b). Clinically important differences in health-related quality of life for patients with asthma: An expert consensus panel report. Annals of Allergy, Asthma and Immunology 91: 14853.

Main articles

Correspondence should be sent to Elaine Thomas, Primary Care Sciences Research Centre, Keele University, North Staffordshire, ST5 5BG. Tel: +44 1782 583924, Fax: +44 1782 583911. E-mail e.thomas@keele.ac.uk Received January 2004 Accepted May 2004