Diagnosing Model Issues and Reporting

F.
Reporting the Results of a Statistical Analysis STAT 840: Linear Regression
1 Simple Report vs. Statistical Analysis Report

In the materials for this lesson, I have provided an overview of best practices for
Writing a Statistical Analysis Report.
• The primary purpose of the document is to make you aware of the level of com-
munication required for fully communicating the results of any statistical
analysis to a client, collaborator, or community.
• In contrast, a simple report should be thought of as a preliminary step from
which you build the full and final statistical analysis report.
• In the following pages, I will first define the general structure of a simple report
and then walk us through an example to clarify the level of detail required of
such a report.
• You will practice writing simple reports through homework problems that re-
quire analysis. You will practice writing a full statistical analysis report through
the final project (see details in Blackboard).
2 Structure of a Simple Report

There is no “one-size-fits-all” report–use your best judgment. However, some guide-
lines on structure and content are provided below.
I. The Study Design, Aims, and Model.
A. How was the data obtained?
B. Include the research question(s) the model is being built to address.
C. Include all candidate predictor variables (transformed if applicable) and
higher order terms if appropriate.
II. Preliminary Analyses.
A. Brief comments on (details should be included in Appendix):
i. leverage points
ii. excluded points
iii. multicollinearity (making sure not to include higher order terms)
iv. other relevant details
B. Screening of candidate predictor variables using measures of goodness of
2
fit (e.g., Radj , Cp , etc.)
i. Refinement1 of model: transformations, higher order terms, partial
R2 , high influence points, etc.
ii. Final model: show variables that are included are needed, and vari-
ables that are excluded are not needed
III. Statistical Analysis.2
A. Formula for Ŷ
B. If requested, CI (or joint CI) on β
C. If requested, PI (or joint PI) for Xh1 , Xh2 , . . .
IV. Conclusion/Summary.
V. Appendix
A. Justify exclusion of observations via outlier/leverage analysis
1 Note: Be very clear what points and terms are included in each table or figure.
2 This section is fully determined by the study design, aims, and final model.
1
F. Reporting the Results of a Statistical Analysis STAT 840: Linear Regression
B. Justify transformations
C. Justify higher order terms
D. Justify model assumptions (e.g., linearity, normality, independence, etc.)
3 A Worked Example: Muscle Mass and Aging

I. The Study Design, Aims, and Model.
A. Study Design. A study on muscle mass and aging is considered. To
explore this relationship in women, 15 women from each 10-year age group
(40 - 79 years old) were randomly selected. A total of n = 60 women were
selected and a measure of their muscle mass was collected.
B. A person’s muscle mass is expected to decrease with age. The goals of the
study are to:
1. Test for a linear relationship between age and muscle mass.
2. Provide an estimate of the precision with which the linear relationship
can be estimated from this data.
2. Estimate the difference in the mean muscle mass for women differing
in age by 10 years.
3. Predict the muscle mass for women of age 60.
C. The following simple linear regression model is considered. Let
Yi = a continuous measure of muscle mass of the ith randomly selected woman

Xi = known age in years of the ith randomly selected woman
with
Yi = β0 + β1 Xi + εi , i = 1, 2, . . . , 60
where εi ∼ iidN (0, σ 2 ) and β0 , β1 , σ 2 are the unknown parameters of in-
terest.
II. Preliminary Analyses.
A. A scatterplot indicates a negative linear relationship between age and mus-
cle mass may be useful (r = −0.87).
A. One observation (53) has a studentized and studentized deleted residual
larger than 2. We are unable to determine with certainty any reasons for
excluding the observation. However, a sensitivity analysis shows that the
model fit is robust with respect to its inclusion (Appendix).
B. The assumptions of linearity, constant error variance, independence of error
terms, and normality of error terms are assessed via residual plots and
hypothesis tests. All model assumptions are reasonably met (Appendix).
III. Statistical Analysis. The lm() function in R was used to estimate the least
squares fit of the simple linear regression model. The final fitted model is given
by
Ŷ = −156.35 − 1.19X
and is shown in Figure 2 (solid line). The model MSE is 66.8. Women’s age
explains 75% of the variation in muscle mass (R2 = 0.75).
1. A two-sided t−test is used to select between H0 : β1 = 0 and H1 : β1 6= 0
with a type I error rate of α = 0.05. Since b1 = −1.19 is more than 13 SE
below what is expected if H0 : β1 = 0 (t(58) = −13.19, p < 0.001), H0 is
unlikely to have produced the observed data and is rejected. Thus, there
is sufficient evidence of a linear association between age and muscle mass.
2
Figure 1: A scatterplot of muscle mass and age
Figure 2: A scatterplot of muscle mass and age including the least squares line (solid)
and 95% confidence band (dashed)
2. A 95% confidence band on the regression line is used to determine the pre-
cision with which the line is estimated (Figure 2)3 It is apparent from the
figure that the regression line has been estimated fairly precisely. The slope
of the regression line is clearly negative, and the levels of the regression
3 The level of confidence indicates the proportion of time that the estimating procedure will yield
a band that covers the entire line.
3
line at different levels of X are estimated fairly precisely, losing precision

the further away the estimation is from the mean age.
3. The estimated difference in mean muscle mass for women differing in age
by 10 years is −11.9 (95% CI: −13.7, −10.1).
4. The predicted muscle mass for women at age 60 is 84.9 (95% PI: 68.5,
101.4).
IV. Conclusions. There is a decreasing linear relationship between a woman’s
muscle mass and age. The fitted model is given by
Ŷ = −156.38 − 1.19X
with σˆ2 = M SE = 66.8. One outlying observation was identified but included in
the final model after sensitivity analysis concluded robustness of results. Given
the well-behaved residuals and the high percent of variation explained by the
model (R2 = 0.75), this model has predictive power for muscle mass of women
between 40 and 79 years of age.
V. Appendix.
A. Diagnostics for Predictors.
1. Figure 3 is a clustered dot plot45 of ages for the n = 60 subjects. The
minimum and maximum ages are 41 and 78 years, respectively. The
ages appear to be spread evenly throughout this interval. There are
also no unusually large or small observations.
Figure 3: A clustered (by decade) dot plot of age
2. Figure 4 is a boxplot6 of age for the n = 60 subjects. In addition

to reiterating the range (minimum - maximum) of age, the boxplot
shows the first (50.25 years) and third (70 years) quartiles and the
median age (60 years). We can see that the middle half of ages range
from 50.25 years to 70 years and that they are very symmetrically
distributed because the median is located in the middle of the box and
is very similar to the mean (x̄ = 59.98 years).
4 Dot plots are ideal for small datasets.
5 Barcharts and stem-and-leaf plots would provide similar information and are useful for
identifying any important patterns in the values of X.
6 Boxplots are ideal for large datasets.
4
Figure 4: A boxplot of age
3. Figure 5 is a sequence plot7 of age for the n = 60 subjects. Age is

here plotted against the index (i.e., the time sequence of observation).
The points in the plot are connected to show more effectively the time
sequence of observations. There is a discernible pattern within the
sequence of observations. However, we know from the study design
that subjects were randomly selected according to decade of age (40
- 49, 50 - 59, etc.). It appears that subjects were selected randomly
within increasing decades of age. Within each decade, there is no
discernible pattern. Thus, it appears that the subject selection process
is reflected in the sequencing of ages.
B. Residual Analysis. Residual plots allow us to identify any systematic pat-
tern in the deviations around the fitted regression line.
1. Linearity of Regression Function. Figure 6 is a scatterplot of
residuals ei = yi − yˆi versus X. We can see that, on the whole,
the residuals depart from 0 (their assumed mean) in random fashion.
The model does appear to be overestimating Y for ages near 60, as
evidenced by all negative residuals in the neighborhood of X = 608 .
Overall, however, the plot supports the linearity assumption of the
model. No transformations are necessary to meet this assumption.
2. Constant Error Variance. In addition to linearity, Figure 6 shows
constant variation in the residuals across the entire range of X. Figure
7 is a scatterplot of the absolute values of the residuals |ei | versus X.
Since the sign of ei are not meaningful for examining the constancy
of variation, this plot is useful for amplifying any issues with changing
magnitudes of the residuals as a function of X. We can see that Figure
7 also supports constant variation in the residuals across the entire
range of X. We apply the Breusch-Pagan to the data to test for an
association between σ 2 and X (i.e., nonconstancy of variance). At the
α = 0.05 level of significance, we fail to reject H0 : σ 2 is not associated
with X (χ2 (1) = 3.8, p = 0.051). That is, there there is insufficient
7 Sequence plots should be used when data are obtained in a sequence, such as over time or for
adjacent space.
8 An examination of the scatterplot of the data with the fitted line overlaid would be helpful here.
5
Figure 5: A sequence plot of age
Figure 6: A scatterplot of residuals versus age with the expected value of 0 indicated
by a dotted line
evidence of nonconstancy of error variance so we are reasonably assured

that the assumption is valid.
3. Detection of Outlying Observations. Figures 8 and 9 are plots
of studentized and studentized deleted residuals, respectively, versus
X. Standardized versions of residuals are helpful for distinguishing
observations that are fit poorly by the model as their values represent
a standard deviation distance from zero. We would expect most stu-
dentized residuals to fall between ±2. It appears that is the case
6
Figure 7: A scatterplot of absolute residuals versus age
here, with the exception of Observation 53 whose value is near 39 We

fit the model without Observation 53 and determined the results were
robust to its inclusion.
Figure 8: A scatterplot of studentized residuals versus age
4. Independence of Error Terms. Figure 10 is a sequence plot of the

residuals to see if any correlation between error terms that are near each
9 A conservative rule of thumb for removal of outlying observations: only discard if there is direct
evidence that it represents an error in recording, a miscalculation, a malfunctioning of equipment,

or a similar type of circumstance. We do need to address its potential impact on the model fit and
determine whether our results are sensitive to its inclusion in the model.
7
Figure 9: A scatterplot of studentized deleted residuals versus age34r
other in the sequence exists10 . No correlation is present as evident by the

random scatter of points about 0.
Figure 10: A sequence plot of residuals
5. Normality of Error Terms. Figure 11 is a normal probability plot

(also known as a Q-Q plot) of the residuals. If the assumption holds,
the observed quantiles of the residuals should closely match the expected
quantiles under the normal distribution and the data will follow the line
closely. We see that, though there appear to be some issues with the tails
10 We can also do a formal test of correlation.
8
of the distribution, the normality assumption is reasonable.
Figure 11: A normal probability plot of residuals

Diagnosing Model Issues and Reporting

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Diagnosing Model Issues and Reporting

Загружено:

Авторское право:

Доступные форматы

F.

Reporting the Results of a Statistical Analysis STAT 840: Linear Regression

1 Simple Report vs. Statistical Analysis Report

2 Structure of a Simple Report

3 A Worked Example: Muscle Mass and Aging

Yi = a continuous measure of muscle mass of the ith randomly selected woman

Figure 1: A scatterplot of muscle mass and age

a band that covers the entire line.

line at different levels of X are estimated fairly precisely, losing precision

Figure 3: A clustered (by decade) dot plot of age

2. Figure 4 is a boxplot6 of age for the n = 60 subjects. In addition

Figure 4: A boxplot of age

3. Figure 5 is a sequence plot7 of age for the n = 60 subjects. Age is

Figure 5: A sequence plot of age

evidence of nonconstancy of error variance so we are reasonably assured

Figure 7: A scatterplot of absolute residuals versus age

here, with the exception of Observation 53 whose value is near 39 We

Figure 8: A scatterplot of studentized residuals versus age

4. Independence of Error Terms. Figure 10 is a sequence plot of the

evidence that it represents an error in recording, a miscalculation, a malfunctioning of equipment,

Figure 9: A scatterplot of studentized deleted residuals versus age34r

other in the sequence exists10 . No correlation is present as evident by the

Figure 10: A sequence plot of residuals

5. Normality of Error Terms. Figure 11 is a normal probability plot

of the distribution, the normality assumption is reasonable.

Figure 11: A normal probability plot of residuals

Вам также может понравиться