Вы находитесь на странице: 1из 35

# Multiple Linear Regression

c
Rollin
Brant 2007

Contents
2 Multiple Regression
2.1 Assumptions of the Multiple Regression Model . . . . .
2.2 Fitting the Multiple Linear Regression Model . . . . .
2.3 Applying Multiple Regression . . . . . . . . . . . . . .
2.3.1 Initial variable selection . . . . . . . . . . . . .
2.3.2 Initial data examination . . . . . . . . . . . . .
2.3.3 Examining the Regression Coefficients . . . . .
2.3.4 Predictions . . . . . . . . . . . . . . . . . . . .
2.4 Examining Assumptions . . . . . . . . . . . . . . . . .
2.4.1 Examining Linearity . . . . . . . . . . . . . . .
2.4.2 Assessing Independence . . . . . . . . . . . . .
2.4.3 Examining the pattern of dispersion . . . . . . .
2.4.4 Examining normality . . . . . . . . . . . . . . .
2.4.5 Applying a Variance Stabilizing Transformation
2.4.6 Influence diagnostics . . . . . . . . . . . . . . .
2.5 Analysis of Variance/Covariance . . . . . . . . . . . . .
2.5.1 Allowing for varying slope terms . . . . . . . . .
2.5.2 Allowing for more than 2 categories . . . . . . .
2.5.3 Analysis of Variance: groups of variables . . . .
2.5.4 Multi-collinearity . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

2
6
7
8
8
10
13
14
15
16
17
17
18
19
21
23
23
26
29
32

Chapter 2
Multiple Regression
The multiple linear regression model is an extension of a simple linear regression
model to incorporate two or more explanatory variable in a prediction equation for
a response variable. Multiple regression modeling is now a mainstay of statistical
analysis in most fields because of its power and flexibility. As you will quickly learn
it requires very little effort (and sometimes even less thought) to estimate very complicated models with large numbers of variables. Practical experience has shown
however, that such models may be very hard to interpret and give very misleading
impressions. As a first example, we will consider a reasonably uncomplicated analysis with two predictor variables, beginning with an initial analysis based on simple
linear regressions.
Heparin is a drug used in the treatment and prevention of deep vein thrombosis.
The most commonly used form of the drug requires careful monitoring to prevent
under or over-anticoagulation, leading to possible treatment failure or bleeding, respectively. In a study concerning the efficacy of heparin therapy, levels of heparin
sulfate in the blood were monitored. Separate plots of heparin vs. body weight for
the 81 females and 66 males in the study are given below:

## Multiple Linear Regression

Heparin Level

1.6

Heparin Level

1.6

.05
39

119.6

.05

weight

male

female

1.6

Heparin Level

Heparin Level

1.04

.05

.05
39

100
weight

54.5

119.6
weight

The plots indicate the potential utility of linear regression. Estimates for the
regression fits are described below:
-> sex = female
hsulf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------weight | -.0078113
.0023395
-3.34
0.001
-.012468
-.0031546
_cons |
1.071086
.1603247
6.68
0.000
.7519672
1.390204

## -> sex = male

hsulf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------weight | -.0033953
.0016333
-2.08
0.042
-.0066582
-.0001324
_cons |
.6249125
.1348726
4.63
0.000
.3554735
.8943515

We can examine these fitted relations graphically as seen in the following combined plot.

## Multiple Linear Regression

Fitted values

Heparin Level
Fitted values
f

1.6

Heparin Level

ff
ff
f

.05

ff
f

f
f
f
f f
f m
f
f
f
f
m
f
f
m
fm
f m
m m mm m
f
ff f m
f
m
f
f m
f f
f
m f
f mm m
f mf m f m f mm
f f
f f m m
f m f
f
m
m
fm
ff
mf mf ff m f
m
ff f f fm
f
f
f
f
mm fm m m
f
m
m
m
m
m
f
f
m
fmm
m
mf
f
mm
mf
m f
m m
m
m
f
m
f
m

m
m

m
m

119.6

39
weight

Both regressions suggest negative relationships between heparin levels and weight,
though the relationship seems weaker for males. Based on a quick large sample comparison of slopes, using the generic formula for the standard error of a difference of
two independent estimates:
se(est1 est2 ) =

se(est1 )2 + se(est2 )2

we note that the estimated difference between slopes is .0044 with a standard error
of .0029, which does not provide compelling evidence significant for a difference.
(Z = .0044/.0029 1.5 )
This suggests the potential utility of combining the information for males and
females in a single more precise estimate. To do this we consider a comprehensive
model that incorporates weight and sex effects - i.e. a multiple regression model.
The common slope model can be derived from a simple enhancement of the SLR
model. Letting y represent heparin and x1 be weight, the two regression lines below
have separate intercepts but the same slope in accordance with the model.
yx = f + 1 x1 , for females
and
yx = m + 1 x1 , for males
Notice that these two equations encapsulate the prediction of heparin level based
on two predictor variables, one categorical (sex) and the other continuous (weight).

## Multiple Linear Regression

By defining an indicator variable, x2 , that takes on the value 0 for females and 1
for males this can be put into a single equation
yxz = f + 1 x1 + 2 x2
Simple algebraic comparison yields that the parameter 2 is really just the difference
in intercepts m f . By applying the same principle of least squares used to obtain
simple linear regression estimates, we can obtain multiple regression estimates, as
given in the following STATA output:
-----------------------------------------------------------------------------hsulf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male | -.1162196
.0468531
-2.48
0.014
-.2088282
-.023611
weight | -.0057582
.0014804
-3.89
0.000
-.0086844
-.0028321
_cons |
.9333015
.1032388
9.04
0.000
.7292423
1.137361
------------------------------------------------------------------------------

Note that the combined slope estimate is a compromise between the two slopes
in the separate fits and indicates an overall significant negative relationship. The
coefficient for the male term is negative (and significant) indicating that males have
lower heparin levels compared to females (after adjustment for weight). 1 . A
plot of the fitted regressions for males and females makes this clear.
1

An alternate form of the model could be based by defining x2 to be 0 for males and 1 for
females. Defining things this way would imply that 2 is just f m , i.e. the negative of its
previous value. In statistics (and science in general) Though these models are defined differently,
in practice they are essentially the same because in the end they both give rise to identical fitted
values. Two models that look different algebraically but that still give rise to the same set of
predictions are really the same model and are called re-parameterizations of each other. For this
reason, it is often best to consider models in terms of the predictions they generate, rather in terms
of the particular symbols or algebraic equations used.

Females

Heparin Level
Males
f

1.5

Heparin Level

ff
ff

1
f

f f

ff
f

f
f

.5

f
f
f
f f
f m
f f
f
m
f
f
m
fm
f m
m m mm m
f
ff f m
f
f
m
f m
f
f
m
f mm m
f mf m f m f mm
f f
f
f f m m
f m f
m
m
fm
ff
mf mf ff m f
m
ff f f fm
f
f
f
f
mm fm m m
m
f
m
m
m
m
m
f
f
fmm
m
mf
f
mf
mm
m f
m m
m
m
f
m
m
f
f

m
m

m
m

0
40

2.1

60

80
weight

100

120

## Assumptions of the Multiple Regression Model

Given a continuous response (dependent, outcome) variable and a set of of k numerical explanatory variables, x1 , x2 , ..., xk , the multiple linear regression model is
characterized by three assumptions.
The population mean of y within strata defined by the xs follows a linear and
y|x1 ,x2 ,...,xk = + 1 x1 + 2 x2 + ... + k xk
This will be referred to as the regression equation.
Note: In the following we will use x as shorthand for the set of x variables,
so we can right yx for the left hand side in the above. Ill also use the term
x-strata to describe strata determined by a particular set of values x.
The y observations are assumed to be statistically independent
The standard deviation of y within particular x-strata, yx , is constant over
all values of x. Well simplify notation by calling this value .
The distribution of y within x-strata is normal.

2.2

## Fitting the Multiple Linear Regression Model

Let us consider applying the MLR model to the Heparin example of the previous section. By doing so we are implicitly accepting (at least tentatively) the assumptions
above. The assumed regression equation
y|x1,x2 = f + 1 x1 + 2 x2
means that assumption of linearity is equivalent to assuming a parallel lines model,
which is most easily understood from a geometric or graphical perspective. In addition we are assuming that the dispersion of individual points about the the relevant
line (one for men, one for women) is the same for men and women and follows normal distribution. Even if we have doubts about these assumptions, it is reasonable
to fit first and check assumptions later.
Just as for a simple linear regression model, the principle of least squares provides a basis for estimating both the regression coefficients, , 1 , and 2 and the
dispersion (or scale) parameter, 2 .
If we label potential estimates as a, b1 and b2 , the least squares estimates are
the values of a, b1 and b2 that minimize the residual sum of squares ,
SSresid =

(y (a + b1 x1 + b2 x2 ))2

cases

The typical magnitude of the deviations from the fit (i.e. residual values)
yi (a + b1 x1 + b2 x2 ) is given by residual standard deviation
syxz =

SSresid
n3

In fitting the heparin data, we have applied a multiple linear regression model
with k = 2, x1 = weight and x2 = 1 if male, 0 if female. Pictorially, this particular
instance of the model can be thought of as fitting separate regression lines for males
and females, constraining the slopes to be equal, and assuming an equal degree of
dispersion about each line. The previous plot of the fitting regression lines makes
these assumptions clear. The slope of the parallel fitted lines is determined by b1 .
The value of b2 corresponds to the difference in intercepts. Because the lines are
parallel, this value also describes the difference in predicted heparin level for between
males and females in the same weight-stratum. 2
2

x2 only takes on values 0 and 1. When some (or all) of the x variables are constructed to
represent categories, the model is sometimes called the general linear model.

## The model is essentially a means of dissecting a high-dimensional relationship

between y and x into simple 2-dimensional components. For instance, if we wish
to consider the partial contribution of varying a particular variable, say xi , in a
stratum where all the remaining xs are fixed, the model reduces to simple linear
form
yxi = x without xi + i xi
In the above xwithout xi just represents the combined contribution of the remaining variables, i.e.
x

without xi

## When k = 2 the overall relationship can be understood geometrically in terms of

a 2-dimensional flat surface (plane) which gives values of yx1 ,x2 over the (x1 , x2 )
co-ordinate plane. For k > 3 one can refer mathematically to planes of higher order
(hyper-planes).
In one sense, the model allows us to borrow strength in examining the relationship
between y and xi under the assumption that this relationship is essentially the same
(i.e. linear, fixed slope) irrespective of the values of the remaining variables. The
contribution of the remaining variables is contained in a changing intercept. This
additivity of effects is a key simplifying assumption which can be checked as well
see later on.

2.3
2.3.1

## Applying Multiple Regression

Initial variable selection

Modern computational facilities now make it quite easy to fit a multiple regression
model. To use it in a meaningful way, however, a number of practical and statistical
The explanatory variables to be included in the regression model must be
selected.
Potential deficiencies in the fit of the model must be identified and corrected
(if possible)
Results of the model fit must be carefully interpreted and presented.

## Multiple Linear Regression

One cannot even begin of course without first choosing which predictor variables
to include in the model (well leave the remaining issues to later sections). In
general the rationale for choosing variables depends mainly on the objectives of
study. Loosely speaking, objectives can be classified as descriptive, predictive and
comparative in character (with a large degree of overlap).
At the level considered so far, our analysis of the heparin data has been merely
descriptive. In descriptive analysis (which may be a tentative precursor to development of predictive or causal models) one is interested in identifying patterns of
relationship without worrying overmuch about the underlying mechanisms or extrapolation into the future. If patterns do emerge in the analysis they may engender
more careful thought about their meaning and subsequently more focused analysis
or perhaps an additional step of data collection. The choice of variables is then
largely subjective, depending on the investigators own thoughts about what constitutes an interesting pattern. Conventions specific to individual disciplines may
exist - for instance in human epidemiology, age and sex are usually considered to be
important and will more or less automatically be considered for inclusion in models.
In predictive modeling the aim is to develop a formula or rule for making predictions. The term black-box prediction is used when there is no interest in explaining
or interpreting the individual roles of the variables in the prediction equation. For
example, if there is direct clinical utility in predicting heparin levels, one may not
care too much what particular variables are used, as long as they are commonly
measured. In general though predictive equations tend to be more reliable when
some physiological (or other theoretical) rationale can be applied. In the setting of
a study of vocabulary of elementary school students while it may well be that shoe
size has some predictive utility with regard to vocabulary size, much more powerful
predictive relationships can be developed by considering more direct determinants
Studies which aim at comparative inferences typically (ideally?) focus on a small
number of primary factors (variables) and often relate to explicit (and sometime implicit) underlying causal hypotheses. For comparisons to be relevant (especially in
attempting to support causal hypotheses from observational data) there may be
a number of nuisance or confounding factors that need to be taken into account.
In the above example, one might wish to investigate the fundamental role of sex
of the patient in determining heparin levels. Because patterns in weight differ between sexes, weight needs to be accounted for in the model otherwise a distorted
picture of the relationship my result. Thus in causal modeling one attempts to be as
comprehensive as possible in including variables that could realistically be thought
to be competing causal determinants. Other criteria for making valid comparisons

## Multiple Linear Regression

10

may enter into variable selection in other settings. In the end though the focus for
interpretation will be on the role of the principal factors of interest.

2.3.2

## As always, it is imperative to apply simple summaries and graphs (graphs especially)

before proceeding with more formal analysis as a part of data verification and becoming familiar with the data. Histograms, boxplots and scatterplots are especially
helpful. Measures of association (such as correlations) may also be be a examined.
Returning to the heparin example, lets add two additional variables, age and
initial heparin infusion rate into consideration, i.e. lets consider fitting a model
y|x1 ,x2 ,x3 ,x4 = + 1 x1 + 2 x2 + 3 x3 ... + 4 x4
1. x1 = weight
2. x2 = sex indicator (0=f,1=m)
3. x3 = age
4. x4 = initial dose indicator (0=low,1=high)
The last variable is introduced to account for the fact that two different initial
infusion rates were applied initially. 3
Before we begin building a model, it is always wise to examine all variables
separately. At this stage we may be alerted to the presence of outliers or uncommon
values in any of the variables which may require special attention in building a sound
model. In this case we have already considered weight and sex, so we should look
at some univariate summaries and graphs for age and initial dose.
. summarize age
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------age |
147
67.10884
16.43548
20
92
. tabulate hepdos0
Initial IV |
3

While we could enter the actual rates in x4 , it is convenient for purposes of interpretation to
follow the indicator approach. However, both approaches yield the same fundamental models in
the sense of model prediction.

11

## Multiple Linear Regression

Rate |
Freq.
Percent
Cum.
------------+----------------------------------Low |
84
57.14
57.14
High |
63
42.86
100.00
------------+----------------------------------Total |
147
100.00

Fraction

.571429

Fraction

.408163

0
92

20

High

Low

age

Initial IV Rate

Heparin Level

1.6

Heparin Level

1.6

.05
92

20

.05

age

Low

High

Initial Plots
In our initial exploration of the relationships, we begin with bivariate graphs
relating y to each of the xs in turn. For age and initial dose levels, a scatterdiagram
and comparative boxplots are most relevant. Such plots may provide initial hints
about which variables may be important in the model, about non-linear effects for
some variables or about non-heterogeneous dispersion in others. However features
noted in separate bivariate plots and summaries may not translate directly to the
multiple variable model. Nonetheless, if strong non-linearities are evident in a any of
the plots there is the suggestion that some transformation of either the response or
relevant explanatory variable may be useful if the effect is a dramatic improvement
in the apparent linearity in the plots (as judged by the IOT 4 test).
The examination of these plots suggests it is reasonable to begin with a model
incorporating variables as is. Before doing so, we can get an initial indication
of the potential utility of adding age and initial dose by considering some simple
analyses, in this case, a simple linear regression of heparin level on age and a t-test
comparing heparin levels between the low and high dose groups.
4

Intra-Ocular Trauma

12

## Multiple Linear Regression

. ttest hsulf, by(hepdos0)
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------Low |
84
.4192857
.0265336
.2431844
.3665115
.47206
High |
63
.5094444
.0403743
.320461
.4287374
.5901515
---------+-------------------------------------------------------------------combined |
147
.4579252
.0232166
.2814863
.4120411
.5038092
---------+-------------------------------------------------------------------diff |
-.0901587
.0464767
-.1820179
.0017005
-----------------------------------------------------------------------------. regress hsulf age
-----------------------------------------------------------------------------hsulf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
.0040398
.0013822
2.92
0.004
.001308
.0067716
_cons |
.186818
.0954784
1.96
0.052
-.0018912
.3755271
------------------------------------------------------------------------------

While these simple tests are not foolproof indicators of the utility of including
age and initial dose in a model with weight and sex, the above findings (P .05 for
initial dose, P = .004 for age) heighten our expectations for fitting the full model
on the next page.
. regress hsulf weight male age hepdos0
Source |
SS
df
MS
-------------+-----------------------------Model | 3.55976945
4 .889942361
Residual | 8.00847268
142 .056397695
-------------+-----------------------------Total | 11.5682421
146 .079234535

Number of obs
F( 4,
142)
Prob > F
R-squared
Root MSE

=
=
=
=
=
=

147
15.78
0.0000
0.3077
0.2882
.23748

-----------------------------------------------------------------------------hsulf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------weight | -.0058301
.0014201
-4.11
0.000
-.0086374
-.0030227
male | -.1098883
.0442424
-2.48
0.014
-.1973471
-.0224295
age |
.0039068
.0012984
3.01
0.003
.0013401
.0064735

13

## Multiple Linear Regression

hepdos0 |
.176446
.0422362
4.18
0.000
.0929529
.259939
_cons |
.5979384
.1459204
4.10
0.000
.3094813
.8863955
------------------------------------------------------------------------------

We return now to consider the statistical side of things more carefully, deferring
until later lectures the consideration of goodness of fit issues Issues of interpretation
will be considered by example as the opportunity arises.
Lets return to the previous output which arises from fitting the combined regression model relating heparin levels to weight, sex, age and initial heparin dose.
As described in the last lecture, the primary components of this output are the
intercept and the coefficients for the weight, sex, age and dose variables, a and
b1 , b2 , b3 and b4 . These may be interpreted on an informal basis, as can the the
residual standard deviation syx , which gives us a notion of the closeness of the overall
fit. In order to make more precise interpretations, we must begin to integrate various
components of our analysis. In general, meaningful interpretations of any statistical
analysis require this sort of synthesis.

2.3.3

## The most immediate application of the estimated regression coefficients is in forming

predicted values,
y = a + b1 x1 + b2 x2 + b3 x3 + b4 x4
based on choosing particular values of the x variables. As in the simple case, when
we set the x-values to correspond to the data in the sample we obtain the fitted
values.
As in simple linear regression, there is an analysis of variance table, based on
the additive decomposition of the data based on fitted values and residuals:
X

(y y)2 =

(
y y)2 +

(y y)2

## Since the explained component of variation is based on 4 variables, this has

4 degrees of freedom in this example. In general the numerator degrees of freedom
equals k, the number of x-variables. The residual degrees of freedom is n (k + 1),
where the extra degree of freedom subtracted is for the constant term.
The practical application of this analysis of variance decomposition is in testing
the utility of the regression relationship. The model is not useful if all the true s
are equal to 0, for then there are no use predictive relationships. To test the null

14

## Multiple Linear Regression

hypothesis that all the coefficients (i.e. the TRUE values corresponding to the bi
estimates) are 0 we use the F-statistic,
F =P

(
y y)2 /k
(y y)2 /(n (k + 1))
P

## which has k numerator and n (k + 1) denominator degrees of freedom. Values of

F close to one indicate the the apparent total explained variation is about equal to
what youd expect on the basis of pure chance, with all of the s being 0 (i.e. the
null hypothesis). The statistical significance of larger values is assessed by reference
to the theoretical F distribution.
We are also typically interested in examining the i s individually. The quick
answer for now is based on examining the estimated values bi in relation to their
standard errors se(bi ). Algebraically we have a series of formulas of the form
se(bi ) = syx (matrix formula involving the x values)
which statistical software conveniently takes care of for us.
The generic formulas for t-statistics and confidence intervals apply, based on tvalues on n (k + 1) degrees of freedom. The overall F -test can be thought of as an
omnibus test, combining all the separate t-tests in a way which controls the overall
type I error. We will come back later to examine common pitfalls in interpretation
of individual coefficients and associate tests.

2.3.4

Predictions

As mentioned above, we can form predictions which, just as in simple linear regression, can be interpret either as x-stratum specific estimates of yx , or as predictions of particular observations. In forming prediction intervals, we follow the same
framework as in simple linear regress to choose the appropriate standard error. The
standard error for the actual predicted value, appropriate for inference for the stratum population mean, se(
y ), has a formula similar in form to se(bi ). For prediction,
just as in simple linear regression we have
sepred (y) =

s2yx + se(
y )2

## As before we use t-values on n (k + 1) degrees of freedom to form intervals.

As seen in the initial model involving only weight and sex, plotting fitted values
is usually the most informative way to examine and interpret the fitted model.

15

## Multiple Linear Regression

When there are a number of variables, this can become complicated, as it is not
clear how to graph data to look at a single variable, where there are a number of
important explanatory variables at work. One convenient approach is to look
at adjusted predictions where only one or two variables are considered at their
actual values, while the effects of the remaining variables are adjusted out, by
considering predictions for hypothetical cases where the adjustment variables are
set to the overall mean (or other relevant value).
For example, the following plot illustrating the age and sex effects are adjusted
for weight and dose.
yhat_f

Linear Prediction

1.31031

.095099
20

92
age

2.4

Examining Assumptions

Though we have previously discussed applying the results of fitting a multiple linear
regression model, logic dictates that we should assess the plausibility of the basic
assumptions before investing much effort in interpretation. Recall that the basic
assumptions of the multiple linear regression model are
Independence
Homogeneous dispersion
Normality

16

## Multiple Linear Regression

2.4.1

Examining Linearity

Lets consider once more the model for heparin levels that includes the initial IV
rate and age (partial regression output below).
-----------------------------------------------------------------------------hsulf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------weight | -.0058301
.0014201
-4.11
0.000
-.0086374
-.0030227
male | -.1098883
.0442424
-2.48
0.014
-.1973471
-.0224295
age |
.0039068
.0012984
3.01
0.003
.0013401
.0064735
hepdos0 |
.176446
.0422362
4.18
0.000
.0929529
.259939
_cons |
.5979384
.1459204
4.10
0.000
.3094813
.8863955
------------------------------------------------------------------------------

## As previously described, the first steps of assessment of assumption begins (even

before fitting the model) with initial descriptive plots of all the individual variables as
well as and plots of y versus each of the xs, separately. From the strict mathematical
viewpoint, these initial plots (or the simple models and tests associated with them)
do not pertain directly to the multiple regression assumptions, but they often provide
a useful guide to later analysis.
To assess whether the linearity assumption is tenable, it is customary to plot the
residuals versus each of the x-variables in turn, looking for curvature.
Residuals

.743259

Residuals

.743259

.589814
119.6

39

.589814
female

weight

male

Residuals by Sex
Residuals

.743259

Residuals

.743259

.589814
20

92

.589814
Low

age

Residual Plots

High

## Multiple Linear Regression

17

Note that when particular xs are discrete, it usually makes more sense to use
boxplots.
When there is very pronounced non-linearity in some, but not all of the plots, a
transformation of the offending x-variable may be appropriate.

2.4.2

Assessing Independence

## Assessing the validity of the statistical independence of the observations depends

mainly on understanding of how the data were collected. For example if the observations were selected from a population using simple random sampling, then the
assumption of independence is justified. When sampling is not actually random,
as in convenience sampling, one must assess firstly whether the sampling process is
likely to have been susceptible to bias. In regards to independence, the key issue is
whether the inclusion of one case is likely to lead to inclusion of a closely related
case, e.g. husbands and wives.

2.4.3

## Examining the pattern of dispersion

The plots above may give indications of non-homogeneous dispersion. Another plot
that is often used is the plot of the residual values versus the fitted values. The plot
below illustrates this approach, though in this plot I have chosen to use a modified
form of the residual known as the standardized residual. This residual is defined as
as the ordinary residual, divided by an estimate of the standard deviation for the
residual. Standard deviations for residuals actually decrease according to how far
the xvalues are from their means, so this puts the residuals on an equal footing in
terms of variance.

18

## Multiple Linear Regression

Standardized residuals

3.18392

2.51697
.044713

.874659
Fitted values

## Stand. Residuals vs. Predicted Values

A typical pattern is to increasing dispersion as y increases, indicating that variability increases with bigger values of y. If this is very pronounced, a logarithmic or square root transformation (variance stabilizing transformation) for y may
help. However, if the previous checks for linearity all seem in order, transforming
y may foul up the linear assumptions. If it is not desirable to transform y and the
increasing dispersion is very pronounced, more advanced models which allow for
non-homogeneous dispersion can be applied.

2.4.4

Examining normality

The assumption of normality is the least critical assumption for most purposes, in
the sense that with large samples, the central limit theorem will provide enough
normality to allow the application of tests and confidence intervals. The one critical
area is in prediction intervals for new observations, which depend on the assumption
that the individual observations (both old and new) are normal. As in simple linear
regression, we can apply a normal quantile plot (a.k.a. QQ plot) to examining this
assumption, and more importantly to highlight deviant observations (outliers).

19

## Multiple Linear Regression

Standardized residuals

Inverse Normal

Standardized residuals

3.18392

2.51697
2.47878

2.47977
Inverse Normal

2.4.5

## As indicated by the plot of residuals versus predicted values, it is plausible to consider

a logarithmic transformation to the respose variables, with results as follow which
do not differ much in overall pattern from the untransformed.
. generate log_hsulf = log10(hsulf)
. regress log_hsulf weight male age hepdos0
Source |
SS
df
MS
-------------+-----------------------------Model | 2.89804262
4 .724510655
Residual |
10.067947
142 .070901035
-------------+-----------------------------Total | 12.9659896
146 .088808148

Number of obs
F( 4,
142)
Prob > F
R-squared
Root MSE

=
=
=
=
=
=

147
10.22
0.0000
0.2235
0.2016
.26627

-----------------------------------------------------------------------------log_hsulf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------weight | -.0050508
.0015923
-3.17
0.002
-.0081985
-.0019031
male |
-.124347
.049606
-2.51
0.013
-.2224087
-.0262854
age |
.0028794
.0014558
1.98
0.050
1.50e-06
.0057573
hepdos0 |
.1470529
.0473566
3.11
0.002
.0534377
.240668
_cons | -.2562236
.1636107
-1.57
0.120
-.579651
.0672038
------------------------------------------------------------------------------

20

## Multiple Linear Regression

Examining residual plots (below) indicates that the fit seems much improved.
Residuals

.590619

Residuals

.590619

1.0384
119.6

39

1.0384
female

weight

male

Residuals by Sex
Residuals

.590619

Residuals

.590619

1.0384
20

92

1.0384
Low

age

High

## Residuals by Initial Dose

Residual Plots
The following plots indicate that the transformation has introduced negative
skewness in the distribution of residuals while improving the homogeneity of variance. The latter criterion is more important for the validity of our inferences, leading
us to prefer (on a statistical basis anyway), the transformed plot. In practise however, since the two analysis are in substantial agreement, one might prefer to report
the untransformed results.

21

## Multiple Linear Regression

Studentized residuals

2.28997

Inverse Normal

Studentized residuals

Studentized residuals

2.51741

4.17447
.797481

.066249
Fitted values

2.4.6

4.17447
2.52697

2.51741
Inverse Normal

## Norm. Quant. Plot of Stand. Residuals

Influence diagnostics

As in simple linear regression, cases whose y-values do not follow the general pattern
of association with the xs can sometimes have undue influence on the model results.
This is especially true if the x-values for the case are also outlying. In simple
linear regression we can spot the x-outliers easily in initial plots. In multiple linear
regression, the initial plots may not reveal all. One measure of the distance of a
specific cases x-values from the overall means is called the leverage (for algebraic
reasons these values are also called the hat values.

22

Cooks D

.102249

1.2e07
199

1
patnum

## Cooks Distance Case Plot

In addition, we may consider a more direct measure of overall influence such as
Cooks distance measure, which reflects the overall change in fitted values which
results if a particular case is deleted. In the following figure, I have plotted Cooks
distance and leverage values. When Cooks distance is close to 1 (or greater), the
peculiar influence of the corresponding case can be examined by refitting the model
with the case excluded. However, one should not routinely delete such cases from
analyses.

Leverage

.092835

.016492
1

199
patnum

Leverage Caseplot

23

## Multiple Linear Regression

2.5

Analysis of Variance/Covariance

In this section we will see more examples of the application of constructed variables,
such as group indicator variables. The use of constructed variables provides a way
to incorporate categorical predictor variables in a regression model in a number of
testing techniques based on the analysis of variance.

2.5.1

## Allowing for varying slope terms

In the example at the beginning of this chapter, we examined simple linear regressions for heparin levels against age in males and females separately. We informally
compared slopes and noting no strong statistical evidence for differing slopes, followed a combined approach.
The following example arises from the study into sleep apnea that gave rise to the
neck and weight data described in the first lecture. The main aim of the study was
to try to predict the results of overnight sleep testing, which yields a measure of sleep
disturbance called the respiratory distress index (RDI), which is a main diagnostic
factor for obstructive sleep apnea (OSA). Since age and gender are thought to be
important determining factors in the occurrence of OSA, an analysis was conducted,
separately by sex, relating the (log transformed) RDI to age, with results below:
---------------------------------Females-------------------------------------Source |
SS
df
MS
---------+-----------------------------Model | 3.56438156
1 3.56438156
Residual | 3.05199836
17 .179529316
---------+-----------------------------Total | 6.61637993
18 .367576663

Number of obs
F( 1,
17)
Prob > F
R-squared
Root MSE

=
=
=
=
=
=

19
19.85
0.0003
0.5387
0.5116
.42371

-----------------------------------------------------------------------------lrdi |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------age |
.049469
.0111022
4.456
0.000
.0260454
.0728925
_cons | -1.266384
.4854723
-2.609
0.018
-2.290641
-.2421271
------------------------------------------------------------------------------

-----------------------------------Males--------------------------------------

24

## Multiple Linear Regression

Source |
SS
df
MS
---------+-----------------------------Model | 2.09990916
1 2.09990916
Residual | 15.2482793
54 .282375542
---------+-----------------------------Total | 17.3481884
55 .315421608

Number of obs
F( 1,
54)
Prob > F
R-squared
Root MSE

=
=
=
=
=
=

56
7.44
0.0086
0.1210
0.1048
.53139

-----------------------------------------------------------------------------lrdi |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------age |
.0163944
.0060119
2.727
0.009
.0043413
.0284474
_cons |
.4734697
.3038537
1.558
0.125
-.1357204
1.08266
------------------------------------------------------------------------------

The plot above indicates quite a difference in fitted lines for males and females,
and the informal test for comparing slopes is significant. This would invalidate
using a common slope model such as the one considered in the heparin example.
However, a sex-specific slope model can be constructed using a constructed variable
to accommodate a different slope for women. Recall in the heparin example, we

25

## Multiple Linear Regression

included a variable called male which was 1 for males, 0 for females, which allowed
for different intercepts.
In this example well define a variable called female which is 1 for females, 0 for
male. To allow for different slopes we can construct a variable femage which equals
0 for males and for females equals age in years. This is easily done by multiplying
the age variable by the female variable, as below:
. generate femage = female*age
. list age sex femage

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

age
sex
femage
39
f
39
55
m
0
50
m
0
54
m
0
33
m
0
70
m
0
57
m
0
53
f
53
40
m
0
40
m
0
etc. ..................

## Including this variable in the model produces the following results:

Source |
SS
df
MS
---------+-----------------------------Model | 8.24083511
3 2.74694504
Residual | 18.3002776
71 .257750389
---------+-----------------------------Total | 26.5411127
74 .358663686

Number of obs
F( 3,
71)
Prob > F
R-squared
Root MSE

=
=
=
=
=
=

75
10.66
0.0000
0.3105
0.2814
.50769

-----------------------------------------------------------------------------lrdi |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------+-------------------------------------------------------------------female | -1.739854
.6501125
-2.676
0.009
-3.036141
-.4435663
age |
.0163944
.0057437
2.854
0.006
.0049417
.0278471
femage |
.0330746
.0144898
2.283
0.025
.0041828
.0619663
_cons |
.4734697
.2903025
1.631
0.107
-.105377
1.052316
------------------------------------------------------------------------------

26

## Multiple Linear Regression

By considering the male and female subjects separately, we can see that the
above model gives the same separate intercepts and slopes for men and women as
the previous model, with a pooled common standard deviation. In particular, the
femage coefficient represents the difference between slopes. Testing that the TRUE
coefficient is 0 allows us to test the hypothesis of equal slopes.

2.5.2

## Allowing for more than 2 categories

Lets return to look at more data arising in the study of sleep apnea. Well consider
now an earlier study similar in design to the RDI study discussed in the first lecture.
In this earlier study, the severity of disease was measured by another approach giving
rise to values of the Apnea-Hypopnea Index (AHI). The variables measured in this
study are summarized below.
ahi
Min.
: 0.000
1st Qu.: 0.833
Median : 8.257
Mean
: 22.080
3rd Qu.: 28.082
Max.
:124.909
stopbr
Never :74
Some
:34
Freqly:42

bmi
Min.
:18.72
1st Qu.:26.22
Median :29.38
Mean
:30.48
3rd Qu.:33.36
Max.
:68.59

neck.circ
Min.
: 77.9
1st Qu.:100.5
Median :105.6
Mean
:106.7
3rd Qu.:114.4
Max.
:143.3

age
Min.
:24.00
1st Qu.:37.00
Median :45.50
Mean
:45.72
3rd Qu.:53.75
Max.
:74.00

snorehx
Never : 18
Some
: 24
Freqly:108

partgasp
Never :55
Some
:44
Freqly:51

gender
female: 39
male :111

Note that some of the variables above categorical but with more than two categories. Suppose that we want to incorporate such variables in our model. In
particular, lets consider the variable partgasp, which indicates how often (if ever)
their sleeping partner has observed them gasping for breath during sleep.
If we start simply by looking at y in relation to each of the x variables, we would
consider the following plots.

27

0.0

20

0.5

40

60

1.0

80

1.5

100

2.0

120

Never

Some

Freqly

Never

Some

Freqly

## It is apparent that the log transformation of ahi is an appropriate way to handle

the extreme skewness on the original scale. The summaries and analysis below are
of the log-transformed data, logahi = log1 0(ahi + 1)
Never
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.00000 0.04938 0.44410 0.64260 1.11000 2.02300
Some
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.0000 0.3612 1.0920 1.0060 1.6960 2.1000
"Freqly"
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.0000 0.5989 1.2300 1.1060 1.6250 2.0370
------------------------------------------------Analysis of Variance Table
Response: logahi
Df Sum Sq Mean Sq F value
Pr(>F)
partgasp
2 6.286
3.143 7.3253 0.0009277 ***
Residuals 147 63.075
0.429

--Signif. codes:

28

## Now we face the question of how to incorporate this variable in a multi-predictor

model in a way that is a natural extension of the above. An equivalent analysis
can be carried out in a regression setting by considering the three indicator (0/1)
variables for the three categories of this variable, as follows. We can consider z1 = 1
if partgasp = Never, 0 otherwise, z2 = 1 if partgasp = Sometimes, 0 otherwise,
and z3 = 1 if partgasp = Frequently, 0 otherwise.
Looking at the expanded data set we have:
ahi
#001
2.388759
#002
7.692308
#003
1.071429
#004 13.404255
#005 28.957529
#006
0.000000
#007 61.281139
#008 104.347826

## bmi age snorehx partgasp stopbr neck.circ gender z1 z2 z3

26.73010 28 Freqly
Some
Never
100.25
male 0 1 0
24.43744 34 Freqly
Never
Never
97.50
male 1 0 0
23.48516 37 Freqly
Never
Never
84.40 female 1 0 0
34.46945 44
Some
Never
Never
122.50
male 1 0 0
36.93444 46 Freqly Freqly
Some
118.90
male 0 0 1
37.28191 65
Some
Some
Never
103.50 female 0 1 0
29.09986 56 Freqly
Some Freqly
110.60
male 0 1 0
38.55213 60 Freqly
Never Freqly
126.40
male 1 0 0

## Call: lm(formula = logahi ~ z1 + z2 + z3)

Residuals:
Min
1Q
Median
3Q
Max
-1.10554 -0.58645 -0.03734 0.51895 1.38003
Coefficients: (1 not defined because of singularities)
Estimate Std. Error
t value Pr(>|t|)
(Intercept) 1.105539 0.09172438 12.0528367
<0.001
z1
-0.462942 0.12733752 -3.6355505
<0.001
z2
-0.099565 0.13477839 -0.7387312
0.461
Residual standard error: 0.655 on 147 degrees of freedom
Multiple R-Squared: 0.09063
F-statistic: 7.325 on 2 and 147 degrees of freedom, the p-value is 0.0009277

We get the same F-value, because the use of indicators allows for different means
in the three groups. Testing whether any of these variables contribute to the predictive value of the model is the same as testing for equal means. In addition, note that

## Multiple Linear Regression

29

the variable z3 has been dropped from the model. This is because it only requires
two of the indicators, plus an constant term, to allow for three different means. In
the above model, the constant term is the estimated mean for the frequently gasping
group (z3 ), and the other coefficients are differences relative to our chosen reference
group.
We can choose the group we wish to make the reference group by leaving out the
corresponding indicator variable. In the analysis below, I have incorporated left out
the indicator for subjects whose partners reported Never. The coefficients change,
but when interpreted correctly we obtain the exact same inferences as in the first
output. These two models are equivalent, that is they produce the same predictions
for this or for other similarly structure data sets.
Call: lm(formula = logahi ~ z2 + z3)
Residuals:
Min
1Q
Median
3Q
Max
-1.10554 -0.58645 -0.03734 0.51895 1.38003
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.642597 0.0883260 7.275288
<0.001
z2
0.363377 0.1324890 2.742696
0.007
z3
0.462942 0.1273375 3.635550
<0.001
Residual standard error: 0.655 on 147 degrees of freedom
Multiple R-Squared: 0.09063
F-statistic: 7.325 on 2 and 147 degrees of freedom, the p-value is 0.0009277

2.5.3

## Analysis of Variance: groups of variables

In the following section we will consider another version of analysis with applications
to examining the general linear model. As seen in in previous examples, using
analysis of variance in regression allows us to test whether a number of coefficients
are 0, based on a single omnibus test which is more convenient than separate tests
and which also eliminates some multiple testing concerns. In addition, we will see
how analysis of variance exposes some apparent paradoxes which arise from the
uncritical examination of the t-tests for individual coefficients. We will see this
in examples that I have chosen somewhat eclectically (and perhaps, artificially) to
illuminate statistical problems.
Lets consider a model that includes not only partgasp, but age, sex (or gender),
and bmi with the result below:

30

## Multiple Linear Regression

Call: lm(formula = logahi ~ age + female + bmi + z2 + z3)
Residuals:
Min
1Q
Median
3Q
Max
-1.33948 -0.41276 0.04742 0.45400 1.03964
Coefficients:
(Intercept)
age
female
bmi
z2
z3

Estimate
-0.80603306
0.01098533
-0.37904052
0.03759992
0.24480014
0.26815610

Std. Error
0.283480652
0.004171628
0.109837682
0.007194607
0.117572907
0.115860514

t value Pr(>|t|)
-2.843344
0.005
2.633343
0.009
-3.450915
<0.001
5.226125
<0.001
2.082113
0.039
2.314474
0.022

## Residual standard error: 0.5698 on 144 degrees of freedom

Multiple R-Squared: 0.3259
F-statistic: 13.92 on 5 and 144 degrees of freedom, the p-value is 4.269e-11

To incorporate gender in the model, we have set up the indicator variable f emale,
which is 1 for females, 0 for males. Again, we could have chosen, according to
convenience, to use an indicator variable male = 1 f emale, or we could have used
another choice of indicators for partgasp. The coefficients will change, but the fitted
values, and hence the residual standard deviation s2 resid is always the same.
It is sensible to re-consider the predictive contribution of partgasp in the light
of the other variables added. We note that the t tests for z2 and z3 are both
significant. Since we could have used z1 and z2, or even z1 and z3, in the above, one
can ask, is there a proper way to test the overall significance of partgasp without
reference to the t test results, which depend on our choice of indicators. The
answer of course is yes (otherwise I never would have raised this issue!), and it lies
in comparing the residual standard deviations from the models with and without
partgasp. Here is the latter analysis.
Call: lm(formula = logahi ~ age + female + bmi)
Residuals:
Min
1Q
Median
3Q
Max
-1.40878 -0.41867 0.05892 0.48241 1.06714
Coefficients:
Estimate Std. Error
t value Pr(>|t|)
(Intercept) -0.70693507 0.285214928 -2.478605
0.014
age
0.01142104 0.004194976 2.722553
0.007
female
-0.44763783 0.107933083 -4.147364
<0.001

31

bmi

<0.001

## Residual standard error: 0.5787 on 146 degrees of freedom

Multiple R-Squared: 0.295
F-statistic: 20.36 on 3 and 146 degrees of freedom, the p-value is 4.439e-11

The two residual standard errors are derived from sums of squared residuals.
The sum of squared residuals in the first model is 46.757, based on 144 degrees of
freedom. The sum of squared residuals in the second model is 48.902, based on 146
degrees of freedom. Note that because variables in the second model are a subset
of the variables in the first model, the latter sum of squared residuals is necessarily
larger. ( Why ? - because the estimates for the first model are the least squares
estimates!!).
The difference in the sum of squared residuals for model one, which well call the
FULL model and for model 2, which will be referred to as the RESTRICTED model,
indicates how much the variable partgasp adds to the predictive accuracy made by
partgasp. We can incorporate this difference in an F-test that provides measures
the statistical significance of this improvement. This F-test for additional variables
depends on the sum of squared residuals, SSresid, and the associated degrees of
freedom, dfresid for the FULL and RESTRICTED models.

(SSregression
(dfresid(REST RICT ED MODEL) dfresid(F ULL MODEL)

## where the r egression sum of squares, SSregression is defined as

SSregression = SSresid (REST RICT ED MODEL) SSresid(F ULL MODEL)
This calculation is available in R as follows:
Analysis of Variance Table
Model 1: logahi ~ age +
Model 2: logahi ~ age +
Res.Df
1
146 48.902
2
144 46.757
2
--Signif. codes: 0 ***

female + bmi
female + bmi + z2 + z3
of Sq
F Pr(>F)
2.145 3.3023 0.03961 *
0.001 ** 0.01 * 0.05 . 0.1 1

32

## Multiple Linear Regression

This is only valid test for testing that a group of two or more regression coefficients are all 0. While it may seem that this could also be done informally
combining the results of separate t-tests, this is not so, due to possible effects of
emphcollinearity, which we discuss next.

2.5.4

Multi-collinearity

To illustrate the utility of the test for additional variables and the non-intuitive
nature of indiviudal significance tests, consider the following analysis. The data
was gathered in an attempt to understand how characteristics of 20 segments of
highways relate to the occurence of accidents on them.
The data set contains as the outcome variable rate as well as the length of
the segment in miles (len), the speed limit in miles per hour (slim), the width
of the driving lanes (lwid), the width of the shoulder (shld) and the number of
intersections (itg). We start with a fairly superficial approach and fit a model with
all of the explanatory variables, yielding the following fit:
. regress rate len slim lwid shld itg
Source |
SS
df
MS
-------------+-----------------------------Model | 23.3491998
5 4.66983997
Residual | 17.3849747
14 1.24178391
-------------+-----------------------------Total | 40.7341746
19 2.14390393

Number of obs
F( 5,
14)
Prob > F
R-squared
Root MSE

=
=
=
=
=
=

20
3.76
0.0228
0.5732
0.4208
1.1144

-----------------------------------------------------------------------------rate |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------len | -.1113296
.0665881
-1.67
0.117
-.2541469
.0314878
slim | -.1374671
.0850988
-1.62
0.129
-.3199858
.0450515
lwid | -1.407115
1.222619
-1.15
0.269
-4.029373
1.215142
shld |
.0367669
.1770846
0.21
0.839
-.3430419
.4165756
itg | -.2603523
.5988827
-0.43
0.670
-1.544828
1.024123
_cons |
29.38981
14.60633
2.01
0.064
-1.937656
60.71728
------------------------------------------------------------------------------

33

## Multiple Linear Regression

Source |
SS
df
MS
-------------+-----------------------------Model | 21.2286975
2 10.6143487
Residual | 19.5054771
17
1.147381
-------------+-----------------------------Total | 40.7341746
19 2.14390393

Number of obs
F( 2,
17)
Prob > F
R-squared
Root MSE

=
=
=
=
=
=

20
9.25
0.0019
0.5212
0.4648
1.0712

-----------------------------------------------------------------------------rate |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------len | -.1175567
.0536484
-2.19
0.043
-.230745
-.0043684
slim | -.1270314
.0486591
-2.61
0.018
-.2296932
-.0243696
_cons |
12.08022
2.619819
4.61
0.000
6.552885
17.60756
------------------------------------------------------------------------------

Note that in the first model, none of the individual t-tests (except for the constant term) are significant, but that the overall regression F-test is (P=.023). The
following matrix plot and correlation matrix hold clues to this puzzle.
2.96

20.31

12

13

1.54
6.87

## Accident rate per

million vehicle miles
1.61
20.31

Length of segment
in miles
2.96
70

Speed Limit
50
13

Lane Width
in feet
12
10

Shoulder width
in feet
2
1.54

Interchanges
0
1.61

6.87

50

70

10

## . correlate len slim lwid shld itg

(obs=20)
|
len
slim
lwid
shld
itg
-------------+--------------------------------------------len |
1.0000

34

## Multiple Linear Regression

slim
lwid
shld
itg

|
|
|
|

0.3747
-0.0625
-0.0795
0.1810

1.0000
-0.0756
0.6981
0.3489

1.0000
-0.2324
-0.1904

1.0000
0.2862

1.0000

We note that there is a fairly high correlation between the variables, especially
between slim (speed limit) and shld (shoulder width). When the two variables are
in the equation together it is difficult, if not impossible, to parcel out which variable
is making the significant contribution, so that the contribution of speed limit may
be masked by the presence of shld in the model. After considering a smaller model
as below, we can see that it is plausible (depending on our objectives in fitting a
model in the first place!), to adopt it.
. test lwid shld itg
( 1)
( 2)
( 3)

lwid = 0.0
shld = 0.0
itg = 0.0
F(

3,
14) =
Prob > F =

0.57
0.6444