Вы находитесь на странице: 1из 55

Statistics for Health Research

Entering Multidimensional
Space: Multiple
Regression
Peter T. Donnan
Professor of Epidemiology and Biostatistics

Objectives of session

Recognise the need for multiple regression


Understand methods of selecting variables
Understand strengths and weakness of
selection methods
Carry out Multiple
Regression in SPSS
and interpret the output

Why do we need multiple


regression?
Research is not as simple as effect
of one variable on one outcome,
Especially with observational data
Need to assess many factors
simultaneously; more realistic
models

Dependent (y)

Consider Fitted line of


y = a + b1x1 + b2x2

x 2)
(
ry
o
t
na
a
l
p
Ex

Explanatory (x1)

3-dimensional scatterplot from


SPSS of Min LDL in relation to
baseline LDL and age

When to use multiple


regression modelling (1)
Assess relationship between two
variables while adjusting or allowing
for another variable
Sometimes the second variable is
considered a nuisance factor
Example: Physical Activity allowing
for age and medications

When to use multiple


regression modelling (2)
In RCT whenever there is imbalance
between arms of the trial at baseline
in characteristics of subjects
e.g. survival in colorectal cancer on
two different randomised therapies
adjusted for age, gender, stage, and
co-morbidity

When to use multiple


regression modelling (2)
A special case of this is when
adjusting for baseline level of the
primary outcome in an RCT
Baseline level added as a factor in
regression model
This will be covered in Trials part of
the course

When to use multiple


regression modelling (3)
With observational data in order to
produce a prognostic equation for
future prediction of risk of mortality
e.g. Predicting future risk of CHD
used 10-year data from the
Framingham cohort

When to use multiple


regression modelling (4)
With observational
adjust for possible

data in order to
confounders

e.g. survival in colorectal cancer in


those with hypertension adjusted for
age, gender, social deprivation and
co-morbidity

Definition of Confounding
A confounder is a factor which
is related to both the variable
of interest (explanatory) and
the outcome, but is not an
intermediary in a causal
pathway

Example of Confounding
Lung
Cancer

Deprivation

Smoking

But, also worth adjusting for


factors only related to outcome
Lung
Cancer

Deprivation

Exercise

Not worth adjusting for intermediate


factor in a causal pathway
Exercise

Blood
viscosity

Stroke

In a causal pathway each factor is


merely a marker of the other
factors i.e correlated - collinearity

SPSS: Add both baseline LDL and


age in the independent box in linear
regression

Output from SPSS


regression on Age at

linear
baseline

Coefficientsa

Model
1
(Constant)
Age at baseline

Unstandardized
Standardized
Coefficients
Coefficients
B
Std. Error
Beta
2.024
.105
-.008
.002
-.121

a. Dependent Variable: Min LDL achieved

t
19.340
-4.546

95% Confidence Interv al for B Collinearity Statistics


Sig.
Lower Bound Upper Bound Tolerance
VIF
.000
1.819
2.229
.000
-.011
-.004
1.000
1.000

Output from
regression on

SPSS linear
Baseline LDL

Coefficientsa

Model
1

(Constant)
Baseline LDL

Unstandardized
Coeff icients
B
Std. Error
.668
.066
.257
.018

a. Dependent Variable: Min LDL achieved

Standardized
Coeff icients
Beta
.351

t
10.091
13.950

95% Confidence Interv al for B


Sig.
Lower Bound Upper Bound
.000
.538
.798
.000
.221
.293

Output: Multiple regression


Model Summary

R2 now
improved
to 13%

Model
1

R
.360a

R Square
.130

Adjusted
R Square
.129

St d. Error of
the Estimate
.6753538

a. Predictors: (Constant), Age at baseline, Baseline LDL

Coefficientsa

Model
1

(Constant)
Baseline LDL
Age at baseline

Unstandardized
Coeff icients
B
Std. Error
1.003
.124
.250
.019
-.005
.002

Standardized
Coeff icients
Beta
.342
-.081

t
8.086
13.516
-3.187

Sig.
.000
.000
.001

a. Dependent Variable: Min LDL achiev ed

Both variables still significant


INDEPENDENTLY of each other

95% Confidence Interv al for B


Lower Bound Upper Bound
.760
1.246
.214
.286
-.008
-.002

How do you select which


variables to enter the model?
Usually consider what hypotheses are you testing?
If main exposure variable, enter first and assess
confounders one at a time
For derivation of CPR you want powerful predictors
Also clinically important factors e.g. cholesterol in CHD
prediction
Significance is important but
It is acceptable to have an important variable without
statistical significance

How do you decide what variables to


enter in model?
Correlations? With great difficulty!

3-dimensional scatterplot from


SPSS of Time from Surgery in
relation to Dukes staging and age

Approaches to model building


1. Let Scientific or Clinical factors
guide selection
2. Use automatic selection algorithms
3. A mixture of above

1) Let Science or Clinical


factors guide selection
Baseline LDL cholesterol is an
important factor determining LDL
outcome so enter first
Next allow for age and gender
Add adherence as important?
Add BMI and smoking?

1) Let Science or Clinical


factors guide selection
Results in model of:
1.Baseline LDL
2.age and gender
3.Adherence
4.BMI and smoking
Is this a good model?

1) Let Science or Clinical factors


guide selection: Final Model
Note three variables entered but not statistically significant

1) Let Science or Clinical factors


guide selection
Is this the best model?
Should I leave out the non-significant factors (Model 2)?
Model

Adj R2

F from
ANOVA

No. of
Paramete
rs p

0.137

37.48

0.134

72.021

Adj R2 lower, F has increased and number of


parameters is less in 2nd model. Is this better?

Kullback-Leibler
Information
Kullback and Leibler (1951)
quantified the meaning of
information related to
Fishers sufficient statistics
Basically we have reality f
And a model g to approximate f
So K-L information is
I(f,g)

Kullback-Leibler
Information

We want to minimise I (f,g)


to obtain the best model
over other models
I (f,g) is the information
lost or distance between
reality and a model so need
to minimise:

f ( x)
I ( f , g ) f ( x ) log(
) dx
g( x )

Akaikes Information
Criterion
It turns out that the
function I(f,g) is
related to a very simple
measure of goodnessof-fit:
Akaikes Information
Criterion or AIC

Selection Criteria
With a large number of factors type 1 error
large, likely to have model with many variables
Two standard criteria:
1) Akaikes Information Criterion (AIC)
2) Schwartzs Bayesian Information
Criterion (BIC)
Both penalise models with large number of
variables if sample size is large

Akaikes Information
Criterion
AIC 2 * loglikelihood 2 * p
Where p = number of parameters and
-2*log likelihood is in the output
Hence AIC penalises models with large
number of variables
Select model that minimises (-2LL+2p)

Generalized linear models


Unfortunately the standard
REGRESSION in SPSS does not give
these statistics
Need to use
Analyze
Generalized Linear Models..

Generalized linear models.


Default is linear
Add Min LDL
achieved as
dependent as in
REGRESSION in
SPSS
Next go to
predictors..

Generalized linear models:


Predictors

WARNING!

Make sure you


add the
predictors in
the correct box
Categorical in
FACTORS box
Continuous in
COVARIATES
box

Generalized linear models:


Model
Add all
factors and
covariates in
the model as
main effects

Generalized Linear Models


Parameter Estimates
Note identical to REGRESSION output

Generalized Linear Models


Goodness-of-fit
Note output gives
log likelihood and
AIC = 2835
(AIC = -2x-1409.6
+2x7= 2835)

Footnote explains
smaller AIC is
better

Let Science or Clinical factors


guide selection: Optimal model
The log likelihood is a measure of
GOODNESS-OF-FIT
Seek optimal model that maximises the log
likelihood or minimises the AIC
Model
1 Full Model

2 Non-significant
variables removed

2LL

AIC

-1409.6

2835.6

-1413.6

2837.2

Chang
e is
1.6

1) Let Science or Clinical


factors guide selection
Key points:
1.Results demonstrate a significant association
with baseline LDL, Age and Adherence
2.Difficult choices with Gender, smoking and
BMI
3.AIC only changes by 1.6 when removed
4.Generally changes of 4 or more in AIC are
considered important

1) Let Science or Clinical


factors guide selection
Key points:
1.Conclude little to chose between models
2.AIC actually lower with larger model and
consider Gender, and BMI important factors so
keep larger model but have to justify
3.Model building manual, logical, transparent
and under your control

2) Use automatic selection


procedures
These are based on automatic
mechanical algorithms usually related
to statistical significance
Common ones are stepwise, forward
or backward elimination
Can be selected in SPSS using
Method in dialogue box

2) Use automatic selection


procedures (e.g Stepwise)

Select
Method =
Stepwise

2) Use automatic selection


procedures (e.g Stepwise)

1st step
2nd step

Final
Model

2) Change in AIC with Stepwise


selection
Note: Only available from Generalized Linear Models
Step

Model

Log
Likelihoo
d

AIC

Chang
e in
AIC

No. of
Parameter
s p

Baseline LDL

-1423.1

2852.2

+Adherence

-1418.0

2844.1

8.1

+Age

-1413.6

2837.2

6.9

2) Advantages and
disadvantages of stepwise
Advantages
Simple to implement
Gives a parsimonious model
Selection is certainly objective

Disadvantages
Non stable selection stepwise considers many
models that are very similar
P-value on entry may be smaller once procedure is
finished so exaggeration of p-value
Predictions in external dataset usually worse for
stepwise procedures

2) Automatic procedures:
Backward elimination
Backward starts by eliminating the least
significant factor form the full model and has a
few advantages over forward:
Modeller has to consider the full model and
sees results for all factors simultaneously
Correlated factors can remain in the model (in
forward methods they may not even enter)
Criteria for removal tend to be more lax in
backward so end up with more parameters

2) Use automatic selection


procedures (e.g Backward)

Select
Method =
Backward

2) Backward elimination in
SPSS
1st step
Gender
removed
2nd step
BMI
removed

Final
Model

Summary of automatic
selection

Automatic selection may not give optimal


model (may leave out important factors)

Different methods may give different results


(forward vs. backward elimination)

Backward elimination preferred as less


stringent

Too easily fitted in SPSS!

Model assessment still requires some thought

3) A mixture of automatic
procedures and self selection

Use automatic procedures as a

guide
Think about what factors are
important
Add important factors
Do not blindly follow statistical
significance
Consider AIC

Summary of Model
selection
Selection of factors for Multiple Linear

regression models requires some


judgement

Automatic procedures are available but

treat results with caution

They are easily fitted in SPSS


Check AIC or log likelihood for fit

Summary
Multiple regression models are the

most used analytical tool in


quantitative research

They are easily fitted in SPSS


Model assessment requires some

thought

Parsimony is better Occams Razor

Remember Occams Razor


Entia non sunt
multiplicanda
praeter
necessitatem
Entities must not be
multiplied beyond
necessity

William of Ockham
14th century Friar and
logician
1288-1347

Summary
After fitting any model check assumptions
Functional form linearity or not
Check Residuals for normality
Check Residuals for outliers
All accomplished within SPSS
See publications for further info
Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E
genotypes are associated with lipid lowering response to statin treatment in diabetes: A GoDARTS study. Pharmacogenetics and Genomics , 2008; 18: 279-87.

Practical on Multiple
Regression
Read in LDL Data.sav

1)Try fitting multiple regression model on Min


LDL obtained using forward and backward
elimination. Are the results the same? Add
other factors than those considered in the
presentation such as BMI, smoking.
Remember the goal is to assess the
association of APOE with LDL response.
2)Try fitting multiple regression models for
Min Chol achieved. Is the model similar to
that found for Min Chol?