Вы находитесь на странице: 1из 21

REGRESSION

This procedure performs multiple linear regression with five methods for entry and
removal of variables. It also provides extensive analysis of residual and influential
cases. Caseweight (CASEWEIGHT) and regression weight (REGWGT) can be
specified in the model fitting.

Notation
The following notation is used throughout this chapter unless otherwise stated:

yi

Dependent variable for case i with variance 2 gi

ci

Caseweight for case i; ci = 1 if CASEWEIGHT is not specified

gi

Regression weight for case i; gi = 1 if REGWGT is not specified

Number of distinct cases

wi

ci gi
l

i =1

Number of independent variables


l

Sum of caseweights:

i =1

x ki

The kth independent variable for case i

Xk

Sample mean for the kth independent variable: X k =


 w x



Sample mean for the dependent variable: Y =  w y  W


l

i ki

i =1

i i

i =1

hi

Leverage for case i





2 REGRESSION

~
hi

gi
+ hi
W

S kj

Sample covariance for X k and X j

S yy

Sample variance for Y

S ky

Sample covariance for X k and Y

Number of coefficients in the model. p = p if the intercept is not included;


otherwise p = p + 1
The sample correlation matrix for X 1 ,K , X p and Y

Descriptive Statistics

r

11

r21

R=

ry1

K
K
K
K

r1 p r1 y
r2 p r2 y

ryp

ryy

"#
##
##
$

where

rkj =

S kj
S kk S jj

and

S ky

ryk = rky =

S kk S yy

The sample mean X i and covariance Sij are computed by a provisional means
algorithm. Define
k

Wk =

w =
i

i =1

cumulative weight up to case k

REGRESSION 3

then

Xi

1k 6 = X i1 k 16 + 4 xik X i1 k 16 9 Wkk
w

and, if the intercept is included,

Cij

16
k

= Cij

1 6 4
k 1

+ xik X i

1 6 94 x
k 1

jk

Xj

w
9
1 6 
k 1

wk2
Wk




Otherwise,

Cij

1 k 6 = Cij1k 16 + wk xik x jk

where

16

X i 1 = xi 1
and

16

Cij 1 = 0
The sample covariance Sij is computed as the final Cij divided by C 1 .

Sweep Operations (Dempster, 1969)


For a regression model of the form

Yi = 0 + 1 X 1i + 2 X 2i + L + p X pi + ei
sweep operations are used to compute the least squares estimates b of E and the
associated regression statistics. The sweeping starts with the correlation matrix R.

4 REGRESSION

~
Let R be the new matrix produced by sweeping on the kth row and column of R.
~
The elements of R are

1
r~kk =
rkk
r
r~ik = ik ,
rkk
rkj
r~kj =
,
rkk

ik
jk

and

rij rkk rik rkj


r~ij =
,
rkk

i k, j k

If the above sweep operations are repeatedly applied to each row of R 11 in

R=

 R
R

11

R 12

21

R 22




where R 11 contains independent variables in the equation at the current step, the
result is




1
R 11
~
R=
1
R 21R 11

1
R 11
R 12
1
R 22 R 21R 11
R 12




The last row of


1
R 21R 11

contains the standardized coefficients (also called BETA), and


1
R 22 R 21R 11
R 12

REGRESSION 5

can be used to obtain the partial correlations for the variables not in the equation,
controlling for the variables already in the equation. Note that this routine is its own
inverse; that is, exactly the same operations are performed to remove a variable as
to enter a variable.

Variable Selection Criteria


Let rij be the element in the current swept matrix associated with X i and X j .
Variables are entered or removed one at a time. X k is eligible for entry if it is an
independent variable not currently in the model with

rkk t (tolerance with a default of 0.0001)


and also, for each variable X j that is currently in the model,

r


jj

r jk rkj
rkk

t 1


The above condition is imposed so that entry of the variable does not reduce the
tolerance of variables already in the model to unacceptable levels.
The F-to-enter value for X k is computed as

4C p 19V

F to enterk =

ryy Vk

with 1 and C p 1 degrees of freedom, where p is the number of coefficients


currently in the model and

Vk =

ryk rky
rkk

The F-to-remove value for X k is computed as

4C p 9 V
=

F to removek

ryy

with 1 and C p degrees of freedom.

6 REGRESSION

Methods for Variable Entry and Removal


Five methods for entry and removal of variables are available. The selection
process is repeated until the maximum number of steps (MAXSTEP) is reached or
no more independent variables qualify for entry or removal. The algorithms for
these five methods are described below.

Stepwise
If there are independent variables currently entered in the model, choose X k such
that F to removek is minimum. X k is removed if F to removek < Fout
(default = 2.71) or, if probability criteria are used, P F to removek > Pout
(default = 0.1). If the inequality does not hold, no variable is removed from the
model.
If there are no independent variables currently entered in the model or if no
entered variable is to be removed, choose X k such that F to enterk is

maximum.

Xk

is entered if

F to enterk > Fin

(default = 3.84) or,

P F to enterk < Pin (default = 0.05). If the inequality does not hold, no
variable is entered.
At each step, all eligible variables are considered for removal and entry.

Forward
This procedure is the entry phase of the stepwise procedure.

Backward
This procedure is the removal phase of the stepwise procedure and can be used only
after at least one independent variable has been entered in the model.

Enter (Forced Entry)


Choose X k such that rkk is maximum and enter X k . Repeat for all variables to
be entered.

REGRESSION 7

Remove (Forced Removal)


Choose X k such that rkk is minimum and remove X k . Repeat for all variables to
be removed.

Statistics
Summary
For the summary statistics, assume p independent variables are currently entered in
the equation, of which a block of q variables have been entered or removed in the
current step.
Multiple R

R = 1 ryy

R Square

R 2 = 1 ryy

Adjusted R Square

41 R 9 p

2
Radj

=R

C p

R Square Change (when a block of q independent variables was added or removed)


2
2
'R 2 = Rcurrent
R previous

8 REGRESSION

F Change and Significance of F Change

%K R2 4C p 9
2
KK q41 Rcurrent
9
F = & 2

KK R 4C p q9
2
19
K' q4 Rprevious

for the addition of q independent variables

for the removal of q independent variables

the degrees of freedom for the addition are q and C p , while the degrees of
freedom for the removal are q and C p q .
Residual Sum of Squares

1 6

SSe = ryy C 1 S yy
with degrees of freedom C p .
Sum of Squares Due to Regression

1 6

SS R = R 2 C 1 S yy
with degrees of freedom p.

REGRESSION 9

ANOVA Table

Analysis of Variance

df

Sum of Squares

Regression

SS R

C p

SSe

Mean Square

1SS 6 p
1SS 6 4C p 9
R

Variance-Covariance Matrix for Unstandardized Regression Coefficient Estimates


A square matrix of size p with diagonal elements equal to the variance, the below
diagonal elements equal to the covariance, and the above diagonal elements equal
to the correlations:

1 6

var bk =

rkk ryy Syy

Skk C p

cov bk , b j =

cor bk , b j =

rkj ryy Syy

Skk S jj C p
rkj
rkk rjj

Selection Criteria
Akaike Information Criterion (AIC)

AIC = C ln

 SS  + 2 p
C
e

10 REGRESSION

Amemiyas Prediction Criterion (PC)

41 R 94C + p 9
PC =

C p

Mallows Cp (CP)

CP =

SSe
+ 2 p* C
2
$

where $ 2 is the mean square error from fitting the model that includes all the
variables in the variable list.
Schwarz Bayesian Criterion (SBC)

SBC = C ln

 SS  + p ln1C6
C
e

Collinearity
Variance Inflation Factors

VIFi =

1
rii

Tolerance

Tolerancei = rii

REGRESSION 11

Eigenvalues, ON
The eigenvalues of scaled and uncentered cross-product matrix for the
independent variables in the equation are computed by the QL method
(Wilkinson and Reinsch, 1971).
Condition Indices

max j

k =

Variance-Decomposition Proportions
Let

v i = vi1 ,K , vip

be the eigenvector associated with eigenvalue i . Also, let


p

) ij =

vij2

i and ) j =

ij

i =1

The variance-decomposition proportion for the jth regression coefficient associated


with the ith component is defined as

ij = ) ij ) j

Statistics for Variables in the Equation


Regression Coefficient bk

bk =

ryk Syy
Skk

for k = 1,K , p

12 REGRESSION

The standard error of bk is computed as

rkk ryy S yy

$ bk =

S kk C p

A 95% confidence interval for Ek is constructed from

bk $ bk t 0.025, C p

If the model includes the intercept, the intercept is estimated as


p

b0 = y

b X
k

k =1

The variance of b0 is estimated by

1C 16r S + X $

C 4C p 9
p

$ b2 =
0

yy yy

k =1

2 2
k bk

+2

p 1

k = j +1 j =1

Beta Coefficients

Beta k = ryk
The standard error of Beta k is estimated by

$ Betak =

ryy rkk
C p

X j est . cov bk , b j

REGRESSION 13

F-test for Beta k

 Beta 
F =
 $ 

Beta k

with 1 and C p degrees of freedom.


Part Correlation of Xk with Y

1 6

Part Corr X k =

ryk
rkk

Partial Correlation of Xk with Y

1 6

Partial Corr X k =

ryk
rkk ryy ryk rky

Statistics for Variables Not in the Equation


Standardized regression coefficient Beta k if Xk enters the equation at the next step

Beta k =

ryk
rkk

The F-test for Beta k

4C p 19r

F=

2
yk

2
rkk ryy ryk

with 1 and C p degrees of freedom

14 REGRESSION

Partial Correlation of Xk with Y

1 6

Partial X k =

ryk
ryy rkk

Tolerance of Xk

Tolerancek = rkk

Minimum tolerance among variables already in the equation if Xk enters at the next step is

min

1 j p


 r r 1r r
 3 8
jj

kj jk

, rkk

kk





Residuals and Associated Statistics


There are 19 temporary variables that can be added to the active system file. These
variables can be requested with the RESIDUAL subcommand.
Centered Leverage Values
For all cases, compute

%K
KK 0Cg 15 3 X X S83 XS
K
h =&
KK
KK 0Cg 15 X SX S r
'
p

ji

ki

Xk rjk

if intercept is included

jj kk

j =1 k =1

ji ki jk

j =1 k =1

jj kk

otherwise

REGRESSION 15

For selected cases, leverage is hi ; for unselected case i with positive caseweight,
leverage is

%Kg  1 + h  1 + 1 + h  1 "
K  W   W  W + 1#$
h = & !
KKh 11 + h g 6
'
i

if intercept is included

otherwise

Unstandardized Predicted Values

%K
b X
K
$
Y =&
KKb + b X
K'
p

k =1

if no intercept

ki

ki

otherwise

k =1

Unstandardized Residuals

ei = Yi Y$i

Standardized Residuals

%K e
ZRESID = & s
K'SYSMIS
i

if no regression weight is specified

otherwise

where s is the square root of the residual mean square.

16 REGRESSION

Standardized Predicted Values

%K Y$ Y
ZPRED = & sd
KKSYSMIS
'
i

if no regression weight is specified

otherwise

where sd is computed as

sd =

ci Y$i Y

i =1

C 1

Studentized Residuals

%K e s
~
1 h 9 g
K
4
=&
KK e ~s
K' 41 + h 9 g
i

SRESi

for selected cases with ci > 0

otherwise

Deleted Residuals

DRESIDi =

%Ke 41 h~ 9
&Ke
'
i
i

for selected cases with ci > 0


otherwise

REGRESSION 17

Studentized Deleted Residuals

%K DRESID
s1 6
K
SDRESID = &
KK e~
K' s 41 + h 9 g
i

for selected cases with ci > 0

otherwise
i

16

where s i is computed as

16

si =

4C p 9s

~
1 hi

C p 1

DRESIDi2

Adjusted Predicted Values

ADJPREDi = Yi DRESIDi

DfBeta

16

g e X WX
DFBETAi = b b i = i i
~
1 hi

Xit

where

Xit =

%K31, X ,K, X 8
&K3 X ,K, X 8
'
1i

if intercept is included

pi

1i

ottherwise

pi

and W = diag w1 ,K , wl .

18 REGRESSION

Standardized DfBeta

16

bj bj i

SDBETAij =

t
1 6 4X WX9 jj

si

16

16

where b j b j i is the jth component of b b i .


DfFit

16

DFFITi = X i b b i =

~
hi ei
~
1 hi

Standardized DfFit

SDFITi =

DFFITi
~
s i hi

16

Covratio

 s1 6 
COVRATIO = 
 s 
i

2 p

1
~
1 hi

Mahalanobis Distance
For selected cases with ci > 0 ,

MAHALi =

%&1C 16h
'C h

if intercept is included
otherwise

REGRESSION 19

For unselected cases with ci > 0

MAHALi =

%&C h
'1C + 16h

if intercept is included
otherwise

Cooks Distance (Cook, 1977)


For selected cases with ci > 0

%K4 DRESIDi2h~i gi 9 s2 1 p + 16
COOKi = &
K'4 DRESIDi2hi gi 9 4s2 p9

if intercept is included
otherwise

For unselected cases with ci > 0

%K DRESID  h + 1  
 W 
K
=&
KK4 DRESID h 9 4~s p9
'
2
i

COOKi

2
i i

1 6

~
s 2 p +1

if intercept is included
otherwise

where hi is the leverage for unselected case i, and ~


s 2 is computed as

~2
s

%K 1 SS + e 1 h 1  "



1 + W  #$
KC p !
=&
KK 1 SS + e 11 h6
'C p + 1
2
i

2
i

if intercept is included
otherwise

20 REGRESSION

Standard Errors of the Mean Predicted Values


For all the cases with positive caseweight,

%Ks
&Ks
'

SEPREDi =

~
hi gi

if intercept is included

hi gi

otherwise

95% Confidence Interval for Mean Predicted Response

LMCIN i = Y$i t 0.025, C p SEPREDi


UMCIN i = Y$i + t 0.025, C p SEPREDi

95% Confidence Interval for a Single Observation

LICINi

UICINi

%KY$ t
=&
K'Y$ t

0.025, C p

0.025, C p s

%KY$ + t
=&
K'Y$ + t

0.025, C p

0.025, C p s

1e~ e~ 6
i

i 1

i =2

c e~

i i

i =1

where e~i = ei gi .

4h~ + 19 g
1h + 16 g
i

Durbin-Watson Statistic

DW =

otherwise

4h~ + 19 g
1h + 16 g
i

if intercept is included

if intercept is included
otherwise

REGRESSION 21

Partial Residual Plots


The scatterplots of the residuals of the dependent variable and an independent
variable when both of these variables are regressed on the rest of the independent
variables can be requested in the RESIDUAL branch. The algorithm for these
residuals is described in Velleman and Welsch (1981).

Missing Values
By default, a case that has a missing value for any variable is deleted from the
computation of the correlation matrix on which all consequent computations are
based. Users are allowed to change the treatment of cases with missing values.

References
Cook, R. D. 1977. Detection of influential observations in linear regression,
Technometrics, 19: 1518.
Dempster, A. P. 1969. Elements of Continuous Multivariate Analysis. Reading,
Mass.: Addison-Wesley.
Velleman, P. F., and Welsch, R. E. 1981. Efficient computing of regression
diagnostics. The American Statistician, 35: 234242.
Wilkinson, J. H., and Reinsch, C. 1971. Linear algebra. In: Handbook for
Automatic Computation, Volume II, J. H. Wilkinson and C. Reinsch, eds. New
York: Springer-Verlag.