Вы находитесь на странице: 1из 14

Regression Analysis: Multiple Linear Regression

Dr. Madhuri Kulkarni

Consider the data set from the file ‘DeliverytimeData.txt’. The data contains the time of delivery (y) of
certain number of cases (x1) of soft drink bottles and the distance (x2) walked by the carrier.
We fit the following linear regression model to above data:
y = β0 + β1 x1 + β2 x2 + 
1 Getting Data

DelTime = read.delim("D:/Regression Analysis/Datasets/DeliveryTimeData.txt",row.names=1)


summary(DelTime)

## Delivery.Time..y Number.of.Cases..x1 Distance..x2..ft.


## Min. : 8.00 Min. : 2.00 Min. : 36.0
## 1st Qu.:13.75 1st Qu.: 4.00 1st Qu.: 150.0
## Median :18.11 Median : 7.00 Median : 330.0
## Mean :22.38 Mean : 8.76 Mean : 409.3
## 3rd Qu.:21.50 3rd Qu.:10.00 3rd Qu.: 605.0
## Max. :79.24 Max. :30.00 Max. :1460.0

attach(DelTime)

2 Model Fitting and Summary

y = DelTime$Delivery.Time..y
x1 = DelTime$Number.of.Cases..x1
x2 = DelTime$Distance..x2..ft.
n = length(y)
p = ncol(DelTime)
reg = lm(y~x1+x2)
summary(reg)

##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7880 -0.6629 0.4364 1.1566 7.4197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.341231 1.096730 2.135 0.044170 *
## x1 1.615907 0.170735 9.464 3.25e-09 ***
## x2 0.014385 0.003613 3.981 0.000631 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
## F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16

1
The last column in the summary of the regression model gives p-values for the test of significance of individual
regression coefficients. The null hypothesis under consideration is H0 : βj = 0 for j = 0, 1, 2. Observe that
both the predictors, X1 and x2 are significant. The adjusted R2 is 95.59%. Lastly, the statistic F-test to test
the significance of regression through the hypothesis H0 : β1 = β2 = 0 is obtained as 261.2. The hypothesis is
rejected, that is, the regression is significant.
The same F tests for individual regression coefficients can be carried out by analysis of variance.
3 ANOVA

anova(reg)

## Analysis of Variance Table


##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 5382.4 5382.4 506.619 < 2.2e-16 ***
## x2 1 168.4 168.4 15.851 0.0006312 ***
## Residuals 22 233.7 10.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

names(anova(reg))

## [1] "Df" "Sum Sq" "Mean Sq" "F value" "Pr(>F)"

4 Confidence Intervals for β1 and β2

confint.lm(reg,level=0.9)

## 5 % 95 %
## (Intercept) 0.457987107 4.22447518
## x1 1.322730706 1.90908372
## x2 0.008180636 0.02058902

confint.lm(reg,level=0.95)

## 2.5 % 97.5 %
## (Intercept) 0.066751987 4.61571030
## x1 1.261824662 1.96998976
## x2 0.006891745 0.02187791

The model is invalid if the necessary assumptions are not satisfied by the fitted model. The assumptions
include normality of errors, constant variance, uncorrelatedness of errors. This procedure of validating the
model is referred as residual analysis.
5 Residual Analysis
* Normality

reg.stdres = rstandard(reg)
qqnorm(reg.stdres)
qqline(reg.stdres)

2
Normal Q−Q Plot
3
2
Sample Quantiles

1
0
−1
−2

−2 −1 0 1 2

Theoretical Quantiles

shapiro.test(reg.stdres)

##
## Shapiro-Wilk normality test
##
## data: reg.stdres
## W = 0.9229, p-value = 0.05952

• Constant Variance
The assumption of constant variance is established if the fitted values and the residuals are uncorrelated.
Hence the residuals are plotted against the fitted values. If the plot does not exhibit any pattern, the
variance is assumed to be constant. Equivalently, the plot of regressors and residuals can also be used.

plot(reg$fitted.values,reg.stdres)

3
3
2
reg.stdres

1
0
−1
−2

10 20 30 40 50 60 70

reg$fitted.values

plot(x1,reg.stdres)

4
3
2
reg.stdres

1
0
−1
−2

5 10 15 20 25 30

x1

plot(x2,reg.stdres)

5
3
2
reg.stdres

1
0
−1
−2

0 200 400 600 800 1000 1200 1400

x2

Above plot shows no patterns. Hence the variance of errors is constant.


* Uncorrelatedness of Errors The plot of residuals in the time sequence tells whether there is an autocorrelation
among the errors.

plot(reg.stdres)

6
3
2
reg.stdres

1
0
−1
−2

5 10 15 20 25

Index

6 Outlier Detection

ep1 = abs(reg.stdres)
plot(ep1)

7
0.0 0.5 1.0 1.5 2.0 2.5 3.0
ep1

5 10 15 20 25

Index

out = which(ep1>2)
out

## 9
## 9

7 Constructing the design matrix and hat matrix

E = rep(1,n)
X = cbind(E,x1,x2)
dim(X)

## [1] 25 3

H = X%*%solve(t(X)%*%X)%*%t(X)
dim(H)

## [1] 25 25

H1 = H%*%H
H1[1,1]

## [1] 0.1018018

8
H[1,1]

## [1] 0.1018018

8 Different Types of Residuals and Their Usage


* Standardised Residuals

SS = anova(reg)$"Sum Sq"
MSS = anova(reg)$"Mean Sq"
SSRes = SS[3]
MSRes = MSS[3]
res = reg$residuals
rstd = res/sqrt(MSRes)

• Studentised Residuals

studres = res/sqrt(MSRes*(1-diag(H)))

which is same as the residuals obtained in reg.stdres.


* PRESS (Prediction Error Sum of Squares) Residuals and PRESS Statistic

PR = res/sqrt(1-diag(H))
PRESS = sum(PR^2)
PRESS

## [1] 315.2796

Large values of PRESS indicate that the model is inappropriate.

9 Testing of Linear Hypotheses H0 : β1 = β2


New Model:
y = β0 + (x1 + x2 )β1 + 

xnew = x1 + x2
regnew = lm(y~xnew)
summary(regnew)

##
## Call:
## lm(formula = y ~ xnew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.015 -4.735 -0.529 5.919 12.357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.802052 2.294227 2.093 0.0476 *
## xnew 0.042058 0.004337 9.698 1.36e-09 ***

9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.03 on 23 degrees of freedom
## Multiple R-squared: 0.8035, Adjusted R-squared: 0.795
## F-statistic: 94.05 on 1 and 23 DF, p-value: 1.359e-09

anova(regnew)

## Analysis of Variance Table


##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## xnew 1 4647.9 4647.9 94.052 1.359e-09 ***
## Residuals 23 1136.6 49.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

SSnew = anova(regnew)$"Sum Sq"


SSResnew = SSnew[2]
SSH0 = SSResnew - SSRes
Ftest = (SSH0/1)/MSRes
Ftest

## [1] 84.9849

qf(0.95,1,n-3)

## [1] 4.30095

H0 : 0.71β1 + 100β2 = 0

xnew = x1 + 0.71/100*x2
regnew = lm(y~xnew)
summary(regnew)

##
## Call:
## lm(formula = y ~ xnew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7724 -1.1454 0.3474 1.4496 7.7296
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.41534 1.07573 2.245 0.0347 *
## xnew 1.71171 0.07391 23.158 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

10
##
## Residual standard error: 3.216 on 23 degrees of freedom
## Multiple R-squared: 0.9589, Adjusted R-squared: 0.9571
## F-statistic: 536.3 on 1 and 23 DF, p-value: < 2.2e-16

anova(regnew)

## Analysis of Variance Table


##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## xnew 1 5546.7 5546.7 536.31 < 2.2e-16 ***
## Residuals 23 237.9 10.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

SSnew = anova(regnew)$"Sum Sq"


SSResnew = SSnew[2]
SSH0 = SSResnew - SSRes
Ftest = (SSH0/1)/MSRes
Ftest

## [1] 0.3899549

qf(0.95,1,n-3)

## [1] 4.30095

Alternative way to test the hypothesis

C = solve(t(X)%*%X)
betahat = reg$coefficients
num = 0.71*betahat[2]-100*betahat[3]
SE = MSRes*(0.71^2*C[2,2]+100^2*C[3,3]-2*0.71*100*C[2,3])
t0 = num/sqrt(SE)
t0

## x1
## -0.6244637

t0^2

## x1
## 0.3899549

### Do not reject H0

H0 : β1 + β2 = 1.9 Under H0 we get


y = β0 + β1 x1 + (1.9 − β1 )x2 +  which can be simplified as
y − 1.9x2 = β0 + β1 (x1 − x2 ) + 

11
xnew = x1 - x2
ynew = y - 1.9*x2
regnew = lm(ynew~xnew)
summary(regnew)

##
## Call:
## lm(formula = ynew ~ xnew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5159 -1.8556 0.7375 1.6021 6.5780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.879830 1.094361 1.718 0.0993 .
## xnew 1.890362 0.002153 878.137 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.37 on 23 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.711e+05 on 1 and 23 DF, p-value: < 2.2e-16

anova(regnew)

## Analysis of Variance Table


##
## Response: ynew
## Df Sum Sq Mean Sq F value Pr(>F)
## xnew 1 8756922 8756922 771125 < 2.2e-16 ***
## Residuals 23 261 11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

SSnew = anova(regnew)$"Sum Sq"


SSResnew = SSnew[2]
SSH0 = SSResnew - SSRes
Ftest = (SSH0/1)/MSRes
Ftest

## [1] 2.584413

qf(0.95,1,n-3)

## [1] 4.30095

Alternative approach

12
num = betahat[2]+betahat[3]-1.9
SE = MSRes*(C[2,2]+C[3,3]+2*C[2,3])
t0 = num/sqrt(SE)
t0

## x1
## -1.607611

t0^2

## x1
## 2.584413

Another Method to compute Sum of Squares due to H0

T = matrix(nrow=1,ncol=3)
T[1,] = c(0,1,1)
k = 1.9
Term1 = t(T%*%betahat - k)
Term2 = solve(T%*%C%*%t(T))
Term3 = T%*%betahat - k
F0 = (Term1%*%Term2%*%Term3/1)/MSRes
F0

## [,1]
## [1,] 2.584413

H0 : β0 = 0; β1 + β2 = 3
New Model y = β1 x1 + (3 − β1 x2 +  which becomes
y − 3x2 = β1 (x1 − x2 ) + 

ynew = y - 3*x2
xnew = x1 -x2
regnew = lm(ynew~0+xnew)
summary(regnew)

##
## Call:
## lm(formula = ynew ~ 0 + xnew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.0552 -1.1957 0.8209 3.6191 12.3353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xnew 3.009725 0.002465 1221 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.265 on 24 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.491e+06 on 1 and 24 DF, p-value: < 2.2e-16

13
SSResnew = anova(regnew)$"Sum Sq"[2]
SSResnew

## [1] 941.8816

SSH0 = SSResnew - SSRes


Ftest = (SSH0/2)/MSRes
Ftest

## [1] 33.32731

qf(0.95,2,n-3)

## [1] 3.443357

Alternative method: H0 : T β = k

T = matrix(nrow=2,ncol=3)
T[1,] = c(1,0,0)
T[2,] = c(0,1,1)
k = c(0,3)
k = as.matrix(k)
dim(k)

## [1] 2 1

Term1 = t(T%*%betahat - k)
Term2 = solve(T%*%C%*%t(T))
Term3 = T%*%betahat - k
F0 = (Term1%*%Term2%*%Term3/2)/MSRes
F0

## [,1]
## [1,] 33.32731

14

Вам также может понравиться