Multiple Linear Regression

Regression Analysis: Multiple Linear Regression
Dr. Madhuri Kulkarni
Consider the data set from the file ‘DeliverytimeData.txt’. The data contains the time of delivery (y) of
certain number of cases (x1) of soft drink bottles and the distance (x2) walked by the carrier.
We fit the following linear regression model to above data:
y = β0 + β1 x1 + β2 x2 +
1 Getting Data
DelTime = read.delim("D:/Regression Analysis/Datasets/DeliveryTimeData.txt",row.names=1)

summary(DelTime)
## Delivery.Time..y Number.of.Cases..x1 Distance..x2..ft.

## Min. : 8.00 Min. : 2.00 Min. : 36.0
## 1st Qu.:13.75 1st Qu.: 4.00 1st Qu.: 150.0
## Median :18.11 Median : 7.00 Median : 330.0
## Mean :22.38 Mean : 8.76 Mean : 409.3
## 3rd Qu.:21.50 3rd Qu.:10.00 3rd Qu.: 605.0
## Max. :79.24 Max. :30.00 Max. :1460.0
attach(DelTime)
2 Model Fitting and Summary
y = DelTime$Delivery.Time..y
x1 = DelTime$Number.of.Cases..x1
x2 = DelTime$Distance..x2..ft.
n = length(y)
p = ncol(DelTime)
reg = lm(y~x1+x2)
summary(reg)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7880 -0.6629 0.4364 1.1566 7.4197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.341231 1.096730 2.135 0.044170 *
## x1 1.615907 0.170735 9.464 3.25e-09 ***
## x2 0.014385 0.003613 3.981 0.000631 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
## F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
1
The last column in the summary of the regression model gives p-values for the test of significance of individual
regression coefficients. The null hypothesis under consideration is H0 : βj = 0 for j = 0, 1, 2. Observe that
both the predictors, X1 and x2 are significant. The adjusted R2 is 95.59%. Lastly, the statistic F-test to test
the significance of regression through the hypothesis H0 : β1 = β2 = 0 is obtained as 261.2. The hypothesis is
rejected, that is, the regression is significant.
The same F tests for individual regression coefficients can be carried out by analysis of variance.
3 ANOVA
anova(reg)
## Analysis of Variance Table

##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 5382.4 5382.4 506.619 < 2.2e-16 ***
## x2 1 168.4 168.4 15.851 0.0006312 ***
## Residuals 22 233.7 10.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
names(anova(reg))
## [1] "Df" "Sum Sq" "Mean Sq" "F value" "Pr(>F)"
4 Confidence Intervals for β1 and β2
confint.lm(reg,level=0.9)
## 5 % 95 %
## (Intercept) 0.457987107 4.22447518
## x1 1.322730706 1.90908372
## x2 0.008180636 0.02058902
confint.lm(reg,level=0.95)
## 2.5 % 97.5 %
## (Intercept) 0.066751987 4.61571030
## x1 1.261824662 1.96998976
## x2 0.006891745 0.02187791
The model is invalid if the necessary assumptions are not satisfied by the fitted model. The assumptions
include normality of errors, constant variance, uncorrelatedness of errors. This procedure of validating the
model is referred as residual analysis.
5 Residual Analysis
* Normality
reg.stdres = rstandard(reg)
qqnorm(reg.stdres)
qqline(reg.stdres)
2
Normal Q−Q Plot
3
2
Sample Quantiles
1
0
−1
−2
−2 −1 0 1 2
Theoretical Quantiles
shapiro.test(reg.stdres)
##
## Shapiro-Wilk normality test
##
## data: reg.stdres
## W = 0.9229, p-value = 0.05952
• Constant Variance
The assumption of constant variance is established if the fitted values and the residuals are uncorrelated.
Hence the residuals are plotted against the fitted values. If the plot does not exhibit any pattern, the
variance is assumed to be constant. Equivalently, the plot of regressors and residuals can also be used.
plot(reg$fitted.values,reg.stdres)
3
3
2
reg.stdres
1
0
−1
−2
10 20 30 40 50 60 70
reg$fitted.values
plot(x1,reg.stdres)
4
3
2
reg.stdres
1
0
−1
−2
5 10 15 20 25 30
x1
plot(x2,reg.stdres)
5
3
2
reg.stdres
1
0
−1
−2
0 200 400 600 800 1000 1200 1400
x2
Above plot shows no patterns. Hence the variance of errors is constant.

* Uncorrelatedness of Errors The plot of residuals in the time sequence tells whether there is an autocorrelation
among the errors.
plot(reg.stdres)
6
3
2
reg.stdres
1
0
−1
−2
5 10 15 20 25
Index
6 Outlier Detection
ep1 = abs(reg.stdres)
plot(ep1)
7
0.0 0.5 1.0 1.5 2.0 2.5 3.0
ep1
5 10 15 20 25
Index
out = which(ep1>2)
out
## 9
## 9
7 Constructing the design matrix and hat matrix
E = rep(1,n)
X = cbind(E,x1,x2)
dim(X)
## [1] 25 3
H = X%*%solve(t(X)%*%X)%*%t(X)
dim(H)
## [1] 25 25
H1 = H%*%H
H1[1,1]
## [1] 0.1018018
8
H[1,1]
## [1] 0.1018018
8 Different Types of Residuals and Their Usage

* Standardised Residuals
SS = anova(reg)$"Sum Sq"
MSS = anova(reg)$"Mean Sq"
SSRes = SS[3]
MSRes = MSS[3]
res = reg$residuals
rstd = res/sqrt(MSRes)
• Studentised Residuals
studres = res/sqrt(MSRes*(1-diag(H)))
which is same as the residuals obtained in reg.stdres.

* PRESS (Prediction Error Sum of Squares) Residuals and PRESS Statistic
PR = res/sqrt(1-diag(H))
PRESS = sum(PR^2)
PRESS
## [1] 315.2796
Large values of PRESS indicate that the model is inappropriate.
9 Testing of Linear Hypotheses H0 : β1 = β2

New Model:
y = β0 + (x1 + x2 )β1 +
xnew = x1 + x2
regnew = lm(y~xnew)
summary(regnew)
##
## Call:
## lm(formula = y ~ xnew)
##
## Residuals:
## -12.015 -4.735 -0.529 5.919 12.357
##
## Coefficients:
## (Intercept) 4.802052 2.294227 2.093 0.0476 *
## xnew 0.042058 0.004337 9.698 1.36e-09 ***
9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistic: 94.05 on 1 and 23 DF, p-value: 1.359e-09
anova(regnew)

##
## Response: y
## xnew 1 4647.9 4647.9 94.052 1.359e-09 ***
## Residuals 23 1136.6 49.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
SSnew = anova(regnew)$"Sum Sq"

SSResnew = SSnew[2]
SSH0 = SSResnew - SSRes
Ftest = (SSH0/1)/MSRes
Ftest
## [1] 84.9849
qf(0.95,1,n-3)
## [1] 4.30095
H0 : 0.71β1 + 100β2 = 0
xnew = x1 + 0.71/100*x2
regnew = lm(y~xnew)
summary(regnew)
##
## Call:
## lm(formula = y ~ xnew)
##
## Residuals:
## -5.7724 -1.1454 0.3474 1.4496 7.7296
##
## Coefficients:
## (Intercept) 2.41534 1.07573 2.245 0.0347 *
## xnew 1.71171 0.07391 23.158 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
10
##
## F-statistic: 536.3 on 1 and 23 DF, p-value: < 2.2e-16
anova(regnew)

##
## Response: y
## xnew 1 5546.7 5546.7 536.31 < 2.2e-16 ***
## Residuals 23 237.9 10.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

SSResnew = SSnew[2]
Ftest
## [1] 0.3899549
qf(0.95,1,n-3)
## [1] 4.30095
Alternative way to test the hypothesis
C = solve(t(X)%*%X)
betahat = reg$coefficients
num = 0.71*betahat[2]-100*betahat[3]
SE = MSRes*(0.71^2*C[2,2]+100^2*C[3,3]-2*0.71*100*C[2,3])
t0 = num/sqrt(SE)
t0
## x1
## -0.6244637
t0^2
## x1
## 0.3899549
### Do not reject H0
H0 : β1 + β2 = 1.9 Under H0 we get

y = β0 + β1 x1 + (1.9 − β1 )x2 + which can be simplified as
y − 1.9x2 = β0 + β1 (x1 − x2 ) +
11
xnew = x1 - x2
ynew = y - 1.9*x2
regnew = lm(ynew~xnew)
summary(regnew)
##
## Call:
## lm(formula = ynew ~ xnew)
##
## Residuals:
## -6.5159 -1.8556 0.7375 1.6021 6.5780
##
## Coefficients:
## (Intercept) 1.879830 1.094361 1.718 0.0993 .
## xnew 1.890362 0.002153 878.137 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.711e+05 on 1 and 23 DF, p-value: < 2.2e-16
anova(regnew)

##
## Response: ynew
## xnew 1 8756922 8756922 771125 < 2.2e-16 ***
## Residuals 23 261 11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

SSResnew = SSnew[2]
Ftest
## [1] 2.584413
qf(0.95,1,n-3)
## [1] 4.30095
Alternative approach
12
num = betahat[2]+betahat[3]-1.9
SE = MSRes*(C[2,2]+C[3,3]+2*C[2,3])
t0 = num/sqrt(SE)
t0
## x1
## -1.607611
t0^2
## x1
## 2.584413
Another Method to compute Sum of Squares due to H0
T = matrix(nrow=1,ncol=3)
T[1,] = c(0,1,1)
k = 1.9
Term1 = t(T%*%betahat - k)
Term2 = solve(T%*%C%*%t(T))
Term3 = T%*%betahat - k
F0 = (Term1%*%Term2%*%Term3/1)/MSRes
F0
## [,1]
## [1,] 2.584413
H0 : β0 = 0; β1 + β2 = 3
New Model y = β1 x1 + (3 − β1 x2 + which becomes
y − 3x2 = β1 (x1 − x2 ) +
ynew = y - 3*x2
xnew = x1 -x2
regnew = lm(ynew~0+xnew)
summary(regnew)
##
## Call:
## lm(formula = ynew ~ 0 + xnew)
##
## Residuals:
## -18.0552 -1.1957 0.8209 3.6191 12.3353
##
## Coefficients:
## xnew 3.009725 0.002465 1221 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.491e+06 on 1 and 24 DF, p-value: < 2.2e-16
13
SSResnew = anova(regnew)$"Sum Sq"[2]
SSResnew
## [1] 941.8816

Ftest
## [1] 33.32731
qf(0.95,2,n-3)
## [1] 3.443357
Alternative method: H0 : T β = k
T = matrix(nrow=2,ncol=3)
T[1,] = c(1,0,0)
T[2,] = c(0,1,1)
k = c(0,3)
k = as.matrix(k)
dim(k)
## [1] 2 1
Term1 = t(T%*%betahat - k)
Term2 = solve(T%*%C%*%t(T))
Term3 = T%*%betahat - k
F0 = (Term1%*%Term2%*%Term3/2)/MSRes
F0
## [,1]
## [1,] 33.32731
14

Multiple Linear Regression

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Multiple Linear Regression

Загружено:

Авторское право:

Доступные форматы

Regression Analysis: Multiple Linear Regression

Dr. Madhuri Kulkarni

DelTime = read.delim("D:/Regression Analysis/Datasets/DeliveryTimeData.txt",row.names=1)

## Delivery.Time..y Number.of.Cases..x1 Distance..x2..ft.

2 Model Fitting and Summary

## Analysis of Variance Table

## [1] "Df" "Sum Sq" "Mean Sq" "F value" "Pr(>F)"

4 Confidence Intervals for β1 and β2

0 200 400 600 800 1000 1200 1400

Above plot shows no patterns. Hence the variance of errors is constant.

7 Constructing the design matrix and hat matrix

8 Different Types of Residuals and Their Usage

which is same as the residuals obtained in reg.stdres.

Large values of PRESS indicate that the model is inappropriate.

9 Testing of Linear Hypotheses H0 : β1 = β2

## Analysis of Variance Table

SSnew = anova(regnew)$"Sum Sq"

## Analysis of Variance Table

SSnew = anova(regnew)$"Sum Sq"

Alternative way to test the hypothesis

### Do not reject H0

H0 : β1 + β2 = 1.9 Under H0 we get

## Analysis of Variance Table

SSnew = anova(regnew)$"Sum Sq"

Another Method to compute Sum of Squares due to H0

SSH0 = SSResnew - SSRes

Вам также может понравиться