Linear Regression Analysis of Smoking and Lung Capacity

12/11/2012
1
LinearRegressionAnalysis
Correlation
SimpleLinearRegression
TheMultipleLinearRegressionModel
LeastSquaresEstimates
R
2
andAdjustedR
2
OverallValidityoftheModel(F test)
Testingforindividualregressor (t test)
ProblemofMulticollinearity
Gaurav Garg (IIMLucknow)
SmokingandLungCapacity
Suppose,forexample,wewanttoinvestigatethe
relationshipbetweencigarettesmokingandlung
capacity
Wemightaskagroupofpeopleabouttheirsmoking
habits,andmeasuretheirlungcapacities
Cigarettes (X) Lung Capacity (Y)
0 45
5 42
10 33
15 31
20 29
Scatterplotofthedata
Wecanseethatassmokinggoesup,lung
capacitytendstogodown.
Thetwovariableschangethevaluesinopposite
directions.
0
20
40
60
0 10 20 30
LungCapacity
HeightandWeight
Considerthefollowingdataofheightsandweightsof5
womenswimmers:
Height(inch): 62 64 65 66 68
Weight(pounds): 102 108 115 128 132
Wecanobservethatweightisalsoincreasingwith
height.
0
50
100
150
60 65 70
Sometimestwovariablesarerelatedtoeach
other.
Thevaluesofbothofthevariablesarepaired.
Changeinthevalueofoneaffectsthevalueof
other.
Usuallythesetwovariablesaretwoattributesof
eachmemberofthepopulation
ForExample:
Height Weight
AdvertisingExpenditure SalesVolume
Unemployment CrimeRate
Rainfall FoodProduction
Expenditure Savings
Wehavealreadystudiedonemeasureofrelationship
betweentwovariables Covariance
Covariancebetweentworandomvariables,X andY is
givenby
ForpairedobservationsonvariablesX andY,
) ( ) ( ) ( ) , ( Y E X E XY E Y X Cov
XY
= = o
=
= =
n
i
i i XY
y y x x
n
Y X Cov
1
) )( (
1
) , ( o
y
x
x x
y y
12/11/2012
2
PropertiesofCovariance:
Cov(X+a, Y+b) = Cov(X, Y) [notaffectedbychangeinlocation]
Cov(aX, bY) = ab Cov(X, Y) [affectedbychangeinscale]
Covariancecantakeanyvaluefrom- to+.
Cov(X,Y) > 0 meansX andY changeinthesamedirection
Cov(X,Y) < 0 meansX andY changeintheoppositedirection
If X andY areindependent,Cov(X,Y) = 0 [otherwaymaynotbetrue]
Itisnotunitfree.
Soitisnotagoodmeasureofrelationshipbetweentwo
variables.
Abettermeasureiscorrelationcoefficient.
Itisunitfreeandtakesvaluesin[-1,+1].
Correlation
KarlPearsonsCorrelationcoefficientisgivenby
WhenthejointdistributionofX andY isknown
WhenobservationsonX andY areavailable
) ( ) (
) , (
) , (
Y Var X Var
Y X Cov
Y X Corr r
XY
= =
2 2 2 2
)] ( [ ) ( ) ( , )] ( [ ) ( ) (
) ( ) ( ) ( ) , (
Y E Y E Y Var X E X E X Var
Y E X E XY E Y X Cov
= =
=

= =
=
= =
=
n
i
i
n
i
i
n
i
i i
y y
n
Y Var x x
n
X Var
y y x x
n
Y X Cov
1
2
1
2
1
) (
1
) ( , ) (
1
) (
) )( (
1
) , (
PropertiesofCorrelationCoefficient
Corr(aX+b, cY+d) = Corr(X, Y),
Itisunitfree.
Itmeasuresthestrengthofrelationshipona
scaleof-1 to+1.
So,itcanbeusedtocomparetherelationshipsof
variouspairsofvariables.
Valuescloseto0 indicatelittleornocorrelation
Valuescloseto+1 indicateverystrongpositive
correlation.
Valuescloseto-1 indicateverystrongnegative
correlation.
ScatterDiagram
PositivelyCorrelated
NegativelyCorrelated
WeaklyCorrelated StronglyCorrelated NotCorrelated
X
Y
CorrelationCoefficientmeasuresthestrengthof
linear relationship.
r = 0 doesnotnecessarilyimplythatthereisno
correlation.
Itmaybethere,butisnotalinear one.
x
y
x
y
x
1.25
1.75
2.25
2.00
2.50
2.25
2.70
2.50
17.50
y
125
105
65
85
75
80
50
55
640
-0.9
-0.4
0.1
-0.15
0.35
0.1
0.55
0.35
0
45
25
-15
5
-5
0
-30
-25
0
0.8100
0.1600
0.0100
0.0225
0.1225
0.0100
0.3025
0.1225
1.560
SSX
2025
625
225
25
25
0
900
625
4450
SSY
-40.50
-10.00
-1.50
-0.75
-1.75
0
-16.50
-8.75
-79.75
SSXY
957 . 0
4450 56 . 1
75 . 79
) ( ) (
) , (
=
= = =
SSY SSX
SSXY
Y Var X Var
Y X Cov
r
x x y y
2
) ( x x
2
) ( y y
) )( ( y y x x
12/11/2012
3
x
1.25
1.75
2.25
2.00
2.50
2.25
2.70
2.50
17.20
y
125
105
65
85
75
80
50
55
640
x
2
1.5625
3.0625
5.0625
4.0000
6.2500
5.0625
7.2500
6.2500
38.54
y
2
15625
11025
4225
7225
5625
6400
2500
3025
55650
x.y
156.25
183.75
146.25
170.00
187.50
180.00
135.00
137.50
1296.25
( )
,
2
2

=
n
x
x SSX
( )
,
2
2

=
n
y
y SSY
( )( )

=
n
y x
xy SSXY
SSX =1.56
SSY = 4450
SSXY=-79.75
AlternativeFormulasforSumofSquares
957 . 0
4450 56 . 1
75 . 79
) ( ) (
) , (
=
= = =
SSY SSX
SSXY
Y Var X Var
Y X Cov
r
SmokingandLungCapacityExample
Cigarettes
(X)
XY
Lung
Capacity
(Y)
0 0 0 2025 45
5 25 210 1764 42
10 100 330 1089 33
15 225 465 961 31
20 400 580 841 29
50 750 1585 6680 180
2
X
2
Y
( )
2 2
(5)(1585) (50)(180)
(5)(750) 50 (5)(6680) 180
7925 9000
(3750 2500)(33400 32400)
1075
.9615
1250 (1000)
xy
r

=
( (

= =
RegressionAnalysis
HavingdeterminedthecorrelationbetweenXandY,we
wishtodetermineamathematicalrelationshipbetween
them.
Dependentvariable:thevariableyouwishtoexplain
Independentvariables:thevariablesusedtoexplainthe
dependentvariable
Regressionanalysisisusedto:
Predictthevalueofdependentvariablebasedonthe
valueofindependentvariable(s)
Explaintheimpactofchangesinanindependent
variableonthedependentvariable
TypesofRelationships
Y
X
Y
X
Y
Y
X
X
Linearrelationships Curvilinearrelationships
Y
X
Y
X
Y
Y
X
X
Strong relationships Weak relationships
12/11/2012
4
Y
X
Y
X
No relationship
SimpleLinearRegressionAnalysis
Thesimplestmathematicalrelationshipis
Y = a + bX + error (linear)
ChangesinY arerelatedtothechangesin X
Whatarethemostsuitablevaluesof
a (intercept)andb (slope)?
X
Y
X
Y
y = a + b.x
}a
1
b
X
Y
(x
i
, y
i
)
y
i
x
i
MethodofLeastSquares
i
bx a +
bX a +
The best fitted line would be for which all the
ERRORS are minimum.
error
Wewanttofitalineforwhichalltheerrorsare
minimum.
Wewanttoobtainsuchvaluesofa andb in
Y = a + bX + error forwhichalltheerrorsare
minimum.
Tominimizealltheerrorstogetherweminimize
thesumofsquaresoferrors(SSE).
=
=
n
i
i i
bX a Y SSE
1
2
) (
Togetthevaluesofa andb whichminimizeSSE,we
proceedasfollows:
Eq(1)and(2)arecallednormalequations.
Solvenormalequationstogeta and b
) 1 (
0 ) ( 2 0
1 1
1

= =
=
+ =
= =
c
c
n
i
i
n
i
i
i
n
i
i
X b na Y
bX a Y
a
SSE
) 2 (
0 ) ( 2 0
1
2
1 1
1

= = =
=
+ =
= =
c
c
n
i
i
n
i
i i
n
i
i
i i
n
i
i
X b X a X Y
X bX a Y
b
SSE
Solvingabovenormalequations,weget

= = =
= =
+ =
+ =
n
i
i
n
i
i
n
i
i i
n
i
i
n
i
i
X b X a X Y
X b na Y
1
2
1 1
1 1
( )( )
( )
SSX
SSXY
X X
X X Y Y
X X
X Y X Y n
b
n
i
i
n
i
i i
n
i
i
n
i
i
n
i
i
n
i
i
n
i
i i
=

=
|
.
|
\
|
|
.
|
\
|
|
.
|
\
|

=
=
= =
= = =
1
2
1
2
1 1
2
1 1 1
X b Y a =
12/11/2012
5
Thevaluesofa andb obtainedusingleastsquares
methodarecalledasleastsquaresestimates(LSE)
ofa andb.
Thus,LSEofa andb aregivenby
AlsothecorrelationcoefficientbetweenX andY is
.

SSX
SSXY
b , X b Y a = =
SSY
SSX
b
SSY
SSX
SSX
SSXY
SSY SSX
SSXY
Y Var X Var
Y X Cov
r
XY
) ( ) (
) , (
= = = =
x
1.25
1.75
2.25
2.00
2.50
2.25
2.70
2.50
17.50
y
125
105
65
85
75
80
50
55
640
-0.9
-0.4
0.1
-0.15
0.35
0.1
0.55
0.35
0
45
25
-15
5
-5
0
-30
-25
0
0.8100
0.1600
0.0100
0.0225
0.1225
0.0100
0.3025
0.1225
1.560
SSX
2025
625
225
25
25
0
900
625
4450
SSY
-40.50
-10.00
-1.50
-0.75
-1.75
0
-16.50
-8.75
-79.75
SSXY
x x y y
2
) ( x x
2
) ( y y
) )( ( y y x x
. 80 , 15 . 2 = = Y X
957 . 0 = =
SSY SSX
SSXY
r
12 . 51
= =
SSX
SSXY
b 91 . 189
= = X b Y a
0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
140
120
100
80
60
40
X Y 12 . 51 91 . 189
= is Line Fitted
189.91istheestimatedmeanvalueofY when
thevalueofX iszero.
-51.12 isthechangeintheaveragevalueofY as
aresultofaoneunitchangeinX.
WecanpredictthevalueofY forsomegiven
valueofX.
ForexampleatX=2.15,predictedvalueofY is
189.91 51.12 x2.15=80.002
X Y 12 . 51 91 . 189
= is Line Fitted
ResidualistheunexplainedpartofY
Thesmallertheresiduals,thebettertheutilityof
Regression.
SumofResidualsisalwayszero.LeastSquare
procedureensuresthat.
Residualsplayanimportantroleininvestigating
theadequacyofthefittedmodel.
Weobtaincoefficientofdetermination(R
2
)
usingtheresiduals.
R
2
isusedtoexaminetheadequacyofthefitted
linearmodeltothegivendata.
i i i
Y Y e
= : Residuals
CoefficientofDetermination
X
Y
Y
( ) Y Y
( ) Y Y
( ) Y Y
=
=
n
i
i
Y Y SST
1
2
) ( : Squares of Sum Total
=
=
n
i
i
Y Y SSR
1
2
)
( : Squares of Sum Regression
=
=
n
i
i i
Y Y SSE
1
2
)
( : Squares of Sum Error

Also, SST = SSR + SSE
12/11/2012
6
ThefractionofSST explained byRegressionisgivenbyR
2
R
2
= SSR/ SST = 1 (SSE/ SST)
Clearly,0 R
2
1
WhenSSR isclosedtoSST, R
2
willbeclosedto1.
Thismeansthatregressionexplainsmostofthevariability
inY.(Fitisgood)
WhenSSE isclosedtoSST, R
2
willbeclosedto0.
Thismeansthatregressiondoesnotexplainmuch
variabilityinY. (Fitisnotgood)
R
2
isthesquareofcorrelationcoefficientbetweenX and
Y.(proofomitted)
r = 1
r = -1
R
2
= 1
Perfectlinear
relationship
100%ofthevariation
inY isexplainedbyX
0 < R
2
< 1
Weaklinear
relationships
Somebutnotallof
thevariationinY is
explainedbyX
R
2
= 0
Nolinear
relationship
Noneofthe
variationinY is
explainedbyX
CoefficientofDetermination:R
2
= (4450-370.5)/4450 = 0.916
CorrelationCoefficient: r = -0.957
CoefficientofDetermination=(CorrelationCoefficient)
2
X Y
1.25 125 126.0 45 -1 46 2025 1 2116
1.75 105 100.5 25 4.5 20.5 625 20.25 420.25
2.25 65 74.9 -15 -9.9 -5.1 225 98.00 26.01
2.00 85 87.7 5 -2.2 7.7 25 4.84 59.29
2.50 75 62.1 -5 12.9 -17.7 25 166.41 313.29
2.25 80 74.9 0 5.1 -5.1 0 26.01 26.01
2.70 50 51.9 -30 -1.9 -28.1 900 3.61 789.61
2.50 55 62.1 -25 -7.1 -17.9 625 50.41 320.41
17.20 640 4450 370.54 4079.4
6
Y
) ( Y Y
)
( Y Y )
( Y Y
2
) ( Y Y
2
)
( Y Y
2
)
( Y Y
Example:
Watchingtelevisionalsoreducestheamountofphysicalexercise,
causingweightgains.
Asampleoffifteen10yearoldchildrenwastaken.
Thenumberofpoundseachchildwasoverweightwasrecorded
(anegativenumberindicatesthechildisunderweight).
Additionally,thenumberofhoursoftelevisionviewingperweeks
wasalsorecorded.Thesedataarelistedhere.
Calculatethesampleregressionlineanddescribewhatthe
coefficientstellyouabouttherelationshipbetweenthetwo
variables.
Y=24.709+0.967XandR
2
=0.768
TV 42 34 25 35 37 38 31 33 19 29 38 28 29 36 18
Overweight 18 6 0 1 13 14 7 7 9 8 8 5 3 14 7
15.00
10.00
5.00
0.00
5.00
10.00
15.00
20.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Y
PredictedY
12/11/2012
7
StandardError
Consideradataset.
Alltheobservationscannotbeexactlythesameas
arithmeticmean(AM).
VariabilityoftheobservationsaroundAMismeasured
bystandarddeviation.
Similarlyinregression,allY valuescannotbethesame
aspredictedY values.
VariabilityofY valuesaroundthepredictionlineis
measuredbySTANDARDERROROFTHEESTIMATE.
Itisgivenby
2
)
(
2
1
2
=

=
n
Y Y
n
SSE
S
n
i
i i
YX
Assumptions
TherelationshipbetweenXandYislinear
Errorvaluesarestatisticallyindependent
AlltheErrorshaveacommonvariance.
(Homoscedasticity)
Var(e
i
)=o
2
,where
E(e
i
)= 0
Nodistributionalassumptionabouterrorsis
requiredforleastsquaresmethod.
i i i
Y Y e
=
Linearity
NotLinear
Linear
X
r
e
s
i
d
u
a
l
s
X
Y
X
Y
X
r
e
s
i
d
u
a
l
s
Independence
NotIndependent Independent
X
X
r
e
s
i
d
u
a
l
s
r
e
s
i
d
u
a
l
s
X
r
e
s
i
d
u
a
l
s
EqualVariance
Unequalvariance
(Heteroscadastic)
Equalvariance
(Homoscadastic)
X
X
Y
X
X
Y
r
e
s
i
d
u
a
l
s
r
e
s
i
d
u
a
l
s
TVWatching WeightGainExample
ScatterPlotofXandY
ScatterPlotofXandResiduals
12.00
10.00
8.00
6.00
4.00
2.00
0.00
2.00
4.00
6.00
0 5 10 15 20 25 30 35 40 45
15.00
10.00
5.00
0.00
5.00
10.00
15.00
20.00
0 5 10 15 20 25 30 35 40 45
12/11/2012
8
Insimplelinearregressionanalysis,wefitlinearrelation
between
oneindependentvariable(X)and
onedependentvariable(Y).
WeassumethatY isregressedononlyoneregressor
variableX.
Insomesituations,thevariableY isregressedonmore
thanoneregressor variables(X
1
, X
2
, X
3
, ).
ForEXample:
Cost >Laborcost,Electricitycost,Rawmaterialcost
Salary >Education,EXperience
Sales >Cost,AdvertisingEXpenditure
Example:
Adistributoroffrozendessertpieswantsto
evaluatefactorswhichinfluencethedemand
Dependentvariable:
Y:Piesales(unitsperweek)
Independentvariables:
X
1
: Price(in$)
X
2
: AdvertisingExpenditure($100s)
Dataarecollectedfor15weeks
Week
Pie
Sales
Price
($)
Advertising
($100s)
1 350 5.50 3.3
2 460 7.50 3.3
3 350 8.00 3.0
4 430 8.00 4.5
5 350 6.80 3.0
6 380 7.50 4.0
7 430 4.50 3.0
8 470 6.40 3.7
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5
12 300 7.90 3.2
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7
Usingthegivendata,wewishtofitalinear
functionoftheform:
where
X
1
: Price(in$)
X
2
Fittingmeans,wewanttogetthevaluesof
regressioncoefficientsdenotedby
Originalvaluesofsarenotknown.
Weestimatethemusingthegivendata.
,
2 2 1 1 0 i i i i
X X Y + + + =
. 15 , , 2 , 1 = i
Examinethelinearrelationshipbetween
onedependent (Y) and
twoormoreindependentvariables(X
1
, X
2
, , X
k
).
MultipleLinearRegressionModelwithk
IndependentVariables:
i ki k i i i
X X X Y + + + + + =
2 2 1 1 0
Intercept
Slopes RandomError
. , , 2 , 1 n i =
MultipleLinearRegressionEquation
Intercept and Slopes are estimated using observed
data.
Multiple linear regression equation with k
independent variables
ki k i i i
X b X b X b b Y + + + + =
2 2 1 1 0
Estimated
value
Estimates of slopes
Estimate of
intercept
. , , 2 , 1 n i =
12/11/2012
9
MultipleRegressionEquation
EXample withtwoindependentvariables
Y
X
1
X
2
2 2 1 1 0
X b X b b Y + + =
EstimatingRegressionCoefficients
Themultiplelinearregressionmodel
InmatriX notations
or
|
|
|
|
|
.
|
\
|
+
|
|
|
|
|
|
.
|
\
|
|
|
|
|
|
.
|
\
|
=
|
|
|
|
|
.
|
\
|
n
k
nk n n
k
k
n
X X X
X X X
X X X
Y
Y
Y
c
c
c
|
|
|
|
2
1
2
1
0
2 1
2 22 21
1 12 11
2
1
1
1
1
X Y + =
,...,n , ,i X X X Y
i ki k i i i
2 1
2 2 1 1 0
= + + + + + =
Assumptions
No.ofobservations(n)isgreaterthanno.of
regressors (k).i.e.,n> k
RandomErrorsareindependent
RandomErrorshavethesamevariances.
(Homoscedasticity)
Var(c
i
)=o
2
Inlongrun,meaneffectofrandomerrorsiszero.
E(c
i
)= 0.
NoAssumptionondistributionofRandomerrors
isrequiredforleastsquaresmethod.
Inordertofindtheestimateof,weminimize
WedifferentiateS() withrespectto andequate
tozero,i.e.,
Thisgives
b iscalledleastsquaresestimatorof.
X X Y X Y- Y
X (Y ) X (Y S(
n
i
i
' ' + ' ' ' =
' = ' = =
=
2
) )
1
2
|
,
S
0 =
c
c
Y X X) X ( b ' ' =
1
Example:Considerthepieexample.
Wewanttofitthemodel
Thevariablesare
X
1
: Price(in$)
X
2
UsingthematriX formula,theleastsquaresestimate
(LSE)ofsareobtainedasbelow:
PieSales=306.53 24.98Price+74.13Adv.Expend.
,
2 2 1 1 0 i i i i
X X Y + + + =
LSE of Intercept
0
Intercept(b
0
) 306.53
LSE of slope
1
Price(b
1
) 24.98
LSE of slope
2
Advertising(b
2
) 74.13
) ( 13 74 ) ( 98 24 53 306 Sales
2 1
X . X . - . + =
b
1
= -24.98: sales will decrease, on
average, by 24.98 pies per week for
each $1 increase in selling price,
while advertising expenses are kept
fixed.
b
2
= 74.13: sales will
increase, on average, by
74.13 pies per week for
each $100 increase in
advertising, while selling
price are kept fixed.
12/11/2012
10
Prediction:
Predictsalesforaweekinwhich
sellingpriceis$5.50
AdvertisingeXpenditure is$350:
Sales=306.53 24.98X
1
+74.13X
2
=306.53 24.98(5.50)+74.13(3.5)
=428.62
Predictedsalesis428.62pies
NotethatAdvertisingisin$100s,soX
2
=3.5
Y X
1
X
2
PredictedY Residuals
350 5.5 3.3 413.77 63.80
460 7.5 3.3 363.81 96.15
350 8.0 3.0 329.08 20.88
430 8.0 4.5 440.28 10.31
350 6.8 3.0 359.06 9.09
380 7.5 4.0 415.70 35.74
430 4.5 3.0 416.51 13.47
470 6.4 3.7 420.94 49.03
450 7.0 3.5 391.13 58.84
490 5.0 4.0 478.15 11.83
340 7.2 3.5 386.13 46.16
300 7.9 3.2 346.40 46.44
440 5.9 4.0 455.67 15.70
450 5.0 3.5 441.09 8.89
300 7.0 2.7 331.82 31.85
2 1
13096 74 97509 24 52619 306
X . X . . Y + =
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Y
PredictedY
CoefficientofDetermination
CoefficientofDetermination(R
2
)isobtainedusingthe
sameformulaaswasinsimplelinearregression.
R
2
= SSR/SST = 1 (SSE/SST)
R
2
istheproportionofvariationinY explainedby
regression.
=
=
n
i
i
Y Y
1
2
) ( SST Squares, of Sum Total
=
=
n
i
i
Y Y
1
2
)
( SSR Squares, of Sum Regression
=
=
n
i
i i
Y Y
1
2
)
( SSE Squares, of Sum Error

Also,SST = SSR + SSE
Since SST = SSR + SSE
andallthreequantitiesarenonnegative,
Also, 0 SSR SST
So 0 SSR/SST 1
Or 0 R
2
1
WhenR
2
iscloseto0,thelinearfitisnotgood
AndX variablesdonotcontributeinexplainingthe
variabilityinY.
WhenR
2
iscloseto1,thelinearfitisgood.
Inthepreviouslydiscussedexample,R
2
=0.5215
IfweconsiderY andX
1
only,R
2
=0.1965
IfweconsiderY andX
2
only,R
2
=0.3095
AdjustedR
2
Ifonemoreregressor isaddedtothemodel,thevalue
ofR
2
willincrease
Thisincreaseisregardlessofthecontributionofnewly
addedregressor.
So,anadjustedvalueofR
2
isdefined,whichiscalledas
adjustedR
2
anddefinedas
ThisAdjustedR
2
willonlyincrease,iftheadditional
variablecontributeinexplainingthevariationinY.
Forourexample,AdjustedR
2
=0.4417
) (n SST
) k (n SSE
R
Adj
1
1
1
2

=
12/11/2012
11
FTestforOverallSignificance
Wecheckifthereisalinearrelationshipbetweenallthe
regressors (X
1
, X
2
, , X
k
) andresponse(Y).
UseFteststatistic
Totest:
H
0
:
1
=
2
=
=
k
= 0 (noregressor issignificant)
H
1
: atleastone
i
0 (atleastoneregressor affectsY)
ThetechniqueofAnalysisofVarianceisused.
Assumptions:
n > k, Var(c
i
)=o
2
, E(c
i
)= 0.
c
i
s areindependent.ThisimpliesthatCorr (c
i
, c
j
)=0,fori j
c
i
s haveNormalDistribution.[c
i
~ N(0, o
2
)]
[NEWASSUMPTION]
TotalSumofSquare(SST)ispartitionedinto
SumofSquaresduetoRegression(SSR)and
SumofSquaresduetoResiduals(SSE)
where
e
i
s arecalledtheresiduals.
( )
( )
SSE SST SSR
Y Y e SSE
Y Y SST
n
i
i i
n
i
i
n
i
i
=
= =
=

= =
=
1
2
1
2
2
1

AnalysisofVarianceTable
TestStatistic: F
c
= MSR / MSE ~ F
(k, n-k-1)
ForthepreviouseXample,wewishtotest
H
0
:
1
=
2
= 0 AgainstH
1
: atleastone
i
0
ANOVATable
ThusH
0
isrejectedat5%levelofsignificance.
df SS MS F
c
Regression k SSR MSR MSR/MSE
ResidualorError n-k-1 SSE MSE
Total n-1 SST
df SS MS F F
(2,12)
(0.05)
Regression 2 29460.03 14730.01 6.5386 3.89
ResidualorError 12 27033.31 2252.78
Total 14 56493.33
IndividualVariablesTestsofHypothesis
Wetestifthereisalinearrelationshipbetweena
particularregressor X
j
andY
Hypotheses:
H
0
:
j
= 0 (nolinearrelationship)
H
1
:
j
0 (linearrelationshipexistsbetweenX
j
andY)
Weuseatwotailedttest
IfH
0
:
j
= 0 isaccepted,
thisindicatesthatthevariableX
j
canbedeleted
fromthemodel.
TestStatistic:
T
c
~ Studentst with(n-k-1) degreeoffreedom
b
j
istheleastsquaresestimateof
j
C
j j
isthe(j, j)thelementofmatrix(XX)
-1
(MSE isobtainedinANOVATable)
jj
j
c
C
b
T
2
o
=
MSE =
2
o
Inourexample
and
TotestH
0
:
1
= 0 againstH
1
:
1
0
T
c
= -2.3057
TotestH
0
:
2
= 0 againstH
1
:
2
0
T
c
=2.8548
Twotailedcriticalvaluesoft at12d.f.are
3.0545for1%levelofsignificance
2252.7755
2
= o
|
|
|
.
|
\
|

= '

2993 0 0038 0 0165 1
0038 0 0521 0 3312 0
0165 1 3312 0 7946 5
1
. . .
. . .
. . .
X) X (
12/11/2012
12
StandardError
Consideradataset.
Alltheobservationscannotbeexactlythesameas
arithmeticmean(AM).
VariabilityoftheobservationsaroundAMismeasured
bystandarddeviation.
Similarlyinregression,allY valuescannotbethesame
aspredictedY values.
VariabilityofY valuesaroundthepredictionlineis
measuredbySTANDARDERROROFTHEESTIMATE.
Itisgivenby
1
)
(
1
1
2

=

=

=
k n
Y Y
k n
SSE
S
n
i
i i
YX
AssumptionofLinearity
NotLinear
Linear
r
e
s
i
d
u
a
l
s
Y
X
Y
X
r
e
s
i
d
u
a
l
s
Y
Y

AssumptionofEqualVariance
WeassumethatVar(c
i
)=o
2
Thevarianceisconstantforallobservations.
Thisassumptionisexaminedbylookingatthe
plotof
Predictedvalues andresiduals
i
Y
i i i
Y Y e
=
ResidualAnalysisforEqualVariance
Unequal variance Equal variance
r
e
s
i
d
u
a
l
s
r
e
s
i
d
u
a
l
s
Y

AssumptionofUncorrelatedResiduals
DurbinWatsonstatisticisateststatisticusedtodetect
thepresenceofautocorrelation.
Itisgivenby
Thevalueofd alwaysliesbetween0 and4.
d = 2 indicatesnoautocorrelation.
Smallvaluesofd < 2 indicatesuccessiveerrortermsare
positivelycorrelated.
Ifd > 2 successiveerrortermsarenegativelycorrelated.
Thevalueofd morethan3andlessthan1arealarming.
=
=

=
n
i
i
n
i
i i
e
e e
d
1
2
2
2
1
) (
ResidualAnalysisforIndependence
(UncorrelatedErrors)
Not Independent Independent
r
e
s
i
d
u
a
l
s
r
e
s
i
d
u
a
l
s
r
e
s
i
d
u
a
l
s
Y

12/11/2012
13
AssumptionofNormality
WhenweuseF testort test,weassumethatc
1
,
c
2
, , c
n
arenormallydistributed.
Thisassumptioncanbeexaminedbyhistogram
ofresiduals.
NOTNORMAL
NORMAL
NormalitycanalsobeexaminedusingQQplot
orNormalprobabilityplot.
NOTNORMAL
NORMAL
StandardizedRegressionCoefficient
Inamultiplelinearregression,wemayliketoknow
whichregressor contributesmore.
Weobtainstandardizedestimatesofregression
coefficients.
Forthat,firstwestandardizetheobservations.
1
2
2
1 1
2
1 1 1 1
1 1
2
2 2 2 2
1 1
1 1
, ( )
1
1 1
, ( )
1
1 1
, ( )
1
n n
i Y i
i i
n n
i X i
i i
n n
i X i
i i
Y Y s Y Y
n n
X X s X X
n n
X X s X X
n n
= =
= =
= =
= =
= =
= =

StandardizeallY, X
1
andX
2
valuesasfollows:
Fittheregressioninthestandardizeddataandobtain
theleastsquaresestimateofregressioncoefficients.
Thesecoefficientsaredimensionlessorunitfreeand
canbecompared.
Lookfortheregressioncoefficienthavingthehighest
magnitude.
Correspondingregressor contributesthemost.
1 2
1 1 2 2
1 2
Standardized ,
Standardized , Standardized
i
Y
i i
i i
X X
Y Y
Y
s
X X X X
X X
s s
=

= =
Y = 0 0.461 X
1
+ 0.570 X
2
Since0.461 < 0.570
X
2
Contributesthemost
Standardized Data
Week
Pie
Sales
Price
($)
Advertising
($100s)
1
0.78 0.95 0.37
2
0.96 0.76 0.37
3
0.78 1.18 0.98
4
0.48 1.18 2.09
5
0.78 0.16 0.98
6
0.30 0.76 1.06
7
0.48 1.80 0.98
8
1.11 0.18 0.45
9
0.80 0.33 0.04
10
1.43 1.38 1.06
11
0.93 0.50 0.04
12
1.56 1.10 0.57
13
0.64 0.61 1.06
14
0.80 1.38 0.04
15
1.56 0.33 1.60
Notethat:
AdjustedR
2
canbenegative
AdjustedR
2
isalwayslessthanorequaltoR
2
Inclusionofintercepttermisnotnecessary.
Itdependsontheproblem.
Analystmaydecideonthis.
) 1 (
) 1 )( 1 (
1
2
2

=
k n
n R
R
Adj
) 1 (
) 1 (
2
2
R k
R k n
F
c

=
12/11/2012
14
Example:Followingdatawascollectedforthesales,numberof
advertisementspublishedandadvertizingexpenditurefor12
weeks.Fitaregressionmodeltopredictthesales.
Sales(0,000Rs) Ads(Nos.) AdvEx(000Rs)
43.6 12 13.9
38.0 11 12
30.1 9 9.3
35.3 7 9.7
46.4 12 12.3
34.2 8 11.4
30.2 6 9.3
40.7 13 14.3
38.5 8 10.2
22.6 6 8.4
37.6 8 11.2
35.2 10 11.1
ANOVA
b
Model
Sumof
Squares df Mean Square F Sig.
1 Regression 309.986 2 154.993 9.741 .006
a
Residual
143.201 9 15.911
Total
453.187 11
a. Predictors: (Constant), Ex_Adv, No_Adv
b. Dependent Variable: Sales
Coefficients
a
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig. B Std. Error Beta
1 (Constant) 6.584 8.542 .771 .461
No_Adv .625 1.120 .234 .558 .591
Ex_Adv 2.139 1.470 .611 1.455 .180
a. Dependent Variable: Sales
pvalue<0.05;H
0
isrejected;Allsarenotzero
Allpvalues>0.05;NoH
0
rejected.
0
=0,
1
=0,
2
=0
CONTRADICTION
Multicollinearity
Weassumethatregressors areindependentvariables.
WhenweregressY onregressors X
1
, X
2
, , X
k
.
Weassumethatallregressors X
1
, X
2
, , X
k
are
statisticallyindependentofeachother.
Alltheregressors affectthevaluesofY.
Oneregressor doesnotaffectthevaluesofother
regressor.
Sometimes,inpracticethisassumptionisnotmet.
Wefacetheproblemofmulticollinearity.
Thecorrelatedvariablescontributeredundantinformation
tothemodel
Includingtwohighlycorrelatedindependentvariablescan
adverselyaffecttheregressionresults
Canleadtounstablecoefficients
SomeIndicationsofStrongMulticollinearity:
Coefficientsignsmaynotmatchpriorexpectations
Largechangeinthevalueofapreviouscoefficientwhenanew
variableisaddedtothemodel
Apreviouslysignificantvariablebecomesinsignificantwhena
newindependentvariableisadded.
Fsaysatleastonevariableissignificant,butnoneofthets
indicatesausefulvariable.
Largestandarderrorandcorrespondingregressors isstill
significant.
MSEisveryhighand/orR
2
isverysmall
EXAMPLESINWHICHTHISMIGHTHAPPEN:
MilespergallonVs.horsepowerandenginesize
IncomeVs.ageandexperience
SalesVs.No.ofAdvertisementandAdvert.Expenditure
VarianceInflationaryFactor:
VIF
j
isusedtomeasuremulticollinearity generated
byvariableX
j
Itisgivenby
whereR
2
j
isthecoefficientofdeterminationofa
regressionmodelthatuses
X
j
asthedependentvariableand
allotherX variablesastheindependentvariables.
2
1
1
j
j
R
VIF
=
IfVIF
j
>5,X
j
ishighlycorrelatedwiththeother
independentvariables
Mathematically,theproblemofmulticollinearity occurs
whenthecolumnsofmatrixX havenearlinear
dependence
LSEb cannotbeobtainedwhenthematrixXX issingular
ThematrixXX becomessingularwhen
thecolumnsofmatrixX haveexactlineardependence
Ifanyoftheeigen valueofmatrixXX iszero
Thus,nearzeroeigen valueisalsoanindicationof
multicollinearity.
Themethodsofdealingwithmulticollinearity:
CollectingAdditionalData
VariableElimination
12/11/2012
15
Coefficients
a
Model
Unstandardized
Coefficients
Standardize
d
Coefficients
t Sig.
Collinearity
Statistics
B Std. Error Beta Tolerance VIF
1 (Constant) 6.584 8.542 .771 .461
No_Adv .625 1.120 .234 .558 .591 .199 5.022
Ex_Adv 2.139 1.470 .611 1.455 .180 .199 5.022
Collinearity Diagnostics
a
Model Dimension Eigenvalue
Condition
Index
Variance Proportions
(Constant) No_Adv Ex_Adv
1 1 2.966 1.000 .00 .00 .00
2 .030 9.882 .33 .17 .00
3 .003 30.417 .67 .83 1.00
Greaterthan5 Tolerance=1/VIF
LargeValue NegligibleValue
Wemayusethemethodofvariableelimination.
Inpractice,IfCorr (X
1
, X
2
) ismorethan0.7or
lessthan0.7,weeliminateoneofthem.
Techniques:
Stepwise (basedonANOVA)
ForwardInclusion (basedonCorrelation)
BackwardElimination (basedonCorrelation)
StepwiseRegression
Y = |
0
+ |
1
X
1
+ |
2
X
2
+ |
3
X
3
+ |
4
X
4
+ |
5
X
5
+ c
Step1:Run5simplelinearregressions:
Y = |
0
+ |
1
X
1
Y = |
0
+ |
2
X
2
Y = |
0
+ |
3
X
3
Y = |
0
+ |
4
X
4
Y = |
0
+ |
5
X
5
Step2:Run4twovariablelinearregressions:
Y = |
0
+ |
4
X
4
+ |
1
X
1
Y = |
0
+ |
4
X
4
+ |
2
X
2
Y = |
0
+ |
4
X
4
+ |
3
X
3
Y = |
0
+ |
4
X
4
+ |
5
X
5
<====haslowestpvalue(ANOVA)<0.05
<=haslowestpvalue(ANOVA)<0.05
Step3:Run3threevariablelinearregressions:
Y = |
0
+ |
3
X
3
+ |
4
X
4
+ |
1
X
1
Y = |
0
+ |
3
X
3
+ |
4
X
4
+ |
2
X
2
Y = |
0
+ |
3
X
3
+ |
4
X
4
+ |
5
X
5
Supposenoneofthesemodelshave
pvalues<0.05
STOP
BestmodelistheonewithX
3
andX
4
only
Example:Followingdatawascollectedforthesales,numberof
advertisementspublishedandadvertizingexpenditurefor12
months.Fitaregressionmodeltopredictthesales.
Sales(0,000Rs) Ads(Nos.) AdvEx(000Rs)
43.6 12 13.9
38.0 11 12
30.1 9 9.3
35.3 7 9.7
46.4 12 12.3
34.2 8 11.4
30.2 6 9.3
40.7 13 14.3
38.5 8 10.2
22.6 6 8.4
37.6 8 11.2
35.2 10 11.1
SummaryOutput1:SalesVs.No_Adv
Model Summary
Model R R Square Adjusted R Square
Std. Error of the
Estimate
1 .781
a
.610 .571 4.20570
a. Predictors: (Constant), No_Adv
ANOVA
b
Model Sumof Squares df Mean Square F Sig.
1 Regression 276.308 1 276.308 15.621 .003
a
Residual 176.879 10 17.688
Total 453.187 11
a. Predictors: (Constant), No_Adv
Coefficients
a
Model
Standardized
Coefficients
1 (Constant) 16.937 4.982 3.400 .007
No_Adv 2.083 .527 .781 3.952 .003
12/11/2012
16
SummaryOutput2:SalesVs.Ex_Adv
Model Summary
Std. Error of the
Estimate
1 .820
a
.673 .640 3.84900
a. Predictors: (Constant), Ex_Adv
ANOVA
b
1 Regression 305.039 1 305.039 20.590 .001
a
Residual 148.148 10 14.815
Total 453.187 11
a. Predictors: (Constant), Ex_Adv
Coefficients
a
Model
Standardized
Coefficients
1 (Constant) 4.173 7.109 .587 .570
Ex_Adv 2.872 .633 .820 4.538 .001
SummaryOutput3:SalesVs.No_Adv &Ex_Adv
Model Summary
Std. Error of the
Estimate
1 .827
a
.684 .614 3.98888
ANOVA
b
1 Regression 309.986 2 154.993 9.741 .006
a
Residual 143.201 9 15.911
Total 453.187 11
Coefficients
a
Model
Standardized
Coefficients
1 (Constant) 6.584 8.542 .771 .461
No_Adv .625 1.120 .234 .558 .591
Ex_Adv 2.139 1.470 .611 1.455 .180
QualitativeIndependentVariables
JohnsonFiltration,Inc.,providesmaintenance
serviceforwaterfiltrationsystemsthroughout
southernFlorida.
Toestimatetheservicetimeandtheservicecost,
themanagerswanttopredicttherepairtime
necessaryforeachmaintenancerequest.
Repairtimeisbelievedtoberelatedtotwo
factors
Numberofmonthssincethelastmaintenance
service
Typeofrepairproblem(mechanicalorelectrical)
Dataforasampleof10servicecallsaregiven:
LetY denotetherepairtime,X
1
denotethenumberof
monthssincelastmaintenanceservice.
RegressionModelthatusesX
1
onlytoregressY is
Y=
0
+
1
X
1
+
ServiceCall
MonthsSinceLast
Service TypeofRepair
RepairTimein
Hours
1 2 electrical 2.9
2 6 mechanical 3.0
3 8 electrical 4.8
4 3 mechanical 1.8
5 2 electrical 2.9
6 7 electrical 4.9
7 9 mechanical 4.2
8 8 mechanical 4.8
9 4 electrical 4.4
10 6 electrical 4.5
Usingleastsquaresmethod,wefittedthemodelas
R
2
=0.534
At5%levelofsignificance,wereject
H
0
:
0
= 0 (Usingt test)
H
0
:
1
= 0 (Usingt andF test)
X
1
aloneexplains53.4%variabilityinrepairtime.
Tointroducethetypeofrepairintothemodel,wedefinea
dummyvariablegivenas
RegressionModelthatusesX
1
andX
2
toregressY is
Y=
0
+
1
X
1
+
2
X
2
+
Isthenewmodelimproved?
1
3041 . 0 1473 . 2
X Y + =
=
electrical is repair of type if 1,
mechanical is repair of type if , 0
2
X
Summary
Multiplelinearregressionmodel Y=X +
LeastSquaresEstimateof isgivenbyb= (XX)
-1
XY
R
2
andadjustedR
2
UsingANOVA(F test),weexamineifallsarezeroor
not.
t testisconductedforeachregressor separately.
Usingt test,weexamineif correspondingtothat
regressor iszeroornot.
ProblemofMulticollinearity VIF,eigen value
DummyVariable
Examiningtheassumptions:
commonvariance,independence,normality

Linear Regression Analysis of Smoking and Lung Capacity

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Linear Regression Analysis of Smoking and Lung Capacity

Загружено:

Авторское право:

Доступные форматы

12/11/2012

( : Squares of Sum Regression

( : Squares of Sum Error

( SSR Squares, of Sum Regression

( SSE Squares, of Sum Error

Gaurav Garg (IIMLucknow)

Gaurav Garg (IIMLucknow)

Gaurav Garg (IIMLucknow)

Gaurav Garg (IIMLucknow)

Вам также может понравиться