Linear Regression

10/6/2016
1.2Whatisthe"BestFittingLine"?|STAT501
STAT501
RegressionMethods
1.2Whatisthe"BestFittingLine"?
Printerfriendlyversion (https://onlinecourses.science.psu.edu/stat501/print/book/export/html/252)
Sinceweareinterestedinsummarizingthetrendbetweentwoquantitativevariables,thenaturalquestionarises
"whatisthebestfittingline?"Atsomepointinyoureducation,youwereprobablyshownascatterplotof(x,
y)dataandwereaskedtodrawthe"mostappropriate"linethroughthedata.Evenifyouweren't,youcantryit
nowonasetofheights(x)andweights(y)of10students,(student_height_weight.txt)
(/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/student_height_weight.txt).Lookingattheplotbelow,whichline
thesolidlineorthedashedlinedoyouthinkbestsummarizesthetrendbetweenheightandweight?
Holdontoyouranswer!Inordertoexaminewhichofthetwolinesisabetterfit,wefirstneedtointroduce
somecommonnotation:
yi
xi
^
y
denotestheobservedresponseforexperimentaluniti
denotesthepredictorvalueforexperimentaluniti
isthepredictedresponse(orfittedvalue)forexperimentaluniti
Then,theequationforthebestfittinglineis:
^
y
https://onlinecourses.science.psu.edu/stat501/node/252
= b 0 + b 1 xi
1/8
10/6/2016
Incidentally,recallthatan"experimentalunit"istheobjectorpersononwhichthemeasurementismade.In
ourheightandweightexample,theexperimentalunitsarestudents.
Let'stryoutthenotationonourexamplewiththetrendsummarizedbythelinew=266.53+6.1376h.(Note
thatthislineisjustamorepreciseversionoftheabovesolidline,w=266.5+6.1h.)Thefirstdatapointinthe
listindicatesthatstudent1is63inchestallandweighs127pounds.Thatis,x1=63andy1=127.Doyousee
thispointontheplot?Ifweknowthisstudent'sheightbutnothisorherweight,wecouldusetheequationof
thelinetopredicthisorherweight.We'dpredictthestudent'sweighttobe266.53+6.1376(63)or120.1
pounds.Thatis,y^ =120.1.Clearly,ourpredictionwouldn'tbeperfectlycorrectithassome"prediction
error"(or"residualerror").Infact,thesizeofitspredictionerroris127120.1or6.9pounds.
1
Youmightwanttorollyourcursorovereachofthe10datapointstomakesureyouunderstandthenotation
usedtokeeptrackofthepredictorvalues,theobservedresponsesandthepredictedresponses:
xi
yi
^
y
63
127
120.1
64
121
126.3
66
142
138.5
69
157
157.0
69
162
157.0
71
156
169.2
71
169
169.2
72
165
175.4
73
181
181.5
10
75
208
193.8
Asyoucansee,thesizeofthepredictionerrordependsonthedatapoint.Ifwedidn'tknowtheweightof
student4,theequationofthelinewouldpredicthisorherweighttobe266.53+6.1376(69)or157pounds.
Thesizeofthepredictionerrorhereis162157,or5pounds.
Ingeneral,whenweusey^
= b 0 + b 1 xi
topredicttheactualresponseyi,wemakeapredictionerror(or
residualerror)ofsize:
^
ei = y i y
Alinethatfitsthedata"best"willbeoneforwhichthenpredictionerrorsoneforeachobserveddatapoint
areassmallaspossibleinsomeoverallsense.Onewaytoachievethisgoalistoinvokethe"leastsquares
criterion,"whichsaysto"minimizethesumofthesquaredpredictionerrors."Thatis:
Theequationofthebestfittinglineis:y^ = b + b x
Wejustneedtofindthevaluesb 0andb 1thatmakethesumofthesquaredpredictionerrorsthesmallestit
i
canbe.
2/8
10/6/2016
Thatis,weneedtofindthevaluesb 0andb 1thatminimize:

n
2
^ )
Q = (y i y
i
i=1
Here'showyoumightthinkaboutthisquantityQ:
Thequantitye = y y^ isthepredictionerrorfordatapointi.
Thequantitye = (y y^ ) isthesquaredpredictionerrorfordatapointi.
And,thesymbol tellsustoaddupthesquaredpredictionerrorsforallndatapoints.
i
i=1
Incidentally,ifwedidn'tsquarethepredictionerrore = y y^ togete = (y y^ ) ,thepositiveand

negativepredictionerrorswouldcanceleachotheroutwhensummed,alwaysyielding0.
Now,beingfamiliarwiththeleastsquarescriterion,let'stakeafreshlookatourplotagain.Inlightoftheleast
squarescriterion,whichlinedoyounowthinkisthebestfittingline?
i
Let'sseehowyoudid!Thefollowingtwosidebysidetablesillustratetheimplementationoftheleastsquares
criterionforthetwolinesupforconsiderationthedashedlineandthesolidline.
w=331.2+7.1h(thedashedline)
xi
yi
^
y
63
127
64
w=266.53+6.1376h(thesolidline)
^ )
(y i y
^ )
(y i y
xi
yi
^
y
116.1
10.9
118.81
63
127
121
123.2
2.2
4.84
64
66
142
137.4
4.6
21.16
69
157
158.7
1.7
2.89
69
162
158.7
3.3
71
156
172.9
71
169
172.9
^ )
(y i y
^ )
(y i y
120.139
6.8612
47.076
121
126.276
5.2764
27.840
66
142
138.552
3.4484
11.891
69
157
156.964
0.0356
0.001
10.89
69
162
156.964
5.0356
25.357
16.9
285.61
71
156
169.240
13.2396
175.287
3.9
15.21
71
169
169.240
0.2396
0.057
3/8
10/6/2016
72
165
180.0
15.0
225.00
72
165
175.377
10.3772
107.686
73
181
187.1
6.1
37.21
73
181
181.515
0.5148
0.265
10
75
208
201.3
6.7
44.89
10
75
208
193.790
14.2100
201.924
______
766.5
______
597.4
Basedontheleastsquarescriterion,whichequationbestsummarizesthedata?Thesumofthesquared
predictionerrorsis766.5forthedashedline,whileitisonly597.4forthesolidline.Therefore,ofthetwo
lines,thesolidline,w=266.53+6.1376h,bestsummarizesthedata.But,isthisequationguaranteedtobethe
bestfittinglineofallofthepossiblelineswedidn'tevenconsider?Ofcoursenot!
Ifweusedtheaboveapproachforfindingtheequationofthelinethatminimizesthesumofthesquared
predictionerrors,we'dhaveourworkcutoutforus.We'dhavetoimplementtheaboveprocedureforan
infinitenumberofpossiblelinesclearly,animpossibletask!Fortunately,somebodyhasdonesomedirty
workforusbyfiguringoutformulasfortheinterceptb 0andtheslopeb 1fortheequationofthelinethat
minimizesthesumofthesquaredpredictionerrors.
Theformulasaredeterminedusingmethodsofcalculus.Weminimizetheequationforthesumofthesquared
predictionerrors:
n
2
Q = (y i (b 0 + b 1 x i ))
i=1
(thatis,takethederivativewithrespecttob 0andb 1,setto0,andsolveforb 0andb 1)andgetthe"least

squaresestimates"forb 0andb 1:
b1 x
b0 = y
and:
b1 =
i=1
)(y i y
)
(x i x
n
i=1
)
(x i x
Becausetheformulasforb 0andb 1arederivedusingtheleastsquarescriterion,theresultingequation

= b + b x isoftenreferredtoasthe"leastsquaresregressionline,"orsimplythe"leastsquaresline."
Itisalsosometimescalledthe"estimatedregressionequation."Incidentally,notethatinderivingtheabove
formulas,wemadenoassumptionsaboutthedataotherthanthattheyfollowsomesortoflineartrend.
^
y
, y
) ,sincewhen x = x
,
Wecanseefromtheseformulasthattheleastsquareslinepassesthroughthepoint(x
= y
b x
+b x
= y
.
theny = b + b x
0
Inpractice,youwon'treallyneedtoworryabouttheformulasforb 0andb 1.Instead,youarearegoingtolet

statisticalsoftware,suchasMinitab,findleastsquareslinesforyou.But,wecanstilllearnsomethingfromthe
formulasforb 1inparticular.
4/8
10/6/2016
Ifyoustudytheformulafortheslopeb 1:
n
b1 =
i=1
)(y i y
)
(x i x
n
i=1
)
(x i x
youseethatthedenominatorisnecessarilypositivesinceitonlyinvolvessummingpositiveterms.Therefore,
thesignoftheslopeb 1issolelydeterminedbythenumerator.Thenumeratortellsus,foreachdatapoint,to
sumuptheproductoftwodistancesthedistanceofthexvaluefromthemeanofallofthexvaluesandthe
distanceoftheyvaluefromthemeanofalloftheyvalues.Let'sseehowthisdeterminesthesignoftheslope
b 1bystudyingthefollowingtwoplots.
Whenistheslopeb 1>0?Doyouagreethatthetrendinthefollowingplotispositivethatis,asxincreases,
ytendstoincrease?Ifthetrendispositive,thentheslopeb 1mustbepositive.Let'sseehow!
Clickonthebluedatapointintheupperrightquadrant.........Notethattheproductofthetwodistancesfor
thisdatapointispositive.Infact,theproductofthetwodistancesispositiveforanydatapointinthe
upperrightquadrant.
Now,selectclearandthenclickonthebluedatapointinthelowerleftquadrant.........Notethatthe
productofthetwodistancesforthisdatapointisalsopositive.Infact,theproductofthetwodistancesis
positiveforanydatapointinthelowerleftquadrant.
Addingupallofthesepositiveproductsmustnecessarilyyieldapositivenumber,andhencetheslopeofthe
lineb 1willbepositive.
Whenistheslopeb 1<0?Now,doyouagreethatthetrendinthefollowingplotisnegativethatis,asx
increases,ytendstodecrease?Ifthetrendisnegative,thentheslopeb 1mustbenegative.Let'sseehow!
Clickonthebluedatapointintheupperleftquadrant.........Notethattheproductofthetwodistancesfor
thisdatapointisnegative.Infact,theproductofthetwodistancesisnegativeforanydatapointinthe
upperleftquadrant.
Now,selectclearandthenclickonthebluedatapointinthelowerrightquadrant.........Notethatthe
productofthetwodistancesforthisdatapointisalsonegative.Infact,theproductofthetwodistancesis
negativeforanydatapointinthelowerrightquadrant.
5/8
10/6/2016
Addingupallofthesenegativeproductsmustnecessarilyyieldanegativenumber,andhencetheslopeofthe
lineb 1willbenegative.
Nowthatwefinishedthatinvestigation,youcanjustsetasidetheformulasforb 0andb 1.Again,inpractice,
youaregoingtoletstatisticalsoftware,suchasMinitab,findleastsquareslinesforyou.Wecanobtainthe
estimatedregressionequationintwodifferentplacesinMinitab.Thefollowingplotillustrateswhereyoucan
findtheleastsquaresline(inbox)onMinitab's"fittedlineplot."
ThefollowingMinitaboutputillustrateswhereyoucanfindtheleastsquaresline(inbox)inMinitab's
"standardregressionanalysis"output.
6/8
10/6/2016
Notethattheestimatedvaluesb 0andb 1alsoappearinatableunderthecolumnslabeled"Predictor"(the

interceptb 0isalwaysreferredtoasthe"Constant"inMinitab)and"Coef"(for"Coefficients").Also,notethat
thevalueweobtainedbyminimizingthesumofthesquaredpredictionerrors,597.4,appearsinthe"Analysis
ofVariance"tableappropriatelyinarowlabeled"ResidualError"andunderacolumnlabeled"SS"(for"Sum
ofSquares").
Althoughwe'velearnedhowtoobtainthe"estimatedregressioncoefficients"b 0andb 1,we'venotyet
discussedwhatwelearnfromthem.Onethingtheyallowustodoistopredictfutureresponsesoneofthe
mostcommonusesofanestimatedregressionline.Thisuseisratherstraightforward:
Acommonuseoftheestimatedregression
line.
^
y
Predict(mean)weightof66"inchtall
people.
^
y
Predict(mean)weightof67"inchtall
people.
^
y
i,wt
i,wt
i,wt
= 267 + 6.14xi,ht
= 267 + 6.14(66) = 138.24
= 267 + 6.14(67) = 144.38
Now,whatdoesb 0tellus?Theanswerisobviouswhenyouevaluatetheestimatedregressionequationatx=
0.Here,ittellsusthatapersonwhois0inchestallispredictedtoweigh267pounds!Clearly,thispredictionis
nonsense.Thishappenedbecausewe"extrapolated"beyondthe"scopeofthemodel"(therangeofthex
values).Itisnotmeaningfultohaveaheightof0inches,thatis,thescopeofthemodeldoesnotincludex=0.
So,heretheinterceptb 0isnotmeaningful.Ingeneral,ifthe"scopeofthemodel"includesx=0,thenb 0isthe
predictedmeanresponsewhenx=0.Otherwise,b 0isnotmeaningful.Thereismoreinformationonthishere
(http://blog.minitab.com/blog/adventuresinstatistics/regressionanalysishowtointerprettheconstantyintercept).
And,whatdoesb 1tellus?Theanswerisobviouswhenyousubtractthepredictedweightof66"inchtall
peoplefromthepredictedweightof67"inchtallpeople.Weobtain144.38138.24=6.14poundsthevalue
ofb 1.Here,ittellsusthatwepredictthemeanweighttoincreaseby6.14poundsforeveryadditionaloneinch
increaseinheight.Ingeneral,wecanexpectthemeanresponsetoincreaseordecreasebyb 1unitsforevery
oneunitincreaseinx.
1.1WhatisSimpleLinearRegression?
up
1.3TheSimpleLinearRegressionModel
7/8
10/6/2016
(/stat501/node/251)
(/stat501/node/250)
(/stat501/node/253)
8/8

Linear Regression

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Linear Regression

Загружено:

Авторское право:

Доступные форматы

10/6/2016

Thatis,weneedtofindthevaluesb 0andb 1thatminimize:

Incidentally,ifwedidn'tsquarethepredictionerrore = y y^ togete = (y y^ ) ,thepositiveand

(thatis,takethederivativewithrespecttob 0andb 1,setto0,andsolveforb 0andb 1)andgetthe"least

Becausetheformulasforb 0andb 1arederivedusingtheleastsquarescriterion,theresultingequation

Inpractice,youwon'treallyneedtoworryabouttheformulasforb 0andb 1.Instead,youarearegoingtolet

Notethattheestimatedvaluesb 0andb 1alsoappearinatableunderthecolumnslabeled"Predictor"(the

= 267 + 6.14(66) = 138.24

= 267 + 6.14(67) = 144.38

Вам также может понравиться