Академический Документы
Профессиональный Документы
Культура Документы
1.2Whatisthe"BestFittingLine"?|STAT501
STAT501
RegressionMethods
1.2Whatisthe"BestFittingLine"?
Printerfriendlyversion (https://onlinecourses.science.psu.edu/stat501/print/book/export/html/252)
Sinceweareinterestedinsummarizingthetrendbetweentwoquantitativevariables,thenaturalquestionarises
"whatisthebestfittingline?"Atsomepointinyoureducation,youwereprobablyshownascatterplotof(x,
y)dataandwereaskedtodrawthe"mostappropriate"linethroughthedata.Evenifyouweren't,youcantryit
nowonasetofheights(x)andweights(y)of10students,(student_height_weight.txt)
(/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/student_height_weight.txt).Lookingattheplotbelow,whichline
thesolidlineorthedashedlinedoyouthinkbestsummarizesthetrendbetweenheightandweight?
Holdontoyouranswer!Inordertoexaminewhichofthetwolinesisabetterfit,wefirstneedtointroduce
somecommonnotation:
yi
xi
^
y
denotestheobservedresponseforexperimentaluniti
denotesthepredictorvalueforexperimentaluniti
isthepredictedresponse(orfittedvalue)forexperimentaluniti
Then,theequationforthebestfittinglineis:
^
y
https://onlinecourses.science.psu.edu/stat501/node/252
= b 0 + b 1 xi
1/8
10/6/2016
1.2Whatisthe"BestFittingLine"?|STAT501
Incidentally,recallthatan"experimentalunit"istheobjectorpersononwhichthemeasurementismade.In
ourheightandweightexample,theexperimentalunitsarestudents.
Let'stryoutthenotationonourexamplewiththetrendsummarizedbythelinew=266.53+6.1376h.(Note
thatthislineisjustamorepreciseversionoftheabovesolidline,w=266.5+6.1h.)Thefirstdatapointinthe
listindicatesthatstudent1is63inchestallandweighs127pounds.Thatis,x1=63andy1=127.Doyousee
thispointontheplot?Ifweknowthisstudent'sheightbutnothisorherweight,wecouldusetheequationof
thelinetopredicthisorherweight.We'dpredictthestudent'sweighttobe266.53+6.1376(63)or120.1
pounds.Thatis,y^ =120.1.Clearly,ourpredictionwouldn'tbeperfectlycorrectithassome"prediction
error"(or"residualerror").Infact,thesizeofitspredictionerroris127120.1or6.9pounds.
1
Youmightwanttorollyourcursorovereachofthe10datapointstomakesureyouunderstandthenotation
usedtokeeptrackofthepredictorvalues,theobservedresponsesandthepredictedresponses:
xi
yi
^
y
63
127
120.1
64
121
126.3
66
142
138.5
69
157
157.0
69
162
157.0
71
156
169.2
71
169
169.2
72
165
175.4
73
181
181.5
10
75
208
193.8
Asyoucansee,thesizeofthepredictionerrordependsonthedatapoint.Ifwedidn'tknowtheweightof
student4,theequationofthelinewouldpredicthisorherweighttobe266.53+6.1376(69)or157pounds.
Thesizeofthepredictionerrorhereis162157,or5pounds.
Ingeneral,whenweusey^
= b 0 + b 1 xi
topredicttheactualresponseyi,wemakeapredictionerror(or
residualerror)ofsize:
^
ei = y i y
Alinethatfitsthedata"best"willbeoneforwhichthenpredictionerrorsoneforeachobserveddatapoint
areassmallaspossibleinsomeoverallsense.Onewaytoachievethisgoalistoinvokethe"leastsquares
criterion,"whichsaysto"minimizethesumofthesquaredpredictionerrors."Thatis:
Theequationofthebestfittinglineis:y^ = b + b x
Wejustneedtofindthevaluesb 0andb 1thatmakethesumofthesquaredpredictionerrorsthesmallestit
i
canbe.
https://onlinecourses.science.psu.edu/stat501/node/252
2/8
10/6/2016
1.2Whatisthe"BestFittingLine"?|STAT501
^ )
Q = (y i y
i
i=1
Here'showyoumightthinkaboutthisquantityQ:
Thequantitye = y y^ isthepredictionerrorfordatapointi.
Thequantitye = (y y^ ) isthesquaredpredictionerrorfordatapointi.
And,thesymbol tellsustoaddupthesquaredpredictionerrorsforallndatapoints.
i
i=1
Let'sseehowyoudid!Thefollowingtwosidebysidetablesillustratetheimplementationoftheleastsquares
criterionforthetwolinesupforconsiderationthedashedlineandthesolidline.
w=331.2+7.1h(thedashedline)
xi
yi
^
y
63
127
64
w=266.53+6.1376h(thesolidline)
^ )
(y i y
^ )
(y i y
xi
yi
^
y
116.1
10.9
118.81
63
127
121
123.2
2.2
4.84
64
66
142
137.4
4.6
21.16
69
157
158.7
1.7
2.89
69
162
158.7
3.3
71
156
172.9
71
169
172.9
^ )
(y i y
^ )
(y i y
120.139
6.8612
47.076
121
126.276
5.2764
27.840
66
142
138.552
3.4484
11.891
69
157
156.964
0.0356
0.001
10.89
69
162
156.964
5.0356
25.357
16.9
285.61
71
156
169.240
13.2396
175.287
3.9
15.21
71
169
169.240
0.2396
0.057
https://onlinecourses.science.psu.edu/stat501/node/252
3/8
10/6/2016
1.2Whatisthe"BestFittingLine"?|STAT501
72
165
180.0
15.0
225.00
72
165
175.377
10.3772
107.686
73
181
187.1
6.1
37.21
73
181
181.515
0.5148
0.265
10
75
208
201.3
6.7
44.89
10
75
208
193.790
14.2100
201.924
______
766.5
______
597.4
Basedontheleastsquarescriterion,whichequationbestsummarizesthedata?Thesumofthesquared
predictionerrorsis766.5forthedashedline,whileitisonly597.4forthesolidline.Therefore,ofthetwo
lines,thesolidline,w=266.53+6.1376h,bestsummarizesthedata.But,isthisequationguaranteedtobethe
bestfittinglineofallofthepossiblelineswedidn'tevenconsider?Ofcoursenot!
Ifweusedtheaboveapproachforfindingtheequationofthelinethatminimizesthesumofthesquared
predictionerrors,we'dhaveourworkcutoutforus.We'dhavetoimplementtheaboveprocedureforan
infinitenumberofpossiblelinesclearly,animpossibletask!Fortunately,somebodyhasdonesomedirty
workforusbyfiguringoutformulasfortheinterceptb 0andtheslopeb 1fortheequationofthelinethat
minimizesthesumofthesquaredpredictionerrors.
Theformulasaredeterminedusingmethodsofcalculus.Weminimizetheequationforthesumofthesquared
predictionerrors:
n
2
Q = (y i (b 0 + b 1 x i ))
i=1
b0 = y
and:
b1 =
i=1
)(y i y
)
(x i x
n
i=1
)
(x i x
, y
) ,sincewhen x = x
,
Wecanseefromtheseformulasthattheleastsquareslinepassesthroughthepoint(x
= y
b x
+b x
= y
.
theny = b + b x
0
4/8
10/6/2016
1.2Whatisthe"BestFittingLine"?|STAT501
Ifyoustudytheformulafortheslopeb 1:
n
b1 =
i=1
)(y i y
)
(x i x
n
i=1
)
(x i x
youseethatthedenominatorisnecessarilypositivesinceitonlyinvolvessummingpositiveterms.Therefore,
thesignoftheslopeb 1issolelydeterminedbythenumerator.Thenumeratortellsus,foreachdatapoint,to
sumuptheproductoftwodistancesthedistanceofthexvaluefromthemeanofallofthexvaluesandthe
distanceoftheyvaluefromthemeanofalloftheyvalues.Let'sseehowthisdeterminesthesignoftheslope
b 1bystudyingthefollowingtwoplots.
Whenistheslopeb 1>0?Doyouagreethatthetrendinthefollowingplotispositivethatis,asxincreases,
ytendstoincrease?Ifthetrendispositive,thentheslopeb 1mustbepositive.Let'sseehow!
Clickonthebluedatapointintheupperrightquadrant.........Notethattheproductofthetwodistancesfor
thisdatapointispositive.Infact,theproductofthetwodistancesispositiveforanydatapointinthe
upperrightquadrant.
Now,selectclearandthenclickonthebluedatapointinthelowerleftquadrant.........Notethatthe
productofthetwodistancesforthisdatapointisalsopositive.Infact,theproductofthetwodistancesis
positiveforanydatapointinthelowerleftquadrant.
Addingupallofthesepositiveproductsmustnecessarilyyieldapositivenumber,andhencetheslopeofthe
lineb 1willbepositive.
Whenistheslopeb 1<0?Now,doyouagreethatthetrendinthefollowingplotisnegativethatis,asx
increases,ytendstodecrease?Ifthetrendisnegative,thentheslopeb 1mustbenegative.Let'sseehow!
Clickonthebluedatapointintheupperleftquadrant.........Notethattheproductofthetwodistancesfor
thisdatapointisnegative.Infact,theproductofthetwodistancesisnegativeforanydatapointinthe
upperleftquadrant.
Now,selectclearandthenclickonthebluedatapointinthelowerrightquadrant.........Notethatthe
productofthetwodistancesforthisdatapointisalsonegative.Infact,theproductofthetwodistancesis
negativeforanydatapointinthelowerrightquadrant.
https://onlinecourses.science.psu.edu/stat501/node/252
5/8
10/6/2016
1.2Whatisthe"BestFittingLine"?|STAT501
Addingupallofthesenegativeproductsmustnecessarilyyieldanegativenumber,andhencetheslopeofthe
lineb 1willbenegative.
Nowthatwefinishedthatinvestigation,youcanjustsetasidetheformulasforb 0andb 1.Again,inpractice,
youaregoingtoletstatisticalsoftware,suchasMinitab,findleastsquareslinesforyou.Wecanobtainthe
estimatedregressionequationintwodifferentplacesinMinitab.Thefollowingplotillustrateswhereyoucan
findtheleastsquaresline(inbox)onMinitab's"fittedlineplot."
ThefollowingMinitaboutputillustrateswhereyoucanfindtheleastsquaresline(inbox)inMinitab's
"standardregressionanalysis"output.
https://onlinecourses.science.psu.edu/stat501/node/252
6/8
10/6/2016
1.2Whatisthe"BestFittingLine"?|STAT501
^
y
Predict(mean)weightof66"inchtall
people.
^
y
Predict(mean)weightof67"inchtall
people.
^
y
i,wt
i,wt
i,wt
= 267 + 6.14xi,ht
Now,whatdoesb 0tellus?Theanswerisobviouswhenyouevaluatetheestimatedregressionequationatx=
0.Here,ittellsusthatapersonwhois0inchestallispredictedtoweigh267pounds!Clearly,thispredictionis
nonsense.Thishappenedbecausewe"extrapolated"beyondthe"scopeofthemodel"(therangeofthex
values).Itisnotmeaningfultohaveaheightof0inches,thatis,thescopeofthemodeldoesnotincludex=0.
So,heretheinterceptb 0isnotmeaningful.Ingeneral,ifthe"scopeofthemodel"includesx=0,thenb 0isthe
predictedmeanresponsewhenx=0.Otherwise,b 0isnotmeaningful.Thereismoreinformationonthishere
(http://blog.minitab.com/blog/adventuresinstatistics/regressionanalysishowtointerprettheconstantyintercept).
And,whatdoesb 1tellus?Theanswerisobviouswhenyousubtractthepredictedweightof66"inchtall
peoplefromthepredictedweightof67"inchtallpeople.Weobtain144.38138.24=6.14poundsthevalue
ofb 1.Here,ittellsusthatwepredictthemeanweighttoincreaseby6.14poundsforeveryadditionaloneinch
increaseinheight.Ingeneral,wecanexpectthemeanresponsetoincreaseordecreasebyb 1unitsforevery
oneunitincreaseinx.
1.1WhatisSimpleLinearRegression?
https://onlinecourses.science.psu.edu/stat501/node/252
up
1.3TheSimpleLinearRegressionModel
7/8
10/6/2016
(/stat501/node/251)
https://onlinecourses.science.psu.edu/stat501/node/252
1.2Whatisthe"BestFittingLine"?|STAT501
(/stat501/node/250)
(/stat501/node/253)
8/8