Вы находитесь на странице: 1из 8

10/6/2016

1.2Whatisthe"BestFittingLine"?|STAT501

STAT501
RegressionMethods

1.2Whatisthe"BestFittingLine"?
Printerfriendlyversion (https://onlinecourses.science.psu.edu/stat501/print/book/export/html/252)
Sinceweareinterestedinsummarizingthetrendbetweentwoquantitativevariables,thenaturalquestionarises
"whatisthebestfittingline?"Atsomepointinyoureducation,youwereprobablyshownascatterplotof(x,
y)dataandwereaskedtodrawthe"mostappropriate"linethroughthedata.Evenifyouweren't,youcantryit
nowonasetofheights(x)andweights(y)of10students,(student_height_weight.txt)
(/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/student_height_weight.txt).Lookingattheplotbelow,whichline
thesolidlineorthedashedlinedoyouthinkbestsummarizesthetrendbetweenheightandweight?

Holdontoyouranswer!Inordertoexaminewhichofthetwolinesisabetterfit,wefirstneedtointroduce
somecommonnotation:
yi
xi
^
y

denotestheobservedresponseforexperimentaluniti
denotesthepredictorvalueforexperimentaluniti
isthepredictedresponse(orfittedvalue)forexperimentaluniti

Then,theequationforthebestfittinglineis:
^
y

https://onlinecourses.science.psu.edu/stat501/node/252

= b 0 + b 1 xi

1/8

10/6/2016

1.2Whatisthe"BestFittingLine"?|STAT501

Incidentally,recallthatan"experimentalunit"istheobjectorpersononwhichthemeasurementismade.In
ourheightandweightexample,theexperimentalunitsarestudents.
Let'stryoutthenotationonourexamplewiththetrendsummarizedbythelinew=266.53+6.1376h.(Note
thatthislineisjustamorepreciseversionoftheabovesolidline,w=266.5+6.1h.)Thefirstdatapointinthe
listindicatesthatstudent1is63inchestallandweighs127pounds.Thatis,x1=63andy1=127.Doyousee
thispointontheplot?Ifweknowthisstudent'sheightbutnothisorherweight,wecouldusetheequationof
thelinetopredicthisorherweight.We'dpredictthestudent'sweighttobe266.53+6.1376(63)or120.1
pounds.Thatis,y^ =120.1.Clearly,ourpredictionwouldn'tbeperfectlycorrectithassome"prediction
error"(or"residualerror").Infact,thesizeofitspredictionerroris127120.1or6.9pounds.
1

Youmightwanttorollyourcursorovereachofthe10datapointstomakesureyouunderstandthenotation
usedtokeeptrackofthepredictorvalues,theobservedresponsesandthepredictedresponses:

xi

yi

^
y

63

127

120.1

64

121

126.3

66

142

138.5

69

157

157.0

69

162

157.0

71

156

169.2

71

169

169.2

72

165

175.4

73

181

181.5

10

75

208

193.8

Asyoucansee,thesizeofthepredictionerrordependsonthedatapoint.Ifwedidn'tknowtheweightof
student4,theequationofthelinewouldpredicthisorherweighttobe266.53+6.1376(69)or157pounds.
Thesizeofthepredictionerrorhereis162157,or5pounds.
Ingeneral,whenweusey^

= b 0 + b 1 xi

topredicttheactualresponseyi,wemakeapredictionerror(or

residualerror)ofsize:
^
ei = y i y

Alinethatfitsthedata"best"willbeoneforwhichthenpredictionerrorsoneforeachobserveddatapoint
areassmallaspossibleinsomeoverallsense.Onewaytoachievethisgoalistoinvokethe"leastsquares
criterion,"whichsaysto"minimizethesumofthesquaredpredictionerrors."Thatis:
Theequationofthebestfittinglineis:y^ = b + b x
Wejustneedtofindthevaluesb 0andb 1thatmakethesumofthesquaredpredictionerrorsthesmallestit
i

canbe.
https://onlinecourses.science.psu.edu/stat501/node/252

2/8

10/6/2016

1.2Whatisthe"BestFittingLine"?|STAT501

Thatis,weneedtofindthevaluesb 0andb 1thatminimize:


n
2

^ )
Q = (y i y
i

i=1

Here'showyoumightthinkaboutthisquantityQ:
Thequantitye = y y^ isthepredictionerrorfordatapointi.
Thequantitye = (y y^ ) isthesquaredpredictionerrorfordatapointi.
And,thesymbol tellsustoaddupthesquaredpredictionerrorsforallndatapoints.
i

i=1

Incidentally,ifwedidn'tsquarethepredictionerrore = y y^ togete = (y y^ ) ,thepositiveand


negativepredictionerrorswouldcanceleachotheroutwhensummed,alwaysyielding0.
Now,beingfamiliarwiththeleastsquarescriterion,let'stakeafreshlookatourplotagain.Inlightoftheleast
squarescriterion,whichlinedoyounowthinkisthebestfittingline?
i

Let'sseehowyoudid!Thefollowingtwosidebysidetablesillustratetheimplementationoftheleastsquares
criterionforthetwolinesupforconsiderationthedashedlineandthesolidline.

w=331.2+7.1h(thedashedline)

xi

yi

^
y

63

127

64

w=266.53+6.1376h(thesolidline)

^ )
(y i y

^ )
(y i y

xi

yi

^
y

116.1

10.9

118.81

63

127

121

123.2

2.2

4.84

64

66

142

137.4

4.6

21.16

69

157

158.7

1.7

2.89

69

162

158.7

3.3

71

156

172.9

71

169

172.9

^ )
(y i y

^ )
(y i y

120.139

6.8612

47.076

121

126.276

5.2764

27.840

66

142

138.552

3.4484

11.891

69

157

156.964

0.0356

0.001

10.89

69

162

156.964

5.0356

25.357

16.9

285.61

71

156

169.240

13.2396

175.287

3.9

15.21

71

169

169.240

0.2396

0.057

https://onlinecourses.science.psu.edu/stat501/node/252

3/8

10/6/2016

1.2Whatisthe"BestFittingLine"?|STAT501

72

165

180.0

15.0

225.00

72

165

175.377

10.3772

107.686

73

181

187.1

6.1

37.21

73

181

181.515

0.5148

0.265

10

75

208

201.3

6.7

44.89

10

75

208

193.790

14.2100

201.924

______
766.5

______
597.4

Basedontheleastsquarescriterion,whichequationbestsummarizesthedata?Thesumofthesquared
predictionerrorsis766.5forthedashedline,whileitisonly597.4forthesolidline.Therefore,ofthetwo
lines,thesolidline,w=266.53+6.1376h,bestsummarizesthedata.But,isthisequationguaranteedtobethe
bestfittinglineofallofthepossiblelineswedidn'tevenconsider?Ofcoursenot!
Ifweusedtheaboveapproachforfindingtheequationofthelinethatminimizesthesumofthesquared
predictionerrors,we'dhaveourworkcutoutforus.We'dhavetoimplementtheaboveprocedureforan
infinitenumberofpossiblelinesclearly,animpossibletask!Fortunately,somebodyhasdonesomedirty
workforusbyfiguringoutformulasfortheinterceptb 0andtheslopeb 1fortheequationofthelinethat
minimizesthesumofthesquaredpredictionerrors.
Theformulasaredeterminedusingmethodsofcalculus.Weminimizetheequationforthesumofthesquared
predictionerrors:
n
2

Q = (y i (b 0 + b 1 x i ))
i=1

(thatis,takethederivativewithrespecttob 0andb 1,setto0,andsolveforb 0andb 1)andgetthe"least


squaresestimates"forb 0andb 1:
b1 x

b0 = y

and:

b1 =

i=1

)(y i y
)
(x i x
n

i=1

)
(x i x

Becausetheformulasforb 0andb 1arederivedusingtheleastsquarescriterion,theresultingequation


= b + b x isoftenreferredtoasthe"leastsquaresregressionline,"orsimplythe"leastsquaresline."
Itisalsosometimescalledthe"estimatedregressionequation."Incidentally,notethatinderivingtheabove
formulas,wemadenoassumptionsaboutthedataotherthanthattheyfollowsomesortoflineartrend.
^
y

, y
) ,sincewhen x = x
,
Wecanseefromtheseformulasthattheleastsquareslinepassesthroughthepoint(x
= y
b x
+b x
= y
.
theny = b + b x
0

Inpractice,youwon'treallyneedtoworryabouttheformulasforb 0andb 1.Instead,youarearegoingtolet


statisticalsoftware,suchasMinitab,findleastsquareslinesforyou.But,wecanstilllearnsomethingfromthe
formulasforb 1inparticular.
https://onlinecourses.science.psu.edu/stat501/node/252

4/8

10/6/2016

1.2Whatisthe"BestFittingLine"?|STAT501

Ifyoustudytheformulafortheslopeb 1:
n

b1 =

i=1

)(y i y
)
(x i x
n

i=1

)
(x i x

youseethatthedenominatorisnecessarilypositivesinceitonlyinvolvessummingpositiveterms.Therefore,
thesignoftheslopeb 1issolelydeterminedbythenumerator.Thenumeratortellsus,foreachdatapoint,to
sumuptheproductoftwodistancesthedistanceofthexvaluefromthemeanofallofthexvaluesandthe
distanceoftheyvaluefromthemeanofalloftheyvalues.Let'sseehowthisdeterminesthesignoftheslope
b 1bystudyingthefollowingtwoplots.
Whenistheslopeb 1>0?Doyouagreethatthetrendinthefollowingplotispositivethatis,asxincreases,
ytendstoincrease?Ifthetrendispositive,thentheslopeb 1mustbepositive.Let'sseehow!
Clickonthebluedatapointintheupperrightquadrant.........Notethattheproductofthetwodistancesfor
thisdatapointispositive.Infact,theproductofthetwodistancesispositiveforanydatapointinthe
upperrightquadrant.
Now,selectclearandthenclickonthebluedatapointinthelowerleftquadrant.........Notethatthe
productofthetwodistancesforthisdatapointisalsopositive.Infact,theproductofthetwodistancesis
positiveforanydatapointinthelowerleftquadrant.

Addingupallofthesepositiveproductsmustnecessarilyyieldapositivenumber,andhencetheslopeofthe
lineb 1willbepositive.
Whenistheslopeb 1<0?Now,doyouagreethatthetrendinthefollowingplotisnegativethatis,asx
increases,ytendstodecrease?Ifthetrendisnegative,thentheslopeb 1mustbenegative.Let'sseehow!
Clickonthebluedatapointintheupperleftquadrant.........Notethattheproductofthetwodistancesfor
thisdatapointisnegative.Infact,theproductofthetwodistancesisnegativeforanydatapointinthe
upperleftquadrant.
Now,selectclearandthenclickonthebluedatapointinthelowerrightquadrant.........Notethatthe
productofthetwodistancesforthisdatapointisalsonegative.Infact,theproductofthetwodistancesis
negativeforanydatapointinthelowerrightquadrant.

https://onlinecourses.science.psu.edu/stat501/node/252

5/8

10/6/2016

1.2Whatisthe"BestFittingLine"?|STAT501

Addingupallofthesenegativeproductsmustnecessarilyyieldanegativenumber,andhencetheslopeofthe
lineb 1willbenegative.
Nowthatwefinishedthatinvestigation,youcanjustsetasidetheformulasforb 0andb 1.Again,inpractice,
youaregoingtoletstatisticalsoftware,suchasMinitab,findleastsquareslinesforyou.Wecanobtainthe
estimatedregressionequationintwodifferentplacesinMinitab.Thefollowingplotillustrateswhereyoucan
findtheleastsquaresline(inbox)onMinitab's"fittedlineplot."

ThefollowingMinitaboutputillustrateswhereyoucanfindtheleastsquaresline(inbox)inMinitab's
"standardregressionanalysis"output.

https://onlinecourses.science.psu.edu/stat501/node/252

6/8

10/6/2016

1.2Whatisthe"BestFittingLine"?|STAT501

Notethattheestimatedvaluesb 0andb 1alsoappearinatableunderthecolumnslabeled"Predictor"(the


interceptb 0isalwaysreferredtoasthe"Constant"inMinitab)and"Coef"(for"Coefficients").Also,notethat
thevalueweobtainedbyminimizingthesumofthesquaredpredictionerrors,597.4,appearsinthe"Analysis
ofVariance"tableappropriatelyinarowlabeled"ResidualError"andunderacolumnlabeled"SS"(for"Sum
ofSquares").
Althoughwe'velearnedhowtoobtainthe"estimatedregressioncoefficients"b 0andb 1,we'venotyet
discussedwhatwelearnfromthem.Onethingtheyallowustodoistopredictfutureresponsesoneofthe
mostcommonusesofanestimatedregressionline.Thisuseisratherstraightforward:
Acommonuseoftheestimatedregression
line.

^
y

Predict(mean)weightof66"inchtall
people.

^
y

Predict(mean)weightof67"inchtall
people.

^
y

i,wt

i,wt

i,wt

= 267 + 6.14xi,ht

= 267 + 6.14(66) = 138.24

= 267 + 6.14(67) = 144.38

Now,whatdoesb 0tellus?Theanswerisobviouswhenyouevaluatetheestimatedregressionequationatx=
0.Here,ittellsusthatapersonwhois0inchestallispredictedtoweigh267pounds!Clearly,thispredictionis
nonsense.Thishappenedbecausewe"extrapolated"beyondthe"scopeofthemodel"(therangeofthex
values).Itisnotmeaningfultohaveaheightof0inches,thatis,thescopeofthemodeldoesnotincludex=0.
So,heretheinterceptb 0isnotmeaningful.Ingeneral,ifthe"scopeofthemodel"includesx=0,thenb 0isthe
predictedmeanresponsewhenx=0.Otherwise,b 0isnotmeaningful.Thereismoreinformationonthishere
(http://blog.minitab.com/blog/adventuresinstatistics/regressionanalysishowtointerprettheconstantyintercept).

And,whatdoesb 1tellus?Theanswerisobviouswhenyousubtractthepredictedweightof66"inchtall
peoplefromthepredictedweightof67"inchtallpeople.Weobtain144.38138.24=6.14poundsthevalue
ofb 1.Here,ittellsusthatwepredictthemeanweighttoincreaseby6.14poundsforeveryadditionaloneinch
increaseinheight.Ingeneral,wecanexpectthemeanresponsetoincreaseordecreasebyb 1unitsforevery
oneunitincreaseinx.
1.1WhatisSimpleLinearRegression?
https://onlinecourses.science.psu.edu/stat501/node/252

up

1.3TheSimpleLinearRegressionModel
7/8

10/6/2016

(/stat501/node/251)

https://onlinecourses.science.psu.edu/stat501/node/252

1.2Whatisthe"BestFittingLine"?|STAT501

(/stat501/node/250)

(/stat501/node/253)

8/8

Вам также может понравиться