Академический Документы
Профессиональный Документы
Культура Документы
LinearRegression
FatVersusProtein:AnExample
Thefollowingisascatterplotoftotalfat versusprotein for30
itemsontheBurgerKingmenu:
SIT191(week3)
LinearRegression
TheLinearModel
Examiningthescatterplotshowsthatthereseemstobea
linearrelationshipbetweenthesetwovariables
Wecansaymoreaboutthelinearrelationshipbetweentwo
quantitativevariableswithamodel/equation.
Amodelsimplifiesrealdatatohelpusunderstandunderlying
patternsandrelationships.
Theequationcanbeusedtopredictvaluesoftheresponse(y)
variableforgivenvaluesoftheexplanatory(x)variable
Residuals
Themodelwontbeperfect,regardlessofthelinewedraw
somepointswillbeabovethelineandsomewillbebelow.
Theestimatemadefromamodelisthepredictedvalue
(denotedy).
Thedifferencebetweentheobservedvalueanditsassociated
predictedvalueiscalledtheresidual.
Tofindtheresiduals,wealwayssubtractthepredictedvalue
fromtheobservedone:
TheLinearModel(cont.)
Thelinearmodel isjustanequationofthestraightlinethat
bestfitsthedata.
Thepointsinthescatterplotdontalllineupperfectly,buta
straightlinecansummarisethegeneralpattern.
Thelinearmodelcanhelpusbetterunderstandhowthevalues
areassociatednumerically.
Residuals(cont.)
Anegativeresidualmeans
thepredictedvalue(the
line)isabovethe
observation
Apositiveresidualmeans
thepredictedvalue(line)
liesbelowtheobservation
19/03/2015
BestFitMeansLeastSquares
Someresidualsarepositive,othersarenegative,and,on
average,theycanceleachotherout.
So,wecantassesshowwellthelinefitsbyaddingupallthe
residuals.
Similartowhatwedidwithstandarddeviations,wesquare
theresidualsandaddthesquares.
Thesmallerthesum,thebetterthefit.
Thelineofbestfitisthelineforwhichthesumofthesquared
residualsissmallest.
TheLeastSquaresLine
Wewritethelinearmodelas y b0 b1 x
Themodel/equationhasaslopeb1
Theslopeisbuiltfromthecorrelationandthestandard
deviations:
sy
b1 r
sx
b0 y b1 x
Theinterceptisalwaysinunitsofy.
FatVersusProteinExample
Theregressionlineforthe
BurgerKingdatafitsthedata
well:
TheLeastSquaresLine(cont.)
Sinceregressionandcorrelationarecloselyrelated,weneed
tocheckthesameconditionsforregressionsaswedidfor
correlations:
QuantitativeVariablesCondition
StraightEnoughCondition
OutlierCondition
Theequationis
Thepredictedfat contentfor
aBKBroilerchickensandwich
is
6.8+0.97(30)=35.9grams
offat.
ResidualsRevisited
ResidualsRevisited(cont.)
Thelinearmodelassumesthattherelationshipbetweenthe
twovariablesisaperfectstraightline.Theresidualsarethe
partofthedatathathasnt beenmodeled.
Data=Model+Residual
or
Residual=Data Model
Residualshelpustoseewhetherthemodelmakessense.
Whenaregressionmodelisappropriate,nothinginteresting
shouldbeleftbehind.
Afterwefitaregressionmodel,weusuallyplottheresiduals
inthehopeoffindingnothing.
Theresidualplotshouldnotshowanypatternsortrends
Theplotshouldshowarandomcloudofpoints
Or,insymbols,
e y y
19/03/2015
ResidualsRevisited(cont.)
TheresidualsfortheBKmenuregressionlookappropriately
boring:
R2TheVariationaccountedfor
Thevariationintheresidualsisthekeytoassessinghowwell
themodelfits.
IntheBKmenuitems
totalfat hasastandarddeviation
of16.4grams.The standarddeviation
oftheresidualsis9.2grams.
R2TheVariationaccountedfor(cont.)
R2TheVariationaccountedfor(cont.)
Ifthecorrelationwere1.0andthemodelpredictedthefat
valuesperfectly,theresidualswouldallbezeroandhaveno
variation.
Asitis,thecorrelationis0.83notperfect
Howeverwedidseethatthemodelresidualshadless
variationthantotalfatalone.
Wecandeterminehowmuchofthevariationisaccounted
forbythemodelandhowmuchisleftintheresiduals.
Thesquaredcorrelation,r2,givesthefractionofthedatas
varianceaccountedforbythemodel.
Thus,1 r2 isthefractionoftheoriginalvarianceleftinthe
residuals.
FortheBKmodel,r2=0.832 =0.69,so31%ofthevariabilityin
totalfat hasbeenleftintheresiduals.
R2TheVariationaccountedfor(cont.)
HowBigShouldR2 Be?
Allregressionanalysesincludethisstatistic,althoughby
tradition,itiswrittenR2 (pronouncedRsquared).AnR2 of
0meansthatnoneofthevarianceinthedataisinthemodel;
allofitisstillintheresiduals.
Wheninterpretingaregressionmodelyouneedtointerpret
whatR2 means.
R2 isalwaysbetween0%and100%.WhatmakesagoodR2
valuedependsonthekindofdatayouareanalysingandon
whatyouwanttodowithit.
Thestandarddeviationoftheresidualscangiveusmore
informationabouttheusefulnessoftheregressionbytelling
ushowmuchscatterthereisaroundtheline.
IntheBKexample,69%ofthevariationintotalfat isaccounted
forbythemodel.
19/03/2015
HowBigShouldR2 Be?(cont)
Alongwiththeslopeandinterceptforaregression,you
shouldalwaysreportR2 sothatreaderscanjudgefor
themselveshowsuccessfultheregressionisatfittingthe
data.
Statisticsisaboutvariation,andR2 measuresthesuccessof
theregressionmodelintermsofthefractionofthevariation
ofy accountedforbytheregression.
RegressionsAssumptionsandConditions(cont.)
OutlierCondition:
Watchoutforoutliers.
Outlyingpointscandramaticallychangearegressionmodel.
Outlierscanevenchangethesignoftheslope,misleadingus
abouttheunderlyingrelationshipbetweenthevariables.
RegressionAssumptionsandConditions
QuantitativeVariablesCondition:
Regressioncanonlybedoneontwoquantitativevariables,so
makesuretocheckthiscondition.
StraightEnoughCondition:
Thelinearmodelassumesthattherelationshipbetweenthe
variablesislinear.
Ascatterplotwillletyoucheckthattheassumptionis
reasonable.
Cautions
Dontfitastraightlinetoanonlinearrelationship.
Bewareofextraordinarypoints(yvaluesthatstandofffrom
thelinearpatternorextremexvalues).
Dontextrapolatebeyondthedatathelinearmodelmayno
longerholdoutsideoftherangeofthedata.
Dontinferthatx causesy justbecausethereisagoodlinear
modelfortheirrelationshipassociationisnot causation.
DontchooseamodelbasedonR2 alone.
Summary
Extrapolation:ReachingBeyondtheData
Whentherelationshipbetweentwoquantitativevariablesis
fairlystraight,alinearmodelcanhelpsummarisethat
relationship.
Regressionmodelsarenearlyalwayscalculatedusinga
calculatororsoftwaresuchasSPSS
Thecorrelationtellsushowstrongtherelationshipis.
R2 givesusthefractionoftheresponseaccountedforbythe
regressionmodel.
Linearmodelsgiveapredictedvalueforeachcaseinthedata.
Wecannotassumethatalinearrelationshipinthedataexists
beyondtherangeofthedata.
Onceweventureintonewx territory,suchapredictionis
calledanextrapolation.
19/03/2015
Extrapolation(cont.)
Extrapolationsaredubiousbecausetheyrequirethe
additionalandveryquestionableassumptionthatnothing
abouttherelationshipbetweenx andy changesevenat
extremevaluesofx.
Extrapolationscangetyouintodeeptrouble.Yourebetteroff
notmakingextrapolations.
Extrapolation(cont.)
Aregressionofmeanageatfirstmarriageformenvs.yearfittothe
yearsfrom1890 1998doesnotholdforlateryears:
Outliers,Leverage,andInfluence
Outlyingpointscanstronglyinfluencearegression.Evena
singlepointfarfromthebodyofthedatacandominatethe
analysis.
After1950,linearitydidnothold.
Outliers,Leverage,andInfluence(cont.)
ThefollowingscatterplotshowsthatsomethingwasawryinPalm
BeachCounty,Florida,duringthe2000presidentialelection
Anypointthatstandsawayfromtheotherscanbecalledan
outlier anddeservesspecialattention.
Outliers,Leverage,andInfluence(cont.)
Theredlineshowstheeffectsthatoneunusualpointcan
haveonaregression:
Outliers,Leverage,andInfluence(cont.)
Adatapointcanalsobeunusualifitsxvalueisfarfromthe
meanofthexvalues.Suchpointsaresaidtohavehigh
leverage.
Apointwithhighleveragehasthepotentialtochangethe
regressionline.
Wesaythatapointisinfluential ifomittingitfromthe
analysisgivesaverydifferentmodel.
19/03/2015
Outliers,Leverage,andInfluence(cont.)
Outliers,Leverage,andInfluence(cont.)
Warning:
Influentialpointscanhideinplotsofresiduals.
Pointswithhighleveragepullthelineclosetothem,so
theyoftenhavesmallresiduals.
Youllseeinfluentialpointsmoreeasilyinscatterplotsof
theoriginaldataorbyfindingaregressionmodelwithand
withoutthepoints.
LurkingVariablesandCausation
Nomatterhowstrongtheassociation,nomatterhowlarge
theR2 value,nomatterhowstraighttheline,thereisnoway
toconcludefromaregressionalonethatonevariablecauses
theother.
Theresalwaysthepossibilitythatsomethirdvariableis
drivingbothofthevariablesyouhaveobserved.
Withobservationaldata,asopposedtodatafromadesigned
experiment,thereisnowaytobesurethatalurkingvariable
isnotthecauseofanyapparentassociation.
LurkingVariablesandCausation(cont.)
Thisnewscatterplotshowsthattheaveragelifeexpectancy
foracountryisrelatedtothenumberoftelevisions per
personinthatcountry:
LurkingVariablesandCausation(cont.)
Thefollowingscatterplotshowsthattheaveragelife
expectancy foracountryisrelatedtothenumberofdoctors
perpersoninthatcountry:
LurkingVariablesandCausation(cont.)
Sincetelevisionsarecheaperthandoctors,sendTVsto
countrieswithlowlifeexpectanciesinordertoextend
lifetimes.Right?No!
Howaboutconsideringalurkingvariable?Thatmakesmore
sense
Countrieswithhigherstandardsoflivinghavebothlonger
lifeexpectanciesand moredoctors(andTVs!).
Ifhigherlivingstandardscause changesintheseother
variables,improvinglivingstandardsmightbeexpectedto
prolonglivesandincreasethenumbersofdoctors,and
TVs.
19/03/2015
WorkingWithSummaryValues
WorkingWithSummaryValues(cont.)
Scatterplotsofstatisticssummarisedovergroupstendtoshow
lessvariabilitythanwewouldseeifwemeasuredthesame
variableonindividuals.
Thisisbecausethesummarystatisticsthemselvesvaryless
thanthedataontheindividualsdo.
Thereisastrong,positive,linearassociationbetweenweight
(inpounds)andheight (ininches)formen:
WorkingWithSummaryValues(cont.)
Ifinsteadofdataonindividualsweonlyhadthemean weight
foreachheightvalue,wewouldseeanevenstronger
association:
WorkingWithSummaryValues(cont.)
Meansvarylessthanindividualvalues.
Scatterplotsofsummarystatisticsshowlessscatterthan
thebaselinedataonindividuals.
Thiscangiveafalseimpressionofhowwellaline
summarisesthedata.
Thereisnosimplecorrectionforthisphenomenon.
Oncewehavesummarydata,theresnosimplewaytoget
theoriginalvaluesback.
Cautions
Makesuretherelationshipisstraight.
ChecktheStraightEnoughCondition.
Bewareofextrapolating.
Lookforunusualpoints.
Bewareoflurkingvariablesanddontassumethat
associationiscausation.
Watchoutwhendealingwithdatathataresummaries.