You are on page 1of 32

'.

'&
ti

jp
tsBnJc
r{"
',,,

C h a p e Treso

I1

\
i
\
\
Model
The SimpleRegression
\
\
\
modclcantrcuscdto stLrcly
hc sinrplcregrcssion therclationshipbctwecntwo
variables.For reasonswe will see,the sirnpleregressionurodel haslirlita- \
it
Ncvcrthclcss,
anall'sis.
tionsas a gcncraltool lor cnrpirical is sorttetintcs
appropriareas an empirical tool. Learning how to interpret the sinrple regression
m oc lc l i s g o o c lp r.a c ti c cfo r s tu rl y i ngrnul ti pl c rcgrcssi on.w hi ch w c' i l do i n subsc-
q u c n tc h r t p t c r s .

2.{ REGRESSIOIU MODEL


DEFITUITIOIII OF THE SNTUIPT.E
Much of applieoeconometricanalysisbeginswith the follolving prentise:.l and r are
somepopulation,attd we are interestedin "explainin$i'in
t',vovariables,representating
termsof r," or in "studyinghow .]'varieswith changesin r." We discussedsolTle*x\tt-
ples in Chapter i, including:.r,is soybeancrop yield and "r is amountof fertilizer;,1is
Sclurlywagc ancJ,ris yearsof cducation;.\'is a comniunityCrimerate and .r is nuntber
of police officers. \
In writing clowna modelthat will "explainl,irt tertnsof .t." we musi confrontthre\
issues.First, sincetirereis ncver an exactrelationshipbetrvccntwo variables.how do\ \\
we alloiv {or ofter factors to aifect -r,'?Seconcl,r.vhatis the functional relationship \
\\
bctwccn -),anclr? AncJtirird, how can wc be sllre wc are capLuringa ceterisparihusrela- \ \
tionshipbetween,vand ,t (if that is a desiredgoal)'l \
\
Wc can resoiyethesearnbiguities['ry rvriting down an equationrelating1'to r. A \
I
s im p l ee q u a ti o ni s

. I l:Fo*Br-r*a. , , , , , , ( 2 ' 1 1 ,\, , , ,


i
Equation (2.1), urhichis assrlmedto hold in the populationof interest,definesthe sim-
nnde[
lrlc linear regressionmodel. It is also calleclthe In'o-trnriablelinear regre:;.sian
or ltit,oriatelinerir regres,siort mot{elbecause it relates the two variables.r and t'. We now
c ' lis c u stfis c n rc a ri i n go l -c a c h o l ' tl rc rl ui trtti l i csi rt (2.1).(l rrci dcntal l y,
thc tcrm " rcgres-
sion" has originsthatare not especiallyimportantfor rnostlnodernccollometlicappli-
c t r [ ir l u ss, () w c ri ri l l l ro t c x l tl a i rri t l r crc.S cc S ti -ql cr' l l 9tl (rll i rr l rt cngagi rl ghi sttl ry< tl '
r c gr c s s i o na n a l y s i s .)

22
Model
The SimpleRegression
.,c:ptcr I

havsseveril!rJifterent nantesused
when relatedby Q.l), the vafiables1land.l
,n *'ari*ble' explainrdvari-
interchangeirbly, asfollows. is calleclthe.clg*adent 'i[1e the
variable, or regressapd.-r :s called
able, the responsevariable, the predg*ect
variable' the control vsriable':he pre-
the independentvariable, tfueex-gdatcry
(T.hetern covariateis alsousedfor x.) Tnetertns
dictor variabre,or the re:=;f.on
..dependent uoriuil{, -a,e-{nclepe*dent variable"a'e frequentiyusedi* *ionomet-
"'independenr" herecloesnot reler tu drestatistical
rics. But na o*=i.rlrat'the label
rlnciom variables (seeAppendixBi'
'"-in" of ii:,:€Fcfil*n.. betwgg.n
notion
ancl"explanatory"
,r..:rii;,explainsd" variabies arepro"\ablythemostdescrip-
in the experim*ntal science.s'wherethe
1,, .i.esponse', and"co:itrci"areuseclmostly
..;frable-r rs undertl.ieexperimenter's controi.we u,ill not usilthe "pr:dicted.
te-rms vari-
..predictor,"althoughyou somctimcsscc thcsc.Our tclniiriolcrgy for sinrple
*;--o
regres;ionis summarized in Table2' 1"'

Tbbla 2.{
for SimpleRegression
Termirrology

v .r
.l:
,. ,.,

: . ::i,..;.-
_,::.:i."
DependentVariable IndependentVariable

ExplainedYariable ExplanaLoryVariable

ResponseVariable Control Variable

PredictedVariable PrcdictorVariablc

Regressand Regressor

, called
nqltcrtthe
rhe error relationsnip,represents
rnce in the relatlol
effor term or disturbance
The variablez, 1n
treats all fac-
eft-ectivei,i'
factors other than r that affecl,rl,A simple regressionanalysis
think of a as stand-
tors atlectin-q.1,other than-x a3'beingunobser'red.You can usefully
ing for "unobscrvcd."
Ecuetion (2.1) also adclresses the issueof the functional relationshi;:betweenrr and
zero' .\fi : 0, then x
-1.Xf ise clrer faciors in u are,heldtixed, so that the changein lr is
has a line.urcffect otr y: l

A] : F,Ar if Au : 0. (r:2I

This'i'reansthat BI is
ThuS, the changein .y is sirnply B, multipliecl by the changein .r.
between y and.r holding the ciher factorsin u
t5e slope parameter in the relationship
economics. The intercept paxameter Bc,also
fixecl; i1 is of primary interest in applied
has its uSeS,although it is rareiy central to an analysis.
I l'(r,9ie::t ot r Ar r.rly',r', wt tl r C rcl:'-- :cclt c-'rt al I'r'rl''r
Part

-**"*:**----
w, y" s.u gul p ,* f;; ? ?
(Soybean Yield and Fertilizer)

by the model
th a t s o y b e a ny i e l di s d e te rm i ned
S uppo s e

* u,
yield: Fo+ B,Jertilizer (2't)

researcher is interested in the effectof


so tlraty: yieldand x : fertilizer. The agricultural
The errorterm u
f er t iliz eor n y i e l d ,h o l d i n go th e rfa c to r sfi xed.Thi seffecti s gi venby F' .
Ti tc
l , so < .rrr. cocffi ci cnt mcasurcs thc
c ont ai n sfa c to rss u c l ra s l a ri ccl l u a l i tyr,, ri rrfal.tttd Pr
:
effectof fertilizeron yield,holdingother factorsfixed: Lyield B,Lfertilizer.

Y" SL IY} F , 1 ^ # ;T ?
1A S imple W a g e E q u a t i o n)

s a g eto o b served
A m od e lre l a ti n ga p e rs o n 'w and other unobservedfactorsis
educatton

,vagg.: 9a* Bteduc* u. (2.4)

i n d o l l a rsp e r h o ur and educi s yearsof educati on' then p, measures


lf wag e i s m e a s u re d
t he c h a n g ei n h o u rl yw a g e g i v e na n o th eryearof educatton, hol di ngal l otherfactorsfixed.
S om eo f th o s efa c to rsi n c l u d el a b o rf orceexperi ence, rnnateabi l i ty,tenurew i th current
em plo y e r, w o rk e th i c sa, n d i n n u rn e ra blother
e thrngs'

on y'
The linearity of (2.1) implies that a one-unitchangein "t has the sunrceffect
regarclless of t5c initial valueof .r. Tl-iisis unrcalistic1or ntan)/ecollonlicapplications'
for increu'sing
Foi exarnple,in the rvage-educationexampie. i.l'e might want to allorv
has a lurge r el'l'cct on wages than did tltc prcvious
rcturns:the ncxt ycar ol- educaticln
year.we rvill sec*howto aliow for sucit possibilities in section 2.4.
T h c .to s t ri i l ti c u l t i s s u cto a c l d r css i s w i rcthcrnl odel(2.1)real l yal l ow susl tocl raw
cetcrisparibuscoiiclusi0ns abouL horv .r al,f-ecrs J,.we.iust saw in equation(2.2) thar Bl
l h c c t' t' c c t
4, r : . r,rrrc .s u rc g l ' .r' tl ry r'h, okl i l
n-1] l l otl tcr Ii tctcl l '(isrt rr) fl xcrl .[s thi s thc cncl
of the causalityistue? Unfor[unately, no. Horv can we hope lo learn in -{etteralat'loitl"
ilre ce[erisparibusef'lbctof ,r,on-t,, holding olhcr factors fixed" when we are ignoring all
those othcr factors'l
As we will seein Section2.5, we are only able to get reliableestimatorsof B,,and
wc make an i-rssumption restrictinghow the
B, lrom a ranclonrsamplc of data when
Without such a rest:ictiol']. we
unobseryableu is relateclto the explanatoryvariable.r.
Because u and x are random
rvill not be able to estimatethe ceterisparibuseffect,B,.
variables,we needa conceptgroundedin probability'
Ilclbrc wc statcrirckcy assLlllption aboutliorv.randu arc rclatcd.[ltcrcis cltteassulnp-
tion aboutrr thatwe can ahvaysnrake.As long as the intercept. B,,is includedin lhe equa-
zero'
tion, rrothingis lost by assumingthat dre averagevalue ol u in the populationis

24
Model
TheSimPleRegression
ChaPter 2

MathematicailY'
(2.51
E(il) : 0.
, ' i , .

betweenn andx but sim-


assume (2.5) says nothingaboutthe.relationship
lmportantly, unobservabtt:, the populat'ion'
ply makesa srarernent, aboutttre ctisiiuutionof the 1l (2'5)is nol very
assunr-pilon
tor'iustration,we canseethat
Usingilrepreviourr^u*pies by normalizing theu'observed tactorsaffect-
restrictive.In Example2.1,we losenorr,ing zero in the populationof
quality,to haveun uu*tnie of
ing soybeanyieicl,lucn as lancl fictors in Examp'e 2'2' wit'out
alr culdvatecr prots.The sameis rrueof tt'r.unobserved ability arc zero in the pop-
avertrge
lossof generaliry, we canassume,ir.,irrringr:11,1:
uladonofallrvorkingpeople.lfyouar:n:t:o]1:::.0'you'onworkthroughProblenr (2.1)to make(2'5)true'
redefrne the irrerceptin equation
2.2 toseeilnatwe canalways regardinghow a anclr arerelated.
A natural
We now tLlrn|'othecruciilro*,u*p,ion is thecorreltttion cocJficient'
measufeof theassociation betweentwtl ra'dom variabres cl'then'asrar-
p'operties') If u and x afe uncorrekfie
(seeAppendixB for cletjnitionontt thatu an{ x are uncorrelated goes
dorrrvariables, trreyarenor tr,tearll,i-i"..r Assurning in equatlon
in which u ttnJr shoulclbe unrelatecl
a long way towarddrtiningthe "n" only lineardepen-
becausecorrelationmeasufes
(2.1).But it doesnot go far enough, counterintuitive feature:it is posst-
tra, a sornewhat
dencebetweenu ano.rlcorrelatioi with tunctionsof x, suchas
for u to be uncorrelated with .r while bein,ecorreratecl tor t'ost
ble This possibilityis not acceptable
(See Secdon B.4 tor funher discussion')
x2. for interpretati'g
^the the,mocieland deriving
for
purposes, as it causes fiootems x'
regression
involves expected vtiue'of u giv"en
properties. A better assirrnption
statisticar *. .on'j.tine theconclitional distributionof
, and x are rando'i variables, (or average)
Bccausc
,ny uot* of ,r' ln particular, lor any-tr,we canobtairrthc expecrecl
rr given
or ilrc ctescrined by the valueof x' The clucial
for tl-ratslicc fopulation
valuc of r.r
uui,r. of , doesnor ciepend on trrevalueof r. we can
assumption is t'at the averag*
write thisas
t2r5l ,

lt saysthat' for any


wherethesecondequalityfollowsfrom(2'5):*i']'l'^::yj]I^Tequation(2.6)isthe
aottro the zero conditional mean assumption'
new assunption,
unobservabres is the sameandthereforemustequal
givenvaiueof x, dreaverageof :h,:
population.
i1.,,ourrogevalueof u in the entire fo slmnlify the cliscussion'
in the wage exan-rple'
Let us seewhat (2'6)lentails average levelof
ooiriryrricn rz.oi iequiresthaithe
assumethatu is the sameasin'are F'r exampie' \f E@bill9) denotes
of yearsoi eclucation.
ability is the sane regardless years of eclucation'ancl
grorp of arl peoprewiilr eight
*re averageability for thg thJpoiulation with 16 years of
E(nbil116) denoteJrh(rourrog.onifityo*ong p'opft in the average ability
Ger. *uit-be the same.rn fact,
education,then (2.6)imptiestnar we think thiit average
educationrevers.If, for example,
Ievermustue trnesam*io, crl (2'6) is false'(This wouldhappenif' on
ability increases*i,n y.u,, or..rutotion,then As we cannot
chooseto becomcmole eclucatecl')
average,peopte*i,t, *ore abiliry average abilityis the
no *uy of knorvi'gwrretheror not
i:':.,:.

:...., observeinnateab*ity,we have


::: .r:

2A
Analysis
Regression Data
with Cross-Sectional
Part t

l*ili1.'
:i$sj'snmc bel'orcapplying
lbr all cducltioulcvcls.But thisis ln issuothatwc tnlrstilddrcss
, simple regression analysis.
. In thefertilizerexample,if fertilizeramountsarechosenindepertdently oi otheri'ea-
turcs of thc plots, then (2.6) will hold: the
averagslanclquality will not dependon the
uEsTloru amourlt ol' I'crtiliz-er.Flowr:vcr,if more l'er-
Suopbse that a scoreon a final exam,score,dependson classes tilizer is put on the higher quality plots of
ittenaea (atend) and unobserved factorsthat affect exam perfor- lancl,then the expectedvalue of a changes
manieGuchasstudentability): with the level of fertiiizer, and (2.6) faits'
Assumption (2.6) gives B, anotlrer
interpretation that is often useful. Thking
the expectedvalue of (2.1) conditional on
you expectthis modelto satisfy(2.6)? x anclusing E(r.rl-r): 0 gives

E(ylx): Fo* F$

5:]iii Equation(2.8) showsthat the populationregressionfunction (PRF),e(yir)' is a lin-


;ill.lll:1; I earfunctlonof x. The linearitymeilnsthat a one-unitincreasein r changesthee"rpecl'

a:;.,'-r-'--
i,,,'iE(ylx)as a linearfunction of x.

;i1i;i"1.
*

l" 26

I
:
Model
TheSimpieReqression
Ghaptcr 2
\!.r,',,,'l

ffi
of )'is cen-
valueof 'r' thedisuibution
Fo'ony
edvalueof v by the amountB,' Y?n .
in Figure2' l'
i"r.J ououtbOltl, asillustrated The picce Fo* Ftx
y into two components'
When (2.6),, ,'ut,^iii' osefulto tireat \ti'

issornerimesca.lledthe silsteilxaticpartofl'-thatis,thepartofyexplainedby;r-attd \
'r' We will use
is cailed ne unrysle'ioticpart' or the pa'rtof v not exptainedby
r
assumpdon(2'6)intr'"""-tsectionformodvatingestinlatesofpoandB,.Thisassump.
1orthestatisticalanalysisin Section2'5'
tion is alsocrr'rcial

LEAST SQUARE5
2,2 DERIVIHG THE OTTDIHARY
ESTIIVIATEE
ttl?i
Nowthatwehavediscussecithebasicingredientsolthesinrpleregressionnrodei,we
will address of hol '1esti1a1e
issue
trreimportant le..l,1;1T"1':: =l
fi=1,...,rt!
asample from Letf,l^li
rhepopuiarion. {(xi,:-i):
ill ?iil:?tri'iff';. need
denotearandotnsampteofsizerrfrorrrthepopulation'sincethesedat.ercornefrom
(2.1),we canwrite
(t.bt
li: Fo + Br.[r+ t{i
ali factors aff'ect-
tbr observationisince it contains
foreacl.ri. Here, r{,is the error term
t-r forlamilv
savings
andy,theannuai
il:f:lxil#, income
.x,mightberheannuar
:L
n : l5' A scat-
t.hen
ciataon l5 families,
l. i. y;:ii;,
j duringa panicutar havecollecred
'l:i, ficiitious)
aiongwiththe(necessarilv
i.'..\
rerptorof such^ o"^ J;, t, *:**'i";;;2.2,
r*s:t.:iJ:il:::fl,fll,'"-3J; *.r" dara andsrope
ofrheintercepr
esrimutes
toobtrin
in thepopulationregressionof savingson income'
we will use
tr-,efollowingesdmationprocedure'
Thereareseveral*ri i" *.ttvaie
impticatlonof assuniption (2'6):in tirepopulation' hasa zero
u
(2.5)anclan importarrr
meananriisuncorrelaterlwit]r'r,Therefore,weseetlratulrasZeroexpectedv:ilueand'
u is zero:
that the covariancebetweenx and
E(a) : 0 i;;F,1!01
Cov(r,u):E(ru):0'
t2.rt)
iiit' wherethetirstequalityin(2.1t)1ollows196(2.10).(SeeSectionB.4forthedefini.
1i r,r,, tionanclpropertiesot.ouu,iun.,".)lntermsoftheobservablevariablesxandyandthe
writtenas
unknownparametersp.-tnJ B,' equations(2'10)and(2'11)canbe
i; 1'. !
E(v-Fn-Fr'r):o 'ti'lil

'(;ffi
$$i
;].i.:r..,
',
and
Elx(y*Fo-F,x)l:0,
on thejoin'
restrictions
.fi111,i,t
Equations(2'12) and(2'13)impty two
respectively. f:b*lt]:'l
to estl-
i;1i;i,,: of (.r'-v)
m tne Sincethereale two unknownparameters
population'
i'rl;i;- distribution obnin good esti-
(2' i2) and (2'13)canbe usedto
mate,we might hope,nui!qu*'o''s
i';; -, t,
't
= tii
*-
.tl 1 Regression Data
with Cross-Sectional
Analysis

* 9o +. plincorne
E(savingslincome)

mato$ of BoandB,. In fact,ttheycanbe.Givena sampleof data,we chooseestimates


liri Foaodp, to solvethesanrylicounteJpartsot (2.12)and(2.13):
x;rrl!:; i
l. " , : ,
;:.i:i: . -s
n

, n - ' Z ( f , - & - F ' x r=) 0 .


i'€'',;::ii . ial
, .t : : .

n-' Z,\;(-\';- Fo- F,.t,)= 0.


-r\a ' A A

(SeeSectitlnC.4
This is an exampleof the.mttltodof ntomentrapproachto estimation.
Theseequations be solvedfor
approaches.)
of differentrestimation
for a cliscussion can
Po-d F,.
Using the basicpropertiesof the summationoperatorfrom AppendixA, equation
*.1a:,:
(2.14)canbe rewrittenas
!':

iir I !: Fo*F' r , ;riiqr,,

2a
....1.------.-
The Simple RegressionModel
'\.,
Chapter 2
\.",
- F,t' ,,tz,r1711
Fo: I

++* Therefore,oncewe havethe slopeestimate


ceptestimateBo,given! and.X'
pr, it is sUaighttbrwarcl

Droppingtlren-, in(2.15) (sinceit doesnot iffect


to obtainthe inter-

tl]esolutiorr)an<ipluggirrg(2.17)

*i;r
irji,-,:1
i,:
into (2.15)Yields
- (t - F,x)- F'x,)
: 0 i{''
I ",,r, ' .,f

.'
,uai',
r".

1,,. ...'
,,i,:
i1 which, upon rcarrangclllcul',givcs t'

$t i'lrr
!:l
S
-/-t ^" ri\.t
., i
- ii\ =
.\ l F,)x,("r,-r).
'l:,i
Frornbasicpropertiesofthesumnrationoperatorlsee(A.7)and(A.8)]'
rt rl sl nrn r t -
Z.r,(.r, x) i=r
== i ( * , - ; ) z a n d ) ; , t . r , , - . r ) :
i:r
) (",-.r)(t,r-l').
,:r..

ffi*
i= |
rirra

Therefore, Providedthit
t. '
'
.l -
) ,", ;)'> 0, i'.1"}
,ff'ti
,*
. ir':
i "r:

dreestimatedsioPeis
i-il j
Q,- .-r1(]i- -v)
il

Equation(Z.fq) ii simplythe samplecovariance between;r andy dividedby tirp.sam-


andthe denominator
pliuuriun." of x. (SeeAppen4ixC. Diviclingboth the nun'lerator
equals th(]populationcovari-
[v ,, _ I changesnotr,ing,trhis makessersebecause B,
:
of r when E(u) :0 and cov(x,u) 0. An irnrnediate
ancedividedby the vari"ance p, is positive;
in
correlatecl the sample' then
implicationis thatif .r andy arepositively^
arenegativelycorrelated,
if i' and"v.. thenB'is negative'
thernethod for obtaining (2.17)and(2.19)is motivated by (2'6)'theonly
Although
to cornpute theestintates tbr a particular sample is (2'l8)' This is
arsu*ptio,ineeded
thex,
at a1t (2.18) is trueproviclecl in the sample arenot all equal
hudly an assumption
unlucky in obtainingour
to the samevatue.lr (?.18)fails, thenwe haveeitherbeen (-rdoesnot
an interesting.problem
samptefrom the populationor we havenot specified : (2'18)
=
For example,tf y wageand-t' etluc't\en fails only
vary in tirepopulation').
(For example' everyone
if
if everyOnein the samplehasthe san]eamountof eclucation'
SeeFigure2.3.)If just onepeIsonirasa cliff'erent amountof
is a high schoolgraduate.
thena2.18)holds,andtireOLS estitnates canbe computed'
educat]on,
29
Data
RegressionAnalysiswith Cross-Sectional
Par{ {

Figuro 2,3
A*tt"tpl"i;f wage againsteducationwhen educ,= 12 {or all i'

if

squares
The estimatesgiven in Q.l7) and (2.19)are calledthe ordina^ryleast
(OLS) estimatesof prand F,.To justify this na're, for any [i,and B,' definea fitted
Yaluefor Y whenx : .ti SUCh
3s

f (t,hi.i
, ':
!,= Fo*8,x,,
: There
for thegiveninterceptandslope.This is the valuewe predictfor y when;r Jt'
is a trttedvaluefor eachobgervationin thesample'Theresidualfor observationis the
i
differencebctweerrthe actqal)'i andits fittedvalue:
-t
'

fii= !,- !; : )',- Fo- F,r,.


T - a 1
', ;.(2.:t)-
l'
Again, thereare /l suchresiduals.(TheseNe not the sameasthe elrorsin
(2'9)' a point
*i ,.rurn to in SectionZ.-l.i)fne fittedvaluesandresiduaisareindicatedin Figure2'4'
Now, suppose we choosepr andp, to makethesum of squaredresiduals'
inn

i>t"=)0,
I i:1 i:l
' &'- F,x,)', ,'*l
The SimPleRegressionModel
Chapter 2

i , t l ; '. , ' l ,,
.. Figure'2.4

g : P o+ P ' x
0r: residuq!

"

t'
assmaltaspossible.Theappenclixtotlrischaptersirowsthatthecondidonsnecessary
(2' 14)and(2'15)'withoul
give-nexactiyby equarions
ror ip,),p,;to minimize1i'Zz) ;" OLS
caffeOtitefirst order conditionsfor the
n-1. Equations(2.14)u* 1Z'iSlaie often A)' From
usingcalculus(see,Appendix
esdmates, a termtnatcomesfrom optimization {irst orderconditions
we know-ttratthe solutionito the oLS
our previouscatcutauons, from the fact
name"ordinaryleastsquares"comes
aregivenby (2.J7)unJ ii.rql. The
:, .t'.a.
t:.;: i
UtuJtf,"r. minimize the sum of squaredresiduals'
.t,,
"rtl*it., interceptand slope estimates,we form theoLS
once we havecietermined theoLS
regression line:

!=F,,+Pt'r'
tz.zsi
:. ]
p,have beenobtainedusingequations(2'17) and
whereit is understooorirui lrr'and
thatthepredictedvaiuesfrom equa-
(2.t9). The notariont, r; ur'..yttui;; emphasizes : 0'
is the predictedvalue .of y when 'r
tion (2.23) are estimates'The intercept'Bo' is not'
senseto set:l-: 0. In thosesituations'8,,
althoughin somecasesit *itt not make
initself,veryinterestlng.Whenusing(2.23)tocomputepredictedvaluesofl'forvari-
Equation(2'23)
ttreinterceptin the calculations'
ous valuesof x, we *uii o.rount foi version
is alsocalledthe ,u*pr" *gr*ssion functiol tsirnl becauseit is theestimated
g(ll.r)': is important to remember
of the populationregressioifunction Fo t F,x.It
pRF is sometSing fixed. but unknown,'ln trrepopuiation'since the sRF is
thar rhe
lyl
Analysis
Regression Data
with Cross-Sectional
Pa|{ I

a differentslopeand
obtflined1bra givensantpleof dau, a n()wsanple will generate
intercept in equation(2.23).
lrr rlost casesths slopeestinate,whichwe cill wl-ltcaij

B,: A/Ax,
by
increases
is of primaryinterest.It tells us the amountby which i changeswhenr
one unit. EquivalentlY'

'" A! = B,Ar, (2.251


' '-'"' -:

so that givenany changein .l (whetherpositiveor negadve), we cancomputetirePre-


dictedchangein Y.
We now preseirtseveralexamplesof simpleregressiotl obtainedby usingrealdata'
(2'17) and
In otherwords,we ftnd tlle interceptanclslopcestimatcswith cquations
(2.19).Sincetheseexamplesinvolvemany observations, the calculationswere done
uSingan econometric sot'twarepackagc.At tlrispoint, you should bc carcfulnot to read
too rluch into thsseregressions;thcyiue not uccessarily uncovering a causalrelation-
ship.We havesaidnothingsofar aboutthestatisticalproperties of OLS' In Section2'5'
rveconsiderstatisticalpropertiesafler we explicitlyimpose assumptions on the popu-
lationmodelequation (2'1).

6XlhMpl"8 3 3
(CEO SalarY and Return on EquitY)

Forthe populationof chiefexecutive officers,let y be annualsalary(salary) in thousandsof


d o l l a r sT. h u s y, : 8 5 6 . 3i n d i c a t easn a n n u a sl a l a r yo f $ 8 5 6 , 3 0 0a, n d y : 1 4 5 2 ' 6i n d i c a t e s
a s a l a r yo f $ 1 , 4 5 2 , 6 0 0L. e tx b e t h e a v e r a g er e t u r ne q u i t y
( r o e ) f o rt h e C E o ' sf i r m f o r t h e
previousthreeyears.(Returnon equityis definedin termsof net incomeas a percentage
o f c o m m o ne q u i t y .F) o re x a m p l ei ,f r o e: 1 0 , t h e na v e r a g e r e t u r no n e q u i t yi s 1 0 p e r c e n t '
To study the relationship betweenthis measureof firm performance and CEOcom-
thesimplemodel
we postulate
pensation,
: Brr* Bttoe-f u.
salary-
thechange in annual salary inthousands when
of dollars,
Theslopeparameter B1measures
returnon equityincreasesby onepercentage point.Because a higherroeis good for the
companY, we thinkFt > 0.
ThedatasetCEOSAL1.RAW containsinformation on209CEOs fortheyear1990;these
datawereobtained fromSusinessWeek(5/6/91). ln thissample, theaverage annualsalary
;L'r:i is$1,281,120,with thesmallestand being
largest $223,000 and 514,822,000,respective-
4t I'ii:
years'1988,1989,and1990is 17.18percent, with
ly.Theaverage returnon for
equity the
thesmallestandIargest valuesbeing0.5and56.3percent, respectively.
Usingthe datain CEOSALl.RAW, the OL5 regressionline relatingsalaryloroeis

sctla,t= 963.191+ 18.50ircc, (2,26),

32
,1.,
Model
TheSimpleRegression
Chapter 2
,::
tt andslopeestimates havebeenrounded to threedecimal places; we
l r, wherethe intercept the
thatthisis an estimated equation. Howdo we interpret
],, use,,salaryhat,, to indicate
i equation?First,ifthereturnonequityiszerc,roe:0,thenthepredictedsalaryistheinter-
ffii.,., c e p t , 9 6 3 ' 1 9 1 . w h i c h e q u a | s $ 9 6 3 , 1 9 1 ' i n . . s a l a r y i sroe:m e Asaiary:
a s u r e d i 18
n t h501
ousands.Next,wecan
writethe predicted changein salary asa functionof the changein
$,:;:l:",' iaro"l.Thismeansthat iithe return
on equityincreases by onepercentage point,Aroe:
Because (2.26)is a linear
1uiri, 1,then salaryispredided io change by about18.5,or $18,500.
tfrl ls tfreestimated changeregardless of the initialsalary'
equation,
W e c a n e a s i | y u s e ( 2 . 2 6 ) t o c o m p a r e p r e d i :c t e d s a l a r i e s a tis differentva|uesofroe.
1518'221' which justover
Suppose roe : 3b' Thensaiary: 963'191+ 18 501(30)
meanthat a particular CEOwhosefirm had an
$1.5 million.However,,fi'iot' not that affectsalary' Thisisjust
other factors
roe - 30 earns$1,518,22LTherearemany in Fig-
tineiz.za). Theestimated lineis graphed
our prediction fromtn.bLs regression
u r e 2 . 5 , a | o n g w i t h t h e p o p u | a t i o n r e g r e s s i o n f u n c t i o n E sample
( s a t a rof
y | data
r o e )will
'WewiIlneverknow
PRFAnother
the pRF, so we cannortell'howclosethe sRFis to the regres-
Iine,whichmayor maynot be closer to the population
givea differentregression
sionline.

Fiqure 2.5
populatlon
+ 18'50roeandthe (unknown)
{unction.
regression

sarary
Saafy = YbJ' lY 1 +
? ^-r.^
roe
18,501

\-,z

I i,t,

iti*:;
iilr.i

,l::r:.
963.'191
;. i::]]l

i..'.i
;;, -r
il t . , '. , , ','
1 . . : ; 1 , , , ' ; - r -, i
ilr.,i,;:,,: ,:,
!"t"'t'
lr, .r:.. : .
33
!r,iii"',','
i:;a:,1, ,.

i;'-
lir-,i'

1;1,'-'
i.::rj.,r.,.'
. :
:,,
,'
:;,
i;t;:,,.,
Data
RegressionAnalysiswith Cross-Sectlonal
Part {

EXA$b'!*3$*K X 4
(Wage and Education)
- wage,wherewage is mea-
Forthe populationof peoplein'the work f orcein 1976,lety
f
s u r e di n d o l l a r sp e r h o u r .T h u s , o r a p a r t i c u l apr e r s o ni,I w a g e : 6 . 7 5 , t h e h o u r l yw a g e i s
$ 6 . 7 5 .L e tx : e d u cd e n o t e y e a r so f s c h o o l i n gf o
; r e x a m p l ee, d u c : 1 2 c o r r e s p o n dt os a
completehigh schooleducation. Since the average wage in the sampleis $5 90, the con-
sumerpricelndexindicates that this amount is equivalent to $16'64 in 1997dollars'
Usingthe datain WAGEl.RAWwheren : 526 individuals, we obtain the followingOLS
regression line(or sampleregression function):

wdge: -0.90 + 0.54educ. ti*i


we mustinterpret thisequation Theintercept
with caution. of -0.90 literallymeans thata
hasa predictedhourlywage of -90 cents an hour'This, of
personwith no education
issilly.lt turnsoutthatno onein thesample
course, hasless than eightyears of education'
whichhelpsto explain for.a zeroeducation
the crazyprediction value.Fora personwith
erghtyearsof education, thepredicted wage
is w6ge = -0.90 + 0.54(8) : 3.42' or
$3.42 perhour (in 1976 dollars).
Theslopeestimate in (2,27)implies that
onemore year of education increaseshourly
wageby 54 centsan hour.Therefore, four
morevearsof education increase the pre-
of
dictedwage by a(0.5a)= 2.16 or $2.16 per hour.Thesearefairlylargeeffects.Because
the linear natureof (2.27),anotheryear of educationincreases the wage by the same
2 . 4 , w e d i s c u ssso m em e t h -
l f e d u c a t i o nI .n S e c t i o n
a m o u n t ,r e g a r d l e sosf t h e i n i t i a l e v e o
ods that allowfor nonconstant marginaleffectsof our explanatory variables'

EXJIMPLF N $
(Voting Outcomes and Campaign Expenditures).

Thefile voTEl.RAWcontains dataon election outcomes andcampaign fot'


expend'itures
173two-partyracesfor the U.S.Houseof Representatives in 1988.Therearetwo candi-
datesin each race,A andB.Lgt voteAbe the percentageof the byCandidate
votereceived
A and shareA be the the pdrcentageof total campaign expendituresaccountedfor by
Candidate A. Many factorsqtherthan shareA the
affect electionoutcome (including
the
qualityof the candidates the
and]possibly dollar amountsspent by A andB).Nevertheless,
we canestimate a simplereg[essionmodelto findout whetherspending to
morerelative
oneschallenger impliesa hig[erpercentage of thevote.
Theestimated equation the 173observations
rJsing is
I
j

ip?el = 40.90+ 0.306shnreA. ,$.:er:i


As expenditures
Thismeansthat,if the shar!of Candidate by onepercent-
increases
agepoint, A receives,almost
Candidate pointmore0f the
of a percentage
one'third

!r4
'.
Model
TheSimPleRegression
-
:. GhaPter 2

.'
ili#,i!',,.,:.'
:I : :
.. :.
m i ^gL +h t ^ewx^ 6' aet c t '
*l

1,'
li',,'t'
' but to simply
.. -^^-,,.,i.innrnqlvsisis not usedto cleterminecausality
i',"i ,"";r.ii^fir;l'J;;'"il,r;ilil:ffiJi#J##fflj*i;l
x*x;J':i1t.l,
0ccursin Problem2'12'where.Iou-T:
-3
e u E s r r o ru 2 \ ;;; to us3^11ta
ol ii::"::1ii:":::
spent sreeping
-*orr.ingtrooo)
ij-am"'mesf'
r.r,T,nL,Lnlo**1r:j:lT":,:iil1*"Li::fj
rnExampre I
I
reasonablc? lme
invcstigatc thc tradeofl
b;;s thisanswerseem
= 60 (whichmeans60 pJ.I;itr on,r to
between thesetwo factors'

A ilote on TerminolgY
sakeof brevity'it is
Inmostcases,wewillindicatetheestimationofarelationshipthrouglroLSbywriting
as fi'i'i,'A.iil, or (2.28).Stot,i*"t, for the
an equatiorr such
regressi"" withoutactuallywritingout the
usefulto indicate that an OLS 1i:-;;;un
eouation.Wewilloftenirrdicatethatequauon-(2'.23)hasbeenobtainedbyoLSinsay-
o7
ing that we run the reg'ression : . : . . ,,
(2.2e)
)onr,

o r s i r r r p l y t l t t t t ' w c r c s r c s i y o l l , r . T l r c prc't
t l svariable
i t i o r t s:owe
f t , always
a n c l x rcgrcss 9 ) i tlcpcn-
i n ( 2 ' 2 thc ndicatewlrichisthe
clcpcnclcnt variable *ntlir i* ,r,. in,r"p"n we rcplacc I andx
"no var.iablc. r,u,^rj,"Jii.applications,
dentvariabie on urerndepencJcnt on I?c or to obtain (2'28)'
(2'26)' *t
with their names.Tlrus,io obtain "g"" "'l'l'
-'l;T::'#f*:l#i;'iinotogv pran toesti-
'topt'%'Tl,:'Ii:l',i:::,1'l'we
for thevast
marctheintercept, *iir'
Bo,"oiong
"'(';??):
'nt B'' ri.ti' to*Lit Ti1:ltiott lelationship
the
&tu'ion']l1l L", ilt" *i"!' t: :iTi:
majoriryof appiicatlon'' = 0);
intercept
.r assuningthat the is zero(sothat'r : 0 impliesthatI
)' arTd
beiween wealways
otherwise'
.*pii.it'tystated
i.o. Untess
this
wc cover case ort.tjrtt"'i."i""
with a slope'
es[itnatean intercept along

ll',
2.3 nlEcHAlulcq oF oLs
i*'it' ln rhis$cction, *" sontealgebraic
"oul'thinkabout
nrory111::*'.i],tl:,t"
properties
rrrese trrarthevarercatures
?::,i:it:::'iJ"J,1li;
istorsalize
il$tii i:,:ffiffi:Iirl,i":r"i;
of ols for a piulicutarlampleof
with thestatisticulprop'
Theycanbe contrasted
clata,
of tlreesti-
[i.;i'
(ai i,i':r:::. ertiesof ol-s, which requires deriving teaturesoi tt e sampiingdistributions
matofs.WewilldiscussstatisticaipropertiesirrSection2.5. will appearmundane'
iirri.j. Scru:d of thc u\gttniic pro6erties\\e rue going to deri-ve

Fiiit' the OLS esrimares o"i i"foi.O


in cefiainways'
Nevertlreless,havingagraspoft}reseproperuesheip.o.toigure'txrst'iih:ppe*sro
when the dataaremanipulated
statistici
viuiableschange'
mclin<lcpcnclent
unitsof theclepcnclcnt
[it-
i ,tl,'
suchaswhenthe*"uJura*an,
35

l
i,

i''
i .',.,,
'if'"
Data
RegressionAnalysiswith Cross-Sectional
Part I

Fitiled Values and Residuals


we assumethattheinterceptandslopeestimates, p, anopl' havebeenobtainedfor the
obser-
given sampieof daU. Given p6and Fr, we can .btain the fitted value!, for each
iation. iThis is givenlry equationQ.20)]By definition,each frttedvalue of !, on the
is
with observation i, it, is *re differ-
OLS relressionline.ffre blS residualassociateci
v, andits fittedvalue,asgivenin equation Q'21).lf is
r?, positive' theline
encebJtween
theline overpredicts
il; if fi,is negative, );.The ideal case for observation
underpredicts
i is wiren tti = 0, but in mostcases €veu is
residual not equalto zero'In otherwords'
noneof thedatapointsmustactuallylie on theOLS line'

E}C,l.M$$!.s x s
(CEO SalarY and Return on EquitY)
in theCEOdataset,alongwith the
listingof thefirst15 observations
Table2.2 contains-a
calledsalaryhat,
fittedvalues, and the uhat.
called
residuals,

Thble 2.2
for the First15 CEOs
andResiduals
FittedValues

. obstto, roa salary salaryhat uhat

r 09_5 r224.058 - r29.058


i
I l4.l

r00l 1r 64.854 -163.8542


2 10.9

1122 t397.969 -2'.7s.9692


4
23.5

t012.348 -494.3484
A
T 5.9 578

5 I -1.6 1368 1218.508 t49.4923

- 114-5 l 333.2I .5 - t88.21s{


6 20.0
ti,:.
l 078 r 266.6I I - r8 8 . 6 1 0 8
7 16.4

t264.761 - 170.7606
8 I O.-1 1094
'79.s4626
9 l t l \ 1157.454

833 t449.713 -616:1726


i0 /.o.3

567 1M2.312 -875.3721


1l 25.9

t459.023 -526.0231
tz 26.8 933
continued

36
Clraptcr 2 The Simple RegressionModel

Tbble 2.2 (concludedl

obsno roe salary salaryhat ahat


::
\.
r4 .8 l 339 t237.00e l 0 r . 9 9l l \
l-)
\
937 1375.768 -438.16-78
t4

i5 56.3 2OIT 2004.808 6.r91895

Thefirstfour CEOshavelowersalaries than what we predictedf rom the OLSregression line


( 2 . 2 6 ) ;i n o t h e rw o r d s ,g i v e no n l yt h e f i r m ' sr o e ,t h e s eC E O sm a k el e s st h a nw h a t w e p r e -
dicted.As can be seenfrom the positiveuhaf, the fifth CEOmakesmorethan predicted
line.
from the OL5 regression

/llgebraic Properties of OLS Statistics


Thereareseveralusefulalgebraicproperties of OLS estimatesandfteir associatedsta-
tistics.We now coverthe threemostimportantof these'
( I ) The sum,andthereforethe sampleaverageof theoLS residuals,is zero.
Mathematically,

\ , r -uin - w. (2.3O)
,LJ
r

This properryneeclsno proof; it follows immediateiyfrom the OLS first ordercondi-


tion (2.14),whenwe remember thatthe residuals by f' : .l',- F,,- P,.r,.
areciehne<J
In otherwords,theOLS estimates BuandB, arechosanto make theresiclualsaddup to
This saysnotiringabouttheresidualfor any
zero(for any data-set). particular observa-
tion i.
(2) The samplecovariance betweenthe regressors and the OLS residualsis zero.
This fbllowsliom thefirst ordercondition(2.15),whichcanbe writtenin termsof the
residuals as

s\
Z xiui: U, {,.;t
The sampleaverageof the OLS residuztls is zero,so the left handsideof (2.31) is pro-
portionalto the samplecovariance between-lrandi,.
(3) The point (-r-',.v.)
is alwayson the OLS regression line.In otherwords,if we tike
equation(2.23) andplug in 7,for x, thenthe predictedvalueis !. This is exactlywhat
l.i equation(2.16)sitowsus.

77

i;

i
Part I ReoressionAnalvsiswith Cross-Sectional
Data

ffiK&rvgtrilffi s"F
(Wage and Education)

F o rt h e d a t ai n W A G E l . R A Wt h e a v e r a g eh o u r l yw a g e i n t h e s a m p l ei s 5 . 9 0 ,r o u n d e dt o
t w o d e c i m apl l a c e sa, n d t h e a v e r a g e d u c a t i o n i s 1 2 . 5 6 .l f w e p l u ge d u c : 1 2 . 5 6i n t ot h e
O L Sr e g r e s s i o n l t n e ( 2 , 2w 7e ),get wAge: - 0 . 9 0 + 0 , 5 4 ( 1 2 . 5 6 ) :5 . 8 8 2 4w , h i c he q u a l s
5 . 9 w h e n r o u n d e dt o t h e f i r s td e c i m a l p l a c e . d
T h e r e a s o nt h e s ef i g u r e s o n o t e x a c t la
ygree
isthat we haveroundedthe averagewage and education, aswell asthe intercept slope and
e s t i m a t e sl f,w e d i d n o t l n i t i a l lryo u n da n yo f t h e v a l u e sw, e w o u l dg e t t h ea n s w e rtso a g r e e
more closely,but thls practicehas littleuseful effect.

its fitted value,pllrs its rcsidual,providcsanotllcrway to intcpret


Writing each-I,-as
an OLS regression.For each j, write

!i: !i* iti. t2.33)


From property(1) above,theaverageof theresidualsis zero;equivalently, the sample
as
averageof the fittecJvalues,.v;,is the sarl'le the average
sarrrple of the y,, or i : T.
(l)
Further,properties and (2) can be usedto show that the sample covariance
betweeni, and 0ris zero.Thus,we can view OLS as decomposin-e eachvr into two
parts,:a fitted valueand a residual.The fitted valuesandresidualsareurrcorr:elated in
the sample.
Detrnethetotal sum of squarcs(SST),thecxplainedsum of stluares(SSE),and
the residualsum of squares(SSR)(alsoknownas the sumof squaredresiduals), as
follows:

S S T=
\1

ffii
s\ ,^
l S S E= {2.34}

I ssn=>r;'.
n
;il$l
SST is a measureof tne rotatsamplevarialionin the ,r,; that is. it measureshow spreitd
out the y, are in the sample.If we divide SST by n - 1. we obtainthe samplevariance
of 'y, as discussedin Appendix C. Similarly,SSE measurestlte sarnplevuiation in tJte
: rr),lncl SSR measuresthe samplevariationin tire fi,.
i,(where we usethe facr thati
The total variation in i, can always be expressedas the sum of the explainedvaliation
and the unexpiainecl variationSSR.Thus,
a . : ' t ; : ' : 1 ' 1 ' \t t ) t

S S T : S S E+ S S R . {ft16|q

3A
T h e S i m P I eR e g r e s s i oM
n odel
ChaPter 2

of the sum-
diftjcuil' but it requirelus to useall ol the ProPertics
Proving(2.36)is not
ApperrdixA' Wrile
madonoperatorcoveredin
n
-.r S
S r .\Ji' - lLY,-YJ+Sr-l)l'
Z,/

:)ti,+()r-)-,)lr
tl
,, -'l , S r.r - '-,\:
: ) n i + 2 > r i ' ( i , -) ) - r ZJ\)i .')

il
- !) + ssE'
: sSR+ z> r?r()'i

that
Now (2.36)holdsif we show

S .rra -
uiUi
r:\ = 0 iz.iit r
4 'r/

B u t w e h a v e a l r e a c l y c l a i m e d t h a t t h e s air.iuriti:il
n p i e c o v a l i i<livicleci twn
r n c e b eby e e- n1' d u ahave
e r e s iwe
t l rTirr'rs' lsandthe
is zero,#'il;
firtedvatues .-ouuri^,.,.*

SSr'stlllj3,.Y': quantities isnounirorm


about
"."3:#t-ffff't 'ou'ion
the three ]l'':,',',:l
fi're in equa[ons
defined
on dre
agreement "un.r"or-uUJ*iations-for SSTorTSS'sothere
it torlttreiilrer
(t.33),(2.34),ono
fz':!l'ih;'"t'l :"i:t:.1Y'** sumof squaresis somedmes called
*:,T?]ttneci
is little confusionhere'Unfortunatel't it caneas-
abbreviation'
sum ot squares'" liven its natural ref'er
the "regression lf.ti"::"tl^t: packages
with gretermresidualsumof iquart:: S"1t't^::'gression
ily be confusecl sumof squa'res"'
tt,tht
to the explainedsumoi 'quuttt :.1:.1:lslu]nof 'quttt is oftencalledthe "enor
residual
ttre
To makenattersevenworse' as we wil'l seein Section2'5'
i, .rp..i^fly unfortunat;;ilJ.
sum of squares.,,{hi, call (2'35)the
resrduals'are ditferent q"rtttit.r. inrr, *t will aiways
rheerrorsandrhe Wc preferto usetheabbrevia-
sumof 'qu""ti'"''Ouals'
residualsumof 'quo"'o' the because it is morecommonin econo-
residuals,
tion SSRm denote,;r;;; ";;quarecl
metricPackages' l

Goodness-of'Fit I variable'
n:Y theexpianatory or 1ndePenclent
no way of measuring sum-
So tar,we have lttt to compute a number that
variaute,u. it i, olten usefirl
x, explainsme oepe#eni li-. ;;;;;;oin. rn trre toitowingdiscussion' be
rhe ol-s regression
marizeshow weil alongwith the slope'
that iin *io..pt is esdmated
sufeto rerlemberu]atwe u*.ur* i, no. equalto zero-which
iS true
ot,qu*."-isi
Assumingtnat tne total Sum can divide
all-tle 1' equaitire sane value-we
exceptin the very ""tt*ttt """"ttft'i
( 2 . 3 6 ) b v S S T t o r ; ; ' i l ' i s g s s r + s s r v d s r ' T h eisRdefined
- s q u aas
redoftheregresslon'
of determination'
sometimescaUeOtf'!""oefficient
g9
-'T-

Para I Regression
Analysis pata
with Cross-Sectional

= I - SSruSST.
R2:.SSE/SST

R2is the ratio of the explainedvariationcomparedto the totalvariation,andthusit is


interpretedasLheJraction of th.esantplevariution irty thcttis e.xplainedDy.l, The sec-
ond equalityin (2,38)providesanotherway fbr computingR2.
From (2.36),the valueof .R2is alwaysbetweenzeroandone.sinceSSEcanbe no
greaterthanSST.when interpreting R?,we usuallymultiplyit by 100to changeit into
a percenf 100'R2is thepercentageoJthe santplevaricftionin tt that is explainedbx-x.
If the datapointsall lie on the santeline, OLS providesa perf'ect. fi1.tr: thedata.In
this case,R2= l.,A valueof R2that is nearlyequalto zeroinclicates a poor fit of the
oLS line: verylittle of thevariationin they, is capturetlby thevariationin thei, (which
all lie on theOLS regression line).In fact,it canbe shownthatRr is equalto thesquure
of the samplecon'elationcoefficientbetween.r, and i,. Tlris is where the term
"R-squared" camefiom. (TheletterR wastraditionallyuseclto clenote an estimateof a
populationconelationcoefficient,arrdits usagehassr"uvived in regression analysis.)

HXAM$}T-H } 8
(CEO Salary and Return on Equity)

In the CEOsalaryregression,
we obtainthe following:

: 963.191+18.501
saiary roe
n : 2 0 9R
, 2: 0.0132
We havereproduced the OLSregression lineand the numberof observatrons for clarity.
Usingthe R-squared (rounded to four decimalplaces) reported for thisequation,we can
seehow muchof the variation in salaryis actually
explained by the returnon equity.The
answer is:notmuch.Thefirmsreturnon equityexplains onlyabout1.3o/o of thevariation
in salaries
for thissampleof 209CEOs. Thatmeansthat98Jo/o of thesalaryvariationsfor
theseCEOsis left unexplained!Thislackof explanatory powermaynot be too surprising
sincethereare manyothercharacteristics of both the firm and the indjvidualCEOthat
shouldinfluence'salary;
thesefactorsare necessarily included in the errorsin,a simple
rdgressionanalysis. I

In the sociaisciences,
low R-squareds in regression equationsarenot uncomrnon,
especiallyfor cross-sectionql
analysis. We wiii discussthis issuemoregenerallyuncler
rnultipleregressionanalysis,but it is worthemphasizing now rhata seeminglylow R-
squareddoesnot necessarily meanthatan OLS regression equationis useless.
It is still
possiblethat(2.39)is a goodestimateof theceterisparibusrelationship betweenscla4y

$i" androe; whetheror not this is uue cloesnot dependdirectlyon thesizeof R-squared.
Studentswho arefirst learnipgeconometrics tendto put too muchweighton the sizeof
the R-squaredin evaluati4gregressionequations.For now, be awarethat using
R-squared asthemaingaugeof success for an econometric analysiscanleadto trouble.

il Sometimesthe explanatoryvariableexplainsa substantialpart of the szunplevaria-


tion in thedependent

4()
variable.

i;
ffi'"'
t.
'
::,t ''
Model
The SimPleRegression
GhaPter 2

? s
EKAmp*'a
Campaign Expenditures)
(Vottng Outcomes and
expen-
rt
equation (.2 o sos Thus,theshareof camPaign
outcome
lnthevotins ]u]J I::variation outcomes
intheelection for thissam-
just 50
over percentof the
explains
ditures
portion'
ple.Thisis a fairlysizable

AND FUTUCTTOIUAI-
2.4 UNTTS OF MEASURE]NEilT
FORM the
how clranging
t:on:*-i:'"-1re( L) unclerstanding
issuesin appliecl OLSesti-
Two important unOlorlni.p*Otntlalablesaffects
;i ;h; dependent
unirsof nleasur.cmen,
(2)knowins
lurd
matos i"':'l-"-i1'^.lT:)fjT;'fili"""'fr:[:::jil':l"tT[:
rrow.to oI runc-
lffiiffi J[.CIoio'atu' u'ircrstancrirtg
H::.,:#lts,lff;il:J1# in AppendixA'
revieweci
,i*tt r?,* issuesis
of Ghanging Units of Measurement on OLs
The Effects
Statistics
I n E x a m p l e 2 . 3 , w e c h o s e t o n r ep.rr"nt.Gu'ttuinon
a s u f e a n n u a l s a lu'a raydecimal)' " s . ^ o fto
a nisdcrucial
i n t h o u s lt dollars,andthe
equiry was #il;J;r u sense of the
reruril on in order to make
in trri* l*o*pr.
know how salary and;;;;;ruied
;*ttl"?ff;;:'inf wavs
inenrirerv,.:1.:,,:d the
when
e$imares
,11l"LS lfange ln
cirange' Examplc
unoiffi"'de,'t variables. lt ln
uni$ of measuremenr;;';;;;o;nJrntsalaryin tnousands of dollars'wemeasure
suppose raUrJr
tfrat, ntfo'u'ing
tfro't be interpreted as
2.3, :845'?61 wouid
Letsalardolbesalary,in measured in
doilars' .Oot'*l,911'idot
a^simpie to the salary
ieiationsnip
ot course,sataic,sltas run the
$g45,?61.). wt ;" not need'to actually
thousands -i"ia"i = r'ooo'snlary'
or oouars'ls is:
thattheestimatedequation
reEession of salardol;;;;";"know i ',,..,,",',,,',,*',.

interceptandtlre
in (24g|:i*ply bv multip|l:'i^'nt
Weobtaintheinterceptandslope
salarvis
slopein(2.39)byr,o6o.rpi,givesequaticl"'iilql""d(240)the.tcnteinterpretation.
6, ,n"".'rr1a'rr)":',16:,i9r' t-q.t.,predicted
roor<ingat (2.40),ii"rnii,) (21?Jl Furthermore' if roe
[the same ;;J *; obbinedfrom equation thisis whatwe
$963,191 i;;;.;;;t by $ 18,501; again'
by onc,*""'ir. brJtcicclsalary
i'creascs
concludedt'o* o* *'tiei analysis of equation 'J#il::
Q'39)'
* ffii ;;'-;d111n5;m:ilT:ti:iffi
i,r,'ilv'
rv,
erar
:,,l :

Gen bv
each multiplied
varueintr'''umft"is
Xfl,ffi,i:fiff:Tf,[1T.--'fnl,:F
,"i.r.if ^"0 ,top. arealso by c' (Thisassumes
nultiplied :
c-rhsn rheOLS "rti*uils c
in *recEO salarvexanrple'
r,* .r,ong.i-;i;;i;; inoepenoent*"uianr".l
nothing
to solatdol'
;;66 ;" movingirom snlar-v
41
Pal't I RegressionAnalysiswith Cross-Sectional
Data

We can also use the CEO salary cxaruplc t.oscc what happcnswhen.we cirange
the unitsof measurement of the indepen-
.+'r
dent variable.Define roedec - roel100
,a-
QuEsrlolu 2 - 4 to be the decimalequivalentof nre; thus,
ilt,'
iuooor" that salaryis measured in hundredsof dollars,rdher
than roedec= 0.23meansa retun on equityof
in'ihousandsof dollars,saysalarhun.Whatwill be the OLSintercept 23 percem.To focuson changin_q theunits
ind slopeestimatesin the ol
regression salarhun on roe? I i of measurement of the independent vari-
ablc,wc rcturnlo oul originaldepcnclcnt
wltich is measupdin thtlusandsof dollars.When we regresssalary-on
variable,salary,,
roedec,we obtain
salary)1963.191
+ I850.1roedcc. (2.4'U

ot roedecis 100!irncs
The coelllcietrL tirecoethcier)Lon /irc in (2.39).'l'hisis as ir
shouldbe.Changingroeby one pfintage pointis to \roedec- 0.01.From
equivalent
(2.41),ifA,roedec:0.0i, thenfri/an : :
18-50.1(0.01)18.501, whichis whatis
obrainedby usin-eQ.39).Note tir4 in noving froni (2.39)to (2.41),the independent
',ri
.:i* variablewasclividedby 100,ands(tlreOLS slopeeslimatewasmultipliedby 100,pre-
servingthe interpretationof tire Quation.Cenerally,if the independerrt variable.is
dividedor rnultipliedby somenorferoconstant,c, thenthe OLS slopecoefficientis
by c cspptively.
alsomultipliedor cJivicled
The intercepthasnot changdi{(2.4i) because roedec: 0 still corresponds to a
zero return on equity.In generil,chqging the units 01'nreasurement
of only the inde-
pendentvariabledoes not affertthe i\ercept.
In the previoussection.ve defir\dR-squareci
as a goodness-of-fit
nreasure
lbr
OLS rcgrcssion.Wc catr alscask wlti\ happcnsto .It2whcn thc uniI ol' nictrsurcnicnt
of either the independentorthe depen$ntvariablechanges.Without doing any alge-
bra, we shoul6know the rerrlt:the goolpess-ot'-tit
of the model shoultJnor depenclon
the units of measuremenlof our variabps. For e.rample,the amountof variationin
salary,explainedby therrturnon equity,\houldnotdependon whethersalaryis mea-
of dollar$r on wiretirer
suredin dollarsor rn thcusands returnon equity'is a pcrccnt
or a decimai.Thisjntuitoncan be veritledrnathematically:
usingthedetrnitionof R2,
it canbe shownthatRzis, in fact,invariantto
changes in the unitsof y or.r.

tncorporating Nonlinearities \r Simple Regression


n linearrelationsliplbetween
So far we havefocused thedependent
andindependenr
variablcs. in Chapteri,lin\r rclationships
As wc nrenliurcd arc not ncarlygcncral

nonlinearidesinto sim[e regressionanallris propriatelydefiningthe dependent


.
and independent variates.Hererve rvill tovert\ possibilitiesthat often appearin
:". .t.
fi:',: appliedwork.
In readingappliedvorkin thesocialscences. rvill oftenencounter
;fii! ll,',,

|ii"f1;,r,;; regression
ii;'l'':, equationswherethedoendent variableappears in tithmicfonn.Why is thisdone?
il,'.,,,,
'.: Recallthe wage-educdon example,rvherewe ly u'ageon yearsof edu-
i).i...: .
cation.We obtained islopeestimate of 0.:4 lsee equat (2.27)),which ureansthat
iii,'
5;:;:li:'.' each year:f
additional educationis predicted to rly wage by 54 cents.
i,'
1 , :" ) ,
i:..'1.
,1 a
,'.
ll. l
:11:

Model
TheSimpleRegression
Chapter 2

is theincrcasclbrcitherthe tlrstyearof
Becauseof thelinearnatureof (2.21),54cents
education
-- or the twentiethyear;this may not be reasonable'
i4prease in wageis il"esamegivenonemore
Suppora,instead,thatihepercentoge increase:thepeF
a constantpercmtage
y.u, ot iJo.ation. ldodel(2.2i) ooesrol.iinpiy (approxintately)a
oependion the initkfi wage. A model thatgives
centageincreases
constantpercentage effectis

log(w@e)= Fo * Preduc+ u, 1**1,


wherelog(.)denotes|henaturrtllogarithm.(SeeApperrdftAforareviewofioga.
:1C, then
r,ithms.)In particular,if Au

1;,. (100'B,)Aeduc.f
%Lwage,e'
percentagecpngein wage gtvenorreaddi-
Nodcehow we multiply B1by i00 to get the
tionnty.o,ofeducati.on:$^ncettepefcentage.changeinr&seistheSameforeaciraddi.
iionofVt* of educa.tion. ^.y?!:.fbr an extrafearo1cducationincrcaseszrs
ihp'fl*ng" returnto education'
educationincreases: in ';thrj *otdt' (2'42) inipliesan trfrcaslrlS
;;;;;;;*riating.(?.n2), !1/ecanwrirc wcge : exP(Fo+B'educ* a)' This equation
ls'grapneO in Figule';'6" wirh tt : 0'

F i g- u r e J r . rr-. f -
'iig"
.
with B, > o'
= ex4(loi Breduc),
'waQeI
: , t:
Data
RegressionAnaiysiswltlr Cross-sectional
Part t

whenusingsimplel'egressron'
Estimatinga modelsuchas(2.42)is suaightlbrward variableis
variable,y, to be I : log(wage). The independent
Justdefinethedepenclent as before:tlre intercept
the same
;;;A by r : edic.Tnemechanicsof oLS :ue words' we
(2'l?) andQ'19)' In other
and slopeestimates.r.giutn by the fbrmulas
p, tio* th;ols regression of log(rvagc)ot edttc'
ilJ; A-;

ffix&fiftff}Ftuffi & $$
(A Lo9 Wagc tquation)
l o g ( w a q ea)s t h ed e p e n d e n t v a r i a bwi ee,
U s i n gt h e s a m ed a t aa s i n E x a m p l2e. 4 , b u t u s i n g
obtainthe followingrelationshrp:
ir;1, - o'584+ o'083edrc
log(wage) li.++l
iir,i:]
n:526,R2:0.186'
interpretation whenit is multiplied by 100:wage
Thecoefficient on educhasa percentage
yearof education. This is what economists
increases by g.3 percent for everyadJitional
yearof education"'
m.un*f,.n theyreferto the "returnto another
ltisimportanttorememberthatthemainreasonforusingthelogotwagein(2.42)is
t o i m p o s e a c o n s t a n t p e r c e n t a g e e f f e c t o f e dparticu|ar,
u c a t i o nitois
n npt
W acorrect
g e . o ntoc say
eequation(2.42)is
the naturallo9of wage|s rare|ymentioned, ln
obtained,
yearof education increases log(wage) by 8 3%'
ihut unothur. log(wage)'
as it givesthe predicted
The lnterceptin (2.42)is not verymeaniniful, percent of thevari-
explains about 18.6
wheneduc: 0. Thenirquui.oshowsthat educ all of the non-
(not wage). Frnally, equation (2.44) mlght not capture
ationin log(wage) "diploma effects"'
u.*..n wageandschooling' lf there are
linearity in the relatronti-]ip
t h e n t h e t w e l f t h y e a r o f e d u c a t i o n _ g r a d u a t i o n f r o m h ikind
g h sof
c hnonlinearity
o o | _ c o u | in
dbeworthmucn
allowfor thls
morethanthe eleventh v.r|. w. wiii learnhow to
Chaoter 7.

a constantelasticitymodel'
A'other i*poitrnt useof thenatu'allog is in obtaining

w,K&{qt$sl-€ u s$
(CEO SalarY and Firm Sales)

modelrelating
ellsticity CEOsalary Thedataset
to firmsales'
We canestimate a constant
2.3,except*. no* relatesa/ary to sa/esLetsa/esbe
is the sameone usedtn Example is
A constant
of dollars.
inlmillions model
elasticity
annual measured
firmsales,
f tt, (i.45);
iog(salnr])= Fo* Brlog(snles)

w h e r e B 1 i s t h e e l a s t i c i t y o f s a t l a r y w i t h r e s p e c t t o s a:/ e s . T h i s mand
odelfal|sunderthesimp|e
to be y log(sa/ary) the inde-
regression model by o.tlning the dependentvariable
this equationby OL5gives
pJnO"n,variableto be x : llg(sa/es)'Estimating

l
i.,,ji
;,
(hapter 2 T h e S i m p l eR e g r e s s i oM
n odel \*1,''
' 1os$nkn') : 4.822 + 0.25'llog(saies) It*.,..=1.9;.
f"'
i:;:lrllir:,,;r'':
n : 209,R2: 0.211.
Thecoefficient is the estimated
of log(sales) to sa/es.lt
of salarywith respect
elasticity
CEO
in firm salesincreases salaryby 0.257per-
about
iii;-' impliesthat a 1 percentincrease
cent-the usualinterpretationof an elasticity
1ffi,, :
i::r:::la')i::

il*[' The two funclionalformscoverecl in thissectionwill oftenarisehr theremainder


this text. We have coverednodels containingnaturallogarithms
appearso frequentlyin appliedwork. The interpretalion
here because
of suchmodelswill not be
they
of

ffifi*i muchdifferentin the multipieregression


It is alsouseful
theunitsof measurement
Becauscthe changeto
to note what
of the
happens
dependent
logarithmic form
to
case.
theinterceptandslopeestimates
variable
approximates
when
a
it appea$
propoltionate
in
if we change
logarithtnic
change, it
form.
makes
1o;,'t''
sensethat nothinghappensto the slope.We can seethisby writing tlie rescaledvati-
ableas c,1,,foreachobservation I' The originalecluation is log(l') = Fo* Brx,* u,'If
:
rveaddlog(cr)to bothsides,rvegetlog(c1)* log('li) [log(r:,)+ Fn] + Br;trt 4,,or
log(cr),i): Uog(cr)+ Frl + B,xi* tti.(Rernember thatthesumof thelogsis equalto
thelog of t.heir product as shown in Appcndix A.) Therefore, theslopcis still B', butthe
interceptis now log(c,) t Bo. Similarly, if the independent vzriableis log(x),and we
ctrangethe unitsof measurement of x before taking the log, the sloperemainsthe same
but ttreinterceptdoes not change, You will be asked to verify tlnese claimsin Problern2.9.
We end this subsection by summarizing four combinations of functionalforms
';l'' availablefrom usingeither the original variable or its natural log. In Table2.3,'r and.v
: i,ri standfor the variablesrn their original form. The model rvith y as the dependent vari-
,:r:ix{
i!.':
ableand.rastheindependent variableis calledthe level-level model, because each vari-
ableappears in its levelform,The modelwith log()) as the ciependent variable and .r as
lr, l'

ttreindependent variableis calledthelog-levelmoclel.We will not explicitly discuss the


Ievet-Iogmodelhere,becauseit ariseslessoften in practice. In any case, we will see
i l ,r :
examples of thismodelin laterchapters.

Tbble 2.3
Forms
of Functional
Summary Logarithms
lnvolvlng

Dependent Independent lnterpretation


Model Variable Variable of F'

level-level y x Av: B'A"t

level-1og lno/ r') A): (F'll00)7cA-t


)

log-level tog(y) : (100B')Ax


VoLr-

log(y) Iog(.r) ckLv{:


log-log BrIoLx

45
RegressionAnalysiswith Cross-Sectional
Data

The last column in Thble2.3 gives the interpretationof


B,. In the log-levelilloclel,
100'Br is sometimescatled the semi-elasticity of with respect
.1, to "r.As we mentionecl
in Example 2.1 | , in the log-log moder, p, is the erasticity
of 1,with respectto x. .Iiible
2'3 warrants careful study,as we will refer to it often in the rernaindef
of the text.
The Meaning of ,,Linear,' Regression
The sinrpleregression modelthatwe havestudiedin thischapteris alsocalledtlie sim-
ple lineur rcgressionmocJel.Yet,aswe havejust seen,thegeneralmo<Jel ajsoallowslbr
certainnonlineurrelationships. So whatdoes"linear"meanhere?you canseeby look-
ing at equation(2.1)thaty : 0,,* Ffl * a. Thekeyis thatthisequation
is linearin the
loromercrs, FrMd B'. Therea.rerto restrictionson how v andr relateto the original
explainedandexpranatory variablesof interest.As we sawin Examples2.7 and,i.g,y
andx canbe naturallogsof variables, anclthisis quitecornmon in ajplications. Butwe
ncednot stopt.hele'For cxantple,nolhingpr'events us lronr usingsirnplerogressior.r to
]i it estimatea modeisuchascors : Fr,+ Fr{frt * r.r.whereconsis annuar
consumption
andinc is annualincome.
whiie the mechanicsof sinrpreresressioncronot ciepe'cron rrow y
ancr"r are
defined,the interpretationof thecoefflcients doesdependon theirdefinitions.For suc-
cessfulempiricalwork,it is muchntoreimportantto becorneproficient
at interyreting
coefficientsthanto becorneefficientat compuringformulast;uri,u, (2.1g).we
will gi
much morepracticewith interpretingthe estimatesin OLS regression
lineswhenwe
studymultipleregression.
Thereareplentyof modelsthatclnnot be castasa linearrcgression moclelbecause
they are not linea-rin their parameters;an exiuripleis cons : tt(Fu* Brirtc)i- u.
Estirnationof suchmoclelsukes us into the rcalnr01'thertortlitrcurregre.r.rion
rnodel,
whiciris beyondthescopeof thistext.For mosrapplications, choosing a modelthatcan
be put into theiinearregression li.arncwurkis sul.ficicnt.

2.5 EXPECTED VALI'Es AIUT' I'ARIAIUGES OF THE OLs


ESTIMATORS
1".',: 'In Section2.l,we
definedthepopulation moder.r,: Fo+ Bfi + u, andwe craimed that
thekey assumption for simpleregression analysisto beusefulis thattheexpected'alue
of r givenanyvalueof .r is zero.In Sections2.2,2.3, ancJ 2.4,we cliscussedthe alge_
braic properties of oLS esrt11a1io,n we now returnto the populationmodelandstudy
[te statisticulproperties of ot-s. In 0therwords,we now ui.* li. and as estimators
B,
for theparameters Buandp, thatappealin thc popLrlation nodel.This meanstiratwc
will studypropertiesof the distributionsof
F,,and p, oue, differentrandomsamples
from the population'(ApptndixC conhinsdcfinitionsof estimiitors
ancireviewssome
of their importantproperties.)

lfnbiasedness of OLs
we be-qinby establisiring the unbiaseciness
of oLS unclera sinple setof assunptions.
For luturc ref'crence,
it is usefulto numberr]reseassumptrons usin_{theprefix .,sLR,,
1brsimplelinearregression. Thefirstassunrptiondefinesthepopulation moder.
#
T h e S i m p l eR e g r e s s i oM
n odel
Chapter 2

- (LINEAR IN PARAMETERs)
ASSUMPTION SLR.T
is related the independentvariable
t o
I n t h e p o p u l a t i o nm o d e l , t [ . o u p " n o * n t v a r i a b l e y
u )a s
a n d t h e e r r o r( o r d i s t u r b a n c e

tr,
,(2:471
respectlvely'
and slopeparameters'
where Boand p1 aretne populationintercept

asrandomvariablesin statingthe population


To be realistic,\,, ,r, arrdu areaII viewecj
moclel.WediscussecltheinterpretationofthismorlelatSomel."g:1]lSection2.lmd
gaveseveralexamptes.InthepreviousSection,welearnedthatequation(2.47)isnotas
y and'r appropriately' we canobtaininter-
restrictiveasit initially ,,"*i by choosing models)'
(suchasconstantelasticity
lsting nonfinearrelationships
i;;t;g Bt'and'cspe-
tiatao' v ancl-r to cstimatcthe pafameters
We ar-einterestcd (See Appendix
o,i, clatuwercObtainccl
tr.,ut as a sarnple,
ranclonl
ciully,B,. Wc assurnc
C fbr a reviewof randornsuttpling')
-l
sLR .2 (RANDOM sAMPLING)
f;tsuMPrroN r t , i : 1 , 2 , . . . , fnr\o, mt h eP o P u l a t i o n
w. samPIe
.un usea random o f s t z e l ( x , , Y , ) ' .
II
I
mooet. t
I
:lr6ttri j:ri
:f :e-siq\:i:' 't:::":il\<---':i:\--cin'3iT
U<-r-j'bi=-:t Er:'*\i-:'r=-

il r-=ll -rlm.:rrl= =:B: a-::iss

s;arrpbs car:tre rie.r,$ lS JJllJi)r\


Wc can w:tite (2.47) in terms of
l:
tlie
:lt
\''':':\l':
1l:::::':i
j.l:i
random
:^:l'r-'1::' \:'
\a)lia\.
sample as
\ui
.llrarl
l" \'\'S\'\\i\\''
i)i N '

lir,,, ti;cbt;
li: Fo* Ffiit [t,,i: 1,2,..,,t7,
fir*, j (for exzunple,
wheren, is the.erroror diSturbance for observation personl,'firm i, city
ilit
,: i; i, etc.).ThuS,t, containsthe unobservables for observation I which affect1". The It
shouldnot be confusedrvith the residuals. r'i,,thatrvedefinedin Section2.3.Lateron.
rve will explorethe relationship between the erors and the residuats. For interpret-
ing Fomd F, in a particular (2.47)
application, \s most (2'48)is alsct
informative,,but
r
n."d"a fclrsonreof the statisticalderivations.
The relationship (2.48)canbe plottecl for a particular outcomeof rlataasshownin
Figure2.7.
In orclerto obtainunbiased of B,,andFr, we needto intpOse
estimators thezerocott-
rJitionalmeanassumption that we cliscussed in solne detail in Section 2.1. We now
explicitlyaddit to our list of assumplions'

sLR.3 (zERo coNDtrtoNAL MEA N


f;tsuMprtoN
I elrp): o.
I

i'.
Data
ReoressionAnalvsiswith Cross-Sectional

Fiqure 2,7
=
Graphof y, P'o+Bli+ ui.

'
E l y r :xF) o * F , x

ij'i'

For a randomsample, thisassumption impliesthatE(url.rr) : 0, for all i: 1,2,...,n.


In additionto restrictingtherelationship betweenl andr in tirepopulation, thezero
conditionalmean assumption-coupledwith the random santpiingassuntptiott-
allowsfor a convenient technicalsimplification.In particulal,we canderivethe statis-
tical propertiesof the OLS estimators asconditionalon thevalues0f the.rrin our satn-
ple.Technically, in statistical derivations, conditionin-e on thesamplevaluesof theinde-
pendentvariablE theis same as ueating the x, as.fixeclin repeated santples. This process
involvesseveralsteps. We first choose tl sample values for 11,
"11, ...',t,,(These canbe
repeated.). Giventhese values, we then obtain a sample on .v(effectively by obtaining
t randomsample6f thetr,).Next anothcrsarnple ol'r-is obtainecl,usingthesanrcval-
uesfOr.{1,...,.t,,. Then another sample of l is obtained, again uSing thesamex,. And
so on. l
The fixedin repeated samples scenariois not veryrealisticin nonexperimental con-
in
texts.For instance, sampling individuals for the wage-education example, it makes
little senseto think of choosingthe valuesof educaheadof time and thensampling
individualswith thoseparticularlevelsof education. Randomsampling,whereindivid-
ualsarechosenrandomlyandtheirwageandeducatiotiarebothrecorded, is represen-
tativeof how mostdatasetsareobtainedfor empiricalanalysis in the social sciences'
=
Once we assunethat E(ul.r) 0, and we have randomsampling,nothing lost in
is
'I'ire
derivationsby treatingthex, as nonra:rdom. clangeris that the hxed in rcpeatecl
samplesassumptionalwaysimpliesthat tt, and .r,are independent' In decidingwhen
Chaptcr 2 l h e 5 i m p l eR e g r e s s i oM
n odel

regression
sirrrple is goingto produccunbiased
analysis cstimators,it is criticalto think
in termsof AssumptionSLR.3.
Oncewe haveagreedto conditionon thex,, we needonefinal assumption for unbi-
asedness.

t- A S S U M P T T O N SLR.4 (SAMPLE VARIATI oN,N


I
l t t E T N D E P E N D E N T v A R T A B L E )
t h, ei n d e p e n d e n t v a r ixa,b,i l:e1s, 2 , . . . , nor,,r^g ^I t^v+L o- ll lt ^g ^q ,u, o- I totherr..o*
1
I t nt h es a m p l e
somevariation in x in thepopulat,orr
I stant.Thisrequires
1*- J
r rl . , l t . 4w l t c r rw e r l r : r i v c ttll r t ll i l n r r t r l r rl si u ' t l t cO l , Sc s t i -
W c c r r c o r " r n t c rAcscsl u n r p t i o S
r ) ) u t ( ) r 'lst ;l s c c l u l v u l c , tr,r ) ( . r '-; - . t ) r - - 0 . O l t l r u l i r u l i r s s r r r r r p t i orrrrrsr u l cl ,l r i s i s t l r c
. l:I

leastimportantbecause it essentially
neverfailsin interestingapplications.
If Assump-
tion SLR.4doesfail, we cannotcomputetheOLS estimators, whichmeansstatistical
analysisis irrelevant.
li, n
Using the fact that ) (r, - ;)0,, - .'r-,)
,,
: ) (r, - x)v;(seeAppendixA), we can
l t=l i:l
,:l.i:,''
'i,".rl;.:l;r""t Ii in equation(2.19)as
writetheOLS slopeestimator
.,
flilii',..., ..
i"ir':'i..;t I '
s, -.
iiit:rli.rl,,.: I I I
^ ,L. (xi - x.)],
fu1li1i,,,.,,, ln.'...'
Ft:#,
r;.,;. 1:ir
';r l' . ) (*,- r)'

llii, , Becausewe etrenow interestecl in the behaviorof p, acrossall possiblesamples,B, is


properlyviewedas a randomvariable.
l :.,i'
' . We canwrite B, in termsof thepopulationcoefficients anderrorsby substitutingthe
rti. l right hzu:dsideof (2.48)into (2.49).We have
,n '
' u
' '
't

S
Z-/
/*
\*i
- ^r\.,
t-t i )
"
(", - xXBo+ Br:r,-lu,) .' l
gr : ----------;- -
s.i ?, ,'(r.5ol

rvhcrcwe havedcfinedtlrctotalvariationin.r,asr'j : i (r, - x): in orclcrto simpliiy


t: I

tlte notation.(This is not quite the sarnplevuiance of the "r,becausewe do not divide
by rt - l.) Using the algebra of the sumrnationoperator,write the numeratorof p, as

, :rrl 'r: :r.


+, " , - r ) F , . r ,) * ( . r , - . r ) u ,
ii -,l " , - x ) F o> ii |

i=l i=l
rrririlr
'hnn r.fi"!t,fi
i:i ::: ::'::

= Fo) (xi- i) + F,) (x - .r).r,+) (n - ;)a,.


,jti

li:llrriiiiliiirilr,i

49
Part I Data
ReoressionAnalvsiswith Cross-Sectional

t l l l "

As shownin AppendixA, ) (.r,- ;) = 0 arrd) (.r,- .x).r,: ) (x; - .r=')r


: 5,r.
,., t /'" I
il

-.[)4,. Writing this over


Therefore,we canwrite the numeratorof p, asF,s.l + 2 {r,
the denominator gives
L",:"'i
rrr : i:''r"
1:'ii, l,li 'Fr ,
zr \xi* x)ut
F,: F,+
i:l

s^?
: B, * (tlsi) ) d,H,,
,l

i: I
;;;L
'
,'t
.. i:

whered,- .f,i .t. we nowseethattheestirnator p, equalsthepopulationslopeB,, plus


a term that is a linearcombination in the errors ConditionalOnthe val-
lL!,,ttr.,...,u,).
uesof Ji, the randomness in p, is due entirelyto the errors in the sample.The factthat
theseenorsare generally
diff'erentfiom zero is what causes Bl differ fiom p'.
to
Using therepresentation in (2.52), we can prove the first prop-
importantstatistisal
erty of OLS.
-l
2.1(ttNBtAsrDtl Is:, or: oLS)
f;-EoREM
Assumptions
Using SLR.4,
SLR.Ithrough
I I

II andE(8,): B,
t(BJ: Bo, (2'sil"
I
for Bo,andp' isunbiased
of BoandB,. Inotherwords,Boisunbiased
,or.un,values for B'.
| I
tl p on thesample of I
values
I n o o t : In thisproof,theexpectedvalues areconditional
variable.
I tn. lnC.p.ndent Since onlyof the&, theyarenonrandomI
sl andd,aref unctions
Therefore,
in theconditicning. from(2.53),
I I
I E (p ,): F r* E [(l /s) .2
'
2,t,u,)= 13,*ir ls]1i
E( d,ui)
'=,,' |
|| : Fr*
,'=
( t / s ' f ; ) r t , E ( u , )p=r * t l / s , 1 1d) , . 0 : 8 1 ,
I
i:r i:r I
I I
| *f,"r" we haveusedthef actthattheexpected valueoI eachu,(conditionalon {x',xr,...,x,,})|
I iszero underAssumptions and
51R.2 SLR.3. I
Avcragc
fne prooffor 1loisnowstraightforward, (2.48)acrossi to getI': [3o+Bi 1'
I I
u, andplugthisintotheformula for 6:
I I
I Bo:.!- F,x:Fu-rB,x,+n-B,t=Fu*18,-B,p+a- |
I rh.n, conditional of thex,,
on thevalues I
r F,)]r,
- B,)rt+ E(;)iBuo EftF,
ti&,1: pn+E[(F'
I : Bt,which
I
I sinceE(0):0 by Assumptiorrs SLR.2andSLR.3.But,we showed thatE(81) I
that E[(F,- F,)] : 0. Thus,EFJ : B6.Bothof thesearguments
I implies
arevalidfor any
I
valuesof BoandB1,andsowe haveestablished unbiasedness
I J
50
l l:ll'lr;-,..1ll
..., I r,tii
.r.,,,,. i
,1.,, ,
' 1
t,

Model
Regression
The'SitnPle
Chapter 2

Rememberthatunbiasedness is a featureof thesarrpiingdisuibutionsof p, andPo,


which saysnothingabouttheestimatethatwe obtainfor a givensample,we hopethat'
if the samplewe obtainis sOmehow "typical,"thenour estimateshouldbe "near"the
populationvalue.Unfortunately, it is alwayspossiblethat we couldobtainan unlucky
iamptethatwould give us a point estimatefar from 0r, andwe canneverknow for sure
whetherthis is the case.You may want to reviewthe materialon unbiasedesdmatorsin
AppendixC, especiallythe simulation exercise in TableC.l thatillustratestheconcept
of unbiasedness.
Unbiascc.lncssgcnerallylails if anyclfour lour assuuptions iaii' This llleansthatit
is importantto thint<abouithe veracity of each assumption for a particularapplication'
As we havealreadydiscussed, if Assumption SLR'4 fails, then we will not be ableto
obtainthe OLS estimates, Assumption SLR.I requires that,'f'and 'l be linearlyrelated'
with an additivedisturbance. This can certainly fail. But we also know $rat) and'r can
nonlinear relationships. Dealing with the failureof (2.47)
be chosento yield interesting
requiresmoreadvancedmethodsthat are beyond the scope of this text.
. Later,we will haveto relaxAssumptionSLR.2,therandomsamplingassumpdon,
for time seriesanalysis.But what aboutusingit for cross-sectional anaiysis'JRandom
sectionwhensamples are not representative 0f the under-
, samplingcanfail in a cross]
fying polutation;in fact,somedatasetsareconstructed by intentionally oversampling
diiiJr"niprts of the population. We wiil discussproblemsof nonrandomsamplingin
Chapters 9 and l7'
Th. assoroprion we shoulcJ concentrate on for now is SLR.3.If SLR'3 holds'the
OLS estimatorsare unbiased.Likewise, if SLR.3 fails, the OLS estimatorsgenerally
Therearewaysto detgrminc
rvill be biasecl. the likcly clirection andsizeo[ thebias'
whichwe will studyin ChaPter 3'
The possibilirythat r is conelatedwith a is almostalwaysa concernin simple
regSession analysiswith nonexperimental data,as we indicatedwith severalexamples
in-section2.1.Usingsimpieregression whenr containsfactorsaffecting.'vthatarealso
correlatedwith r can result \n spuriouscorrelation'.that is, we find a relationship
tretweeny and.rthatis reallydueto otherunobserved factorsthataffectl' andalsohap-
Dento be correlatedwith.r.

NXltff!&1$E"K x s:t
( S t u d e n l M a t h t . . , 1 o . r T u . n " . : , . : l d t h g 5 q h . o , o lL. u n 5 , h r r o o . r e m )
Letm,othioidenole th.epelientage of tenthgraders bt a high.school'.receiving a passing
sco1€.;ot-l a stardardized mdthematics exam. Suppose we wish to estimate the effectof
the federally fundedschoOl lunch program on student performance' lf anything, we
expectthe lun::hprogramio have a positive ceteris paribus effect on performance: all
otherfactors beingequal,if a student who is too poor to eat regular meals becomes eli-
giblefor.the schoollunchqrogram,hisor herperformance should'irnbrove Lel'tnChprg
ilirl;
denotethe percentage,of.sludents who areeligible,for;trhsllunshLprrbgrarni;Thbnib,sirtr'ipile ':r ',
I

* u,
lnathl} = F,s* BJnchprg (2.541:,
, l': ' .'l

5t
Data
RegressionAnalysiswith Cross-Sectional
Pad I

school
[hataffectovcrall
characl"crislics pcrformance'
whereu contatnS andsludent
school
highschools
on 408 Michigan for the 1992-93school
Usingrhe datain MEAP93.RAW
year,we obtain
nw?ht0:32.14 - 0.319lu:hPrg
n : 408,R: : 0.171
that if studenteligibilityin the lunchprogram increasesby 10 per-
Thisequationpredicts
centagepoints,tnepercentageofstudentspassingthemathexamfa//sbyabout3.2per-
points.Dowe really Lelieve in the lunchprogram
that higherparticipation actually
centage term
Almostcertainly not.A better explanation isthatthe error
causes worseperformancei
ln fact,u factors
contains suchastne pover
u in equation(2.5a)iscorrelated with/nchprg.
attending school, which affectsstudentperformance andis highlycorre-
ty rateof children are
in thellnchprogram. Varlablessuchasschool quality andresources
latedwitheligibility remem-
with /nchprg lt is importantto
alsocontained in u, arrdthesearelikelycorrelated
the estimate -0.319 isonly for thisparticularsample,butitssignandmagnitude
ber that
makeussuspectthatuandxarecorrelated,sothatsimpieregressionisbiased.

with
thereareotherreasonsfor -r to be correlated
In additionto omittedvariables,
model.Sincethe sameissuesarisein muitiple regressiott
u in the simpleregression
analysis,wewiilpostponeaSystematictreatmentoftheproblemuntilthen.

Variances of the OLS Estimators


clistributionof B, is centerecl aboutB' (p' is
In aclcJitiorrto knowingthattltesantpling
fi'orr Br on aver-
unbiased), it is irnportantto kriowhow tar we callexpectB, to be away
estimator .mong all' or at
age.Among othei things,this allowsu.sto choosethe best
estimators. The measure of spread in the distribu-
leasta broadclassof, ihe unbiaserJ
to work with is the varianceor its square root'thestan-
il;; n,ir"o prythaiis easiest
darclcleviation.(SeeAppentlixCloramorecletaileddiscussion')
be corlputeclunder
It turns out that the varianceof the oIJs estimatorscan
SLR.1throughSLR.4.However,theseexpressions would be sontewirat
.gssumptions analy-
Insteacl,*, u6Oan assumption for cross-sectibnal
thatis ftadit-ional
complicateci.
smtesthatthe varianceof the unobservable, a, conditional x' is
on
sis.This assumption
Constant. TiriSis knOwnasthehomoskedasticity Of"constantvafiance"assumptiOn'

t.''.,] t-- sLR's (HoMosKEDAsrlclrY)


I otsuMprloN
:
I vartulxl "'.
l
.:yrr:r
lr'i;: Wcnrustcnrphasizcthatthehor-troskcdasticityassumptioltis.quitedistinctfrom
t5e zero conditionalmeanassumption, E(ulx) : 0. Ass*mptiorrSLR'3 involvesthe
of u (bothcondi-
expecteclvhlueof u, whileAssumptionSLR.5concernsthe v-ariance
{.t,,, the unbiasecirress
lional on ;r). Recallthat we establishecl of OLS withoutAssumption
assunption plays rto role in showingthat ft.,and B, iue
i'::r,'l SLR.5:the homoskedasticity
it simplifies thevtlriancecalculationsfor
i;'ii.' unbiased. We addessumptionSLR.j because

l'i", 52
i"l
I' r.

Model
TheSimpleRegression
Chapter 2

it iffrpliesthatorclinaryleastsquares hascertainefficiencyprop-
B,,andF, andbecause thatu andx areindepen-
assun'le
erlies,which we will seein Chapter3' If we wereto
on .r' andso E(rrfr:)= E(1) = 0
dent, thenthedistributionof ,t giuen,, doesnot clepend
: s2. But independence is sometimes too sffongof an assutnption'
.";-v*irtrl
"^-B;;il;
: - : 0'.crr
E(rrl't) whichnreans
Var(rrl.r) ni":ltl tE(ul'r)]'?a^nd l.Ejrurlr)'
:
:
of 12'Theretbre,or E(u2) Var(r'r)' because
o2 is also tte unconclitionat expectation
E(u) : 0. In .ther words,o2 iS'thetmconclitiottttl varianceof tt, andsorr! is oftencalled
thc crror varianccor ciisturbance variance, The squarerootol'cr2,o. is the standard
ol'thetrnobservablcs af'l'ecl-
deviationof thcerror.A largercr mcallsthltlthc clistribution
ing l is moresPreadout.
'ItisoftenusefultowriteAssumptionsSLR'3arrdSLR.5intermsofthecondi-
of y:
tionalmeanandconditional1'ariance
(2.55)
E(Ilr) : Fo* Frx.
Var(yl,l)= o2. tz.s,eil
Irr Otherwords,theconditionalexpectadon of v given-r is linearin 'r, butthevadanceof
Trrissituationis grapheclin Figure2'8 whereFu> 0 andF' > 0'
;, gi;; r is constant.

Figure 2'B
r:ii;'Thesimpleregressionmodel under homoskedasticity'
,,.:'*,

Elyrx)= {,r, uu,,

53