4 FWL Handout

Partitioned Regression: FWL Theorem
(Handout Version)∗
Walter Belluzzo
Econ 507 Econometric Analysis
Spring 2013
1 Introduction and Motivation

• In this lecture we use projections to present the Frisch-Waugh-Lovell (FWL) theorem.
• The main result of the FWL theorem is sometimes called Partitioned Regression
because only part of the covariates are included explicitly in the estimated model.
• This theorem is very useful in regression analysis because it allows us to partition the
covariates to facilitate derivation.
• We will develop the ideas underlying the FWL theorem starting from the properties of
least squares projections and the linear transformation of the regressors.
2 Linear Transformations of the Regressors

• Let us consider a linear transformation of X, obtained by post-multiplying X by any
nonsingular k × k matrix A.
• We are interested in understanding how this linear transformation of the regressors change
the vector of fitted values ŷ and the residuals vector û.
• Let A be partitioned by its columns: A = [a1 , . . . , a2 ].
• Then, the linear transformation of X can be written as
XA = [Xa1 Xa2 . . . Xak ] .
• Note that each ai is a k-vector, just like β, and each block Xai is a n-vector, which is a
linear combination of the columns of X.
∗ This lecture is based on D & M’s Chapter 2.
1
Linear Transformations and the Column Space
• Since S(X) contains all linear combinations of the columns of X, it must be that every of
Xai is in S(X)
• Now, take any w ∈ S(X). Since w = Xb for some b ∈ R (why?), we can write
Xb = X(AA−1 )b = (XA)(A−1 b).
• Note that A−1 b is a k-vector, so that the RHS is actually a linear combination of the
columns of XA, and thus it is in S(XA).
• As a result, we can conclude that every element of S(X) is in S(XA) and vice-versa. That
is, these two subspaces must be identical.
• Note: so, the projection of y on to either of these subspaces is the same.

• We can easily verify that the projections Px and Pxa are the same:
Pxa = XA(A0 X0 XA)−1 A0 X0
= XAA−1 (X0 X)−1 (A0 )−1 A0 X0
| {z } | {z }
I I
0 −1
= X(X X) X
= Px ,
and equality of Mx and Px follows directly.
• Because S(X) and S⊥ (X) are not affected by the linear transformation, the fitted values
and residuals vectors are the same, regardless of using X or XA as regressors.
Linear Transformations and the OLS Estimates

• It is important to note that even though ŷ and û are invariant to linear transformations
of the regressors, the vector of OLS estimates will change.
• Remember that we can write linear transformations of the columns as Xb for any (real)
k-vector.
• Replacing b by β̂ do not change our previous argument, so that we can write

ŷ = Xβ̂ = X(A−1 A)β̂ = XA(A−1 β̂).
• Thus, we see that replacing X by XA requires adjusting the vector of OLS estimates by
A−1 , so that ŷ will remain unchanged.
• Alternatively, we can substitute XA into the expression for β̂ and obtain the same result:
ˆ = ((XA)0 (XA))−1 (XA)0 y
β̂
= (A0 X0 XA)−1 A0 X0 y
= A−1 (X 0 X)−1 (A0 )−1 A0 X0 y (why?)
−1 0
=A (X 0 X)−1 X y
= A−1 β̂.
2
Linear Transformations and the Regression Intercept
• The invariance of ŷ and û to nonsingular linear transformations can be extended to the
addition of a constant amount to one or more of the regressors, provided that the regression
equation includes an intercept.
• Consider the n-vector β1 ι + β2 x.
• Note that adding β1 ι to β2 x is equivalent to add the constant β1 to each element of β2 x.
• Thus, if ι is a regressor, we can always write a matrix A that produces the effect of adding
a constant to a regressor.
• For instance, with k = 3 we have:
 
x11 x12 x13  
 x21 x22 x23  a11 a12 a13
XA =  . ..  a21 a22 a23 
 
..
 .. . .  a31 a32 a33
xn1 xn2 xn3
• The resulting n × k matrix is

 Pk Pk Pk 
i=1 ai1 x1i i=1 ai2 x1i i=1 ai3 x1i
 .. .. .. 
 . . . 
Pk Pk Pk
i=1 ai1 xni i=1 ai2 xni i=1 ai3 xni
• Making xi1 = 1 for all i, and transposing the first line, just to facilitate typesetting, we
get:  0
a11 + a12 x12 + a13 x13
a21 + a22 x12 + a23 x13 
a31 + a32 x12 + a33 x13
• Note that all columns of XA include ai1 , and therefore we can always accommodate any
additive constant in x2 and/or x3 by choosing suitable a’s.
• For instance, to add c to x3 we can make

 
1 0 c
A = 0 1 0
0 0 1
What happens to the OLS Estimates?

• We know that ŷ and û are invariant to linear transformations, but β̂ will change to
accommodate the changes.
• Let us consider now the role of the intercept is the adjustment necessary in the vector β̂.
• Consider a simple regression model,

yi = β1 + β2 xi + ui . (1)
3
• Adding a constant a to the regressor will change the OLS estimates. So we can write the
corresponding model as
yi = γ1 + γ2 (xi + a) + ui .
• The OLS estimators for ξ2 and β2 are the same,
P
(x − x̄)(y − ȳ)
γ̂2 = β̂2 = P ,
(x − x̄)2
because adding a constant to x do not change (x − x̄):
1X
x − x̄ = (x + a) − (x + a)
n
• But the estimates of the intercepts do change:

β̂1 = ȳ − β̂2 x̄ 6= ȳ − β̂2 x̄ + β̂2 a = γ̂1 ,
and the last equality follows from the fact that γ̂2 = β̂2 .
What Happens to the Intercept?

• What is happening is that the intercept will assimilate any additive constant term we toss
into the regression equation.
• It is useful to help our understanding getting this result directly from our simple regression
model.
• Let zi = xi + c. Then,
yi = β1 + β2 (zi − c) +ui = (β1 − β2 c) +β2 zi + ui ,
| {z } | {z }
xi γ1
that is, running a regression of y on x is equivalent to run a regression of y on z with

intercept β1 − β2 c.
• What if E(u) 6= 0? Can you cast this case in to the additive constant term framework?
• We can also illustrate these ideas geometrically, using one of the previous diagrams with
ι as one of the regressors:
cι
x2 z
β̂2 cι
β̂2 x2 β̂2 z
θ2
ŷ
φ2
φ3 θ1 θ3
φθ11
O α̂1 β̂1 x1 x1
β̂2 cι
4
More on Adding Constants to Regressors
• Because there is no restriction on the arbitrary placement of the vector ι in our figure, we
can simply rotate the whole picture at will.
• Thus, can replace ι for a non-constant regressor x1 .
• Note, however, that in the simple regression example illustrated in our figure the trans-
formation can not be the addition of a constant, because of there would be no intercept.
• Without an intercept, adding a constant to a regressor will change other coefficients in

the β̂ vector.
Linear Transformations When There is no Intercept
• Even without an intercept, we can still get α̂2 = β̂2 and ŷ invariant, if the transformation
produces the necessary adjustment in the coefficient of the other variable.
• But what sort of transformation would produce such adjustment?
• The answer is evident once you can see that the magic happens only because we can write
model (1) as
yi = (β1 − β2 c) + β2 zi + ui . (2)
• Consider for instance the “interceptless” model
yi = β1 x1 + β2 x2 + ui .
• Try to rewrite it for z = x2 + c, just like we did before. Can you get something like model
(2)?
• Substituting for x2 = z − c, like we did before you get
yi = β1 x1 + β2 zi − β2 c + ui . (3)
• Now we cannot rearrange to get rid of the −β2 c by including it into a new term and keep
the model unchanged.
• Note that −β2 c works like an intercept the transformed model, so that we have now 3
parameters.
• It should be clear to you at this point that the subspaces spanned by the regressors before
and after the transformation have different dimensions.
• If the new column space is different, the fitted values and residuals vectors from the
transformed model will not be the same as those from the original model.
• With the intercept in place, x1 = ι, and we can simply combine β1 and β2 c into a single
parameter, while keeping the column space of X intact.
5
• When x1 6= ι we can get the same result only if the transformation is a linear function of
the other regressor.
• For instance, making z = x2 + cx1 we get
y = β1 x1 + β2 (z − cx1 ) + residuals
= (β1 − cβ2 )x1 + β2 z + residuals
• Now we see that the effect of the transformation “migrates” to the coefficient of x1 , leaving
the β2 intact in the transformed model.
• But note that β2 is associated with z in the transformed model, and not x2 as in the
original model.
• What is the magic producing the result here?
• Because z is a linear transformation of x1 and x2 , the subspace spanned by x1 and z is

the same subspace spanned by x1 and x2 , that is S(x1 , x2 ) = S(x1 , z).
Generalizing to Multiple Regression Models
• This result is valid for general multiple regression model.
• Let us partition X such that X = [X1 X2 ], where X1 is n × k1 and X2 is n × k2 , with

k1 + k2 = k.
• Partitioning the parameter vector accordingly, the regression model can be written as
y = X1 β1 + X2 β2 u.
• Then, for a k1 × k2 matrix A, adding X1 A to X2 while keeping X1 intact will not change
the estimates β̂2 , nor ŷ and û.
• Do not confuse this A matrix with that k × k matrix we defined to get nonsingular linear
transformation of X. This k1 × k2 matrix is actually part of that previous one.
• If we partition A accordingly, for the model above we get
A11 A12 I1 A12

   
(k1 ×k1 ) (k1 ×k )
1  (k1 ×k1 ) (k1 ×k1 )
A=  = 
A21 A22 O I2
(k1 ×k1 ) (k1 ×k1 ) (k1 ×k1 ) (k1 ×k1 )
• So, it might be better to call it A12 instead.
6
3 Deviations from the mean
• We know that adding/subtracting a constant to a regressor does not change fitted values
or residuals. Moreover, parameter estimates other than the intercept will not change
either.
• Because the mean of a regressor is a constant for a given sample, running a regression on
deviations from the mean should affect only the intercept estimate.
• Consider a simple regression model y = β1 ι + β2 x + u.
• Replacing x by z = x − x̄, the estimated model can be written as
y = α̂1 ι + β̂2 z + û,
where we have used the invariance of β̂2 and û.
Centering and Orthogonal Projections
• An interesting characteristic of the centered variable z is that it is orthogonal to the vector

ι.
• To see this, note that

ι0 z = x̄(x − x̄ι) = nx̄ − x̄ι0 ι
= nx̄ − nx̄ = 0.
• Therefore, even though x is not orthogonal to ι, centering will transform it into an or-
thogonal vector.
• In fact, centering can be performed by premultiplying x by an orthogonal projection

matrix
Mι x = (I − Pι )x = x − ι(ι0 ι)−1 ι0 x
= x − x̄ι = z
Why should we center variables?
• The main reason for centering covariates is that the resulting change in the intercept gives
it a special interpretation, which is very convenient sometimes.
• Because z = 0 when xi = x̄, the intercept of the centered regression equals the mean of y
conditional on x = x̄.
• Remember that E(y|x) = α1 + α2 z.
• Thus, if we make x = x̄ and z = 0, resulting in E(y|x) = α1 .
7
Graphic Illustration of the Effect of Centering
• The figure at the right depicts a scatter plot of 50 observations simulated from our model,
and the estimated regression line, drawn in red.
• Note that the transformed regression model is
y = (β1 + β2 x̄) ι + β2 z + u.
| {z }
α1
• What we are really doing is a rescaling the axis so that the origin is at x̄.
y y
α̂1
β̂2 x̄
β̂1
O x̄
x̄/O x/xz
4 Orthogonal Regressors
• A very special feature of the regression on centered covariates is that the regressors are
orthogonal to ι (remember that ι0 z = 0).
• Because the regressors are orthogonal, the estimate α̂2 in our simple regression example
is the same whether or not we include the regressor ι.
• To see why this is true, remember that we are decomposing y into three vectors:
y = ŷ + û = α̂1 ι + α̂2 z + û.
• Remember that ŷ is the projection of y on to S(z).
• But projecting y on to S(z) annihilates both α̂1 ι and û, because these vectors are orthog-
onal to S(z).
8
• As a result, we are actually projecting α̂2 z on to S(z), which results in α̂2 z itself. (why?)
• It follows, then, that α̂2 is the same as given by regressing y on both ι and z.
• For any regressors x1 and x2 we can illustrate geometrically, what happens as we transform
one of them to get orthogonality:
cι
x2 z
β̂2 x2 β̂2 z
ŷ
θ1
O α̂1 β̂1 x1 x1
Orthogonal Projection on to a Subspace of the Image

Theorem 1. Let Pz and Pw be the orthogonal projection matrices that project any vector in En
on to S(Z) and S(W), respectively. Then, if S(Z) is a subspace of S(W), Pz Pw = Pw Pz = Pz .
Proof. Since Z ⊂ W, projecting Z on to S(W) yields Z again, that is Pw Z = Z. As a result,
we can write Pw Pz = Pw Z(Z0 Z)−1 Z0 = Pz .[6pt]
Now, transposing the previous result we obtain Pz0 = (Pw Pz )0 = Pz0 Pw0 , and the result follows
by noting that Pz and Pw are symmetric.
• Note that DM derive this for Px and P1 , stressing that orthogonality of X1 and X2 is
not necessary to obtain the result, but only symmetry of the corresponding orthogonal
projection matrices.
• I changed the notation to Z and W to avoid any confusion with this. In the result I
presented above only the relevant property of the projections Pz and Px when S(Z) ⊂ S(X).
• Confusion may arise because orthogonality is linked to the symmetry. It can be shown
that a square matrix P (i) is a projection matrix iff it is idempotent, and (ii) a orthogonal
projection matrix iff it is idempotent and symmetric.
• It should be clear that symmetry of the projection matrices is not linked to orthogonality
of Z and W in the preceding derivation.
• We consider here only P, but the same results hold for the complementary projection.
You should check this (DM Exercise 2.15).
9
5 Short and Short Regression
Short and Long Regression with Orthogonal Regressors
• We can generalize the orthogonal regressors case to any two groups of regressors using
the partitioned regression model
y = X1 β1 + X2 β2 + u. (4)
• Side note: Remember that we already defined the partition of the matrix X such that
X = [X1 X2 ], where X1 is n × k1 and X2 is n × k2 , with k1 + k2 = k.
• We will refer to the partitioned regression on the full set of regressors (4) as the long
regression and the regressions on a subset of the regressors (that is, X1 or X2 ) as the
short regression.
Partitioned Orthogonal Regression

Theorem 2 (Partitioned Orthogonal Regression). If X1 is orthogonal to X2 , the least squares
estimate β̂1 obtained from the long regression y = X1 β1 + X2 β2 + u is the same obtained
from the short regression of y on X1 alone. Similarly, β̂2 will be the same obtained from the
regression of y on X2 .
ˆ be the OLS estimates from the long regression and the short regressions
Proof. Let β̂1 and β̂ 1
on X1 , respectively. Note that S(X1 ) ⊂ S(X), and therefore it follows from previous theorem
ˆ = P y, and we can write
that P1 Px = P1 . Now, since X1 β̂ 1 1
X β̂ˆ = P P y = P ŷ.
1 1 1 x 1
= P1 X1 β̂1 + P1 X2 β̂2
= X1 β̂1 ,
where we used the facts that P1 X2 = O and P1 X1 = X1 . (why?)

ˆ = X β̂ .
The result now follows directly from X1 β̂ 1 1 1
6 Residuals Regression and FWL Theorem

Transforming the Regressors into Orthogonal Regressors
• In general, orthogonality of two sets of regressors can only be produced by transforming

one of them.
• With two groups of regressors, we can also produce orthogonality by projecting X2 on to

the subspace that is orthogonal to X1 using the matrix M1 , because M1 X2 ⊥ X2 .
• As a result, we obtain exactly the same α̂2 from either the long regression
y = X1 α1 + M1 X2 α2 + u
or the short regression

y = M1 X2 α2 + v
10
Long Regressions on X2 × Long Regression on M1 X2
• But how does α̂2 is related to β̂2 in the long regression
y = X 1 β1 + X 2 β2 + u
• To see why this is true, rewrite the long regression on M1 X2 as
y = X1 α1 + (I − P1 )X2 α2 + u
= X1 α1 + (X2 − X1 (X01 X1 )−1 X01 X2 )α2 + u
= X1 α1 + (X2 − X1 A)α2 + u.
• But this is just the generalized result for adding a linear transformation of a set of regres-
sors to the other set of regressors.
• Thus, we know that α̂2 = β̂2 and also that the vectors of fitted values ŷ and residuals û
will be the same.
Long Regression × Short Regression
• Because M1 X2 ⊥ X1 we know that α̂2 from the short regression equals β̂2 from the long
regression.
• However, the fitted values and the residuals from short regression will be different because
dropping regressors change the column space where ŷ lies.
• To see this, note that y can be decomposed into fitted values from either the short or the
long regression. Therefore,
X1 α̂1 + M1 X2 α̂2 + û = M1 X2 α̂2 + v̂ =⇒ v̂ = X1 α̂1 + û.
• But since X1 α̂1 = P1 y, we obtain that v̂ = P1 y + û.
Residuals Regression
• Because α̂2 = β̂2 , the short regression can be written as
y = M1 X2 α̂2 + v̂
• Substituting v̂ = P1 y + û, we get
y = M1 X2 α̂2 + P1 y + û
y − P1 y = M1 X2 α̂2 + û
M1 y = M1 X2 α̂2 + û
• Finally, note that M1 y is the residuals vector of the regression of y on X1 , and M1 X2

is the residuals vector of the regression of X2 on X1 . Thus the reference to residuals
regression.
• These results are summarized in the Frisch-Waugh-Lovell Theorem.
11
Frisch-Waugh-Lovell Theorem
Theorem 3 (Frisch-Waugh-Lovell). The OLS estimates of β2 and the residuals from the re-
gressions
y = X1 β1 + X2 β2 + u
and
M1 y = M1 X2 β2 + residuals
are numerically identical.
Proof
• Let β̂1 and β̂2 be the OLS estimates from the original model. Then we can write y =
Px y + Mx y = X1 β̂1 + X2 β̂2 + Mx y.
• Premultiplying both sides by X02 M1 we get
X02 M1 y = X02 M1 X1 β̂1 + X02 M1 X2 β̂2 + X02 M1 Mx y.
which reduces to X2 M1 y = X2 M1 X2 β̂2 because M1 annihilates X1 the first and the

third terms in the RHS vanish.
• Solving for β̂2 yields

β̂2 = (X02 M1 X2 )−1 X2 M1 y.
ˆ is equal to β̂ .
• Next, we show that OLS estimate of β2 from the transformed model β̂ 2 2
• The OLS estimate of β2 from the transformed model is
ˆ = ((M X )0 (M X ))−1 (M X )0 y
β̂ 2 1 2 1 2 1 2
= (X02 M01 M1 X2 )−1 X02 M1 y
= (X02 M1 X2 )−1 X02 M1 y = β̂2
where we used the fact that M1 is symmetric and idempotent.
• To show that residuals are also numerically identical, we start again with y = X1 β̂1 +
X2 β̂2 + Mx y, but now we premultiply both sides by M1 ,
M1 y = M1 X1 β̂1 + M1 X2 β̂2 + M1 Mx y.
• Since M1 annihilates X1 and M1 Mx = Mx (don’t forget to do the recommended exercise),

we get
M1 y = M1 X2 β̂2 + Mx y.
ˆ , and thus we write

• Finally, from the first part of the theorem β̂2 = β̂ 2
M1 y − M1 X2 β̂2 = Mx y = û
which is the desired result, since the LHS is the residual vector from the transformed
regression.
12

4 FWL Handout

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

4 FWL Handout

Загружено:

Авторское право:

Доступные форматы

Partitioned Regression: FWL Theorem

1 Introduction and Motivation

2 Linear Transformations of the Regressors

• Let A be partitioned by its columns: A = [a1 , . . . , a2 ].

• Then, the linear transformation of X can be written as

XA = [Xa1 Xa2 . . . Xak ] .

• Note: so, the projection of y on to either of these subspaces is the same.

Linear Transformations and the OLS Estimates

• Replacing b by β̂ do not change our previous argument, so that we can write

• Consider the n-vector β1 ι + β2 x.

• Note that adding β1 ι to β2 x is equivalent to add the constant β1 to each element of β2 x.

• The resulting n × k matrix is

• For instance, to add c to x3 we can make

What happens to the OLS Estimates?

• Consider a simple regression model,

• But the estimates of the intercepts do change:

What Happens to the Intercept?

that is, running a regression of y on x is equivalent to run a regression of y on z with

• Thus, can replace ι for a non-constant regressor x1 .

• Without an intercept, adding a constant to a regressor will change other coefficients in

Linear Transformations When There is no Intercept

• But what sort of transformation would produce such adjustment?

• Consider for instance the “interceptless” model

• Substituting for x2 = z − c, like we did before you get

• For instance, making z = x2 + cx1 we get

= (β1 − cβ2 )x1 + β2 z + residuals

• What is the magic producing the result here?

• Because z is a linear transformation of x1 and x2 , the subspace spanned by x1 and z is

Generalizing to Multiple Regression Models

• This result is valid for general multiple regression model.

• Let us partition X such that X = [X1 X2 ], where X1 is n × k1 and X2 is n × k2 , with

• If we partition A accordingly, for the model above we get

A11 A12 I1 A12

• So, it might be better to call it A12 instead.

• Consider a simple regression model y = β1 ι + β2 x + u.

• Replacing x by z = x − x̄, the estimated model can be written as

y = α̂1 ι + β̂2 z + û,

where we have used the invariance of β̂2 and û.

Centering and Orthogonal Projections

• An interesting characteristic of the centered variable z is that it is orthogonal to the vector

• To see this, note that

• In fact, centering can be performed by premultiplying x by an orthogonal projection

Why should we center variables?

• Remember that E(y|x) = α1 + α2 z.

• Thus, if we make x = x̄ and z = 0, resulting in E(y|x) = α1 .

• Note that the transformed regression model is

y = ŷ + û = α̂1 ι + α̂2 z + û.

• Remember that ŷ is the projection of y on to S(z).

Orthogonal Projection on to a Subspace of the Image

Partitioned Orthogonal Regression

where we used the facts that P1 X2 = O and P1 X1 = X1 . (why?)

6 Residuals Regression and FWL Theorem

• In general, orthogonality of two sets of regressors can only be produced by transforming

• With two groups of regressors, we can also produce orthogonality by projecting X2 on to

or the short regression

• But how does α̂2 is related to β̂2 in the long regression

• To see why this is true, rewrite the long regression on M1 X2 as

Long Regression × Short Regression

X1 α̂1 + M1 X2 α̂2 + û = M1 X2 α̂2 + v̂ =⇒ v̂ = X1 α̂1 + û.

• But since X1 α̂1 = P1 y, we obtain that v̂ = P1 y + û.

• Because α̂2 = β̂2 , the short regression can be written as

• Substituting v̂ = P1 y + û, we get

• Finally, note that M1 y is the residuals vector of the regression of y on X1 , and M1 X2

• These results are summarized in the Frisch-Waugh-Lovell Theorem.