You are on page 1of 10



respectively, where the latter condition is implied by the rst. Under E{i |xi } = 0,
we can interpret the regression model as describing the conditional expected value of
yi given values for the explanatory variables xi . For example, what is the expected
wage for an arbitrary woman of age 40, with a university education and 14 years of
experience? Or, what is the expected unemployment rate given wage rates, ination
and total output in the economy? The rst consequence of (3.2) is the interpretation
of the individual coefcients. For example, k measures the expected change in yi
if xik changes with one unit but all the other variables in xi do not change. That is
E{yi |xi }
= k .


It is important to realize that we had to state explicitly that the other variables in xi did
not change. This is the so-called ceteris paribus condition. In a multiple regression
model single coefcients can only be interpreted under ceteris paribus conditions. For
example, k could measure the effect of age on the expected wage of a woman, if the
education level and years of experience are kept constant. An important consequence
of the ceteris paribus condition is that it is not possible to interpret a single coefcient
in a regression model without knowing what the other variables in the model are. If
interest is focussed on the relationship between yi and xik , the other variables in xi act
as control variables. For example, we may be interested in the relationship between
house prices and the number of bedrooms, controlling for differences in lot size and
location. Depending upon the question of interest, we may decide to control for some
factors but not for all (see Wooldridge, 2003, Section 6.3, for more discussion).
Sometimes these ceteris paribus conditions are hard to maintain. For example, in
the wage equation case, it may be very common that a changing age almost always
corresponds to changing years of experience. Although the k coefcient in this case
still measures the effect of age keeping years of experience (and the other variables)
xed, it may not be very well identied from a given sample, due to the collinearity
between the two variables. In some cases, it is just impossible to maintain the ceteris
paribus condition, for example if xi includes both age and age-squared. Clearly, it is
ridiculous to say that a coefcient k measures the effect of age given that age-squared
is constant. In this case, one should go back to the derivative (3.3). If xi includes,
say, age i 2 + age 2i 3 , we can derive
E{yi |xi }
= 2 + 2 age i 3 ,
age i


which can be interpreted as the marginal effect of a changing age if the other variables in xi (excluding age 2i ) are kept constant. This shows how the marginal effects
of explanatory variables can be allowed to vary over the observations by including
additional terms involving these variables (in this case age 2i ). For example, we can
allow the effect of age to be different for men and women by including an interaction
term age i male i in the regression, where male i is a dummy for males. Thus, if the
model includes age i 2 + age i male i 3 the effect of a changing age is
E{yi |xi }
= 2 + male i 3 ,
age i




which is 2 for females and 2 + 3 for males. Sections 3.4 and 3.5 will illustrate the
use of such interaction terms.
Frequently, economists are interested in elasticities rather than marginal effects.
An elasticity measures the relative change in the dependent variable due to a relative
change in one of the xi variables. Often, elasticities are estimated directly from a linear
regression model involving the logarithms of most explanatory variables (excluding
dummy variables), that is
log yi = (log xi ) + vi ,
where log xi is shorthand notation for a vector with elements (1, log xi2 , . . . , log xiK )
and it is assumed that E{vi | log xi } = 0. We shall call this a loglinear model. In
this case,
E{yi |xi }
E{log yi | log xi }

= k ,

E{yi |xi }
log xik
where the is due to the fact that E{log yi | log xi } = E{log yi |xi } = log E{yi |xi }. Note
that (3.3) implies that in the linear model
E{yi |xi }

= ik k ,
E{yi |xi }


which shows that the linear model implies that elasticities are nonconstant and vary
with xi , while the loglinear model imposes constant elasticities. While in many cases
the choice of functional form is dictated by convenience in economic interpretation,
other considerations may play a role. For example, explaining log yi rather than yi
may help reducing heteroskedasticity problems, as illustrated in Section 3.5 below.
In Section 3.3 we shall briey consider statistical tests for a linear versus a loglinear
If xik is a dummy variable (or another variable that may take nonpositive values)
we cannot take its logarithm and we include the original variable in the model. Thus
we estimate
log yi = xi + i .
Of course, it is possible to include some explanatory variables in logs and some in
levels. In (3.9) the interpretation of a coefcient k is the relative change in yi due
to an absolute change of one unit in xik . So if xik is a dummy for males, k is the
(ceteris paribus) relative wage differential between men and women. Again this holds
only approximately, see Subsection 3.5.2.
The inequality of E{log yi |xi } and log E{yi |xi } also has some consequences for prediction purposes. Suppose we start from the loglinear model (3.6) with E{vi | log xi } =
0. Then, we can determine the predicted value of log yi as (log xi ) . However, if we
are interested in predicting yi rather than log yi , it is not the case that exp{(log xi ) }
is a good predictor for yi in the sense that it corresponds to the expected value of yi ,
given xi . That is, E{yi |xi } = exp{E{log yi |xi }} = exp{(log xi ) }. The reason is that
taking logarithms is a nonlinear transformation, while the expected value of a nonlinear function is not this nonlinear function of the expected value. The only way to get
around this problem is to make distributional assumptions. If, for example, it can be



assumed that vi in (3.6) is normally distributed with mean zero and variance v2 , it
implies that the conditional distribution of yi is lognormal (see Appendix B) with mean

E{yi |xi } = exp E{log yi |xi } + 12 v2 = exp (log xi ) + 12 v2 .


Sometimes, the additional half-variance term is also added when the error terms are
not assumed to be normal. Often, it is simply omitted. Additional discussion on predicting yi when the dependent variable is log(yi ) is provided in Wooldridge (2003,
Section 6.4).
It should be noted that the assumption that E{i |xi } = 0 is also important, as it says
that changing xi should not lead to changes in the expected error term. There are many
cases in economics where this is hard to maintain and the models we are interested
in do not correspond to conditional expectations. We shall come back to this issue in
Chapter 5.
Another consequence of (3.2) is often overlooked. If we change the set of explanatory
variables xi to zi , say, and estimate another regression model,
yi = zi + vi


with the interpretation that E{yi |zi } = zi , there is no conict with the previous model
that said that E{yi |xi } = xi . Because the conditioning variables are different, both
conditional expectations could be correct in the sense that both are linear in the conditioning variables. Consequently, if we interpret the regression models as describing
the conditional expectation given the variables that are included there can never be
any conict between them. It is just two different things we might be interested in.
For example, we may be interested in the expected wage as a function of gender only,
but also in the expected wage as a function of gender, education and experience. Note
that, because of a different ceteris paribus condition, the coefcients for gender in
these two models do not have the same interpretation. Often, researchers implicitly
or explicitly make the assumption that the set of conditioning variables is larger than
those that are included. Sometimes, it is suggested that the model contains all relevant
observable variables (implying that observables that are not included in the model are
in the conditioning set but irrelevant). If it would be argued, for example, that the two
linear models above should be interpreted as
E{yi |xi , zi } = zi

E{yi |xi , zi } = xi ,

respectively, then the two models are typically in conict and at most one of them can
be correct.1 Only in such cases, it makes sense to compare the two models statistically
and to test, for example, which model is correct and which one is not. We come back
to this issue in Subsection 3.2.3.

We abstract from trivial exceptions, like xi = zi and = .




Selecting the Set of Regressors

3.2.1 Misspecifying the Set of Regressors

If one is (implicitly) assuming that the conditioning set of the model contains more
variables than the ones that are included, it is possible that the set of explanatory
variables is misspecied. This means that one or more of the omitted variables are
relevant, i.e. have nonzero coefcients. This raises two questions: what happens when
a relevant variable is excluded from the model and what happens when an irrelevant
variable is included in the model? To illustrate this, consider the following two models
yi = xi + zi + i ,


yi = xi + vi ,



both interpreted as describing the conditional expectation of yi given xi , zi (and may

be some additional variables). The model in (3.13) is nested in (3.12) and implicitly
assumes that zi is irrelevant ( = 0). What happens if we estimate model (3.13) while
in fact model (3.12) is the correct model? That is, what happens when we omit zi from
the set of regressors?
The OLS estimator for based on (3.13), denoted b2 , is given by

b2 =


xi xi



xi yi .



The properties of this estimator under model (3.12) can be determined by substituting
(3.12) into (3.14) to obtain

b2 = +



xi xi



xi zi


xi xi



xi i .



Depending upon the assumptions made for model (3.12), the last term in this expression
will have an expectation or probability limit of zero.2 The second term on the right
hand side, however, corresponds to a bias (or asymptotic bias) in the OLS estimator
due to estimating the incorrect model (3.13). This is referred to as an omitted variable
bias. As expected, there will be no bias if = 0 (implying that the two models are
identical), but there
 is one more case in which the estimator for will not be biased
and that is when N
i=1 xi zi = 0, or, asymptotically, when E{xi zi } = 0. If this happens
we say that xi and zi are orthogonal. This does not happen very often in economic
applications. Note, for example, that the presence of an intercept in xi implies that
E{zi } should be zero.
The converse is less of a problem. If we estimate model (3.12) while in fact model
(3.13) is appropriate, that is, we needlessly include the irrelevant variables zi , we would
simply be estimating the coefcients, which are zero. In this case, however, it would

Compare the derivations of the properties of the OLS estimator in Chapter 2.



be preferable to estimate from the restricted model (3.13) rather than from (3.12)
because the latter estimator for will usually have a higher variance and thus be less
reliable. While the derivation of this result requires some tedious matrix manipulations,
it is intuitively obvious: model (3.13) imposes more information, so that we can expect
that the estimator that exploits this information is, on average, more accurate than one
which does not. Thus, including irrelevant variables in your model, even though they
have a zero coefcient, will typically increase the variance of the estimators for the
other model parameters. Including as many variables as possible in a model is thus not
a good strategy, while including too few variables has the danger of biased estimates.
This means we need some guidance on how to select the set of regressors.
3.2.2 Selecting Regressors

Again, it should be stressed that if we interpret the regression model as describing the
conditional expectation of yi given the included variables xi , there is no issue of a
misspecied set of regressors, although there might be a problem of functional form
(see the next section). This implies that statistically there is nothing to test here. The
set of xi variables will be chosen on the basis of what we nd interesting and often
economic theory or common sense guides us in our choice. Interpreting the model in a
broader sense implies that there may be relevant regressors that are excluded or irrelevant ones that are included. To nd potentially relevant variables we can use economic
theory again. For example, when specifying an individual wage equation we may use
the human capital theory which essentially says that everything that affects a persons
productivity will affect his or her wage. In addition, we may use job characteristics
(blue or white collar, shift work, public or private sector, etc.) and general labour
market conditions (e.g. sectorial unemployment).
It is good practice to select the set of potentially relevant variables on the basis of
economic arguments rather than statistical ones. Although it is sometimes suggested
otherwise, statistical arguments are never certainty arguments. That is, there is always
a small (but not ignorable) probability of drawing the wrong conclusion. For example,
there is always a probability (corresponding to the size of the test) of rejecting the null
hypothesis that a coefcient is zero, while the null is actually true. Such type I errors
are rather likely to happen if we use a sequence of many tests to select the regressors
to include in the model. This process is referred to as data snooping or data mining
(see Leamer, 1978; Lovell, 1983; or Charemza and Deadman, 1999, Chapter 2), and in
economics it is not a compliment if someone accuses you of doing it. In general, data
snooping refers to the fact that a given set of data is used more than once to choose a
model specication and to test hypotheses. You can imagine, for example, that if you
have a set of 20 potential regressors and you try each one of them, that it is quite likely
to conclude that one of them is signicant, even though there is no true relationship
between any of these regressors and the variable you are explaining. Although statistical
software packages sometimes provide mechanical routines to select regressors, these
are not recommended in economic work. The probability of making incorrect choices
is high and it is not unlikely that your model captures some peculiarities in the
data that have no real meaning outside the sample. In practice, however, it is hard
to avoid that some amount of data snooping enters your work. Even if you do not
perform your own specication search and happen to know which model to estimate,



this knowledge may be based upon the successes and failures of past investigations.
Nevertheless, it is important to be aware of the problem. In recent years, the possibility
of data snooping biases plays an important role in empirical studies that model stock
returns. Lo and MacKinlay (1990), for example, analyse such biases in tests of nancial
asset pricing models, while Sullivan, Timmermann and White (2001) analyse to what
extent the presence of calendar effects in stock returns, like the January effect discussed
in Section 2.7, can be attributed to data snooping.
The danger of data mining is particularly high if the specication search is from
simple to general. In this approach, you start with a simple model and you include
additional variables or lags of variables until the specication appears adequate. That
is, until the restrictions imposed by the model are no longer rejected and you are
happy with the signs of the coefcient estimates and their signicance. Clearly, such
a procedure may involve a very large number of tests. An alternative is the generalto-specic modelling approach, advocated by Professor David Hendry and others,
typically referred to as the LSE methodology.3 This approach starts by estimating a
general unrestricted model (GUM), which is subsequently reduced in size and complexity by testing restrictions that can be imposed; see Charemza and Deadman (1999)
for an extensive treatment. The idea behind this approach is appealing. Assuming that
a sufciently general and complicated model can describe reality, any more parsimonious model is an improvement if it conveys all of the same information in a simpler,
more compact form. The art of model specication in the LSE approach is to nd
models that are valid restrictions of the GUM, and that cannot be reduced to even
more parsimonious models that are also valid restrictions. While the LSE methodology involves a large number of (mis)specication tests, it can be argued to be relatively
insensitive to data-mining problems. The basic argument, formalized by White (1990),
is that as the sample size grows to innity only the true specication will survive all
specication tests. This assumes that the true specication is a special case of the
GUM that a researcher starts with. Rather than ending up with a specication that is
most likely incorrect, due to an accumulation of type I and type II errors, the generalto-specic approach in the long run would result in the correct specication. While
this asymptotic result is insufcient to assure that the LSE approach works well with
sample sizes typical for empirical work, Hoover and Perez (1999) show that it may
work pretty well in practice in the sense that the methodology recovers the correct
specication (or a closely related specication) most of the time. An automated version of the general-to-specic approach is developed by Krolzig and Hendry (2001)
and available in PcGets (see Bardsen, 2001, or Owen, 2003, for a review).
In practice, most applied researchers will start somewhere in the middle with a
specication that could be appropriate and, ideally, then test (1) whether restrictions
imposed by the model are correct and test (2) whether restrictions not imposed by the
model could be imposed. In the rst category are misspecication tests for omitted
variables, but also for autocorrelation and heteroskedasticity (see Chapter 4). In the
second category are tests of parametric restrictions, for example that one or more
explanatory variables have zero coefcients.

The adjective LSE derives from the fact that there is a strong tradition of time-series econometrics at the
London School of Economics (LSE), starting in the 1960s. Currently, the practitioners of LSE econometrics
are widely dispersed among institutions throughout the world.



In presenting your estimation results, it is not a sin to have insignicant variables

included in your specication. The fact that your results do not show a signicant
effect on yi of some variable xik is informative to the reader and there is no reason
to hide it by re-estimating the model while excluding xik . Of course, you should be
careful including many variables in your model that are multicollinear so that, in the
end, almost none of the variables appears individually signicant.
Besides formal statistical tests there are other criteria that are sometimes used to
select a set of regressors. First of all, the R 2 , discussed in Section 2.4, measures the
proportion of the sample variation in yi that is explained by variation in xi . It is
clear that if we were to extend the model by including zi in the set of regressors, the
explained variation would never decrease, so that also the R 2 will never decrease if
we include additional variables in the model. Using the R 2 as criterion would thus
favour models with as many explanatory variables as possible. This is certainly not
optimal, because with too many variables we will not be able to say very much about
the models coefcients, as they may be estimated rather inaccurately. Because the R 2
does not punish the inclusion of many variables, one would better use a measure
which incorporates a trade-off between goodness-of-t and the number of regressors
employed in the model. One way to do this is to use the adjusted R 2 (or R 2 ), as
discussed in the previous chapter. Writing it as
R 2 = 1

1/(N K) N
i=1 ei
1/(N 1) i=1 (yi y)


and noting that the denominator in this expression is unaffected by the model under
consideration, shows
the adjusted R 2 provides a trade-off between goodness-of-t,
N that
as measured by i=1 ei , and the simplicity or parsimony of the model, as measured by
the number of parameters K. There exist a number of alternative criteria that provide
such a trade-off, the most common ones being Akaikes Information Criterion (AIC),
proposed by Akaike (1973), given by
1  2 2K
AIC = log
e +
N i=1 i


and the Schwarz Bayesian Information Criterion (BIC), proposed by Schwarz

(1978), which is given by
BIC = log

1  2 K
e + log N.
N i=1 i


Models with a lower AIC or BIC are typically preferred. Note that both criteria add a
penalty that increases with the number of regressors. Because the penalty is larger for
BIC, the latter criterion tends to favour more parsimonious models than AIC. The use
of either of these criteria is usually restricted to cases where alternative models are not
nested (see Subsection 3.2.3) and economic theory provides no guidance on selecting
the appropriate model. A typical situation is the search for a parsimonious model that
describes the dynamic process of a particular variable (see Chapter 8).



Alternatively, it is possible to test whether the increase in R 2 is statistically signicant. Testing this is exactly the same as testing whether the coefcients for the newly
added variables zi are all equal to zero, and we have seen a test for that in the previous
chapter. Recall from (2.59) that the appropriate F-statistic can be written as
f =

(R12 R02 )/J

(1 R12 )/(N K)


where R12 and R02 denote the R 2 in the model with and without zi , respectively, and J
is the number of variables in zi . Under the null hypothesis that zi has zero coefcients,
the f statistic has an F distribution with J and N K degrees of freedom, provided we
can impose conditions (A1)(A5) from Chapter 2. The F-test thus provides a statistical
answer to the question whether the increase in R 2 due to including zi in the model
was signicant or not. It is also possible to rewrite f in terms of adjusted R 2 s. This
would show that R 12 > R 02 if and only if f exceeds a certain threshold. In general, these
thresholds do not correspond to 5% or 10% critical values of the F distribution, but are
substantially smaller. In particular, it can be shown that R 12 > R 02 if and only if the f
statistic is larger than one. For a single variable (J = 1) this implies that the adjusted
R 2 will increase if the additional variable has a t-ratio with an absolute value larger
than unity. (Recall that for a single restriction t 2 = f .) This reveals that the adjusted
R 2 would lead to the inclusion of more variables than standard t or F-tests.
Direct tests of the hypothesis that the coefcients for zi are zero can be obtained
from the t and F-tests discussed in the previous chapter. Compared to f above, a test
statistic can be derived which is more generally appropriate. Let denote the OLS
estimator for and let V { } denote an estimated covariance matrix for . Then, it
can be shown that under the null hypothesis that = 0 the test statistic
=  V { }1


has an asymptotic 2 distribution with J degrees of freedom. This is similar to the

Wald test described in Chapter 2 (compare (2.63)). The form of the covariance matrix
of depends upon the assumptions we are willing to make. Under the GaussMarkov
assumptions we would obtain a statistic that satises = Jf .
It is important to recall that two single tests are not equivalent to one joint test. For
example, if we are considering the exclusion of two single variables with coefcients
1 and 2 , the individual t-tests may reject neither 1 = 0 nor 2 = 0, whereas the
joint F-test (or Wald test) rejects the joint restriction 1 = 2 = 0. The message here
is that if we want to drop two variables from the model at the same time, we should
be looking at a joint test rather than at two separate tests. Once the rst variable is
omitted from the model, the second one may appear signicant. This is particularly of
importance if collinearity exists between the two variables.
3.2.3 Comparing Non-nested Models

Sometimes econometricians want to compare two different models that are not nested.
In this case neither of the two models is obtained as a special case of the other. Such



a situation may arise if two alternative economic theories lead to different models for
the same phenomenon. Let us consider the following two alternative specications:


Model A: yi = xi + i


Model B: yi = zi + vi ,


where both are interpreted as describing the conditional expectation of yi given xi

and zi . The two models are non-nested if zi includes a variable that is not in xi and
vice versa. Because both models are explaining the same endogenous variable, it is
possible to use the R 2 , AIC or BIC criteria, discussed in the previous subsection. An
alternative and more formal idea that can be used to compare the two models is that of
encompassing (see Mizon, 1984; Mizon and Richard, 1986): if model A is believed
to be the correct model it must be able to encompass model B, that is, it must be
able to explain model Bs results. If model A is unable to do so, it has to be rejected.
Vice versa, if model B is unable to encompass model A, it should be rejected as well.
Consequently, it is possible that both models are rejected, because neither of them
is correct. If model A is not rejected, we can test it against another rival model and
maintain it as long as it is not rejected.
The encompassing principle is very general and it is legitimate to require a model
to encompass its rivals. If these rival models are nested within the current model, they
are automatically encompassed by it, because a more general model is always able to
explain results of simpler models (compare (3.15) above). If the models are not nested
encompassing is nontrivial. Unfortunately, encompassing tests for general models are
fairly complicated, but for the regression models above things are relatively simple.
We shall consider two alternative tests. The rst is the non-nested F -test or encom 
passing F-test. Writing xi = (x1i
x2i ) where x1i is included in zi (and x2i is not), model
B can be tested by constructing a so-called articial nesting model as

yi = zi + x2i
A + vi .


This model typically has no economic rationale, but reduces to model B if A = 0.

Thus, the validity of model B (model B encompasses model A) can be tested using
an F-test for the restrictions A = 0. In a similar fashion, we can test the validity of
model A by testing B = 0 in

yi = xi + z2i
B + i ,


where z2i contains the variables from zi that are not included in xi . The null hypotheses
that are tested here state that one model encompasses the other. The outcome of the
two tests may be that both models have to be rejected. On the other hand, it is also
possible that neither of the two models is rejected. Thus the fact that model A is
rejected should not be interpreted as evidence in favour of model B. It just indicates
that something is captured by model B which is not adequately taken into account in
model A.



A more parsimonious non-nested test is the J-test. Let us start again from an articial
nesting model that nests both model A and model B, given by
yi = (1 )xi + zi + ui ,


where is a scalar parameter and ui denotes the error term. If = 0, equation (3.25)
corresponds to model A and if = 1 it reduces to model B. Unfortunately, the nesting
model (3.25) cannot be estimated because in general , and cannot be separately
identied. One solution to this problem (suggested by Davidson and MacKinnon, 1981)
is to replace the unknown parameters by , the OLS estimates from model B, and
to test the hypothesis that = 0 in
yi = xi + zi + ui = xi + yiB + ui ,


where yiB is the predicted value from model B and = (1 ). The J-test for the
validity of model A uses the t-statistic for = 0 in this last regression. Computationally,
it simply means that the tted value from the rival model is added to the model that
we are testing and that we test whether its coefcient is zero using a standard t-test.
Compared to the non-nested F-test, the J-test involves only one restriction. This means
that the J-test may be more attractive (have more power) if the number of additional
regressors in the non-nested F-test is large. If the non-nested F-test involves only one
additional regressor, it is equivalent to the J-test. More details on non-nested testing can
be found in Davidson and MacKinnon (1993, Section 11.3) and the references therein.
Another relevant case with two alternative models that are non-nested is the choice
between a linear and loglinear functional form. Because the dependent variable is
different (yi and log yi , respectively) a comparison on the basis of goodness-of-t
measures, including AIC and BIC, is inappropriate. One way to test the appropriateness of the linear and loglinear models involves nesting them in a more general
model using the so-called BoxCox transformation (see Davidson and MacKinnon,
1993, Section 14.6), and comparing them against this more general alternative. Alternatively, an approach similar to the encompassing approach above can be chosen by
making use of an articial nesting model. A very simple procedure is the PE test, suggested by MacKinnon, White and Davidson (1983). First, estimate both the linear and
loglinear models by OLS. Denote the predicted values by yi and log yi , respectively.
Then the linear model can be tested against its loglinear alternative by testing the null
hypothesis that LIN = 0 in the test regression
yi = xi + LIN (log yi log yi ) + ui .
Similarly, the loglinear model corresponds to the null hypothesis LOG = 0 in
log yi = (log xi ) + LOG (yi exp{log yi }) + ui .
Both tests can simply be based on the standard t-statistics, which under the null hypothesis have an approximate standard normal distribution. If LIN = 0 is not rejected, the
linear model may be preferred. If LOG = 0 is not rejected, the loglinear model is
preferred. If both hypotheses are rejected, neither of the two models appears to be