Financial Valuation and Econometrics

Kian-Guan Lim
Financial Valuation
and
Econometrics
Final Draft version under Copy-Editing by
World Scientific Publishing, June 2010
ii
About the Author

Kian-Guan Lim received his doctorate from Stanford University in
1986 and works in the field of risk management and financial asset
pricing. He is Professor of Quantitative Finance in the Business School
at the Singapore Management University and adjunct Professor in the
Mathematics Department of the National University of Singapore. Prior
to joining SMU, Kian-Guan was at NUS and founded the University
Center for Financial Engineering, starting the Master of Science
program in Financial Engineering. He has consulted for several banks
on risk validation and valuation. He was also a reservist captain in the
Singapore Armed Forces and had held administrative positions at SMU
and NUS including deanships and headships.
iii
To Leng
iv
Contents
vi-viii
Preface
Chapter
Probability Distribution and Statistics
1-24
Chapter
25-40
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Statistical Laws and Central Limit

Theorem
Application: Stock Return
Distributions
Two-Variable Linear Regression
Application: Financial Hedging
Model Estimation
Application: Capital Asset Pricing
Model
Constrained Regression
Application: Cost of Capital
Time Series Analysis
Application: Inflation Forecasting
Random Walk
Application: Market Efficiency
Autoregression and Persistence:
Application: Predictability
Estimation Errors and T-Tests
Application: Event Studies
Chapter
10
Multiple Linear Regression and

Stochastic Regressors
173-194
Chapter
11
Dummy Variables and ANOVA

Application: Time Effect Anomalies
195-206
Chapter
12
Specification Errors
207-234
Chapter
13
235-253
Chapter
14
Cross-Sectional Regression
Application: Testing CAPM
More Multiple Linear Regressions
Application: Multi-Factor Asset
Pricing
41-68
69-84
85-100
101-124
125-144
145-154
155-172
254-269
v
Errors-in-variable
Application: Exchange Rates and
Risk Premium
Unit Root Processes
Application: Purchasing Power Parity
Conditional Heteroskedasticity
Application: Risk Estimation
Mean Reverting Continuous Time
Process
Application: Bonds and Term
Structures
Implied Parameters
Application: Option Pricing
Generalized Method of Moments
Application: Consumption-Based
Asset Pricing
270-284
Matrix Algebra
374-394
Appendix
Eviews Guide
395-405
Appendix
Linear Regression in EXCEL
406-410
Appendix
Multiple Choice Question Tests
411-435
Appendix
Solutions to Problem Sets
436-465
Chapter
15
Chapter
16
Chapter
17
Chapter
18
Chapter
19
Chapter
20
Appendix
Index
285-303
304-324
325-346
347-359
360-373
466-470
vi
Preface
This book is an introduction to financial valuation and financial data analyses
using econometric methods. The complexity and enormity of global financial
markets today has far-reaching implications for financial decision-making and
practice. The uncertainty that drives and perturbs financial market prices often
draws rigorous scientific search in the hope of finding clues to reproduce
winning formulas for gold, or else to find the secret recipe to avert disasters.
This scientific drive is partly fueled by the rising of mathematical analyses,
including economic and statistical theory, to the occasion, and is also spurred
on by the increasing availability of financial market data as well as the
fullness of computing power in crunching data.
Financial valuation is the key to investment decision and risk management
that are central to any economy today. In a nutshell, investment leads to
production of tomorrows consumption goods. Risk management leads to
prudence in investment and savings so as to avoid bankruptcies that cost
aplenty. Since the 1950s, the field of finance has developed a rich set of
theories and rigorous framework with which to understand how stock prices
are formed, whether rationally or sometimes perhaps behaviorally or with
anomalies. Besides stock prices and returns, bond prices, interest rates,
exchange rates, futures and option prices are other major financial variables in
the capital markets.
The empirical validation of financial valuation models and of market
phenomena by data, and in turn the feedback to appropriate and effective
theoretical modeling, form an interesting and exciting experience in the study
of finance. There are really three key domain knowledge areas here. Financial
valuation or pricing theories are typically constructed from more fundamental
economic axioms such as investor rationality and insatiability. Some
mathematical tools such as optimization and conditional expectations are
utilized. In the process of deriving a closed form or else analytical form
theoretical model, market equilibrium conditions are often added as part of the
necessary conditions to a solution. A major output of such theorizing efforts is
an asset pricing model. Theoretical models help to explain positively how
market variables happened and provide a vehicle to develop optimal decisionmaking.
Yet pragmatic investment decisions typically require parameter inputs that
have to be estimated or require forecasts of future prices. Such considerations
inevitably lead to applications of statistical models to historical prices and
time series of the relevant economic variables. In more formal language, this
is the construction of a probability space on these variables of interest. If we
model a particular variable over time, it is a statistical model. If we model a
collection of variables over time, where these variables mutually influence
vii
each other, then it is also called an econometric model. Thus the second key
domain knowledge area is probability and statistical theory, or else
econometrics in the context of problems to do with capital market finance.
Finally, another key domain is how data are collected and used. Raw data
are just numbers that in themselves do not lend much insight. For example, if
we collect twelve past monthly return rates of a particular stock, and find that
the sample average of these is one percent, this average of one percent should
not be used simply as an expectation of what the next months return would
be. But suppose we have in addition a statistical model showing that the
monthly return rate of this stock follows an upward trend of half percent while
any deviation is due to random error, then it is more accurate to expect next
months return to be half percent. Sometimes great attention has to be paid to
whether monthly return rate, or daily return rate, or intra-day return rate, is
more appropriate for the question under study. It is of paramount importance
to understand how the data are obtained, whether there are observational or
recording errors, and whether there are better proxy variables to represent the
effect we seek.
This book is a modest attempt to bring together these domains in financial
valuation theory, in econometrics modeling, and in the empirical analyses of
financial data. These domains are highly inter-twined and should be properly
understood in order to correctly and effectively harness the power of data and
statistical or econometrics methods for investment and financial decisionmaking.
One can think of many good books in basic econometrics and also many
good books in finance theory and modeling. The contribution in this book, and
at the same time, its novelty, is in employing materials in basic econometrics,
particularly linear regression analyses, and weaving into it threads of
foundational finance theory, concepts, ideas, and models. The treatment in this
book is at a basic level. It is hoped that advanced undergraduate or first year
postgraduate students learning finance and/or basic econometrics or linear
regression analyses could go through the materials in this book with a
heightened appreciation of how applied econometrics inter-twined with the
discovery of financial market knowledge.
It is also hoped that all students who work through this book will begin to
understand that it may be indeed erroneous to make a forecast by simply
taking a bunch of financial time series data and making a straight-line least
squares regression. We should seek to know what the theory, if any, behind
the linkage of the variables is, and be able to choose the appropriate time
series data, and employ useful econometric modeling to address not so
apparent features of the time series such as non-homogeneity, non-linearity,
measurement errors, and so on. Students should also appreciate that the
estimates or test statistics are in themselves random variables and would
viii
behave in some prescribed manner as sample size varies, and would also
misbehave if there is a spurious problem, and thus be able to interpret
empirical results with more scientific precision.
The chapters of the book are organized along a general progression of
topics taught in basic econometrics, particularly in linear regression analyses,
although there is also coverage on time series analyses and on the nonlinear
generalized method-of-moments technique. There is a clear attempt on my
part to make this coincide with teaching of key concepts and theories in
financial valuation. In fact, this feature of covering both finance and
econometrics at the same time should be especially rewarding and interesting
to students who are learning both finance and basic econometrics at the same
time.
At the beginning of each chapter in this book, key points of learning are
listed so that students can check their own progress if they have covered the
major materials of the chapter. Some econometrics materials, especially those
involving multiple variables, are more conveniently developed in terms of
matrices instead of arithmetic algebra. Therefore, some prior knowledge of
matrix algebra will be helpful. Appendix A contains a short refresher on matrix
algebra as a preparation.
Most chapters would contain one or more finance application examples
where finance concepts, and sometimes theory, are taught. I have tried to
incorporate real examples of companies and practice where useful, and due to
my nationality, I naturally use some examples of Singapore-based companies.
References to articles and sources are usually listed as footnotes on the same
pages. Data sources for the empirical examples are cited. The empirical
examples were developed using EVIEWS, a statistical software package that
is easily available. Alternative software such as R or SAS, or even EXCEL
(with VBA) can be used as well. A beginners guide to using EVIEWS and
also EXCEL regression is provided in Appendix B and C respectively. Each
chapter ends with a problem set for the student to practice, and more reading
references should the student desire to learn more advanced materials related
to the contents of that chapter. Appendix D contains sets of multiple choice
question tests so students can quickly check if they understand the concepts
taught. Appendix E provides solutions to the problem sets.
This manuscript is a substantial expansion and revision from a draft
version that I used to teach in a course on Investment and Financial Data
Analysis. I wish to express my thanks to Dharma, Hong Chao, Jane Lim, Yi
Bao, Christopher Ting, and several other colleagues who had provided
valuable feedbacks. Finally, any updates or errata will be available at http://
www. mysmu.edu/faculty/kglim.
Kian Guan
Singapore, April 2010
ix
Chapter 1
PROBABILITY DISTRIBUTION AND STATISTICS
Key Points of Learning
Random variable, Joint probability distribution, Marginal probability
distribution, Conditional probability distribution, Expected value, Variance,
Covariance, Correlation, Independence, Normal distribution function, Chisquare distribution, Student-t distribution, F-distribution, Data types and
categories, Sampling distribution, Hypothesis, Statistical test
1.1
PROBABILITY
Joint probability, marginal probability, and conditional probability are

important basic tools in financial valuation and regression analyses. These
concepts and their usefulness in financial data analyses will become clearer at
the end of the chapter. To motivate the idea of a joint probability distribution,
let us begin by looking at a time series plot or graph of two financial economic
variables over time: Xt and Yt, for example, S&P 500 Index aggregate priceto-earnings ratio Xt, and S&P 500 Index return rate Yt. The values or numbers
that variables Xt and Yt will take are uncertain before they happen, i.e. before
time t. At time t, both economic variables take realized values or numbers x t
and yt. xt and yt are said to be realized jointly or simultaneously at the same
time t. Thus we can describe their values as a joint pair (x t,yt). If their order is
preserved, it is called an ordered pair. Note that subscript t represents the time
index.
The P/E or price-to-earnings ratio of a stock or a portfolio is a financial
ratio showing the price paid for the stock relative to the annual net income or
profit per share earned by the firm for the year. The reciprocal of the P/E ratio
is called the earnings yield. The earnings yield or E/P reflects the risky annual
rate of return, R, on the stock. This is easily shown by the relationship $E = $P
R%. In other words, P/E = 1/R.
In Figure 1.1, it seems that low return corresponded to, or lagged high P/E
especially at the beginnings of the years 1929-1930, 1999-2002, and 20082009. Conversely, high returns followed relatively low P/E ratios at the
beginning of years 1949-1954, 1975-1982, and 2006-2007. We shall explore
the issue of the predictability of stock return in a later chapter.
The idea that random variables correspond with each other over time or
that display some form of association is called a statistical correlation which is
2
defined, or which has interpretative meaning only when there is existence of a
joint probability distribution describing the random variables.
Figure 1.1
S&P 500 Index Portfolio Return Rate and Price-Earning Ratio 1872-2009
(Data from Prof Shiller, Yale University)
60%
40%
20%
0%
-20%
-40%
S&P 500 INDEX RETURN RATE
S&P 500 INDEX AGGREGATE P/E RATIO
-60%
1870
1890
1910
1930
1950
1970
1990
2010
YEAR
In Figure 1.2, we plot the U.S. national aggregate consumption versus national
disposable income in US$ billion. Disposable income is defined as Personal
Income less personal taxes. Personal Income is National Income less corporate
taxes and corporate retained earnings. In turn, National Income is Gross
Domestic Product (GDP) less depreciation and indirect business taxes such as
sales tax. GDP is essentially the total dollar output or gross income of the
country. If we include repatriations from citizens working abroad, then it
becomes Gross National Product (GNP).
In Figure 1.2, it appears that consumption increases in disposable income.
The relationship is approximately linear. This is intuitive as on a per capita
basis, we would expect that for each person, when his or her disposable
income rises, he or she would consume more. In life-cycle models of financial
economics theory, some types of individual preferences could lead to
consumption as an increasing function of individual wealth which consists of
inheritance as well as fresh income. Sometimes analysis on income also
3
breaks it down into a permanent part and a transitory part. More of these could
be read in economics articles on life-cycle models and hypotheses.
Figure 1.2
U.S. Annual National Aggregate Consumption versus Disposable Income
1999-2009 (Data from Federal Reserve Board of U.S. in $billion)
$9,600
CONSUMPTION
$9,200
$8,800
$8,400
$8,000
$7,600
$7,200
$7,000
$8,000
$9,000
$10,000
$11,000
DISPOSABLE INCOME
In Figure 1.3 we evaluate the annual year-to-year change in consumption and

disposable income and plot them on an X-Y graph. The point P1 refers to the
bivariate values (x1,y1) where x1 is change in disposable income and y1 is
change in consumption in 2000. P2 refers to the bivariate values (x2,y2) where
x2 is change in disposable income and y2 is change in consumption in 2001,
and so on. Subscripts to x and y indicate time. It may be construed as the end
of a time period and the beginning of the next time period. In this case,
subscript 1 refers to time t1, end of year 2000.
The pattern in Figure 1.3 reveals that disposable income change dropped
from t=1 to t=2, then rose back at t=3. After that there was a sharp drop at t=4
before a wild swing back up at t=5, and so on. The changes seem to be
cyclical. A cyclical but decreasing trend can be seen in consumption.
However, what is more interesting is that consumption and disposable income
visibly increased and decreased together. Thus, if we construe consumption as
4
purchases of goods and services, then the plot displays the positive income
effect on such effective demand. Theoretically, each Xt and each Yt for every
time t is a random variable.
Figure 1.3
U.S. Annual Year-to-Year Change in National Aggregate Consumption
versus Change in Disposable Income 2000-2009 (Data from Federal
Reserve Board of U.S. in $billion)
$400
CHANGE IN CONSUMPTION
P1
P6
$300
P5
P7
P4
P8
$200
P2
P3
$100
$0
P9
$-100
$40
P10
$80 $120 $160 $200 $240 $280 $320 $360

CHANGE IN DISPOSABLE INCOME
A random variable is a variable that takes on different values each with a

given probability. It is a variable with an associated probability distribution.
For the above scatter plot, since Xt and Yt occur jointly together in (Xt,Yt), the
pair is a bivariate random variable, and thus has a joint bivariate probability
distribution. There are two generic classes of probability distributions: discrete
probability distribution where the random variable takes on only a finite set of
possible values, and continuous probability distribution where the random
variable takes on an uncountable number of possible values. In what follows,
we construct a bivariate discrete probability distribution of the return rates on
two stocks.
Let t denote the day number. Thus, time t=1 is end of day 1, and t=2 is end
of day 2, and so on. Let Pt be the price in $ of stock ABC at time t. Let X t+1 be
5
stock ABCs holding or discrete return rate at time t+1. X t+1 = Pt+1/Pt 1. The
corresponding continuously compounded return rate at t+1 is ln(Pt+1/Pt), which
is approximately Xt+1 when Xt+1 is close to 0. Another stock XYZ has discrete
return rate Yt+1 at time t+1.
Table 1.1
Discrete Bivariate Joint Probability of Two Stock Return Rates
Xt+1
P(xt+1,yt+1)
Yt+1
a1
a2
a3
a4
a5
a6
P(yt+1)
b1
0.005 0.03
0.03
0.015
0.005
0.01
0.095
b2
0.015 0.02
0.04
0.015
0.005
0.02
0.115
b3
0.015 0.025
0.05
0.02
0.015
0.05
0.175
b4
0.03
0.03
0.07
0.08
0.025
0.035
0.27
b5
0.02
0.06
0.04
0.05
0.045
0.02
0.235
b6
0.015 0.035
0.02
0.02
0.005
0.015
0.11
0.25
0.2
0.1
0.15
P(xt+1)
0.1
0.2
In Table 1.1, we must take care to distinguish between random variable Xt+1
and the realized value it takes in an outcome, e.g. xt+1 a3. For example, a3
could be 0.08 or 8%. In the bivariate discrete probability distribution shown in
the table, Xt+1 takes one of six possible values viz. a1, a2, a3, a4, a5, and a6. The
probability of any one of these six events or outcomes is given by P(X t+1 = xt+1
ak), or in short P(xt+1), and is shown in the last row of the table. The
probability function P(.) for discrete probability distribution is also called a
probability mass function (pmf). We should think of a probability or chance as
a one-to-one function that maps or assigns a number in [0,1]
to each
realized value of the random variable.
denotes the real line or (-, +).
Likewise, the probability of any one of the six outcomes of random variable
Yt+1 is given by P(yt+1) and is shown in the last column of the table. Note that
the probabilities of events that make up all the possibilities must sum up to 1.
The joint probability of event or outcome with realized values (xt+1 , yt+1) is
given by P(Xt+1=xt+1,Yt+1=yt+1). These probabilities are shown in the cells
within the inner box. For example, P(a3, b5) = 0.04. This means that the
probability or chance of Xt+1 = a3 and Yt+1 = b5 simultaneously occurring is
0.04 or 4%. Clearly the sum of all the joint probabilities within the inner box
must equal 1. The marginal probability of Yt+1 = b3 in the context of the
(bivariate) joint probability distribution is the probability that Yt+1 takes the
realized value yt+1 b3 regardless of the simultaneous value of xt+1. We write
6
this marginal probability as PY(Yt+1=b3). The subscript Y to probability
function P(.) is to highlight that it is marginal probability of Y. Sometimes this
is omitted. Note that this marginal probability is also a univariate probability.
In this case, PY(b3) = P(a1,b3) + P(a2,b3) + P(a3,b3) + P(a4,b3) + P(a5,b3) +
P(a6,b3). Notice we simplify the notations indicating the ajs and bks are
values xt+1 and yt+1 respectively where the context is understood. In a full
summation notation,
PY Yt 1 b 3 PX t 1 a j , Yt 1 b 3 .
6
j1
This is obviously the sum of numbers in the row involving b3, and is equal to
0.175. The marginal probability of Xt+1 = a2 is given by
6
PX X t 1 a 2 PX t 1 a 2 , Yt 1 b k 0.2 .
k 1
Thus, given the joint probability distribution, the marginal probability

distribution of any one of the joint random variables can be found.
PX
6
What is
j1 k 1
t 1
a j , Yt 1 b k ? Employing the concept of marginal
probability we just learned,
PX
6
j1 k 1
a j , Yt 1 b k PX X t 1 a j 1.
6
t 1
j1
In the bivariate probability case, we know that future risk or uncertainty is

characterized by one and only one of the 36 pairs of values (aj, bk) that will
occur. Suppose the event has occurred, and we know only that it is event
{Xt+1=a2} that occurred, but without knowing which of the events b1, b2, b3, b4,
b5, or b6 had occurred in simultaneity. An interesting question is to ask what is
the probability that {Yt+1=b3} had occurred, given that we know {Xt+1=a2}
occurred. This is called a conditional probability, and is denoted by P(Yt+1=b3|
Xt+1=a2). The symbol | represents given or conditional on.
From Table 1.1, we focus on the column where it is given that {xt+1a2}
occurred. This is shown below as Table 1.2. The highlighted 0.025 is the joint
probability of (a2 ,b3).
Intuitively, the higher (lower) this number, the higher (lower) is the
conditional probability that b3 in fact had occurred simultaneously. Given that
a2 had occurred, we are finding the conditional probabilities given {xt+1a2},
which is in itself a proper probability distribution and thus must have
probabilities that add to 1. Then the conditional probability must be the
relative size of 0.025 to the other joint probabilities in the above column.
7
Table 1.2
Joint Probability of Two Stock Return Rates when Xt+1=a2
0.03
0.02
0.025
0.03
0.06
0.035
We recall Bayes rule on event sets, that
PA | B
PA B
PB
where A and B are events or event sets in a universe. We can think of the
outcome {Xt+1 = a2} as event B, and outcome {Yt+1 = b3} as event A. Events
can be more general, as occurrences {Xt+1=aj}, {Yt+1=bk}, {Xt+1=aj ,Yt+1=bk}
are all events or event sets. More exactly,
Pb 3 | a 2
Pa 2 , b 3 0.025
0.125 .
PX a 2
0.2
In general,
PX t 1 a j , Yt 1 b k
PYt 1 b k | X t 1 a j
PX X t 1 a j
PX t 1 a j , Yt 1 b k
PX
6
k 1
t 1
a j , Yt 1 b k
When we move from discrete probability distribution, where event sets

consist of discrete elements, to continuous probability distribution, where
event sets are continuous, such as intervals on a real line, we have to deal with
continuous functions.
The continuous joint probability density function (pdf) of bivariate (X t+1,
Yt+1) is represented by a continuous function f(x,y) where X t+1 = x, and Yt+1 =
y, and x, y are usually numbers on the real line . Note that we simplify the
notations of the realized values by dropping their time subscripts here. For a
continuous probability distribution, the events are described not as point
values e.g. x=3, y=4, but rather as intervals, e.g. event A = {(x,y): -2 < y < 3},
B = {(x,y): 0 < x < 9.5}. Then,
8
9.5 3
P(A,B) = P(0 < x < 9.5, -2 < y < 3) =
f x, ydy dx .
0 2
The support for a random variable such as Xt+1 is the range of x. For joint
normal densities, the ranges are usually (-,). Thus, Yt+1 also has the same
support. It is usually harmless to use (-,) as supports even if the range is
finite [a,b], since probabilities of null events (-,a) and (b,) are zeros.
However, when more advanced mathematics is involved, it is typically better
to be precise. Notice also that probability is essentially an integral of a
function, whether continuous or discrete, and is area under the pdf curve.
The marginal probability density function of Xt+1 and Yt+1 are given by
f Y y
f x, ydx
and f X x
f x, ydy .
Notice that while f(x,y) is a function containing both x and y, f Y(y) is a

function containing only y since x is integrated out. Likewise fX(x) is a
function that contains only x.
The conditional probability density functions are:
f(x|y) = f(x,y)/fY(y)
and
f(y|x) = f(x,y)/fX(x) .
These conditional pdfs contain both x and y in their arguments.

1.2
EXPECTATIONS
The expected value of random variable Xt+1 is given by

6
E(Xt+1) =
a
j1
PX a j X for the discrete distribution in Table 1.1,
and for continuous pdf,
E(Xt+1) =
x f x dx
X
The conditional expected value or conditional expectation of Xt+1|b4 is given

by
a Pa
6
E(Xt+1|b4) =
j1
| b 4 for the discrete distribution in Table 1.1,
9
and for continuous pdf,
E(Xt+1|y) =
x f x | y dx .
Notice that for the continuous pdf, the conditional expected value given y is a
function containing only y. This means that one can further evaluate more
specific conditional expectations based on given sets of y values e.g. {y: -2 <
y < 3}. Then E(Xt+1|-2 < y < 3) is found via
x f x | -2 y 3dx
f x,-2 y 3
3
f x, y dx dy
dx
x
f
x,
y
dy
dx

2
3
f x, y dx dy
x
f
x,
y
dx
dy

2
3
f Y y dy
3
The interchange of integrals in the last step above uses the Fubini Theorem
assuming some mild regularity conditions satisfied by the functions.
The variance of a continuous random variable Xt+1 is given by
var(Xt+1) = X =
2
f X (x) dx .
Variance measures the degree of movement or variability of the random

variable itself. The standard deviation of a random variable X t+1 is the square
root of the variance. Standard deviation (s.d.) is sometimes referred to as
volatility and sometimes as risk in the finance literature.
The covariance between two continuous random variables Xt+1 and Yt+1 is
given by

cov(Xt+1,Yt+1) = XY =
x y f x, ydx dy .
x
Covariance measures the degree of co-movements between two random

variables. If the two random variables tend to move together, i.e. when one
increases (decreases), the probability of the other increasing (decreasing) is
10
high, then the covariance will be a positive number. If they vary inversely,
then the covariance will be a negative number. If there is no co-moving
relationship and each random variable moves independently, then their
covariance is zero. Notice that covariance is also an expectation or integral.
The co-movement of two random variables is typically better characterized
by their correlation coefficient which is covariance normalized or divided by
their s.d.s.
corr(Xt+1,Yt+1) = XY
XY
.
XY
One other advantage of using the correlation coefficient than the variance is
that the correlation coefficient is not denominated in the value units of X or Y
but is a ratio.
It is important to understand that correlation measures association but not
causality. In Figure 1.3, clearly changes in consumption and income are
strongly positively correlated. Suppose one concludes that increasing
consumption will increase income, the resulting action will be disastrous. Or
even if one simply concludes (based on some understanding of
macroeconomics theory or by intuition) that increased income causes
increased consumption, it may still be premature as there are so many other
possibilities and qualifications. For example, some other unobserved variables
such as general education level could lead to increases in both income and
consumption.
Or, suppose we think of Yt+1 as GDP and Xt+1 as population. Both increase
with time due to various economic and geo-political reasons. But it will be
disastrous for policy implication to think that increasing population leads to or
causes increase in GDP. This has to assume fairly constant employment and
output.
For general random variables X and Y, (dropping time subscripts), we can
write their means, variances, and covariance as follows.
E(X) = X
E(Y) = Y
2X
2
var(Y) = E(Y-Y)2 = E(Y2) - Y
var(X) = E(X-X)2 = E(X2) -
cov(X,Y) = EX X Y Y EXY X Y .
Covariances are actually linear operators. A function is f : A B or
f : f a b ; a A, b B in which A is the domain set and B the range

set and each a is mapped onto one and only one element b in B. We can think
11
of an operator as a special case of a function where the domain and range
consist of normed space such as a vector space. These technicalities are not
important except in more advanced courses.
Now consider N number of random variables Xi, where i=1,2,.,N. A
very useful property of a covariance is shown below.
N
N
cov X i , X j
j1
i 1
N

N
E X i EX i X j E X j
i 1
j1

E X i EX i X j EX j
N
i 1 j1
E X i EX i X j EX j
N
i 1 j1
covX i , X j .
N
i 1 j1
A special case of the above is
var X Y covX Y, X Y
covX, X covX, Y covY, X covY, Y
var X var Y 2covX, Y .
A convenient property of a correlation coefficient is that it lies between 1
and +1. This is shown as follows. For any real ,
2 2
var X Y 2
X Y 2 X Y 0 .
Put
X
. Then, 2 2 2 2 2 2 0 .
X
X
X
Y
Thus, for any random variable X and Y, 2X 1 2 0 , and hence
1 0 , or
2
1.3
1 . Therefore, 1 1 .
DISTRIBUTIONS
Continuous probability distributions are commonly employed in regression

analyses. The most common probability distribution is the normal (Gaussian)
distribution. The pdf of a normally distributed random variable X is given by
12
f(x)
2 2
1 x

for - < x <
where the mean of x is , and the s.d. of x is . and are given constants.
xf(x)dx
E(X) =
Var(X) = E(X-)2
= (x ) 2 f(x)dx
The cumulative distribution function (cdf) of X is

x
f(x)dx .
F(X) =
We can write the distribution of X as X ~ N , 2 in which the arguments

indicate the mean and variance of the normal random variable. Suppose we
define a corresponding random variable
X
or X Z
where the symbol means to define. The second equality is interpreted

as not just equivalence in distribution, but that whenever Z takes value z, then
X takes value x = + z. Then,
E(Z) = 0, and Var(Z) = 1.
Since a constant multiple of a normal random variable is normally distributed,
and a sum of normal random variables is also a normal random variable,
x
, and is called the standard normal

then Z ~ N0,1 . Z has pdf f

d
variable.
For normal distribution N(, 2),
x
F(X) =
x
f
dz

13
x
x
. The standard
is the standard normal pdf, and z =
where f
normal cdf is often written as (z). For the standard normal Z,

P(a z b) = (b) (a).
The normal distribution is a familiar workhorse in statistical estimation and
testing. The normal distribution pdf curve is bell-shaped. Areas under the
curve are associated with probabilities. The following Figure 1.4 shows a
standard normal pdf N(0,1) and the associated probabilities.
Figure 1.4
Standard Normal Probability Density Function of Z
Total Area from - to
5%
a= -1.645
The corresponding z values of random variable (r.v.) Z can be seen in the

following standard normal distribution Table 1.3.
For example, the probability P(- < Z < 1.5) = 0.933. This same probability
can be written as P(- Z < 1.5) = 0.933, P(- < Z 1.5) = 0.933, or P(-
Z 1.5) = 0.933. This is because for continuous pdf, P(Z = 1.5) = 0.
From the symmetry of the normal pdf, P(-a < Z < ) = P(- < Z < a), we
can also compute the following.
P(Z > 1.5) = 1 - P(- < Z 1.5) = 1 0.933 = 0.067.
P(- < Z -1.0) = P(Z > 1.0) = 1 - P(- < Z 1.0) = 1 0.841 = 0.159.
P(-1.0 < Z < 1.5) = P(- < Z < 1.5) - (- < Z -1.0) = 0.933 0.159 = 0.774.
P(Z -1.0 or Z 1.5) = 1 - P(-1.0 < Z < 1.5) = 1 0.774 = 0.226.
14
Table 1.3
Z
0.000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0.900
1.000
1.100
1.282
1.300
1.400
1.500
Area under curve

from - to a
0.500
0.539
0.579
0.618
0.655
0.691
0.726
0.758
0.788
0.816
0.841
0.864
0.900
0.903
0.919
0.933
Z
1.600
1.645
1.700
1.800
1.960
2.000
2.100
2.200
2.300
2.330
2.400
2.500
2.576
2.600
2.700
2.800
Area under curve

from - to a
0.945
0.950
0.955
0.964
0.975
0.977
0.982
0.986
0.989
0.990
0.992
0.994
0.995
0.996
0.997
0.998
Several values of Z under N(0,1) are commonly encountered, viz. 1.282,

1.645, 1.960, 2.330, and 2.576.
P(Z > 1.282) = 0.10 or 10% .
P(Z < -1.645 or Z > 1.645) = 0.10 or 10% .
P(Z > 1.960) = 0.025 or 2.5% .
P(Z < -1.960 or Z > 1.960) = 0.05 or 5% .
P(Z > 2.330) = 0.01 or 1% .
P(Z < -2.576 or Z > 2.576) = 0.01 or 1% .
The case for P(Z<-1.645) = 5% is shown in Figure 1.4.
The bivariate normal distribution of random variables X, Y is given by
f(x, y)
where
1
2 X Y 1 2
1
q
2
(1.1)
15
x
X
X
cov(x, y)
and
.
XY
1
q
1 2
x X
2
y Y
y Y

Y
The multivariate normal distribution pdf (p-variate normal pdf) is given by
f x 1 , x 2 ,, x p
p/2
1/2
T
exp x 1 x
2
where x is the vector of random variables X1 to Xp , is the px1 vector of

means of x, and is the pp covariance matrix of x. If p=2 is substituted into
the above, the bivariate pdf shown in equation (1.1) can be obtained.
The kth moment of random variable X is x k f x dx where f(x) is the pdf
of X. If = E(X) is the mean of X, the kth central moment of X is

x k f x dx . Notice that the variance is the second central moment of
X. The 3rd central moment variance3/2 is known as skewness. The 4th central
moment variance2 is known as kurtosis.
The normal distribution r.v. X N(,2) has a mean , variance 2,
skewness 0, and kurtosis 32. Hence the standard normal variate Z N(0,1)
has a mean 0, variance 1, skewness 0, and kurtosis 3.
Many financial variables, e.g. daily stock returns, currency rate of change,
etc. display skewness as well as large kurtosis compared with the benchmark
normal distribution with symmetrical pdf, skewness = 0, and kurtosis = 3.
Figure 1.5
Example of a Pdf with Negative Skewness and Large Kurtosis
f(x)
negative or left
skewness
(longer left tail)
fat tails with

kurtosis > 3
16
Departure from normality is illustrated by a pdf in Figure 1.5. The shaded
area in Figure 1.5 shows a normal pdf. The unshaded curve shows pdf of a
random variable with negative skewness and a kurtosis larger than that of the
normal random variable.
The concept of stochastic independence between random variables is
important. Two random variables X and Y are said to be stochastically
independent if and only if their joint pdfs can be expressed as follows:
f(X,Y) = fx(X) fy(Y).
One implication of the above is that for any function h(.) of X and any
function g(.) of Y, their expectation can be found as
E(h(X) g(Y)) = E(h(X)) E(g(Y)).
A special case is the covariance operator. If X and Y are (stochastically)
independent, then it implies their covariance is zero:
cov(X,Y) = EX X Y Y EX X EY Y 0 .
The converse is not always true. It is true only for special cases such as when
X and Y are jointly normally distributed. When X and Y are jointly normally
distributed, then if they have zero covariance, they are stochastically
independent. For bivariate normal pdf, conditional
g(x | y)
f(x, y)
.
f Y (y)
Or,
q
1
g(x | y)
2 X Y 1 2
1
Y 2
e 2
1 y Y

2 Y
2 X2 1 2
1
2 2X|Y
e 2 1-
1
X 2 1
x X

X
2
2
e 2 1- X
1
2 2X | Y
X X|Y 2
y Y

Y
X y Y
x X
Y Y
17
where X|Y (1 2 ) x is the variance of X conditional on Y=y,
2
and X|Y X
X
y Y is the mean of X condition on Y=y.
Y
There are some common continuous probability distributions that are

d
related to the normal distribution. If random variable X ~ N( , 2 ) , then

2
X-
2
random variable V
~ 1 is a chi-square distribution with 1

degree of freedom. If X1, X2, X3, , Xn are n random variables each
independently drawn from the same population distribution N(( 2 ) , or
X
2
think of {Xi}i=1 to n as a random sample of size n, then i
~ n is

i 1
n
a chi-square distribution with n degrees of freedom.

d
If X ~ N(0,1) , and V ~ r , and both X, V are stochastically independent,
then
Vr
d
is a Student-t distribution with r degrees of freedom. If U ~ r1 ,
V ~ , and both U, V are stochastically independent, then

2
r2
Ur1-1 d
~ Fr1 ,r2 is an
Vr2-1
F-distribution with degrees of freedom r1 and r2. If random variable

d
X ~ N , 2 , and Y = exp(X) or X = ln(Y), then Y is a random variable with

a lognormal distribution.
1.4
STATISTICAL ESTIMATION
Suppose a random variable X with a fixed normal distribution N(, 2) is

given. Suppose there is a random draw of a number or outcome from this
distribution. This is the same as stating that random variable X takes a realized
value x. Let this value be x1; it may be say 3.89703. Suppose we repeatedly
make random draws and thus form a sample of n observations: x 1, x2, x3, .,
xn-1, xn. This is called a random sample with a sample size of n. Each xi comes
from the same distribution N(, 2), but each of xi, xj are realizations from
independent sampling.
We next compute a statistic, which is a function of the realized values
{xk}, k=1,2,..,n. Consider a statistic, the sample mean
18
1 n
x k . Another common sample statistics is the unbiased sample
n k 1
variance
s2
1 n
x k x 2 .
n 1 k 1
Each time we select a random sample of size n, we obtain a realization x .

Thus, x is itself a realization of a random variable, and this r.v. can be
denoted by
Xn
1 n
Xk
n k 1
where Xk above is clearly the random variable from N(, 2) itself. X n is a

random variable and its probability distribution is called the sampling
distribution of the mean, or perhaps more clearly, the distribution of the
sample mean.
What is the exact probability distribution of X n ?
1 n
1 n
1 n
E X k EX k .
n k 1
n k 1
n k 1
n
n
1
1
n 2 2
var( X n ) 2 var X k 2 var X k 2
.
n
n
n k 1
n
k 1
Since X n is a normal random variable, therefore
E(X n )
2
N ,
n
Xn
The standardized normal random variable then becomes

Xn
n X n
N0,1 .
On the other hand, E(s2) = 2. But s2 itself has a sampling distribution.
s2 d 2
n 1 2 ~ n 1 . Thus it can be seen that E(n-12) = n-1, the number of
degrees of freedom of the chi-square random variable. Therefore,

n X n
s2
2
n X n
s
19
is distributed as Student-t with (n-1) degrees of freedom and zero mean.
Denote the random variable with t-distribution, n-1 degrees of freedom, as tn-1.
Then,
n X n d
~ t n 1 .
s
Suppose we find (-a,+a), a>0, such that Prob(-a tn-1 +a) = 95%. Since tn-1 is
symmetrically distributed, then Prob(-a tn-1) = 97.5% and Prob(tn-1 +a) =
97.5%. Thus,
Prob a
Also,
n X n
a 0.95 .
s
s
s
Prob X n a
Xn a
0.95 .
n
n
Suppose x1, x2, x3, ., xn-1, xn are randomly sampled from X ~ N(( 2 ) .
Sample size n = 30. The t-statistic value such that Prob(t29 2.045) = 97.5% is
t29 = 2.045. Then
s
s
Prob X n 2.045
X n 2.045
0.95 .
30
30
Hence the 95% confidence interval estimate of is given by
s
s
X n 2.045
, X n 2.045
when estimated s is entered.
30
30
1.5
STATISTICAL TESTING
In many situations there is apriori (or ex-ante) information about the value of
the mean , and it may be desirable to use observed data to test if the
information is correct. is called a parameter of the population or fixed
distribution N(, 2). A statistical hypothesis is an assertion about the true
value of the population parameter, in this case . A simple hypothesis
specifies a single value for the parameter while a composite hypothesis will
specify more than one value. We will work with the simple null hypothesis H0
(sometimes this is called the maintained hypothesis), which is what is
postulated to be true. The alternative hypothesis HA is what will be the case if
the null hypothesis is rejected. Together the values specified under H0 and HA
should form the total universe of possibilities of the parameter. For example,
H0: = 1
HA: 1.
20
A statistical test of the hypothesis is a decision rule that, given the inputs from
the sample values and hence sampling distribution, chooses to either reject or
else not reject (intuitively similar in meaning to accept) the null H 0. Given
this rule, the set of sample outcomes or sample values that lead to rejection of
the H0 is called the critical region. If H0 is true but is rejected, a Type I error is
committed. If H0 is false but is accepted, a Type II error is committed.
The statistical rule on H0: = 1, HA: 1, is that if the test statistic
t n 1
s/ n
which is t-distributed with (n-1) degrees of freedom, falls
within the critical region (shaded), defined as {tn-1 < -a or tn-1 > +a} , a>0, as
shown in Figure 1.6 below, then H0 is rejected in favor of HA. Otherwise H0 is
not rejected and is accepted.
Figure 1.6
Critical Region under the Null Hypothesis H0: = 1
tn-1
X
-a true, then the t-distribution
0
If H0 were
would be correct, and +a
therefore the
probability of rejecting H0 would be the area of the critical region, or 5% in
this case. Notice that for n=61, P(-2 < t60 < 2) = 0.95. Moreover, the tdistribution is symmetrical, so each of the right and left shaded tails makes up
2.5%. This is called a 2-tailed test with a significance level of 5%. The
significance level is the probability of committing a Type I error when H0 is
true. In the above example, if the sample t-statistic is 1.045, then it is < 2, and
we cannot reject H0 at the 2-tailed 5% significance level. Given a sample tvalue, we can also find its p-value which is the probability under H0 of t60
exceeding 1.045 in a one-tailed test, or of exceeding |1.045| in a 2-tailed test.
In the above 2-tailed test, the p-value of a sample statistic of 1.045 would be 2
Prob(t60 > 1.045) = 2 0.15 = 0.30 or 30%. Another way to verify the test
is that if the p-value < test significance level, reject H0; otherwise H0 cannot
be rejected.
In theory, if we reduce the probability of Type I error, the probability of
Type II error increases, and vice-versa. This is illustrated as follows.
21
Figure 1.7
pdf f(X)
tn-1
-2
Suppose H0 is false, and > 1, so the true tn-1 distribution is represented by the
dotted curve in Figure 1.7. The critical region {tn-1 < -2.00 or tn-1 > 2.00}
remains the same, so the probability of committing Type II error is 1 sum of
shaded areas. Clearly, this probability increases as we reduce the critical
region in order to reduce Type I error. Although it is ideal to reduce both types
of errors, the tradeoff forces us to choose between the two. In practice, we fix
the probability of Type I error when H0 is true, i.e. determine a fixed
significance level e.g. 10%, 5%, or 1%. The power of a test is the probability
of rejecting H0 when it is false. Thus power = 1 P(Type II error). Or power
equals the shaded area in Figure 1.7. Clearly this power is a function of the
alternative parameter value 1. We may determine such a power function
of 1.
Thus reducing significance level also reduces power and vice-versa. In
statistics, it is customary to want to design a test so that its power function of
1 equals or exceeds that of any other test with equal significance level for
all plausible parameter values 1 in HA. If this test is found, it is called a
uniformly most powerful test.
We have seen the performance of a 2-tailed test. Sometimes we embark
instead on a one-tailed test such as H0: = 1, HA: > 1, in which we
theoretically rule out the possibility of < 1, i.e. P( < 1) = 0. In this case, it
makes sense to limit the critical region to only the left side, for when > 1,
then tn-1 will become smaller. Thus at the one-tail 5% significance level, the
critical region is {tn-1 < -1.671} for n=61.
1.6
DATA TYPES
Consider the types of data series that are commonly encountered in regression
analyses. There are four generic types, viz.
22
(a)
(b)
(c)
(d)
Time series
Cross-sectional
Pooled Time Series Cross-Sectional
Panel/longitudinal/micropanel
Time series are the most prevalent in empirical studies in finance. They are
data indexed by time. Each data point is a realization of a random variable at a
particular point in time. The data occur as a series over time. A sample of such
data is typically a collection of the realized data over time such as the history
of ABC stocks prices on a daily basis from 1970 January 2 till 2002
December 31.
Cross-sectional data are also common in finance. An example is the
reported annual net profit of all companies listed on an exchange for a specific
year. If we collect the cross sections for each year over a 20-year period, then
we have a pooled time series cross section of companies over 20 years. Panel
data are less used in finance. They are data collected by tracking specific
individuals or subjects over time and across subjects.
The nature of data also differs according to the following categories.
(a) Quantitative
(b) Ordinal e.g. very good, good, average, poor
(c) Nominal/categorical e.g. married/not married, college graduate/nongraduate
Quantitative data such as return rates, prices, volume of trades, etc. have
the least limitations and therefore the greatest use in finance. These data
provide not only ordinal rankings or comparisons of magnitudes, but also
exact degrees of comparisons. There are some limitations and therefore
special considerations to the use of the other categories of data. In the
treatment of ordinal and nominal data, we may have to use specific tools such
as dummy variables in regression.
1.7
PROBLEM SET
1.1 X, Y, Z are r.v.s with a joint pdf f(X,Y,Z) that is integrable. Show using
the concept of marginal pdfs that E(X+Y+Z) = E(X)+E(Y)+E(Z) by
integrating over (X+Y+Z).
X
,
i X j in terms of the N by N
i
1
j
1.2 Show how one could express cov

matrix covariance matrix
NxN
23
1.3 The following is the probability distribution table of a trivariate U1, U2,
and U3.
U1
U2
U3
P(U1,U2,U3)
-1
-2
-3
.125
-1
-2
3
.125
-1
2
-3
.125
-1
2
3
.125
1
-2
-3
.125
1
-2
3
.125
1
2
-3
.125
1
2
3
.125
Find the bivariate probability distribution P(U1, U2). Find the marginal
P(U3).
1.4 In the probability distribution table of a trivariate U1, U2, and U3,
U1
U2
U3
P(U1,U2,U3)
-1
-2
-3
.125
-1
-2
3
.125
-1
2
-3
.125
-1
2
3
.125
1
-2
-3
.125
1
-2
3
.125
1
2
-3
.125
1
2
3
.125
after finding P(U1,U2), suppose Yi = bXi + Ui , i=1,2, and X1=1, X2=2,

(i)
(ii)
Find E(Ui)s, and cov(U1, U2).

Find the probability distribution of estimator
(iii)
2
2
b X i Yi X i2 . This probability distribution of the

i 1
i 1
estimator is called the sampling distribution of b .

Find the mean and variance of b from its probability distribution.
1.5 X, Y have joint pdf f(X,Y) = exp(-X-Y) for 0<X,Y<, and pdf is 0
elsewhere. Find the marginal pdfs of X and Y. Are X and Y stochastically
dependent?
1.6 X, Y have a joint pdf f(X,Y) = 1 in the set {0X2, 0YX/2}.
(i)
Find the marginal distributions of X and Y.
(ii)
Find the variances of X and Y, and the covariance of X,Y.
(iii)
Find the conditional means E(X|Y), E(Y|X), and conditional
variances var(X|Y), var(Y|X).
1.7 Xit is distributed as independent univariate normal, N(0,1) for i=1,2,3, and
t = 1,2,.,60. Yt = 0.5X1t + 0.3 X2t + 0.2 X3t . What are the mean and the
standard deviation of Yt? If a computer program runs and churns out
24
3K number of random values Zj belonging to univariate normal N(0,1)
distribution, and Wi = 0.5Z3i-2 + 0.3Z3i-1 + 0.2Z3i for i=1,2,....,K, what is
K
the variance of the sampling mean K 1 Wi ?

i 1
1.8 Suppose r.v. X i ~ N 0 ,
1
for i=1,2,.,K, and Xi and Xj are
60
independent when i j. If AXi ~ N(0,1) where A is a constant, what is A?

If random vector Y = (X1, X2, , XK), what is the distribution of YYT?
1.9 If cov(a,b) = 0.1, cov(c,a) = 0.2, cov(d,a) = 0.3, and x = b + 2c + 3d, what
is cov(a, x)?
1.10 Suppose X, Y, Z are jointly distributed as follows.
Probability
0.5
0.5
X
+1
-1
Y
-1
0
Z
0
+1
Find cov(X,Y), cov(X,Z), and cov(Y,Z).

FURTHER RECOMMENDED READINGS
[1] Alexander M Mood, F A Graybill, and D C Boes, 3rd or later editions,
Introduction to the theory of statistics, McGraw-Hill publisher.
[2] Robert V Hogg, and Allen T Craig, Introduction to mathematical
statistics, 4th or later editions, Collier MacMillan publisher.
25
Chapter 2
STATISTICAL LAWS AND
CENTRAL LIMIT THEOREM
APPLICATION: STOCK RETURN DISTRIBUTIONS
Stochastic process, Stationarity, Law of large numbers, Central limit theorem,
Rates of return, Lognormal distribution, Information sets, Random walk, Law
of iterated expectations, Unconditional expectation, Conditional mean,
Conditional variance, Jarque-Bera test
In this chapter, we shall build on the fundamental notions of probability

distribution and statistics in the last chapter, and extend consideration to a
sequence of random variables. In financial application, it is mostly the case
that the sequence is indexed by time, hence a stochastic process. Interesting
statistical laws or mathematical theories result when we look at the
relationships within a stochastic process. We introduce an application of the
Central Limit Theorem to the study of stock return distributions.
2.1
STOCHASTIC PROCESS
A stochastic process is a sequence of random variables X 1, X2, X3, . , and so

on. Each Xi has a probability density function or pdf. A common type of
sequence is indexed by time t1 < t2 < t3 < . for X t1 , X t 2 , X t 3 ,, and so on.
A stochastic process {Xi}i=1,2,.. is said to be weakly (covariance) stationary if
each Xi has the same mean and variance, and cov(Xi, Xi+k) = (k), i.e. a
function only dependent on k. As an example, suppose monthly stock return
rates ~rt where
~r = return rate in Jan 2009

1
~r = return rate in Feb 2009
2
.. etc.
form a stochastic process that is weakly stationary. If Var( ~r1 )=0.25, what is
Var ( ~r5 )? Clearly this is the same constant 0.25. If Cov( ~r1 , ~r3 )=0.10, what is
26
cov( ~r7 , ~r9 )? Clearly, this is 0.10 since the time gap between the two random
variables is the similarly two months in either case.
Suppose we have a realized history of the past 60 monthly return rates
{rt}t=1,2,,60. Each of these rts is a known number, e.g. 0.01, one percent, or
0.005, negative half percent. The realized number rt is a sample point taken
from the pdf of random variable ~rt . We have to learn to distinguish between
what is a random variable that has an attached pdf, and what is a realized
sample point that is a given number. Notice that sometimes a tilde is put
over the variable to denote it as being random. The past history or realized
values of the stochastic process, {rt}t=1,2,,60 , e.g. {0.010, -0.005, 0.003, 0.008,
-0.012, ., 0.008} is called a time series, which is a time-indexed
sequence of sample points of each random variable ~rt in the stochastic process
{ ~rt }t.
A stochastic process {Xi}i is said to be strongly stationary if each set of
{Xi, Xi+1, Xi+2, , Xi+k} for any i and the same k has the same joint
multivariate pdf independent of i. As an example, consider joint multivariate
normal distributions, MVN. Suppose the following is strongly stationary,
~r1 , ~r2 , ~r3 ~ MVN(M 3x1 , 3x3 ) ,

d
then clearly the joint multivariate pdf of ~r3 , ~r4 , ~r5 has the same MVN (M, ).
There are two very important and essential theorems dealing with
stochastic processes and therefore applicable to the study of time series of
empirical data. They are the Law of Large Numbers and the Central Limit
Theorem.
2.2
LAW OF LARGE NUMBERS
The Law of Large Numbers (LLN) states that if x1, x2, , xn is a realized
sample randomly chosen from any random variable Xi with a fixed pdf where
each time a draw is taken from an independent Xi, then the sample average or
sample mean converges to the expected value of random variable Xi or E(Xi).
This is sometimes referred to as Kolmogorovs LLN when the convergence
refers to a sample mean taken from a time series, and the corresponding
stochastic process is stationary and also independently distributed. The latter
implies that any Xj and Xk within {Xi}i are independent. We will discuss
convergence in a later chapter, but for now, it suffices to understand it as
approaching in value in some arbitrarily close fashion. Thus, the law of
large numbers states:
lim
n
1 n
x i where E(Xi) = .
n i1
27
An extension of the above, relaxing the assumption of independence, states
that in a (stationary) ergodic stochastic process {Xt}t with mean , i.e. E(Xt) =
for all t, if x1, x2, , xn is a realized sample randomly chosen from the
stochastic process {Xt}t , then
lim
n
1 n
Xi .
n i1
An ergodic stochastic process is one in which two random variables in the

process that are sufficiently far apart in time index tend toward (asymptotic)
independence. The independently identically distributed process (i.i.d.
process) is a special case of the stationary ergodic process. In many
applications, usually stationary ergodicity or some related variation is
assumed and then sample means are used to estimate population means.
2.3
CENTRAL LIMIT THEOREM
The Central Limit Theorem states that if X1, X2, , Xn is a vector of random
variables drawn from the same stationary distribution with mean and
variance 2, and suppose we let
n
i 1
or
n X ,
or else,
Y i1
or
,
then Y is a random variable that converges in distribution, as the sample size n

approaches infinity, to a normal standard random variable, i.e.
n
lim Y ~ N(0,1) .
n
For sufficiently large n, suppose Y
n 1 n
X i N(0,1), then
n i 1
2
1 n
N
Xi n Y , n .
n i1
n
Or,
X
i 1
n n Y Nn , n 2 .
(2.1)
28
This says that when n is large, the sample mean X , itself a random variable, is
normally distributed with mean and variance 2/n.
2.4
STOCK RETURN RATES
In finance, the lognormal distribution is important. A pertinent example is a

stock price at time t, Pt . There are several empirically observed characteristics
of stock price that are neat and could be appropriately captured by the
lognormal probability distribution model.
(a) Pt 0 , i.e. prices must be strictly positive.
(b) Return rates derived from stock prices over time are normally distributed
when measured over a sufficiently long interval e.g. a month.
(c) Returns could display a small trend or drift, i.e. increases or decreases over
time.
(d) The ex-ante (anticipated) variance of return rate increases with the holding
period.
We examine the case when Pt is lognormally distributed to see if this
distribution offers the above characteristics. Lognormally distributed Pt means
d
that lnPt ~ N, Normal . Thus, Pt = exp(N) > 0 where N is a normal random

d
variable. Hence (a) is satisfied. Likewise, ln Pt 1 ~ Normal . Therefore,

d
P d
lnPt 1 lnPt ~ Normal or, ln t 1 ~ Normal .
Pt
P
Now ln t 1 rt,t 1 is the continuously compounded stock return over
Pt
holding period or interval [t, t+1). If the time interval or each period is small,
this is approximately the discrete return rate
Pt 1
1 . However, the discrete
Pt
return rate is bounded from below by -1. Contrary to that, the return rt,t+1 has (, ) as support, as in a normal distribution.
We can justify how rt,t+1 can be reasonably normally distributed, or
equivalently, that the price is lognormally distributed, over a longer time
interval. Consider a small time interval or period = 1/T, such that
29
P
ln t , the small interval continuously compounded return, is a random
Pt
variable (not necessarily normal) with mean = /T, and variance 2 =
2/T. The allowance of small 0 in the above satisfies (c).
Aggregating the returns,
P
P
P
ln t ln t 2 ln t 3 ln t T ln t T . (2.2)
P
P
P
P
P
t
t
t 2
t
t (T 1)
The right-hand side of equation (2.2) is simply the continuously compounded
Pt 1
rt,t 1 over the longer period [t,t+1), where the length is
Pt
return ln
made up of T=1/ number of periods. The left-hand side of equation (2.2),

invoking the Central Limit Theorem, for large T, is N(T, T2) or N(,2)
since T=1. Hence rt,t+1 N(,2), which satisfies (b) and justifies the use of
lognormal distribution for prices.
Moreover, Pt k Pt e t ,t k 0 even if return rt+k may sometimes be
negative. Suppose the returns rt,t+1 , rt+1,t+2 , rt+2,t+3 , .. , rt+k-1,t+k are
independent. Then
r
k1
P
var ln t k var rt j,t j1 k 2 .
Pt
j0
Thus, ex-ante variance of return increases with holding period [t, t+k). This
satisfies characteristic (d).
It is important to recognize that the discrete or holding period return rate
Pt 1
1 does not display some of the appropriate characteristics. The discrete
Pt
period returns have to be aggregated geometrically in the following way.
Pt 1
1 R t,t 1
Pt
Pt k k 1 Pt j1 k 1
1 R t j,t j1 0,
Pt
j0 Pt j
j0
The lower boundary of zero is implied by the limited liability of owners of
listed stocks. This discrete setup is cumbersome and poses analytical
intractability when it comes to computing drifts and variances. It is
straightforward to compute the means and variances of sums of random
variables as in the case of the continuously compounded returns, but not so for
30
products of random variables when they are not necessarily independent, as in
the case of the discrete period returns here.
2.5
CONDITIONAL MEAN AND VARIANCE
Earlier we have seen how when two random variables X, Y are jointly
bivariate normal, we can express the conditional mean or expectation of one in
terms of the other viz.
E X | Y E X
X
Y E Y .
Y
(2.3)
We could be more precise in the use of notation to denote the expectation

of X:
E X,Y X E X X
where the superscripts in the expectation operator denote that the integral is
taken with respect to those random variables. We could also use small letter x
to denote the sample realization of random variable X, although sometimes we
ignore this if the context is clear as to which is used, whether it is a random
variable or a realized value. We could also employ notation E X| y X | y to
denote an expected value taken on random variable X based on conditional
probability of X given Y = y.
When two random variables X, Y are not jointly normal, the linear
relationship in equation (2.3) is not possible based just on distributional
assumptions. Instead we have to impose the linear relationship directly. For
example, we may assume or specify:
X = a + bY + e
(2.4)
where a, b are constants, and e is a random variable with zero mean and
variance 2, and is independent of Y. Equation (2.4) is called a linear
regression model, or a linear relationship connecting two or more random
variables including at least one unobservable random variable e. Then
E(X) = a + b E(Y).
Now E(X|Y) = a + bY, since E(Y|Y) = Y, and E(e|Y) = E(e) = 0. Then E(X|Y)
= a + b E(Y) + b[Y E(Y)] = E(X) + b[Y E(Y)].
Also, cov(Y,X) = cov(Y,a) + b cov(Y,Y) + cov(Y,e) = b var(Y). Thus,
b=
XY
X . Hence we can write E(X|Y) = E(X) + X [Y E(Y)],

2
Y
Y
Y
which is identical to the case under bivariate normality. Thus, it may be seen
that the linear regression model in (2.4) plays a crucial role.
From (2.4), var(X) = b2 var(Y) + 2 where var(e) = 2. But conditional
variance var(X|Y) = 2. Therefore var(X|Y) < var(X). Thus it is seen that
31
providing information Y reduces the ex-ante uncertainty or variance of X. Of
course, if X and Y are not linearly related, i.e. b = 0, then var(X|Y) = var(X),
in which case knowing Y does not reduce the uncertainty. This idea of
reducing uncertainty with given relevant information Y about X is central in
the thinking and theory of finance. For example, if we know pertinent
information about tomorrows stock return movements, then the risk of
investing in stocks will be suitably reduced.
2.6
INFORMATION SET AND RANDOM WALK
We have introduced the idea of a stochastic process earlier. Let the time
sequence of random variables Pt , Pt+1 , Pt+2 , .. represent the prices of a stock
at time t, t+1, t+2, etc. Then {Pt}t is a stochastic process.
Let t , t 1 , t 2 ,.... represent the information set at time t, t+1, t+2, etc.
that is available to the decision-maker for making forecasts of the future price
of the stock. We may interpret an information set t essentially as a random
variable that can take realized sample values t that are called information. A
piece of information at different times t, t+1, t+2, etc., viz. t , t+1 , t+2 , .
can be thought of as some function of other random variable Yt at time t, i.e.
t(yt). t has a joint density with Pt, and possibly also with Pt+1. Therefore,
since Yt and Pt, Pt+1, are jointly determined in a probabilistic manner, then
given information t(yt), a better forecast of next period Pt+1 can be attained.
Et(Pt+1) E(Pt+1|t) is a conditional expectation or forecast of next period
Pt+1 based on information available at t, i.e. t . Notice that subscript t is used
to denote evaluation of integral over information set at t, i.e. t. Such
applications of conditional expectations are plentiful in the finance literature.
Early studies in finance suggest simple stochastic processes for prices such as
a random walk:
Pt+1 = + Pt + et+1
(2.5)
where is a constant drift, and et+1 is a disturbance or white noise or a random
variable that is independent of past information as well as prices. Equation
(2.5) is sometimes called an arithmetic random walk in prices. The latter name
arises since when an arithmetic or subtraction operation is performed on the
prices such as taking the price difference, it is equal to a constant with an
added disturbance.
Since Pt t , Et(Pt+1) E(Pt+1|t) = E(Pt+1|Pt) as only Pt is relevant
according to the random walk process above. Other information within t
except for Pt is redundant in equation (2.5). This is an implication of the
arithmetic random walk in prices. Thus, Et(Pt+1) E(Pt+1|t) = + Pt .
32
If =0, then the best forecast of tomorrows price is todays price Pt
according to the random walk theory as in (2.5). Suppose we construct a
random walk in natural logarithm of price. Then
ln Pt+1 = + ln Pt + et+1 .
This is sometimes called the geometric random walk in prices. Or,
P
ln t 1 rt,t 1 = + et+1. If we specify et+1 N(0, 2), then we are back to
Pt
the lognormal model described earlier. Thus we see that we can construct
meaningful return rate distributions using the linear model of stochastic price
process as in the random walk model above. The linear model is essentially a
difference equation in the logarithms of price.
Suppose the information set t (is a subset of) t , the Law of Iterated
Expectation states:
E[ E(Pt+1|t) |t] = E(Pt+1| t) .
If we condition on the null set , E[ E(Pt+1|t) |] = E(Pt+1| ) = E(Pt+1)
that is also the unconditional expectation or unconditional forecast. Applying
the Law of Iterated Expectations to information revelation over time,
E[ E(Pt+2|t+1) |t] = E(Pt+2| t) since t t+1 .
Or, Et [ Et+1(Pt+2) ] = Et (Pt+2) . The best forecast of tomorrow (t+1)s forecast
of X at t+2 is equal to the best forecast of P at t+2 today at t.
2.7
LAW OF ITERATED EXPECTATIONS
In the following, we shall show more formally that E[ E(P t+1|t) |t] = E(Pt+1|
t) when t t. Although not necessary, but for convenience of proof, we
shall assume that jointly distributed random variables X, Y, and Z have joint
probability density function f(x, y, z). Then, f x, y f x, y, z dz .
The Law of Iterated Expectations can take various forms. At the simplest
level, consider
E X, Y Y
y f x, y dy dx
f x, y
y
f x dy dx
x
y
f x
y f y | x dy f x dx
x y
E Y | x f x dx
Y| x
E X E Y|x Y | x
33
where we use notation E Y|x Y | x to denote an expected value taken on
random variable Y based on conditional probability density function f y | x
expressed in the superscript to the expectation operator. We could also have
used E Y|x Y[x] indicating the integrand is y that is a function of x.
Similarly, we should be able to show
E X,Y|z Y | z
y f x, y | z dy dx
f x, y | z
y
f x | z dy dx
x
f x | z
y
y f y | x, z dy f x | z dx
x y
E
x
Y| x, z
Y | x, z f x | z dx
E X|z E Y|x,z Y | x, z .
We can think of random variables {X,Z} as the information set t, and values
x, z as the realized information t at time t. Likewise {Z} is an information
set t , and clearly t t . Thus we may rewrite the equation of the law of
iterated expectations above as:
E Y | t E E Y | t | t
where we have simplified the notation by dropping the superscripts to the

expectation operator as long as the integration is taken over a proper
probability distribution and the condition is reflected by the argument term |
t or |t. Let Y Pt 1 , then E[ E(Pt+1|t) |t] = E(Pt+1| t) as asserted
earlier.
2.8
TEST OF NORMALITY
Given the importance of the role of the normal distribution in financial returns
data, it is not surprising that many statistics have been devised to test if a
given sample of data {ri} comes from a normal distribution. One such statistic
is the Jarque-Bera (JB) test of normality.1 The test is useful only when the
sample size n is large (sometimes, we call such a test an asymptotic test). The
2
2
d
JB test statistic n 3 ~ 2
2
24
See C M Jarque and A K Bera, (1987) A Test for Normality of Observations and
Regression Residuals, International Statistical Review, vol.55, 163-172.
34
where is the skewness measure of {ri} , and is the kurtosis measure of
{ri}. The inputs of these measures to the JB test statistic are usually sample
estimates. For {ri} to follow a normal distribution, its skewness sample
estimate should converge to 0, since the normal distribution is symmetrical
with third moment being zero, and its kurtosis sample estimate should
converge to 3. If the JB statistic is too large, exceeding say the 95 th percentile
of a 2 distribution with 2 degrees of freedom, or 5.99, then the null
hypothesis, H0, of normal distribution is rejected. The JB statistic is large if
and ( -3) deviate materially from zero.
Recall that the p-value (probability-value) of a realized test statistic t*
based on the null hypothesis distribution is either:
(a) the probability of obtaining test statistic values whose magnitudes are even
larger than t*, i.e. P(tt*) for one-tail right-tail test, or
(b) the probability of obtaining test statistic values whose absolute magnitudes
are larger than t* in a symmetrical zero mean null distribution, i.e. P(t-|t*| or
t|t*|) for a two-tail test. Thus if in a statistical test, the significance level is set
at %, and the p-value is x%, then reject H0 if x , and accept H0 if x > .
2.9
CORPORATE STOCK RETURNS
In this section, we provide two examples of the stock return sample

distributions of two companies.
The American Express Company (AXP) is one of the Dow Jones Industrial
Averages 30 companies, and is a large diversified international company
specializing in financial services including the iconic AMEX card services.
It is globally traded. The AXP daily stock returns in a 5-year period from
1/3/2003 to 12/31/2007 are collected from public source Yahoo Finance and
processed as follows. The return rates are daily continuously compounded
return rates ln(Pt+1/Pt). Thus, weekly as well as monthly stock returns can be
easily computed from the daily return rates. The continuously compounded
weekly return rate would be the sum of the daily returns for the week, Monday
to Friday. The continuously compounded monthly return rate would be the
sum of the daily returns for the month. As there are typically about 5 trading
days in a week, since stocks are traded mainly through Stock Exchanges that
operate on a 5-day week, the weekly return is computed from the sum of 5
daily return rates. Similarly, there are only about 21 to 22 trading days on
average in a month for adding up to the monthly return. Yearly or annual
return will be summed over about 252 to 260 trading days in a year. (An
outlier of more than 12% drop in a single day in the database on 2005 October
3rd was dropped.)
35
Tuesdays return is usually computed as the log (natural logarithm) of
close of Tuesdays stock price relative to close of Mondays price. Unlike
other days, however, one has to be sensitive to the fact that Mondays return
cannot be usually computed as the log (natural logarithm) of close of
Mondays stock price relative to close of Fridays price. The latter return
spans 3 days, and some may argue that the Monday daily return should be a
third of this, although it is also clearly the case that Saturday and Sunday have
no trading. Some may use closing price relative to the opening price on the
same day to compute daily returns. Open-to-close return signifies return
captured during daytime trading when the Exchange is open. However, closeto-open return signifies price change taking place overnight. We shall not be
concerned with these issues for the present purpose.
The three series of daily, weekly, and monthly return rates are tabulated in
histograms. Descriptive statistics of these distributions such as mean, standard
deviation, skewness, and kurtosis are reported. The Jarque-Bera tests for
normality of the distributions are also conducted.
Figure 2.1
Histogram and Statistics of Daily AXP Stock Return Rates
280
Series: DR
Sample 1 1257
Observations 1257
240
200
Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis
160
120
80
Jarque-Bera 419.1988
Probability
0.000000
40
0
-0.06
0.000377
-0.000210
0.063081
-0.057786
0.013331
0.129556
5.817207
-0.04
-0.02
0.00
0.02
0.04
0.06
In Figure 2.1, the JB test statistic shows a p-value of <0.000, thus normality is
rejected at significance level 0.0005 or 0.05% for the daily returns. The mean
return in the sampling period is 0.0377% per day, or about 252 0.0377 =
9.5% per annum. The daily return standard deviation or volatility is 1.333%.
If the continuously compounded return were indeed normally distributed, the
annual volatility may be computed as 252 0.01333 = 21.16%. Figure 2.1
indicates AXP stock has positive skewness during this sampling period. Its
kurtosis of 5.817 exceeds 3.0 which is the kurtosis of a normally distributed
random variable.
36
In Figure 2.2, the JB test statistic shows a p-value of <0.000, thus
normality is also rejected at significance level 0.0005 or 0.05% for the weekly
returns.
Figure 2.2
Histogram and Statistics of Weekly AXP Stock Return Rates
50
Series: WR
Sample 1 1257
Observations 251
40
30
20
10
Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis
0.001815
0.001179
0.112051
-0.103370
0.025712
-0.040191
5.345425
Jarque-Bera
Probability
57.59904
0.000000
0
-0.10
-0.05
0.00
0.05
0.10
Figure 2.3
Histogram and Statistics of Monthly AXP Stock Return Rates
8
Series: MR
Sample 1 1257
Observations 57
7
6
5
4
3
Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis
0.008601
0.004985
0.121969
-0.092568
0.047514
0.166812
2.721483
Jarque-Bera
Probability
0.448583
0.799082
2
1
0
-0.10
-0.05
0.00
0.05
0.10
In Figure 2.3, the JB test statistic shows a p-value of 0.449. Thus normality is
not rejected at significance level 0.10 or 10%. (Indeed it is not rejected even at
significance level of 44%. Sometimes we may call the p-value the exact
significance level.)
37
In the tables of the Figures, note that sample size n=1257 for the daily
returns, n=251 for the weekly returns, and n=57 for the monthly returns. The
mean formula is
i 1 i
ni 1 ri
n 1
r

n
i 1
n 3
. The standard deviation (Std. Dev.) formula is
. The skewness and kurtosis formulae are respectively,
r
and
n
i 1
n 4
It is interesting to note that daily and weekly stock return rates are usually not
normal, but aggregation to monthly return rates produces normality as would
be expected by our earlier discussion on Central Limit Theorem. This result
has important implications in financial modeling of stock returns. Short
interval return rates should not be modeled as normal given our findings. In
fact, the descriptive statistics of the return rates for different intervals above
show that shorter interval return rates tend to display higher kurtosis or fat
tail in the pdf. Many recent studies of shorter interval return rates introduce
other kinds of distributions or else stochastic volatility to produce returns with
fatter tails or higher kurtosis than that of the normal distribution.
The next example is that of the Overseas Chinese Banking Corporation
(OCBC) which is one of the 3 largest banks in Singapore. OCBC is a strong
blue-chip stock with plenty of liquidity in trading. The OCBC bank daily
stock returns in a 5-year period from 10/27/1997 to 10/25/2002 are collected
from the Singapore Stock Exchange (SGX) source and processed as follows.
The return rates are daily continuously compounded return rates ln(P t+1/Pt).
Weekly as well as monthly stock returns are computed from the daily return
rates. Likewise, the three series of daily, weekly, and monthly return rates are
tabulated in histograms shown in Figures 2.4, 2.5, and 2.6. Descriptive
statistics of these distributions such as mean, standard deviation, skewness,
and kurtosis are reported. The Jarque-Bera tests for normality of the
distributions are also conducted.
As in the case of American Express Company, the daily return rates of
OCBC show very high kurtosis deviating from normality. There is also
skewness. In Figures 2.4 and 2.5, the JB test statistics show p-values less than
0.000. Tthus normality is rejected at significance level 0.0005 or 0.05% for
the daily as well as the weekly returns.
From Figure 2.4 the mean return in the sampling period is 0.0259% per
day, or about 253 0.0259 = 6.55% per annum. The daily return standard
38
deviation or volatility is 2.527%. The annual volatility may be computed as
252 0.02527 = 40.19%.
Figure 2.4
Histogram and Statistics of Daily OCBC Stock Return Rates
500
Series: DR
Sample 1 1305
Observations 1305
400
300
200
Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis
0.000259
0.000000
0.163979
-0.116818
0.025270
0.297441
6.720127
Jarque-Bera
Probability
771.7571
0.000000
100
0
-0.10
-0.05
0.00
0.05
0.10
0.15
Figure 2.5
Histogram and Statistics of Weekly OCBC Stock Return Rates
70
Series: WR
Sample 1 1301
Observations 255
60
50
Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis
40
30
20
10
Jarque-Bera
Probability
0
-0.2
-0.1
0.0
0.1
0.001324
0.007725
0.203466
-0.261330
0.061151
-0.272323
5.723938
81.98756
0.000000
0.2
In Figure 2.6, the JB test statistic shows a p-value of 0.858. Thus normality
is not rejected at significance level 0.10 or 10%. In the tables of the Figures,
39
note that sample size n=1305 for the daily returns, n=255 for the weekly
returns, and n=65 for the monthly returns.
Figure 2.6
Histogram and Statistics of Monthly OCBC Stock Return Rates
8
Series: MR
Sample 1 1281
Observations 65
7
6
Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis
5
4
3
2
1
Jarque-Bera
Probability
0
-0.2
2.10
-0.1
0.0
0.1
0.005055
3.76E-16
0.244560
-0.247624
0.113431
-0.081671
2.706749
0.305165
0.858488
0.2
PROBLEM SET
2.1 There is a sample size of 60 taken from an unknown distribution except

that the variance is known to be 0.24. The sample mean is 0.5. Find
numbers a, b, so that the unknown population mean lies in the interval
(a,b) with a 95% probability. (a,b) is called the 95% confidence interval
for the population mean. What theorem are we implicitly using in deriving
this result?
2.2 Let Rt,t+ be the continuously compounded rate of return over interval
(t,t+], where is small. Suppose Rt,t+, R t+,,t+2, etc. are each i.i.d. with
mean and variance 2. Suppose an interval of 1 month is divided into
N such -intervals.
(i) Explain why monthly continuously compounded return rate Rt,t+N
may be approximated by N(N, 2N).
(ii) Given monthly return rate time series R1, R2, , R60 over 5 years,
explain how you would construct a test of normality for Rt (assuming
stationary stochastic process) showing a test statistic in terms of the
data Rts.
40
(iii) Suppose Rt,t+, R t+,,t+2, etc. are independent, have constant mean ,
but different variances where variance tends to be high when return is
low and low when return is high. Intuitively explain if the monthly
return sample distribution will or will not show fatter tails?
2.3 X1, X2, and X3 are n 1 vectors of stock A,B, and Cs observed monthly
return rates from month t=1 to month t=n. L is a n 1 vector with each
element equal to 1. Express the average returns of A, B, and C in terms
of the Xs and L. What is the multivariate distribution of these average
returns if the stock returns are independent of each other, but each stock
return is theoretically MVN N( , n x n )?
2.4 Suppose we are testing a theory that says that the market as a whole holds
a conditional expectation of a certain stocks return Rt+1 as XtQ where
Et(Rt+1| t) = XtQ where t is all the information available to the market at
t, Xt is an observable variable that has a stationary history, and Q is a
known constant. Show how you would test this model by employing the
Law of Iterated Expectations.
2.5 Prove that E(et+1|Pt) = 0 implies that et+1 and Pt have zero correlation.
2.6 Find the mean and variance of the lognormally distributed random
variable X where log X ~ N(,2).

[1] Blattberg, R.C., and N.J. Gonedes (1974), A comparison of the stable
and student distributions as statistical models for stock prices, Journal of
Business, April, 244-280.
[2] Ibbotson and Sinquefield (1982), Stocks, bonds, bills, and inflation: The
Past and the Future, The Financial Analysts Research Foundation
Monograph Number 15, USA.
41
Chapter 3
TWO-VARIABLE LINEAR REGRESSION
APPICATION: FINANCIAL HEDGING
Regression model, Transformations, Dependent variable, Regressand,
Regressor, Ordinary least squares method, Classical conditions, Unbiased
linear estimator, Estimation efficiency, Gauss-Markov theorem, Testing of
coefficient estimates, Decomposition of squares, Coefficient of determination,
Forecasting, Stock index futures, Cost of carry, Arbitrage, Hedging, Hedge
ratio
In this chapter we shall proceed one step further to investigate how a

conditional mean as mentioned in chapter 2 could be estimated using linear
statistical relationship between the two variables. The two-variable linear
regression is studied and an application on financial futures hedging will be
investigated later in the chapter.
3.1
REGRESSION
A regression is an association between a random variable and other

independent or exogenous variables. The idea of checking out the association
is basically for two major purposes: to provide some positive theory of how
occurrence of some events could explain, not necessarily causing, occurrence
of another event, and a normative or prescriptive theory of how to use the
association to predict future occurrences or to influence them.
In Chapter One, if we refer back to Figure 1.3 and wonder how we can
characterize the association between random variables Y and X, one may
intuitively draw a linear (straight) line that passes by the points as closely as
possible according to some notions of distance between the line and the
points. This nails the regression idea as one of linear regression. One can think
of many interesting finance situations where association between two finance
variables Y and X would be important to know. Examples are stock price
movements and economic fundamentals, interest rate and inflationary
expectations, interest rate and money supplies, housing market rates and tax
regulations, car prices and quotas in the Singapore case, exchange rates and
country economic performance, and so on.
42
In Figure 3.1, we show sample observations of X and Y variables that
occur simultaneously.
The bold line shows an attempt to draw a linear line as close to the
observed occurrences as possible. Or does it make sense for us to draw a
nonlinear line that fits all the sample points exactly as seen in the dotted
curve? Obviously not. This is because the bivariate points are just realized
observations of bivariate random variables, and at the next sampling, no
matter how large the sample size is, the points will change positions. Drawing
a line through all or most of the 8 points is like a model with 8 parameters,
and results in over-fitting problem.
Figure 3.1
(X3,Y3)
(X7,Y7)
(X5,Y5)
(X8,Y8)
(X2,Y2)
(X4,Y4)
(X6,Y6)
(X1,Y1)
X
What we want is a straight line in a linear model (or a curve in a nonlinear
model) that is estimated in such a way that whatever the sample, as long as the
size is sufficiently large, the line will be pretty much remain at about the same
position. This will then enable purposes of (1) explaining Y given any X (not
just those observed) that is within the normal range of X in the context, and
(2) forecast given a new X or in the case of time series, the next period X t+1.
When the sample size is small, there will be large sampling errors of the
parameter estimates.
Therefore, the idea of a regression model (needs not be linear), Y = f(X;)
+ , where is a random error or noise, is one where parameter(s) are
suitably estimated as . is close to true given a sample of {(Xi
43
g Y
n
,Yi)}i=1,.,n size n, such that
i 1
f X i ;
is small in some statistical
sense where g(.) is a criterion function. For example g(z) = z2 is one such
criterion function. Thus a linear regression model does not fit random
variables X, Y perfectly, but allows for a residual noise in order that the
model is not over-parameterized or over-fitted. This would then serve
purposes (1) and (2).
A Linear (bivariate) Regression Model is
Yt = a + bXt + et ,
where a, b are constants. In the linear regression model, Yt is the dependent
variable or regressand. Xt is the explanatory variable or regressor. et is a
residual noise, disturbance, or innovation.
If a constant a has been specified in the linear regression model, then the
mean of et is zero. If a constant has not been specified, then et has a non-zero
mean. It is common to add the specification that et is independently and
identically distributed (in short, i.i.d.). This means that the probability
distributions of et , et+1, et-1, etc. are all identical, and that et is stochastically
independent of all other r.v.s, including its own lags and leading terms, i.e.
cov(et, et-k) = 0 and cov(et, et+k) = 0 for k = 1,2,3,.
An even stronger specification is that et is i.i.d. and also normally
distributed, and we can write this as n.i.d. N(, 2). In trying to employ the
model to explain, and also to forecast, the constant parameters a and b need to
be estimated effectively, and perhaps some form of testing on their estimates
could be done to verify if they accord with theory. This forms the bulk of the
material in the rest of this chapter.
It is also important to recognize that a linear model provides for correlation
between Xt and Yt (this needs not be the only type of model providing
correlation, e.g. nonlinear models Yt = exp(Xt) also does the job) as we see
occurred in joint bivariate distribution (X,Y). For example, in Yt = a + bXt +
et, with i.i.d. et, we have cov(Xt,Yt) = b var(Xt) 0 provided b 0.
Sometimes we encounter a timeplot (a timeplot shows a variables realized
values against time) or a scatterplot (a graph of simultaneous pairs of realized
values of random variables) that does not look linear, unlike Figure 3.1. As an
example, consider the following two regressions both producing straight lines
that appear to cut evenly through the collection of points in each graph if we
use the criterion that minimizes z2.
The point is that using some intuitively appropriate criterion to fit linear
lines is not enough. It is important to first establish that the relationships are
linear before fitting a linear regression model makes sense.
44
Figure 3.2
X
Figure 3.3
X
In Figure 3.2, the graph Y versus X is clearly a nonlinear curve toward the
origin. If it is quadratic, then it is appropriate in that case to use a nonlinear
regression model such as Y = a + bX + cX2 + .
In Figure 3.3, for Y versus X, there is clearly an outlier point with a very
high Y-value. As a result, the fitted line is actually above the normal points
that form the rest of the sample. This can be treated either by excluding the
outlier point if the assessment is that it is an aberration or distortion, or else by
providing for another explanatory variable to explain that point that may be a
rare event.
Thus, a visual check on the data plots is useful to ascertain if a linear
regression model is appropriate and whether there are outliers.
Sometimes there are theoretical models that specify relationships between
random variables that look nonlinear, but can be transformed to linear models
45
so that linear regression methods can be applied for estimation and testing.
Examples are as follows.
When Y = AXB, take Log-Log transformation (taking logs on both sides),
so
ln Y = lnA + B lnX + ln .
Note that here the disturbance noise must necessarily be larger than zero,
otherwise ln will have non-feasible values. Here, ln can range from - to
. Sometimes Y is called the constant elasticity function since B is the
constant elasticity (when ln is fixed at zero).
When Y=exp(a+bX+), taking logs on both sides ends up with a semi-log
transformation, so lnY = a + bX + . This is also called a Semi-log model.
When eY = AXB, taking logs on both sides ends up again with a Semi-log
model Y = lnA + B ln X + ln . Sometimes when the regressor X is a fast
increasing series relative to Y, then taking the natural log of X as regressor
will produce a more stable result, as long as theory has nothing against this
adhoc data transformation practice.
There are examples of interesting nonlinear curves that are important in
economics. An example is the Phillips curve as follows.
Figure 3.4
Philips Curve Relating Short-Run Wage Inflation with Unemployment
Level
Y
Rate of wage
change
X Unemployment
Y versus X is highly nonlinear, but we can use a linear regression model on

the reciprocal of X, i.e. Y = a + b (1/X) + or use 1/Y as regressand or the
dependent variable, thus 1/Y = a + b X + . A serious econometric study of the
Phillips curve is of course much more involved as it is now known that
rational forces cause the curve to shift over time in the longer run. In other
words, observed pairs of wage inflation and unemployment levels over time
belong to different Phillips curves.
46
Next we study one major class of estimators of the linear regression model
and the properties of such estimators. This class is synonymous with the
criterion method for deriving the estimates of the model parameters. This is
the ordinary least squares criterion. For this chapter, we will cover only the
two-variable linear regression model.
3.2
ORDINARY LEAST SQUARES METHOD
In the linear regression model, the dependent variable is assumed to be a linear

function of one or more independent (or explanatory or exogenous or predetermined) variables plus an error introduced to account for all other factor(s)
that are either not observed or are not known and which are random
variable(s).
In a two-variable linear regression model
~
Yi a bX i ~
ei ,
i 1,2,, N.
(3.1)
Yi is a dependent variable, Xi is an independent or explanatory variable, and ei

~ ~
is a disturbance or residual error. The random variables X i , Yi s are
observed as (Xi , ,Yi)s; disturbances or residual errors eis are not observed,
and a, b are constants to be estimated. E(ei)=0, and var(ei) is assumed to be a
constant e2 which is also not observed. The task is to estimate parameters a
and b and e2. Notice that we have dropped the tildes and the context should
be clear which is a r.v. and which is a realized value.
The Classical assumptions (desirable conditions) for Ordinary Least
Squares regression are:
(A1)
Ee i 0 for every i.
(A2)
E e i e , a same constant for every i.
(A3)
Eei e j 0 for every i j .
(A4)
X i and e j are stochastically independent (of all other r.v.s)

for each i, j.
In assumption (A2), the disturbances with constant variance are called

homoskedastic. On the flip side, disturbances with non-constant variance are
called heteroskedastic, a subject we shall address in later chapters. Condition
(A3) implies zero cross-correlation if the sample is a cross-section, or zero
autocorrelation if the sample is a time series.
47
In simpler treatment, Xis are assumed to be given, so we can treat them as
constants. We shall adhere to this mostly in this chapter. This treatment can be
justified easily in the case when there is repeated sampling of Y is given that
Xis can be pre-selected. If not, the results that ensue are interpreted as being
conditional on given Xis. Such Xt is deterministic or exogenous, and are
also referred to as pre-determined in a time series context. At other times, Xt
is stochastic and occurs jointly with Yt. It is theoretically easier if Xt is
deterministic. In the latter, it is easier to accept that the e is are uncorrelated,
or stronger still, i.i.d. Such properties of the disturbance will be seen to
simplify the estimation theory.
In addition to assumptions (A1) through (A4), we could also add a
distributional assumption to the random variables, e.g.
(A5)
e i ~ N 0, 2 .
In Figure 3.5 below, the dots represent the data points (Xi ,Yi) for each i.
The regression lines passing amidst the points represent attempts to provide a
i indicate measure of
linear association between Xi and Yi. The scalar value e
the vertical distance between the point (X1,Yi) and the fitted regression line.
The solid line provides a better fit than the dotted line, and we shall elaborate
on this.
Figure 3.5
Ordinary Least Squares Regression of Observations (Xi, Yi)
Y
(X1,Y1)
e 1
(X2,Y2)
e 3
(X3,Y3)
e 2
48
The requirement of a linear regression model estimation is to estimate a and b.
The ordinary least squares (OLS) method of estimating a and b is to find a
N 2
and b so as to minimize the residual sum of squares (RSS), e i .
i1
Note that this is different from minimizing the sum of squares of random
variables ei which we do not observe. This is an important concept that should
not be missed. It harks back to the distinction of what is a random variable and
what is its sample value in a single draw.
It should also be noted that the estimators a , b , and are themselves
random variables. However, given a particular sample, the computed number
a 0.245, for example, is a realized value of the estimator, and is called an
estimate. Although the same notation is used, the context should be
distinguished.
The key criterion in OLS is to minimize the sum of the squares of vertical
distances from the points to the fitted OLS straight line (or plane if the
problem is of a higher dimension):N
i 1
i 1
e i2 Yi a bX
i
min
a ,b
Since this is an optimization problem and the objective function is continuous

in a and b , we set the slope with respect to a and b to zeros.
The First Order Conditions (FOC) yields the following two equations.
N
e i2
i 1
2 Yi a bX
i 0
i 1
e i2
i 1
2 X i Yi a bX
i 0
i 1
Note that the above left-side quantities are partial derivatives. The equations
above are called the normal equations for the linear regression of Yi on Xi.
From the FOC,
N
i 1
i 1
i 1
Yi a bX
i
NY Na bNX
a Y bX
(3.2)
49
N
X Y X a bX
i
i 1
i 1
i 1
i 1
i 1
2
i
(3.3)
X i Yi aNX
b X i2
Putting (3.2) into (3.3):
N
i 1
i 1
bX
X i Yi X i Y bX
i2
i 1
X Y Y b X X
i
i 1
i 1
X Y Y
i
i 1
N
X X
i
i 1
b can also be expressed as follows.

N
Xi X Yi Y
i 1
N
X
i 1
X X i X
x y
i
i 1
N
x
i 1
(3.4)
2
i
where x i X i X , y i Yi Y . Note that
Y Y 0 .
i 1
We see that a , b , are linear estimators with wis as fixed weights (when Xis
are deterministic) on the Yis. They are linear functions of Yis. The weights
for b are as follows.
N
b w i Yi
i 1
where w i
xi
. It can be seen that the following properties of wi hold.
x
i 1
2
i
50
2
i
w x
i
1
x i2
1
a are as follows.
The weights for
N
N
1
Yi w i XYi v i Yi
N i1
i 1
i 1
N
xiX
1
where v i
.
N N 2
xi
i 1
X x i
vi 1 x2 1
i
v X
i
XX
x X
x X
i
1
2x i X
x i2 X 2 1
X2
2
i N 2 N x 2 2 2 N
i xi
x i2
In the above, x i X i X NX X 0.
Now, for the finite sample properties of OLS estimators:
b wi a bXi ei b wiei .
This way of expression allows it to be easily seen that b is a random variable

since it is a function of the random variables eis, or more precisely, b plus a
weighted average of eis. Then, clearly E b b . Now,
2
var b E b b
.
2
i
w e w Ee 1
x
Similarly, a v a bX e a v e
= E
Then,
2
i
i i
2
i
i i
. Thus, Ea a .
51
1
2
X 2
2
var a E a a E v i e i 2 v i2 2
.
N x2
i
The above results show that the means of the estimators a and b are centered
at true population parameters a and b respectively. Thus we say that the OLS
estimators a and b are unbiased.
What is the probability distribution of b ? Using (A5), since b is a linear
combination of eis that are normally distributed, b is also normally
distributed.
1
.
b ~ N b , 2
x2
What is the distribution of a ? Similarly, we show that
1
X 2
.
a ~ N a , 2
2
N
x
The covariance between the estimators a and b is obtained as follows.

cov a , b Ea a b b
X
E v i e i w i e i 2 v i w i 2
x2
i
3.3
GAUSS-MARKOV THEOREM
The Gauss-Markov Theorem states that amongst all linear and unbiased
estimators of the form
N
Y
A
i i
i 1
N
Y
B
i i
i 1
where i and i are constant weights in Xis (and not in Yis or a or b), and
a, E B
b, the OLS estimators
E A
variances, i.e.
var b var B
var a var A
a , b have the minimum
52
In this sense, OLS estimators (under the classical conditions) are called
BLUE, viz. Best Linear Unbiased Estimators for the linear regression model
in (3.1). They are efficient estimators (estimation efficiency) when they have
the least variances and are unbiased. They are best in this linear unbiased
class.2
What happens to the estimators a and b when the sample size N goes
toward infinity? In such a situation when sample size approaches infinity (or
practically when we are in a situation of a very large sample, though still finite
sample size), we are discussing asymptotic (large sample) theory.
Consider the following sample moments as sample size increases toward
. Earlier we see that population means E X X , E Y Y , and
population covariance E X X Y Y E XY X Y . The Sample
1 N
1 N
X
X
,
i
Yi Y . From the Law of Large Numbers,
N i1
N i1
lim X X , lim Y Y , when X and Y are stationary.
means are
N
The sample covariance

can also employ
1
X i X Yi Y is unbiased, but we
N2
1
X i X Yi Y = SXY if N approaches . Both the
N
unbiased version and this SXY will converge to population covariance XY as N

. In dealing with large sample theory, we shall henceforth use the latter
version, and also sample variance
1
X i X 2 S2X , and observe that
SX2 also converges to X2 as N approaches .
The population correlation coefficient is XY

correlation
coefficient
is
rXY
XY
. The sample
X Y
X X Y Y
X X Y Y
i
S XY
.
SXS Y
Likewise when we take the limit, lim rXY XY .Theoretically XY lies

N
within [-1,+1] as seen earlier in Chapter 1. Now, sample estimate r XY is

defined above so that it also lies within [-1,+1]. This can be shown using the
Cauchy-Schwarz Inequality:
2
There are some estimators that are biased but may possess smaller variances e.g. the
Stein estimators.
53
xy x y .
2
Other definitions of sample correlation, though convergent to the population

correlation, may not lie within [-1,+1]. One such example is when we use the
unbiased sample covariance divided by the unbiased standard deviations.
Now,
S
r S S
S
OLS estimator b XY
XY 2X Y rXY Y .
2
SX
SX
SX
Is there a population equivalent? In the rest of this section we can treat Xi

more generally as stochastic rather than an exogenous or predetermined
constant.
Yi = a + b Xi + ei where ei is i.i.d. mean 0, variance e2.
Then cov(ei ,Xi) = 0 since ei is independent of Xi . Now,
cov (Xi ,Yi) = cov (a+bXi+e ,Xi) = b var(Xi) + cov(ei ,Xi) = b var(Xi).
Then, b = cov (Xi ,Yi)/var(Xi) . Hence we see lim b b .
N
If b is an estimator of b, and lim b b , b is said to be a consistent

N
estimator. We can show likewise that lim a a , and hence a is also

N
. It can be further
consistent. The estimated residual e i Yi a bX
i
expressed as
bX
Y Y b X X y bx
.
e i Yi Y bX
i
i
i
i
i
i from the
It is important to distinguish this estimated residual Yi Y
actual unobserved ei. From the FOCs in (3.2) and (3.3), we see that
N
e i 0 and
Xie i 0 or also
i 1
i 1
(X
i 1
X) e i 0 .
What are their population equivalents? They are respectively E(e i) = 0, E(Xiei)
= 0 or cov(Xi , ei) = 0.
3.4
DECOMPOSITION
Y Y
N
We now analyse the decomposition of the Sum of Squares of
i 1
Recall that in the OLS method, we minimize the sum of squares of estimated
N
residual errors
e
i 1
2
i
. Now,
54
Y Y Y Y Y Y Y Y
N
i 1
i 1
N
i 1
2 Y i Y Yi Y i Yi Y i
i 1
i 1
2
Y i Y 2 Y i Y e i e i
i 1
N
i 1
i 1
2
Y i Y e i
i 1
i 1
(3.5)
e
a bX
Y Y e Ye
N
since
i 1
i 1
i i
i 1
Let us define the Total Sum of Squares (TSS) =

Define Explained Sum of Squares (ESS) =
0.
Y Y
Y Y
i .
Define Residual Sum of Squares (RSS) = e i Yi Y
Thus, from (3.5), TSS = ESS + RSS. RSS is also called the unexplained sum
of squares (USS).
Now,
2
Y
ESS Y
i
2
rXY
a b X i a b X
2
b 2 X i X
S 2Y
2
NS 2X rXY
NS 2Y
S 2X
TSS Yi Y NS2Y
2
So,
ESS 2
RSS
2
.
rXY . Also, rXY
1
TSS
TSS
Explained Sum of Squares as a fraction of total Sum of Squares or

variation is the square of sample correlation coefficient in the two-variable
linear regression model. But rXY2 lies between 0 and 1 inclusive since rXY lies
in [-1, +1]. This term
ESS
R2
TSS
where 0 < R2 < 1 is called the coefficient of determination. This coefficient R2
determines the degree of fit of the linear regression line to the data points in
the sample. The closer R2 is to 1, the better is the fit. Perfect fit occurs if all
points lie on the straight line. Then R2 = 1.
55
ESS
RSS
.
R2 1
TSS
TSS
The unbiased estimator of residual variance e2
1 N 2
e i .
N 2 i1
, so e is a normally distributed random variable since Yi,

e i Yi a bX
i
i
a , and b , (given Xi) are normally distributed. This is obtained using the
result that a linear combination of normal random variables is itself a normal
random variable. Moreover, Ee i EYi a bX i 0 . Conditional on Xi,
var e i var Yi var a X i2 var b 2cov Yi , a 2X i cov Yi , b 2X i cov a, b
1
X2
2
2

e
eN
x i2
X
2
2
2
2v i e 2X i w i e 2 e X i
x2
i
2
X
2x X 2x X
2X X
1
X2
i 2
i
i i
i
2 1
2
N x2 x2 N x2
xi
x i2
i
i
i
1
2
2
e Xi
x2
1
1
2 1
X 2 X 2 2x X 2x X 2X X
i
i
i i
i
e
N x2
x2
1
i
2 1
e
N x2
Similarly,we can show that

cov e , e cov Y a - b X , Y - a - b X
i
j
j
i j
i
cov a - a b - b X e , a - a b - b X e
i i
j j
1 X 2 X jX 1 x jX Xi X Xi X j x jXi 1 x i X x i X j
2
e N
x i2 x i2 N x i2 x i2 x i2 x i2 N x i2 x i2
1
1
2
e N
x i2
X 2 X X X X X X
i
j
i j
xx
1
i j
2
e N
x i2
56
Note that although true ei and ej are independent according to the classical
conditions, yet their OLS estimates are correlated. Now,
N
2
i
i 1
e2
N2 2 is a useful relationship to note involving sample estimate
e2 and unknown population parameter e2 .

After obtaining the OLS estimates a and b , there is sometimes a need to
perform statistical inference and testing, as well as forecasting and confidence
interval estimation.
2 1 X2 2 X
2
e

a e N x 2
x
N ,
2 1
b
b e2 X 2
e x 2
x

b b
So,
Z N(0,1).
s.e. b
For testing null hypothesis H0: b = 1, employ sample estimate of e2 using
e2 . Use
t N 2
b 1
1
x2
For testing null hypothesis H0: a = 0, use
t N 2
a 0
1
N
X2
x2
It should be noted that most statistical or econometrics computing packages by

default report tests of coefficients that are based on a null of zero, i.e. H0: a =
0, H0: b = 0.
3.5
FORECASTING
For forecasting, recall the model in (3.1) Yi = a + b Xi + ei , i=1,2,.,N,

where ei is i.i.d. N(0, e2). The OLS forecast of YN+1 given XN+1 is
a bX
Y
N 1
N 1
57
where a and b are the OLS BLUE estimates. This forecast is the most
efficient amongst all linear forecasts. In general, the most efficient forecast is
E(YN+1|XN+1), and this needs not be the OLS forecast if the regression model is
not linear as in (3.1).
We can express the above in terms of population parameter a.
N 1 a bX
Y
N 1 Y bX bX N 1 Y bx N 1
where x N 1 X N 1 X .
But (3.1) gives Y a bX
1 N
ei and here we are dealing with a
N i1
random variable rather than sample estimate. So,
1 N
Y bx
Y
bX
ei bx
N 1
N 1
N 1
N i1
which is again a representation as a random variable. But
YN 1 a bX N 1 e N 1 , so the forecast (or prediction) error is
N
1 N
b x e 1 e .
bx bx
YN 1 Y
b
i
i
N 1
N 1
N 1
N 1
N 1
N 1
N i1
N i1
The forecast error (conditional on xN+1) is normally distributed. Now,
N 1 | x N 1 0 , so the forecast is unbiased.

E YN 1 Y
2
1
var YN 1 Y N 1 | x N 1 x 2N 1 var b e2 e2 e2 1 N1 xNN 1
N
xi2
i 1
YN 1 Y
N 1
So,
1
N
2
xNN 1
xi2
i 1
t N 2 .
Therefore, a 95% confidence interval for YN+1 is
YN 1 t N 2,95% e 1 1 xN2N 1 .
N xi2
i 1
As a final comment, suppose distributional assumption for ei is made, e.g.

normality, then another important class of estimators maximum likelihood
58
estimators (MLE) can be developed. MLE essentially chooses estimators
that maximizes the sample likelihood function. There is equivalence of OLS
and ML estimators in the specific case of normally distributed i.i.d. eis.
However, MLE is in general a nonlinear estimator.3
3.6
STOCK INDEX FUTURES
Although the framework of linear regression can be applied to explain and

also predict many financial variables, it is usually not enough to know just the
econometric theory as seen so far in this chapter. To do a good job of
exposition and predicting some financial variables, there would usually be a
finance-theoretic framework and an appropriate way to think about how
financial variables may interact and dynamically change over time as a result
of investor actions and market conditions. Therefore we will introduce these
as the chapters proceed. In the rest of this chapter, we shall concentrate on a
very important financial instrument used in the futures market as well as used
by portfolio managers for hedging purposes.
A stock index is usually a capitalization- or value-weighted average of
stock prices. The Standard and Poors (S&P) 500 stock index, for example, is
a weighted average number reflecting the average price of 500 major stocks
trading in U.S. The Nikkei 225 (or Nikkei 300) is the average price of 255
(or 300) Japanese stocks. The FTSE (The Financial Times and London Stock
Exchange) indexes reflect portfolios of UK stocks. The Straits Times Index is
the average price of 30 of the biggest stocks traded on the Singapore Stock
Exchange. There are numerous stock indexes reflecting price-averages of
stocks in a country, in sectors of a country, and sometimes across bourses in a
region.
These stock index numbers change every day, and usually more frequently
on an intraday basis, as long as there is an agency or mechanism that
computes the new average number as the constituent stocks change their
traded prices in the market. While the index numbers themselves are not
directly trade-able, derivatives or contracts written on them can be traded. One
such type of contract is the stock index futures. Others include stock index
options, exchange-traded funds, and so on.
We shall consider stock index futures that are traded in Stock or Futures
Exchanges. In September, for example, one can trade on a Nikkei 225 Index
3
We shall re-visit the idea of maximum likelihood estimation in more details in a later
Chapter. The theory of maximum likelihood estimation and nonlinear estimation can
be read in more advanced econometrics textbooks. There are numerous such excellent
books. One example is: Russell Davidson and James G. MacKinnon, (1993),
Estimation and Inference in Econometrics, Oxford University Press.
59
futures contract that matures in December. This is called a December Nikkei
225 Index futures contract to reflect its maturity. After its maturity date, the
contract is worthless. In September, however, the traded price (this is not a
currency price, but an index price or a notional price) of this December
contract will reflect how the market thinks the final at-maturity Nikkei 225
index will be. If the September market trades the index futures at a notional
price of 12,000, and you know that the December index number is going to be
higher, then you will buy (long) say N of the Nikkei 225 Index December
futures contracts. At maturity in December, if you still have not sold your
position, and if the Nikkei 225 index is really higher at 14,000, then you will
make a big profit. This profit is calculated as the increase in futures notional
price or 2000 points in this case the Yen value per point per contract
number of contracts N.
Thus, a current stock index futures notional price is related to the index
notional price at a future time. At maturity, the index futures notional price
also converges to the underlying stock index number. As stock index
represents the average price of a large portfolio of stocks, the corresponding
stock index futures notional price is related to the value of the underlying
large portfolio of stocks making up the index. This relationship is sometimes
called the no-arbitrage model pricing. It can be explained briefly as follows.
3.7
COST OF CARRY MODEL
Suppose we define the stock index value to be St at current time t. This value
is the capitalization-weighted average of the underlying portfolio stock prices.
The actual market capitalization currency value of the portfolio is of course a
very large constant multiplier of this index value. Nevertheless, the percentage
return to the index changes reflects the overall portfolios gain or loss.
Suppose an institutional investor holds a large diversified portfolio say of the
major Japanese stocks. Even if this portfolio consists of only 80% of the
number of stocks in the Nikkei 225 Index, the return to this portfolio would
look quite similar to the return computed on the changes in the Nikkei 225
Index value.
Let the effective risk-free interest rate or cost of carry be Rt,T over [t,T].
An arbitrageur could in principle buy or short-sell a portfolio of N225 stocks
in proportions equal to their value weights in the index. Let the cost of this
portfolio be St whereby is a constant multiplier reflecting the proportionate
relationship of the portfolio value to the index notional value S t. However, the
percentage return on the index is also the same percentage return on the
portfolio.
The arbitrageur either carries or holds the portfolio, or short-sells the
portfolio till maturity T with a final cost at T of St (1+Rt,T) after the
60
opportunity cost of interest compounding is added. Suppose the Japanese
stocks in the N225 index issue an aggregate amount of dividends D over the
period [t,T]. Since the N225 index notional value is proportional to the overall
225 Japanese stocks market value, the dividend yield d as a fraction of the
total market value is the same dividend yield as a fraction of the N225 index
notional value. Then, the dividends issued to the arbitrageurs portfolio
amount to dSt. Suppose that the dividends to be received are perfectly
anticipated, then the present value of this amount, d*St can be deducted
from the cost of carry. Let D*=d*St. The net cost of carry of the stocks as at
time T is then
[St D*] (1+Rt,T).
Suppose the N225 index futures notional price is now trading at Ft,T. The
subscript notations imply the price at t for a contract that matures at T. The
arbitrageur would enter a buy or long position in the stocks if at t, Ft,T > [St
D*] (1+Rt,T). At the same time t, the arbitrageur sells an index futures contract
at notional price Ft,T. For simplicity, we assume the currency value per point
per contract is 1. Without loss of generality, assume =1. At T, whatever the
index value ST = FT,T, the arbitrageur would:Sell the portfolio at ST, gaining (Yen or $, whichever may be)
$ ST [St D*] (1+Rt,T).
Cash-settle the index futures trade, gaining
$Ft,T FT,T or $Ft,T ST.
The net gain is the sum of the two terms, or $Ft,T [St D*] (1+Rt,T) > 0.
Thus, the arbitrageur risklessly makes a profit equivalent to the net gain
above.
Conversely, the arbitrageur would enter a short position in the stocks if at t,
Ft,T < [St D*] (1+Rt,T). At the same time t, the arbitrageur buys an index
futures contract at notional price Ft,T. At T, whatever the index value ST = FT,T,
the arbitrageur would:Buy back the portfolio at ST, gaining (Yen or $, whichever may be)
$ [St D*] (1+Rt,T) ST
Cash-settle the index futures trade, gaining
$FT,T Ft,T or $ST Ft,T .
61
The net gain is the sum of the two terms, or $ [St D*] (1+Rt,T) Ft,T > 0.
Thus, the arbitrageur risklessly makes a profit equivalent to the net gain
above.
We have of course ignored transaction costs in this analysis, which would
mean that it is even more difficult to try to make riskless arbitrage4 profit. The
cost-of-carry model price of the index futures Ft,T = [St D*] (1+Rt,T) is also
called the fair value price.
We employ data from Singapore Exchange (SGX) that contain daily endof-day Nikkei 225 Index values and Nikkei 225 Index December 1999 futures
contract prices traded at SIMEX/SGX during the period September 1 to
October 15, 1999. During this end 1999 period, the Japan money market
interest rate was very low at 0.5% p.a. We use this as the cost-of-carry interest
rate. We also assume the Nikkei 225 stock portfolios aggregate dividend was
1.0 % p.a. at present value. During these trade dates, the term-to-maturity is
about of a year or 3 months from September/October till December.
Based on the finance theory above, we plot in Figure 3.6 the two time
series of the N225 futures price Ft,T , and the fair price Ft* = [St D*] (1+Rt,T),
where D* = (1- 1.0% ) of St or 0.9975 St, and Rt,T = 0.5% .
Figure 3.6
Prices of Nikkei 225 December futures contract from 9/1/1999 to
10/15/1999
18500
18000
17500
17000
Futures
Price
16500
Fair
Value
16000
15500
9/1
9/8
9/15
9/22
9/30
10/7
10/15
One of the earliest studies showing that such risk-free arbitrage in the Nikkei 225
Stock Index futures had largely disappeared, after transaction costs, in the late 1980s
is a paper by Kian-Guan Lim, (1992), Arbitrage and Price Behavior of the Nikkei
Stock Index Futures, Journal of Futures Markets, Vol 12, No 2, 151-162.
62
We also compute a normalized percent difference p = (Ft,T Ft*)/Ft*
indicating the percentage deviation from the fair price Ft*. We test the
statistical hypothesis, H0: p = 0, during this period. The simple t-test is used
here. If p is highly positive, then arbitrageurs can make profit by shorting the
index futures and buying stocks to carry. If p is highly negative, then
arbitrageurs can make profit by longing the index futures and shorting stocks
(borrowing them through a brokerage and paying extra transaction costs in
this case). The time series of p is also shown in Figure 3.7.
Figure 3.7
Difference between futures price and fair price
0.01
0.005
0
9/1
9/8
9/15
9/22
9/30
-0.005
10/7
10/15
-0.01
-0.015
Figure 3.6 shows that both the futures price and the fair value are tracked
closely together. The sample size for the two variables is 30. The t-statistic of
their daily difference p, at d.f. 29,
29
p
is -0.1278 which is
s.e.p
statistically not significant at a 2-tailed significant level of 10%.

The daily differences p shown in Figure 3.7 were contained within 1% of
the fair value Yen cost. Though p was statistically not significantly different
from zero, the results could still be economically significant in that actual
arbitrage profits (though small, if any) could take place on some days if the
arbitrageurs total cost of transactions was much less than 1%.
The transaction costs would include the costs paid to the Exchange of
buying and selling the stocks and of the futures. This may exclude cost paid to
brokers if the arbitrageur itself is a broking house or a large institutional
player. There is also the spread cost, i.e. buying at the markets offer price and
selling at the markets bid price.
63
Sometimes when large transactions take place, the market price may be
sensitive to the price pressure and prices may move against a large buy by a
resulting higher average price paid for the total order, or prices may dip in a
large sell order resulting in average lower revenue. This is called impact cost.
For a short position in stocks, the cost could be even higher as the arbitrageur
normally would have to also pay for borrowing the stocks.
The graph in Figure 3.7 suggests that if p > 0 (or futures price > fair
value), except for one outlier, it reverses direction downward when it hit about
%. If p < 0 (or futures price < fair value), it reverses direction upward when
it hit -1%. One possible interpretation, though this is by no means conclusive,
is that suppose transaction cost to arbitrageurs in a long spot-short futures
situation, where p > 0, was a little less than %, then arbitrage profits
occurred at about p = %. This drew infinitely (or theoretically close) many
sell orders for futures and buy orders for spot and thus pushed futures price
down and spot price up. Then, p started to drop toward zero. This is the
observed reversal pattern close to p = %.
On the other hand, suppose transaction cost to arbitrageurs in a short spotlong futures situation, where p < 0, was a little less than 1 %, then arbitrage
profits occurred at about p = - 1 %. This drew infinitely (or theoretically close)
many buy orders for futures and sell orders for spot and thus pushed futures
price up and spot price down. Then, p started to increase toward zero. This is
the observed reversal pattern close to p = - 1 %.
The above interpretation would be consistent with (note that we did not say
it is a conclusive or even convincing evidence of)5 the presence of some
arbitrage opportunities for large institutional players with relatively low costs
of transaction. The pattern basically says that if on day t, pt (change in p
from t-1 to t) hit a high note and drew in arbitrage, then p would reverse, and
the following pt+1 would be opposite in sign to pt.
We can statistically examine the reversal in p by investigating a
regression of the daily change in p over its lag. Specifically, we perform the
linear regression pt+1 = a + b pt + et+1 , where pt+1 = pt+1 pt , a, b are
coefficients, and et+1 is assumed to be an i.i.d. residual error. In this case, the
first data point p2= p2 p1 is the change in p from 9/1 to 9/2; the last data
point p30= p30 p29 is the change in p from 10/14 to 10/15.
Since we employ a lag for regression, the number of sample observations used
in the regression is further reduced by 1, so there are only N=28 data points {
5
There are other stories from a market micro-structure perspective. One example of
such possibilities is about bid-ask bounce causing negative serial correlations even
when there were no information and the market was efficient. See Richard Roll,
(1984), A Simple Implicit Measure of the Effective Bid-Ask Spread in an Efficient
Market, Journal of Finance, 1127-1139.
64
(p30, p29 ), (p29, p28 ), , (p3, p2) } for the dependent variable, and
28 data points { (p29, p28 ), (p28, p27 ), , (p2, p1) } for the
explanatory variable that is the first lag of the dependent variable in the
regression. Thus the linear regression would produce t-statistic with N-2 or in
this case 26 degrees of freedom. The regression results are reported in Table
3.1.
The regression results show that b = -0.5237078 , and is statistically
significant at 2-tailed 1% significance level, p-value being 0.0057, thus
rejecting H0: b = 0. This shows the presence of reversals in p across days.
Table 3.1
Regression of Difference of Actual and Fair N225 Futures Price on its
Lag: pt+1 = a + b pt + et+1 (CHANGEP is pt+1)
Dependent Variable: CHANGEP
Method: Least Squares
Sample: 1 28
Included observations: 28
Variable
Coefficient
Std. Error
t-Statistic
Prob.
C
LAGCHANGEP
0.000161
-0.523708
0.000891
0.173972
0.181
-3.010
0.8579
0.0057
R-squared
Adjusted R-squared
S.E. of regression
0.2585
0.2299
0.004701
F-statistic
Prob(F-statistic) df 1,26
Sum squared resid
9.062
0.00574
0.000200
The probability in the output table 3.1 is Prob(t-value > |-3.010|).

3.8
HEDGING
We can also use linear regression to study optimal hedging. Suppose a large
institutional investor holds a huge well-diversified portfolio of Japanese stocks
that has returns following closely that of the N225 stock index return or rate of
change. Suppose in September 1999, the investor was nervous about an
imminent big fall in Japan equity prices, and wished to protect his portfolio
value over the period September to mid-October 1999. He could liquidate his
stocks. But this would be unproductive since his main business was to invest
in the Japanese equity sector. Besides, liquidating a huge holding or even a big
part of it would likely result in loss due to impact costs. Thus, the investor
decided to hedge the potential drop in index value by selling some h Nikkei
225 index futures contracts. If the Japanese stock prices did fall, then the gain
65
in the short position of the futures contracts would make up for the loss in the
actual portfolio value.
The investors original stock position has a total current value V t. For
example, this could be 10 billion Yen. Suppose his stock position value is a
constant factor f the N225 index value St. Then Vt+1 = f St+1, and further
the portfolio return rate Vt+1/Vt = St+1/St, as mentioned in the last paragraph.
In essence, the investor forms a hedged portfolio comprising f St Yen,
and h number of short positions in N225 index futures contracts. The contract
with maturity T has notional traded price Ft,T and an actual price value of
500 Ft,T where the contract is specified to have a value of 500 per notional
price point. At the end of the risky period, his hedged portfolio value change
would be:Pt+1 Pt = f (St+1 St) h 500 (Ft+1,T Ft,T ).
(3.6)
In effect, the investor wished to minimize the risk or variance of Pt+1 Pt
P. Now, simplifying notations, from (3.6):P = f S h 500 F.
So, var(P) = f2 var(S) + h2 5002 var(F) 2h 500f cov(S, F).
The FOC for minimizing var(P) with respect to decision variable h
yields:
2h (5002) var(F) 2(500f) cov(S, F) = 0,
or an risk-minimizing optimal hedge of
h* = f cov(S, F) / {500 var(F)}.
This is a positive number of contracts since St and Ft,T would move together
and recall that at maturity T of the futures contract, ST = FT,T. h* can be
estimated by substituting in the sample estimates of the covariance in the
numerator and of the variance in the denominator. It can also be estimated
through the following linear regression employing OLS method:S = a + b F + e
where e is the usual residual error that is uncorrelated with F. We run this
regression and the results are shown in Table 3.2. Theoretically, b = cov(S,
F)/ var(F) = 500/f h*. (Recall that earlier in the chapter, when dealing
with 2-variable linear regression, b = cov (Xi ,Yi)/var(Xi).) The OLS estimate
b is thus the risk-minimizing or optimal hedge ratio or estimate6 of 500/f

h*. h* estimate is then found as b f /500 number of the futures contracts to
short in this case.
One of the earliest studies to highlight use of least squares regression in optimal
hedging is Louis H. Ederington, (1979), The Hedging Performance of the New
Futures Markets, Journal of Finance, Vol 34, No 1, 157-170.
66
Table 3.2
Regression of Change in Nikkei Index (SPOTCHANGE) on Change in
Nikkei Futures Price (FUTCHANGE): S = a + b F + e
Dependent Variable: SPOTCHANGE
Sample: 2 30
Variable
Coefficient
Std. Error
t-Statistic
Prob.
FUTCHANGE
C
0.715750
4.666338
0.092666
24.01950
7.723968
0.194273
0.0000
0.8474
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
F-statistic
Prob(F-statistic)
0.688436
0.676897
129.3249
451573.2
-181.1206
59.65968
0.000000
Mean dependent var

S.D. dependent var
Akaike info criterion
Schwarz criterion
Hannan-Quinn criter.
Durbin-Watson stat
1.087586
227.5159
12.62900
12.72330
12.65854
2.706941
From Table 3.2, b is 0.71575. With a 10 billion portfolio value and spot
N225 index on 1 September 1999 at 17479, f = 10b/17479 = 572115. Number
of futures contract to short in this case is estimated at
h* = b f /500 = 0.71575 572115 / 500 819 N225 futures contracts.
3.9
PROBLEM SET
3.1 Suppose Yt is the excess stock return at month t, and Xt is the excess
market return at time t. Suppose we run an OLS regression
Yt = a + b Xt + et
on a sample size 60. Assume normally distributed returns.
(i) Find the alpha and the beta estimate of the stock using the following
sample data.
60
X Y
t 1
5.0 103 ;
X 0.005
60
X
t 1
2
t
4.0 103 ; Y 0.003 ;
Further given SSR = 5.8 10-5, provide a test of H0: a = 0, H0: b = 1.

(It is enough to show the computed t-statistics.)
67
(ii) Suppose you had instead run an OLS regression
Yt = c + d Zt + nt
where Zt is another factor variable e.g. market trading volume. If OLS
estimates c and d are significantly different from zero, and R2 is

high, does it imply that the market premium is not important in this
case? Explain why or why not.
3.2 Consider a random sample of size n, {xi}, i=1,2,.,n , drawn repeatedly

from a normally distributed variable X N(,2). The sample variance is
computed as s 2
1 n
xk x 2 . Find the expected value of s2 and
n 1 k 1
thus show that s2 is an unbiased estimate of 2.

3.3 If we define basis = spot index notional price - index futures price, and
suppose we calculate a time series of the basis values of a futures contract
from time t when it is traded till the futures contract matures or expires at
time T. Let the basis value at t, t t T, be a realized value of the
corresponding random variable Bt. Is Bt , for t t T, a stationary
process?
3.4 In the OLS regression of S = a + b F + e, what statistic would we look
for to be able to tell how well the variations in S is explained by F?
3.5 Show a scatterplot of (Xi,Yi) of the following data.
i
1
2
3
4
5
6
7
8
9
10
(i)
Find X and Y .
Xi
2
2.5
3
3
4
5
4.5
6
7
8
Yi
3
4
4.5
5
6.5
7
8
8.5
9
9.5
Find OLS estimates a and b in Yi=a+bXi+ei where ei is

independent of Xi.
3.6 Given data on Y and X, explain how you will estimate the parameters
(ii)
68
in the following equation using OLS? Do you need to assume anything
about the data?
X2
Y
X
3.7 You hypothesize that demand for a particular credit card is linearly
related to income. In using OLS of demand on income to estimate the
coefficients of the regression model assuming the hypothesis is true,
would you attempt to sample the card subscription patterns or demand
using a survey of people from a homogeneous income group or from a
wide range of income groups? Explain.
3.8 A well-diversified stock portfolio has $40m value, and a beta 1.22
relative to S&P500. The portfolio manager anticipates a bearish horizon.
Instead of liquidating, he sells N contracts of S&P500 stock index futures.
Suppose 1 futures contract = $500 x futures price in points or notional
value. Suppose the S&P 500 notion value is now 1500 points. Suppose the
rate of change of the S&P 500 index is the same as the rate of change of
the futures index. What is N if the manager wants to hedge perfectly?
3.9 Given the market model of stock returns rit = i + i rmt + eit where
eit is i.i.d. N(0, e2), and suppose the OLS estimates i and i are found.
Given sample observations {rit} and {rmt} for t=1,2,3,..,T, i and i ,
show the appropriate formulae of computing the unbiased estimates of
(i) e2
(ii) conditional variance var(rit|rmt)
(iii) (unconditional) variance of rit
[1] Russell Davidson and James G. MacKinnon, (1993), Estimation and
Inference in Econometrics, Oxford University Press.
[2] Damodar N. Gujarati, (1995), Basic Econometrics, 3rd edition,
McGraw-Hill.
[3] Jack Johnston, and John DiNardo, (1997), Econometric Methods, 4th
edition, McGraw-Hill.
69
Chapter 4
MODEL ESTIMATION
APPLICATION: CAPITAL ASSET PRICING MODEL
Market model, Capital asset pricing model, Beta, Alpha, Systematic risk,
Unsystematic risk, Security Characteristic Line, Test of Zero-Beta CAPM,
Jensen measure, Treynor measure, Sharpe measure, Appraisal ratio, Market
timing
In this chapter we look at how estimation is performed on an important

finance valuation model to address the problem of risk pricing and also to
apply the model to compute investment performances.
A key foundation of finance is the tradeoff between risk and return. Assets
returning higher ex-ante (or conditional expected) return should also bear
higher conditional risk, and vice-versa. Finance theory and models seek to
establish the exact relationship between this return and risk. One of the earliest
finance models is the Sharpe-Lintner7 capital asset pricing model (CAPM) that
links expected return to only systematic risk. The unsystematic or residual risk
can be diversified away and is therefore not priced. Higher systematic risk in
the form of market risk is priced. For any asset, this market risk is reflected in
the sensitivity of the asset return to the market factor movement. Higher
sensitivity implies a higher beta or systematic risk. When we refer to risk, it is
usually quantified in terms of standard deviation instead of variance since the
former is expressed in terms of the same unit as the underlying variable. We
shall begin our study of CAPM with the statistical market model.
4.1
MARKET MODEL
Suppose rit is the return rate of stock i at time t, and rmt is the market portfolio
return rate (or alternatively return rate of a market index) at time t. If r it and rmt
are bivariate normally distributed (all stock returns being MVN is sufficient to
give this bivariate relationship), then conditional distribution of rit|rmt is normal
7
Sometimes this is also called the Sharpe-Lintner-Mossin CAPM in recognition of all

three scholars who did work in similar areas in the early 1960s. Reference as SharpeLintner-Black CAPM also abounds given that Fischer Black generalized CAPM to the
zero beta version.
70
with (conditional) mean and (conditional) variance
E rit | r mt E rit

im
r E rmt E rit im2 E rmt im2 rmt (4.1)

2 mt
m
m
2
im
var rit | rmt 2
m
2
i
(4.2)
Then we can write the linear regression model of rit on rmt as
rit a b rmt eit
where a E rit
(4.3)
im
E rmt
2
m
(4.4)
im
and E eit | rmt 0 .
m2
This last condition can be taken as a specification that eit is uncorrelated

with rmt, which in the context of MVN, amounts to stochastic independence. It
also implies via the law of iterated expectations that E(e it) = 0. The above
regression model is a precursor to the CAPM, and is called the market model.
This is purely a statistical model, but it is consistent8 with the CAPM. Note
that in general cov(eit, eit-1) is not necessarily zero. If we restrict this
covariance to zero, then we are looking at what had been called a single-index
model. In the latter case, the index actually needs not be the market portfolio
return, although it is natural to identify the appropriate single index or factor
as the market return.
The verification of (4.1) is straightforward as in a bivariate normal
distribution. If we take the unconditional variance of rit or var(rit) in (4.3),
2
im
b 2 e2 .
m
2
i
2
m
2
e
Now bm is called the systematic risk, and e is called the unsystematic or

idiosyncratic or diversifiable risk of stock i return. But conditional variance
var rit | rmt e2 . Hence, var rit | rmt i2
2
im
which is (4.2).
m2
Market model does not necessarily imply the Sharpe-Lintner CAPM as parameter a
is not necessarily constrained to be the CAPM intercept. Nevertheless, a is a constant
that can possibly follow that constraint. CAPM does not necessarily imply the market
model as quadratic utility and not MVN can be a sufficient condition for CAPM. But
MVN is both a sufficient condition for CAPM and for the market model.
71
4.2
CAPITAL ASSET PRICING MODEL (CAPM)
Suppose we add economic equilibrium to the statistical regression model or

market model in (4.3). Then the theory of CAPM requires that parameter a be
restricted to
a rf 1 b
where rf is the riskfree rate that is constant. This condition is not exactly (4.4)
since rf is not required nor does it appear there. However, it can be thought of
as a special case of (4.4). Then, imposing this theoretical CAPM restriction,
(4.3) becomes
rit rft b rmt rft eit
(4.5)
where we have added time indexing for the riskfree rate in order to allow the
rate to vary over time. The familiar CAPM equation we usually see is the
expectation condition (take expectation of (4.5)):
E(rit ) rft b E rmt rft
(4.6)
that holds good for each time period t. In other words, this CAPM is actually a
single period model where the cross-sectional expected returns of stocks are
related to the excess expected market portfolio return. It has become common
for empirical as well as some theoretical reason, including assuming stationary
processes in the returns, to treat the estimation and testing of CAPM in the
time series version of (4.5) rather than just the expected condition above.
In (4.5), the relationship applies for every stock i in the economy, so it is
convenient to write b as bi (still a constant) to indicate association of a
different bi with a different stock i. bi is called the beta of stock i. Thus,
rit rft bi rmt rft eit

for stocks i=1,2,,n in the economy or market, and for t=1,2,3,.,T periods
according to the time series sample size. Moreover, eit is uncorrelated with (rmt
- rft). This beta bi for stock i is the same b in the related market model of (4.3).
When estimating beta of a stock in the context of the market model (4.3),
we use a time series of returns {rit , rmt}t=1,2,3,,T . We can employ OLS on
(4.3) and obtain OLS estimator
T
r
t 1
mt
rm rit ri
r
t 1
mt
rm
72
where rx
1 T
rxt . Since the classical OLS conditions are met, b is BLUE.
T t 1
It is also consistent, converging asymptotically to b
im
.
m2
An alternative CAPM time series version to (4.5) is
rit rft a i bi rmt rft eit
(4.7)
where E eit | rmt 0 for each i and t. Since rft is supposed to be a constant at
t, eit is independent of rft, and thus also of (rmt rft). This version is slightly
more general and has an added advantage as follows. It involves regression of
excess stock i return rate rit rft on excess market return rate rmt rft , and
intercept ai. ai is also called the alpha of stock i. It is theoretically 0 in
equilibrium, but could become positive or negative in actual regression. The
interpretation of the latter then becomes one of financial performance:ai > 0
ai < 0
positive abnormal return

negative abnormal return
In the investment context, suppose rit is the return of a stock or a portfolio

over time. Positive alpha indicates that the stock or portfolio is providing
returns above normal or above the equilibrium according to CAPM where ai
should be zero. Negative alpha indicates that the stock or portfolio is
providing returns below normal or below the equilibrium according to CAPM
where ai should be zero.
Alpha is also called the Jensen measure in a portfolio context. Good
quantitative strategy portfolio managers hunt for stocks with significant
positive alphas in order to form their super-performing portfolio. The
advantage of using (4.7) is not only to provide for possibility of disequilibrium
situations and uncovering of abnormal returns in the form of alphas that are
significantly different from zero, but the use of excess returns as dependent
and explanatory variables purges any inflationary components in the return
rates. The theoretical model deals strictly with real rates of returns, so it is
good to use real excess rates of returns for this reason as in (4.7).
There are some practical issues. What is the ideal sampling size for
estimating betas, alphas, and the other risk and performance measures? The
Law of Large Numbers advises using as large a sample as possible when the
stochastic processes are stationary. However, it is observed that in practice
one can only obtain in any case a finite sample. Between 5 years of monthly
data and 10 years of monthly data, it may make sense to use only 5 years or 60
monthly sampling points. This is because the market changes over time in the
73
sense of changing its distribution so that if there is some random break in
between, the sampling moments computed for 5 years will be more robust
(less prone to deviation from the true underlying distribution during the
sample period) and make more sense than sampling moments across 10 years.
Another example is when the underlying distribution is stationary, but
conditional distribution changes dramatically such as 5 years of recession
followed by 5 years of boom. In such a case, taking a sample from the entire
10 years to model a situation of boom may provide incorrect inferences. Some
studies also recommend adjustments to the estimation of beta to minimize
sampling errors. For details, see Blume (1975).9
4.3
ESTIMATING BETA
Weekly stock return data of Chuan Hup (marine sector) and of the market
index are collected in the sampling period 9/10/2000 to 8/25/2002 from the
Singapore Stock Exchange. The market index is represented by the Singapore
Straits Time Index comprising large capitalization stocks from the main listing
(the number as well as the constituent stocks are updated from time to time).
The ST Index return proxies for the market portfolio return. The Singapore
Government Treasury bills 3-month rates of return, which serves as a proxy
for the riskfree rate, are also collected from Datastream.
To compute the excess weekly return on a stock, we subtract from the
weekly stock return the weekly riskfree return rate. Ideally we should use the
return of a Treasury-bill with one week left till maturity as the weekly riskfree rate. However, short-term T-bill rates such as 3-month rates are more
easily available, and we may use these as approximations. The interest rates
are typically quoted on a per annum basis, so we need to convert them to
weekly basis by dividing them by 52, viz. ln(1+rweekly) = 1/52 * ln(1+rp.a.) .
The linear regression model using (4.7) is employed to estimate alpha and
beta:
rit rft a i bi rmt rft eit .
The case of Chuan Hup Limited (CHL) is illustrated as follows. The
dependent variable is the weekly excess return rate, denoted CHL_EXC_RET.
The formulae for the various reported statistics in Table 4.1 are explained as
follows.
The sample sequence is from 2 to 103, thus yielding a sample size of 102
(Included observations). Number of regressors, k=2, as shown by number of
Blume, Marshall, (1975), Betas and Their Regression Tendencies, Journal of

Finance, Vol 10, No 3, 785-795.
74
explanatory variables
MKT_EXC_RET.
in
the
Variable
column,
viz.
and
is intercept estimate C in the table. is slope estimate and reflected

as coefficient of explanatory variable, MKT_EXC_RET, or the market
, Std Error of C, is
excess return in the table. The standard error of
1
X2
.
e T
T
2
Xt X
t 1
Table 4.1
Regfression Results of rit rft a i bi rmt rft eit
The standard error of
, Std Error of Coefficient of MKT_EXC_RET, is
1
.
T
2
Xt X
t 1
T
The SSR, Sum squared resid, is SSR e 2 .
t
t 1
75
The standard error of e, S.E. of regression, is

e
1
T2
SSR .
Mean dependent var in the table is Y .
T
2
Y Y .
t
T 1 t 1
R 2 /(k 1)
F-statistic in the table is F
.
k 1, T k
2
(1 R )/(T k)
Standard deviation or S.D. dependent var is
For the case k=2, the t-Statistic for MKT_EXC_RET, is also the square-root
of the F1,T-k statistic where the first degree of freedom in the F-statistic is one,
2 F
i.e. t
. This result does not generalize to k>2.
Tk
1, T k
.
4.4
INTERPRETATION OF REGRESSION RESULTS
Chuan Hup Limited (CHL) is a firm that has substantial interests in the marine
business among others. We employ all 2 years of data, although a case can be
made for using a slightly smaller sample size, as was noted earlier.
What is the OLS estimate of alpha? a i ri rf b i rm rf where the bar
denotes the sampling average. From Table 4.1, it is seen that OLS estimate for
alpha is 0.0032, but the p-value for the t-statistics is 0.24, which means that
we cannot reject H0: a = 0 at significance level up to 20% for a 2-tailed test.
Estimate a appears not to be significantly different from zero. Thus there is
positive alpha (which means a good performing stock), but this is not
significant.
OLS estimate for beta is 0.624. Thus the stock is not well-diversified or
else its beta will be close to 1, and is positively correlated with but not
strongly sensitive to market movements. Its t-statistic is 7.755 with p-value
lesser than 0.000, so beta is certainly significantly positive. The F-statistic
with degrees of freedom k-1=1, and T-k=102-2=100, is 60.143. Notice that the
higher the coefficient of determination R2, the higher the F-value. This is a test
on null H0: a=b=0. Thus a good fit with reasonably high R2 = 0.376 implies
that a and b fit well and are unlikely to be zero. Therefore H0 is rejected since
p-value for F1,100 is < 0.00000. The estimate of stock is systematic risk is
b i
1 T
2
rmt rm . The estimate of stock is unsystematic risk is the
T 1 t 1
76
e = RSS /(T 2) . From the standard error of regression e and the t-value
T
for b , we can compute

t 1
X X
t
= 0.11. The systematic risk of stock i is
thus estimated as 0.624 0.11 / 101 = 0.021 or 2.1%. Unsystematic risk is

estimated as e = 0.0266 or 2.66%. It is seen that idiosyncratic risk for Chuan
Hup during the 9/2000 to 8/2002 period is the same magnitude and in fact
slightly larger that the systematic risk induced by market movements. Total
(weekly) variance is 0.0212+0.02662 = 0.0011. Total risk is therefore
(0.0011)0.5 = 0.0335 or 3.35%. This is indeed the standard deviation of the
dependent variable.
We can also find R2 from the TSS deduced from the standard deviation of
dependent variable, and from sum of squared residuals or RSS. Then R2 = 1
RSS/TSS. For the 2-variable linear regression (including constant, i.e. k=2),
F1,100 = t1002 where the t100 is that of the beta OLS estimate.
The regression line as a result of the OLS method can be written as:Excess CHL Return = 0.0031 + 0.624 * Excess Market Return.
Figure 4.1
CHL Security Characteristic Line
This linear relationship, or the line if expressed graphically, is called

CHLs Security Characteristic Line. It shows the slope as CHL stocks beta,
77
the intercept as alpha, and indications of unsystematic risks as dispersions of
the returns about the CHL We produce such a plot in Figure 4.1. The stocks
SCL should not be confused with the markets security market line SML
which is represented by a graph of expected returns versus their corresponding
betas. We also plot the estimated (or fitted) residuals of the SCL in Figure 4.2.
Figure 4.2
Estimated Residuals
The OLS regression produces:

Excess CHL Return = 0.0031 + 0.624 * Excess Market Return. In Figure
4.2, the actual curve refers to Excess CHL Return, Yit. The fitted curve refers
a b X . The estimated (or
to 0.0031 + 0.624 * Excess Market Return or Y
it
it
fitted) residuals refer to e it Yit a b X it . Later we will see that it is
important to check the properties of the residuals to ensure that they follow the
classical assumptions.
4.5
PERFORMANCE MEASURES
The Jensen measure of portfolio performance is given by the alpha estimate in

regressions similar to (4.7) where excess return is used as dependent variable.
Jensen alpha is also called risk-adjusted return or abnormal return.
The Treynor measure of portfolio performance is given by expected excess
portfolio return rate per unit of beta, i.e.
78
E rpt rft
bp
where subscript p denotes association with a portfolio. Theoretically, in
equilibrium when there is no abnormal performance as in zero Jensen
measure, it is equivalent to the expected excess market portfolio return rate.
Therefore, this measure shows whether a portfolio is performing better than or
equal to, or worse than the market portfolio.
It is estimated using
rp rf
. It can be shown that if the Jensen measure
b
p
indicates superior (inferior) performance, then the Treynor measure indicates

performance better than (worse than) the markets. Jensens alpha is
characterized as follows. In practice, the alpha is calculated by using sample
averages instead of the population means. bp would be an estimated input.
a E rpt rft b p E rmt rft () 0
E r
rft
( ) E r
mt rft .
bp
The above performance measures are useful for well-diversified portfolios, but
could also be interpreted for individual stocks. For portfolios that are not welldiversified, their total risk p becomes important. Sharpe measure or Sharpe
ratio
pt
E rpt rft
p
shows how well the portfolio is performing relative to the capital market line
E rmt rft
with slope
.
E rpt rft
( )
E rmt rft
p
E rmt rft ( ) 0
m

E rpt rft pm 2p m E rmt rft (uncertain) 0
m
E rpt rft
a (uncertain) 0
79
Thus, there is also some relationship between the Sharpe performance
measure and the other two measures. All the measures identify superior
performance consistently with one another. Sharpe measure is estimated using
rp rf
.
p
There is also the appraisal ratio
ap
which is estimated by
a p
e
where the
numerator is Jensens measure, and the denominator is residual or

idiosyncratic risk, not total risk. Thus, the appraisal ratio ranks stocks with
positive alphas according to the relative increase in idiosyncratic risk they
bring to a portfolio. The higher the ratio, the better is the stock for investment
or inclusion in a portfolio, all other things being equal. An excellent
discussion of some these issues on performance measures and attribution can
be found in Bodie, Kane, and Marcus (1999), especially chapter 24.
An early study by Banz (1981)10 documents an important observation that
stock capitalization value or size matters (at least during the sampling period
and in many subsequent studies well into 2000s) during in-sample and
usually ex-post realized returns. This is an aberration from the CAPM which
prescribes the market index as the only common factor affecting all stock
returns.
We employ stock return data from Profs Fama and French of CRSP stocks
divided into 10 value-weighted portfolios with each portfolio containing a
decile ranked by capitalization value or firm size. Market returns (S&P 500
index returns) and 1-month U.S. Treasury bill rates of returns for the sampling
period January 2002 till December 2006 are also used. Monthly (end-ofmonth) return rates are used in the regression of (4.7) for each of the portfolio.
The alphas and betas for each sized-portfolio are obtained and plotted in
Figure 4.3 against the size deciles. On the horizontal axis, decile 1 denotes the
smallest capitalization portfolio while decile 10 denotes the largest
capitalization portfolio.
In Figure 4.3, all the betas are significantly different from zero at very
small p-values of less than 0.0005. Only 3 of the alpha values are significantly
different from zero at the 1% significance level. It is seen that the smallest size
firms in decile 1 realizes the largest positive Jensens alpha, while the biggest
size firms in decile 10 realizes the only negative alpha amongst all portfolios.
This result appears to be consistent with the empirical observations by Banz.
Betas are seen to fall slightly as size increases.
10
Banz, Rolf W., (1981), The Relationship Between Return and Market Value of
Common Stocks, Journal of Financial Economics, 3-18.
80
Figure 4.3
Alphas and Betas in CAPM regressions using monthly returns of 10
Sized-Portfolios in Sampling Period January 2002 to December 2006
1.6
A LP HA
B E TA
1.2
0.8
0.4
0.0
-0.4
1
10
P ort folio by Capit aliz at ion S iz e in Dec iles

S m alles t s iz e Dec ile 1 to bigges t s iz e Dec ile 10
The investment performance measures of alpha and also risk measure

(systematic risk sensitivity) of beta to market risk premium E(r mt rft),
however, needs to be applied with care, when dealing with hedge funds and
investment strategies such as market timing.
Since the 1990s, hedge funds have become quite fashionable. These
funds, unlike traditional investment funds or unit trusts that go long and hold
in selected assets over selected horizons, can go short, rollover derivatives,
and perform all kinds of investments in virtually any asset classes there are in
the financial markets. Therefore, it is not appropriate to measure the
performance of hedge funds using the traditional performance measures
described above. They may display very high return to risk (Sharpe) ratio, but
a lot of risks could be contained in huge negative skewness or huge kurtosis
that do not show up readily in variance. Fung and Hsieh (2001)11 have
described a method using complicated lookback straddles to track these funds
11
Fung, William, and David A Hsieh, (2001), The Risk in Hedge Fund Strategies:
Theory and Evidence from Trend Followers, Review of Financial Studies, Vol 14,
No 2, 313-342.
81
performances. Research in hedge fund strategies has been especially
voluminous in recent years.
Market timing refers to the ability of funds managers to shift investment
funds into the market portfolio when market is rising, and to shift out of the
stock market into money assets or safe Treasury bonds when market is falling,
particularly if the market falls below riskfree return. If a particular fund can
perform in this way, then its returns profile over time will look as follows. See
Figure 4.4.
Figure 4.4
Excess Fund Return
Excess Market Return
switch to
riskfree asset
when market
falls below rf
Note the nonlinear profile of the returns realizations over time. Suppose we
represent the above by a different set of axes as follows by squaring the X
variable. See Figure 4.5.
It can be seen that existence of market timing abilities in a fund portfolio
will show up as a slope when we regress excess fund return on the square of
excess market return as Figure 4.5 indicates. If there is no market timing
ability, there will be as many points in the negative 4th quadrant, and the slope
of a fitted line will be flat or close to zero. This idea was first proposed by
Treynor and Mazur (1966)12. Many important subsequent studies include
12
Treynor J L, and Kay Mazur, (1966), Can Mutual Funds Outguess the Market?
Harvard Business Review, Vol 43.
82
Merton (1981)13. If we employ a multiple linear regression using another
explanatory variable which is the square of the market excess return,
rit rft a i b i rmt rft c i rmt rft e it

2
where eit is independent of rmt, then market timing abilities in a fund will show
up as a significantly positive c i . Unfortunately, many mutual funds that were
studied did not display such market timing abilities.
Figure 4.5
Excess Fund Return
(Excess Market Return)2
switch to
riskfree asset
when market
falls below rf
4.6
PROBLEM SETS
4.1 Let Ri be the average return over a particular month of portfolio I

comprising stocks with betas that are largely similar. Let i be the beta
of portfolio is. In a cross-sectional linear regression of
Ri = a + b i + ei
where there are 100 observations per variable, ei is assumed to be
homoskedastic and normally distributed with mean zero. OLS estimates
13
Merton, R.C., (1981), On Market Timing and Investment Performance, I: An

Equilibrium Theory of Value for Market Forecasts, Journal of Business, Vol 54, 363406.
83
a 0.004 , b 0.004 ,
100
e
i 1
2
i
100 120
and XT =
120 160
XTX =
0.00245 ,
1
100

(i) Find the t-statistics of the OLS estimators a , b under the null
H0: a=0 , HA: a 0; and H0: b=0 , HA: b0.
(ii) What is the average return of all the portfolio average returns Ris?
(iii) According to CAPM, how would you interpret a 0.004 and
b 0.004 to be?
4.2 A researcher runs the following regression of stock i returns rit on
market portfolio returns rmt :
rit = a + b rmt + eit
where eit is a residual noise that is i.i.d. and independent of rmt.
(i) Is eit independent of rit ?
(ii) He performs OLS regression and obtains OLS estimates a and b . He

determines b as proportionate to the systematic risk of the stock i, and
determines a as Jensens alpha. Comment if he is estimating
appropriately?
(iii) Then the researcher selects all stocks with positive a and forms a
portfolio. Is this portfolio likely to outperform the market index on
average? (You may assume the CAPM is true.)
4.3 Suppose an investor decides to short a stock selling at price Pt, and to
buy it back at Pt+1. He can place the shortsale proceeds with the broker
and earn riskfree interest rate r. However, he also has to pay interest
rate r for borrowing the stocks that he has shorted. How may we
compute a rate of return to this shortsale?
4.4 Is it theoretically possible to find an asset with expected return smaller
than that of the risk-free or riskless return?
4.5. If gold prices are very high relative to market index values even when
the latter fall, how would you conclude about golds beta?
84
[1] Bodie Z., Alex Kane, and A.J. Marcus, (1999), Investments, 4th edition,
Irwin McGraw Hill.
[2] Merton, R.C., (1981), On Market Timing and Investment Performance, I:
An Equilibrium Theory of Value for Market Forecasts, Journal of
Business, Vol 54, 363-406.
[3] Sharpe, William, (1964), Capital Asset Prices: A Theory of Market
Equilibrium under Conditions of Risk, Journal of Finance, 19, 425-442.
85
Chapter 5
CONSTRAINED REGRESSION
APPLICATION: COST OF CAPITAL
Constrained regression, Net present value analysis, Internal rate of return,
Capital budgeting, Weighted average cost of capital, Levered beta, Levered
cost of equity, Dividend growth model, Residual income valuation model,
Earnings forecast, Security market line, Capital market line, Excess market
return, Fair rates of return to Utilities
In this chapter, we extend the usage of CAPM to the problem of estimating the
cost of capital in funding risky projects. This estimation involves estimation of
betas which we had shown in the last chapter, as well as estimation of market
risk premium. The latter is a bit more tricky and sometimes requires auxiliary
regressions involving constraining the intercept to be zero. The latter is the
same as regression through the origin. Although constrained regression is
more general and can apply to the constraint of any sets of coefficients in a
linear regression equation, we consider only the case of regression through the
origin here.
5.1
REGRESSION THROUGH THE ORIGIN

Y
L
(X1,Y1)
(X2,Y2)
(X3,Y3)
86
We reconsider the sample observations of X and Y variables in Figure 3.5 of
chapter 3. Instead of the OLS line without constraint (which is now shown as
the dotted line), we seek the least squares line that must pass through the
origin 0. This line is the bold line L as seen on the graph.
The Linear (bivariate) Regression Model is
~
Yi bX i ~
ei ,
i 1,2,, N.
(5.1)
for a sample of size N, where b is the constant slope, and there is zero
intercept. As usual, Yt is the dependent variable and Xt is the explanatory
variable. et is the residual noise. As in Chapter 3, we assume conditions (A1),
(A2), (A3), (A4), and (A5) hold.
Employing the same minimum least squares criterion,
N
min
e
i 1
2
i
Yi b X i
i 1
yields the following The First Order Condition (FOC):

N
e i2
i 1
X 0.
2 X i Yi b
i
i 1
(5.2)
Solving,
N
X i Yi
i1
b
N 2
Xi
i1
(5.3)
Note that in the constrained OLS b , the numerator and denominator

arguments contain values of Xi and Yi, and not their deviations from sample
mean as was in the unconstrained OLS case. Estimator b is similarly a linear

function of Yis.
Now, for the finite sample properties of the constrained OLS estimator:
N
e
X bX ~
i
i
i
b
i
1

b
N 2
X
i1 i
N
ei
Xi ~
.
i1
N 2
X
i1 i
b is a random variable and clearly E b b , hence b is unbiased. Now,
Eb
b
var b
N
~
X i ei
i
N 2
Xi
i1
2 .

N 2
Xi
i1
87
From equation (5.2), it is seen that if e i Yi b X i is the fitted residual, then
N
e i X i
0 . This is the sampling moment for the population moment
i 1
E[eiXi] = 0, or essentially the zero correlation condition between ei and Xi.

However, unlike the unconstrained OLS case, here the sampling average
1 N
e i is not necessarily 0. This can be seen by taking the sum over
N i 1
e Y b X .
i
i 1
i 1
i 1
e i Yi b X i . If
e i 0 , then it is necessary that

i 1
N
Yi
b i1 which is of course not true in general.
N
Xi
i1
What is the probability distribution of b ? Using (A5), since b is a linear
combination of eis that are normally distributed, b is also normally
distributed.
1
.
b ~ N b , 2
X2
For the constrained regression, the coefficient of determination R2 can be

negative at times.
5.2
CAPITAL PROJECTS
One of the most important applications of financial valuation and empirical

estimation is to find the cost of capital. The cost of capital is an essential
concept in finance. In business, new projects such as building of a production
plant, setting up a joint venture, leasing an office space, project financing, etc.
all require an analysis of the expected incoming cashflows or revenues, and
the outgoing cashflows or costs. In addition, an initial outlay or capital outlay
is typically required. This initial outlay and also subsequent injections of
additional capital are funded through borrowing at some interest costs or
through plough-back of retained earnings. Even if self-financed capital is
utilized, there is opportunity cost. Measuring this cost of capital is crucial in
comparing with the benefits of the returns, so as to come to a decision whether
88
the project is profitable or not, or at all financially feasible. This is the area of
capital budgeting.
The central idea in finance when it comes to capital budgeting and project
evaluation is that the cost of capital of a project must be commensurate with
its market risk. As we have seen earlier, the market will require higher
expected return for higher risk. Therefore if we are to fund our project with
market capital, the market will require a certain expected return based on the
risk of our project. This expected return is equivalent to our cost of capital.
For example, if the market requires 10% p.a. return for 10 years of loan or
debt on borrowing to finance a project, then we have to apply 10% p.a. as our
cost of capital. This 10% rate is then used as a discount rate to compute the
Net Present Value (NPV) of the project. Obviously, only projects with
positive NPVs are financially feasible. And given capital budgets or
constraints, only those projects with the highest NPVs are selected. Thus,
estimating cost of capital is closely connected to (1) whether it is financially
justifiable to fund a particular capital project, and (2) which capital project
should receive priority in funding with limited capital available (capital
budgeting).
Many fallacies were committed with respect to (1) and (2). For example, a
large firm may be able to borrow funds from a bank based on a general credit
facility at prime rate of 10% p.a. If a division in the firm proposed a project
with a market risk-adjusted (or risk-assessed) cost of 15% p.a., should it be
funded based on the firms 10% cost? You may suppose the NPV at 15% cost
is negative, but the NPV at 10% cost is positive. If in another firm, division A
has projects with internal rate of return (IRR) of 12% p.a. while division B has
projects with IRR of only 8% p.a., should all the firms borrowing capacity at
10% p.a. cost be allocated to fund only division As projects?
These are deceptively simple questions. The answers are negative. In the
first case, the firms credit line of 10% cost is based on the firms weighted
average cost of capital (WACC) and its weighted average risks of all existing
projects, and possibly new projects with similar average risks. Should the firm
decide to fund a new marginal project at 15% risk-adjusted cost with a
negative NPV, its firm value will decrease, and its future average cost will rise
from the current 10%. Thus, a firm should only fund a project with positive
NPV using the projects own risk-adjusted cost, and not the firms lower
overall WACC.
In the second case, it should again be noted that the firms overall
borrowing cost of 10% p.a. is likely an average of the costs of capital in both
operating and ongoing divisions. Also note that the IRR does not reflect any
bit about the risk of the projects in the division. IRR is simply the discount
rate that equates a projects present value cash-inflows with its present value
of outflows. Some of division As projects may be highly risky with risk-
89
adjusted cost of over 10%. On the other hand, some of division Bs projects
may have risk-adjusted costs of less than 10%. The overall funding should be
allocated to both divisions projects with risk-adjusted costs lower than their
IRRs (here we assume there is no problem of cashflow ambiguity or multiple
solutions in the computation of the IRRs), i.e. with positive NPVs.
5.3
NET PRESENT VALUE ANALYSIS
Net present value (NPV) analysis is the key method to assess the worth of a
new capital project proposal. It could be a replacement plant or a new
investment in an ongoing firm, or it could be in project financing where
oftentimes the project will have a terminal time when it would be sold off.
In NPV analysis, there are two key inputs: Expected cash flows and the
risk-adjusted discount rate(s) over the future horizon of the cashflows. The
steps in the analysis may be summarized as follows.
(a) Forecast the nominal cash flows.
Sometimes this may require the forecast of inflation rates. The inflation
rates are utilized to estimate subsequent years nominal revenues by
multiplying present revenue by (1 + expected inflation rate), if no real
growth is anticipated. If real growth is expected, then this must be used to
gross up the nominal revenues.
(b) Ascertain the currency of the cashflows and anticipate the risks that come
with currency conversions. This foreign exchange risk can be hedged by
using currency derivatives.
(c) Ascertain the cost of capital or risk-adjusted discount rate r.
(d) Ascertain the initial capital outlay.
For example, a risky stream of T-years annual future (nominal) cashflows has
a PV computed as its expected values C discounted by nominal per annum
risk-adjusted discount rate r. This annuity of $C from t=1 till t=T, has PV = $
C/(1+r) + C/(1+r)2 + + C/(1+r)T = (C/r) [1 1/(1+r)T] using the
summation by geometric progression. If the cashflow is grossed up by an
inflation component i, then the same risky stream of T-years annual future
(nominal) cash flows has a PV = $ C(1+i)/(1+r) + C(1+i) 2/(1+r)2 + +
C(1+i)T/(1+r)T = C(1+i)/(r-i) [1 (1+i)T/(1+r)T]. One can also add a growth
rate g to i.
Suppose for annuity, we use the certainty equivalent cash flow X < C.
Then now the risk of project is not reflected in cash flows, but should be
reflected in the lower risk-free discount rate k. Thus, PV = (X/k) [1
1/(1+k)T] = (C/r) [1 1/(1+r)T]. Therefore, certainty equivalent X is related
to C as follows: X = C (k/r) [1 1/(1+r)T] / [1 1/(1+k)T]. The NPV of the
90
project is then the PV less the initial capital outlay, or else less the present
value of capital outlays.
5.4
CAPITAL STRUCTURE
When we consider the firm level share valuation, the following steps can be
performed. Let $X = EBIT, expected earnings before interest and tax.
Expected earnings after interest and tax charges,
EAIT = (X rDD)(1-t)
where rD is cost of debt, D is debt level at the firm, and t is the firms tax rate.
This cashflow EAIT goes to equity holders, or to equity value E.
For a perpetual stream of cashflows, the cost of equity is
rE = EAIT/E = (X-rDD)(1-t)/E
or rE = [ X(1-t) rDD(1-t) ] / E
(5.4)
Then the firms weighted average cost of capital, WACC is defined as:rA = X(1-t)/[D+E].
Let V=D+E, where V is the firms total market value, assuming the firm has
only debt and equity. Note that in this V=D+E equation, we do not consider
any tax effect. Without any tax effect, the value of the firm should be the same
whether it is leveraged or not, so we can write V=V U for unlevered firm value
VU, ceteris paribus. From (5.4),
rA = [rEE + rDD(1-t)]/[D+E]
or rA = rE E/[D+E] + rDD(1-t)/[D+E]
or rA = rE E/V + rD(1-t) D/V.
(5.5)
(5.5) shows clearly the weighting in the overall cost of capital by E/V and
D/V. For the debt part, it is the after-tax cost of debt rD(1-t). Note that t is
marginal tax rate (marginal to this project or incremental taxable revenues, so
that the entire analysis is to find the cost of the next funding dollar), and rE and
rD are fund costs payable to suppliers of equity capital and of debt or loans.
We assume competitive capital markets throughout.
Assuming perpetual flows, an unlevered firm value
VU = X(1-t)/rU
(5.6)
91
The total expected cashflows to equity and debt holders is X(1-t) + t rD D.
Note that with corporate tax, interest rate payments are tax-deductible, and so
accounts for the second term above which is called a tax shield. The levered
firm value, ceteris paribus, is
VL = VU + t D
(5.7)
From (5.6), X(1-t) = rU VU. Now, expected cashflows to levered firm L,

ceteris paribus, is X(1-t) + t rDD = rUVU + t rDD.
But in L, total expected cashflow is also expressed as rEE + rDD.
Put rEE + rDD = rUVU + t rDD.
Then14, rE = rU + [rU rD] (1-t) D/E,
(5.8)
using VU = VL tD = E + (1-t) D.
Now, we would link up with a result about 8 years later in the CAPM
world we saw in the previous chapter. Using (5.8), but instead treating the
returns as random variables and not expected values, then since beta of equity
or bE = cov(rE, rM) / var( rM), we have
bE = bU + [bU bD] (1-t) D/E
(5.9)
If we assume cost of debt is constant or independent of D, i.e. cov(rD,rM) =

0, hence bD=0, then
bE = bU [ 1 + (1-t) D/E ]
(5.10)
In (5.10), the left-hand side bE is the levered beta, or beta of stocks in a

levered firm L, while bU is the unlevered beta, or beta of stocks in an
otherwise similar but unlevered firm U. Assuming the firms cashflows do not
change with leverage alone, (5.10) allows the computation of the firms
levered cost of equity when the leverage changes. One way to do this is first to
calibrate bU by having the values of bE for a particular D/E level. Once bU is
obtained from (5.10), it is treated as a constant, and at a different debt-equity
or leverage level D/E, the new equity beta is computed using bU {1 + (1-t)
D/E }.
14
Equations (5.7) and (5.8) represent some of the ingenious results obtained in a series
of papers by Modigliani and Miller, including the path-breaking paper in corporate
finance:- Franco Modigliani and Merton Miller, (1958), The Cost of Capital,
Corporation Finance, and the Theory of Investment, American Economic Review,
June, 261-297. (5.8) is sometimes called MM Proposition II.
92
The initial bE could be obtained using regression as follows.
ri rf = a + bE (rM - rf) + e,
where ri is return rate from the levered firms equity, and constraining
intercept to zero if the unconstrained version produced a significant intercept
0.
As an illustration, suppose a levered firm with debt-equity ratio 0.1 and
marginal tax rate 40% produces an estimate of levered equity beta at 1.2. To
find a new beta supposing the firm intends to lever debt up to a ratio of 25%,
we can compute:
bU = 1.2/[1 + 0.6(0.1)] = 1.13
then new bE = 1.13 [1+0.6(0.25)] = 1.30.
Equations (5.5), (5.8), and (5.10) are useful in the computations of costs of
capital for capital budgeting and project valuation purposes, especially in
firms where there is leverage, provided at some point either rU or rE are
known. Even with the CAPM method, we still need to estimate the market
premium which is the expected excess market return for the period in which
the cost is to be computed.
There are two main approaches to this issue. The first is to estimate the
cost of equity (whether from levered or unlevered firms) directly by
estimating the cashflows that accrue to the equity. The second is to employ the
CAPM in its expected value form, and in that process, necessarily estimating
beta and the market risk premium. We shall now explain these approaches.
5.5
ESTIMATING COST OF CAPITAL
In what follows, we shall assume that market data are available, and that the
market is efficient so that CAPM holds. In a well-functioning market, a
(listed) stock entitles its holder to a perpetual stream of after-tax15 dividends
(random variables as at time now t = 0) starting next period t = 1, through to
. There is no terminal date unless the firm goes bankrupt. This infinite
horizon is in contrast to project financing with a finite terminal date T as seen
in the earlier sections. Let the expected values (based on current date) of these
future dividends be
D1 , D2 , D3 , .. , Dn ,
15
We ignore the small difference created by dividend tax rebates allowable for
investors with personal income tax lower than imputed tax. So, all investors face the
same tax rate on dividends and so would agree on the same discount formulation.
93
and the after-tax required rates of return (or risk-adjusted expected rate of
return) of equity for each of the future dividends is
R1 , R2 , R3 , , Rn , ..
quoted on a per period basis, then, the price of the stock now in a rational
efficient market is
D1
D2
D3
Dn
Dt
P0
......
.......
t
2
3
n
1 R1 (1 R 2 ) (1 R 3 )
(1 R n )
t 1 1 R t
Suppose dividend growth rate per period g is a known constant, so that

D1 = D0 (1+g) , D2 = D0 (1+g)2 , D3 = D0 (1+g)3 , , Dn = D0 (1+g)n ,
and that the term structure of risky rates is flat, i.e. R1 = R2 = R3 = . = Rn

= .. = R.
Then,
D1
1 g D0 (1 g)
P0
D0

t
R g
R g
1 R
t 1 1 R
t 1
D1
where R > g . P0
is called the Dividend Growth Model (or
R g
D0 (1 g) t
sometimes Gordon Growth model).

The Dividend Growth Model (DGM) also provides a formulation of the
cost of equity as
D1
g.
P0
(5.11)
We provide an illustration as follows. Suppose a firm j has a steady

dividend history seen in the following table each half year or one period.
Table 5.1
$Dividend Payout and Dividend Growth of A Firm
Period 1
2
$Div 0.02 0.02
Growth n.a. 0%
3
4
5
0.025 0.025 0.03
25% 0% 20%
6
0.03
0%
7
0.03
0%
8
0.03
0%
9
10
0.035
0.04
16.67% 14.29%
Period 12
$Div 0.04
13
0.04
14
0.04
15
0.05
16
0.05
17
18
19
20
0.045 0.045 0.045 0.045
Growth 0%
0%
0%
25%
0%
-10%
0%
0%
0%
11
0.04
0%
21
0.05
Average
0.037
11.11%
5.10%
94
Thus, annual % growth rate of dividends is projected at 2 x 5.10% or 10.2%
p.a. If the current stock price is $5 and the last annual dividend is $ 0.045 +
0.05 = $ 0.095, then using (5.8):
Rj = (0.095 x 1.102) / 5 + 10.2% = 12.3% p.a.
Using the DGM, cost of equity is estimated at 12.3% p.a.
Note that sometimes DGM cannot be used because the stock has no steady
dividend record, so the estimation of dividend growth is not reliable. Or in a
volatile market, current stock price can change dramatically by the minute, so
that the expected dividend yield component (D1/P0) in (5.11) is unreliable
because of the highly variable denominator.
The DGM method is not the only method in the approach using direct
estimation of cashflows and discount rate. The Residual Income Valuation
model16 uses direct estimation or forecast of a firms earnings instead of
dividends.
P0 BV0 EYt rBVt 1 / 1 r
(5.12)
t 1
where BVt is the book value of equity at time t from now, Yt is the abnormal
earnings forecasts (some averaging of the many earnings forecasts on active
stocks made by analysts in the finance industry), and r is the risk-adjusted
discount rate on equity earnings and also charge on the book value of equity.
Yt is abnormal in the sense that it meets the condition of being at least equal or
more to the capital charge rBVt-1. It can be shown that this model is equivalent
to the DGM when the BV0 is related to some discounted value of excess Yt
over dividend issue at t.
One advantage of (5.12) over (5.11) or other forms of dividend growth
model is that earnings forecasts are much more readily available from the
industry.17 They are intuitively more accurate than dividend forecasts since
expected dividends are derived from forecast earnings, and are discretionary
issues after considering new investment requirements of the firm and retained
earnings plough-back. The latter are much harder to forecast.18
16
See for example Ohlson, J., (1995), Earnings, book values, and dividends in equity
valuation, Contemporary Accounting Research, 11, 661-687; and also Feltham G.,
and J. Ohlson, (1995), Valuation and clean surplus accounting for operating and
financial activities, Contemporary Accounting Research, 11, 689-731.
17
See for example the data provider www.ibes.com or the I/B/E/S data.
18
There remain other issues. For a detailed discussion and further references, see
Kothari, S.P., (2001), Capital markets research in accounting, Journal of
Accounting and Economics, 31, 105-231.
95
5.6
SECURITY MARKET LINE APPROACH
The Security Market Line (SML) approach is based on the CAPM. The
equation (5.13) below is the SML, and is typically represented as a line in a
graph of expected return versus beta. It is oftentimes confused with CAPM
which is not only the result in (5.13) but also the model, including the
underlying assumptions and equilibrium conditions, which produce (5.13).
Rj E(rj) = rf + j { E(rm) rf }
(5.13)
Suppose we use recent past history of excess stock return rates (we suggest
using the last 5 years or 60 months of returns data of stock, market, and
riskfree rate) versus excess market return rates to run OLS without intercept,
i.e.
rjt rft = j{ rmt - rft } + ejt .
This is to estimate the beta j of stock j. Suppose we use
1
rf T+1 + j
rmt rft
T t 1
to estimate the required rate of return presently at T+1 where the sample
history of returns [1,T] were used. During certain historical sampling period
T
within, for example, 2000 to 2002, it may be that
1 T
T rmt rft < 0

t 1
because the market was falling, and therefore this does not suitably represent
the expected market risk premium E(r m T+1) rf T+1 that must be positive.
Ferson and Locke (1998)19 also pointed out in their study that estimating this
component of market risk premium in the CAPM or SML setup produces the
most critical errors, while errors due to beta estimation and even riskfree rate
assumption are minor. In some industry practice, the market risk premium,
E(rm) rf , is estimated at t using a long history of past realized premium that
averaged out to give more stable positive estimates.20 We can also approach
the estimation of market risk premium in an interesting econometric
framework.
19
Ferson, W.E., and D.H. Locke, (1998), Estimating the Cost of Capital through
Time: An Analysis of Sources of Error, Management Science, Vol 44, 4, 485-500.
20
Market risk premium estimates over long horizons were provided, for example, in
Ibbotson Associates, Stocks, Bonds, Bills and Inflation: 1987 Yearbook, Ibbotson
Associates Inc., Chicago.
96
To estimate a positive market risk premium, we employ an auxiliary
regression as follows. First we look at the Capital Market Line (CML) also
implied by CAPM.
CML is the line containing all possible portfolios of investors. Each
portfolio is just a simple linear combination of 2 assets the market portfolio
and the riskfree asset. Any portfolio P on the CML or portfolio efficient
frontier with expected return E(rp) and risk p satisfies the following equation:
Erm rf Erp rf
.
m
p
The right-hand side is basically the Sharpe index or ratio. It is basically a
reward-to-risk ratio and the higher the number, the better is the performance
of the portfolio. In CAPM theory, of course all stocks share the same rewardto-risk ratio (call it ) in equilibrium.
Figure 5.1
The Capital Market Line
CML
E(rp)-rf
E(rm) -rf
rf
m
Putting in the time subscript to denote the relationship at each time point t,
we have
Ermt rft
mt
> 0,
assuming the market Sharpe ratio is a constant over time.

So, Ermt rft mt .
(5.14)
97
Therefore, rmt rft mt u t where ut is assumed to be n.i.d. with zero
mean. Now, rmt and rft are the realized market return at t and the realized
riskfree rate at t respectively. For the purpose of this study, they are monthly
rates. mt > 0 here is allowed to vary over time, so the left-side excess
expected market return in (5.14) is effectively a conditional expected return.
Merton (1980)21 also suggested that excess market return at time t may be
estimated by
2
(5.15)
Ermt rft mt
where is a positive constant now denoting the relative risk aversion, i.e.
larger implies requirement of a higher risk premium against total market risk
2
at time t.
mt
The advantage of both the above specifications, (5.14) and (5.15), of
market risk premium is that the OLS estimate of expected market premium
2
will not be negative. In either of the above specifications, mt
is practically
not observable, and so is estimated as follows according to Merton (1980):
2
mt
6
1 6
2
2
lnR m( t k ) lnR m( t k )
12 k 1
k 1
where Rmt = 1+rmt. This measure obviously contains measurement error.

Sometimes we may have to adjust this formula. For example, if we are
2
estimating mT
at current time T, then there is no future observations T+1,
T+2, . yet. In this case we may do with just the lagged information part. We
may also include the current term, thus
2
mt
1 6
2
lnR m( t k ) .
13 k 6
Now we can estimate using OLS linear regression through the origin, in
rmt rft pmt u t (p = 1 or 2 depending on which of the above versions),
p
where ut is assumed to be n.i.d., and ut is not correlated with mt
.
If in the above regression, OLS estimate is not positive, then we need to
employ an alternative approximate regression to ensure that estimate is

positive:
mt
21
rft 2 2mt t
2
Merton, Robert, (1980), Estimating the expected return on the market: An

exploratory investigation, Journal of Financial Economics Vol 8, 323-361.
98
Estimate 2 is obtained, hence > 0. Finally, the market risk premium or the
expected excess market return is computed as
E( rmt ) rft pmt ,

from which we estimate the cost of equity or required rate of return to
stock/firm j at time t as
p
E(rjt ) rft j{Ermt rft } rft j mt
where is obtained from regression mentioned earlier:
j
rjt rft = j{ rmt - rft } + ejt .

Suppose we estimate the required rates for every past months t=1, 2, 3,
..,T till current time T. To project into the next 5 years or 60 months, we can
either perform an average of the past required rates or fit some ARIMA
process for future forecast. The subject of ARIMA will be discussed in a later
chapter.
Next estimate firm js cost of debt if it is leveraged. If we are trying to
estimate the firms overall cost of capital over a horizon of say 5 years, then
we look at the cost of its debt over 5 years term. (Note that a different term
will produce a different cost number the market will usually require a higher
interest cost for a longer term loan.) For simplicity, assume a 5-year
outstanding bond. Measure the yield-to-maturity for this 5-year bond, e.g. Yj0 .
Firm js weighted average cost of capital
WACC = E/V Rj0 + D/V Yj0 (1 tC)
where total firm market value V = firm market equity value E + firm market
debt value D, tC is the corporate tax rate, and Rj0 and Yj0 are appropriately
expressed in terms of monthly or yearly rates depending on the cashflow
intervals used for discounting.
So far, we have employed CAPM to compute the cost of equity of a
presumably listed firm with liquidly traded stocks and efficient market prices.
We also computed the WACC of the firm. How is this cost to be utilized? We
have seen at the introduction to the chapter how not to use this cost incorrectly
on the firms division projects when they carry different systematic risks
compared to the firms overall systematic risk. The cost calculated above is
useful as a hurdle rate to discount expected cashflows over the next 5 years for
a 5-year project with a systematic risk that is similar to the overall firms risk,
and that will be financed based on similar debt-equity ratio of the firm.
What happens if the project carries a different risk from that of the firm? If
the firm has a hurdle rate of say 15% p.a., but the project has expected ROE of
10% but risk-adjusted rate of 8%, do we still embark on the project? Yes. If
99
the project has expected ROE of 8% but risk-adjusted rate of 10%, do we still
embark on the project? No. It is all about positive NPV as we had discussed
earlier.
5.7
FAIR RATES TO UTILITY
The CAPM methodology has been extensively used since mid-1960s, and
especially prior to 1980s, to investigate the cost of capital to investments (or
equivalently required rate of return) in regulated industries in the U.S. such as
electric power, natural gas, insurance, utilities, telecommunications, and so
on.22 The method has also been applied, sometimes with slight variation, to
investigating required returns to potential investments in new technology or
continuing investments in industry. Baldwin, Tribendis, and Clark (1984)23
studied this problem of the cost of continuing investments in the U.S. steelmaking industry. This was of course an important question when there was so
much cost competition from overseas production at that time.
As an example of the important application of finance, suppose a utility
firm has a rate base or book equity value of $100m. It supplies an expected
10m units of power. The allowed fair rate of return (or cost of capital) is 5%
p.a. A fair rate is one where return should be commensurate with the
(systematic) risks in the market, but also sufficient to maintain the credit
standing of the firm, and also to be able to attract fresh capital. These are
usually the stipulations of court when consumers or consumer association tries
to sue utilities to bring down rates. The last 2 conditions are more difficult and
subjective in the rate fixing process.
If expected costs are $30m per year, including depreciation and taxes (so
as to maintain firms steady state), then it should set per unit power price at
{$30m + 0.05 $100m} / 10m = $3.50.
Thus, one can see that the rate of $3.50 that consumers pay is very much
determined by the allowed fair rate of return of 5% p.a. Much of the finance
theory and econometrics in this chapter could be put to good use in
determining scientifically this fair rate.
5.8
PROBLEM SET
5.1 Why is it necessary either for the government or the courts to set rates
22
See Myers, Stewart C., (1972), The Application of Finance Theory of Public
Utility Rate Cases, Bell Journal of Economics and Management Science, Spring, 5897.
23
Baldwin C.Y., J.J. Tribendis, and J.P. Clark (1984), The evolution of market risk
in the U.S. steel industry and implications for required rates of return, The Journal of
Industrial Economics, September, 73-98.
100
or prices per unit power or water in utilities, but not necessary for
setting prices charged on goods sold by private companies in general?
5.2 According to DGM, what would a high P/E ratio imply about the stocks
future earning prospects?
5.3 Suppose a steady firm has $100 million market value of assets paid for
by 10 million shares of equity funding, and in steady state the assets are
expected to generate earnings of $10 million a year forever, what is the
equilibrium market price per share if the cost of equity is 5% p.a.?
54 Suppose the same firm as in Q5.3 decides to retain 60% of its earnings
every year as internal financing, but could plough back this retained
earnings into new investments that provide only return of 5% p.a., what
would be its new price per share? Explain why. (Ignore tax, potential
bankruptcy, and temporary business cycles.)
5.5 Suppose for a firm with a constant dividend growth rate of 3% p.a., we
estimated the cost of equity to be 8% p.a., what is its fair share price if
the current dividend is $0.50 per share?
5.6 Comment if the SML model is consistent with DGM, and what if any, are
their key differences?
[1] Grinblatt, M., and Sheridan Titman, (2004), Financial Markets and
Corporate Strategy, McGraw-Hill.
[2] Jensen, M., and W. Meckling, (1976), Theory of the Firm: Managerial
Behavior, Agency Costs, and Ownership Stucture, Journal of Financial
Economics, October, 305-360.
[3] Merton, Miller, (1977), Debt and Taxes, Journal of Finance, 32, 261
-275.
[4] Myers, Stewart C., (1984), The Capital Structure Puzzle, Journal of
Finance, 39, 575-592.
101
Chapter 6
TIME SERIES ANALYSIS
APPLICATION: INFLATION FORECASTING
Time series model, White noise, Autoregressive process, Moving average
process, ARMA process, Autocovariance function, Autocorrelation function,
Autocorrelogram, Box Pierce Q-statistic, Ljung-Box test, Backward shift
operator, Invertibility, Yule-Walker equations, Partial autocorrelation
function, ARIMA, Exponential-weighted moving average, De-seasonalization,
Gross domestic product, Inflation, Out-of-sample forecast, Fisher effect
In this chapter we study the Box-Jenkins approach to the modelling of

stochastic process. It is a versatile and a practically useful method for
modeling stationary processes as the fundamental building blocks of most
stochastic processes. The approach is systematic, involving a recipe of careful
steps to arrive at the appropriate process for modeling and forecasting
purposes. The steps involve identification of the process, making it stationary
or making it as some differencings of an underlying non-stationary process,
estimating the identified model, then validating it. We shall also study an
important application to the modeling of inflation rates in the economy.
6.1
STATIONARY PROCESSES
We have seen how {Yt} t = - T,+ T is a stochastic process when each Yt is a

random variable. T is typically a very large number. For a stationary stochastic
process, the mean24 and variance25 is constant at every t, and the covariance26
is either zero or a function of only the lag k, not a function of the values of Y.
The realized data over time form a time series. Time series is a common data
form in finance. Sometimes, the stochastic process producing the time series
of realized data is also called the Time Series Model.
A basic building block of stationary processes is an independently
24&2
This is unconditional mean and variance, i.e.taking expectation without any other
information available at t.
3
For time series, this is also called autocovariance or sometimes serial covariance.
102
identically distributed (i.i.d.) process {ut} where ut is stationary, and in
addition is also independent of ut-k and ut+k for any k 0. Thus, pdf (ut | ut-k,
ut+k) = pdf (ut) , k 0 . {ut} is called a white noise. In the slightly weaker case
where E(ut) = 0, var(ut) = constant u2, and cov(ut , ut+k) = 0 for any k0, {ut}
is called a weakly stationary zero serial correlation process. Most covariancestationary processes can be formed from linear combinations of white noises.
Since covariance-stationary processes have a constant variance at any time
t, they do not display changing conditional variances. Some common
examples of covariance-stationary processes can be modelled using the BoxJenkins (1976) approach.
Consider the following stochastic process to model how random variable
Yt evolves through time. They are all constructed from the basic white noise
process {ut} t = -, ., + where each ut is i.i.d. with zero mean, variance u2.
(a)
Autoregressive order one, AR(1), process:

Yt
Yt-1 u t
, 0
(6.1)
where Yt depends on or autoregresses on its lag value, Yt-1.

Moving Average order one, MA(1), process:
Yt u t u t 1
, 0
(b)
(6.2)
where the residual is made of a moving average of two white noises u t
and ut-1.
Autoregressive Moving Average order one-one, ARMA(1,1),process:
Yt Yt-1 u t u t 1
, 0, 0
(6.3)
(c)
where Yt autoregresses on its first lag, and the residual is also a moving
average.
6.2
AUTOREGRESSIVE PROCESS
Consider the AR(1) process:

Yt = + Yt-1 + ut , ( 0), t = -T, T
(6.4)
where {ut} is i.i.d. with zero mean, i.e. E(ut) = 0 for every t, and var (ut) = u2.
Note that this is a regression of a stationary random variable Yt on its lag Yt-1.
It is important to recognize that since the process holds for t = -T,.,T,
the process is equivalent to a system of equations as follows.
YT = + YT-1 + uT
YT-1 = + YT-2 + uT-1
Y1 = + Y0 + u1
103
Y-T+1 = + Y-T + u-T+1
These equations are stochastic, not deterministic, as each equation contains a
random variable ut that is not observable.
By repeated substitution for Yt in this AR(1) process,
(6.4) Yt ( Yt-2 u t 1 ) u t
or
Yt (1 )
Yt-2 (u t
u t 1 )
(1
) (u t
u t 1 u t 2
2
For each t,
E(Yt ) (1
2
provided | | 1
1
Otherwise if || 1, either a finite mean does not exist or the mean is not
constant
for every t.
var(Yt ) var( u t
u (1
2
u t 1
u tk
provided | | 1
2
1
Otherwise if || 1, either a finite variance does not exist or the variance is
not constant.
Autocovariance of Yt and Yt-1
or
cov(Yt , Yt-1 ) cov (
Yt-1 u t , Yt-1 )
corr(Yt , Yt-1 )
Also, corr(Yt , Yt-k )
( k ).
The autocorrelation coefficient lag k, (k), is obtained by dividing the
104
autocovariance lag k of Yt and Yt-k, (k), by the variance of Yt.
Hence we see that the AR(1) Yt process is covariance-stationary with
2
constant mean =
, constant variance =
, and autocorrelation lag
1 2
1
k, (k) = |k| , a function of k only, provided || < 1.

As a numerical example, suppose
Yt 2.5 0.5Yt 1 u t ,
(6.5)
Now, Yt-1 and ut are stationary normally distributed, and not correlated. It is
given that E(ut) = 0, and var(ut) = 3.
Stationarity implies E Yt E Yt 1 Y ,
and var Yt var Yt 1 Y2 . If we take unconditional expectation on

(6.5),
E(Yt ) 2.5 0.5E(Yt 1 ) E(u t ) .
So, Y 2.5 0.5Y , then Y
2.5
5.
1 0.5
If we take unconditional variance on (6.5),

var Yt 0.52 var Yt 1 var u t .
2
So, Y2 0.25 Y2 3 . Then, Y
3
4.
1 0.25
The first order autocovariance (or autocovariance at lag 1)

cov Yt , Yt 1 cov 2.5 0.5Yt 1 u t , Yt 1
0.5cov Yt 1 , Yt 1
0.5 4 2.
Since Yt is stationary, cov (Yt+k , Yt+k+1) = cov (Yt+k , Yt+k-1) = 1 = 2 for any
k. First order autocorrelation (autocorrelation at lag 1)
corr (Yt+k , Yt+k+1) = corr (Yt+k , Yt+k-1) = 1 =
1
= 0.5.
Y2
Second order and higher order autocovariance
cov( Yt , Yt j ) cov(2.5 0.5Yt 1 u t , Yt j ) j 0 for j 1.

2
Note that we can also write var(Yt) = Y 0 . So in general,
k
.
0
105
6.3
MOVING AVERAGE PROCESS
For the MA(1) process in (2), Yt = + ut + ut-1 ( 0), t= -T,,T, where

{ut} is i.i.d. with zero mean and variance u2, we have:
E(Yt )
var(Yt ) u (1 )
2
cov(Yt , Yt-1 ) cov ( u t
u
corr(Yt , Yt-1 )
u t 1 , u t 1
u t2 )
corr(Yt , Yt-k ) 0
for k 1
Hence we see that MA(1) Yt is covariance-stationary with constant mean = ,

constant variance = u2( 1+2), and autocorrelation lag k, a function of k only:
k 1 2
0
6.4
,k 1
, k 1
AUTOREGRESSIVE MOVING AVERAGE PROCESS
For the ARMA(1,1) process in (6.3), Yt = + Yt-1 + ut + ut-1 ( 0,

0), t = -T,,T, where {ut} is i.i.d. with zero mean and variance u2.
Then (6.3) implies
Yt Yt 2 u t 1 u t 2 u t u t 1
1 2 Yt 2 u t u t 1 u t 1 u t 2
1 2 u t u t 1 2 u t 2 u t 1 u t 2 2 u t 3
1 2 u t u t 1 u t 2 2 u t 3
For each t,
E(Yt )
provided | | 1
106
var(Yt ) u
u
2
2 2
2 4
2 6
[1 ( ) ( ) ( ) ( )
2
( )
2
1
2
1-
provided | | 1
cov(Yt , Yt-1 ) cov( Yt-1 u t u t 1, Yt-1 )

var(Yt-1 ) cov(u t 1, Yt-1 )
var(Yt-1 ) cov(u t-1, Yt-2 u t 1 u t 2 )
2
var(Yt-1 ) u
cov(Yt , Yt-k ) k var(Yt-k )

k u
( )
2
1-
, for k 1
Hence we see that ARMA(1,1) Yt is covariance stationary with constant

mean =
, constant variance =
u2 [
( ) 2
1+
] = y2,
2
1
and autocovariance lag k, a function of k only,
Y2 u2
k k 2
Y
, k 1
, k 1
if and only if || < 1.

So far we have dealt with covariance stationary process. Why is it
important to consider such stationary process? It is important certainly from
the point of view of estimation and inference using sample data. In the 2variable linear regression, Yt = + Xt + ut , t =1,2,.,T where {Yt}, {Xt},
{ut} are stochastic processes, and the classical conditions are met, {Yt}, {Xt}
are typically observable, while {ut} is an unobservable disturbance or noise.
The OLS estimators and involve actual observed sample data (X1, Y1),
(X2, Y2), .., (XT, YT). Theoretically, for stochastic relationship
Yt X t u t (here we emphasize that Yt, Xt, ut are random variables by
putting tildes). Thus,
cov(X t ,Yt )
var(X t )
E(Yt ) E(X t )
107
Then only when {Xt}, {Yt} are stationary can we apply the Law of large
numbers so that and are consistent. Moreover, u t Yt X t , then
T
2
t
2
u
a meaningful number that is the variance of ut only when
{Xt}, {Yt}, and hence {ut}, are stationary. And then only can we perform
appropriate statistical inference based on the usual tT-2 -statistics.
Hence we see the importance of stationarity on the dependent variable
process {Yt} or regressand, and on {Xt} or regressor. Even though OLS and
may be unbiased, but if that is all we know and there is nothing about the
sampling distribution of these OLS estimators that can be estimated, then
statistical inference would be impossible.
It is important to know that while we are dealing with covariancestationary processes e.g. AR(1) (|| <1), or MA(1), their conditional mean will
change over time. This is distinct from the constant mean we talk about, which
is the unconditional mean, and the distinction is often a source of confusion
among beginning students, but it is important enough to elaborate.
6.5
CHANGING CONDITIONAL MEANS
Consider AR(1) process:
Yt Yt-1 u t
Also
( 0)
Yt 1 Yt u t 1 .
At t+1, information Yt is already known, so

E(Yt+1|Yt) = Yt E(u t 1 |Yt ) .
The last term is 0 since ut+1 is not correlated with Yt. Therefore,
conditional mean27 at t is Yt . This is different from the constant
unconditional mean
. As Yt+k changes with k, the conditional mean
changes with Yt+k. However, the conditional variance at t is
var(Yt 1 |Yt ) var(u t 1 |Yt ) u ,

2
which is still a constant no matter what Yt is.

This conditional variance, however, is smaller than the unconditional
27
The conditional mean also serves as forecast when estimated parameters are used
e.g. Yt .
108
variance
variance
0).
2u
1 2
(|| <1). This is because the unconditional variance includes
2 2y on the regressor since Yt-1 is not assumed to be known yet.
Next consider another example, a MA(1) process: Yt = + ut + ut-1 (
E(Yt 1 |u t ) u t
var(Yt 1 |u t ) 2u 2u (1 2 )
Likewise for MA(1) covariance stationary process, conditional mean changes
at each t, but conditional variance is constant and is smaller than the
unconditional variance.
We will branch off from this point to two other important topics : (1) nonstationary processes, and (2) conditional variance that is not a constant. These
topics will be covered in later chapters.
Given a time series and knowing it is from a stationary process, the next
step is to identify the statistical time series model or the generating process for
the time series. Since AR and MA models produce different autocorrelation
function (ACF) k , k > 0, we can find the sample autocorrelation function
r(k) and use this to try to differentiate between a AR or MA or perhaps
ARMA model. To get to the sample autocorrelation function, we need to
estimate the sample autocovariance function.
6.6
SAMPLE AUTOCORRELATION FUNCTION
Given a sample {Yt}, t=1,2,3,,T, the sample autocovariance lag k is
c k
1 T k
Yt Y Ytk Y
T t 1
for k = 0,1,2,3,,p. As a rule of thumb, p<T/4, i.e. given a sample size T,

we usually would not want to estimate a sample autocovariance with a lag that
is larger than T/4. To use a larger lag means we will have a lesser number of
terms in the summation, and this may affect convergence.
Note the divisor is T and c(k) is consistent, i.e. lim c k k . The
T
sample autocorrelation lag k is then
r k
c k
c 0
for k = 0,1,2,3,,p.
109
In the above estimator of autocorrelation function, there are sometimes
different versions in different statistical packages due to the use of different
divisors, e.g.
Tk
r k
1
T k 2
Y Y Yt k Y
t
t 1
T
1
T 1
Y Y
t 1
which is the ratio of two unbiased estimators of covariances. Asymptotically

there is no difference as T gets to be very large. However, in finite sample, the
estimates will look slightly different. A more convenient formula is
Tk
r k
Y Y Yt k Y
t
t 1
(6.6)
Y Y
t 1
This formulation implicitly assumes division by T on both the numerator

and the denominator, and thus represents the consistent estimates of both
autocovariances, rather than unbiased finite sample estimates. Its advantage is
that (using Cauchy Schwarz inequality) r k will always fall between (-1, +1)
which accords with theory. In the other formulations, there may be sampling
possibilities when r k lies outside (-1, +1). c(k) and r(k) are symmetrical
functions about k = 0, i.e. c(k) = c(-k), and r(k) = r(-k).
The sample autocorrelation function r(k) can also be represented as an
autocorrelation matrix. The following may be that of an AR(1) process
Yt 2.5 0.5Yt 1 u t .
corr
Yt
Yt 1
Yt 2
Yt 3
Yt
Yt 1
Yt 2
Yt 3
0.53 0.24 0.12

1
0.53
1
0.53 0.24
1
0.53
0.24 0.53
1
0.12 0.24 0.53
r(k) is a random variable with a sampling distribution of the autocorrelation.

The general shapes of sample autocorrelation function for AR and MA
processes are depicted in Figure 6.1 below.
110
Figure 6.1
Sample Autocorrelation Functions of AR and MA Processes
r(k)
MA
AR
0
k
Based on the AR(1) process in equation (6.1) and the sample autocorrelation
measure r(k) in (6.6),
var r k
2
2k
1 1 1
2k 2k .28
2
T
1
(6.7)
For k=1, var r 1
1
1 2 . For k=2,
T
1
1 1 2 1 4
3
var r2
4 4 1 2 2 1 2 var r1.
2
T
T
1
T
6.7
TEST OF ZERO AUTOCORRELATIONS FOR AR(1)
For the AR(1) process, autocorrelation lag k, (k) = k , provided || < 1.

Suppose we test the null hypothesis H0: (k) = 0 for all k > 0. This is
28
For derivation of the var(r(k)) above, see M.S. Bartlett (1946), On the theoretical
specification of sampling properties of autocorrelated time series, Journal of Royal
Statistical Society., B, Vol. 27.
111
essentially a test of the null hypothesis H0: = 0. Then
var r k
1
T
for all k > 0. Under H0, (1) becomes Yt = + ut , where {ut} is i.i.d. Hence
{Yt} is i.i.d. Therefore, asymptotically as sample size T increases to +,
1
0
T 0
r 1
0
0 1
r 2
T
~ N
,

0 0
r m

0 0
0
0
0
0
1
T
As T becomes large, the MVN distribution is approached. To test that the j th

autocorrelation is zero, evaluate zj ~ N(0,1) statistic as follows.
zj
r j 0
.
1
T
Reject H0: (j) = 0 at 95% significance level if the z value exceeds 1.96 for a
2-tailed test.
Since it is known that AR processes have ACF (k), approximated by
r(k), that decay to zero slowly, but MA(q) processes have ACF (k),
approximated by r(k), that are zero for k > q, a check of the autocorrelogram
(graph of r(k)) in Figure 6.2 below shows that we cannot reject H0: (k) = 0
for k > 1 at the 95% significance level. For MA(q) processes where (k)=0 for
k>q,
j1
1
2
var rk 1 2 j .
T
Thus for MA(1), var rk for k>1 may be estimated by
1
1
2
1 2 r1 .
T
T
Thus MA(1) is identified. Compare this with the AR(3) that is also shown.
The standard error used in most statistical programs is
. This standard
error is reasonably accurate for AR(1) and for MA(q) processes when q is
small.
112
Figure 6.2
Identification Using Sample Autocorrelogram
r(k)
MA (1)
AR (3)
1.96 s.e.
1.96 s.e.
To test if all autocorrelations are simultaneously zero, H0:

1 2 m 0 , (it is common practice to set m at 6, 12, 18,
provided T > 4m as a rule of thumb), we can apply the Box and Pierce (1970)
Q-statistic:
T
Q
m
r k T r k z 2k ~ m2
k 1
k 1
(6.8)
k 1
This is an asymptotic test statistic. The Ljung and Box (1978) test statistic
provides for approximate finite sample correction to the above asymptotic test
statistic:
r k
T T 2
~ m2 .
Q
m
2
'
k 1
Tk
(6.9)
This Ljung-Box test is appropriate in situations when the null hypothesis is a

white noise or approximately white noise such as stock return rate.
6.8
ARMA(P,Q) PROCESSES
th
A p order autoregressive process AR(p) or ARMA(p,0) is

Yt = + 1Yt-1 + 2Yt-2 + + pYt-p + ut .
A qth order moving average process MA(q) or ARMA(0,q) is
Yt = + ut + 1ut-1 + 2ut-2 + .. + qut-q .
(6.10)
(6.11)
113
We shall define the backward shift operator B as a function such that its map
is its lag, BYt = Yt-1 . The operator follows usual algebraic properties. B2Yt =
B(BYt) = BYt-1=Yt-2 . BkY = Yt-k .
A finite qth order MA process is always stationary as the mean and
variance are finitely constant. A finite pth order AR process may be nonstationary, and is stationary provided restriction on the parameters ks holds.
For the AR(p) in (6.10), we express it as
(B)Yt (1- 1B - 2B2 - - pBp)Yt = + ut .
(B) (1- 1B - 2B2 - - pBp) = 0 is called the characteristic equation
of the AR(p) process. For the process to be stationary, the roots or zeros of the
characteristic equation must lie outside the unit circle (complex roots lie
outside the Argand circle). For example, in the AR(1) case, (B) (1- 1B) =
0, and B = 1-1, so if B lies outside unit circle, then the requirement for
stationarity is that | 1-1| > 1, or | 1| < 1.
Next we show how MA processes can be sometimes represented by an
infinite order AR processes. As an example, consider MA(1):
Yt = + ut + ut-1 ( 0), or Yt = + (1+B) ut .
So, (1+B)-1 Yt = (1+B)-1 + ut .
Note that (1+x)-1=1-x+x2-x3+x4- for |x|<1. Also, let constant c = (1+B)1
. Then,
(1-B+2B2-3B3+4B4-..)Yt = c + ut , or
Yt - Yt-1 + 2Yt-2 - 3Yt-3 + 4Yt-4 - . = c + ut .
Thus
Yt = c + Yt-1 - 2Yt-2 + 3Yt-3 - 4Yt-4 + . + ut ,
which is an infinite order AR process.
This AR() process is not a proper representation that allows infinitely
past numbers of Yt-ks to forecast a finite Yt unless it is stationary. If it is not
stationary, Yt may increase by too much, based on an infinite number of
explanations of past Yt-ks.
It is stationary provided that roots of (1+B) = 0 lie outside the unit circle.
If a stationary MA(q) process can be equivalently represented as a stationary
AR() process, then the MA(q) process is said to be invertible. Although all
finite order MA(q) processes are stationary, not all are invertible. For
example, Yt = ut - 0.3ut-1 is invertible, but Yt = ut - 1.3ut-1 is not.
Invertibility of an MA(q) process to stationary AR() allows expression
of current Yt and future Yt+k in terms of past Yt-k , k > 0. This could facilitate
forecasts and interpretations of past impact.
It is not an interesting issue to consider inverting AR(p) processes to
infinite order MA processes.
A mixed autoregressive moving average process of order (p,q) is
Yt = + 1Yt-1 + 2Yt-2 + + pYt-p + ut + 1ut-1 + 2ut-2 + .. + qut-q .
114
This is also invertible to an infinite order AR or less interestingly, an infinite
order MA.
Consider an AR(p) process in (10): Yt = + 1Yt-1 + 2Yt-2 + +
pYt-p + ut . Note that ut is i.i.d. and has zero correlation with Yt-k , k > 0.
Multiply both sides by Yt-k . Then,
Yt-kYt = Yt-k + 1 Yt-k Yt-1 + 2 Yt-k Yt-2 + + p Yt-k Yt-p + Yt-k ut .
Taking unconditional expectation on both sides, and noting that
E(Yt-kYt) = (k) + 2 where = E(Yt-k) for any k, then
(k) + 2 = + 1[(k-1) + 2] + 2[(k-2) + 2]
+ .+ p[(k-p) + 2]
(6.12)
Finding the unconditional mean in (6.10), = + 1 + 2 + .+ p , and
using this in (6.11), we have
(k) = 1(k-1) + 2(k-2) + .+ p(k-p) for k > 0 .
Dividing both sides by (0),
(k) = 1 (k-1) + 2 (k-2) + .+ p (k-p) for k > 0 .
If we set k = 1 ,2 3, , p, we obtain p equations. Put (0)=1. The following
equations derived from AR(p) are called the Yule-Walker equations. They
solve for the p parameters in 1 , 2 , .. , p .
(1) = 1
+ 2 (1) + .+ p (p-1)
(2) = 1 (1) + 2
+ .+ p (p-2)
(3) = 1 (2) + 2 (1) + .+ p (p-3)
(p) = 1(p-1)+2(p-2) + .+ p
If we replace the (k) by sample r(k) as approximates, then the p Yule-Walker
equations can be solved as follows for the parameter estimates k s.
r(1)
r(2)
1
r(1)
r(1)
1
r(1)
r(2)
r(2)
R
r(1)
1
r(p)
r(p 1) r(p 2) r(p 3)
R ,
and
Therefore, 1R
The other parameters can be estimated as follows.
r(p 1)
r(p 2)
r(p 3)
1

2

p
(6.13)
115
1 T
Yt
T t 1
1 1 2
p .
The estimate of the variance of ut can be obtained from
1
2
u2 Yt 1 1r(1) 2r(2)
T t 1
pr(p)
It is important to note that identification of the appropriate process must be

done before estimation of the parameters is possible since the structure of
approach to the estimation requires knowledge of the relevant process to
estimate.
6.9
PARTIAL AUTOCORRELATION FUNCTION
The sample ACF allows the identification respectively of either an AR or an

MA process depending on whether the sample r(k) decays (reduces to zero)
slowly or are clearly zero after some lag k. However, even if an AR(p) is
identified, it is still difficult to identify the order of the lag, p, since all AR(p)
processes show similar decay patterns of ACF. A complementary tool using
the partial autocorrelation function (PACF) is used to identify p in the AR(p)
process. The PACF also helps to confirm MA(q) processes.
Suppose the true process is AR(p) for some p > 0. Suppose we take k=1,
and based on AR(1):
Yt = + 11Yt-1 + ut .
Note that we add a double subscript with the first indicating the premise of the
kth order, and the second denoting the position of the parameter in the kth order
AR(k). Compute the sample Yule-Walker equations in (6.12), then
11 r(1) .
Suppose we take k=2, and based on AR(2): Yt = + 21Yt-1+ 22Yt-2 + ut
(note that we add a double subscript with the first indicating the premise of the
2nd order, and the second denoting the position of the parameter in the AR(2)),
solve for the sample Yule-Walker equations in (6.13), then
r(2) r(1)
22
.
2
1 r(1)
2
Note that the computation of 21 is not important at all here. Suppose we take
k=3, and based on AR(3): Yt = + 31Yt-1+ 32Yt-2+ 33Yt-3 + ut , compute the
sample Yule-Walker equations in (6.13), and obtain 33 . In a similar way, we
compute 44 , 55 , ., kk and so on. Each of these kk (k = 1,2,3,..) is

called a (sample) partial autocorrelation. Consider
116
Yt = + k1Yt-1+ k2Yt-2+ k3Yt-3 +..+ kkYt-k + ut
and assuming all Yt-js, j<k, are constants, then autocorrelation coefficient of
Yt and Yt-k is simply kk which is the partial autocorrelation. The function or
sequence kk (k), k = 1,2,3,.., is called the partial autocorrelation function
(PACF). They can be computed from a time series {Yt}. Their theoretical
counterparts are kk where theoretical (k) are used instead of r(k) in (6.13).
What is interesting to note is that given true order p or AR(p),
theoretically 11 0, 22 0, 33 0, ., pp 0, p+1 p+1 = 0, p+2 p+2 = 0, p+3
p+3 = 0, and so on. Or kk 0 for k p, and kk = 0 for k > p. Therefore, while
an AR(p) process has a decaying ACF that does not disappear to zero, it has a
PACF that disappears to zero after lag p.
For k > p, var( kk ) = T-1. Therefore, we can apply hypothesis testing to
determine if for a particular k, H0: kk = 0, should be rejected or not by
kk
considering if statistic
T 1
1.96 at the 5% level of significance.
For any MA(q) process, if it is invertible, it is an AR(), so its PACF will

not disappear, but will decay. In addition to the sample ACF shown in Figure
6.2, sample PACF is also computed and shown in Figure 6.3 below.
Figure 6.3
Sample Partial Autocorrelogram
kk
MA (1)
AR (3)
1.96 s.e.
1.96 s.e.
117
6.10
GDP GROWTH
In growing economies, Gross Domestic Product dollars, or the national output,

is usually on an upward trend with obviously increasing means. Thus the
process is not stationary per se, and some form of transformation is required to
arrive at the stationary process. We shall see how the Box Jenkins approach
deals with such a non-stationary process. In Figure 6.4, we show the quarterly
time series US$ per capita GDP graph of a fast growing Singapore economy
from 1985 to 1995. The data are available from the Singapore Governments
Department of Statistics.
Figure 6.4
Singapores per capita GDP is US$ from 1985 to 1995
This is the gross domestic output in an economy, and is seen to be rising with
a time trend. If we input a time trend variable T, and then run an OLS
regression on an intercept and time trend29, we obtain
Y = 7977.59 + 292.74*T
where Y is GDP. We then perform a time series analysis on the residuals of
this regression, i.e. e = Y 7977.59 292.74*T. Sample ACF and PACF
correlograms are plotted for the estimated residuals e to determine its time
series model. This is shown in Figure 6.5.
29
At this juncture, another type of analysis may proceed alternatively with taking first
difference. We shall look at this in a later chapter on Unit Roots.
118
Figure 6.5
Sample Autocorrelation and Partial Autocorrelation Statistics
Figure 6.5 indicates significant ACF that decays slowly so AR(p) process
is suggested. PACF is significant only for lag 1, so an AR(1) is identified. The
residual process is not i.i.d. since the Ljung-Box Q-statistics in (6.9) indicate
rejection of zero correlation for m=1,2,3,etc. Then, we may approximate the
process as
(Yt - 7977.59 - 292.74*T) = 0.573*(Yt-1 - 7977.59 - 292.74*[T-1]) + ut
where ut is i.i.d. Then, Yt = 3598 + 126 T + 0.573 Yt-1 + ut . If we create a
fitted (in-sample fit) series
3598 126 T 0.57 3 Y ,

Y
t
t -1
and plot this fitted Y and Yt together, we obtain Figure 6.6.

t
119
Figure 6.6
3598 126 T 0.57 3 Y
Fitted GDP Equation Y
t
t -1
Fitted GDP
Actual GDP
6.11
ARIMA
Suppose for a stationary time series {Yt}, both its ACF and PACF decay
slowly or exponentially without reducing to zero, then an ARMA (p,q), p 0,
q 0, model is identified. Sometimes a time series ACF does not reduce to
zero, and yet its PACF also does not reduce to zero, not because it is ARMA
(p,q), but rather it is an autoregressive integrated moving average process
ARIMA(p,d,q). This means that we need to take d number of differences in
{Yt} in order to arrive at a stationary ARMA(p,q). For example, if {Yt} is
ARIMA(1,1,1), then {Yt Yt-1} is ARMA(1,1). In such a case, we have to
take differences and then check the resulting ACF, PACF in order to proceed
to determine ARMA.
There is a special case ARIMA(0,1,1) that is interesting, amongst others.
Yt Yt Yt 1 u t u t 1 , 1
Then
1 B Yt 1 B u t . So,
120
1 B Y u
1 B t t
1 B 1 B Y u
t
t
1 B
1 1 B B2 2 B3 ..... Yt u t
Yt Yt 1 u t , Yt 1 1
j1
Yt j , 1
j1
E Yt | Yt 1 , Yt 2 ,.... Yt 1 Yt 1 1 Yt 2
Thus, the forecast of next Yt at time t-1, is the weighted average of last
observation Yt-1, and the exponential-weighted moving average of past
observations of Yt-1.
Another special application of ARIMA is de-seasonalization. Suppose a
time series {St} is monthly sales. It is noted that every December (when
Christmas and New Year comes around) sales will be higher because of an
additive seasonal component X (assume this is a constant for simplicity).
Otherwise St = Yt. Assume Yt is stationary.
Then the stochastic process of sales {St} will look as follows.
Y1, Y2 ,...., Y11, Y12 X, Y13, Y14 ,......., Y23, Y24 X, Y25 , Y26 ,......Y35 , Y36 X, Y37 , Y38 ,......
This is clearly a non-stationary series even if {Yt} by itself is stationary. This
is because the means will jump by X each December. A stationary series can
be obtained from the above for purpose of forecast and analysis by performing
the appropriate differencing.
(1-B12) St = Yt Yt-12 .
Suppose (1-B12) St = ut, a white noise. Then this can be notated as (0,1,0)12. If
(1- B12) St = ut, then it is (1,0,0)12. Notice that the subscript 12 denotes the
power of B. If (1- B)(1-B12) St = ut, then it is (1,0,0) (0,1,0)12. (1- B12)St =
(1-B) ut is (0,1,0)12 (0,0,1).
In this case, the X component is removed, and we perform analyses on the
differenced series {Yt Yt-12}. Suppose a stationary sales series
St = 1000 + 0.3 St-1 + 0.2 St-2 + 0.05 St-3 + ut
where previous month sales has 0.3 impact on current St and so on. A forecast
of this month sales given all previous months sales information yields
E(St|St-1,St-2,.) = 1000 + 0.3 St-1 + 0.2 St-2 + 0.05 St-3.
Thus, if the past three months sales records are 25,000 units, 30,000 units,
and 18,000 units, then
Et-1(St) = 15,400 units.
Note the subscript t-1 denotes all information available at time t-1.
121
6.12
MODELLING INFLATION RATES
Modelling inflation rates using the Box-Jenkins approach is an important

application in finance, especially during periods in which the economy is
experiencing significant inflation.
We define monthly inflation rate It as the change from month t-1 to month
t in the natural log of the Consumer Price Index (CPI), denoted by Pt.
P
I t ln t
Pt 1
Using US data for the period 1953 to 1977, Fama and Gibbons (1984)
reported the following sample autocorrelations in their Table 1 (shown as
follows).
N=299
It
It It-1
r(1)
.55
-.53
r(2)
.58
.11
r(3)
.52
-.06
R(4)
.52
-.01
r(5)
.52
-.00
r(6)
.52
.03
It
It It-1
r(7)
.48
-.04
r(8)
.49
-.04
r(9)
.51
.06
r(10)
.48
.02
r(11)
.44
-.08
r(12)
.47
.09
Given sample size N, the standard error of r(k) is approximately 0.058. Using
95% significance level or a critical region outside of two standard errors, or
about 0.113, r(k)s for the It process are all significantly greater than 0, but
r(k)s for It-It-1 process are all not significantly different from 0 except for r(1).
The autocorrelation of It is seen to decline particularly slowly, suggesting
plausibility of an ARIMA process. The ACF of It-It-1 suggests a MA(1)
process. Thus It is plausibly ARIMA (0,1,1). Using this identification,
It It-1 = ut + ut-1 , || < 1
(6.14)
From (6.14), Et-1(It ) = It-1 + ut-1 since ut is i.i.d. Substitute this conditional
forecast or expectation back into (6.14), then
It = Et-1(It ) + ut .
(6.15)
But Et(It+1) = It + ut and Et-1(It ) = It-1 + ut-1 imply that
Et(It+1) - Et-1(It ) = It It-1 + (ut ut-1)
= ut + ut-1 + (ut ut-1)
= (1+) ut .
(6.16)
122
This indicates that the forecast follows a random walk since ut is i.i.d. From
(6.15), ut = It - Et-1(It ) can also be interpreted as the unexpected inflation rate
realized at t.
Suppose is estimated as -0.8 . Then (6.14) can be written as
It It-1 = ut 0.8 ut-1 .
From (6.16), the variance in the change of expected inflation is (1-0.8)2 u2 or
0.04 u2 , while the variance of the unexpected inflation is u2 . Thus, the latter
variance is much bigger. This suggests that using past inflation rates in
forecasting future inflation as in the time series model (6.15) above is not as
efficient. Other relevant economic information could be harnessed to produce
forecast with less variance in the unexpected inflation, i.e. produce forecast
with less surprise. Fama and Gibbons (1984) show such approaches using the
Fisher effect which says that
Rt = Et-1(rt) + Et-1(It)
where Rt is the nominal riskfree interest rate from end of period t-1 to end of
period t, and this is known at t-1, rt is the real interest rate for the same period
as the nominal rate, and It is the inflation rate.
Notice that the right-side of the Fisher equation contains terms in
expectations. This is because inflation at t-1 when Rt is known, has not
happened, and so is only expected in form or ex-ante. If the real rate rt is
known at t-1, then the equation would be Rt = rt + Et-1(It), and this creates an
immediate contradiction in that Et-1(It) = Rt rt becomes known at t-1 which is
not.
Thus, at t-1, real rate, like inflation, is ex-ante and not known. In fact,
unlike inflation which may be measured ex-post fairly accurately, real rate
cannot be accurately observed ex-post and has to be estimated.
6.13
PROBLEM SET
6.1 If return rate rt follows an autoregressive process of order 1, i.e. AR(1)

process, rt rt 1 t , where and are constants, and t is i.i.d.
(0, 2), find the mean and variance of rt if it is covariance-stationary.
6.2 A particular stock return rate time series is analyzed, and its
autocorrelogram is as follows (in bold). The partial autocorrelogram is
in dotted lines.
(i) What kind of stochastic process is evidenced by this stock return time
series?
(ii) Is there weak-form market efficiency with regard to this stock?
(iii) If the lagged one correlation is estimated as -0.099, what is a forecast
of next period return (to nearest two decimal points) if the mean return
123
is 1.5% and the lagged one, lagged two, and lagged three return rates
are 2%, 1%, and 1.2% respectively? Do you need to check any
condition on the stochastic process before you perform this forecast?
r(k)
1.96 s.e.
0
1.96 s.e.
6.3 Consider the MA(1) process Yt = 5 + ut 0.4ut-1

(i) What is the unconditional mean of Yt ?
(ii) What is the ACF of Yt ?
(iii) Is Yt covariance-stationary?
(iv) Show what is the invertible AR representation?
6.4 Consider the AR(2) process Yt = 2 + 0.5Yt-1 + 0.4Yt-2 + ut
(i) Is Yt stationary?
(ii) Find the unconditional mean of Yt.
(iii) Find the first two autocorrelations, and show how autocorrelations
of higher lags are related to these.
(iv)Find the PACF.
6.5 An infinite order MA process of Yt is expressed as
Yt = ut + A*[ut-1 + ut-2 +.], where A0 is a constant.
(i) Is Yt covariance-stationary?
(ii) Is the differenced series Yt Yt-1 stationary? What is the ACF of this
differenced series?
6.6 Suppose an equal-weighted portfolio is formed from N securities or
stocks. The portfolio return in any period can be written as
124
rp =
1 N
ri where ri is the return of asset i during the same period.
N i 1
Suppose that each return i may be characterized as ri = ai + bi rm + ei

during the period, where ai, bi are constants, rm is the market return, and ei
is asset is idiosyncratic residual return that has a mean of zero and is
independent of rm and also is not correlated with other ejs, except that
var(ei) = ci > 0 and N *1
N*
ci c . Assume that
i
1 N*
bi bp
N * i 1
where bp is a constant > 0. Find the variance of rp for N* that is finite, and
for N that is almost infinite.

[1] Box, George and Jenkins, Gwilym, (1970), Time Series Analysis:
Forecasting and Control, Holden-Day Publishers.
[2] Fama Eugene and M R Gibbons, 1984, A Comparison of Inflation
Forecasts, Journal of Monetary Economics, Vol 13, 327-348.
125
Chapter 7
RANDOM WALK
APPLICATION: MARKET EFFICIENCY
Random walk, Rational expectation, Informational efficiency, Equilibrium
asset pricing model, Benchmark model, Joint test, Market efficiency, Euler
condition, Martingale, Markov process, Strong-form market efficiency, Semistrong form market efficiency, Weak-form market efficiency, Variance ratio,
Long term memory
In this chapter we shall study the mathematical construction of a random walk

and how this has been a workhorse for developing the concept of market
efficiency in the finance literature. While market efficiency (or informational
efficiency, to be more precise about the type of market efficiency discussed
here) is a benchmark consideration in an economy with rational agents or
investors, it is also important to realize that anomalies and deviations from the
benchmark such as in behavioral finance are plausible and possible.
7.1
RANDOM WALK
Suppose a process Pt+1 follows a random walk:

Pt+1 = + Pt + et+1
(7.1)
where is a constant, and et+1 has zero mean and is not correlated with Pt or
E(et+1|Pt) = 0, and the coefficient of Pt is b=1. Moreover, et+1 is not correlated
with any other economic variables (except Pt+1) existing at t, t-1, t-2, etc. The
property of et+1 essentially implies that prices at t+1 are so unpredictable at t,
and that very few if at all any investor can claim to beat the market over any
reasonable stretch of time. Note that for a random walk, et+1 needs not follow
any specific probability distribution. It could as well be a normally distributed
noise or else a discrete Bernoulli random variable.
The strongest version of a random process as in equation (7.1) is where
et+1 is i.i.d. and E(et+1) = 0. A second strong version is where et+1 is
independently but not necessarily identically distributed and E(et+1) = 0. In
particular, this allows variance of et+1 to change over time conditionally. A
third weak version is where et+1 has zero correlation with all its leads and lags.
Now, E(et+1|Pt) = 0 E(et+1) = 0 and also cov(et+1, Pt) = 0 by taking
126
iterated expectations. Note that cov(et+1,Pt) = 0 needs not necessarily imply
E(et+1|Pt) = 0. The former represents zero correlation, while the latter
represents a stronger condition of conditional stochastic independence
between et+1 and Pt.
Note that the strongest version implies the strong version that in turn
implies the weak version of random walk. For an example that the reverse is
not true, suppose cov(et+1, et) = 0. However, it could be that cov(et+12 , et2) 0.
These conditions satisfy the weak version, but not the strong or strongest
version since stochastic independence means E(et+12et2) = E(et+12) E(et2),
and hence cov(et+12, et2) necessarily equals 0 in this case.
A related process Pt+1 = + bPt + et+1 where b=1, et+1 is stationary with
zero mean, but that could be stochastically dependent over time, is called a
Unit Root Process.
In the special case of random walk in (7.1) when drift = 0,
Pt+1 = Pt + et+1 .
(7.2)
Z
Suppose t represents observed information variables in the economy
at time t. Hence by time t and afterward, investors or agents in the

economy would know about tZ. At t, such a random variable (or
variables) is taken as predetermined or given constants. They are
realizations of information random variables at time t.
If we take the conditional expectation on (7.2), then
E Pt 1 | Zt E Pt | Zt E e t 1 | Zt Pt 0 .
The last term is zero whether we have the strongest or strong or weak version
of random walks. Hence also,
E Pt 1 Pt | Zt 0 .
(7.3)
The converse where equation (7.3) implies (7.1) is not necessarily true as
nonlinear forms e.g. Pt+1=Pt t+1 , and E(t+1| Zt ) = 1, can also lead to (7.3).
Thus, the random walk process is slightly stronger than (7.3), and is a special
case of (7.3). Nevertheless, the random walk process is a convenient
workhorse to test a condition such as (7.3). In other words, if empirical data
satisfy (7.3), they are consistent with (7.1) though (7.1) is not exactly
validated.
The equation (7.3) or E Pt 1 | t Pt has an important implication. It

Z
means that conditional or given all information in the market at time t, Zt ,
the last market price of Pt is as good a predictor as any about next

periods price E(Pt+1). Thus Pt is said to fully reflect all information
available at time t. Another way to say it is that all information in the
market has been fully absorbed and revealed in its current price P t at
127
time t. Prices following a process with property (7.3) such that the
conditional expectation is the last period price is called a martingale
process.
Suppose ln t+1 ~ N(, 2). Then yt+1 exp (ln t+1) t+1 is lognormally
distributed. Now, E y t 1 | Zt E t 1 | tZ E e
N , 2
12 2
. Thus
if we have 12 2 , then E(t+1| Zt ) = 1, and Pt+1=Ptt+1 leads to (7.3).

Taking natural logs,
ln Pt+1 = ln Pt + ln t+1 , and E(ln t+1) = 12 2 . Or, we can write
ln Pt+1 = -1/2 2 + ln Pt + ut+1 , ut+1~ N(0, 2).
This is a random walk in log prices.
When ut+1 needs not be constrained as normally distributed, and if the drift
is not constrained to be negative as in a growth economy, then a more general
natural log form of random walk in prices is
ln Pt+1 = + ln Pt + ut+1 ,
(7.4)
where ut+1 is i.i.d. (or in the weaker version, zero-correlated over time) has
been commonly used to test for market efficiency. (7.4) can also be re-written
as
rt 1 u t 1
(7.5)
where the left-hand side rt 1 ln
Pt 1
is the continuously compounded return

Pt
rate of asset. What is the expected return rt+1 conditional on past information,
including ln Pt ? Does this show why under a random walk, there can be no
past information that helps to produce higher returns? This is another variation
of the sense of market efficiency. The random walk hypothesis in (7.5) is
directly contrary to the idea of technical charting whereby past returns or their
patterns such as heads and shoulders (buy on the first shoulder before the head
; sell on the second shoulder after the head) can be utilized to predict and
thereby earn a return higher than .
Suppose ut+1 = ut + vt+1 (vt+1 is i.i.d.), so ln Pt+1 is no longer a random
walk because the residual noise are autocorrelated. We can show how past
information/history can produce a return different from the population mean.
E(rt+1|uttZ) = + ut .
Thus, it is not the random walk in (7.5) where all conditional means are
simply . To test for random walk (weak version), or zero correlations in u t+1 ,
we compute the sample correlation coefficients of ut+1 as was discussed in the
chapter on time series models. We can apply
128
j 0
1
T
to test that the jth autocorrelation is zero. To test if all autocorrelations are
zero, viz.
H0: 1 2 m 0 [with m usually set at 6, 12, 18],
we can apply the Ljung and Box (1978) test statistic that provides for
approximate finite sample correction:
'
Q m T T 2
2
k
Tk
~ m .
2
k 1
Under (7.5), the null hypothesis is H0: cov(rt, rt-k) = 0 for any k0, so we can
use the z statistic and also the Q-statistic on the continuously compounded
return rates. Processes such as ARMA with dependence on lags are said to
exhibit long memory and are predictable.
7.2
INFORMATIONAL EFFICIENCY
The issue of whether stock returns are predictable is a very old one, and will
always be terribly important as long as stock market remains a major source of
funding for the corporate world. The issue of predictability is intimately
connected to the issue of market efficiency. Because the microeconomics and
finance literature are very intense on this topic, at least during a good stretch
of the last 3 decades with such an enormous collection of writings, there has
always been some confusion on the constructions of definitions and ideas.
To begin with, it is important first to understand what rational expectation
is. Rational expectation is to some economists a very fundamental assumption
of rationality in any financial valuation or price modeling. It assumes that
agents in the market make forecasts in a rational way by incorporating all
given information at hand and then act optimally on these forecasts. The way
information is incorporated is typically in a statistical sense through the
Bayesian rule when risks and uncertainties in relevant variables are expressed
as a multivariate probability distribution.
Suppose we are given a model Yt = a + bXt + ut , ut being i.i.d. If Xt is
known or observed by the agents in the market, the rational expectation of Y t
is then the conditional mean E(Yt|Xt) = a + bXt . This is the best forecast with
minimum error. It is a statistical result no doubt, but it is the way the model
and agents in the economy work. If forecast is made any other way e.g.
E(Yt|Xt) = a + bXt v , v > 0, then obviously it is irrational, since the forecast
error
Yt E(Yt|Xt) = Yt a bXt + v = ut + v
is expected to be larger.
129
Suppose instead of using the given information Xt, the agents in the
market use a subset of the information Zt Xt to form the forecast E(Yt|Zt)
instead. Clearly, this is a rational forecast given Zt, but it is inefficient use of
information. Not all information in Xt is applied. The market based on such
inefficient information usage, with respect to the information available to the
market, is informationally inefficient. Otherwise it is informationally efficient.
In this context, it is also referred to as market efficiency30 if there is
informational efficiency.
According to our definition, informational efficiency includes necessarily
rational expectations. However, rational expectation by itself strictly does not
ensure informational efficiency as is seen in the case above where only subset
information Zt is used. Rational expectation is a necessary but not sufficient
condition for informational efficiency. Conversely, it is not ordinarily
conceivable in usual circumstances to have irrationality and yet informational
efficiency. We shall illustrate the concept of informational efficiency with
respect to use of available information by an example as follows.
Suppose a true statistical model of returns generation over two periods is
as follows.
Figure 7.1
Today
0.7
good
news
0.3
0.4
0.6
0.2
Earnings
$20
bad
news
$10
good
news
$20
bad
news
0.8
30
good
news
bad
news
$10
There have been various other definitions of market efficiency with regard to
transaction costs. However, the chief usage in finance is in the sense of
informationally efficient or inefficient market.
130
In the first period, there is either good news with probability 0.4 or bad news
with probability 0.6. If it is good news, then in the second period, there is
either good news with probability 0.7 or bad news with probability 0.3. If
there is bad news, then in the second period, there is either good news with
probability 0.2 or bad news with probability 0.8. The information structure is
depicted as follows.
One period has just passed, and the asset giving rise to the earnings is to
be priced by the market today at the start of the second period. Suppose by
some equilibrium asset pricing model taking into account risk aversion and
economy wide factors, the required risk-adjusted rate of return for this asset
over one period is 10%. This equilibrium model is sometimes referred to as
the benchmark model of equilibrium asset price.
If good news had happened in the first period, and the market has
efficiently taken this information into account, the expected earnings for next
period is 0.7 x $20 + 0.3 x $10 = $17. The price of the asset today is $17/1.1 =
$15.45.
If bad news had happened in the first period, and the market has
efficiently taken this information into account, the expected earnings for next
period is 0.2 x $20 + 0.8 x $10 = $12. The price of the asset today is $12/1.1 =
$10.91.
However, if the market is inefficient with respect to this relevant news,
and does not take this into account, the probability of good news next period is
0.4x0.7+0.6x0.2=0.4, and the probability of bad news next period is
0.4x0.3+0.6x0.8=0.6. The expected earnings for next period is 0.4 x $20 + 0.6
x $10 = $14. The price of the asset today under this informationally inefficient
market is $14/1.1 = $12.73.
Thus we see that in an informationally efficient market, when relevant
information affecting price arrives (as at end of period 1), the asset price will
adjust quickly to reflect this information. If it is good news, price will rise to
$15.45. If it is bad news, price will fall to $10.91. But if the market is not
informationally efficient, the news will not be quickly incorporated into price,
and it will stay at $12.73. If only a subset, including null set, of the revealed
information is utilized by the market, the asset price does not fully incorporate
the information and therefore will not adjust quickly. The market is then
informationally inefficient. This idea of quick price adjustment in the face of
relevant information reaching the market will again be taken up in the chapter
on Event Studies.
There are two important salient points to note in the above example.
Firstly, rational expectations of prices are employed throughout, whether the
market is informationally efficient or not. If a market is rational, yet
inefficient, it means that not all available information out there is utilized or
taken as being given. It could be something about information being costly
131
and thus cannot be totally appropriated, or market participants are slow to
utilize all existing information at once. Thus rational expectation is not a
sufficient condition for concepts of market efficiency. However, it is usually a
necessary condition, just as rationality is usually the foundation to equilibrium
asset pricing. Secondly, we note that if we use asset price to assess, whether
by how much it moves or does not move, if the market is informationally
efficient or not, then this assessment (or more formally a test)s conclusion
also depends on the benchmark model or the equilibrium required riskadjusted rate of return. Thus, if in the above example, the benchmark model is
incorrectly specified, and the required rate of return for the asset is actually
33.5% per period, then when there is good news and the market is
informationally efficient, the resulting price is $17/1.335 = $12.73.
Using the incorrect benchmark of 10% will result in the wrong conclusion
that the market is informationally inefficient. Hence the often quoted
expression that, Test of market efficiency is a joint test of informational
efficiency and also correct equilibrium asset pricing model. A rejection of the
null hypothesis of market efficiency could be due, not just to sampling or type
I error, but to an incorrect benchmark model even when the market is
informationally efficient. (More rarely, but plausibly, an acceptance of
informational efficiency could be due, not just to sampling or type II error, but
to incorrect benchmark model.) In summary, under a joint hypothesis, the
rejection of H0: Market Efficiency could be rejection of the benchmark model
and not market efficiency. Vice-versa, it could be rejection of market
efficiency and not the benchmark model. It could also be the rejection of both
efficiency and model. However, the acceptance of H0 could be market
inefficiency but also incorrect benchmark model in a coincidence of sorts.
Therefore we need to understand the benchmark model a little more here.
A very general and necessary condition for intertemporal asset pricing
model is the Euler condition31
31
This condition is necessary for many forms of rational asset pricing models,
including CAPM and its many variations. It becomes a testable implication for asset
pricing, and beginning with an important theoretical work by Lucas (1978) [Asset
Prices in an Exchange Economy, Econometrica 46, 1429-1445], and an important
empirical work by Hansen and Singleton (1983) [Stochastic Consumption, Risk
Aversion, and Temporal Behavior of Asset Returns, Journal of Political Economy
91, 249-265], the investigations have spawned a whole new era of research, leading to
important results such as the consumption-based asset pricing, Mehra and Prescotts
equity premium puzzle, and Hansen and Jagannathans lower bound for stochastic
discount rate volatility. A related econometrics paper, Hansen and Singleton (1982)
[Generalized Instrumental Variables Estimation of Nonlinear Rational Expectations
Models, Econometrica 50, 1269-1288] has also generalized results in classical
132
U C t 1
(7.6)
E t Pt 1
Pt
U C t
where Pt+1 is next period asset price, U(Ct) is the marginal utility of
consumption Ct at t, and 0 < < 1 is a time preference discount factor or
inverse of impatience (to consume) coefficient, with higher preference for
present consumption instead of postponement if is smaller. The term in the
bracket to the right of the price Pt+1 is sometimes called the pricing kernel.
Such a necessary condition presupposes liquid trading in a securitized market
and the availability of information to all agents.
All agents are assumed to be alike or homogeneous, so equilibrium in the
market or market clearing price Pt can be attained at t. Importantly, it is
assumed that the conditional expectation in (7.6) is taken by agents with
respect to all available information Zt in the economy at t. This includes the
history of the asset prices {Pt} or Pt , and the history of all available public
information Yt e.g. earnings records, trading records, announcements, etc.
Information set in Zt excluding Pt and Yt would be any other information
at t that was not publicly available, and would include insider information.
The three information sets satisfy the following set relationship:
Pt Yt Zt .
Equation (7.6) provides conditions for fixing the required risk-adjusted
rate of return in the market. For example, if U(.) is quadratic, or if return rates
are MVN, we can derive the single-period CAPM as seen in earlier chapters.
In general, when rational expectation is taken in (7.6) for a general kernel, the
prices Pt+1 needs not be a martingale or exhibit random walk. A random walk
is a special case of a martingale as discussed earlier.32 Therefore we should
keep in mind that rejection of random walk does not necessarily amount to
rejecting rational expectations and market efficiency.
As a special case, if U(.) is linear in Ct , which is the same as assuming
agents are risk-neutral, then U(Ct+1) = constant, and ratio U(Ct+1)/U(Ct) = 1.
Then we have (letting 1 for a sufficiently small interval),
Et (Pt+1 ) = Pt .
(7.7)
nonlinear multi-stage least squares regression, and led to an enormous production of

empirical papers applying the Generalized Method of Moments (GMM) techniques.
32
See Leroy, S.F., (1973), Risk Aversion and the Martingale Property of Stock
Returns, International Economic Review, 14, 436-446, for such a discussion.
133
It is important to emphasize that this is a conditional expectation taken by the
agents in the economy (and this benchmark model has only one representative
agent, so we are not dealing with heterogeneous agents model33) with
information set Zt used by the agent. Another way to state (7.7) is that
present price Pt is a sufficient statistic for the conditioned information set
Zt that is used by agent. Sufficient statistic means that it is enough to use it to
infer exactly the same conditional expectation instead of using the entire
information set at t , Zt , which also includes the sufficient statistic.
Another representation of (7.7) is back to (7.3): E Pt 1 Pt | Zt 0 .

Applying the law of iterated expectations, this condition implies that
E E Pt 1 Pt | Zt | tY 0 , or
E Pt 1 Pt | tY 0
(7.8)
since Yt Zt . Likewise (7.3) or (7.8) implies
E E Pt 1 Pt | Zt | Pt 0 , or
E Pt 1 Pt | tP 0
(7.9)
since Pt Yt Zt .
In (7.3), stochastic process {Pt} is a martingale. A related property of
random variables in any stochastic process {Qt} is that P(Qt+1| Qt, Qt-1, Qt2,..) = P(Qt+1| Qt). Such a process is called Markov process, and is said to
exhibit the Markov property.
A Markov process {Pt} is necessary for a martingale in prices. Martingale
theory was developed in connection to the idea of fair games. As an example
of a fair game: bet $1 ; outcome 50% $2, 50% $0 (double or zero). Expected
wealth next round = $1. Under the double or zero strategy, wealth is a
martingale. Or, expected gain is always zero.
Historically, Et (Pt+1 ) = Pt or (7.3) has come to be well studied. If we
assume infinite supply of funds for arbitrage, and a very small period interval,
this should hold approximately true under speculative efficiency. 34 If the
market is efficient, any information leading to equilibrium price increase (or
33
In some cases, it may be shown that an equilibrium resulting from heterogeneous

agents is equivalent to one using a representative agent where the representative
agents preference parameters is some restrictions related to the heterogeneity of
agents.
34
This was called speculative efficiency by John F.O. Bilson, (1981), The
Speculative Efficiency Hypothesis, Vol 54 No 3, Journal of Business.
134
decrease) as illustrated by Figure 7.1, must cause current price to adjust till the
forecast based on all available information at t equals the price Pt. Intuitively,
even if (7.3) is strictly not a benchmark model, it may be a good
approximation. It is also plausible for the economy to allow a positive drift,
i.e. Et(Pt+1) = + Pt , for >0.
In view of the earlier discussion of the joint hypothesis of informational
efficiency and correct benchmark model, we simplify the issue by either
assuming that the benchmark model is correct, or that the market is
informationally efficient. Indeed this has been the implicit practice of most
empirical finance research. If we assume that the market is informationally
efficient, then whatever test of (7.6) is a test of asset pricing model, and there
has been a copious amount of research testing the Euler condition in (7.6). If
we assume the benchmark model, e.g. (7.3) or (7.8) or (7.9) is correct, then
the test of the condition amounts to testing whether the market is
informationally efficient. We shall adopt this latter assumption of a correct
benchmark, and thus (7.3) is called the strong-form market efficiency
hypothesis, (7.8) is called the semi-strong form market efficiency hypothesis,
and (7.9) is called the weak-form market efficiency hypothesis. Note that
these hypotheses deal with aspects of informational efficiency. Much of this
framework was a result of the seminal contribution by Eugene Fama. In 1970,
he surveyed the idea of an informationally efficient capital market, and made
the following famous definition: A market in which prices always fully
reflect available information is called efficient.
What are the hypothesis implications of (7.3), (7.8), and (7.9)? Suppose
we accept (7.3) and thus also (7.8) and (7.9), the market is said to have strongform (informational) efficiency. If we reject (7.3), but accept (7.8) and hence
also (7.9), the market is said to have semi-strong form (informational)
efficiency. If we reject (7.3) and (7.8), but accept (7.9), the market is said to
have weak-form (informational) efficiency.
What exactly are the testable implications of (7.3), (7.8), and (7.9)? How
do we test (7.3), (7.8), and (7.9)? It is easiest to test (7.9), the martingale
property of asset price. Let the price change at t+1 be defined by
et+1 = Pt+1 - Pt .
According to (7.9), its conditional expectation is zero, i.e.
E e t 1 | tP 0 .
(7.10)
Applying the law of iterated expectation on (7.10) with unconditional

expectation,
135
E E e t 1 | tP E e t 1 0 .
(7.11)
(7.11) implies E e t e t 1 | tP 0 since e t Pt . Thus E e t e t 1 0 . Given

that means of et , et+1 , etc. are zeros, then cov(et+1 , et ) = E(et+1 et) = 0.
In a similar way, we can ascertain that (7.9) or equivalently (7.10)
implies more generally,
cov(et+k , et ) = 0,
(7.12)
for any k > 0 or any k < 0. Thus {et} is serially uncorrelated, or equivalently,
has zero autocorrelations with lags k0.
Suppose we think of the investment as buying at price Pt, selling next
period at Pt+1, and thus making gains (loss) et+1 > (<) 0, then if we repeat the
investment process long enough, the average payoff will be zero, i.e. E(e t+1) =
0. This is the notion of a fair game studied in mathematics and that has been
closely related to finance theory.
For the semi-strong form market efficiency test of (7.8), suppose we run a
regression:
et+1 = c0 + c1y1t + c2y2t + + ckykt + t+1
where cks are constants, yit is the ith (i=1,2,3,.,k) publicly available
information variable observed at t, and t+1 is a zero mean disturbance term
satisfying the classical conditions.
Then (7.8) E(et+1|y1t,y2t,,ykt) = 0 H0: c0=c1=c2=..=ck=0. This can
be empirically tested. The strong-form, semi-strong form and weak-form
(martingale) market efficiency conditions of (7.3), (7.8), and (7.9) are thus
closely related to the price process as a random walk process. Test of random
walk by Ljung-Box Q-statistics on the serial correlation property of a return
process is thus a weak test on the concept of market or informational
efficiency. It is a weak test for the reason discussed earlier that rejection of the
random walk is not necessarily a rejection of (7.6).
7.3
VARIANCE RATIO TESTS OF MARKET EFFICIENCY
A variation of the autocorrelation test is variance ratio test for random walk
processes. Suppose stock prices {Pt} follow process:
ln Pt+1 = + ln Pt + et+1
(7.13)
and rt+1 = ln(Pt+1/Pt) is the continuously compounded return rate. Residual

noise et+1 is i.i.d. and normally distributed N(0,2) over the interval of one
period from t to t+1. The length of this period is arbitrary, but for our
136
empirical data work, we shall make this one month. Then 2 is the variance of
monthly return rates. Call the return over one period rt+1 = rt(1). Call the return
over two periods or two months, rt(2) = rt+1 + rt+2 ln (Pt+2/Pt). The return over
q periods or q months is rt(q) = rt+1 + rt+2 + + rt+q ln (Pt+q/Pt). This
discrete random walk process is consistent with, and implied by continuous
time geometric diffusion process that will be discussed later in the chapter on
bonds and term structure.
The important point is that var[rt(1)] = 2, var[rt(2)] = 22, .,
var[rt(q)] = q2, and so on.
Therefore, random walk model (7.13) implies that:
var[rt(q)] = q var[rt(1)],
(7.14)
or in general, p var(rt(q)) = q var(rt(p)).

Moreover, if we take the variance ratio in (7.14):
varrt q /q
1.
varrt 1
Or,
var rt q /q
1 0.
var rt 1
(7.15)
Equation (7.15) is thus the testable implication of the random walk process in
(7.13).
For a large sample size N+1 relative to q, with price data P0, P1, P2, .,
PN,, the sampling estimates of the terms in (7.15), respectively for the
numerator and denominator, are (choose N so that N/q is an integer):
1 N/q
2
N/q lnPqj lnPq j1 q
j1
q2 /q
q
1 N
2
2
and 1 lnPj lnPj-1
N j1
1 N
where lnPj lnPj1 .
N j1
(7.16)
(7.17)
By the Law of Large Numbers for strongly stationary process r t(q), for
2
2
any q, as N, , and 1 in terms of asymptotic convergence. In
137
(7.17), by the Central Limit Theorem, the sample average 12 converges
asymptotically to a normal distribution with mean
1 N
2
E lnPj lnPj-1 2
N j1
and asymptotic variance
1
1
2
var 12 var lnPj lnPj-1 var e 2t 1
N
N
4
1
var 2 z 2t 1
2
N
N
E 12
where zt+1 N(0,1), and var(zt+12) is variance of 12 (chi-square with degree of

freedom 1), 2. The above statement is informal since as N, the right hand
side of the asymptotic variance goes to zero. More appropriately, we write:
(7.18)
N 12 2 N
N 0, 2 4 .
In (7.16), by the Central Limit Theorem, the sample average q2 /q converges
asymptotically to a normal distribution with mean

1 M
2
M E lnPqj lnPq j1 q
2
j1
q 2
E q2 /q
q
q
and asymptotic variance
1
1
2
var /q
var
lnP
lnP
var
e
qj
q
j
-1
Nq
Mq 2
j1
2
q
1
1
var Y 2
var q z t 1
Nq
Nq
2q 4
q
var 2 z 2t 1
N
N
1
var q 2 z 2t 1
Nq
where Y N(0,q2), and var(zt+12) is variance of 12 (chi-square with degree

of freedom 1), 2. We let M=N/q for clearer notation. In this case M as
N for a fixed q.
Hence,
N q2 /q 2 N
N 0, 2q 4 .
2
2
Consider statistic J d q /q 1 .
It can be shown that asymptotically, by the LLN,
(7.19)
138
.

N
N /q
N0, 2q - 1 .
var q2 /q 12 N
var q2 /q var 12 2q 1
NJ d
Then
2
q
2
1
Dividing by 12 which in large sample tends to constant 2, we may

approximately obtain for large N:
NJ r N
/q N0, 2q - 1
2
q
2
1
2
1
where Jr Jd / . Jr is 1+VR(q) where variance ratio order q, VR(q) =

2
1
q2 /q
12
We may call NJ r N0, 2q - 1 the variance ratio test statistic35 on the

random walk hypothesis H0: ln Pt+1 = + ln Pt + et+1 . If the test statistic
deviates from zero, it means q2 /q and 12 are different, or that the ratio
deviates from one.

We use the following Z-test statistic which is approximately distributed as
standard normal N(0,1) for finite sample size N:
q2 /q
N
J r N0, 1 where J r 2 1 .
2q - 1
1
We collect monthly (month-end) S&P 500 index prices Pt from Yahoo
Finance in the sampling period 1949 December to December 2009. The
statistic Jr , the variance ratio test statistic Z =
N
J r , and the p-values of
2q - 1
the normality tests are reported as follows for different q = 2, 3, 6, 12, 24, and
36. N is 720 observations here.
Table 7.1 shows that the random walk hypothesis is rejected in 2-tailed
test at 10% significance level for all cases except when q=36. If significance
level is 5%, then it is also not rejected at q=24. However, for q 12, there is a
very strong indication of positive autocorrelation in the monthly returns. This
35
See Lo, Andrew and Craig MacKinlay, (1988), Stock Prices do not Follow
Random Walks: Evidence from a Simple Specification Test, The Review of Financial Studies,
Vol.1, Spring, 41-66, for a more general discussion involving overlapping data.
139
led to the variance ratio being larger than one, and hence the test statistic is
greater than zero.
Table 7.1
Variance Ratio Test of Random Walk Hypothesis on S&P 500 Prices
q
Jr
Z
p-value
7.4
12
24
36
0.2298
0.4992
0.4667
0.9622
0.4507
0.4330
3.5605
5.9905
3.7762
5.3838
1.7640
1.3788
0.0004
0.0000
0.0002
0.0000
0.0777
0.1679
ALTERNATIVE HYPOTHESES
We can see how the variance ratio is related to autocorrelations when random
walk does not hold. Define rt(2) = rt + rt-1 . This is a 2-period return ln[Pt/Pt-2] .
Define the variance ratio of a 2-period continuously compounded return to
twice of a 1-period continuously compounded return:
VR 2
var[rt (2)] var[rt rt 1 ] 2 var[rt ] 2 1
1 1 .
2 var[rt ]
2 var[rt ]
2 var[rt ]
Under random walk (weak version), (1) = 0, so VR(2) = 1. Under

random walk, VR(q) = 1 for all q. However, we see that the variance ratio test
statistic VR(2) -1 = (1). In Table 7.1, for q = 2 (aggregation of returns over 2
months), the test statistic Jr = 0.2298 which is (1). Hence there is positive
autocorrelation or serial correlation in the case of S&P 500 index returns
during the sampling period.
Indeed alternative hypothesis such as negative serial correlation in an AR
process of return rates would produce a VR(q) that deviates significantly from
1. When serial correlation is negative, VR(q) < 1, and vice-versa. We show
that as follows.
Suppose the generating process of return is not (7.5) but AR(1),
rt rt 1 t , where < 0. Recall from the previous chapter on time
series model that the first order autocorrelation is < 0, and must satisfy || <
1 for stationary AR(1).
Now rt q rt rt 1 rt 2 ... rt q 1 .
140
Covariance matrix of rt , rt 1,..., rt q 1

1
Var rt .
.
.
q 1 q 2 q 3
T
1q
.
.
.
is
q 1
q 2
q 3
Var rt q 1T 1
Var rt q 2 q 1 2 q 2 2 2 q 3 3 ... 2 q 1
Thus,
VR q
Var rt q
qVar rt
1
q 2 q 1 q 2 2 q 3 3 ... q q 1 q1
q
1
2
q 2 3 ... q1 2 2 3 3 ... q 1 q 1
q
2
1 2 1 2 ... q2 2 2 3 3 ... q 1 q 1
q
1 2 1 q 1
q 1
1 2
1
2
1
q 1
q 1
1 1 1
q 1

q 1 q q 1 q
141
q 1 q
q
q
1
2
q q
1
.
1
q q 1
2
1
From the above, if VR(q) < 1 (q>1), then < 0 where || < 1. Conversely, in
general, when there is positive first order correlation, > 0, then VR(q) > 1.
The early literature on stock prices indicated that they behave like random
walks.36 If the strong-form market efficiency is true, stock return is not
predictable based on any past information. If price is a random walk and there
is no memory or correlation in returns, then VR(q) = 1 for any q. Statistically
significant deviation from 1 indicates long term memory.
Variance ratio tests indicated that stocks have smaller variances over a
long horizon than indicated by a random walk process. Thus there is some
exhibition of price memory. The existence of prices exhibiting memory may
be explained by a short-run disequilibrium during which prices do not adjust
quickly because of the cost of acquiring and evaluating information. Slow
price adjustment and informational inefficiency for limited periods of time,
can offer a window for profitable trading using simple technical trading
methods. This may explain why technical trading or charting activities persist
from time to time because of such short windows of informational
inefficiencies. But of course, it has also been found that many claimed
technical profits disappear when the trading costs such as commissions and
fees are counted. Smaller variances over a longer horizon could also be
negative return autocorrelation due to price reversion as a result of initial overreaction.
7.5
PROBLEM SET
7.1 Suppose we are testing a joint hypothesis that both statements A and B
are simultaneously true. Coming up with some statistical measures, we
perform a test and show that the joint hypothesis that A and B are true is
rejected. Does this mean that by itself statement B is not true?
36
The intellectual debate of the 1960s through the early 1980s was typified by the
popular book, A Random Walk down Wall Street, written by Burton Malkiel. Those
were exciting times when funds managers tried to invite investments by promising
superior returns that beat the market. As it turned out, funds generally did not
consistently beat the market, and indeed most funds lost to the market index in terms
of performance after commission and costs were deducted.
142
7.2 A simple two period market works as follows. In the first period, the
market opens for trade and an equilibrium price of $10 is reached for
stock X. It is common knowledge that in the next period when market
trades, 4 states of the world are possible. S1, S2, S3, and S4 can occur
with probabilities 0.1, 0.4, 0.4, and 0.1 respectively. Share prices P2 of
$15, $12, $10, and $7 are associated with each of these states in period 2
respectively. Investors risk-adjusted required rate of return in the market
is homogeneously 10% per period. In period 1, suppose an informed
investor has private information = {S1, S2, S3} that one of the states in
the set will occur in period 2.
(i) Conditional on , what is the informed investors expected stock price
E(P2) for X in period 1? What is the informed investors new revised
price for X in period 1?
(ii) What is the uninformed price?
(iii) If the investor managed to buy the stock X at $10.20, what is his
abnormal expected return?
7.3 Suppose we can collect the trading activity report of corporate officers
(filed with Exchange Commission) and run a regression
et+1 = c0 + c1It + t+1
where et+1 is residual return over day [t,t+1] after fitting a model such as
market model or CAPM, and It = 1 if officer buys own stocks, -1 if he
sells own stocks at time t. How does the significance of OLS estimate of
c1 imply about market efficiency? What form is it?
7.4 The tree below shows the resolution of uncertainties over time. Each
branch represents a state of nature occurring at that time period (t,t+1].
The number on the branch indicates the probability of occurrence of the
state. The alphabet at each node denotes the state at that time t. By t=3, the
security prices P3 are revealed as shown on the nodes.
(i) Find conditional expectations E(P3|B) and E(P3|X) at t=1.
(ii) Hence find the expectation of P3 at t=1 when there is no information.
(iii) Suppose the per period risk-adjusted discount rate of the security is
(4/3)1/2-1, independent of the information. What would be the
different prices of the security at t=1 given information B, given
information X, and given no information?
(iv) If the observed market price at t=1 is $3.45, what is a possible
conclusion to draw about market informational efficiency?
143
2/3
$10
$8
$8
$6
$5
$4
$2
$1
t=3
B
1/5
1/3
A
2/3
4/5
X
1/3
t=0
t=1
Z
t=2
7.5 For the AR(1) process rt rt 1 t , show that the variance ratio
2
VRq 1
1
q q

q q1
where VR(q) = var[rt(q)]/(q var[rt]), rt = ln(Pt/Pt-1),

var[rt(q)] = var[rt+rt-1+rt-2+.+rt-(q-1)], and Pt is stock price at time t.
How does the variance ratio show if there is long term memory in the
returns process?
7.6 Suppose an investor trades with the following rules. Tracking a particular
stock, if within trading hours in a day its price starts to climb from a low
turning point by y% (take this as 2 ticks where 1 tick refers to the
minimum trading price change), he buys the stock and holds, as well as
liquidating any outstanding short positions. Subsequently if the price starts
to drop from a high turning point by y% (take this as 2 ticks), he sells
the long position and then further shorts the stock. If price changes are
less than y% in the interim, he does nothing. You may call this some form
of technical analysis or charting, although there are many fanciful charting
techniques out in the market. If the trader can make positive returns after
paying for the transaction costs, how do you conclude about market
efficiency?
144
[1] John F.O. Bilson, (1981), The Speculative Efficiency Hypothesis, Vol
54 No 3, Journal of Business.
[2] John Y. Campbell, Andrew W. Lo, and A. Craig MacKinlay, (1997), The
Econometrics of Financial Markets, Princeton University Press.
[3] Fama, Eugene F., (1970), Efficient Capital Markets: A Review Of
Theory And Empirical Work, Journal Of Finance, Volume 25, Issue 2,
Papers And Proceedings Of The Twenty-Eighth Annual Meeting Of The
American Finance Association New York, N.Y. December, 28-30, 1969,
383-417.
[4] Kian-Guan Lim, (2007), The Efficient Markets Hypothesis: A
Developmental Perspective, Chapter 9 of Pioneers of Financial
Economics Volume II: Twentieth Century Contributions edited by
Geoffrey Poitras.
[5] Burton Malkiel, (1996), A Random Walk Down Wall Street. Revised
Edition, W Norton & Company.
145
Chapter 8
AUTOREGRESSION AND PERSISTENCE
APPLICATION: PREDICTABILITY
Mean Reversion, Dividend-Price ratio, Forward shift operator, Long-run
return, Serial correlation, Out-of-sample forecast, Price-earnings ratio,
Business cycle, Momentum, Volume
In this chapter we explore how long-run mean reversion in stock returns could
have significant implication on the forecast or predictability of such returns.
8.1
MEAN REVERSION
Since the late 1980s, there has been a strong surge of research interest in the
belief that stock returns are after all predictable to some extent. We shall see
that this means a certain mean-reverting pattern or mean reversion in returns
over the long run, and also long-term predictability based on some current
fundamental information. It also means that random walk holds approximately
true only for the very short-run prices, but is not appropriate for long-run
prices.
An example of a mean-reversion process is
Yt+1 = c - Yt + et+1
where 0 < < 1, and et+1 is a zero mean i.i.d. process. When Yt is large, Yt+1
tends to be smaller, and vice-versa. The unconditional mean of the stationary
AR(1) process is c/(1+). The conditional mean is Et(Yt+1) = c - Yt .
If Yt is above the mean c/(1+), i.e. Yt > c/(1+), then
Et(Yt+1) = c - Yt < c - [c/(1+)] = c/(1+) .
Conversely, if Yt is below the mean, Yt < c/(1+), then Et(Yt+1) > c/(1+) .
Thus, we see the mean reversion effect. When Yt exceeds the unconditional
mean or the long run mean level, the expected Yt+1 next period will be
below this mean. Conversely, when Yt is below this long run mean level,
then the expected Yt+1 will rise above it.
As seen in the earlier chapter on market efficiency, when stock returns
exhibit memory, as in negative autocorrelation, then it would follow a meanreversion process, though the reversion could be very small and detectable
only over a long horizon.
146
This idea of stock return memory is also affirmed by a separate study by
Fama and French (1988)37 who ran regressions of dependent variable
k
rt,t k rt j (return over k period horizon) on its lags, i.e.

j1
rt,t k a bk rt k,t et k .
They found estimated coefficient b k to be negative. There is thus
empirical evidence that market prices exhibit some form of memory pattern
within them and future prices have some predictability content within them
from past prices.
For example, if rt 1 rt t 1 , with a small mean reversion, < 0.
Then,
ln Pt 1 ln Pt ln Pt ln Pt 1 t 1 ,
for < 0. Thus, ln Pt 1 (1 ) ln Pt ln Pt 1 t 1 .
Thus it is seen that past price Pt-1 (apart from the current Pt) does predict
future Pt+1, i.e. there is also price memory as a result of or in connection with
predictability in stock returns when 0.
Though random walk of prices has come to be pretty much accepted as
given in the short run, the long-run issue has become quite important from the
point of view of investors who invest for the long-run. The long-run situation
has paramount implications for pension funds, insurance companies, and the
generally greying population considering how their investment plans would
work out in the long horizon. One implication of long-term memory and mean
reversion in stock prices is that volatility of stock returns over a long horizon
does not increase as much as is prescribed by a random walk process. This
smaller risk would induce a larger fraction of wealth to be allocated to equity
if the investor holds a longer horizon, and is not about to liquidate the
positions over short horizons.
8.2
PRICE-DIVIDEND RATIO
One possibility that could explain returns predictability is that log dividendprice ratio is positively correlated with return, and more significantly so over a
longer horizon. We shall explore this idea further.
We shall show how dividend-price ratio or its inverse price-dividend ratio
is a fundamental input to stock return. Let us start with the definition of stock
return over one period as the sum of capital gain Pt+1/Pt and dividend yield
37
Fama, E.F., and K. French, (1988), Permanent and Temporary Components of

Stock Prices, Journal of Political Economy 96, 246-273.
147
Dt+1/Pt .
R t 1
Pt 1 D t 1
Pt
(8.1)
For special cases, if Dt = 0 for each t, then ln Rt+1 = ln Pt+1 ln Pt . In other

words, the return is fully characterized as a capital gain if there is no dividend.
However, for most stocks, dividend issue will be a regular and important
concern to stock holders.
From (8.1), we can obtain the following expression:-
Pt
P D t 1 D t 1
, or
R t 1 t 1
Dt
D t 1
Dt
Pt
1
D
P
t 1 1 t 1 .
Dt R t 1 Dt Dt 1
The left-hand side of the above definition is the price-dividend ratio of the
stock. Taking natural logarithms:
(8.2)
pt d t rt 1 d t 1 ln 1 ep t1 d t1
where p t ln Pt and d t ln D t . Note that rt+1 = ln Rt+1 is now a

continuously compounded return rate.
We employ Taylor series expansion in a linear approximation of
p d
1
ln 1 e t 1 t 1 ln 1 e k p
d
k
t
1
t
1 e k
where k is a constant close to (pt+1 dt+1) that is stochastic.
Then from (8.2):
1
p t 1 d t 1 k
p t d t rt 1 d t 1 ln 1 e k
1 ek
rt 1 d t 1 c p t 1 d t 1
c p t 1 d t 1 d t 1 rt 1
k
where constant c ln 1 e k
1 ek
1
and
0 1 .
1 ek
(8.3)
We shall assume that
c p t 1 d t 1 0
(8.4)
148
p d
since this term is really the approximation for ln 1 e t1 t1 >0.
Let F be one-period forward shift operator. This is the inverse of the

backward shift operator B we saw in an earlier chapter.
F = B-1, so F(B)Yt = B-1(B Yt) = Yt. The operator functions with algebraic
properties in this case. Now,
p t d t F p t d t c d t 1 rt 1
1
c d t 1 rt 1
1 F
1 F 2 F2 ... c d t 1 rt 1
c
j1 d t j rt j .
1 j1
(8.5)
Taking conditional expectation based on information up to and including t,
pt d t
c
E t j1 d t j rt j .
1
j1
(8.6)
Equation (8.6) shows that high price-dividend ratio on the left-hand side is
associated with high dividend growth or else low return in the future. (8.6)
shows that price/dividend ratios vary over time because they forecast changing
future dividend growth rates and returns.
If future dividend growth or returns in (8.6) are not forecastable, then
p t d t may be assumed as constant. Volatility in p t d t is largely due to
changing expected future returns (i.e. volatile information impact on
expectations). From (8.3),
rt 1 c d t 1 d t 1 p t 1 d t p t .
(8.7)
In (8.7), we should require expected future stock return to be positive (even if
realized returns may sometimes be negative). If we rewrite (8.7) as
E(rt 1 ) E(c d t 1 p t 1 Ed t 1 d t p t .
By (8.4), the first term on the right-side > 0, so a sufficient condition for
E(rt+1) > 0 is
E[d t 1 d t p t ] 0 .
(8.8)
Suppose the log dividend-price ratio38 (or simply dividend-price ratio from
38
We should be careful to point out that this is not the usual dividend yield used in
existing literature. Dividend yield is dividend divided by the previous period price,
while dividend-price ratio is dividend divided by the current end-of-period price.
However, this difference is sometimes blurred.
149
now on) has long persistence (i.e. a slow-moving variable):
d t 1 p t 1 d t p t t 1
(8.9)
where 0 1 is close to 1, and t 1 is i.i.d with a mean not necessarily

equal to 0. Then from (8.7),
rt 1 c d t 1 1 d t pt t 1 ,
or rt 1 c 1 d t p t d t 1 t 1 .
(8.10)
Note that the last term could be highly serially correlated if d t 1 is highly
serially correlated. This may be seen from (8.9) if we take the first difference:
d t 1 p t 1 d t p t t 1 , or
d t 1 d t ( p t 1 p t t 1 )
where the last term is close to i.i.d. given that the prices are close to random
walks. From (8.10),
Ert 1 c 1 Ed t p t Ed t 1 t 1
Ed t p t c Ed t 1 t 1 0
where 1 (0,1) .
We assume that Dt < Pt or (dt pt) < 0, so it is necessary that
c E d t 1 t 1 0
(8.11)
in order that E(rt+1) may be positive. Now, let
c d t 1 t 1 t 1
where t 1 has zero mean and may be serially correlated. Substituting this into
(8.10), then
rt 1 d t p t t 1 .
(8.12)
If rt+1 is a constant, in this case , then de facto dt pt and also t+1 are
zeros. In this case, is the riskfree rate at time t+1. We define
rt*1 rt 1
to be the excess return at time t+1, so
rt*1 d t p t t 1 .
Then the system of regression equations can be written as
rt*1 d t p t t 1
rt* 2 d t 1 p t 1 t 2
rt* k d t k 1 p t k 1 t k .
(8.13)
150
Notice that 0 1 is small since 0 , and that 1 is large. Then,
from (8.9),
rt* k rt* k 1 rt*1 1 2 k 1 d t p t u t,t k
where stationary disturbance ut,t+k involves terms in t+k, t+k-1, ., t+1, t+k-1,
t+k-2, . , t+1 , and may be serially correlated. Let the left-hand side of the
above equation be the excess continuously compounded k-period return rate,
which can be written as
rt,t k d t p t u t,t k
*
(8.14)
1 k 1
. It should be noted that increases with k . This
1
where
means that if we perform a regression using (8.14) by using an excess return

over a longer holding period or horizon, i.e. larger k, then the effect seen by
current dividend-price ratio (dt pt) in predicting this return will be larger.
This is due to the persistence of the dividend-price ratio according to (8.9),
with a large 0 < <1, hence the notion that stock returns are predictable
especially for longer horizons.
8.3
OUT-OF-SAMPLE FORECASTS
Fama and French (1988, 1989)39 employed the idea in (8.14) to investigate the
influence of dividend-price ratio on future stock returns and found non-trivial
effects. (8.14) can be performed as it is, or in ratio forms without taking logs,
or by adding an intercept (as long as the estimate of this is not significantly
large). Thus, we can run OLS on
(8.15)
rt , t k a bd t pt u t , t k
using sample t=1,.,T-k Now, even though ut,t+k may be serially correlated, as
long as it is not contemporaneously correlated with dt-pt , then a and b are
unbiased and consistent. Because of the limitation in the lack of a long enough
time series, overlapping data usually have to be used in the regression
especially for large k>1, or long horizon. For example, if we use dependent
variables that are excess returns over 4-year holding period or horizon, the
regression equations would look like
r1,5 = a + b(d1 p1) + u1,5
39
Fama Eugene F., and Kenneth R. French, (1988), Dividend Yields and Expected
Stock Returns, Journal of Financial Economics 22, 3-27. Fama Eugene F., and
Kenneth R. French, (1989), Business Conditions and Expected Returns on Stocks
and Bonds, Journal of Financial Economics 25, 23-49.
151
r2,6 = a + b(d2 p2) + u2,6
r3,7 = a + b(d3 p3) + u3,7
...
rT-4,T = a + b(dT-4 pT-4) + uT-4,T
where subscripts (i,j) denote end of year i to end of year j. Thus the excess
returns are 4-year returns. More specifically, a dependent data point, say r 3,7
would mean log(P7/P3) or the difference of log of price at the end of the 7 th
year and log of price at the end of the 3rd year. And the explanatory variable
(d3 p3) would be the log of total dividends issued over the 3rd year less the
log of price at the end of the 3rd year.
If more frequent dividends e.g. quarterly dividends are available, and each
period t denotes a quarter, then in the above setup, r3,7 would mean log(P7/P3)
or the difference of log of price at the end of the 7th quarter and log of price at
the end of the 3rd quarter. And the explanatory variable (d3 p3) would be the
log of total dividends issued over the 3rd quarter (not year) less the log of price
at the end of the 3rd quarter. Moreover, as dividends are sometimes reported in
annualized terms, be sure to convert or decimate the annualized figure to a
quarterly figure for such reported dividends. Data care cannot be overemphasized if the theory or model is to be properly tested and utilized.
The OLS regression produces estimates a and b that are unbiased and
consistent. However, since the errors ut,t+k may be serially correlated (and
perhaps even heteroskedastic), the OLS t-statistics may not be appropriate for
inference, and must be read with care. Given estimates a and b , the out-ofsample forecast of a 4-year return starting at end of regression sample T,
would be for a return over [T, T+4], and is given by:
ET(rT,T+4) = a + b (dT pT)
(8.16)
provided we enter current (dT pT) on the right-hand side.

Based on (8.16), suppose we let X T d T p T for notational
convenience, and with sample size N=T-4, then the 90% confidence interval
of the forecast ET(rT,T+4) or rT,T 4 is
r
t
T, T 4
N 2, 95%
e
where x T X T X .
1
N
x2
T
N
2
x
t
1
152
8.4
OTHER STUDIES RELATED TO PREDICTABILITY ISSUE
Shiller (2001)40 studied the relationship between aggregate U.S. market priceearnings ratios from 1881 to 1989 and the stock markets aggregate 10-year
real return over the 10 years period immediately after the P/E ratio is
registered. His plot looks as follows in our simplistic reproduction.
Figure 8.1
Annualized 10-year return
At t+10 y
20%
P/E at t
0
10
20
30
Clearly, there is a negative relationship. When todays P/E is high, the return
over 10 years into the future would be low. High stock market price today
therefore is bad for investors who wish to hold stocks over 10 years. On the
contrary, low P/E or low market price today is good for long term investors
who wish to buy and hold the stocks over the next 10 years, for the return will
be good. This result appears to be in sync with the dividend-price ratio D/P we
just studied. High P/E or low E/P is highly similar to low D/P, and hence
predicts low future return. Low P/E or high E/P, similar to high dividend-price
ratio, predicts high future returns.
The similarity could be due to the relationship:
40
Robert J. Shiller, (2001), Irrational Exuberance, Broadway Books, NY, is an

interesting and highly readable account of how stock markets work in todays modern
markets.
153
P/E = P/D DPR
where DPR D/E (E is EPS or earnings per share) is the dividend payout
ratio. If DPR = 1, then P/D = P/E and the results apply equivalently to P/E. Or,
we can express the relationship as P/D = P/E DC where DC E/D, the
dividend cover or number of times earnings can cover the dividend on a per
share basis.
The predictability is about time variation in the risk premium (excess over
riskfree returns), and not about time varying interest or discount rate which is
found to be of less an impact. It makes economic sense in that at the low of a
business cycle, prices are low (low P/D but high D/P) and investors are
willing to hold stocks because of high expected returns (high risk premium
involved) in the future. This would be compensated by high expected return
over the long future horizon. High returns happen when the business cycle
reverts.
Conversely, at the high point of a business cycle in a boom, prices are
high (high P/D but low D/P), and low D/P predicts long-term lower returns.
Reversals in business cycles can be seen in the boom time of the 1960s
followed by the 1973 oil and exchange rate crisis, the 1970s recession, and
then followed by the 1980s and 1990s boom (despite a couple of incidences
of sharp market falls in 1987 and 1989), the tech-bubble burst in 2001, and the
subsequent and current boom since 2002-2003 till the global financial crisis in
2008. Markets then rebounced in late 2009. Business cycle reversals
contribute to the observed predictability phenomenon.
There is an exciting and impactful area of work in investment on
momentum41 which says that stocks that climb in prices for the last several
months would likely continue to do so in the next few months; and stocks that
are losers would continue to be losers over the next few months. Other studies
point out that momentum strategies to buy winners and sell losers (long-short
strategy) are particularly effective only in stocks with some of the largest trade
volumes, and are not evident in illiquid stocks. Hedge funds also come into
the picture and long-short is one popular though increasingly plain vanilla
strategy. Yet other studies of momentum indicate that the strategy works
mostly for small cap stocks. Momentum strategies come in various shapes and
sizes. In China, some studies indicate that the estimation of momentum would
take place over a window a few months back, followed by a window of
waiting, before a short one to two weeks window of momentum-induced
buying and then liquidation within the same period. Such strategies are said to
yield returns of over 15% p.a. after transactions costs.
41
Jegadeesh, N., and Sheridan Titman, (1993), Returns to Buying Winners and
Selling Losers: Implications for Stock Market Efficiency, Journal of Finance 48, 6591.
154
8.5
PROBLEM SET
8.1 Three researchers A, B, C each has some interesting findings about stock
market research. A finds that using the Box-Ljung test, stock returns are
not correlated, so he concludes that stock prices are not predictable. B
uses the variance-ratio tests and finds that they are significantly less than
1 at the 10% level, and concludes that there is evidence of stock return
correlation. C regresses long-term stock returns on their dividend-price
ratios and finds significant coefficients. He concludes that there is longterm stock price predictability. Are their conclusions all in conflict? Or
can you reconcile their findings? Explain.
8.2 In a regression of 5-year future excess return on current dividend/price
variable, the estimated regression equation is ln(1+Rt,t+5) = 0.56 + 0.2
ln(Dt/Pt), where Rt,t+5 is the 5-year nominal excess return rate. If current
dividend per share is Dt = $1, current price per share is Pt = $10, what is
the forecast of the return rate over the next 5 years if the 5-year riskfree
rate is 1% p.a.?
[1] John H. Cochrane, (2001), Asset Pricing, Princeton University Press.
(Chapter 20 of the book is particularly relevant for the initial part of this
chapter.)
[2] John Y. Campbell, and Luis M. Viceira, (2002), Strategic Asset
Allocation: Portfolio Choice for Long-Term Investors, Oxford University
Press.
155
Chapter 9
ESTIMATION ERRORS AND T-TESTS
APPLICATION: EVENT STUDIES
Measurement (Estimation) period, Post-event period, Event window, Event
date, Benchmark model, Abnormal return, Standardized abnormal return,
Average abnormal return, Cumulative abnormal return, Cumulative average
abnormal return, Tests of significance, Price pressure effect, Substitution
effect, Information effect, Information leakage
In this chapter, we show how the t-statistics is used in testing whether

systematic corporate financial events such as dividend payout, bonus issue,
earnings announcements, and so on, affect the market prices of the
corresponding stocks.
9.1
EVENTS
In finance, an event study is essentially a statistical test of whether a financial

happening or event such as a rights issue would affect the stock price. The
events are usually publicly available information at the time of announcement
or happening. In event study, an underlying benchmark asset pricing model is
assumed in order to make proper or meaningful inferences about the return
deviations.
Thus, event studies are never about tests of an asset pricing model, but are
tests of (1) whether the market is informationally efficient given the public
information of the event, assuming the benchmark model, and also assuming
we know or have a prior belief of whether the event has positive, negative, or
neutral impact on returns, or (2) whether the event has positive, negative, or
neutral impact on returns, assuming the benchmark model, and also assuming
the market is informationally efficient.
The distinction between (1) and (2) should be kept in view in the
interpretation of event study results in order not to confuse things. In most
practical situations dealing with policy implications, it is usually more useful
to apply (b), i.e. assuming benchmark model and market efficiency. It is also
common to assume strong-form market efficiency, so that significant return
156
deviations before the time of the publicly announced event may be interpreted
as information leakage or inside information revelation.
There are three stages in an event study analysis. First, we need to define
the event of interest and identify the period over which the event will be
examined. Next, we have to design a testing framework to define the way to
measure impact and to test its significance. Finally, we need to collect
appropriate data to perform the testing of the events impact and draw
conclusions in a model-theoretic and statistical sense.
Event studies are of various types. It is typically in the form of an
announcement.
(a) Firm-specific event e.g. insider trading, announcement of board change,
announcement of major strategic change, unusual rights issue
announcement, announcement to file or seek Chapter Eleven (reorganization in US in a last effort to avoid bankruptcy and liquidation
which is Chapter Seven), executive stock option issues, employee stock
option issues, etc.
(b) Across firms system-wide event e.g. unanticipated better-than-forecast
earnings announcement (good news), unanticipated worse-than-forecast
earnings announcement (bad news), anticipated better-than or else worsethan forecast earnings announcement (no news). Others include
announcement of bonus issues, stock splits, mergers and acquisitions,
better than expected dividends or worse than expected dividends, new
debt issues, seasoned equity issues, block sales, purchase of other
companies assets and stocks, etc.
(c) Macro events e.g. increase in CPF employer contribution, GDP growth
decline projection, regulatory changes etc.
Event studies focus on the performance of stock prices (or equivalently
stock returns) before, during, and after the event announcement. On the flip
side, it is also about the reaction of stock investors to news or information, if
any, from the event.
9.2
TESTING FRAMEWORK
The subject of event studies in finance is usually approached using a formal

treatment of statistical and data analyses. The sampling frame or study period
must also be clearly designed and stated. The study period is depicted as
follows. The measurement (or estimation) period is used for estimating the
parameters of the stock return process using historical data during this period.
Daily (continuously compounded) stock return, i.e. ln(Pt+1/Pt), using typically
157
end of day prices, is the variable commonly employed in conjunction with the
daily continuously compounded market return, and also the daily riskfree rate.
Figure 9.1
Event Sampling Frame
Measurement
(Estimation Period)
-250
Event Window
-10
Post-Event Period
+10
+250
Trading days t
Days are measured according to number of trading days before Event Day 0
(or the event announcement day), or number of trading days after Event Day
0. In the above, the measurement or estimation is about 240 sample points
from t=-250 to t=-11. Sometimes when we suspect that the market is
disruptive and beta may have changed over the nearly 1 calendar year (about
250+ trading days), then we use a shorter time series e.g.60 trading days (t=70 to 11). During a stable measurement period, a longer or large sample is
better in order to reduce sampling errors in the parameter estimators.
The post-event period that goes up to 1 calendar year is less often used
except for studies such as mergers and acquisitions, buyouts, IPOs, when a
longer time is required before the effect of the event is to be seen. The event
Window (or period) is the most important part of the time frame and is further
delineated as follows.
Figure 9.2
Announcement Windows
(Event) Announcement Day
Pre-announcement window
-10
Post-announcement window
+10
The number of days to use in the event window typically includes 2 calendar
weeks (or 10 trading days) before the announcement day, the announcement
day itself, and 10 trading days after the announcement day. This window
158
should be large enough to show up any possible changes to returns due to the
event. The event date is the day when the event becomes public information
e.g. when the announcement or news is broadcast as public information. It is
denoted as day 0 in the event study calendar.
The parameters estimated from the measurement period are used in the
event period to compute the defined deviation from normal return as a
measure of impact of event, if any.
9.3
BENCHMARK
A normal return will have to be defined. Various benchmark models are used.
(A) Market Model (MVN of stock returns leads to this specification)
rit i i rmt e it
for return rit stock i at time t, and conditions (a), (b), (c) are assumed.
(a)
cov(rmt, eit) = 0 ;
(b)
var(eit) = i2 , a constant ;
(c)
cov(eit , eit-k) = 0 for k0
Then OLS regression for data set, t=-250 to t=-11, will yield BLUE i and
i that are also consistent. In addition, estimate of i2 ,

i2
11
1
rit i i rmt
L 2 t L10
11
1
2
eit
L 2 t L10
is unbiased and consistent. Note that L is number of sample points in the

measurement window, and could be 240 or 60 or in-between.
(B) CAPM Model (more exactly, regression model consistent with the
CAPM)
rit rft i rmt rft u it
In (A), the benchmark or normal return during the EVENT WINDOW is

defined as ri i i rm where the time subscript is now to distinguish
from t outside the Event Window. So = -10 to +10 in the Event Window.
The normal return on day is an expected return conditional on
information available up to and including day . In the case of market model,
this relevant information is just rm .
The Abnormal Return to stock i at time (-10 , +10) is
AR i ri ri ri i i rm .
(9.1)
159
In (B), the normal return during the EVENT WINDOW is defined as
ri rf i rm rf .
The abnormal return would then be
AR i ri ri ri rf i rm rf .
Alternative measures of abnormal return include:
(C) Market Adjusted Excess Return, defined as ri rm .
This measure may be appropriate especially when the stocks have betas close
to one. Note that when beta exactly equals one, then the CAPM abnormal
return is also the market adjusted excess return.
(D) Mean Adjusted Excess Return, defined as ri ri where ri
1 11
rit .
L t L10
Suppose our event is a systematic one such as bonus issue information

effect (whether bonus dividend announcement is good news, bad news, or no
news?) We can test this over not just one firm with a bonus dividend
announcement, but over N (e.g. 30) firms each with a bonus dividend
announcement not all at the same time but scattered through time e.g. all
within 2 years.
Figure 9.3
Grouping Events
i=1
i=2
0
0
i=N
However, the announcement date (day of announcement of the bonus

dividend event) for each firm is denoted Event Day 0 even though they are
from different calendar dates. Also, event day 3 would refer to 3 days before
announcement relative to all Day 0 of each firm-event. Note also that we
could have two events from the same firm if the firm happened to make 2
bonus issue announcements separated by time.
160
Avoiding clustering of the events in time prevents confounding events e.g.
9/11 when all stocks dived. If we had done an event study of positive
announcements and all firm-events lined up around the month of late
November-December 2001, then the 9/11 would produce a negative effect that
is certainly not due to the positive earnings. Spreading out also has the
advantage of ensuring the various ARis of the various firms do not correlate
so that it is easier to estimate the variance of a portfolio of the AR is across
firms. In other words, if the firm-events all cluster together at the same time,
then since firm returns (even after adjusting for market) may still possess
correlation, the portfolio variance will be harder to measure because of the
covariance terms.
Suppose we use the Market Model (A) as our correct benchmark model.
Then the ARis computed during the Event Window would be randomly
varying about zero provided (a) the model is true (which we assumed already);
and (b) the market is efficient (quick to process information and update price)
and rational (will process relevant information correctly given the
information). The last statement (b) is basically an assumption about market
efficiency. The implication of (a) and (b) is that residual noise does not
display a mean > 0 or < 0 when there is no news to alter the firm or stocks
mean and volatility. In an efficient and rational market, without significant
information impact, the expected value of ARi is zero, conditional on market
information up to and including those at . Hence, average abnormal returns
over the event window should not be significantly different from zero when
there is no significant information in the event.
Significant information in the event announcement is taken to be
unanticipated news that causes the market to either (a) re-evaluate the stocks
expected future earnings (thus also dividends), and, or (b) re-evaluate the
stocks risk-adjusted discount rate, resulting in the immediate efficient
adjustment of the stock price. With significant information impact, the
expected value of ARi is non-zero (positive if good news on stock and
negative if bad news on stock), conditional on market information up to and
including those at . Average abnormal returns over the event window would
then be significantly different from zero. The abnormal return essentially
reflects a change in the conditional expectation of the stock return by the
market.
We define the null hypothesis H0: Event has no impact on stock returns (or
more specifically - no impact on stocks abnormal returns). This same
hypothesis can be made more detailed in several ways as follows. From
equation (9.1), the market model abnormal return is
AR i ri i i rm
so that
161
EAR i | rm Eri | rm E i | rm rm E i | rm
Eri | rm rm
0
and
2
var AR i | rm var ri | rm var i | rm rm var i | rm
2rm cov i , i | rm 2cov ri , i | rm 2rm cov ri , i | rm
1
var AR i | rm i2 i2
L
2rm i2
11
2
rm i2
11
2
rmt rm
t L 10
rm
rm2
t L 10
1
i2 1
L
where rm
mt
rm
1
11
t L 10
mt
rm
rm rm
11
2
rmt rm
t L 10
1 11
rmt .
L t L10
Recall that while t ranges from L-10 to -11 within the estimation period,
ranges from -10 to +10 within the event window. The last 2 terms in
var AR i | rm , i.e. cov ri , i | rm and cov ri , i | rm

because ri | rm involves e i whereas i
10 to 11), and e i
and e it
are zero
and i involves e it (t = L
are uncorrelated or perhaps even stronger,
independent. Note that estimator errors are added to the variance of ARi.
9.4
TEST STATISTICS
If the estimation period sample size L is large, we can use the argument of
asymptotic result to show var AR i | rm converges to i2 , or use the latter
as an approximation in the case when L is fairly large, e.g. L = 240.
So, AR i | rm d N 0, i2 .
162
Test for each stock i using
AR i
N 0,1 or tL-2
i
(9.2)
This is sometimes called the SARi , the standardized abnormal return. The
distribution is sometimes also interpreted as Student-t with L-2 degrees of
freedom. This is because i2 is estimated via
11
1

rit i i rmt
L 2 t L10
2
i
and
2 2
L2
L2
i2
t L 2 where Z N(0,1).
i2
AR i AR i i
Z
i
i
i
If we have N firm-events, at any time within the event window, the

average abnormal return (in some instances, it has been called aggregated
abnormal return),
AAR
1 N
AR i .
N i1
Assuming independence of disturbance across events because they are not

clustered, for large L,
var(AAR | rm )
1 N 2
i
N 2 i1
So, AAR i | rm N 0, 2
N
i 1
2
i
Test for each event window day using
AAR i
1
N2
N 0,1 .
i 1
(9.3)
2
i
There is another way to represent this test statistic in terms of the SARi.
1 N AR i
N
N 0,1 . This is approximately equal to (9.3).
N i1 i
In order to test for the persistence of the impact of the event during period k 1 (where 1 is start of event window, and 21 is end of event window, and 1
k 21 ), the abnormal return can be added to obtain the cumulative abnormal
return
163
k
CAR 1 , k AR i
1
Assuming independence of disturbance across time viz. cov(e it,eit-k) = 0 for

k0:
k
var(CAR 1 , k | rm k ,..) var(AR i | rm ) k 1 1 i2

1
So, CAR1 , k | rm k ,... N0, k 1 1 i2 .

Test for each event window period (1 , k) using
CAR 1 , k
k 1 1 i2
N 0,1 .
(9.4)
The cumulative average abnormal return
CAAR 1 , k
k
1 N k
1 N
AR
i
AR i
N i1 1
1 N i 1
var(CAAR 1 , k | rm k ,...)
1 N
1 1 i2
2 k
N i1
So, CAAR 1 , k | rm k ... N 0, 2

N

i 1
1 1 i2 .
CAAR may also be construed as Average of CAR or Cumulation of AAR.

Now test for each event window period (1 , k) using
CAAR 1 , k
1 N
k 1 1 i2
N 2 i1
N 0,1 .
(9.5)
The z-statistics in (9.2), (9.3), (9.4), and (9.5) can be used to test the H0. The
interpretation in each case will be slightly different. For example, in (9.3) a
rejection says that there is an unexpected large increase or decrease in return
for that event day, while in (9.5) a rejection says that there is an unexpected
large increase or decrease in cumulative return.
The tests are based on the null that the returns mean level and variance or
volatility remain constant. A rejection would mean that the conditional mean
had changed due to the event announcement. If we want to test if there is a
conditional mean change not due to increased volatility because of the event,
i.e. not due to finite sampling errors caused by bigger variances, then we can
use the sample variance of AAR during the event window, i.e.
164
2 AAR i
1 21
(AAR i AAR i ) 2
20 1
to construct the approximate t20 test statistic
AAR i
t 20
AAR i
for testing. It is important to interpret the CARi or CAAR graph appropriately.
Figure 9.4
Graph of Cumulative Abnormal Returns based on Events
Significant leakage
information from -3
of
Informational efficiency.
Significant impact on (-1,0)
and less on (0,1)
Permanent effect on asset
value with no reversal in AR
-10
2 s.e.
Temporary or transitory
effect such as a price
pressure
effect.
AR
reverses. CAR or CAAR
returns to 0.
+10
The three events in Figure 9.4 show different paths of CAR. In the dotted
event path, CAR is significant only at event date due to a significantly positive
AR. After that, AR is negative, and thus CAR falls back to zero. CAR remains
at zero thereafter. This shows a price pressure effect that could be due to
excessive buying (or selling) not because of information but liquidity. Because
of temporary inelasticity, large volume selling will drive down price
temporarily and large volume buying will drive up price temporarily. As there
is no information, the prices will revert, thus showing a reversal in AR over
one or two days: CAR reverts back to zero after that. The positive AR is
possibly due to buying pressure and illiquidity. Such price pressure effects do
not last and are said to be temporary or transitory.
However, if there is excessive buying pressure, but sellers are plentiful and
can easily substitute into other shares, then the supply curve is flat or highly
elastic, so the buying pressure will not force up price. This substitution effect
will not produce any significant AR or CAR in the first place.
165
Similarly, if there is excessive selling pressure, but buyers are plentiful and
can easily substitute from other shares, then the demand curve is flat or highly
elastic demand, so the selling pressure will not force down the market share
price. It is more plausible that the market operates under information effect as
well as substitution effect, so dotted event paths illustrated in Figure 9.4 will
be rare.
At times, even when a piece of information is known to produce asset price
changes, but if this information is already known or anticipated, then its
announcement at day 0 will not produce any significant AR or CAR.
However, the solid line event shown in Figure 9.4 illustrates significant
event with a positive impact on price and returns e.g. a positive earnings
announcement. The impact is permanent. The news is also quickly absorbed
and price adjusted quickly so that after day 1 of event, there is no more price
adjustment, and AR is zero, so CAR stays constant thereafter.
In the other dotted event with positive CAR, the news appeared to hit
before event date, at about t=-3. The AR and thus CAR are significantly
positive. This may indicate information leakage. It appears that inside
information has caused the significant price changes before the public news.
In what follows, we study an actual event on a firm, and show how to
conduct the various event study tests. The data is collected from Center for
Research in Security Prices (CRSP) database managed at the University of
Chicago.
9.5
BANK OF AMERICA ACQUIRES MERRILL LYNCH
Around the beginning of the recent global financial crisis, the Bank of
America (BOA) announced on September 15, 2008 (Monday), that she was
acquiring Merrill Lynch (ML) in a $50 billion stock-for-stock exchange. BOA
would exchange 0.8595 share for each ML share. The offer price for MLs
share in terms of the BOA share market value at that time represented about
US$29 a share.
It was to be one of the most tumultuous weeks in the financial history of
global capital markets. Early on that Monday, Lehman Brothers, the Number
4 investment bank in U.S. had filed for Chapter 11 bankruptcy. Earlier in
March JP Morgan Chase had bought out ailing Bear Stearns at a deep
discount. These investment firms had gotten into deep trouble by sinking huge
chunks of investment capital in real estate derivatives and collaterized debt
obligations (CDOs), and these instruments were either becoming a tiny
fraction of their original values or could not be traded as liquidity had dried up
in the fear of troubled mortgages.
Merrill Lynchs share was last traded the close of the previous week at
US$17.05 a share, so BOA appeared to pay a premium of 70% above the last
166
market price. At its peak in 2007, MLs share was selling above US$98. The
deal also carried substantial risks for BOA as ML had billions of dollars in
assets tied to mortgages that had dived in value. Merrill had reported four
straight quarterly losses.
Our task in this single-event single-firm study is to try as scientifically as
possible to test (a) if the acquisition announcement on September 15, 2008
(event day 0) contained significant information whether good news or bad
news for the stockholders, and (b) if there was a leakage42 of information
during the 2 weeks (10 trading days) before announcement.
We use daily traded data during one calendar year (about 252 market price
observations) before the event window [-10,+10] to estimate the market model
(A) as benchmark. The i and i estimates of the market model parameters
for BOA are then used to estimate the abnormal returns
AR i ri i i rm
in the event window. The abnormal returns during the event window 10 days
prior to announcement date till 10 days post-announcement are shown in
Figure 9.5.
Figure 9.5
Abnormal Returns Around the Event that Bank of America (BOA)
Announced Acquisition of Merrill Lynch on September 15, 2008
Return Rate
ALPHA
.15
.10
Announcement Date
.05
.00
-.05
-.10
-.15
254
256
258
260
262
264
266
268
270
272
29 2 3 4 5 8 9 10 11 12 15 16 17 18 19 22 23 24 25 26 29
August September
42
All information and data are collected from known published sources such as Yahoo
Finance, CRSP, and SGX database. The term leakage is purely a technical word
denoting information being known by a subset of investors. There is no connotation of
wrong doing or of issues that would be implicated by law.
167
The cumulative abnormal return is shown in Figure 9.6.
Figure 9.6
Cumulative Abnormal Returns of Bank of America (BOA) from August
29 to September 29, 2008
S
Return Rate
.35
.30
.25
The week before

announcement
.20
September 15
Event Date
.15
.10
.05
.00
254
256
258
260
262
264
266
268
270
272
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10
The t-statistics (approximated by standard normal variate) in (9.4) for

testing if the CAR is significant are also computed for different event days
during the event window. They are shown in Table 9.1 below.
Table 9.1
T-statistics of the CAR from t= -10 to t=+9
Dates
Aug 29
Sep 2
Sep 3
Sep 4
Sep 5
Sep 8
Sep 9
Sep 10
Sep 11
Sep 12
N(0,1)
0.7719
2.1730*
2.6002**
1.9566
2.5640**
2.9481**
2.7704**
2.3756*
2.1583*
2.2696*
Dates
Sep 15
Sep 16
Sep 17
Sep 18
Sep 19
Sep 22
Sep 23
Sep 24
Sep 25
Sep 26
N(0,1)
0.3517
1.2196
1.3112
1.6551
2.9633**
2.6989**
2.6798**
2.5824**
2.5253*
3.0170**
168
CAR values that are significantly different from 0 at the 5% 2-tail significance
level are marked with an *, whereas those significant at the 1% 2-tail level
are marked with two asterisks **.
Figures 9.5, 9.6, and Table 9.1 indicate that BOAs prices had increased
two weeks prior to announcement date. At event date on September 15, there
was a significant drop in price. There did not appear to be any leakage as the
price did not show any marked changes in the few days before the event date.
The significant drop was probably due to the market taking the news as bad,
believing BOA was undertaking a high risk in acquiring ML.
However, in the 2 to 4 days subsequent to announcement, BOAs prices
showed reversion back to normality and in fact climbed somewhat. This may
be due to stockholders realizing after all that the acquisition was actually good
for the future business development of BOA.
The news reported that BOA was combining its own large consumer and
corporate banking business with MLs global wealth management, advisory
expertise, and financial services capabilities and capacities, creating a huge
finance corporation that would rival Citigroup Inc.
9.6
A SEMI-CONDUCTOR FIRM
Not long ago, some time during September 2002, there was a rave of
criticisms of Chartered Semiconductor Manufacturing (CSM) that stunned the
market on September 2, Monday, by announcing a massive rights issue (8
rights for every 10 ordinary shares) of S$1.1 b at S$1 a share when its share
was trading at Fridays close of S$2.10. There was talk of information leakage
to the big boys and a market sell-down the week earlier.
Even as late as August 31, CSM (the worlds 3rd largest chip maker, but
one seeing its share price fell from a height of glory at over S$19 in 2000 to
just over S$2 in mid 2002), which had lost money in the last 2 years before
2002, had maintained that its cash balance and access to credit were strong
enough and the market deduced that further fund raising was not imminent.
Ordinarily, ceteris paribus, a rights issue should not increase or reduce
existing share value. We illustrate as follows.
An ordinary share holder holds 1 share at $X. Suppose he is given a right
1:1 to buy a second new issued share at $Y < X. The holders original wealth
is $X + Y with $Y in some other assets e.g. bank deposit. After all ordinary
shareholders exercise the rights, new share price is $N x (X+Y) / (2N) = $
(X+Y) / 2, where N is the original number of shares of the firm. If Y < X, it
can be seen that the new share price is diluted or diminished. But the original
holder now has 2 shares, so his wealth is still $X + Y, and there is no change
in value.
169
The exception is that his $Y was originally in his own pocket and could
seek other investments, but now is constrained to be used as new capital in the
firm. Thus a firm will issue rights in 2 kinds of situation: (1) when it requires
additional capital funding for new growth projects that will improve returns
good news and is welcome by the shareholders because the $Y has higher
return in the firm than shareholders alternatives, and (2) when it requires
additional capital funding because it is losing money and needs to put in more
capital bad news and is not welcome by the shareholders because $Y has
lower return in the firm than shareholders alternatives.
Good news or good signals by the rights issue announcement will lead to
increased share price, while bad news or bad signals by the rights issue
announcement will lead to decreased share price.
In this particular episode, bad news appeared to be the case with CSM.
There are of course happy episodes with CSM including its rise in prosperity
back in the middle of 2000s. The sequence of events can be tabled as
follows:Close Friday
August 30, 2002
6 am Monday
Sep 2, 2002
Close Monday
Sep 2, 2002
CSM share price fell 19 cents (8.3%) to S$2.10

CSM announced 8 shares for 10 rights issue at S$1
each new share
CSM share price fell a further 50 cents (23.8%) to
S$1.60
Business Times September 5, 2002, reported, But the biggest question of all
concerns the activity in its shares in the run-up to Mondays announcement of
the rights issue. .Pre-announcement or not, it did not go unnoticed that some
quarters seemed to have started offloading Chartereds stock several days
before Mondays announcement. This heavy-volume selling drove down the
stocks price by more than 16 per cent during the week. There was clearly
a leak, thundered the director of a local broking house, How else do you
explain the share price falling sharply since last week, way ahead of the
announcement.
Our task in this single-event single-firm study is to try as scientifically as
possible to test (a) if the rights issue announcement on September 2, 2002
(event day 0) contained significant information that influenced share price,
and (b) if there was a leakage43 of information during the 2 weeks (10 trading
days) before announcement.
43
All information and data are collected from known published sources.
170
Data were collected from the Straits Times stock reporting pages. The
Market Model in (A) is used as the benchmark model. The abnormal returns
10 trading days before announcement date and up to 8 trading days post event
[-10, +8] are shown in Figure 9.7 below. (We use +8 because we collected the
data till that date.)
Figure 9.7
Abnormal Returns Around the Event that Chartered Semiconductor
Manufacturing (CSM) Announced a Massive Rights Issue on September
2, 2002
Return Rate
Announcement Date
19 20 21 22 23 26 27 28 29 30 2 3 4 5 6 9 10 11 12
August
September
The event date or the date of the announcement was September 2nd, 2002. In
event time, this is t=0.
The cumulative abnormal returns are shown in Figure 9.6 below. It starts
from event date -10 to +8. Figure 9.8 shows the CAR graph during the event
window [-10,+8]. Figure 9.8 shows that the CAR remains flat in the postannouncement window, indicating a permanent information effect.
Table 9.2 then shows the t-test results of the CAR (starting from t= -5) for
the days of the week prior to the Monday September 2, 2002, morning
announcement, and including the event date itself. As shown in the chapter,
the z-statistic is often used as an approximate for the t-statistic when the
sample size is large.
171
Figure 9.8
Cumulative Abnormal Returns of Chartered Semiconductor
Manufacturing (CSM) from August 19 to September 12, 2002
Return Rate
0.05
0
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5 +6 +7 +8
-0.05
-0.1
The week before

announcement
-0.15
September 2
Event Date
-0.2
-0.25
-0.3
-0.35
-0.4
-0.45
Table 9.2
T-statistics of the CAR from t= -5 to t=0
Dates
Aug 26
Aug 27
Aug 28
Aug 29
Aug 30
Sep 2
N(0,1)
-0.11001
-0.54175
-0.98432
-1.21243
-2.64587**
-6.17074**
There appears to be statistical evidence at the 1% significance level that

leakage occurs on Friday Aug 30, 2002, before the announcement on Monday,
since there is a huge negative z-statistic with p-value smaller than 0.0000.
9.7
PROBLEM SET
~
9.1 (i) Suppose AR it d N (0, i2 ) , where i2 0.01 , then describe the
exact distribution of the cumulative abnormal return
CAR i 1 , 2
AR
t 1
it
172
where 1 = 10, and 2 = +10.
(ii) Suppose the sum of the variances of the ARit of 5 event-stocks (i =
5
1,2,3,4, and 5) is given by
i 1
2
i
0.06 , what are the exact
distributions of the CAAR (-10, -1) and CAAR (-10,+10) ?

(iii) If CAAR (-10, -1) = 0.17 and CAAR (-10, +10) = 0.45, what can you
infer about the event ?
9.2 Suppose firm A stages a takeover of target firm B by making public tender
offer to buy Bs shares at a price of $20 when Bs shares are currently
traded at $15. Upon announcement, the CAR of B increases to 3% while
that of A falls to 2%. Explain why there may be a fall in the CAR of A? Is
this fall permanent if the takeover bid fails?
9.3 In general, rights issues are greeted with a positive abnormal return as the
information generally indicates that the firm has plans to invest more
capital in good prospective businesses. However, is it true that such an
event may sometimes show up negative abnormal return on some firms?
Why?
9.4 If a really badly managed firm A announces it will seek to take over
another smaller firm B, and actually puts out a public tender for a higher
price than the current market price of Bs share, will Bs share price
increase or decrease during the announcement period, if the market is
efficient?
There have been numerous studies on various events of corporate finance
involving announcement effects. Two representative studies for further
readings are as follows.
[1] Aharony J., and I. Swary, (1980), Quarterly Dividend and Earnings
Announcements and Stockholders Return: An Empirical Analysis,
Journal of Finance 35, 1-12.
[2] Keown, Arthur, and J. Pinkerton, (1981), Merger Announcements and
Insider Trading Activity, Journal of Finance 36.
173
Chapter 10
MULTIPLE LINEAR REGRESSION
AND STOCHASTIC REGRESSORS
Stochastic regressor, Consistency, Asymptotic efficiency, Orthogonal
projection, Ordinary least squares method, Tests of restrictions, Adjusted R2,
Schwarz criterion, Akaike information criterion, Hannan-Quinn criteria,
Forecasting
In this chapter we continue with the linear regression theory and extend the
Two-Variable case in Chapter 3 to more than two variables. This is called
multiple linear regression when there is more than one explanatory variable
besides the constant.
10.1
STOCHASTIC REGRESSORS
In Chapter 3, we consider the method of ordinary least squares in a simple

linear regression with only one explanatory variable. There the explanatory
variable Xi is treated as a constant or a pre-determined value. Now we shall
consider the broader picture and treat Xi as a random variable. When the
method in Chapter 3 is carried out in the context of random Xi, the results,
including the estimates and their statistical properties, hold, provided we
interpret those results as being conditioned on the given values of X i.
Basically this means that if we have to run the regression all over again, unless
we use explanatory Xis that are similar, we may not be able to obtain similar
regression results. Note that we use may not. Sometimes, the results can be
robust, but if the stochastic process of Xi is extreme, we may obtain different
results, e.g. estimates, R2, that look very different and that may be biased or
having quite different estimator sampling variances. We shall consider this
issue here.
u i where Yi is dependent
Linear regression equation is Yi X i ~
~
X i is independent or explanatory variable
ui is disturbance or error variable, and and are regression
(regressor), ~
variable (or regressand),
coefficients or constants. We have added tilde to emphasize they are random
174
variables (we usually would omit if the context is clear whether they are to be
treated as random variables or as the realized values of the random variables).
Each pair of observable X i , Yi for i = 1,2,3,,N are realizations of the
bivariate random variables X i , Yi . In other words, if we could repeat the

experiment so to speak, we could have obtained different values of X i , Yi
for each i.
Consider the OLS estimators and of the regression equation
~
~
Yi X i ~
ui .
The Classical Conditions (desirable conditions) for OLS regression are:
(A1) E~
ui 0 for every i,
(A2) E ~
ui
, a same constant for every i,
(A3) E~
ui ~
u j 0 for every i j
~
(A4) X i and ~
u j are stochastically independent (of all other r.v.s) for each i, j.
(A5) ~
u ~ N 0, 2
i
The classical conditions are meant to produce desirable properties in OLS

estimators. We could use slightly weaker conditions instead of (A4) (weaker
~
~
by obviating stochastic independence of X and U ) as follows.
~
E~
u i | X j , all j 0 for every i .
This condition implies via the iterated expectations theorem that
~
u i , X j 0 for all i, j. The latter is not as strong as
E~
ui 0 and cov ~
~
stochastic independence between X i and ~
u j , and is implied by the latter. The
~
~
condition can be written in matrix format as E U Nx1 | X Nx1 0 . Note that in
the 2-variable regression, there is only a non-constant X column: X1, X2, .,

XN .
We could also use weaker conditions in place of (A2) and (A3):
2 ~
E~
u i | X j , all j 2 , a same constant for every i .
This condition implies via the iterated expectations theorem that E ~

ui
for every i.
~
E~
ui ~
u j | X k , all k 0 for every i j .
(10.1)
(10.2)
This condition implies via the iterated expectations theorem that

E~
ui ~
u j 0 for every i j .
175
Equations (10.1) and (10.2) can be written in matrix format as
E U Nx1U Nx1 | X Nx1 2 I NxN .

T
Note
that
stochastic
or
probabilistic
~
independence of X i and ~
u j are obviated. We can also replace (A5) by U | X
N 0, 2 in (A5) is a stronger condition.

~
With stochastic independence of X i and ~
u j , the distribution of ~
u j is not at all
~ N 0, 2 I . Obviously, ~
ui ~
affected by X, so (A5) and stochastic independence imply U | X ~
N 0, 2 I . The converse is not true, i.e. U | X ~ N 0, 2 I does not

necessarily imply that ~
ui ~ N 0, 2 .44
Using similar analyses as in Chapter 3, the OLS method in this case of
stochastic regressor produce:1 N
1 N
OLS Yi X i and
N i 1
N i 1
N
N
1
1
X
X
Y
i N i i N Yi
i 1
i 1
OLS i 1
.
2
N
N
1
X i N X i
i 1
i 1
N
Notice we have chosen to express the OLS estimators as functions of all

random variables at this point. Then quite clearly, the OLS estimators and
~ ~
themselves are random variables. Given that X i , Yi for each i take
realizations X i , Yi , the estimators and also take realization in the
form of estimates and (unique for each sample { X i , Yi }i=1,2,3,..,N .
The estimates are
Y X
X
N
i 1
i 1
44
X Yi Y
i
See William H.Greene, (2000), Econometric Analysis, 4 th edition, Prentice-Hall,

pp. 213-223 for details.
176
Again the symbols and shall be used both in the context of random
variables (estimators) or in the context of estimates. Now, as estimator,
~
Xi
i 1
~
~
X i X i u~i
i 1
1
N
1
N
i 1
u~i
~ 1 N ~
Xi N Xi
i 1
i 1
N
~ 1 N ~ ~ 1 N ~
X i N X i u i N u i
i 1
i 1
i 1
2
N
~ 1 N ~
Xi N Xi
i 1
i 1
Taking expectation of the random variable ,

N
N ~
~
X i N1 X i
i 1
E u~
E E i 1
i
2
N
N
~
~
1
X i N X i
i 1
i 1

1
N
u~ .
i 1
If we use the weaker set of assumptions without invoking stochastic

independence between X and U, then
N
N
N ~

~
X i N1 X i u~i N1 u~i
i 1
i 1

E E E X i 1
2
N
N
~ 1
~

Xi N Xi

i 1
i 1
N
N
N ~
~
X i N1 X i E X u~i N1 u~i
i 1
i 1

E i 1
2
N
N
~ 1
~
Xi N Xi
i 1
i 1
177
since conditional on X = {X1 , X2 , X3 ,,XN}, E~
u i | X 0 . Likewise it
can be shown that E . This is because
E Y E E
N
1
N
i 1
~
i
1
N
~
i
X
i 1
~
~
E Yi E N1 X i E X
i 1
N ~
~
E Yi E N1 X i
i 1
~
~
E Yi E X i .
Since and purports to estimate and , and their expected values or

means are exactly and , we say that the estimators and are unbiased.
Graphically, we show in Figure 10.1 how the distribution of an estimator
will look like.
Figure 10.1
one standard error
mean
N
If amongst the class of all unbiased linear estimators
w Y , for different
i 1
weights wi that are not functions of Yis nor of the parameters, but just
functions of X, and have the least variance given the same sample size
N, then and are said to be efficient estimators conditional on X. This is

the Gauss-Markov Theorem in the case of stochastic regressor. Thus OLS
178
and are called Best Linear45 Unbiased Estimators, or BLUE, conditional

on X. Note that BLUE refers to a finite sample property, i.e. it applies for any
finite sample size N.
Now consider what happens when the sample size N goes to infinity.
Divide the numerator and denominator in the estimator by N.
N
N
1
1
X
X
Y
i N i i N Yi
N i 1
.
i 1
i 1

2
N
N
1
Xi N Xi
N i 1
i 1
N
By
the
law
N
p lim
N
1
N
of
large
numbers
for
stationary
random
variables,
E X i . So,
i 1
X
N
1
p lim p lim
E X i Yi E Yi
cov X i , Yi
.
var X i
1 N
Xi E Xi
N i 1
This convergence is in the sense of a probability limit plim, or it is
convergence in probability. It means that in the ordinary limit sense of
numbers in Euclidean distance,
N
i 1
lim prob | | 0 for any 0 , or p lim .

N
When the estimators and converge to the population parameters

and , that they are meant to estimate, in the probability limit, and are
said to be consistent estimators. Remember that they are still random variables
although the variances are infinitesimally small. The variances are called
asymptotic variances.
45
When dealing with stochastic regressors X, some authors e.g. Judge et. al., (1980),
in Introduction to the theory and practice of econometrics, (2 nd ed. p 573), JohnWiley, prefer not to call this a linear estimator since it is now a stochastic function of
Y. It is strictly a stochastic function because it contains wi that is a function of
stochastic X. Hence it is strictly not BLUE, unless we again condition it on X.
Conditioned on X, it of course has minimum variance amongst all unbiased estimators
of the same linear form.
179
The convergence can be depicted as a series of graphs with different
sample size N in Figure 10.2.
Figure 10.2
N=1000
N
N=200
N=60
mean
The above density functions depict those of the sampling distributions of the
estimator
N for different N. Furthermore, if amongst all the linear consistent
estimators, and have the least asymptotic variances, then they are
asymptotically efficient.
Under the classical conditions, OLS estimators are BLUE, consistent, and
asymptotically efficient (amongst the class of linear consistent estimators).
With additional distributional assumption (A5), OLS estimators are also
normally distributed, and statistical inference becomes possible in finite
sample. (Note that without (A5), statistical inference is sometimes possible,
and if so, only in the asymptotic sense when the sample size is very large and
that the statistics converge to some kind of normality under the Central Limit
Theorem.)
However, there are situations in which OLS will not possess these
desirable sampling properties.
~
It is a good time to pause and ask why stochastic regressor X is necessary.
In economics and finance, the explanatory variables are not drawn within a
laboratory control setting. Instead, they are drawn from some probability
distribution, and therefore are not determined ex-ante. Therefore the
assumption of stochastic independence between X and U is important, or else
it is important to be able to say that given X, U behaves like white noise.
180
However, if we are able to reproduce a new time series sample { X i , Yi for
each i } by repeatedly sampling new Yi while keeping Xi fixed, then the
weaker set of classical conditions that are conditioned on X becomes quite
intuitive and obvious. We shall keep to the more general setting of stochastic
regressors except when situations allow otherwise.
The above discussion extends to multivariate linear regression with more
than two regressors including the constant. For multivariate regression
involving many variables, it is more convenient to work with matrices. For the
set of k explanatory variables including the constant, the matrix notation X Nxk
is used. The k columns denote the k number of regressors, while the N rows
denote the sample values of each regressor in a sample of size N.
The homoskedastic and non-autocorrelation condition is written as
T
E U Nx1U Nx1 | X Nxk 2 I NxN . And the conditional zero mean condition is
written as E U Nx1 | X Nxk 0 . It should be noted that in the case when we

treat the explanatory variables as constant, or pre-determined (i.e. known to
econometrician when performing the regression) or when we interpret the
2 X T X 1 . These notations
results as conditional on X), then var B
will become clearer as we proceed to the next section.

If in addition we have (A5) or normal disturbances, then OLS conditional
on X gives exactly the same estimators as maximum likelihood estimators,
and under the Rao-Blackwell Theorem, OLS estimators are not only BLUE,
but also of minimum variance or best (efficient) amongst the bigger class of
all unbiased estimators (including nonlinear ones).
However, when we treat X as independently stochastic, then
2 E X T X 1
(10.3)
var B
where the expectation is taken with respect to the X multivariate distribution.

This gives a vivid idea of the difference in result between constant or predetermined X and stochastic regressor X. However, in practice, with time
series X, it may not be possible to estimate E[(XTX)-1]. Therefore, most
econometric packages still apply (XTX)-1 as a hopefully good approximation
when it comes to statistical inference. The limitation must therefore be noted.
In particular, we are testing based on given X, even when X is stochastic. This
limitation sometimes is unavoidable much as we like to expand analyses fully
to a stochastic X.
Sometimes this distinction between constant and stochastic regressors can
become very confusing for students, and basic textbook sometimes waive the
stochastic regressor treatment and show results under the conditional X (or
fixed X, or repeated X sampling) situation. When X and U are stochastically
181
independent, and the classical conditions are met, the OLS test statistics such
as t-, F-, remain valid provided the appropriate variances are used for
normalizing.
10.2
MULTIPLE LINEAR REGRESSION
When we extend the linear regression analyses to more than two variables, we
are dealing with multiple linear regression model viz.
Yi = b0 + b1X1i + b2X2i + b3X3i + .. + bk-1Xk-1i + ui
(10.4)
where subscripts ji of X denotes the j th non-constant explanatory variable and

the ith sample point (we can also use t instead when we deal with a time
series). Note that, including the constant, there are k explanatory variables.
The matrix formulation of (10.4) can be written concisely as
YNx1 = XNxk Bkx1 + UNx1 ,
Y1
Y2
where YNx1 =
,

YN
1 X11
1 X12
XNxk =
1 X
1N
X k 1,1
X k 1, 2
X k 1,N
u1
u
2
, and UNx1 = .

uN
b0
b
1
Bkx1 =
b k 1
The classical conditions in matrix format are

(A1) E U 0N1
(A2)
var U u2 I NN
where E u i
2
u
, a same constant for every i, and
E u i u j 0 for every i j . Thus this matrix condition provides for

both conditions of variance homoskedasticity (constant variance), and crossvariable zero correlation that are separately specified in the univariate case in
Chapter 3.
(A3) X and U are stochastically independent of each other. Being vector and
matrixes, this means any pairs of elements from X and U are stochastically
independent.
182
U ~ N 0, u2 I .
(A4)
Unless otherwise mentioned, we shall mostly use the simpler treatment

where X is assumed to be given, so we can treat it as a constant matrix. This is
to facilitate learning of the multivariate case in the rest of this chapter. Later,
we may require the consideration of cases of stochastic regressor X. In this
scenario, the classical conditions can be relaxed by removing (A3) and
strengthening (A1) to E(U|X) = 0Nx1 , (A2) to var(U|X) = E(UUT|X) = u2INxN,
and (A4) to U|X ~ N 0, u2 I . Note that E(U|X) = 0Nx1 E[XTU] = 0 by the
Law of Iterated Expectations. Thus, it also implies that X and U are not
contemporaneously correlated, i.e. cov(Xji , ui) = 0 for every j.
Y XB
. The sum of square residuals
Let the estimated residuals be U
N
(or residual sum of squares RSS) is
u
t 1
2
t
TU
. Applying the Ordinary
U
Least Squares method to obtain the OLS estimate for B:

TU
Y XB
T Y XB
Y T Y 2B
TXTY B
T X T XB .
min U
First Order Condition gives:

TU
U
0.
2XT Y 2X T XB
XT X
So, B
X T Y , provided that inverse of (XTX) exists. (XTX) is of
dimension k x k. For inverse of (XTX) to exist, (XTX) must have full rank of k.
The rank of XNxk is at most min(N,k). Thus, k must be smaller than N. This
means that when we perform a regression involving k explanatory variables
(including the constant), we must employ at least a sample size of N.
10.3
LEAST SQUARES THEORY
The idea of Least Squares can be thought of as a linear orthogonal (least

distance) projection as follows. See Figure 10.3.
Think of the Nx1 vector of Y as a vector in N-dimensional Euclidean
space (or Cartesian space or RN vector space) and each Nx1 vector XB (for
each B) in the same N-dimensional space but lying on a subspace which is a
k-dimensional hyperplane. This is a non-trivial departure in conception. For
example, in the 2-variable case we look at Y versus X in 2 dimension, but here
we think of Y and XB in N dimension. The idea is to find a projection vector.
183
Figure 10.3
k-dimensional hyperplane
formed by XB (for any B)
or linear combination of
the k columns of XN x k
Y XB
XB
is the orthogonal projection from Y to the hyper-plane formed

If Y XB
T
0 or BT XT Y XB
0.
by XB (any B), then XB Y XB
0 . Therefore, we obtain
Thus X Y XB
T
XT X
B
10.4
X T Y as in the OLS regression estimate.
PROPERTIES OF OLS ESTIMATORS
XT X
B
X T XB U B X T X
XT U .
B . If X and U are stochastically independent as in

Conditional on X, E B
B E XTX
(A3), then also E B
X E U B .
is BLUE. Note also

In fact, as in the two-variable case, OLS estimator B
1
that in this case, there is no restriction on X (i.e. it is not necessary that there
to be BLUE). Therefore,
must be a constant regressor amongst X, for B
B X T X 1 X T U
B
, given X, is
The k k covariance matrix of B
E B
B
var B
E XTX
u2 X T X
XTU
B B
E XTX
U X X X X X
T
X T IX X T X
u2 X T X
XTU
1
X X
T
XTU
X T E UU T X X T X
184
Thus, B kx1 ~ N Bkx1 , u2 X T X
Or, if X is stochastic, but E(UU |X) = u2I, then

T
E XTX
var B
X T UU T X X T X
X UU XX X X
EX X X I XX X EX X .
E E XTX
2
u
2
u
Now, the unbiased sample estimate of u2 is u2

statistical distribution of test statistic
1
B
j
u j
th
10.5
diag element
X X
T
TU
U
Nk
. If H0: Bj = 1, the
is tN-k .
TESTS OF RESTRICTIONS
kx1 . Suppose the null hypothesis is

Consider a linear combination, Rqxk B
based on q linear restrictions on the coefficients, H0: RqxkBkx1 = rqx1. Then
is
kx1 rqx1 , where R, r are constants, is normally distributed since B
Rqxk B
normally distributed given the classical assumptions. The idea is to test if
- r is attributed to sampling error (i.e. we cannot reject H0) or
deviation of R B
whether it is significant (i.e. we reject H0).
RB
E RB
r is the covariance matrix of (R B

)qx1 since r is
The covariance matrix of R B
- B) since RB is constant.
constant. This is also the covariance matrix of (R B
1
r E R B
B B
B T R T 2 R X T X R T ,
var RB
r R cov B
1 R T
or var RB
2 T
where cov B X X
u
1
as noted in the last section.
We provide an illustration as follows. Note in what follows that

3x1) yields a 2x2 covariance matrix.
var(R2x3 B
185
x x 2
var 1
y y2
1
a
x a x 2 b x 3c
x 3
b var 1
y 3
y
a
y
b
y
c
2
3
1
c

x T Vx x T Vy
T
T
y Vx y Vy
x1 y1
x1 x 2 x 3
V x 2 y 2 = RVRT

y1 y 2 y 3 x y
3
3
a

where x = (x1 x2 x3) , y = (y1 y2 y3) , and V3x3 = cov b .
c

T
A standard result is that if random vector Zqx1 ~ N(0,qxq) ,then ZT-1Z ~ q2 .

1
1
So, RB r qx1T 2 R X T X R T RB r qx1 ~ q2 ,

u
is the unconditional OLS estimate of B in Y=XB+U. This is also

where B
called the Wald criterion. It may be used as a Wald 2-test if the sample size is
very large so that unknown u2 can be substituted by estimate u2 .
But
TU
2
u
.
~ N-k2 , and is independent of B
Thus,
1
r T R X T X R T
RB
U U /( N k )
RB r / q
~ Fq, N-k
(10.5)
This provides a test of H0: RB = r .

The test of this restriction, H0: RB = r, can also be constructed in another
way. Suppose we run OLS on a restricted regression by imposing RB = r.
Hence, min (Y-XB)T(Y-XB) subject to RB = r. Solve the Lagrangian
min Y XB Y XB 2 T RB r .
T
The first order conditions are:
C 2R T 0
2XT Y XB
186
C r 0.
and also 2 RB
C X T Y
X T X R T B
C is OLS under the constrained

where B
0 r
R
Or,
regression. Then,
1
C X T X R T X T Y
B

0 r
R
X T X 1 I R T CR(X T X) 1
1
CR X T X
where C - R(X T X) 1 R T
T
R T C X Y
C
r
1
using partitioned matrix inverse. Thus,
I R CR(X X) X Y X X R Cr
X X X Y X X R CR(X X) X Y - r
X X R CRB
- r
B
X X R R(X X) R RB
- r .
B
T
X X
B
C
X X
Therefore,
C B
XT X
B
1
T
T 1
R T R(X T X) 1 R T
RB - r .
1
Now, constrained estimated residual
C Y - XB
- X(B
C B
) U
X(B
C B
).
U
Hence
0.
CT U
C U
TU
(B
C B
) T X T X(B
C B
) since X T U
U
CT U
C -U
TU
RB
- r R(X T X) 1 R T RB
- r which
Then U
numerator in (10.5) before dividing by q. Hence we see that
T
is
the
CSSR USSR /q ~ F
USSR/(N k)
q, N k
where CSSR is constrained sum of squared residuals and USSR is

unconstrained sum of squared residuals.
If we apply R = Ikxk in the restriction RB=0, then we are really testing H0:
b0=b1= b2=..=bk-1=0. The test statistic for this H0 is
1
T 1
T
/k
B X X B
T X T X B
/k
B
~ Fk, N-k .
TU
/( N k )
TU
/( N k )
U
U
187
The above is a test of whether the Xs explain Y or allow a linear fitting of Y.
Suppose the Xs do not explain Y. But as long as the mean of Y is not zero,
then the constraint b0=0 should not be used. If used, there will be rejection of
the H0, but this will lead to the wrong conclusion that Xs explain Y.
In other words, if we allow or maintain the mean of Y to be non-zero, a
more suitable test of whether Xs affect Y is H0: b1= b2=..=bk-1=0. We leave
out the constraint b0=0. How do we obtain such a test statistic?
The restrictions H0: b1= b2=..=bk-1=0 are equivalent to the matrix
restriction of Rk-1xkB = 0k-1x1 where R=[0| Ik-1] with its first column containing
all zeros. Partition X = [1| X*] where the first column 1 contains all ones, and
X* is N x (k-1). Then XTX = [1| X*]T [1| X*] =
N
1T X *
.
*T
*T *
X 1 X X
Now R(XTX)-1RT = [0| Ik-1] (XTX)-1 [0| Ik-1]T. This produces the bottom right
(k-1) x (k-1) submatrix of (XTX)-1. But by the partitioned matrix result in
linear algebra, this submatrix is
[(X*TX*) X*T1 (1/N) 1TX*]-1 = [X*T (I 1/N 11T)X*]-1
where (I 1/N 11T) M0 is symmetrical and idempotent, and transforms a
matrix into deviation form, i.e.
X 11 X1
X 12 X1
0 *
M X
X 1N X1
T
-1
Hence R(X X) R
[X**TX**].
X 21 X 2
X 22 X 2
X 2N X 2
X (k 1)1 X k 1
X (k 1)2 X k 1
X ** .
X (k 1)N X k 1
= (X*TM0X*)-1 = [X**TX**]-1. So [R(XTX)-1RT]-1 =
e . So,
From the OLS definition of estimated residual e , Y XB
M 0 e . Now M 0 e e since for any column of X, including
M 0 Y M 0 XB
- X T Y 0 or
1, XT e 0 . This comes from normal equations X T XB
) 0 or - X T e 0 . Then M 0 Y 0 X* * B
e . Let
- X T (Y - XB
Y1 Y
Y Y be Y**.
M 0Y 2

YN Y
Then, from the above,
188
Y ** X **
b 1

b 2 e

b k 1
where constant estimate b 0 is left out. If we define

**
B
b 1

b
2

b k 1
** e where B
** X**T X**
then, Y ** X ** B
Thus, Y
**T
-1
X**T Y** .
**T X **T X ** B
** e T e ,
Y ** B
* T
since X e M X e X M e X e 0 .
Hence, TSS = ESS + RSS.
Now,
**T
RB R X X
T
2
u
RT
*T
RB B
1
**T
*T
**T
**
X ** B
2
u
**
**
B
X
T
2
u
**
**
B
2
k 1
Hence the test statistic is
**
** T X ** B
** / k 1
B
~ Fk 1, N k .
e T e / N k
Moreover, since the numerator is ESS/(k-1), then
ESS/[TSS k 1]
~ Fk 1, Nk .
RSS/[TSS N k ]
Therefore,
R 2 /(k 1)
~ Fk-1, N-k .
(1 R 2 ) /( N k )
Note that if R2 is large, the F statistic is also large. What happens to the test on
H0 : B2=B3=..=Bk=0 ? Do we tend to reject or accept?
, usually
While R2 determines how well is the fit of the regression line XB
adjusted R2 is used to check how well the Y is explained by the model XB:
R2 1
RSS /( N k) 1 k N 1 2
R .
TSS /( N 1) N k N k
As we can arbitrarily increase R2 by using more explanatory variables or
189
overfitting, then for any fixed N, increase in k will be compensated for by a
reduction to a smaller R 2 , ceteris paribus.
Three other common criteria for comparing the fit of various specifications
or model XB are:
L* k
lnN
N N
L * 2k
Akaike Information criterion: AIC 2
N N
L * 2k
Hannan-Quinn criterion: SC -2
lnln N
N
N
N/2
1 N
where L* ln 2 2
exp 2 e i2 , and
2 i 1
Schwarz criterion: SC 2
e i2 is SSR.
i 1
Unlike R , when a better fit yields a larger R number, here, smaller SC and
AIC indicate better fits. This is due to the penalty imposed by larger k.
10.6
FORECASTING
For prediction or forecasting involving multiple explanatory variables, let the

new variables be cT = (1 X2* X3* .. Xk*). So, the forecast is
cT B
.
Y
cT var B
c . Note that the variance
The variance of the forecast is var cT B
of forecast is not the variance of forecast error.
However, in terms of prediction or forecast error in the next period
Y=cTB+U*1x1
U* cT B
B
YY
u2 1 c T X T X 1 c
So var Y Y
YY
And
u 1 cT X T X c
1
~ tN-k .
Thus the 95% confidence interval of next Y is
t 0.025 u 1 cT X T X 1 c .
Y
10.7
PROBLEM SETS
10.1
The following linear regression output shows a cross-sectional

regression of the increase in return rate of the stock of a takeover
190
target firm as dependent variable, and the capitalization size of the
firm as explanatory variable.
Dependent Variable: IRET
Sample: 1 100
Variable
C
SIZE
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Coefficient
Std. Error
t-Statistic
Prob.
0.0900
-0.003
0.0300
????
3.000
????
0.0034
????
????
0.0547
0.2928
Mean dependent var

S.D. dependent var
F-statistic
Prob(F-statistic)
0.1130
0.0650
????
0.000000
(i) Find R2 and adjusted R2

(ii) Find the F-statistic testing H0: SIZE Coefficient = 0
(iii) Find the SIZE Coefficient t-statistic and its standard error
(iv) What can you conclude about the size effect here?
10.2
In a multiple linear regression of sample size 60 with 4 explanatory

variables including the constant, the OLS estimate 3 =0.867 with a
standard error of 0.113 . What is the correct value of the t-statistic to
use to test the null hypothesis H 0 : 3 1 ? Would you reject H0 at
5% significance level with a 2-tail test?
10.3
If you perform an OLS simple regression of Y on X, then Y Y on

X X , (each regression containing a constant), will the OLS
estimates of the 2 regressions differ? Explain.
10.4
The forecast of sales of equipment is in terms of log of actual number

of units. This month, the log of sales is 6.593. Forecast for next month
is 6.648. The 95% confidence interval for the forecast is 6.648 0.06.
(i) What is the equivalent forecast of percentage change in sales
level?
(ii) Indicate the 95% confidence interval for this %-change forecast.
(iii) What is the conditional expectation forecast of the level of unit
sales?
191
(iv) What is the confidence interval for level sales next month?
10.5
An investor runs an OLS regression on

Rt+1 = a + bYt + c Zt + et+1
where Rt is return of a well-diversified stock portfolio, Yt , and Zt are
prices of oil per barrel, and gold per kilogram, respectively. et+1 is a
disturbance noise that is i.i.d. and not correlated with any of the
explanatory variables.
(i) If he tests the H0: b=c=0, and reject H0, how would you conclude
on the empirical evidence about market efficiency?
(ii) If he cannot reject H0: b=c=0, but can reject H0: a=0, how would
you conclude on the empirical evidence about market efficiency?
10.6
The Cobb-Douglas production function states that Q L K

where Q is output or $GDP, L is $labor input factor , K is $capital
input factor, , , are constant parameters, and is an i.i.d. noise.
Use an OLS regression to estimate , , and . What are the
estimates of the elasticity of output with respect to the input
production factors? What is the out-of-sample forecast of next period
GDP if next period input are expected to be L = $exp(30) and K =
$exp(25).
ln $GDP
16.84
16.56
15.64
18.04
20.68
20.48
21.92
24.12
19.92
22.2
21
20.28
ln $Labor
14.5
15.3
16.1
17.4
18.4
18.8
18.8
19.7
20.1
20.3
20.8
21.2
ln $Capital
16.7
16.8
19.5
22.1
22.3
17.5
20.2
20.4
12.7
22.9
19.3
17.1
192
20.64
22.48
26.48
25.28
27.56
25.72
25.52
27.96
10.7
21.3
22.9
23.1
23.7
24.8
25.5
25.7
28.8
16.8
19.8
31.9
26.3
25.9
22.1
24.1
25.7
Random variable R is linearly related to P and H. The regression R =

c0 + c1P + c2H + e is performed, where c0, c1, and c2 are constants, and
e is a an i.i.d. disturbance with mean zero. Suppose there are 52
observations per variable.
(i) If the regression is written in matrix form as Y = XB + E, what is
the matrix dimension of X ? of B ?
(ii) If (XTX)-1 is given as
.
0.539 0.004
1141
0.539 0.276 0.001
0.004 0.001 0.001
and (X Y) as
6256
12521
64389
what is the OLS estimate of B?

(iii) If the sum of square residuals is 1805.2, find the unbiased
estimate of the covariance matrix of E.
.
(iv) Compute the Covariance matrix of the OLS estimator B
10.8
In a cross-sectional linear regression of yi = a + b xi + ei where there

are 125 observations per variable, ei is homoskedastic and normally
distributed with mean zero, OLS estimates a 0.3 , b 1.09 ,
193
125
e i2 0.85 ,
i 1
125 1.5
and XT =
3
1.5
1 1
x 3 x125
(i) Find the t-statistics of the OLS estimators a , b under the null
XTX =
1
x
1
1
x2
H0: a=0 , HA: a 0; and H0: b=1 , HA: b1.

(ii) Find the F-statistic for the test against null H0: a=0, b=1. Is the
null rejected at 5% significance level?
(iii) If the cross-section is thought to be divided into two categories I
and II in which yi would be affected by whether it is in I or II, and
that this is independent of the values of xi, explain if OLS
estimates a , b above are BLUE?

(iv) If the disturbance ei in fact has a variance that changes according
to whether it belongs to category I or II as in part (c) above, how
would you perform a least squares estimation?
10.9.
Cobb-Douglas production function Yt = a Ct b Lt d et where Yt, Ct, and

Lt are national output, capital, and labor aggregates in year t, and e t is
a residual error. Describe how you would proceed to estimate
constants a, b, and d using linear regression.
10.10
For a financial application, consider the task of setting up a small

portfolio of stocks that track the market index as closely as possible.
Denote this by the vector of portfolio weights: W*T =
(w1*w2*w3*wN*)T. If stock j is not included in this portfolio, then
wj*=0. This kind of portfolio is often employed in passive funds
investing. Obviously, if we include all the index stocks, then the
tracking will be almost if not exact. However, transaction cost and
execution time constraints (e.g. when index portfolio is to be
liquidated when investor takes profit) often dictate use of a smaller set
of stocks. The tradeoff is that tracking will be less than exact, and can
only be described in terms of 90% or sometimes 80% accuracy.
Let the market portfolio be represented by the vector of portfolio
weights WT = (w1 w2 w3 . wN)T. Let the tracking error be 1-R2
where R2 is the coefficient of determination of the regression of
tracking portfolio return on actual market index return. If the market
returns covariance matrix is V, find 1-R2 in terms of W,W*, and V.
194
10.11
The following estimation output table shows a linear regression of

monthly stock i return rit against constant C and monthly market
return rmt.
Dependent Variable: ri
Sample: 1 60
Variable
Ci
rm
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
Durbin-Watson stat
Coefficient
Std. Error
t-Statistic
Prob.
-0.0005
1.52635
0.05
0.7471
-0.01
2.0430
0.992
0.0456
????
????
0.003
0.000522
-588
1.57
Mean dependent var

S.D. dependent var
Schwarz criterion
F-statistic
0.0085
0.02
10.75
10.76
????
(i) Find the missing F-statistic, R2, and adjusted R2 statistics in the
Table. (Hint: All the information required are available within the
table).
(ii) What is the null hypothesis that the reported F-statistic is testing?
What is the p-value of this F-test?
(iii) Given the market model of stock returns
rit = i + i rmt + eit
where eit is i.i.d. N(0, e2), as in (a), find the unbiased estimates of
e2 and of the conditional variance var(rit|rmt).
(iv) In (iii), what is the unconditional variance of rit ?
[1] Fama, Eugene F., and James D. MacBeth, (1973), Risk, Return, and
Equilibrium, Journal of Political Economy, Vol 81, 607-636.
[2] Fama, Eugene F., and Kenneth R. French, (1992), The Cross-Section of
Expected Stock Returns, Journal of Finance, Vol XLVII, No2, June, 427465.
[3] Jensen, MC, and Myron Scholes, (1972), The capital asset pricing model:
some empirical tests, in M. Jensen ed., Studies in the Theory of Capital
Markets (Praeger Press).
195
Chapter 11
DUMMY VARIABLES AND ANOVA
APPLICATION: TIME EFFECT ANOMALIES
Dummy variables, Asset pricing anomalies, Day-of-the-week effect, Seasonal
Effect, January Effect, Fear Gauge, Test of equality of the Means, Wald test,
Analysis of variance
In this chapter we consider the use of dummy variables or quality variables as

explanatory variables, and show how they are used in financial pricing.
11.1
DUMMY VARIABLES
In an employment survey carried out in Singapore in the early 1990s, 60

randomly surveyed employees reported their monthly S$ salary, their gender,
their job position whether it carries managerial responsibilities or not, their
number of years of formal education, and number of years of working
experience.
We can try to verify if salary was determined by the factors of gender,
managerial position, education, and working experience. This can be done by
linear regression. The last two variables across the 60 respondents are
quantitative data. The first two variables are, however, categorical or nominal
data. Categorical or nominal data cannot be represented in regression by any
other way except by dummy variables.
For example, if a respondent is female, the gender variable will take the
value 1. If the respondent is male, the gender variable will take the value 0
instead. Thus the gender variable is a dummy variable taking the values of
either 1 or 0. The values 1 or 0 do not represent any quantities except
indicating belonging to a category or not. If we let
S = S$ salary/month
D1 = dummy variable (gender)
1 female
0 male
We usually write D1
D2 = dummy variable (management position)

Specifically,
196
1 managerial responsibility
D2
0 no managerial responsibility
E = number of years of formal education
W = number of years of working experience
Then a linear regression model is
Si = c0 + c1D1i + c2D2i + c3Ei + c4Wi + ui
(11.1)
for subjects i = 1,2,3,,60.
ui is the disturbance term that is i.i.d. and independent of the explanatory
variables. Using OLS method for the linear regression, the results are reported
below. The dependent variable is salary and the sample size is 60.
Table 11.1
Explanation of Monthly Salary
The tests of the coefficients indicate that managerial position, education, and
working experience are significantly related to higher salary. The last two are
especially significant at p-values of less than 0.001%. Each additional year of
education adds on average S$408 to salary, ceteris paribus (everything else
being equal). Each additional year of work experience adds on average S$245
to salary, ceteris paribus.
Gender is not significant in explaining differences in salary. If this were
significant, the coefficient of S$274.39 on gender D1 indicates that if the
respondent were female (thus D1=1), then there is an average increase of
S$274.39 in monthly salary over that of a male, ceteris paribus. What is the
interpretation of constant of regression C? Its estimate is - S$4606. Usually
the constant is an estimate of some base level or fixed level effect. Here, due
197
to about 6 years of mandatory education in Singapore and respondents
average work years of about 6.5, their incremental income due to these would
have put them at the baseline of about zero to start with. This is of course a
rough explanation of why there is such a low negative constant.
Suppose now we run the same OLS regression except without the
constant. Now we create two alternative dummies in place of the gender
dummy.
0 female
and
G1
1 male
1 female
G2
0 male
Si = g1G1i + g2G2i + c2D2i + c3Ei + c4Wi + ui
(11.2)
Notice that G1i + G2i = 1 for any i. The regression results are as follows.
Table 11.2
Explanation of Monthly Salary
In the OLS regression of (11.1) and (11.2), it is notable that almost all the
reported results are identical. All unaffected explanatory variables D2i , Ei ,
and Wi , produced the same estimates and t-tests. SSE, R2 , mean of dependent
variable, DW d-statistic all remain the same.
The estimated coefficient of g 1 and its t-statistic in (11.2) are identical
with the estimate of the constant and its t-statistic in (11.1). Thus it is seen that
if the male dummy is left out in (11.1), it is actually contained in the constant.
In (11.2), the male dummy G1i is included, but the constant is left out, so it
replaces the role of the constant.
However, the estimated coefficient g 2 of G2i and its t-statistic in (11.2)
198
does not appear to look similar with the coefficient of D1i in (11.1) even
though G2i = D1i for every i. However, note that g 2 - g 1 = S$274.39 = c 1 .
Thus in (11.1), the estimate of c1 is really about the relative additional mean
salary of female to male. This relative figure is more explicitly revealed in
(11.2) as g 2 - g 1 .
A key point to note is that in any regression when we allocate the
respondents into x mutually exclusive and exhaustive categories and provide
dummies D2, D3, .., Dx (equals 1, otherwise 0) for membership in category
2, 3,..,x respectively, and then perform a linear regression with a constant c,
the estimated coefficients of dummy variables D2, D3, , Dx are to be
interpreted as membership effect relative to category 1 which does not have a
dummy. The membership effect of category 1 is, however, captured in c .
Thus to obtain the full (not relative) membership effect of say category 2, we
add the estimated coefficient of dummy variables D2 to c . This full effect can
also be obtained if we perform a MLR without a constant, but instead assign a
dummy to all categories.
In (11.2), we cannot have all dummies for all mutually exclusive and
exhaustive categories and still introduce a regression constant. In other words,
if the dummies are exhaustive, i.e. G1i + G2i = 1 for every i, then introducing a
constant viz.
Si = c + g1G1i + g2G2i + c2D2i + c3Ei + c4Wi + ui
will create a singularity condition in the explanatory X 60x6 matrix and thus
OLS estimates cannot be derived because X is singular (of rank < k < N; or
here rank < 6) and so XTX is not invertible. This is sometimes termed the
dummy variable trap.
11.2
TIME EFFECT ASSET PRICING ANOMALIES
The CAPM contains only one market factor. Sometimes, other factors are
found that explains higher or lower mean returns to portfolios or stocks that
cannot be explained by existing equilibrium asset pricing models. More
specifically, the issue is that if such abnormally high average returns exist,
why should the rational and efficient market (assumed) not act to cream away
this abnormal return quickly enough? For this reason, such noticed empirical
irregularities are called asset pricing anomalies.46 Since the early 1980s, time
46
There is some similarity with new research in the area of behavioral finance which
tries to explain why classical rational expectations approach sometimes cannot explain
all empirically observed or stylized facts about market prices and investment results
because human beings can sometimes be irrational or behave in ways yet to be totally
199
effect anomalies (unusual observations) on stock returns were reported in the
US stock market.
Many researchers reported that the mean daily stock returns on Monday
tend to be lower than those of other weekdays47. Other time or seasonal effect
anomalies such as January effect (higher monthly return in January relative to
average), holiday effect, and so on, have since been reported. Non-time
anomalies include higher expected return from investing in smaller
capitalization firm, firms with lower P/E ratios, and so on.
Of course, many anomalies do not survive closer analysis once
transactions costs are taken into account. If a so-called anomaly exists, but
which cannot be profitably exploited because transaction costs are too high,
then it cannot be anomalous by definition. Time anomalies are interesting in
finance and useful for instruction from the point of teaching basic
econometrics when it comes to using dummy variables. In this chapter we will
experience the Monday blues lower than normal returns on Mondays. Many
anecdotal evidence and explanations have been advanced for such anomalies.
Some anecdotes seem to last for a long time and do not disappear so easily.
However, similar studies in the 1990s showed that the week-of-the-day effect
in the U.S. stock market may have largely disappeared in the 1990s.
Let the daily continuously compounded return rate Rt of a market
portfolio or else a well traded stock at time t be expressed as
Rt = c1D1t + c2D2t + c3D3t + c4D4t + c5D5t + et
(11.3)
where cjs are constants, and Djts are dummy variables. These dummy
variables take values as follows.
1
D1t
0
1
D 2t
0
1
D 3t
0
1
D 4t
0
if return is on a Monday at time t

otherwise
if return is on a Tuesday at time t
otherwise
if return is on a Wednesday at time t
otherwise
if return is on a Thursday at time t
otherwise
understood. Not all anomalies are, however, candidates for behavioral financial
research as some may be explained by purely taxation or accounting effects.
47
See a complete literature in Pettengill, G.N., (2003), A Survey of the Monday
Effect Literature, Quarterly Journal of Economics, June.
200
1
D 5t
0
if return is on a Friday at time t

otherwise
eit is a disturbance term that is assumed to be n.i.d. Notice that we choose to

regress without a constant. If we perform OLS on (11.3), with a time series
sample size of N, the matrix form of the explanatory variables looks like
R1 1 0 0 0 0
u1
R 2 0 1 0 0 0 c1 u 2

c

2

c3

c 5
R
u
N 0 0 0 0 1
N
Y
Each row of the XNx5 matrix contains all zero elements except for one unit
element. Assume U ~ N( 0 , u2).
The day-of-the-week effect refers to a price anomaly whereby a particular
day of the week has a higher mean return than the other days of the week. To
test for the day-of-the-week effect, a regression based on equation (11.3) is
performed.
To test if the day-of-the-week effect occurred in the Singapore market, we
employ Singapore Stock Exchange data. Continuously compounded daily
returns are computed based on the Straits Times Industrial Index (STII) from
11 July 1994 - 28 August 1998 and on the re-constructed Straits Times Index
(STI) from 31 August till the end of the sample period. The data were
collected from Datastream. At that time STI was a value-weighted index based
on 45 major stocks that make up approximately 61% of the total market
capitalization in Singapore. Since the STII and STI were indexes that captured
the major stocks that were the most liquidly traded, the day-of-the-week
effect, if any, would show up in the returns based on the index movements.
We use 1075 daily traded data in each period: July 18, 1994 August 28,
1998 (period 1), and September 7, 1998 October 18, 2002 (period 2). Table
11.3 below shows descriptive statistics of return rates for each trading day of
the week.
Table 11.4 shows the multiple linear regression result of daily return rates
on the weekday dummies.
201
Table 11.3
Return Characteristics on Different Days of the Week
Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis
MON
-0.002278
-0.002272
0.160307
-0.078205
0.018646
3.192659
32.39143
TUE
WED
-0.001052 0.001194
-0.001255 0.000269
0.095055 0.069049
-0.092189 -0.039137
0.013078 0.012631
0.263696 0.920829
25.84039 8.182601
Jarque-Bera
Probability
8103.961
0.000000
4675.906
0.000000
Sum
-0.489672 -0.226154
Sum Sq. Dev. 0.074401 0.036600
Observations 215
215
THU
-0.000394
-0.000206
0.040004
-0.079436
0.013085
-1.154059
10.15761
FRI
-0.000359
3.71E-05
0.039870
-0.069948
0.010729
-1.147256
11.22779
270.9991 506.6721
0.000000 0.000000
653.6114
0.000000
0.256656 -0.084740 -0.077153

0.034139 0.036640 0.024632
215
215
215
Table 11.4
Rt = c1D1t + c2D2t + c3D3t + c4D4t + c5D5t + et
July 18, 1994 August 28, 1998 (period 1)
Dependent Variable: PERIOD1RE
Sample: 1 1075
Variable
Coefficient
Std. Error
t-Statistic
Prob.
D1
D2
D3
D4
D5
-0.002278
-0.001052
0.001194
-0.000394
-0.000359
0.000947
0.000947
0.000947
0.000947
0.000947
-2.404414
-1.110473
1.260246
-0.416095
-0.378842
0.0164
0.2670
0.2079
0.6774
0.7049
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
0.006554
0.002840
0.013889
0.206413
3074.541
Mean dependent var

S.D. dependent var
Schwarz criterion
Durbin-Watson stat
-0.000578
0.013909
-5.710774
-5.687611
1.678574
202
The OLS regression result shows that the coefficient c 1 = -0.002278 is the
only one that is significantly different (smaller) than zero. This corresponds to
D1, the Monday dummy variable. Thus, there is a significantly negative mean
return on Monday. This Monday day-of-the-week effect in Singapore is
similar to evidence elsewhere in US and in other exchanges in Asia.
Why do we interpret this as negative mean return on Monday? (11.3)
implies that
Rt = c1D1t + et
since the other cj coefficients are not significantly different from zero. Then
E(Rt) = c1D1t . For Mondays, D1t = 1. So, mean Monday return, E(Rt|Monday)
= c1. This is estimated by c 1 = -0.002278. From Table 1 and 2, it is seen that
the means of the returns on Monday, Tuesday, etc. are indeed the coefficient
estimates of the dummies D1t , D2t , etc. However, if there are other nondummy quantitative explanatory variables on Rt, then c j is in general not the
mean return on the jth day of the week, but just its marginal contribution.
The negative or lower Monday return effect is sometimes addressed as the
weekend effect due to the explanation that most companies typically put out
bad news if any during the weekends so that Monday prices on average end
relatively lower than on other days of the week. This weekend effect
sometimes appears only some months of the year. In the US, the Monday
effect does not typically appear in January. It has also been empirically found
that Friday returns on average on the highest, at least in studies before 2002.
11.3
TEST OF EQUALITY OF MEANS
We test the null hypothesis that the means of all weekday returns are equal.
This takes the form of testing if the coefficients to the 5 dummies are all
equal, viz. H0: c1 = c2 = c3 = c4 = c5. The results are shown in Table 3 below.
Table 11.5
Wald and F-Test of Equal Mean Returns on All Weekdays
Equation: PERIOD1_REGRESSION
Null Hypothesis:
C(1)=C(5)
C(2)=C(5)
C(3)=C(5)
C(4)=C(5)
F-statistic
Chi-square
1.764813
7.059252
Probability
Probability
0.133652
0.132790
203
The Wald chi-square statistic is asymptotic, assuming u2 u2 in
RB - r R X X
T
RT
2u
RB - r ~
1
2
q
where there are q number of restrictions. The F-statistic, however, is exact. In

general, the asymptotic Wald statistic in more useful in nonlinear constraint
and testing. In the above, the Wald chi-square statistic (d.f. 4) shows that we
can reject H0 only in a critical region with p-value of 13.28%. The F-test
statistic shows that we can reject H0 only at a critical region with p-value of
13.36%. Thus, statistical evidence that Monday return is different from the
rest is not as strong.
For period 2, we run OLS on Rt = c1N1t + c2N2t + c3N3t + c4N4t + c5N5t + et .
Nit is equivalent to Dit. It is the same dummy. In Table 4, however, it is seen
that though the coefficient of Monday dummy N1t is the only negative
coefficient, it is nevertheless not significant with a p-value of 0.41. The test if
the coefficients to the 5 dummies are all equal, viz. H0: c1 = c2 = c3 = c4 = c5,
does not reject the null hypothesis at 10% significance level.
Table 11.6
Rt = c1N1t + c2N2t + c3N3t + c4N4t + c5N5t + et
September 7, 1998 October 18, 2002 (period 2)
Dependent Variable: PERIOD2RE
Sample: 1 1075
Variable
Coefficient
Std. Error
t-Statistic
Prob.
N1
N2
-0.000877
0.000799
0.001055
0.001055
-0.831321
0.756916
0.4060
0.4493
N3
N4
0.000193
0.001417
0.001055
0.001055
0.183048
1.342474
0.8548
0.1797
N5
0.001840
0.001055
1.743269
0.0816
R-squared
0.003815
Mean dependent var
0.000674
Adjusted R-squared
S.E. of regression
0.000091
0.015474
S.D. dependent var

0.015475
-5.494645
Sum squared resid

Log likelihood
0.256212
2958.372
Schwarz criterion
Durbin-Watson stat
-5.471482
1.768158
204
Has the Monday day-of-the-week effect disappeared after August 1998? It
seems so.
The day-of-the week effect in U.S. may have disappeared in the late
1990s partly because arbitrageurs would enter to cream away the profit by
buying low at close of Monday and selling the same stock high on Friday,
earning on average above normal returns. The arbitrageurs activities have the
effect of raising Mondays prices and lowering Fridays. This would wipe out
the observed differences. Recently, some studies48 documented high Monday
VIX (volatility index) prices relative to Friday prices, and high fall prices
relative to summer prices. This would allow abnormal trading profit by buying
VIX futures at CBOE on Friday and selling on Monday, and buying the same
in summer and selling as autumn approaches. The day-of-the-week and
seasonal effects are not explained by risk premia, but perhaps rather by
behavioral patterns exhibiting pessimism or fear of uncertainty, hence greater
anticipated volatility or VIX index (sometimes called the Fear Gauge) on
Monday for the whole working week ahead, and in autumn when the chilly
winds start to blow in North America.
11.4
ANALYSIS OF VARIANCE
Earlier we saw that E(Rt|Monday) = c1. Hence also E(Rt|Tuesday) = c2 , and so

on. Thus the means of Monday returns, Tuesday returns, .. , and Friday
returns are c1 , c2 , c3 , c4 , and c5 respectively. We can also test for the equality
of the means of each weekday, i.e. c1 , c2 , c3 , c4 , and c5 , using Analysis of
Variance (ANOVA).
The sum of squares treatments (SST) measuring the variability between
the sample means of the different days/groups is
SST n i R i R
5
i 1
where n1 is the number of Mondays in the sample space, n2 is the number of

Tuesdays, n3 is the number of Wednesdays, and so on. R 1 is the sample mean
of Monday returns. R 2 is the sample mean of Tuesday returns, and so on.
R is the sample average return of all days.

The sum of squares for errors (SSE) measures the variability within the
groups.
n3
n1
n2
n4
n5
2
2
2
2
2
SSE R 1j R 1 R 2j R 2 R 3j R 3 R 4j R 4 R 5j R 5
j1
j1
j1
j1
j1
48
See for example, Haim Levy (2010), Volatility Risk Premium, Market Sentiment
and Market Anomalies, Melbourne Conference in Finance, March 2010.
205
where R1j is a Monday return for a day j, R2j is a Tuesday return for a day j,
and so on.
The population means of Monday returns, Tuesday returns, Wednesday
returns, and so on, are c1, c2 , .. , c5. Intuitively H0: c1 = c2 = c3 = c4= c5 is
true if the variability between groups (SST) is small relative to variability
within groups (SSE). H0 is false if the variability between groups is large
relative to variability within groups.
Analysis of between-group variance and within-group variance leads to
test-statistic
SST /(g 1)
~ Fg-1, N-g (F-distribution)
SSE /( N g)
(11.4)
where g is number of groups (here g=5), and total number of days N =

n1+n2+n3+n4+n5 = 1075.
From the row of unbiased estimates of standard deviations of Monday
returns, etc. in Table 1, we obtain
SSE = 214* [0.0186462 + 0.0130782 + 0.0126312 + 0.0130852 + 0.0107292]
= 0.20642.
SST = 0.001362343.
Therefore, via (11.4), F4,1070 = 1.765. This is identical with the F-test
statistic reported in Table 1.
11.5
PROBLEM SET
11.1
A researcher wants to check out if daily stock returns the 5 trading

days just before any public holiday have a different mean level
compared to daily returns otherwise. In addition, he wants to find out
if these daily returns in the pre-holiday period, if they also fall in the
month of January, are particularly different? He runs a regression of
daily returns of a market portfolio on dummy variables, with a sample
size of 4 years of daily returns. Show a regression equation without
constant that he may use, indicating clearly what are the explanatory
dummy variables.
11.2
An analyst suspected that for some strange reasons, the daily return
rates to some stocks are usually lower on Fridays and higher on other
days of the week. To test this hypothesis, he collected the daily return
data rt of a particular stock, and performed the following linear
regression using ordinary least squares.
rt c1 c 2 I1 c3I 2 c 4 I 3 c5I 4 e t
206
where cis are the regression constant coefficients, et is the disturbance
that is assumed to be i.i.d., and
1 if it is Monday
I1
0 otherwise
1 if it isTuesday
I2
0 otherwise
1 if it is Wednesday
I3
0 otherwise
1 if it isThursday
I4
0 otherwise
It is noted that trading takes place only on week days. If the estimated
equation is
rt 0.0001 0.0005I1 0.0002I 2 0.0001I3 0.0003I 4 ,

what is the estimated increment to return (relative to zero) if it is on a
Friday? What are the estimated increments to return (relative to zero)
if it is on a Monday, Tuesday, Wednesday, and Thursday?

[1] M. Gibbons and P. Hess, (1981), Day of the Week Effects and Asset
Returns, Journal of Business 54.
[2] D. Keim and R. Stambaugh, (1984), A Further Investigation of the
Weekend Effect in Stock Returns, Journal of Finance 39.
[3] J. Jaffe and R. Westerfield, (1985), The Weekend Effect in Common
Stock Returns: The International Evidence, The Journal of Finance 40.
207
Chapter 12
SPECIFICATION ERRORS
Generalized least squares, Heteroskedasticity, Weighted least squares,
Relevance exclusion, Irrelevant inclusion, Multi-collinearity, Lagged
endogenous variable, Contemporaneous correlation, Measurement error,
Simultaneous equations bias, Probability limits, Instrumental variables,
Whites heteroskedasticity-consistent covariance matrix estimator, GoldfeldQuandt test, Breusch-Pagan & Godfrey (LM test), Cochrane-Orcutt procedure
After considering the desirable properties of OLS estimators of the multiple

linear regression (MLR) under the classical conditions of disturbances, it is
time to consider specification errors. A specification error or a
misspecification of the MLR model is a problem with an assumption of the
model such that the problem will lead to OLS not being BLUE. Some
specification errors are practically not serious enough to lose sleep over, but
some can be pretty serious to merit detailed investigation.
We shall consider different types of specification errors as follows, and
then suggest remedies. Specifically we shall consider misspecifications of
disturbances or residual errors, misspecifications of explanatory variables,
misspecifications of the relationship between the explanatory variables and
disturbances, and misspecifications of coefficients. Special cases of
misspecifications of disturbances such as heteroskedasticity and serial
correlations will be considered in more details.
At the end of the Chapter is a Table that contains a succinct summary of
the major problems and remedies if any.
12.1
MISSPECIFICATION WITH DISTURBANCES
The classical conditions state that E(UUT |X) = u2 IN in which the

disturbances are homoskedastic. Suppose instead E(UUT |X) = NxN u2 IN ,
where each element of the covariance matrix NxN is a constant. Taking
iterated expectations, E(UUT) = NxN u2IN . If NxN is known, i.e.
YNx1 = XNxk Bkx1 + UNx1
U ~ N( 0, u2 NxN ) and I , we
can apply generalized least squares (GLS) estimation.
208
Since is a covariance matrix, any xNx1 vector must yield u2 xT x 0
as the variance of xTU . Hence is positive definite. In Linear Algebra, there
is a theorem that if is positive definite, then it can be expressed as
= PPT
(12.1)
where P is a NxN non-singular matrix. Note that P is fixed and non-stochastic.
(12.1) P-1P-1T = I
(12.1) -1 = P-1TP-1
(12.2)
(12.3)
Define Y* = P-1Y , X* = P-1X , and U* = P-1U .

Pre-multiply the original model Y = XB + U by P-1 . Then,
P-1Y = P-1XB + P-1U or
Y* = X*B + U*
cov(U*) = E(U*U*T) = E(P-1UUTP-1T) = P-1 E(UUT) P-1T = u2 P-1 P-1T
By (12.2), cov(U*) = u2 I .
Thus, Y* = X*B + U* satisfies the classical conditions. The OLS regression of
Y* on X* gives
X*T X* 1 X*T Y*
B
(12.4)
is BLUE. This is not the original OLS estimator since the regression
This B
is made using transformed Y* and X*.
above also in terms of the original Y and X. We do this
We can express B
by substituting the definitions of X* and Y* in (12.4) and utilizing (12.3).
X*T X* 1 X*T Y*
B
P X P X P X P Y
1
X T P 1T P 1X X T P 1T P 1Y
1
X T1X
Or, B
X T1Y .
Given , the latter is the Generalized Least Squares (GLS) estimator of the
in (12.4)) is BLUE. It
regression of Y on X. This GLS (exactly the same as B
is interesting to note that OLS regression of Y on X based on YNx1 = XNxk Bkx1
+ UNx1 , with non-homoskedastic disturbances U, produces OLS estimator that
is unbiased, i.e. E X T X
X T Y B , but is not best in the sense of
efficiency.
Given that is known, an unbiased estimate for u2 is (since cov(U*) is
u2I)
209
T
T
1
Y* X*B
Y* X*B
P 1Y P 1XB
P 1Y P 1XB
Nk
Nk
T
1
P 1T P 1 Y XB
Y XB
Nk
1 T 1
U U
Nk
= Y XB
Note that
above is found by using GLS
U
u2
X T1X 1 X T1Y . Note also that

B
u2 X*T X* 1 u2 X T P 1T P 1X 1 u2 X T 1X 1 .
cov B
Thus, the usual procedures of confidence interval estimation and testing of the
parameters can be carried out.
As a special case, suppose the disturbance exhibits heteroskedasticity
which is of a known form. For example, Y= XB + U, X= (Xi1 | Xi2 | Xi3 | .. |
XiN) Nxk and first column of X, Xi1 = (1 1 1)T. Suppose the N
disturbances have variances proportional to the square of a certain j th
explanatory variable Xij as follows.
cov(U) = u2
X12j 0
0
2
0 X2 j
0
where NxN
2
0
X Nj
0
1/ X12j
0
0
2
0
1/ X 2 j
0
1
Now NxN
.
0
1/ X 2Nj
0
XT1X 1 XT1Y X*T X* 1 X*T Y*
B
where
Y1 / X1 j
Y /X
2
2j
*
Y
YN / X Nj
and
1/ X1 j
1/ X
2j
*
X
1/ X Nj
X12 / X1 j
X 22 / X 2 j
X N 2 / X Nj
X1k / X1 j
X 2k / X 2 j
X Nk / X Nj
210
For the notation in Xij, the first subscript i represents time/cross-sectional
position 1, 2, ., N, and second subscript j represents the column number of
X or the jth explanatory variable.
is also called the weighted least squares
The above GLS estimator B
estimator since we are basically weighing each observable (Yi Xi1 Xi2 . Xij
. Xik) by 1/Xij . Note that Y* is a MLR on X* with a constant. Where is the
constant?
Then, to test H0: Bj = 0, we use
j 0
B
u jth diagonal element of X*T X*
that is distributed as tN-k
If NxN is unknown, it is not possible to estimate all the NxN unknown

parameters of the covariance matrix using only the Nx1 values of Y and the
Nxk (k<N, otherwise XTX does not have inverse) values of X. Typically some
restrictions based on theory are placed on the form of to reduce its number
of unknown parameters.
There are two major possibilities in the case of unknown NxN.
(a) Disturbances are heteroskedastic, viz. (u2 1 here)
N N
12 0
0 22
0
0
N2
0
0
0
The variances of ui, uj, etc. in U are not the same at least for some of the us.
However, all cross-covariances, cov(ui,uj) = 0 for i j.
What happens? If {i2}i=1,2,,N are known, then we can apply GLS and
obtain a BLUE estimator of B. More discussion of GLS for the case when
{i2}i=1,2,,N are unknown will be presented after listing of the other types of
specification errors.
(b) Disturbances are serially correlated (if ut is a time series), or crosscorrelated (if ui is cross-sectional).
For example, if Ys, Xs, and Us are stochastic processes (over time), then if
ut is AR(1) or MA(1), the autocorrelation is not zero for at least lag one. Then
211
E(UUT) = NxN u2 IN . Specifically, the off-diagonal elements are not zero.
For example, if disturbance
ut+1 = ut + et+1,
where et+1 is zero mean i.i.d., then
NN
N 1
N 2
N 3
N 1
N 2
N 3 .
What happens? If is known, then we can apply GLS and obtain a BLUE
estimator of B. More discussion of GLS will be presented later for the case
when is unknown after listing of the other types of specification errors.
12.2
MISSPECIFICATION WITH EXPLANATORY VARIABLE
There are several categories of specification errors with explanatory variable

X.
(a) Exclusion of relevant explanatory variables
Exclusion or omission of relevant explanatory variables can lead to serious
finite sample bias and also large sample inconsistency. This is sometimes
called an error of omission.
For example, in Singapore, demand (proxied by number of bids for COEs)
for car Dt is specified as follows.
Dt = B0 + B1Pt-1 + ut
where Pt is average price of car at t. What have we left out? What happens to
,B
if the true model is
OLS B
0
1
Dt = b0 + b1Pt-1 + b2Yt + b3Qt + b4Mt + t

where Yt is average income level at t, Qt is amount of quota allocated for
bidding at t, and Mt is unemployment rate in the economy at t. Thus there is an
error of omission.
0 b0 and E B
1 b1 . Thus the OLS estimators are biased and not
E B
consistent.
212
(b) Inclusion of irrelevant explanatory variables
The unnecessary inclusion of irrelevant or dependent variables for explanation
is sometimes called an error of commission. Suppose the true specification is
Dt = b0 + b1Pt-1 + b2Yt + b3Qt + b4Mt + t ,
but additional explanatory variables
Z1t = number of cars in Hong Kong
Z2t = number of cars in New York
Z3t = number of cars in Tokyo
Zkt = number of cars in Mexico City

are employed in the MLR.
Clearly, the {Zit}i,t variables should have no explanation on demand for
cars/COEs in Singapore, Dt. They are irrelevant variables. However, given the
finite sample size t=1,2,,N, it is highly likely that some of the irrelevant
variables may help to explain the variation in Dt and thus increase the
coefficient of determination, R2, of the MLR.
This is a statistical artifact and has nothing to do with economic theory. If
the sample is sufficiently large relative to the number of included irrelevant
variables, we should find the estimates of their coefficients not to be
significantly different from zero. Estimates of b0, b1, b2, b3, b4, and u2 are still
unbiased. However, the sampling error may increase if the superfluous
variables are correlated with the relevant ones. If the estimated regression
model is employed to do forecast, then it could lead to serious errors
especially when the variances of the irrelevant variables are large.
One approach to reduce the problem of exclusion is to start with a larger
set of plausible explanatory variables. Take note of the adjusted R2 or the
Akaike Information criteria. Then reduce the number of explanatory variables
until the adjusted R2 is maximized or the AIC is minimized. Since the problem
of unnecessary inclusion is not as serious, it is useful to observe the estimates
with the larger set of explanatory variables, and then compare with estimates
of a subset of explanatory variables. If an appropriate set of explanatory
variables is selected, the estimates should not change much from the results in
the regression with the larger set. This procedure is part of an area in
econometrics called model selection.
(c) Multi-collinearity
Suppose we partition X = (X1 | X2 | X3 |..| Xk). If at least one Xi is nearly a
linear combination of other Xjs , i.e. there is a strong degree of collinearity
amongst the columns of X, or multi-collinearity problem, then the determinant
|XTX| is very small diagonal elements of (XTX)-1 are very large.
213
j jth diagonal element of u2 X T X

But var B
so this would be very
large.
This means that the sampling errors of the estimators B are large in the
face of multi-collinear X. It leads to small t-statistic based on H0: Bj = 0, and
thus the zero null is not rejected. Thus it is difficult to obtain accurate
estimators. Even if Bj is actually > (or <) 0, we cannot reject H0 : Bj = 0.
What can we do to fix the problem? We can fall back on apriori restrictions
based on theory. For example, if explanatory variable X k is highly correlated
with X1, X2, , Xk-1, Xk+1, etc. but bk theoretically is close to zero, we can
restrict bk = 0 and thus avoid inclusion of Xk. This will eliminate the multicollinearity problem.
Or else we live with the shortcoming which is a data problem, and not a
model problem. The problem can of course be mitigated when the sample size
increases. It should be noted that the OLS estimators are still BLUE, and
asymptotically, the OLS estimators are still consistent.
(d) Non-stationarity
When X has a time trend, e.g. Xit = a0 + a1t + uit , uit being i.i.d., then it is
problematic to regress Yit on Xit because Xit is not stationary. When Yit is also
non-stationary, there is the problem of false correlation between Yit and Xit .
This problem will be explored more fully in the chapter on unit roots.
12.3
MISSPECIFICATION OF INDEPENDENCE BETWEEN X

AND U
When there is dependence between regressor X and disturbance U, this
dependence can take 2 major forms. These problems may only arise when we
treat X now as stochastic and that we do not impose use of conditional f(Y|X).
The major misspecification forms basically result from the inclusion of lagged
dependent variable as an explanatory variable. This is also called lagged
endogenous variable problem. The two major forms are contemporaneous zero
correlation but stochastic dependence, and contemporaneous non-zero
correlation.
(a) Contemporaneous zero correlation but stochastic dependence
Suppose Yt-1 is included as an explanatory variable in MLR with
Yt = B0 + B1X1t + B2X2t + . + C Yt-1 + ut .
Then even if Yt-1 and ut are contemporaneously not correlated, because Yt is
correlated with ut, then process {Yt} and {ut} are not stochastically
independent. (Stochastic independence would require all past and future Yt+ks
214
to be uncorrelated with ut .) What happens? OLS estimator is not BLUE, but is
still consistent.
We can show that as follows. In matrix notation, YNx1 = (Y1 Y2 .. YN)T .
XNxk = ( 1 | X1 | X2 | .. | Y* ) where 1 is Nx1, Xi is Nx1, and Y* = (Y0 Y1 ..
YN-1)T.
UNx1 = (u1 u2 . uN)T. Bkx1 = (B0 B1 B2 C)T .
XT X
Then, OLS B
E XT X
EB
XT Y . The expected value of this OLS estimator is
1 X T XB U E B X T X 1 X T U B E X T X 1 X T U
If {X1 , X2 , .. } or XNxk is stochastically independent of U, then the

expectation of the last term above becomes
E X T X
X T U E X T X
X T E U which is 0kx1 since E(U) = 0.
However, if one element of XNxk , e.g. Y*, is not independent of U, then the
1
last term E X T X X T U is not zero since the expectation cannot be
1
taken as the product of expectations of X T X X T and U as some random
1
elements in X T X X T will be dependent on U.
1
B XT X XT U .
The OLS estimator can always be written as B
Therefore, if Yt-1 and ut are contemporaneously uncorrelated, and all other
Xits are also uncorrelated with ut , then
1 N
X 1t u t
covX 1t , u t
N t 1
N
X T U 1 X 2t u t
covX 2t , u t
N t 1
N
0 k1 .
covY , u
N
1
t
1
t
N Yt 1 u t
t 1
k1
215
Since
XT X
converges to a non-singular k k matrix, say Q, then
N
XT X
plim B B plim
N
N
N
plim
N
XT U
B Q.0 B.
N
Thus, zero contemporaneous correlation, but not independence, does not yield
, but yields consistency. In other words, B
is biased in
BLUE for OLS B
finite sample but is consistent.
(b) Contemporaneous non-zero correlation
In addition to Yt = B0 + B1X1t + B2X2t + . + C Yt-1 + ut , suppose ut is AR(1)
process, i.e. ut = ut-1 + et , then
cov(Yt-1 , ut)
= cov(B0 + B1X1t-1+ B2X2t-1 + . + C Yt-2 + ut-1 , ut)
= cov(ut , ut-1) 0.
Thus, Yt-1 and ut are contemporaneously correlated. What happens? OLS
estimator is not BLUE, and is also not consistent.
There are some special situations when stochastic dependence between X
and U arises and causes problems.
12.4
MEASUREMENT ERROR PROBLEM
Measurement error problem in X (errors-in-variables problem) occurs when

observed Xt* is not the intended explanatory variable Xt , but is Xt measured
with error. Suppose the true model is Yt = a + b Xt + ut , but we employ Xt* as
regressor instead.
X t * = X t + et
is what is observed and used for the regressor in place of X t.. et is
measurement error from the actual but unobserved Xt , and Cov (et , ut ) = 0.
Under the true model, Yt = a + b Xt + ut Yt = a + b (Xt*-et) + ut .
Therefore,
Yt = a + b Xt* + (ut bet) .
Now cov(Xt* , [ut bet]) = cov(Xt + et , ut bet) = - b var(et) 0 when cov(Xt ,
ut) = 0. Thus the MLR with measurement error induces a contemporaneous
non-zero correlation. If we regress Yt on Xt* , the OLS estimator is not BLUE,
and is not consistent. Does measurement error in dependent variable cause
inconsistency? No.
216
Some regression specifications require simultaneous equations model. For
example, if demand is modeled as a regression equation
Dt = B0 + B1Pt + et ,
then there could be another simultaneous equation at work, viz.
St = A0 + A1Pt + ut ,
and cov(ut , et ) = 0.
The two regressions constitute the simultaneous equations model. In
economic equilibrium, Dt = St . Then,
B0 + B1Pt + et = A0 + A1Pt + ut
or, et = A0-B0 + (A1-B1)Pt + ut .
If cov(et , ut) = 0, then (A1-B1) cov(Pt , ut) + var(ut) = 0, or cov(Pt , ut) = var(ut)/(A1-B1). Therefore, cov (et , Pt ) = (A1-B1)var(Pt) + cov(Pt , ut) = (A1B1)var(Pt) - var(ut)/(A1-B1) 0. Thus the MLR with simultaneous equations
bias induces contemporaneous non-zero correlation. If we regress Dt on Pt ,
without considering simultaneous equations, the OLS estimator is not BLUE,
and not consistent. The simultaneous equations bias can be shown graphically
as follows.
Figure 12.1
Simultaneous Equations Bias under Demand and Supply Equations
S
et
P
P
Given Pt , demand Dt moves up or down by disturbance et . Because of the
elastic supply curve, movement in the demand curve due to et induces
217
equilibrium price along the supply curve, and thus Pt also changes. Thus, it is
seen that cov(Pt , et) 0.
12.5
PROBABILITY LIMITS
Contemporaneous correlation between regressor and disturbance yields OLS

estimators that are biased and inconsistent. To show this, we need the concept
of a probability limit.
1
0 is easy to understand as the real number
N N
An ordinary limit e.g. lim
1/N continuously decreases toward 0. Suppose we show a probability

distribution (cumulative distribution function, or simply distribution function)
as follows.
Figure 12.2
Illustration of Probability Limits
F X dF
X
If the probability distribution of random variable XN (the random variable is

function of N, i.e. the probability distribution changes with N) collapses to a
point A as N , i.e.
lim prob(| X N A |
N
1
) 0,
N
then A is called the probability limit of XN , or plim X N A .

N
218
For example, given X
1 N
Xi
N i1
where X i ~ n.i.d. N 0, 2 , then
plim X = 0.
There are some results on probability limits that we will use. If
plim X N A , and f(.) is a continuous function, then plimf (X N ) f (A) .

N
This is also called Slutsky theorem.

If plim X N A and plim YN B , then plim(X N YN ) A B ,
N
plim(X N YN ) AB , and plim(X N / YN ) A / B .

N
Suppose in a MLR YNx1 = X Nxk Bkx1 + UNx1, where U is spherical, but

contemporaneous correlation of Xij and ui is not zero.
1 N
Xijui 0 for at least some columns j amongst the k
N i1
1
columns of X. So plim X T U XU 0kx1 .

N
Let plim X T X XX 0kxk Moreover, suppose XX is non-singular,

N
Then p lim
i.e. has inverse, since asymptotically for stationary X, XX represents the

second moments.
OLS B XT X
Since B
X T U , then
1
1
OLS B plim 1 X T X 1 X T U B XX
plim B
XU B .
N
N
Hence OLS is not BLUE and is not consistent when disturbance is

contemporaneously correlated with regressor(s).
12.6
INSTRUMENTAL VARIABLES
To overcome the problem of inconsistency when there is contemporaneous

correlation, instrumental variables regression can be performed. Suppose we
can find k explanatory variables (maybe including some of the original X
columns that do not have contemporaneous correlation with U) ZNxk such that
(a)
(b)
they are (contemporaneously) correlated with X, and

they are not (contemporaneously) correlated with U.
219
1 T
Z X ZX 0kxk and ZX is non-singular.
N
Property (b) yields p lim ZT U ZU 0kx1 .

N
As plim ZT Z ZZ 0kxk , an instrumental variables (IV) estimator

N
1
ZT X ZT Y .
is B
Property (a) yields plim
IV
IV ZT X
So, B
ZT Y B ZT X ZT U , and
1
1
IV B plim 1 ZT X 1 ZT U B ZX
plim B
ZU B 0kx1 B .
N
N
IV is consistent. The asymptotic variance of B

IV conditional on Z, X,
Thus, B
is
B B
B T X, Z
E B
IV
IV
Z UU ZX Z
Z X Z ZX Z
E ZT X
2u
X, Z
which in finite sample is approximated by 2u Z T X
Hence, unconditionally, cov B

IV is
ZT Z X T Z
2u E Z T X Z T Z X T Z
.
.
Asymptotically as N, this covariance matrix can be written as

1
1
1
1
1

2u E plim Z T X plim Z T Z plim X T Z
N
N

N
1
1
1
1 ZZ
1
2u E ZX ZZ ZX 2u ZX
XZ .
N
N
12.7
MISSPECIFICATION WITH COEFFICIENTS
In the study of MLR so far we have assumed B is constant. When B is not

constant e.g. structural break, then not accounting for changing B would yield
incorrect estimates. Other types of non-constant parameters are when B is a
random coefficient, or when B is a switching regression model within a
220
business cycle. For example, B could take different values conditional on the
state of the economy. The states of the economy could be driven by a Markov
Chain model. Each period the state could be good G or bad B. If the state is G,
next period probability of G is 0.6 and probability of B is 0.4. This is shown in
Table 12.1 below.
Table 12.1
Random Coefficient in Two States
State G
0.6
0.2
State G
State B
12.8
State B
0.4
0.8
HETEROSKEDASTICITY
Suppose NxN is not known and the elements need to be estimated, but assume
the form of the heteroskedasticity is known. Suppose the N disturbances have
variances proportional to the th power of a certain jth explanatory variable Xij
as follows.
cov(U) = u2
NxN
X1j
0
X2 j
0

X Nj
The suggested procedure is as follows.
OLS XT X
(a) Use OLS to obtain B
X T Y that is unbiased, but not
BLUE.
(b) Find estimated residual u i Yi
B
j1
j,OLS
U .)
X ij (Note: E U
2
(c) Since var u i u X ij for i = 1, 2,.., N, then
log[var u i ] log u2 log X ij . is estimated by OLS regression

2
using log[u i ] const an t log X ij residual error .
Obtaining , this is used as follows.
221
X1j 0
0
0 X2 j
0
(d) NxN

0
X Nj
0
1
1X XT
1Y will be approximately BLUE (subject to
XT
Then, B
sampling error in ) .
In situations where is unknown and the form is approximately INxN , then
we may stick to OLS. It is still unbiased, and approximately BLUE as in the
case of any estimated GLS estimators.
Suppose the heteroskedasticity implies
N N
where
12 0
0 22
0
0
u2=1
N2
0
0
0
here, but the {t2}t=1,2,3,.,N are not known.
2
t
X Tt X t
conditional on X is XT X
The covariance matrix of B
Since is diagonal with constants, X T X
XTX XT X .
1
t 1
where X N k
X1
X
2
, and Xt is 1xk vector containing all the explanatory

Xt
XN
variables at the same time-point.

Assume t2 and X are stationary random variables, and the t2s are
independent of X. Then for each t,
2
T
2 T
E ~ X X E ~ E X X .
t t t t t t
And given the stationarities of {t2} and X, the above may be estimated by
222
plim
N
1N 2 T
1 N 2
1N T
X
X
plim
plim
t t t N N
t
X t X t .
N t 1
t 1
N N t 1
Then, X T X
t 1
2
t
X Tt X t is estimated in large sample or large N by
1 N 2 T
1 N 2
N plim

t X t X t N plim
t
N N t 1
N N t 1
plim 1 X T X
t t
N N t 1
2
N
0 XX
1 N 2
1 N T
where 2 plim
plim
and
X X .
0
XX
N t 1 t
N t 1 t t
N
N
Suppose we employ OLS and use N XT X
XTX XT X
as estimate of
. Asymptotically, this OLS covariance estimator is

the covariance of N B
OLS
estimated using
1
1
T
XTX
XTX
X X
1
2 -1
plim
1 2
plim
plim
XX
.
XX
XX
0
N
0
XX
N
N N N N
Thus, when the sample size is large, OLS is still a useful estimator under such
unknown heteroskedasticity as the covariance matrix of the OLS estimators
can be estimated and is consistent.
When t2 is a function of variables dependent on X, the covariance
is XT X
matrix49 of B
XTX XT X
which cannot be estimated by the
method in the earlier exposition, and the OLS covariance estimator
u2 X T X is also incorrect.
1
For plausible inference, however, if (t2,Xt) are jointly stationary, White50

(1980) suggests a heteroskedasticity-consistent covariance matrix estimator
(HCCME) that is consistent.
. Then let
First, run OLS and obtain u t Yt X t B
OLS
49
Refer to Estimation and Inference in Econometrics, by R. Davidson and J.G.

MacKinnon, Oxford University Press, 1993, pp. 548-554, for more information.
50
White, Halbert, 1980, A heteroskedasticity-consistent covariance matrix estimator
and a direct test for heteroskedasticity, Econometrica 48, 817-838.
223
u 12
0
.
u 2N
0
u 22
0
where N is the sample size. And apply N XT X

HCCME estimator of
consistent, i.e.
X XT X as the
XT
1
. It can be shown that this estimator is

NB
OLS
1XT XXTX 1
plim N X T X
N
XT X
plim
N
N
1
plim
N
T
X
X
N
XTX
plim
N
N
T
X X 1
1
plim
.
XX
XX
N
N
The middle term plim
T
X
X
N
, a k k matrix. K < N, has only (k2+k)
distinct elements which can be far fewer than N if k is small. Thus,

plim
N
T
X
X
N
1 N 2 T
T
plim
u t X t X t converges to X X . Hence,
N
N t 1
X X T X X T X X T X X T X .
p lim X T X X T
1
using this HCCME can be done when the

Thus, statistical inference on B
sample size is large. In practice, sometimes a finite-sample correction is used,
and HCCME is computed as
1
N
X XT X 1 .
XT X XT
Nk
There are several tests of the presence of heteroskedasticity51. If the form is

such that the disturbance variance is positively associated with the
corresponding explanatory variable level, Goldfeld-Quandt test may be used.
51
Refer to Estimation and Inference in Econometrics, by R. Davidson and J.G.

MacKinnon, Oxford University Press, (1993), pp. 560-564, for more information.
224
In this test, the sample data is sorted in order of the value of the explanatory
variable that is associated with the disturbance variance, starting with data
with lowest disturbance variance. OLS regression is then performed using the
first third and the last third of this sorted sample. If the association is true, then
the disturbance of the first third will have smaller variance (approximately
homoskedastic) than the variance of the disturbance of the last third. Since the
SSR/(N-k) (or RSS/(N-k)) is the unbiased estimate of the variance of the
residuals, the ratio of SSR(last third)/SSR(first third) (n1-k)/(n3-k) , where n1
and n3 are the sample sizes of the first third and last third respectively, is
distributed as Fn3-k , n1-k under the null of no heteroskedasticity.
If the form of the heteroskedasticity is suspected to be of the form
2
t f Z t linked linearly to some k-1 exogenous variables Zt, then the
Breusch-Pagan & Godfrey test (LM test) can be performed. Here estimated
2
u t2 is regressed against a constant and Zt and an asymptotic k 1 test statistic
and an equivalent Fk-1,N-k-statistic are reported based on the null hypothesis of

zero restrictions on the slope coefficients of Zt.
If the form of the heteroskedasticity is unknown, except that {t2}t=1,2,3,.,N
are not constants but are functions of variables possibly dependent on X, then
the Whites test can be applied. There are two cases. Either estimated u t2 is
regressed against a constant and X and its squared non-constant terms e.g.
X1t2, X2t2, etc., or estimated u t2 is regressed against a constant and X, its
squared non-constant terms e.g. X1t2, X2t2, etc., as well as cross-sectional terms
e.g. X1tX2t etc. In either case, an asymptotic k 1 test statistic and an
equivalent Fk-1,N-k-statistic are reported based on the null hypothesis of zero
restrictions on the slope coefficients of the k-1 number of regressors.
2
12.9
SERIAL CORRELATION
When E(UUT) = NxN u2 IN , then another possibility is that the offdiagonal elements are not zero. This is equivalent to serial correlation in its
disturbances or residuals. Specifically, suppose disturbance
ut+1 = ut + et+1
where et+1 is zero mean i.i.d. If can be accurately estimated, then Estimated
or Feasible GLS can be applied. Specifically the Cochrane-Orcutt (iterative)
procedure is explained here. It tries to transform the disturbances into i.i.d.
disturbances so that the problem is less severe, and then FGLS estimators are
approximately consistent and asymptotically efficient. (Recall that if there are
225
lagged dependent variables on the right-hand side with AR(1) disturbances,
OLS estimates are biased and inconsistent, and IV method has to be used.)
The covariance matrix of the disturbances is
NN
N 1
N 2
N 3
N 1
N 2
N 3 .
Suppose we transform the data into

YN* = YN - YN-1
.
Y2* = Y2 - Y1
and
XN* = XN - XN-1
X2* = X2 - X1 .
Recall Xt is 1 x k matrix containing all explanatory variables at time t. The
system of regression equations is
YN = XN B + uN
YN-1 = XN-1B +uN-1
...
Y2 = X2 B +u2
Y1 = X1 B +u1
Therefore,
YN* = YN - YN-1 = XN B + uN - (XN-1B +uN-1) = XN* B + eN
..
Y2* = X2* B + e2
Thus, we are back to the classical conditions, and OLS is BLUE on the
transformed data. Of course, this is equivalent to GLS on the original data. In
practice, since is not known, it has to be estimated.
226
First run OLS on the original data. Obtain the estimated residuals
.
u t Yt X t B
OLS
Next, estimate using
N
1
N
u u
t 2
N
1
N
u
t 2
t 1
.
2
t 1
Note that the index starts from 2. can also be obtained from the reported
Durbin-Watson d-statistic in the OLS regression. Then the transformations are
done, viz.
YN* = YN - YN-1
..
Y2* = Y2 - Y1
1 2 Y1
XN* = XN - XN-1
Y1* =
..
X2* = X2 - X1
X1 * =
1 2 X1
Then, OLS is run again using the transformed {Yt*} and {Xt*}. The
estimators are approximately consistent and asymptotically efficient.
Iterations can be performed using the updated OLS u t and for another
round of OLS regression till the estimates converge.
The presence of serial correlations, hence non-homoskedastic disturbances
and breakdown of the classical conditions, can be tested using the DurbinWatson test, the Box-Pierce-Ljung Q-tests discussed earlier, and BreuschPagan & Godfrey test (LM test).
There are three limitations of the DW test as a test for serial correlation.
First, the distribution of the DW d-statistic under the null hypothesis depends
on the data matrix. Thus bounds are placed on the critical region within which
the test results are inconclusive. Second, if there are lagged dependent
variables on the right-hand side of the regression, the DW d-test is no longer
valid. Lastly, the d-test is strictly valid on the null hypothesis of no serial
correlation only when it has the alternative hypothesis of first-order serial
correlation.
227
The other tests of serial correlation viz. the Box-Pierce-Ljung Q-test and
Breusch-Pagan & Godfrey test (LM test) overcome these limitations. The
essential idea of the LM test is to test the null hypothesis that there is no serial
correlation in the residuals up to the specified order. Here estimated u t is
regressed against a constant and lagged u t k s and an asymptotic k 1 test
statistic and an equivalent Fk-1,N-k statistic are reported based on the null
hypothesis of zero restrictions on the slope coefficients of u t k . Strictly
speaking, the distribution of the F-statistic is not known, and the F-distribution
is an approximation since is estimated.
2
12.10
PROBLEM SET
12.1
Suppose an OLS regression

Yt = A + BZt + et
[1]
is run, where A, B are constants, and et ~ N(0,e2) is independent of
and B
. We re-arrange terms in [1] as
Zt. The OLS estimates are A
follows:
B-1Yt = B-1A + Zt + B-1et , so
Zt = C + DYt + t
[2]
where C= -A/B, D=1/B, and t ~ N(0, e2/B2). If we perform another
12.2
1/ B
? Explain
OLS regression on [2], do we obtain OLS estimate D
why or why not?
A theoretical model is postulated as
Yt
ea bX t cZ t t
,
1 ea bX t cZ t t
where a, b, and c are constants, and t is a zero mean i.i.d. noise. In

addition, it is thought that the restriction 2b = c holds. If time series
data of {Xt, Yt, Zt} are available, explain how you would perform a
linear regression to estimate a, b, and c. Perform a manual
computation of the OLS estimates of a, b, and c using the
following small sample, and test H0: a=1 and b=0.
t
Xt
Yt
Zt
1
10
0.6
3
2
15
0.5
6
3
5
0.7
2
4
20
0.8
5
228
12.3
Two researchers A and B performed the following regressions of

stock returns on various factors.
A:
rt = a0 + a1Xt + a2Yt + et
B:
rt = b0 + b1Xt + b2Zt + ut
where et and ut are disturbance terms satisfying classical conditions.
Both reported significant coefficients of a 1 , a 2 , b 1 , and b 2 . A third

researcher C felt that A had underspecified since he did not include Zt
that was obviously significant in Bs regression. Similarly, he felt that
B underspecified since he did not include Yt that was obviously
significant in As regression. So C performed the following regression
to improve on the perceived deficiency in the specification:
C: rt = c0 + c1Xt + c2Yt + c3Zt + vt
where disturbance vt satisfies classical conditions. He was surprised to
find that now his OLS estimates c 2 and c 3 are not significantly
different from zero. Should he then drop variables Yt and Zt and run rt
= c0 + c1Xt + wt where wt is disturbance term? Or should he pick one
of A or Bs regressions? Explain.
12.4
A researcher wants to test a model to explain variations in Yt . He has

two plausible explanatory variables Xt and Zt. He performs three
separate regressions as follows.
Yt = a + bXt
+ et
[1]
Yt = a +
cZt + et
[2]
Yt = a + bXt + cZt + et
[3]
where et represents the disturbance in each case. The following table
indicates the estimates and their p-values in brackets.
[1]
[2]
[3]
-0.01
(0.36)
0.01
(0.55)
0.02
(0.87)
0.21
(0.04)
0.16
(0.34)
C
0.18
(0.06)
0.08
(0.41)
Explain if there is any problem with the data here? How would you
proceed to handle the problem?
229
Table 12.2
Glossary of Specification Errors and Remedies
Problem: Heteroskedasticity
Special Cases
Non-constant variances but zero
correlations i.e. covariance matrix
I is diagonal. Variance form is
known e.g. Xi2
Resulting from
Effect on OLS
Unbiased and Consistent
BUT not efficient (in both
finite sample and large
sample). Incorrect standard
error when using
2 X T X
var B
Test if Problem Exists
Goldfeld-Quandt test
Breusch-Pagan & Godfrey test
(LM test)
Solution
Weighted LS BLUE.
Consistent and Asymptotically Efficient (consistent
estimation of the parameter in the heteroskedastic
form gives rise to this estimator consistency)
Problem: Heteroskedasticity
Special Cases
Non-constant variances but zero
correlations i.e. covariance matrix
I is diagonal. Variance form is
not known
Resulting from
Effect on OLS
error
when
using
2 X T X
var B
Whites test
Solution
Whites Heteroskedasticity-Consistent Covariance
Matrix Adjustment to Standard Error Estimate.
Adjusted standard error is now correct
asymptotically. OLS is consistent and adjusted
covariance matrix allows correct statistical
inference. However, estimator is not asymptotically
efficient.
Problem: Non-zero cross correlations or Autocorrelated Disturbances in

time series
Special Cases
covariance matrix I contains
non-zero off-diagonal elements,
but form is unknown (Disturbance
could be MA or AR)
Resulting from
Effect on OLS
error
when
using
2 X T X
var B
230
Box-Pierce-Ljung Q-tests
Breusch-Godfrey test (LM test)
Solution
Problem: Non-zero cross correlations or Autocorrelated Disturbances in

time series
Special Cases
Disturbances identically distributed
(homoskedastic)
but
serially
correlated.
Disturbance
form
known e.g. AR(1)
Resulting from
covariance matrix
I contains
non-zero
offdiagonal elements
Effect on OLS
error
when
using
2 X T X
var B
Durbin-Watson d-test
Solution
Estimated or Feasible GLS (Cochrane-Orcutt or else
Hildreth-Lu or other versions of iterative procedures
are used). Consistent and Asymptotically Efficient.
Approximately BLUE (in finite sample)
Problem: Disturbances ut with zero contemporaneous correlation with

explanatory variable(s) BUT are stochastically dependent
Special Cases
Lagged
Endogenous
(Dependent)
Variable
Yt=a+bXt+bYt-1+ut such
that cov(Yt-1 , ut) = 0
but cov(Yt-1 , ut-1) 0 so
in general Y, U are not
stochastically
independent
Resulting from
Partial Adjustment models.
Actual dep variable is not
observed
Yt*=a+bXt+et but observed
Yt=Yt-1+q*(Yt*-Yt-1) + ut , i.e.
with delayed adjustment.
Effect on OLS
Biased and not efficient.
But still consistent because
there is zero contemporaneous
correlation.
Asymptotically
Efficient.
Solution
OLS Consistent and Asymptotically Efficient.
Problem: Disturbances ut with contemporaneous correlation with

explanatory variable(s) (hence also are stochastically dependent)
Special Cases
Yt=a+bXt+ut such that
cov(Xt , ut) 0
Resulting from
Errors-in-variables
Measurement
error
explanatory variable.
Solution
If error form and specification is known, correction can be
made to establish consistent estimator with asymptotic
inference.
or
in
Effect on OLS
Not consistent because there
is
contemporaneous
correlation.
Not
Asymptotically Efficient.
231
Special Cases
Yt=a+bXt+ut such that
cov(Xt , ut) 0
Resulting from
Simultaneous
equation bias
Effect on OLS
is
contemporaneous
correlation.
Not
Solution
For exactly identified system, use Instrumental
Variables or Indirect Least Squares. For overidentified system, use two-stage least squares or else
limited information maximum likelihood method.
All these enables Consistency with asymptotic
variance, and are thus testable, but are usually not
asymptotically efficient.

Special Cases
Lagged Endogenous (Dependent)
Variable
Yt=a+bXt+bYt-1+ut
AND Disturbances follow MA
process ut = et + et-1 , et i.i.d. so
Cov(Yt-1 , ut) 0 and cov(Yt-1 , ut-1)
0 so in general Y, U are not
stochastically independent. Also,
contemporaneous correlation is not
zero.
Resulting from
Imposing Koyck lags
on lagged explanatory
variables or Adaptive
Expectations Models
where
Yt=a+b*XtE+et,
XtE is an expectation
formed at t and
follows
XtE=Xt-1E+q*(Xt-Xt-1E)
Effect on OLS
Not consistent because
there is contemporaneous
correlation.
Not

Note: MA processes are also
infinite AR processes
Solution
If can be accurately estimated, then Estimated or
Feasible GLS, e.g. Zellner-Geisel Method tries to
transform into i.i.d. disturbances so that the problem
is less severe, and then FGLS is approximately
Consistent and Asymptotically Efficient. Or else
perform Instrumental
Variables
Regression.
Consistent with asymptotic variance, thus testable,
but usually not asymptotically efficient.
232
Special Cases
Lagged Endogenous (Dependent)
Variable Yt=a+bXt+bYt-1+ut
AND Disturbances follow AR
process ut = ut-1 + et , et i.i.d.
Cov(Yt-1 , ut) 0 and cov(Yt-1 , ut-1)
0 so in general Y, U are not
stochastically independent. Also,
contemporaneous correlation is not
zero.
Durbin-Watson h-test
Resulting from
Effect on OLS
is
contemporaneous
correlation.
Not
Solution
If can be accurately estimated, then Estimated or
Feasible GLS (Cochrane-Orcutt or else HildrethLu or other versions of iterative procedures) tries
to transform problem into i.i.d. disturbances so
that the problem is less severe, and then FGLS is
approximately Consistent and Asymptotically
Efficient. Or else perform Instrumental Variables
Regression. Consistent with asymptotic variance.
Thus testable, but usually not asymptotically
efficient.
Problem: Wrong regression form or specification

Special Cases
Nonlinear form
Check scatterplot
and
estimated residuals display
Resulting from
also
Effect on OLS
Spurious fit and bad forecasts
Solution
Transform into linear form or perform nonlinear
regressions
Problem: Wrong explanatory variables

Special Cases
Exclusion or Omission of relevant
Resulting from
Effect on OLS
Biased and not efficient. Not
consistent. Not Asymptotically
Efficient. Incorrect standard
error
when
using
2 X T X
var B
Low R2 and/or unreasonable
economic
interpretations
of
estimated coefficients.
Solution
Add variables based on theoretical reasoning or
model. Or start with overfitting and reduce number
of explanatory variables by optimal model selection.
233
Problem: Wrong explanatory variables
Special Cases
Inclusion of irrelevant
Resulting from
datamining/overfitting
or wrong theoretical
model
Test
if
Problem
Exists
Test for insignificant
t-statistics
for
coefficient estimates
Solution
Effect on OLS
Still BLUE in the sense that estimator of
coefficients for irrelevant explanatory
variables will be expected to be zeros
(and also consistent). But may introduce
unnecessary data multi-collinearity in
finite sample and hence less precision to
relevant estimators, i.e. sampling
variance of estimators of coefficients of
relevant variables now larger. Also,
seriously
Biased
forecast
when
irrelevant variables are used (as nonzero coefficients appear in forecast
equation).
Model selection of maximal adjusted R2 or minimal AIC or SC or

HQ by trimming excess insignificant variables
Problem: Multi-collinearity
Special Cases
lagged
explanatory
variables is one cause
Test
if
Problem
Exists
Correlation matrix and
test of autocorrelation
amongst
lagged
Resulting from
at least one explanatory
variable close to being a linear
combination
of
other
explanatory variable(s)
Solution
Effect on OLS
Low
t-statistics
(Large
sampling errors of estimators)
Remove one of the two highly correlated regressors if theory

allows. Or increase sample size.
Problem: Multi-collinearity
Special Cases
Explanatory variables consist
of lags of same variable e.g.
Yt = c0+c1Xt+c2Xt-1+c3Xt2+c4Xt-3++et
Correlation matrix and test
of autocorrelation amongst
lagged explanatory variables
Resulting from
the lagged explanatory
variables are highly
correlated
Effect on OLS
Low
t-statistics
(Large
sampling errors of estimators)
Solution
Impose restrictions on coefficients e.g. Koyck lag so as to
concentrate effect onto just one Xt [but this leads to
Lagged Endogenous (Dependent) Variable problem] or
else polynomial distributed lag to concentrate effect onto
just a few composite (linear averages of X t-ks) explanatory
variables. Or increase sample size.
234
Problem: Non-stationary regressand and regressor
Special Cases
Regressand and regressor are trend
stationary
Test for presence of time trend
Resulting from
Effect on OLS
Both
a
linear spurious coefficients showing
function
of
a significant slope when it does
strong time trend
not exist
Solution
Remove trend by first differencing and regression
using first differenced series
Problem: Non-stationary regressand and regressor

Special Cases
Stochastic trend
Augmented Dickey-Fuller unit root
test
Resulting from
Effect on OLS
Both unit root spurious coefficient estimates
processes
Solution
Cointegration regression if there is cointegration
between the unit root processes. If not, perform OLS
on first differences.
Problem: Variable coefficients or parameters

Special Cases
structural breaks
or random coefficient models
or switching regression models
Chow test cusum and cusumof-squares tests, Hausman test
Resulting from
Effect on OLS
very long time series
spurious
coefficient
or economic regime estimates
changes
or
major
economic event
Solution
Apply model specific techniques principally based on
maximum likelihoods
Problem: Wrong distributional assumption of disturbance

Special Cases
Assumption
of
Normally
distributed disturbances
Jarque-Bera test
Shapiro-Wilk test
Resulting from
Effect on OLS
Wrong statistical inference of
OLS estimators and forecasts
Solution
Ascertain the distribution and employ maximum
likelihood methods
235
Chapter 13
CROSS SECTIONAL REGRESSION
APPLICATION: TESTING CAPM
Mean-Variance Efficiency, Rolls critique, Fama-MacBeth Procedure, Crosssectional Regression Test, Coefficient Restriction, Wishart Distribution,
Hotellings T2 Statistic, Asymptotic Chi-Square Test, Asymptotic Tests, Wald
Test Statistic, Idiosyncratic Risk
In this chapter we continue investigation of the CAPM as discussed in the

earlier chapters. We shall employ multiple linear regressions. The testing of
the Sharpe-Lintner CAPM was a very hot topic of research in finance
academia especially during the 1970s and 1980s.
13.1
TEST OF MEAN-VARIANCE EFFICIENCY
The CAPM is closely related to Markowitzs mean-variance portfolio

optimization problem.52 When all investors in the market choose the same
portfolio that is optimally diversified with minimum variance for a given level
of expected return, then this market portfolio must lie on the mean-variance
efficient frontier. Investors optimization problem is:
min
w
subject to
wT V w
(minimize variance of portfolio return)
wT E(R) = k
and
wT j = 1.
where wT (w1, w2, w3, .. , wN) is the vector of portfolio weights on N

stocks in the market, R is the N1 vector of random stock returns in the
market over the investment period, and j is an N1 vector of ones. There is no
constraint on shortselling or borrowing and lending at the riskfree rate r f over
the period. The required rate of portfolio return is k > rf. This is a single period
static optimization problem.
We can form the Lagrangian L and minimize it. The Lagrangean
incorporates the constraints into the objective function.
52
Markowitz, Harry, (1952), Portfolio Selection, Journal of Finance, 7, 77-99.
236
min
w, a>0, b>0
L wT V w + a [k wT E(R)] + b [1- wT j]
FOCs are:
Lw = V w a E(R) b j = 0
(13.1)
T
La = k w E(R) = 0
(13.2)
Lb = 1- wT j = 0
(13.3)
where 0 is a N1 vector of zeros.
From equation (13.1), the optimal (mean-variance, MV, efficient)
portfolio weights are given by:
w*= a V-1 E(R) + b V-1 j .
(13.4)
Therefore, using (13.4) and (13.2),

k = [E(R)]T { a V-1 E(R) + b V-1 j } = a E(R)TV-1E(R) + b E(R)TV-1j . (13.5)
Using (13.4) and (13.3),
1 = j T { a V-1 E(R) + b V-1 j } = a j TV-1E(R) + b j TV-1j .
(13.6)
Let A = j TV-1E(R), B = E(R)TV-1E(R), C = j TV-1j, and D = BC A2.
Then solving for (13.5) and (13.6):
a* = [Ck A]/D and
b* = [B Ak]/D.
From (13.4) again,
w* =
=
=
=
a* V-1 E(R) + b* V-1 j

[Ck A]/D V-1 E(R) + [B Ak]/D V-1 j
[B V-1 j A V-1 E(R)]/D + [C V-1 E(R) A V-1 j]/D k
p+qk
(13.7)
where p = [B V-1 j A V-1 E(R)]/D and q = [C V-1 E(R) A V-1 j]/D

are constant N1 vectors. One can see that p is a MV portfolio when k=0, and
p+q is a MV portfolio when k=1. For different values of k, we can map out a
MV efficient frontier or locus of portfolios on the k versus w *T V w * graph
which is a hyperbola. This was shown in Figure 5.1 of chapter 5.
Equation (13.7) provides a characterization of MV portfolios. Similarly,
any portfolio that satisfies (13.7) lies on the MV efficient portfolio frontier.
The portfolio or fund q with expected return 1, and fund p with expected
return 0 both lie on the MV frontier (sometimes the term efficient frontier is
reserved only for the part of the MV frontier above the minimum variance
portfolio).
Consider any two MV frontier portfolios g and h with (1) g = p + q kg
237
with expected return kg , and h = p + q kh with expected return kh. We can
form a linear combination of these two MV frontier portfolios:
z = g + (1- )h
= [ p + q kg ] + (1- ) [ p + q kh ]
= p + q { kg + (1- ) kh}.
Clearly, portfolio z is itself a MV frontier portfolio since its portfolio weights
satisfy the property in (13.7). Its expected return is kz = kg + (1- ) kh.
We just showed an important result in portfolio mathematics, that there
exists a two-fund separation theorem, i.e. any efficient portfolio can be created
by investing in only two efficient funds. In other words, any linear
combination of two efficient portfolios is itself an efficient portfolio. We let
portfolio h have expected return kh equal to the riskfree rate rf . Then, portfolio
hs weight vector is p + q rf. Such a portfolio exists on the MV frontier.
By the above result, the market portfolio m which is MV efficient can be
expressed as a linear combination of any arbitrary MV efficient portfolio g
and MV efficient portfolio h with expected return rf , such that
m = m g + (1- m) h
= m [ p + q kg ] + (1- m) [ p + q rf ]
= p + q {m kg + (1- m) rf }
= p + q km
where km = m kg + (1- m) rf .
(13.7)
We can continue with an equilibrium condition to derive the CAPMs

security market line (SML) as seen in Chapter 5:
E(ri) = rf + bi [E(rm) rf] ,
(13.8)
where E(ri) ki is the expected return of an ith portfolio (it can also be a stock
simply by taking the weights to be [0,0,0,,0,1,0,..,0] with 1 in the ith
position of the N1 vector. E(rm) km, and rf are respectively the expected
market return and the riskfree rate for the period.
Substituting (13.7) into (13.8):
E(ri) = rf + bi [m E(rg) + (1- m) rf rf ]
= rf + bim [ E(rg) rf ]
= rf + i [ E(rg) rf ]
(13.9)
238
where E(rg) kg , and i = bim is a constant associated with portfolio g.
Equation (13.9) gives rise to regression equation:
ri = rf + i [ rg r f ] + ei
(13.10)
where it is assumed that cov(rg , ei) = 0, and hence i = cov(ri , rg)/ var(rg),
which acts like a market beta if indeed rg is the market return.
Rolls critique53 was a clever and deep insight into the CAPM tests of
(13.8). Typically, the S&P 500 or some other market indices are used as
proxies for the market portfolio price. Their returns are assumed to be market
returns rmt , for a range of t. However, Roll noted that there was no way to
verify if indeed the proxy is the market return. Thus acceptance of test
restrictions in (13.8) merely implies that the proxy or the arbitrary portfolio g
follows (13.9) and (13.10) instead. This in turn merely implies that there is
evidence the proxy is a MV efficient portfolio. Nothing more. Hence all the
tests of CAPM in the past using proxies are nothing more than testing if the
proxies are mean-variance efficient, i.e. if that portfolio lies on the MV
efficient frontier. They do not test for CAPM in (13.8), and unless a market
return can truly be observed, the CAPM is unlikely to be testable in the formal
and exact sense. It may also be said that rejection of a test on de facto (13.9)
or (13.10) is not necessarily a rejection of CAPM in (13.8) but rather a
rejection that the chosen proxy is not MV efficient.
Though the idealized situation of a definitive test of the CAPM cannot be
realized, given that we do not know exactly what the market return is, as it
may embody human capital returns and other assets returns not captured in
the standard proxies such as market index returns, still researchers agree that
CAPM tests are useful to know if given the set of stocks, the standard proxies
are mean-variance efficient (or sometimes called minimum variance efficient).
If the proxy is MV efficient, then it is still useful to employ the proxy to
describe differences in expected returns of different assets based on their
return correlations with the proxy return. In what follows, stated CAPM test
would assume the market index is a good proxy for market return.
13.2
CROSS SECTION OF EXPECTED RETURNS
The CAPM as primarily represented by the restriction in (13.8) or the SML

has at least three major testable implications. In each period, the expected
returns of all assets are linearly related to their betas. (13.8) applies to both
individual stocks as well as portfolios that are linear combinations of such
stocks. Hence we may write a testable restriction as:
53
Roll, Richard, (1977), A Critique of the Asset Pricing Theorys Tests: Part I,
Journal of Financial Economics 4, 129-176.
239
Ei = 0 + 1 bi ,
for i=1,2,.,N
(13.11)
where N is the number of assets, Ei E(ri), and 0, 1 are constants. In
particular, 0 can be interpreted as the riskfree rate, and 1 the excess market
return or the market risk premium, according to (13.8). In other words, the
variation in assets betas must explain the variation in assets expected returns
or the cross section of expected returns. In fact, under CAPM, the implication
is quite strong in that only betas are systematic in explaining all the variation
of the expected returns.
Intuitively, the CAPM result or SML relates to the Markowitz portfolio
diversification. An asset with return having a high covariance with r mt does not
contribute as much to market portfolio variance reduction than does an asset
with return having a low or even negative covariance with r mt. Thus the high
covariance asset or equivalently high beta asset will have a higher systematic
risk (risk that cannot be diversified away due to covariation with r mt) and thus
higher expected return according to (13.8). Thus, when there is relatively high
expected return (or high required return) on an asset given current price Pt,
this is the same as saying there is relatively high expected future price E(P T)
then one key possibility is that the asset has a high risk premium attached to it,
so that the relatively higher required return or risk premium is to compensate
the investor with bearing the higher systematic risk.
Tests using restriction (13.11) across assets for a particular period are
called cross-sectional tests of the CAPM. Alternative cross-sectional tests of
the CAPM include:
Ei = 0 + 1 bi + 2 bi2 +.+ k i ,
for i=1,2,.,N
(13.12)
and testing H0: 2 = 3 = = k = 0, while 0 = rf and 1 = E(rm) rf.

A second testable implication is that the expected market risk premium
must be positive. Random rmt at any point in time should have an expectation
that is greater than rft which is constant at time t. This is an implication of
positive risk aversion.54 The (expected) market risk premium E(rmt rft) > 0.
This also ensures that assets whose systematic risk is high, i.e. high positive
beta bi , must yield a high (expected) return. This is true only if 1 > 0. This
implication is usually validated at the same time when the cross sectional
regression is performed and that the 1 is estimated.
The third testable implication involves time series testing of the asset
returns rit over time for i=1,2,,N. We assume stationary stochastic
54
A basic definition of risk aversion is that if an investor prefers a certain outcome $K

to an uncertain experiment in which the expected outcome is $K but which could have
actual outcomes larger or smaller than K, then the investor is said to be risk averse.
240
processes for the returns, and extend the single period CAPM to a time series
representation:
rit = rft + bi [rmt rft] + eit , for i=1,2,......,N; and t=1,2,........,T
that is consistent with (13.8), though with additional auxiliary assumptions
such as cov(rmt , eit ) = 0, and bi being constant over time. The time series
representation can be re-written as:
rit rft = ai + bi [rmt rft] + eit , for i=1,2,......,N; and t=1,2,........,T. (13.13)
A testable restriction is then H0: ai = 0 for all i=1,2,..,N. If we had run
time series regressions on rit = ai + bi rmt + eit instead, then an alternative and
more complicated testable restriction on the coefficients is H0: ai = rf (1-bi) for
all i=1,2,.....,N, and assuming rft = constant rf for all t=1,2,.......,T. Even more
complicated tests involve cross-sectional time series regressions.55
We provide a cross-sectional regression of (13.11) as follows. 135 large
and liquid (active trading) stocks in the Singapore Exchange were grouped
into 14 portfolios according to the size of their betas. The first portfolio
contains stocks with the lowest betas and the 12th portfolio contains stocks
with the highest betas. The betas range from 0.6 to 1.6. The portfolios are
chosen so that the number of stocks in each portfolio is about similar and at
the same time the resulting portfolio betas are spread out as widely as
possible. The spread in the explanatory variable of beta is to ensure adequate
statistical power to the test. The betas were computed based on returns in the
past 5 years and were checked with a published source: Corporate Handbook
Singapore, Thomson Financial Publishing to ensure consistency. The
averaged returns of stocks in a portfolio is then used as the proxy for the
expected return of that portfolio, i.e. the dependent variable in (13.11). For the
following example, we use the averaged portfolio returns for averaged
monthly returns of the stocks during the period October 2001 to September
2002.
Equation (13.11) is consistent with the following regression equation:
ri = 0 + 1 bi + i ,
for i=1,2,.,N
(13.14)
where E(ri) = Ei, E(i) = 0, and cov(bi , i) = 0. Moreover we shall add the
usual classical OLS condition that error i is homoskedastic and serially
uncorrelated.
55
See Gibbons, M., (1982), Multivariate tests of financial models: A new approach,
Journal of Financial Economics, 10, 3-27.
241
Using averaged portfolio returns provides for less errors to the dependent
variable which is supposed to be an expected return. This basically means that
the residual error variance is smaller. Betas are typically measured with
sampling errors. Portfolio betas allow for averaging across the stock betas in
the portfolio and thus have less measurement errors. In the chapter on
Specification Errors, we saw how reducing this is important to ensure betas
are not biased downward.
The graphical plot of return versus beta is shown in Figure 13.1 below.
The monthly return is an average of the returns during October 2001 to
September 2002.
Figure 13.1
Plot of Portfolio Averaged Monthly Return versus Portfolio Beta for
Singapore stocks, October 2001 September 2002
Average Portfolio Return
.03
Realized SML
.02
.01
.00
OLS Regression Line

-.01
-.02
0.6
0.8
1.0
1.2
1.4
1.6
Portfolio Beta
OLS regression is performed on (13.14) and the results are shown in

Table 13.1. Table 13.1 shows that the estimated slope is 0.0005 suggesting an
expected market return of only 0.05% per month or 0.6% p.a. The intercept is
positive at 0.3% per month. Both estimates are not significantly different from
zero with p-values above 0.8. The realized riskfree rate in the Singapore
economy at the time was about 0.6% p.a., close to zero. The ex-post market
242
return was about 6% p.a. Though both the OLS regression intercept and slope
estimates are positive and hence support the implications of SML in CAPM,
Figure 13.1 shows that the OLS fitted line is flatter than the realized SML
according to (13.8):
E(ri) = 0.0005 + bi 0.0045.
It means that beta alone cannot fully explain the cross sectional variation of
expected returns.56
Table 13.1
Cross-sectional Regression of Portfolio Return on Portfolio Beta
ri = 0 + 1 bi + i
Dependent Variable: PORTRET
Sample: 1 14
Variable
Coefficient
Std. Error
t-Statistic
Prob.
C
BETA
0.002941
0.000493
0.012179
0.010413
0.241508
0.047316
0.8132
0.9630
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
F-statistic
Prob(F-statistic)
0.000187
-0.083131
0.010532
0.001331
44.96073
0.002239
0.963039
Mean dependent var

S.D. dependent var
Schwarz criterion
Durbin-Watson stat
0.003502
0.010120
-6.137247
-6.045953
-6.145698
1.684388
The large standard errors in the regression estimates in Table 13.1 could be
due to the measurement errors of betas. Specifically, in the regression of
(13.14), the estimated betas (from time series) would contain errors as they are
sampling estimates of the population parameters
cov(rit , rmt )
. To perform an
var(r mt )
OLS regression with an independent variable bi measured with error incurs an

errors-in-the-variables problem, something we will consider in detail in a
later chapter. Therefore, to avoid this error and yet test (13.11), Fama and
56
A similar but stronger result for U.S. stocks during 1928 to 2003 is discussed in
Fama, Eugene F., and Kenneth R. French (2004), The Capital Asset Pricing Model:
Theory and Evidence, Journal of Economic Perspective,18, 25-46. The result was
also found by some earlier research on the U.S. stock market.
243
MacBeth (1973)57 used a three-step procedure that has come to be known as
the Fama-MacBeth procedure or approach in cross-sectional regression
involving asset expected returns. A second issue that the Fama-MacBeth
procedure addressed is that any non-zero cross sectional correlations of i in
(13.14) would lead to estimator distribution that is different from the usual tstatistics and hence incorrect inference based on the t-tests.
13.3
FAMA-MACBETH PROCEDURE
This approach is used in the original article Fama and MacBeth (1973) and
later articles such as Fama and French (1992).6
(1) In step one, the N assets are divided into M portfolios. These are divided
based on criteria such as initial beta estimates58 (that are subject to the
measurement errors). Or the N assets could be first divided into M
portfolios based on firm size, and within each sized portfolio, there is a
further division into M sub-portfolios by initial betas. The latter would
create M2 number of portfolios. Then compute the average or equalweighted portfolio return R j for each of the M (or M2 whichever case it
may be) portfolios.
The idea is that each portfolio sorted by initial beta will behave
similarly (with the same distribution) given their betas are
approximately the same. Averaging their returns in a particular period, say
month t, will produce an observation close to the portfolios expected
return at time t. Any remaining measurement error with expected return is
not critical as this variable is a dependent variable. This point will also
become clearer in later chapters.
(2) In step two, the beta for each portfolio is re-computed using time series
regression on returns data on rjt rft = j{ rmt - rft } + ejt for the regression,
typically for 60 monthly data.7 This regression could be performed in two
ways: (a) on the equal-weighted portfolio as dependent rjt, thus obtaining
57
See Fama, Eugene F., and James D. MacBeth, (1973), Risk, Return, and
Equilibrium, Journal of Political Economy, Vol 81, 607-636. See also Fama, Eugene
F., and Kenneth R. French, (1992), The Cross-Section of Expected Stock Returns,
Journal of Finance, Vol XLVII, No2, June, 427-465.
58
Betas may be computed as prior betas, i.e. using past time series of the stocks, e.g.
the last 60 months to compute the beta for the next month regression. Sometimes,
forward betas, i.e. the next 60 months are used instead if it is supposed instead that the
investor in the market looked forward rather than backward to infer beta.
244
the portfolio j ; or (b) on each stock return rjt in a portfolio, and then
averaging these betas to obtain the portfolio beta.
The idea is that the estimated portfolio beta for each of the M portfolios will
contain as small a measurement error as is possible since there is initial beta
sorting. Finding a beta from such a portfolio containing assets with closely
similar betas will yield a more accurate estimate of beta. This is one of the key
merits in the Fama-MacBeth approach for cross-sectional regressions.
Steps (1) and (2) above produce M bivariate data points { R j , j } for
j=1,2,.,M. These would be sufficient for the cross-sectional regression59
R j = a + j + ej , j=1,2,3,..,N
(13.15)
to estimate a and in which case a should be an estimate of the riskfree rate

for month t, and is an estimate of market risk premium for month t.
(3) The entire cycle of steps one and two is then repeated by moving one
month forward to t+1. Portfolios are reconstructed, and { R j , j } reestimated. Sometimes, the portfolio reconstruction is done only each year
ahead. In this case, monthly returns and betas are still used for the crosssectional regression, but the changes from month to month within that year
will be due to delistings and other reasons. For other FM-type regressions
that involve other explanatory variables, as in Fama and French (1992),
regressions could be run with individual stock returns as the dependent
variable. In this case, the associated beta could still be computed based on
something like step (2) above to reduce errors-in-variables. However, in
this case, all stocks in the same portfolio for that period will have the same
estimated beta. Since the cross-section could involve thousands of stocks
from say 100 portfolios (M2 = 100), using same beta for some stocks is not
a big issue.
The a and are estimated cross-sectionally for each month, and then
the monthly time series of estimates are averaged across time (or across the
different cross-sectional regressions). The averages are then tested to see if
they are significantly different from zero, or positive as in the case of the
average of . This last step avoids the problem that any non-zero cross
sectional correlations of i in (13.14) would lead to incorrect inference
based on the t-tests for a single month cross-sectional regression. There is
59
Some studies use returns as dependent variable, while others use excess returns as
dependent variable.
245
of course implicit assumptions that the alphas and betas are approximately
constant, the estimators are uncorrelated over time, and that the sampling
distribution of the estimated alphas and betas follow the central limit
theorem, in which case the time series has to be reasonably long.
Suppose such cross-sectional regressions are performed over T number of
months, and T such slope estimates are obtained. We then test if the
1
time-average of ,
72
t is significantly positive, or H0:

T
t 1
E 0 ,
by considering the t-statistic
sd
where sd t
1 72
t . In the Fama and MacBeth (1973) study,
T t 1
one particular averaged over 72 months during 1935 to 1940 was found
to be 1.09%, but the sd t
1 72
t = 11.6%. Hence the t
T t 1
1.09
0.79 was found not to be significantly
11.6
positive. We can also test if the averaged a differs significantly from the
statistic d.f. 71,
72
riskfree rate that prevails during the sampling period.
13.4
ASYMPTOTIC TESTS
In the previous sections, we discuss cross-sectional implications of the

CAPM given that the betas of portfolios are estimated correctly, or in
other words, the tests are conditional on the betas.
We now consider tests based on the third testable implication of
CAPM as discussed in equation (13.13):
rit rft = ai + bi [rmt rft] + eit , for i=1,2,......,N; and t=1,2,........,T
where the null hypothesis under CAPM is H0: ai = 0 for all i. This is a
hypothesis of coefficient restriction on the intercept terms. The idea of the test
is that is we could estimate a i s for all i, then on average they should be
close to zero on either side of it in order that CAPM holds true. We shall run
246
such time series regressions based on (13.13), and collect the a i s for the test.
It is still a cross-sectional test as we are considering the deviations of a i s
across section, i.e. across the different stocks i. However, to produce test
distributions, we assume, as in the market model, that the zero mean residual
errors in (13.13) for each i, are jointly normally distributed at each t, and are
stationary over time. Specifically, for each t, variance-covariance matrix of
vector residuals is:
e1t e1t
e1t2

e e
e 2t e 2t
cov , E 2t 1t

e Nt e1t
e Nt e Nt
e1t e 2t
e
2
2t
e Nt e 2t
e1t e Nt
e 2t e Nt
N N .
e 2Nt
We can write random vector et (e1t, e2t, e3t, , eNt)T as jointly normally
distributed NN (0, ), where the subscript to N(. , .) denotes multivariate
normality of dimension N. Now consider the TN matrix:
e1, t 1 e 2, t 1
e
e 2, t 2
= 1, t 2

e1, t T e 2, t T
where we assume (e1,t,
independent for t t.
e N, t 1
e N, t 2

e N, t T
e2,t, .., eN,t) and (e1,t, e2,t, .., eN,t) are
Let SNN = T . Clearly S is positive definite since any non-zero vector xN1
is such that xTSx = xTT x = (x)T x > 0. S is also symmetric. Random
square matrix S has a Wishart distribution. For most purposes, we consider the
case T > N, and the Wishart distribution above is said to have T degrees of
freedom, and a pdf function f(.) characterized by parameters T, N, and matrix
:
f s
T N 1/2
exp 12 trace 1s
2 (TN)/2
T/2
N T/2
where s is a realized value of S, and
N T/2 N N 1/4 T/2 1 j/2 is a multivariate gamma function.

N
j1
Notationally, we write SNN WN (,T). There are some useful results about
the Wishart distribution.
247
If xN1 is a vector (0,.,0,1,0,..0) with all elements zero except the kth
T
element which is one, then x T S x e 2kt ~ 2k T2 , a chi-square distribution

t 1
with T degrees of freedom multiplied by 2k , where NN [ jk ] , and
kk 2k .
Note that in (13.13), the residual errors are normally distributed with
var e it i2 for all t,
and
cove it , e jt ij for all t.
The random matrix SNN consists of ijth element
e it e jt .
t 1
If xN1 is a nonzero constant vector, then x T S x ~ x T x T2 , a chisquare distribution with T degrees of freedom multiplied by variance of (xTet),
x T x . More generally, if XNq is a Nq matrix of constants, then

XT S X ~ Wq(XT X, T).
In the above statements, vector x is constant. Suppose x N1 is a random
vector that is multivariate normally distributed x ~ NN(, ), i.e. E(x) =
N1, and cov(x) = E[ (x ) (x )T ] = NN, and that this is
independent of S NN that has a Wishart distribution, WN ( ,T). Then,
the statistic
TH2 = T (x )T S-1 (x )
(13.16)
has a Hotellings T-square distribution with parameters N and T, i.e.

TH2(N, T). It can be shown that the TH2 distribution is a multiple of the
familiar F-distribution.
TH2 N, T
TN
FN,T- N1 .
T N 1
As an example, for a normally distributed random variable xN1 ~ NN
1 T
(, ), its sample mean x x t over sample size T, where each xt is
T t 1
an N1 normally distributed vector ~ NN (, ), has distribution
x ~ N N , /T . Or T x ~ N N T , .The
unbiased
sampling
covariance matrix of x is
248
1 T
x t x x t x T
T 1 t 1
1
~
WN , T 1.
T 1
Hence, T - 1 U ~ WN , T 1.
From (13.16), we see that
T
T 1 T x T T 1
U
T x T
is distributed as
Re-arranging, we have
T 1N F .
T
1
(13.17)
Tx U x ~ TH2 N, T 1
N,T - N
TN
If we use the sampling covariance matrix that is normalized by sample
size T,
1 T
T 1
T
x t x x t x
U , or
T t 1
T
1
~ WN , T 1.
T
T 1N F .
T
Then, T - 1x 1 x ~ TH2 N, T 1
N,T - N
TN
T 1N F
In (13.17), as sample size T, the statistic
N, T - N converges
TN
to NFN, which is a chi-square statistic with N degrees of freedom. The
U
TH2(N,T-1).
latter is the result of a Wald test. A Wald test in this case is a test that
the population mean is with sample evidence of x . The estimator x
is a maximum likelihood estimator of (under normal distributional
assumption, for example), and the Wald statistic is basically of the
form:
A T
1 A
A ~ 2
lim A
T
where A0 is the N1 vector population parameter value under the null

is the maximum likelihood estimator of A, and
is a
hypothesis H0, A
.
consistent estimator of the covariance of A
Now consider a time series regression on stock j,
rjt rft = aj + bj [rmt rft] + ejt , ejt ~ N(0, j2), cov(eit,ejt) = ij,
249
and cov(eit, ejt) = 0 for any i,j and tt, using sample t=1,2,........,T. From the
two-variable regression theory seen in chapter 3, we know the sampling
distributions of OLS estimators a j and b j .
For notational simplicity, let Yjt = rjt rft, and Xt = rmt rft.
T
a j a j v t e jt
t 1
where v t
T
xtX
x 2t
1 T
, x t X t X and X X t ,
T t 1
for j=1,2,.......,N assets.
1
X2
Now, v
2
t 1
T xt
1 2m
1 2m

T T 2 T 1 2
m
m
T
1
where we use notations m rmt rft
T t 1
T
1
2
and 2m rmt rft m .
T t 1
The variance-covariance matrix of ( a 1 , a 2 , a 3 , .........., a N )T is
T
2
t
a 1

1 2m
a2
var 1 2
T m

a N
1 2m
1 2
T m
e1t2
e e
E 2t 1t

e Nt e1t
e1t e 2t
e 22t
e Nt e 2t
e1t e Nt
e 2t e Nt
e 2Nt
NN .
Under the null hypothesis of CAPM, the restrictions H0: a1 = a2 = = aN
T is a random vector that is

= 0 apply. Then, ( a 1 , a 2 , a 3 , .........., a N )T A
multivariate normally distributed
2
~ N 0 , 1 1 m .
A
N
2
T m
250
1 2
Or, 1 m2
T m
1 / 2
~ N 0 ,
A
N
The covariance matrix of random vector et (e1t, e2t, e3t, , eNt)T is

estimated as follows.
T
e jt Yjt a j b j X t , and e t e 1t , e 2t ,, e Nt .
When T is large,
U
1 T
1
e t e Tt ~
WN , T 2.
T 2 t 1
T2
From (13.16), we see that

1 2m
T 2 1 2
T m
T
A T 2 U
is distributed as TH2(N,T-2). Re-arranging, we have

1
2m T 1
T 2N F
T1 2 A U A ~ TH2 N, T 2
(13.18)
N, T - N -1 .
T N -1
m
Two learning points to note here are: (1) equation (13.18) is tantamount
to finding the statistic
2
T 1 1 m 1 A
,
A
U
2
T m
and (2) under joint normality of returns, the OLS estimators a j s are
also maximum likelihood estimators, hence (13.18) is a Wald test when T.
2
In (13.18), lim T1 m2
T
m
T 1 A
NF 2 .
A
U
N,
N
We can apply the test statistic in (13.18) to test the CAPM and reject if the test
statistic is too large, i.e. if the deviations of a i s across section from zero are
too large.60
60
See also Gibbons, M., S.A. Ross, and Jay Shanken, (1989), A Test of the
Efficiency of a Given Portfolio, Econometrica, 57, 1121-1152.
251
13.5
IDIOSYNCRATIC RISK
The cross-sectional regression tests on the CAPM have mixed results with
some clearly rejecting the model, while others could not. There is much
discussion that centered around whether small sample bias could biased
toward more rejections than warranted. There are also objections that those
tests that support the Sharpe-Lintner CAPM did not impose enough
restrictions implied by the model, and thus could not be said to have produce
convincing evidence.
Yet another test of the CAPM is whether the idiosyncratic risk (or
unsystematic risk) is in fact priced by the market (as observed in market
traded prices). This type of test is somewhat similar to the tests under the first
major implication of the CAPM, that only the betas explain cross-sectional
variation in expected stock returns. However, in substance it challenges basic
CAPM premises that the market diversifies away all unsystematic risk (that a
representative investor holds a large diversified portfolio) and so would only
be compensated for bearing systematic risk related to the market. The trend of
development is to accept that there may be more than one systematic factor
other than the market index factor, but a large part of mainstream finance
academia would still debate about whether unsystematic risk should be priced.
Regardless, there is a line of literature showing that indeed idiosyncratic
risk is priced. This implies that the market is under-diversified and thus (on
average) for each under-diversified market investor, he or she would demand
compensation for stocks with high expected idiosyncratic risk.61
Fu (2009)62 showed that expected idiosyncratic risk indeed explained the
cross-section of stock expected returns in the period July 1963 to December
2006. U.S. stocks on the NYSE, AMEX, and NASDAQ are covered in the
study.Monthly idiosyncratic risk for each stock is basically estimated as the
sample standard deviation of daily stock excess return residual errors (after
fitting a multi-factor model to explain most of the time series variations)
multiplied by the square root of the number of trading days in the month. The
monthly idiosyncratic risks were found to be highly (positively) persistent.
Exponential GARCH model was used to model the dynamics of the
idiosyncratic riskd and enabled expected idiosyncratic risk (as distinct from
unexpected idiosyncratic risk, or the total realized idiosyncratic risk) to be
estimated. Using Fama-MacBeth style approach, monthly cross-sectional
61
See Levy Haim, (1978), Equilibrium in an imperfect market: A constraint on the

number of securities in the portfolio, American Economic Review, 643-658, and
more recently Ang A., R.J. Hodrick, Y. Xing, and X. Zhang, (2006), The cross
section of volatility and expected returns, Journal of Finance 61, 259-299.
62
Fangjian Fu, (2009), Idiosyncratic Risk and the Cross-Section of Expected Stock
Returns, Journal of Financial Economics 91, 24-37.
252
regressions of stock returns on various factors such as betas, sizes, values,
expected idiosyncratic risk, and so on, were performed. The estimated
coefficients or slopes on the expected idiosyncratic risk were collected and
their time series properties were evaluated. We show below an excerpt of
results on the estimated coefficient on expected idiosyncratic volatility under 3
different regression specifications involving slightly different explanatory
variables as controls.
Specification Averaged Coefficient of Expected
Idiosyncratic Volatility
1
0.11
2
0.13
3
0.15
t-statistc
9.05
11.41
13.65
Averaged
R2
3.02%
4.98%
6.89%
It is seen that expected idiosyncratic volatility or risk during the sampling

period has a significant positive relation to the expected returns.
13.6
PROBLEM SET
13.1
What is the expected return on the portfolio q in equation (13.7)?
13.2
Use equation (13.7) to show that w*T E(R) = k. Is it correct to

conclude that since w*T = pT + qT k, then
k = pT E(R) +qT E(R) k, and hence k = pT E(R) / [1 qT E(R)] ?
13.3
If the CAPM is correct, and a researcher runs the following

regression:
R j = a + j + j 2 + ei , i=1,2,3,..,N
13.4
Comment if the estimate would be significant?

If the market proxy used is r mt, and three other factors x1t, x2t, and x3t
are used in cross-sectional regression CAPM test of stock returns rjts,
for j=1,2,.,N.
rjt = a0 + a1rmt + a2x1t + a3x2t + a4x3t + ejt
where ejt is the residual error.
Suppose rmt, x1t, x2t, x3t have pairwise zero correlations. Suppose OLS
estimates a 0 , a 1 , a 2 , a 3 are significantly different from zero, but a 4
253
is both close to zero in value and also not significantly different from
zero. Under the above scenario, given Rolls critique, is it still
possible that CAPM be in fact true?
13.5
In running the cross-sectional regression of returns

rjt = a0 + a1rmt + a2x1t + a3x2t + a4x3t + ejt
where ejt is the residual error, for j=1,2,.,N, suppose all xjts are
highly correlated with each other and with r mt. The estimated
coefficent of a 1 is not significantly different from zero, but at least
one of a 2 , a 3 , or a 4 is significantly different from zero. Do the
regression results confirm that CAPM is not valid?
13.6
In the regression Yjt = aj + bj Xt + ejt , for j=1,2,.,N, and
t=1,2,.,T, there are N assets and T time periods, and Xt is a

predetermined market factor that is common to all stocks. For
each time period t, the cross section of residual errors, or vector
et (eit, e2t,.,eNt)T has covariance matrix NN. Let Yt (Y1t,
Y2t , , YNt)T. The joint likelihood function of Yt|Xt or
equivalently the joint likelihood function of et is
1
1
N
f e t 2 2 2 exp e Tt 1e t . Suppose et is i.i.d. across
2
time t. Find the log-likelihood function

log f(Y1,Y2,.,YT | X1,X2,.,XT). Maximize this function with
respect to the parameters ajs and bjs and hence derive the maximum
likelihood estimators. Are they the same as the OLS estimators?
[1] Gourieroux C., and A. Monfort, (1989), Statistics and Econometric
Models, Volume 2, Cambridge University Press.
[2] Johnson N.L. and Samuel Kotz, (1972), Distributions in Statistics:
Continuous Multivariate Distributions, John Wiley & Sons.
[3] Judge, G., R.C. Hill, W.E. Griffiths, H. Ltkepohl, and T.C. Lee, (1988),
Introduction to the Theory and Practice of Econometrics, John Wiley &
Sons.
254
Chapter 14
MORE MULTIPLE LINEAR REGRESSIONS
APPLICATION: MULTI-FACTOR ASSET PRICING
Arbitrage Pricing Theory, Intertemporal Capital Asset Pricing Model, Risk
factors, Cross-sectional regression, Model selection, Multi-collinearity, FamaFrench three factor model, Heteroskedasticity, Test of heteroskedasticity,
White heteroskedasticity-consistent adjustment
In this chapter we discuss a very important topic about arbitrage pricing

theory and the connection to multi-factor asset pricing models. The latter is at
the heart of many quantitative investment strategies aimed at portfolio
selection and rebalancing to capture positive alphas and hence above market
returns. Whites Heteroskedasticity Consistent Covariance Matrix Estimator
adjustment is also discussed.
14.1
BEYOND ONE FACTOR
The CAPM is a single factor asset pricing model, with the market as the
factor. The implication is that systematic risk connected to the covariation
with market returns is the only risk that is priced. However, there is no reason
why the universe of assets in the economy could not have their prices and
required returns dependent on more than one economy-wide systematic factor.
There have been two approaches to the issue of finding more factors in
asset pricing. One approach by Chen, Roll, and Ross (1986)63 and others is to
specify macroeconomic and financial market variables that have economically
valid intuitions to explain co-movements with stock returns in a systematic
fashion. For example, industrial production in an economy could increase and
correlate with higher stock prices especially for firms that have business
exposures to industrial activities, and this is in addition to the general stock
market movement.
Another approach is to look for factors to which certain firms
characteristics, e.g. firm size, firm growth potential, firms industry sector,
etc., could potentially be sensitive to.
63
Chen, NF, R Roll, and S A Ross, (1986), Economic forces and the stock market,
Journal of Business, 59, 383-403.
255
As in the CAPM theory, finding factors is not enough. It is systematic
factors that we want. Unsystematic factors are not yet considered to be
important when an investor holds a large portfolio, because then these
unsystematic noises cancel out one another, and there is no net impact on the
portfolios expected return. But for systematic factors that affect all, or mostly
all, of the economys stocks, nothwithstanding by different loadings, these
risks cannot be diversified away in a large portfolio. Sneezes in the factor will
come to haunt the portfolio performance, be it good or bad. This is the risk
business.
As stock market and portfolio performance research continues, it is
interesting to know that empirical data research oftentimes came up with
evidence of new systematic factors that are valuable to be considered. Over
time some proved to be spurious results, some due to data-snooping64, some
over-shadowed by new factors that seem to subsume the old ones, and some
disappeared with new and more recent data.
In what follows, we describe the framework to think about systematic
factors that require risk-adjusted return compensation, discussing the
Arbitrage Pricing Theory, and then moving on to empirical spadeworks.
14.2
ARBITRAGE PRICING THEORY
Starting with the CAPM, Ross (1976)s Arbitrage Pricing Theory65 (APT) and
Merton (1973)s Intertemporal Capital Asset Pricing Model66 (ICAPM)
continued the stage for decades of excitement and research into the pricing of
assets. Essentially, if researchers can know what at a certain point in time is
the correct model to price an asset, then if the market prices the asset too low,
it is opportune to long the asset with the view it will increase in price soon
enough, and vice-versa. The equilibrium or correct price of an asset is of
course related to the equilibrium (ex-ante) expected return since future
expected payoffs discounted by this expected return gives the price. The
64
This is similar in idea to over-fitting a regression with too many explanatory

variables to get a high R2, but which does not promise, and sometimes work adversely
in forecasting. Data snooping is more about using models and specifications,
including searching for constructions of data variables to try to explain cross-sectional
return variations. Heuristically, if a relationship can be rejected at 5% significance
level if the null that there is no relationship is true, then there is 5% chance that if we
search hard enough within a dataset, we just may be able to find a relationship that
cannot be rejected 5% of the times.
65
Ross, A. Stephen, (1976), The Arbitrage Theory of Capital Asset Pricing, Journal
of Economic Theory, Vol 13, 341-360.
66
Merton, R.C., (1973), An Intertemporal Capital Asset Pricing Model,
Econometrica, 41, 867 -887.
256
variation of returns in the cross-section of firms in the market is important
enough to be well researched, as seen in the previous chapter.
Rosss APT has come to be understood as essentially a statistical model of
prices. In a very large economy with no friction and many assets, no-arbitrage
argument (without the need to specify investors risk-return preferences) gives
rise to equilibrium expected returns. These returns are related to an unknown
number of factors in the economy that exogenously affect the returns in a
statistical way. Mertons ICAPM is an intertemporal equilibrium model where
investors make optimal consumption versus investment decisions constrained
by their preferences and resources. The risks in the economy are driven by
some finite number of economic state factors. Expected returns are related to
the nature of these economic factors as well as investor preferences implicitly.
Although the characters of both APT and ICAPM are quite different, they
both have a common intention of explaining equilibrium expected returns
based on some other market factors, whether observed or not.
Tests of APT tend to go along statistical methods such as principal
components method, factor analysis, etc. with a view to understand how many
factors there are in the economy. There have been various debates about
whether APT is testable, but we shall not worry about that here. In some
sense, ICAPM is more natural in suggesting a regression relationship between
asset returns and observed market variables that are possibly the ones
producing risks that investors must hedge. The use of observed market
variables to proxy as factors in APT or as the risks in ICAPM led to multifactor asset pricing model. There is a colossal amount of research output on
such. One of the earlier papers, effectively on multi-factor models, is by Chen,
Roll and Ross (1986) who found the following four macroeconomic variables
to be significant in explaining the variation in the cross-section of firms
returns:
(a) index of industrial production
(b) differences in promised yields to maturity on AAA versus Baa corporate
bonds (default premium)
(c) differences in promised yields to maturity on long- and short-term
government bonds (term-structure of interest rates), and
(d) unanticipated inflation .
Keim and Stambaugh (1986) found67 the following three ex-ante
observable variables that affected risk premia of stocks and bonds:
(a) difference between yields on long-term Baa grade and below corporate
bonds and yields on short-term Treasury bills
67
Keim DB, and RF Stambaugh, (1986), Predicting returns in the bond and stock
markets, Journal of Financial Economics, 17, 357-390.
257
This proxied for default premium.
(b) loge (ratio of real S&P composite index to previous long-run S&P level)
This may proxy for inflationary tendencies.
(c) loge (average share price of the lowest market value quintile of firms on
NYSE)
There appears to be some business cycle and size effect.
There have been many other similar works. All suggest that there are at most 3
to 4 significant factors or economic variables that affect variation in the crosssectional returns. It is instructive to understand the APT so that a better
understanding of why multi-factors are used in cross-sectional regressions can
be obtained. Strictly speaking, however, APT does not imply nor is implied by
multi-factor asset pricing with identified observable economic variables
explaining the cross-sectional variation in average returns.
14.3
ARBITRAGE PRICING THEORY
Suppose asset returns are generated by a K-factor model (K<N where N is the
total number of assets in the economy):
K
R i E(R i ) b ij j i ,
i 1,2,, N
(14.1)
j1
where E(Ri)
And
= Ei
j's
are zero mean common risk factors (i.e., they affect asset
is return Ri via b ij ' s )
b ij ' s
are the factor loadings (or sensitivity coefficients to

factors) for asset i.
mean zero asset i' s specific or unique risk
i
Cov(i , j ) 0
for i j
Cov(i , j ) 0
for every i, j
Using matrix notations as follows,
R N1 E N1 B NK K1 N1
(' ) 2 I NN
( ' ) 0 KN
An example of (14.1) is for stock returns
~
R i E(R i ) b ig ~
g b iR R i
where E(Ri) is the stocks unconditional expected return in the absence of any
~ and R are unexpected deviations, with zero means, in GDP

other news, and g
and in prime interest rate, that will affect Ri. The factor betas (or factor
258
loadings or sensitivities) big > 0, and biR < 0 as are typical. When GDP is
unexpectedly high with a booming economy, the firms revenues will
unexpectedly rise and gives rise to a higher return Ri, hence positive big. When
prime rate or business cost unexpectedly rises, firms revenues will suffer
unexpectedly, leading to fall in return ri.
Suppose we can find a portfolio x N1 where the element are weights, such
that
x' l 0 ;
1
1
l

1 N 1
(14.2)
x' B 01K
(14.3)
Now (14.2) x is an arbitrage portfolio, i.e., zero outlay, with zero

systematic risk via (14.3).
Suppose that x is a well-diversified portfolio, so x' 011
Hence, portfolio return is
.
x' R x' E x' B x' x' E ( x' E as N ) .

But since x is costless and riskless, then to prevent arbitrage profit, the return
to x must be zero, i.e.,
xE=0.
(14.4)
Since (14.2) and (14.3) economically imply (14.4) always, then
E N1 0 l B K1
11 N1
N K
(14.5)
where 0 , k1 are constants.

To verify (14.5):
xE = 0 xl + xB = 0 + 0 = 0, using results in (14.2) and (14.3).
Equation (14.5) can be explained by the standard linear algebra result that any
vector orthogonal to l and columns of B is orthogonal to E , and this implies
that E is a linear combination of l and the columns of B, i.e., (14.5).
(14.5) is sometimes called the Arbitrage Pricing Model. Ex-ante
conditional expected return, E is related to economy-wide constant 0 and to
259
risk premia K1 for each source of the K factor risks. If we put B=0, then
clearly 0 is the riskfree rate. Note that if the asset is more sensitive, i.e., if B
increases, then the systematic risk B 1 increases, and thus E also increases.
For a single asset i, (14.5) implies:
E(R i ) rf b i11 b i2 2 b iK K
(14.6)
Putting (14.6) side by side the underlying process

K
R i E(R i ) b ij j i ,
j1
it is seen that
R i rf b i1 ( 1 1 ) b i2 ( 2 2 ) b iK ( K K ) i
(14.7)
There are some empirical problems in estimating and testing the APT. Firstly,
the number of factors, K, is not known theoretically. Setting different K will
affect the estimation of the factor loadings B. Secondly, even if K is known, it
is not known what the factors are. One can only make guesses about economic
variables that may have an impact on Ei .
Equation (14.7) can be expressed in regression form as
~
R N 1 r f l N 1 BN K ( ) K 1 U N 1
(14.8)
where ( R N1 ) E N1 ,
~
() E( ) K1 , a Kx1 vector of risk premia,
(14.9)
(U ) 0
~~
( U T ) K N 0
.
Each equation in the system in (14.8) at time t is
R i rf b i11 b i2 2 b iK K i
(14.10)
for stocks i, where i=1, 2, , N.

bij is the sensitivity of stock i to the j th risk premium factor variable j that
is common to all stocks. We may call j the jth risk factor, and its mean j the
jth factor risk premium. bij is the same sensitivity to the risk premium form of
the APT equation in (14.5). Except for the special case of single-period
CAPM where K=1 and E(1)=E(rm rf ), the risk premia factor variables Ks,
or called common risk factors, are generally not observable nor known.
Do we regress
260
R it rft b i11t b i 2 2 t b iK Kt u it ?
(14.11)
where it is risk (premium) factor that can vary over time and E()=. It would
be a straightforward test of APT, but unfortunately the risk factors jt, and the
number of these K, are usually unknown. The multi-factor approach using
macroeconomic variables or using own firm variables are attempts to guess
such a structure and hope to approximate them.
In the regression specification in (14.11) which we can adopt for multifactor models, 1t can be defined as ones, i.e. allowing for a constant intercept.
Further, uit may in general covary with some or all of jts. This is because
APT deals with a single period time frame and strictly speaking, has little to
say about inter-temporal properties of stochastic processes. However, we shall
add a bit more restriction to enable nice econometric results, i.e. we assume
E(uit jt)=0, for every t, i, and j .
Provided is a stationary random variable, OLS regression can be
performed on (14.11). To re-iterate, in practice, we do not know a priori what
are the bijs (sensitivities) for each i, and what are the jts for each t. Nor do
we know what is the regressor span K. However, if we link this framework
with multi-factor asset pricing model, then we can postulate jts to be some
specific observed market-wide economic variables for j=1, 2, , K. (Though
this also implies their expected values are js, the risk premia themselves.)
Suppose we can perform a time series regression on (14.11), assuming we
observe the risk factors jt for each j, and over time t. We can also check the
effectiveness of out-of-sample forecast with realized returns. At t the out-ofsample forecast of Rit+1-rft+1, conditional on given 1t1 , , Kt 1 , is

i b i1 1t 1 b ik Kt 1
where i , b i ' s are estimated from regressing
R it rft i b i11t b iK Kt u it .
(14.12)
We shall come back to this framework when we discuss the Fama and French
three-factor model.
14.4
CROSS-SECTIONAL REGRESSION
Cross-sectional expected returns on assets is much like the soul of stock

investment; it pops up everywhere, and we shall see a more sophisticated
usage here further to what is covered in earlier chapters. After all, think about
it, if we know the expected returns of all securities in the market in the next
261
trading period, we can do many wonderful long-short strategies over huge
portfolios, so as to minimize volatility risks, to make super- or abnormal
profits. Hedge funds and many investment funds tracking some indexes with
enhanced alphas do this all the time, and other things as well.
Suppose we are back to using APT in (14.5) and (14.6). Adding a constant
intercept:
E(R i ) rf a b i11 b i2 2 b iK K
.
This is a perfect candidate for cross-sectional regression as in the single-factor
Fama-MacBeth study of CAPM testing seen in Chapter 13. Using an estimate
such as portfolio average return as dependent variable, and estimated beta or
loading sensitivities bijs, j=1,2,,K; i=1,2,.,N, as explanatory variables,
the regression is run across i as follows:
R i rf a 1b i1 2 b i2 K b iK u i .
(14.13)
We can also write it in the form of (14.11) in which case ui in (14.13) is
b i11 b i2 2 b iK K i .
We can test this multi-factor asset pricing model by testing if the (14.13)
multiple linear regression coefficient estimates of factor means
1 , 2 ,, K are significantly different from zeros. If so there is statistical
evidence of the model.
Fama and French (1992)68 suggested (for US NYSE, AMEX, and
NASDAQ exchange stocks that are recorded in CRSP database) that for crosssectional regression with stock return as dependent variable, in addition to
estimated CAPM beta b i , ln(ME), ln(BE/ME), ln(BA/ME), and E/P variables
of stock i in year m can be used for bi1 , bi2 , bi3, ., iK.
ME = market equity $ value = stock is last price number of stock is shares
outstanding in the market.
BE = book equity value
BA = total book asset value = BE + BL where BL is book liability value
E/P = earnings to price ratio of the stock i
ln(ME) represents size of market equity. Small size firms have lower ln(ME)
values than larger firms. We expect smaller firms to be systematically more
68
Fame, Eugene, and K R French, (1992), The Cross-Section of Expected Stock

Returns, The Journal of Finance, Vol 47, Issue 2, 427-465.
262
risky and thus the market will require higher return.
ln(BE/ME) represents book-to-market equity value. Underpriced stocks lead
to high book-to-market equity value, and vice-versa. Underpriced stocks are
systematically more risky and will fetch higher required returns.
ln(BA/ME) represents relative leverage. Higher ln(BA/ME) implies a
relatively higher component of debt. This increases beta, but beyond that it
also increases default risk, which leads to higher expected return. High E/P
indicates underpriced stock (despite good earnings) and will explain higher
expected returns.69
In the Fama and French studies on cross-sectional regressions, for each
period such as a month, the cross-sectional multiple linear regression (14.13)
yields coefficient estimates of factor means 1 , 2 ,, K and their tstatistics. Over all the periods or months, the estimated first coefficients e.g.
1 can be treated like a time series, and its t-statistics can be obtained as
1 T
1t
T t 1
1 T
1 T
1t
1t
T t 1
T t 1
to test if the estimates are significantly different from zero assuming they are
randomly distributed about zero under the null.
14.5
SINGAPORE FACTORS
The reported annual financial data of 60 major firms listed on the first board
of the Singapore Exchange and featured in the Straits Times Index, are used.
For each firm the follow data is collected from source book: Corporate
Handbook Singapore published by CEIC Holdings Ltd in connection with
Thomson Financial Publishing. The data are contained in Multi Factor.xls
Beta
Closing Share Price (S$)
Price/Earnings Close
Number of Ordinary Shares Outstanding
69
Jaffe, Keim, and Westerfield (1989) suggested a U-shape for average return versus
E/P ratio. See their paper, Earnings yields, market values, and stock returns, Journal
of Finance 44, 135-148.
263
Gross Dividend Per Share (cent)
Debt/Equity Ratio
Total Book Assets (In thousands $)
Total Book Equity (In thousands $)
Similar data for each of years 1996, 1997, and 1998, were collected. We run a
cross-sectional OLS regression similar to (14.13), replacing rf with constant of
regression a, and adding two more regressors (excess return could also be
used, though similar results could be obtained):
R i a 1 b i1 2 b i2 3 b i3 4 b i4 5 b i5 6 b i6 u i (14.14)
where classical conditions are assumed to be satisfied.
Stock is annual return is formed from previous year end price Pt, and
current year end price Pt+1, and then taking ln(Pt+1/Pt) = Ri. For beta of each
stock i, instead of using the Fama-MacBeth procedure, or the procedure in
Fama and French (1992) where portfolios of stocks with similar beta are first
formed and used to estimate their beta from time series (then using this
portfolio beta as the same beta applied to each stock in the portfolio), we use
the reported beta in the source book. bi1 = stock i beta.
ME = S$ Closing Share Price (S$) x Number of Ordinary Shares Outstanding
bi2 = stock is ln(ME)
bi3 = stock is ln(BE/ME) = ln (Total Book Equity / ME )
bi4 = stock is ln(BA/ME) = ln (Total Book Assets / ME )
bi5 = stock is P/E ratio
bi6 = stock is debt/equity ratio
The result of the OLS cross-sectional regression in (14.14) for the above
regressors for the year 1996 are reported as follows in Table 14.1.
Starting with a larger set of regressors, we shrink the set till adjusted R 2 is
about maximized. At the same time, we notice that regressors that are not
significant can be removed and adjusted R2 increases as a result. In the sample
data, we find that coefficients on ln(BA/ME), E/P, and D/E are not
significantly different from zero. Coefficients for beta are negative and mostly
insignificant, so are not considered. We could arrive at
E(Ri) = -1.898 + 0.048 * ln(ME) 0.117*ln(BE/ME)
as the best fitting model where the coefficient estimates are also significant
and make sense.
This result is similar to the US study in that beta is found to be
unimportant in explaining the cross-sectional returns. The data show that beta
is positively correlated with firm size. However, there are two major
264
differences. Firstly, the size effect appears to be reversed. Large capitalization
firm appears to fetch higher expected returns, unlike in U.S. Larger book-tomarket equity produces lower returns across the section. These results are
direct opposites to the US results. Why?
Table 14.1
The regressors are shown on the first row. The OLS estimates of the
coefficients are reported below. Blanks indicate that the regression does
not use that regressor. Each row or case reports a different regression
using different sets of regressors. (Numbers in brackets below estimates
indicate the corresponding p-values.)
constant
beta
ln(ME)
ln(BE/ME)
ln(BA/ME)
E/P
D/E
adj R2
case
1
-2.273
-0.00008
0.060
-0.0949
-0.0413
0.239
0.0230
0.0715
(0.0037)
(0.748)
(0.056)
(0.1519)
(0.4773)
(0.7351)
(0.4605)
case
2
-2.061
(0.0028)
-0.00011
(0.648)
0.055
(0.0672)
-0.117
(0.0232)
case
3
-2.029
-0.00009
0.053
-0.101
-0.0217
(0.0031)
(0.710)
(0.074)
(0.1131)
(0.6691)
case
4
-1.975
-0.0001
0.052
-0.117
(0.0031)
(0.641)
(0.0784)
(0.0225)
case
5
-1.978
0.051
-0.0980
-0.0254
(0.0030)
(0.0775)
(0.1182)
(0.6088)
case
6
-1.986
(0.0027)
0.052
(0.0736)
-0.117
(0.0226)
case
7
case
8
-1.898
0.048
-0.117
(0.0030)
(0.0867)
(0.022)
case
9
-1.266
0.057
(0.0309)
(0.047)
-1.333
-0.0001
0.061
(0.0302)
(0.687)
(0.0452)
0.0172
(0.5563)
0.0965
0.0938
0.1070
0.1077
0.0174
(0.5498)
0.1093
0.1192
0.0363
0.0502
It could be that systematic factors differ in their impacts across countries.

Many of Singapores largest firms are government-linked and has invested
heavily overseas, unlike smaller firms. Therefore these large firms may be
exposed to higher risks and thus fetch higher returns. It should also be noted
that the Fama and French (1992) study used monthly data for each regression
(not yearly data as used here). Also, the cross-sectional sample size based on
the US stock market is much larger, and thus their sampling error will be
smaller. Such are the caveats of working with small sample sizes.
265
It is instructive to note that the variable log(BA/ME) [which by itself is
not significant] is highly correlated with several others especially
log(BE/ME). This introduces multi-collinearity problem into the regression.
In Table 14.1, case 5 and case 6 regressions show that when both
log(BE/ME) and log(BA/ME) are used together, the standard error (hence pvalue) of the coefficient estimator of log(BE/ME) increases. Without
log(BA/ME), coefficient of log(BE/ME) is significantly different from zero at
a p-value of 0.02. When log(BA/ME) is also present, multi-collinearity
introduces more sampling noise, and the coefficient of log(BE/ME) is then
significantly different from zero at a p-value of 0.12.
This suggests dropping log(BA/ME) from the set of regressors since it
does not appear to be significantly different from zero. Further, we run the
optimal maximal adjusted R2 model (case 7 of Table 14.1)
Ri = c0 + c1 log(ME) + c2 log(BE/ME) + ui , i=1,2,3,,60
and then perform a Whites test of (unknown) heteroskedasticity.
Heteroskedasticity is typically a problem in a cross-sectional regression since
each stock in the sample has heterogeneous variances, and may have returns
correlations. The results are shown in Table 14.2 below.
Table 14.2
Whites Heteroskedasticity Test
266
Thus, the null of no heteroskedasticity is rejected at 5% significance level
since the p-value is 0.042445 for the F5,54 test. The 5 degrees of freedom
comes from the restrictions to zero in the regression of u 2t on logme, logme^2,
logme*logbeme, logbeme, and logbeme^2. We also perform OLS with
Whites HCCME (Heteroskedasticity Consistent Covariance Matrix
Estimator). The results are reported in Table 14.3. It is seen that the results do
not change a lot in this case from the OLS.
Table 14.3
OLS Regression with Whites HCCME
14.6
FAMA FRENCH THREE FACTOR MODEL
Earlier we see how APT provides for a nice linkage to cross-sectional

expected returns. Back to the other specification that is consistent with APT in
(14.7), (14.8), (14.9), (14.10), (14.11), and (14.12), if some common risk
factors 1t, 2t, 3t, .., Kt, are appropriately guessed and their time series
data, t=1,2,.,T are obtained, then a time series regression can be performed:-
R it rft i b i11t b iK Kt u it .
(14.15)
In contrast to the cross-sectional regression, here the bijs are estimated for
267
each asset i.
One of several very influential papers by Fama and French (1993)70
suggests that the market factor, and two other firm attributes viz. capitalization
size, and book-to-market equity ratio, are three variables that correspond each
to a common risk factor affecting the cross-section of stocks. Stocks with high
book-to-market equity ratio are called value stocks, and those with low bookto-market equity ratio are called growth stocks. Of course, there are stocks
with persisting high book-to-market equity ratio that are about to go belly up
or default. Together with an earlier 1992 study, Fama and French broke
completely new and fascinating ground in the world of investment finance by
pointing out presumably better explanations for the cross sectional expected
returns of stocks than what single factor CAPM does. The new proxies of
systematic factors they suggested led to voluminous research that followed.
From the Fama and French (1992) U.S. study, the basic message is that
small size stocks and high book-to-equity ratio or value stocks tend to provide
higher expected returns.
14.7
TIME SERIES REGRESSION
The Fama and French (1993) approach is essentially a time series regression
employing (14.15) where the variables associated with the common random
risk factors {1t , 2t , 3t ,., Kt}t=1,2,,T are the explanatory variables whose
expectations are the risk premia 1, 2, . ,K, and the coefficients to be
estimated are the sensitivities of stock or portfolio i. The dependent variable is
the time series of stock or portfolio is excess return.
The choice in using {1t , 2t , 3t ,., Kt}t=1,2,,T of course comes from the
earlier Fama and French (1992) research that had shown that their means {1,
2, . ,K} do affect the expected returns of stocks cross-sectionally.
Under CAPM, we already identified 1t being the excess market return, so that
E(1t) = 1 is the market premium. Fama and French (1992) show that smallsized firms produce cross-sectionally (consistently the same each period)
higher returns than large-sized firms. Thus a brilliant guess of 2t that
corresponds with this size-factor 2 is for 2t = difference in average return of
small-sized stocks at t and large-sized stocks at t. Small versus large portfolios
formed for this purpose produces a return difference called the SMB (small
minus big) factor variable that can be determined each period.
Likewise, if in (14.13), it is shown that high book-to-market equity ratio
firms produce cross-sectionally (consistently the same each period) higher
returns than small book-to-market equity ratio firms, then a brilliant guess of
70
Fama, Eugene, and K R French, (1993), Common risk factors in the returns on
stocks and bonds, Journal of Financial Economics 33, 3-56.
268
3t that corresponds with this BE/ME-factor 3 is for 3t = difference in
average return of high book-to-market equity ratio stocks at t and low bookto-market equity ratio stocks at t. High BE/ME stock versus low BE/ME stock
portfolios formed for this purpose produces a return difference called the
HML (high minus low) factor variable that can be determined each period.
Fama and French (1993) found that these two variables are important risk
factors (in addition to the market factor r mt rft that we already know through
CAPM) in explaining every stock is return variations over time. It is in
explaining every stocks variations that affords these risk factors to be
considered as systematic across the market.
Clearly, it may be interpreted that the mean over time of random
systematic risk, 2t = difference in average return of small-sized stocks at t and
large-sized stocks at t, is 2 0 (>0). That is why this becomes a non-zero
coefficient in the cross-sectional regression in (14.13). The same can be said
for the time-averaged mean of 3t = difference in average return of high bookto-market equity ratio stocks at t and low book-to-market equity ratio stocks at
t. This mean is also a non-zero coefficient 3 in the cross-sectional regression
in (14.13). Unlike factors linked to macroeconomic variables, these FamaFrench factors can be constructed like funds that can be traded and used for
hedging. Thus, there is greater plausibility in their use by the market, and their
role as systematic factors.
14.8
PROBLEM SET
14.1
A particular stocks monthly return rate at t, r1t = 0.01 + 1.2 rmt + e1t ,
where rmt is the market portfolio return at t, and e1t is a residual noise
statistically independent of rmt. Another stocks monthly return rate at
t, r2t = 0.005 0.1 rmt + e2t, e2t is a residual noise also statistically
independent of rmt. Further, suppose cov(e1t , e2t) = 0.036. (Assume all
returns are covariance-stationary.)
(i)
(ii)
14.2
cov(r1t , r2t) > 0, find an upper bound for the variance of rmt.
Suppose r1t and r2t can also be represented by
r1t = 0.01 + 1.2 rmt + 0.2 It + 1t
r2t = 0.005 0.1 rmt + 0.3 It + 2t
where It is an industry factor at t that is statistically
independent of rmt , and 1t and 2t are i.i.d. noises, then what
is the variance of It?
If in (14.15), we apply correctly the risk factors and perform time
269
series regressions on all stocks i:-
R it rft i bi11t biK Kt u it .

Then if for each stock, we collect the estimated residuals u it and find
the time series estimate of its variance i2
1 T
2
u it , and suppose
T t 1
we perform cross-sectional regression as in (14.13)
R i rf a 1 b i1 2 b i2 K b iK K 1 i2 u i
(i) If all stocks form well-diversified portfolios of investors, do you
expect K 1 to be significantly different from zero?
(ii) If investors generally do not hold well-diversified portfolio, do
you expect K 1 to be significantly positive? What do you call
K 1 ?

[1] Fame, Eugene, and K R French, 1992, The Cross-Section of Expected
Stock Returns, The Journal of Finance, Vol 47, Issue 2, 427-465.
[2] Fama, Eugene F., and Kenneth R.French, (1993), Common risk factors
in the returns on stocks and bonds, Journal of Financial Economics 33, 356.
[3] Fama, Eugene, and K R French, (1995), Size and Book-to-Market
Factors in Earnings and Returns, Journal of Finance, Vol L, 131-155.
[4] Merton, R.C., (1973), An Intertemporal Capital Asset Pricing Model,
Econometrica, 41, 867-887.
[5] Ross, A. Stephen, (1976), The Arbitrage Theory of Capital Asset
Pricing, Journal of Economic Theory, Vol 13, 341-360.
For principal components method, factor analysis, applied multivariate
statistics book will be a convenient source of reference. See for example,
Richard A. Johnson, and D.W. Wichern, (2002), Applied Multivariate
Statistical Analysis, Prentice-Hall.
270
Chapter 15
ERRORS-IN-VARIABLE
APPLICATION: EXCHANGE RATES
AND RISK PREMIUM
Spot exchange rate, Forward exchange rate, Unbiased expectation hypothesis,
Test of Restriction, Overlapping data problem, Serial correlation, DurbinWatson d-statistic, Errors-in-variable, Missing variable problem, Forward risk
premium, Forward premium, Forward discount, Interest rate parity
This chapter covers the interesting topic on exchange rates. Currency trading
and speculation is one of the oldest and largest games in town. Multi-national
corporations also enter the currency market to hedge their currency exposures.
Corporate treasurers and finance controllers are familiar with transaction in
the forward market to hedge transactions denominated in other currencies. It is
important to understand the relationship between the forward and the spot
prices. The cost of carry model learned earlier in the Nikkei stock index can
be applied in most forward contract situation such as this. We discuss and test
hypothesis about unbiased expectations and about risk premium. The
interesting case of overlapping data problem is shown and the pitfalls are told.
The test of serial correlation in the Durbin-Watson statistic is discussed. We
introduce the topic on tests of restrictions in the coefficients.
15.1
FOREIGN EXCHANGE
Exchange rates have been highly volatile since 1973 when the US and most of
the developed countries departed from fixed rate regime. Exchange rates are
heavily studied because of the critical role they play in lubricating world trade
machineries. They are also an important market for hedgers and also
speculators. The currency forward market has become one of the principal and
major world market.
Spot exchange rate is the rate valued at spot for current transaction,
although market practices require actual exchange of currencies only one to
two trading days later. A direct quote to U.S. citizen is in terms of number of
US$ per unit of the foreign currency, e.g. $0.7 per S$, or $1.35 per Euro. An
271
indirect quote to U.S. citizen is in terms of number of foreign currency units
per US$, e.g. S$1.4286 per US$ or 0.74 euro per US$. A direct quote of
US$1.02 per C$ is an indirect quote to Canadian citizens. To avoid these
confusions, we shall use the notation US$/FC to denote the number of US$
per unit of foreign currency, or FC/US$ to denote the number of units of
foreign currency per US$. Unless otherwise stated, $ refers to US$. The /
sign above denotes a quotient. This is sometimes confusing as the way
quotations are read in the foreign exchange market is to put the base currency
first followed by the variable currency, and this is written in the short-form as
base currency/variable currency. We shall keep to the quotient notation.
A forward x-month rate is a rate valued currently for transaction x-months
from present. Thus a forward rate is a contractual rate locking in a rate for the
future. Common forward contracts have x equal to 1-month, 3-months, 6months, and so on. The near-term (shorter maturity x) contracts tend to be
more liquid and heavily traded. For example, a forward 3-month S$/$ rate is
1.45. This means that in 3-months time, buyer of $ forward will pay S$1.45
for each US$. Seller will collect S$1.45 for each US$ delivered.
15.2
UNBIASED EXPECTATIONS HYPOTHESIS
The unbiased expectations hypothesis (UEH) states that todays forward xmonth rate is an unbiased predictor or forecast or expectation of the future xmonth spot rate.
Ft,t+x = Et (St+x)
(15.1)
For the UEH in (1), Ft,t+x or $/Y is the forward rate of currency Y contracted
at time t for transaction at t+x > t. Thus the forward rate is an x-month forward
rate. St+x $/Y is the spot rate of currency Y at time t+x when the forward
contract matures. From (1), Et(St+x Ft,t+x) = 0, so UEH implies that forward
exchange speculation, i.e. buy cheaper currency Y in $ forward and selling at
a dearer expected spot price Y in $ later, or vice-versa, will not produce any
profit. Thus, intuitively, it suggests that the currency market is speculatively
efficient.
Bilson (1981)71 suggests that the hypothesis that forward prices are the
best unbiased forecast of future spot prices is not necessarily equivalent to the
efficient markets hypothesis (EMH). However, strictly speaking, the rational
expectations hypothesis is required in (15.1) when conditioning on market
information is done. In fact, if the time interval of forecast is long, and market
investors are risk averse and not risk-neutral, then a positive cost-of-carry
71
John F.O. Bilson, (1981), The Speculative Efficiency Hypothesis, Vol 54 No 3,

Journal of Business
272
model would predict St = Ft,t+x / (1+rt,t+x) = Et(St+x) / (1+k), where rt,t+x is the
riskfree interest cost of carry over [t,t+x], and k > r t,t+x is the risk-adjusted
cost. Hence, Ft,t+x = Et(St+x) [(1+rt,t+x) / (1+k)] = Et(St+x) - , where = Et(St+x)
(rt,t+x - k ) / (1+k) < 0, is a risk premium. Hence, there this is an equilibrium
model in which market expectations are rational but in which there is forward
bias or negative risk premium. If the forecast interval is short, and interest cost
rt,t+x and k are small, then the UEH may hold true in an approximate sense.
Of course, if investors are risk-neutral, i.e. k = rt,t+x, then UEH holds exactly.
Despite the existence of many models of currency risk bias, UEH has been
widely tested and investigated. Especially over short horizons and over
periods when market uncertainties are not rampant, the UEH may be fairly
accurate.
15.3
TEST OF RESTRICTIONS
The UEH in (15.1) may be tested for its restriction on observed time series of
spot rates St+x , and forward rates Ft,t+x . Suppose we employ monthly (end of
month) data, and we match spot rate at t+x, St+x , with forward rate applicable
at t+x but contracted earlier at t, i.e. Ft,t+x . At any one point in time, there
could be various forward rate matches for a single spot St+x, since x could be
1-month, 3-months, 6-months, etc. Suppose
St+x = E(St+x | t) + et+x ,
(15.2)
where t is the market information set used at t to predict St+x at t+x. The
prediction or forecast error et+x can be characterized as follows. Taking
conditional expectation on (15.2) implies
E(St+x | t) = E[ E(St+x | t) | t] + E(et+x | t) .
Using the law of iterated expectations, the right-hand side is E(St+x| t) +
E(et+x | t) , so
E(et+x | t) = 0.
(15.3)
But et+x = St+x Ft,t+x , using (15.1) and (15.2). Therefore UEH implies E(St+x
Ft,t+x|t) = 0, which was discussed earlier. (15.3) E(et+x) = 0.
Since Ft,t+x is known at t, Ft,t+xt . So, (15.3) E(et+x | t) = 0 E(Ft,t+x
et+x | t) = 0 cov(Ft,t+x , et+x | t) = 0. UEH implies (15.2) which can be
written as
St+x = a + b Ft,t+x + et+x
(15.4)
where a = 0, b = 1, mean of et+x is zero and et+x has zero correlation with Ft,t+x .
Thus, a testable hypothesis of the UEH is to test the restrictions H0: a= 0 and b
= 1 jointly in the linear regression model in (15.4).
15.4
OVERLAPPING DATA PROBLEM
In testing (15.4), suppose we use x = 3-months, i.e. forward 3-month rates are
273
used. The system of linear equations in the linear regression model involving
the time series variables St+x and Ft,t+x are as follows, starting with forward rate
F0,3 as the earliest observed data point.
ST = c0 + c1FT-3,T + eT
..
St+3 = c0 + c1Ft,t+3 + et+3
S4 = c0 + c1F1,4 + e4
S3 = c0 + c1F0,3 + e3
This system of equations leads to the following characterization of the
disturbance errors.
eT = ST - c0 - c1FT-3,T
e5 = S5 - c0 - c1F2,5
e4 = S4 - c0 - c1F1,4
e3 = S3 - c0 - c1F0,3
This can be depicted as follows.
Figure 15.1
Overlapping of Forecast Errors
The forecast error spans the period [2,5]. Since this forecast error
overlaps with the earlier forecast error over [1,4], there may be
correlation in the forecast errors
Time t in
months
Some events happening between t=3 and t=4, for example, will affect S5 and
also S4. Hence they affect e5 and e4. Thus e5 and e4 may be correlated. This
results in overlapping forecast errors. To characterize this further, we employ
an auxiliary model on the spot rates. Suppose the spot market dynamics is S t+1
= + St + vt+1 where vt+1 is i.i.d. and has zero mean. is too small so that after
transaction cost, there is no arbitrage profit to be made. If = 0, then {St} is a
strong version random walk with zero drift. Then F2,5 = E2(S5) = E2( [S5S4]+[S4-S3]+[S3-S2]+S2 ) = E2( 3+v5+v4+v3+S2) = 3+S2.
In general, Ft,t+x = x+St . Then, the forecast error
et+x = St+x Ft,t+x = St+x St a
274
where a = x, a constant.
The correlation of overlapping forecast errors are
cov(e6 , e5)
= cov(S6-S3,S5-S2 )
= cov([S6-S5]+[S5-S4]+[S4-S3] , [S5-S4]+[S4-S3]+[S3-S2])
= cov(v6 + v5 + v4 , v5 + v4 + v3)
= var(v5) + var(v4) 0.
Indeed they are positively correlated. Therefore e5 is correlated with e4 , e4
with e3, and so on. This is called the overlapping data problem. This is a
problem because the serial correlation in et+xs implies that OLS using (15.4)
will not satisfy all the classical conditions with regard to disturbance qualities.
In the above, it is empirically plausible that disturbance e t is AR(2) or AR(1),
so estimated GLS will need to be done instead of OLS if the serial correlation
is not close to zero. OLS on (15.4) without correcting for serial correlation
will result in the estimators not BLUE, although they are still unbiased.
However, in the latter, statistical inference using the OLS estimated
covariances cannot be applied.
We provide a regression of monthly spot C$/$ from March 1989 to May
2002 on monthly 3-month forward C$/$ rate from December 1988 to February
2002. We may view this as testing the UEH from the point of view of the US$
in terms of C$. In other words, we like to validate if the forward C$ value of a
US$ is a best forcast of future spot US$ in terms of C$.
We show the results of such OLS regressions as follows.
Table 15.1
Regression:- St+x = a + b Ft,t+x + et+x
Dependent variable is spot rate, Sample size 159
275
Serial correlation in the disturbance term can be checked using the DurbinWatson (DW) d-statistic.
N
u
t 2
u t 1
u
t 1
2
t
The Durbin-Watson d-statistic is shown at the left bottom of Table 15.1. It is a

test of serial correlation when the null is zero correlation. The DW d-statistic
is nearly similar to
N
u t u t 1
t 2
u
t 1
2
t
N
N
2 u 2t u t u t 1
2 1
t 2 N t 2
u 2t1
t 2
where
u u
t 2
N
u
t 2
d
1
2
t 1
is the sample correlation coefficient of ut. Clearly,

2
t 1
(15.5)
may be inferred from the DW statistic. If d > 2, then < 0, and ut is likely to
be negatively correlated. If d < 2, then > 0, and ut is likely to be positively
correlated. If d is close to 2, then 0, and ut is likely to be zero-correlated.
In the above regression of (15.4), the DW d-statistic is 0.819. Thus, there
appears to be a strong positive serial correlation in ut+x . This is suggested by
the overlapping data problem.
This statistic follows a Durbin-Watson or DW distribution when the null
is zero correlation.72 D-W distribution is reported in table form giving for a
sample size N, a given number K of regressors (including the constant), and
the significance level, two numbers dL and dH where dL < dH . The null
hypothesis is H0: 0 .
If DW d < 2, then if d < dL , reject H0 in favor of alternative hypothesis
HA: 0 . But if DW d > dH , then we cannot reject (thus accept) H0.
If DW d > 2, then if 4-d < dL , reject H0 in favor of alternative hypothesis HA:
72
When the null of the disturbance is an AR(1) process, then the Durbin-Watson hstatistic is used.
276
0 . But if DW 4-d > dH , then we cannot reject (thus accept) H0.

If d < 2, and dL < d < dH , or if d > 2, and dL < 4-d < dH , then we cannot
conclude whether to reject or accept H0. For N=159, k=2, at 5% significance
level, dL=1.730, dH=1.752. Hence for d=0.819, we reject H0: =0 in favor of
HA: >0.
Suppose the disturbance term is AR(1), non-homoskedastic, and thus the
covariance matrix of the disturbance is not a diagonal matrix. We can apply
estimated GLS. In this case, it is basically the Cochrane-Orcutt procedure for
an AR(1) process. We transform the spot rate and forward rate data as follows.
From (14.5), if d=0.819, then 1 0.819 / 2 0.59 .
SN* = SN - SN-1
.
S4* = S4 - S3
1 2 S3
FN-3,N*=FN-3,N - FN-4,N-1
S3* =
F1,4*=F1,4 - F0,3
F0,3* =
1 2 F0,3
We can also omit the initial points73 S3 and F0,3.

Then we perform OLS on the transformed data. We can perform iterations
till converges. Another solution to the overlapping data problem, instead of
using estimated GLS, is to employ a subset of the data to avoid the
overlapping data problem. Use the sample data over sampling period [0, T] as
follows.
SN = c0 + c1FN-3,N + eN
.
.
S9 = c0 + c1F6,9 + e9
S6 = c0 + c1F3,6 + e6
S3 = c0 + c1F0,3 + e3
73
Suggestions of the initial point corrections were apparently by Prais and Winsten in
Prais SJ and CB Winsten, Trend Estimates and Serial Correlation, Unpublished
Cowles Foundation Discussion Paper, Chicago, 1954.
277
Here, without necessarily using the auxiliary model, we can show that the
forecast errors are zero-correlated. Note that we employ the law of iterated
expectations.
E(e6 e3) = E [ E3(e6 e3) ] = E [ e3 E3(e6 ) ] = E[ e3 0 ] = 0.
Since E(et+x) = 0, therefore cov(et+x , et) = E(et+x et) E(et+x) E(et) = 0. Thus the
OLS regression on the non-overlapping dataset now satisfies the classical
conditions.
(a)
(b)
Perform OLS again with non-overlapping data.

Check estimated residuals. This is to ensure that serial correlation no
longer exists.
In Table 15.2, note that Note that the DW d-statistic is now 2.17 and does not
appear to indicate presence of positive or negative serial correlation. For
N=53, k=2, at 5% significance level, dL=1.518, dH=1.595. Now 4-d = 1.83 >
dH, hence we cannot reject H0: = 0.
Under UEH in (15.4), we test the H0: a= 0 and b=1. The following Table
15.3 employing the Wald Test shows that the null hypothesis, hence UEH, is
not rejected.
Table 15.2
Regression of Spot rate on forward rate
Sample size 53 after adjusting endpoints
278
Table 15.3
Wald Test using non-overlapping regression
The probability figures on the top right of Table 15.3 shows that we cannot
reject the H0: c0=0 , c1=1 at up to 38.71% significance level.
15.5
MEASUREMENT ERROR PROBLEM
According to (15.1), Ft,t+x = c0 + c1 Et (St+x), where c0=0, and c1=1. The righthand explanatory variable in this case is a conditional forecast which is not
observed by the econometrician.74
We shall suppose that this conditional forecast E(St+x| t) is estimated by
current spot St but with a measurement error at t, t , i.e.
St = E(St+x| t) + t .
(15.6)
We shall assume that t is uncorrelated with E(St+x| t). This means that the
error t is not related to t. This is sometimes also called the errors-invariables problem. Though measurement error t is not correlated with E(St+x|
t), it is, however, necessarily correlated with St since cov(St , t) = var(t) >
0. Suppose we instead run OLS regression on
74
Here it is not appropriate to use the auxiliary model on spot rates: St+x = St + vt+x
where vt+x is i.i.d., because taking the conditional expectation with respect to t yields
E(St+x | t) = St + E(vt+x | t) = St which suggests the market forecast is observed. This
is not our purpose here.
279
Ft,t+x = c0 + c1 St + ut .
(15.7)
Under the true model, Ft,t+x = c0 + c1 Et (St+x) Ft,t+x = c0 + c1 [St - t] via

(15.6). Thus, (15.7) implies that ut = - c1t .
Moreover, cov(St , ut) in the regression in (15.7) is cov(St , - c1t) = - c1
cov(St , t) = - c1 var(t) 0. Hence, OLS regression with measurement errors
on the right-hand explanatory variable St leads to contemporaneous
correlation.
Thus the OLS estimators c 0 and c 1 are biased and not consistent. They
are not BLUE. Moreover,
S
N -3
c 1
t 0
N -3
S
t 0
S Ft,t 3
t
S
N -3
S S t
t 0
S c 0 c1S t u t
S
N -3
t 0
S S t
S u t
N -3
c1
t 0
N -3
S
t 0
S S t
Therefore,
S
N -3
Ec 1 c1 E
t 0
N -3
S
t 0
S u t
t
S S t
c1
since St and ut are not independent. Also,

1 N-3
St S u t
N t 0
c 1 c1
1 N-3
St S St
N t 0
does not converge asymptotically to c1
(15.8)
, hence c1, since St
1
S t S S t
N t 0
and ut are contemporaneous correlated. Hence c 1 is not consistent.
N -3
From (15.8),
1 N-3
1 N-3
St S u t
St S u t
N t 0
N t 0
c 1 c1
c1 N-3
.
1 N-3
1
St S St
St S St S
N t 0
N t 0
280
So,
plim
plim c 1 c1
N
1 N-3
St S u t
N t 0
1 N-3
St S St S
N N t 0
covS t , u t
c1
var S t
plim
c1
c1 var t
var S t
var t
c1 1
c1
var S t
Thus when the sample size is large, the estimator c 1 will be biased downward.
The OLS regression result of (15.7) with the measurement error problem
is shown as follows in the following regression Table 15.4.
Table 15.4
Dependent variable is F3M
Sample size 159
It is interesting to note that the slope estimate is now 0.9688, biased

downward from one, whereas in the earlier regressions whether with
overlapping or non-overlapping data of (15.4), the slope estimates were about
281
1.01 and 1.02.
15.6
MISSING VARIABLE PROBLEM
Other kinds of serial correlation (error autocorrelation) problems can occur
with missing explanatory variable in the regression specification. Suppose in
the OLS regression model
Yt = c0 + c1Xt + et ,
and there is a missing variable Zt . Moreover, suppose the process for Zt is
Zt = Zt-1 + vt , 0 .
Including this to produce the correct model leads to
Yt = c0 + c1Xt + c2Zt + ut ,
where {ut} and {vt} are independent i.i.d. processes.
Therefore,
et = c2Zt + ut .
(15.9)
Thus, this misspecified model disturbance
et = c2 ( Zt-1 + vt) + ut
= c2 Zt-1 + c2vt + ut .
Using (15.9) for c2 Zt-1 = et-1 ut-1, then
et = (et-1-ut-1) + c2vt + ut .
Hence et is serially correlated in most cases. However, if there are not missing
explanatory variables, then we do not detect serial correlations. We show such
an example as follows in Table 15.5.
Table 15.5
Dependent variable is differenced spot
Later we will see that in most cases spot rates (or their loge) and forward rates
282
are unit root processes I(1), in which case they are likely to be stochastic trend
stationary and so (St+x Ft,t+x) may itself be I(1). Then the above regression
may be spurious, i.e. the estimates may not mean anything and may not
converge to any population parameters or moments.
For the regression to make sense, which we implicitly assume here, St+x
bFt,t+x (b a constant, can be 1) should be I(0), i.e. stationary. Then St+x and Ft,t+x
are said to be cointegrated with cointegrating vector (1, -b). Then the OLS
regression in (15.4)
St+x = a + bFt,t+x + et+x
will produce superconsistent (implies consistent) estimators. To avoid
spuriousness, we can also perform the results in Table 15.5 using differenced
series in the regression, i.e.
(St+x St)/St = a + b (Ft,t+x St)/St + ut+x .
15.7
FORWARD RISK PREMIUM
We have concentrated on the UEH to work through some interesting

econometrics issues. There have been many other studies to establish if there
exists a risk premium in forward exchange rates. This forward risk premium
t,t+x, is defined as the excess of forward rate over the expected future spot
rate, i.e.
t,t+x = Ft,t+x Et (St+x).
(15.10)
Note that this forward risk premium is different from the x-period forward
premium that is the excess of forward rate over current spot rate, viz. (F t,t+x
St)/St . If this rate is negative, then it is a forward exchange discount. The
forward premium or discount is directly related to the x-period interest rate
differentials across the two related countries via the interest rate parity
theorem. The interest rate parity theorem states that if Ft,t+3 is $ per 3-month
forward exchange rate, St is present $ per spot exchange rate, RUS is 3-month
US money market interest rate, and RUK is 3-month UK money market interest
rate, then Ft,t+3 /St = (1+RUS)/(1+RUK). Or, (Ft,t+3 St)/St RUS RUK. Thus, if
RUS<RUK, then Ft,t+3 < St, and there is a forward discount on (or a forward
premium on the $).
15.8
PROBLEM SET
15.1.
In a classic study by Chow (1957), demand for automobile stock per

capita in the States Dt was found to be related to price Pt and
disposable income Ydt as follows:
Dt = 1.1666 - 0.039544 Pt + 0.020827 Ydt
R2=0.850
283
If Friedmans real expected income per capita data, Yet, were used
instead, the regression was estimated as:
Dt = -0.7247 - 0.048802 Pt + 0.025487 Yet
R2=0.895
(i) Suppose that disposable income Ydt = Yet + error, i.e. disposable
income is expected income plus disturbance. How would this
explain the relative magnitudes of the estimated coefficients of Ydt
and Yet?
(ii) In forecasting the next years demand for automobile stock per
capita, suppose price will be fixed as at the current years level at
151.1, and real expected income per capita is taken to be at 828.8,
what will be the forecast?
(iii) Will the forecast error be affected by the value taken as the real
expected income per capita next year given that this is a very
accurate assessment?
(iv) Suppose in the above regression, Friedmans real expected
income per capita variable was not observable, but instead it was
assumed that it evolved according to adaptive expectation, i.e.
Yet = Ye t-1 + (Ydt - Ye t-1), would the estimated coefficients using
OLS still be BLUE? Explain.
15.2
In the regression of future horizon return on current dividend-price

ratio:
rt , t k a bd t pt u t , t k ,
suppose you wish to prevent using overlapping data, how would you
write out the regression equations?
15.3
The following estimation output table shows a linear regression of

number of oil rigs in USA against the $ price per barrel of crude oil.
The dependent variable of number of oil rigs is called COUNT. The $
price per barrel of oil is called PRICE. A constant is added for the
regression. (Assume there is no simultaneous equation bias, that the
regression variables are correctly specified.)
284
Dependent Variable: COUNT
Sample: 1 751
Variable
Coefficient
Std. Error
t-Statistic
Prob.
C
PRICE
-678.2789
84.29917
59.66590
2.701127
-11.36795
31.20888
0.0000
0.0000
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
Durbin-Watson stat
????
????
385.7005
111424892
-5536.873
0.041786
Mean dependent var

S.D. dependent var
Schwarz criterion
F-statistic
Prob(F-statistic)
1131.280
584.6037
14.75066
14.76297
????
0.000000
(i) Are the explanatory variables significant in explaining COUNT?

(ii) What would be a likely source of problem in the estimates?
(iii) Explain how would you attempt to address this problem in (ii) and
what may be the limitations if any?
(iv) The F-statistic, R2, and adjusted R2 statistics were left out. Fill in
the correct numbers based on consistency with the rest of the
table (all the information required are already within the table).
15.4
Why in practice do we perform regression

St St k c0 c1 (F t k,t St k ) e t , k>0
where E e t | Ft k,t St k 0 with testable restrictions

H0: c0 = 0, and H0: c1 = 1?
(Or sometimes joint null H0: c0 = 0 and c1 = 1)
[1] David Backus, S Foresi, and C I Telmer, (1995), Interpreting the
Forward Premium Anomaly, The Canadian Journal of Economics, Vol
28, 108-119.
[2] Ravi Bansal, (1997), An Exploration of the Forward Premium Puzzle in
Currency Markets, The Review of Financial Studies, 369-403.
[3] John F.O. Bilson, (1981), The Speculative Efficiency Hypothesis, Vol
54 No 3, Journal of Business
[4] Jerome Stein, (1980), The Dynamics of Spot and Forward Prices in an
Efficient Foreign Exchange Market with Rational Expectations, The
American Economic Review, Vol. 70, Issue 4, 565-583.
285
Chapter 16
UNIT ROOT PROCESSES
APPLICATION: PURCHASING POWER PARITY
Non-stationary process, Deterministic trend, Trend stationary process,
Stochastic trend, Difference stationary process, Augmented Dickey-Fuller
statistic, Spurious regression, Cointegration, Super-consistency, Real
exchange rate, Long-run PPP equilibrium, Exchange rate forecasting, Error
correction model
In this chapter we introduce the important topic of non-stationary process and

highlight a major area of econometric research in the last 3 decades on trend
stationary process and cointegration. This research has been important in
shedding light on some spurious regression results if the underlying process
dynamics is not properly understood or investigated. We consider the
application to purchasing power parity, one of the tools of determining longrun exchange rate levels.
16.1
NON-STATIONARY PROCESS
We already have some ideas about stationary stochastic processes in an earlier

chapter. A random variable that is at least weakly stationary will have constant
mean and variance as it moves through time. This means that any deviation
from some mean level is a just a random draw and has no permanent
consequence. If a rate of return goes up this period, at some point in the near
enough future, it will come down. This is unlike prices where the price level
can continue to creep up without sign of wanting to revert back to old levels.
It is in this spirit that we consider non-stationary processes, especially in
relation to price levels of stocks, of GDP, of consumption, and so on.
Consider the process Yt = + Yt-1 + t , 0, where t is a covariancestationary process, i.e.
cov( t ,
t k
E( t ) 0 and var( t )
, a constant.
) is not necessarily zero for any k 0, but is a function of k
286
t-1, t ) 0.
only. And further, cov(Y
Yt is covariance stationary provided ||
< 1. However, if = 1, then in this case,

Yt = + Yt-1 + t .
(16.1)
Or (1-B)Yt = + t , so (1-B) = 0 yields a unit root. Thus Yt is said to contain
a unit root.and {Yt} is called a unit root process. It is also called a difference
stationary process since Yt is stationary with a general stationary noise
without any special structure, and needs not be i.i.d. By repeated substitution
Yt ( Yt -2 t 1 ) t
2 ( Yt -3 t 2 ) t t 1

t Y0 t t 1 t 2 2 1
Thus we see that a unit root process in Yt leads to Yt having a time trend t as
t 1
well as a stochastic trend
j 0
t j
. Note that for the unit root process in Yt, its
starting value Y0 is still a random variable, although its variance may be very
2
small. Clearly, if E(Y0 ) 0 , var(Y0 ) 0 then
E(Yt ) 0 t
0 , provided 0 .
Hence the mean of Yt increases (decreases) with time according to drift >
(<) 0. Also,
t-1
2
2
var(Yt ) 0 var[ t- j ] 0 .
j0
However, the variance of Yt changes due to the presence of a stochastic trend
in the unit root process. Therefore, {Yt} is not covariance-stationary, or we
shall simply call it non-stationary.
Suppose random variable Yt is trend stationary, i.e. stationary about a
deterministic time trend. By definition, a trend stationary process, unlike a
unit root process, does not have a stochastic trend, and thus does not display
changing variance over time, although its mean t does change over time. The
unit root process, however, possesses both a time trend as in the trend
stationary process, and also an additional stochastic trend. The following is a
trend stationary process fluctuating randomly about the deterministic trend +
t.
Yt t t
(16.2)
287
where t is time, and are constants, and t is a stationary random variable
with zero mean and i.i.d., such that var Yt var t 2 . Then
Yt 1 (t 1) t 1
and so Yt Yt Y
t 1 t
or Yt Y
t
t 1
(16.3)
2
where var t var t
t 1 2 , since t is i.i.d.
Equation (16.3) may look like the unit root process in (16.1). However, it
is really not so75 because the stationary noise term t carries a special
structure (thus we do not call this a difference stationary process). If we iterate
the process through time,
Yt ( Yt -2 t 1 ) t
2 ( Yt -3 t 2 ) t t 1

t Y0 t t 1 t 2 2 1
t Y0 t 0
2
where var t 0 2 . Here we treat the starting value Y0 as a constant.
We thus see that for a trend stationary process, the variance of Y t stays the
same even as t increases. There is no stochastic trend, and the variance of Yt
does not change through time. The big difference is that the noise at the end of
a trend stationary process in (16.3) t does not add up variance as fast as the
noise in a unit root process t.
Let us recall. A unit root process contains a deterministic time trend plus a
stochastic trend. The latter causes the unit root process to have changing
variances over time. A process with just deterministic time trend plus a
stationary noise, but not a stochastic trend, is called a trend stationary process.
75
One of the earliest and exciting papers to point out this difference is Nelson,
Charles, and Charles Plosser, (1982), Trends and Random Walks in Macroeconomic
Time Series: Some Evidence and Implications, Journal of Monetary Economics 10,
130-162.
288
A unit root process may thus be construed as a trend stationary process to
t 1
which is added a stochastic trend
j1
. If t is positively correlated through
time, then the stochastic trend will induce even larger variances in the unit
root process. While both a trend stationary process and a unit root process will
display similar increasing trend (expected values or means) if >0, the unit
root process will display increasing volatility over time relative to the trend
stationary process. This distinction is important to differentiate the two.
In more general terms, equation (16.1) can be represented by ARIMA
(p,1,q) where p and q need not be zero for a unit root process. Thus, more
general unit root processes can be modeled by ARIMA (p,2,q), and so on.
Figure 16.1
Time Series Graphs of Stochastic Processes
6
Unit root 0.01t+itVi

Trend Stationary 0.01t+Vi
Stationary Vi ~ N(0.1,0.1)
0
1
13
19
25
31
37
43
49
55
61
67
73
79
85
91
97
-2
-4
-6
In Figure 16.1 above, we show how the three different processes would have
looked like. Clearly the unit root process can produce large deviations away
from the mean.
289
16.2
SPURIOUS REGRESSION
Suppose
Yt Yt-1 e t
Z t Z t -1 u t
, e t ~ stationary with mean 0

, u t ~ stationary with mean 0
and et and ut are independent of each other. They are also not correlated with
Yt-1 and Zt-1 . {Yt} and {Zt} are unit root processes with drifts and
respectively. Then,
Yt t Y0 (e t e t 1 e1 )
Z t t Z 0 (u t u t -1 u )
1
showing their deterministic as well as stochastic trends. Let Y0 and Z0 be
independent. Then,
t
t
cov(Yt , Z t ) cov(Y0 , Z0 ) cov( e j , u k ) 0
j1 k 1
since {et }, {u t } are stochastically independen t.

Now Yt and Zt are uncorrelated. If we set up a linear regression of Yt on
Zt,
Yt a bZ t t
, cov(Zt , t ) 0 , var(t) < .
Since cov(Yt , Zt) = 0, the slope b = cov(Yt , Zt)/var(Zt) = 0.

However, if we expand the regression into its time trend and additive
stochastic component, we obtain:
Yt a bZ t t
, cov(Zt , t ) 0
(16.4)
and where t has finite variance, then we can write
t -1
t 1
t Y0 e t - j a b (t Z 0 u t - j ) t
j0
j0
Divide through by t
Y
bZ 0
1
a
1
0 e t j b
b u t- j t .
t
t
t
t
t
t
Since var(t) < , then var(
var t
t
0 as t
) in the last term =
t
t2
approaches infinity. Likewise var(Y0/t) and var(Z0/t) 0 as t . Thus, as t

, the terms Y0/t , Z0/t , and t /t should approach zero in mean square sense.
290
As t increases, the time-averages of the noise terms in et and ut converge to
zeros via some version of the law of large numbers due to their stationarity.
All terms except the following also converge to zeros. We are then left with
the following.
so b
0.
Hence the regression in (16.4) between two independent processes produces a

slope coefficient b that is non-zero! This is what is termed a spurious
(seemingly true yet false) regression result: b 0 is obtained when
theoretically b = 0.
More specifically, regression method such as OLS, will provide a non-zero
estimate of b that is spurious. Thus, the point to note is that when we perform
OLS regression of a unit root process on another independent unit root
process, instead of obtaining the expected zero slope, we are likely to end up
with a spurious non-zero slope estimate. In other words, under OLS, the
sampling estimate of cov(Yt , Zt) will be spurious and not zero because Yt and
Zt are unit root processes.
Only when we perform the OLS regression using stationary first
differences, i.e.
Yt e t on Z t u t ,
or Yt a bZ t t
where t is stationary, then b = cov(et, ut)/var(ut) = 0. Thus we obtain OLS
estimator b that converges to b=0.
The spurious regression above applies also to Yt and Zt if they are trend
stationary instead of being unit root processes. Consider
Yt t t
Z t t t
where t and t are mean zero i.i.d. random variables that have zero
correlation.
Even though Yt and Zt are not correlated,
Z t
Yt t
Z t t t .

291
So OLS regression of Yt on Zt will give a spurious estimate of 0 .
It suggests that the spurious non-zero correlation between Yt and Zt (even

when they are independent processes) comes from their deterministic trend,
not the stochastic trend. Suppose
Zt Z t-1 u t
, u t ~ stationary with mean 0
w t w t-1 t
, t ~ stationary with mean 0
are independent unit root processes. Then in general, a linear combination of

the unit root processes Zt and wt , Yt , is also a unit root process as shown
below.
Yt c dZ t w t
(c ) d dZ t 1 du t w t 1 t
( d) Yt 1 du t t
is also a unit root process where c,d 0. Here Yt is correlated with Zt due to Yt
being a linear combination involving Zt.
If we perform OLS on Yt = c + dZt + wt , d0, the effects are as follows.
The OLS estimate of d will involve cov(Yt , Zt) = cov(c + dZt + wt , Zt) = d
var(Zt) + cov (c+wt , Zt). But the latter is a covariance of two independent unit
root processes each with a deterministic trend (and a stochastic trend as well),
that produces spurious sampling estimate that is not zero. Thus, the sampling
estimate of cov(Yt , Zt) under OLS will also be spurious even if d 0.
At this point we can almost see when OLS on two related unit root
processes such as Yt and Zt can or cannot be feasible. It has to do with the
covariance of the explanatory variable and the residual variable, cov (wt , Zt).
If both are unit root processes, then there is spuriousness.
Suppose instead, wt is a stationary process, and not a unit root process,
independent of Zt . Then the sample estimate of cov (wt,Zt) = 0. In this case,
the OLS estimate of d converges correctly.
In summary, suppose unit root processes Yt and Zt are truly related as
follows: Yt = c + dZt + wt , where disturbance wt has a unit root and is not
correlated with Zt . Then, it will not be appropriate to perform OLS of Yt on Zt
since wt is not stationary. The OLS result will be spurious.
When two unit root variables are not correlated, it is not appropriate to
perform OLS of one on the other, as the slope coefficient estimate will be
spurious. Even when two unit root variables are correlated, OLS regression
will also produce spurious slope estimate if the disturbance is a unit root
variable as seen above. This is because of non-zero covariance in the unit root
regressor and the unit root disturbance. In the latter, the regression will not be
spurious only if the disturbance is stationary.
292
16.3
COINTEGRATION
Sometimes two processes may be non-stationary such that they carry unit
roots. However, they could have a long-term equilibrium relationship so that
over a long time interval, it can be seen that a linear combination of them
behaves like a stationary process and they do not drift away from each other
aimlessly. We say they are cointegrated to each other. To model such a
relationship, we proceed as follows.
In Yt = c + dZt + wt ,
if wt is stationary, then Yt and Zt are said to be cointegrated with cointegrating
vector (1, -d), i.e.
Y
d t c w
t
Z
t
is stationary. In this case, OLS of Yt on Zt when they are cointegrated indeed

produces OLS estimators that are super-consistent, i.e. they converge even
faster than normally consistent estimators. In this case, the usual t-statistics
inference is valid.
A special case of unit root process (1) when t is i.i.d. (which also means it
is independent of {Yt}) is a random walk in Yt. The discussion about trend
and difference stationarity carries over into this special case as well. Here,
2
2
var(Yt ) 0 (t - 1) .
where 02=var(Y0).
This (unconditional) variance increases linearly with time t. A finance
example is when Yt is the random walk of (ln) prices.
ln Pt ln Pt-1 t .
This is similar to stationary i.i.d. continuously compounded return
P
R t ln ( t ) t
Pt -1
with mean , var R t 2 . The mean and variance of return RT over a
longer interval [t-1,T] are T t 1 and T t 1 2 . This is the same as
E t 1 lnPT T t 1 lnPt -1 , and var t 1 lnPT T t 1 2 .
Let us continue with the discussion of unit root process, or what is

equivalent, difference stationary process.
Suppose Yt = Yt-1 + t
, t ~ stationary .
(16.5)
293
Then first difference
Yt Yt - Yt-1 = t
is stationary (without any special structure like in the trend stationary
process). In this case Yt is said to be Integrated order 1 or I(1) process. The
first difference Yt is integrated order 0 or I(0) process and is stationary.
I(k) is integrated order k process. The kth difference of an I(k) process
becomes stationary for the first time. Thus for an I(1) process, it is not
stationary but its first difference is.
But suppose Yt = + Yt-1 + t , t~stationary
then first difference
(16.6)
Yt Yt Yt-1 = + t .
If Yt = + t + Yt-1 + t , t ~ stationary , then
(16.7)
Yt = + t + t .
The above (16.5), (16.6), and (16.7) are all unit root processes. The alternative
stationary autoregressive hypotheses are
Yt = Yt-1 + t
Yt = + Yt-1 + t
Yt = + t + Yt-1 + t
16.4
(|| < 1)
(||<1)
(|| < 1)
(16.8)
(16.9)
(16.10)
UNIT ROOT TEST
How do we test for unit processes (16.5), (16.6), or (16.7)? Using the
alternative specifications in (16.8), (16.9) and (16.10), we can write:
Yt = Yt-1 + t
Yt = + Yt-1 + t
Yt = + t + Yt-1 + t
(16.11)
(16.12)
(16.13)
where = -1. For I(1) processes in (16.5), (16.6) and (16.7), however, (1) = 0. Thus, we can test the null hypothesis of a unit root process by testing
H0: = 0.
294
In practice, before any test is carried out, specifications (16.11), (16.12), or
(16.13) are generalized to include lags of Yt so that any elements of
stationarity in Yt are explicitly modeled and that we are left with a residual
noise et that has autocorrelations removed and is then i.i.d. with mean zero.
This is the same as saying that we model stationary t in (16.11), (16.12), and
(16.13) as MA process. (16.11), (16.12), and (16.13) in general can be
expressed as:
Yt = Yt-1 + 1Yt-1 + 2Yt-2 + + kYt-k + et
(no constant)
Yt = + Yt-1 +
j 1
Yt-j + et
(16.14)
(16.15)
(there is constant)
Yt = + t + Yt-1 +
j 1
Yt-j + et
(16.16)
(there is constant and time trend) where et is i.i.d.

Therefore in practice, we run OLS for some k on (16.14), (16.15), or
(16.16) depending if in the model we would exclude a constant, include a
constant, or include both a constant and a time trend. Usually k is initially
chosen to be large e.g. 6. Then we can employ smaller k by removing
explanatory variable Yt-j when j is not significant. This will reduce (16.14),
(16.15), and (16.16) to regressions involving a minimal k.
To test if (16.14), (16.15), or (16.16) contains a unit root, i.e. Yt is a unit
root process, we can run OLS on (16.14), (16.15) or (16.16). If is
significantly < 0, then we reject H0 of unit root. If not, then there is evidence
of a unit root process. Test using the specifications with lagged Yt are also
called Augmented Dickey-Fuller tests. Next we compute
x=
OLS
.
OLS s.e.(OLS )
This is the usual formula for t-value function, but in this case it is not
distributed as Student-tT-n statistic where T is the sample size, and n is the
number of parameters, i.e. n=k+1 for (16.14), n=k+2 for (16.15), and n=k+3
for (16.16). The distribution was found through some simulation and is
reported in studies by Dickey and Fuller.76 For the computed t-value, we
76
See Dickey, D., (1976), Estimation and Hypothesis Testing in Nonstationary Time
Series, PhD Dissertation, Iowa State University, and Fuller, W., (1976),
Introduction to Statistical Time Series, Wiley New York. See also an early but
295
therefore use the Dickey-Fuller (ADF) critical values for inference to test the
null hypothesis. The critical values for the test are shown in Table 16.1 below.
Table 16.1
Critical Values for Dickey-Fuller t-Test
Source: Fuller, W., (1996), Introduction to Statistical Time Series, (2nd ed.)
New York: Wiley
Case:
No constant
Equation
(16.5)
Case:
Constant
Equation
(16.6)
Case:
Constant and
Time Trend
Equation
(16.7)
Sample
Size
N
25
p-values (probability of a smaller test value)

0.01
-2.6577
0.025
-2.26
0.05
-1.95
0.10
-1.60
50
-2.62
-2.25
-1.95
-1.61
100
-2.60
-2.24
-1.95
-1.61
250
-2.58
-2.24
-1.95
-1.62
-2.58
-2.23
-1.95
-1.62
N
25
0.01
-3.75
0.025
-3.33
0.05
-2.99
0.10
-2.64
50
-3.59
-3.23
-2.93
-2.60
100
-3.50
-3.17
-2.90
-2.59
300
-3.45
-3.14
-2.88
-2.58
N
25
-3.42
-3.12
-2.86
-2.57
0.01
-4.38
0.025
-3.95
0.05
-3.60
0.10
-3.24
50
-4.16
-3.80
-3.50
-3.18
100
-4.05
-3.73
-3.45
-3.15
300
-3.98
-3.69
-3.42
-3.13
-3.96
-3.67
-3.41
-3.13
If x < critical value (at say p-value 1%), then we reject H0: = 0 (or = 1),
i.e. there is no unit root at 1% significance level.
important paper on such statistics, Dickey,D., and W. Fuller, (1979), Distribution of

the Estimators for Autoregressive Time Series with a Unit Root, Journal of the
American Statistical Association, 74, 427-431.
77
Monte Carlo results show that under the null of unit root process in this case, 99%
of the t-values are above 2.65.
296
If x > critical value (p-value 1%), then we cannot reject the evidence of
unit root at 1% level.
As another check on whether a process is a unit root process, the
autocorrelation function (ACF) of the process is computed. A unit root
process will typically show a highly persistent ACF, i.e. one where
autocorrelation decays very slowly with increase in the lags.
16.5
PURCHASING POWER PARITY
Absolute purchasing power parity (PPP) version states that Pt = etPt* , where
Pt is UK national price index in , Pt* is the US national price index in USD,
and et is spot exchange rate: number of per $.
ln Pt = ln et + ln Pt*
d ln Pt = d ln et + d ln Pt*
*
dPt
de t
dPt
*
Pt
et
Pt
The Relative PPP version states that
*
e t Pt P t
Thus if US inflation rate is 5%, UK inflation rate is 10%, both over horizon T
years, then det/et = 10% - 5% = 5%, and $ is expected to appreciate by 5%
over over T years. et is the nominal exchange rate, exchanging et number of
pounds for one US$. The real exchange rate or real per $ is the number of
units of real good in U.K. that can be obtained in exchange for one unit of the
same real good purchased in U.S. Here, the number of units of real goods
purchased in U.S. per US$ is 1/Pt*, supposing US$ per unit of good is Pt*. The
number of units of the same good that can be obtained in U.K. by exchanging
one US$ is et/Pt where we suppose the price per unit of good is Pt.
The real exchange rate (real per $) is e t Pt . If the real exchange rate of
Pt
$ is rising over time, then it means goods prices in US are becoming more
expensive relative to UK as more units of goods in U.K. can be obtained in
exchange for one same unit in U.S. This can happen if nominal et increases, or
if inflation in U.S. rises relative to that in U.K. If the PPP holds exactly, then
real exchange is 1.
In log form, real exchange rate is
rt = ln et + ln Pt* - ln Pt = 0 under PPP.
(16.17)
297
In reality, in the short run, rt deviates from zero at any t. In the long-run, if
PPP holds, then rt will be a stationary process with mean zero. This means that
rt may deviate, but will over time revert back to its mean at 0. This is the
realistic interpretation of PPP (sometimes called the long-run PPP), rather than
stating rt as being equal to 0 at every time t.
If long-run PPP does not hold, then rt may deviate from 0 and not return to
it. It can then be described as following a unit root process, viz.
rt = rt-1 + t
(16.18)
where t is stationary with zero mean.
If equation ( 16.18) is the case, it means the t has a permanent effect of
causing rt to move away from 0. This is because if 1> 0, then r1=r0+1, so
new r2=r1+2 is a stationary deviation from r1 that has permanently absorbed
1. Moreover, if rt has a drift, then the unit root also incorporates a
deterministic trend moving rt away from zero deterministically as well.
We test the validity of the long-run PPP by testing the null of unit root of
rt. We run OLS on
j rt-j + t
rt = + rt-1 +
j1
to test .
Suppose we test ln et , ln Pt*, and ln Pt separately and they are all unit root
processes. Then it is plausible that rt = ln et + ln Pt* - ln Pt in (16.17) is also a
unit root process. However, it is also possible that rt may be a stationary
process in the following way.
Suppose their linear combination
ln e t
rt ( 1 1 1 ) ln Pt
ln Pt
is stationary and not unit root. Thus the processes ln et, ln Pt*, and ln Pt are
cointegrated with cointegrating vector ( 1 1 -1 ).
In empirical work, some currency pairs satisfy long-run PPP while others
do not. For those that satisfy long-run PPP, i.e. rt is stationary, then we may try
to forecast ln et+1 as follows. In some studies, there is some relaxation of PPP
to allow for a more general cointegrating vector (1 b a) as long as a, b are
close to one. Run OLS on
ln et = + a ln Pt b ln Pt* + ( ln et-1 + ln Pt-1* - ln Pt-1 ) + ut
(16.19)
where ut is assumed to be i.i.d.
Notice how error correction is now built into the equation. When there is
long-run tendency to revert back to zero for rt or else (ln et + b ln Pt* - a ln Pt),
then past deviation ( ln et-1 + ln Pt-1* - ln Pt-1 ) is likely to be of influence via
298
negative to bring the error back towards zero. The above regression is called
an error correction model i.e. using short-run deviation ln rt-1 to help explain
variation in the next period deviation. Error correction specification is possible
provided cointegration exists. If there is no cointegration or long-run reversion
to zero, then error correction is meaningless.
With the error correction model in (16.19), provided ln et , ln Pt , and ln Pt*
are cointegrated, the forecast of next period nominal exchange rate et+1, $ per
, can be obtained as follows.
a
Pt 1
ln rt
e t 1
exp
b
Pt 1
or a and b can be fixed at one if they are not significantly different from one.
In this case we also need to input next period Pt+1 and Pt+1*. These may be
more readily available in the form of price inflation forecast.
A caveat in the above exercise is that exchange rate forecasting is
extremely elusive, and PPP is not a particularly effective forecasting
specification, especially in the short-run.
However, there are many instances where currency pairs do not display
cointegration, and the real exchange rate is not stationary. The following
empirical results show that. We employ per $ nominal exchange rate, US
CPI and UK CPI annual data from 1960 till 2001. The data series are
contained in the datafile PPP.xls. The real exchange rate is shown below.
Figure 16.2
per $ Real Exchange 1960 till 2001
1960
1970
1980
1990
2000
Figure 16.2 shows that log real exchange rate in per $ appears to be mostly
larger than zero, indicating better terms of trade and competitiveness favoring
299
U.K. from 1960 to 2001, except for a period in the early 1970s and during
mid 1980s.
Recall that et is spot exchange rate in number of per $. Results of a unit
root test of ln et , of ln(Pt*) or U.S. price, of ln(Pt) or UK price, of the first
difference of ln(et), and of the real exchange rt are shown in Tables 16.2, 16.3,
16.4, 16.5 and 16.6 respectively.
From Table 16.2, the ADF test-statistic of -1.7760 > -3.1988 at 10%
critical level. Hence we cannot reject that the spot per $ exchange rate
during 1960-2001 follows a unit root process. All the ADF tests in Tables
16.2, 16.3, and 16.4 show that loge of the nominal spot exchange rates and the
loge of the prices are unit root processes. The real exchange rate also has a unit
root as seen in Table 16.6. Its first difference, however, is stationary.
Sometimes, bank reports show the use of PPP in trying to forecast or make
prediction about the future movement of a currency. For example, in the above
per $ spot rate, actual spot $ value may lie above the theoretical PPP $, and
if real exchange is stationary, a bank report may suggest that $ is way
overvalued and PPP will bring about a correction soon to see $ trending back
to PPP level. However, this is a dangerous prediction as the real exchange
seems to be a unit root process during the period, and the $ value may indeed
continue to move upward or not revert down.
Table 16.2
Augmented D-F unit root test of ln et
300
Table 16.3
Unit root test of ln Pt* (US price)
Table 16.4
Unit root test of ln Pt (UK price)
301
Table 16.5
Unit root test on First Difference of ln et
Table 16.6
Unit root test of real exchange rate rt
302
16.6
16.1
PROBLEM SET
The following augmented Dickey-Fuller unit root test statistics were
collected on a certain price series Pt with a sample size of 1000. The
critical values were also shown.
Test without constant and trend
ADF Test Statistic
-1.313870
1% Critical Value
-2.5678
5% Critical Value
10% Critical Value
-1.9397
-1.6158
1% Critical Value
-3.4397
5% Critical Value
-2.8649
10% Critical Value
-2.5685
1% Critical Value
-3.9722
5% Critical Value
-3.4167
10% Critical Value
-3.1303
Test with constant and no trend

ADF Test Statistic
-2.118800
Test with both constant and trend

ADF Test Statistic
-3.847680
The augmented Dickey-Fuller unit root test was performed again, this
time on the first difference of the Pt series with a sample size of 999.
Test without constant and trend
ADF Test Statistic
-14.57404
1% Critical Value*
-2.5678
5% Critical Value
-1.9397
10% Critical Value
-1.6158
Explain if the above results would lead us to express

Pt = c + t + Pt-1 + t
where c, are constants, and t is a mean zero stationary process?
Suppose that we wish to avoid the small probability that price Pt may
become negative if t takes a value that is highly negative, show how
we can model the price process differently that will give rise to similar
unit root test results as shown above.
16.2
If Yt and Xt are cointegrated as Yt Xt = et where et is I(0). Then an

error correction model is a regression
Yt = a + b(Yt Xt) + vt where vt is i.i.d.
Suppose estimate for b is significant, could you infer that temporary
disequilibrium in the relationship between Yt and Xt provides
information about the dynamics of Yt?
303
16.3
If a stochastic process Yt , t = ,-T, -T+1,,-1,0,1,2,., T-1, T, is

such that Yt+2 2Yt+1 + Yt is stationary, is Yt a unit root process?
What is the process of Yt?

Other interesting details of cointegration and error correction models can be
read in various books.
[1]
[2]
[3]
Walter Enders, (1995), Applied Econometric Time Series, 1995,

John Wiley.
Hayashi, Fumio, (2000), Econometrics, Princeton University Press.
Hamilton, James D., (1994), Time Series Analysis, Princeton
University Press.
304
Chapter 17
CONDITIONAL HETEROSKEDASTICITY
APPLICATION: RISK ESTIMATION
Risk management, Value-at-risk, Historical approach, Parametric approach,
Volatility cluster, Volatility persistence, Conditional distribution,
Unconditional distribution, Conditional variance, ARCH, GARCH, ARCH-inmean, Maximum likelihood, Fishers information matrix, Cramer-Rao lower
bound, Asymptotic efficiency, Estimating GARCH, Futures margin
The landscape of financial econometrics was forever changed and augmented

greatly when Prof Engle introduced the Autoregressive Conditonal
Heteroskedasticity model in 1982. It was an ingeniously embedded tool in
specifying the dynamics of volatility coupled with the underlying asset price
process. This greatly extended the space of stochastic processes including
those in the Box Jenkins approach. ARCH was very successfully extended to
GARCH by Prof Bollerslev, and to-date there continues to be a huge number
of variations of processes building on the embedded tooling idea. We consider
such process modeling to be particularly relevant in estimating risks in market
prices and in risk management in todays market of high and persistent
volatilities. The topic of maximum likelihood is given more discussion here,
as well as ideas of asymptotic efficiency.
17.1
RISK MANAGEMENT
One major application of finance is risk management of investments. Risk

management is used in so many different contexts, ranging from regulatory
capital adequacy standards78, to a banks internal risk control over its
78
The Bank for International Settlements set up a Basel Committee to advise on

capital adequacy standards for international banks of participating countries. Come to
be known as BASEL II standard, regulators in member countries, typically Central
Banks, exercise regulatory control over their country banks to ensure proper
governance and prudence in keeping sufficient capital to contain risks of market price
changes, credit risks, and operational risks that banks face from day to day in their
305
proprietary trading positions and lending exposures, to corporations
investments, to firms hedging of transactions and translation risks79, and to an
Exchanges margining rules and practices, and so on.
Major sources of risks that impact on asset values or prices are market
risks, credit risks, operational risks, liquidity risks, legal risks, political risks,
and model risks.80 The most prevalent and obvious of these is market risk.
When a bear market starts to run, be it in equities or bonds or futures or
options or commodities or the real estate sector, those with long positions will
start to worry and for good reasons. Conversely, in a bull run, those who sell
short ought to beware. Market movements can erode the value of asset
positions or net wealth, and thus is a major concern for investors, be they
institutions or individuals.
The key concept in market risk management is Value-at-Risk. Value-atRisk (VaR) is the maximum loss (or worst loss) over a specified horizon at a
given confidence level. It is also the minimum loss with probability equal to 1
less the confidence level. The idea of estimating such losses is related to
regulatory bodies requiring firms to hold enough capital so that the firms can
bear any foreseeable losses arising out of their trading and market activities.
The practice of such computations appeared to have formally started in 1980
when the Securities Exchange Commission required financial firms to report
potential losses over a 30-day horizon with a 95% confidence level. The
computation of VaR became commonplace when Basel regulatory bodies
quickened supervisory roles in banking operations worldwide especially in the
developed markets.
Suppose a portfolio of stocks is currently valued at $100 million. This
$100m value is the marked-to-market value81 of the portfolio. If the
consideration of market risk exposure is over a day, then we are analyzing
daily VaR. If it is exposure over a week, then we think in terms of a weekly
loans, investments, and other business activities. Sound operating banks in a country
ensure economic stability and a well functioning capital market.
79
Firms that export or import, and thus receive or pay foreign currencies for the
goods, typically hedge currency risk by selling or buying currency forward contracts
in order to lock in a certain exchange rate. This is an example of hedging transaction
risk. Translation risk has to do with accounting exposures.
80
See Steven Allen (2003), Financial Risk Management: A Practitioners Guide to
Managing Market and Credit Risk, Wiley Finance. There are many excellent books
written both by academics and by practitioners on this growing and important subject.
81
This means that the assets could be sold if needs be to realize the marked-to-market
dollar value. On the other hand, in situations of illiquid market or when the asset is not
placed onto the market for sale, there is no ready market price, In such cases,
theoretical models to price the assets are used, and the assets are said to be marked-tomodel when a value is assessed based on the model price.
306
VaR. Over a day, suppose the probability distribution of daily portfolio return
rate ~r , is normal and shown in Figure 17.1, where is daily return volatility.
We assume that over a day, the expected return is negligibly small and cannot
be estimated accurately, so we set it to zero, i.e. = 0. Suppose = 0.02.
Figure 17.1
Daily Ex-ante Portfolio Return
Negative -0.0466
return rate at
99% confidence
level
-2.33 -1.645
Normal ~r
At 99% confidence level (critical value on the left tail for standard normal Z is
-2.33), Prob( [ ~r -]/ Z < -2.33) = 1%. Hence Prob( ~r < -2.33) Prob (
~r < - 0.0466) = 1%. Since ~r = P /P 1, where P is the next period or next
1 0
1
days portfolio price, then if ~r < 0, the portfolio value loss is P0 P1 = - P0 ~r .
In this case, the daily VaR (or portfolio value loss) at 99% confidence level is
P0 ~r 99% = - 100m -0.0466 = $4.66 m. There is a chance of losing $4.66 m
or more out of the portfolio of $100 m, with a probability of 1%. This is also
called the absolute VaR when = 0.
Suppose instead = 0.01 or a 1% expected increase in return. Then the
return distribution is shifted to the right by . This is shown in Figure 17.2.
At 99% confidence level (critical value on the left tail for standard normal
Z is -2.33), Prob( [ ~r -]/ Z < -2.33) = 1%. Hence Prob( ~r < -2.33)
Prob ( ~r < 0.01- 0.0466 -0.0366 ) = 1%. Since ~r = P1/P0 1, where P1 is the
next period or next days portfolio price, then if ~r < 0, the portfolio value loss
is P0 P1 = - P0 ~r . In this case, the daily VaR (or portfolio value loss) at 99%
confidence level is P0 ~r 99% = - 100m -0.0366 = $3.66 m. There is a chance
of losing $3.66 m or more out of the portfolio of $100 m, with a probability of
1%. This is also called the absolute VaR when > 0.
307
Figure 17.2
Daily Ex-ante Portfolio Return
Negative -0.0466
+ 0.01 return rate
at 99% confidence
level
Normal ~r
0
-2.33 -1.645
Absolute VaR $3.66m
Relative VaR $4.66m
The absolute VaR is loss measured with respect to the current marked-tomarket portfolio value regardless of . On the other hand, if the loss is to be
computed taking into account the loss also of the expected profit P0, then the
loss with respect to the expected value E(P1), not current value P0, is E(P1)
P1 = P0(1+) P1 = P0 - P0 ~r . This is called the relative VaR. Relative VaR
Absolute VaR + P0. Here P0 = 0.01 100 m = $1 m. Hence relative VaR
= $ 4.66 m. For > (<) 0, Relative VaR > (<) Absolute VaR. For > 0,
Relative VaR is more conservative, giving a higher VaR or loss number.
There are 3 major approaches to measuring the probabilities:
(a) Historical Approach
Collect the immediate past historical daily returns and form an empirical
distribution.
(b) Parametric (VaR) Approach
Assume a normal distribution. Use immediate past daily returns to estimate
the mean and variance of the normal distribution.
1 T
rt
T t 1
r
T 1
t 1
Sometimes we may wish to measure VaR over a longer horizon, e.g. 1

week (5-trading days), then
308
* 5 days 5* 2 daily
2
or * 5
For mean * 5 days 5 * daily .
(c) Monte Carlo Approach
The price processes are specified. Usually they are complicated and
cannot be solved analytically. Computer simulations of the possible
paths of the prices are made millions of times, and the distribution of
such path prices is then collected and used for determining the critical
regions of losses.
In the Parametric Approach, we see how critical is the term volatility ,
from now to end of next day in determining VaR. If is underestimated, then
VaR will be too small and excessive risk may be overlooked, creating grave
dangers for the credit standing of the exposed firm or bank.
Why does volatility matter in the world of finance? To motivate, let us
consider the Nick Leeson lesson. In late 1994 and early 1995, Leeson, who
was chief trader for Barings Futures in Singapore, accumulated a speculative
long position on the Nikkei 225 Futures contracts worth nominally US$7.7
billion. By mid-February 1995, the N225 index had fallen more than 15%.
This is clearly a case of great volatility. Together with losses on options,
Leeson losed US$1.3 billion and wiped out the entire equity capital of the
Barings PLC Bank based in U.K. at that time. History seems to repeat itself
when in January 2008 it was reported that a rogue trader at Socit Gnrale
through fraudulent practice losed more than US$7 billion.
If we use stationary ARMA process seen in Chapter 6 to model return, we
would have constant variances, and this will often underestimate volatility
when market becomes uncertain and volatility clusters together. This is
evidence of volatility persistence and is illustrated in the following diagram.
Clearly, such clustering over say a few days as depicted above means that
based on the most recent information e.g. information the day before at t, t,
one can provide a much more accurate forecast of next period t+1 volatility.
This is a forecast of conditional (on t) volatility. We will require the
modeling of the return process using other than ARMA process since the latter
has constant conditional volatilities.
Modeling clustering avoids underestimating volatility and hence also VaR
during uncertain periods. We employ models of autoregressive conditional
heteroskedastic processes (ARCH) with changing conditional variance.
It should also be mentioned that VaR as reported by banks to regulatory
agencies on a daily basis is meant to take care of normal dayto-day risks.
Therefore, in situations where major market movements are expected e.g. 9/11
309
or at the aftermath of Lehman Brothers bankruptcy in September 2008,
additional extreme risk tools would be deployed in addition to the tool of
VaR.
Figure 17.3
Illustrating a volatility cluster
Volatility
Market Uncertainty Appears
cluster
Time
17.2
ARCH-GARCH
For ARMA (1,1) which includes AR (1) and MA (1) processes,

yt = + yt-1 + ut + aut-1,
the conditional variance Var (yt|yt-1) = (1+a2) u2 is constant. Or, if we
condition on all lagged information available at t, t, then Var (yt|yt-1,ut-1) =
u2 is constant. Note however that conditional mean E(yt|yt-1) = + yt-1
changes. This is the motivation behind building a conditional changing
variance as well. Similarly, we can show that for
yt = + xt + ut,
(17.1)
where E(ut) = 0, Cov (ut, ut-k) = 0, k 0, and xt, ut are stochastically stationary
and independent, then Var (yt| xt) = Var(ut) = u2 is constant. In (17.1), we can
think of xt in general terms, including it being lagged yt-1. So far we have not
modeled anything about the variance of ut.
However, suppose we model a process on the variance of ut (not ut itself
note this distinction) such that:
Var (ut) = 0 + 1ut-12 .
(17.2)
310
This is an autoregressive conditional heteroskedasticity or ARCH (1) model82
in disturbance ut. Then in (17.1) and (17.2), we that Var (yt| xt) = Var(ut) = 0
+ 1ut-12 u2. The conditional variance of yt indeed changes with past levels
of ut-1, although the latter cannot be directly observed.
We can also write (17.2) in terms of a ut process as follows:
u t e t 0 1u 2t 1
(17.3)
where et ~ N(0,1). To be precise, (17.3) implies (17.2), but the converse is not
necessarily true. It is interesting to know what is the nature of the distribution
of the disturbance ut. From (17.3), it should be evident that ut is
unconditionally not a normal distribution. However, conditional on ut-1, ut is
normally distributed. Using Monte Carlo simulation with a sample size of
10000, a histogram of the distribution of ut is produced as follows.
Figure 17.4
Monte Carlo Simulation of errors u t e t 0 1u 2t 1
2400
Series: u
Sample 1 10000
Observations 10000
2000
1600
Mean
0.004843
Median
0.000929
Maximum
3.423084
Minimum
-3.927891
Std. Dev.
0.497121
Skewness -0.036257
Kurtosis
5.531066
1200
800
400
JarqueProbability
Bera
0
-3.75 -2.50 -1.25
0.00
1.25
2671.481
0.000000
2.50
Clearly, unlike unit normal et, unconditional uts empirical distribution as

shown in Figure 17.4 has a larger kurtosis (>3) than a normal random variable.
The Jarque-Bera test statistic rejects the null of normal distribution for ut.
82
The seminal article in this area is Engle, R., (1982), Autoregressive Conditional
Heteroscedasticity with Estimates of the Variance of United Kingdom Inflations,
Econometrica, 50, 987-1008. Its significant generalization is in Bollerslev, T., (1986),
Generalized Autoregressive Conditional Heteroscedasticity, Journal of
Econometrics, 31, 307-327.
311
When Var (ut) = 0 + 1ut-12 + 2 ut-22 +..+ q-1ut-q+12 + qut-q2, we call the
conditional variance of ut above an ARCH(q) process.
Besides (17.2), another model of changing conditional variance is
Var (ut) = o + 1ut-12 + 1 Var (ut-1)
(17.4)
This is generalized autoregressive conditional heteroskedasticity or GARCH

(1,1) model in ut. It includes a lagged volatility term. When Var (ut) = o + 1
ut-12 + 2 ut-22 +.+ qut-q2+ 1 Var (ut-1) + 2 Var (ut-2) +..+ p Var (ut-p), we
call the conditional variance of ut a GARCH(q,p) process with weighted
averages of q lagged ut-j2s and weighted averages of p lagged Var(ut-j)s. Due
to the large number of parameters that usually has to be estimated in a
GARCH process, parsimony typically dictates modeling with GARCH(1,1)
processes or something not too much more complicated.
Suppose a yt process contains a disturbance random error that behaves
according to ARCH in equation (17.2) or GARCH in equation (17.4). Then
Var (yt| t) = Var (ut | xt , ut-1, t-12) = o + 1ut-12 + 1 t-12 is no longer
constant, but instead, changes with t or more precisely, the information
available at t, t , even as ut-1 and also t-1 change over time. Processes such as
yt or ut itself are said to exhibit conditional heteroskedasticity or dynamic
volatility. However, conditional on the lagged observations, it is possible to
model the conditional uts and also yts as normally distributed, though their
unconditional distributions are in general not normally distributed.
As an example, if Var (ut-1) = 0.1, o = 0.02, 1 = 0.2, 1 = 0.7, ut-1 = 0.5
then Var (ut) = 0.02 + 0.2*0.52 + 0.7*0.1 = 0.14 . We see that this changes
over time with new lagged values of residuals and new lagged values of
conditional variances.
In what follows, we simulate sample paths of a process {yt}t=1,2,200 where
it is weakly stationary without conditional heteroskedasticity.
yt x t u t
(17.5)
i.i.d.
i.i.d.
x t ~ N 1, 0.4 , u t ~ N 0, 2 ,
Suppose we chart the path of y t .
y0
y1
1
2
1
1
2
1
x 0 u0
x1 u1
2 2
.
1
2
1
2
312
E yt E x t
1
2
Var y t 2 x u
1
2
1 1
* 0.4 2 2.1
4
The plot of the time-path of yt is shown in Figure 17.5. It is seen that yt
behaves like a random series with a mean at 1 and the two dotted lines are the
two standard deviations away from the mean. In this case, they are 1 2.98
(about 4 and -2 respectively), with 2% probability of exceeding each way the
region between the two dotted lines.
2
Figure 17.5
Stationary process yt ~ N(1, 2.1)
t
Next we simulate sample paths of another process {yt}t=1,2,200 that follows
(17.1) and (17.4) instead.
yt x t u t .
i.i.d.
x t ~ N 1, 0.4
Var (ut) = o + 1ut-12 + 1 Var (ut-1)

We set initial Var(u0) = 02 = 2. Initial u0 may be drawn from N(0,2), but
subsequent uts are not unconditionally normally distributed. Unlike the
constant variance of 2 in the earlier stationary process, this GARCH(1,1)
process will have a changing variance over time.
313
, 0
, 1
, and 1
2
2
2
4
d
1 1
y0 x 0 u 0 , u 0 ~ N 0, 2
2 2
1
2
Once u02 and Var(u0) are obtained, we can use (17.4) to obtain Var(u1). Next
2
simulate u1= e1 0 1 u 0 1 Var ( u 0 ) for e1 ~ N(0,1).
Put y1
1 1
x1 u1 .
2 2
Next use u12 and Var(u1) to obtain Var(u2) via (17.4).

2
Then simulate u2 = e 2 0 1 u 1 1 Var ( u 1 ) for e2 ~ N(0,1), and so on.
In general,
u t e t 0 1 u 2t 1 1 Var (u t 1 ) .
The plot is shown in Figure 17.6.
Figure 17.6
GARCH error process Var (ut) = 0.5 + 0.25ut-12 + 0.5 Var (ut-1)
Unconditional yt has mean, variance 1, 2.1
t
Figure 17.6 shows a similar yt process as in Figure 17.5, with yt = + xt +
ut. Its unconditional mean and variance are the same as yt in Figure 17.5.
Unconditional mean and variance of yt are 1 and 2.1. However, its variance
314
follows the GARCH error process: Var (ut) = 0.5 + 0.25ut-12 + 0.5 Var (ut-1).
The Figure shows that yt behaves like a random series with a mean at 1 and the
two dotted lines are the two standard deviations away from the mean. In this
case, they are 1 2.98 (about 4 and -2 respectively), with 2% probability of
exceeding each way the region between the two dotted lines. There appears to
be more volatility. At about the 50th observation, the variance clusters together
and y-values persist below -2.
Figure 17.7
GARCH error process Var (ut) = 0.5 + 0.25ut-12 + 0.7 Var (ut-1)
Unconditional yt has mean, variance 1, 2.1
t
We provide another simulation using the same yt = + xt + ut with
unconditional mean and variance of yt at the same 1 and 2.1 values
respectively. However, its variance now follows GARCH error process: Var
(ut) = 0.1 + 0.25ut-12 + 0.7 Var (ut-1) where clustering or persistence in
volatility should be more evident because of the high 1 = 0.25, and the higher
1 = 0.7. Indeed Figure 17.7 shows the persistent and much higher volatility
with yts exceeding +15 and falling below -15 in the observations from 100 to
150. Thus we see that GARCH modeling of variance is able to produce the
kind of persistence and clustering in volatility sometimes observed in market
prices.
In addition to models such as (17.1) and (17.4), suppose
2
y t x t t u t
315
where t Var y t | x t , u t 1 0 1u t 1 . Then the yt process is an ARCH2
inmean or ARCH-M model. This version basically has the variance t2
driving the mean effect E(yt).
It is interesting to note that if we perform a regression of
Yt =c0 + c1 Xt + ut ,
where ut follows a GARCH process, OLS estimators are still blue as long as ut
is unconditionally stationary. However in such cases the maximum likelihood
estimators will usually be more efficient. We shall discuss the estimation later.
17.3
STATIONARITY CONDITIONS
It is interesting to note that while GARCH processes are conditionally nonstationary with changing variances, they are still unconditionally stationary
processes. For reasons of data analyses, when we have only one time series or
one sample path, it is important to be able to invoke the law of large numbers
or ergodic theory so that sample averages can converge to population
parameters. The convergence requires stationarity, hence stationarity is
essential for most purposes. We shall show how GARCH processes are also
stationary.
If we expand (17.4):
Var (ut) = o + 1ut-12 + 1 Var (ut-1)
= o + 1ut-12 + 1 [o + 1ut-22 + 1 Var (ut-2)]
= o(1+1) + 1( ut-12 + 1 ut-22) + 12[o + 1ut-32 + 1 Var (ut-3)]
= o(1+1+ 12+.. ) + 1( ut-12 + 1 ut-22 + 12 ut-32 + )
(17.6)
Let 2 = Var(ut), being the unconditional variance of ut. Taking unconditional
expectation on both sides, and assuming there exists stationarity so that 2 =
E(ut-12) = E(ut-22) = E(ut-32) = ., then
2
= o/(1-1) + 1(2 + 1 2 + 12 2 + .. )
= o/(1-1) + 12/(1-1)
= [o+ 12] / (1-1), supposing |1| < 1.
Then, 2 = o / (1-1-1), supposing o>0 and |1 + 1 | < 1. In the simulation

1
1
example of Figure 17.6 above, given the parameters 0 , 1 , and
2
4
1
1 , therefore the unconditional variance of the GARCH disturbance is
2
2= 0.5/(1-0.75) = 2. In the same way, in the simulation example of Figure
316
17.7 above, given the parameters 0
, 1
, and 1
, the
4
10
unconditional variance of the GARCH disturbance is 2= 0.1/(1-0.95) = 2.
17.4
10
MAXIMUM LIKELIHOOD ESTIMATORS
Let us consider non-linear models such as

Yt = f(X;) + et
where X could be more than one explanatory variable, and f(.) is a known
non-linear function dependent on parameters to be estimated. ei is a
normally distributed disturbance with zero mean and unknown variance e2.
Then Yt f(X; ) is distributed as i.i.d. N(0, e). Its probability density
function is
1
2e2
1 Y f X ;
2
e
At this point, we can assume e to be a constant, i.e. the residuals ei to be

homoskedastic. Later we can relax this to allow et to be affected by past
observations as in the case of GARCH residuals.
Given a sample size N of {Y1, Y2, .. , YN , and X}, the likelihood
function is
1 12 Yt f X;
e
e t 1
L=
.
2
2e
N
This is the density function of observing the sample, or the chance (we
should strictly be careful not to interpret density as chance) of the sample
taking the values that it did.
Taking natural logarithm which preserves the relative magnitudes, the loglikelihood function is
N
1 N Y f X;
2
.
log L = log 2e t
2
2 t 1
e
Suppose we find estimate values and e that maximizes this log-likelihood

function log L. Then the estimates are called Maximum Likelihood estimates.
Note that since the log function is monotonic or preserves relative magnitudes,
the estimates also maximize L, although it is usually easier to work with the
log L function, since it is linear for a normal density.
317
Suppose the parameters to be estimated are put in a vector [if we use the
earlier example, then this means , e ], and the variables Y, X are
also notationally subsumed in Z. Then
LZ; dZ 1 ,
which is the property of any density function, that the area under the density
curve sums to one. Differentiating the above with respect to the parameter ,
LZ;
dZ 0 .
log L 1 L
L
log L
, then
, so
log LZ;
log LZ;
dZ E
L
0 .
Since
(17.7)
that satisfy (17.7) and from this step onward

Note that any parameter values
are values that had maximized the likelihood function L. Hence we are dealing
with properties of the maximum likelihood (ML) estimators as follows.
Differentiate (17.7) again, in matrix format since is a vector, to obtain
2 log LZ;
log LZ; LZ;
L
dZ
dZ 0 .
Or,
2 log LZ;
log LZ; log LZ;
L
dZ
L dZ 0 .
Hence
2 log LZ;
log LZ; log LZ;
L
dZ
L dZ .
The left-hand side is called Fishers information matrix,
2 log LZ;
E
, defined as R().
T
318
, R(
) is also the covariance matrix of the
When the parameters are ML
vector
log L Z ;
which has a mean of 0 seen in (17.7).
Suppose h(Z) is an unbiased estimator of . Then, by unbiasedness,
Eh ( Z) h ( Z)LZ; dZ .
In
differentiating
the
left-hand
side,
recall
that
L
log L
.
L
Differentiating the right-hand side with respect to yields identity matrix I.

Therefore the above becomes
log LZ;
log LZ;
LZ; dZ E h ( Z)
T
I .
log LZ;
Thus, I is the covariance matrix between h(Z) and
since the
T
h (Z)
latter is a zero-mean vector. Then we can specify the covariance matrix
h ( Z) cov[h ( Z)]
I
.
cov log L
I
R ()

Suppose is n 1 vector. Then h(Z) is also n 1.
log L
is n 1. So the
left-hand side is a 2n 2n covariance matrix. The right-hand side has 4

elements of n n matrices.
A covariance matrix is always positive semidefinite, so we can choose any
arbitrary nonzero vector ( pT , -pTR-1() ), where pT is 1 n and -pTR-1() is
also 1 n, such that
I p
cov[ h ( Z)]
p T R 1
I
R () R 1p
pT cov[ h ( Z)] R 1 p 0 .
The last line above shows the Cramer-Rao inequality. Thus the covariance
matrix of any unbiased estimator, cov[h(Z)] is larger than or equal to the
inverse of the information matrix R.
We can write, for any arbitrary n 1 vector p, pT cov[h(Z)] p pT R-1 p, a
11 number. Clearly, if we choose pT = (1, 0, 0,.., 0), then pT cov[h(Z)] p =
319
variance of the unbiased estimator of the first parameter in vector , and this
is bounded below by the first row-first column element of R-1, say r11. Suppose
r11
R 1
r22
rkk
Then all unbiased estimators have variances bounded below by the CramerRao lower bounds (r11 , r22 , .. , rkk) respectively. An estimator that attains
the lower bound is said to be a minimum variance unbiased or efficient
estimator.
Though maximum likelihood estimators are not always unbiased (one
exception is the linear regression model when ei is normally distributed the
OLS estimator and ML estimator are identical in that case) in finite sample,
they are in most cases consistent and asymptotically efficient. This makes ML
a favorite estimation method especially when the sample size is large.
17.5
ESTIMATING GARCH
Given a historical time series of futures price and its changes, how do you
estimate the daily value-at-risk at 95% confidence interval83? Traditional
methods have used sampling variance method (assuming constant conditional
variance). Since there is plenty of evidence of volatility clustering and
contagion effects when markets moved together over a period of several days,
modeling volatility as a dynamic process such as in GARCH (including
ARCH) is useful for the purpose of estimating risk and developing margins
for risk control at the Exchange.84
DFt
Suppose for the following day, the volatility of F is forecast at in
t
order to estimate the Value-at-Risk or VaR of a long N225 futures contract
83
For bank risk control, it is usual to estimate daily 95%, 97.5%, or 99% confidence
intervals Value-at-risk, sometimes doing this several times in a day or at least at the
close of the trading day. Sometimes a 10-day 95% confidence interval VaR is also
used.
84
Perhaps one of the earliest applications of GARCH technology to Exchange risk
margin setting could be found in Lim, KG, (1996), Weekly Volatility Study of
SIMEX Nikkei 225 Futures Contracts using GARCH Methodology, Technical
Report, Singapore: SIMEX, December, 15pp.
320
position. Daily VaR at 95% confidence interval is such that Prob
( F1 F0 1.645 F0 ). Recall therefore, VaR is 1.645 F0 .
An Exchange is interested to decide the level of maintenance margin per
contract, $x, so that within the next day, chances of the Exchange taking risk
of a loss before top-up by the client or broker member, i.e. when event
( | F1 F0 | x ) or loss exceeding maintenance margin happens, is 5% or less.
Then, x is set by the Exchange to be large enough, i.e. set x 1.645 F0 .
Thus forecasting or estimating for the next day is an important task for
setting Exchange contract margin for the following day. We shall model the
DFt
conditional variance of daily rates of the futures price change F as a
t
DFt
GARCH (1,1) process. Assume E[ F ] = 0 over a day.
t
DFt
Let
= ut and then assume Var(ut) follows a GARCH (1,1) process as
Ft
follows.
Var (ut) = c + ut-12 + Var (ut-1)
(17.8)
We wish to estimate the parameters {c, , } in (17.8) and then use them to
forecast the following days volatility Var(ut+1). Notationally, we let
Var u t 2t , and an estimate of the volatility t 1 as t 1 . If we take a
DFt
historical sampling variance of a past daily time series { F }t=1,2,,N , and let
t
this sample variance be s2, then we assume the initial Var(u0) = s2, and initial
u0=0. Furthermore, assume ut conditional on ut-1 and on t-1 is normally
distributed. This shall allow us to perform a ML estimation of {c, , } in
(17.8). Although assuming other distributions could in principle allow a ML
estimation, the form of the density function could be complicated for accurate
computations.
Here we observe a past history {ut}t=1,2,.,N . The likelihood function of
this history is:N
1 12 u t
e t 1 t
2
t 1 2
t
The log-likelihood function of observing these past uts can be written as
321
2
N
1 N
1 N u
Log L log2 log 2t t .
2
2 t 1
2 t 1 t
From (17.8), for a general m > 0 we can write, by expanding:

m2 = c + um-12 + m-12
= c [1++ 2++ m-1]+[um-12+um-22+2um-32+.+m-2u12 +m-1u02]+ m02
1 m
m 1
t 1 u 2m t m s 2
= c
t 1
1
(17.9)
where we have put in u0=0 and 02=s2. In the above formula, for example,
when m=1, m2=c+s2.
Putting expression (17.9) in the log-likelihood function, we maximize the
following objective function:
1
1 N
log c
2 t 1 1
Log L *
2
t 1 j1 2
ut
t 2 1 N
u t j s
t
j1
2 t 1 c 1 t1 j1 u 2 t s 2
t j
1 j1
where we have removed the constant term N/2 (log [2]). Note that log L* is
a function of parameters {c, , } given sample size N and s.
Using (17.9), once the parameters are estimated, the next day volatility can
be estimated as:
N
1 N 1
t 1 u 2N t 1 N 1s 2 .
N 1 c
t 1
1
17.6
DIAGNOSTIC FOR ARCH-GARCH
It is useful to check if a time series {yt} contains conditional

heteroskedasticity in its residuals. Note that ARCH(q) is modeled by E(ut2 ) =
0 + 1ut-12 + 2 ut-22 +..+ q-1ut-q+12 + qut-q2 , assuming E(ut) = 0. We may
also write this heuristically as a regression of ut2 on its lags up to lag q, adding
a white noise et.
ut2 = 0 + 1ut-12 + 2 ut-22 +..+ q-1ut-q+12 + qut-q2 + et
GARCH(q,p) can also have a representation in terms of an infinite number of
lags in squares of the residuals. We show this for the case of GARCH(1,1) in
322
(17.4). From the section on stationarity, we see that the GARCH process can
be expanded in a similar way. The last term on the right-hand side shows an
infinite number of lags in ut-j2s.
Var (ut) = o + 1ut-12 + 1 Var (ut-1)
= o + 1ut-12 + 1 [o + 1ut-22 + 1 Var (ut-2)]
= o(1+1) + 1( ut-12 + 1 ut-22) + 12[o + 1ut-32 + 1 Var (ut-3)]
= o(1+1+ 12+.. ) + 1( ut-12 + 1 ut-22 + 12 ut-32 + )
Thus, heuristically, a GARCH process may be expressed as follows for an
arbitrarily large number of lags N, where the cjs are constants :
ut2 = c0 + c1 ut-12 + c2 ut-22 +..+ cqut-q2 ++cNut-N2 + et.
It is clear from both the expressions of ARCH and GARCH above that there is
autocorrelations (serial correlations) in the square of the residuals. For
ARCH(q) , autocorrelation in ut2 is non-zero up to lag q, and becomes zero
after that lag. For GARCH(q,p), autocorrelation in ut2 is non-zero for an
arbitrarily long number of lags.
Considering (17.1), suppose we estimate via OLS and then obtain the
estimated residuals:-
u t y t x t .
Compute time series { u 2t }. Then using the Ljung and Box Q-test in Chapter 6
on the u 2t and its auto-correlogram (not u t ) , test if correlations H0:
(1)=(2)=(3)=..=(q)=0. If H0 is rejected for an auto-correlogram that is
significant out to lag q, then ARCH(q) is plausible. If the correlations do not
appear to decay or disappear, then a GARCH process is likely. We should also
follow up with the Ljung and Box Q-test on
2
u t
.
var u t
17.7
PROBLEM SET
17.1
A portfolios return rate process rt is modeled as follows:

rt = a + bxt + ht1/2 et
where a, b are constants, xt is an explanatory variable that is observed,
323
and et is an unobserved i.i.d. disturbance and is distributed as N(0,1).
ht is an unobserved process that is uncorrelated with xt and is modeled
as follows:
ht = 0.2 ht-1 + 0.2 et-12
Suppose consistent estimates of a, b are 0.03 and 0.5 respectively. ht-1
is estimated consistently at 0.01. xt-1 has value 0.10 and current return
rt-1 is effectively 5%. Current price of the portfolio Pt-1 is $10 million.
Next period value xt is 0.12 and its variance is 0.02.
Show how you would compute the absolute and also the relative
Value-at-Risk of the portfolio at time t at 99% confidence interval?
Given:
z-value
% Area to the right of the unit normal curve
1.28
10.0%
1.645
5.0%
1.96
2.5%
2.33
1.0%
2.56
0.5%
17.2

rt = a + bxt + exp(0.5ht) et
and et is the unobserved i.i.d. disturbance and is distributed as N(0,2).
ht is an unobserved process that is modeled as follows:
ht = 0.5 ht-1 + 2 et-12.
Suppose consistent estimates of a, b, ht-1, and 2 are 0.05, 0.1, 2, and
0.001 respectively. xt-1 = 10 and return rt-1 = 0.05. What is the variance
of the disturbance at time t?
17.3
Current price of the portfolio Pt-1 is $100 million. Next period

variance,
var(xt) = 0.2. Suppose next period expected return is 0. Show how
you
would compute the Value-at-Risk of the portfolio at time t at 99%
confidence interval? The following table is given.
z-value
1.28
1.645
1.96
2.33
2.56

10.0%
5.0%
2.5%
1.0%
0.5%
324
17.4

rt = a + bxt + exp(0.5ht) et
and et is the unobserved i.i.d. disturbance and is distributed as N(0,1).
ht is an unobserved process that is independent of xt and et and is
modeled as follows:
ht = 0.9 ht-1 + 3.9 et-12
Suppose consistent estimates of a, b are 0.02 and 0.0005 respectively.
ht-1 is estimated consistently at 6.5. xt-1 took value 10 and current
return rt-1 is effectively 3%. Current price of the portfolio Pt-1 is $100
million. Next period variance of xt is 2. Suppose next period expected
return is 0.
Show how you would compute the Value-at-Risk of the portfolio
at time t at 99% confidence interval?
Given:
z-value
1.28
10.0%
1.645
5.0%
1.96
2.5%
2.33
1.0%
2.56
0.5%

[1]
[2]
John Y. Campbell, Andrew W. Lo, and A. C. MacKinlay, (1997),

The Econometrics of Financial Markets, Princeton University Press.
For risk, a good place to start reading will be:
Jorion Philippe, (2000), Value at Risk, 2nd edition, McGraw-Hill.
325
Chapter 18
MEAN REVERTING CONTINUOUS TIME PROCESS
APPLICATION: BONDS AND TERM STRUCTURES
Bonds, Credit ratings, Yield-to-maturity, Bond equivalent yield, Bond total
return, Zero coupon bond, Zero yield, Spot rate, Bootstrapping, Spot yield
curve, Credit spread, Continuous Time Process, Short rate, Mean reversion,
Ornstein-Uhlenbeck process, Vasicek model, Cox-Ingersoll-Ross model,
Bond pricing, Treasury slope, Business cycle
In this chapter we shall study bonds and their term structures. Bonds are a
major class of investment assets distinct from equities, and they have salient
features that will be explained. Continuous time stochastic processes are
briefly inttroduced in this chapter to show their usage in modeling bond prices
and therefore the resulting credit spreads and yields. Multiple regression
analyses involving explanation of credit spreads and of bond returns are
described.
18.1
BONDS
Capital markets in the world may be broadly divided into money markets
where borrowing and lending of short-term monies take place, equity markets
where companies raise capital for production by issuing stocks or shares to
investors, and the bond markets (sometimes called Fixed Income markets)
where medium to long-term borrowings and lendings occur.
Medium one-year to ten-year borrowings are usually called notes while
longer-term borrowings are called debts or bonds. While in the past, bonds
take the form of nice colorful certificates stating the entitlement of holders
(lenders) to payments of interests and principal repayments, these days they
are mostly in electronic entry forms. In the past, when an interest on a bond is
due, holders will tear off a coupon attached to the bond certificate or package
and sent to the company to require payment of interest. As a result, bonds with
periodic interest payments are usually called coupon bonds.
Debts are of two major categories those that are negotiable instruments
may be bought and sold in a secondary market. These debts include not just
coupon bonds and notes, but also asset-backed and mortgaged-backed and
326
structured notes with attached derivatives. Non-negotiable debts on the other
hand are usually loans made privately between two parties, and are generally
not tradeable.
Borrowings in the form of bonds come from various sectors of the
economy. When sovereign governments borrow, the bonds are called
government bonds and usually carry the guarantee of the government in
repayment. Thus sovereign bonds typically have very high credit ratings. Of
course, there are governments in some countries that do not appear stable and
their economies appear to be in serious trouble with high inflation and low
incomes. In such cases, default did occur and the credit ratings might be bad.
Corporate bonds issued by commercial companies are very large in size. Debt
borrowing by companies is the other major source of capital financing for
companies apart from equity issues. Banks and financial institutions also issue
debts for various purposes. They usually borrow at cheaper rates and lend at
higher rates.
We shall look at some major characteristics of a bond as follows. These
characteristics would have some impact on the value of the bonds or the prices
that the bonds would be able to sell for in a liquid market.
(a) Credit rating of issuer:
There are several key rating agencies such as Standard and Poor, Moodys,
and Fitch, that regularly provide updates on ratings of major bonds and
corporations. S&P ratings, for example, go from the almost risk-free AAA
(Aaa in Moodys) to D which is default. Quality investment grade bonds are
typically BBB and better, while speculative risky bonds are BB and below,
and sometimes also referred to as junk bonds which means they give high
returns but also carry non-trivial probabilities of default. When a bond
defaults, all promised payments in terms of coupons and principals are
canceled, and the investor may hope for only a fraction of the outlay once the
liquidation process is completed.
Bonds issued by firms or counterparties that are deemed to be credit-risky
and which face a positive probability of default (or going bankrupt in some
cases) are priced lower to induce investors to buy. This translates into a higher
yield. Bonds with low credit ratings therefore have higher yields than bonds
with higher credit ratings. If we substract the Treasury yield from the creditrisky yield of a rated bond, we obtain the credit spread of that rated bond.
Credit spreads increase with lower ratings. AAA bonds typically have low
spreads of several basis points. B rated bonds may have spreads of several
hundred basis points (or several % above Treasury rate of the same tenor or
maturity).
327
(b) Bond coupon and frequency:
The higher the coupon interest and frequency means more interest payments
and at a faster rate, which is good for the investor.
(c) Maturity of Bond:
The longer maturity means that the principal will be repaid only after a longer
time, and is thus more risky for the investor.
(d) Taxability:
Some bonds carry tax-free status, which is common for municipal bonds. Taxfree bonds however usually give a lower interest rate. In general, a high
income-tax investor would prefer tax-free bonds because coupon interest is
considered as taxable income in most other bonds.
(e) Type of bond Straight, callable, puttable, sinking fund, zero-coupon:
Straight bonds have well defined coupon payment dates and final redemption
or maturity date when principal would be repaid. Callable bonds subject the
investor to risk of the bond being called for early redemption by the issuer if
the bond prices are rising. Therefore callable bonds usually carry higher
interest rates as compensation to investors for this call risk. Puttable bonds is
usually the reverse situation when investors can sell the bonds back to the
issuer at an earlier date than maturity. In this case, the interest rate received by
the investor is usually lower to pay for this option they hold. In a common
type of sinking fund bond, the issuer may periodically call back parts of the
bond issue. Like callable bonds, sinking fund bonds carry higher interest
because there is risk to the investor who may have to surrender the bonds at a
time when alternative interest rates are low in the market. Zero coupon bonds
are also discount bonds where the bonds are issued at a deep discount. No
interim coupons are paid. At redemption, the investor is paid the bonds par
value.
(f) Liquidity:
Bonds that cannot be traded or sold in a liquid market usually require a
compensation in terms of higher interest or yield for bearing this risk.
There are other macroeconomic factors that affect the prices of bonds
generally. If inflation is uncertain and inflation risk premium in the form of
expected inflation is high, then via the Fisher effect discussed in an earlier
chapter, this induces a higher nominal interest rate in the economy, and thus
lower bond prices. During economic recession, credit spreads will be
especially wide, so credit-risky bonds will be priced at very low prices. The
converse is true of boom times. Increased money supplies by governments
328
also tend to reduce the general level of interest rates, and vice-versa.
Governments can directly intervene in the bond market through open market
operation by buying up Treasury bonds to increase money supply and by
selling such bonds to reduce money supply in the economy.
18.2
YIELD-TO-MATURITY
Suppose a coupon bond has the following properties:Maturity = 10 years

Coupon rate = 5% p.a.
Frequency of payment = twice a year, every 6 months
Par value = $100,000
Final redemption at maturity = $100,000
Suppose a price of $88,000 is paid for this bond. If we find the internal
rate of return that equates this current price to the present value of the
promised stream of payoffs from holding the bond, the rate is 6.665% p.a.
This is worked out as follows, where y = 6.665% below.
88000
1
2
5% 100000 12 5% 100000

1 y / 2
1 y / 22
1
2
5% 100000 100000
1 y / 220
Notice that the numerator of the cashflows on the right-hand sides are the
interest rate payments. The last payment includes the redemption of $100,000.
The denominator represents the opportunity cost of the investment, and is a
discount rate. As y is annualized, the 6-monthly discount uses y/2 %.
The discount rate y% is also called the Yield-to-maturity of the bond. In
other words, if an investor should hold the bond to maturity, then he can
realize the yield of y% p.a.
There is an inverse relationship between current bond price and yield-tomaturity (YTM). As YTM rises for a particular bond, its price falls, and viceversa.
Although YTM is simple to compute and offers some idea of what kinds
of returns accrue in investing in a particular bond, the YTM method has some
shortcomings.
Firstly, YTM assumes that cashflows, specifically the coupon interests,
are re-invested at the same YTM rates over time. This may not be feasible as
short-term interest rates for investment change and are uncertain. Secondly,
the YTM assumes that discount rate for every cashflow in the future till
maturity is the same. In other words, discount for a cashflow in ten years time
is the same as the discount for a cashflow in one years time, except that the
329
ten-year discount is compounded ten times. This is not a true picture of the
real market where sometimes grave uncertainty even two years out will mean
that the discount applied to a two year future cashflow will be much more than
twice that of a one-year cashflow. YTM is a single rate that applies across all
cashflows up to a certain maturity T, and is not as flexible as a framework
whereby different discount rates are applied to different maturities up to T.
In U.S., corporate bonds usually issue coupons twice a year, so the coupon
rate is a semi-annual coupon rate. If a bond issues coupon only once a year,
the annualized rate is usually converted to a semi-annual equivalent rate called
the bond-equivalent yield. For example, a European bond pays one annual
coupon of 6% p.a., then its bond-equivalent-yield (BEY) comparable to US
bonds that are typically semi-annual coupon, is found via 2 x [(1.06)0.5 1] =
5.91% p.a. Conversely, a US semi-annual bond at BEY 5.91% p.a. is
equivalent in payment at end of the year to (1+ x 0.0591)2 -1 = 6%.
18.3
BOND TOTAL RETURN
For investments with different horizons other than the bonds maturities, it
makes sense to compute the bonds total rate of return. For investments with a
portfolio comprising both equities and bonds, the bond returns also allow for
measuring returns correlation for diversification purpose.
For computing the bonds total rate of return, we require the following
information:
(a)
(b)
(c)
(d)
(e)
(f)
(g)
initial price
end of horizon price
holding period
reinvestment rate
interim cash-flows
compounding frequency
daycount convention
An example of investing in a 10Y semi-annual 5% p.a. Treasury Bond

proceeds as follows. The initial quoted price 99-16 of bond implies 99 16/32
% of $1,000,000 or $995,000 of cost of purchase. The end-of-horizon price
assuming horizon yield (forward yield-to-maturity in one years time) of
5.25% p.a. is $1m/(1+5.25%/2)18 + present value of semi-annual coupons
5.25%/2 over 18 6-month periods = $982,250. This is expected amount to be
received in one years time.
The holding period is 1 year. Reinvestment rate is assumed to be 5.2%
p.a. The interim cash-flows are $25,000 in 6 months, and another $25,000 in 1
year. The first cash-flow is compounded to $25,000 (1+5.2% 183/365) =
330
$25,652 in 1 year. The compounding frequency is semi-annual, and the
daycount convention is (actual number of days)/365.
Suppose the accrued interest is 10/182 $25,000 = $1374 at the start.
This means that the bond buyer will pay this additional amount to bond seller.
Suppose the accrued interest at the end of horizon when selling the bond is
11/183 $25,000 = $1503. The seller will obtain this amount in addition to
the clean or quoted bond price. The total price is sometimes called the full
or dirty price.
Then, the total bond return rate
= Future Value/Present Value -1
= $(982,250+50,652+1503)/(995,000+1374) -1
= $1,034,405/ 996,374 1
= 3.817% p.a.
18.4
ZEROS
In 1982, Merrill Lynch created financial innovation by buying Treasury

Bonds, putting them up as collateral in a trust, and then selling pieces of the
bonds future cash flows (both coupons and principals) to the public as TIGRs
or tigers (Treasury Investment Growth Receipts). These tiger bonds do
not have intermediate coupons, and so their prices were low with a par return.
This is the beginning of zero coupon bonds or zeros.
Huge success in the sales of tigers leads to Salomon Brothers
Certificates of Accrual on Treasury Securities (CATS) and Shearson
Lehmans Lehman Investment Opportunity Notes (LIONs). These are
essentially dealers zero coupon bonds.
In 1984, the US Treasury started the Separate Trading of Registered
Interest and Principal of Securities program (STRIPS). In this program, the
Fed served as the bank and essentially enabled dealers to trade in for cash and
also to buy others trade-in of future coupons of a Treasury bond. Essentially,
the US Treasury had started the buying and selling of zero coupon bonds with
maturities from near-term coupons to longer-term principal payments.
Zeros became a generic product (no brand premium) and not attached to
dealers names. Liquidity made TIGRs, LIONs, CATS all alike and they
(dealers products) became extinct soon after.
As an example, suppose a semi-annual 10Y 6.7% p.a. T-bond was
stripped so that its principal of $100,000 is sold as a 10 Y zero coupon bond.
The YTM of the original 10Y coupon Treasury may be 6.7% since it sells at
par when issued, however, the yield (or internal rate of return) of the new 10Y
zero needs not be 6.7% p.a. as unlike the coupon bond, there is no averaging
across the coupon flows every 6 months till maturity. Suppose the zero price
is $50,835, or nearly 49% discount from the par value of $100,000. Then the
331
yield is computed as (100,000/50,835)0.1 1= 7% p.a. on annual
compounding. Its BEY z is found via (1.07)10 = (1+z/2)20, or z = semi-annual
6.88% p.a. Thus, it is about 18 basis points (0.18%) higher for the zero
compared to a coupon bond of the same maturity. This semi-annual 6.88%
p.a. or annual 7% p.a. is called the 10 Y zero rate and is also called a 10 Y
spot rate.
The idea of zero or spot rate overcomes the shortcomings and problems of
using YTM. The spot rate RT at a given maturity T is the YTM on a zero
coupon bond with that maturity. It is the appropriate discount rate for valuing
the present value of any cash flow occurring on a specific maturity date in the
future. Any two different amounts $A or $2A occuring at T in the future
would unambiguously have present values of $a and $2a, using the same Tyear spot rate for discounting (assuming also use of same compounding
frequency and day-count convention).
18.5
SPOT RATES
Suppose the T-year zero-coupon bond is priced at $P(0,T). Define this to be

the percentage per unit dollar par value. The annualized yield-to-maturity
(YTM) y for maturity or tenor T is
y = (1/P(0,T))1/T - 1
where T is number of years. If we use continuous compounding, then
y = T-1 ln (1/P(0,T))
is the annualized continuously compounded YTM.
The zero coupon discount function P(0,T) or zero price giving rise to y
is a very important function. It is the appropriate discount rate for valuing any
cash flow occurring on that date T. This yield y on a zero coupon bond or the
zero yield is also called the spot rate at the given maturity T. The spot rate y is
also the rate that will yield $(1+y)T at the end of T years for an initial
investment of $1 with no interim cashflows.
The function of spot rate versus maturity is called the zero yield curve or
the spot rate curve. This is not the YTM yield curve. Spot rates for different
maturities can be obtained from the zeros market and plotted on a graph to
derive the spot rate curve. However, in the absence of arbitrage opportunities,
which is a useful assumption in a competitive market setting, we can also
derive the zero yield curve from coupon bond prices using a method called
bootstrapping. We illustrate this method as follows.
Suppose there are four coupon bonds as follows. To show how the method
works, we shall assume for simplicity that the bonds pay annual coupons. The
current prices and coupon cashflow payments of a series of bonds with
maturities covering one to four years are shown in Table 18.1.
332
Table 18.1
Current Coupon Bond Prices
$ Price
99.30
99.50
100.20
101.20
End Year 1
104
4.5
5
5.5
End Year 2
End Year 3
End Year 4
104.5
5
5.5
105
5.5
105.5
Coupon
4%
4.5%
5%
5.5%
To find the term structure of yield, first determine the one-year spot rate y1.
Put the one-year zero coupon bond price equal to the present value of the
cashflows in one years time.
99.30 = 104/(1+y1) implies y1 = 4.733%. Thus, one-year zero price =
1/(1.04733) = 0.9548.
Now put the two-year zero coupon bond price equal to the present value of the
cashflows in one years and two years time.
99.50 = 4.5/1.04733 + 104.5/(1+y2 )2 implies two-year spot rate y2 =
4.769%.
The two-year zero price = 1/(1.04769)2 = 0.9110. Similarly,
100.20 = 5/1.04733 + 5/(1.04769)2 + 105/(1+y3 )3 implies three-year spot
rate y3 = 4.935%. Three-year zero price = 0.8654.
And 101.20 = 5.5/1.04733 + 5.5/(1.04769)2 + 5.5/(1.04935)3 + 105.5/(1+y4)4
implies four-year spot rate y4 = 5.187%. Four-year zero price is 0.8169.
Hence the spot rates are found and plotted on a spot rate curve or zero yield
curve. See Figure 18.1 below.
Figure 18.1
Spot Rate Curve
Yield %
4.935%
3 years
Maturity
333
It is important to emphasize the usefulness of the spot rates to price a fixed
income security. The spot curve is the fundamental building block for the
valuation of any fixed income security. Just as a coupon bond can be
decomposed into a portfolio of zero coupon bonds, spot rates allow any
combination of timed cash-flows or coupon flows to be priced as a coupon
bond. For example, a 4Y coupon bond may have the following specifications:
(a) Repayment of principal of par $100,000 at end of 4 years
(b) First coupon payment of effective 5% at end of 1Y
(c) Second coupon payment of effective 10% at end of 3Y
(d) No coupon payment in the second and final year.
To find its no-arbitrage current price, we just need to find out the spot
rates (or equivalently the zero prices) at the points of the cashflows, i.e. at end
of 1Y, 3Y, and 4Y. Using the numbers from the bootstrapping illustration, the
1Y, 3Y, 4Y zero prices are 0.9548, 0.8654, and 0.8169. Hence, the present
value of this bond is $5000(0.9548) + 10,000(0.8654) + 100,000(0.8169)
=$95,118. If the market price for this bond is any other than this price, then
arbitrage opportunities would be possible by trading in 1Y, 3Y, and 4Y zeros
that are mispriced.
18.6
TERM STRUCTURE OF INTEREST
The relationship between yield and maturity on securities differing only in

length of time to maturity (credit quality, tax status, liquidity condition across
maturities assumed constant) is called the term structure of interest rates or
yield. Usually the term structure is described by a display using the yield
curve showing yield% on the vertical scale versus maturity in T years on the
horizontal scale.
As we have seen, term structures on YTM is not very useful for pricing.
However, term structures for spot rates or for zero yields are essential. The
most common term structure of spot rate is that of Treasury spot rates (or rates
bootstrapped from Treasury coupon bonds). Other term structures of Aaa, Aa,
A, Baa, and lower credit-rating bonds are also possible when there is
sufficient liquidity in them.
As there are only a finite number of bonds to bootstrap from, the spot
rates are finitely intervalled along the Spot Yield Curve or spot rate curve
(sometimes less appropriately termed Yield Curve), a continuous spot rate
curve can be constructed from the finite number of points by cubic splines or
some form of curve-fitting.
Figure 18.2 shows the spot rate curves of Treasury, Aa, and Ba bonds.
Treasury spot curves usually go out to 30 years as there are long-term U.S.
Treasury bonds of up to 30Y in the market. Less credit-worthy bonds do not
have as long maturities and may also be illiquidly traded at long maturities.
334
The credit spread for the Aa bonds is shown, and in this example, the credit
spread widens as maturity increases, given the greater uncertainties and risks
associated with a longer horizon.
Figure 18.2
Term strutures: spot yield curves for different credit ratings
Yield %
Ba
Aa
Credit spread
Treasury
Maturity
18.7
CONTINUOUS TIME STOCHASTIC PROCESSES
A continuous time stochastic process that possesses a strong Markov property

and that has sample paths (continuous obervations in time) that are continuous
function of time (no discontinuity or discrete jumps) is called a diffusion
process.85 The application of continuous time mathematics to finance has been
largely pioneered by Merton.86 Many practical Markov processes used in
financial modeling can be approximated by diffusion process when the time
interval between each obervation point becomes very small. Likewise for
many diffusion processes, when the interval is very small, we may sometimes
choose to use a discrete process to approximate it.
Diffusion processes can be expressed as a stochastic differential equation
(SDE) such as the lognormal diffusion process (or sometimes called
Geometric Brownian Motion), dSt = St dt + St dWt where St is the
underlying asset price at time t, Wt is a Wiener process with W0 0, and (Wt
W0) distributed as N(0, t). is the instantaneous or infinitesimal mean or drift
of the process, and is the instantaneous volatility. Just as solution of a partial
85
See a classic such as Samuel Karlin and Howard M. Taylor, (1981), A Second
Course in Stochastic Process, Academic Press.
86
See Robert C. Merton, (1990), Continuous-Time Finance, Basil Blackwell.
335
differential equation is a multivariate deterministic function, the solution of a
stochastic differential equation is a random variable that is characterized by a
probability distribution.
The short rate is the spot interest rate when the term goes to zero. Let the
short rate be r. This is the instantaneous spot rate. An example of a diffusion
process of the short rate is:
drt = ( - rt ) dt + rt dWt
(18.1)
where, , , and are constants. In the literature on interest rate

modeling, many different models were applied to study interest rate
dynamics and the associated bond prices. Many of the models are
subsumed under the class represented by (18.1), and take diffent forms
for different restricted values of the parameters , , and . When =
=0 and = 1, there is the Dothan model.87 When just = 1, there is the
Brennan-Schwartz model.88 When just = 0, there is the Vasicek
model.89 And when = , there is the Cox-Ingersoll-Ross (CIR)
model.90
The Vasicek model drt = ( - rt ) dt + dWt , for > 0 and > 0, is
essentially a version of the well established Ornstein-Uhlenbeck
process that is known to be a mean-reverting process and has an
analytical solution. It can also be rewritten, as
drt = ( - rt ) dt + dWt
(18.2)
where the equivalence of (18.2) and (18.1) is readily seen when we put =
, = . The mean reversion occurs because when rt exceeds (falls

below) , then there is a drift with positive speed , that brings future
rt+1 back toward . The solution to the SDE (18.2) is:
87
Dothan, Uri, (1978), On the term structure of interest rates, Journal of Financial
Economics 6, 59-69.
88
Brennan, M.J. and Eduardo S. Schwarz, (1977), Savings bonds, retractable bonds,
and callable bonds, Journal of Financial Economics 3, 133-155.
89
Vasicek, Oldrich, (1977), An equilibrium characterization of the term structure,
Journal of Financial Economics 5, 177-188.
90
Cox, John C., J.E. Ingersoll, and Stephen A. Ross, (1985), A theory of the term
structure of interest rates, Econometrica 53, 385-407.
336
rt e t r0 1 e t e t e u dWu
(18.3)
where r0 is the short rate at initial time t = 0. The mean or expectation of rt is
Ert e t - r0 ,
(18.4)
and the variance is
var rt
2
e 2 t
2 u
e du
0
2
1 e 2 t .
2
(18.5)
Moreover, in the Vasicek model, rt is seen to be normally distributed from

(18.3) as it is an integral or summation of normal dWu. This last feature of the
Vasicek model is not desirable as it implies that there is non-zero probability
that rt can attain some negative values. Since rt is a nominal interest rate, it is
not proper for rt to be negative, or there will be infinite arbitrage by borrowing
at a negative cost of funds.
From the mean and variance equations in (18.4) and (18.5), it is seen that
over time, as t increases, the short rate rt converges to a stationary random
variable about the long-run mean of > 0 regardless of the starting point r0.
When the time interval between two observations on the short rate is
small, it may be approximated by a discrete process as follows.
rt+ - rt = ( - rt ) + et+
(18.6)
where et+ is an i.i.d. normally distributed r.v. with mean E(et+) = 0,

and var(et+) = 2 rt 2 . (18.6) may be expressed alternatively as
rt+ - rt = ( - rt ) + et+. This approximation was used in Chan et. al.
and others.91
Short rate models in the class of (18.1) are one-factor models
because there is only one state variable or source of uncertainty
affecting the stochastic changes in rt, i.e. the source from only dWt.
Short rate models are, however, extremely important in affecting bond
pricing under no-arbitrage rational expectations framework. This is
because if we assume a full spectrum of discount bonds B(0,t) for
maturity t in the future where 0 < t < T, then each of these can be priced
by the following expectation:
91
See K.C. Chan, Andrew Karolyi, F.A. Longstaff, and Anthony Saunders, (1992),
An Empirical Comparison of Alternative Models of the Short-Term Interest Rate,
The Journal of Finance, Vol 47, No.3, 1209-1227.
337
B0, t E 0P exp ru du
(18.7)
0
where subscript p to the conditional expectation at t = 0 denotes

employment of a risk-neutral set of probability distribution. In the
derivatives literature, this is akin to the no-arbitrage condition. If we
look at the solution of a short rate model such as (18.3), then the RHS
of (18.7) intuitively is solvable and would lead to an expression
involving the parameters in the short rate, and this in turn is the t-period
to maturity discount bond price B(0,t) on the LHS.92
For many short-rate models, discount bond prices at time t, B(t,T)
can be solved as
B(t,T) = exp[ (t,T) - (t,T) rt ] for functions (t,T) and (t,T) that
are dependent on the short-rate model parameters.
Thus, theoretical zero coupon bond prices and their derivatives,
such as coupon bonds and bond derivatives, are related to the short rate
models and are affected by the parameter values in the short rate
models. If actual empirical bond prices are available, they can be used
to calibrate the parameters in the short rate. This means implying out
the parametric values from the theoretical bond price. On the other
hand, observed short rates (or observed proxies) can be used to directly
test different short rate models and estimate their parameters. The
parameters and the bond price models can then be used in different
ways to explain or forecast bond derivative prices for hedging or for
speculative trading purposes.
If a short-rate model such as (18.2) is solved in (18.3), then the
analytical probability distribution of rt is obtained, and this can be used
to estimate the parameters via the maximum likelihood method.
However, in many cases of short-rate models, including multi-factor
models, complete solution is not possible, and thus the maximum
likelihood method cannot be applied. Some distribution-free semiparametric method or numerical methods can be used. Another
approach is to use approximations such as (18.6). Discrete
approximations usually produce a simpler short-rate model that can be
solved to obtain a probability distribution of rt.
92
Interest rate modeling is quite mathematical. An example of the many books that
can be consulted is Musiela M. and M. Rutkowski, (1998), Martingale Methods in
Financial Modelling, Springer.
338
18.8
ESTIMATION OF DISCRETIZED MODELS
Daily one-Month Treasury Bill Rates in the Secondary Market from August
2001 to March 2010 are obtained from the Federal Reserve Bank of New
York public website. The one-month spot rates are treated as proxies for the
short rate rt. The graph of the time series of this rate is shown in Figure 18.3.
Figure 18.3
Daily One-Month Treasury Bill Rates in the Secondary Market from
August 2001 to March 2010
Annualized Daily Treasury One Month Spot Bond Equivalent Yield
.06
.05
.04
.03
.02
.01
.00
Aug 2001
02
03
04
05
06
07
08
09
2010 March
It is seen that the rates increase spectacularly from 2003 till 2007 when the
U.S. property market was booming. The rates collapsed in 2008 and 2009
together with the global financial crisis as governments cut central bank
interest rates.
We shall use linear regression method to provide preliminary investigation
of the plausibility of the Dothan, Vasicek, Brennan-Schwarz, and the CIR
short-rate models.
Dothans approximate discrete model is
rt+ - rt = rt t+
(18.8)
where t+ ~ N(0, ). The implication is that (rt+ - rt) / rt ~ N(0, 2). It
should be pointed out here that except for the Vasicek model that is
mentioned in this section, all the other 3 models including Dothans do
339
not imply normal distribution for the short rates under continuous time.
Hence the discretized version of normal errors is merely an
approximation.
Figure 18.4
Test of Normality of (rt+ - rt) / rt ~ N(0, 2)

1,200
Series: DRR
Sample 2 2172
Observations 2170
1,000
Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis
800
600
400
200
0.027614
0.000000
15.99488
-0.794818
0.450052
22.86198
748.8531
Jarque-Bera 50487537
Probability
0.000000
0
0
10
12
14
16
Figure 18.4 and the embedded table show that (rt+ - rt) / rt is not normally
distributed. Thus (18.8) may not be a good description of the proxy short
rates.
Next we explore regression using the Vasicek discrete model in (18.6):
rt+ - rt = - rt + et+
(18.9)
2
where et+ is i.i.d. normally distributed N(0, ). Let a = , and
b = - . Then we perform regression

rt = a + b rt + et+ .
The results are shown in Table 18.2. Whites heteroskedasticity
consistent adjustment is made. Though the estimates are not
significantly different from zero, they neverthelessly have the correct
signs. We use = 1/365. = = - 365 b = 0.549. = / =
a /( ) = 0.0108. If the model is correct, it suggests a long-run mean
of 1.08% on an annualized basis, and a mean reversion adjustment
speed of 0.549. However, the residuals are not normal as seen in Figure
18.5, and this contradicts one implication of the Vasicek model. There
is also strong correlation in the residuals.
340
Table 18.2
Regression rt = a + b rt + et+ , et+ ~ i.i.d. N(0, 2).
Dependent Variable: DR
Date: 04/18/10 Time: 23:58
Sample: 2 2172
White heteroskedasticity-consistent standard errors & covariance
Variable
Coefficient
Std. Error
t-Statistic
Prob.
C
R
1.63E-05
-0.001504
2.81E-05
0.001179
0.579256
-1.275752
0.5625
0.2022
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
F-statistic
Prob(F-statistic)
0.000695
0.000234
0.000977
0.002070
11968.11
1.508434
0.219512
Mean dependent var

S.D. dependent var
Schwarz criterion
Durbin-Watson stat
-1.67E-05
0.000977
-11.02359
-11.01836
-11.02168
1.653049
Figure 18.5
Test of Normality of et+ ~ N(0, 2)

1,200
Series: Residuals
Sample 2 2172
Observations 2171
1,000
Mean
Median
Maximum
Minimum
Std. Dev.
Skewness
Kurtosis
800
600
400
200
5.50e-20
-2.44e-06
0.009721
-0.011617
0.000977
-0.642009
38.81535
Jarque-Bera 116183.6
Probability 0.000000
0
-0.010
-0.005
0.000
0.005
0.010
341
The discrete approximations of Brennan-Schwarz short-rate model is:
rt+ - rt = - rt + rt et+
where et+ is i.i.d. normally distributed N(0, 2).
Let yt+ = (rt+ - rt)/rt, then the model implies
yt+ = - + (1/rt) + et+ .

Let a = - , and b = . Then we perform regression
yt+ = a + b (1/rt) + et+ .
The results are shown in Table 18.3.
Table 18.3
2
Regression yt+ = a + b (1/rt) + et+ , et+ ~ i.i.d. N(0, ).
Dependent Variable: DRR
Date: 04/19/10 Time: 00:01
Sample (adjusted): 2 2171
Included observations: 2170 after adjustments
Variable
Coefficient
Std. Error
t-Statistic
Prob.
C
RINV
0.000216
6.73E-05
0.008442
1.60E-05
0.025591
4.209572
0.9796
0.0000
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
F-statistic
Prob(F-statistic)
0.032504
0.032058
0.442780
425.0444
-1310.233
72.83674
0.000000
Mean dependent var

S.D. dependent var
Schwarz criterion
Durbin-Watson stat
0.027614
0.450052
1.209432
1.214669
1.211347
2.181601
In the above regression, the sign of the coefficient of the intercept is not
negative, and thus a negative speed is incorrectly estimated.
The discrete approximations of CIR short-rate model is:
rt+ - rt = - rt + rt1/2 et+
where et+ is i.i.d. normally distributed N(0, 2). Let yt+ = (rt+ - rt)/rt1/2,
then the model implies
yt+ = c + (1/rt1/2) - rt1/2 + et+ .

We add a constant c which should be insignificantly different from
342
zero. Let b1 = and b2 = - . Then we perform regression
yt+ = c + b1 (1/rt1/2 ) + b2 (rt1/2) + et+ .
The results are shown in Table 18.4.
Table 18.4
Regression yt+ = c + b1 (1/rt1/2 ) + b2 (rt1/2) + et+,
et+ ~ i.i.d. N(0, 2).
Dependent Variable: DRSQRR
Date: 04/19/10 Time: 00:07
Sample (adjusted): 2 2171
Variable
Coefficient
Std. Error
t-Statistic
Prob.
C
SQRRINV
SQRR
-0.000630
7.82E-05
-0.000523
0.001408
2.64E-05
0.007337
-0.447633
2.960648
-0.071293
0.6545
0.0031
0.9432
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
F-statistic
Prob(F-statistic)
0.009705
0.008791
0.012429
0.334769
6443.727
10.61857
0.000026
Mean dependent var

S.D. dependent var
Schwarz criterion
Durbin-Watson stat
0.000317
0.012484
-5.936154
-5.928298
-5.933281
2.018086
In the regression reported in Table 18.4, the estimated coefficients are of the
correct signs and one of them is highly significant.
= = 0.191, and = / = 0.149.

However, the estimate of appears to high for a Treasury short rate
during this sampling period.
The above preliminary results based on a simple discretized
approximation of the various continuous time short rate models appear to
reject the models. Similar results are found in a study by Nowman (1997).93
93
See Nowman, K.B., (1997), Gaussian Estimation of Single-Factor Continuous

Time Models of The Term Structure of Interest Rates, The Journal of Finance, Vol.
52 No.4, 1695-1706.
343
18.9
CREDIT SPREAD CHANGES
In Duffee (1998) and Dufresne et. al. (2001)94, increase in 3-month U.S.
Treasury bill rate and increase in the slope of the Treasury yield curve both
separately would lead to decrease in the credit spread of corporate bonds,
especially bonds with longer maturity and lower ratings.
One way of understanding this is that when times are good boom during
a business cycle peak, and demand for loanable funds are high, the short
(short-maturity) Treasury interest rates (and other loanable funds rates as well)
go up. Likewise, expected next period short rate is high. The latter implies that
todays long rate would account for the high expected next periods rate and
would also be high.
Higher long rates mean that the slope of the yield curve also goes up. In
these situations, the markets assessments of the credit risks of corporate
bonds are lower, and thus the credit spread (or credit premium, difference
between credit risky bond interest and the Treasury interest of the same
maturity) would narrow.
On the other hand, when the Treasury yield curve slope starts to decrease
and turn negative, then there is expectation of future short rate decreases, a
signal of markets negative assessment of future economic prospects.95
To empirically verify this economic intuition, we perform a regression
analysis as follows. Monthly spot interest rates of U.S. Treasury bills and
bonds and also Ba-rating bonds from February 1992 till March 1998 were
obtained from Lehman Brothers Fixed Income Database. We first construct
monthly Ba-rated credit spreads for 9-year term by finding the difference of
Ba-rated 9-year spot rate less the Treasury 9-year spot rate. We then construct
the Treasury yield curve slope (or its proxy slope since the yield curve is not
exactly a straight line) by taking the Treasury 10-year spot rate less the
Treasury 1-month spot rate.
Finally, we also employ the Treasury 1-month spot rate as the proxy for
short rate (very short-term Treasury spot rate). A regression of the credit
spread on the 1-month Treasury spot and the Treasury slope using the set of
74 monthly observations is performed. The results are shown in Table 18.5. It
94
See Duffee, Gregory R., (1998), The Relation Between Treasury Yields and
Corporate Bond Yield Spreads, The Journal of Finance, Vol LIII, No 6, 2225-2241,
and also Pierre Collin-Dufresne, Robert S. Goldstein, and J. Spencer Martin, (2001),
The Determinants of Credit Spread Changes, The Journal of Finance, Vol LVI, No
6, 2177-2207.
95
Ang, Andrew, M. Piazzesi, and Min Wei, (2006), What does the yield curve tell us
about GDP growth? Journal of Econometrics 131, 359-403, commented that every
recession in the U.S. after the mid-1960s was predicted by a negative yield curve
slope within 6 quarters.
344
is indeed seen that the coefficient estimates on the 1-month Treasury spot rate
and on the Treasury slope are both negative as indicated. Regression using the
3-month Treasury spot rate as proxy for the short rate yields almost similar
results. However, the coefficients for this sampling period are not significantly
negative based on the t-tests.
Table 18.5
Regression of credit spread on Treasury short rate
and Treasury Slope, 2/1992 to 3/1998 (74 Observations)
Dependent Variable: CREDITSPREAD
Date: 04/18/10 Time: 10:36
Sample: 1 74
Variable
Coefficient
Std. Error
t-Statistic
Prob.
C
T1M
TERMSTRSLOPE
0.071755
-0.387103
-1.120837
0.022873
0.337696
0.907097
3.137055
-1.146305
-1.235631
0.0025
0.2555
0.2207
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
F-statistic
Prob(F-statistic)
0.024888
-0.002580
0.017943
0.022857
194.0529
0.906070
0.408730
Mean dependent var

S.D. dependent var
Schwarz criterion
Durbin-Watson stat
0.043620
0.017919
-5.163593
-5.070185
-5.126331
2.093021
The credit spread and the Treasury slope are indeed important factors in
the economy. They can also explain the cross-sectional variations of
bond returns. We leave this idea as a recommended reading in the
article by Profs Fama and French (1993) on the recommended reading
list.
18.10
PROBLEM SET
18.1
Suppose a record of the last 50 years was taken, and 10 of the years
had negative GDP growth while the other 40 years had positive GDP
growth. In the 10 years with negative GDP growth or recession, 7 of
these years saw inverted yield curves the year just prior to recession.
In the other 40 years of positive GDP growth, 7 of those years saw
inverted yield curves the year just prior to recession. If you see an
345
inverted yield curve this year, what is the probability of a recession
next year given just the above information?
If you could run an OLS linear regression using actual GDP growth
numbers, Y, as dependent variable and actual yield curve slope, X, as
explanatory variable, and suppose the OLS result is
Yt = 0.01 + 1.2*Xt-1
where estimated residual variance u2 0.0004 . Assuming the
residual is i.i.d. normally distributed with zero mean, and that this
period Xt = -0.02, what is the estimated probability of a recession next
year using the linear model? (Use the following unit normal
distribution table.) If the two answers above do not match, explain
why is it so? (A couple of sentences will do here.) If they match, you
do not need to explain.
A
Prob
(z<a)
18.2.
0
0.50
0.1
0.54
0.2
0.58
0.3
0.62
0.4
0.66
0.5
0.69
0.6
0.73
0.7
0.76
0.8
0.79
0.9
0.82
Estimate the parameters in the following relationship between

constant-maturity 5-year Treasury Bond price P and 3-month Treasury
bill rate R of the same country. These are end of month data. Interpret
your linear OLS estimates briefly.
Pt Ae BR t
where t=1,2,3,....,T denotes consecutive trading months (end of
month).
Time
Rt %
Pt
Jan
3.5
78
Feb
3.3
80
Mar
3.2
81
Apr
3.4
79
May
3.5
78
Jun
3.5
78.5
Jly
3.6
77
Aug
3.7
76.5
Sep
3.7
76
Oct
4.0
75

[1] Duffee, Gregory R., (1998), The Relation Between Treasury Yields and
Corporate Bond Yield Spreads, The Journal of Finance, Vol LIII, No 6,
2225-2241.
[2] Fabozzi, Frank J., (2005), The Handbook of Fixed Income Securities,
7th edition, McGraw-Hill.
[3] Fama, Eugene, and Kenneth French, (1993), Common risk factors in the
346
returns on stocks and bonds, Journal of Financial Economics 33, 3-56.
[4] Musiela Marek, and Marek Rutkowski, (1998), Martingale Methods in
Financial Modelling, Springer Applications of Mathematics series 36.
[5] Pierre Collin-Dufresne, Robert S. Goldstein, and J. Spencer Martin,
(2001), The Determinants of Credit Spread Changes, The Journal of
Finance, Vol LVI, No 6, 2177-2207.
[6] Sundaresan, Suresh, (1997), Fixed Income Markets and Their
Derivatives, South-Western Publishing.
347
Chapter 19
IMPLIED PARAMETERS
APPLICATION: OPTION PRICING
Warrants, Call, Put, Option premium, Intrinsic value, Time value, Put-call
parity, Black-Scholes formula, Historical volatility, Implied volatility,
Efficient forecast, Errors-in-variable problem, Leverage effect
In this chapter we provide a basic introduction to two fascinating instruments

that have mesmerized Wall Street for a while, that of call and put. In their
complexities hide great opportunities for trading profits but also great
treacherous ways that lead to destruction for those unwary. Together with
futures and swaps, these derivatives form the basic building blocks of
derivatives jungle. Knowing how they are priced and how their prices may be
used to infer information on the market is important.
19.1
WARRANTS AND OPTIONS
In the early 1960s, equity warrants issued by companies in U.S. garnered

interest in the investing public due to the leverage it can provide. It cost only a
small fraction of a share to purchase a warrant, but when the stock got into the
money, the warrant could then be converted into a share to make a handsome
profit. However, unlike common equity that do not expire unless the company
goes belly-up, warrants typically have an expiry date. If the underlying stock
price was not sufficiently high at the expiry date of the warrant, then the
warrant literally expired, and the holder of the warrant losed the original
amount paid for the warrant.
The complex payoff nature of warrants and its mathematical innateness
motivated serious research in the 1960s through the early 1970s by notable
academics such as Paul Samuelson, Jack Treynor, Robert Merton, Myron
Scholes, and Fischer Black, most of whom were at one point or another based
at MIT.96 Myron Scholes and Robert Merton won their Nobel prize in
Economics in 1997 for their discovery and development of the Black-Scholes96
See an interesting account of the discovery of the Black-Scholes formula in Fischer

Black (1989), How we came up with the option formula, The Journal of Portfolio
Management, Vol.15, 4-8.
348
Merton option pricing model; Fischer Black regrettably passed away in 1995
before the grand accolade.
A European (style) call option on an asset A is the right to purchase on a
fixed future date T, one unit of the underlying asset A, at a pre-determined
strike or exercise price K. A European put option on an asset A is the right to
sell on a fixed future date T, one unit of the underlying asset A, at a predetermined strike or exercise price K. An American (style) call option on an
asset A is the right to purchase at any time up to and including on a fixed
future date T, one unit of the underlying asset A, at a pre-determined strike or
exercise price K. An American put option on an asset A is the right to sell at
any time up to and including on a fixed future date T, one unit of the
underlying asset A, at a pre-determined strike or exercise price K.
Calls and puts are pretty much the basic types and most prevalent of all
options. Second generation options include many exotic types97 such as Asian
options, Barrier Options, Lookbacks, Flexi-Options, Digitals, Quantos,
Rainbows, Spread and Basket Options. Current trends in the 2000s include
issue of many different types of structured products that combine various
types of exotics or vanilla options. Besides selling structured products to
investing institutions such as hedge-funds, pension funds, mutual funds,
insurance companies, and corporations, increasingly a large amount of
structured products are also placed out to high net-worth individual investors
directly or through private banking distribution. Sell-side banks or structuring
arms also issue structured warrants on third party equity that are backed by an
inventory of the equity purchased from the open market. After the blues of the
technological bubble burst in U.S. in 2001, and since about 2004, the major
financial markets of the world have seen till the middle of 2007 a giant wave
of financial product and process innovation, and enjoyed a boom time. Since
the middle of 2007, however, the U.S. subprime woes have continued to erode
market confidence.
A particular option, whether European call or put, or whether American
call or put, is of course defined by what the underlying asset is. Many
different types of options have been traded in Exchanges or on the Over-thecounter (OTC) inter-bank market based on assets ranging from stock or equity
to bonds to money-market instruments, to futures contracts, to commodities
such as oil, metals, raw materials, and agricultural produce, to currencies, to
synthetic constructions such as indices, and to verifiable conditions such as
credits, weather, and so on.
97
See for example, Peter G. Zhang (1997), Exotic Options: A Guide to Second
Generation Options, World Scientific Press. Or, Israel Nelken (2000), Pricing,
Hedging, and Trading Exotic Options, McGraw-Hill.
349
19.2
OPTION PRICING
A call option has a maturity or expiry date T in the future. During the period
from the current time t till T, a call option has a price that is determined in the
market, whether an Exchange or OTC. If the call is not exercised by the buyer
or holder till maturity, then at maturity T, the call price should be max(0, ST
K) where ST is the underlying asset price at time T. After T, whether the
option is exercised or not, the option is worthless as the contract has expired.
At t < T, a European call (option) price $c(t,St) is less than or equal to an
American call (option) price $C(t,St) when their exercise prices and term-tomaturity (remaining time to expiry) are the same. This is because an American
call option can always be utilized like an European call option by not
exercising till maturity; the converse is not possible. Thus an American call
option is at least as valuable as its European counterpart. The difference,
C(t,St) c(t,St) 0, is called the exercise premium.
For a call option98, at t T, the situations St > K, St = K, and St < K, are
called in-the-money (ITM), at-the-money (ATM), and out-of-the-money
(OTM) respectively. We show below the dollar payoff outcomes of a long call
and a short call position at maturity T.
Figure 19.1a
Value of a long call contract at T
Figure 19.1b
Value of a short call contract at T
$
Long Call payoff:
Max(0,ST K)
98
Asset price at T, ST
Short Call payoff:
- Max(0,ST K)
In some countries as in Singapore, certain banks are allowed to issue covered

warrants or structured warrants on third party firms and sell them through the
Exchange. These are call warrants and put warrants, being essentially calls or puts.
They differ from usual calls and puts in so far as the issuing bank is obliged to hold an
inventory of the third party equity before issuing call warrants on them. These
warrants are typically European.
350
A put option has a maturity or expiry date T in the future. During the period
from the current time t till T, a put option has a price that is determined in the
market, whether an Exchange or OTC. If the put is not exercised by the buyer
or holder till maturity, then at maturity T, the put price should be max(0, K
ST) where ST is the underlying asset price at time T. After T, whether the
option is exercised or not, the option is worthless as the contract has expired.
At t < T, a European put (option) price $p(t,St) is less than or equal to an
American put (option) price $P(t,St) when their exercise prices and term-tomaturity (remaining time to expiry) are the same. This is because an American
put option can always be utilized like an European put option by not
exercising till maturity; the converse is not possible. Thus an American put
option is at least as valuable as its European counterpart. The difference,
P(t,St) p(t,St) 0, is called the exercise premium.
For a put option, at t T, the situations St < K, St = K, and St > K, are
called in-the-money (ITM), at-the-money (ATM), and out-of-the-money
(OTM) respectively. We show below the dollar payoff outcomes of a long put
and a short put position at maturity T.
Figure 19.2a
Value of a long put contract at T
Figure 19.2b
Value of a short put contract at
T
$
Short Put payoff:
- Max(0, K ST)
Long Put payoff:

Max(0, K ST)
Since an option position involves initial outlay if it is a long position or initial

revenue if it is a short position, the final profit/loss situation at T must take
into account this initial option premium. We illustrate the profit/loss situation
at maturity T for European-style long and short call positions as follows. Note
that the factor er(T-t) reflects the cost of carrying the premium to maturity,
where r is the continuously compounded rate of interest.
351
Figure 19.3a
Profit/Loss of a long call contract
contract at T
Figure 19.3b
Profit/Loss of a short call
at T
Long Call Profit:

Max(0,ST K) C er(T-t)
C
0
-C
Breakeven
point
Short Call Profit:

- Max(0,ST K) + C er(T-t)
While it is easy to price a call or a put option at its maturity, it is not so

straightforward to price an option before its maturity.
Figure 19.4a
Value of a long call contract
at t < T
Figure 19.4b
Value of a long put contract
at t < T
$ Put Price
$ Call Price
Time value
Intrinsic value
Asset price at t, St
Asset price at t, St
We may intuitively guess the equilibrium price solution c(t,St) or p(t,St) as

follows. For a call, when St becomes 0 (firm issuing the share goes bankrupt),
the call becomes worthless or c(t,0) = 0. When the call is deep-in-the-money
or St K is highly positive, then c(t,St) gets close to max(0, St K). For a put,
when St becomes 0 (firm issuing the share goes bankrupt), the put value
352
becomes max(0, K) or K. When the put is deep-out-of-the-money or St K is
highly positive, then p(t,St) gets close to 0. The figures show that the call price
function (bold curve) is a convex function originating at the origin, while the
put price function is a convex function to the origin. The dotted curve
represents the intrinsic value function of the option.
Consider the situation in Figure 19.4a, suppose at t the price of the option
is $3. This option price is also called the option premium. If the underlying
asset price is trading at $8, and if the call strike is K = $5.50, then the call
option is ITM. The intrinsic value of the call option is max(0, St K), or $2.50
in this case. If the option is OTM or ATM, the intrinsic value is defined as
zero.
The time value or time premium of the option is the option premium less
the intrinsic value. In this case, the time value is $3 - $2.50, or $0.50. The time
premium is the value the option buyer has to pay to buy the chance of the
option getting more into the money during the remaining life of the option.
The time value is intuitively increasing with term-to-maturity and with
volatility of the underlying asset return.
19.3
PUT-CALL PARITY THEOREMS
Equilibrium call and put prices are also related in certain ways. For a
European call and a European put with the same strike price K and same
maturity T, they are related by the put-call parity theorem:
c p = St K e-r(T-t).
Thus, the equilibrium put price can be inferred from the call price, and viceversa. Of course, in reality, market prices of calls and puts are only
approximated by the above relationship since there are transaction costs, and
there are also bid-ask spreads not considered in the above formulation.
Ignoring the issue of dividends, American call and put prices are governed
by the following bounds:St K C P St K e-r(T-t).
The easiest way to prove the above is to look at a payoff Table 19.1 as
follows. It serves to prove C P St K e-r(T-t) or P C + St K e-r(T-t) 0.
Thus, whatever may happen, the initial portfolio will produce either zero or
positive future payoff. To prevent arbitrage, the portfolio must cost something
at t. Thus P C + St K e-r(T-t) 0.
The put-call parity theorem for European options is also proved using the
table. In the European case, there is no exercise before T. Therefore the
payoffs according to the table will be zero whatever the future outcomes. To
prevent arbitrage, therefore current outlay cost must equal zero. Hence p c +
St K e-r(T-t) = 0.
353
Table 19.1
Proof of C P St K e-r(T-t) or P C + St K e-r(T-t) 0
Payoff
outcome
at any
tT
Long
Put
Short
Call
Long
stock
Borrow
Total
If no exercise till T
ST > K
ST K
At t < T, if call is
exercised against
put holder;
liquidate all other
positions
+ P(t,St) > 0 sell
put
(St K)
+ St
+ St sell stock
Initial outlay at t
K e-r(T-t)
>K
P C + St K e- > 0
K ST
(ST K)
+ ST
+ ST
K
0
K
0
r(T-t)
Now to prove St K C P or C P St + K 0, see Table 19.2.

Table 19.2
Proof of St K C P or C P St + K 0
Payoff
outcome
at any
tT
Long
Call
Short
Put
Short
stock
Lend
Total
Initial outlay at At t < T, if put is

t
exercised against
call holder;
liquidate all other
positions
C
+ C(t,St) > 0 sell
call
P
(K St)
St
+K
C P St + K
If no exercise till T
ST > K
ST K
ST K
(K ST)
St
buy back ST
stock
+ K er(t-t) > K
+ K er(T-t)
>0
>0
ST
+ K er(T-t)
>0
Thus, whatever may happen, the initial portfolio will produce positive future
payoff. To prevent arbitrage, the portfolio must cost something at t. Thus C
P St + K 0. The latter could be a strict inequality, just like American put
price is above European put price because as long as the underlying is low
354
enough, there is a positive probability of early exercise in the American put
case.
However, it can be shown in the American call case without dividends
that it will not be exercised before maturity. Thus the latter American call
price is equal to the European call price, in this special case.
19.4
IMPLIED VOLATILITY
In 1987, the U.S. stock market saw a dramatic rise of 42% in its market value
until October 19 that year when over two days the market fell a whopping
23%. This has come to be known as the Stock Market Crash of 87 and Black
Monday, and there had been an enormous amount of studies poured over this
subject. Two of the most important empirical studies on this subject, Schwert
(1990) and Bates (1991)99 basically indicated that stock index options
contained more information beyond those carried in stocks themselves that
could shed light on the event. This highlights the importance of empirical
studies of option prices in investment. Specifically, we shall explore the
information contained in volatility that is implied in option prices.
A graphical depiction of prices and implied volatility during the time of
the 1987 October crash may be seen as follows in Figure 19.5. The numbers
are inferred from the Schwert study.
From the earlier chapters, we know the volatility of a stock or a stock
index return is its standard deviation, usually expressed on an annual basis.
Since the lognormal diffusion process underlying the Black-Scholes option
pricing model has the random walk as its discrete analogue, we know the
variance of return over horizon T, relative to the variance 2 over one unit of
time, is 2T. Hence, the volatility of return over T is T.
Annualized historical volatility can be estimated using T where
number of trading days in a year is T=252, for example, and daily volatility is
estimated as follows.
1 252
rd r 2
251 d1
where r
1 252
rd ,
252 d1
Sd
Sd1
and rd is a daily stock or stock index return on day d, found as r d ln
where Sd is day d stock or stock index price. You would note that rd is a
99
G. William Schwert, (1990), Stock Volatility and the Crash of 87 in The Review
of Financial Studies, Vol.3, No.1, 77-102, and David Bates, (1991), The Crash of 87
was it expected? The evidence from options markets, Journal of Finance, Vol.46,
No.3, 1009-1044.
355
continuously compounded return rate.
Figure 19.5
S&P 500 Index and Implied Volatility in October 1987
During the U.S. Stock Market Crash
35
30
25
20
15
10
S&P 500 Index /10
Implied Vola % p.a.
0
13
14
15
16
19
20
21
Dates/October 1987
In the path-breaking Black-Scholes (1973) model100, the prices of a European

call option and a European put option on a non-dividend-paying stock are:-
N x S N x
ct, St St Nx Ke r N x
pt, St Ke r
where
x
(19.1)
lnSt /K r 12 2
and T t.

T is expiry date, so is term-to-maturity (remaining life) of the option that is

usually expressed as a fraction of a calendar year. The calendar year is more
commonly construed as number of actual calendar days in a year, though
elsewhere number of trading days is also used. Here, is the instantaneous
return volatility, St is the underlying stock price at current time t, and r is the
constant risk-free interest rate over the period [t,T]. The model can be
100
Black, F., and M. Scholes, 1973, The Pricing of Options and Corporate
Liabilities, Journal of Political Economy 81, 637-659.
356
generalized to one where the volatility (t) is a deterministic function of time,
and where 2
sds .
T
From equations in (19.1), the call option price at time t is a function of

underlying stock price St, the option strike or exercise price K, the prevailing
risk-free rate, at least covering [t,T], r, and the term-to-maturity of the option
. Thus, can be found such that the call price observed at t, c, is equal to:
c St Nx Ke r N x
and
lnSt /K r 12 2
.
In best practices, the call that is selected to back-out is one that at t is

closest to being at-the-money, or nearest-the-money. There is some empirical
evidence that using such a call provides the best information about volatility
of the underlying asset.
One key reason could be that nearest-the-money options are typically the
most liquid in trading. This means that market information will be largely
reflected in the prices of the these options. It is to be noted that using different
calls with different strikes or puts, though the same maturity, would produce
slightly different volatilities. is called the implied volatility (implied from

the Black-Scholes model).
There is a large area of relatively recent studies getting onto the pattern of
implied volatilities versus different strikes (and sometimes also different
maturities), called volatility smiles or sometimes volatility smirks. It is
thought that recognizing and predicting volatility smiles would help to
forecast option prices better.
There are also more recent studies using less restrictive assumptions, such
as avoiding imposition of Black-Scholes model, to derive what is termed
model-free implied volatility. We shall not pursue excursions on these
subjects, but further reading references are provided later for interested
students to follow up.
Using open sources on Bloomberg, we obtain 2 particular SPX (S&P 500)
European call option prices on 2005 Jan 03 and 2005 Mar 01, and then extract
the implied volatility. These two options are closest to the money, i.e. their
strike prices are closest to the underlying index value at that time. Their
specifications are shown below.
357
Dates SPX index
Strike K
1/03
3/01
1200
1200
1202.08
1210.41
Time-tomaturity
73/365
80/365
Riskfree
rate
3% p.a.
3% p.a.
Call
price
31.5
33.0
The implied volatilities for the nearest the money calls on Jan 03 and Mar 01
are 12.5% and 10.1% respectively.
19.5
EFFICIENT FORECAST
In the Black-Scholes option pricing model, the underlying asset price

stochastic process is assumed to be
dSt = St dt + St dWt
where Wt is a Wiener
(19.2)
process or Brownian motion such that
dWs ~ N0, T t . The that appears in (2) is an instantaneous volatility of

t
the rate of return process of the asset. This same appears in the BlackScholes option price in (1). At time t, when we find the implied volatility of an
[ t ,T ] is assumed to be the markets riskoption that matures at time T, this
neutral assessment of the volatility of dSt/St that will happen over [t,T].
Suppose we compute the realized (ex-post) volatility of dSt/St by
1 T
rk r 2 where N is number of days between t and T, rk is
N k t 1
the daily return rate at day k, and r is the daily mean return rate over [t,T].
[ t ,T ] contains information about *[ t ,T ] .
Then it would be reasonable to test if
[*t ,T ]
Christensen and Prabhala (1998)101 used monthly European S&P 100 index
options from November 1983 to May 1995 to compute implied volatilities
[ t , t 1] of the underlying S&P 100 index returns. The subscripts [t,t+1] means
that the implied volatility is obtained at end of month t for a horizon till end of
the next month which is the options expiry. The options have about 1-month
left to maturity. At the same time, they computed realized index returns over
the month ahead, *[ t ,t 1] . Options are nearest the money calls. Thus, for the
sampling period t=Nov 83 to T=May 95, using 139 monthly sample points the
following non-overlapping regression was performed.
101
Christensen, B.J. and N.R. Prabhala, (1998), The relation between implied and
realized volatility, Journal of Financial Economics 50, 125-150.
358
[ t , t 1] ) + et
ln( *[ t ,t 1] ) = a + b ln(
(19.3)
where et is residual noise. If implied volatility contains information and is an

efficient forecast of realized volatility, then we would expect a to be close to
zero, and b to be close to 1. Indeed if the implied volatility is an unbiased
estimator, then a=0 and b=1. If the forecast is efficient, then the estimated
residuals e t should also be white noise and not correlated to other information
sources. They found b to be significantly positive but less than 1. This could
be due to errors-in-variable problem with measurement of the implied
volatility. It could be model error since it assumes the Black-Scholes model. A
[ t , t 1] ) on lagged
new explanatory variable constructed by regressing ln(
ln( *[ t 1, t ] ) serving as instrument variable would produce better regression
performance.
Other studies of implied volatility also suggest that over time, changes in
implied volatility appear to be negatively correlated with changes in the index
level. This could be explained as follows. Basically, when implied volatility
rises to a high level, investors perceive high market risk, and thus expect
higher returns. This has the effect of depressing price levels of stocks, hence
price level of the index.
On the other hand, as equity price level falls, the firms debt-to-equity
ratio rises. This is equivalent to increasing leverage which brings about higher
default risk. The increased leverage has the effect of increasing stock return
volatility. Thus firms leverage effect can also explain negative correlation
between stock or index return and the change in stock or index return
volatility.
19.6
PROBLEM SET
19.1
Two call options on the same underlying stock are traded such that
one has a strike price of $2.05 and the other has a strike price of
$2.10. Both have the same maturity in 3 months. What are the
possible reasons that their implied volatilities derived from the BlackScholes model are different?
19.2
Company A has just announced the surprising news that its CEO is
quitting, and that search for a replacement will be started. In the light
of this, supposing options are available on the stocks of this firm, what
would you expect the behavior of the implied volatility to be? In the
event studies you have come across, what may be a necessary
359
adjustment to make in trying to test the significance of any abnormal
returns post announcement?

[1] Black, F., and M. Scholes, (1973), The Pricing of Options and Corporate
Liabilities, Journal of Political Economy 81, 637-659.
[2] Cox, J., and Mark Rubinstein, (1985), Options Markets, Prentice-Hall.
[3] Cox, J., and S. A. Ross, (1976), The Valuation of Options for Alternative
Stochastic Processes, Journal of Financial Economics 3, 145-166.
[4] Cox, J., S. Ross and M. Rubinstein, (1979), Option Pricing: a Simplified
Approach, Journal of Financial Economics 7, 229-263.
[5] Harrison, J. M., and D. M. Kreps, (1979), Martingales and Arbitrage in
Multiperiod Securities Markets, Journal of Economic Theory 20, 381408.
[6] Hull, John C., (2006), Options, Futures, and Other Derivatives, 6th
edition, Pearson-Prentice Hall.
[7] Jackwerth, J., and M. Rubinstein, (1996), Recovering Probability
Distributions from Option Prices, Journal of Finance 51, 1611-1631.
[8] Jiang, George J., and Yisong S. Tian, (2005), The model-free implied
volatility and its information content, The Review of Financial Studies,
Vol.18, No.4, 1305-1341.
[9] McDonald, R. L., (2006), Derivatives Markets, 2nd edition, AddisonWesley.
360
Chapter 20
GENERALIZED METHOD OF MOMENTS
APPLICATION: CONSUMPTION-BASED ASSET
PRICING
Key Points of Learning:
Consumption, Utility, Representative agent, Utility-based asset pricing model,
Euler condition, Moment restrictions, Orthogonality conditions, Consumption
beta, Equity premium puzzle, Hansen-Jagannathan bound, Generalized
method of moments (GMM), Asymptotic chi-square test, Newey-West
covariance estimator
It is suitable to think of the theoretic aspects of finance as financial economics.

In this regard, main areas of finance have had strong linkages to economics,
including microeconomics, where the foundations of optimal consumer and
investor choices are laid. Many of the economics giants in the late 1800s and
well into the last century labored on mathematical results prescribing optimal
consumption and investment choices building on fundamental axioms of
rationality and non-satiability in human wants. One key construction is that of
a utility function of consumption. A more positive utility number to a bundle
of consumption goods than another bundle is simply another expression for
preference by the individual of the former bundle compared to the latter. A
celebrated study by mathematicians Von Neumann and Morgenstern102
produces the useful result that when probability estimates are introduced into
consumption outcomes or utilities, then an individual will prefer a risky
gamble to another provided the expected utility of the former is larger.
Therefore, we can perform optimization on a set of gambles or choices based
on expected outcomes of utility functions of consumption goods.
In this chapter we consider the necessary Euler condition for utility
optimization that leads to asset pricing modeling. As the theoretical
restrictions on the model are nonlinear, we apply a nonlinear method, the
generalized method of moments, for the estimation and testing of such
models.
Many interesting results and constructions in finance theory are exposited
102
Von Neumann, J. and O. Morgenstern, 1953, Theory of Games and Economic

Behavior, Princeton University Press.
361
in excellent books such as Huang and Litzenberger (1988) and Ingersoll
(1987). Similarly excellent books in macroeconomics but using utility-based
frameworks are Blanchard and Fischer (1989) and Stokey and Lucas (1989).
These classics are shown in the further reading list at the end of the chapter.
20.1
CONSUMPTION-BASED ASSET PRICING
Suppose a representative agent or investor maximizes his or her expected

lifetime utility
Max E t
C t
k UC t k
(20.1)
k 0
where U(.) is Von Neumann-Morgenstern utility function on consumption Ct,

and we can think of a unit of consumption good as a numeraire or a real
dollar, like the way goods were bartered for pieces of rare metals a long time
ago. (0 < 1) is the time discount factor on utility. It means that a more
distant consumption unit will be preferred less than a more recent
consumption unit.
In the optimization problem of (20.1), one of the necessary first-order
conditions, called Euler conditions, is that the marginal utility of consuming
real dollars less at time t is equal to the expected marginal utility of
consuming (1 + Rt+1) real dollars in the next period after investing the
foregone real dollars at t and investing to obtain return rate Rt+1 at t+1.
The investor chooses = 0 when optimum is attained. Thus the solution to
(20.1) requires solving the following:
Max UC t E t U C t 1 1 R t 1 .
(20.2)
The first-order condition to (20.2) is:
- U C C t E t 1 R t 1 U C C t 1 1 R t 1 0 0 .
Note that when the substitution is optimal as above, the equation has net gain
of zero, hence 0 on the RHS.
Then,
Or,
- U C C t E t 1 R t 1 U C C t 1 0 .
U C
E t 1 R t 1 C t 1 1 .
U C C t
(20.3)
Equation (20.3) is an Euler condition. (20.3) applies to all traded assets in the
economy. Let any such an asset with price Pt at t have random price at t+1 of
362
Pt+1. Then, 1+ Rt+1 = Pt+1/Pt , and we substitute this into (20.3). Thus,
U C
(20.4)
E t C t 1 Pt 1 Pt .
U C C t
U C C t 1
Let M t 1
be the kernel within the integral or expectation in
U C C t
(20.4). Then (20.4) can be re-written as
E t M t 1 Pt 1 Pt .
(20.5)
In the conditional moment, or more specifically conditional mean, of (20.5),

Mt+1 is a very interesting quantity and unifies many areas of research in
rational expectation macroeconomics as well as rational financial asset pricing
theories. It has been variously called the marginal rate of substitution (between
Uc(Ct+1) and Uc(Ct)), the pricing kernel, the state-price density, and the
stochastic discount factor (SDF).
The LHS of (20.5) is a conditional expectation based on the latest information
at time t. We apply the iterated expectation theorem, so it becomes
unconditional expectation or conditional on a null set.
E M t 1 R t 1 1 .
(20.6) E(Rt+1) E(Mt+1) + Cov (Mt+1, Rt+1) = 1.
(20.6)
(20.7)
Since the riskfree asset with return Rf should also satisfy (20.6), we have
Rf E (Mt+1) = 1, or E (Mt+1) = 1/Rf . Hence (20.7) becomes
E (Rt+1) Rf = - Cov (Mt+1, Rt+1) / E (Mt+1)
or, ER t 1 R f
cov- M t 1 , R t 1 var M t 1
.
var M t 1
EM t 1
(20.8)
(20.9)
(20.9) is one version of the Consumption-Based Asset Pricing Model
var M t 1
cov- M t 1 , R t 1
and M
, then M
EM t 1
var M t 1
is a market risk premium or the price of risk common to all assets, and
(CCAPM). If we put C
363
C is a consumption beta specific to the asset with return Rt+1. The
intuition of this beta is as follows. Suppose the assets consumption
beta > 0 and thus its return Rt+1 correlates positively with Mt+1. This
implies a positive correlation between Rt+1 and Ct+1 since UCC < 0 due to
decreasing marginal utility to increasing consumption (U is a concave
function). Thus holding that asset adds to the consumption volatility
which would require risk compensation in the form of higher expected
return for the asset.
(20.8) can be re-written as
E (Rt+1) Rf = - M,R M R / E (Mt+1) , or
ER t 1 R f
M
M
MR
.
R
EM t 1 EM t 1
(20.10)
Equation (20.10) provides for the Hansen-Jagannathan bound to the Sharpe

ratio on the LHS, and is the subject of interesting research into why postwar
U.S. has produced very high equity market premium or Sharpe ratio for the
market index. For something like 8% value-weighted index annual return,
16% annual volatility, and on average 1 to 2% p.a. riskfree rate, the market
Sharpe ratio is about 0.4. Recall the RHS is Rf volatility of the SDF. This
means that the volatility of the SDF > 0.4/0.02 = 20. This implies too high a
volatility in the per capita consumption, or an extremely high investor risk
aversion. This high equity premium puzzle is still an ongoing area of research.
20.2
GENERALIZED METHOD OF MOMENTS (GMM)
This econometric method103 to perform estimation of parameters and to test

the plausibility of a model is based on expectations or moment conditions
derived from the theoretical model itself. Assume a stationary stochastic
process {Xt}t=1,2,.where Xt is vector of random variables at time t. Hence
{Xt} is a stochastic vector process. Suppose we have a finite sample from this
process stochastic vector process as {x1, x2, , xT} of sample size T, where
each xt is a vector of realized values at t that are observable. The subscript t
needs not be time, and could be in more general context such as sampling
across sections at a point in time.
The model is derived based on a set of K number of moment conditions:
E[ f1(Xt,) ] = 0
103
The GMM is developed by Prof Lars P. Hansen (1982) Large Sample Properties
of Generalized Method of Moments Estimators, Econometrica 50, 1029-1054.
364
E[ f2(Xt,) ] = 0
E[ fK(Xt,) ] = 0
(20.11)
where fj(.) is a function, and is a m dimension unique vector of unknown

parameters with m < K. The powerful use of this method arises in cases when
fj(.) is nonlinear in Xt and , so that linear regression methods cannot be
applied.
The essential idea in deriving an estimate of lies in finding a vector
such that the corresponding sample (empirical) moments are close to zero in
some fashion.
1 T
f x , 0
T t 1 1 t
1 T
f x , 0
T t 1 2 t
(20.12)
...
1 T
f x , 0
T t 1 K t
The Central Limit Theorem would tell us that as T, these sample moments
would converge to
E f X ,
j t
for all j=1, 2,.., K. Hence, if these conditions are approximately zero, then
intuitively the vector estimator would also be close to in some fashion,
given that is unique. We would also assume some regularity conditions on
the moments E[fj(.)] such that E[fj( )] would behave smoothly and not jump
about as value gets closer and closer to . Let all the observable values of
sample size T be YT {xt}t=1,2,.,T.
365
1 T
f 1 x t ,
T tT1
1 f x ,
T 2 t
Let g YT , t 1
1 T
f K x t ,
T t 1
be a K1 vector of sampling moments. Suppose WT is a KK symmetric

positive definite weighting matrix which may be a function of data YT. The
GMM estimator is found by minimizing the scalar function:
min
g YT ,
WT YT , g YT , .
(20.13)
Note that if WT(.,.) is any arbitrary symmetric positive definite matrix, then
estimators will still be consistent though not efficient. However, given

some regularity smoothness conditions about the function fi(.,.), an optimal
weighting matrix function WT (.,.) can be found so that the estimators will
be asymptotically efficient, or have the lowest asymptotic covariance in the
class of estimators satisfying (20.13) for arbitrary WT(.,.).
Let vector function FK1(Xt, ) = ( f1(Xt, ), f2(Xt, ), , fK(Xt, ) )T.
Then,
gYT ,
Let 0
1 T
Fx t , .
T t 1
E FX t , FX t - j ,
N
j N
be a KK covariance matrix that is
the sum of contemporaneous covariance matrix cov [ F(Xt,) F(Xt,)T]

as well as 2N number of serial covariance matrices cov [ F(Xt,) F(XtT
T
T
1,) ], cov [ F(Xt,) F(Xt-2,) ], , cov [ F(Xt,) F(Xt-N,) ]. It is
assumed that beyond N lags, cov [ F(Xt,) F(Xt-N-j,)T] = 0 for j 1.
Assume 0 exists and has full rank K.
Note that the covariance matrix of random vector g(YT, ) = T-1 0. By
the Central Limit Theorem, as T, T g(YT,)T 0-1 g(YT,) k2.
366
The minimization of (20.13) gives first order condition:
T
g YT ,
2
WT YT , g YT , 0 m1
conditional on a given weighting matrix WT(YT, ). In principle, the above m

equations in the vector equation can be solved, i.e.
g YT ,
WT YT , g YT , 0 m1 ,
(20.14)
to obtain the m estimates in m1 vector . In this first step, WT (YT, ) can be
initially selected as IKK The solution 1 is consistent but not efficient. The
consistent estimates 1 in this first step are then employed to find the optimal
weighting matrix WT*(YT , 1 ).
1 T F x t , 1
Let g' YT , 1
.
T t 1
Let a consistent estimator of 0 be
T
1 F X , F X ,
0
1
t
1
t
1
T t 1
N
1 T
F X t , 1 F X t - j , 1
j1 T t j1
T
1 T
F X t , 1 F X t - j , 1 .
(20.15)
j1 T t j1
1
Then employ WT * 0 1 as the optimal weighting matrix in (20.13)
and minimize the function again in the second step to obtain the
efficient and consistent GMM estimator * .
N
The minimized function in (20.13) is now
gY , *.
T
T * g YT , * 0 1
where the first-order condition below is satisfied:
T
g YT , * 1
0 g YT , * 0 m1 .
367
, and
As T, 1 , * ,
0 1
0
gY , *
1
T
T T * T g YT , * 0 1
2
K -m
(20.16)
Sometimes the LHS is called the J-statistic, JT T T. Notice that this test
statistic is asymptotically chi-square of K-m degrees of freedom and not K
degrees of freedom when the population parameter is instead in the
arguments. This is because m degrees of freedom were taken up in m linear
dependencies created in (20.15) in the solution for * . (20.16) also indicate
that a test is only possible when K > m, i.e. the number of moment conditions
or restrictions is greater than the number of parameters to be estimated. We
can always create additional moments conditions by using instruments. One
common instrument is lagged variables contained in the information set at t. If
the moment conditions such as those in (20.11) are generated by conditional
moments such as (20.6), then it is easy to enter the information variables
observed at t into the expectation operator in (20.6), and then take iterated
expectation on the null set to arrive at the unconditional moments such as in
(20.11).
For example, a theoretical model may prescribe Et-1[ f1(Xt,) ] = 0 as a
conditional expectation. By the iterated expectation theorem, this leads to a
moment restriction in (20.11). Since Xt-1 is observed at t-1, we can add it as in
instrument to obtain Et-1 [ f1(Xt,) Xt-1 ] = 0, hence another moment restriction
E [ f1(Xt,) Xt-1 ] = 0. From Appendix A, we may say vector f1(Xt,) is
orthogonal to Xt-1 for any realization of Xt-1, Xt. Thus, such moment
restrictions are sometimes also called orthogonality conditions.
The excess number of moment conditions over the number of parameters
is called the number of overidentifying restrictions, and is the number of
degrees of freedom of the asymptotic 2 test.
~
g
* by a linear Taylor series expansion
T
~
about the true population parameter , where is some linear combination of
~
and * . is also consistent. Pre-multiplying by the m K matrix,
Now, g * g
g YT , * 1
0 , we obtain
T
T
~
g YT , * 1
g YT , * 1 g
0 g * g
0
* .
T
Then
368
~
g Y , * T
T
1 g
0
T
or,
T
g YT , * 1
0 g * ,
g Y , * T 1 g ~
T
T *
0
T
Since T g ~ N 0, 0 , then
g YT , *
cov
T
g YT , *
V
g YT , * 1
0 T g .
~ 1
g g YT , * 1
0 T g
T

1
1 g YT , *
0 0 0
V
T
1
g Y , * T 1 g ~
T
is a mm matrix.
where V
0
T
Asymptotically, T * N0, V
where
(20.17)
g Y , * T 1 g *
T
.
V
0
(20.16) is the test statistic measuring the sampling moment deviations from
the means imposed by the theoretical restrictions in (20.11). If the test statistic
is too large and exceeds the critical boundaries of the chi-square random
variable, then the moment conditions of (20.11) and thus the theoretical
restrictions would be rejected. (20.17) provides the asymptotic standard errors
of the GMM estimator * that can be utilized to infer if the estimates are
statistically significant given a certain significance level.
In estimating 0 in (20.15), the estimator 0 1
is consistent.
However, in finite sample, this computed matrix may not be positive
369
semi-definite. This may result in (20.16) being negative. Newey-West
HACC (heteroskedastic and autocorrelation consistent covariance)
matrix estimator104 provides for a positive semi-definite covariance
estimator of 0 that can be used. This takes the form
T
1 F X , F X ,
0
1
t
1
t
1
T t 1
N
j
T
1
j
j
N 1
j1
T
1 F X , F X ,
where
j
t
1
t- j
1
T
t j1
20.3
.
T
ESTIMATING PREFERENCE PARAMETERS IN EULER

EQUATION
Hansen and Singleton tested the Euler condition in (20.3)
U C
E t 1 R t 1 C t 1 1 .
U C C t
The utility function is assumed to be a power function U(Ct) = Ct / for <

1. The risk aversion coefficient is 1- > 0. This utility function is also called
the Constant Relative Risk Aversion Utility function as its relative risk
aversion parameter - Ct UCC/UC = 1- , is a constant. Suppose the S&P 500
index return is used as a proxy for the market value-weighted return. Let this
be Rt+1. Then the Euler equation is:
E t 1 R t 1 Qt 11 1 ,
(20.18)
where Qt+1 Ct+1/Ct is the per capita consumption ratio, as Uc (Ct) Ct-1.
Since there are two parameters, and , to be estimated, we require at least 3
moment restrictions. This will yield 1 overidentifying restriction. We could
employ the lagged values of (1+Rt)s or Qts as instruments. Here we form 3
moment restrictions:
(a) E 1 R t 1 Qt 1 1 0 ,
(b)
(c)
E 1 R
E 1 R
1 R t Qt 1 1 R t 0 ,
t 1 1 R t -1 Q t 1 1 R t -1 0 ,
t 1
where we let = - 1 < 0 when risk aversion is positive.

104
See Newey, W.K. and Kenneth D. West, (1987), A Simple, Positive SemiDefinite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix,
Econometrica, Vol.55 No.3, 703-708.
370
If a distributional assumption is made about Qt, then the Maximum
Likelihood method can be employed.105 Using the GMM method discussed in
the previous section, the GMM estimation and test statistics are shown as
follows. The quarterly consumption data from 2000 to 2009 used in the
analysis are obtained from the public website of the Bureau of Economic
Analysis of the U.S. Department of Commerce. In particular the real durable
consumption series is divided by population to obtain the per capital durable
consumption. Qt measures the quarterly growth ratio.
Table 20.1
GMM Estimation of the Euler Equation under Constant Relative Risk
Aversion
Dependent Variable: Implicit Equation
Method: Generalized Method of Moments
Estimation weighting matrix: HAC Newey-West
Standard errors & covariance computed using estimation weighting matrix
Convergence achieved after 13 iterations
C(1)*MKTRET*CONS^C(2)-1
Instrument specification: MKTRET(-1) MKTRET(-2)
Constant added to instrument list
C(1)
C(2)
Mean dependent var
S.E. of regression
Durbin-Watson stat
Instrument rank
Coefficient
Std. Error
t-Statistic
Prob.
0.998049
-1.361539
0.023597
0.514941
42.29641
-2.644066
0.0000
0.0122
0.000000
0.091516
1.897556
3
S.D. dependent var

Sum squared resid
J-statistic
Prob(J-statistic)
0.000000
0.293129
4.115583
0.042490
From Table 20.1, it is seen that the time discount factor is estimated to be
0.998. The relative risk aversion coefficient is estimated to be 1.36. Both are
significantly different from zero at p-values of less than 2%. The J-statistic
that is distributed as 2 with 1degree of freedom is 4.11 with a p-value of
0.0425. Therefore, the moment restrictions (a), (b), and (c) implied by the
105
See Hansen, L.P., and K.J. Singleton (1983), Stochastic consumption, risk
aversion, and the temporal behavior of asset returns, Journal of Political Economy,
vol. 91, 249268.
371
model and rational expectations are not rejected at the 10% significance level,
though it would be rejected at the 5% level. The result is similar to those
reported in Hansen and Singleton (1982).106 In this GMM estimation, the
covariance matrix of the sampling moments is estimated using the Newey-
West HACC matrix estimator with 6 lags.

20.4
OTHER GMM APPLICATIONS
There is a copious amount of GMM applications. We look at one here

whereby a seemingly simple linear regression specification of the KrausLitzenberger (1976) three-moment asset pricing model107:
m0
ER i i
ER m i
ER m
0 m0
0 m0
(20.19)
can be rigorously tested for its full implications using the GMM.108
Ris and Rm are the ith assets and the market excess return rates. i is the usual
market beta im/m2. i is ith assets co-skewness with the market,
2
3
2
E{ R i ER i R m ER m } m 03 ; 0, 0 and m0 are the usual mean,
variance and the third central moment of the excess market return. is the
marginal rate of substitution of terminal wealth m for terminal wealth
volatility. Then the moment restrictions testable under the null of (20.19) with
parameters is, is, 0, 0, and to be estimated, are:
m0
E R i
i
i R m 0 , for all i,
0 m0
0 m0
E R i R m 0 R i i [ R 2m 0 R m ] 0 , for all i,
E R i R 2 2 0 R i R [ 2 2 ] R i [ R 0 ] 3 0 , for all i,
m
m
0
0 i
m
106
Hansen, L.P., and Kenneth Singleton, (1982), Generalized Instrumental Variables

Estimation of Nonlinear Rational Expectations Models, Econometrica Vol.50, No.5,
1269-1286.
107
See Kraus, A. and R. H. Litzenberger, Skewness Preference and the Valuation of
Risk Assets, Journal of Finance, 31, 1085-1100.
108
Kian-Guan Lim, (1989), A New Test of the Three-Moment Capital Asset Pricing
Model, Journal of Financial and Quantitative Analysis, Vol.24, No.2, 205-216, is one
of the earliest to employ a full test of the moment implications of such skewness
models.
372
and
E R m 0 0
E [ R m 0 ] 2 2 0
0
E [ R m 0 ]3 m 3 0 .
0
Since 1982, the GMM technique has been one of the most widely applied
methods for empirical estimation and testing especially of nonlinear rational
expectations models in many fields of economics and finance. This attests to
its usefulness in many situations when serial correlations or lack of
distributional information hampers empirical investigation. However, its
drawback is its reliance on asymptotic theory. Many Monte Carlo and
econometric studies have arisen to investigate how inference under the GMM
can be improved.
20.5
PROBLEM SET
20.1
For the CRRA utility function U(Ct) = Ct / , evaluate Ct (-UCC/UC)

where the subscript to U denotes partial derivative with respect to the
subscript. This constant is called the relative risk aversion parameter.
If the parameter has to be positive, is there any implication on the
range of ? What happens to the utility function when converges to
zero?
In (20.5), E t M t 1 Pt 1 Pt . Suppose we have random variables
Gt
,
Ht
that
are
observed
at
t,
show
that
20.2
E t M t 1 Pt 1 Pt 0 .
G t
What consideration(s) if any is there in choosing between whether to
Ht
M t 1 Pt 1 Pt 0 or
G t

Pt 1
1 0 using GMM?
Pt

test restriction E
H
E t
G t
20.3
M t 1
Show how you would apply GMM to estimate a and b and to test the
linear regression model in Yt = a + b Xt + et where Et-1(et) = 0 and
373
vart-1(et) = 2, though it is serially correlated. We also do not know
what is the distribution of et. et is also correlated with Xt. Assume Xt-1
is observed. Assume {Xt} and {Yt} are stationary series.
20.4
Given short rates rt that following discrete stochastic processs:

rt+ - rt = ( - rt ) + et+
where et+ is an i.i.d. normally distributed r.v. with mean E(et+) = 0,
and var(et+) = 2 rt 2 , use lagged rt as instrument and create 4
moment conditions for GMM testing.
20.5
What are the advantages and disadvantages of the GMM versus the
maximum likelihood estimation methods?
RECOMMENDED FURTHER READINGS

[1] Blanchard, Olivier Jean, and Stanley Fischer, (1989), Lectures on
Macroeconomics, MIT Press.
[2] Cochrane, John H., (2001), Asset Pricing, Princeton University Press.
[2] Davidson, R., and J.G. MacKinnon, (1993), Estimation and Inference in
Econometrics, Oxford University Press.
[3] Hamilton, J.D., (1994), Time Series Analysis, Princeton University
Press.
[4] Huang, Chi-fu, and Robert H. Litzenberger, (1988), Foundations for
Financial Economics, North-Holland Elsevier Science Publishing.
[5] Ingersoll, Jonathan E., (1987), Theory of Financial Decision Making,
Rowman & Littlefield Publishers.
[6] Stokey, Nancy L., and Robert E. Lucas Jr., (1989), Recursive Methods in
Economic Dynamics, Harvard University Press.
374
Appendix A
MATRIX ALGEBRA
A1.
MATRICES
A matrix M is a rectangular array of numbers or elements aij where i

represents the ith row and j represents the jth column.
M = [aij]rxc
Matrix M has r rows and c columns. M is a r x c matrix with dimensions r by
c.
2 1 3 4
4 2 0 8 is a 3 x 4 matrix.
5 1 7 12
The 2nd row and 3rd column element, a23 = 0.
2 1 3 4 is a row vector.
2
4 is a column vector.

5
A vector is a special matrix. A scalar is a number e.g. 3. A scalar multiplied
by a matrix M is a scalar operation, such that [aij]rxc = [aij]rxc. As an
illustration,
2 3 6 9
3 1 4 3 12 .
6 1 18 3
Matrix addition goes as follows, much like simple algebraic addition,

except that the matrices to be added must be of the same dimension, and
addition is performed element by element.
375
a b e f a e b f
c d g h c g d h .
Or, [aij]rxc + [bij]rxc = [aij + bij]rxc.

Likewise Matrix subtraction proceeds as follows.
2 0 3 4 1 4
1 5 2 6 3 1 .
Or, [aij]rxc - [bij]rxc = [aij] + [-bij]rxc = [aij - bij]rxc .

Subtraction is equivalent to scalar multiplication by 1 followed by addition.
Matrix multiplication proceeds as follows.
u x
a b c
d e f v y
23 w z
32
au bv cw ax by cz
du ev fw dx ey fz 22
Matrices [aij]mxn and [bij]nxm are conformable if they can be multiplied

together. Conformity occurs when the number of columns in the left matrix
equals the number of rows in the right matrix. The resulting product matrix
has m rows and m columns.
5
3 1 2
1 0 3 1 1

5
4 0 2
2
0
5
3
10 1 0 0
7
0 1 0 .
10
0
0
1
1

10
Matrices also follow certain common laws of arithmetic operations.

(a) A(B+C) = AB + AC
(Distributive Law)
(b) (A+B)C = AC + BC(Distributive Law)
(c) A(BC) = (AB)C
(Associative Law)
Matrices do not follow the Commutative Law, i.e. AB BA unless AB.
376
[aij]nxn is a square matrix. For a square matrix, if all diagonal elements a ii
are one, and all non-diagonal elements aij (i j) are zero, it is called an
identity matrix, usually denoted by Inxn. Each row or column of I is a unit
vector. However, these unit vectors are themselves different because their
units occur at different positions in the vector.
For a square matrix where not all diagonal elements are zero, and all nondiagonal elements are zero, it is a diagonal matrix. This is different from Inxn
since diagonal elements need not be all ones.
The trace of a square matrix with dimension n n, or dim (n), is the sum of
the n diagonal elements.
tr a ij
i 1
ii
3 2
For an example, tr
7.
5 4
Also,
tr (A) = tr (A)
tr (AB) = tr (BA)
tr (A + B) = tr (A) + tr (B)
[aij]rxc , where all aij = 0, is a zero or null matrix 0.
The transpose of M is MT
2 3
1 6
1 4 2
3 4 1
6 1
T
Or, [aij]mxn = [aji]nxm . The ith row becomes the ith column, for all i.
Suppose tr (A) = 6, then clearly tr (AT) also = 6.
[ai]rx1 where all ai = 1 is a ones vector 1. Note that a unit vector is not, and
should not be confused with a ones vector in which all elements are ones.
377
1
1
1
1
4.
1
Also,
1
5 6 3 1 5 6 3 14 .
1
More generally, x 1
x2
x 3 . . . . x n 1n 1n1 x i .
i 1
If M x M = M, then M is an idempotent matrix. For M to be idempotent, M

must be a square matrix. If M is a square zero matrix, 0, it is also idempotent
since 0 x 0 = 0. If M = I, it also is idempotent since I x I = I.
The determinant of M is Mwhich is a scalar. However, unlike a scalar,
the notation | . | on a matrix does not denote absolute value, but is quite
involved as follows.
For a 2x2 matrix, finding determinant is simple.
a b
M
c d
M
a b
c d
ad bc
As an example,
10 3
2 . The determinant is 10x2 3x6 = 2. This is a
6 2
second order determinant since dim(M) = 2.

For a 3 x 3 matrix M, dim(M) = 3,
a 11
M a 21
a 31
a 12
a 22
a 32
a 13
a 23 .
a 33
The third order (dim 3)M is

a
a 23
a
a 11 (1) 11 22
a 12 (1) 1 2 21
a 32 a 33
a 31
a 23
a 33
a 13 (1) 1 3
a 21
a 22
a 31
a 32
378
In the above expression,
a22
a23
a32
a33
is called minor of element a11. It is a subdeterminant of M, and a
determinant of the smaller matrix obtained by deleting the 1 st row and the 1st
column of original M.
Consider a minor Mijof element aij. Mij is the remaining matrix after
deleting the ith row and the jth column. The corresponding co-factor of
element aij is
Cij = (-1)i+j Mij.
The determinant M nxn
a
j 1
ij
(1) i j M ij for any given i.
Note that the same determinant can be found by expansion along a column k
n
instead of a row i as above, i.e.
M nxn a ik ( 1) i k M ik
i 1
As an example,
5
2
6
3
1
3 2
2 2 2 3
2 5
6
3 0
7 0
7 3
7 3 0
5 (6) 6 14 (27) 141.
If A, B are conformable, A x B is an operation where one pre-multiplies B
by A, or post-multiplies A by B.
If A, B are square matrices, and A x B = I, then A is inverse of B, A B-1.
Also, B is inverse of A, B A-1. For any general X, where its inverse exists,
XX-1 = X-1X = I.
For a 2 x 2 matrix, finding the inverse is simple. Given
a b
M
,
c d
M -1
1
M
d b
c a .
379
As an illustration,
3 1
4 2
1 2 1 1 12
.
2 - 4 3 2 32
To verify this is indeed the inverse, we multiply it with the original matrix to
yield the identity matrix.
1 12 3 1 1 0
.
- 2 3
2 0 1
2 4
The adjoint of a n x n matrix M is another n x n matrix seen as follows.

c 11
c
12
.
.
adj M =
.
.
.
c 1n
c 21
c 22
.
.
.
.
.
c 2n
c n1
c n2
.
.
.
.
.
c nn
where cij is co-factor of element aij in M.

Note that the ijth element in adjoint matrix is the co-factor of the jith
element aji in M. Now we can express the inverse of M in terms of the
adjoint matrix and the determinant of M:-
M -1
adj M
|M|
As an example,
4 1 1
if M = 0 3 2 , then adj M =
3 0 7
21 7 5
6 31 - 8 .
9 3 12
380
21 7 5
1
6 31 - 8 .
Now M= 99, thus M =
99
9 3 12
-1
Now suppose M= 0, does M have inverse? No. If M-1 does not exist, M
is called a singular matrix. If M-1 exists, M is a non-singular matrix. Thus we
see that
1 2
0 0 is singular since its determinant is zero.
A large matrix M can be partitioned into smaller matrices as follows.
A B
M
C D
For example,
2 5
4 1
M=
3 1
1 0
6
0 A B
0 C I 2
In the above example, A is vector (2,4)T.

A useful result for partitioned matrices is as follows. Suppose M is a
square non-singular matrix as follows,
A
M 11
A 21
A12
A 22
where A11-1 and A22-1 exist.

Then,
A11
A
21
A12
A 22
A11 1 I A12 BA 21A11 1
1
BA 21A11
where B A 22 A 21A11 A12

It can be shown that
- A11 A12 B
381
A11
A
21
A 12
A 22
A11 1 I A12 BA 21 A11 1
1
BA 21 A11
1
- A11 A12 B
I A12 BA 21A111 A12 BA 21A111
1
1
1
A 21 A11 I A 12 BA 21 A11 A 22 BA 21 A11
I 0
I.
0 I
A12 B A12 B
This facilitates finding of the inverse of large M using smaller matrices. M3x2
can be partitioned into row or column vectors. A row partition is
X1
X
2
X 3
where each Xi is a 1x2 vector. For a general Mrxc, if number of rows r >
number of columns c, it is possible to find 1xr non-zero vectors [ij] so that
[ij] x M = 0.
For example,
1 3
M 1 2 .
3 7
We can find a vector 1 2 1 so that
1 3
1 2 1 1 2 0 0.
3 7
This is akin to finding a linear combination of the rows viz.
1 [1 3] + 2 [1 2] 1 [3 7] = [ 0 0]. Thus at least one of the rows is
linearly dependent on the others.
For Mrxc where r c, it may or may not be possible to find 1xr non-zero
vector [ij] so that [ij]xM = 0. For example,
1 1 3
M
.
3 2 7
382
If a
1 1 3
b
. a 1 1 3 b 3 2 7 0 0 0 , [a b] must be
3 2 7
zero. Thus, the two rows are linearly independent. Then the rank of M is r.
The rank of M is also the maximum number of linearly independent rows in
M.
1 3
What is the rank of 1 2 ? As shown earlier, we can find a [1 2 -1] linear
3 7
combination of the rows such that [0 0] obtains. This means that out of the 3
rows in the matrix, one is linearly dependent on the other two. If we remove
one row, e.g. the first row, and we cannot find non-zero [a b] so that
1 2
b
0 0 , then it has a rank of 2.
3 7
For a square matrix Mnxn, if M 0, its rank is n.

A.2
LINEAR EQUATIONS
We can solve a system of linear equations using row operations as follows.

X1 + 3X2 = -2
X1 + 2X2 = -1
3X1 + 7X2 = -4
(A.1)
(A.2)
(A.3)
[Row (A.1) + 2 x Row (A.2)] Row (A.3):

Equation (A.1) becomes 0X1 + 0X2 = 0
(A.4)
When number of equations > number of unknowns Xi, excess equations
are either redundant or else lead to no (feasible) solution.
Matrices are useful to represent a linear system of equations such as the
above:
1 3
2
1 2 x1 1 .
x
3 7 2 4
We perform the same row operations as above by pre-multiplying:
383
1 2 1 1 3
0 1 0 1 2
0 0 1 3 7
1 2 1 2
x1

x 0 1 0 1 .
2 0 0 1 4
Thus,
0 0
0
1 2 x 1 1 .
x
3 7 2 4
In the above, a linear combination of rows
1 1 3 2 1 2 1 3 7 0 0
leads to zero vector 0T. Then matrix
1 3
1 2
3 7
is said to be linearly dependent.
The reduced system of equations after deleting redundant equations is:
1 2 x 1 1
3 7 x 4 .
2
If the system cannot be reduced further, then matrix
1 2
3 7
is said to be linearly independent. Each row is linearly independent.

The solution to the above non-reducible system of equations is
x 1 1
x 3
2
x1 7
Or,
x 2 - 3
2 1
7 4
- 2 1 1
.
1 4 1
The solution X to a linear system Anxn Xnx1 = Bnx1 above requires finding
inverse A-1.
Xnx1 = A-1nxn Bnx1 = adj(A)nxn Bnx1 |A| .
If Xnx1 = [xk]k=1,2,,n , then kth row of X is
384
(k th row of adj A) B c1k c 2k c 3k c nk b1 b 2 b 3 b n
A
A
xk
b - 1
i 1
ik
M ik
The numerator looks like a determinant. In this case, it is that of a new matrix
formed by replacing the kth column of A with Bnx1, say AK.
Then x K
A.3
AK
A
. This result is called the Cramers Rule.
EIGENVALUES
Given matrix Mnxn and vector Vnx1 0, suppose we can find a scalar so that
MV = V. Then is called an eigenvalue or characteristic root of M, and V
is called an eigenvector or characteristic vector of M.
(M-In)V = 0
If (M-In)-1 exists, then V = 0 which is a trivial solution. For general V 0,
the system of linear equations holds if (M-In)-1 does not exist, i.e. (M-In) is
singular.
M-In= 0
is called the characteristic equation of M.
2 2
, solve
2 1
To find the eigenvalues and eigenvectors of
2
2
0.
2
1
Obtain (2-)(1+)-4 = 0, or 2 - -6 = 0.
At this point, use the usual quadratic formula to find the roots.
(1) (1) 2 4 1 (6) 1 25
3 or 2 .
2 1
2
Given a root =3, a particular eigenvector [vi ] can be found using
385
2 v1
2 3
0,
2
1 3 v 2
or 1 + 22 = 0, thus 1 = 22 .
2 4
1 2
Hence , , are eigenvectors.

A normalized eigenvector is one with a vector length equal to one unit,
i.e. v or vTv = 1. The operation is also called the inner product, or a dot
product.
2v
v 2
v2
So v T v 2v2
2v
2
v2 2 5v2 1 .
v2
2
1
. Thus the normalized eigenvector is v
5
1
Then,
v2
A.4
ORTHOGONALITY
5
.
5
If the scalar product of two vectors u and v is 0, uTv = 0, then u and v are
orthogonal. Unit vectors are orthogonal:
0
1 0 0 1 0 .
0
Orthogonality has a geometric interpretation.
y
v
W
u
z
x
Vectors u and v are perpendicular to each other. If u, v are also

normalized to unit length, they are called orthonormal vectors. In the 3-
386
dimensional Euclidean space R3 above, u, v, w are the 3 unit vectors shown in
axes x-, y-, and z- respectively.
y
P
y
z
x
In the diagram above, Point P is described as having position vector (x y

z)T.
In the following diagram, let two vectors a, b on R3 be represented by
(xa ya za)T and (xb yb zb)T respectively.
y
b
w*
a
z
A w vector that lies on the plane (shaded area) formed by linear combinations
of a and b is represented by
xa
xb xa
a ya b y b ya
z
z z
a
b a
xb

y b a X .
z b b
A specific w* vector on this plane with a = , b = 2/3, is shown.
When (xa ya za)T and (xb yb zb)T are linearly independent, their linear
combination forms a subspace of a lower dimension (dim = number of
387
weights in the combination). The (2-dim) subspace is said to be spanned by
the 2 columns of X above.
The 2-dim subspace (or geometrically a 2-dim plane in R3 space) above
passes through O. To be more general, when a 2-dim subspace does not pass
through O, it is represented by X - C, where C is a constant 3x1 vector.
Thus we can express the equation of a plane (2-dim) in 3-dim space as
V3x1 = X - C
where X and C are given, and V = (x, y, z) represents a vector in 3-dim with
general coordinates in X-, Y-, and Z-axis.
Since we can certainly solve a and b in terms of X, C, and also x, y, then
the plane-equation can also be written as a linear equation f (x, y, z; X, C) = 0.
An example is 3x+4y-2z = 5.
There are interesting geometric interpretations to operations on vector spaces
as in the above.
The length of vector a is
x a2 x 2b x c2 a T a 2 a
1
Note that the analytical formula of length in Euclidean space comes from the
basic Pythagoras theorem. ||a|| is sometimes called the norm of a. To convert
a vector of length ||a|| into a similar vector (i.e. same direction) but of unit
length, transform it into a/||a|| .
Now consider a simpler 2-dim space as follows, shown in the diagram.
y
(xa ya)
(xb yb)
0
From basic trigonometric identity,
388
cos cos cos sin sin
x a x b ya y b
a b
a b
aTb
a b
We can relate trigonometry to vector inner products. Inner or scalar product of

different normalized vectors a/||a|| and b/||b|| is equal to the cosine of the angle
between them. When uTv = 0, the cosine between u and v is 0. cos (/2) = 0,
so the angle between u and v is /2. u and v are perpendicular to each other.
Let e = b/||b|| be a normalized vector of unit length.
y
A (xa ya)
=-
e=(xb yb)/||b||
0
Thus, cos
a Te
a
T
or, a,e a e a cos is the length of the projection of a onto e. The
projected vector is denoted by OP or P.

Vector P = (||a|| cos ) e = (aTe) e .
The projection vector AP or OP-OA in this case has minimum possible length
since it forms a right angle with the projected vector on e.
For an important inequality result, consider bTa = aTb = cos ||a|| ||b||.
Since | cos | 1,
| aTb | ||a|| ||b|| or (aTb)2 ||a||2 ||b||2 (aTa) (bTb).
The above is called the Cauchy-Schwarz inequality. We can also write it as:-
389
2
n
n 2 n 2
a i bi a i bi .
i1
i1 i1
Another inequality, the triangle inequality states that
||c-a|| ||b-a|| + ||c-b||
or, |AC| is |AB| + |BC|.
A5.
CONVEX SET AND HYPERPLANE THEOREM
Let scalar t[0,1], or 0 t 1. Consider a set A. Take any two elements of A,

x and y. For any x and y, if z = tx + (1-t)y is also an element of A, then A is a
convex set. An example of a 2-dim convex set is shown below.
x
z
Note that a convex set can have linear segments e.g. a sliced disc.
A and B are convex sets in Rn space that are disjoint, i.e. A B = . A
linear functional p is a vectornx1 of constants such that pTx = constant k, for
any point x in the Rn space.
pTx = k represents a hyperplane in Rn space. In R2 or 2-dimensional
Euclidean space, p = (a b)T. So pTx in R2 is ax1+bx2=k which is a straight line.
In R3, ax1+bx2+cx3=k is a 2-dim plane.
390
u2
A
x
B
y
u1
u2-u1= -3
The above diagram shows two disjoint convex sets A and B in two-dimension,
and a line u2 = u1 3 that passes between them.
The Separating Hyperplane Theorem states that if A and B are two disjoint
nonempty convex sets in Rn, then for any x in A and y in B, there exists a
linear functional p such that pTx pTy . This is illustrated in the R2 or 2-dim
diagram where pT = (-1 1). So, (-1 1) x > -3 > (-1 1) y where xT = [ u1 u2 ].
A.6
QUADRATIC FORM
A quadratic form in 2 variables x1 and x2 can be written as
x1
x
x 2 M 2x2 1
x 2
where M2x2 = [ij]. This is 12x12 + 12 x1x2 + 21 x2x1 + 22x22.

An n-variable quadratic form is xT M x
391
where x nx1
x1
x
2
.
and Mnxn = [ij] .
.
.

x n
An important application of quadratic form is the computation of variance.

Let ij = cov ( ~
ri , ~rj ) = covariance of security i return rate and security j
return rate,
ri ) = variance of security i return rate
ii or i2 = var ( ~
where ri is the random security i return rate.
Let wi = fraction of wealth invested in security i .
wT = w 1
w2
. . . w n and wT 1 =1 .
Var (
w ~r ) = w
i 1
i i
V w where Vnxn = [ij] .
0.16 0.3
0.3 0.25
Covariance matrix V
V is symmetrical. VT = V. Suppose we find var ( 0.3~r1 0.7~r2 ). The variance

is:
0.16 0.3 0.3

0.7
0.0109.
0.3 0.25 0.7
Suppose Portfolio A return is ~
rA 0.1~
r1 0.9~
r2 , and Portfolio B return is
~
~
~
rB 0.3r1 0.7r2 . Then,
Cov( ~
rA , ~
rB ) = 0.1 x 0.3 12 + 0.1 x 0.7 12 + 0.9 x 0.3 21 + 0.9 x 0.7 22
0.3
or,
0.16 0.3
0.3 0.25
AB = 0.1 0.9
0.3
0.7 0.0603.

Matrix M is positive (negative) definite if for any vector x 0, xT M x > (<) 0.

M is positive (negative) semidefinite if for any vector x, xT M x () 0.
Thus a covariance matrix V is always positive definite.
392
~r1
r2
Now, let R 2x1 ~ where matrix elements can be random variables or
ri ) or i.
functions. The mean of ri is its expected value E( ~

r1 , ~
r2 is
M 1 . The covariance matrix for ~
2
~
~
V E{(R M)(R M) T }
~r1 1 ~
r1 1 ~r2 2
E ~
r2 2
2
~
~
( r1 1 )
( r1 1 )(~r2 2 )
E ~
~
(~r2 2 ) 2
( r2 2 )( r1 1 )
E(~r ) 2
E(~r1 1 )(~r2 2 )
~ 1 ~1
E(~r2 2 ) 2
E( r2 2 )( r1 1 )
2
1
21
12
.
2
2
Note that ~
r1 can be interpreted as return rate of ith security or return rate at time
period i of the same security.
A.7
MATRIX CALCULUS
x 2
dm 2x
,
then
2 . Note that m is a vector and x is a
3
dx
x
3x

Suppose m
scalar variable. The variance of a portfolio return is wTVw. For 2 securities,

p2 = w1212 + 2w1w212 + w2222
w 1
2 (1 w 1 12 w 2 )
2
393
2 (12 w 1 2 w 2 )
2
w 2
We can write
p 2
2
p
1 2 12 w 1
w 1
2
p 2
w
21 2 w 2
w 2
T
or,
w Vw 2 V w , where w is a vector.
w
If we take second order derivatives:
p
2
(
) 2 1
w 1 w 1
p
(
) 2 12
w 1 w 2
p
2
(
) 22
w 2 w 2
p
(
) 2 12
w 2 w 1
2
We can write
p 2
(
)
2
2 ( p ) w 1 w 1
p 2
w 2
(
)
w 2 w 1
2
2 1
21
2V .
2
p
(
)
w 1 w 2
2
p
(
)
w 2 w 2
12
2
2
Let xnx1 be a vector of random variables, each of which has a normal

d
distribution. X i ~ N( i , i ) . Cov (Xi, Xj) = ij . More generally, the vector

2
x nx1 itself has a multivariate normal distribution: x nx1 ~ N( nx1 , Vnxn ) , where
nx1 = [i] , Vnxn = [ij].
x nx1 X1
T
X2 Xn
If Dmxn has rank m (m n), and vector y = Dx,

d
y mx1 ~ N(D , DVD T )
394
and dim (DVDT) is m m.
A sum of n independent squared standard normal random variables is
distributed as 2 with n degrees of freedom. Univariate random variable
d
X i ~ N( i , i ) .
X i d
Zi i
~ N(0,1)
i
n
Z
i 1
2
i
~n .
n
Multivariate random variable z nx1 ~ N(0, I n ) , and
Zi z T z ~ n .
2
i 1
In general, if x nx1 ~ N(M, V) , then

d
(x M) T V 1 (x M) ~ n .
2
1 2
0
.
A special case is V = diag [i2] =
.
.
0
2
2
.
.
.
.
.
.
.
.
.
.
2
n
Then the above becomes

n
i 1
Xi i
d 2
~ n .

There are numerous good books on linear and matrix algebra. For more
materials, you may refer to the following.
[1] James W. Demmel, Applied Numerical Linear Algebra, Society for
Industrial & Applied Mathematics, September 1997.
[2] Gilbert Strang, Linear algebra and its applications, 3rd ed., International
Thomson Publishing, 1988.
[3] Dhrymes, P.J., Mathematics for Econometrics, Springer-Verlag, 1978.
395
Appendix B
EVIEWS GUIDE
EVIEWS is a statistical software package that works in Window environment
and is used mainly for econometrics and regression analyses. It has built-in
functions for many econometric procedures and easy data handling tools.
Different versions of EVIEWS may have slightly different bells and whistles
though the generic framework of the statistical software should be similar.
To start EVIEWS, click the EVIEWS program icon and open into the
following blank Workfile. All works and outputs are done through a loaded
workfile.
In order to bring data from an Excel spreadsheet or other format into the
EVIEWS program for econometrics processing, a new loaded Workfile
has to be first created. Do this by
(1)
clicking on the main menu File , then (choose and click) New . Then
Workfile. The following submenu will appear to prompt where and
how to locate data in the PC in order to load them into the new workfile.
396
(2) Suppose our Excel datafile is OCBC daily return data.xls with 5
columns of data and 1306 rows (first row is non-data and can be
excluded during loading). One way is to specify undated or irregular
data, and enter start observation 1 and end observation 1305. Then
press OK.
(3) EVIEWS will then open the following worksheet displayed as
follows. Note that by default, two time series c and resid will be
initialized and appear, but they basically do not contain information at
this point. The program file below contains the upper space:
programming window for writing programs, and the worksheet
below to contain objects such as the two time series.
(4) We can save the work file for future work, or re-save (save) the latest
update as work progresses in EVIEWS. This is done by clicking File
(on the main menu) Save As (and proceed to name the file to be
saved in the appropriate PC directory). This can be recalled later for
use.
397
Programming
window
worksheet
(5) Next, load the external data into EVIEWS program itself. This is done
by clicking PROCS (procedure) in the submenu Import Read
Text-Lotus-Excel , and then pointing to the location of the Excel
datafile, e.g.: OCBC daily return data.xls. The Excel Spreadsheet
Import menu will appear see next diagram (note that the external
Excel datafile has to be closed before this works). At this point it is
instructive to know the structure of the Excel datafile. It looks as
follows. The 5 columns of time series data are
A
1
Calendar Date Date Value Day Number
Day
2
10/27/1997
35730
1
Monday
3
10/28/1997
35731
2
Tuesday
4
10/29/1997
35732
3
Wednesday
.
.
1305
1306
10/24/2002
10/25/2002
37553
37554
1824
1825
Thursday
Friday
E
OCBC daily
returns
0
-0.053708912
0.020498102
-0.008888947
-0.00896867
398
(6) Next specify the number of observations and enter the start Excel cell
name of the first observation to be loaded onto EVIEWS workfile.
Enter A2 accordingly. Enter the Excel sheet name if there is more
than one sheet in the Excel datafile. Enter the variable names of the
5 variables (each occupying a column) and separate each name by a
space. Then press OK. (Note that the length of time series for each
variable, 1 to 1305, has been entered earlier in (2), and now appears in
the Import Sample space.)
(7) By now, the 5 variable time series would be loaded. They appear as the
additional objects cadate, datev, day, dayno, and return in the
worksheet. We can think of objects as sheets of paper with data
series. To look at a particular object e.g. return (or to look at a sheet
of paper), we click on the object = return, and then under the submenu
of series: RETURN Workfile , click View Spreadsheet to see the
data series of the variable return. This is shown in diagram
SPREADSHEET below.
(8) On the same submenu, click View Graph Line to see the line
graph of the variable return. This is shown in diagram LINEGRAPH
below.
(9) On the same submenu, click View Descriptive Statistics
Histogram and Stats to see the histogram of the variable return. This
is shown in diagram HISTOGRAM below.
399
SPREADSHEET
LINEGRAPH
400
HISTOGRAM
(10) We can place the mouse pointer on either Range or Sample in the
worksheet space just below the submenu and then double-click to
change the range or sample inputs, e.g. from 1 to 200 (instead of the
present 1 to 1305) if we wish to work with a smaller sub-sample.
(11) An important resource in the EVIEWS program is to click on the main
menu Help Eviews Help Topics Index in order to learn the
details of certain commands etc.
(12) Now we will learn how to generate additional time series of variables.
We can either type directly on the programming window series
s=@sum(return), then hit carriage return, or we can click the Genr
(generate) in the workfile submenu, and type s=@sum(return) , then
click OK. Both methods will yield the same output series s which now
appears as an object in the worksheet. It is a series sample size 1305
with each number equal to the sum of the 1305 return rates. Note that if
we use the programming window to output new series, we have to add
the prefix series to initialize s. See the diagram below.
401
Generate New
Seris
(13) We can also generate independent new time series as follows. Typing in
the programming window:
series x=0.03+@sqr(0.5)*nrnd
generates a series named x which is a random draw from a normal
distribution with mean 0.03 and variance 0.5.
Then generate another series y that is correlated with x by typing in
the programming window:
series y=0.005+0.5*x+@sqr(0.125)*nrnd
The last term above @sqr(0.125)*nrnd is really an independent
normal r.v. e or normal white noise with mean 0 and variance 0.125.
Mean of y is 0.005+0.5*E(x)+E(e) = 0.005+0.5*0.03+0 = 0.02. The
variance of y is (0.5)2*var(x)+var(e) = 0.25*0.5+0.125 = 0.25.
The covariance between x and y is cov(x,y) = 0.5*var(x) = 0.25.
Indeed, y=0.005+0.5*x+e itself forms a linear regression model
associating x with y. If we write y = a + bx + e, then a linear
regression estimation should yield estimate of a close to 0.005, and
402
estimate of b close to 0.5. We will use EVIEWS to examine the
plausibility of this construction next.
(14) Once the x and y series are generated and appear in the worksheet, we
can create a new object called Group in order to perform some crossvariable analysis. To create the Group, first highlight the variables on
the worksheet that are to be put into the Group we choose series x
and y that have just been formed hold down Ctrl button and select
the variables to highlight.
Then on workfile submenu click Objects New Object Group
(at this stage, leave the Name for Object box as it is, i.e. Untitled)
OK. Then a Series List box will appear, with x y in the box
confirming the chosen series x and y to be grouped. Click OK. Then the
Group XY: workfile will appear showing the spreadsheet containing
only the columns of x and y. Click Name on the submenu of the Group
XY: workfile and proceed to give a name to this new Group object as
xy, upon which xy will appear in the worksheet as the new object.
We can also perform a shortcut to the above by hightlighting the
variables x and y to be grouped, then right-click, click Open as
Group .
403
(15) Once the Group object xy is created, we can click on it. The Group: XY
workfile will be displayed. Click on its submenu View Descriptive
Stats Common Sample, and obtain the statistics as shown in the
diagram below. The sample mean of Y and X are highlighted, and are
shown to be close to 0.02 and 0.03 as expected. Moreover, the sample
standard deviations of Y and X are 0.496 and 0.721 respectively, giving
variances of about 0.25 and 0.5 as expected.
View
(16) We produce scatterplots using group objects. On Group: XY workfile

submenu, click View GraphScatterSimple Scatter to see the
following graph SCATTERPLOT.
(17) If we click View Covariances Common Sample, we obtain output
of the following covariance matrix and we see that cov(x,y) is about 0.25
as expected.
Y
X
Y
0.246229917878
0.248612205356
X
0.248612205356
0.519961054164
404
SCATTERPLOT
(18) Finally we want to perform a simple linear regression to verify if

indeed the construction y=0.005+0.5*x+e , e N(0,0.125) is done
correctly. At the worksheet submenu, click Objects New Object
Equation (at this stage, leave the Name for Object box as it is, i.e.
Untitled) OK , and the Equation Specification box will appear.
Type y c x to represent r.v. y regressing on constant c, and
explanatory r.v. x. Note the space between the variables. Click OK and
obtain the Equation output file. Click Name to name this output file so it
becomes an object in the worksheet that can be called up later for
review. From the Equation output file , we see that the regression
constant C = 0.0065 is about 0.005, and the slope coefficient estimate
= 0.478 is about 0.5 as expected.
(19)
As a further step of understanding linear regression, we call up the
stored Equation object in (18) that was named regress. Then under
the Equation: REGRESS Workfile, click View Actual, Fitted,
Residual Residual Graph, and observe the residual of the fitted
regression. We can also perform other tests e.g. click View Residual
Tests Correlogram - Q-statistics.
405
REGRESSION
RESIDUAL TEST
406
Appendix C
LINEAR REGRESSION IN EXCEL
C.1
INTRODUCTION
Many of the regressions done in this book can be performed using the Data
Analysis subroutine in EXCEL. To enable the Data Analysis subroutine, click
on the main menu Tools in EXCEL. Then click Add-Ins Next tick the
relevant toolboxes required. In this case, tick Analysis ToolPak and also
Solver Add-in. In the Microsoft Office 2007 version, go to the Office button,
click and choose Excel Options and then find the Add-Ins.
There are two ways of performing linear least square regression using
Excel.
(a) From main menu of Excel, click Tools Data Analysis Regression
(b) From main menu of Excel, click Insert (choose and click) Functions
Statistical Linest
The two methods are applicable for both a simple and multiple linear
regression analysis. We demonstrate here procedures in (1) for a linear
regression on a dataset.
C.2
DATA ANALYSIS REGRESSION TOOLS
In Excel, open whatever dataset to be examined. From the Tools menu,

choose Data Analysis. In the Data Analysis dialog box, scroll the list box,
select Regression.
Click OK. Excel provides several options that determine how the regression
407
output displays the output. We will highlight some of these.
In the Regression input box, enter

(1) Input Y Range: the dependent variable values.
(2) Input X Range: the explanatory variable values.
(3) Labels: Tick or Select if the first row or column of your input range or
ranges contains labels. Otherwise leave it blank as in this case.
(4) Constant is Zero: Select to force the regression line to pass through the
origin (i.e. zero regression intercept), or else leave it blank.
(5) Confidence Level: Select to include an additional level in the summary
output table. In the box, enter the confidence level you want in addition to
the default 95 percent level. As an illustration, we choose here additional
99 percent confidence level.
(6) Residuals: Select to include residuals in the residuals output table.
(7) Standardized Residuals: Select to include standardized residuals in the
residuals output table.
(8) Residual Plots: Select to generate a chart for each independent variable
versus the residual.
(9) Line Fit Plots: Select to generate a chart for predicted values versus the
observed values.
408
Before we proceed to the regression output file, Excel is used to show the line
graph of the two time series of oil price and rig counts.
409
In the New Worksheet Ply: Select and enter the new sheet name regression
output so that the output is stored in the new sheet that will be created. Click
OK. The output file in the regression output sheet is shown below.
Other outputs in the regression output sheet include:
410
In this regression example, the dependent variable is the number of oil rigs
produced by a company, and the explanatory variable is the price of oil. We
would expect that if the price of oil increases, then the company will find it
profitable to produce more oil rigs as there would be a higher demand for oil
rigs to harvest more oil.
C.3
WORKING WITH EXCEL VBA
EXCEL is also packaged with VBA or Visual Basic for Applications which is
essentially a BASIC language that when coupled with the Microsoft EXCEL
spreadsheet program, enhances the latters functionality and applicability for the
more complex computational problems beyond just spreadsheet calculations.
VBA is largely about trying to automate repeated calculations using short-cut
programs. These programs are either user-defined VBA programs or they
employ Macros (or subroutines) in VBA. VBA is an Objected-Oriented
programming language. Each Excel object e.g. worksheets, charts, range, etc.
represents a functionality. The objects are usually arranged in a hierarchy. For
example, the range, e.g. A1:C10 in EXCEL, is represented as an object called
data. This is referenced hierarchically using Application.Workbooks
(Distributions.xls). Sheets(lognormal).Range(data), i.e. we have to first
point to the Excel file, then the correct worksheet within that file, then the
correct data or range.
411
Appendix D
MULTIPLE CHOICE QUESTION TESTS
D.1
TEST ONE
Please circle the alphabet associated with the most appropriate answer.
1. If we run an Ordinary Least Squares regression of Yi = a + bXi + ei where
ei is a white noise that is not correlated with Xi, and suppose sample
averages for Y and X are 3 and 4 respectively. Moreover, a is found to be
1. What is the value of b ?
(a)
0.5
(b)
1.0
(c)
4/3
(d)
Indeterminate from the given information
2. In the CAPM model, the SML line refers to the regression of:(a)
(b)
(c)
(d)
capital asset prices over model

security expected return over standard deviation
security expected return over beta
security expected return over market excess return
3. The CAPM cannot be derived from the following assumptions:(a)

(b)
(c)
(d)
market model
joint multivariate normal returns
quadratic utility
positive alphas
4. Suppose a firm generates a perpetual after tax annual earnings per share of
$2 which are distributed as dividends on a stock with required return of
10% p.a. by the market. A new technology allows the firm to retain 50%
earnings each year to reinvest in the new technology with a higher return
of 20%. Would the new share price be:
(a)
(b)
(c)
(d)
<$10
$10
$20
>$20
412
5. If we wish to estimate beta j according to the CAPM model, we should
strictly apply linear regression of
(a)
(b)
(c)
(d)
stock j return on constant and market return

stock j return on market return
stock j excess return on constant and excess market return
stock j excess return on excess market return
6. Since CAPM deals with real and not nominal return rates, i.e. beta is
covariance of real returns divided by variance of real market return, then
using excess returns for regression is suitable in this case since it
(a)
(b)
(c)
(d)
removes the expected inflation component

removes the market component
removes the need for a constant
removes an additional degree of freedom
7. To estimate in rmt rft mt we can perform an OLS of

(a)
regressing rmt on rft and mt
(b)
(c)
regressing rmt on rft and

regressing rmt rft on constant and mt
(d)
regressing rmt rft on mt
8. In general, when we appy OLS regression to

Yt = bXt + ut (no constant), and ut satisfies the classical conditions, then
min
b X t
X Y b X 0 b XXY
If we consider the OLS estimator random variable
X bX
X
t
t
2
t
ut
then the OLS estimator is

(a)
(b)
(c)
(d)
incorrect
biased
unbiased
degenerate
Xt
ut ,
X 2t
2
t
413
9. The purpose of computing cost of capital is not
(a)
(b)
(c)
(d)
to set a hurdle rate to evaluate or appraise firms project feasibility

to provide a fair valuation of firms total capital
to set fair utility rates in the case of regulated utility firms
to test the CAPM
10. The stock price according to the dividend growth model can be very
volatile not because of the following reason:
(a)
(b)
(c)
(d)
The expected dividends change over time because of new

information
The discount rate changes over time because of new information
The expected earnings change over time because of new
information
The expected model changes over time because of new
information
Answer Key:
1a, 2c, 3d, 4d, 5d, 6a, 7d, 8c, 9d, 10d
414
D2.
TEST TWO
1. For a random walk process on price, the forecast (or conditional
expectation) of return rate next time period t+1, given all information up
to time t, is
(a)
(b)
(c)
(d)
zero
constant
uncertain, dependent on information at t
white noise
2. The general implication of random walk stock price processes is that

(a)
(b)
(c)
(d)
the market is inefficient

stock prices do not have probability distributions
one can never make positive profits every period
one can never consistently outperform the market
3. The predictability of stock returns would appear most applicable in the

context of
(a)
(b)
(c)
(d)
forecasting the daily time trend of the stock return

forecasting the dividend growth of the stock
forecasting the short-term price variation of stock
forecasting the long-term return variation of stock
4. The predictability of long-run stock returns is not due to

(a)
(b)
(c)
(d)
the persistence of long-run returns

the persistence of dividend yield
the availability of superior information
serial correlation in dividend growth
5. In a regression of T-year future excess return on current dividend/price

variable,
(rt+1+rt+2+..rt+T) = a + b (Dt/ Pt) + et
where residual noise et is independent of (Dt/Pt), supposing the variance of
the T-year future excess return increases with T,
415
(a)
(b)
(c)
(d)
OLS estimates a, b remain constant regardless of T

OLS estimate b increases
OLS estimate a may increase depending on whether var(et)
remains constant as T increases
OLS estimates a, b should change as T increases.
6. In a regression of 10-year future excess return on current dividend/price

variable, the estimated coefficient of 0.3 is significantly different from
null of zero at 5% significance level. In a separate regression, the 10-year
future excess return is also significantly explained by the interest term
premium variable with a coefficient of 0.2. Suppose another regression of
10-year future excess return is now performed on both these variables, the
resulting estimated coefficients may now not be significant because of
(a)
(b)
(c)
(d)
wrong specification
asymptotic errors
multi-collinearity
errors-in-variables
7. In a regression of future excess return on current dividend/price variable,

suppose the errors or disturbances are not contemporaneously correlated
with dividend yields, but are serially correlated, then the OLS estimates
will be
(a)
(b)
(c)
(d)
biased and consistent

unbiased but inconsistent
unbiased and consistent
wrong because of wrong t-statistic
8. If the variance ratio statistic VR(q) is less than 1, then it is quite likely that
when one regresses 5-year returns on its lag, the coefficient of the slope is
(a)
(b)
(c)
(d)
zero
positive
negative
infinite
9. The following statement makes the most consistent sense

(a)
(b)
trough of business cycle, high risk premium, high future returns

peak of business cycle, high risk premium, low future returns
416
(c)
(d)
trough of business cycle, high dividend yield, high future dividend

growth
peak of business cycle, low risk premium, low future dividend
growth
10. Which of the following is not true about holding stocks over a longer
horizon
(a)
(b)
(c)
(d)
more risky because uncertainty in the longer horizon is greater

less risky because of mean reversion in the business cycles
more predictable direction because of mean reversion in the
business cycles
less predictable because the variance of long-run return grows
linearly or approximately so
Answer Key:
1b, 2d, 3d, 4c, 5b, 6c, 7c, 8c, 9a, 10c
417
D3.
TEST THREE
1. In a financial event study, the time line is usually broken up into 2
adjacent blocks called the
(a)
(b)
(c)
(d)
measurement (estimation) period | event period (window)

pre-event period | post-event period
measurement (estimation) period | computation period
uneventful window | eventful window
2. The event period is made up of 3 components:

(a) measurement period | event window | post-event window
(b)
pre-announcement window | announcement (event) date | post
announcement window
(c) estimation period | event period | post-announcement window
(d) AR period | AAR period | CAAR period
3. One of the following is not suitable as characterizing deviation from
expected or normal return
(a)
(b)
(c)
(d)
CAPM model expected return

market-adjusted excess return
mean-adjusted excess return
market model abnormal return
4. The market model parameters and for each stock can be estimated
consistently (and also BLUE in finite sample) using OLS
(a)
(b)
(c)
(d)
Yes
No
Not sure
Sometimes
5. In event study, the event day +3 refers to

(a)
(b)
(c)
(d)
a fixed calendar date e.g. 5 March 2000

3 days into the sampling period
3 days after announcement date of any stock with the same event
3 days after announcement date of only one specific stock
418
6. In selecting different stocks (or sometimes same stock but at different
times e.g. different years in a bonus shares event) for a common event
study e.g. bonus shares issue, it is important to ensure as far as possible
that their calendar dates do not cluster together. The following is not a
reason for this non-clustering.
(a)
(b)
(c)
(d)
This will avoid impact of confounding systematic events such as

911
This will help the ARits across the different is to be independent
so that variance of AAR is simplified
This will avoid the market movement influencing all stocks at the
same time
This will enable more data points to be available
7. ARit is abnormal return of stock i at event time t, and suppose
~
1 N
AR it d N (0, i2 ) . If AAR t AR it , is average (or aggregated)
N i1
abnormal returns of stocks i=1,2,,N, all at event time t, then AARt has
distribution (assuming independence of AR across stocks)
(a)
(b)
(c)
(d)
2
N 0, i
N
i2
N 0, 2
N
1 N
N 0, i2
N i1
1 N
N 0, 2 i2
N i1
8. If we test the null hypothesis H0 using
CAAR 1 , 2
~
d N(0,1)
var CAAR 1 , 2
for each 2 = 1 , 1+1, 1+2,.., 2 within the event period, what is the
null hypothesis?
(a)
(b)
that the given event has no impact on the returns process (or more
specifically the abnormal returns) up to event day 2
that the given event has significant impact on the returns process
(or more specifically the abnormal returns) up to event day 2
419
(c)
that the given event has no impact on the returns process (or more
specifically the abnormal returns) every day within (1, 2)
(d)
that the given event has significant impact on the returns process
(or more specifically the abnormal returns) every day within (1,
2)
9. Given the following diagram of CAAR versus event day, what would you
infer about the event if the standard error of CAAR at (-15,0) is 2%? (The
event is a type of financial announcement. Assume the model you used to
compute abnormal returns is correct.)
5%
2%
-15
-10
-5
2 s.d.
+5
+10
+15
event day t
(a)
(b)
(c)
(d)
The event does not carry information and the market is efficient
The event carries insignificant information and the market is
efficient
The event carries significant information and the market is
efficient
The event carries significant information and the market is
inefficient
10. Given whatever information that is available in the Figure in Question 9,

it would appear the information effect there is:
(a)
(b)
(c)
(d)
permanent because there is a permanent change in the values of the

stocks
permanent because there is a permanent change in the daily
abnormal returns of the stocks
transitory or temporary because of a temporary change in the
values of the stocks (e.g. price pressure which is followed by
reversion)
transitory or temporary because there is a temporary change in the
daily abnormal returns of the stocks
Answer Key:
1a, 2b, 3a, 4a, 5c, 6c, 7d, 8a, 9c, 10a
420
D.4
TEST FOUR
1. An analyst suspected that for some strange reasons, the daily return rates
to some stocks are usually lower on Fridays and higher on other days of
the week. To test this hypothesis, he collected the daily return data r t of a
particular stock, and performed the following linear regression using
ordinary least squares.
rt c1 c 2 I1 c3I 2 c 4 I 3 c5I 4 e t
where cis are the regression constant coefficients, et is the disturbance
that is assumed to be i.i.d., and
1 if it is Monday
I1
0 otherwise
1 if it isTuesday
I2
0 otherwise
1 if it is Wednesday
I3
0 otherwise
1 if it isThursday
I4
0 otherwise
It is noted that trading takes place only on week days. If he had included
another dummy variable
1
I5
0
if it is Friday
otherwise
in the above regression, the outcome of the
OLS would be
(a)
(b)
(c)
(d)
BLUE and consistent

misspecified with missing dummy variables
stable
no solution because of singular matrix
2. The p-value of an estimated coefficient is

(a)
(b)
(c)
(d)
The probability of observing the coefficient

The probability of observing values more extreme than estimated
coefficient
The probability of rejecting the null hypothesis
The probability of accepting the null hypothesis
421
3. The reported p-values for the t-statistics of c 1 ,c 2 ,c 3 ,c 4 , c 5 are 0.03, 0.11,
0.08, 0.15, and 0.02 respectively. Based on test at 5% significance level
for a 2-tail test, we can
(a)
(b)
(c)
(d)
reject H0: c1=c2=c3=c4=c5=0

reject H0: c2=0, H0: c3=0, H0: c4=0
reject H0: c1=c5=0
reject H0: c1=0, H0: c5=0
4. In the regression rt c1 c 2 I1 c3I 2 c 4 I 3 c5I 4 e t of Q1, suppose the

estimated equation is:
rt = -0.0001 + 0.0005I1 + 0.0002I2 + 00001I3 + 0.0003I4,
what is the expected return of the stock on a Thursday?
(a)
(b)
(c)
(d)
0.02%
0.03%
-0.01%
none of the above
5. A currency speculator was trying to understand the following spotforward relationship of C$ (versus US$). According to the unbiased
expectations hypothesis
Ft,t+3 = Et (St+3) + t,t+3
where Ft,t+3 is the forward 3-month C$ per US$ as at time t, St+3 is the
future spot rate at t+3 months, Et(.) denotes conditional expectation given
all market information current at t, and risk premium t,t+3 = 0.
Which of the following is a random variable at t?
(a)
(b)
(c)
(d)
t,t+3
Ft,t+3
St+3
Et(St+3)
6. If we restate the unbiased expectations hypothesis (UEH) as
E St 3 | Ft,t 3 ,St , Ft,t 2 , Ft,t 1 , Ft 1,t 2 ,St 1 ,..... Ft,t 3 ,
which
of
the
following specifications is a linear regression consistent with the

hypothesis?
422
(a)
St 3 c0 c1S t 2 c2S t 1 c3S t c4 F t,t 3 et 3
(b)
St 3 c0 c1S t 2 c2S t 1 c3S t c4 F t,t 3 et
(c)
St 3 c0 c1S t c2S t 1 c3S t 2 c4 F t,t 3 et 3
(d)
St 3 c0 c1S t c2S t 1 c3S t 2 c4 F t,t 3 et
7. There are typically many specifications that are consistent. The following
is one. What restrictions on the regression coefficients and disturbance are
implied by the UEH?
St c0 c1F t k,t et , k>0
(a)
c0 0, c1 0, E e t | Ft k,t 0
(b)
c0 0, c1 0, E e t | Ft k,t 0
(c)
c0 0, c1 0, E e t | Ft k,t 0
(d)
c0 0, c1 1, E e t | Ft k,t 0
8. Suppose we run linear regression
S e
Ft,t 3 c0 c1E
t
t 3
t
S is an
where Ft,t+3 is a forward 3-month rate as in Q5, and E
t
t 3
unbiased estimator at t of E t St 3 . What are the problems in such an
OLS regression?
(a)
(b)
(c)
(d)
measurement error, contemporaneous correlation, bias but

consistent
measurement error, non-contemporaneous correlation, bias but
consistent
measurement error, contemporaneous correlation, bias,
inconsistency
measurement error, non-contemporaneous correlation, bias,
inconsistency
9. Suppose we run a time series regression Yt = a + b Xt + et , and we suspect

that the disturbances {et} has a diagonal covariance matrix with the j th
element as j22. To obtain BLUE estimates of a and b, we could instead
run the following corrected OLS regression:-
423
(a)
(b)
(c)
(d)
Yt/j = a + b (Xt/j) + et/j

Yt/j = a/j + b (Xt/j) + et/j
Yt/j2 = a + b (Xt/j2) + et/j2
Yt/j2 = a/j2 + b (Xt/j2) + et/j2
10. Suppose you ran a regression Yt = a + b Xt + c Zt + et using OLS, and then

estimated the residuals as e t Yt a b X t c Z t . What patterns of the
estimated residuals would lead you to suspect a heteroskedasticity
problem?
(a)
large variations in e t 2 centering around zero
(b)
large persistent increases in e t 2 away from zero
(c)
large variations in e t centering around zero
(d)
large persistent increases in e t away from zero
Answer Key:
1d, 2b, 3d, 4a, 5c, 6c, 7d, 8c, 9b, 10b
Answers to Q5, Q6, and Q8 are explained in more details as follows.
A5.
At t, forward rate Ft,t+3 is already known, so it is not a random

variable. At t, Et(St+3) is a conditional expectation based on
information at t, so this expected value is going to be in terms of
variables observed at t, i.e. already known at t, so it is not a random
variable at t. t,t+3 would have been a random variable in a more
general model, but specifically for UEH, this is restricted to zero at all
times, so at t, it is zero and not a random variable. (c) is clearly a
random variable since at t, St+3 is future spot rate and has a probability
distribution.
424
A6.
(a) is incorrect because the UEH here is about forward being an

unbiased estimator of future spot. Here Ft,t+3 is 3-month forward
trying to predict future St+3. If we take conditional expectation with
respect to information at t, E(St+3|Ft,t+3, St, St-1, etc.) = c0+c1E(St+2|info
at t)+c2E(St+1|info at t) +c3St+c4Ft,t+3+0, since E(et+3| info up to t) = 0,
i.e. et+3 is not known at t, and cannot be predicted at t, has mean 0. But
c1E(St+2|info at t) and c2E(St+1|info at t) are according to UEH equal to
Ft,t+2 and Ft,t+1 respectively. Though we may restrict c1=c2=c3=0, the
Ft,t+2 and Ft,t+1 will correlate highly with Ft,t+3 and thus the regressors
St+2 , St+1 will create unnecessarily high multi-collinearity with Ft,t+3
(since Ft,t+3 is forward looking and overlaps with future St+2 and St+1).
(b) is incorrect given (a) and also the et will appear in E(St+3|Ft,t+3, St,
St-1, etc.) = c0+c1E(St+2|info at t)+c2E(St+1|info at t) +c3St+c4Ft,t+3+et .
(d) is incorrect because if we take conditional expectation of St+3 with
respect to all information up to time t, then E(St+3|Ft,t+3, St, St-1, etc.) =
c0+c1St+c2St-1+c3St-2+c4Ft,t+3+et (including et which becomes known at
time t), and clearly this is not UEH given in the statement even if we
put c0=c1=c2=c3=0 and c4=1, because of et
This leaves us with (c), E(St+3|Ft,t+3, St, St-1, etc.) = c0+c1St+c2St-1+c3St2+c4Ft,t+3+0 since E(et+3| info up to t) = 0, i.e. et+3 is not known at t, and
cannot be predicted at t, has mean 0. We can test c0=c1=c2=c3=0 and
c4=1. If so, UEH is established. This is therefore a testable hypothesis
leading to rejection or else evidence of UEH.
A8.
S which is just an estimator of E t St 3 means that

Using E
t
t 3
explanatory variable is measured with measurement error (or has
errors-in-variable problem). This induces contemporaneous
S and disturbance et , so that OLS

correlation between E
t
t 3
estimators in the regression are biased and also not consistent, hence
(c).
425
D.5
TEST FIVE
1. If Yi = c0 + c1Xi + ei ,
i=1,2,,N
Xi and zero mean ei are stochastically independent, and ei is not

homoskedastic (i.e. ei has variance that is different for different i, or is
serially correlated or autocorrelated, or both), but is instead
heteroskedastic, then the OLS estimator is
(a)
(b)
(c)
(d)
BLUE
biased but consistent
GLS
unbiased but not efficient
2. If Yt = c0 + c1Xt + c2Zt + et ,
t=1,2,,T
Xt and zero mean et are stochastically independent, and et = et-1 + ut where

0, ut is mean zero i.i.d., then GLS can be performed using
(a)
(b)
(c)
(d)
3.
If you test H0: = 0 in Q2 above, Durbin-Watson d-statistic gives 2.65,

and at 5% significance level, T = 90, k= 3, the critical values DL = 1.589,
DU = 1.726, how do you conclude?
(a)
(b)
(c)
(d)
4.
repeated OLS
estimating ut and transforming Yt , Xt , Zt using this
estimating and transforming Yt , Xt , Zt using this
generalizing the covariance matrix of ut and then applying OLS
Accept H0
Inconclusive on H0
Reject H0, accept negative autocorrelation
Reject H0, accept positive autocorrelation
If you test H0: = 0 in Q2 above, Durbin-Watson d-statistic gives 2.35,

and at 5% significance level, T = 90, k= 3, the critical values DL = 1.589,
DU = 1.726, how do you conclude?
(a)
Accept H0
(b)
Inconclusive on H0
(c)
Reject H0, accept negative autocorrelation
(d)
Reject H0, accept positive autocorrelation
426
5. Besides the DW d-statistic, what other test statistics do you use to help
detect serial or autocorrelations?
(a)
(b)
(c)
(d)
Jarque-Bera
Shapiro-Wilk
Box-Pierce-Ljung
Johnson
6. Suppose Yt = c0 + c1Xt + c2Zt + et , t=1,2,,T

and et satisfies the classical conditions. However, in a regression, Zt was
omitted. If Zt = Zt-1 + ut where 0, and ut is i.i.d., is the DW d-statistic in
the above likely to be
(a)
(b)
(c)
(d)
close to 0
close to 2
different from 2
cannot be computed
7. Is the DW d-statistic appropriate when the explanatory variables contain a

lagged endogenous or lagged dependent variable? (in which case the OLS
estimator will be biased and inconsistent if indeed error is AR(1))
(a)
(b)
(c)
(d)
Yes
No
Uncertain
Sometimes
8. If in Q2, Zt = Yt-1 , i.e. lagged dependent (or lagged endogenous) variable,

OLS estimator will be
(a)
(b)
(c)
(d)
BLUE
unbiased but not consistent
biased and not consistent
none of the above
9. How do you get around the problem in Q8 if any?

(a)
(b)
(c)
(d)
OLS
GLS
IV
DW
427
10. In the test of the UEH, St+k = c0 + c1Ft,t+k + et+k , based on joint hypothesis
H0: c0 = 0 and c1 = 1, which test statistic is used? Note that N is the sample
size.
(a)
(b)
(c)
(d)
tN-2
Fk,N-2
Fk-1,N-2
F2,N-2
Answer Key:
1d, 2c 3c, 4b, 5c, 6c, 7b, 8c, 9c, 10d
428
D6.
TEST SIX
1. When a stochastic process {Yt} is covariance-stationary, the following
statement is not true:
(a)
(b)
(c)
(d)
mean is constant at every point in time

variance is constant at every point in time
all moments are constant at every point in time
autocorrelation lag k is a function of only variable k
2. Suppose Yt and Zt are I(1) and an OLS regression is run as follows: Yt = c

+ dZt + wt where wt is added as a noise term in the specification. It may
not be appropriate to perform this OLS of Yt on Zt for all of the following
reasons except
(a)
(b)
(c)
(d)
it cannot provide unambiguous statistical inference

it leads to biases in the estimators
the regression result is spurious
Yt and Zt may be cointegrated
3. For the covariance-stationary processes such as AR(p), MA(q), or

ARMA(p,q) [p, q being any reasonable integers],
(a)
(b)
(c)
(d)
conditional mean and conditional variance are constant at each t

conditional mean and conditional variance change at each t
conditional mean is constant and conditional variance changes at
each t
conditional mean changes and conditional variance is constant at
each t
4. While both a trend stationary process and a unit root process may display
similar looking trends, their difference is shown by
(a)
(b)
(c)
(d)
only the trend stationary process has deterministic trend

only the unit root process has a difference series that is stationary
unit root process displays increasing volatility over time
unit root process displays deterministic trend over time
429
5. If Xt, Yt and Zt are all unit root processes, and we perform OLS regression
of
Xt = a+bYt+cZt+et where et is a disturbance that is independent of all the
other variables, then
(a)
(b)
(c)
(d)
the estimators of a, b, and c will always be spurious

the estimators of a, b, and c will always be consistent
it is not possible for b or c to take any values other than 0
none of the above
6. We may think of a random walk as a special case of a unit root process

when the disturbance term is white noise, in which case
(a)
(b)
(c)
(d)
the process variable must have a time trend that increases linearly
with time
the process variable must have a variance that increases linearly
with time
the process variable is unpredictable
the process variable is trend stationary
7. Suppose we are testing if Yt is a unit root or I(1) process, and we perform

the following OLS regression Yt = + t + Yt-1 + t , where t is a
stationary random variable.
Suppose the critical ADF statistic at 1% significance level is 2.65, and
the computed < 0 is 2.33 standard deviations away from 0, then
(a)
(b)
(c)
(d)
reject null of no unit root

reject null of unit root
accept [or cannot reject] null of no unit root
accept [or cannot reject] null of unit root
8. If a stochastic trend exists in a price process Qt with i.i.d. increments, then

this is likely to show up as the following with the exception of
(a)
(b)
(c)
(d)
process divergence from zero

mean reversion
an increasing variance as time increases
a correlogram (graph of correlation function against time lags)
that decays very slowly
430
9. In the long-run, if PPP holds, then
(a)
(b)
(c)
(d)
exchange rate and the two price indices are integrated processes
real exchange rate is stationary
variance of real exchange rate must converge to zero
exchange rate follows a deterministic trend
10. We can check for cointegration of Yt and Zt in the regression Yt = a + bZt

+ et (where et is independent with zero mean) by employing estimates a ,
b , and
(a)
testing if Yt Zt is stationary
(b)
testing if the t-values of estimates a , b are higher than ADF critical

values
(c)
testing using estimates a , b if Yt - a - b Zt has unit root
(d)
testing using estimates a , b if Yt - a - b Zt has higher than ADF

critical value
Answer Key:
1c, 2d, 3d, 4c, 5d, 6b, 7d, 8b, 9b, 10c
*In Q10, (d) does not necessarily have the ADF distribution used for t-values.
431
D7.
TEST SEVEN
1. In the CAPM model, beta is
(a)
(b)
(c)
(d)
Y aX
cov(Ri,RM)/var(RM)
corr(Ri,RM)/M
(XTX)-1(XTY)
2. In an OLS linear regression of excess stock i return on excess market

return, with a sample size of N=60, if the intercept estimate is 0.025, then
given prob(t-stat d.f. 58 > 2.002) = 2.5%, there is evidence of positive
abnormal return on stock i at 2-tailed 5% significance level provided the
intercept standard error is
(a)
(b)
(c)
(d)
< 0.01
> 0.01
< 0.02
> 0.02
3. In evaluating a project proposal, a firm should use the following as the

risk adjusted discount rate for the projected cashflows
(a)
(b)
(c)
(d)
the projects required rate of return

the projects internal rate of return
the firms average borrowing cost
the firms marginal cost of capital
4. The random walk hypothesis is best described by the postulation that one
cannot make
(a)
(b)
(c)
(d)
Positive expected profit all the time

Positive unexpected profit all the time
Positive expected profit most of the time
Positive unexpected profit most of the time
5. If a researcher uses Nikkei 225 index futures closing price data at Chicago
Mercantile Exchange on date YYY, compares those with Nikkei 225
index futures closing price data at Singapore Exchange on the date YYY,
and finds significant difference in the notional price, what is the most
432
likely reason to explain the difference?
(a)
(b)
(c)
(d)
Existence of arbitrage profits

Non-synchronous price data from both Exchanges
Transactions costs
Differences in Futures regulations
6. The predictability of long-run stock returns is due to

(a)
(b)
(c)
(d)
availability of superior market information

persistence of dividend yields
the size effect
long-run market risk premium
7. What is not a good reason for why a financial event study should avoid
calendar-time clustering of sample firm events?
(a) There may be a systematic event impacting the market, unrelated to
the financial event, that occurred at a calendar time
(b) Correlations across different stocks at the same calendar time may
introduce more sampling errors
(c) The study implications may become conditional on the general
business condition or regime at that calendar time
(d) The market may impact different stocks differently at the same
calendar time
8. In a day-of-the-week test of significant returns, the reported p-values for
the t-statistics of c 1 ,c 2 ,c 3 ,c 4 , c 5 are 0.06, 0.01, 0.03, 0.09, and 0.04
respectively. Based on test at 5% significance level for a 2-tail test, we
can
(a)
(b)
(c)
(d)
reject H0: c2=c3=c5=0

reject H0: c2=0, H0: c3=0, H0: c5=0
reject H0: c1=c4=0
reject H0: c1=0, H0: c4=0
9. In the day-of-the-week regression rt c1 c 2 I1 c3I 2 c 4 I 3 c5I 4 e t ,

where I1, I2, I3, and I4 are dummies for Tuesday, Wednesday, Thursday,
and Friday respectively, suppose the estimated equation is:
(a)
0.02%
433
(b)
(c)
(d)
0.03%
-0.01%
none of the above
10. If the disturbances are heteroskedastic, we prefer GLS to OLS, wherever

feasible, because
(a)
(b)
(c)
(d)
OLS is biased
OLS is inconsistent
OLS is inefficient
OLS cannot provide for a test
Answer Key:
1b, 2a, 3a, 4d, 5b, 6b, 7d, 8b, 9b, 10c
* For Q10, OLS can provide the HCCME for test, so it is not d.
434
D8.
TEST EIGHT
1. The
is
(a)
(b)
(c)
(d)
difference between a unit root process and a trend stationary process

One has stochastic trend and the other has not
One has drift and the other has not
One has deterministic trend and the other has not
One has coefficient of unit for lagged Y and the other has not
2. If Xt and Yt are unit root processes, and we perform OLS regression of

Yt = a+bXt+et where et is a disturbance term that is independent of all the
other variables, then
(a)
(b)
(c)
(d)
unbiased estimate of b should = 0

estimators of a, b are consistent
estmators of a, b are spurious
there is insufficient information to choose any of the above
3. What is not an appropriate finance concept you learn from this course?
(a)
(b)
(c)
(d)
higher (expected) return necessitates taking higher risks

markets are predictable over longer term than short-term
returns must be stationary and normally distributed
risk models help to explain systematic time variations
4. Which is not an appropriate econometrics concept you learn from this

course?
(a) Over-fitting a model may have high R2 but poor forecasting
(b) data must be stationary to be useful for it does not make sense to
have changing probability distributions over time
(c) avoiding estimator bias and inefficiency is a desirable objective
(d) an estimate number by itself does not give much meaning unless we
know its statistical properties
5. In a linear regression where Yt is regressed on its lagged Yt-1 to improve
R2, a key concern with DW d-value close to zero is
(a) heteroskedasticity in the error
(b) contemporaneous correlation
(c) serial correlation in Yt
(d) spurious regression
435
6. Multi-factor models are:(a)
(b)
(c)
(d)
related to cross-sectional regressions

good risk models
useful for prediction if the factors can be apriori estimated
all of the above
7. Whites HCCME estimator is used for:

(a)
(b)
(c)
(d)
GLS estimation to obtain efficient estimators

OLS estimation to obtain BLUE estimators
Covariance matrix estimation to obtain heteroskedasticity
Covariance matrix estimation to obtain test statistic
8. Which of the following statement is false? A passive equity portfolio

management strategy is to:
(a)
(b)
(c)
(d)
buy-and-hold
track an index
outperform by taking risk
match market performance
9. Which of the following is not an investment strategy?

(a)
(b)
(c)
(d)
stock-picking
levered shortsales
market timing
sector rotation
10. Index tracking error is caused by:

(a) stock volatility in the market
(b) changes in correlations amongst tracking portfolio stocks
(c) index arbitrage
(d) computing mistakes
Answer Key:
1a, 2d, 3c, 4b, 5b, 6d, 7d, 8c, 9b, 10b
* Note that for Q10, the error is var(portfolio return index return).
Correlation changes affect portfolio return varaiances directly, hence the
tracking error. Unless index is also the market index, which is not necessarily
so, the market volatility needs not affect error.
436
Appendix E
SOLUTIONS TO PROBLEM SETS
Chapter 1
1.1 Show E (X+Y+Z) = E (X) + E (Y) + E (Z).
x y z f x, y, z dx dy dz
x f x, y, z dxdydz y f x, y, z dxdydz
z f x, y, z dxdydz
x [ f x, y, z dydz ] dx y [ f x, y, z dxdz ] dy
z [ f x, y, z dxdy ] dz
x f x dx y f y dx z f z dz
EX EY EZ
X
,
i Xj
i
1
j
N

N
E X i E X i X j E X j
j 1

i 1
1.2 cov
N
i 1 j 1
N
i 1 j 1
covX i , X j
N
i 1 j 1
covX , X 1
N
i 1 j 1
NxN
where 1 1 1 1N x 1 .
T
1.3 Find the bivariate probability distribution P(U1, U2).
PU1 , U 2 PU1 , U 2 , U 3 holding U1 , U 2 constant

U3
437
P 1,2 P 1,2,3 P 1,2,3 0.25

P 1, 2 P 1, 2,3 P 1, 2, 3 0.25
P1,2 P1,2,3 P1,2, 3 0.25
P1, 2 P1, 2,3 P1, 2, 3 0.25

Find the marginal P(U3).
PU 3 PU 1 ,U 2 ,U 3 holding U 3 constant
U1 U 2
P 3 P 1,2,3 P 1,2,3 P1,2,3 P1, 2,3 0.5
P3 P 1,2, 3 P 1, 2, 3 P1,2, 3 P1, 2, 3 0.5

1.4 (i) Find E(Ui)s, and cov(U1, U2).
PU1 1 PU1 ,U 2 P 1,2 P 1, 2 0.5

U2
PU1 1 P1,2 P1, 2 0.5

PU 2 2 P 1,2 P1,2 0.5
PU 2 2 P 1, 2 P1, 2 0.5
EU1 1 0.5 1 0.5 0
EU 2 2 0.5 2 0.5 0
covU1 , U 2 EU1 U 2 U1 U 2 PU1 , U 2
U1 U 2
1 20.25 120.25 1 20.25 120.25 0

(ii) Find the probability distribution of estimator b .
Given X1 = 1, X2 = 2,
~
~
Y1 b U1
~
Y1 b 1 with probability of 0.5
i.e. ~
Y1 b 1 with probability of 0.5
~
~
Y2 2b U 2
~
Y2 2b 2 with probability of 0.5
i.e. ~
Y2 2b 2 with probability of 0.5
~
~
b X1Y1 X 2Y2 Y1 2Y2
2
2
5
X1 X 2
438
b 1 22b 2
with probability of 0.25
5
b 1 22b 2
b
5
i.e.
b 1 22b 2
b
5
b 1 22b 2
b
5
(iii) Find the mean and variance of b .
b
1
1
1
1
E b 0.25 5b 5 5b 3 5b 3 5b 5
5
5
5
5
1
20b b
20
Var b E b 2 b 2
0.25
25
[25 b 2 50 b 25 25 b 2 30 b 9 25 b 2 30 b 9
25 b 2 50 b 25 b 2
1.5
1
100
100 b 2 68 b 2 0.68
Find the marginal pdfs of X and Y.
e X Y 0 x, y
f x, y
0 otherwise
f X x e x y dy
0
f X x e x e y dy
0
f X x e x e y
f X x e , 0 x
x
439
fY x e y , 0 y
Since f x, y f X x . fY y , then X and Y are stochastically
independent.
1.6
y
2
y = x/2
x
0
(i) Find the marginal distribution of X and Y.

x
f X x
1 dy
0
f X x y 0 2
x
fY y
x
for 0 x 2 (0 o.w.)
2
1dx
2y
fY y x2 y 21 y for 0 y 1 (0 o.w.)
2
(ii) Find the means, variances of X and Y, and covariance of X and

Y.
x
E X x dx
2
0
2
4
1
E X x3
6 0 3
E X
x4
2
8 0
440
2
2
4
Var X EX 2 E 2 X 2
9
3
1
E Y 2 y1 y dy
0
1
2 1
E Y y 2 y 3
3 0 3
2 y 1 y dy
1
EY
EY
2 y 3 1 y 4 16
2 0
3
Var Y EY 2 E2 Y
2
1 1 1

6 9 18
E XY xy (1) dydx
0 0
x
y2 2
E XY x dx
2 0
0
2
x x
E XY dx
2 2
0
2
1
E XY x 3dx
80
2
1 x4
1
E XY
8 4 0 2
CovX, Y EXY EX EY
1 4 1 1

2 3 3 18
(iii) Find the conditional mean E (X|Y), E (Y|X), and conditional

variances var (X|Y), var (Y|X).
f x | y
f x, y
1
fY y 21 y
441
2
E X | Y x f x | y dx 109
2y
EX | Y
1
xdx
21 y 2y
2
1 x2
EX | Y

21 y 2 2 y
E X | Y
1
4 4 y2
41 y
EX | Y 1 y
f Y | X
1
2
x/2 x
x
E Y | X
y x dy
0
2 y2 2
E Y | X
x 2 0
1 x2 x
E Y | X
x 4 4
211 y x dx
2
E X 2 |Y
2y
1 x3
E X |Y

21 y 3 2 y
611 y 8 8 y
E X 2 |Y
109
We use 2y instead of 0 for lower limit of integration to ensure
f x, y dx 1 .
442
E X 2 |Y
4
1 y y2
3
Var X | Y E X 2 | Y E 2 X | Y
Var X | Y
4
2
1 y y 2 1 y
3
1 2
1
Var X | Y y y 2
3 3
3
Var X | Y
1
1 y 2
3
E Y2 | X
2
dy
x
x
2 y3 2
EY |X
x 3 0
E Y
2 x
| X
3x 8
3
E Y2 | X
x2
12
Var Y | X E Y 2 | X E 2 Y | X
Var Y | X
x2 x2
12 16
x2
VarY | X
48
1.7 Xit is distributed as univariate normal, N(0,1) for i=1,2,3, and
t=1,2,.,60. Yt = 0.5X1t + 0.3 X2t + 0.2 X3t . Thus, E(Yt) = 0 since
443
E(Xit) = 0. E(Yt2) = E(Yt E(Yt))2 = var(Yt) = var(0.5X1t + 0.3 X2t + 0.2
X3t) = 0.25 + 0.09 + 0.04 = 0.38.
Standard Deviation (Yt) = 0.38 =
K
0.6164. Variance of K 1 Wi is 0.38/K.

i 1
1.8 AX i ~ N 0 ,
A2
. Hence A = 60. If random vector Y = (X1, X2,
60
, XK), the distribution of YYT is 602/60.

1.9 cov(a, b+2c+3d) = cov(a,b) + 2 cov(a,c) + 3 cov(a,d)
= 0.1+0.4+0.9 = 1.4
1.10 Cov(X,Y) = E(XY) = 0.5(-1) + 0 = -0.5 since E(X) = 0.
Cov(X,Z) = E(XZ) = 0.5(-1) + 0 = -0.5.
Cov(Y,Z) = E(YZ) E(Y)E(Z) = 0 (-0.5)(0.5) = +0.25.
Chapter 2
2.1 Find the numbers a, b. Var X i 0.24 , E X i
1 60
X i 0.5
60 i 1
Apply Central Limit Theorem,

d
x
0.24
1.96
x ~ N ,
1.96
60
0.24
60
a 0.5 1.96
b 0.5 1.96
0.24
0.376
60
0.24
0.624
60
2.2 (i) Monthly continuously compounded rate of return is

ln (Pt+N/Pt) = ln (Pt+N/Pt+[N-1] x Pt+[N-1]/Pt+[N-2] x .. x Pt+2/Pt+
x Pt+/Pt)
444
= Rt+[N-1],t+N + Rt+[N-2],t+[N-1]+ Rt+[N-3],t+[N-2] + .+ Rt,t+
= N x (1/N) { Rt+[N-1],t+N + Rt+[N-2],t+[N-1]+ .+ Rt,t+ }
This converges to N x normal (, 2/N), as N increases. Hence, it
is distributed as Nomal (N, 2N)
(ii) Use the Jarque-Bera test statistic
2 32
2
n
distributed as 2
24
6
where 2
1 60
Rt R 2
59 t 1
1 60
Rt R 3
59 t 1

3
1 60
Rt R 4
59 t 1
and
.
4
(iii)The return distributions may be characterized as follows. Each

return over interval takes one of the following distributions
randomly.
pdf
returns
On average, the monthly returns will display negative skew and fatter
tails, as seen in the bold pdf.
2.3 Let X = [X1 X2 X3], n 3 matrix. Average return vector
M3 x 1 = XT L /n.
E(M) = 3 x 1. Var-covariance matrix of M, or
445
Var(M) = diag(1/n2 LT n x n L), where diag(K) means a diagonal matrix
with diagonal elements = K. Note that M is MVN.
2.4 Using the Law of Iterated Expectations, take unconditional expectation
over Et(Rt+1| t) = XtQ, to obtain E(Rt+1) = E(XtQ) or E(Rt+1-XtQ) = 0. Since
Rt+1 and XtQ are stationary, assume ergodicity, and test for the sample
mean of the time series {Rt+1 - XtQ ) to be zero.
2.5 E[E(et+1|Pt)] = E(et+1) = 0 by Law of Iterated Expectations. Also,
Pt E(et+1|Pt) = E( Ptet+1|Pt) = 0 implies E( Ptet+1) = 0. Thus,
cov(et+1,Pt) = E( Ptet+1) E(Pt)E(et+1) = 0-0 = 0. Hence implies zero
correlation.
2.6 X = eY where Y ~ N(,2). E(X) = exp[( + 2)].
E(X2) = E(e2Y) = exp(2 + (4) 2) = exp[(2 + 22)]
So var(X) = E(X2) [E(X)]2 = exp[(2 + 22)] exp[(2 + 2)]
Chapter 3
3.1 (i) b
XY 60 XY
X 60 X
2
1.64
a Y bX 0.0052
e2 SSR / 58 106
1
0.0004
var b e2
X 2 60 X 2
2
1
X
2.667 108
var a e2
60 X 2 60 X 2
Under H0: a =0, t58 = -0.0052/(2.667x10-8) = -31.84

Under H0: b=1, t58 = (1.64-1)/0.0004 = 32
Hence, both hypotheses are rejected at very low significance levels.
(ii) No, market premium could still be important. One possibility to
explain both the results is that X and Z are highly correlated. For
example, if Xt = p + qZt + ut where ut is i.i.d. noise, then Yt =
(a+bp) + bq Zt + (but+et) which produces the regression results of
the second regression.
446
3.2 s 2
1 n
1 n
2
X k2 nX 2 .
n 1 k 1
n 1 k 1
Var(Xk) = E(Xk2) - 2 = 2. So, E(Xk2) = 2 + 2.
Also, Var X E X 2 2
2
n
. So, E X 2 2
2
n
2 2
1 n
1 n 2
2
2
2

E
X
nE
(
X
)
k
n 1
n 1 k 1
n
k 1
1
n 1 2 .
n 1
E (s 2 )
3.3 No, Bt is not a stationary process. This is because BT becomes zero with
zero variance.
3.4 R2
3.5 (i) X 4.5 , Y 6.5
(ii) a 1.6265 ; b 1.083
X2
; one just needs 2 points (X1 , Y1) ,
3.6 Strictly speaking, if Y
X
(X2 , Y2) to infer and . But requiring OLS implies random error is
involved. A suitable assumption of the data structure is:
Random error , Y = Y
X2
X
and X is exogenous, is iid. Thus,
the random error Y is independent of X, but proportionate to Y. So,

transformation gives
X2
X + u
Y
provided Y0 , and u = -
(X+) . u is statistically independent of X since X is exogenous, though

u is linearly related to X. If X is not exogenous, then clearly u is not
statistically independent of X.
X2
Then use OLS to regress
on constant and X.
Y
3.7 Let the variance of the random error be u2 and the income variable be ai.
Then, the variance of the estimator of the coefficient of the income
variable is
447
u2
. For a more accurate estimator, the variance is to be
minimized by sampling from a wide range of income groups so that the

denominator above is large.
3.8 If S falls by 1%, F falls by same 1% and portfolio value falls by 1.22%
Fall in portfolio = $40m x 1.22%
Gain in futures = N x 1500 x 1%.
Perfect hedge implies N x 1500 x $500 x 1% = $40m x 1.22%, so
N = 65.
e2
3.9 (i)
1 T 2
1 T
e it
rit i i rmt
T 2 t 1
T 2 t 1
(i) same as (a) since var(rit|rmt) = var(eit)

(iii) var(rit) = i2var(rmt) + var(eit)
So estimate
1 T
1 T
2
rit i i rmt
mt
m
T 1 t 1
T 2 t 1
2 rit i2
Chapter 4
4.1 (i) u = sqrt(0.00245/98) = 0.005. sd ( a ) = 0.005*0.316 = 0.00158,
sd( b ) = 0.005*0.25 = 0.00125; ta = 2.53, tb = 3.2
(ii) Y = a + b * X = 0.004 + 0.004*120 = 0.0088 or 0.88%

(iii) a is estimate of riskfree rate; b is estimate of exp excess market
return or market risk premium
4.2 (i) cov(eit , rit) = cov(rit a b rmt , rit) = var(rit) b cov(rmt , rit)
which is in general not 0. Thus eit is generally correlated with the
dependent variable rit even if it is not correlated with the
explanatory variable.
(ii) b may be an estimate of stock i beta since it is an estimate of
cov(rit, rmt)/var(rmt). But a rf (1- b ) according to CAPM. This is
not the Jensen alpha measure.
(iii) Since a rf (1- b ), and rf > 0, positive a means that the stocks
have b < 1. Thus the portfolio beta bp < 1. E(rp) = rf + bp [E(rm) rf] <
rf + [E(rm) rf] = E(rm). Thus rp is likely to fall below rm on average.
4.3 We need to make some assumption about initial outlay. Assume he has
$Pt to start with. Instead of buying stock, he puts $Pt into riskfree bond
448
yielding interest rate r. He short-sells $Pt and puts this into r, but has to
pay r for borrowing the scripts. At t+1, he buys in at $Pt+1. Final payoff
at t+1 is $ Pt (1+r) + [Pt - Pt+1] . Initial outlay at t is $Pt . Return factor is $
[Pt (1+r) + Pt - Pt+1] / $Pt = (1+r) (Pt+1/Pt 1) .
Return rate is r (Pt+1/Pt 1). This is r - (return in a long position).
So if Pt+1 = Pt , then return rate is just r. If Pt+1 = 0, then return rate is
100% + r.
4.4 Yes, if the assets beta is negative and the market risk premium is
positive.
4.5 Golds return rate is low and negatively or lowly correlated with market
return. Thus gold has a beta close to zero, if not negative.
Chapter 5
5.1 Utility companies are monopolies or oligopolies and hold strategic
resource that belong in part to the country. They should not overcharge and build excess profits. There is no competition of service
providers unlike private goods. Thus the rates must be commensurate with
keeping the firm ongoing but without profiting from the captive
consumers.
5.2 According to DGM, if all earnings are issued as dividends, then
P/E = 1/(R-g) where P is current stock price, E is expected next period
Earnings, R is the risky discount rate, and g is the earnings growth. Hence
a high P/E would imply a high growth rate, provided R > g (hence risk
also would be higher), and thus higher future earning prospects.
5.3 Earnings are $1 per share a year forever. Share price =
$(1/1.05 + 1/1.052 + 1/1.053 + ) = 1/0.05 = $20.
5.4 The generated dividends each year may be shown as the sum of the
entries in each row.
$/share
generated from
first plough-back
of retained
earnings
generated from
second ploughback of retained
earnings
generated from
third ploughback of retained
earnings
2003
2004
2005
2006
2007
0.4
0.4
0.4
0.4
0.4
(0.6*1.05)*0.4
(0.6*1.05)*0.4
(0.6*1.05)*0.4
(0.6*1.05)*0.4
(0.62*1.052)*0.4
(0.62*1.052)*0.4
(0.62*1.052)*0.4
(0.63*1.053)*0.4
(0.63*1.053)*0.4
449
In general, in the nth plough-back of retained earnings, additional dividend
issue due to that portion is (0.6n*1.05n)*0.4.
The present value of the dividend stream is (summing all diagonals) per
share is :
0.4 {1/1.05 + 0.6*1.05/1.052 + 0.62*1.052/1.053 + 0.63*1.053/1.054 +.}
+0.4/1.05{1/1.05+0.6*1.05/1.052+0.62*1.052/1.053 + 0.63*1.053/1.054 +
}+0.4/1.052{1/1.05+0.6*1.05/1.052+0.62*1.052/1.053+0 .63*1.053/1.054
+ }+ ..
= 0.4[1+1/1.05+1/1.052+] {1/1.05 (1+0.6+0.62+0.63+..)}
= 0.4(1.05/0.05)(1/1.05)(1/0.4)
= 1/0.05
= $20
Therefore, the price per share is unchanged at $20. This is unchanged by
the dividend policy as the retained earnings do not yield additional returns
over the original share returns if all dividends were issued.
5.5 P = D1/(R-g) = .5*1.03/(0.05) = $10.30
5.6 There is no inherent inconsistency between the SML and DGM. The SML
is a single period model providing the risk-adjusted required rate of return,
while the DGM is on pricing a stock given the required rate of return and
all future expected earnings or dividends. The DGM further imposes some
restrictions such as constant expected return and constant growth in
earnings or dividends. The latter may be inconsistent with empirical
versions of CAPM where over time the required rates of return are
allowed to vary from period to period.
Chapter 6
6.1 Find the mean and variance of rt.
rt rt 1 t
Let E rt m t
m m m
.
1
Let Var rt r t
2
r 2 2 r 2 2 r 2
2
.
1 2
6.2 (i) MA(1)

(ii) No, not random walk in price
450
(iii) rt = 1.5% + ut 0.1ut-1
Obtain -0.1 from -0.1/(1+0.1^2) = -0.099. MA is invertible,
(1-0.1B)=0 lies outside unit root, ie B>1. AR() representation is
appropriate.
(1-0.1B)-1rt = (10/9) x 1.5% + ut, or rt
= 1.667% - 0.1rt-1 0.12rt-2 0.13rt-3 -.
Forecast is 1.667% - 0.2% - 0.01% - 0.0012%
= 1.4558 or approximately 1.46%.
6.3 (i) E(Yt) = 5;
(ii) var(Yt) = 1.16 var(ut) ; autocovariances: (1) = -0.4 var(ut) ; (k) = 0
for |k| > 1. ACF of Yt is 1 for k=0, -0.4/1.16 = -0.345 for k=1, 0 for
|k|>1.
(iii) Mean and ACF are independent of Yt , hence it is covariance
stationary.
(iv) (Yt 5) = (1-0.4B)ut . So, 1/(1-0.4B) * (Yt 5) = ut . The root or zero
of the equation (1-0.4B) = 0 is B = 1/0.4 = 2.5 outside the unit circle,
so the MA process is stationary, and thus it is invertible. The
invertible AR is (1+0.4B+0.16B2+0.064B3+..)(Yt 5) = ut or
Yt = 25/3 0.4Yt-1 0.16Yt-2 0.064Yt-3 - + ut
6.4 (i) (1-0.5B-0.4B2)Yt = 2+ut . Roots of characteristic equation )
(1-0.5B-0.4B2)=0 is B = -2.33 or 1.075. Both are outside the unit
circle, so the AR process is stationary.
(ii) E(Yt) = 2/(1-0.5-0.4) = 20
(iii) var(Yt) = 0.25 var(Yt) + 0.16 var(Yt-1) + 2(.5)(.4)(1) + var(ut)
implies 0.59 var(Yt) = 0.4 (1) + var(ut).
Next multiply equation by Yt-1 and take expectation, (1) = 0.5 var(Yt)
+ 0.4 (1), so var(Yt) = 1.2 (1).
Thus, 0.308 (1) = var(ut), or (1) = 3.247 var(ut),
and var(Yt) = 3.896 var(ut).
Multiply by Yt-2 and take expectations,
(2)=0.5(1)+0.4var(Yt).
Therefore (0)=1; (1) = 3.247/3.896=0.833; (2)=0.817.
In general for higher k, (k) = 0.5(k-1) + 0.4(k-2).
(iv) 11 = (1) = 0.833; 22 = ((2)- (1)2)/(1-(1)2 ) = 0.402;
kk = 0, k>2
6.5 (i) Since uts are uncorrelated,
var(Yt) = var(ut) + A2[var(ut-1) + var(ut-2 ) +].
451
The second term on the right sums to infinity, hence var(Yt) is not
finite. Hence, Yt is not covariance-stationary.
(ii) Yt Yt-1 = ut + (A-1)*ut-1 . Thus it is MA(1), or Yt is ARIMA(0,1,1),
and is stationary. (Remember all finite MA processes are stationary.)
ACF of the first-differenced series is (0) = 1
(k) = (A-1)/[1+(A-1)2] for k = +1 or -1
(k) = 0 for |k| > 1
6.6
The variance is var(
N * 2
cov(r , r ) N * b b
N * N*
2
m
cove i , e j
= b p 2m N *2
2
1 N*
ri )=
N * i 1
= b p 2m N *1 c . As N goes to infinity, the

2
portfolio variance becomes bp2m2.

Chapter 7
7.1 No, B could still be true. This is because if B is true and A is false, then
the joint hypothesis that both A, B are true is still rejected.
7.2 (i)
(ii)
(iii)
7.3
Pr{S1|}= Pr{S1 }/Pr{} = Pr{S1}/ Pr{}= 0.1/09 = 1/9.

Pr{S2|}= 0.4/0.9 = 4/9. Pr{S3|}= 4/9. E [P2| ] = $15 x 1/9 + 12
x 4/9+10x 4/9 = 11.44. Informed price of X = E [P2 |]/1.1 =
11.44/1.1 = $10.40.
E [P2] = (0.1x.15 + 0.4x12 + 0.4x10 + 0.1x7) = 11.
Abnormal expected return is [11.44/10.20 1] required 10%
= 12.20% - 10% = 2.20%.
Assuming market has used all its available information, so et+1 is
random. But if the trader is able to pick up significantly positive c 1,
i.e. buy when et+1 > 0, and sell when et+1 <0, then there is some
evidence that the trader has private or inside information more than
market available information. There is implication about strong-form
market efficiency.
7.4 (i) E(Pt+3|B) = (2/3x1/2)x$10 + [2/3x1/2+1/3x1/2]x$8 + (1/3x1/2)x$6

= $ 10/3 + 4 + 1 = $8 1/3
E(Pt+3|X) = (2/3x1/2)x$5 + (2/3x1/2)x$4 + (1/3x1/2)x$2 +
(1/3x1/2)x$1 = $ 5/3 + 4/3 + 1/3 + 1/6 = $3
(ii) E(Pt+3) = 1/5 x E(Pt+3|B) + 4/5 x E(Pt+3|X) = 5/3 + 14/5 = $4 7/15
452
(iii) Per period discount factor is (4/3)1/2. Discount over 2 periods from
t=1 to t=3 is 4/3.
P1|B = E(P3|B)/[4/3] = $25/3 4/3 = $6
P1|X = E(P3|X)/[4/3] = $7/2 4/3 = $2 5/8
P1 = E(P3) 4/3 = $67/15 4/3 = $3 7/20 or $3.35
(iv) The observed market price of $3.45 is approximately P1 without
information. Hence the market is not efficient with respect to
information B or X at time t=1.
7.5 To show variance ratio, VR q 1
2
1
q q

,
q q1
2
. Now
1 2
rt q rt rt 1 rt 2 ... rt q 1
qVar rt q
Covariance matrix of rt , rt 1 ,..., rt q 1
Var rt .
.
.
q 1
Var rt q 1T 1
1q
.
.
.
q 2
q 3
is
q 1
q 2
q 3
Var rt q Var rt q 2q 1 2q 22 2q 33 ... 2q 1

Varrt q
VR q
qVarrt
VRq 1
VRq
1
q 2 q 1 q 2 2 q 3 3 ... q q 1 q 1
q
2
q 2 3 ... q 1 2 2 3 3 ... q 1 q 1
q
2
VRq 1 2 1 2 ... q 2 2 2 3 3 ...q 1 q 1
q
453
VRq 1 2
VRq 1
1 2 1 q 1
2
1
q 1
q 1
q 1 2
1 1 q 1 1
q 1
q 1 q
q 1
q
2
q 1 q
q
VRq 1
1

1
q1 q
VRq 1
2
1
q q

.
q
q
1
If price is a random walk and there is no memory or correlation in returns,

then VR(q) = 1 for any q. Statistically significant deviation from 1
indicates long term memory.
7.6 If return after transaction cost is positive, then the market is not
informationally efficient even in the weak-form since following price
patterns allows one to make profit. Testing for profitability in this case is a
type of weak-form market efficiency test used in the 1960s and 1970s, and
is called a Filter Test.
Chapter 8
8.1 BJ test are for H0: 1= 2 = 3 = = k = 0, for large k, so for short
term (short-lag) correlations that are zero (because prices are close to
random walks), they tend to weigh test toward non-correlation finding in
A. Bs VR for large k, i.e. [var(rt, rt+k)/k*var(rt)] < 1, indicates negative
correlation for returns over longer horizons. This is consistent with Cs
finding where low D/P leads to long-run low return low price, hence high
D/P which in turn leads to long-run high return high P. Thus C would
show long-run low return followed by next long-run high return and so
on. This is a mean-reversion in long-run returns. This would show up as
negative correlation as in B. Thus As short-term non-predictability and B,
Cs long-term predictability are consistent as B, C could only get longterm predictability (not short-term) because the short-term correlations are
too small to be statistically significant.
8.2 The continuously compounded excess 5-yr return rate is 0.56+0.2x
ln(0.1) = 0.099483. The nominal 5-yr return rate is therefore
454
0.099483+riskfree return over 5 yrs = 0.099483+[1.015-1] = 0.1505, or
15.05% over 5 yrs. This is also (1.15050.2-1) = 0.0284 or 2.84% p.a.
Chapter 9
9.1 (i) N(0, 21*0.01) = N(0, 0.21)
(ii) CAAR 1 , 2
AAR t
t 1
1 N
CAR i 1 , 2
N i1
has a normal distribution with mean 0 and variance
1
N2

i 1
1 1 i2
CAAR(-10, -1) ~ N(0, 1/25 *10*0.06 = 0.024)

CAAR(-10, +10) ~ N(0, 1/25 *21*0.06 = 0.0504)
(iii) z=0.17/(0.024)1/2 = 1.097 not significant at 2-tail 5%
z=0.45/(0.0504)1/2 = 2.004 significant at 2-tail 5%
(critical values +/-1.96; 2.5% each tail)
Thus, no impact up to a day before event date. After event takes place,
there is significant abnormal return up to +10 days. Thus event has
significant information impact on returns. No sign of information
leakage beforehand.
9.2 Paying too much for Bs shares, or overestimation of the synergistic
gain through merger or takeover could be a reason why the market is
reacting negatively to As own shares. If the takeover fails, the prices
of the firms should revert back to their original levels. In this case, the
CAR of A will return back to zero.
9.3 Yes, rights issues may sometimes show up negative abnormal return on
some firms for firms that are short of credit and on the verge of
bankruptcy. In such cases, the market knows the intended raise of fresh
capital is to rescue the troubled firm than to deploy for fresh opportunities.
9.4 Bs share price will increase during the announcement period if the market
is efficient. Bs shareholders could sell at the increased price or the tender
price, and not worry about how the acquired B firm would then be badly
managed by A.
Chapter 10
10.1 (i) SST=.065^2*99=.418275;
R2 = 1 SSE/SST = 1 - .2928/.418275=0.3
455
Adj R2 = 0.2928
(ii) F2,98 = [R2/1]/[(1-R2)/98] = 41.996; t98 = sqrt(41.996) = -6.480
(iii) Standard error = -6.480;
t-statistic of SIZE = -0.003/-6.480 = 0.0004629
(iv) The smaller size of target has significantly larger increased return at
1% significance level.
10.2 t-statistic = (0.867-1)/0.113 = -1.177. We need to know the sample
size in order to determine the degrees of freedom of the t-statistic. But for
typical n > 30, the test will not reject H0 at 5% level.
10.3
Y1
Y
2
Y
:

Yn
Y

Y
Y
:

Y
1
X 21
1
X
22
L=
X=
:
:

X 2n
1
X 31
X 32
:
X 3n
X2
X
X 2
:
X2
and
..... X k 1
..... X k 2
.....
:
..... X kn
X 3 ..... X k
X 3 ..... X k
;
: ..... :
X 3 ..... X k
b2
b
3
B =
:

bk
d2
d
3
D= .
:

dk
a and c are constants.
Then in general, the regression of Y=aL + XB will yield OLS
estimates a, B that are different from OLS estimates c, D in the
regression Y Y cL ( X X )D .
In the special case where k=2,
N
b 2
X
i 1
N
2i
X
i 1
X 2 Yi
2i
X2
for the first regression.

2
Now sample mean of X 2i X 2 is 0. Therefore for the second

regression using demeaned data,
456
N
b 2
[ X
2i
i 1
N
[ X
i 1
X 2 0]Yi
X 2 0]
which is the same as b 2 in the first
2i
regression.
For
first
regression,
a Y b 2X2 . For second regression,
c Y Y b 2 [X 2 X 2 ] 0 0 0 .
Thus OLS estimate B=D, but c=0 whereas a is not necessarily 0.
10.4 (i)
Zt = ln St = 6.593 , where Stis sales level in units. dZ = dln S =

dS/S is % change in sales level. Since Et(Zt+1) = 6.648, Et(Zt+1
Zt) = 6.648 6.593 = 0.055 is the forecast of % change in
sales level.
(ii)
vart (Zt+1 Zt) = vart (Zt+1) . Hence c.i. is 0.055 0.06.
(iii)
Et [dS/S] = 0.055, so Et[St+1] = St x (1.055).
Et(St+1) = exp(6.593) x 1.055 =770.12
(iv)
Vart[dlnS] =Var[lnSt+1]. But Vart[dlnS] = Vart[St+1/St ] . Therefore
Vart(St+1) = St2 Var[ln St+1] . Confidence interval is
exp(6.593) x 1.055 [exp(6.593) x 1.055] x 0.06 = 770.12
46.21
10.5(i) Dependence on other time t variables such as Yt and Zt, suggests
that Pt alone does not incorporate all relevant information at t,
and so there is evidence against market efficiency of the semistrong form.
(ii) Since Pt reflects all information, and there is no dependence on other
time t variables such as Yt and Zt, there is evidence in support of
market efficiency of the semi-strong form. Coefficient a needs not be
zero for market efficiency as there could be positive equilibrium rate
of return for the portfolio.
10.6
Use OLS regression on lnQ = ln + lnL + lnK + ln. OLS estimate
4.86 0.7236 and 0.2520 . The last

of ln is 1.5812, so
two numbers are the elasticities of output with respect to labor and
capital respectively. Forecast of next period ln($GDP) is ln$GDP =
1.5812 + 0.7236 (30) + 0.252 (25) = 29.589. Therefore forecast of
$GDP is exp(29.589) = $7.08 x 1012.
10.7 (i) X: 52x3 B: 3x1
457
131.72
(ii) B 19.423
26.844
(iii) 36.84 I52x52
42.03 19.86 0.15
(iv) 19.86 10.17 0.04
0.15 0.04 0.04

10.8 (i) det |XTX| = 372.75
X X
T
1 3 1.5
372.75 1.5 125
0.85
0.006910
123
5.56137 105 2.78068 105
1
2u X T X
5
0.002317237
2.78068 10
0.3
t a
40.23
0.00746
b 1
0.09
t b
1.869
0.0481 0.0481
2u
(ii)
a b 1 X Xb a 1 /2
T
F2,123
U
/123
U
125 1.5 0.3
0.3 0.09
/2
3 0.09
1.5
T
0.085/123
809.87
(iii) Not BLUE since it is misspecified. Should add a dummy as a

regressor in this case.
(iv) Use GLS
10.9
Taking natural logs, run regression

lnYt = ln a + b lnCt + d lnLt + lnet.
458
10.10
Theoretically we can show that if we regress tracking portfolio return

S on index portfolio return M in S = a + b M + e (e a disturbance
term), then coefficient of determination R2 = b2 var(M)/var(S) = 2
where is correlation coefficient between S and M.
S2 = W*TVW* , M2 = WTVW, and SM = W*TVW.
and SM2 = SM2/(S2M2).
Therefore 1-R2 = 1 - SM2 = (W*TVW)2/[( W*TVW*)(WTVW)].
10.11(i)
(ii)
(iii)
(iv)
4.17398, 0.067134, 0.05105

H0: coeff (rm) = 0, F1,58 =4.17398, p-value 1-tail = 4.56%; one can
reject H0 at 5%
0.0032, 0.0032
ESS = R2 x TSS = 0.067134 x (0.00307962 x 59) = 3.7565 x 10-5
(rmt rm )2 = ESS/ 2 = 1.6124 x 10-5
unbiased estimate of var(rmt) is (rmt rm )2/[59] = 2.7329 x 10-7
unconditional variance is 1.5263522.732910-7+0.0032 =
9.636710-6
Chapter 11
11.1
r = aD1 [1 if not 1-5 days before holiday and in January; 0 o.w.] +

bD2 [1 if not 1-5 days before holiday and not in January; 0 o.w. ] +
cD3 [1 if 1-5 days before holiday and in January; 0 o.w.] + dD4 [1 if 15 days before holiday and not in January; 0 o.w.] + et
Test if H0: a + b = c + d, H0 : c = d.
11.2
Friday = constant, -0.0001

Monday = 0.0005 + (-0.0001) = 0.0004
Tuesday = 0.0001
Wednesday = 0
Thursday = 0.0002
Chapter 12
12.1
from [1] = sum(Y- Y )(Z- Z )/sum(Z- Z )2

No. B
from [2] = sum(Y- Y )(Z- Z )/sum(Y- Z )2. D
is not inverse of
But D
. The problem of course is that in [2], t and Yt are perfectly
B
correlated from [1]. In [2], cov(Y,Z)
=D*var(Y)+cov(Y,t)=D*var(Y)-B-1cov(Y,e)=D*var(Y)- e2/B,
Or B*var(Z)= D*var(Y)- e2/B
459
12.2
Or, var(Y)=B/D var(Z)+ e2/[BD]=B^2var(Z)+ e2

which is consistent with [1].
Hence OLS in [2] violates the classical assumption of zero correlation
between Y and t .
Run ln(Yt/[1-Yt]) = a+b(Xt+2Zt)+t.
OLS estimate c=2*OLS estimate b
Find OLS estimates of a, b first. \hat a = 0.544585, \hat b = 0.005619
Remember to add \hat c = 2*\hat b = 0.001124
Test with restrictions
1
R
0
F2,2
0
1
r
1
0
T
1
R r R X T X R T
u T u /2
R r /2 0.44786
1
p-value = 0.6907
Hence we cannot reject H0 at conventional 10% significance level.
12.3
Since the disturbances are all classical, this is clearly a multicollinearity problem where Y and Z are highly correlated, both not
correlated with the noises. Given A and B show significance in Y and
Z, it is incorrect to drop both as regressing with just X will be
underspecification with missing variable and will result in biased
coefficient for X. He could pick either A or B with the higher R2. Or
use some theoretical justification to settle for A or B. Or if he could
increase the sample size till c2 and c3 estimates become significant,
then continue with C.
12.4
X and Z are probably highly correlated. This is a data problem

inducing multi-collinearity. Either increase the sample size
significantly to get back small p-values or else restrict to using only
either X or Z.
Chapter 13
13.1
w*Tp = 0, w*T (p+q) = 1, hence w*Tq = w*T(p+q) w*Tp = 1.
13.2
It is easy to show w*T E(R) = k from (13.7) since w* is derived from

the restriction w*T E(R) = k in (13.2). It is also correct that
k = pT E(R) +qT E(R) k. However, since pT E(R) = 0, and qT E(R) =1,
we really have an identity relationship in k = pT E(R) +qT E(R) k k.
It is incorrect to write k = pT E(R) / [1 qT E(R)] which is 0/0!
460
13.3
No. The alpha and beta should explain all the cross-sectional expected
returns. However, if there are severe measurement errors in j , this
may cause some variations to be explained by the irrelevant j .
13.4
13.5
13.6
Yes, if the market proxy is not the true market return, then it could be
that the true market return is Rmt = a1rmt + a2x1t + a3x2t. Regression of
rjt = a + b Rmt + ejt that produces significant b is supportive of
CAPM.
There is contemporaneous correlation in the explanatory variables that
could cause serious bias in the interpretation of the t-statistics.
Removing all or some of the xjts may make the estimate a 1 become
significantly positive, which would accord with CAPM or at least the
interpretation that the market proxy is MV efficient.
L log f(Y1,Y2,.,YT | X1,X2,.,XT)
NT
T
1 T
T
ln2 ln Yt A BX t 1 Yt A BX t
2
2
2 t 1
where A (a1, a2, .., aN)T and B (b1, b2, .., bN)T. Note that Xt is a
scalar.
Finding FOCs:
L A 1 Yt A BX t 0 N1
t 1
L B 1 Yt A BX t X t 0 N1
t 1
T 1 1 1 T
T
Yt A BX t Yt A BX t 1
2
2
t 1
0 N N
T

Hence B
Yt X t X
t 1
X t X
B

and A
X
t 1
t 1
t 1
where N1 T 1 Yt and X T 1 X t
. Moreover,
461
1 T
B
B
X Y A
X
Yt A
t
t
t
T t 1
The estimators are the same as OLS estimators under joint normality.
Chapter 14
14.1(i) cov(r1t,r2t) = -0.12m2+0.036>0, therefore m2<0.3 or 30%.
(ii) Key is to note that r1t and r2t can also be represented by, so
0.2 It + 1t =e1t and
0.3 It + 2t = e2t , and so cov(0.2 It + 1t , 0.3 It + 2t) = 0.036.
0.2x0.3 var(It) = 0.036, so var(It) = 0.6, or 60%.
14.2(i) No, we do not expect the coefficient to be significantly different
from zero as unsystematic risk is not priced in an economy where
all investors are fully diversified.
(ii) In this case where investors are not fully diversified, K 1 may be
significantly positive and represents risk premium as
compensation to investors for idiosyncratic risk of their holdings.
Chapter 15
15.1 (i) Measurement error depresses estimate in the disposable case
(ii) Forecast is 13.025
(iii) Yes, will be affected because it multiplies the estimators errors. Thus,
the further this value is from the sample mean, the larger is the
forecast error.
(iv) No, simultaneity bias.
15.2
r1,5 = a + b(d1 p1) + u1,5

r5,9 = a + b(d5 p5) + u5,9
r9,13 = a + b(d9 p9) + u9,13
..
rT-4,T = a + b(dT-4 pT-4) + uT-4,T
15.3
(i) Yes. p-values approximately 0

(ii) DW low. positive serial correlation of disturbance.
(iii) Apply serial correlation adjustment as follows
Yn* = Yn - Yn-1
...
Y2* = Y2 - Y1
462
Y1* = 1 2 Y1
Perform similar transforms on the Xs. Then regress Y * on c and
X*. If the disturbance is not correlated with X or PRICE, then
estimators are unbiased and consistent. The adjustment depends
on how accurate estimate of above is. If it is not accurate, then
estimates may become biased and worst off.
(iv) k (inclusive of constant) = 2, so k-1 =1
tN-k (of PRICE) = 31.20888 = t751-2 and t7492 = F1,749
so Fk-1, N-k = 31.208882 = 973.99
(R2/[k-1])/(1-R2)/[N-k] = Fk-1,N-k = 973.99,
so R2 = 973.99/[749+973.99] =0.5653
adjusted R2 , R 2 1
RSS /( N k) 1 k N 1 2
R
TSS /( N 1) N k N k
= -1/749+(750/749)*0.5653=0.5647
15.4
Sometimes, St and Ft-k,t are not stationary. For OLS, we need

dependent as well as explanatory variables to be stationary (weak- or
covariance stationary). Taking difference with respect to another
lagged spot transforms the variables into difference variable, and is
a common way of making a non-stationary process stationary. Of
course, there are other methods such as co-integration regressions.
Chapter 16
16.1
Yes, unit root process allowing for drift and trend as suggested by the
3 tests. To avoid negative price, we can use ln Pt = a + ln Pt-1 + t by
taking exponentials. The price process would be Pt = ea . Pt-1 . et .
16.2
Yes, this is the use of ECM.
16.3
Yes, Yt is a unit root process. Yt is ARIMA (p,1,q) for general p and

q.
Chapter 17
17.1
ht-11/2 et-1 = 0.05 0.03 0.5*0.10 = -0.03

0.1 et-1 = -0.03, and so et-1 = -0.3
ht = 0.2 ht-1 + 0.8 et-12 = 0.2*(0.01) + 0.2*0.32 = 0.02
var(rt) = 0.52*0.02 + ht = 0.025
standard deviation = sqrt(0.025) = 0.1581
463
17.2
17.3
17.4
E(rt) = 0.03+0.5*0.12 = 0.09

At 99% confidence interval, r critical value is 2.33*0.1581
= -0.3684
r critical value with mean 0.09 is 0.09-2.33*0.1581=-0.2784
Hence absolute VaR is $2.784 million.
Relative VaR includes the loss from expected positive return and is
$10m * 36.84% = $3.684 million.
(There is usually some confusion on the relation between absolute and
relative VaR.)
exp(1+2/e2 ) x 0.001 = 0.003563
The total variance of portf is 0.002+0.003563 = 0.005563
VaR = 2.33 x 0.005563 x $100m = $17.3784m
exp(0.5ht-1)et-1 = 0.03 0.02 0.0005*10 = 0.005
so, exp(0.5*-6.5) et-1 = 0.005
so et-1 = 0.129
ht = 0.9 ht-1 + 3.9 et-12 = 0.9*(-6.5) + 3.9*0.1292 = -5.785
var(rt) = 0.00052*2 + exp(ht) = 0.00307
std dev = sqrt(0.00307) = 0.05544
at 99% c.i. , r critical value is 2.33*0.05544 = -0.12918
Therefore VaR = $100m * 12.918% = $12.918 million
Chapter 18
18.1
18.2
R: event of recession
I: event of inverted Yield Curve the year earlier
Prob(R|I) = Prob(RI)/Prob(I) = [7/50]/[14/50]=50% only
E(Yt+1)=0.01-1.2*0.02= - 0.014. Yt+1 | Xt ~ N(-0.014, 0.0004).
So Prob(Yt+1<0) = Prob({Yt+1-[-0.014]}/0.02<0.014/0.02)
= Prob(z<0.7) = 0.76
The latter is parametric, and the distribution contains more
information than the original non-parametric. It could be that this
periods negative yield slope is especially large, being larger than
average.
Run the OLS regression lnPt = lnA B rt + et where et is residual
error. B>0, so increase in rt leads to fall in bond price Pt. There is the
usual inverse price-yield relationship for bonds.
= 10.13 (% change in 5-yr T-bond
OLS estimate of ln A = 4.714, B
price per unit change in 3-m T-bill rate)
exp( 4.714) 111.5 (upper bound of 5-yr T-bond price when

A
r=0)
464
Chapter 19
19.1
Possible reasons for different implied volatilities are Black-Scholes

model is inadequate, one could be a European call while the other an
American call, and there could be minor liquidity differences facing
the two options with different strikes.
19.2
The uncertainty will typically show up in an increase in implied

volatility post announcement or at the point of announcement and
afterward. For event studies, if the volatility of returns increases post
announcement, then it may be necessary to measure the standard
deviation of returns post announcement differently from that preannouncement, and apply the appropriate standard deviation to
abnormal returns for a t-test. In this case, it may make it harder to
accept that a significant abnormal return has occurred given the larger
standard deviation surrounding it.
Chapter 20
20.1
CRRA is 1-. For positive risk aversion, 1- > 0, so < 1. Using

LHpitals rule, it converges to ln(Ct).
20.2
The variable in the argument should be stationary. The choice is
Ht
M t 1 Pt 1 Pt or H t M t 1 Pt 1 1 are stationary.
Gt
Gt
Pt
Et-1(Yt a bXt ) = 0. We need at least one more restriction.

Et-1[ (Yt a bXt ) Xt-1] = 0. Take iterated expectation. Then,
EYt X t 1 EX t EYt
a E(Yt ) b E(X t ) and b
.We can
2
EX t X t 1 EX t
whether
20.3
then use the sampling averages to substitute for the population

moments. The estimators are consistent but need not be unbiased or
efficient.
20.4
Let et+ = rt+ - rt - ( - rt ) . The moment conditions are: E(et+) =

0, E(et+ rt ) = 0, E(et+2 - 2 rt 2 ) = 0, and E( [et+2 - 2 rt 2 ] rt ) =
0.
20.5
Advantage of GMM is that it does not require the underlying variables

to have a known probability distribution, unlike the ML method. It
can handle in a straightforward manner serial correlation in the
underlying variables that enter the moment conditions as long as they
465
are stationary. ML method requires the serial correlation to be fully
specified in order to be able to compute the likelihood function of the
sample. However, if the distribution is in fact known, then the ML
method will in general yield more efficient estimators as GMM does
not make use of distributional assumptions. Also, in finite sample, ML
method in principle allows the finite sample distribution of the
estimators to be computed for exact inference. GMM can only provide
asymptotic inference, and tends to have biases in finite sample
inference and testing.
466
INDEX
abnormal return, 77, 82, 165, 169,
170, 171, 172, 173, 177, 182, 210,
440, 441, 455, 479, 489
ACF, 115, 118, 122, 123, 125, 126,
127, 129, 131, 132, 313, 475, 476
adjusted R2, 183
aggregate consumption, 2
Akaike information criterion, 183
alpha, 73, 77
American call, 369, 370, 373, 375,
489
American put, 368, 371, 375
annual volatility, 38, 40, 385
ANOVA (analysis of variance), 206,
216
Appraisal ratio, 73
APT, 271, 272, 275, 276, 282
AR(1), 108, 109, 110, 113, 114, 116,
117, 119, 120, 122, 126, 130, 149,
152, 155, 223, 228, 238, 243, 290,
292, 450
arbitrage, 44, 269, 271, 274, 286, 380
arbitrageur, 63, 64, 66, 67
ARCH, 322, 323, 327, 328, 329, 330,
334, 339, 341, 342
ARCH-in-mean, 322
ARIMA, 104, 107, 127, 128, 129,
305, 475, 488
ARMA, 107, 108, 112, 115, 120, 127,
136, 327, 328, 452
asymptotic efficiency, 183, 322
asymptotic test, 36, 119
asymptotic variance, 146, 232, 244,
245, 246
Augmented Dickey-Fuller statistic,
302
autocorrelation function, See ACF
autocorrelogram, 107, 119, 124
autocovariance function, See ACF
autoregressive process, See AR(1)
average abnormal return, 165
backward shift operator, 107
bankruptcy, 106, 166, 176, 327, 479
BASEL II, 323
BE/ME, 277, 279, 280, 281, 284
benchmark model, 133, 165

best forecast, 34, 137
best linear unbiased estimators, See
BLUE
beta, 73, 256, 257, 278
bivariate normality, 32
bivariate probability, 5, 7, 24, 461
Black-Scholes formula, 367, 368
BLUE, 55, 60, 76, 168, 188, 190, 191,
194, 204, 219, 221, 223, 226, 228,
229, 231, 233, 234, 239, 242, 243,
247, 290, 296, 300, 440, 443, 446,
449, 450, 459, 482
bond equivalent yield, 344
bond pricing, 345
bond total return, 345
bonds, 101, 160, 344, 345, 346, 347,
350
book equity value, 105, 277
book-to-market equity, 277, 283, 284
bootstrapping, 345
Box Pierce Q-statistic, 107
Box-Jenkins approach, 107, 129
Brennan-Schwarz, 359, 361
Breusch-Pagan & Godfrey, 219, 237,
240, 242
business cycle, 154, 163, 345
capital asset pricing model, See
CAPM
capital budgeting, 90
capital market line, 90
CAPM, 73, 74, 75, 76, 77, 84, 85, 88,
90, 97, 98, 100, 101, 102, 104, 105,
140, 141, 151, 168, 169, 210, 249,
250, 252, 253, 254, 256, 260, 264,
265, 266, 267, 268, 270, 271, 275,
276, 277, 283, 284, 433, 434, 435,
440, 455, 472, 474, 485
CDOs, 176
central limit theorem, 26
central moment, 16, 393
chi-square, 1, 214
CML, 101
Cobb-Douglas production function,
202, 204
467
Cochrane-Orcutt procedure, 219, 292
coefficient of determination, 44
cointegration, 248, 302
conditional mean, 27
conditional probability, 1
conditional variance, 27, 322
confidence level, 324, 325, 430
consistency, 183, 244
constrained regression, 90
consumption beta, 381
contemporaneous correlation, 219,
230
continuous probability (pdf), 5, 8, 9,
10, 14, 18
continuous time stochastic process,
345, 354
correlation (correlation coefficient), 1,
11, 12, 56, 58, 247, 248, 291, 293,
460, 483
cost of carry, 44
cost of equity, 90, 95, 97, 98, 99, 104,
106
covariance, 1, 10, 204, 243, 269, 282,
390, 413, 459
Cox-Ingersoll-Ross model, 345
Cramer-Rao lower bound, 322
credit ratings, 344
credit spread, 345
critical region, 21, 22, 23, 129, 215,
240
cross-sectional Regression, 249, 257
cumulative abnormal return, 165
cumulative average abnormal return,
165
daily returns, 37, 38, 39, 41, 212, 217,
326, 419
data types, 1
day-of-the-week effect, 206
decomposition of squares, 44
defaults, 346
de-seasonalization, 107
deterministic trend, 302
DGM, 99, 100, 106, 473, 474
diagnostic, 341
disposable income, 2, 3, 4, 299, 300
dividend growth model. See DGM
dividend-price ratio, 154

Dothan, 355, 359
dummy variables, 206
Durbin-Watson d-statistic, 239, 286,
291, 449
earnings announcement, 166, 175
earnings forecast, 90
earnings to price ratio, 277
earnings yield, 1
efficient forecast, 367
equity premium puzzle, 381
ergodic process, 29
error correction model, 302
errors-in-variable, 286, 368, See
measurement error
estimation efficiency, 44
estimation period, 44, 62, 73, 140,
165, 167, 235, 236, 312, 363, 392,
395
Euler condition, 133, 140, 143, 381,
382, 383, 391
event date, 165
event window, 165
excess market return., 81, 82
exchange rate, 163, 286, 287, 299,
302, 314, 316, 317, 319, 323, 454
exchange rate forecasting, 302
expected value, 1, 9, 10, 28, 32, 35,
71, 98, 170, 226, 326, 414, 447
explained sum of squares, 58
exponential GARCH, 267
exponential-weighted moving
average, 107
fair rates of return, 90
Fama-French three factor model, 269
Fama-MacBeth Procedure, 249
F-distribution, 1, 19, 217, 240, 262
fear gauge, 206, 216
firm size, 84, 258, 270, 279
Fisher effect, 107, 130, 347
Fishers information matrix, 322, 337
forecasting, 44, 132, 183
forward discount, 286
forward risk premium, 286
forward shift operator, 154
Fubini theorem, 10
468
futures hedging, 44
futures margin, 322
GARCH, 322, 323, 328, 330, 332,
333, 334, 335, 338, 339, 341, 342
Gauss-Markov theorem, 44
GDP, 3, 11, 124, 125, 127, 166, 202,
273, 303, 364, 365, 481
generalized least squares, 219
generalized method of moments, 381,
See GMM
geometric random walk, 34
GMM, 140, 381, 385, 386, 388, 390,
391, 392, 393, 394, 395, 490
Goldfeld-Quandt test, 219, 237, 242
gross domestic product, 3, 124
Hannan-Quinn criteria, 183
Hansen-Jagannathan bound, 381, 385
HCCME, 235, 236, 282, 458, 459
hedge ratio, 44
hedging, 44, 69, 369
heteroskedasticity, 219, 242, 243, 269,
281, 282, 323, 390
historical approach, 322
historical volatility, 367
HML, 284
homoskedastic, 50, 88, 190, 204, 220,
221, 237, 239, 243, 255, 292, 335,
449
Hotellings T2 Statistic, 249
hypothesis, 1, 21, 142, 148, 153, 214,
288, 301, 312
ICAPM, 269, 271, 272, 285
idiosyncratic risk, 80, 84, 249, 266,
486
implied volatility, 367
inflation, 48, 101, 107, 132
information effect, 165
information leakage, 165
information set, 33, 34, 35, 142, 288,
388
informational efficiency, 133
instrumental variables, 219
interest rate parity, 286
internal rate of return, 90
Intertemporal Capital Asset Pricing
Model, See ICAPM
invertibility, 107, 120

irrelevant inclusion, 219
January Effect, 206
Jarque-Bera test, 27, 249, 330, 468
Jensen measure, 73, 77, 82, 83
joint probability, 1
joint test, 133
kurtosis, 16, 17, 36, 37, 38, 40, 85,
329
lagged endogenous variable, 219
law of iterated expectations, 27
law of large numbers, 26, 113
leverage, 97, 98, 277, 368, 379
levered beta, 90
life-cycle, 3
linear estimators, 53, 188
liquidity, 347, 350
Ljung-Box test, 107, 119
log-likelihood function, 268, 336, 340
lognormal distribution, 19, 27, 30, 31
long term memory, 133
MA(1), 108, 111, 113, 114, 118, 120,
129, 131, 223, 474, 475
marginal probability, 1
market crash, 375, 376
market efficiency, 131, 133, 135, 136,
137, 139, 140, 141, 143, 144, 150,
151, 153, 155, 165, 170, 202, 476,
478, 481
market equity, 104, 277, 279, 283, 284
market model, 73, 75
market timing, 73, 86
Markov process, 133, 142
martingale, 133, 141, 142, 357, 366
maximum likelihood,, 322
mean reversion, 154, 345
measurement error problem, 228, 295,
See errors-in-variable
measurement period, 165, 167, 219,
228, 244, 486
missing variable problem, 286
misspecification, 220, 223, 226, 232
moment restrictions, 381
momentum, 154, 163
moving average process, 107
multi-collinearity, 219, 225, 247, 269
469
multi-factor asset pricing, 269, 272,
276, 277
multiple linear regression, 87, 183,
191, 201, 212, 219, 277, 278, 428
multivariate distribution, 43, 191
N225 index, 63, 64, 69, 70, 327
net present value, 90, 94
Newey-West covariance estimator,
381
news, 138, 139, 166, 168, 169, 170,
175, 176, 178, 179, 214, 273, 380
nominal data, 24
non-stationary process, 302
OLS, 49, 51, 54, 55, 56, 57, 59, 60,
61, 69, 70, 71, 72, 76, 80, 81, 82,
88, 91, 92, 101, 103, 113, 125, 151,
160, 168, 184, 185, 186, 188, 190,
191, 193, 194, 196, 197, 198, 201,
202, 204, 207, 208, 209, 210, 211,
213, 215, 218, 219, 220, 221, 224,
225, 226, 227, 228, 229, 230, 231,
233, 234, 235, 237, 238, 239, 240,
241, 242, 243, 244, 245, 246, 247,
248, 249, 255, 256, 257, 264, 265,
268, 276, 279, 280, 282, 290, 291,
293, 295, 296, 297, 298, 299, 300,
307, 308, 309, 310, 312, 315, 334,
338, 342, 365, 366, 434, 435, 438,
440, 443, 445, 446, 448, 449, 450,
451, 452, 453, 455, 457, 458, 459,
471, 480, 481, 484, 486, 487, 489
option premium, 367
ordinal data, 24
ordinary least squares, See OLS
Ornstein-Uhlenbeck process, 345, 355
orthogonal projection, 183
orthogonality conditions, 381
out-of-sample forecast, 107, 154
overlapping data problem, 286, 287,
290, 292, 293
overlapping forecast errors, 290
P/E ratios, 2, 210
PACF, 107, 122, 123, 125, 126, 127,
131
parametric approach, 322
partial autocorrelation function. See
PACF
Phillips curve, 48, 49
portfolios, 62, 83, 84, 85, 101, 210,
251, 253, 255, 258, 259, 260, 276,
279, 284, 285
post-event period, 165
PPP, 302, 314, 315, 316, 317, 454
pre-determined, 49, 50, 184, 191, 368
price pressure effect, 165
price-to-earnings, 1 See P/E ratios
pricing anomalies, 206, 210
probability limits, 219, 229
purchasing power parity, 302, 314,
See PPP
put-call parity theorem, 373
random coefficient, 232, 248
random walk, 27, 33, 34, 130, 133,
134, 135, 136, 141, 144, 145, 147,
148, 150, 155, 156, 290, 310, 375,
437, 453, 456, 474, 478
rates of return, 27
rational expectation, 133, 137, 138
regressand, 44, 248
regression through the origin, 90, 103
regressor,, 44
relative risk aversion parameter, 391,
394
relevance exclusion, 219
representative agent, 381
residual income, 90
risk factors, 269
risk management, 322, 323
Rolls critique, 249, 252, 268
RSS, 51, 58, 193
S&P 500 Index, 1, 2, 376
sample mean, 19, 28, 29, 42, 92, 216,
262, 425, 469, 480, 486
sample variance, 19, 56, 71, 174, 340
sampling, 1, 85, 167
Schwarz criterion, 70, 183, 200, 205,
213, 215, 257, 301, 360, 362, 363,
365
seasonal Effect, 206
security characteristic line, 73, 81
security market line, 90 See SML
semi-log, 48
470
serial correlation. See autocorrelation
Sharpe measure, 73, 83, 84
Sharpe ratio, 83, 102, 385
short rate, 345, 356, 357
significance level, 22, 23, 36, 38, 39,
40, 41, 68, 80, 84, 118, 129, 148,
178, 182, 201, 204, 215, 270, 282,
292, 293, 295, 313, 390, 392, 438,
444, 449, 453, 455, 457, 480, 484
Simultaneous equations bias, 219
single index, 75
skewness, 16, 17, 36, 37, 38, 40, 85,
393
Slutsky theorem, 231
SML, 81, 100, 101, 106, 252, 253,
254, 256, 433, 474
spot rate(s), 294, 345, 351
spot yield curve, 345
spurious regression, 302
standard deviation, 10, 79
standard normal variable, 13
statistical test, 1
stochastic regressor, 183
stochastic trend, 248, 302
stock index futures, 44
stock return, 2, 27, 30, 33, 36, 40, 70,
78, 84, 101, 119, 131, 150, 155,
156, 158, 163, 166, 170, 259, 277,
380, 437
Student-t, 1, 18, 20, 172
substitution effect, 165
sum of squares, 51, 57, 58, 193, 216,
See also RSS
super-consistency, 302
switching regression model, 232
systematic risk, 73
term structure, 99, 145, 352, 353, 355
testing of coefficient, 44
tests of restrictions, 183
three-moment asset pricing, 393
transaction costs, 64, 65, 66, 137, 153,
210, 373
treasury slope, 345, 364, 365
trend stationary process, 302
Treynor measure, 73, 82, 83
two-fund separation theorem, 251
Type I error, 21, 22, 23

Type II error, 21, 22, 23
unbiased expectations hypothesis,
286, 288, 444, 445
unbiased linear estimator, 44
unconditional expectation, 27
uniformly most powerful, 23
unit root process, 303, 304, 305, 306,
307, 308, 309, 310, 311, 312, 313,
315, 317, 321, 452, 453, 458, 487,
488
unsystematic risk, 73, 80
utility-based asset pricing model, 381
Value-at-risk, 322, 338, See VaR
VaR, 324, 325, 326, 327, 339, 488
variance, 1, 10, 133, 148, 150, 216,
242, 249, 329, 467
variance ratio test, 144, 147, 148
Vasicek model, 345, 355, 356, 359,
360
volatility cluster, 322
volume, 153, 154, 269
Wald test, 206, 263, 265
warrants, 367
weighted average cost of capital, 90
weighted least squares, 219
white noise, 33, 108, 119, 128, 190,
341, 379, 423, 433, 437, 453
Whites heteroskedasticity-consistent
covariance matrix estimator, 219,
See HCCME
Wishart distribution, 249
yield curve, 351, 352, 353, 363, 364,
365
yield-to-maturity, 344, 348, See YTM
YTM, 348, 350, 351, 353, See yieldto-maturity
Yule-Walker equations, 107, 121,
122, 123
zero coupon bond, 345
zero yield, 345
471

Financial Valuation and Econometrics

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Financial Valuation and Econometrics

Загружено:

Авторское право:

Доступные форматы

Kian-Guan Lim

About the Author

Probability Distribution and Statistics

Statistical Laws and Central Limit

Multiple Linear Regression and

Dummy Variables and ANOVA

Linear Regression in EXCEL

Multiple Choice Question Tests

Solutions to Problem Sets

Joint probability, marginal probability, and conditional probability are

In Figure 1.3 we evaluate the annual year-to-year change in consumption and

$80 $120 $160 $200 $240 $280 $320 $360

A random variable is a variable that takes on different values each with a

Thus, given the joint probability distribution, the marginal probability

a j , Yt 1 b k ? Employing the concept of marginal

probability we just learned,

In the bivariate probability case, we know that future risk or uncertainty is

When we move from discrete probability distribution, where event sets

P(A,B) = P(0 < x < 9.5, -2 < y < 3) =

Notice that while f(x,y) is a function containing both x and y, f Y(y) is a

These conditional pdfs contain both x and y in their arguments.

The expected value of random variable Xt+1 is given by

PX a j X for the discrete distribution in Table 1.1,

and for continuous pdf,

The conditional expected value or conditional expectation of Xt+1|b4 is given

| b 4 for the discrete distribution in Table 1.1,

Variance measures the degree of movement or variability of the random

Covariance measures the degree of co-movements between two random

f : f a b ; a A, b B in which A is the domain set and B the range

A special case of the above is

Thus, for any random variable X and Y, 2X 1 2 0 , and hence

Continuous probability distributions are commonly employed in regression

for - < x <

The cumulative distribution function (cdf) of X is

We can write the distribution of X as X ~ N , 2 in which the arguments

where the symbol means to define. The second equality is interpreted

then Z ~ N0,1 . Z has pdf f

normal cdf is often written as (z). For the standard normal Z,

Total Area from - to

The corresponding z values of random variable (r.v.) Z can be seen in the

Area under curve

Area under curve

Several values of Z under N(0,1) are commonly encountered, viz. 1.282,

The multivariate normal distribution pdf (p-variate normal pdf) is given by

where x is the vector of random variables X1 to Xp , is the px1 vector of

of X. If = E(X) is the mean of X, the kth central moment of X is

fat tails with

There are some common continuous probability distributions that are

related to the normal distribution. If random variable X ~ N( , 2 ) , then

a chi-square distribution with n degrees of freedom.

If X ~ N(0,1) , and V ~ r , and both X, V are stochastically independent,

is a Student-t distribution with r degrees of freedom. If U ~ r1 ,

V ~ , and both U, V are stochastically independent, then

F-distribution with degrees of freedom r1 and r2. If random variable

X ~ N , 2 , and Y = exp(X) or X = ln(Y), then Y is a random variable with

Suppose a random variable X with a fixed normal distribution N(, 2) is

Each time we select a random sample of size n, we obtain a realization x .

where Xk above is clearly the random variable from N(, 2) itself. X n is a

The standardized normal random variable then becomes

On the other hand, E(s2) = 2. But s2 itself has a sampling distribution.

degrees of freedom of the chi-square random variable. Therefore,

Hence the 95% confidence interval estimate of is given by

which is t-distributed with (n-1) degrees of freedom, falls

1.2 Show how one could express cov