Вы находитесь на странице: 1из 40

Lecture Notes

Advanced Econometrics I

Fall Semester 2013 1


University of Mannheim

Anne Leucht, Enno Mammen

1 Slightly modied
Literature

[1] Ash, R. B. and Dolans-Dade, C. (1999). Probability & Measure Theory. Academic Press.
[2] Bauer, H. (1990). Measure and Integration Theory. de Gruyter.
[3] Bauer, H. (1991). Wahrscheinlichkeitstheorie. de Gruyter.
[4] Billingsley, P. (1994). Probability and Measure. Wiley.
[5] Breiman, L. (2007). Probability. Siam.
[6] Davidson, R. and MacKinnon, J.G. (2004). Econometric Theory and Methods. Oxford University
Press.

[7] Dehling, H. and Haupt, B. (2004). Einfhrung in die Wahrscheinlichkeitstheorie und Statistik.
Springer.

[8] Georgii, H.-O. (2007). Stochastik: Einfhrung in die Wahrscheinlichkeitstheorie und Statistik. de
Gruyter.

[9] Hayashi, F. (2009). Econometrics. Princeton University Press.


[10] Jacod, J. and Protter, P. (2000). Probability Essentials. Springer.
[11] Pollard, D. (1984). Convergence of Stochastic Processes. Springer.
[12] Seber, G. A. F. (2008). A Matrix Handbook for Statisticians. Wiley.
[13] Van der Vaart, A. W. and Wellner, J. A. (2000). Weak Convergence and Empirical Processes. With
Applications to Statistics. New York: Springer.

[14] Wooldridge, J. M. (2004). Introductory Econometrics: A Modern Approach. Thom-


son/Southwestern.

2
Contents

1 Elementary probability theory 5


1.1 Probability measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Countable sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Arbitrary sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Probability measures on R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Some general facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Discrete and absolutely continuous probability measures . . . . . . . . . . . . . . . 9
1.2.3 Extensions to Rk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Denition and Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Expectations, moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Asymptotic theory 17
2.1 Convergence of expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Modes of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Relation between dierent modes of convergence . . . . . . . . . . . . . . . . . . . 19
2.2.3 Discussion of convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4 Discussion of stochastic boundedness . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Strong law of large numbers and central limit theorem . . . . . . . . . . . . . . . . . . . . 24

3 Conditional expectations, probabilities and variances 26


3.1 Conditional expectations and conditional probabilities . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Special case: X, Y discrete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2 Special case: Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.3 Special case: X discrete, Y continuous . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.4 Important properties of conditional expectations . . . . . . . . . . . . . . . . . . . 28
3.2 Conditional Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Linear regression 32
4.1 The classical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Parameter estimation - the OLS approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 Estimation of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 Estimation of 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Hypothesis tests in the classical linear regression model . . . . . . . . . . . . . . . . . . . 36
4.3.1 Introduction to statistical testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.2 Wald tests to test linear restrictions on the regression coecients . . . . . . . . . . 37
4.3.3 Hypothesis tests in the classical linear regression model under normality . . . . . . 39

3
Introduction

Motivation:
study relationship between variables, e.g.

 consumption and income How does raising income eect consumption behaviour?

 evaluation of eectiveness of job market training (treatment eects)

 ...

Econometrics (Wooldridge [14]):

 development of statistical methods for estimating economic relationships

 testing economic theories

 evaluation of government and business policies

classical model in econometrics: linear regression

Figure 1: http://en.wikipedia.org/wiki/File:Linear regression.svg

Y = 0 + 1 X +
e.g. Y consumption, X wage, error term

typically data not generated by experiments,


error term collects all other eects on consumption besides wage

variables somehow random

How do we formalize randomness?

Aims of this course:


(1) provide basic probabilistic framework and statistical tools for econometric theory

(2) application of these tools to the classical multiple linear regression model

application of these results to economic problems in Advanced Econometrics II/III and follow-up
elective courses

4
1 Elementary probability theory

1.1 Probability measures

Aim: Formal description of probability measures

Setup:
set of the possible outcomes of an experiment  sample space 
e.g. = N = {1, 2, }
Unless otherwise stated, is non-empty.

A  event
e.g. A = {2, 4, 6, 8, }

 outcome :
intend to dene P (A)  probability  of event A

1.1.1 Countable sample spaces


Consider rst the case that is a countable set, i.e.

= {1 , 2 , 3 , }

(e.g. = N, = Z).
Denition 1.1. A probability measure P on a countable set is a set function that maps subsets of
to [0, 1] and has the following properties:
(i) P () = 0 (Here denotes the empty set.),
(ii) P () = 1,
(iii) P ( for Ai , i N, pairwise disjoint (i.e Ai Aj = for i 6= j ).
S P
i=1 Ai ) = i=1 P (Ai )

Remark:
interpretation of (i): probability that nothing happens = 0.

interpretation of (ii): probability that something happens = 1.

Sk Pk
interpretation of (iii): P ( i=1 Ai ) = i=1 P (Ai ) is reasonable for k < , but questionable for
k = . Anyway, one cannot proceed without making this assumption.

Lemma 1.2. For a countable sample space = {i }iI (with countable I ) a probability measure P is
specied by
pi = P ({i }) for i I.
For every A , it holds X
P (A) = pi .
i: i A

Proof. Application of Denition 1.1(iii) - see problem set 2.

An event {} ( ) that only contains one element is also called an elementary event.

5
1.1.2 Arbitrary sample spaces
Typical examples: =R (or Rk or [0, ))
Problem: Mathematics (general measure theory) It is often impossible to dene P appropriately
for all subsets A such that Denition 1.1(i)-(iii) hold; see e.g. Billingsley [4][p.45f ].

Solution: Dene P on a class A of subsets of .


Minimal requirement on A:
(S i) A
(S ii) AA AC = \A A
S
(S iii) A1 , A2 , A i=1 Ai A

Motivation: If we naively carried over the axioms of Denition 1.1(i)-(iii) to the present setting, then
we would obtain:

Regarding(S i): P () = 0.
Regarding (S ii): If I know the value of P (A), I will also know the value of P (AC ) = 1 P (A).
Regarding (S iii):
S If I know the values of P (A1 ), P (A2 ), ..., then I will know the value of
P ( i=1 Ai ). This is clear for pairwise disjoint A1 , A2 , . . . . For A1 , A2 , . . . not pairwise disjoint, a
slightly more complicated argument is needed (see Theorem 1.7(vi) below).

Denition 1.3. A class A of subsets of with (S i)-(S iii) is called a -eld (also: algebra).

Example 1.4. For a set

the class {, } is the smallest -algebra on and is called trivial -algebra,


the so-called power set P() := {A | A } is the largest -algebra on ,
and if B , the class {, , B, B C } is the smallest -algebra on that contains B.

Denition 1.5. Suppose that A is a -eld on a set . Then the tuple (, A) is called a mea-
surable space.
A set function P : A [0, 1] is a probability measure on (, A) if
(a) P () = 1 and
(b) for A1 , A2 , . . . pairwise disjoint A it holds that

!
additivity
[ X
P Ai = P (Ai )
i=1 i=1

The triple (, A, P ) is a probability space.

Example 1.6 (Dirac measure) . Let (, A) be a measurable space and 0 . The Dirac measure 0
is then dened by
0 (A) := 1A (0 ) A A.
The Dirac measure is indeed a probability measure.

Theorem 1.7 (Properties of probability measures) . Suppose that (, A, P ) is probability space. Then
the following hold true:
(i) P () = 0

6
(ii) Finite additivity: A1 , . . . , An A disjoint imply P (ni=1 Ai ) =
Pn
i=1 P (Ai )
(iii) P (AC ) = 1 P (A).
(iv) Monotonicity: A, B A, A B implies P (A) P (B)
(v) Subtractivity: A, B A, A B implies P (B\A) = P (B) P (A)
(vi) Poincar-Sylvester: P (A B) = P (A) + P (B) P (A B)
(vii) Continuity from below: A1 , A2 , A, An An+1 , n N, implies P (An ) n
P (
k=1 Ak )

(viii) Continuity from above: A1 , A2 , A, An+1 An , n N, implies P (An ) n


P (
k=1 Ak )

(ix) Sub--additivity: A1 , A2 , A implies P (


P
n=1 An ) n=1 P (An ).

Proof. Exercise.

1.2 Probability measures on R


1.2.1 Some general facts
In this section we consider the special case = R.
Questions:
1. What is a natural -eld on R?
2. How can we construct P on a large class A?
Solution to question 1:
Denition 1.8. The smallest -eld B, that contains all open intervals (a, b) ( a b ), is
called the Borel -eld. A set A B is also called a Borel set.

Theorem 1.9. Put


A1 = {(a, b] : a b < +},
A2 = {[a, b) : < a < b +},
A3 = {[a, b] : < a b < +},
A4 = {(, b] : < b < +}

Then it follows for j = 1, . . . , 4: B is the smallest -eld that contains Aj .


Proof.
S 1 1
See exercise for j = 1. For j 6= 1 the result follows similarly, use e.g. n=1 [a+ n , b n ] = (a, b).
Remark: Note that there are subsets of R B , cf. Theorem 8.6 in Bauer [2].
which are not contained in
It is very dicult to understand which sets are contained in B . But it is not really necessary. We dene
a probability measure P on {(a, b) | a b }, A1 or . . . or A4 . Then B is just the class of
sets for which P is automatically dened. Why is this true?

Solution to question 2: Dene P on a smaller treatable class A A.


Hope: this denes P on A. This works as follows:

Denition 1.10. A class A of subsets of is a eld if


(i) A ,
(ii) A A AC A ,
(iii) A1 , A2 A A1 A2 A .

Suppose that A is a eld and dene A as the smallest -eld with A A (notion: A = (A )). Now

choose P : A [0, 1] with

7
(a) P () = 0
(b) P () = 1
S
(c) For A1 , A2 , . . . pairwise disjoint A with i=1 Ai A it holds that


!
[ X
P Ai = P (Ai ).
i=1 i=1

(A set function with these properties is called pre-measure.)


Theorem 1.11 (Carathodory) . For A , A, P as above there exists a unique probability measure P on
A with
P (A) = P (A) for A A .

Proof. See Theorem 1.3.10 in Ash and Dolans-Dade [1].

This method can be applied to characterize probability measures on R endowed with a -algebra B .
To this end, we rst introduce another function.

Denition 1.12. For a probability measure P on (R, B) the function F : R [0, 1] given by
F (b) = P ((, b]) b R
is called a (cumulative) distribution function (CDF).

Theorem 1.13 (Properties of the CDF) . Suppose that F is the distribution function of a probability
measure P on (R, B). Then
(i) P ((a, b]) = F (b) F (a), b > a,
(ii) F is non-decreasing (i.e. F (b0 ) F (b) for b0 b),
(iii) F is continuous from the right (i.e. F (bn ) F (b) for bn b, bn b (or for bn b)),
(iv) 1. limx F (x) = 0,
2. limx+ F (x) = 1.

Proof. (i) is an immediate consequence of Theorem 1.7(iv).

(ii) Exercise.

(iii) Suppose that (bn )n is an arbitrary monotonously decreasing sequence in R with bn b. Then we
have to show that
F (bn ) F (b).
n
By continuity of probability measures from above


!
\
F (bn ) = P ((, bn ]) P (, bk ] = F (b).
n
k=1

(iv) 1. Exercise.

2. Suppose that (bn )n is an arbitrary monotonously increasing sequence in R with bn . We


have to show
F (bn ) 1.
Invoking continuity of probability measures from below gives


!
[
lim F (bn ) = lim P ((, bn ]) = P (, bk ] = P ((, +)) = 1.
n n
k=1

8
Now the idea is to choose F with the properties (ii) to (iv) in the previous Theorem and to dene P on
A4 :
P ((, b]) = F (b)
and on A1 :
P ((a, b]) = F (b) F (a), b>a
Can this function be uniquely extended to a set function on B?

Theorem 1.14. Consider a function F : R R satisfying (ii) to (iv) of Theorem 1.13. Then F is
a distribution function (i.e. then there exists a unique probability measure P on (R, B) with F (b) =
P ((, b]) for all b R).

Proof. (Sketch) First, we dene a set function P : A1 [0, 1] as

P ((a, b]) = F (b) F (a). (1)

Now we extend this function as follows: P : A [0, 1], where A consists of the empty set and all
nite unions of sets of A1 and their complements, and for disjoint intervals

n
! n
[ X

P (ai , bi ] = P ((ai , bi ]) with notation (c, ] = (c, ).
i=1 i=1

Now, it can be shown that A is a eld and P is a pre-measure on A . Finally, Carathodory's


extension theorem tells us that there is a unique probability measure P on B satisfying (1) and hence
P ((, b]) = F (b).

1.2.2 Discrete and absolutely continuous probability measures


Discrete probability measures

Denition 1.15. A probability measure P on the measurable space (R, B) is discrete if there is an at
most countable set A = {ai R | ( < ... < a1 < a0 <)a1 < a2 < . . . } such that P (A) = 1.

Remark: If P is a discrete probability measure with

P ({ai }) = pi > 0,

then F has jumps at ( < ... < a1 < a0 <)a1 < a2 < . . . with jump heights (..., p1 , p0 ,) p1 , p2 , . . . .

Example 1.16. 1. Binomial distribution


 
n i
P ({i}) = (1 )(ni) for i = 0, 1, . . . , n.
i

Parameter: 0 1, n 1.
2. Geometric distribution
P ({i}) = (1 )i1 for i = 1, . . .
Parameter: 0 1.
3. Poisson distribution
i
P ({i}) = e , i = 0, 1, . . .
i!
Parameter: > 0

9
Absolutely continuous probability measures
If there is a (Riemann) integrable function f : R [0, ) such that

Z x
F (x) = f (t)dt,

then f is called Riemann density (probability density) and the corresponding probability measure (and
the CDF) is called absolutely continuous.

There is a more general denition of absolute continuity. However, this needs deeper mathematics.
Therefore we will stick to the one above which suces for many applications. Moreover, note that if F
is continuously dierentiable at x and is absolutely continuous, then f (x) = F 0 (x).

Lemma 1.17. Suppose thatR f : R [0, ) is a bounded function with at most nitely many points of
discontinuity and such that

f (x)dx = 1. Then there exists a unique probability measure on (R, B)
such that Z b
P ((a, b]) = f (x)dx.
a

Proof.
R
It suces to show that

f (x)dx denes a function satisfying (ii) to (iv) of Theorem 1.13 and
to use Theorem 1.14. This in turn is straightforward.

Example 1.18. 1. Normal distribution


1 (x )2
 
1
f (x) = exp .
2 2 2

Parameter: R, > 0
2. Uniform distribution
1
f (x) = 1[a,b] (x)
ba
Parameters: < a < b <

3. Exponential distribution
f (x) = ex 1[0,) (x)
Parameter: >0

1.2.3 Extensions to Rk
The Borel -eld B k is the -eld generated by the open intervals (a1 , b1 ) (ak , bk ). As in the real-
valued case, probability measures on (Rk , B k ) are uniquely dened via the multivariate distribution
function:
F (b1 , . . . , bk ) = P ({(x1 , . . . , xk ) : x1 b1 , . . . , xk bk }).
F is called absolutely continuous if:
Z b1 Z bk
F (b1 , . . . , bk ) = f (x1 , . . . , xk )dxk dx1

for all b1 , ..., bk R. Here, if f continuous at (x1 , . . . , xk ), then

kF
(x1 , . . . , xk ) = f (x1 , . . . , xk ).
x1 xk
For a more detailed discussion, we refer the reader to Billingsley [4].

10
1.3 Random Variables

1.3.1 Denition and Independence


Intuitively: An Rk -valued random variable X is a random element of Rk , e.g. to describe age, education
and wage of a randomly chosen person in Mannheim.

Denition 1.19. An Rk -valued random variable is a function X : Rk , where (, A) is a mea-


surable space and X fullls:
X 1 (B) = { : X() B} A B Bk .

Property (1.19) is called measurability. It allows to dene the so-called distribution of X:


Denition 1.20. Suppose that X is an Rk -valued random variable on a probability space (, A, P ).
Then
P X (B) := P (X B) := P (X 1 (B)), B Bk ,
is called the distribution of X .
Remark:
(i) Show as an exercise that PX is a probability measure on (Rk , B k ).
(ii) Interpretation: (, A, P ) machinery of randomness in the background.

(iii) Notation: Capital letters X, Y, Z, . . . are used for random variables (exception: Greek letters,
e.g. ), lower case letters x, y, z, . . . are used for elements of Rk and denote realisations (possible
outcomes of X, Y, Z, . . . ). (Typical convention in statistics). However, one often uses x, y, z, . . . for
random variables and their realisations, at the same time, in econometrics.

(iv) Suppose that (, A) = (R, B). Then (real-valued) indicator functions, monotone functions,
continuous functions and functions with only nitely many discontinuities are (Borel) measurable
(see Jacod and Protter [10, Theorems 8.3, 8.4].

For the indicator function it works as follows: Let AA and X = 1A . Then, we get

Case 1 BB and B {0, 1} = {0, 1}


X 1
(B) = 11
A (B) = = R A
Case 2 BB and B {0, 1} =
X 1
(B) = 11
A (B) = A
Case 3 BB and B {0, 1} = {0}
X 1
(B) = 11 c
A (B) = A A
Case 4 BB and B {0, 1} = {1}
X 1 (B) = 11
A (B) = A A

Thus, the indicator function is (Borel) measurable.

(v) Suppose that (, A) = (R, B) and that (Xn )n is a sequence of random variables on this space, then
X1 + X2 , X1 X2 , supn Xn , inf n Xn , and limn Xn (provided its existence) are random variables (see
Jacod and Protter [10, Corollary 8.1, Theorem 8.4]).

For two random variables X1 : Rk1 , X2 : Rk2 dened on the same probability space, one
(X10 ,X20 )0
denes the joint distribution P by

P ((X10 , X20 )0 B) = P ({ : (X10 (), X20 ())0 B}), B B k1 +k2 .


(Interpret (X10 , X20 )0 as a new random variable Y in Rk1 +k2 .)
Denition 1.21. (i) Random variables X1 , . . . , Xl on a probability space (, A, P ) are independent
if
P (X1 A1 , . . . , Xl Al ) = P (X11 (A1 ) Xl1 (Al )) = P (X1 A1 ) P (Xl Al )
for all Borel sets A1 , . . . , Al .

11
(ii) Suppose that (Xt )tT with some nonempty index set T is a family of Rk -valued random variables
on (, A, P ). These random variables are independent if for any nite, nonempty I0 T and any
Qt B k , t I0 , !
\ Y
P Xt1 (Qt ) = P (Xt1 (Qt )).
tI0 tI0

Independence is one of the most important tools to construct complex probabilistic models.

Lemma 1.22. If the distribution of an Rk -valued random variable X = (X1 , . . . , Xk )0 has a bounded,
piecewise continuous density fX , then
Z Z
fX1 (x1 ) = fX (x1 , . . . , xk )dx2 dxk

is the density of X1 (marginal density).

Proof. The assertion follows from


Z z Z Z
FX1 (z) = P (X1 (, z], X2 R, . . . , Xl R) = fX (x1 , . . . xl )dx2 . . . dxl dx1 .

Furthermore, it can be shown that X1 , . . . , Xk are independent if and only if

fX (x) = fX1 (x1 ) ... fXk (xk );

see Chapter 20 in Billingsley [4].

1.3.2 Expectations, moments


Motivation: Relate unknown parameters of a random variable to its moments.
Question: When do moments exists and how do we calculate them?
We now consider R-valued random variables. We call a random variable X discrete if its distribution is
discrete.

Denition 1.23. Suppose that X : X with countable X RP


is a discrete random variable on a
probability space (, A, P ) with P (X = ai ) = pi for i = 1, 2, . . . and i pi = 1.
(i) If X = {a1 , . . . , aN } is nite, then the mean (expectation) of X is given by
N
X N
X
EX = ai P ({ | X() = ai }) = ai P X ({ai }).
i=1 i=1

(ii) If X [0, ) is countable, then


X X
EX = ai P ({ | X() = ai }) = ai P X ({ai }).
i i

(iii) If X is countable and either EX + < or EX < (with X + = max{0, X} and X =


max{X, 0}). Then X
+ X
EX = EX EX = ai P ({ai }).
i

Note that the expectation of a random variable only depends on its distribution.

Example 1.24. Let (, A, P ) be a probability space. Then for AA

E 1A = P (A).

12
Denition 1.25 (General denition of expectation) . (i) For a real-valued random variable X 0
dene
k k k+1
Xn () = if X() < for a k N0 .
n n n
Dene
EX = lim EXn .
n

(ii) For a real-valued random X we put


EX = EX + EX if EX + < or EX < .

Remark:
1X[k/n, (k+1)/n)
P k
P k
Xn = k=0 n and hence EXn = k=0 n P (X [k/n, (k + 1)/n))
kXn Xm k max{n1 , m1 }
limn EXn always exists but might be innite.

EX is nite < EX < + E|X| < +.

Excursus to integration theory:


Riemann integral: arises as limit of sums. It is based on a horizontal grid.

The denition of the expectation above can also be interpreted as an integral, the so-called Lebesgue
integral which is based on a vertical grid.

Riemann integral versus Lebesgue integral,


http://commons.wikimedia.org/wiki/File:Riemannvslebesgue.svg

Alternative notions:
Z Z Z Z Z Z
EX = xdP X (x) = xP X (dx) = xdF X (x) = xF X (dx) = X()P (d) = X()dP ()

Theorem 1.26. (Properties of expectations). For real-valued random variables X1 , X2 with nite ex-
pectations and 1 , 2 R it holds that
(i) |E[X1 ]| sup |X1 ()|,
(ii) E[1 X1 + 2 X2 ] = 1 E[X1 ] + 2 E[X2 ] (linearity),
(iii) E[X1 ] E[X2 ] if X1 X2 (monotonicity),
(iv) E[X1 X2 ] = E[X1 ]E[X2 ] if X1 , X2 are independent.

Remark: One can show that additivity of the expectation of positive random variables also holds if the
corresponding expectations are innite; in particular E|X| = EX + + EX .

13
Proof. The proofs of the rst three items is deferred to the exercises.
(iv):

(1) Discrete Case: Suppose that X1 , X2 0 and discrete. Say, X1 = {a1 , a2 , . . . } and X2 =
{b1 , b2 , . . . }. For Y := X1 X2 0 it follows

E(X1 X2 ) = EY
X
= yi P (Y = yi )
yi Y
X
= aj bk P (X1 = aj , X2 = bk )
aj X ,
1
bk X
2
X
= aj bk P (X1 = aj )P (X2 = bk ) (by independence)
aj X ,
1
bk X
2
X X
= aj P (X1 = aj ) bk P (X2 = bk )
aj X1 bk X2

= EX1 EX2

(2) Suppose that X1 , X2 0.


Note that   
1 1
X1,n X2,n (X1 X2 )n2 X1,n + X2,n + .
n n
(Justication:

kl
1X [k/n,(k+1)/n),X2 [l/n,(l+1)/n)
X
X1,n X2,n =
n2 1
k,l=0

m
1X1 [k/n,(k+1)/n),X2 [l/n,(l+1)/n)
X X
= 2
m=0
n
k,l:kl=m

m
1X X [m/n2 ,(m+1)/n2 )
X

m=0
n2 1 2
= (X1 X2 )n2
and similarly for the second inequality.)
Here, Xj,n Xj on a n1 -grid, and (X1 X2 )n2 is the discrete ap-
is the discrete approximation of
1
proximation of (X1 X2 ) on a
n2 -grid. For a discrete random variable one can easily check
1 1 1 1
that (iv) holds. Thus, we have that E[(X1,n +
n )(X2,n + n )] = E[X1,n + n ]E[X2,n + n ] and
E[X1,n X2,n ] = E[X1,n ]E[X2,n ]. By application of (iii) this gives
   
1 1
E[X1,n ]E[X2,n ] E[X1 X2 ]n2 E X1,n + E X2,n + .
n n

This implies the result because for n the lower and the upper bound converges to E[X1 ]E[X2 ].
(3) General Case:

E(X1 X2 ) = E[(X1+ X1 )(X2+ X2 )]


= E[X1+ X2+ X1+ X2 X2+ X1 + X2 X1 ]
= E(X1+ X2+ ) E(X1+ X2 ) E(X2+ X1 ) + E(X2 X1 ) (by linearity)
= E(X1+ )E(X2+ ) E(X1+ )E(X2 ) E(X2+ )E(X1 ) + E(X2 )E(X1 ) (by (2))
= E(X1+ X1 )E(X2+ X2 )
= EX1 EX2

14
Theorem 1.27. If the distribution function FX of X , given by FX (x) = P (X x), x R, has a
bounded and piecewise continuous density fX and if E[X] is nite, then
Z
E[X] = xfX (x)dx. (2)

Proof. See exercises for X 0. An extension to the general case is then straightforward.

Extensions of Theorem 1.27: Assume as above that fX is bounded and piecewise continuous and
moreover, g is a piecewise continuous function that is either bounded or non-negative, then
Z
Eg(X) = g(x)fX (x)dx. (3)

(See e.g. Georgii [8] for a proof.)

Denition 1.28. Suppose that X is a real-valued random variable. Then


(i) The s-th moment of X is dened as E[X s ] (if well-dened),
(ii) the s-th absolute moment as E [|X|s ],
(iii) and the s-th central moment as E [(X )s ] with = E[X] (if well-dened).
(iv) The 2nd central moment is also called variance: V ar[X] = E (X )2 .
 

Example 1.29. X N (, 2 ) : EX = , V ar[X] = 2 . We only give the calculations for the


expectation here. First note that
Z
1 2 2
EX + = xe(x) /(2 ) dx
2 2
0
 Z Z 
1 2 x (x)2 /(22 ) (x)2 /(2 2 )
= 2 e dx + e dx <
2 2 0 0

EX < . Hence,
and similarly
Z Z Z
1 2 2 1 2 2 1 2 2
EX = xe(x) /(2 ) dx = y ey /(2 ) dy + ey /(2 ) dy = .
2 2 2 2 2 2
In particular, we see that there are random variables with the same expectation but dierent distribu-
tions.

Theorem 1.30. If X1 , X2 , . . . , Xn are independent with nite variance, then


V ar(X1 + + Xn ) = V ar(X1 ) + + V ar(Xn ).

Proof. This assertion follows immediately from Theorem 1.26(iv).

n
! n n
!!2
X X X
V ar Xi =E Xi E Xi
i=1 i=1 i=1
n n
!2
X X
=E Xi EXi (by linearity)
i=1 i=1
n
!2
X
=E (Xi EXi ) (by linearity)
i=1
n
! n
X X
=E (Xi EXi ) (Xj EXj )
i=1 j=1
n
X
= E(Xi EXi )(Xj EXj ) (by linearity)
i=1,j=1
Xn n
X
= E(Xi EXi )(Xj EXj ) + E(Xi EXi )2
| {z }
i=1,j=1 i=1
i6=j V ar(Xi )

15
By the fact that functions of independent random variables are also independent it follows that

n
! n n
X X X
V ar Xi = E(Xi EXi )E(Xj EXj ) + V ar(Xi )
i=1 i=1,j=1 i=1
i6=j

Moreover, we have for all i, j = 1, . . . , n

E(Xi EXi ) = EXi E(EXi ) (by linearity)


= EXi EXi E(1) (by linearity, since EXi is a constant)
= EXi EXi E 1Xi
= EXi EXi P (Xi )
= EXi EXi
=0

Then, the desired result follows


n
! n
X X
V ar Xi = V ar(Xi )
i=1 i=1

For an Rk -valued random variable X = (X1 , . . . , Xk )0 such that EXj exists for all j, we dene the
expectation (vector) as
E[X1 ]
.
= E[X] = . .

.
E[Xk ]
If E|Xj |2 < for all j, the covariance matrix is dened by
=E[(X )(X )0 ]
(X1 1 )2

... (X1 1 )(Xk k )
. .. .
=E . .

. . .
(Xk k )(X1 1 ) ... (Xk k )2
E[(X1 1 )2 ]

... E[(X1 1 )(Xk k )]
. .. .
= . . .

. . .
2
E[(Xk k )(X1 1 )] ... E[(Xk k ) ]

16
2 Asymptotic theory

In applications, we are typically interested in behaviour of Xn0 for xed n0 . Via asymptotic theory we
get an approximation by embedding Xn0 into a sequence (Xn )n .

2.1 Convergence of expectations

Suppose that for a sequence of random variables (Xn )n and a random variable X on a probability space
(, A, P )
Xn () X() for all . (4)
n
In general this is not sucient for
EXn EX
n
(see exercises). Our aim is now to set up additional conditions that assure convergence of the corre-
sponding expectations. There are two main tools, the monotone convergence theorem and Lebesgue's
dominated convergence theorem.

Theorem 2.1 (Monotone convergence theorem) . Assume (4) and


0 Xn () Xn+1 () for n 1, . (5)

Then
EXn EX.
n

Proof. See Jacod and Protter [10].

If the random variables do not converge, one can obtain a weaker result.

Lemma 2.2 (Fatou's Lemma). Consider a sequence of nonnegative random variables (Xn )n (dened on
the same probability space (, A, P )) and dene
X() = lim inf Xn ().
n
Then
EX lim inf EXn .
n
Proof. lim inf n Xn () = limn (inf kn Xk ()).
Recall that First note that the sequence (Yn )n ,
given by Yn = inf kn Xk , satises 0 Yn () Yn+1 (), n, , and Yn () X(). By monotone
n
convergence theorem and the fact that Yn Xk k n,

EX = lim EYn lim inf EXk .


n n kn

Using the latter result, Lebesgue's majorized convergence theorem can be proven.

Theorem 2.3 (Dominated Convergence Theorem) . Assume (4) and


|Xn ()| Y () for n 1, (6)

for a random variable Y with E[Y ] < +. Then


E[Xn ] E[X].
n

Proof. See Jacod and Protter [10].

Remark: Both theorems also hold if (4), (5), (6) only hold a.s. (almost surely), i.e.

P (Xn X) = P ({ : Xn () X()}) = 1. (7)


n n
P ({ : 0 Xn () Xn+1 ()}) = 1 for n 1, or, equivalently, (8)

P ({ : 0 Xn () Xn+1 () for n 1}) = 1.


For a random variable Y with E[Y ] < + it holds that (9)

P ({ : |Xn ()| Y ()}) = 1 for n 1, or, equivalently,

P ({ : |Xn ()| Y () for n 1}) = 1.


The proofs can be found in Jacod and Protter [10] and will be skipped here.

17
2.2 Modes of convergence

2.2.1 Denitions
Pk qP
k
Let kk denote a norm on Rk , e.g. kxk = kxk1 = j=1 |xj | or kxk = kxk2 = j=1 |xj |2 .

Denition 2.4. Suppose that (Xn )n and X are random variables on a probability space (, A, P ) and
with values in (Rk , Bk ).
(i) (Convergence in probability)
The sequence (Xn )n converges in probability to X if
P ({ | kXn () X()k > }) = P (kXn Xk > ) 0  > 0.
n

Notation:
P
Xn X, p-limn Xn = X.

(ii) (Almost sure convergence)


The sequence (Xn )n converges almost surely to X if
P ({ | lim Xn () = X()}) = P (Xn X) = 1.
n n

Notation:
a.s.
Xn X P a.s., Xn X a.s., Xn X.
n n

(iii) (Convergence in p-th mean (Lp convergence))


The sequence (Xn )n converges in p-th mean to X if
EkXn Xkp 0.

Notation:
Lp
Xn X, Lp lim Xn = X.
n

Now we skip the assumption that all random variables live on the same probability space.

Denition 2.5. Suppose that (Xn )n and X are random variables with values in (Rk , Bk ).
(i) (Convergence in distribution)
The sequence (Xn )n converges to X in distribution if
Ef (Xn ) Ef (X)
n

for all functions f : Rk R that are continuous and bounded.


Notation:
L D d
Xn X, Xn X, Xn X,

(d, D, and L for in distribution, in law).


(ii) (Stochastic boundedness, tightness)
The sequence (Xn )n is stochastically bounded if
 > 0 C, n0 P (kXn k C) 1  n n0 .

Notation: Xn = OP (1).

Remark:
(i) Rough interpretation of convergence in probability: Xn is approximately equal to X

18
(ii) Rough interpretation of convergence in distribution: Xn has approximately the same distribution
as X

There is another characterization of convergence in distribution based on the corresponding distri-


bution functions. In applications it is sometimes easier to work with one or the other denition

Theorem 2.6. Suppose that (Xn )n and X are real-valued with distribution functions (Fn )n and F ,
respectively. Then
d
Xn X Fn (x) F (x)
n
at all continuity points of F .

Proof. See proof of Theorem 18.4 in Jacod and Protter [10].

2.2.2 Relation between dierent modes of convergence


Suppose that (Xn )n and X are random variables on a probability space (, A, P ). Then the following
scheme holds true:

Lp Lp1 1 L
Xn X Xn X Xn X &
P d
Xn X Xn X
a.s.
Xn X %
Lp -convergence (p 1)

Lp Lp1 L
1
Xn X Xn X Xn X
This follows from Jensen's inequality :
(EkY k)p E(kY kp ) for p 1,
see Billingsley [4].

Lp -convergence and convergence in probability


We begin this paragraph with a very powerful inequality and will derive the desired result on the
relation between Lp -convergence and convergence in probability as a corollary.

Theorem 2.7 (Markov's inequality). For a random variable X and an (strictly) monotonously increasing
function g : [0, ) [0, ) it holds for  > 0 that
Eg(kXk)
P (kXk )
g()

Proof. It holds that 1(x ) g(x)/g(). We apply this inequality with x equal to kX()k. This gives
 
g(kXk) E[g(kXk)]
P (kXk ) = E[1(kXk )] E .
g() g()

Corollary 2.8. Suppose that (Xn )n and X are random variables on some probability space (, A, P ).
Then
Xn X in p-th mean = Xn
P
X.

Proof. Here, we apply Markov's inequality with g(u) = up :


EkXn Xkp
P (kXn Xk ) 0.
p

Before we discuss further relations between modes of convergence, two other important applications
of Markov's inequality are presented.

19
Application 1. For a real-valued random variable X with nite second moment and >0 it holds
that
var(X)
P (|X EX| ) (Chebychev's inequality).
2
Proof. The application of Markov's inequality with g(u) = u2 and Y = X EX yields the assertion.

Application 2.
Theorem 2.9 (Weak law of large numbers 1). Suppose that X1 , X2 , . . . are uncorrelated random vari-
ables (i.e. cov(Xi , Xj ) = 0, i 6= j ) with EX1 = EX2 = = R and V arX1 , V arX2 , 2 .
Then
1 P
Xn = (X1 + + Xn ) .
n

Proof. As

1 2
E[(Xn )2 ] = V ar(Xn ) = [V ar(X1 ) + + V ar(Xn )] ,
n2 n
we obtain from Chebychev's inequality that

1 2
P (|Xn | ) 0.
2 n n

One can skip the assumption of nite second moments.

Theorem 2.10 (Weak law of large numbers 2) . For a sequence of i.i.d. random variables X1 , X2 , ...
with nite mean = E[Xj ] it holds that Xn . P

Theorem 2.10 can be proved similarly as Theorem 2.9 but in a more complex manner. Here,
Chebychev's inequality is applied to an average of truncated versions of X1 , X2 , ... instead of Xn .

Remark: In general, convergence in probability does not imply convergence in Lp , see exercises.

Almost sure convergence and convergence in probability


Theorem 2.11. Suppose that (Xn )n and X are random variables on some probability space (, A, P )
such that Xn
a.s
X. Then
P
Xn X.

Proof. Choose >0 arbitrary but xed and put Zn = 1(|Xn X| )


Then by dominated convergence

|Zn | 1, Zn 0 a.s. = EZn 0.


n

Thus,

P (|Xn X| ) = P (Zn = 1) = EZn 0.


n

Convergence in probability does not imply a.s. convergence. To see this, a counterexample is provided.

Example 2.12. Suppose that (, A, P ) = ([0, 1], B, U nif orm[0, 1]) and dene

X2k +j () = 1[j2k ,(j+1)2k ] () , k N0 , j = 0, . . . , 2k 1

Then with X 0, P (|Xn X| > ) 0 but P (limn Xn = X) = P () = 0.


n

20
Convergence in probability and convergence in distribution
Theorem 2.13. For Rk -valued random variables (Xn )n and X it holds:
P d
Xn X Xn X.

Proof. Choose >0 and a bounded continuous function f : Rk R. We will show that there exists an
n0 > 0 such that
(*) |E[f (Xn )] E[f (X)]| 
for n n0 . W.l.o.g. we assume that |f (x)| 1, x.
First, we choose C>0 such that P (kXk > C) /6. Such a C exists because P (kXk > m) 0 for
m by monotone convergence (or dominated convergence).
0 < 1 such that |f (x) f (z)| /3 for all x, z with kxk C + 1, kzk C + 1 and
Second, choose
kx zk . exists because f is uniformly continuous on the compact set {x : kxk C + 1}.
Such a
Third, choose n0 , such that P (kXn Xk ) /6 for n n0 .
Now dene the events An,1 = {kXk C}, An,2 = {kXk < C, kXn Xk < }, and An,3 = {kXk <
C, kXn Xk }. Denote the indicator functions of these events by 1An,1 , 1An,2 and 1An,3 . Dene also
An,3 = {kXn Xk }. Note that An,3 An,3 and An,1 An,2 An,3 = . We have now all parts of
our argument prepared to show (*):

|E[f (Xn )] E[f (X)]| E[|f (Xn ) f (X)|]


E[1An,1 |f (Xn ) f (X)|] + E[1An,2 |f (Xn ) f (X)|] + E[1An,3 |f (Xn ) f (X)|]
E[1An,1 2] + E[1An,2 /3] + E[1An,3 2]
2P [An,1 ] + P [An,2 ]/3 + 2P [An,3 ]
2/6 + 2/6 + /3
.

Another useful relation between both modes of convergence is the following:

Theorem 2.14. For Rk -valued random variables (Xn )n on a probability space (, A, P ) and determin-
istic a Rk it holds:
P d
Xn a Xn a

Proof. We have to prove  = only. We dene a non-negative function f by f (x) = kxak/(1+kxak),


which is continuous and bounded by one. Therefore, Ef (Xn ) Ef (a) = 0. On the other hand, the
n
function g(z) = z/(1 + z) is increasing, which in turn implies

 
 
Ef (Xn ) E 1kXn ak = P (kXn ak ).
1+ 1+

This nally implies P (kXn ak ) 0.


n

2.2.3 Discussion of convergence in distribution


The general denition of convergence in distribution can be invoked to deduce the following theorem
which plays an outstanding role in mathematical statistics.

Theorem 2.15 (Continuous mapping theorem (CMT)). Suppose that (Xn )n , X are random variables

with values in a joint state space e Rk such that Xn


d
X . Moreover, let f :
e Rl be a continuous
function. Then f (Xn ) d
f (X).

Proof. Exercise.

21
Example 2.16 (Plug-in principle) . Suppose that we observe realizations of the Rk -valued random vari-
p
ables X1 , . . . , Xn and we want to estimate an unknown parameter R . Then
bn = T (X1 , . . . , Xn )
with some measurable function T : Rkn is called parameter estimator. A sequence of estimators
P
(bn )n for an unknown parameter is consistent if bn . If (bn )n is a consistent sequence of estimators
P
bn )
for a parameter and if g is a continuous function, then g( g(). E.g. suppose that X1 , . . . , Xn
X
are i.i.d. (independent and identically distributed) with P 1 = Exp(), > 0 and values in (0, ).
1
Pn P 1
We search for a consistent estimator of . Since by the WLLN Xn = n k=1 Xk EX1 = , the
1
CMT gives that the denition of n = Xn yields a consistent sequence of estimators for .
b

Even though our denition of distributional convergence is technically suitable, it is sometimes not
very convenient to apply it to check distributional convergence. Next, we state a useful tool to deduce
multivariate distributional convergence from the univariate.

Theorem 2.17 (Cramr-Wold device) . Suppose that (Xn )n and X are Rk -valued random variables.
Then
d d
Xn X a0 Xn a0 X a Rk

Proof. For the proof, we refer the reader to Theorem 29.4 in Billingsley [4].

Remark. In particular, the theorem implies that the distribution of a random vector X is uniquely
determined by the distributions of a0 X for all a Rk . This is used in computer tomography to recover
images. It is also useful to deduce a multivariate CLT from a univariate one, see below.

In dierent situations, one or another characterization of distributional convergence is preferable.

Theorem 2.18. Suppose that (Xn )n , X are Rk -valued random variables. Choose l N. The following
statements are equivalent
(i) Xn X.
d

(ii) Ef (Xn ) Ef (X) for all functions f : Rk R that are continuous and have bounded support.
n

(iii) Ef (Xn ) Ef (X) for all functions f : Rk R that are l-times dierentiable and bounded.
n

(iv) E[exp(ia0 Xn )] E[exp(ia0 X)] for all a Rk . Here i = 1. The function a 7 E[exp(ia0 X)]
n
is also called the characteristic function or Fourier transform of X .

Proof. Clearly, (i) = (ii). The converse can be proved using the ideas of the proof of Theorem 2.11.
(i) (iii) can be found in Pollard [11, Theorem III.3.12]. The proof of (i) (iv) can be found in
Section 29 in Billingsley [4].

An important application is Slutsky's lemma.

Lemma 2.19 (Slutsky's Lemma) . For Rk -valued random variables (Xn )n , (Zn )n and X it holds:
P d d
kXn Zn k 0, Zn X Xn X.

Proof. Note that the functions considered in Theorem 2.18(ii) are uniformly continuous. Using this
particular characterization of convergence in distribution, we get

|Ef (Xn ) Ef (X)| |Ef (Zn ) f (X)| + E|f (Zn ) f (Xn )|1kZn Xn k + kf k P (kZn Xn k ).

The right-hand side is less than any prescribed >0 whenever >0 is chosen suciently small and
n n0 (), where n0 has to be chosen suciently large.

22
2.2.4 Discussion of stochastic boundedness
Recall: (Xn )n is stochastically bounded if

 > 0 C, n0 with P (kXn k C) 1  n n0 .

Notation: Xn = Op (1)
Why do we not use
P (kXn k C) = 1 C ?
In this sense, also constant sequences (Xn )n are not bounded, i.e. sequences with Xn X (the whole
sequence Xn is identical to a xed random variable X ). Note that in general it does not hold that

P (kXk C) = 1 C.

Consider e.g. X N (0, 1).


Theorem 2.20. Suppose that (Xn )n and X are Rk -valued random variables. Then
d
Xn X = Xn = Op (1).

Proof. We show the statement for real-valued Xn and X with continuous CDF FX . The general proof
follows from Theorem 29.1 in Billingsley [4].
 
For >0 choose x , y with FX (x ) < 2, FX (y ) > 1 2 . Then:

FXn (y ) FXn (x ) = P (x < Xn y )



FX (y ) FX (x )
> 1 2 < 2

This implies that P (x < Xn y ) > 1  for n n0 if n0 is chosen large enough.

P
We write Xn = oP (1) for Rk -valued random variables (Xn )n if Xn 0k . This relation is often
successfully applied to make use of the following relations.

Theorem 2.21.
(i) Xn = X + op (1) = Xn = Op (1) (Xn , X scalar or vector or matrix).
(ii) For Xn = op (1), Yn = op (1), Un = Op (1), Wn = Op (1) it holds
(a) Xn + Yn = op (1),
(b) Un + Wn = Op (1),
(c) Un Wn = Op (1),
(d) Xn Un = op (1).
(iii) g : Rk Rl continuous at x0 . Then
Xn = x0 + op (1) = g(Xn ) = g(x0 ) + op (1)

Proof. (i) Apply Theorems 2.13 and 2.20: We have that Xn X = oP (1) and therefore
P
Xn X ,
d
which in turn implies Xn X and hence Xn = OP (1).
(ii) (a) See exercises.
(b) Can be shown analogously to (a).
(c) We only state the real-valued case. The general case can be treated similarly invoking sub-
multiplicativity of certain matrix norms and equivalence of matrix norms.

P (|Un Wn | K ) P (|Un |KW K ) + P (|Wn | KW )

The assertion follows from choosing KW such that P (|Wn | KW ) /2 and then choosing K
such that P (|Un | K /KW ) /2).
(d) Can be veried analogously to (c).

23
(iii) This is a corollary of the CMT in the case where g is continuous everywhere: Similar to [(i)], we
d d
get that Xn x0 and therefore also g(Xn ) g(x0 ). Finally, apply Theorem 2.14 to deduce
the assertion.

Notation: Suppose that (cn )n is a sequence of real numbers. If c1


n Xn = Op (1), one writes also

Xn = Op (cn ).

If c1
n Xn = op (1), one writes also
Xn = op (cn ).
For example,

 
1
n(n ) = Op (1) n = + Op .
n

2.3 Strong law of large numbers and central limit theorem

The following theorems provide very powerful tools for asymptotic statistics. Still their proofs are lengthy.
Therefore, we skip them and only give the corresponding references.

Theorem 2.22 (Strong law of large numbers (SLLN)). For a sequence of i.i.d. random variables
on some probability space (, A, P ) with nite mean = E[Xj ] it holds that Xn
X1 , X2 , ...
almost surely.

Proof. See Theorem 20.1 in Jacod and Protter [10].

Theorem 2.23 (Central limit theorem for i.i.d. sequences) . Let X1 , X2 , X3 , . . . be i.i.d. real-valued
random variables with EXi = , V ar(Xi ) = 2 (0, ). Then
X1 + + Xn n d
Z N (0, 1).
n

Proof. See Theorem 21.1 in Jacod and Protter [10].

Theorem 2.24 (Lyapounov CLT) . Let X1 , X2 , . . . be independent real-valued random variables with
t = EXt , t2 = V ar(Xt ) and m3,t = E|Xt t |3 < . Assume
Pn 1/3
[ t=1 m3,t ]
Pn 0.
1/2 n
[ t=1 t2 ]

Then
X1 + + Xn 1 n d
p Z N (0, 1).
12 + + n2

Proof. See Theorem 27.3 in Billingsley [4].

Excursus to multivariate normal distribution: A vector X with density

1
exp 0.5(x )0 1 (x ) , x Rk ,

(x) = p
k
(2) det

has a multivariate normal distribution with mean Rk and covariance matrix Rk Rk , which it
0 0 0 k
assumed to be positive denite. One can show that a X N (a , a a) for any a R \{0k }

24
Theorem 2.25 (Multivariate CLT). Suppose that X1 , X2 , . . . are i.i.d. Rk -valued random variables with
mean vector and nite, positive denite covariance matrix . Then
1 d
(X1 + + Xn n) Ze N (0k , ).
n

Proof. This proof is an application of the Cramr-Wold device and an one-dimensional CLT.
Choose a Rk : a0 X1 , , a0 Xn are one dimensional i.i.d. random variables with mean a0 and variance
0
a a. Thus

(a0 X1 + + a0 Xn na0 ) d
Z N (0, 1)
n(a0 a)1/2
1 d
a0 (X1 + Xn n) Z N (0, a0 a)
n
1 d
(X1 + + Xn n) Ze N (0, ).
n

25
3 Conditional expectations, probabilities and variances

3.1 Conditional expectations and conditional probabilities

Starting point: (X 0 , Y )0 , where Y is an R-valued random variable with E[Y


2
]< and X is an Rk -
valued random variable

Regression problem: How much of the random uctuations of Y can be explained by X?


k
Find (measurable) function g : R R that minimizes

E[{Y g(X)}2 ]. ()

Denition 3.1. Each (measurable) function g that minimizes () is called conditional expectation
of Y given X .
Notation:
E[Y |X] = g(X), E[Y |X = x] = g(x).

Remark:
One can show that E[Y |X] exists, i.e. there is a measurable function g that minimizes (); see
Chapters 22 and 23 in Jacod and Protter [10].

Note that E[Y |X] = g(X) is a random variable, while E[Y |X = x] is a real number.

Conditional expectations can also be dened if only E|Y | < or Y 0 but this denition is less
intuitive. Therefore, we stick to this version and discuss the general variant very briey later on.

() generalizes the denition of E[Y ]: = E[Y ] minimizes E[(Y )2 ].

Recall the relation between expectation and probability E 1A = P (A). Now, we dene conditional
distributions via conditional expectations.

Denition 3.2. Suppose that X and Z are random variables with values in (Rk , Bk ) and (Rl , Bl ),
respectively. Then a conditional distribution of Z given X is dened as
P Z|X (B) = P (Z B | X) = E(1ZB | X), B B l
Moreover, P Z|X=x (B) = E(1ZB | X = x) is called conditional distribution of Z given X = x.
Remark: Note that again, P Z|X is random. One can show that the corresponding minimizers can be
chosen such that P Z|X
and P Z|X=x are probability measures (a.s.) (these are called regular conditional
distributions). This also holds for random variables with other state spaces (under certain assumptions).
ydP Y |X=x ;
R
In this case E(Y | X = x) = see Theorem 34.5 in Billingsley [4].

3.1.1 Special case: X , Y discrete


We minimize X
E(Y g(X))2 = [y g(x)]2 P (Y = y, X = x)
y,x

if we minimize - for every x with P (X = x) > 0 separately - the following expression

X X P (Y = y, X = x)
[y g(x)]2 P (Y = y, X = x) or equivalently [y g(x)]2 .
y y
P (X = x)

Note that the latter quotient gives a probability measure on Y , the discrete state space of Y . Since we
know that the mean of a square-integrable random variable Z minimizes E(Z a)2 , we obtain
X P (Y = y, X = x)
E(Y | X = x) = g(x) = y .
y
P (X = x)

More generally speaking, we re-obtain the denition of conditional distribution for discrete random
variables as
P X,Y ({x} A)
P Y |X=x (A) = if P X ({x}) > 0. (+)
P X ({x})
For other values of x dene P Y |X=x (A) as you like (The set of such x is a PX null set).

26
Example 3.3. Suppose that X and Y describe two independent dice experiments and dene Z = X +Y .
Then, for x = 1, . . . , 6
12 12 x+6
X X X z
E(Z | X = x) = 6zP (Z = z, X = x) = 6zP (Y = z x, X = x) = = x + 3, 5
z=2 z=2 z=x+1
6

and hence E(Z | X) = X + 3, 5.

Example 3.4. Suppose that X1 , . . . , Xn are i.i.d. random variables with X1 Bin(1, ), where
(0, 1) is an unknown parameter. Then
Pn Pn
P (X1 = x1 , . . . , Xn = xn ) = i=1 xi (1 )n i=1 xi .
0
Pn
We consider the statistic T (X := (X1 , . . . , Xn ) ) = i=1 Xi , i.e. T Bin(n, ). Then

(
1/ nk , if T (x) = k

0
P (X = (x1 , . . . , xn ) | T (X) = k) =
0 else

is independent of , i.e. T contains already the whole information of the observations X1 , . . . , Xn regard-
ing . A statistic with this property is called sucient. These kind of statistics can be used to construct
so-called UMVU estimators for the unknown parameter ( niformly u minimal variance unbiased).

3.1.2 Special case: Continuous distributions


Suppose that X, Y have densities fX , fY , respectively, and a joint density fX,Y . Suppose for simplicity
that fX (x) > 0 for all x. Then

ZZ
E[{Y g(X)}2 ] = {y g(x)}2 fX,Y (x, y)dxdy = min!
Z
{y g(x)}2 fX,Y (x, y)dy = min!
R R
yfX,Y (x, y)dy yfX,Y (x, y)dy
g(x) = R = a.s.
fX,Y (x, y)dy fX (x)
R R
yfX,Y (X, y)dy yfX,Y (x, y)dy
= E[Y |X] = a.s. and E[Y |X = x] = a.s.
fX (X) fX (x)

More generally, for arbitrary fX and forx xed, E[Y |X = x] is the mean of the distribution with density
( f (x,y)
X,Y
if fX (x) > 0
fX (x)
fY |x (y) = ,
any density else

see e.g. Example 33.5 in Billingsley [4]. This distribution is the conditional distribution of Y given
X = x. The (random) distribution with density fY |X is the conditional distribution of Y given X.

3.1.3 Special case: X discrete, Y continuous


Dene P Y |X=x (A) as in (+).

3.1.4 Important properties of conditional expectations


Intuitive interpretation: E[Y |X = x] is a mean in a stochastic model where X is nonrandom and equal
to x Rk .

Theorem 3.5. For an Rk -valued random variable X and a real-valued random variable Y assume that
EY 2 < . Then the following are equivalent (TFAE):
(i) g(X) = E[Y |X] a.s.
(ii) E [{Y g(X)}h(X)] = 0 for all measurable functions h with E[h2 (X)] < .

27
(iii) E [{Y g(X)}h(X)] = 0 for all measurable functions h : Rk {0, 1}.

Remark:
(iii) can be rewritten as E [Y 1(X B)] = E [g(X)1(X B)] for all (Borel-) sets B Rk .
Property (iii) is often used as denition of a conditional expectation. Note that is does not require
that E[Y 2 ] < . It suces to assume that E|Y | < or Y 0. Conditional expectations are
(more generally) typically dened under these conditions.

There exists an even more general notion of conditional expectations. Suppose that Y is a random
variable dened on a probability space (, A, P ) with E|Y | < . Suppose that A0 A is a
sub- -eld of A. Then the random variable Y0 = E[Y |A0 ] is dened as an A0 -measurable random
variable that fullls:
E [Y0 1B )] = E [Y 1B ] for all sets B A0 .
2
Under the additional assumption of E[Y ] < this is equivalent to:

Y = Y0 minimizes E(Y Y )2 among all A0 -measurable random variables Y ;

see Satz 15.8 in Bauer [3]. This notion of conditional expectations generalizes conditional expecta-
tions of the form E[Y |X] because of E[Y |X] = E[Y |A0 ] if A0 is equal to the -eld generated by
X, i.e.A0 = {X 1 (C) : C measurable}. An example for such conditional expectations are cases
of time series where A0 denotes the -eld of events of the past.

Proof of Theorem 3.5. (i) (ii):

G = g minimizes E {Y G(X)}2
 


= E[{Y g(X) ah(X)}2 ]|a=0 = 0 for all measurable functions h with E[h2 (X)] <
a

E[{Y g(X)}2 ] 2aE[{Y g(X)}h(X)] + a2 Eh2 (X) |a=0 = 0

=
a
for all measurable functions h with E[h2 (X)] <
= E[{Y g(X)}h(X)] = 0 for all measurable functions h with Eh2 (X) < .

(i) (ii):
Assume that

E[{Y g(X)}h(X)] = 0 for all measurable functions h with Eh2 (X) <

but that g does not minimize E[{Y g(X)}2 ]. Then g with

E[{Y g (X)}2 ] < E[{Y g(X)}2 ].

This implies

0 > E[{Y g (X)}2 {Y g(X)}2 ]


= E[2(g (X) g(X))Y + g (X)2 g(X)2 ]
= 2E[(g (X) g(X))g(X)] + Eg (X)2 Eg(X)2
= E[(g (X) g(X))2 ].

As the last term is greater than or equal to 0 this leads to a contradiction.


(ii) (iii) is obvious.
(ii) (iii)
2
This proof is omitted but can be carried out by approximating functions h with E[h (X)] < by
functions h that only take a nite number of values; see also Lemma 15.4 in Bauer [3].

Theorem 3.6 (Uniqueness of conditional expectation) . For two minimizers g1 , g2 of () it holds


g1 (X) = g2 (X) a.s.

28
Proof. Suppose that there exists an  > 0 such that P (g1 (X)g2 (X) ) > 0. From Theorem 3.5(iii)
we get with h(x) = 1g1 (x)g2 (x)

0 = E([g1 (X) g2 (X)]h(X))  > 0

which yields a contradiction.

Theorem 3.7 (Iterated expectations). Suppose that Y is a square-integrable, real-valued random vari-
able, X is an Rk -valued random variable and Z is an Rl -valued random variable on a probability space
(, A, P ). Then
(i) E[E[Y |X]] = EY ,
(ii) E[E[Y |X, Z]|Z] = E[Y |Z] a.s.,
(iii) E[E[Y |X]|X, Z] = E[Y |X] a.s.,
(iv) E[Y f (X)|X] = f (X)E[Y |X] a.s., where f is an R-valued function such that Ef 2 (X) +
E[Y f (X)]2 < ,
(v) E[E[Y |X]|f (X)] = E[Y |f (X)] a.s.,
(vi) E[Y |X, Z, X 2 , XZ] = E[Y |X, Z] a.s., where k=l=1.

Remark: (i)-(iii) are also called Law of Iterated Expectations (LIE) or tower property.

Proof. (i) follows from Theorem 3.5(ii) mit h 1.


(ii) We apply Theorem 3.5(ii) again and put

g(X, Z) = E[Y |X, Z],


f (Z) = E[g(X, Z)|Z].

We want to show
f (Z) = E[Y |Z] a.s.
This is equivalent to
E[f (Z)h(Z)] = E[Y h(Z)] h.
The latter follows from

E[f (Z)h(Z)] = E[g(X, Z)h(Z)] = E[Y h(Z)] h.

(iii) Exercise.

(iv) Suppose that g(X) is a version of E[Y |X], then E(h(X)[Y g(X)]) = 0 for any square-integrable
function h. In particular E(h(X)f (X)[Y g(X)]) = 0 for any function h that takes values 0 or 1
only. The application of Theorem 3.5(iii) nally yields the assertion.

(vi) More generally, one can show that

E[Y |W, g(W )] = E[Y |W ] a.s. (10)

This follows directly by the application of the denitions of conditional expectations. Now put
W = (X, Z)0 and g(W ) = (X 2 , XZ)0 .
(v) Due to (10) we obtain

E(E[Y | X] | f (X)) = E(E[Y | X, f (X)] | f (X)) = E(Y | f (X)) a.s.

applying part (ii).

29
Example 3.8 (Application of (vi)) .
E[W age | education, experience] = 0 + 1 educ + 2 exper + 3 educ exper + 4 educ2
= E[W age | educ, exper, educ2 , educ exper] a.s.

Thus, it is redundant to also condition on educ2 and educ exper.

Conditional expectations have similar properties as unconditional expectations:

Theorem 3.9 (Properties of conditional expectation). Suppose that Y1 , Y2 are square-integrable real-
valued random variables, X is an Rk -valued random variable on a probability space (, A, P ) and a1 , a2
are scalars. Then
(i) E[a1 Y1 + a2 Y2 |X] = a1 E[Y1 |X] + a2 E[Y2 |X] a.s.,
(ii) Y1 Y2 E[Y1 |X] E[Y2 |X] a.s.,
(iii) (E[XY |Z])2 E[X 2 |Z]E[Y 2 |Z] a.s. for real-valued, square-integrable X and another random
variable Z on the same space (Cauchy-Schwarz inequality),
(iv) : R R convex, E[(Y )]2 < +
(E[Y |X]) E[(Y )|X] a.s.

(Jensen inequality),
(v) 0 Yn Y = E[Yn |X] E[Y |X] a.s. (monotone convergence),
(vi) P (|Y | |X) 2 E[Y 2 |X] a.s. for any  > 0.

Proof. (i) Apply Theorem 3.5(iii) and linearity of the ordinary expectation.

(ii) Apply (i) and Exercise 29(ii).

(iii) See Theorem 23.10 in Jacod and Protter [10].

(iv)) See Theorem 23.9 in Jacod and Protter [10].

(v) See Theorem 23.8 in Jacod and Protter [10].

(vi) Proceed as in the case of ordinary expectation.

The next theorem describes the relation between independence and conditional distributions:

Theorem 3.10. Suppose that Y is a square-integrable real-valued random variable, X is an Rk -valued


random variable on a probability space (, A, P ). Then
X, Y independent P Y |X = P Y a.s.
= E[Y |X] = E[Y ] a.s.

Proof. Obviously the second part follows from the rst one in conjunction with the remark below De-
nition 3.2. Moreover,

X, Y independent E[1Y B 1XA ] = E[1Y B ]E[1XA ] for any Borel sets A, B


E[1Y B 1XA ] = E[E[1Y B ]1XA ] for any Borel sets A, B
E[1Y B |X] = P (Y B) a.s. B
Y |X Y
P =P a.s.

30
3.2 Conditional Variances

Denition 3.11. For a real-valued random variable Y (with EY 4 < ) and an Rk -valued random
variable X on a probability space (, A, P ) a conditional variance of Y given X is dened as
V ar[Y |X] = E[(Y E[Y |X])2 |X].

Theorem 3.12. Under the conditions of Denition 3.11


(i) V ar[a(X)Y + b(X)|X] = a2 (X)V ar[Y |X] a.s., where a and b are measurable functions such that
a(X)Y + b(X) satises the assumptions of Denition 3.11,
(ii) V ar(Y ) = E[V ar(Y |X)] + V ar(E[Y |X]),
(iii) E[V ar(Y |X)] E[V ar(Y |X, Z)] with an additional Rl -valued random variable Z on the same
space.
Proof. (i) By Theorem 3.7(iii)

V ar[a(X)Y + b(X)|X] = E(a2 (X)(Y E(Y | X))2 | X) = a2 (X)V ar[Y |X] a.s.

(ii)

E[V ar(Y |X)] + V ar(E[Y |X]) = E(E(Y 2 |X) [E(Y | X)]2 ) + E[E(Y | X)]2 (EY )2 = var(Y )

(iii) Exercise.

31
4 Linear regression

4.1 The classical model

Denition 4.1. A (multiple) linear regression model based on n observations


(Yi , Xi,1 , . . . , Xi,K ), i = 1, . . . , n with (unknown) regression coecients 0 , . . . , K is given
by
Yi = 0 + 1 Xi,1 + + K Xi,K + i , i = 1, . . . , n or in matrix notation Y = X +
with
Y1 0 1 X1,1 ... X1,K 1
.. . .. .. .. .. .
Y = . , = .. , X = . . . . and = .. .

Yn K 1 Xn,1 ... Xn,K n


Here, X is called design matrix and assumed to satisfy P (rank(X) = K +1) = 1 (no multicollinearity).
The unobserved error is a random vector with
(i) strict exogeneity: E(i | X) = 0 a.s. for i = 1, . . . , n
(ii) spherical error variance: E(0 | X) = 2 In a.s. for some 2 > 0 and In denoting the n-
dimensional identity matrix.

Remark:
Y: dependent variable, regressand

Xi = (1, Xi,1 , . . . , Xi,K )0 : i-th observation of K regressors, co-variables

log linear model:


log(Yi ) = 0 + 1 log(Xi,1 ) + + K log(Xi,K ) + i
is equivalent to
1 K
Yi = exp (0 )Xi,1 Xi,K exp (i )

strict exogeneity implies Ei = 0 (tower property of conditional expectation), which is not restric-
tive (existence of 0 )
homoskedasticity: E(2i | X) = 2 a.s., and var(2i | X) = 2 a.s.

no serial correlation: E(i j | X) = 0 a.s. for all i 6= j


note that conditional expectation of a matrix is understood componentwise

4.2 Parameter estimation - the OLS approach

4.2.1 Estimation of
We aim to establish an estimator for the unknown parameter
Pn . The ordinary least square estimator
0 2
(OLS estimator) is the minimizer of i=1 (Yi Xi ) . In matrix notation we get the following denition.

Denition 4.2. In the classical linear regression model an OLS estimator is given by
bOLS = arg min (Y X)0 (Y X).
RK+1

Theorem 4.3. In the classical linear regression model the OLS estimator is unique a.s. and
bOLS = (X 0 X)1 X 0 Y a.s..

Proof. We only work on a set 0 with P (0 ) = 1 and such that for each 0 X() has full rank.
Set Q() = (Y X)0 (Y X). Then

Q()
= 2X 0 Y + 2X 0 X

32
Q()
and, therefore,
= 0K+1 if = (X 0 X)1 X 0 Y . We found the only candidate for a minimizer. It
remains to show that bOLS indeed minimizes (not maximizes) the function Q. Therefore, suppose that
e 6= bOLS and obtain (with probability 1)

Q()
e

= Q(bOLS ) + (Y X bOLS )0 X(bOLS ) e 0 X 0 (Y X bOLS ) + (bOLS )


e + (bOLS ) e 0 X 0 X(bOLS )
e

> Q(bOLS ).

Remark:
Note that if a matrix A has full column rank, than A0 A is positive denite. See Exercise.

The (K + 1) equations given by X 0 Y = X 0 X bOLS are called normal equations. If we further


denote by e = Y X bOLS the OLS residuals, the normal equations can be interpreted as the
sample analogue to the orthogonality condition EXi i = 0K+1 :

n
1X
Xi ei = 0K+1 .
n i=1

Moreover note that the OLS estimator can be interpreted as follows:

1
bOLS = SXX sXY a.s.

with

n n
1X 1X
SXX = Xi Xi0 (sample mean of Xi Xi0 ) and sXY = Xi Yi (sample mean of Xi Yi ).
n i=1 n i=1

The following result summarizes the nite-sample properties of the OLS estimator.

Theorem 4.4 (Gauss-Markov-Theorem). In the classical linear regression model the OLS is BLUE (best
linear unbiased estimator), that is, for any (conditionally) unbiased estimator e of that is linear in Y ,
COV (e | X) COV (bOLS | X) a.s..

Remark: COV (e | X) COV (bOLS | X) x0 (COV (e | X) COV (bOLS | X))x 0 for all
means
K+1
xR . ek | X) var(bOLS | X).
Taking x as the k th unit vector, this implies in particular that var(

Proof. Linearity and unbiasedness of the OLS estimator follow immediately from Theorem 4.3 and it
remains to prove optimality. To this end, suppose that e = AY is another linear, conditionally unbiased
estimator of , i.e.

E(e | X) = E(AY | X) = E(AX | X) = a.s. AX = IK+1 a.s..

Then the conditional covariance matrix of e given X is given by

COV (e | X) = A COV (Y | X)A0 = A COV ( | X)A0 = 2 AA0 a.s.

and similarly COV (b | X) = 2 (X 0 X)1 ; see also Theorem 3.12(i).


This implies

COV (e | X) COV (b | X) = 2 (AA0 (X 0 X)1 )


= 2 [(A (X 0 X)1 X 0 )(A (X 0 X)1 X 0 )0 ]

which is positive semidenite.

33
How much of the variation of the dependent variable can be explained by the variability of the
regressors? We could use kY Yn 1n k22 kY X bOLS k22 . However, this would lead to a scale-dependent
measure. Instead let us consider
kY X bOLS k22
R2 = 1
kY Yn 1n k22
which is referred to as coecient of determination in the literature. This is indeed the fraction of
variability of Y that can be explained by X since

n
X n
X n
X
(Yi Yn )2 = (Ybi Yn )2 + e2i +Sn .
i=1 i=1 i=1
| {z } | {z } | {z }
total variability of Y variability of regression variability of residuals

Here, Ybi = Xi0 bOLS and ei = Yi Ybi denotes the ith residual and we use X 0 e = 0K+1 (normal equations)
to see that !
n
X n
X
0
Sn = 2 (Ybi Yn )ei = 2 bOLS X 0 e Yn ei = 0 + 0 = 0.
i=1 i=1

Note that R2 [0, 1] if the regressors include a constant.

For the rest of this paragraph, we assume the observations (Yi , Xi,1 , . . . , Xi,K ), i = 1, . . . , n, to be i.i.d.
and study the asymptotics of the OLS estimator for . To this end, we have to deal with asymptotics
for matrices. We can apply our Denition 2.4, just using a matrix norm, e.g.
qP 2-norm of a pq matrix,
p Pq
kAk = i=1 j=1 |aij |2 .

Theorem 4.5 (Consistency of the OLS estimator) . In the classical linear regression model with i.i.d.
(Yi , Xi,1 , . . . , Xi,K ), i = 1, 2, . . . we assume that EX1 X10 is nite and invertible. Then
P
bOLS,n .

Proof. (Sketch). We use the representation

1
bOLS = SXX sXY a.s.
P
and note that SXX EX1 X10 by the WLLN. Using that the inversion of a matrix is a continuous
transformation and the CMT for convergence in probability (see Theorem 1.9.5 in van der Vaart and
Wellner [13]), we get
1 P
SXX (EX1 X10 )1 .
Again by the WLLN, sXY converges in probability to EX1 Y1 = (EX1 X10 ) + EX1 1 = (EX1 X10 ) .
Therefore,
1
bOLS = (SXX (EX1 X10 )1 )sXY + (EX1 X10 )1 sXY
= oP (1)OP (1) + (EX1 X10 )1 (EX1 X10 ) + oP (1)
= + oP (1).

In a next step we establish asymptotic normality of the OLS estimator.

Theorem 4.6 (Asymptotic normality of bOLS ). In the classical linear regression model with
i.i.d. (Yi , Xi,1 , . . . , Xi,K ), i = 1, 2, . . . , we assume that EX1 X10 is nite and invertible. Then
d
n (bOLS ) Z N (0K+1 , 2 (E(X1 X10 ))1 ).

Proof. First note that bOLS = (X 0 X)1 X 0 almost surely. Moreover, the multivariate CLT gives

n
1 1 X d
X 0 = Xi i Ze N (0K+1 , 2 (E(X1 X10 ))
n n i=1

34
To sum up, we have

1 1
n(bOLS ) = (n1 X 0 X)1 X 0 = oP (1) OP (1) + E(X1 X10 )1 X 0 = oP (1) + Zn
n n
d
where Zn = E(X1 X10 )1 1n X 0 Z by CMT. Finally, Slutsky's Lemma yields the result.

Remark: (Statistical implication)


Suppose that 2 is known. Then the latter result can be used for the construction of condence intervals

r h i
1
1 d
k , k = 0, . . . , K . n(bk,OLS k )/ 2 0 N (0, 1).
for Note that
nX X Dene
k,k

 rh i rh i 
1 1
In(k) = bk,OLS z1/2 (X 0 X) , bk,OLS + z1/2 (X 0 X)
k,k k,k

where zq denotes the q -quantile of the standard normal distribution. Hence, P (k In (k)) 1 ,
n
(k)
i.e. In is an asymptotic (1-)-condence interval for k . It is desirable to provide condence intervals
2
in the more realistic setting that is unknown.

4.2.2 Estimation of 2
The aim of this paragraph is to establish an estimator for the variance of the error terms based on the
OLS estimator for the regression coecients.

Denition 4.7. If n > K + 1, the OLS estimate of the variance 2 > 0 is given by

2 e0 e

bOLS =
nK 1

and is called standard error of regression (SER).


p
2
bOLS

Theorem 4.8. In the classical linear regression model bOLS


2
is a (conditionally) unbiased estimator for
2 provided that n > K + 1.
Proof. We have to show that E(e0 e | X) = (n K 1) 2 almost surely. To this end, rst note

E(e0 e | X) = E((X( bOLS ) + )0 (X( bOLS ) + ) | X)


= E(0 (In X(X 0 X)1 X 0 )0 (In X(X 0 X)1 X 0 )) | X)
= E(0 (In X(X 0 X)1 X 0 )) | X).

Due to the spherical error variance assumption the latter term reduces to

n n
!
X X
0
E(e e | X) = (In X(X 0 X)1 X 0 )i,i = 2
2
n (X(X X)0 1 0
X )i,i .
i=1 i=1

and it remains to show that the sum on the r.h.s. is equal to K +1 almost surely. This in turn follows
from
K+1
X K+1
X
(X 0 X(X 0 X)1 )i,i = (IK+1 )i,i = K + 1
i=1 i=1
Pn
0 1 0
PK+1
if we can show that i=1 (X(X X) X )i,i = i=1 (X 0 X(X 0 X)1 )i,i . To see this, let P
us consider a
p q matrix A and P
a q p matrix B . The trace of a square matrix C is dened as tr(C)= i Ci,i . Then
p Pq Pq Pp
it holds tr(AB) = i=1 j=1 A i,j Bj,i = j=1 i=1 B j,i A i,j = tr(BA). Finally, put A = X(X 0 X)1
0
and B = X .

Theorem 4.9. In the classical linear regression model with i.i.d. (Yi , Xi,1 , . . . , Xi,K ), i = 1, 2, . . . , we
assume that EX1 X10 is nite and invertible. Then bOLS
P
2 as n .

35
Proof. P
1
Pn 2
It suces to show that
n i=1 ei 2 . By the WLLN, consistency of bOLS and since SX,X =
OP (1) we have

n n
1X 2 1X
ei = (i Xi0 (bOLS ))2
n i=1 n i=1
n n n
1X 2 1X 1X
= i 2(bOLS )0 Xi i + (bOLS )0 Xi Xi0 (bOLS )
n i=1 n i=1 n i=1
= 2 + oP (1).

Remark: (Statistical implication)


Then the latter result can be used for the construction of condence intervals for k , k = 1, . . . , n if 2
is unknown. It is left to the reader to show that
 rh i rh i 
1 1
Ibn(k) = bk,OLS z1/2
bOLS (X 0 X) , bk,OLS + z1/2
bOLS (X 0 X)
k,k k,k

is an asymptotic (1-)-condence interval for k .

4.3 Hypothesis tests in the classical linear regression model

4.3.1 Introduction to statistical testing


Example 4.10. A company delivers packages of pasta to the canteen of the University of Mannheim
and claims that the weight of a randomly chosen package is N (5, 0.5)-distributed. Based on a sample of
size n we intend to decide whether

(a) the expected weight is at least 5.

(b) the assumption of normality is justied.

Question: How can we decide these problems properly ?


Remark:
tests to decide (a) are called parameter tests

tests to decide (b) are called goodness-of-t tests

Starting point: statistical experiment

Denition 4.11. A statistical experiment is a tuple (X, , A, {P | }), where X denotes a


random variable on the measurable space (, A) and {P ) | } is a family of probability measures
on this space. We call parameter space.

Typically, X is modelling the vector of observations and is an unknown parameter lying in the
parameter space . Based on our data X we aim to decide a test problem of the following form

H 0 : 0 vs. H1 : 1 = \0 . (11)

Here, H0 is called null hypothesis and H1 is referred to as alternative or alternative hypothesis.


Denition 4.12. Let (X, , A, {P | }) be a statistical experiment and consider the test problem
(11) . Then a function : X {0, 1} is a (non-randomized) test if
if X=x implies acceptance H0
(
0
(x) = .
1 if X=x implies rejection of H0

In doing so two dierent kind of errors may occur.

36
Denition 4.13. Suppose that (X, , A, {P | }) is a statistical experiment and is a test to
decide the problem (11) .
(i) A type I error (error of rst kind) occurs when H0 is true but rejected.
(ii) A type II error (error of second kind) occurs when H1 is true but rejected.

Example 4.14 (Example 4.10 cont'd) . Typically, both errors occur, see sketch in the lecture.

Hence, the idea is to minimize the type II error under the condition that the type I error is less than
a prescribed level .
Denition 4.15. In the set-up of Denition 4.13
(i) a test is an -test if E (X) = P ((X) = 1) for all 0 ,
(ii) a sequence of tests (n )n is consistent if P (n (X) = 1) n
1 for all 1 .

4.3.2 Wald tests to test linear restrictions on the regression coecients


We want to test the following null hypothesis

H0 : R =

for some prescribed (r (K + 1)) matrix R of rank r and a prescribed r dimensional vector .
Example 4.16. This general hypothesis covers several interesting special cases.

1. H 0 : k = 0
with R = (0, . . . , 0, 1 , 0, . . . , 0)
|{z} and =0
(k+1)th

2. H 0 : 0 = 1
with R = (1, 1, 0, . . . , 0) and =0
3. H 0 : 0 + 1 + 2 = 1
with R = (1, 1, 1, 0, . . . , 0) and =1
4. H 0 : 0 = 1 = 2 = 0
with
1 0 0 0 ... 0 0
R = 0 1 0 0 ... 0 and = 0 .
0 0 1 0 ... 0 0

Under the conditions of Theorem 4.6, we know that n(bOLS ) is asymptotically normal. This

property carries over to n(RbOLS R) by CMT and the fact that linear transformations of normal
random variables are normally distributed again. This relation is now invoked to construct a so-called
Wald test. For its denition we require knowledge of a certain distribution.

Denition 4.17. The 2 -distribution with l degrees of freedom, l N, is characterized by a


Riemann density of the form
1
f2l (x) = xl/21 exp(x/2)1[0,) (x),
2l/2 (l/2)

where denotes the so-called Gamma function, i.e. (a) = 0 xa1 ex dx, a > 0.
R

It can be shown that the sum of the squares of l independent standard normal variables is 2l distributed;
see Jacod and Protter [10, Example 6, Chapter 15].

37
Denition 4.18. A Wald test of level (0, 1) for H0 : R = based on n observations Zi =
(Yi , Xi,1 , . . . , Xi,K ), i = 1, 2, . . . , n, is given by

0 if nb
(
2
OLS (RbOLS )0 (R(SXX )1 R0 )1 (RbOLS ) [2r,/2 , 2r,1/2 ]
n ((Z1 , . . . , Zn )) = ,
1 else

where 2r,q denotes the q quantile of a 2r distribution.

Theorem 4.19. Suppose that n > K + 1 and that the conditions of Theorem 4.6 hold. Then a sequence
of Wald tests (n )n is an asymptotic test and consistent.

Proof. Step 1: asymptotic test.


We have to show that

2 d
nb
OLS (RbOLS )0 (R(SXX )1 R0 )1 (RbOLS ) 2r

First, note that


1 d
n
bOLS (RbOLS ) N (0r , R (EX1 X10 )1 R0 ).
This gives
2
Tn = n
bOLS (RbOLS )0 (R(SXX )1 R0 )1 (RbOLS )
2
= nbOLS (RbOLS )0 (R(EX1 X10 )1 R0 )1 (RbOLS ) + oP (1)
and by Slutsky's lemma it remains to show that

2 d
nb
OLS (RbOLS )0 (R(EX1 X10 )1 R0 )1 (RbOLS ) .

Since (EX1 X10 )1 is symmetric and positive denite and since R has full rank it can be shown
similarly to Exercise 35(i) that R(EX1 X10 )1 R0 is symmetric and positive denite and, hence,
the same properties hold for its inverse. By Seber [12, 10.8] and reference therein we can
decompose R(EX1 X10 )1 R0 = [R(EX1 X10 )1 R0 ]1/2 [R(EX1 X10 )1 R0 ]1/2 and [R(EX1 X10 )1 R0 ]1 =
([R(EX1 X1 ) R ] ) ([R(EX1 X10 )1 R0 ]1 )1/2 such that
0 1 0 1 1/2

Ir = ([R(EX1 X10 )1 R0 ]1 )1/2 [R(EX1 X10 )1 R0 ]1/2 .

Hence
1 d
n
bOLS [(R(EX1 X10 )1 R0 )1 ]1/2 (RbOLS ) e N (0r , Ir )
and the CMT gives

2 d
nb
OLS (RbOLS )0 (R(EX1 X10 )1 R0 )1 (RbOLS ) e0 e = .

which then gives the rst assertion of the theorem.


Step 2: consistency.
We show that for each  (0, 1) and each with R 6= there exists an n0 N such that

P (n (Z) = 1) 1  n n0 .

Applying the triangular formula for norms we get

P (n (Z) = 1) P (Tn > r,1/2 )



1
P k n bOLS [(R(SX,X )1 R0 )1 ]1/2 (R )k2
1

k nbOLS [(R(SX,X )1 R0 )1 ]1/2 (RbOLS R)k2 > r,1/2



1
P k n bOLS [(R(SX,X )1 R0 )1 ]1/2 (R )k2 > r,1/2 + K
 
1
P k n bOLS [(R(SX,X )1 R0 )1 ]1/2 (RbOLS R)k2 > K .

38
By step 1 of this proof the second term can be bounded from above by /4 for n n0 if we choose K
2
suciently large. Due to consistency of
bOLS and SX,X we have for large n
  

P (n (Z) = 1) P k n [(R(SX,X )1 R0 )1 ]1/2 (R )k2 > 2( r,1/2 + K)
  2
0 1 0 1 1/2
P nk[(R(EX1 X1 ) R ) ] (R )k2 > 4( r,1/2 + K) 
1 .

Note that the theorem above states asymptotic results only. They might be very unreliable in small
samples. Can we do better?

4.3.3 Hypothesis tests in the classical linear regression model under normality
From now on, we assume that P |X = N (0n , 2 In ). This implies that the conditional distribution of
does not depend on X and therefore X and are independent by Theorem 3.10. Hence, the marginal
2
distribution of the errors is also given by P = N (0n , In ). Moreover, this assumption allows us to
construct exact tests.

t test for an individual parameter value

We consider the test problem H0 : k = for a xed k {0, . . . , K}. First note that under H0

b
p OLS,k N (0, 1)
[(X X)1 ]k,k
0

However this quantity cannot be used as a test statistic if 2 is unknown. Therefore we substitute the
unknown term by its estimate
2
bOLS and consider the t statistic

bOLS,k
Tn = p
bOLS [(X 0 X)1 ]k,k

Denition 4.20. Student's t distribution with l degrees of freedom is characterized by the


Riemann density
(l+1)/2
z2

((l + 1)/2)
fl (z) = 1+ .
l (l/2) l

It can be shown that


Z
p 1 tl ,
Z2 /l
where Z1 N (0, 1) and Z2 2l are independent, see Jacod and Protter [10, Example 12.4].

Theorem 4.21. Suppose that n > K + 1 and that P |X = N (0n , 2 In ). Then Tn tnK1 under the
null hypothesis.
Proof. (Sketch) We use the representation of the t distribution below Denition 4.20.

1. Modied nominator.
b
p OLS,k N (0, 1).
[(X X)1 ]k,k
0

2. Modied denominator. Noting that the OLS residuals are distributed as (In X(X 0 X)1 X 0 )Z ,
where Z N (0n , In ) is independent of X, we have

2
e0 e d 0
= Z (In X(X 0 X)1 X 0 )Z.
bOLS
(n K 1) =
2 2
It can be shown that the latter is indeed 2nK1 distributed. First note that by Davidson,
MacKinnon [6, Theorem 4.1(b)]: if Z N (0n , In ) and P is an n dimensional projection matrix

39
(i.e. symmetric and P2 = P) of rank p then Z 0 P Z 2p . Denoting A(X1 , . . . , Xn ) = (In
0 1 0
X(X X) X) we get by Tonelli-Fubini theorem (Jacod and Protter [10, Theorem 10.3]
Z Z
1z0 A(x1 ,...,xn )zy P Z (dz)P (X1 ,...,Xn ) (dx1 , . . . , dxn )
0 0 0
0 0 1 0
P (Z (In X(X X) X )Z y) =
Z
0 0 0
= f2nK1 (y)P (X1 ,...,Xn ) (dx1 , . . . , dxn )

= f2nK1 (y)
since A(x1 , . . . , xn ) is a projection matrix of rank nK1 by Seber [12, 4.11] for almost all
(x1 , . . . , xn ).
3. Independence. A sketch of the proof can be found in Hayashi [9, Proof of Proposition 1.3].

The two-sided t test for the problem above is then given by


(
1 if Tn
/ [tnK1,/2 , tnK1,1/2 ]
(Y1 , X1,1 , . . . X1,K , . . . , Yn , Xn,1 , . . . Xn,K ) = .
0 else

It follows immediately from the previous theorem that this test is an test.

Lemma 4.22. Under the conditions of the previous Theorem and if (Y1 , X1,1 , . . . X1,K )0 , . . . ,
(Yn , Xn,1 , . . . Xn,K )0 , i = 1, . . . , n, are i.i.d. such that EX1 X10 is nite and invertible, the t test is
consistent.

Proof. The previous theorem gives

bOLS,k k
Tn = p + pk
0 1
bOLS [(X X) ]k,k bOLS [(X 0 X)1 ]k,k


= OP (1) + pk .
bOLS [(X 0 X)1 ]k,k

Hence, for any xed >0 it holds that with Rn determining the OP (1) in the previous line
!
|k |
P (|Tn | tnK1,1/2 ) P p tnK1,1/2 + K P (|Rn | > K)
bOLS [(X 0 X)1 ]k,k

!
|k |
P p 2(tnK1,1/2 + K)
[(X 0 X)1 ]k,k

 q 
P n|k | 4 E[(X 0 X)1 ]k,k (tnK1,1/2 + K) 2

for suciently large K. We can deduce the assertion of the lemma if (tnK1,1/2 )n is uni-
formly bounded in
q P n. To this end, recall that any tn distributed random variable can be written as
1 n d
Z0 / n k=1 Zk2 Z N (0, 1) for i.i.d. standard normal Z0 , Z1 , . . . .

Remark:
More generally it can be shown that under the conditions above

c0 (bOLS )
Tn(c) = p tn(K+1) c RK+1 \{0}.
c0 (X 0 X)1 c bOLS
Hence, we can establish a t test for the Problems 2 and 3 described in Example 4.16.

Under the normality assumption we can also provide a test for the general hypothesis of para-
graph 4.3.2. We can show that the nite sample distribution of the corresponding test statistic
divided by r has a so-called F distribution with r and n K 1 degrees of freedom which is dened
as the distribution of (Z1 /r)/(Z2 /(n K 1)) with independent Z1 2r and Z2 2nK1 .
Hence we can again construct an test.

40