0 оценок0% нашли этот документ полезным (0 голосов)

1 просмотров40 страницEconometrics

Aug 27, 2017

Lecture Notes Fall Term 2013

© © All Rights Reserved

PDF, TXT или читайте онлайн в Scribd

Econometrics

© All Rights Reserved

0 оценок0% нашли этот документ полезным (0 голосов)

1 просмотров40 страницLecture Notes Fall Term 2013

Econometrics

© All Rights Reserved

Вы находитесь на странице: 1из 40

Advanced Econometrics I

University of Mannheim

1 Slightly modied

Literature

[1] Ash, R. B. and Dolans-Dade, C. (1999). Probability & Measure Theory. Academic Press.

[2] Bauer, H. (1990). Measure and Integration Theory. de Gruyter.

[3] Bauer, H. (1991). Wahrscheinlichkeitstheorie. de Gruyter.

[4] Billingsley, P. (1994). Probability and Measure. Wiley.

[5] Breiman, L. (2007). Probability. Siam.

[6] Davidson, R. and MacKinnon, J.G. (2004). Econometric Theory and Methods. Oxford University

Press.

[7] Dehling, H. and Haupt, B. (2004). Einfhrung in die Wahrscheinlichkeitstheorie und Statistik.

Springer.

[8] Georgii, H.-O. (2007). Stochastik: Einfhrung in die Wahrscheinlichkeitstheorie und Statistik. de

Gruyter.

[10] Jacod, J. and Protter, P. (2000). Probability Essentials. Springer.

[11] Pollard, D. (1984). Convergence of Stochastic Processes. Springer.

[12] Seber, G. A. F. (2008). A Matrix Handbook for Statisticians. Wiley.

[13] Van der Vaart, A. W. and Wellner, J. A. (2000). Weak Convergence and Empirical Processes. With

Applications to Statistics. New York: Springer.

son/Southwestern.

2

Contents

1.1 Probability measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Countable sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.2 Arbitrary sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Probability measures on R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1 Some general facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.2 Discrete and absolutely continuous probability measures . . . . . . . . . . . . . . . 9

1.2.3 Extensions to Rk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3.1 Denition and Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3.2 Expectations, moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Asymptotic theory 17

2.1 Convergence of expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Modes of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.2 Relation between dierent modes of convergence . . . . . . . . . . . . . . . . . . . 19

2.2.3 Discussion of convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.4 Discussion of stochastic boundedness . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Strong law of large numbers and central limit theorem . . . . . . . . . . . . . . . . . . . . 24

3.1 Conditional expectations and conditional probabilities . . . . . . . . . . . . . . . . . . . . 26

3.1.1 Special case: X, Y discrete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1.2 Special case: Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.3 Special case: X discrete, Y continuous . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.4 Important properties of conditional expectations . . . . . . . . . . . . . . . . . . . 28

3.2 Conditional Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Linear regression 32

4.1 The classical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Parameter estimation - the OLS approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.1 Estimation of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.2 Estimation of 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Hypothesis tests in the classical linear regression model . . . . . . . . . . . . . . . . . . . 36

4.3.1 Introduction to statistical testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.2 Wald tests to test linear restrictions on the regression coecients . . . . . . . . . . 37

4.3.3 Hypothesis tests in the classical linear regression model under normality . . . . . . 39

3

Introduction

Motivation:

study relationship between variables, e.g.

consumption and income How does raising income eect consumption behaviour?

...

Y = 0 + 1 X +

e.g. Y consumption, X wage, error term

error term collects all other eects on consumption besides wage

(1) provide basic probabilistic framework and statistical tools for econometric theory

(2) application of these tools to the classical multiple linear regression model

application of these results to economic problems in Advanced Econometrics II/III and follow-up

elective courses

4

1 Elementary probability theory

Setup:

set of the possible outcomes of an experiment sample space

e.g. = N = {1, 2, }

Unless otherwise stated, is non-empty.

A event

e.g. A = {2, 4, 6, 8, }

outcome :

intend to dene P (A) probability of event A

Consider rst the case that is a countable set, i.e.

= {1 , 2 , 3 , }

(e.g. = N, = Z).

Denition 1.1. A probability measure P on a countable set is a set function that maps subsets of

to [0, 1] and has the following properties:

(i) P () = 0 (Here denotes the empty set.),

(ii) P () = 1,

(iii) P ( for Ai , i N, pairwise disjoint (i.e Ai Aj = for i 6= j ).

S P

i=1 Ai ) = i=1 P (Ai )

Remark:

interpretation of (i): probability that nothing happens = 0.

Sk Pk

interpretation of (iii): P ( i=1 Ai ) = i=1 P (Ai ) is reasonable for k < , but questionable for

k = . Anyway, one cannot proceed without making this assumption.

Lemma 1.2. For a countable sample space = {i }iI (with countable I ) a probability measure P is

specied by

pi = P ({i }) for i I.

For every A , it holds X

P (A) = pi .

i: i A

An event {} ( ) that only contains one element is also called an elementary event.

5

1.1.2 Arbitrary sample spaces

Typical examples: =R (or Rk or [0, ))

Problem: Mathematics (general measure theory) It is often impossible to dene P appropriately

for all subsets A such that Denition 1.1(i)-(iii) hold; see e.g. Billingsley [4][p.45f ].

Minimal requirement on A:

(S i) A

(S ii) AA AC = \A A

S

(S iii) A1 , A2 , A i=1 Ai A

Motivation: If we naively carried over the axioms of Denition 1.1(i)-(iii) to the present setting, then

we would obtain:

Regarding(S i): P () = 0.

Regarding (S ii): If I know the value of P (A), I will also know the value of P (AC ) = 1 P (A).

Regarding (S iii):

S If I know the values of P (A1 ), P (A2 ), ..., then I will know the value of

P ( i=1 Ai ). This is clear for pairwise disjoint A1 , A2 , . . . . For A1 , A2 , . . . not pairwise disjoint, a

slightly more complicated argument is needed (see Theorem 1.7(vi) below).

Denition 1.3. A class A of subsets of with (S i)-(S iii) is called a -eld (also: algebra).

the so-called power set P() := {A | A } is the largest -algebra on ,

and if B , the class {, , B, B C } is the smallest -algebra on that contains B.

Denition 1.5. Suppose that A is a -eld on a set . Then the tuple (, A) is called a mea-

surable space.

A set function P : A [0, 1] is a probability measure on (, A) if

(a) P () = 1 and

(b) for A1 , A2 , . . . pairwise disjoint A it holds that

!

additivity

[ X

P Ai = P (Ai )

i=1 i=1

Example 1.6 (Dirac measure) . Let (, A) be a measurable space and 0 . The Dirac measure 0

is then dened by

0 (A) := 1A (0 ) A A.

The Dirac measure is indeed a probability measure.

Theorem 1.7 (Properties of probability measures) . Suppose that (, A, P ) is probability space. Then

the following hold true:

(i) P () = 0

6

(ii) Finite additivity: A1 , . . . , An A disjoint imply P (ni=1 Ai ) =

Pn

i=1 P (Ai )

(iii) P (AC ) = 1 P (A).

(iv) Monotonicity: A, B A, A B implies P (A) P (B)

(v) Subtractivity: A, B A, A B implies P (B\A) = P (B) P (A)

(vi) Poincar-Sylvester: P (A B) = P (A) + P (B) P (A B)

(vii) Continuity from below: A1 , A2 , A, An An+1 , n N, implies P (An ) n

P (

k=1 Ak )

P (

k=1 Ak )

P

n=1 An ) n=1 P (An ).

Proof. Exercise.

1.2.1 Some general facts

In this section we consider the special case = R.

Questions:

1. What is a natural -eld on R?

2. How can we construct P on a large class A?

Solution to question 1:

Denition 1.8. The smallest -eld B, that contains all open intervals (a, b) ( a b ), is

called the Borel -eld. A set A B is also called a Borel set.

A1 = {(a, b] : a b < +},

A2 = {[a, b) : < a < b +},

A3 = {[a, b] : < a b < +},

A4 = {(, b] : < b < +}

Proof.

S 1 1

See exercise for j = 1. For j 6= 1 the result follows similarly, use e.g. n=1 [a+ n , b n ] = (a, b).

Remark: Note that there are subsets of R B , cf. Theorem 8.6 in Bauer [2].

which are not contained in

It is very dicult to understand which sets are contained in B . But it is not really necessary. We dene

a probability measure P on {(a, b) | a b }, A1 or . . . or A4 . Then B is just the class of

sets for which P is automatically dened. Why is this true?

Hope: this denes P on A. This works as follows:

(i) A ,

(ii) A A AC A ,

(iii) A1 , A2 A A1 A2 A .

Suppose that A is a eld and dene A as the smallest -eld with A A (notion: A = (A )). Now

choose P : A [0, 1] with

7

(a) P () = 0

(b) P () = 1

S

(c) For A1 , A2 , . . . pairwise disjoint A with i=1 Ai A it holds that

!

[ X

P Ai = P (Ai ).

i=1 i=1

Theorem 1.11 (Carathodory) . For A , A, P as above there exists a unique probability measure P on

A with

P (A) = P (A) for A A .

This method can be applied to characterize probability measures on R endowed with a -algebra B .

To this end, we rst introduce another function.

Denition 1.12. For a probability measure P on (R, B) the function F : R [0, 1] given by

F (b) = P ((, b]) b R

is called a (cumulative) distribution function (CDF).

Theorem 1.13 (Properties of the CDF) . Suppose that F is the distribution function of a probability

measure P on (R, B). Then

(i) P ((a, b]) = F (b) F (a), b > a,

(ii) F is non-decreasing (i.e. F (b0 ) F (b) for b0 b),

(iii) F is continuous from the right (i.e. F (bn ) F (b) for bn b, bn b (or for bn b)),

(iv) 1. limx F (x) = 0,

2. limx+ F (x) = 1.

(ii) Exercise.

(iii) Suppose that (bn )n is an arbitrary monotonously decreasing sequence in R with bn b. Then we

have to show that

F (bn ) F (b).

n

By continuity of probability measures from above

!

\

F (bn ) = P ((, bn ]) P (, bk ] = F (b).

n

k=1

(iv) 1. Exercise.

have to show

F (bn ) 1.

Invoking continuity of probability measures from below gives

!

[

lim F (bn ) = lim P ((, bn ]) = P (, bk ] = P ((, +)) = 1.

n n

k=1

8

Now the idea is to choose F with the properties (ii) to (iv) in the previous Theorem and to dene P on

A4 :

P ((, b]) = F (b)

and on A1 :

P ((a, b]) = F (b) F (a), b>a

Can this function be uniquely extended to a set function on B?

Theorem 1.14. Consider a function F : R R satisfying (ii) to (iv) of Theorem 1.13. Then F is

a distribution function (i.e. then there exists a unique probability measure P on (R, B) with F (b) =

P ((, b]) for all b R).

Now we extend this function as follows: P : A [0, 1], where A consists of the empty set and all

nite unions of sets of A1 and their complements, and for disjoint intervals

n

! n

[ X

P (ai , bi ] = P ((ai , bi ]) with notation (c, ] = (c, ).

i=1 i=1

extension theorem tells us that there is a unique probability measure P on B satisfying (1) and hence

P ((, b]) = F (b).

Discrete probability measures

Denition 1.15. A probability measure P on the measurable space (R, B) is discrete if there is an at

most countable set A = {ai R | ( < ... < a1 < a0 <)a1 < a2 < . . . } such that P (A) = 1.

P ({ai }) = pi > 0,

then F has jumps at ( < ... < a1 < a0 <)a1 < a2 < . . . with jump heights (..., p1 , p0 ,) p1 , p2 , . . . .

n i

P ({i}) = (1 )(ni) for i = 0, 1, . . . , n.

i

Parameter: 0 1, n 1.

2. Geometric distribution

P ({i}) = (1 )i1 for i = 1, . . .

Parameter: 0 1.

3. Poisson distribution

i

P ({i}) = e , i = 0, 1, . . .

i!

Parameter: > 0

9

Absolutely continuous probability measures

If there is a (Riemann) integrable function f : R [0, ) such that

Z x

F (x) = f (t)dt,

then f is called Riemann density (probability density) and the corresponding probability measure (and

the CDF) is called absolutely continuous.

There is a more general denition of absolute continuity. However, this needs deeper mathematics.

Therefore we will stick to the one above which suces for many applications. Moreover, note that if F

is continuously dierentiable at x and is absolutely continuous, then f (x) = F 0 (x).

Lemma 1.17. Suppose thatR f : R [0, ) is a bounded function with at most nitely many points of

discontinuity and such that

f (x)dx = 1. Then there exists a unique probability measure on (R, B)

such that Z b

P ((a, b]) = f (x)dx.

a

Proof.

R

It suces to show that

f (x)dx denes a function satisfying (ii) to (iv) of Theorem 1.13 and

to use Theorem 1.14. This in turn is straightforward.

1 (x )2

1

f (x) = exp .

2 2 2

Parameter: R, > 0

2. Uniform distribution

1

f (x) = 1[a,b] (x)

ba

Parameters: < a < b <

3. Exponential distribution

f (x) = ex 1[0,) (x)

Parameter: >0

1.2.3 Extensions to Rk

The Borel -eld B k is the -eld generated by the open intervals (a1 , b1 ) (ak , bk ). As in the real-

valued case, probability measures on (Rk , B k ) are uniquely dened via the multivariate distribution

function:

F (b1 , . . . , bk ) = P ({(x1 , . . . , xk ) : x1 b1 , . . . , xk bk }).

F is called absolutely continuous if:

Z b1 Z bk

F (b1 , . . . , bk ) = f (x1 , . . . , xk )dxk dx1

kF

(x1 , . . . , xk ) = f (x1 , . . . , xk ).

x1 xk

For a more detailed discussion, we refer the reader to Billingsley [4].

10

1.3 Random Variables

Intuitively: An Rk -valued random variable X is a random element of Rk , e.g. to describe age, education

and wage of a randomly chosen person in Mannheim.

surable space and X fullls:

X 1 (B) = { : X() B} A B Bk .

Denition 1.20. Suppose that X is an Rk -valued random variable on a probability space (, A, P ).

Then

P X (B) := P (X B) := P (X 1 (B)), B Bk ,

is called the distribution of X .

Remark:

(i) Show as an exercise that PX is a probability measure on (Rk , B k ).

(ii) Interpretation: (, A, P ) machinery of randomness in the background.

(iii) Notation: Capital letters X, Y, Z, . . . are used for random variables (exception: Greek letters,

e.g. ), lower case letters x, y, z, . . . are used for elements of Rk and denote realisations (possible

outcomes of X, Y, Z, . . . ). (Typical convention in statistics). However, one often uses x, y, z, . . . for

random variables and their realisations, at the same time, in econometrics.

(iv) Suppose that (, A) = (R, B). Then (real-valued) indicator functions, monotone functions,

continuous functions and functions with only nitely many discontinuities are (Borel) measurable

(see Jacod and Protter [10, Theorems 8.3, 8.4].

For the indicator function it works as follows: Let AA and X = 1A . Then, we get

X 1

(B) = 11

A (B) = = R A

Case 2 BB and B {0, 1} =

X 1

(B) = 11

A (B) = A

Case 3 BB and B {0, 1} = {0}

X 1

(B) = 11 c

A (B) = A A

Case 4 BB and B {0, 1} = {1}

X 1 (B) = 11

A (B) = A A

(v) Suppose that (, A) = (R, B) and that (Xn )n is a sequence of random variables on this space, then

X1 + X2 , X1 X2 , supn Xn , inf n Xn , and limn Xn (provided its existence) are random variables (see

Jacod and Protter [10, Corollary 8.1, Theorem 8.4]).

For two random variables X1 : Rk1 , X2 : Rk2 dened on the same probability space, one

(X10 ,X20 )0

denes the joint distribution P by

(Interpret (X10 , X20 )0 as a new random variable Y in Rk1 +k2 .)

Denition 1.21. (i) Random variables X1 , . . . , Xl on a probability space (, A, P ) are independent

if

P (X1 A1 , . . . , Xl Al ) = P (X11 (A1 ) Xl1 (Al )) = P (X1 A1 ) P (Xl Al )

for all Borel sets A1 , . . . , Al .

11

(ii) Suppose that (Xt )tT with some nonempty index set T is a family of Rk -valued random variables

on (, A, P ). These random variables are independent if for any nite, nonempty I0 T and any

Qt B k , t I0 , !

\ Y

P Xt1 (Qt ) = P (Xt1 (Qt )).

tI0 tI0

Independence is one of the most important tools to construct complex probabilistic models.

Lemma 1.22. If the distribution of an Rk -valued random variable X = (X1 , . . . , Xk )0 has a bounded,

piecewise continuous density fX , then

Z Z

fX1 (x1 ) = fX (x1 , . . . , xk )dx2 dxk

Z z Z Z

FX1 (z) = P (X1 (, z], X2 R, . . . , Xl R) = fX (x1 , . . . xl )dx2 . . . dxl dx1 .

Motivation: Relate unknown parameters of a random variable to its moments.

Question: When do moments exists and how do we calculate them?

We now consider R-valued random variables. We call a random variable X discrete if its distribution is

discrete.

is a discrete random variable on a

probability space (, A, P ) with P (X = ai ) = pi for i = 1, 2, . . . and i pi = 1.

(i) If X = {a1 , . . . , aN } is nite, then the mean (expectation) of X is given by

N

X N

X

EX = ai P ({ | X() = ai }) = ai P X ({ai }).

i=1 i=1

X X

EX = ai P ({ | X() = ai }) = ai P X ({ai }).

i i

max{X, 0}). Then X

+ X

EX = EX EX = ai P ({ai }).

i

Note that the expectation of a random variable only depends on its distribution.

E 1A = P (A).

12

Denition 1.25 (General denition of expectation) . (i) For a real-valued random variable X 0

dene

k k k+1

Xn () = if X() < for a k N0 .

n n n

Dene

EX = lim EXn .

n

EX = EX + EX if EX + < or EX < .

Remark:

1X[k/n, (k+1)/n)

P k

P k

Xn = k=0 n and hence EXn = k=0 n P (X [k/n, (k + 1)/n))

kXn Xm k max{n1 , m1 }

limn EXn always exists but might be innite.

Riemann integral: arises as limit of sums. It is based on a horizontal grid.

The denition of the expectation above can also be interpreted as an integral, the so-called Lebesgue

integral which is based on a vertical grid.

http://commons.wikimedia.org/wiki/File:Riemannvslebesgue.svg

Alternative notions:

Z Z Z Z Z Z

EX = xdP X (x) = xP X (dx) = xdF X (x) = xF X (dx) = X()P (d) = X()dP ()

Theorem 1.26. (Properties of expectations). For real-valued random variables X1 , X2 with nite ex-

pectations and 1 , 2 R it holds that

(i) |E[X1 ]| sup |X1 ()|,

(ii) E[1 X1 + 2 X2 ] = 1 E[X1 ] + 2 E[X2 ] (linearity),

(iii) E[X1 ] E[X2 ] if X1 X2 (monotonicity),

(iv) E[X1 X2 ] = E[X1 ]E[X2 ] if X1 , X2 are independent.

Remark: One can show that additivity of the expectation of positive random variables also holds if the

corresponding expectations are innite; in particular E|X| = EX + + EX .

13

Proof. The proofs of the rst three items is deferred to the exercises.

(iv):

(1) Discrete Case: Suppose that X1 , X2 0 and discrete. Say, X1 = {a1 , a2 , . . . } and X2 =

{b1 , b2 , . . . }. For Y := X1 X2 0 it follows

E(X1 X2 ) = EY

X

= yi P (Y = yi )

yi Y

X

= aj bk P (X1 = aj , X2 = bk )

aj X ,

1

bk X

2

X

= aj bk P (X1 = aj )P (X2 = bk ) (by independence)

aj X ,

1

bk X

2

X X

= aj P (X1 = aj ) bk P (X2 = bk )

aj X1 bk X2

= EX1 EX2

Note that

1 1

X1,n X2,n (X1 X2 )n2 X1,n + X2,n + .

n n

(Justication:

kl

1X [k/n,(k+1)/n),X2 [l/n,(l+1)/n)

X

X1,n X2,n =

n2 1

k,l=0

m

1X1 [k/n,(k+1)/n),X2 [l/n,(l+1)/n)

X X

= 2

m=0

n

k,l:kl=m

m

1X X [m/n2 ,(m+1)/n2 )

X

m=0

n2 1 2

= (X1 X2 )n2

and similarly for the second inequality.)

Here, Xj,n Xj on a n1 -grid, and (X1 X2 )n2 is the discrete ap-

is the discrete approximation of

1

proximation of (X1 X2 ) on a

n2 -grid. For a discrete random variable one can easily check

1 1 1 1

that (iv) holds. Thus, we have that E[(X1,n +

n )(X2,n + n )] = E[X1,n + n ]E[X2,n + n ] and

E[X1,n X2,n ] = E[X1,n ]E[X2,n ]. By application of (iii) this gives

1 1

E[X1,n ]E[X2,n ] E[X1 X2 ]n2 E X1,n + E X2,n + .

n n

This implies the result because for n the lower and the upper bound converges to E[X1 ]E[X2 ].

(3) General Case:

= E[X1+ X2+ X1+ X2 X2+ X1 + X2 X1 ]

= E(X1+ X2+ ) E(X1+ X2 ) E(X2+ X1 ) + E(X2 X1 ) (by linearity)

= E(X1+ )E(X2+ ) E(X1+ )E(X2 ) E(X2+ )E(X1 ) + E(X2 )E(X1 ) (by (2))

= E(X1+ X1 )E(X2+ X2 )

= EX1 EX2

14

Theorem 1.27. If the distribution function FX of X , given by FX (x) = P (X x), x R, has a

bounded and piecewise continuous density fX and if E[X] is nite, then

Z

E[X] = xfX (x)dx. (2)

Proof. See exercises for X 0. An extension to the general case is then straightforward.

Extensions of Theorem 1.27: Assume as above that fX is bounded and piecewise continuous and

moreover, g is a piecewise continuous function that is either bounded or non-negative, then

Z

Eg(X) = g(x)fX (x)dx. (3)

(i) The s-th moment of X is dened as E[X s ] (if well-dened),

(ii) the s-th absolute moment as E [|X|s ],

(iii) and the s-th central moment as E [(X )s ] with = E[X] (if well-dened).

(iv) The 2nd central moment is also called variance: V ar[X] = E (X )2 .

expectation here. First note that

Z

1 2 2

EX + = xe(x) /(2 ) dx

2 2

0

Z Z

1 2 x (x)2 /(22 ) (x)2 /(2 2 )

= 2 e dx + e dx <

2 2 0 0

EX < . Hence,

and similarly

Z Z Z

1 2 2 1 2 2 1 2 2

EX = xe(x) /(2 ) dx = y ey /(2 ) dy + ey /(2 ) dy = .

2 2 2 2 2 2

In particular, we see that there are random variables with the same expectation but dierent distribu-

tions.

V ar(X1 + + Xn ) = V ar(X1 ) + + V ar(Xn ).

n

! n n

!!2

X X X

V ar Xi =E Xi E Xi

i=1 i=1 i=1

n n

!2

X X

=E Xi EXi (by linearity)

i=1 i=1

n

!2

X

=E (Xi EXi ) (by linearity)

i=1

n

! n

X X

=E (Xi EXi ) (Xj EXj )

i=1 j=1

n

X

= E(Xi EXi )(Xj EXj ) (by linearity)

i=1,j=1

Xn n

X

= E(Xi EXi )(Xj EXj ) + E(Xi EXi )2

| {z }

i=1,j=1 i=1

i6=j V ar(Xi )

15

By the fact that functions of independent random variables are also independent it follows that

n

! n n

X X X

V ar Xi = E(Xi EXi )E(Xj EXj ) + V ar(Xi )

i=1 i=1,j=1 i=1

i6=j

= EXi EXi E(1) (by linearity, since EXi is a constant)

= EXi EXi E 1Xi

= EXi EXi P (Xi )

= EXi EXi

=0

n

! n

X X

V ar Xi = V ar(Xi )

i=1 i=1

For an Rk -valued random variable X = (X1 , . . . , Xk )0 such that EXj exists for all j, we dene the

expectation (vector) as

E[X1 ]

.

= E[X] = . .

.

E[Xk ]

If E|Xj |2 < for all j, the covariance matrix is dened by

=E[(X )(X )0 ]

(X1 1 )2

... (X1 1 )(Xk k )

. .. .

=E . .

. . .

(Xk k )(X1 1 ) ... (Xk k )2

E[(X1 1 )2 ]

... E[(X1 1 )(Xk k )]

. .. .

= . . .

. . .

2

E[(Xk k )(X1 1 )] ... E[(Xk k ) ]

16

2 Asymptotic theory

In applications, we are typically interested in behaviour of Xn0 for xed n0 . Via asymptotic theory we

get an approximation by embedding Xn0 into a sequence (Xn )n .

Suppose that for a sequence of random variables (Xn )n and a random variable X on a probability space

(, A, P )

Xn () X() for all . (4)

n

In general this is not sucient for

EXn EX

n

(see exercises). Our aim is now to set up additional conditions that assure convergence of the corre-

sponding expectations. There are two main tools, the monotone convergence theorem and Lebesgue's

dominated convergence theorem.

0 Xn () Xn+1 () for n 1, . (5)

Then

EXn EX.

n

If the random variables do not converge, one can obtain a weaker result.

Lemma 2.2 (Fatou's Lemma). Consider a sequence of nonnegative random variables (Xn )n (dened on

the same probability space (, A, P )) and dene

X() = lim inf Xn ().

n

Then

EX lim inf EXn .

n

Proof. lim inf n Xn () = limn (inf kn Xk ()).

Recall that First note that the sequence (Yn )n ,

given by Yn = inf kn Xk , satises 0 Yn () Yn+1 (), n, , and Yn () X(). By monotone

n

convergence theorem and the fact that Yn Xk k n,

n n kn

Using the latter result, Lebesgue's majorized convergence theorem can be proven.

|Xn ()| Y () for n 1, (6)

E[Xn ] E[X].

n

Remark: Both theorems also hold if (4), (5), (6) only hold a.s. (almost surely), i.e.

n n

P ({ : 0 Xn () Xn+1 ()}) = 1 for n 1, or, equivalently, (8)

For a random variable Y with E[Y ] < + it holds that (9)

The proofs can be found in Jacod and Protter [10] and will be skipped here.

17

2.2 Modes of convergence

2.2.1 Denitions

Pk qP

k

Let kk denote a norm on Rk , e.g. kxk = kxk1 = j=1 |xj | or kxk = kxk2 = j=1 |xj |2 .

Denition 2.4. Suppose that (Xn )n and X are random variables on a probability space (, A, P ) and

with values in (Rk , Bk ).

(i) (Convergence in probability)

The sequence (Xn )n converges in probability to X if

P ({ | kXn () X()k > }) = P (kXn Xk > ) 0 > 0.

n

Notation:

P

Xn X, p-limn Xn = X.

The sequence (Xn )n converges almost surely to X if

P ({ | lim Xn () = X()}) = P (Xn X) = 1.

n n

Notation:

a.s.

Xn X P a.s., Xn X a.s., Xn X.

n n

The sequence (Xn )n converges in p-th mean to X if

EkXn Xkp 0.

Notation:

Lp

Xn X, Lp lim Xn = X.

n

Now we skip the assumption that all random variables live on the same probability space.

Denition 2.5. Suppose that (Xn )n and X are random variables with values in (Rk , Bk ).

(i) (Convergence in distribution)

The sequence (Xn )n converges to X in distribution if

Ef (Xn ) Ef (X)

n

Notation:

L D d

Xn X, Xn X, Xn X,

(ii) (Stochastic boundedness, tightness)

The sequence (Xn )n is stochastically bounded if

> 0 C, n0 P (kXn k C) 1 n n0 .

Notation: Xn = OP (1).

Remark:

(i) Rough interpretation of convergence in probability: Xn is approximately equal to X

18

(ii) Rough interpretation of convergence in distribution: Xn has approximately the same distribution

as X

bution functions. In applications it is sometimes easier to work with one or the other denition

Theorem 2.6. Suppose that (Xn )n and X are real-valued with distribution functions (Fn )n and F ,

respectively. Then

d

Xn X Fn (x) F (x)

n

at all continuity points of F .

Suppose that (Xn )n and X are random variables on a probability space (, A, P ). Then the following

scheme holds true:

Lp Lp1 1 L

Xn X Xn X Xn X &

P d

Xn X Xn X

a.s.

Xn X %

Lp -convergence (p 1)

Lp Lp1 L

1

Xn X Xn X Xn X

This follows from Jensen's inequality :

(EkY k)p E(kY kp ) for p 1,

see Billingsley [4].

We begin this paragraph with a very powerful inequality and will derive the desired result on the

relation between Lp -convergence and convergence in probability as a corollary.

Theorem 2.7 (Markov's inequality). For a random variable X and an (strictly) monotonously increasing

function g : [0, ) [0, ) it holds for > 0 that

Eg(kXk)

P (kXk )

g()

Proof. It holds that 1(x ) g(x)/g(). We apply this inequality with x equal to kX()k. This gives

g(kXk) E[g(kXk)]

P (kXk ) = E[1(kXk )] E .

g() g()

Corollary 2.8. Suppose that (Xn )n and X are random variables on some probability space (, A, P ).

Then

Xn X in p-th mean = Xn

P

X.

EkXn Xkp

P (kXn Xk ) 0.

p

Before we discuss further relations between modes of convergence, two other important applications

of Markov's inequality are presented.

19

Application 1. For a real-valued random variable X with nite second moment and >0 it holds

that

var(X)

P (|X EX| ) (Chebychev's inequality).

2

Proof. The application of Markov's inequality with g(u) = u2 and Y = X EX yields the assertion.

Application 2.

Theorem 2.9 (Weak law of large numbers 1). Suppose that X1 , X2 , . . . are uncorrelated random vari-

ables (i.e. cov(Xi , Xj ) = 0, i 6= j ) with EX1 = EX2 = = R and V arX1 , V arX2 , 2 .

Then

1 P

Xn = (X1 + + Xn ) .

n

Proof. As

1 2

E[(Xn )2 ] = V ar(Xn ) = [V ar(X1 ) + + V ar(Xn )] ,

n2 n

we obtain from Chebychev's inequality that

1 2

P (|Xn | ) 0.

2 n n

Theorem 2.10 (Weak law of large numbers 2) . For a sequence of i.i.d. random variables X1 , X2 , ...

with nite mean = E[Xj ] it holds that Xn . P

Theorem 2.10 can be proved similarly as Theorem 2.9 but in a more complex manner. Here,

Chebychev's inequality is applied to an average of truncated versions of X1 , X2 , ... instead of Xn .

Remark: In general, convergence in probability does not imply convergence in Lp , see exercises.

Theorem 2.11. Suppose that (Xn )n and X are random variables on some probability space (, A, P )

such that Xn

a.s

X. Then

P

Xn X.

Then by dominated convergence

n

Thus,

n

Convergence in probability does not imply a.s. convergence. To see this, a counterexample is provided.

Example 2.12. Suppose that (, A, P ) = ([0, 1], B, U nif orm[0, 1]) and dene

n

20

Convergence in probability and convergence in distribution

Theorem 2.13. For Rk -valued random variables (Xn )n and X it holds:

P d

Xn X Xn X.

Proof. Choose >0 and a bounded continuous function f : Rk R. We will show that there exists an

n0 > 0 such that

(*) |E[f (Xn )] E[f (X)]|

for n n0 . W.l.o.g. we assume that |f (x)| 1, x.

First, we choose C>0 such that P (kXk > C) /6. Such a C exists because P (kXk > m) 0 for

m by monotone convergence (or dominated convergence).

0 < 1 such that |f (x) f (z)| /3 for all x, z with kxk C + 1, kzk C + 1 and

Second, choose

kx zk . exists because f is uniformly continuous on the compact set {x : kxk C + 1}.

Such a

Third, choose n0 , such that P (kXn Xk ) /6 for n n0 .

Now dene the events An,1 = {kXk C}, An,2 = {kXk < C, kXn Xk < }, and An,3 = {kXk <

C, kXn Xk }. Denote the indicator functions of these events by 1An,1 , 1An,2 and 1An,3 . Dene also

An,3 = {kXn Xk }. Note that An,3 An,3 and An,1 An,2 An,3 = . We have now all parts of

our argument prepared to show (*):

E[1An,1 |f (Xn ) f (X)|] + E[1An,2 |f (Xn ) f (X)|] + E[1An,3 |f (Xn ) f (X)|]

E[1An,1 2] + E[1An,2 /3] + E[1An,3 2]

2P [An,1 ] + P [An,2 ]/3 + 2P [An,3 ]

2/6 + 2/6 + /3

.

Theorem 2.14. For Rk -valued random variables (Xn )n on a probability space (, A, P ) and determin-

istic a Rk it holds:

P d

Xn a Xn a

which is continuous and bounded by one. Therefore, Ef (Xn ) Ef (a) = 0. On the other hand, the

n

function g(z) = z/(1 + z) is increasing, which in turn implies

Ef (Xn ) E 1kXn ak = P (kXn ak ).

1+ 1+

n

The general denition of convergence in distribution can be invoked to deduce the following theorem

which plays an outstanding role in mathematical statistics.

Theorem 2.15 (Continuous mapping theorem (CMT)). Suppose that (Xn )n , X are random variables

d

X . Moreover, let f :

e Rl be a continuous

function. Then f (Xn ) d

f (X).

Proof. Exercise.

21

Example 2.16 (Plug-in principle) . Suppose that we observe realizations of the Rk -valued random vari-

p

ables X1 , . . . , Xn and we want to estimate an unknown parameter R . Then

bn = T (X1 , . . . , Xn )

with some measurable function T : Rkn is called parameter estimator. A sequence of estimators

P

(bn )n for an unknown parameter is consistent if bn . If (bn )n is a consistent sequence of estimators

P

bn )

for a parameter and if g is a continuous function, then g( g(). E.g. suppose that X1 , . . . , Xn

X

are i.i.d. (independent and identically distributed) with P 1 = Exp(), > 0 and values in (0, ).

1

Pn P 1

We search for a consistent estimator of . Since by the WLLN Xn = n k=1 Xk EX1 = , the

1

CMT gives that the denition of n = Xn yields a consistent sequence of estimators for .

b

Even though our denition of distributional convergence is technically suitable, it is sometimes not

very convenient to apply it to check distributional convergence. Next, we state a useful tool to deduce

multivariate distributional convergence from the univariate.

Theorem 2.17 (Cramr-Wold device) . Suppose that (Xn )n and X are Rk -valued random variables.

Then

d d

Xn X a0 Xn a0 X a Rk

Proof. For the proof, we refer the reader to Theorem 29.4 in Billingsley [4].

Remark. In particular, the theorem implies that the distribution of a random vector X is uniquely

determined by the distributions of a0 X for all a Rk . This is used in computer tomography to recover

images. It is also useful to deduce a multivariate CLT from a univariate one, see below.

Theorem 2.18. Suppose that (Xn )n , X are Rk -valued random variables. Choose l N. The following

statements are equivalent

(i) Xn X.

d

(ii) Ef (Xn ) Ef (X) for all functions f : Rk R that are continuous and have bounded support.

n

(iii) Ef (Xn ) Ef (X) for all functions f : Rk R that are l-times dierentiable and bounded.

n

(iv) E[exp(ia0 Xn )] E[exp(ia0 X)] for all a Rk . Here i = 1. The function a 7 E[exp(ia0 X)]

n

is also called the characteristic function or Fourier transform of X .

Proof. Clearly, (i) = (ii). The converse can be proved using the ideas of the proof of Theorem 2.11.

(i) (iii) can be found in Pollard [11, Theorem III.3.12]. The proof of (i) (iv) can be found in

Section 29 in Billingsley [4].

Lemma 2.19 (Slutsky's Lemma) . For Rk -valued random variables (Xn )n , (Zn )n and X it holds:

P d d

kXn Zn k 0, Zn X Xn X.

Proof. Note that the functions considered in Theorem 2.18(ii) are uniformly continuous. Using this

particular characterization of convergence in distribution, we get

|Ef (Xn ) Ef (X)| |Ef (Zn ) f (X)| + E|f (Zn ) f (Xn )|1kZn Xn k + kf k P (kZn Xn k ).

The right-hand side is less than any prescribed >0 whenever >0 is chosen suciently small and

n n0 (), where n0 has to be chosen suciently large.

22

2.2.4 Discussion of stochastic boundedness

Recall: (Xn )n is stochastically bounded if

Notation: Xn = Op (1)

Why do we not use

P (kXn k C) = 1 C ?

In this sense, also constant sequences (Xn )n are not bounded, i.e. sequences with Xn X (the whole

sequence Xn is identical to a xed random variable X ). Note that in general it does not hold that

P (kXk C) = 1 C.

Theorem 2.20. Suppose that (Xn )n and X are Rk -valued random variables. Then

d

Xn X = Xn = Op (1).

Proof. We show the statement for real-valued Xn and X with continuous CDF FX . The general proof

follows from Theorem 29.1 in Billingsley [4].

For >0 choose x , y with FX (x ) < 2, FX (y ) > 1 2 . Then:

FX (y ) FX (x )

> 1 2 < 2

This implies that P (x < Xn y ) > 1 for n n0 if n0 is chosen large enough.

P

We write Xn = oP (1) for Rk -valued random variables (Xn )n if Xn 0k . This relation is often

successfully applied to make use of the following relations.

Theorem 2.21.

(i) Xn = X + op (1) = Xn = Op (1) (Xn , X scalar or vector or matrix).

(ii) For Xn = op (1), Yn = op (1), Un = Op (1), Wn = Op (1) it holds

(a) Xn + Yn = op (1),

(b) Un + Wn = Op (1),

(c) Un Wn = Op (1),

(d) Xn Un = op (1).

(iii) g : Rk Rl continuous at x0 . Then

Xn = x0 + op (1) = g(Xn ) = g(x0 ) + op (1)

Proof. (i) Apply Theorems 2.13 and 2.20: We have that Xn X = oP (1) and therefore

P

Xn X ,

d

which in turn implies Xn X and hence Xn = OP (1).

(ii) (a) See exercises.

(b) Can be shown analogously to (a).

(c) We only state the real-valued case. The general case can be treated similarly invoking sub-

multiplicativity of certain matrix norms and equivalence of matrix norms.

The assertion follows from choosing KW such that P (|Wn | KW ) /2 and then choosing K

such that P (|Un | K /KW ) /2).

(d) Can be veried analogously to (c).

23

(iii) This is a corollary of the CMT in the case where g is continuous everywhere: Similar to [(i)], we

d d

get that Xn x0 and therefore also g(Xn ) g(x0 ). Finally, apply Theorem 2.14 to deduce

the assertion.

n Xn = Op (1), one writes also

Xn = Op (cn ).

If c1

n Xn = op (1), one writes also

Xn = op (cn ).

For example,

1

n(n ) = Op (1) n = + Op .

n

The following theorems provide very powerful tools for asymptotic statistics. Still their proofs are lengthy.

Therefore, we skip them and only give the corresponding references.

Theorem 2.22 (Strong law of large numbers (SLLN)). For a sequence of i.i.d. random variables

on some probability space (, A, P ) with nite mean = E[Xj ] it holds that Xn

X1 , X2 , ...

almost surely.

Theorem 2.23 (Central limit theorem for i.i.d. sequences) . Let X1 , X2 , X3 , . . . be i.i.d. real-valued

random variables with EXi = , V ar(Xi ) = 2 (0, ). Then

X1 + + Xn n d

Z N (0, 1).

n

Theorem 2.24 (Lyapounov CLT) . Let X1 , X2 , . . . be independent real-valued random variables with

t = EXt , t2 = V ar(Xt ) and m3,t = E|Xt t |3 < . Assume

Pn 1/3

[ t=1 m3,t ]

Pn 0.

1/2 n

[ t=1 t2 ]

Then

X1 + + Xn 1 n d

p Z N (0, 1).

12 + + n2

1

exp 0.5(x )0 1 (x ) , x Rk ,

(x) = p

k

(2) det

has a multivariate normal distribution with mean Rk and covariance matrix Rk Rk , which it

0 0 0 k

assumed to be positive denite. One can show that a X N (a , a a) for any a R \{0k }

24

Theorem 2.25 (Multivariate CLT). Suppose that X1 , X2 , . . . are i.i.d. Rk -valued random variables with

mean vector and nite, positive denite covariance matrix . Then

1 d

(X1 + + Xn n) Ze N (0k , ).

n

Proof. This proof is an application of the Cramr-Wold device and an one-dimensional CLT.

Choose a Rk : a0 X1 , , a0 Xn are one dimensional i.i.d. random variables with mean a0 and variance

0

a a. Thus

(a0 X1 + + a0 Xn na0 ) d

Z N (0, 1)

n(a0 a)1/2

1 d

a0 (X1 + Xn n) Z N (0, a0 a)

n

1 d

(X1 + + Xn n) Ze N (0, ).

n

25

3 Conditional expectations, probabilities and variances

2

]< and X is an Rk -

valued random variable

k

Find (measurable) function g : R R that minimizes

E[{Y g(X)}2 ]. ()

Denition 3.1. Each (measurable) function g that minimizes () is called conditional expectation

of Y given X .

Notation:

E[Y |X] = g(X), E[Y |X = x] = g(x).

Remark:

One can show that E[Y |X] exists, i.e. there is a measurable function g that minimizes (); see

Chapters 22 and 23 in Jacod and Protter [10].

Note that E[Y |X] = g(X) is a random variable, while E[Y |X = x] is a real number.

Conditional expectations can also be dened if only E|Y | < or Y 0 but this denition is less

intuitive. Therefore, we stick to this version and discuss the general variant very briey later on.

Recall the relation between expectation and probability E 1A = P (A). Now, we dene conditional

distributions via conditional expectations.

Denition 3.2. Suppose that X and Z are random variables with values in (Rk , Bk ) and (Rl , Bl ),

respectively. Then a conditional distribution of Z given X is dened as

P Z|X (B) = P (Z B | X) = E(1ZB | X), B B l

Moreover, P Z|X=x (B) = E(1ZB | X = x) is called conditional distribution of Z given X = x.

Remark: Note that again, P Z|X is random. One can show that the corresponding minimizers can be

chosen such that P Z|X

and P Z|X=x are probability measures (a.s.) (these are called regular conditional

distributions). This also holds for random variables with other state spaces (under certain assumptions).

ydP Y |X=x ;

R

In this case E(Y | X = x) = see Theorem 34.5 in Billingsley [4].

We minimize X

E(Y g(X))2 = [y g(x)]2 P (Y = y, X = x)

y,x

X X P (Y = y, X = x)

[y g(x)]2 P (Y = y, X = x) or equivalently [y g(x)]2 .

y y

P (X = x)

Note that the latter quotient gives a probability measure on Y , the discrete state space of Y . Since we

know that the mean of a square-integrable random variable Z minimizes E(Z a)2 , we obtain

X P (Y = y, X = x)

E(Y | X = x) = g(x) = y .

y

P (X = x)

More generally speaking, we re-obtain the denition of conditional distribution for discrete random

variables as

P X,Y ({x} A)

P Y |X=x (A) = if P X ({x}) > 0. (+)

P X ({x})

For other values of x dene P Y |X=x (A) as you like (The set of such x is a PX null set).

26

Example 3.3. Suppose that X and Y describe two independent dice experiments and dene Z = X +Y .

Then, for x = 1, . . . , 6

12 12 x+6

X X X z

E(Z | X = x) = 6zP (Z = z, X = x) = 6zP (Y = z x, X = x) = = x + 3, 5

z=2 z=2 z=x+1

6

Example 3.4. Suppose that X1 , . . . , Xn are i.i.d. random variables with X1 Bin(1, ), where

(0, 1) is an unknown parameter. Then

Pn Pn

P (X1 = x1 , . . . , Xn = xn ) = i=1 xi (1 )n i=1 xi .

0

Pn

We consider the statistic T (X := (X1 , . . . , Xn ) ) = i=1 Xi , i.e. T Bin(n, ). Then

(

1/ nk , if T (x) = k

0

P (X = (x1 , . . . , xn ) | T (X) = k) =

0 else

is independent of , i.e. T contains already the whole information of the observations X1 , . . . , Xn regard-

ing . A statistic with this property is called sucient. These kind of statistics can be used to construct

so-called UMVU estimators for the unknown parameter ( niformly u minimal variance unbiased).

Suppose that X, Y have densities fX , fY , respectively, and a joint density fX,Y . Suppose for simplicity

that fX (x) > 0 for all x. Then

ZZ

E[{Y g(X)}2 ] = {y g(x)}2 fX,Y (x, y)dxdy = min!

Z

{y g(x)}2 fX,Y (x, y)dy = min!

R R

yfX,Y (x, y)dy yfX,Y (x, y)dy

g(x) = R = a.s.

fX,Y (x, y)dy fX (x)

R R

yfX,Y (X, y)dy yfX,Y (x, y)dy

= E[Y |X] = a.s. and E[Y |X = x] = a.s.

fX (X) fX (x)

More generally, for arbitrary fX and forx xed, E[Y |X = x] is the mean of the distribution with density

( f (x,y)

X,Y

if fX (x) > 0

fX (x)

fY |x (y) = ,

any density else

see e.g. Example 33.5 in Billingsley [4]. This distribution is the conditional distribution of Y given

X = x. The (random) distribution with density fY |X is the conditional distribution of Y given X.

Dene P Y |X=x (A) as in (+).

Intuitive interpretation: E[Y |X = x] is a mean in a stochastic model where X is nonrandom and equal

to x Rk .

Theorem 3.5. For an Rk -valued random variable X and a real-valued random variable Y assume that

EY 2 < . Then the following are equivalent (TFAE):

(i) g(X) = E[Y |X] a.s.

(ii) E [{Y g(X)}h(X)] = 0 for all measurable functions h with E[h2 (X)] < .

27

(iii) E [{Y g(X)}h(X)] = 0 for all measurable functions h : Rk {0, 1}.

Remark:

(iii) can be rewritten as E [Y 1(X B)] = E [g(X)1(X B)] for all (Borel-) sets B Rk .

Property (iii) is often used as denition of a conditional expectation. Note that is does not require

that E[Y 2 ] < . It suces to assume that E|Y | < or Y 0. Conditional expectations are

(more generally) typically dened under these conditions.

There exists an even more general notion of conditional expectations. Suppose that Y is a random

variable dened on a probability space (, A, P ) with E|Y | < . Suppose that A0 A is a

sub- -eld of A. Then the random variable Y0 = E[Y |A0 ] is dened as an A0 -measurable random

variable that fullls:

E [Y0 1B )] = E [Y 1B ] for all sets B A0 .

2

Under the additional assumption of E[Y ] < this is equivalent to:

see Satz 15.8 in Bauer [3]. This notion of conditional expectations generalizes conditional expecta-

tions of the form E[Y |X] because of E[Y |X] = E[Y |A0 ] if A0 is equal to the -eld generated by

X, i.e.A0 = {X 1 (C) : C measurable}. An example for such conditional expectations are cases

of time series where A0 denotes the -eld of events of the past.

G = g minimizes E {Y G(X)}2

= E[{Y g(X) ah(X)}2 ]|a=0 = 0 for all measurable functions h with E[h2 (X)] <

a

E[{Y g(X)}2 ] 2aE[{Y g(X)}h(X)] + a2 Eh2 (X) |a=0 = 0

=

a

for all measurable functions h with E[h2 (X)] <

= E[{Y g(X)}h(X)] = 0 for all measurable functions h with Eh2 (X) < .

(i) (ii):

Assume that

E[{Y g(X)}h(X)] = 0 for all measurable functions h with Eh2 (X) <

This implies

= E[2(g (X) g(X))Y + g (X)2 g(X)2 ]

= 2E[(g (X) g(X))g(X)] + Eg (X)2 Eg(X)2

= E[(g (X) g(X))2 ].

(ii) (iii) is obvious.

(ii) (iii)

2

This proof is omitted but can be carried out by approximating functions h with E[h (X)] < by

functions h that only take a nite number of values; see also Lemma 15.4 in Bauer [3].

g1 (X) = g2 (X) a.s.

28

Proof. Suppose that there exists an > 0 such that P (g1 (X)g2 (X) ) > 0. From Theorem 3.5(iii)

we get with h(x) = 1g1 (x)g2 (x)

Theorem 3.7 (Iterated expectations). Suppose that Y is a square-integrable, real-valued random vari-

able, X is an Rk -valued random variable and Z is an Rl -valued random variable on a probability space

(, A, P ). Then

(i) E[E[Y |X]] = EY ,

(ii) E[E[Y |X, Z]|Z] = E[Y |Z] a.s.,

(iii) E[E[Y |X]|X, Z] = E[Y |X] a.s.,

(iv) E[Y f (X)|X] = f (X)E[Y |X] a.s., where f is an R-valued function such that Ef 2 (X) +

E[Y f (X)]2 < ,

(v) E[E[Y |X]|f (X)] = E[Y |f (X)] a.s.,

(vi) E[Y |X, Z, X 2 , XZ] = E[Y |X, Z] a.s., where k=l=1.

Remark: (i)-(iii) are also called Law of Iterated Expectations (LIE) or tower property.

(ii) We apply Theorem 3.5(ii) again and put

f (Z) = E[g(X, Z)|Z].

We want to show

f (Z) = E[Y |Z] a.s.

This is equivalent to

E[f (Z)h(Z)] = E[Y h(Z)] h.

The latter follows from

(iii) Exercise.

(iv) Suppose that g(X) is a version of E[Y |X], then E(h(X)[Y g(X)]) = 0 for any square-integrable

function h. In particular E(h(X)f (X)[Y g(X)]) = 0 for any function h that takes values 0 or 1

only. The application of Theorem 3.5(iii) nally yields the assertion.

This follows directly by the application of the denitions of conditional expectations. Now put

W = (X, Z)0 and g(W ) = (X 2 , XZ)0 .

(v) Due to (10) we obtain

29

Example 3.8 (Application of (vi)) .

E[W age | education, experience] = 0 + 1 educ + 2 exper + 3 educ exper + 4 educ2

= E[W age | educ, exper, educ2 , educ exper] a.s.

Theorem 3.9 (Properties of conditional expectation). Suppose that Y1 , Y2 are square-integrable real-

valued random variables, X is an Rk -valued random variable on a probability space (, A, P ) and a1 , a2

are scalars. Then

(i) E[a1 Y1 + a2 Y2 |X] = a1 E[Y1 |X] + a2 E[Y2 |X] a.s.,

(ii) Y1 Y2 E[Y1 |X] E[Y2 |X] a.s.,

(iii) (E[XY |Z])2 E[X 2 |Z]E[Y 2 |Z] a.s. for real-valued, square-integrable X and another random

variable Z on the same space (Cauchy-Schwarz inequality),

(iv) : R R convex, E[(Y )]2 < +

(E[Y |X]) E[(Y )|X] a.s.

(Jensen inequality),

(v) 0 Yn Y = E[Yn |X] E[Y |X] a.s. (monotone convergence),

(vi) P (|Y | |X) 2 E[Y 2 |X] a.s. for any > 0.

Proof. (i) Apply Theorem 3.5(iii) and linearity of the ordinary expectation.

The next theorem describes the relation between independence and conditional distributions:

random variable on a probability space (, A, P ). Then

X, Y independent P Y |X = P Y a.s.

= E[Y |X] = E[Y ] a.s.

Proof. Obviously the second part follows from the rst one in conjunction with the remark below De-

nition 3.2. Moreover,

E[1Y B 1XA ] = E[E[1Y B ]1XA ] for any Borel sets A, B

E[1Y B |X] = P (Y B) a.s. B

Y |X Y

P =P a.s.

30

3.2 Conditional Variances

Denition 3.11. For a real-valued random variable Y (with EY 4 < ) and an Rk -valued random

variable X on a probability space (, A, P ) a conditional variance of Y given X is dened as

V ar[Y |X] = E[(Y E[Y |X])2 |X].

(i) V ar[a(X)Y + b(X)|X] = a2 (X)V ar[Y |X] a.s., where a and b are measurable functions such that

a(X)Y + b(X) satises the assumptions of Denition 3.11,

(ii) V ar(Y ) = E[V ar(Y |X)] + V ar(E[Y |X]),

(iii) E[V ar(Y |X)] E[V ar(Y |X, Z)] with an additional Rl -valued random variable Z on the same

space.

Proof. (i) By Theorem 3.7(iii)

V ar[a(X)Y + b(X)|X] = E(a2 (X)(Y E(Y | X))2 | X) = a2 (X)V ar[Y |X] a.s.

(ii)

E[V ar(Y |X)] + V ar(E[Y |X]) = E(E(Y 2 |X) [E(Y | X)]2 ) + E[E(Y | X)]2 (EY )2 = var(Y )

(iii) Exercise.

31

4 Linear regression

(Yi , Xi,1 , . . . , Xi,K ), i = 1, . . . , n with (unknown) regression coecients 0 , . . . , K is given

by

Yi = 0 + 1 Xi,1 + + K Xi,K + i , i = 1, . . . , n or in matrix notation Y = X +

with

Y1 0 1 X1,1 ... X1,K 1

.. . .. .. .. .. .

Y = . , = .. , X = . . . . and = .. .

Here, X is called design matrix and assumed to satisfy P (rank(X) = K +1) = 1 (no multicollinearity).

The unobserved error is a random vector with

(i) strict exogeneity: E(i | X) = 0 a.s. for i = 1, . . . , n

(ii) spherical error variance: E(0 | X) = 2 In a.s. for some 2 > 0 and In denoting the n-

dimensional identity matrix.

Remark:

Y: dependent variable, regressand

log(Yi ) = 0 + 1 log(Xi,1 ) + + K log(Xi,K ) + i

is equivalent to

1 K

Yi = exp (0 )Xi,1 Xi,K exp (i )

strict exogeneity implies Ei = 0 (tower property of conditional expectation), which is not restric-

tive (existence of 0 )

homoskedasticity: E(2i | X) = 2 a.s., and var(2i | X) = 2 a.s.

note that conditional expectation of a matrix is understood componentwise

4.2.1 Estimation of

We aim to establish an estimator for the unknown parameter

Pn . The ordinary least square estimator

0 2

(OLS estimator) is the minimizer of i=1 (Yi Xi ) . In matrix notation we get the following denition.

Denition 4.2. In the classical linear regression model an OLS estimator is given by

bOLS = arg min (Y X)0 (Y X).

RK+1

Theorem 4.3. In the classical linear regression model the OLS estimator is unique a.s. and

bOLS = (X 0 X)1 X 0 Y a.s..

Proof. We only work on a set 0 with P (0 ) = 1 and such that for each 0 X() has full rank.

Set Q() = (Y X)0 (Y X). Then

Q()

= 2X 0 Y + 2X 0 X

32

Q()

and, therefore,

= 0K+1 if = (X 0 X)1 X 0 Y . We found the only candidate for a minimizer. It

remains to show that bOLS indeed minimizes (not maximizes) the function Q. Therefore, suppose that

e 6= bOLS and obtain (with probability 1)

Q()

e

e + (bOLS ) e 0 X 0 X(bOLS )

e

> Q(bOLS ).

Remark:

Note that if a matrix A has full column rank, than A0 A is positive denite. See Exercise.

denote by e = Y X bOLS the OLS residuals, the normal equations can be interpreted as the

sample analogue to the orthogonality condition EXi i = 0K+1 :

n

1X

Xi ei = 0K+1 .

n i=1

1

bOLS = SXX sXY a.s.

with

n n

1X 1X

SXX = Xi Xi0 (sample mean of Xi Xi0 ) and sXY = Xi Yi (sample mean of Xi Yi ).

n i=1 n i=1

The following result summarizes the nite-sample properties of the OLS estimator.

Theorem 4.4 (Gauss-Markov-Theorem). In the classical linear regression model the OLS is BLUE (best

linear unbiased estimator), that is, for any (conditionally) unbiased estimator e of that is linear in Y ,

COV (e | X) COV (bOLS | X) a.s..

Remark: COV (e | X) COV (bOLS | X) x0 (COV (e | X) COV (bOLS | X))x 0 for all

means

K+1

xR . ek | X) var(bOLS | X).

Taking x as the k th unit vector, this implies in particular that var(

Proof. Linearity and unbiasedness of the OLS estimator follow immediately from Theorem 4.3 and it

remains to prove optimality. To this end, suppose that e = AY is another linear, conditionally unbiased

estimator of , i.e.

This implies

= 2 [(A (X 0 X)1 X 0 )(A (X 0 X)1 X 0 )0 ]

33

How much of the variation of the dependent variable can be explained by the variability of the

regressors? We could use kY Yn 1n k22 kY X bOLS k22 . However, this would lead to a scale-dependent

measure. Instead let us consider

kY X bOLS k22

R2 = 1

kY Yn 1n k22

which is referred to as coecient of determination in the literature. This is indeed the fraction of

variability of Y that can be explained by X since

n

X n

X n

X

(Yi Yn )2 = (Ybi Yn )2 + e2i +Sn .

i=1 i=1 i=1

| {z } | {z } | {z }

total variability of Y variability of regression variability of residuals

Here, Ybi = Xi0 bOLS and ei = Yi Ybi denotes the ith residual and we use X 0 e = 0K+1 (normal equations)

to see that !

n

X n

X

0

Sn = 2 (Ybi Yn )ei = 2 bOLS X 0 e Yn ei = 0 + 0 = 0.

i=1 i=1

For the rest of this paragraph, we assume the observations (Yi , Xi,1 , . . . , Xi,K ), i = 1, . . . , n, to be i.i.d.

and study the asymptotics of the OLS estimator for . To this end, we have to deal with asymptotics

for matrices. We can apply our Denition 2.4, just using a matrix norm, e.g.

qP 2-norm of a pq matrix,

p Pq

kAk = i=1 j=1 |aij |2 .

Theorem 4.5 (Consistency of the OLS estimator) . In the classical linear regression model with i.i.d.

(Yi , Xi,1 , . . . , Xi,K ), i = 1, 2, . . . we assume that EX1 X10 is nite and invertible. Then

P

bOLS,n .

1

bOLS = SXX sXY a.s.

P

and note that SXX EX1 X10 by the WLLN. Using that the inversion of a matrix is a continuous

transformation and the CMT for convergence in probability (see Theorem 1.9.5 in van der Vaart and

Wellner [13]), we get

1 P

SXX (EX1 X10 )1 .

Again by the WLLN, sXY converges in probability to EX1 Y1 = (EX1 X10 ) + EX1 1 = (EX1 X10 ) .

Therefore,

1

bOLS = (SXX (EX1 X10 )1 )sXY + (EX1 X10 )1 sXY

= oP (1)OP (1) + (EX1 X10 )1 (EX1 X10 ) + oP (1)

= + oP (1).

Theorem 4.6 (Asymptotic normality of bOLS ). In the classical linear regression model with

i.i.d. (Yi , Xi,1 , . . . , Xi,K ), i = 1, 2, . . . , we assume that EX1 X10 is nite and invertible. Then

d

n (bOLS ) Z N (0K+1 , 2 (E(X1 X10 ))1 ).

Proof. First note that bOLS = (X 0 X)1 X 0 almost surely. Moreover, the multivariate CLT gives

n

1 1 X d

X 0 = Xi i Ze N (0K+1 , 2 (E(X1 X10 ))

n n i=1

34

To sum up, we have

1 1

n(bOLS ) = (n1 X 0 X)1 X 0 = oP (1) OP (1) + E(X1 X10 )1 X 0 = oP (1) + Zn

n n

d

where Zn = E(X1 X10 )1 1n X 0 Z by CMT. Finally, Slutsky's Lemma yields the result.

Suppose that 2 is known. Then the latter result can be used for the construction of condence intervals

r h i

1

1 d

k , k = 0, . . . , K . n(bk,OLS k )/ 2 0 N (0, 1).

for Note that

nX X Dene

k,k

rh i rh i

1 1

In(k) = bk,OLS z1/2 (X 0 X) , bk,OLS + z1/2 (X 0 X)

k,k k,k

where zq denotes the q -quantile of the standard normal distribution. Hence, P (k In (k)) 1 ,

n

(k)

i.e. In is an asymptotic (1-)-condence interval for k . It is desirable to provide condence intervals

2

in the more realistic setting that is unknown.

4.2.2 Estimation of 2

The aim of this paragraph is to establish an estimator for the variance of the error terms based on the

OLS estimator for the regression coecients.

Denition 4.7. If n > K + 1, the OLS estimate of the variance 2 > 0 is given by

2 e0 e

bOLS =

nK 1

p

2

bOLS

2

is a (conditionally) unbiased estimator for

2 provided that n > K + 1.

Proof. We have to show that E(e0 e | X) = (n K 1) 2 almost surely. To this end, rst note

= E(0 (In X(X 0 X)1 X 0 )0 (In X(X 0 X)1 X 0 )) | X)

= E(0 (In X(X 0 X)1 X 0 )) | X).

Due to the spherical error variance assumption the latter term reduces to

n n

!

X X

0

E(e e | X) = (In X(X 0 X)1 X 0 )i,i = 2

2

n (X(X X)0 1 0

X )i,i .

i=1 i=1

and it remains to show that the sum on the r.h.s. is equal to K +1 almost surely. This in turn follows

from

K+1

X K+1

X

(X 0 X(X 0 X)1 )i,i = (IK+1 )i,i = K + 1

i=1 i=1

Pn

0 1 0

PK+1

if we can show that i=1 (X(X X) X )i,i = i=1 (X 0 X(X 0 X)1 )i,i . To see this, let P

us consider a

p q matrix A and P

a q p matrix B . The trace of a square matrix C is dened as tr(C)= i Ci,i . Then

p Pq Pq Pp

it holds tr(AB) = i=1 j=1 A i,j Bj,i = j=1 i=1 B j,i A i,j = tr(BA). Finally, put A = X(X 0 X)1

0

and B = X .

Theorem 4.9. In the classical linear regression model with i.i.d. (Yi , Xi,1 , . . . , Xi,K ), i = 1, 2, . . . , we

assume that EX1 X10 is nite and invertible. Then bOLS

P

2 as n .

35

Proof. P

1

Pn 2

It suces to show that

n i=1 ei 2 . By the WLLN, consistency of bOLS and since SX,X =

OP (1) we have

n n

1X 2 1X

ei = (i Xi0 (bOLS ))2

n i=1 n i=1

n n n

1X 2 1X 1X

= i 2(bOLS )0 Xi i + (bOLS )0 Xi Xi0 (bOLS )

n i=1 n i=1 n i=1

= 2 + oP (1).

Then the latter result can be used for the construction of condence intervals for k , k = 1, . . . , n if 2

is unknown. It is left to the reader to show that

rh i rh i

1 1

Ibn(k) = bk,OLS z1/2

bOLS (X 0 X) , bk,OLS + z1/2

bOLS (X 0 X)

k,k k,k

Example 4.10. A company delivers packages of pasta to the canteen of the University of Mannheim

and claims that the weight of a randomly chosen package is N (5, 0.5)-distributed. Based on a sample of

size n we intend to decide whether

Remark:

tests to decide (a) are called parameter tests

random variable on the measurable space (, A) and {P ) | } is a family of probability measures

on this space. We call parameter space.

Typically, X is modelling the vector of observations and is an unknown parameter lying in the

parameter space . Based on our data X we aim to decide a test problem of the following form

H 0 : 0 vs. H1 : 1 = \0 . (11)

Denition 4.12. Let (X, , A, {P | }) be a statistical experiment and consider the test problem

(11) . Then a function : X {0, 1} is a (non-randomized) test if

if X=x implies acceptance H0

(

0

(x) = .

1 if X=x implies rejection of H0

36

Denition 4.13. Suppose that (X, , A, {P | }) is a statistical experiment and is a test to

decide the problem (11) .

(i) A type I error (error of rst kind) occurs when H0 is true but rejected.

(ii) A type II error (error of second kind) occurs when H1 is true but rejected.

Example 4.14 (Example 4.10 cont'd) . Typically, both errors occur, see sketch in the lecture.

Hence, the idea is to minimize the type II error under the condition that the type I error is less than

a prescribed level .

Denition 4.15. In the set-up of Denition 4.13

(i) a test is an -test if E (X) = P ((X) = 1) for all 0 ,

(ii) a sequence of tests (n )n is consistent if P (n (X) = 1) n

1 for all 1 .

We want to test the following null hypothesis

H0 : R =

for some prescribed (r (K + 1)) matrix R of rank r and a prescribed r dimensional vector .

Example 4.16. This general hypothesis covers several interesting special cases.

1. H 0 : k = 0

with R = (0, . . . , 0, 1 , 0, . . . , 0)

|{z} and =0

(k+1)th

2. H 0 : 0 = 1

with R = (1, 1, 0, . . . , 0) and =0

3. H 0 : 0 + 1 + 2 = 1

with R = (1, 1, 1, 0, . . . , 0) and =1

4. H 0 : 0 = 1 = 2 = 0

with

1 0 0 0 ... 0 0

R = 0 1 0 0 ... 0 and = 0 .

0 0 1 0 ... 0 0

Under the conditions of Theorem 4.6, we know that n(bOLS ) is asymptotically normal. This

property carries over to n(RbOLS R) by CMT and the fact that linear transformations of normal

random variables are normally distributed again. This relation is now invoked to construct a so-called

Wald test. For its denition we require knowledge of a certain distribution.

Riemann density of the form

1

f2l (x) = xl/21 exp(x/2)1[0,) (x),

2l/2 (l/2)

where denotes the so-called Gamma function, i.e. (a) = 0 xa1 ex dx, a > 0.

R

It can be shown that the sum of the squares of l independent standard normal variables is 2l distributed;

see Jacod and Protter [10, Example 6, Chapter 15].

37

Denition 4.18. A Wald test of level (0, 1) for H0 : R = based on n observations Zi =

(Yi , Xi,1 , . . . , Xi,K ), i = 1, 2, . . . , n, is given by

0 if nb

(

2

OLS (RbOLS )0 (R(SXX )1 R0 )1 (RbOLS ) [2r,/2 , 2r,1/2 ]

n ((Z1 , . . . , Zn )) = ,

1 else

Theorem 4.19. Suppose that n > K + 1 and that the conditions of Theorem 4.6 hold. Then a sequence

of Wald tests (n )n is an asymptotic test and consistent.

We have to show that

2 d

nb

OLS (RbOLS )0 (R(SXX )1 R0 )1 (RbOLS ) 2r

1 d

n

bOLS (RbOLS ) N (0r , R (EX1 X10 )1 R0 ).

This gives

2

Tn = n

bOLS (RbOLS )0 (R(SXX )1 R0 )1 (RbOLS )

2

= nbOLS (RbOLS )0 (R(EX1 X10 )1 R0 )1 (RbOLS ) + oP (1)

and by Slutsky's lemma it remains to show that

2 d

nb

OLS (RbOLS )0 (R(EX1 X10 )1 R0 )1 (RbOLS ) .

Since (EX1 X10 )1 is symmetric and positive denite and since R has full rank it can be shown

similarly to Exercise 35(i) that R(EX1 X10 )1 R0 is symmetric and positive denite and, hence,

the same properties hold for its inverse. By Seber [12, 10.8] and reference therein we can

decompose R(EX1 X10 )1 R0 = [R(EX1 X10 )1 R0 ]1/2 [R(EX1 X10 )1 R0 ]1/2 and [R(EX1 X10 )1 R0 ]1 =

([R(EX1 X1 ) R ] ) ([R(EX1 X10 )1 R0 ]1 )1/2 such that

0 1 0 1 1/2

Hence

1 d

n

bOLS [(R(EX1 X10 )1 R0 )1 ]1/2 (RbOLS ) e N (0r , Ir )

and the CMT gives

2 d

nb

OLS (RbOLS )0 (R(EX1 X10 )1 R0 )1 (RbOLS ) e0 e = .

Step 2: consistency.

We show that for each (0, 1) and each with R 6= there exists an n0 N such that

P (n (Z) = 1) 1 n n0 .

1

P k n bOLS [(R(SX,X )1 R0 )1 ]1/2 (R )k2

1

k nbOLS [(R(SX,X )1 R0 )1 ]1/2 (RbOLS R)k2 > r,1/2

1

P k n bOLS [(R(SX,X )1 R0 )1 ]1/2 (R )k2 > r,1/2 + K

1

P k n bOLS [(R(SX,X )1 R0 )1 ]1/2 (RbOLS R)k2 > K .

38

By step 1 of this proof the second term can be bounded from above by /4 for n n0 if we choose K

2

suciently large. Due to consistency of

bOLS and SX,X we have for large n

P (n (Z) = 1) P k n [(R(SX,X )1 R0 )1 ]1/2 (R )k2 > 2( r,1/2 + K)

2

0 1 0 1 1/2

P nk[(R(EX1 X1 ) R ) ] (R )k2 > 4( r,1/2 + K)

1 .

Note that the theorem above states asymptotic results only. They might be very unreliable in small

samples. Can we do better?

4.3.3 Hypothesis tests in the classical linear regression model under normality

From now on, we assume that P |X = N (0n , 2 In ). This implies that the conditional distribution of

does not depend on X and therefore X and are independent by Theorem 3.10. Hence, the marginal

2

distribution of the errors is also given by P = N (0n , In ). Moreover, this assumption allows us to

construct exact tests.

We consider the test problem H0 : k = for a xed k {0, . . . , K}. First note that under H0

b

p OLS,k N (0, 1)

[(X X)1 ]k,k

0

However this quantity cannot be used as a test statistic if 2 is unknown. Therefore we substitute the

unknown term by its estimate

2

bOLS and consider the t statistic

bOLS,k

Tn = p

bOLS [(X 0 X)1 ]k,k

Riemann density

(l+1)/2

z2

((l + 1)/2)

fl (z) = 1+ .

l (l/2) l

Z

p 1 tl ,

Z2 /l

where Z1 N (0, 1) and Z2 2l are independent, see Jacod and Protter [10, Example 12.4].

Theorem 4.21. Suppose that n > K + 1 and that P |X = N (0n , 2 In ). Then Tn tnK1 under the

null hypothesis.

Proof. (Sketch) We use the representation of the t distribution below Denition 4.20.

1. Modied nominator.

b

p OLS,k N (0, 1).

[(X X)1 ]k,k

0

2. Modied denominator. Noting that the OLS residuals are distributed as (In X(X 0 X)1 X 0 )Z ,

where Z N (0n , In ) is independent of X, we have

2

e0 e d 0

= Z (In X(X 0 X)1 X 0 )Z.

bOLS

(n K 1) =

2 2

It can be shown that the latter is indeed 2nK1 distributed. First note that by Davidson,

MacKinnon [6, Theorem 4.1(b)]: if Z N (0n , In ) and P is an n dimensional projection matrix

39

(i.e. symmetric and P2 = P) of rank p then Z 0 P Z 2p . Denoting A(X1 , . . . , Xn ) = (In

0 1 0

X(X X) X) we get by Tonelli-Fubini theorem (Jacod and Protter [10, Theorem 10.3]

Z Z

1z0 A(x1 ,...,xn )zy P Z (dz)P (X1 ,...,Xn ) (dx1 , . . . , dxn )

0 0 0

0 0 1 0

P (Z (In X(X X) X )Z y) =

Z

0 0 0

= f2nK1 (y)P (X1 ,...,Xn ) (dx1 , . . . , dxn )

= f2nK1 (y)

since A(x1 , . . . , xn ) is a projection matrix of rank nK1 by Seber [12, 4.11] for almost all

(x1 , . . . , xn ).

3. Independence. A sketch of the proof can be found in Hayashi [9, Proof of Proposition 1.3].

(

1 if Tn

/ [tnK1,/2 , tnK1,1/2 ]

(Y1 , X1,1 , . . . X1,K , . . . , Yn , Xn,1 , . . . Xn,K ) = .

0 else

It follows immediately from the previous theorem that this test is an test.

Lemma 4.22. Under the conditions of the previous Theorem and if (Y1 , X1,1 , . . . X1,K )0 , . . . ,

(Yn , Xn,1 , . . . Xn,K )0 , i = 1, . . . , n, are i.i.d. such that EX1 X10 is nite and invertible, the t test is

consistent.

bOLS,k k

Tn = p + pk

0 1

bOLS [(X X) ]k,k bOLS [(X 0 X)1 ]k,k

= OP (1) + pk .

bOLS [(X 0 X)1 ]k,k

Hence, for any xed >0 it holds that with Rn determining the OP (1) in the previous line

!

|k |

P (|Tn | tnK1,1/2 ) P p tnK1,1/2 + K P (|Rn | > K)

bOLS [(X 0 X)1 ]k,k

!

|k |

P p 2(tnK1,1/2 + K)

[(X 0 X)1 ]k,k

q

P n|k | 4 E[(X 0 X)1 ]k,k (tnK1,1/2 + K) 2

for suciently large K. We can deduce the assertion of the lemma if (tnK1,1/2 )n is uni-

formly bounded in

q P n. To this end, recall that any tn distributed random variable can be written as

1 n d

Z0 / n k=1 Zk2 Z N (0, 1) for i.i.d. standard normal Z0 , Z1 , . . . .

Remark:

More generally it can be shown that under the conditions above

c0 (bOLS )

Tn(c) = p tn(K+1) c RK+1 \{0}.

c0 (X 0 X)1 c bOLS

Hence, we can establish a t test for the Problems 2 and 3 described in Example 4.16.

Under the normality assumption we can also provide a test for the general hypothesis of para-

graph 4.3.2. We can show that the nite sample distribution of the corresponding test statistic

divided by r has a so-called F distribution with r and n K 1 degrees of freedom which is dened

as the distribution of (Z1 /r)/(Z2 /(n K 1)) with independent Z1 2r and Z2 2nK1 .

Hence we can again construct an test.

40