Вы находитесь на странице: 1из 94

ECON509 Probability and Statistics

Slides 5

Bilkent

This Version: 25 November 2013

(Bilkent) ECON509 This Version: 25 November 2013 1 / 94


Introduction

In this part of the lecture notes, we will focus on properties of random samples and
consider some important statistics and their distributions, which you will encounter
in your future econometrics courses.
In addition, we will also introduce key convergence concepts, which are at the
foundation of asymptotic theory. You might …nd these concepts a bit too abstract,
but they are used in econometrics pretty frequently.

(Bilkent) ECON509 This Version: 25 November 2013 2 / 94


Basic Concepts of Random Samples

We start with a de…nition.


De…nition (5.1.1): The random variables X1 , ..., Xn are called a random sample of
size n from the population f (x ) if X1 , ..., Xn are mutually independent random
variables and the marginal pdf or pmf of each Xi is the same function f (x ).
Alternatively, X1 , ..., Xn are called independent and identically distributed random
variables with pdf or pmf f (x ). This is commonly abbreviated to iid random
variables.
In many experiments there are n > 1 repeated observations made on the variable,
where X1 is the …rst observation, X2 is the second observation etc.
Under the above de…nition, each Xi is an observation on the same variable and each
Xi has a marginal distribution given by f (x ).
In addition, the value of one observation has no e¤ect on or relationship with any of
the other observations.
X1 , ..., Xn are mutually independent.

(Bilkent) ECON509 This Version: 25 November 2013 3 / 94


Basic Concepts of Random Samples

Then, the joint pdf or pmf is given by


n
f (x1 , ..., xn jθ ) = f (x1 jθ ) f (x2 jθ ) ... f (xn jθ ) = ∏ f (xi j θ ),
i =1

where we assume that the population pdf of pmf is a member of a parametric family,
and θ is the vector of parameters.
The random sampling model in De…nition 5.1.1 is sometimes called sampling from
an in…nite population.
Suppose we obtain the values of X1 , ..., Xn sequentially.
First, the experiment is performed and X1 = x1 is observed.
Then, the experiment is repeated and X2 = x2 is observed.
The assumption of independence implies that the probability distribution for X2 is
una¤ected by the fact that X1 = x1 was observed …rst.

(Bilkent) ECON509 This Version: 25 November 2013 4 / 94


Sums of Random Variables from a Random Sample

Suppose we have drawn a sample (X1 , ..., Xn ) from the population. We might want
to obtain some summary statistics.
This summary might be de…ned as a function

T (x1 , ..., xn ),

which might be real or vector-valued.


So,
Y = T (X1 , ..., Xn ),
is a random variable or a random vector.
We can use similar techniques as those introduced for functions of random variables,
to investigate the distributional properties of Y .
Thanks to (X1 , ..., Xn ) possessing the iid property, the distribution of Y will be
tractable.
This distribution is usually derived from the distribution of the variables in the
random sample. Hence, it is called the sampling distribution of Y .

(Bilkent) ECON509 This Version: 25 November 2013 5 / 94


Sums of Random Variables from a Random Sample
De…nition (5.2.1): Let X1 , ..., Xn be a random sample of size n from a population
and let T (x1 , ..., xn ) be a real-valued or vector-valued function whose domain
includes the sample space of (X1 , ..., Xn ) . Then, the random variable or random
vector Y = T (X1 , ..., Xn ) is called a statistic. The probability distribution of a
statistic Y is called the sampling distribution of Y .
Let’s consider some commonly used statistics.
De…nition (5.2.2): The sample mean is the arithmetic average of the values in a
random sample. It is usually denoted by

1 n
n i∑
X̄ = Xi .
=1

De…nition (5.2.3): The sample variance is the statistic de…ned by


n
1
∑ (X i
2
S2 = X̄ ) .
n 1 i =1

The sample standard deviation is the statistic de…ned by


p
S = S 2.

(Bilkent) ECON509 This Version: 25 November 2013 6 / 94


Sums of Random Variables from a Random Sample

Theorem (5.2.4): Let x1 , ..., xn be any number and x̄ = (x1 + ... + xn ) /n. Then,
1 min a ∑ni=1 (xi a )2 = ∑ni=1 (xi x̄ )2 ,
2 (n 1 ) s 2 = ∑ni=1 (xi x̄ )2 = ∑ni=1 xi2 n x̄ 2 .
Proof: To prove part (a), consider
n n
∑ (xi a )2 = ∑ (xi x̄ + x̄ a )2
i =1 i =1
n n n
= ∑ (xi x̄ )2 + 2 ∑ (xi x̄ ) (x̄ a) + ∑ (x̄ a )2
i =1 i =1 i =1
n n
= ∑ (xi x̄ )2 + ∑ (x̄ a )2 .
i =1 i =1

Clearly, the value of a that minimises ∑ni=1 (xi a )2 is given by

a = x̄ .

Part (b) can easily be proved by expanding the binomial and taking the sum.

(Bilkent) ECON509 This Version: 25 November 2013 7 / 94


Sums of Random Variables from a Random Sample

Lemma (5.2.5): Let X1 , ..., Xn be a random sample from a population and let g (x )
be a function such that E [g (X1 )] and Var (g (X1 )) exist. Then
" #
n
E ∑ g (X i ) = nE [g (X1 )]
i =1

and !
n
Var ∑ g (X i ) = nVar (g (X1 )).
i =1
Proof: This is straightforward. First,
" #
n n n
E ∑ g (X i ) = ∑ E [g (Xi )] = ∑ E [g (X1 )] = nE [g (X1 )],
i =1 i =1 i =1

since Xi are distributed identically. Note that independence is not required here.

(Bilkent) ECON509 This Version: 25 November 2013 8 / 94


Sums of Random Variables from a Random Sample

Then,
!
n n
Var ∑ g (X i ) = ∑ Var (g (Xi )) + ∑ ∑Cov (g (Xi ), g (Xj ))
i =1 i =1 i 6 =j
n n
= ∑ Var (g (Xi )) + 0 = ∑ Var (g (X1 )) = nVar (g (X1 )),
i =1 i =1

where we have used the fact that

Cov (g (Xi ), g (Xj )) = 0 for all i 6= j,

due to independence and that Var (g (Xi )) is the same for all Xi , due to their
distributions being identical.

(Bilkent) ECON509 This Version: 25 November 2013 9 / 94


Sums of Random Variables from a Random Sample

Theorem (5.2.6): Let X1 , ..., Xn be a random sample from a population with mean
µ and variance σ2 < ∞. Then,
1 E [X̄ ] = µ,
2 Var (X̄ ) = σ2 /n,
3 E [S 2 ] = σ 2 .
Proof: Now,
!
1 n 1 n 1
n i∑ n i∑
E [X̄ ] = E Xi = E [Xi ] = nµ = µ.
=1 =1 n

Then,
!
1 n 1 n 1 n 2 σ2
n i∑ ∑ ∑
Var (X̄ ) = Var Xi = Var ( X i ) = σ = .
=1 n 2 i =1 n 2 i =1 n

(Bilkent) ECON509 This Version: 25 November 2013 10 / 94


Sums of Random Variables from a Random Sample

Finally,
" # " #
n n
1 1
∑ (X i ∑
2 2
E [S ] = E X̄ ) = E Xi2 n X̄ 2
n 1 i =1
n 1 i =1
n
1 n
=
n ∑ E [Xi2 ]
1 i =1 n 1
E [X̄ 2 ]

n
1 n σ2
=
n ∑
1 i =1
σ 2 + µ2
n 1 n
+ µ2

1 σ2 (n 1 ) σ2
= n σ 2 + µ2 n + µ2 = = σ2 .
n 1 n n 1

An aside: You might already know that, since E [X̄ ] = µ and E [S 2 ] = σ2 , one
would refer to X̄ and S 2 as the unbiased estimators for µ and σ2 , respectively.

(Bilkent) ECON509 This Version: 25 November 2013 11 / 94


Sums of Random Variables from a Random Sample

Now, observe that

MX̄ (t ) = E [e t X̄ ] = E [e t (X 1 +...+X n )/n ] = E [e (t /n )Y ] = MY (t/n ).

Of course, since X1 , ..., Xn are identically distributed, MX i (t ) is the same function


for each i . Therefore, we have the following result.
Theorem (5.2.7): Let X1 , ..., Xn be a random sample from a population with mgf
MX (t ). Then, the mgf of the sample mean is

MX̄ (t ) = [MX (t/n )]n .

The proof is an application of Theorem (4.6.7).


Proof: Remember that Theorem (4.6.7) says that if X1 , ..., Xn are mutually
independent random variables with mgfs MX 1 (t ), ..., MX n (t ) and if

Z = X1 + ... + Xn ,

then, the mgf of Z is


MZ (t ) = MX 1 (t ) ... MX n (t ).

(Bilkent) ECON509 This Version: 25 November 2013 12 / 94


Sums of Random Variables from a Random Sample

In particular, if X1 , ..., Xn all have the same distribution with mgf MX (t ), then

MZ (t ) = [MX (t )]n .

Now the sequence we have is actually X1 /n, ..., Xn /n. Observe that if

E [e tX ] = MX (t ),

then, t
E [e t (X /n ) ] = E [e n X ] = MX (t/n ).
Then, for Z = X̄ , Theorem (4.6.7) gives

MZ (t ) = MX̄ (t ) = [MX (t/n )]n .

(Bilkent) ECON509 This Version: 25 November 2013 13 / 94


Sums of Random Variables from a Random Sample
Example (5.2.8) (Distribution of the mean): Let X1 , ..., Xn be a random sample
from a N (µ, σ2 ) population. Then the mgf of the sample mean is
n
t σ2 (t/n )2
MX̄ (t ) = exp µ +
n 2
" #
t σ2 (t/n )2 σ2 /n t 2
= exp n µ + = exp µt + .
n 2 2

Thus, X̄ N (µ, σ2 /n ).
In this example, it was helpful to use Theorem (5.2.7) because the expression for
MX̄ (t ) turned out to be a familiar mgf. It cannot, of course, be guaranteed that this
will always be the case for any X̄ . However, when this is the case, this result makes
derivation of the distribution of X̄ very easy.
Another example is the sample mean for a gamma (α, β) sample. The mgf of X̄ in
this case is given by
α n nα
1 1
MX̄ (t ) = = ,
1 β(t/n ) 1 ( β/n ) t
which reveals that X̄ gamma (nα, β/n ).
Of course, there are cases where this method is not useful, due to mgf of X̄ not
being recognisable. Remedies for such cases are considered in Casella and Berger
(2001, pp. 215-217). We will not cover these.
(Bilkent) ECON509 This Version: 25 November 2013 14 / 94
Sampling from the Normal Distribution

Now, we focus on the case where the population distribution is the normal
distribution.
Result we introduce in this section will be very useful when you deal with linear
regression models.
We have already talked about distributional properties of the sample mean and
variance. Given the extra assumption of normality, we are now in a position to
determine their full distributions.
The chi squared distribution will come up frequently within this context, so we start
by introducing some important properties related to this distribution.
Remember that the chi squared pdf is a special case of the gamma pdf and is given
by
1
f (x ) = x (p/2 ) 1 e x /2 , 0 < x < ∞,
Γ(p/2)2p/2
where p is called the degrees of freedom.

(Bilkent) ECON509 This Version: 25 November 2013 15 / 94


Sampling from the Normal Distribution

Lemma (5.3.2) (Facts about chi squared random variables): We use the notation
χ2p denote a chi squared random variable with p degrees of freedom.
a. If Z N (0, 1 ) random variable, then
Z2 χ21 .
In other words, the square of a standard normal random variable is a chi squared
random variable.
b. If X 1 , ..., X n are independent and X i χ2pi , then

X 1 + ... + X n χ2p1 +...+pn ;


that is, independent chi squared variables add to a chi squared variable, and the
degrees of freedom also add.
To prove the second part, remember that a χ2p random variable is a gamma (p/2, 2)
random variable. By the result obtained in Example (4.6.8), the sum of independent
gamma (αi , β) random variables is a gamma (α1 + ... + αn , β) random variable.
Then, X1 + ... + Xn above is a gamma ((p1 + ... + pn )/2, 2) random variable, which
is a χ2p1 +...+pn random variable.
Part (a) can be proved by obtaining the pdf of the transformation Y = Z 2 and then
con…rming that this is the pdf of a χ21 random variable.

(Bilkent) ECON509 This Version: 25 November 2013 16 / 94


Sampling from the Normal Distribution

Theorem (5.3.1): Let X1 , ..., Xn be a random sample from a N µ, σ2 distribution


and let
1 n 1 n
X̄ = ∑ Xi and S 2 = ∑ (X X̄ ).
n i =1 n 1 i =1 i
Then,
1 X̄ and S 2 are independent random variables,
2 X̄ has a N µ, σ2 /n distribution,
3 (n 1 ) S 2 /σ2 has a chi squared distribution with n 1 degrees of freedom.
Proof: In what follows, we will focus on the case where µ = 0 and σ2 = 1, without
loss of generality, as N (0, 1) belongs to the location scale family given by N (µ, σ2 ).
Note that we have already showed the second result in Example (5.2.8). This leaves
the …rst and third results. Start with
!
n n
1 1
n 1 i∑
2
S = 2
(Xi X̄ ) = (X1 X̄ ) + ∑ (Xi X̄ ) .
2 2
=1 n 1 i =2

(Bilkent) ECON509 This Version: 25 November 2013 17 / 94


Sampling from the Normal Distribution

Now, observe that


n
∑ (X i X̄ ) = 0,
i =1
by de…nition. So,
" #2
n n
∑ (X i ∑ (X i
2
(X 1 X̄ ) + X̄ ) = 0 =) (X1 X̄ ) = X̄ )
i =2 i =2

Therefore,
8" #2 9
1 n
1 < n n =
1 i∑ 1 : i∑ ∑ (X i
2 2 2
S = (X i X̄ ) = (X i X̄ ) + X̄ ) .
n =1 n =2 i =2
;

This implies that S 2 can be written as a function of (X2 X̄ , ..., Xn X̄ ) .

(Bilkent) ECON509 This Version: 25 November 2013 18 / 94


Sampling from the Normal Distribution
The joint pdf of X1 , ..., Xn is given by
!
1 1 n 2
2 i∑
f (x1 , ..., xn ) = exp xi , ∞ < xi < ∞.
(2π )n/2 =1
Consider the transformation
y1 = x̄ ,
y2 = x2 x̄ ,
..
.
yn = xn x̄ .
This yields,
x1 = y1 y2 y3 ... yn ,
x2 = y1 + y2 ,
x3 = y1 + y3 ,
..
.
xn = y1 + yn .
One can show that the Jacobian for this linear transformation is equal to n.
(Bilkent) ECON509 This Version: 25 November 2013 19 / 94
Sampling from the Normal Distribution

Then,
n n 2 n 2
∑ i =2 y i ) + ∑ i =2 ( y 1 + y i ) ] ,
f (y1 , .., yn ) = n/2
e (1/2 )[(y1
(2π )
n n
∑ i =2 y i )
2 n 2
= e (1/2 )(y1 e (1/2 ) ∑i =2 (y1 +yi ) ,
(2π )n/2
( )
n 1/2
( ny12 )/2 n 1/2 (1/2 )[∑ni=2 yi2 +(∑ni=2 yi )2 ]
= e e ,
2π (2π )(n 1 )/2

where ∞ < yi < ∞.


Clearly, the joint pdf factors into a function of Y1 and a function of Y2 , ..., Yn .
Therefore, by Theorem 4.6.11, Y1 is independent of Y2 , ..., Yn . By Theorem 4.6.12,
this shows that X̄ is independent of S 2 .
Now, on to (c)!

(Bilkent) ECON509 This Version: 25 November 2013 20 / 94


Sampling from the Normal Distribution

Let X̄k and Sk2 denote the sample mean and sample variance based on the …rst k
observations.
You will be asked to show in a homework assignment that
n 1
(n 1) Sn2 = (n 2) Sn2 1 + (X n X̄n 1)
2
. (1)
n

Note that X̄1 = X1 . Then, for n = 2 we have


1
S22 = 0 S12 + (X2 X 1 )2 .
2
Now, 2 1/2(X2 X1 ) N (0, 1), so, by part (a) of Lemma (5.3.2), S22 χ21 .
We, then, propose the induction that for n = k,

(k 1) Sk2 χ2k 1.

(Bilkent) ECON509 This Version: 25 November 2013 21 / 94


Sampling from the Normal Distribution

For n = k + 1, we would have, by (1),

k 2
kSk2+1 = (k 1) Sk2 + (X k +1 X̄k ) .
k +1

If the induction hypothesis is correct, (k 1) Sk2 χ2k 1.


What if we could prove that ,
k 2
(X k +1 X̄k ) χ21 , (2)
k +1

independent of Sk2 ? Then, by part (b) of Lemma (5.3.2), kSk2+1 will be χ2k ! So,
proving (2) will …nally prove part (c) of the Theorem.
First, notice that the vector (Xk +1 , X̄k ) is independent of Sk2 since we have already
showed that X̄k is independent of Sk2 . Moreover, Sk2 is a function of X1 , ..., Xk and
so, is independent of Xk +1 .

(Bilkent) ECON509 This Version: 25 November 2013 22 / 94


Sampling from the Normal Distribution

Now, (Xk +1 X̄k ) is a linear function of normal random variables, so it is also


normal. In addition,
X1 + ... + Xk kµ
E [X k +1 X̄k ] = E X k +1 =µ= 0,
k k
1 k +1
Var (Xk +1 X̄k ) = Var (Xk +1 ) + Var (X̄k ) = 1 + = .
k k
Therefore,
r
k +1 k
(X k +1 X̄k ) N 0, and (X X̄k ) N (0, 1).
k k + 1 k +1
But then, this suggests that
k 2
(X k +1 X̄k ) χ21 ,
k +1
which is the result we were looking for. This completes the proof.

(Bilkent) ECON509 This Version: 25 November 2013 23 / 94


Sampling from the Normal Distribution

A nice result about normally distributed random variables is that zero covariance
implies independence, which does not hold necessarily for other distribution
functions.
Lemma (5.3.3): Let Xj N (µj , σ2j ), j = 1, ..., n independent. For constants aij and
brj (j = 1, ..., n; i = 1, ..., k; r = 1, ..., m) where k + m n, de…ne
n
Ui = ∑ aij Xj , i = 1, ..., k,
j =1
n
Vr = ∑ brj Xj , r = 1, ..., m.
j =1

a. The random variables U i and V r are independent if and only if Cov (U i , V r ) = 0.


Furthermore, Cov (U i , V r ) = ∑nj=1 a ij b rj σ2j .
b. The random vectors (U 1 , ..., U k ) and (V 1 , ..., V m ) are independent if and only if U i is
independent of V r for all pairs i , r (i = 1, ..., k ; r = 1, ..., m).

(Bilkent) ECON509 This Version: 25 November 2013 24 / 94


Sampling from the Normal Distribution

Again, we consider the case where Xj N (0, 1), for simplicity. Then,
" # " #
n n n n
Cov (Ui , Vr ) = E [Ui Vr ] = E ∑ aij Xj ∑ brj Xj =E ∑ aij brj Xj2 = ∑ aij brj ,
j =1 j =1 j =1 j =1

due to independence. The implication from independence to zero covariance is also


immediate (Theorem (4.5.5)). In addition, since Ui and Vr are linear combinations
of normal random variables, they are also normally distributed (Corollary 4.6.10).
Now, it is a bit more involved to show that we have indeed independence. Consider
the case where n = 2, as the more general proof will be similar in spirit but more
complicated.
The joint pdf is given by
1 2 2
fX 1 ,X 2 (x1 , x2 ) = e (1/2 )(x1 +x2 ) , ∞ < x1 , x2 < ∞.

(Bilkent) ECON509 This Version: 25 November 2013 25 / 94


Sampling from the Normal Distribution

Consider the transformation is given by

u = a1 x1 + a2 x2 and v = b1 x1 + b2 x2 ,

which imply
b2 u a2 v a1 v b1 u
x1 = and x2 = .
a1 b2 b1 a2 a1 b2 b1 a2
The Jacobian of the transformation is
b2 a2
a1 b2 b1 a2 a1 b2 b1 a2 (b2 a1 b1 a2 ) 1
b1 a1 = 2
= .
a1 b2 b1 a2 a1 b2 b1 a2 (a1 b2 b1 a2 ) a1 b2 b1 a2

Then, the new joint pdf is,


b2 u a2 v a1 v b1 u 1
fU ,V (u, v ) = fX 1 ,X 2 , .
a1 b2 b1 a2 a1 b2 b1 a2 a1 b2 b1 a2

(Bilkent) ECON509 This Version: 25 November 2013 26 / 94


Sampling from the Normal Distribution

Therefore,
( )
1 1 h i
2 2
fU ,V (u, v ) = exp (b2 u a2 v ) + (a1 v b1 u )
2π 2 (a1 b2 b1 a2 )2
1
,
a1 b2 b1 a2
where ∞ < u, v < ∞.
Now,

(b2 u a2 v )2 + (a1 v b1 u )2 = b12 + b22 u 2 + a12 + a22 v 2 2 (a1 b1 + a2 b2 ) uv .

But we know that (a1 b1 + a2 b2 ) = 0. Hence, this shows that the joint pdf factorises
into a function of u and a function of v . Therefore, U and V are independent!
A similar technique can be utilised to prove part (b). Speci…cally, by using a
transformation argument, it can be shown that the joint pdf of vectors (U1 , ..., Uk )
and (V1 , ..., Vm ) will factorise, which will prove independence. We skip this.

(Bilkent) ECON509 This Version: 25 November 2013 27 / 94


Sampling from the Normal Distribution

The main message here is that, if we start with independent normal random
variables, the zero-covariance and independence are equivalent for linear functions of
these random variables. Therefore, checking independence comes down to checking
covariances.
Part (b), on the other hand, makes it possible to infer overall independence of
normal vectors by just checking pairwise independence, which is not valid for general
random variables.
Thinking about part (a), I …nd it intuitively much easier to follow that the normal
distribution is completely determined by its mean and variance while other moments
do not matter. Hence, ensuring zero covariance is equivalent to ensuring
independence, since we do not have to care about the remaining moments of the
distribution.

(Bilkent) ECON509 This Version: 25 November 2013 28 / 94


Sampling from the Normal Distribution

We can show the usefulness of Lemma 5.3.3 by trying to prove the independence of
X̄ and S 2 when X1 , ..., Xn is sampled from a normal population distribution.
Remember that we can write S 2 as a function of (X2 X̄ , ..., Xn X̄ ) . Now, if we
can show that these random variables are not correlated with X̄ , then by the
normality assumption and Lemma 5.3.3, we can conclude independence.
We have
1 n n
1
n i∑ ∑
X̄ = Xi and Xj X̄ = δij Xi ,
=1 i =1
n
where (
1 if i = j
δij = .
0 otherwise

(Bilkent) ECON509 This Version: 25 November 2013 29 / 94


Sampling from the Normal Distribution

Again, for simplicity consider Xj N (0, 1) for all j. Then,


( !" #)
1 n n
1
Cov (X̄ , Xj X̄ ) = E ∑
n i =1
Xi ∑ δij
n
Xi
i =1
" #
1 n 1
n i∑
2
= E δij Xi
=1 n
1 n 1
n i∑
= σ2 δij .
=1 n

Now,
n n
1 1
∑ δij
n
=1 ∑ n
=1 1 = 0.
i =1 i =1
Hence,
Cov (X̄ , Xj X̄ ) = 0,
and so, X̄ and Xj X̄ are independent for all j, which yields the desired result.

(Bilkent) ECON509 This Version: 25 November 2013 30 / 94


The Derived Distributions: Student’s t and Snedecor’s F

You will see (probably next term) that the results derived so far are useful when the
variance of the distribution is known.
When σ2 is unknown, as is the case in most practical cases, it usually is replaced by
some unbiased estimator, say, σ̂2 of σ2 .
However, then the above results lose their relevance.
We will now talk about two important distributions that are relevant for such cases.

(Bilkent) ECON509 This Version: 25 November 2013 31 / 94


The Derived Distributions: Student’s t and Snedecor’s F

If X1 , ..., Xn are a random sample from a N µ, σ2 , we know that

X̄ µ
p N (0, 1).
σ/ n
However, as we just argued, σ is usually unknown. What to do then?
W. S. Gosset (widely known as “Student”) studied the distribution of

X̄ µ
p .
S/ n

(Bilkent) ECON509 This Version: 25 November 2013 32 / 94


The Derived Distributions: Student’s t and Snedecor’s F
How to …nd the distribution of this new random variable? First multiply by σ/σ to
get
p (X̄ µ)
X̄ µ σ/ n
p = q .
S/ n S2
σ2
We know that
(X̄ µ) (n 1) S 2
p N (0, 1) and χ2n 1,
σ/ n σ2
S2
where the latter result follows from part (c) of Theorem 5.3.1. Then, Z = σ2
can be
considered a χ2n 1 random variable divided by its degrees of freedom.
Importantly, remember that S 2 and X̄ are independent, so the numerator and the
denominator are independent here.
The question, then, is to …nd the distribution of
U
p where U N (0, 1), Z χ2p and U ?? Z .
Z /p
This is Student’s t distribution.

(Bilkent) ECON509 This Version: 25 November 2013 33 / 94


The Derived Distributions: Student’s t and Snedecor’s F

De…nition (5.3.4): Let X1 , ..., Xn be a random sample from a N µ, σ2


distribution. The quantity
(X̄ µ)
p
S/ n
has Student’s t distribution with n 1 degrees of freedom. Equivalently, a random
variable T has Student’s t distribution with p degrees of freedom, and we write
T tp if it has pdf

Γ p+2
1
1
fT (t ) = p p , ∞ < t < ∞.
pπΓ 2 (1 + t /p )(p +1 )/2
2

Remember that Student’s t does not have moments of all orders. In particular, if
X tp , then
E [jX jq ] < ∞ for q < p.
Therefore, t1 has no mean, t2 has no variance, etc.

(Bilkent) ECON509 This Version: 25 November 2013 34 / 94


The Derived Distributions: Student’s t and Snedecor’s F

Let’s try to derive the pdf. Since U and V are independent, we have
1 u 2 /2 1
fU ,V (u, v ) = p e v (p/2 ) 1
e v /2
,
2π Γ (p/2) 2p/2
where ∞ < u < ∞ and 0 < v < ∞.
We have the following transformation functions
u
t= p and w = v,
v /p
implying that r
w
v =w and u=t .
p
Then, the Jacobian of the transformation is given by
p p
w /p t (2p ) 1 (w /p ) 1/2
= w /p.
0 1

(Bilkent) ECON509 This Version: 25 November 2013 35 / 94


The Derived Distributions: Student’s t and Snedecor’s F
Hence,
Z ∞ p p
fT (t ) = fU ,V t w /p, w w /pdw
0
Z ∞ p
1 1 2
= p e (1/2 )t w /p w (p/2 ) 1 e w /2 w /pdw
2π Γ (p/2) 2 p/2 0
Z ∞
1 1 2
= p p e (1/2 )(t w /p +w ) w (p/2 ) 1/2 dw
2π Γ (p/2) 2 p/2 p 0
Z ∞
1 1 2
= p p e (1/2 )w (t /p +1 ) w ((p +1 )/2 ) 1 dw .
2π Γ ( p/2 ) 2 p/2 p 0

p +1 2
But the integrand is the kernel of a gamma 2 , 1 +t 2 /p pdf. Therefore,

(p +1 )/2
1 1 p+1 2
fT (t ) = p p Γ
2π Γ ( p/2 ) 2p/2 p 2 1 + t 2 /p
p +1
1 Γ 2 1 (p +1 )/2
= p ,
pπ Γ (p/2) 1 + t 2 /p

which gives the pdf of the t distribution.

(Bilkent) ECON509 This Version: 25 November 2013 36 / 94


The Derived Distributions: Student’s t and Snedecor’s F

Now, consider another important derived distribution, Snedecor’s F , named after Sir
Ronald Fisher. It arises naturally as the distribution of a ratio of variances.
Example (5.3.5): Let X1 , ..., Xn be a random sample from a N µX , σ2X
population, and let Y1 , ..., Ym be a random sample from an independent N µY , σ2Y
population. If we were comparing the variability of the populations, one quantity of
interest would be the ratio σ2X /σ2Y . Information about this ratio is contained in

SX2
,
SY2

the ratio of sample variances. The F distributions allows us to compare these


quantities by giving us a distribution of

SX2 /SY2 S 2 /σ2


2 2
= X2 X2 . (3)
σX /σY SY /σY

Examination of (3) shows us how the F distribution is derived. The ratios SX2 /σ2X
and SY2 /σ2Y are each scaled chi squared variates, and they are independent.

(Bilkent) ECON509 This Version: 25 November 2013 37 / 94


The Derived Distributions: Student’s t and Snedecor’s F
De…nition (5.3.6): Let X1 , ..., Xn be a random sample from a N µX , σ2X
population and let Y1 , ..., Ym be a random sample from an independent N µY , σ2Y
population. The random variable
SX2 /σ2X
F =
SY2 /σ2Y
has Snedecor’s F distribution with n 1 and m 1 degrees of freedom.
Equivalently, the random variable F has the F distribution with p and q degrees of
freedom if it has pdf
p +q
Γ 2 p p/2
x (p/2 ) 1
fF (x ) = p q , 0 < x < ∞.
Γ 2 Γ 2
q [1 + (p/q ) x ](p +q )/2
The F distribution has some interesting properties and is related to a number of
other distributions.
Theorem (5.3.8):
a. If X F p,q , then 1/X F q,p . In other words, the reciprocal of an F random variable
is again an F random variable.
b. If X tq , then X 2 F 1,q .
c. If X F p,q , then
(p/q ) X
beta (p/2, q/2 ) .
1 + (p/q ) X

(Bilkent) ECON509 This Version: 25 November 2013 38 / 94


Convergence Concepts

The underlying idea in this section is to understand what happens to sequences of


random variables, or also summary statistics, when we let the sample size go to
in…nity.
This is, of course, an idealistic concept, if you like, because the sample size never
goes to in…nity. However, the idea is to attain a grasp of what happens when the
sample size becomes large enough.
This is important because, although important results are based on the case when
the sample size approaches ∞, they are actually relevant for …nite sample size, which
are large enough.
How large is “large enough” is very much related to the data, its dependence
structure, the econometric model at hand etc, so it is di¢ cult to give a proper rule
of thumb.

(Bilkent) ECON509 This Version: 25 November 2013 39 / 94


Convergence Concepts

The tools you will learn in this section are the fundamental building blocks of what
is known as “asymptotic theory” or “large sample theory.” Although a bit abstract
at …rst sight, these results are at the core of many proofs you will encounter in
econometrics articles.
The three important concepts we will consider in what follows are
1 almost sure convergence,
2 convergence in probability,
3 convergence in distribution.
In the …rst instance we will consider the case where a sequence of random variables
X1 , ..., Xn exhibits the iid property. This is one (and the simplest) of many possible
dependence settings.

(Bilkent) ECON509 This Version: 25 November 2013 40 / 94


Convergence Concepts
Almost Sure Convergence

Remember that random variables are functions de…ned on the sample space, e.g.
X (ω ) . Our interest will be on a sequence of random variables, indexed by sample
size, i.e. Xn (ω ).
To motivate the following discussion, consider pointwise convergence:

lim Xn (ω ) = X (ω ) for all ω 2 Ω,


n !∞

where Ω is, as before, the sample space.


Notice that, convergence occurs for all ω! There is not much of a probabilistic
statement here.
This is the strongest form of convergence we can have on the sample space. But it
is not relevant for probabilistic statements.

(Bilkent) ECON509 This Version: 25 November 2013 41 / 94


Convergence Concepts
Almost Sure Convergence

A slightly restricted version of pointwise convergence is almost sure convergence.


De…nition (Almost Sure Convergence): A sequence of random variables X1 , X2 , ...
de…ned on a probability space (Ω, F , P ) converges almost surely to a random
variable X if
lim Xn (ω ) = X (ω ) ,
n !∞
for each ω, except for ω 2 E , where P (E ) = 0.
The idea is this: pointwise convergence fails for some points in Ω. However, the
number of such points is so small that we can safely assign zero measure (or zero
probability) to the set of these points.
Other ways of expressing this de…nition are,

P lim jXn Xj > e = 0 for every e > 0,


n !∞
or n o
P ω : lim Xn (ω ) = X (ω ) = 1.
n !∞

(Bilkent) ECON509 This Version: 25 November 2013 42 / 94


Convergence Concepts
Almost Sure Convergence

This time, we have a di¤erence. Convergence fails on a very small set E such that
P (E ) = 0.
That P (E ) = 0 is due to the set being so small that we can safely assign zero
probability to the set.
Remember that in earlier lectures we have stated that for a continuously distributed
random variable, the probability of a single point is always equal to zero. This is
similar, in spirit, to the situation at hand.
This type of convergence is also called convergence almost everywhere and
convergence with probability 1.
The following notation is common:
a.s .
Xn ! X
wp1
Xn ! X
lim Xn = X a.s.
n !∞

Note that, at the cost of sloppy notation, the argument of Xn (ω ) is usually dropped.
Also, Xn (ω ) need not converge to a function. It can also simply converge to some
constant, say, a.
(Bilkent) ECON509 This Version: 25 November 2013 43 / 94
Convergence Concepts
Almost Sure Convergence

Usually, Xn (ω ) is some sample average. For example, let Zi (ω ), i = 1, ..., n, be


some random variable and let
1 n
n i∑
Xn (ω ) = Z i ( ω ).
=1

Then, almost sure convergence is a statement on the joint distribution of the entire
sequence fZi (ω )g.
White (2001, p.19): “The probability measure P determines the joint distribution
of the entire sequence fZi (ω )g. A sequence Xn (ω ) converges almost surely if the
probability of obtaining a realisation of the sequence fZi (ω )g for which convergence
to Xn (ω ) occurs is unity.”

(Bilkent) ECON509 This Version: 25 November 2013 44 / 94


Convergence Concepts
Almost Sure Convergence

We can extend this concept to vectors. Let Xn = (X1,n , ..., XD ,n )0 and


X = (X1 , ..., XD )0 . Then
a.s .
Xn ! X ,
if and only if
a.s .
Xd ,n ! Xd for all d = 1, ..., D.
So almost sure convergence has to occur component by component.
A more compact notation is available. De…ne the Euclidian norm:
q
jjX jj = X12 + ... + Xd2

and observe that q


jX i j X12 + ... + Xd2 .
a.s .
Then, Xn ! X if
lim jjXn (ω ) X (ω )jj = 0, except for ω 2 E , where P (E ) = 0.
n !∞

Whenever you see the terms “almost sure,” “with probability one,” or “almost
everywhere” you should remember that the relationship that is referred to holds
everyhwhere except for some set with zero probability.
(Bilkent) ECON509 This Version: 25 November 2013 45 / 94
Convergence Concepts
Almost Sure Convergence

Now that we have a convergence concept in our arsenal, the next question is: under
which conditions can we use this result for statistics of interest?
One important result is Kolmogorov’s Strong Law of Large Numbers (SLLN).
Theorem (Kolmogorov’s Strong Law of Large Numbers): If X1 , X2 , ... are
independent and identically distributed random variables and the common mean is
…nite, i.e. E [jX1 j] = µ < ∞, then

1 n 1 n a.s .
lim
n !∞ n
∑ Xi = µ a.s. or
n i∑
Xi ! µ.
i =1 =1

Note that the sample size will never be equal to ∞. However, when the sample size
is large enough, the sample mean will be very close to the population mean, µ.
What is remarkable about this result is that, as long as we have iid random
variables, the only condition is that the random variables have …nite mean.
For the time being, su¢ ce to say that one can relax the iid assumption but that
means that we will require more than just …nite means. There is a trade-o¤ between
the amount of dependence and the strength of moment assumptions you have to
make in order to attain convergence.

(Bilkent) ECON509 This Version: 25 November 2013 46 / 94


Convergence Concepts
Convergence in Probability

The next convergence type is convergence in probability. Its de…nition is similar to


that of almost sure convergence but in essence it is a much weaker convergence
concept.
De…nition (Convergence in Probability): A sequence of random variables
X1 , X2 , ... converges in probability to a random variable X if, for every e > 0,

lim P (jXn X j > e) = 0.


n !∞

An equivalent statement is that given e > 0 and δ > 0 there exists an N, which
depends on both δ and e, such that P (jXn X j > e) < δ for all n > N. This
merely is a restatement using the formal de…nition of limit.
One could also write

lim P (fω : jXn (ω ) X (ω )j > eg) = 0.


n !∞

(Bilkent) ECON509 This Version: 25 November 2013 47 / 94


Convergence Concepts
Convergence in Probability

It is not easy to see the di¤erence between the two modes of convergence. However,
let’s give it a try!
Almost sure convergence states that we have pointwise convergence for all ω 2 Ω
except for a small, zero measure set E . Importantly, this set is independent of n.
Convergence in probability states that as the sample size goes towards ∞, the
probability that Xn will deviate from X by more than e decreases towards zero.
However, for any sample size, there is a positive probability that Xn will deviate by
more than e. In other words, for some ω 2 En Ω, such that P (En ) > 0, jXn X j
will be larger than e.
Importantly, there is nothing that restricts En to be the same for all n. Hence, the
set on which Xn deviates from X may change as n increases.
The good news is, deviation probability slowly goes to zero, hence P (En ) will
eventually be zero.

(Bilkent) ECON509 This Version: 25 November 2013 48 / 94


Convergence Concepts
Convergence in Probability

Now, as before, let Zi , i = 1, ..., n, be some random variable and let

1 n
n i∑
Xn = Zi .
=1

White (2001, p.24): “With almost sure convergence, the probability measure P
takes into account the joint distribution of the entire sequence fZi g, but with
convergence in probability, we only need concern ourselves sequentially with the joint
distribution of the elements of fZi g that actually appear in Xn , typically the …rst n.

(Bilkent) ECON509 This Version: 25 November 2013 49 / 94


Convergence Concepts
Convergence in Probability

Example (5.5.8) (Convergence in probability but not almost surely): Let


Ω = [0, 1] with the uniform probability distribution.
De…ne the following sequence.

X1 (ω ) = ω + I[0,1 ] (ω ) , X2 (ω ) = ω + I[0, 1 ] (ω ) ,
2

X3 (ω ) = ω + I[ 1 ,1 ] (ω ) , X4 (ω ) = ω + I[0, 1 ] (ω ) ,
2 3

X5 (ω ) = ω + I[ 1 , 2 ] (ω ) , X6 (ω ) = ω + I[ 2 ,1 ] (ω ) ,
3 3 3

etc. Let, X (ω ) = ω.
Do we have convergence in probability? Observe that

X1 (ω ) X (ω ) = I[0,1 ] (ω ) , X2 (ω ) X (ω ) = I[0, 1 ] (ω ) ,
2

X3 (ω ) X (ω ) = I[ 1 ,1 ] (ω ) , X4 (ω ) X (ω ) = I[0, 1 ] (ω ) ,
2 3

X5 (ω ) X (ω ) = I[ 1 , 2 ] (ω ) , X6 (ω ) X (ω ) = I[ 2 ,1 ] (ω ) ,
3 3 3

etc.

(Bilkent) ECON509 This Version: 25 November 2013 50 / 94


Convergence Concepts
Convergence in Probability

Now, due to the uniform probability distribution assumption, P (ω 2 [a, b ]) = b a


where 0 a b 1. Then, as n ! ∞, P ω : I[a,b ] (ω ) > e ! 0, since
b a ! 0 as n ! ∞.
Do we have almost sure convergence? No. There is no value of ω 2 Ω for which
Xn (ω ) ! ω. For every ω, the value Xn (ω ) alternates between the values ω and
ω + 1 in…nitely often.
For example, if ω = 3/8, X1 (ω ) = 11/8, X2 (ω ) = 11/8, X3 (ω ) = 3/8,
X4 (ω ) = 3/8, X5 (ω ) = 11/8, X6 (ω ) = 3/8, etc.
No pointwise convergence occurs for this sequence.

(Bilkent) ECON509 This Version: 25 November 2013 51 / 94


Convergence Concepts
Convergence in Probability

Example (5.5.7): Let the sample space Ω be the closed interval [0, 1] with the
uniform probability distribution. De…ne random variables

Xn (ω ) = ω + ω n and X (ω ) = ω.

For every ω 2 [0, 1), ω n ! 0 as n ! ∞ and Xn (ω ) ! ω = X (ω ).


However, Xn (1) = 2 for every n, so Xn (1) does not converge to 1 = X (1).
But, since the convergence occurs on the set [0, 1) and P ([0, 1)) = 1 (remember
that when we have a continuous random variable, the probability of a single point is
equal to zero),
a.s .
Xn ! X .

(Bilkent) ECON509 This Version: 25 November 2013 52 / 94


Convergence Concepts
Convergence in Probability

One would usually write


p
Xn ! X or p lim Xn = X ,
n !∞

as short hand.
Associated with convergence in probability is the Weak Law of Large Numbers
(WLLN)
Theorem (Weak Law of Large Numbers): If X1 , X2 , ... are iid random variables
with common mean µ < ∞ and variance σ2 < ∞, then

1 n p
n i∑
Xi ! µ,
=1
as n ! ∞.

(Bilkent) ECON509 This Version: 25 November 2013 53 / 94


Convergence Concepts
Convergence in Probability

Proof: The proof uses Chebychev’s Inequality. Remember this says that if X is a
random variable and if g (x ) is a nonnegative function, then, for any r > 0,
E [g (X )]
P (g (X ) r) .
r
Now, consider
2
2 E [(X̄n µ) ] Var (X̄n ) σ2
P (jX̄n µj e) = P (X̄n µ) e2 2
= 2
= 2,
e e ne
2
where we use r = e2 and g (X̄n ) = (X̄n µ) .
The above result implies that
σ2
P (jX̄n µj < e) = 1 P (jX̄n µj e) 1 ,
ne2
and
σ2
lim 1 = 1.
n !∞ ne2
Hence,
lim P (jX̄n µj < e) = 1.
n !∞

(Bilkent) ECON509 This Version: 25 November 2013 54 / 94


Convergence Concepts
Convergence in Probability

This is perhaps a good time to stop and re‡ect a little bit on these new concepts.
What the Weak and Strong LLNs are saying is that under certain conditions, the
sample mean converges to the population mean as n ! ∞. This is known as
consistency: one would say that the sample mean is a consistent estimator of the
population mean.
In actual applications, this means that if the sample size is large enough, then the
sample mean is close to the population mean. So n does not have to be that close
to in…nity. On the other hand, as mentioned at the beginning , “how large” the
sample size should be in order to be considered a “large enough” sample is a
di¤erent question in its own. We will not deal with this here.
Sometimes, consistency is compared to unbiasedness.
An estimator β̂ of a population value β is an unbiased estimator if and only if

E [ β̂] = β.

What are the things we might want to estimate? One example would be parameters
of a distribution family. For example, we might know that the data are distributed
with N µ, σ2 , but we may not know the particular values of µ and σ2 . In this case,
we would estimate these parameters.

(Bilkent) ECON509 This Version: 25 November 2013 55 / 94


Convergence Concepts
Convergence in Probability

Analytically, the p lim operator is much more convenient to deal with, compared to
the expectation operator. For example, for two sequences X1 , X2 , ... and Y1 , Y2 , ...
Xn p limn !∞ Xn
p lim = ,
n !∞ Yn p limn !∞ Yn
while we usually have
Xn E [X n ]
E 6= .
Yn E [Y n ]
However, one concept does not usually imply another. In other words, a consistent
estimator can be biased, while an unbiased estimator can be inconsistent.
Suppose we are trying to estimate the population parameter β. Consider the
following estimators.
1 β̂ = β + 20/n : consistent but biased.
2 β̂ = X where P (X = β + 100 ) = P (X = β 100 ) = .50 : unbiased but inconsistent.

(Bilkent) ECON509 This Version: 25 November 2013 56 / 94


Convergence Concepts
Convergence in Probability

Returning to the discussion at hand, it is important to acknowledge that neither


almost sure convergence nor convergence in probability (and nor any convergence
type) says anything about the distribution of the sequence X1 , X2 , ... . For example,
it might be such that the distribution of Xi changes as i varies. This is …ne.
So far, we have only considered LLNs that work when the sequence is drawn from an
iid population. If this assumption is violated, we can still probably have convergence
of the sample mean to the population mean, but we will have to …nd an appropriate
LLN that works for the particular population distribution we have.
A useful result relating almost sure convergence and convergence in probability is
that
a.s . p
Xn ! X ) Xn ! X .
Convergence in probability, however, does not imply almost sure convergence.
Obviously, for some constant K
a.s . p
K !K and K ! K.

(Bilkent) ECON509 This Version: 25 November 2013 57 / 94


Convergence Concepts
Convergence in Probability

As far as economists and most of the econometricians are concerned, one would not
care too much about whether convergence is achieved almost surely or in probability.
As long as convergence is achieved, the rest is not important.
However, in some cases it might be easier to prove the LLN for one of the two
a.s . p
convergence types. This is no problem, as ! implies ! anyway.
In addition, convergence almost surely might be slower than convergence in
probability in the sense that it might require a larger sample size before the sample
mean is close enough to the population mean.

(Bilkent) ECON509 This Version: 25 November 2013 58 / 94


Convergence Concepts
Convergence in Probability

Example (5.5.3) (Consistency of S 2 ): Suppose we have a sequence X1 , X2 , ... of


iid random variables with E [Xi ] = µ and Var (Xi ) = σ2 < ∞. If we de…ne
n
1
∑ (X i
2
Sn2 = X̄n ) ,
n 1 i =1

can we prove a Weak Law of Large Numbers (WLLN) for Sn2 ?


Again, use Chebychev’s Inequality, to obtain
h i
2
2 E Sn2 σ2 Var (Sn2 )
P Sn2 σ2 e =P Sn2 σ2 e2 = .
e2 e2

Therefore, a su¢ cient condition for weak convergence of Sn2 to σ2 is that


limn !∞ Var (Sn2 ) = 0.

(Bilkent) ECON509 This Version: 25 November 2013 59 / 94


Convergence Concepts
Convergence in Probability

Although we might get convergence results for some sample mean Xn and Yn , we
might actually be interested in the convergence properties of a function of these, say
g (X n , Y n ) .
Fortunately, we have the following useful result.
If (Xn , Yn ) converges almost surely (in probability) to (X , Y ) , if g (x , y ) is a
continuous function over some set D , and if the images of Ω under
[Xn (ω ) , Yn (ω )] and [X (ω ) , Y (ω )] are in D , then g (Xn , Yn ) converges almost
surely (in probability) to g (X , Y ) .
More pragmatically,
a.s . a.s .
Xn ! X ) g (X n ) ! g (X ),
p p
Xn ! X ) g (X n ) ! g (X ).

(Bilkent) ECON509 This Version: 25 November 2013 60 / 94


Convergence Concepts
Convergence in Probability

The next interesting question is this: now that we have found convergence results
for sequences of random variables, X1 , X2 , ... can we …nd similar results for some
reasonably well behaved function?
Example (5.5.5) (Consistency of Sp): If Sn2 is a consistent estimator of σ2 , then
the sample standard deviation Sn = Sn2 = h (Sn2 ) is a consistent estimator of σ.
Interestingly, it can be shown that Sn , in fact, is a biased estimator of σ! However,
the bias disappears asymptotically.

(Bilkent) ECON509 This Version: 25 November 2013 61 / 94


Convergence Concepts
Convergence in Probability

Before we move on to a di¤erent type of convergence, let us, for sake of


completeness, introduce one more type of convergence.
De…nition (Lp Convergence): Let 0 < p < ∞, let X1 , X2 , ... be a sequence of
random variables with E [jXn jp ] < ∞ and let X be a random variable with
E [jX jp ] < ∞. Then, Xn converges in Lp to X if

lim E [jXn X jp ] = 0.
n !∞

Just as a reference, Lp convergence does not imply almost sure convergence and nor
does almost sure convergence imply Lp convergence. However, Lp convergence
implies convergence in probability.

(Bilkent) ECON509 This Version: 25 November 2013 62 / 94


Convergence Concepts
Convergence in Probability

Finally, we present another type of LLN before we move on to convergence in


distribution.
Theorem (Uniform Strong LLN): If X1 , X2 , ... are iid random variables, if g (x , θ ) is
continuous over X Θ where X is the range of X1 and θ is a closed and bounded
set, and if
E sup jg (X1 , θ )j < ∞,
θ 2Θ
then
1 n
n i∑
lim sup g (X i , θ ) E [g (X1 , θ )] = 0 a.s.
n !∞ θ 2Θ
=1
Moreover, E [g (X1 , θ )] is a continuous function of θ.
Therefore, the worst deviation of the sample average from the population average
(E [g (X1 , θ )]) that one can …nd over all θ 2 Θ converges to zero almost surely.
This is useful because, in many cases you will deal with functions of data
(X1 , X2 , ...) and parameters (θ). Now, one can either check that convergence occurs
for each possible θ, or simply use a Uniform SLLN and ensure that convergence will
be attained for all θ in the parameter space (Θ).

(Bilkent) ECON509 This Version: 25 November 2013 63 / 94


Convergence Concepts
Convergence in Distribution

So far, we have dealt with results concerning the sample mean.


The main theme has been the convergence of the sample mean to the population
mean.
This is useful, but we can get much more.
For example, we can get convergence in distribution, as well.

(Bilkent) ECON509 This Version: 25 November 2013 64 / 94


Convergence Concepts
Convergence in Distribution

De…nition (Convergence in Distribution): A sequence of random variables


X1 , X2 , ... converges in distribution to a random variable X if

lim FX n (x ) = FX (x ) ,
n !∞

at every x where F (x ) is continuous.


This is also called convergence in law. The following short hand notation is used to
denote convergence in distribution:
d
Xn ! X ,
d
Xn ! FX ,
L
Xn ! FX .

It is important to underline that it is not Xn that converges to a distribution.


Instead, it is the distribution of Xn that converges to the distribution of X .

(Bilkent) ECON509 This Version: 25 November 2013 65 / 94


Convergence Concepts
Convergence in Distribution

As far as sequences of random vectors are concerned, a sequence of random vectors


Xn = (X1,n , ..., Xd ,n ) converges in distribution to a random vector X if

lim FX n (x1 , ..., xd ) = FX (x1 , ..., xd ) ,


n !∞

at every x = (x1 , ..., xd ) where F (x1 , ..., xd ) is continuous.


Importantly, convergence in probability implies convergence in distribution.
Theorem (5.5.12): If the sequence of random variables X1 , X2 , ... converges in
probability to a random variable X , the sequence also converges in distribution to X .
Consequently, almost sure convergence implies convergence in distribution, as well.
Theorem (5.5.13): The sequence of random variables X1 , X2 , ... converges in
probability to a constant a if and only if the sequence also converges in distribution
to a. Equivalently, the statement

P (jXn a j > e) ! 0 for every e > 0

is equivalent to (
0 if x < a
F X n (x ) = P (X n x) ! .
1 if x > a

(Bilkent) ECON509 This Version: 25 November 2013 66 / 94


Convergence Concepts
Convergence in Distribution

We now introduce one of the most useful theorems we have considered so far.
Theorem (5.5.15) (Central Limit Theorem): Let X1 , X2 , ... be a sequence of iid
random variables with E [Xi ] = µ < ∞ and 0 < Var (Xi ) = σ2 < ∞. De…ne

1 n
n i∑
X̄n = Xi .
=1
p (X̄ n µ)
Let Gn (x ) denote the cdf of n σ . Then, for any x , ∞ < x < ∞,
Z x
1 y 2 /2
lim Gn (x ) = p e dy .
n !∞ ∞ 2π
In other words,
n
1 Xi µ d
p
n
∑ σ
! N (0, 1).
i =1

(Bilkent) ECON509 This Version: 25 November 2013 67 / 94


Convergence Concepts
Convergence in Distribution

This is a powerful result! We start with the iid and …nite mean and variance
assumptions. In return, the Central Limit Theorem (CLT) promises us that the
distribution of a properly standardised version of the sample mean given by
n
1 Xi µ
p
n
∑ σ
i =1

will converge to the standard normal distribution as the sample size tends to in…nity.
As before, the sample size will never be equal to ∞. BUT, for large enough samples,
Xi µ
p1 ∑n will be approximately standard normal. As n becomes larger, this
n i =1 σ
approximate result will become more accurate.
As with LLNs, it is possible to obtain CLTs for non-iid data. However, this will
require one to make stronger assumptions regarding the moments of the sequence of
random variables. The trade-o¤ between dependence and moment assumptions is
always there.

(Bilkent) ECON509 This Version: 25 November 2013 68 / 94


Convergence Concepts
Convergence in Distribution

Let’s prove the CLT. However, before we do that, we have to revisit Taylor
Expansions.
De…nition (5.5.20): If a function g (x ) has derivatives of order r , that is,
dr
g (r ) (x ) = dx r g (x ) exists, then for any constant a, the Taylor polynomial of order r
about a is
r
g (i ) (a )
T r (x ) = ∑ (x a )i .
i =0
i!
This polynomial is used in order to obtain a Taylor expansion of order r about
x = a. This is given by
g (x ) = T r (x ) + R,
where R = g (x ) Tr (x ) is the remainder for the approximation.

(Bilkent) ECON509 This Version: 25 November 2013 69 / 94


Convergence Concepts
Convergence in Distribution

Now, a useful major result is as follows.

dr
Theorem (5.5.21): If g (r ) (a ) = dx r g (x ) exists, then
x =a

g (x ) T r (x )
lim = 0.
x !a (x a )r

This says that the remainder, g (x ) Tr (x ) , always tends to zero faster than the
highest-order term of the approximation.
Importantly, this also means that as x tends to a, the remainder term approaches 0.

(Bilkent) ECON509 This Version: 25 November 2013 70 / 94


Convergence Concepts
Convergence in Distribution

We can now prove the CLT.


Proof: We will do the proof for the case where the mgf exists for jt j < h for some
positive h. The CLT can be proved without assuming existence of the mgf and using,
instead, characteristic functions. However, this would be much more complicated.
Let E [Xi ] = µ and Var (Xi ) = σ2 . The aim is to show that the mgf of
p X̄n µ
n
σ

converges to the mgf of a N (0, 1) random variable, which will prove that the
p X̄ n µ
distribution of n σ converges to the standard normal distribution.
Xi µ
Now, let Yi = σ . Then,
h Xi µ i tµ
h Xi i tµ
MY i (t ) = E [e tY i ] = E e t σ =e σ E et σ = e σ MX i (t/σ ) .

(Bilkent) ECON509 This Version: 25 November 2013 71 / 94


Convergence Concepts
Convergence in Distribution
p X̄ n µ p
Let Mn (t ) be the mgf for n σ = n Ȳ .
Since Xi are iid, Yi are also iid. In addition, E [Yi ] = 0 and Var (Yi ) = 1.
Now, due to the independence and identical distribution assumptions,
h p1 i h p p i
t (Y +...+Y n )
M n (t ) = E e n 1 = E e (t / n )Y 1 ... e (t / n )Y n
p p p n
= E [e (t / n )Y 1 ] ... E [e (t / n )Y n ] = MY i t/ n .
p
Let’s expand MY i t/ n around t = 0.

t d t t
MY i p = M Y i (0 ) + M p p
n dt Y i n n
t =0
2
1 d2 t t t
+ M p p + RY i p ,
2 dt 2 Y i n n n
t =0

where p
∞ k
t dk t t/ n
RY p
n
= ∑ dt k MYi p
n k!
.
k =3 t =0

(Bilkent) ECON509 This Version: 25 November 2013 72 / 94


Convergence Concepts
Convergence in Distribution

These expansions do exist since that the mgf of Xi exists in a neighbourhood of 0


p
(jt j < h for some h) implies that t < nσh which, in turn, implies that MY i ptn
exists.
We know from Theorem (5.5.21) that, for …xed t,
p p
RY t/ n RY t/ n
lim
p p 2 = nlim !∞ p 2 = 0,
t / n !0 t/ n t/ n

where we need to have t 6= 0.


But, as will become clear in a moment, we are interested in the behaviour of
p
RY t/ n
p 2 .
1/ n

Since the above result is based on …xed t, we have


p
RY t/ n p
lim
n !∞
p 2 = nlim ! ∞
n RY t/ n = 0, (4)
1/ n
p
as well, for all t, including t = 0, since RY 0/ n = 0.
(Bilkent) ECON509 This Version: 25 November 2013 73 / 94
Convergence Concepts
Convergence in Distribution

Now, notice that

d p
M t/ n = E [Yi ] = 0,
dt Y i
t =0

d2 p
M t/ n = Var (Yi ) = 1,
dt 2 Y i
t =0

and MY i (0) = 1 by de…nition.


Therefore,
" #n
n 2
t t 1 t t
MY i p = 1+0 p + p + RY i p
n n 2 n n
n
1 t2 t
= 1+ + n RY i p
n 2 n

(Bilkent) ECON509 This Version: 25 November 2013 74 / 94


Convergence Concepts
Convergence in Distribution

Remember that for any sequence an , if limn !∞ an = a, then


an n
lim 1+ = ea.
n !∞ n
t2 pt
Let an = 2 + n RY i n
and observe that, by (4), limn !∞ an = t 2 /2. Then,

n n
t 1 t2 t 2
lim MY i p = lim 1+ + n RY i p = et /2
.
n !∞ n n !∞ n 2 n

But this is the mgf of the N (0, 1) distribution! Hence, the CLT is proved.

(Bilkent) ECON509 This Version: 25 November 2013 75 / 94


Convergence Concepts
Convergence in Distribution

Example (5.5.16): Suppose (X1 , ..., Xn ) are a random sample from a negative
binomial (r , p ) distribution. For this distribution, one can show that

r (1 p) r (1 p )
E [X i ] = and Var (Xi ) = ,
p p2
for all i .
Then, the CLT tells us that
p
n (X̄ r (1 p ) /p ) d
p ! N (0, 1) .
r (1 p ) /p 2
Hence, in a large sample, this quantity should be approximately standard normally
distributed.

(Bilkent) ECON509 This Version: 25 November 2013 76 / 94


Convergence Concepts
Convergence in Distribution

One can also do exact calculations but these would be di¢ cult. Take r = 10, p = .5
and n = 30. Now, consider
!
30
P (X̄ 11) = P ∑ Xi 330
i =1
330 300 x
300 + x 1 1 1
= ∑ x 2 2
= .8916,
x =0

which follows from the fact that ∑30


i =1 Xi is negative binomial(nr , p ) (you do not
have to prove this!).
Such calculations would be tough, even using a computer, as we are considering
factorials of very large numbers.
We could also use the CLT to obtain the following approximation.
p p !
30 (X̄ 10) 30 (11 10)
P (X̄ 11) = P p p P (Z 1.2247) = .8888,
20 20

where Z N (0, 1).

(Bilkent) ECON509 This Version: 25 November 2013 77 / 94


Convergence Concepts
Convergence in Distribution

Two useful results are given next.


Theorem: If Xn is a sequence of random vectors each with support X , g (x ) is
continuous on X and
d
Xn ! X ,
then
d
g (X n ) ! g (X ).
d p
Theorem (5.5.17) (Slutsky’s Theorem): If Xn ! X and Yn ! k, where k is a
constant, then
d
1 Y n X n ! kX ,
d
2 Xn + Yn ! X + k .

(Bilkent) ECON509 This Version: 25 November 2013 78 / 94


Convergence Concepts
Convergence in Distribution
Example (5.5.18): Suppose that
p
n (X̄n µ) d
! N (0, 1) ,
σ
however the value of σ is unknown. What to do?
p
In Example (5.5.3), we have seen that if limn !∞ Var Sn2 = 0, then Sn2 ! σ2 . One
p
can show that this implies that Sn ! σ.
Then, by Slutsky’s Theorem,
p p
n (X̄n µ) σ n (X̄n µ) d
= ! N (0, 1) .
Sn Sn | σ
{z }
|{z}
p d
!1 !N (0,1 )
If you …nd this confusing, try to see this as
p
n (X̄n µ) d
! X , where X 1 N (0, 1) .
Sn
The mean of X is equal to 1 0 while Var (X ) = 12 1. Moreover, clearly, X is
normally distributed. Hence,
p
n (X̄n µ) d
! N (0, 1) .
Sn

(Bilkent) ECON509 This Version: 25 November 2013 79 / 94


Convergence Concepts
The Delta Method

When talking about the CLT, our focus has been on the limiting distribution of
some standardised random variable.
There are many instances, however, when we are not speci…cally interested in the
distribution of the standardised random variable itself, but rather of some function of
it.
The delta method comes in handy in such cases. This method utilises our knowledge
on the limiting distribution of a random variable in order …nd the limiting
distribution of a function of this random variable.
In essence, this method is a combination of Slutsky’s Theorem and Taylor’s
approximation.

(Bilkent) ECON509 This Version: 25 November 2013 80 / 94


Convergence Concepts
The Delta Method
Theorem (5.5.24) (Delta Method): Let Yn be a sequence of random variables
that satis…es
p d
n (Yn θ ) ! N 0, σ2 .
For a given function g ( ) and a speci…c value of θ, suppose that g 0 (θ ) exists and
g 0 (θ ) 6= 0. Then,
p d 2
n [g (Yn ) g (θ )] ! N 0, σ2 g 0 (θ ) .

Proof: The …rst-order Taylor expansion of g (Yn ) about Yn = θ is


g (Y n ) = g ( θ ) + g 0 ( θ ) (Y n θ ) + R,
p p
where R is the remainder and R ! 0 as Yn ! θ.
Then, p p
n [g (Yn ) g (θ )] g 0 (θ ) n (Yn θ) as n ! ∞,
| {z }
d
!N (0,σ2 )
and therefore,
p d d 2
n [g (Y n ) g (θ )] ! g 0 (θ ) N 0, σ2 = N 0, g 0 (θ ) σ2 .
p p
Implicitly, what we need here is that Yn ! θ, as this makes sure that R ! 0.
(Bilkent) ECON509 This Version: 25 November 2013 81 / 94
Convergence Concepts
The Delta Method

In some cases, one might have


g 0 (θ ) = 0.
In this case, the delta method as proposed above will not work.
However, this problem can be solved by using a second-order delta method.
Consider the second order expansion.

g 00 (θ )
g (Y n ) = g ( θ ) + g 0 ( θ ) (Y n θ) + (Y n θ )2 + R.
2
p p
As before, R ! 0 as Yn ! θ. However, this time g 0 (θ ) = 0.
So,
g 00 (θ )
g (Y n ) g (θ ) + (Y n θ )2 , as n ! ∞.
2

(Bilkent) ECON509 This Version: 25 November 2013 82 / 94


Convergence Concepts
The Delta Method

Now,
p Yn θ d
n ! N (0, 1) ,
σ
which implies that
2
Yn θ d
n ! χ21 .
σ
Hence,
g 00 (θ )
n [g (Y n ) g (θ )] n (Y n θ )2 as n ! ∞,
2 | {z }
d
!σ2 χ21

and, therefore,
d g 00 (θ ) 2 2
n [g (Y n ) g (θ )] ! σ χ1 .
2

(Bilkent) ECON509 This Version: 25 November 2013 83 / 94


Convergence Concepts
The Delta Method

The next Theorem follows.


Theorem (5.5.26) (Second Order Delta Method): Let Yn be a sequence of
random variables that satis…es
p d
n (Yn θ ) ! N 0, σ2 .

For a given function g and a speci…c value of θ, suppose that g 0 (θ ) = 0, g 00 (θ )


exists and g 00 (θ ) 6= 0. Then,

d g 00 (θ ) 2 2
n [g (Y n ) g (θ )] ! σ χ1 .
2

(Bilkent) ECON509 This Version: 25 November 2013 84 / 94


Convergence Concepts
Some More Large Sample Results

So far we have worked with iid random sequences only. We now introduce large
sample results for a di¤erent type of distribution. The following is largely based on
White (2001).
Let’s start with independent heterogeneously distributed random variables.
The failure of the identical distribution assumption results from stratifying
(grouping) the population in some way. The independence assumption remains valid
provided that sampling within and across the strata is random.
Theorem (Markov’s Law of Large Numbers): Let X1 , ..., Xn be a sequence of
independent random variables, with E [Xi ] = µi < ∞, for all i . If for some δ > 0,
0 h i1
∞ E jX i µi j1 + δ
∑@ i 1 +δ
A < ∞,
i =1

then
1 n 1 n a.s .
n i∑ n i∑
Xi µi ! 0.
=1 =1

(Bilkent) ECON509 This Version: 25 November 2013 85 / 94


Convergence Concepts
Some More Large Sample Results

Notice some important di¤erences.


First, note that we have relaxed the iid assumption: the sequence is not identically
distributed. Instead, we have heterogeneously distributed random variables. This
comes at a cost: We now have to ensure that moments of order higher than one
should also be bounded. The following Corollary makes this easier to see.
Corollary 3.9. (White (2001)):
h Let Xi1 , ..., Xn be a sequence of independent
random variables such that E jXi j1 +δ < ∞ for some δ > 0 and all i . Then,
a.s .
1
n ∑ni=1 Xi 1 n
n ∑i =1 µi ! 0.
Remember that Kolmogorov’s strong law only requires the existence of the …rst
order moment.
So, moving away from the iid assumption usually comes at the cost of stronger
moment assumptions.
Xi are heterogeneous so their expected values are not identical. So, now we have
n 1 ∑ni=1 µi lying around. Compare this with µ the iid case,

1 n 1 n

n i =1
E [Xi ] = ∑ µ = µ.
n i =1

(Bilkent) ECON509 This Version: 25 November 2013 86 / 94


Convergence Concepts
Some More Large Sample Results

There is also a CLT associated with independent heterogeneously distributed random


sequences. The following is based on Theorem 5.6 from White (2001). For
conciseness, de…ne
1 n 1 n
n i∑ n i∑
µ̄ = E [X i ] and σ̄2 = Var (Xi ) ,
=1 =1

i.e. the average mean and the average variance, respectively.


Theorem (Lindeberg-Feller): Let X1 , ..., Xn be a sequence of independent random
scalars with E [Xi ] = µi < ∞, Var (Xi ) = σ2 , where 0 < σ2 < ∞ and distribution
functions Fi , i = 1, 2, ... . Then,
p (X̄ µ̄) d
n ! N (0, 1)
σ̄
if and only if for every ε > 0,
Z
1 1 n
lim
n !∞

σ̄2 n i =1 (x µi )2 >εn σ̄2
(x µi )2 dFi (x ) = 0. (5)

Now, this looks a lot more complicated!

(Bilkent) ECON509 This Version: 25 November 2013 87 / 94


Convergence Concepts
Some More Large Sample Results

Again, let’s proceed step by step.


First of all, the normalised term that converges in distribution is almost the same as
the one under the iid assumption. The only di¤erence is that we now use the
average mean and the average variance. Not surprising since we are dealing with
heterogeneous random variables.
Second, the conditions of the Theorem seem to be very similar to the CLT for iid
processes except for (5).
This is known as the Lindeberg condition and it requires the average contribution of
the extreme tails to the variance of Xi to be zero in the limit.
The integral is actually the contribution to the variance of Xi , in the region where
(x µi )2 (x µi ) 2

n σ̄2
= n
∑i =1 Var (X i )
> ε.

(Bilkent) ECON509 This Version: 25 November 2013 88 / 94


Convergence Concepts
Some More Large Sample Results

There is an equivalent result, which is easier to follow. This is given in Theorem


5.10 (White, 2001).
Theorem (Liapounov): Let X1 , ..., Xn be a sequence of independent random scalars
with E [Xi ] = µi < ∞, Var (Xi ) = σ2i < ∞ and
h i
E jXi µi j2 +δ < ∞, for some δ > 0 and all i .

If
σ̄2 > 0
for all n su¢ ciently large, then
p (X̄ µ̄) d
n ! N (0, 1) .
σ̄
The conditions of this results are easiler to follow. We now need existence of even
higher order moments (more than the second order).

(Bilkent) ECON509 This Version: 25 November 2013 89 / 94


Convergence Concepts
Order Notation

We …nish this part by introducing the order notation.


This so called “big-O, little-o” notation is used to determine at which speed random
variables are approaching to some bounded limit.
Put di¤erently, the notation is related to the order of magnitude of terms, as n ! ∞.
Let f (x ) and g (x ) be two functions.
If
f (x )
!0 as x ! ∞,
g (x )
then f is of smaller order than g and we write

f (x ) = o fg (x )g .

If,
lim jf (x ) /g (x )j constant,
x !∞
then we will write
f (x ) = O fg (x )g

(Bilkent) ECON509 This Version: 25 November 2013 90 / 94


Convergence Concepts
Order Notation

There is also a corresponding notation for random variables.


Let X1 , X2 , ... be a sequence of random variables and f be a real function.
If,
Xn p
! 0,
f (n )
then we write
Xn = op ff (n )g ,
while if
Xn p
! X, where X is a constant,
f (n )
we write
Xn = Op ff (n )g .
The same applies to almost sure convergence.

(Bilkent) ECON509 This Version: 25 November 2013 91 / 94


Convergence Concepts
Order Notation

Consider the following two important observations.


If
1 n p
n i∑
Xi ! µ,
=1
then
1 n
n i∑
X i = µ + op (1 )
=1
Moreover, if
1 (X i µ ) d
p ! N (0, 1) ,
n σ
then,
1 (X i µ )
p = O p (1 )
n σ

(Bilkent) ECON509 This Version: 25 November 2013 92 / 94


Convergence Concepts
Order Notation

You might …nd it not so clear at …rst, but if

Xn = o n λ ,

for some λ, then


Xn = O n λ .

To see this, notice that


Xn
! 0 = O (1 ) .

So,
Xn
! O (1 ) , X n ! O n λ .

(Bilkent) ECON509 This Version: 25 November 2013 93 / 94


Convergence Concepts
Order Notation

Now, consider
Xn
Xn = O n λ ) = O (1 ) .

Then, for any δ > 0,
Xn
= o (1 ) ) X n = o n λ + δ .
n λ+δ
It might now be obvious to you that, if

Xn ! 0,

then
X n = o (1 ).

(Bilkent) ECON509 This Version: 25 November 2013 94 / 94

Вам также может понравиться