Slides 5 PDF

ECON509 Probability and Statistics
Slides 5
Bilkent
This Version: 25 November 2013
(Bilkent) ECON509 This Version: 25 November 2013 1 / 94

Introduction
In this part of the lecture notes, we will focus on properties of random samples and
consider some important statistics and their distributions, which you will encounter
in your future econometrics courses.
In addition, we will also introduce key convergence concepts, which are at the
foundation of asymptotic theory. You might …nd these concepts a bit too abstract,
but they are used in econometrics pretty frequently.

Basic Concepts of Random Samples
We start with a de…nition.

De…nition (5.1.1): The random variables X1 , ..., Xn are called a random sample of
size n from the population f (x ) if X1 , ..., Xn are mutually independent random
variables and the marginal pdf or pmf of each Xi is the same function f (x ).
Alternatively, X1 , ..., Xn are called independent and identically distributed random
variables with pdf or pmf f (x ). This is commonly abbreviated to iid random
variables.
In many experiments there are n > 1 repeated observations made on the variable,
where X1 is the …rst observation, X2 is the second observation etc.
Under the above de…nition, each Xi is an observation on the same variable and each
Xi has a marginal distribution given by f (x ).
In addition, the value of one observation has no e¤ect on or relationship with any of
the other observations.
X1 , ..., Xn are mutually independent.

Basic Concepts of Random Samples
Then, the joint pdf or pmf is given by

n
f (x1 , ..., xn jθ ) = f (x1 jθ ) f (x2 jθ ) ... f (xn jθ ) = ∏ f (xi j θ ),
i =1
where we assume that the population pdf of pmf is a member of a parametric family,
and θ is the vector of parameters.
The random sampling model in De…nition 5.1.1 is sometimes called sampling from
an in…nite population.
Suppose we obtain the values of X1 , ..., Xn sequentially.
First, the experiment is performed and X1 = x1 is observed.
Then, the experiment is repeated and X2 = x2 is observed.
The assumption of independence implies that the probability distribution for X2 is
una¤ected by the fact that X1 = x1 was observed …rst.

Sums of Random Variables from a Random Sample
Suppose we have drawn a sample (X1 , ..., Xn ) from the population. We might want
to obtain some summary statistics.
This summary might be de…ned as a function
T (x1 , ..., xn ),
which might be real or vector-valued.

So,
Y = T (X1 , ..., Xn ),
is a random variable or a random vector.
We can use similar techniques as those introduced for functions of random variables,
to investigate the distributional properties of Y .
Thanks to (X1 , ..., Xn ) possessing the iid property, the distribution of Y will be
tractable.
This distribution is usually derived from the distribution of the variables in the
random sample. Hence, it is called the sampling distribution of Y .

De…nition (5.2.1): Let X1 , ..., Xn be a random sample of size n from a population
and let T (x1 , ..., xn ) be a real-valued or vector-valued function whose domain
includes the sample space of (X1 , ..., Xn ) . Then, the random variable or random
vector Y = T (X1 , ..., Xn ) is called a statistic. The probability distribution of a
statistic Y is called the sampling distribution of Y .
Let’s consider some commonly used statistics.
De…nition (5.2.2): The sample mean is the arithmetic average of the values in a
random sample. It is usually denoted by
1 n
n i∑
X̄ = Xi .
=1
De…nition (5.2.3): The sample variance is the statistic de…ned by

n
1
∑ (X i
2
S2 = X̄ ) .
n 1 i =1
The sample standard deviation is the statistic de…ned by

p
S = S 2.

Theorem (5.2.4): Let x1 , ..., xn be any number and x̄ = (x1 + ... + xn ) /n. Then,
1 min a ∑ni=1 (xi a )2 = ∑ni=1 (xi x̄ )2 ,
2 (n 1 ) s 2 = ∑ni=1 (xi x̄ )2 = ∑ni=1 xi2 n x̄ 2 .
Proof: To prove part (a), consider
n n
∑ (xi a )2 = ∑ (xi x̄ + x̄ a )2
i =1 i =1
n n n
= ∑ (xi x̄ )2 + 2 ∑ (xi x̄ ) (x̄ a) + ∑ (x̄ a )2
i =1 i =1 i =1
n n
= ∑ (xi x̄ )2 + ∑ (x̄ a )2 .
i =1 i =1
Clearly, the value of a that minimises ∑ni=1 (xi a )2 is given by
a = x̄ .
Part (b) can easily be proved by expanding the binomial and taking the sum.

Lemma (5.2.5): Let X1 , ..., Xn be a random sample from a population and let g (x )
be a function such that E [g (X1 )] and Var (g (X1 )) exist. Then
" #
n
E ∑ g (X i ) = nE [g (X1 )]
i =1
and !
n
Var ∑ g (X i ) = nVar (g (X1 )).
i =1
Proof: This is straightforward. First,
" #
n n n
E ∑ g (X i ) = ∑ E [g (Xi )] = ∑ E [g (X1 )] = nE [g (X1 )],
i =1 i =1 i =1
since Xi are distributed identically. Note that independence is not required here.

Then,
!
n n
Var ∑ g (X i ) = ∑ Var (g (Xi )) + ∑ ∑Cov (g (Xi ), g (Xj ))
i =1 i =1 i 6 =j
n n
= ∑ Var (g (Xi )) + 0 = ∑ Var (g (X1 )) = nVar (g (X1 )),
i =1 i =1
where we have used the fact that
Cov (g (Xi ), g (Xj )) = 0 for all i 6= j,
due to independence and that Var (g (Xi )) is the same for all Xi , due to their
distributions being identical.

Theorem (5.2.6): Let X1 , ..., Xn be a random sample from a population with mean
µ and variance σ2 < ∞. Then,
1 E [X̄ ] = µ,
2 Var (X̄ ) = σ2 /n,
3 E [S 2 ] = σ 2 .
Proof: Now,
!
1 n 1 n 1
n i∑ n i∑
E [X̄ ] = E Xi = E [Xi ] = nµ = µ.
=1 =1 n
Then,
!
1 n 1 n 1 n 2 σ2
n i∑ ∑ ∑
Var (X̄ ) = Var Xi = Var ( X i ) = σ = .
=1 n 2 i =1 n 2 i =1 n

Finally,
" # " #
n n
1 1
∑ (X i ∑
2 2
E [S ] = E X̄ ) = E Xi2 n X̄ 2
n 1 i =1
n 1 i =1
n
1 n
=
n ∑ E [Xi2 ]
1 i =1 n 1
E [X̄ 2 ]
n
1 n σ2
=
n ∑
1 i =1
σ 2 + µ2
n 1 n
+ µ2
1 σ2 (n 1 ) σ2
= n σ 2 + µ2 n + µ2 = = σ2 .
n 1 n n 1
An aside: You might already know that, since E [X̄ ] = µ and E [S 2 ] = σ2 , one
would refer to X̄ and S 2 as the unbiased estimators for µ and σ2 , respectively.

Now, observe that
MX̄ (t ) = E [e t X̄ ] = E [e t (X 1 +...+X n )/n ] = E [e (t /n )Y ] = MY (t/n ).
Of course, since X1 , ..., Xn are identically distributed, MX i (t ) is the same function

for each i . Therefore, we have the following result.
Theorem (5.2.7): Let X1 , ..., Xn be a random sample from a population with mgf
MX (t ). Then, the mgf of the sample mean is
MX̄ (t ) = [MX (t/n )]n .
The proof is an application of Theorem (4.6.7).

Proof: Remember that Theorem (4.6.7) says that if X1 , ..., Xn are mutually
independent random variables with mgfs MX 1 (t ), ..., MX n (t ) and if
Z = X1 + ... + Xn ,
then, the mgf of Z is

MZ (t ) = MX 1 (t ) ... MX n (t ).

In particular, if X1 , ..., Xn all have the same distribution with mgf MX (t ), then
MZ (t ) = [MX (t )]n .
Now the sequence we have is actually X1 /n, ..., Xn /n. Observe that if
E [e tX ] = MX (t ),
then, t
E [e t (X /n ) ] = E [e n X ] = MX (t/n ).
Then, for Z = X̄ , Theorem (4.6.7) gives
MZ (t ) = MX̄ (t ) = [MX (t/n )]n .

Example (5.2.8) (Distribution of the mean): Let X1 , ..., Xn be a random sample
from a N (µ, σ2 ) population. Then the mgf of the sample mean is
n
t σ2 (t/n )2
MX̄ (t ) = exp µ +
n 2
" #
t σ2 (t/n )2 σ2 /n t 2
= exp n µ + = exp µt + .
n 2 2
Thus, X̄ N (µ, σ2 /n ).
In this example, it was helpful to use Theorem (5.2.7) because the expression for
MX̄ (t ) turned out to be a familiar mgf. It cannot, of course, be guaranteed that this
will always be the case for any X̄ . However, when this is the case, this result makes
derivation of the distribution of X̄ very easy.
Another example is the sample mean for a gamma (α, β) sample. The mgf of X̄ in
this case is given by
α n nα
1 1
MX̄ (t ) = = ,
1 β(t/n ) 1 ( β/n ) t
which reveals that X̄ gamma (nα, β/n ).
Of course, there are cases where this method is not useful, due to mgf of X̄ not
being recognisable. Remedies for such cases are considered in Casella and Berger
(2001, pp. 215-217). We will not cover these.
Sampling from the Normal Distribution
Now, we focus on the case where the population distribution is the normal
distribution.
Result we introduce in this section will be very useful when you deal with linear
regression models.
We have already talked about distributional properties of the sample mean and
variance. Given the extra assumption of normality, we are now in a position to
determine their full distributions.
The chi squared distribution will come up frequently within this context, so we start
by introducing some important properties related to this distribution.
Remember that the chi squared pdf is a special case of the gamma pdf and is given
by
1
f (x ) = x (p/2 ) 1 e x /2 , 0 < x < ∞,
Γ(p/2)2p/2
where p is called the degrees of freedom.

Lemma (5.3.2) (Facts about chi squared random variables): We use the notation
χ2p denote a chi squared random variable with p degrees of freedom.
a. If Z N (0, 1 ) random variable, then
Z2 χ21 .
In other words, the square of a standard normal random variable is a chi squared
random variable.
b. If X 1 , ..., X n are independent and X i χ2pi , then
X 1 + ... + X n χ2p1 +...+pn ;

that is, independent chi squared variables add to a chi squared variable, and the
degrees of freedom also add.
To prove the second part, remember that a χ2p random variable is a gamma (p/2, 2)
random variable. By the result obtained in Example (4.6.8), the sum of independent
gamma (αi , β) random variables is a gamma (α1 + ... + αn , β) random variable.
Then, X1 + ... + Xn above is a gamma ((p1 + ... + pn )/2, 2) random variable, which
is a χ2p1 +...+pn random variable.
Part (a) can be proved by obtaining the pdf of the transformation Y = Z 2 and then
con…rming that this is the pdf of a χ21 random variable.

Theorem (5.3.1): Let X1 , ..., Xn be a random sample from a N µ, σ2 distribution

and let
1 n 1 n
X̄ = ∑ Xi and S 2 = ∑ (X X̄ ).
n i =1 n 1 i =1 i
Then,
1 X̄ and S 2 are independent random variables,
2 X̄ has a N µ, σ2 /n distribution,
3 (n 1 ) S 2 /σ2 has a chi squared distribution with n 1 degrees of freedom.
Proof: In what follows, we will focus on the case where µ = 0 and σ2 = 1, without
loss of generality, as N (0, 1) belongs to the location scale family given by N (µ, σ2 ).
Note that we have already showed the second result in Example (5.2.8). This leaves
the …rst and third results. Start with
!
n n
1 1
n 1 i∑
2
S = 2
(Xi X̄ ) = (X1 X̄ ) + ∑ (Xi X̄ ) .
2 2
=1 n 1 i =2

Now, observe that

n
∑ (X i X̄ ) = 0,
i =1
by de…nition. So,
" #2
n n
∑ (X i ∑ (X i
2
(X 1 X̄ ) + X̄ ) = 0 =) (X1 X̄ ) = X̄ )
i =2 i =2
Therefore,
8" #2 9
1 n
1 < n n =
1 i∑ 1 : i∑ ∑ (X i
2 2 2
S = (X i X̄ ) = (X i X̄ ) + X̄ ) .
n =1 n =2 i =2
;
This implies that S 2 can be written as a function of (X2 X̄ , ..., Xn X̄ ) .

The joint pdf of X1 , ..., Xn is given by
!
1 1 n 2
2 i∑
f (x1 , ..., xn ) = exp xi , ∞ < xi < ∞.
(2π )n/2 =1
Consider the transformation
y1 = x̄ ,
y2 = x2 x̄ ,
..
.
yn = xn x̄ .
This yields,
x1 = y1 y2 y3 ... yn ,
x2 = y1 + y2 ,
x3 = y1 + y3 ,
..
.
xn = y1 + yn .
One can show that the Jacobian for this linear transformation is equal to n.
Then,
n n 2 n 2
∑ i =2 y i ) + ∑ i =2 ( y 1 + y i ) ] ,
f (y1 , .., yn ) = n/2
e (1/2 )[(y1
(2π )
n n
∑ i =2 y i )
2 n 2
= e (1/2 )(y1 e (1/2 ) ∑i =2 (y1 +yi ) ,
(2π )n/2
( )
n 1/2
( ny12 )/2 n 1/2 (1/2 )[∑ni=2 yi2 +(∑ni=2 yi )2 ]
= e e ,
2π (2π )(n 1 )/2
where ∞ < yi < ∞.

Clearly, the joint pdf factors into a function of Y1 and a function of Y2 , ..., Yn .
Therefore, by Theorem 4.6.11, Y1 is independent of Y2 , ..., Yn . By Theorem 4.6.12,
this shows that X̄ is independent of S 2 .
Now, on to (c)!

Let X̄k and Sk2 denote the sample mean and sample variance based on the …rst k
observations.
You will be asked to show in a homework assignment that
n 1
(n 1) Sn2 = (n 2) Sn2 1 + (X n X̄n 1)
2
. (1)
n
Note that X̄1 = X1 . Then, for n = 2 we have

1
S22 = 0 S12 + (X2 X 1 )2 .
2
Now, 2 1/2(X2 X1 ) N (0, 1), so, by part (a) of Lemma (5.3.2), S22 χ21 .
We, then, propose the induction that for n = k,
(k 1) Sk2 χ2k 1.

For n = k + 1, we would have, by (1),
k 2
kSk2+1 = (k 1) Sk2 + (X k +1 X̄k ) .
k +1
If the induction hypothesis is correct, (k 1) Sk2 χ2k 1.

What if we could prove that ,
k 2
(X k +1 X̄k ) χ21 , (2)
k +1
independent of Sk2 ? Then, by part (b) of Lemma (5.3.2), kSk2+1 will be χ2k ! So,
proving (2) will …nally prove part (c) of the Theorem.
First, notice that the vector (Xk +1 , X̄k ) is independent of Sk2 since we have already
showed that X̄k is independent of Sk2 . Moreover, Sk2 is a function of X1 , ..., Xk and
so, is independent of Xk +1 .

Now, (Xk +1 X̄k ) is a linear function of normal random variables, so it is also

normal. In addition,
X1 + ... + Xk kµ
E [X k +1 X̄k ] = E X k +1 =µ= 0,
k k
1 k +1
Var (Xk +1 X̄k ) = Var (Xk +1 ) + Var (X̄k ) = 1 + = .
k k
Therefore,
r
k +1 k
(X k +1 X̄k ) N 0, and (X X̄k ) N (0, 1).
k k + 1 k +1
But then, this suggests that
k 2
(X k +1 X̄k ) χ21 ,
k +1
which is the result we were looking for. This completes the proof.

A nice result about normally distributed random variables is that zero covariance
implies independence, which does not hold necessarily for other distribution
functions.
Lemma (5.3.3): Let Xj N (µj , σ2j ), j = 1, ..., n independent. For constants aij and
brj (j = 1, ..., n; i = 1, ..., k; r = 1, ..., m) where k + m n, de…ne
n
Ui = ∑ aij Xj , i = 1, ..., k,
j =1
n
Vr = ∑ brj Xj , r = 1, ..., m.
j =1
a. The random variables U i and V r are independent if and only if Cov (U i , V r ) = 0.

Furthermore, Cov (U i , V r ) = ∑nj=1 a ij b rj σ2j .
b. The random vectors (U 1 , ..., U k ) and (V 1 , ..., V m ) are independent if and only if U i is
independent of V r for all pairs i , r (i = 1, ..., k ; r = 1, ..., m).

Again, we consider the case where Xj N (0, 1), for simplicity. Then,
" # " #
n n n n
Cov (Ui , Vr ) = E [Ui Vr ] = E ∑ aij Xj ∑ brj Xj =E ∑ aij brj Xj2 = ∑ aij brj ,
j =1 j =1 j =1 j =1
due to independence. The implication from independence to zero covariance is also

immediate (Theorem (4.5.5)). In addition, since Ui and Vr are linear combinations
of normal random variables, they are also normally distributed (Corollary 4.6.10).
Now, it is a bit more involved to show that we have indeed independence. Consider
the case where n = 2, as the more general proof will be similar in spirit but more
complicated.
The joint pdf is given by
1 2 2
fX 1 ,X 2 (x1 , x2 ) = e (1/2 )(x1 +x2 ) , ∞ < x1 , x2 < ∞.
2π

Consider the transformation is given by
u = a1 x1 + a2 x2 and v = b1 x1 + b2 x2 ,
which imply
b2 u a2 v a1 v b1 u
x1 = and x2 = .
a1 b2 b1 a2 a1 b2 b1 a2
The Jacobian of the transformation is
b2 a2
a1 b2 b1 a2 a1 b2 b1 a2 (b2 a1 b1 a2 ) 1
b1 a1 = 2
= .
a1 b2 b1 a2 a1 b2 b1 a2 (a1 b2 b1 a2 ) a1 b2 b1 a2
Then, the new joint pdf is,

b2 u a2 v a1 v b1 u 1
fU ,V (u, v ) = fX 1 ,X 2 , .
a1 b2 b1 a2 a1 b2 b1 a2 a1 b2 b1 a2

Therefore,
( )
1 1 h i
2 2
fU ,V (u, v ) = exp (b2 u a2 v ) + (a1 v b1 u )
2π 2 (a1 b2 b1 a2 )2
1
,
a1 b2 b1 a2
where ∞ < u, v < ∞.
Now,
(b2 u a2 v )2 + (a1 v b1 u )2 = b12 + b22 u 2 + a12 + a22 v 2 2 (a1 b1 + a2 b2 ) uv .
But we know that (a1 b1 + a2 b2 ) = 0. Hence, this shows that the joint pdf factorises
into a function of u and a function of v . Therefore, U and V are independent!
A similar technique can be utilised to prove part (b). Speci…cally, by using a
transformation argument, it can be shown that the joint pdf of vectors (U1 , ..., Uk )
and (V1 , ..., Vm ) will factorise, which will prove independence. We skip this.

The main message here is that, if we start with independent normal random
variables, the zero-covariance and independence are equivalent for linear functions of
these random variables. Therefore, checking independence comes down to checking
covariances.
Part (b), on the other hand, makes it possible to infer overall independence of
normal vectors by just checking pairwise independence, which is not valid for general
random variables.
Thinking about part (a), I …nd it intuitively much easier to follow that the normal
distribution is completely determined by its mean and variance while other moments
do not matter. Hence, ensuring zero covariance is equivalent to ensuring
independence, since we do not have to care about the remaining moments of the
distribution.

We can show the usefulness of Lemma 5.3.3 by trying to prove the independence of
X̄ and S 2 when X1 , ..., Xn is sampled from a normal population distribution.
Remember that we can write S 2 as a function of (X2 X̄ , ..., Xn X̄ ) . Now, if we
can show that these random variables are not correlated with X̄ , then by the
normality assumption and Lemma 5.3.3, we can conclude independence.
We have
1 n n
1
n i∑ ∑
X̄ = Xi and Xj X̄ = δij Xi ,
=1 i =1
n
where (
1 if i = j
δij = .
0 otherwise

Again, for simplicity consider Xj N (0, 1) for all j. Then,

( !" #)
1 n n
1
Cov (X̄ , Xj X̄ ) = E ∑
n i =1
Xi ∑ δij
n
Xi
i =1
" #
1 n 1
n i∑
2
= E δij Xi
=1 n
1 n 1
n i∑
= σ2 δij .
=1 n
Now,
n n
1 1
∑ δij
n
=1 ∑ n
=1 1 = 0.
i =1 i =1
Hence,
Cov (X̄ , Xj X̄ ) = 0,
and so, X̄ and Xj X̄ are independent for all j, which yields the desired result.

The Derived Distributions: Student’s t and Snedecor’s F
You will see (probably next term) that the results derived so far are useful when the
variance of the distribution is known.
When σ2 is unknown, as is the case in most practical cases, it usually is replaced by
some unbiased estimator, say, σ̂2 of σ2 .
However, then the above results lose their relevance.
We will now talk about two important distributions that are relevant for such cases.

If X1 , ..., Xn are a random sample from a N µ, σ2 , we know that
X̄ µ
p N (0, 1).
σ/ n
However, as we just argued, σ is usually unknown. What to do then?
W. S. Gosset (widely known as “Student”) studied the distribution of
X̄ µ
p .
S/ n

How to …nd the distribution of this new random variable? First multiply by σ/σ to
get
p (X̄ µ)
X̄ µ σ/ n
p = q .
S/ n S2
σ2
We know that
(X̄ µ) (n 1) S 2
p N (0, 1) and χ2n 1,
σ/ n σ2
S2
where the latter result follows from part (c) of Theorem 5.3.1. Then, Z = σ2
can be
considered a χ2n 1 random variable divided by its degrees of freedom.
Importantly, remember that S 2 and X̄ are independent, so the numerator and the
denominator are independent here.
The question, then, is to …nd the distribution of
U
p where U N (0, 1), Z χ2p and U ?? Z .
Z /p
This is Student’s t distribution.

De…nition (5.3.4): Let X1 , ..., Xn be a random sample from a N µ, σ2

distribution. The quantity
(X̄ µ)
p
S/ n
has Student’s t distribution with n 1 degrees of freedom. Equivalently, a random
variable T has Student’s t distribution with p degrees of freedom, and we write
T tp if it has pdf
Γ p+2
1
1
fT (t ) = p p , ∞ < t < ∞.
pπΓ 2 (1 + t /p )(p +1 )/2
2
Remember that Student’s t does not have moments of all orders. In particular, if
X tp , then
E [jX jq ] < ∞ for q < p.
Therefore, t1 has no mean, t2 has no variance, etc.

Let’s try to derive the pdf. Since U and V are independent, we have
1 u 2 /2 1
fU ,V (u, v ) = p e v (p/2 ) 1
e v /2
,
2π Γ (p/2) 2p/2
where ∞ < u < ∞ and 0 < v < ∞.
We have the following transformation functions
u
t= p and w = v,
v /p
implying that r
w
v =w and u=t .
p
Then, the Jacobian of the transformation is given by
p p
w /p t (2p ) 1 (w /p ) 1/2
= w /p.
0 1

Hence,
Z ∞ p p
fT (t ) = fU ,V t w /p, w w /pdw
0
Z ∞ p
1 1 2
= p e (1/2 )t w /p w (p/2 ) 1 e w /2 w /pdw
2π Γ (p/2) 2 p/2 0
Z ∞
1 1 2
= p p e (1/2 )(t w /p +w ) w (p/2 ) 1/2 dw
2π Γ (p/2) 2 p/2 p 0
Z ∞
1 1 2
= p p e (1/2 )w (t /p +1 ) w ((p +1 )/2 ) 1 dw .
2π Γ ( p/2 ) 2 p/2 p 0
p +1 2
But the integrand is the kernel of a gamma 2 , 1 +t 2 /p pdf. Therefore,
(p +1 )/2
1 1 p+1 2
fT (t ) = p p Γ
2π Γ ( p/2 ) 2p/2 p 2 1 + t 2 /p
p +1
1 Γ 2 1 (p +1 )/2
= p ,
pπ Γ (p/2) 1 + t 2 /p
which gives the pdf of the t distribution.

Now, consider another important derived distribution, Snedecor’s F , named after Sir
Ronald Fisher. It arises naturally as the distribution of a ratio of variances.
Example (5.3.5): Let X1 , ..., Xn be a random sample from a N µX , σ2X
population, and let Y1 , ..., Ym be a random sample from an independent N µY , σ2Y
population. If we were comparing the variability of the populations, one quantity of
interest would be the ratio σ2X /σ2Y . Information about this ratio is contained in
SX2
,
SY2
the ratio of sample variances. The F distributions allows us to compare these

quantities by giving us a distribution of
SX2 /SY2 S 2 /σ2

2 2
= X2 X2 . (3)
σX /σY SY /σY
Examination of (3) shows us how the F distribution is derived. The ratios SX2 /σ2X
and SY2 /σ2Y are each scaled chi squared variates, and they are independent.

De…nition (5.3.6): Let X1 , ..., Xn be a random sample from a N µX , σ2X
population and let Y1 , ..., Ym be a random sample from an independent N µY , σ2Y
population. The random variable
SX2 /σ2X
F =
SY2 /σ2Y
has Snedecor’s F distribution with n 1 and m 1 degrees of freedom.
Equivalently, the random variable F has the F distribution with p and q degrees of
freedom if it has pdf
p +q
Γ 2 p p/2
x (p/2 ) 1
fF (x ) = p q , 0 < x < ∞.
Γ 2 Γ 2
q [1 + (p/q ) x ](p +q )/2
The F distribution has some interesting properties and is related to a number of
other distributions.
Theorem (5.3.8):
a. If X F p,q , then 1/X F q,p . In other words, the reciprocal of an F random variable
is again an F random variable.
b. If X tq , then X 2 F 1,q .
c. If X F p,q , then
(p/q ) X
beta (p/2, q/2 ) .
1 + (p/q ) X

Convergence Concepts
The underlying idea in this section is to understand what happens to sequences of

random variables, or also summary statistics, when we let the sample size go to
in…nity.
This is, of course, an idealistic concept, if you like, because the sample size never
goes to in…nity. However, the idea is to attain a grasp of what happens when the
sample size becomes large enough.
This is important because, although important results are based on the case when
the sample size approaches ∞, they are actually relevant for …nite sample size, which
are large enough.
How large is “large enough” is very much related to the data, its dependence
structure, the econometric model at hand etc, so it is di¢ cult to give a proper rule
of thumb.

The tools you will learn in this section are the fundamental building blocks of what
is known as “asymptotic theory” or “large sample theory.” Although a bit abstract
at …rst sight, these results are at the core of many proofs you will encounter in
econometrics articles.
The three important concepts we will consider in what follows are
1 almost sure convergence,
2 convergence in probability,
3 convergence in distribution.
In the …rst instance we will consider the case where a sequence of random variables
X1 , ..., Xn exhibits the iid property. This is one (and the simplest) of many possible
dependence settings.

Almost Sure Convergence
Remember that random variables are functions de…ned on the sample space, e.g.
X (ω ) . Our interest will be on a sequence of random variables, indexed by sample
size, i.e. Xn (ω ).
To motivate the following discussion, consider pointwise convergence:
lim Xn (ω ) = X (ω ) for all ω 2 Ω,

n !∞
where Ω is, as before, the sample space.

Notice that, convergence occurs for all ω! There is not much of a probabilistic
statement here.
This is the strongest form of convergence we can have on the sample space. But it
is not relevant for probabilistic statements.

A slightly restricted version of pointwise convergence is almost sure convergence.

De…nition (Almost Sure Convergence): A sequence of random variables X1 , X2 , ...
de…ned on a probability space (Ω, F , P ) converges almost surely to a random
variable X if
lim Xn (ω ) = X (ω ) ,
n !∞
for each ω, except for ω 2 E , where P (E ) = 0.
The idea is this: pointwise convergence fails for some points in Ω. However, the
number of such points is so small that we can safely assign zero measure (or zero
probability) to the set of these points.
Other ways of expressing this de…nition are,
P lim jXn Xj > e = 0 for every e > 0,

n !∞
or n o
P ω : lim Xn (ω ) = X (ω ) = 1.
n !∞

This time, we have a di¤erence. Convergence fails on a very small set E such that
P (E ) = 0.
That P (E ) = 0 is due to the set being so small that we can safely assign zero
probability to the set.
Remember that in earlier lectures we have stated that for a continuously distributed
random variable, the probability of a single point is always equal to zero. This is
similar, in spirit, to the situation at hand.
This type of convergence is also called convergence almost everywhere and
convergence with probability 1.
The following notation is common:
a.s .
Xn ! X
wp1
Xn ! X
lim Xn = X a.s.
n !∞
Note that, at the cost of sloppy notation, the argument of Xn (ω ) is usually dropped.
Also, Xn (ω ) need not converge to a function. It can also simply converge to some
constant, say, a.
Usually, Xn (ω ) is some sample average. For example, let Zi (ω ), i = 1, ..., n, be

some random variable and let
1 n
n i∑
Xn (ω ) = Z i ( ω ).
=1
Then, almost sure convergence is a statement on the joint distribution of the entire
sequence fZi (ω )g.
White (2001, p.19): “The probability measure P determines the joint distribution
of the entire sequence fZi (ω )g. A sequence Xn (ω ) converges almost surely if the
probability of obtaining a realisation of the sequence fZi (ω )g for which convergence
to Xn (ω ) occurs is unity.”

We can extend this concept to vectors. Let Xn = (X1,n , ..., XD ,n )0 and

X = (X1 , ..., XD )0 . Then
a.s .
Xn ! X ,
if and only if
a.s .
Xd ,n ! Xd for all d = 1, ..., D.
So almost sure convergence has to occur component by component.
A more compact notation is available. De…ne the Euclidian norm:
q
jjX jj = X12 + ... + Xd2
and observe that q

jX i j X12 + ... + Xd2 .
a.s .
Then, Xn ! X if
lim jjXn (ω ) X (ω )jj = 0, except for ω 2 E , where P (E ) = 0.
n !∞
Whenever you see the terms “almost sure,” “with probability one,” or “almost
everywhere” you should remember that the relationship that is referred to holds
everyhwhere except for some set with zero probability.
Now that we have a convergence concept in our arsenal, the next question is: under
which conditions can we use this result for statistics of interest?
One important result is Kolmogorov’s Strong Law of Large Numbers (SLLN).
Theorem (Kolmogorov’s Strong Law of Large Numbers): If X1 , X2 , ... are
independent and identically distributed random variables and the common mean is
…nite, i.e. E [jX1 j] = µ < ∞, then
1 n 1 n a.s .
lim
n !∞ n
∑ Xi = µ a.s. or
n i∑
Xi ! µ.
i =1 =1
Note that the sample size will never be equal to ∞. However, when the sample size
is large enough, the sample mean will be very close to the population mean, µ.
What is remarkable about this result is that, as long as we have iid random
variables, the only condition is that the random variables have …nite mean.
For the time being, su¢ ce to say that one can relax the iid assumption but that
means that we will require more than just …nite means. There is a trade-o¤ between
the amount of dependence and the strength of moment assumptions you have to
make in order to attain convergence.

Convergence in Probability
The next convergence type is convergence in probability. Its de…nition is similar to

that of almost sure convergence but in essence it is a much weaker convergence
concept.
De…nition (Convergence in Probability): A sequence of random variables
X1 , X2 , ... converges in probability to a random variable X if, for every e > 0,
lim P (jXn X j > e) = 0.

n !∞
An equivalent statement is that given e > 0 and δ > 0 there exists an N, which
depends on both δ and e, such that P (jXn X j > e) < δ for all n > N. This
merely is a restatement using the formal de…nition of limit.
One could also write
lim P (fω : jXn (ω ) X (ω )j > eg) = 0.

n !∞

It is not easy to see the di¤erence between the two modes of convergence. However,
let’s give it a try!
Almost sure convergence states that we have pointwise convergence for all ω 2 Ω
except for a small, zero measure set E . Importantly, this set is independent of n.
Convergence in probability states that as the sample size goes towards ∞, the
probability that Xn will deviate from X by more than e decreases towards zero.
However, for any sample size, there is a positive probability that Xn will deviate by
more than e. In other words, for some ω 2 En Ω, such that P (En ) > 0, jXn X j
will be larger than e.
Importantly, there is nothing that restricts En to be the same for all n. Hence, the
set on which Xn deviates from X may change as n increases.
The good news is, deviation probability slowly goes to zero, hence P (En ) will
eventually be zero.

Now, as before, let Zi , i = 1, ..., n, be some random variable and let
1 n
n i∑
Xn = Zi .
=1
White (2001, p.24): “With almost sure convergence, the probability measure P
takes into account the joint distribution of the entire sequence fZi g, but with
convergence in probability, we only need concern ourselves sequentially with the joint
distribution of the elements of fZi g that actually appear in Xn , typically the …rst n.

Example (5.5.8) (Convergence in probability but not almost surely): Let

Ω = [0, 1] with the uniform probability distribution.
De…ne the following sequence.
X1 (ω ) = ω + I[0,1 ] (ω ) , X2 (ω ) = ω + I[0, 1 ] (ω ) ,
2
X3 (ω ) = ω + I[ 1 ,1 ] (ω ) , X4 (ω ) = ω + I[0, 1 ] (ω ) ,
2 3
X5 (ω ) = ω + I[ 1 , 2 ] (ω ) , X6 (ω ) = ω + I[ 2 ,1 ] (ω ) ,
3 3 3
etc. Let, X (ω ) = ω.
Do we have convergence in probability? Observe that
X1 (ω ) X (ω ) = I[0,1 ] (ω ) , X2 (ω ) X (ω ) = I[0, 1 ] (ω ) ,
2
X3 (ω ) X (ω ) = I[ 1 ,1 ] (ω ) , X4 (ω ) X (ω ) = I[0, 1 ] (ω ) ,
2 3
X5 (ω ) X (ω ) = I[ 1 , 2 ] (ω ) , X6 (ω ) X (ω ) = I[ 2 ,1 ] (ω ) ,
3 3 3
etc.

Now, due to the uniform probability distribution assumption, P (ω 2 [a, b ]) = b a

where 0 a b 1. Then, as n ! ∞, P ω : I[a,b ] (ω ) > e ! 0, since
b a ! 0 as n ! ∞.
Do we have almost sure convergence? No. There is no value of ω 2 Ω for which
Xn (ω ) ! ω. For every ω, the value Xn (ω ) alternates between the values ω and
ω + 1 in…nitely often.
For example, if ω = 3/8, X1 (ω ) = 11/8, X2 (ω ) = 11/8, X3 (ω ) = 3/8,
X4 (ω ) = 3/8, X5 (ω ) = 11/8, X6 (ω ) = 3/8, etc.
No pointwise convergence occurs for this sequence.

Example (5.5.7): Let the sample space Ω be the closed interval [0, 1] with the
uniform probability distribution. De…ne random variables
Xn (ω ) = ω + ω n and X (ω ) = ω.
For every ω 2 [0, 1), ω n ! 0 as n ! ∞ and Xn (ω ) ! ω = X (ω ).

However, Xn (1) = 2 for every n, so Xn (1) does not converge to 1 = X (1).
But, since the convergence occurs on the set [0, 1) and P ([0, 1)) = 1 (remember
that when we have a continuous random variable, the probability of a single point is
equal to zero),
a.s .
Xn ! X .

One would usually write

p
Xn ! X or p lim Xn = X ,
n !∞
as short hand.
Associated with convergence in probability is the Weak Law of Large Numbers
(WLLN)
Theorem (Weak Law of Large Numbers): If X1 , X2 , ... are iid random variables
with common mean µ < ∞ and variance σ2 < ∞, then
1 n p
n i∑
Xi ! µ,
=1
as n ! ∞.

Proof: The proof uses Chebychev’s Inequality. Remember this says that if X is a
random variable and if g (x ) is a nonnegative function, then, for any r > 0,
E [g (X )]
P (g (X ) r) .
r
Now, consider
2
2 E [(X̄n µ) ] Var (X̄n ) σ2
P (jX̄n µj e) = P (X̄n µ) e2 2
= 2
= 2,
e e ne
2
where we use r = e2 and g (X̄n ) = (X̄n µ) .
The above result implies that
σ2
P (jX̄n µj < e) = 1 P (jX̄n µj e) 1 ,
ne2
and
σ2
lim 1 = 1.
n !∞ ne2
Hence,
lim P (jX̄n µj < e) = 1.
n !∞

This is perhaps a good time to stop and re‡ect a little bit on these new concepts.
What the Weak and Strong LLNs are saying is that under certain conditions, the
sample mean converges to the population mean as n ! ∞. This is known as
consistency: one would say that the sample mean is a consistent estimator of the
population mean.
In actual applications, this means that if the sample size is large enough, then the
sample mean is close to the population mean. So n does not have to be that close
to in…nity. On the other hand, as mentioned at the beginning , “how large” the
sample size should be in order to be considered a “large enough” sample is a
di¤erent question in its own. We will not deal with this here.
Sometimes, consistency is compared to unbiasedness.
An estimator β̂ of a population value β is an unbiased estimator if and only if
E [ β̂] = β.
What are the things we might want to estimate? One example would be parameters
of a distribution family. For example, we might know that the data are distributed
with N µ, σ2 , but we may not know the particular values of µ and σ2 . In this case,
we would estimate these parameters.

Analytically, the p lim operator is much more convenient to deal with, compared to
the expectation operator. For example, for two sequences X1 , X2 , ... and Y1 , Y2 , ...
Xn p limn !∞ Xn
p lim = ,
n !∞ Yn p limn !∞ Yn
while we usually have
Xn E [X n ]
E 6= .
Yn E [Y n ]
However, one concept does not usually imply another. In other words, a consistent
estimator can be biased, while an unbiased estimator can be inconsistent.
Suppose we are trying to estimate the population parameter β. Consider the
following estimators.
1 β̂ = β + 20/n : consistent but biased.
2 β̂ = X where P (X = β + 100 ) = P (X = β 100 ) = .50 : unbiased but inconsistent.

Returning to the discussion at hand, it is important to acknowledge that neither

almost sure convergence nor convergence in probability (and nor any convergence
type) says anything about the distribution of the sequence X1 , X2 , ... . For example,
it might be such that the distribution of Xi changes as i varies. This is …ne.
So far, we have only considered LLNs that work when the sequence is drawn from an
iid population. If this assumption is violated, we can still probably have convergence
of the sample mean to the population mean, but we will have to …nd an appropriate
LLN that works for the particular population distribution we have.
A useful result relating almost sure convergence and convergence in probability is
that
a.s . p
Xn ! X ) Xn ! X .
Convergence in probability, however, does not imply almost sure convergence.
Obviously, for some constant K
a.s . p
K !K and K ! K.

As far as economists and most of the econometricians are concerned, one would not
care too much about whether convergence is achieved almost surely or in probability.
As long as convergence is achieved, the rest is not important.
However, in some cases it might be easier to prove the LLN for one of the two
a.s . p
convergence types. This is no problem, as ! implies ! anyway.
In addition, convergence almost surely might be slower than convergence in
probability in the sense that it might require a larger sample size before the sample
mean is close enough to the population mean.

Example (5.5.3) (Consistency of S 2 ): Suppose we have a sequence X1 , X2 , ... of

iid random variables with E [Xi ] = µ and Var (Xi ) = σ2 < ∞. If we de…ne
n
1
∑ (X i
2
Sn2 = X̄n ) ,
n 1 i =1
can we prove a Weak Law of Large Numbers (WLLN) for Sn2 ?

Again, use Chebychev’s Inequality, to obtain
h i
2
2 E Sn2 σ2 Var (Sn2 )
P Sn2 σ2 e =P Sn2 σ2 e2 = .
e2 e2
Therefore, a su¢ cient condition for weak convergence of Sn2 to σ2 is that

limn !∞ Var (Sn2 ) = 0.

Although we might get convergence results for some sample mean Xn and Yn , we
might actually be interested in the convergence properties of a function of these, say
g (X n , Y n ) .
Fortunately, we have the following useful result.
If (Xn , Yn ) converges almost surely (in probability) to (X , Y ) , if g (x , y ) is a
continuous function over some set D , and if the images of Ω under
[Xn (ω ) , Yn (ω )] and [X (ω ) , Y (ω )] are in D , then g (Xn , Yn ) converges almost
surely (in probability) to g (X , Y ) .
More pragmatically,
a.s . a.s .
Xn ! X ) g (X n ) ! g (X ),
p p
Xn ! X ) g (X n ) ! g (X ).

The next interesting question is this: now that we have found convergence results
for sequences of random variables, X1 , X2 , ... can we …nd similar results for some
reasonably well behaved function?
Example (5.5.5) (Consistency of Sp): If Sn2 is a consistent estimator of σ2 , then
the sample standard deviation Sn = Sn2 = h (Sn2 ) is a consistent estimator of σ.
Interestingly, it can be shown that Sn , in fact, is a biased estimator of σ! However,
the bias disappears asymptotically.

Before we move on to a di¤erent type of convergence, let us, for sake of

completeness, introduce one more type of convergence.
De…nition (Lp Convergence): Let 0 < p < ∞, let X1 , X2 , ... be a sequence of
random variables with E [jXn jp ] < ∞ and let X be a random variable with
E [jX jp ] < ∞. Then, Xn converges in Lp to X if
lim E [jXn X jp ] = 0.
n !∞
Just as a reference, Lp convergence does not imply almost sure convergence and nor
does almost sure convergence imply Lp convergence. However, Lp convergence
implies convergence in probability.

Finally, we present another type of LLN before we move on to convergence in

distribution.
Theorem (Uniform Strong LLN): If X1 , X2 , ... are iid random variables, if g (x , θ ) is
continuous over X Θ where X is the range of X1 and θ is a closed and bounded
set, and if
E sup jg (X1 , θ )j < ∞,
θ 2Θ
then
1 n
n i∑
lim sup g (X i , θ ) E [g (X1 , θ )] = 0 a.s.
n !∞ θ 2Θ
=1
Moreover, E [g (X1 , θ )] is a continuous function of θ.
Therefore, the worst deviation of the sample average from the population average
(E [g (X1 , θ )]) that one can …nd over all θ 2 Θ converges to zero almost surely.
This is useful because, in many cases you will deal with functions of data
(X1 , X2 , ...) and parameters (θ). Now, one can either check that convergence occurs
for each possible θ, or simply use a Uniform SLLN and ensure that convergence will
be attained for all θ in the parameter space (Θ).

Convergence in Distribution
So far, we have dealt with results concerning the sample mean.

The main theme has been the convergence of the sample mean to the population
mean.
This is useful, but we can get much more.
For example, we can get convergence in distribution, as well.

De…nition (Convergence in Distribution): A sequence of random variables

X1 , X2 , ... converges in distribution to a random variable X if
lim FX n (x ) = FX (x ) ,
n !∞
at every x where F (x ) is continuous.

This is also called convergence in law. The following short hand notation is used to
denote convergence in distribution:
d
Xn ! X ,
d
Xn ! FX ,
L
Xn ! FX .
It is important to underline that it is not Xn that converges to a distribution.

Instead, it is the distribution of Xn that converges to the distribution of X .

As far as sequences of random vectors are concerned, a sequence of random vectors

Xn = (X1,n , ..., Xd ,n ) converges in distribution to a random vector X if
lim FX n (x1 , ..., xd ) = FX (x1 , ..., xd ) ,

n !∞
at every x = (x1 , ..., xd ) where F (x1 , ..., xd ) is continuous.

Importantly, convergence in probability implies convergence in distribution.
Theorem (5.5.12): If the sequence of random variables X1 , X2 , ... converges in
probability to a random variable X , the sequence also converges in distribution to X .
Consequently, almost sure convergence implies convergence in distribution, as well.
Theorem (5.5.13): The sequence of random variables X1 , X2 , ... converges in
probability to a constant a if and only if the sequence also converges in distribution
to a. Equivalently, the statement
P (jXn a j > e) ! 0 for every e > 0
is equivalent to (
0 if x < a
F X n (x ) = P (X n x) ! .
1 if x > a

We now introduce one of the most useful theorems we have considered so far.
Theorem (5.5.15) (Central Limit Theorem): Let X1 , X2 , ... be a sequence of iid
random variables with E [Xi ] = µ < ∞ and 0 < Var (Xi ) = σ2 < ∞. De…ne
1 n
n i∑
X̄n = Xi .
=1
p (X̄ n µ)
Let Gn (x ) denote the cdf of n σ . Then, for any x , ∞ < x < ∞,
Z x
1 y 2 /2
lim Gn (x ) = p e dy .
n !∞ ∞ 2π
In other words,
n
1 Xi µ d
p
n
∑ σ
! N (0, 1).
i =1

This is a powerful result! We start with the iid and …nite mean and variance
assumptions. In return, the Central Limit Theorem (CLT) promises us that the
distribution of a properly standardised version of the sample mean given by
n
1 Xi µ
p
n
∑ σ
i =1
will converge to the standard normal distribution as the sample size tends to in…nity.
As before, the sample size will never be equal to ∞. BUT, for large enough samples,
Xi µ
p1 ∑n will be approximately standard normal. As n becomes larger, this
n i =1 σ
approximate result will become more accurate.
As with LLNs, it is possible to obtain CLTs for non-iid data. However, this will
require one to make stronger assumptions regarding the moments of the sequence of
random variables. The trade-o¤ between dependence and moment assumptions is
always there.

Let’s prove the CLT. However, before we do that, we have to revisit Taylor
Expansions.
De…nition (5.5.20): If a function g (x ) has derivatives of order r , that is,
dr
g (r ) (x ) = dx r g (x ) exists, then for any constant a, the Taylor polynomial of order r
about a is
r
g (i ) (a )
T r (x ) = ∑ (x a )i .
i =0
i!
This polynomial is used in order to obtain a Taylor expansion of order r about
x = a. This is given by
g (x ) = T r (x ) + R,
where R = g (x ) Tr (x ) is the remainder for the approximation.

Now, a useful major result is as follows.
dr
Theorem (5.5.21): If g (r ) (a ) = dx r g (x ) exists, then
x =a
g (x ) T r (x )
lim = 0.
x !a (x a )r
This says that the remainder, g (x ) Tr (x ) , always tends to zero faster than the
highest-order term of the approximation.
Importantly, this also means that as x tends to a, the remainder term approaches 0.

We can now prove the CLT.

Proof: We will do the proof for the case where the mgf exists for jt j < h for some
positive h. The CLT can be proved without assuming existence of the mgf and using,
instead, characteristic functions. However, this would be much more complicated.
Let E [Xi ] = µ and Var (Xi ) = σ2 . The aim is to show that the mgf of
p X̄n µ
n
σ
converges to the mgf of a N (0, 1) random variable, which will prove that the
p X̄ n µ
distribution of n σ converges to the standard normal distribution.
Xi µ
Now, let Yi = σ . Then,
h Xi µ i tµ
h Xi i tµ
MY i (t ) = E [e tY i ] = E e t σ =e σ E et σ = e σ MX i (t/σ ) .

p X̄ n µ p
Let Mn (t ) be the mgf for n σ = n Ȳ .
Since Xi are iid, Yi are also iid. In addition, E [Yi ] = 0 and Var (Yi ) = 1.
Now, due to the independence and identical distribution assumptions,
h p1 i h p p i
t (Y +...+Y n )
M n (t ) = E e n 1 = E e (t / n )Y 1 ... e (t / n )Y n
p p p n
= E [e (t / n )Y 1 ] ... E [e (t / n )Y n ] = MY i t/ n .
p
Let’s expand MY i t/ n around t = 0.
t d t t
MY i p = M Y i (0 ) + M p p
n dt Y i n n
t =0
2
1 d2 t t t
+ M p p + RY i p ,
2 dt 2 Y i n n n
t =0
where p
∞ k
t dk t t/ n
RY p
n
= ∑ dt k MYi p
n k!
.
k =3 t =0

These expansions do exist since that the mgf of Xi exists in a neighbourhood of 0

p
(jt j < h for some h) implies that t < nσh which, in turn, implies that MY i ptn
exists.
We know from Theorem (5.5.21) that, for …xed t,
p p
RY t/ n RY t/ n
lim
p p 2 = nlim !∞ p 2 = 0,
t / n !0 t/ n t/ n
where we need to have t 6= 0.

But, as will become clear in a moment, we are interested in the behaviour of
p
RY t/ n
p 2 .
1/ n
Since the above result is based on …xed t, we have

p
RY t/ n p
lim
n !∞
p 2 = nlim ! ∞
n RY t/ n = 0, (4)
1/ n
p
as well, for all t, including t = 0, since RY 0/ n = 0.
Now, notice that
d p
M t/ n = E [Yi ] = 0,
dt Y i
t =0
d2 p
M t/ n = Var (Yi ) = 1,
dt 2 Y i
t =0
and MY i (0) = 1 by de…nition.

Therefore,
" #n
n 2
t t 1 t t
MY i p = 1+0 p + p + RY i p
n n 2 n n
n
1 t2 t
= 1+ + n RY i p
n 2 n

Remember that for any sequence an , if limn !∞ an = a, then

an n
lim 1+ = ea.
n !∞ n
t2 pt
Let an = 2 + n RY i n
and observe that, by (4), limn !∞ an = t 2 /2. Then,
n n
t 1 t2 t 2
lim MY i p = lim 1+ + n RY i p = et /2
.
n !∞ n n !∞ n 2 n
But this is the mgf of the N (0, 1) distribution! Hence, the CLT is proved.

Example (5.5.16): Suppose (X1 , ..., Xn ) are a random sample from a negative
binomial (r , p ) distribution. For this distribution, one can show that
r (1 p) r (1 p )
E [X i ] = and Var (Xi ) = ,
p p2
for all i .
Then, the CLT tells us that
p
n (X̄ r (1 p ) /p ) d
p ! N (0, 1) .
r (1 p ) /p 2
Hence, in a large sample, this quantity should be approximately standard normally
distributed.

One can also do exact calculations but these would be di¢ cult. Take r = 10, p = .5
and n = 30. Now, consider
!
30
P (X̄ 11) = P ∑ Xi 330
i =1
330 300 x
300 + x 1 1 1
= ∑ x 2 2
= .8916,
x =0
which follows from the fact that ∑30

i =1 Xi is negative binomial(nr , p ) (you do not
have to prove this!).
Such calculations would be tough, even using a computer, as we are considering
factorials of very large numbers.
We could also use the CLT to obtain the following approximation.
p p !
30 (X̄ 10) 30 (11 10)
P (X̄ 11) = P p p P (Z 1.2247) = .8888,
20 20
where Z N (0, 1).

Two useful results are given next.

Theorem: If Xn is a sequence of random vectors each with support X , g (x ) is
continuous on X and
d
Xn ! X ,
then
d
g (X n ) ! g (X ).
d p
Theorem (5.5.17) (Slutsky’s Theorem): If Xn ! X and Yn ! k, where k is a
constant, then
d
1 Y n X n ! kX ,
d
2 Xn + Yn ! X + k .

Example (5.5.18): Suppose that
p
n (X̄n µ) d
! N (0, 1) ,
σ
however the value of σ is unknown. What to do?
p
In Example (5.5.3), we have seen that if limn !∞ Var Sn2 = 0, then Sn2 ! σ2 . One
p
can show that this implies that Sn ! σ.
Then, by Slutsky’s Theorem,
p p
n (X̄n µ) σ n (X̄n µ) d
= ! N (0, 1) .
Sn Sn | σ
{z }
|{z}
p d
!1 !N (0,1 )
If you …nd this confusing, try to see this as
p
n (X̄n µ) d
! X , where X 1 N (0, 1) .
Sn
The mean of X is equal to 1 0 while Var (X ) = 12 1. Moreover, clearly, X is
normally distributed. Hence,
p
n (X̄n µ) d
! N (0, 1) .
Sn

The Delta Method
When talking about the CLT, our focus has been on the limiting distribution of
some standardised random variable.
There are many instances, however, when we are not speci…cally interested in the
distribution of the standardised random variable itself, but rather of some function of
it.
The delta method comes in handy in such cases. This method utilises our knowledge
on the limiting distribution of a random variable in order …nd the limiting
distribution of a function of this random variable.
In essence, this method is a combination of Slutsky’s Theorem and Taylor’s
approximation.

The Delta Method
Theorem (5.5.24) (Delta Method): Let Yn be a sequence of random variables
that satis…es
p d
n (Yn θ ) ! N 0, σ2 .
For a given function g ( ) and a speci…c value of θ, suppose that g 0 (θ ) exists and
g 0 (θ ) 6= 0. Then,
p d 2
n [g (Yn ) g (θ )] ! N 0, σ2 g 0 (θ ) .
Proof: The …rst-order Taylor expansion of g (Yn ) about Yn = θ is

g (Y n ) = g ( θ ) + g 0 ( θ ) (Y n θ ) + R,
p p
where R is the remainder and R ! 0 as Yn ! θ.
Then, p p
n [g (Yn ) g (θ )] g 0 (θ ) n (Yn θ) as n ! ∞,
| {z }
d
!N (0,σ2 )
and therefore,
p d d 2
n [g (Y n ) g (θ )] ! g 0 (θ ) N 0, σ2 = N 0, g 0 (θ ) σ2 .
p p
Implicitly, what we need here is that Yn ! θ, as this makes sure that R ! 0.
The Delta Method
In some cases, one might have

g 0 (θ ) = 0.
In this case, the delta method as proposed above will not work.
However, this problem can be solved by using a second-order delta method.
Consider the second order expansion.
g 00 (θ )
g (Y n ) = g ( θ ) + g 0 ( θ ) (Y n θ) + (Y n θ )2 + R.
2
p p
As before, R ! 0 as Yn ! θ. However, this time g 0 (θ ) = 0.
So,
g 00 (θ )
g (Y n ) g (θ ) + (Y n θ )2 , as n ! ∞.
2

The Delta Method
Now,
p Yn θ d
n ! N (0, 1) ,
σ
which implies that
2
Yn θ d
n ! χ21 .
σ
Hence,
g 00 (θ )
n [g (Y n ) g (θ )] n (Y n θ )2 as n ! ∞,
2 | {z }
d
!σ2 χ21
and, therefore,
d g 00 (θ ) 2 2
n [g (Y n ) g (θ )] ! σ χ1 .
2

The Delta Method
The next Theorem follows.

Theorem (5.5.26) (Second Order Delta Method): Let Yn be a sequence of
random variables that satis…es
p d
n (Yn θ ) ! N 0, σ2 .
For a given function g and a speci…c value of θ, suppose that g 0 (θ ) = 0, g 00 (θ )

exists and g 00 (θ ) 6= 0. Then,
d g 00 (θ ) 2 2
n [g (Y n ) g (θ )] ! σ χ1 .
2

Some More Large Sample Results
So far we have worked with iid random sequences only. We now introduce large
sample results for a di¤erent type of distribution. The following is largely based on
White (2001).
Let’s start with independent heterogeneously distributed random variables.
The failure of the identical distribution assumption results from stratifying
(grouping) the population in some way. The independence assumption remains valid
provided that sampling within and across the strata is random.
Theorem (Markov’s Law of Large Numbers): Let X1 , ..., Xn be a sequence of
independent random variables, with E [Xi ] = µi < ∞, for all i . If for some δ > 0,
0 h i1
∞ E jX i µi j1 + δ
∑@ i 1 +δ
A < ∞,
i =1
then
1 n 1 n a.s .
n i∑ n i∑
Xi µi ! 0.
=1 =1

Notice some important di¤erences.

First, note that we have relaxed the iid assumption: the sequence is not identically
distributed. Instead, we have heterogeneously distributed random variables. This
comes at a cost: We now have to ensure that moments of order higher than one
should also be bounded. The following Corollary makes this easier to see.
Corollary 3.9. (White (2001)):
h Let Xi1 , ..., Xn be a sequence of independent
random variables such that E jXi j1 +δ < ∞ for some δ > 0 and all i . Then,
a.s .
1
n ∑ni=1 Xi 1 n
n ∑i =1 µi ! 0.
Remember that Kolmogorov’s strong law only requires the existence of the …rst
order moment.
So, moving away from the iid assumption usually comes at the cost of stronger
moment assumptions.
Xi are heterogeneous so their expected values are not identical. So, now we have
n 1 ∑ni=1 µi lying around. Compare this with µ the iid case,
1 n 1 n
∑
n i =1
E [Xi ] = ∑ µ = µ.
n i =1

There is also a CLT associated with independent heterogeneously distributed random

sequences. The following is based on Theorem 5.6 from White (2001). For
conciseness, de…ne
1 n 1 n
n i∑ n i∑
µ̄ = E [X i ] and σ̄2 = Var (Xi ) ,
=1 =1
i.e. the average mean and the average variance, respectively.

Theorem (Lindeberg-Feller): Let X1 , ..., Xn be a sequence of independent random
scalars with E [Xi ] = µi < ∞, Var (Xi ) = σ2 , where 0 < σ2 < ∞ and distribution
functions Fi , i = 1, 2, ... . Then,
p (X̄ µ̄) d
n ! N (0, 1)
σ̄
if and only if for every ε > 0,
Z
1 1 n
lim
n !∞
∑
σ̄2 n i =1 (x µi )2 >εn σ̄2
(x µi )2 dFi (x ) = 0. (5)
Now, this looks a lot more complicated!

Again, let’s proceed step by step.

First of all, the normalised term that converges in distribution is almost the same as
the one under the iid assumption. The only di¤erence is that we now use the
average mean and the average variance. Not surprising since we are dealing with
heterogeneous random variables.
Second, the conditions of the Theorem seem to be very similar to the CLT for iid
processes except for (5).
This is known as the Lindeberg condition and it requires the average contribution of
the extreme tails to the variance of Xi to be zero in the limit.
The integral is actually the contribution to the variance of Xi , in the region where
(x µi )2 (x µi ) 2
n σ̄2
= n
∑i =1 Var (X i )
> ε.

There is an equivalent result, which is easier to follow. This is given in Theorem

5.10 (White, 2001).
Theorem (Liapounov): Let X1 , ..., Xn be a sequence of independent random scalars
with E [Xi ] = µi < ∞, Var (Xi ) = σ2i < ∞ and
h i
E jXi µi j2 +δ < ∞, for some δ > 0 and all i .
If
σ̄2 > 0
for all n su¢ ciently large, then
p (X̄ µ̄) d
n ! N (0, 1) .
σ̄
The conditions of this results are easiler to follow. We now need existence of even
higher order moments (more than the second order).

Order Notation
We …nish this part by introducing the order notation.

This so called “big-O, little-o” notation is used to determine at which speed random
variables are approaching to some bounded limit.
Put di¤erently, the notation is related to the order of magnitude of terms, as n ! ∞.
Let f (x ) and g (x ) be two functions.
If
f (x )
!0 as x ! ∞,
g (x )
then f is of smaller order than g and we write
f (x ) = o fg (x )g .
If,
lim jf (x ) /g (x )j constant,
x !∞
then we will write
f (x ) = O fg (x )g

Order Notation
There is also a corresponding notation for random variables.

Let X1 , X2 , ... be a sequence of random variables and f be a real function.
If,
Xn p
! 0,
f (n )
then we write
Xn = op ff (n )g ,
while if
Xn p
! X, where X is a constant,
f (n )
we write
Xn = Op ff (n )g .
The same applies to almost sure convergence.

Order Notation
Consider the following two important observations.

If
1 n p
n i∑
Xi ! µ,
=1
then
1 n
n i∑
X i = µ + op (1 )
=1
Moreover, if
1 (X i µ ) d
p ! N (0, 1) ,
n σ
then,
1 (X i µ )
p = O p (1 )
n σ

Order Notation
You might …nd it not so clear at …rst, but if
Xn = o n λ ,
for some λ, then

Xn = O n λ .
To see this, notice that

Xn
! 0 = O (1 ) .
nλ
So,
Xn
! O (1 ) , X n ! O n λ .
nλ

Order Notation
Now, consider
Xn
Xn = O n λ ) = O (1 ) .
nλ
Then, for any δ > 0,
Xn
= o (1 ) ) X n = o n λ + δ .
n λ+δ
It might now be obvious to you that, if
Xn ! 0,
then
X n = o (1 ).

Slides 5 PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Slides 5 PDF

Загружено:

Авторское право:

Доступные форматы

ECON509 Probability and Statistics

This Version: 25 November 2013

(Bilkent) ECON509 This Version: 25 November 2013 1 / 94

(Bilkent) ECON509 This Version: 25 November 2013 2 / 94

We start with a de…nition.

(Bilkent) ECON509 This Version: 25 November 2013 3 / 94

Then, the joint pdf or pmf is given by

(Bilkent) ECON509 This Version: 25 November 2013 4 / 94

which might be real or vector-valued.

(Bilkent) ECON509 This Version: 25 November 2013 5 / 94

De…nition (5.2.3): The sample variance is the statistic de…ned by

The sample standard deviation is the statistic de…ned by

(Bilkent) ECON509 This Version: 25 November 2013 6 / 94

Clearly, the value of a that minimises ∑ni=1 (xi a )2 is given by

(Bilkent) ECON509 This Version: 25 November 2013 7 / 94

(Bilkent) ECON509 This Version: 25 November 2013 8 / 94

where we have used the fact that

Cov (g (Xi ), g (Xj )) = 0 for all i 6= j,

(Bilkent) ECON509 This Version: 25 November 2013 9 / 94

(Bilkent) ECON509 This Version: 25 November 2013 10 / 94

(Bilkent) ECON509 This Version: 25 November 2013 11 / 94

Now, observe that

MX̄ (t ) = E [e t X̄ ] = E [e t (X 1 +...+X n )/n ] = E [e (t /n )Y ] = MY (t/n ).

Of course, since X1 , ..., Xn are identically distributed, MX i (t ) is the same function

MX̄ (t ) = [MX (t/n )]n .

The proof is an application of Theorem (4.6.7).

then, the mgf of Z is

(Bilkent) ECON509 This Version: 25 November 2013 12 / 94

MZ (t ) = MX̄ (t ) = [MX (t/n )]n .

(Bilkent) ECON509 This Version: 25 November 2013 13 / 94

(Bilkent) ECON509 This Version: 25 November 2013 15 / 94

X 1 + ... + X n χ2p1 +...+pn ;

(Bilkent) ECON509 This Version: 25 November 2013 16 / 94

Theorem (5.3.1): Let X1 , ..., Xn be a random sample from a N µ, σ2 distribution

(Bilkent) ECON509 This Version: 25 November 2013 17 / 94

Now, observe that

This implies that S 2 can be written as a function of (X2 X̄ , ..., Xn X̄ ) .

(Bilkent) ECON509 This Version: 25 November 2013 18 / 94

where ∞ < yi < ∞.

(Bilkent) ECON509 This Version: 25 November 2013 20 / 94

Note that X̄1 = X1 . Then, for n = 2 we have

(Bilkent) ECON509 This Version: 25 November 2013 21 / 94

For n = k + 1, we would have, by (1),

If the induction hypothesis is correct, (k 1) Sk2 χ2k 1.

(Bilkent) ECON509 This Version: 25 November 2013 22 / 94

Now, (Xk +1 X̄k ) is a linear function of normal random variables, so it is also

(Bilkent) ECON509 This Version: 25 November 2013 23 / 94

a. The random variables U i and V r are independent if and only if Cov (U i , V r ) = 0.

(Bilkent) ECON509 This Version: 25 November 2013 24 / 94

due to independence. The implication from independence to zero covariance is also

(Bilkent) ECON509 This Version: 25 November 2013 25 / 94

Consider the transformation is given by

Then, the new joint pdf is,

(Bilkent) ECON509 This Version: 25 November 2013 26 / 94

(b2 u a2 v )2 + (a1 v b1 u )2 = b12 + b22 u 2 + a12 + a22 v 2 2 (a1 b1 + a2 b2 ) uv .

(Bilkent) ECON509 This Version: 25 November 2013 27 / 94

(Bilkent) ECON509 This Version: 25 November 2013 28 / 94

(Bilkent) ECON509 This Version: 25 November 2013 29 / 94

Again, for simplicity consider Xj N (0, 1) for all j. Then,

(Bilkent) ECON509 This Version: 25 November 2013 30 / 94

(Bilkent) ECON509 This Version: 25 November 2013 31 / 94

If X1 , ..., Xn are a random sample from a N µ, σ2 , we know that

(Bilkent) ECON509 This Version: 25 November 2013 32 / 94