Академический Документы
Профессиональный Документы
Культура Документы
Slides 5
Bilkent
In this part of the lecture notes, we will focus on properties of random samples and
consider some important statistics and their distributions, which you will encounter
in your future econometrics courses.
In addition, we will also introduce key convergence concepts, which are at the
foundation of asymptotic theory. You might …nd these concepts a bit too abstract,
but they are used in econometrics pretty frequently.
where we assume that the population pdf of pmf is a member of a parametric family,
and θ is the vector of parameters.
The random sampling model in De…nition 5.1.1 is sometimes called sampling from
an in…nite population.
Suppose we obtain the values of X1 , ..., Xn sequentially.
First, the experiment is performed and X1 = x1 is observed.
Then, the experiment is repeated and X2 = x2 is observed.
The assumption of independence implies that the probability distribution for X2 is
una¤ected by the fact that X1 = x1 was observed …rst.
Suppose we have drawn a sample (X1 , ..., Xn ) from the population. We might want
to obtain some summary statistics.
This summary might be de…ned as a function
T (x1 , ..., xn ),
1 n
n i∑
X̄ = Xi .
=1
Theorem (5.2.4): Let x1 , ..., xn be any number and x̄ = (x1 + ... + xn ) /n. Then,
1 min a ∑ni=1 (xi a )2 = ∑ni=1 (xi x̄ )2 ,
2 (n 1 ) s 2 = ∑ni=1 (xi x̄ )2 = ∑ni=1 xi2 n x̄ 2 .
Proof: To prove part (a), consider
n n
∑ (xi a )2 = ∑ (xi x̄ + x̄ a )2
i =1 i =1
n n n
= ∑ (xi x̄ )2 + 2 ∑ (xi x̄ ) (x̄ a) + ∑ (x̄ a )2
i =1 i =1 i =1
n n
= ∑ (xi x̄ )2 + ∑ (x̄ a )2 .
i =1 i =1
a = x̄ .
Part (b) can easily be proved by expanding the binomial and taking the sum.
Lemma (5.2.5): Let X1 , ..., Xn be a random sample from a population and let g (x )
be a function such that E [g (X1 )] and Var (g (X1 )) exist. Then
" #
n
E ∑ g (X i ) = nE [g (X1 )]
i =1
and !
n
Var ∑ g (X i ) = nVar (g (X1 )).
i =1
Proof: This is straightforward. First,
" #
n n n
E ∑ g (X i ) = ∑ E [g (Xi )] = ∑ E [g (X1 )] = nE [g (X1 )],
i =1 i =1 i =1
since Xi are distributed identically. Note that independence is not required here.
Then,
!
n n
Var ∑ g (X i ) = ∑ Var (g (Xi )) + ∑ ∑Cov (g (Xi ), g (Xj ))
i =1 i =1 i 6 =j
n n
= ∑ Var (g (Xi )) + 0 = ∑ Var (g (X1 )) = nVar (g (X1 )),
i =1 i =1
due to independence and that Var (g (Xi )) is the same for all Xi , due to their
distributions being identical.
Theorem (5.2.6): Let X1 , ..., Xn be a random sample from a population with mean
µ and variance σ2 < ∞. Then,
1 E [X̄ ] = µ,
2 Var (X̄ ) = σ2 /n,
3 E [S 2 ] = σ 2 .
Proof: Now,
!
1 n 1 n 1
n i∑ n i∑
E [X̄ ] = E Xi = E [Xi ] = nµ = µ.
=1 =1 n
Then,
!
1 n 1 n 1 n 2 σ2
n i∑ ∑ ∑
Var (X̄ ) = Var Xi = Var ( X i ) = σ = .
=1 n 2 i =1 n 2 i =1 n
Finally,
" # " #
n n
1 1
∑ (X i ∑
2 2
E [S ] = E X̄ ) = E Xi2 n X̄ 2
n 1 i =1
n 1 i =1
n
1 n
=
n ∑ E [Xi2 ]
1 i =1 n 1
E [X̄ 2 ]
n
1 n σ2
=
n ∑
1 i =1
σ 2 + µ2
n 1 n
+ µ2
1 σ2 (n 1 ) σ2
= n σ 2 + µ2 n + µ2 = = σ2 .
n 1 n n 1
An aside: You might already know that, since E [X̄ ] = µ and E [S 2 ] = σ2 , one
would refer to X̄ and S 2 as the unbiased estimators for µ and σ2 , respectively.
Z = X1 + ... + Xn ,
In particular, if X1 , ..., Xn all have the same distribution with mgf MX (t ), then
MZ (t ) = [MX (t )]n .
Now the sequence we have is actually X1 /n, ..., Xn /n. Observe that if
E [e tX ] = MX (t ),
then, t
E [e t (X /n ) ] = E [e n X ] = MX (t/n ).
Then, for Z = X̄ , Theorem (4.6.7) gives
Thus, X̄ N (µ, σ2 /n ).
In this example, it was helpful to use Theorem (5.2.7) because the expression for
MX̄ (t ) turned out to be a familiar mgf. It cannot, of course, be guaranteed that this
will always be the case for any X̄ . However, when this is the case, this result makes
derivation of the distribution of X̄ very easy.
Another example is the sample mean for a gamma (α, β) sample. The mgf of X̄ in
this case is given by
α n nα
1 1
MX̄ (t ) = = ,
1 β(t/n ) 1 ( β/n ) t
which reveals that X̄ gamma (nα, β/n ).
Of course, there are cases where this method is not useful, due to mgf of X̄ not
being recognisable. Remedies for such cases are considered in Casella and Berger
(2001, pp. 215-217). We will not cover these.
(Bilkent) ECON509 This Version: 25 November 2013 14 / 94
Sampling from the Normal Distribution
Now, we focus on the case where the population distribution is the normal
distribution.
Result we introduce in this section will be very useful when you deal with linear
regression models.
We have already talked about distributional properties of the sample mean and
variance. Given the extra assumption of normality, we are now in a position to
determine their full distributions.
The chi squared distribution will come up frequently within this context, so we start
by introducing some important properties related to this distribution.
Remember that the chi squared pdf is a special case of the gamma pdf and is given
by
1
f (x ) = x (p/2 ) 1 e x /2 , 0 < x < ∞,
Γ(p/2)2p/2
where p is called the degrees of freedom.
Lemma (5.3.2) (Facts about chi squared random variables): We use the notation
χ2p denote a chi squared random variable with p degrees of freedom.
a. If Z N (0, 1 ) random variable, then
Z2 χ21 .
In other words, the square of a standard normal random variable is a chi squared
random variable.
b. If X 1 , ..., X n are independent and X i χ2pi , then
Therefore,
8" #2 9
1 n
1 < n n =
1 i∑ 1 : i∑ ∑ (X i
2 2 2
S = (X i X̄ ) = (X i X̄ ) + X̄ ) .
n =1 n =2 i =2
;
Then,
n n 2 n 2
∑ i =2 y i ) + ∑ i =2 ( y 1 + y i ) ] ,
f (y1 , .., yn ) = n/2
e (1/2 )[(y1
(2π )
n n
∑ i =2 y i )
2 n 2
= e (1/2 )(y1 e (1/2 ) ∑i =2 (y1 +yi ) ,
(2π )n/2
( )
n 1/2
( ny12 )/2 n 1/2 (1/2 )[∑ni=2 yi2 +(∑ni=2 yi )2 ]
= e e ,
2π (2π )(n 1 )/2
Let X̄k and Sk2 denote the sample mean and sample variance based on the …rst k
observations.
You will be asked to show in a homework assignment that
n 1
(n 1) Sn2 = (n 2) Sn2 1 + (X n X̄n 1)
2
. (1)
n
(k 1) Sk2 χ2k 1.
k 2
kSk2+1 = (k 1) Sk2 + (X k +1 X̄k ) .
k +1
independent of Sk2 ? Then, by part (b) of Lemma (5.3.2), kSk2+1 will be χ2k ! So,
proving (2) will …nally prove part (c) of the Theorem.
First, notice that the vector (Xk +1 , X̄k ) is independent of Sk2 since we have already
showed that X̄k is independent of Sk2 . Moreover, Sk2 is a function of X1 , ..., Xk and
so, is independent of Xk +1 .
A nice result about normally distributed random variables is that zero covariance
implies independence, which does not hold necessarily for other distribution
functions.
Lemma (5.3.3): Let Xj N (µj , σ2j ), j = 1, ..., n independent. For constants aij and
brj (j = 1, ..., n; i = 1, ..., k; r = 1, ..., m) where k + m n, de…ne
n
Ui = ∑ aij Xj , i = 1, ..., k,
j =1
n
Vr = ∑ brj Xj , r = 1, ..., m.
j =1
Again, we consider the case where Xj N (0, 1), for simplicity. Then,
" # " #
n n n n
Cov (Ui , Vr ) = E [Ui Vr ] = E ∑ aij Xj ∑ brj Xj =E ∑ aij brj Xj2 = ∑ aij brj ,
j =1 j =1 j =1 j =1
u = a1 x1 + a2 x2 and v = b1 x1 + b2 x2 ,
which imply
b2 u a2 v a1 v b1 u
x1 = and x2 = .
a1 b2 b1 a2 a1 b2 b1 a2
The Jacobian of the transformation is
b2 a2
a1 b2 b1 a2 a1 b2 b1 a2 (b2 a1 b1 a2 ) 1
b1 a1 = 2
= .
a1 b2 b1 a2 a1 b2 b1 a2 (a1 b2 b1 a2 ) a1 b2 b1 a2
Therefore,
( )
1 1 h i
2 2
fU ,V (u, v ) = exp (b2 u a2 v ) + (a1 v b1 u )
2π 2 (a1 b2 b1 a2 )2
1
,
a1 b2 b1 a2
where ∞ < u, v < ∞.
Now,
But we know that (a1 b1 + a2 b2 ) = 0. Hence, this shows that the joint pdf factorises
into a function of u and a function of v . Therefore, U and V are independent!
A similar technique can be utilised to prove part (b). Speci…cally, by using a
transformation argument, it can be shown that the joint pdf of vectors (U1 , ..., Uk )
and (V1 , ..., Vm ) will factorise, which will prove independence. We skip this.
The main message here is that, if we start with independent normal random
variables, the zero-covariance and independence are equivalent for linear functions of
these random variables. Therefore, checking independence comes down to checking
covariances.
Part (b), on the other hand, makes it possible to infer overall independence of
normal vectors by just checking pairwise independence, which is not valid for general
random variables.
Thinking about part (a), I …nd it intuitively much easier to follow that the normal
distribution is completely determined by its mean and variance while other moments
do not matter. Hence, ensuring zero covariance is equivalent to ensuring
independence, since we do not have to care about the remaining moments of the
distribution.
We can show the usefulness of Lemma 5.3.3 by trying to prove the independence of
X̄ and S 2 when X1 , ..., Xn is sampled from a normal population distribution.
Remember that we can write S 2 as a function of (X2 X̄ , ..., Xn X̄ ) . Now, if we
can show that these random variables are not correlated with X̄ , then by the
normality assumption and Lemma 5.3.3, we can conclude independence.
We have
1 n n
1
n i∑ ∑
X̄ = Xi and Xj X̄ = δij Xi ,
=1 i =1
n
where (
1 if i = j
δij = .
0 otherwise
Now,
n n
1 1
∑ δij
n
=1 ∑ n
=1 1 = 0.
i =1 i =1
Hence,
Cov (X̄ , Xj X̄ ) = 0,
and so, X̄ and Xj X̄ are independent for all j, which yields the desired result.
You will see (probably next term) that the results derived so far are useful when the
variance of the distribution is known.
When σ2 is unknown, as is the case in most practical cases, it usually is replaced by
some unbiased estimator, say, σ̂2 of σ2 .
However, then the above results lose their relevance.
We will now talk about two important distributions that are relevant for such cases.
X̄ µ
p N (0, 1).
σ/ n
However, as we just argued, σ is usually unknown. What to do then?
W. S. Gosset (widely known as “Student”) studied the distribution of
X̄ µ
p .
S/ n
Γ p+2
1
1
fT (t ) = p p , ∞ < t < ∞.
pπΓ 2 (1 + t /p )(p +1 )/2
2
Remember that Student’s t does not have moments of all orders. In particular, if
X tp , then
E [jX jq ] < ∞ for q < p.
Therefore, t1 has no mean, t2 has no variance, etc.
Let’s try to derive the pdf. Since U and V are independent, we have
1 u 2 /2 1
fU ,V (u, v ) = p e v (p/2 ) 1
e v /2
,
2π Γ (p/2) 2p/2
where ∞ < u < ∞ and 0 < v < ∞.
We have the following transformation functions
u
t= p and w = v,
v /p
implying that r
w
v =w and u=t .
p
Then, the Jacobian of the transformation is given by
p p
w /p t (2p ) 1 (w /p ) 1/2
= w /p.
0 1
p +1 2
But the integrand is the kernel of a gamma 2 , 1 +t 2 /p pdf. Therefore,
(p +1 )/2
1 1 p+1 2
fT (t ) = p p Γ
2π Γ ( p/2 ) 2p/2 p 2 1 + t 2 /p
p +1
1 Γ 2 1 (p +1 )/2
= p ,
pπ Γ (p/2) 1 + t 2 /p
Now, consider another important derived distribution, Snedecor’s F , named after Sir
Ronald Fisher. It arises naturally as the distribution of a ratio of variances.
Example (5.3.5): Let X1 , ..., Xn be a random sample from a N µX , σ2X
population, and let Y1 , ..., Ym be a random sample from an independent N µY , σ2Y
population. If we were comparing the variability of the populations, one quantity of
interest would be the ratio σ2X /σ2Y . Information about this ratio is contained in
SX2
,
SY2
Examination of (3) shows us how the F distribution is derived. The ratios SX2 /σ2X
and SY2 /σ2Y are each scaled chi squared variates, and they are independent.
The tools you will learn in this section are the fundamental building blocks of what
is known as “asymptotic theory” or “large sample theory.” Although a bit abstract
at …rst sight, these results are at the core of many proofs you will encounter in
econometrics articles.
The three important concepts we will consider in what follows are
1 almost sure convergence,
2 convergence in probability,
3 convergence in distribution.
In the …rst instance we will consider the case where a sequence of random variables
X1 , ..., Xn exhibits the iid property. This is one (and the simplest) of many possible
dependence settings.
Remember that random variables are functions de…ned on the sample space, e.g.
X (ω ) . Our interest will be on a sequence of random variables, indexed by sample
size, i.e. Xn (ω ).
To motivate the following discussion, consider pointwise convergence:
This time, we have a di¤erence. Convergence fails on a very small set E such that
P (E ) = 0.
That P (E ) = 0 is due to the set being so small that we can safely assign zero
probability to the set.
Remember that in earlier lectures we have stated that for a continuously distributed
random variable, the probability of a single point is always equal to zero. This is
similar, in spirit, to the situation at hand.
This type of convergence is also called convergence almost everywhere and
convergence with probability 1.
The following notation is common:
a.s .
Xn ! X
wp1
Xn ! X
lim Xn = X a.s.
n !∞
Note that, at the cost of sloppy notation, the argument of Xn (ω ) is usually dropped.
Also, Xn (ω ) need not converge to a function. It can also simply converge to some
constant, say, a.
(Bilkent) ECON509 This Version: 25 November 2013 43 / 94
Convergence Concepts
Almost Sure Convergence
Then, almost sure convergence is a statement on the joint distribution of the entire
sequence fZi (ω )g.
White (2001, p.19): “The probability measure P determines the joint distribution
of the entire sequence fZi (ω )g. A sequence Xn (ω ) converges almost surely if the
probability of obtaining a realisation of the sequence fZi (ω )g for which convergence
to Xn (ω ) occurs is unity.”
Whenever you see the terms “almost sure,” “with probability one,” or “almost
everywhere” you should remember that the relationship that is referred to holds
everyhwhere except for some set with zero probability.
(Bilkent) ECON509 This Version: 25 November 2013 45 / 94
Convergence Concepts
Almost Sure Convergence
Now that we have a convergence concept in our arsenal, the next question is: under
which conditions can we use this result for statistics of interest?
One important result is Kolmogorov’s Strong Law of Large Numbers (SLLN).
Theorem (Kolmogorov’s Strong Law of Large Numbers): If X1 , X2 , ... are
independent and identically distributed random variables and the common mean is
…nite, i.e. E [jX1 j] = µ < ∞, then
1 n 1 n a.s .
lim
n !∞ n
∑ Xi = µ a.s. or
n i∑
Xi ! µ.
i =1 =1
Note that the sample size will never be equal to ∞. However, when the sample size
is large enough, the sample mean will be very close to the population mean, µ.
What is remarkable about this result is that, as long as we have iid random
variables, the only condition is that the random variables have …nite mean.
For the time being, su¢ ce to say that one can relax the iid assumption but that
means that we will require more than just …nite means. There is a trade-o¤ between
the amount of dependence and the strength of moment assumptions you have to
make in order to attain convergence.
An equivalent statement is that given e > 0 and δ > 0 there exists an N, which
depends on both δ and e, such that P (jXn X j > e) < δ for all n > N. This
merely is a restatement using the formal de…nition of limit.
One could also write
It is not easy to see the di¤erence between the two modes of convergence. However,
let’s give it a try!
Almost sure convergence states that we have pointwise convergence for all ω 2 Ω
except for a small, zero measure set E . Importantly, this set is independent of n.
Convergence in probability states that as the sample size goes towards ∞, the
probability that Xn will deviate from X by more than e decreases towards zero.
However, for any sample size, there is a positive probability that Xn will deviate by
more than e. In other words, for some ω 2 En Ω, such that P (En ) > 0, jXn X j
will be larger than e.
Importantly, there is nothing that restricts En to be the same for all n. Hence, the
set on which Xn deviates from X may change as n increases.
The good news is, deviation probability slowly goes to zero, hence P (En ) will
eventually be zero.
1 n
n i∑
Xn = Zi .
=1
White (2001, p.24): “With almost sure convergence, the probability measure P
takes into account the joint distribution of the entire sequence fZi g, but with
convergence in probability, we only need concern ourselves sequentially with the joint
distribution of the elements of fZi g that actually appear in Xn , typically the …rst n.
X1 (ω ) = ω + I[0,1 ] (ω ) , X2 (ω ) = ω + I[0, 1 ] (ω ) ,
2
X3 (ω ) = ω + I[ 1 ,1 ] (ω ) , X4 (ω ) = ω + I[0, 1 ] (ω ) ,
2 3
X5 (ω ) = ω + I[ 1 , 2 ] (ω ) , X6 (ω ) = ω + I[ 2 ,1 ] (ω ) ,
3 3 3
etc. Let, X (ω ) = ω.
Do we have convergence in probability? Observe that
X1 (ω ) X (ω ) = I[0,1 ] (ω ) , X2 (ω ) X (ω ) = I[0, 1 ] (ω ) ,
2
X3 (ω ) X (ω ) = I[ 1 ,1 ] (ω ) , X4 (ω ) X (ω ) = I[0, 1 ] (ω ) ,
2 3
X5 (ω ) X (ω ) = I[ 1 , 2 ] (ω ) , X6 (ω ) X (ω ) = I[ 2 ,1 ] (ω ) ,
3 3 3
etc.
Example (5.5.7): Let the sample space Ω be the closed interval [0, 1] with the
uniform probability distribution. De…ne random variables
Xn (ω ) = ω + ω n and X (ω ) = ω.
as short hand.
Associated with convergence in probability is the Weak Law of Large Numbers
(WLLN)
Theorem (Weak Law of Large Numbers): If X1 , X2 , ... are iid random variables
with common mean µ < ∞ and variance σ2 < ∞, then
1 n p
n i∑
Xi ! µ,
=1
as n ! ∞.
Proof: The proof uses Chebychev’s Inequality. Remember this says that if X is a
random variable and if g (x ) is a nonnegative function, then, for any r > 0,
E [g (X )]
P (g (X ) r) .
r
Now, consider
2
2 E [(X̄n µ) ] Var (X̄n ) σ2
P (jX̄n µj e) = P (X̄n µ) e2 2
= 2
= 2,
e e ne
2
where we use r = e2 and g (X̄n ) = (X̄n µ) .
The above result implies that
σ2
P (jX̄n µj < e) = 1 P (jX̄n µj e) 1 ,
ne2
and
σ2
lim 1 = 1.
n !∞ ne2
Hence,
lim P (jX̄n µj < e) = 1.
n !∞
This is perhaps a good time to stop and re‡ect a little bit on these new concepts.
What the Weak and Strong LLNs are saying is that under certain conditions, the
sample mean converges to the population mean as n ! ∞. This is known as
consistency: one would say that the sample mean is a consistent estimator of the
population mean.
In actual applications, this means that if the sample size is large enough, then the
sample mean is close to the population mean. So n does not have to be that close
to in…nity. On the other hand, as mentioned at the beginning , “how large” the
sample size should be in order to be considered a “large enough” sample is a
di¤erent question in its own. We will not deal with this here.
Sometimes, consistency is compared to unbiasedness.
An estimator β̂ of a population value β is an unbiased estimator if and only if
E [ β̂] = β.
What are the things we might want to estimate? One example would be parameters
of a distribution family. For example, we might know that the data are distributed
with N µ, σ2 , but we may not know the particular values of µ and σ2 . In this case,
we would estimate these parameters.
Analytically, the p lim operator is much more convenient to deal with, compared to
the expectation operator. For example, for two sequences X1 , X2 , ... and Y1 , Y2 , ...
Xn p limn !∞ Xn
p lim = ,
n !∞ Yn p limn !∞ Yn
while we usually have
Xn E [X n ]
E 6= .
Yn E [Y n ]
However, one concept does not usually imply another. In other words, a consistent
estimator can be biased, while an unbiased estimator can be inconsistent.
Suppose we are trying to estimate the population parameter β. Consider the
following estimators.
1 β̂ = β + 20/n : consistent but biased.
2 β̂ = X where P (X = β + 100 ) = P (X = β 100 ) = .50 : unbiased but inconsistent.
As far as economists and most of the econometricians are concerned, one would not
care too much about whether convergence is achieved almost surely or in probability.
As long as convergence is achieved, the rest is not important.
However, in some cases it might be easier to prove the LLN for one of the two
a.s . p
convergence types. This is no problem, as ! implies ! anyway.
In addition, convergence almost surely might be slower than convergence in
probability in the sense that it might require a larger sample size before the sample
mean is close enough to the population mean.
Although we might get convergence results for some sample mean Xn and Yn , we
might actually be interested in the convergence properties of a function of these, say
g (X n , Y n ) .
Fortunately, we have the following useful result.
If (Xn , Yn ) converges almost surely (in probability) to (X , Y ) , if g (x , y ) is a
continuous function over some set D , and if the images of Ω under
[Xn (ω ) , Yn (ω )] and [X (ω ) , Y (ω )] are in D , then g (Xn , Yn ) converges almost
surely (in probability) to g (X , Y ) .
More pragmatically,
a.s . a.s .
Xn ! X ) g (X n ) ! g (X ),
p p
Xn ! X ) g (X n ) ! g (X ).
The next interesting question is this: now that we have found convergence results
for sequences of random variables, X1 , X2 , ... can we …nd similar results for some
reasonably well behaved function?
Example (5.5.5) (Consistency of Sp): If Sn2 is a consistent estimator of σ2 , then
the sample standard deviation Sn = Sn2 = h (Sn2 ) is a consistent estimator of σ.
Interestingly, it can be shown that Sn , in fact, is a biased estimator of σ! However,
the bias disappears asymptotically.
lim E [jXn X jp ] = 0.
n !∞
Just as a reference, Lp convergence does not imply almost sure convergence and nor
does almost sure convergence imply Lp convergence. However, Lp convergence
implies convergence in probability.
lim FX n (x ) = FX (x ) ,
n !∞
is equivalent to (
0 if x < a
F X n (x ) = P (X n x) ! .
1 if x > a
We now introduce one of the most useful theorems we have considered so far.
Theorem (5.5.15) (Central Limit Theorem): Let X1 , X2 , ... be a sequence of iid
random variables with E [Xi ] = µ < ∞ and 0 < Var (Xi ) = σ2 < ∞. De…ne
1 n
n i∑
X̄n = Xi .
=1
p (X̄ n µ)
Let Gn (x ) denote the cdf of n σ . Then, for any x , ∞ < x < ∞,
Z x
1 y 2 /2
lim Gn (x ) = p e dy .
n !∞ ∞ 2π
In other words,
n
1 Xi µ d
p
n
∑ σ
! N (0, 1).
i =1
This is a powerful result! We start with the iid and …nite mean and variance
assumptions. In return, the Central Limit Theorem (CLT) promises us that the
distribution of a properly standardised version of the sample mean given by
n
1 Xi µ
p
n
∑ σ
i =1
will converge to the standard normal distribution as the sample size tends to in…nity.
As before, the sample size will never be equal to ∞. BUT, for large enough samples,
Xi µ
p1 ∑n will be approximately standard normal. As n becomes larger, this
n i =1 σ
approximate result will become more accurate.
As with LLNs, it is possible to obtain CLTs for non-iid data. However, this will
require one to make stronger assumptions regarding the moments of the sequence of
random variables. The trade-o¤ between dependence and moment assumptions is
always there.
Let’s prove the CLT. However, before we do that, we have to revisit Taylor
Expansions.
De…nition (5.5.20): If a function g (x ) has derivatives of order r , that is,
dr
g (r ) (x ) = dx r g (x ) exists, then for any constant a, the Taylor polynomial of order r
about a is
r
g (i ) (a )
T r (x ) = ∑ (x a )i .
i =0
i!
This polynomial is used in order to obtain a Taylor expansion of order r about
x = a. This is given by
g (x ) = T r (x ) + R,
where R = g (x ) Tr (x ) is the remainder for the approximation.
dr
Theorem (5.5.21): If g (r ) (a ) = dx r g (x ) exists, then
x =a
g (x ) T r (x )
lim = 0.
x !a (x a )r
This says that the remainder, g (x ) Tr (x ) , always tends to zero faster than the
highest-order term of the approximation.
Importantly, this also means that as x tends to a, the remainder term approaches 0.
converges to the mgf of a N (0, 1) random variable, which will prove that the
p X̄ n µ
distribution of n σ converges to the standard normal distribution.
Xi µ
Now, let Yi = σ . Then,
h Xi µ i tµ
h Xi i tµ
MY i (t ) = E [e tY i ] = E e t σ =e σ E et σ = e σ MX i (t/σ ) .
t d t t
MY i p = M Y i (0 ) + M p p
n dt Y i n n
t =0
2
1 d2 t t t
+ M p p + RY i p ,
2 dt 2 Y i n n n
t =0
where p
∞ k
t dk t t/ n
RY p
n
= ∑ dt k MYi p
n k!
.
k =3 t =0
d p
M t/ n = E [Yi ] = 0,
dt Y i
t =0
d2 p
M t/ n = Var (Yi ) = 1,
dt 2 Y i
t =0
n n
t 1 t2 t 2
lim MY i p = lim 1+ + n RY i p = et /2
.
n !∞ n n !∞ n 2 n
But this is the mgf of the N (0, 1) distribution! Hence, the CLT is proved.
Example (5.5.16): Suppose (X1 , ..., Xn ) are a random sample from a negative
binomial (r , p ) distribution. For this distribution, one can show that
r (1 p) r (1 p )
E [X i ] = and Var (Xi ) = ,
p p2
for all i .
Then, the CLT tells us that
p
n (X̄ r (1 p ) /p ) d
p ! N (0, 1) .
r (1 p ) /p 2
Hence, in a large sample, this quantity should be approximately standard normally
distributed.
One can also do exact calculations but these would be di¢ cult. Take r = 10, p = .5
and n = 30. Now, consider
!
30
P (X̄ 11) = P ∑ Xi 330
i =1
330 300 x
300 + x 1 1 1
= ∑ x 2 2
= .8916,
x =0
When talking about the CLT, our focus has been on the limiting distribution of
some standardised random variable.
There are many instances, however, when we are not speci…cally interested in the
distribution of the standardised random variable itself, but rather of some function of
it.
The delta method comes in handy in such cases. This method utilises our knowledge
on the limiting distribution of a random variable in order …nd the limiting
distribution of a function of this random variable.
In essence, this method is a combination of Slutsky’s Theorem and Taylor’s
approximation.
g 00 (θ )
g (Y n ) = g ( θ ) + g 0 ( θ ) (Y n θ) + (Y n θ )2 + R.
2
p p
As before, R ! 0 as Yn ! θ. However, this time g 0 (θ ) = 0.
So,
g 00 (θ )
g (Y n ) g (θ ) + (Y n θ )2 , as n ! ∞.
2
Now,
p Yn θ d
n ! N (0, 1) ,
σ
which implies that
2
Yn θ d
n ! χ21 .
σ
Hence,
g 00 (θ )
n [g (Y n ) g (θ )] n (Y n θ )2 as n ! ∞,
2 | {z }
d
!σ2 χ21
and, therefore,
d g 00 (θ ) 2 2
n [g (Y n ) g (θ )] ! σ χ1 .
2
d g 00 (θ ) 2 2
n [g (Y n ) g (θ )] ! σ χ1 .
2
So far we have worked with iid random sequences only. We now introduce large
sample results for a di¤erent type of distribution. The following is largely based on
White (2001).
Let’s start with independent heterogeneously distributed random variables.
The failure of the identical distribution assumption results from stratifying
(grouping) the population in some way. The independence assumption remains valid
provided that sampling within and across the strata is random.
Theorem (Markov’s Law of Large Numbers): Let X1 , ..., Xn be a sequence of
independent random variables, with E [Xi ] = µi < ∞, for all i . If for some δ > 0,
0 h i1
∞ E jX i µi j1 + δ
∑@ i 1 +δ
A < ∞,
i =1
then
1 n 1 n a.s .
n i∑ n i∑
Xi µi ! 0.
=1 =1
1 n 1 n
∑
n i =1
E [Xi ] = ∑ µ = µ.
n i =1
n σ̄2
= n
∑i =1 Var (X i )
> ε.
If
σ̄2 > 0
for all n su¢ ciently large, then
p (X̄ µ̄) d
n ! N (0, 1) .
σ̄
The conditions of this results are easiler to follow. We now need existence of even
higher order moments (more than the second order).
f (x ) = o fg (x )g .
If,
lim jf (x ) /g (x )j constant,
x !∞
then we will write
f (x ) = O fg (x )g
Xn = o n λ ,
Now, consider
Xn
Xn = O n λ ) = O (1 ) .
nλ
Then, for any δ > 0,
Xn
= o (1 ) ) X n = o n λ + δ .
n λ+δ
It might now be obvious to you that, if
Xn ! 0,
then
X n = o (1 ).