Вы находитесь на странице: 1из 9

# STAT7005 Multivariate Methods

Chapter 2

## 2 Multivariate Normal and Related Distributions

In this course, all methods are based on the assumption that the underlying multivariate distribution
is multivariate normal. In this chapter, we shall summarize some properties of this distribution, show
how the maximum likelihood estimators of unknown parameters are derived and record the sampling
distributions of the estimators for later reference. In this course, students are not required to be on
top of the derivations or the mathematical details, but just to be aware of the results as a background
for the forthcoming methods. We also consider assessment of the assumption of multivariate normality
and the possible remedial transformations when the assumption is violated.

2.1

## Multivariate Normal Distribution

Definition
A random vector x is said to have a multivariate normal distribution (multinormal
distribution) if every linear combination of its components has a univariate normal distribution.

f (x1 , x2 )

## Suppose a = [a1 a2 ]0 and x = [x1 x2 ]0 . The multinormality of x requires that a0 x = a1 x1 + a2 x2 is

univariate normal for all a1 and a2 . Graphically,

f (a1 x1 + a2 x2 )

x1

x2

a1 x 1 + a2 x 2

a1 1 + a2 2

Properties
1. If x is multinormal, for any constant vector a,
a0 x N (a0 , a0 a).
Proof: Since E(a0 x) = a0 and Var(a0 x) = a0 a, the result follows by knowing that a0 x is
univariate normal.

2. The m.g.f. of a multinormal random vector x with mean vector and covariance matrix is
given by


1 0
0
Mx (t) = exp t + t t .
2
Thus, a multinormal distribution is identified by its means and covariances . We use the
notation x Np (, ).
0

Hint: Mx (t) = E(et x ) = E(ey ), where t = [t1 tp ]0 and y = t0 x N (t0 , t0 t), y is a linear
combination of components in x.
Recall that the moment generating function (m.g.f.) of a univariate normal
x N (, 2 ) is

k
2
x (t)
Mx (t) = exp(t + 2 t2 ) and the kth moment is generated by E(xk ) = d M
.
dtk
t=0

Chapter 2

## 3. Given x Np (1 , 1 ) and y Np (2 , 2 ). If x and y are independent,

x + y Np (1 + 2 , 1 + 2 ).
4. If x Np (, ), for any constant m p matrix A,
Ax + d Nm (A + d, AA0 ).
5. Given a positive definite (i.e. non-singular square, or invertible) matrix . Then, x Np (, )
if and only if there exists a non-singular matrix B and z Np (0, I) such that
x = + Bz.
In this case, = BB 0 .
6. The pdf of x Np (, ) is given by


1
0 1
f (x) =
(x ) (x ) ,
1 exp
p
2
(2) 2 || 2
1

( is p.d.).


 


x1
1
11 12
7. Let x =
,=
and =
, where x1 consists of the first q components and
2
21 22
x2
x2 consists of the last (p q) components.
(a) x1 and x2 are independent if and only if Cov(x1 , x2 ) = 12 = 0.
(b) x1 Nq (1 , 11 ) and x2 Npq (2 , 22 ).
1
(c) (x1 12 1
22 x2 ) is independent of x2 and is distributed as Nq (1 12 22 2 , 11
12 1
22 21 ).
1
(d) Given x2 , x1 |x2 Nq (1 + 12 1
22 (x2 2 ),11 12 22 21 ).
1
Property 7(d) implies, E(x1 |x2 ) = 1 + 12 1
22 (x2 2 ) and Var(x1 |x2 ) = 11 12 22 21
which does not change with the value of x2 . Indeed, the results of Property 7(d) is related to the
multivariate linear regression model. To summarize, the marginal pdf and conditional pdf of a
multivariate normal distribution are still multivariate normal.

## 8. Given E(x) = and Var(x) = and p p symmetric matrix A. Then,

E(x0 Ax) = 0 A + tr(A).
(This statement is also true for any random x).
9. Given x Np (, ) and is p.d.. Then,
(x )0 1 (x ) 2 (p).
10. Let x Np (, ) and is p.d.. Then, for any m p matrix A and n p matrix B,
(a) Ax is independent of Bx iff AB 0 = 0.
(b) x0 Ax (A is symmetric) is independent of Bx iff BA = 0.
(c) x0 Ax and x0 Bx (A and B are both symmetric) are independent iff AB = 0.
HKU STA7005 (2016-17, Semester 1)

2.2

Chapter 2

## and the sample covariance matrix

Suppose x1 , . . . , xn are i.i.d. Np (, ) the sample mean vector x
S defined in Section 1.4 are respectively the Method of Moments Estimators (MME). By the Law of
Large Numbers, these sample quantities approach to and .

As shown in what follows, the method of Maximum Likelihood Estimation (MLE) will also give x
as MLE of , but the MLE of is slightly different from S, namely (n 1)S/n, which is very close to
S as n is large.
The likelihood function is
L(, ) = f (x1 , x2 , . . . , xn )
= f (x1 )f (x2 ) f (xn )
x1 , x2 , . . . , xn are independent
(
)
n
1X
1

(xi )0 1 (xi )
by Property (5).
=
n exp
np
2 i=1
(2) 2 || 2
and thus the log-likelihood function is
`(, ) = log L(, )
"
#
n
X
np
n
1
= log(2) log || tr 1
(xi )(xi )0 .
2
2
2
i=1
The maximization of the log-likelihood function yields the MLE of and are respectively
= W = (n 1)S .
=x
and

n
n

(2.2.1)

Properties
and S are sufficient statistics for Np (, ), i.e., the conditional distribution of the sample
1. x
and S does not depend on and .
(x1 , . . . , xn ) given x
Np (, n1 ) and (n 1)S Wp (n 1, ), a central Wishart distribution which is defined in
2. x
the next section.
is biased. However, S is unbiased for .
is unbiased but
3.
4. The MLEs possess an invariance property. If the MLE of j is denoted as j for j = 1, 2, . . . , p,
then MLE of i = hi (1 , 2 , . . . , p ) is i = hi (1 , 2 , . . . , p ) for i = 1, 2, . . . , r where 1 r p,
provided that hi ()s are one-to-one functions.

2.3

Wishart Distribution

Definition
Suppose xi (i = 1, . . . , k) are independent Np (i , ). Define a symmetric p p matrix V as
V =

k
X

xi x0i = X 0 X.

i=1

Then, V is said to follow a Wishart distribution, denoted by Wp (k, , ), where is called the
k
P
scaling matrix, k the degree of freedom, =
i 0i the (p p symmetric) noncentrality matrix.
i=1

Indeed, the Wishart distribution can be considered a multivariate extension of chi-squared distribution.
HKU STA7005 (2016-17, Semester 1)

## STAT7005 Multivariate Methods

Chapter 2

When = 0, the Wishart distribution is called the central Wishart distribution, denoted
simply by Wp (k, ).
Note that when p = 1 and = 2 , the Wishart
distribution Wp (k, , ) is reduced to a non-central

k
P
chi-squared distribution 2 2 k, 2i .
i=1

In the univariate case, the pdf of a central Wishart distribution is reduced to central chi-squared
distribution:
f (x)

Properties
1. If V Wp (k, ), the pdf of V is given by

 
1 1
f (V ) = c(p, k) ||
|V |
exp tr V
2



1
where c(p, k) = 2kp/2 p(p1)/4 k2 k1
kp+1
.
2
2
k/2

(kp1)/2

## 2. If V 1 Wp (k1 , , 1 ) and V 2 Wp (k2 , , 2 ) are independent,

V 1 + V 2 Wp (k1 + k2 , , 1 + 2 ).
This implies that the sum of 2 independent Wishart random matrices with the same scale matrix
can give a Wishart random matrix.
3. If V Wp (k, , ),
AV A0 Wq (k, AA0 , AA0 )
for any given q p matrix A. In addition, it is central if AA0 = 0.
When q = 1, the Wishart random matrix V can be reduced a chi-squared random variable.
4. If V Wp (k, , ) and a is a scalar vector,
a0 V a
2 (k, a0 a).
a0 a
In particular, for the ith diagonal element of V ,
vii
2 (k, ii )
ii
where vii , ii and ii are the ith diagonal elements of V , and respectively.
HKU STA7005 (2016-17, Semester 1)

Chapter 2

## 5. If y is any random vector independent of V Wp (k, , ),

y0V y
2 (k)
y 0 y
and is independent of y.
6. Let X be a k p data matrix given in the Definition and M [1 k ]0 . Then, we have
(a) For a symmetric A, X 0 AX Wp (r, , ) iff A2 = A (i.e. idempotent), in which case r =
rank(A) = tr(A) and = M 0 AM .
(b) For symmetric idempotent matrices A and B, X 0 AX and X 0 BX are independent iff AB =
0.
(c) For symmetric idempotent A, X 0 AX and X 0 B are independent iff AB = 0.
7. Let x1 , . . . , xn be a random sample from Np (, ).
Np (, n1 ).
(a) x
(b) (n 1)S (= W , the CSSP matrix) Wp (n 1, ).
and S are independent.
(c) x
Compare the results with the univariate case.
8. In the Cholesky decomposition V = T T 0 of V Wp (k, ), where T = (tij ) is lower triangular
(i.e. tij = 0 if i < j), all non-zero elements are mutually independent with tij N (0, 1) for i > j
and tii (k i + 1).
9. If y is any random vector independent of V Wp (k, ), k > p, the ratio
y 0 1 y
2 (k p + 1)
1
0
yV y
and is independent of y.

2.4

## Assessing Normality Assumption

Before any statistical modeling, it is crucial to verify if the data at hand satisfy the underlying
distributional assumptions. For most multivariate analyses, it is important that the data indeed follow
the multivariate normal, at least approximately if not exactly. Here are some commonly used methods.
1. Check each variable for univariate normality (necessary for multinormality but not
sufficient) [Use either SAS procedure proc univariate or SAS/INSIGHT. To invoke the latter,
we need to select buttons in the following sequence: Solution . Analysis . Interactive Data
Analysis]
Q-Q plot (quantile against quantile plot) for normal distribution
sample quantiles are plotted against the theoretical quantiles of a standard normal
distribution
a straight line indicates univariate normality
non-linear transformation on the variable may help to achieve the normality.
HKU STA7005 (2016-17, Semester 1)

Chapter 2

Normal Quantile

Normal Quantile

Normal Quantile

Normal Quantile

Normal Quantile

Normal Quantile

## (f) Bimodal Distribution

An alternative device is the P-P plot (probability against probability plot) that plots the
sample cdf against the theoretical cdf.
Shapiro-Wilk W test (small sample)
The test statistic of W test is a modified version of the squared sample correlation between
the sample quantiles and the expected quantiles.
Kolmogorov-Smirnov-Lilliefors (KSL) test (large sample, n > 2000)
The test statistic of this test is the maximum difference between the empirical cdf and the
normal cdf.
Test for zero skewness
For symmetric distribution, the sample skewness
Xn (xi x)3
n
i=1
(n 1)(n 2)
s3
is close to 0. [See Doane D.P., Seward L.E. (2011), Measuring Skewness: A Forgotten
Statistic? Journal of Statistics Education, 19(2).]
Test for zero excess kurtosis
For normal distribution, sample excess kurtosis
Xn (xi x)4
n(n + 1)
3(n 1)2

i=1
(n 1)(n 2)(n 3)
s4
(n 2)(n 3)
is close to 0. [See Sheskin, D.J. (2000), Handbook of Parametric and Nonparametric
Statistical Procedures, Second Edition. Boca Raton, Florida: Chapman & Hall/CRC.]
HKU STA7005 (2016-17, Semester 1)

## STAT7005 Multivariate Methods

Chapter 2

2. When n p is large enough, we make use of property (9) of Section 2.1. Check whether the
squared generalized distance as defined below follows a chi-square distribution by a Q-Q plot
(necessary and sufficient conditions for very large sample size).
)0 S 1 (xi x
), i = 1, . . . , n.
Define the squared Mahalanobis distance as d2i = (xi x
Order d21 , d22 , . . . , d2n as d2(1) d2(2) d2(n) .
Plot 2(i) vs d2(i) , where 2(i) is the 100(i0.375)/(n+0.25) (as in proc univariate) percentile
of 2 (p) distribution.

f (x1 , x2 )

f (x1 , x2 )

x2

x2

x1

x1

## (b) Bivariate Normal Distribution

Chi-square Quantile

Chi-square Quantile

## (d) Chi-square QQ Plot of (b)

In this course, a SAS macro code to produce this Q-Q plot will be provided.
3. Check each Principal Component (PC) for univariate normality (necessary condition; and if
the sample size n is large enough, a sufficient condition)
The PCs are readily available and their univariate normality easily checked by SAS/INSIGHT;
otherwise the procedure proc princomp is required before we can use proc univariate to check
normality.

2.5

## Transformations to Near Normality

To achieve the multinormality of the data, univariate transformation is applied to each variable
individually. After then, the multinormality of transformed variables is checked again. Followings are
the transformation commonly used in practice:
HKU STA7005 (2016-17, Semester 1)

Chapter 2

## 1. Use the most common transformation: log x or log(x + 1).

2. Choose a transformation based on theory; some examples are given below

count data: x
p

## percentages or proportions: sin1 x/100 or sin1 x

3. Use univariate Box-Cox transformation
The transformed x is denoted as x() where

x 1
()
x =
logx

for 6= 0
for = 0

## and is a unknown parameter. Typically, can be chosen by

(a) priori information of
Power,
3
2
1
0.5
0
0.5
1
2
3

Transformation
x3
x2
x

x
log x

1/ x
1/x
1/x2
1/x3

## = arg min Pn (x() x() )2 .

(b) minimum sample variance of x() , i.e.,
i=1 i

log(x)

Normal Quantile

Normal Quantile

## (a) Log-normal Distribution

(b) Log-transformation ( = 0)

## 4. Use multivariate Box-Cox transformation

Each variable is transformed by univariate Box-Cox transformation with different parameter
s.
HKU STA7005 (2016-17, Semester 1)

Chapter 2

## The parameters s are estimated jointly by the maximum likelihood estimation.

Note: The transformations above require x 0 and some of them require x > 0. For more general
transformation methods, see Sakia (1992, The Statistician, 169178).

2.6

Summary of Chapter 2

f (x) =

1
(2)p/2 ||1/2



1
0 1
exp (x ) (x ) .
2

## 2. Wishart distribution, V Wp (k, , )

k/2

f (V ) = c(p, k) ||

(kp1)/2

|V |




where c(p, k) = 2kp/2 p(p1)/4 k2 k1

2
P
=x
= n1 ni=1 xi .
3. (a)
n1
= 1 (X 0 X n
0) =
(b)
xx
S.
n
n


 
1 1
exp tr V
2

kp+1
2

1

and S
4. Sampling distributions of x
Np (, n1 ).
(a) x
(b) (n 1)S Wp (n 1, ) where (n 1)S =

Pn1
i=1

z i z 0i with z i Np (0, ).

5. Normality checking
(a) Univariate
Q-Q plot, Shapiro-Wilk test, Kolmogorov-Smirnov (KS) test, tests for skewness and kurtosis
(b) Multivariate
Chi-square plot, principal components method
6. Transformation to near normality
Box-Cox transformation