Вы находитесь на странице: 1из 21

Chapter 4.

Multivariate Models 97

August 31, 2011

Chapter 4. Multivariate Models

The primary purpose of this chapter is to introduce some basic ideas from multivari-
ate statistical analysis. Quite often, experiments produce data where measurements
were obtained on more than one variable hence the name: multivariate. In the Swiss
head dimension example (Flury, 1997), in order to determine well-fitting masks, sev-
eral different head-dimension measurements were obtained on the soldiers. In the
next chapter on regression analysis, we will examine models that are defined in terms
of several parameters. In order to properly understand the estimation of these model
parameters, a foundation in multivariate statistics is needed. In particular, we need
to understand concepts such as covariances and correlations between variables and
estimators. An advantage of the multivariate approach is to allow for designs of ex-
periments where the resulting parameter estimators will be uncorrelated, thus making
it easier to interpret results.

1 Multivariate Probability Density Functions


The probabilistic background for multivariate statistics requires multiple integration
ideas as seen in some of the formulas below. However, this chapter does not require
multiple integration computations. We shall be concerned instead with statistical
estimation computations which require simple (but tedious) arithmetic and some
elementary matrix algebra. Fortunately, these computations can be done very easily
on the computer. Data, particularly multivariate data, comes in the form of arrays of
numbers and hence matrix algebra techniques are the natural way of handling such
data. The appendix to this chapter contains a short review of some matrix algebra
in case the reader needs to brush up on these ideas.
Suppose we are interested in two variables. For instance, in the Swiss head dimension
data, let Y1 = MFB (Minimal frontal breadth or forehead width) and let Y2 = BAM,
(Breadth of angulus mandibulae or chin width). Data that consists of measurements
on two different variables is called bivariate data (similarly, data collected on three
variables is called trivariate and so on). We can define a joint probability density
function f (y1 , y2 ) that satisfies the following properties which mirror the properties
satisfied by the (univariate) pdf:

1. f (y1 , y2 ) 0.
2. The total volume under the pdf must be 1:
Z Z
f (y1 , y2 )dy1 dy2 = 1

Chapter 4. Multivariate Models 98

3. Let A <2 , then


Z Z
P ((Y1 , Y2 ) A) = f (y1 , y2 )dy1 dy2 .
A

Definition. The marginal pdf of Y1 , denoted f1 (y1 ), is just the pdf of the random
variable Y1 considered alone. To determine the marginal pdf, we integrate out y2 in
the joint pdf: Z
f1 (y1 ) = f (y1 , y2 )dy2 .

The marginal pdf of Y2 is defined similarly.


Our focus here is not so much on computing probabilities using multiple integration.
Instead, we will focus on statistical measures of association between variables.

2 Covariance
Let Y1 and Y2 be two jointly distributed random variables with means 1 and 2
respectively and variances 12 and 22 . A common measure of association between Y1
and Y2 is the covariance, denoted 12 :

Covariance: 12 = cov(Y1 , Y2 ) = E[(Y1 1 )(Y2 2 )].

The population covariance can be computed by


Z Z
12 = (y1 1 )(y2 2 )f (y1 , y2 )dy1 dy2 .

A positive covariance indicates that if Y1 is above its average (Y1 1 > 0), then
Y2 tends to be above its average (Y2 2 > 0), so that (Y1 1 )(Y2 2 ) tends to
be positive; also if Y1 is below average, then Y2 tends to be below average as well
whereby (Y1 1 )(Y2 2 ) is a negative times a negative resulting in a positive value.
Conversely, a negative covariance indicates that if Y1 tends to be small, then Y2 tends
to be large, and vice-versa.
To illustrate, if Y1 is a measure of a persons height and Y2 is a measure of their weight,
then these two variables tend to be associated. In particular, the covariance between
them is usually positive since taller people tend to weigh more and shorter people
tend to weigh less. On the other hand, if Y1 is the hours of training a technician
receives for learning to operate a new machine and Y2 represents the number of
errors the technician makes using the machine, then we would expect to see fewer
errors corresponding with more training and hence Y1 and Y2 would have a negative
covariance.
An important area where the covariance is important is when considering differences
of jointly distributed random variables Y1 Y2 . For instance, we will discuss later
experiments looking at paired differences in situations where we may want to compare
Chapter 4. Multivariate Models 99

two different experimental conditions. The statistical analysis requires that we know
the variance of the difference: var(Y1 Y2 ). There are two extreme cases:

Y1 = Y2 : var(Y1 Y2 ) = var(0) = 0
12 = 0 : var(Y1 Y2 ) = var(Y1 ) + var(Y2 )

These two extremes are special cases the following formula which holds in all cases:

var(Y1 Y2 ) = 12 + 22 212 . (1)

Exercise. Derive (1) using the definition of variance.

3 Correlation
We can transform the covariance to obtain a well-known measure of association known
as the correlation, which is denoted by the Greek letter (rho).
12
Correlation: = 1 2 , where 1 and 2 are the standard deviations of Y1 and Y2
respectively.

Here are a couple properties for :

1. 1 1.

2. If = 1, then Y1 and Y2 are perfectly related by a linear transformation, that


is, there exists constants a and b so that Y2 = a + bY1 .

Property (1) highlights the fact that the correlation is a unitless quantity. Property
(2) highlights the fact that the correlation is a measure of the strength of the linear
relation between Y1 and Y2 . A perfect linear relation produces a correlation of 1
or 1. A correlation of zero indicates no linear relation between the two random
variables. Figure 1 shows scatterplots of data obtained from bivariate distributions
with different correlations. The distribution for the top-left panel had a correlation
of = 0.95. The plot shows a strong positive relation between Y1 and Y2 with the
points tightly clustered together in a linear pattern. The correlation for the top-right
panel is also positive with = 0.50 and again we see a positive relation between the
two variables, but not as strong as in the top-right panel. The bottom-left panel
corresponds to a correlation of = 0 and consequently, we see no relationship evident
between Y1 and Y2 in this plot. Finally, the bottom-right panel shows a negative
linear relation with a correlation of = 0.50.
A note of caution is in order: two variables Y1 and Y2 can be strongly related, but
the relation may be nonlinear in which case the correlation may not be a reasonable
Chapter 4. Multivariate Models 100

Figure 1: Scatterplots of data obtained from bivariate distributions with different


correlations.

measure of association. Figure 2 shows a scatterplot of data from a bivariate distri-


bution. There is clearly a very strong relation between y1 and y2 , but the relation is
nonlinear. The correlation is not an appropriate measure of association for this data.
In fact, the correlation is nearly zero. To say y1 and y2 are unrelated because they
are uncorrelated can be misleading if the relation is nonlinear. This is an error that
is quite commonly made in everyday usage of the term correlation.
Caution: Another very common error made in practice is to assume that because
two variables are highly correlated, one causes the other. Sometimes this will indeed
be the case (e.g. more fertilizer leads to taller plants and hence a positive correlation.)
In other cases, the causation conclusion is silly. For example, do a survey of fires in
a large city and note Y1 , the dollar amount of fire damage, and also Y2 , the number
of fire-fighters called in to fight the fire. Will Y1 and Y2 be positively or negatively
correlated? Does sending more fire fighters to a fire cause more fire damage? Or,
could the association be due to something else?
Below is some Matlab code for obtaining plots and statistics for the multivariate Swiss
head dimension data:

% Measurements on 200 Swiss soldiers, obtained to design new


% gas masks. 6 measurements were taken on each soldier (facial height,
% width, etc.)
Chapter 4. Multivariate Models 101

Figure 2: A scatterplot showing a very strong but nonlinear relationship between y1


and y2 . The correlation is nearly zero.

% Put the correct path to the data swiss.dat:


load swiss.dat;

mfb = swiss(:,1); %Minimal frontal breadth (forehead width)


bam = swiss(:,2); % Breadth of angulus mandibulae (chin width)
tfh = swiss(:,3); % True facial height
lgan = swiss(:,4); % Length from glabella to apex nasi (tip of nose to top of forehead)
ltn = swiss(:,5); % length from tragion to nasion (top of nose to ear)
ltg = swiss(:,6); % Length from tragion to gnathion (bottom of chin to ear)

plot(mfb, bam, *)
title(Swiss Head Data)
xlabel(Forehead Width)
ylabel(Chin Width)

cov(swiss) % Compute the sample covariance matrix


corr(swiss) % Compute the sample correlation matrix

Note that to access a particular variable (i.e. a column of the data set) call swiss, we
write swiss(:,1) for column 1, and so on.

4 Higher Dimensional Distributions


For higher dimensional data, it is helpful to employ matrix notation. Suppose we
have p jointly distributed random variables Y1 , Y2 , . . . , Yp with means 1 , 2 , . . . , p
and variances 12 , 22 , . . . , p2 . For instance, in the Swiss head dimension example,
there were p = 6 head dimension variables recorded for each soldier. We can let the
Chapter 4. Multivariate Models 102

boldfaced Y denote the column vector of random variables:



Y1
Y2
Y =
..
.
Yp

and let the boldfaced denote the corresponding vector of means:



1

2
= ..

.
.
p

When we have more than two variables, we can compute covariances between each
pair of variables. These covariances are collected together in a p p matrix called
the covariance matrix. The diagonal elements of a covariance matrix correspond to
the variances of the random variables. The i-jth element of the covariance matrix
is the covariance between Yi and Yj . The covariance matrix is a symmetric matrix
because the covariance between Yi and Yj is the same as the covariance between Yj
and Yi . To illustrate, suppose we have a tri-variate distribution for Y1 , Y2 and Y3 .
Let 12 = cov(Y1 , Y2 ), 13 = cov(Y1 , Y3 ), and 23 = cov(Y2 , Y3 ), . Then the covariance
matrix, denoted by is

12 12 13

Covariance Matrix: = 12 22 23 .
13 23 32

A convenient way of defining the covariance matrix in terms of expectations is

E[(Y )(Y )0 ].

When we take the expected value of a random vector or a random matrix, we compute
the expected value of each term individually. For example,

E[Y1 ]

E[Y2 ]
E[Y ] =
.. .

.
E[Yp ]

For a bivariate random vector (Y1 , Y2 )0 with mean (1 , 2 )0 , we have



0 Y 1 1 (Y1 1 )2 (Y1 1 )(Y2 2 )
(Y )(Y ) = ( Y 1 1 Y 2 2 ) = .
Y 2 2 (Y1 1 )(Y2 2 ) (Y2 2 )2

Therefore,

E[(Y1 1 )2 ] E[(Y1 1 )(Y2 2 )]
E[(Y )(Y )0 ] =
E(Y1 1 )(Y2 2 )] E[(Y2 2 )2 ]

which is the covariance matrix .


Chapter 4. Multivariate Models 103

Of course, the population covariances (e.g. 12 ) and the population correlations are
typically unknown population parameters which must be estimated from the data.
Generally, multivariate data sets are organized so that each row corresponds to a new
p-dimensional observation and each column corresponds to the measurement on one
of the p variables. In other words, the data usually comes in the form of n rows for the
sample size and p columns for the p measured variables. For a p dimensional data set,
let yi1 equal the ith observation on the first variable, and yi2 equal the ith observation
on the second variable and so on for i = 1, 2, . . . , n. The sample covariance between
variables 1 and 2, denoted s12 is
n
X
s12 = (yi1 y1 )(yi2 y2 )/(n 1) (2)
i=1

where y1 and y2 are the sample means of the first and second variables respectively.
We can estimate the covariance matrix by replacing the population variances and
covariances by their respective estimators this will be called the sample covariance
matrix and is generally denoted by S.

Example. Consider once again the Swiss head dimension data consisting of p = 6
head measurements. Denote these measurements by

Y1 = MFB = Minimal frontal breadth (forehead width)


Y2 = BAM = Breadth of angulus mandibulae (chin width)
Y3 = TFH = True facial height
Y4 = LGAN = Length from glabella to apex nasi (tip of nose to top of forehead)
Y5 = LTN = length from tragion to nasion (top of nose to ear)
Y6 = LTG = Length from tragion to gnathion (bottom of chin to ear).

To give an indication of what the data looks like, below is a list of the fist 20 obser-
vations:
Chapter 4. Multivariate Models 104

MFB BAM TFH LGAN LTN LTG


113.2 111.7 119.6 53.9 127.4 143.6
117.6 117.3 121.2 47.7 124.7 143.9
112.3 124.7 131.6 56.7 123.4 149.3
116.2 110.5 114.2 57.9 121.6 140.9
112.9 111.3 114.3 51.5 119.9 133.5
104.2 114.3 116.5 49.9 122.9 136.7
110.7 116.9 128.5 56.8 118.1 134.7
105.0 119.2 121.1 52.2 117.3 131.4
115.9 118.5 120.4 60.2 123.0 146.8
96.8 108.4 109.5 51.9 120.1 132.2
110.7 117.5 115.4 55.2 125.0 140.6
108.4 113.7 122.2 56.2 124.5 146.3
104.1 116.0 124.3 49.8 121.8 138.1
107.9 115.2 129.4 62.2 121.6 137.9
106.4 109.0 114.9 56.8 120.1 129.5
112.7 118.0 117.4 53.0 128.3 141.6
109.9 105.2 122.2 56.6 122.2 137.8
116.6 119.5 130.6 53.0 124.0 135.3
109.9 113.5 125.7 62.8 122.7 139.5
107.1 110.7 121.7 52.1 118.6 141.6

To get a better feel for the data, Figure 3 shows scatterplots of each pair of variables.

The sample mean vector for the entire data set is given by

114.7245
y1
115.9140

y2 123.0550

y = .. =

. 57.9885

y6 122.2340
138.8335

and the sample covariance S is equal to



26.9012 12.6229 5.3834 2.9313 8.1767 12.1073

12.6229 27.2522 2.8805 2.0575 7.1255 11.4412


5.3834 2.8805 35.2300 10.3692 6.0275 7.9725

S= .
2.9313 2.0575 10.3692 17.8453 2.9194 4.9936


8.1767 7.1255 6.0275 2.9194 15.3702 14.5213
12.1073 11.4412 7.9725 4.9936 14.5213 31.8369

Matlab can compute these statistics easily the cov command to get the sample co-
variance matrix. Note that the covariance between the six head measurements are all
positive. It is quite common to see all positive covariances on data of this sort. For
example, if people with larger than average forehead widths will tend to also have
larger than average chin widths and so on.
Chapter 4. Multivariate Models 105

Swiss Head Dimension Data


100 110 120 50 60 70 125 140

130
115
MFB

100
115

BAM
100

TFH 140
125
110
70

LGAN
60
50

135
125

LTN
115
140

LTG
125

100 115 130 110 125 140 115 125 135

Figure 3: Scatterplot matrix of each pair of variables in the Swiss head data. Note
that most pairs of variables are positively correlated.
Chapter 4. Multivariate Models 106

The sample correlations, typically denoted by r, are the sample counterpart to the
population correlation. For instance,
s12
r12 = . (3)
s1 s2
We can collect the sample correlations together into a correlation matrix, denoted
by R where the i-jth element of the matrix is rij , the sample correlation between
the ith and the jth variables. Note that the correlation between a random variable
with itself is always 1 (the same goes for sample correlations). Therefore, correlation
matrices always have ones down the diagonal. For the Swiss head dimension data,
the sample correlation matrix is

1.0000 0.4662 0.1749 0.1338 0.4021 0.4137

0.4662 1.0000 0.0930 0.0933 0.3482 0.3884


0.1749 0.0930 1.0000 0.4135 0.2590 0.2381
R=
0.1338
.
0.0933 0.4135 1.0000 0.1763 0.2095


0.4021 0.3482 0.2590 0.1763 1.0000 0.6564
0.4137 0.3884 0.2381 0.2095 0.6564 1.0000
Note that the highest correlation r56 = 0.6564 is between LTN and LTG, the distances
from the top of the nose to the ear and the distance from the bottom of the chin to the
ear. The correlation between chin width (BAM) and facial height (TFH) is relatively
quite small (r23 = 0.0930). Also, the correlation between chin width and the distance
from the top the nose to the ear is also relatively quite small (r24 = 0.0933). Looking
at Figure 3, one can see a weak association between BAM and TFH and a strong
association between LTN and LTG.

4.4 The Multivariate Normal Density Function.

Recall that a normal random variable Y with mean and variance 2 has a probability
density function (pdf) of
1 1
f (y) = exp{ 2 (y )2 },
2 2 2
for < y < . It is easy to generalize the this univariate pdf to a multivariate nor-
mal pdf. Let Y = (Y1 , Y2 , . . . , Yp )0 denote a multivariate normal random vector with
mean vector = (1 , 2 , . . . , p ) and covariance matrix . To obtain the multivariate
density function, we replace (y )2 / 2 in the exponential exponent by

(y )0 1 (y ),

and we replace the 1/ scaler by the determinant of raised to the 1/2 power:
||1/2 . The p-dimensional normal pdf can be written
1 p/2 1/2 1
f (y1 , y2 , . . . , yp ) = ( ) || exp{ (y )0 1 (y )}, (4)
2 2
for y <p .
It is informative to note that if we set the expression (y )0 1 (y ) in the
exponent of the multivariate normal density equal to a constant, the resulting equation
Chapter 4. Multivariate Models 107

Figure 4: The bivariate normal pdf (4) for the Swiss head dimension variables LTN
and LTG.

describes an ellipsoid in p-dimensional space centered at the mean . These ellipsoid


patterns are used for forming multivariate confidence regions and multivariate critical
regions for hypothesis testing.
Introductory textbooks typically refrain from using matrix notation when expressing
the multivariate normal pdf given in (4). However (4) is fairly easy to write down
in matrix notation compared to what one would get writing it down without matrix
notation. For example, for p = 2 dimensions, we can write out the bivariate normal
pdf as
1 1 y1 1 2 y 1 1 y 2 2 y 2 2 2
f (y1 , y2 ) = exp{ 2
[( ) 2( )( )+( ) ] ],
21 2 1 2 2(1 ) 1 2 2 2

for < y1 < , and < y2 < . This looks quite complicated. The expression
for p = 3 or more dimensions becomes even more of a mess to write out without
matrix notation but (4) stays the same regardless of the dimension.
To get an idea of what a multivariate normal pdf looks like, Figure 4 shows a bivariate
normal pdf for the LTN and LTG variables from the Swiss head dimension data. The
bivariate normal pdf looks like a mountain centered over the mean of the distribution.
In order to compute probabilities using the pdf, one needs to compute the volume
under the pdf surface corresponding to the region of interest.
Chapter 4. Multivariate Models 108

5 Confidence Regions
Confidence intervals were introduced for estimating a single parameter such as the
mean of a distribution. In the multivariate setting, we can similarly define confidence
regions for vectors of parameters, such as the mean vector . To illustrate matters, we
shall consider two of the Swiss head dimension variables: LTN and LTG. Since they
correspond to the 5th and 6th variables, the mean vector of interest is = (5 , 6 )0 .
There are two approaches. One method is to simply compute two univariate confi-
dence intervals separately for 1 and 2 and form the Cartesian product of the two
intervals to obtain a confidence rectangle. However, if we compute say 95% confidence
intervals for 1 and 2 , then the joint confidence region (the rectangle) has a lower
confidence level. To understand why, consider an analogy: if there is a 5% chance Ill
get a speeding ticket on a given day when I drive to work. Then the probability I
get at least one ticket during the year is certainly higher than 5%. Similarly, if there
is a 5% probability that each random interval does not contain its respective mean,
then the probability that at least one of the intervals does not contain their respective
mean is higher than 5%. A simple (but not always efficient) fix to this problem is to
use what is known as the Bonferroni adjustment. If you form p confidence intervals
for p parameters using a confidence level 1 for the family of parameters, then one
can compute a confidence interval for each parameter separately using a confidence
level of 1/p to guarantee that the confidence level is at least 1 for all p intervals
considered jointly.
A more efficient approach for estimating a mean vector is to incorporate the correla-
tions between the estimate parameters. For multivariate normal data, the resulting
confidence regions have ellipsoidal shapes. Instead of determining a random interval
that covers a mean with high probability, we want to determine a region that covers
the vector with high probability.
The solution to this problem requires introducing another probability distribution
known as the F -distribution which results when we look at statistics formed by ratios
of variance estimates. The F -distribution is used extensively in analysis of variance
(ANOVA) applications where we want to compare several means. Because an F
random variable is defined in terms of a ratio of variance estimators, and variance
estimators depend on a degrees of freedom, the F -distribution is specified by a nu-
merator and a denominator degrees of freedom. The F -distribution is skewed to the
right and takes only positive values. Critical values for the F -distribution can be
found beginning on page 202 in the Appendix. Let Fp,np () denote the critical
value of an F -distribution on p numerator degrees of freedom and n p denominator
degrees of freedom.
Returning to the confidence region problem, one can show (e.g., Johnson and Wichern,
1998, page 179) that for a sample of size n from a p-dimensional normal distribution

(n 1)p
P (n(Y )0 S 1 (Y ) Fp,np ()) = 1 .
(n p)

This statement shows that a (1 )100% confidence region for the mean of a p-
Chapter 4. Multivariate Models 109

Figure 5: A 95% confidence ellipse for the Swiss head dimension data using only the
variables LTN and LTG.

dimensional normal distribution is given by the set of <p that satisfy the in-
equality:
(n 1)p
n(y )0 S 1 (y ) Fp,np ().
(n p)
The inequality defines a p-dimensional ellipsoid centered at y. To determine if a
hypothesized value of lies in this region, simply plug it into the expression and see
if the inequality is satisfied or not. Figure 5 shows a 95% confidence ellipse for the
Swiss head data for variables Y5 = LTN and Y6 = LTG.

Multivariate statistics is a broad field of statistics and we have only introduced some
of the most basic ideas. Additional topics in multivariate analysis (such as principal
component analysis, discriminant analysis, cluster analysis, cannonical correlations,
MANOVA) take the correlations between variables into consideration to solve various
problems.

Problems

1. Data on felled black cherry trees was collected (Ryan et al., 1976). The measured
variables were the diameter (in inches measured from 4.5 feet above the ground),
the height (measured in feet) and the volume (measured in cubic feet). The full
data set appear in the following table:
Chapter 4. Multivariate Models 110

x1 x2 x3
Diameter Height Volume (xi1 x1 ) (xi2 x2 ) (xi1 x1 )(xi2 x2 )
11.0 75 18.2
11.1 80 22.6
11.2 75 19.9
11.3 79 24.2
11.4 76 21.0
11.4 76 21.4
11.7 69 21.3
12.0 75 19.1
12.9 74 22.2
12.9 85 33.8
13.3 86 27.4
13.7 71 25.7
13.8 64 24.9
14.0 78 34.5
14.2 80 31.7
14.5 74 36.3
16.0 72 38.3
16.3 77 42.6
17.3 81 55.4
17.5 82 55.7
17.9 80 58.3
18.0 80 51.5
18.0 80 51.0
20.6 87 77.0

a) Before analyzing the data, do you expect the correlations between these
three variables to be negative, positive or zero? A scatterplot matrix of
the data is plotted in Figure 6

b) Instead of carrying out the computation of the covariance which is rather


tedious, we shall attempt to get a feel for the covariance between x1 , the
diameter and x2 , the height. The sample means for the three variables are
y1 = 13.248, y2 = 76.00, and y3 = 30.17. In the table above, put a + in the
column (xi1 x1 ) if the ith diameter is higher than the average diameter
and put a if the ith diameter is lower than the mean value. Do the
same thing for the heights in the column labelled (xi2 x2 ). If both these
differences are positive, or they are both negative, put a + in the column
labelled (xi1 x1 )(xi2 x2 ) and a otherwise. To illustrate, here is how
to do this for the first row:

x1 x2 x3
Diameter Height Volume (xi1 x1 ) (xi2 x2 ) (xi1 x1 )(xi2 x2 )
11.0 75 18.2 +
The sample covariance is basically the average of the product (xi1 x1 )(xi2
Chapter 4. Multivariate Models 111

x2 ). From the list of + and s, does it appear the covariance will be pos-
itive or negative?
c) The sample covariance matrix for the entire data set is given by

9.85 10.38 49.89

S = 10.38 40.60 62.66 .
49.89 62.66 270.20
Compute the sample correlation matrix (using (3)) from the covariance
matrix.
d) The purpose of this study was to predict the volume of wood of the tree
using the diameter and/or height. If you had to choose one of the variables
(height or diameter) for predicting the volume of the tree, which would you
choose from a purely statistical point of view (note that for trees that have
not been cut, it would be much more difficult to measure the height than
the diameter). What was the basis for your choice?
e) If we convert the diameter measurements from units of inches to feet,
then we would need to divide each diameter measurement by 12. Let
xi1 = yi1 /12 denote the diameter measurements in units of feet. Compute
the sample variance of the xi1 measurements. Also, compute the sample
correlation between the diameter (in feet) and the height of the cherry
trees.

Appendix: Matrix Algebra

In this appendix we give a brief review of some of the basics of matrix algebra. A
matrix is simply an array of numbers. Let n denote the number of rows and p denote
the number of columns in an array. Matrices are denoted by boldface letters. For
example, let A denote a matrix with n = 3 rows and p = 2 columns. Then we say
that A is a n p matrix, which in this case, A is a 3 2 matrix. A special case of
a matrix is a vector which is simple a matrix with a single column (a column vector)
or a single row (a row vector). By convention, whenever we denote a vector, we shall
assume it is a column vector. One can regard an n 1 column vector as a point in
n-dimensional Euclidean space
To illustrate matters, let x denote a 3 1 column vector and A denote a 3 2 matrix
defined as follows:
1 2 3

x = 2, A = 4 5.
3 1 6
We can perform operations on vectors and matrices such as summation, subtraction,
multiplication.
The transpose of a matrix means to simply change the columns to vectors and is
denoted by a prime: A0 is the transpose of A. Thus,
x0 = ( 1, 2, 3 )
Chapter 4. Multivariate Models 112

Black Cherry Tree Data


65 70 75 80 85

20
18
16
Girth

14
12
10
8
85
80

Height
75
70
65

70
60
50
Volume

40
30
20
10
8 10 12 14 16 18 20 10 20 30 40 50 60 70

Figure 6: Scatterplots of the black cherry tree data. Here, Girth = diameter

and
2 4 1
A0 = .
3 5 6
To multiply a matrix by a number (i.e. a scalar), one just multiplies each element of
the matrix by the scalar. For instance, if c = 2 then

2 3 4 6

cA = 2 4 5 = 8 10 .
1 6 2 12

In order to add two matrices together, they must both be of the same dimensions in
which case you just add the corresponding components together (or subtract if you
are subtracting matrices). We cannot add the vector x to the matrix A because they
are not of the same dimension. However, if

4

y = 5
6

then
1 4 5

x + y = 2 + 5 = 7.
3 6 9
Chapter 4. Multivariate Models 113

Let
a11 a12 b11 b12

A = a21 a22 and B = b21 b22 ,
a31 a32 b31 b32
then
a11 a12 b11 b12 a11 + b11 a12 + b12

A + B = a21 a22 + b21 b22 = a21 + b21 a22 + b22 .
a31 a32 b31 b32 a31 + b31 a32 + b32
Note that the ijth entry in the matrix A for the ith row and the jth column is denoted
aij . Thus, the first index specifies the row number and the second index specifies the
column number.

Matrix Multiplication. One needs to be a little careful when multiplying two


matrices together. First of all, matrix multiplication is not commutative as we shall
see. If A and B are matrices and we want to form the product AB then the number
of columns of A must match the number of rows of B. Suppose A has dimension
n p and B has dimension p q, then the product AB will have dimension n q.
To illustrate, let us first compute the product of two vectors a and b say where
a = ( a11 , a12 , a13 ) and
b11

b = b21 .
b31
Since a is a 1 3 row vector and b is a 3 1 column vector, we can form the product
ab since the number of columns of a equals the number of rows of b. The product
ab is defined as

b11

ab = ( a11 , a12 , a13 ) b21 = a11 b11 + a12 b21 + a13 b31 .
b31
Now consider the product of two matrices A and B. Think of each row of A as a
row vector and each column of B as a column vector. The the ijth element of the
product AB is defined to be the product of the ith row of A times the jth column
of B. To illustrate, let A denote a 3 2 matrix and B denote a 2 4 matrix:

a11 a12
b b12 b13 b14
A = a21 a22 and B = 11 ,
b21 b22 b23 b24
a31 a32
then AB is a 3 4 matrix computed as

a11 b11 + a12 b21 a11 b12 + a12 b22 a11 b13 + a12 b23 a11 b14 + a12 b24

AB = a21 b11 + a22 b21 a21 b12 + a22 b22 a21 b13 + a22 b23 a21 b14 + a22 b24 .
a31 b11 + a32 b21 a31 b12 + a32 b22 a31 b13 + a32 b23 a31 b14 + a32 b24
In this example, we cannot form the product BA since the number of columns of B
does not match the number of rows of A.
Consider again the multiplication of two vectors a = ( a11 , a12 , a13 ) and

b11
b = b21

.
b31
Chapter 4. Multivariate Models 114

We saw how to compute the product ab. Note that we can also form the product
ba since b is a 3 1 column vector and a is a 1 3 row vector, i.e. the number of
columns of b matches the number of rows of a and the product will be a matrix of
dimension 3 3:

b11 b11 a11 b11 a12 b11 a13

ba = b21 ( a11 , a12 , a13 ) b21 a11 b21 a12 b21 a13
.
b31 b31 a11 b31 a12 b31 a13

Here are a few definitions:


A square matrix is a matrix with the same number of rows as columns.
A diagonal matrix is a matrix with all zeros except along the main diagonal.
An important special case of a square diagonal matrix is the identity matrix denoted
by I. The identity matrix is a square matrix whose diagonal elements are ones and
the off diagonal elements are all zero. For instance, the 3 3 identity matrix is

1 0 0

I = 0 1 0.
0 0 1
The reason we call this the identity matrix is that it acts as the multiplicative identity
element: for any matrix A, we have
AI = A
and
IA = A
provided the matrix multiplications are defined (one can easily verify these relations).
A symmetric matrix is any matrix A such that A = A0 , that is, A is equal to its
transpose. For example
2 1
A=
1 2
is symmetric. Covariance matrices are always symmetric.
Two (column) vectors a and b of the same dimension are orthogonal if a0 b = 0.
Geometrically speaking, if we think of a vector as an array extending from the origin
to the point represented by the vector, then orthogonal vectors are perpendicular to
each other. For example, if a = ( 1, 1 )0 and b = ( 1, 1 )0 , then

0 1
a b = ( 1, 1 ) = 1 1 1 1 = 0.
1
Figure 7 illustrates the geometric property of the orthogonal vectors.

Inverses. For a scalar such as 5, its inverse is simply 1/5 = 51 and 5(1/5) = 1. All
real numbers have an inverse except zero. Let A denote a square p p matrix. The
inverse of A, if it exists, is denoted A1 and is the p p matrix such that
AA1 = A1 A = I, (the identity matrix).
Chapter 4. Multivariate Models 115

Figure 7: Two orthogonal vectors in <2 .

In order for a matrix A to have an inverse, its columns must be linearly independent
which means that no column of A can be expressed as a linear combination of the
other columns of A. Such matrices are call nonsingular. Thus a singular matrix does
not have an inverse. Finding the inverse of a matrix is somewhat tedious for higher
dimensional matrices. However, for a 2 2 matrix, there is a simple formula. If

a11 a12
A= ,
a21 a22

then
1 a22 a12
A1 = .
a11 a22 a12 a21 a21 a11
Inverse matrices are needed in order to define the multivariate normal density function
and understanding multivariate distance. The inverse of diagonal matrices are easy
to compute:
1 .
0 0 0
a11 0 0 0 1 a11
0 1
0
a22 0 0 a22
0 0

.. .. .. .. =
... .. .. ..

. . . . . . .

0 0 0 app 0 0 0 1
app

In order for the inverse of a diagonal matrix to exist, all the diagonal elements must
be nonzero. For higher dimensional non-diagonal matrices, computer software such
as matlab can be used to compute inverses of matrices.
Another matrix operation needed for the multivariate normal density is the deter-
minant of a matrix denoted by |A| (also denoted by det(A)). The computation of
the determinant is rather tedious for square matrices of dimension higher than 3 3
and again, software such as matlab can be used to compute determinants. In the case
Chapter 4. Multivariate Models 116

Figure 8: The determinant is the area of the parallelogram formed from the column
vectors that make up the matrix.

of a 2 2 matrix A where
a11 a12
A= ,
a21 a22
the formula is quite simple:

|A| = a11 a22 a12 a21 .

Thus, if
3 4
A= ,
2 5
then |A| = 3 5 4 2 = 15 8 = 7. One way to think of the determinant of a
matrix is to look at the two column vectors of A. If we plot these two vectors, they
form two edges of a parallelogram as seen in Figure 8. The determinant of A is the
(signed) area of the parallelogram. For higher dimensional matrices, the columns of
the matrix are the vertices of a parallelepiped and the determinant is equal to the
(signed) volume of the parallelepiped.
The determinant of a singular matrix is zero. For instance, suppose

1 4
A= .
2 8

Then, the second column of A is just 2 times the first column of A and therefore A
is singular. The determinant of A is |A| = 8 8 = 0. Since the second column of
A is 2 times the first column of A, the vertices of the parallelogram formed by these
two columns coincide and hence the area of the resulting parallelogram is zero.
Note that in the formula for the multivariate normal distribution, we divide by the
determinant of the covariance matrix. If the determinant is zero, then the distribution
Chapter 4. Multivariate Models 117

does not have a density. To understand what this means, consider a bivariate normal
random vector (Y1 , Y2 )0 . If the covariance matrix has determinant zero, then that
means that Y2 is a linear function of Y1 and the two random variables are perfectly
correlated (i.e. correlation equal to 1). In a scatterplot such as Figure 5, if Y1
and Y2 were perfectly correlated, then the points would lie exactly in a line and the
confidence ellipse would shrink to a line. The bivariate density assigns probability by
computing the volume under the density surface (as shown in Figure 4). However, if
the entire distribution is concentrated on a line in the plane, then the volume under
the density, if it existed, would be zero. In other words, the distribution is degenerate.

Problems

1. Let
3 4 2 1
A= and B = .
4 7 1 2
Find the following:
a) A + B.
b) AB.
c) BA. (Is AB = BA?)
d) A1 . Verify that your answer is correct by confirming that AA1 = I.
e) |A|
2. Let
1 x1
1 x2

X=
1 x3
.

1 x4
1 x5
Find the following:
a) X 0 X.
b) (X 0 X)1 .

References
Flury, B. (1997). A First Course in Multivariate Statistics. Springer, New York.
Johnson, R. A. and Wichern, D. W. (1998). Applied Multivariate Statistical Analysis.
Prentice Hall, New Jersey.
Ryan, T. A., Joiner, B. L., and Ryan, B. F. (1976). The Minitab Student Handbook.
Duxbury Press, California.