You are on page 1of 11

Chapter 25

Multivariate Analysis

1 Introduction

In the preceding chapters only data analyses with a single outcome variable
(y-variable) have been addressed. Linear, logistic, Cox regressions are examples. If
these methods included multiple predictors variables (otherwise called exposure
variables or x-variables), they are sometimes erroneously called multivariate meth-
ods. However, this is not correct, because the term multivariate analysis refers to the
simultaneous analysis of more than one outcome variable. An more adequate term
for the analysis of multiple predictors variables is “multivariable or multiple
variables analysis”.
In clinical research often multiple outcomes variables are being assessed. For
example, in a study of the efficacy of a novel laxative an important outcome may be
the frequency of stools. However, an improved quality of life score may be consid-
ered another and maybe even more important outcome. In order to assess two out-
comes, simply, two ANOVAs can be performed, but this assessment does not
account and adjust the possible relationship between the two outcomes. Also the
type I error is inflated. Another nice thing about multivariate analyses is that weak
predictors may not be able to significantly predict a single outcome, but it may sig-
nificantly predict two outcomes that point in the same direction.
In order to assess the two outcome variables simultaneously an analysis with
two, rather than a single outcome variable, would be an interesting option for the
purpose. For continuous outcome variables both path analysis and MANOVA
(multiple ANOVA) is adequate, for binary outcome variables probit analysis is
adequate.
In the current chapter the following multivariate methods are reviewed:
1. path analysis (see also Chap. 20),
2. multiple analysis of variance (MANOVA),
3. probit regression modeling.

T.J. Cleophas and A.H. Zwinderman, Statistics Applied to Clinical Studies, 289
DOI 10.1007/978-94-007-2863-9_25, © Springer Science+Business Media B.V. 2012
290 25 Multivariate Analysis

2 Multivariate Regression Analysis Using Path Analysis

In a self-controlled study in 35 patients with constitutional constipation the outcome


variables were improvements of frequency of stools and quality of life scores. The
predictor variables were compliance with counselling and compliance with drug
treatment (Var = variable). The data file is given underneath.

Var 1 (y1) Var 2 (y2) Var 3 (x1) Var 4 (x2)


Improvement Improved quality Compliance Compliance with
frequency stools life score with drug treatment counselling
24,00 69,00 25,00 8,00
30,00 110,00 30,00 13,00
25,00 78,00 25,00 15,00
35,00 103,00 31,00 14,00
39,00 103,00 36,00 9,00
30,00 102,00 33,00 10,00
27,00 76,00 22,00 8,00
14,00 75,00 18,00 5,00
39,00 99,00 14,00 13,00
42,00 107,00 30,00 15,00
41,00 112,00 36,00 11,00
38,00 99,00 30,00 11,00
39,00 86,00 27,00 12,00
37,00 107,00 38,00 10,00
47,00 108,00 40,00 15,00
30,00 95,00 31,00 13,00
36,00 88,00 25,00 12,00
12,00 67,00 24,00 4,00
26,00 112,00 27,00 10,00
20,00 87,00 20,00 8,00
43,00 115,00 35,00 16,00
31,00 93,00 29,00 15,00
40,00 92,00 32,00 14,00
31,00 78,00 30,00 7,00
36,00 112,00 40,00 12,00
21,00 69,00 31,00 6,00
44,00 66,00 41,00 19,00
11,00 75,00 26,00 5,00
27,00 85,00 24,00 8,00
24,00 87,00 30,00 9,00
40,00 89,00 20,00 15,00
32,00 89,00 31,00 7,00
10,00 65,00 29,00 6,00
37,00 121,00 43,00 14,00
19,00 74,00 30,00 7,00
2 Multivariate Regression Analysis Using Path Analysis 291

Fig. 25.1 Decomposition of 0.38 compliance drug


Compliance
correlation between efficacy treatment
counselling
laxative and quality of life compliance
score, all of the correlations 0.31 0.18
in the above example were
statistically significant at
p < 0.10 0.72 0.32

Frequency stools Quality of life score

0.58 0.83

Residual efficacy Residual quality of life


1.00

A pleasant thing about path analysis is that it can, indeed, be used as an alternative
approach to multivariate regression, with a result similar to that of the more com-
plex mathematical approach. An example is given above. We assume that compli-
ance with counselling and drug compliance not only affect the efficacy of the drug
but also the patients’ quality of life. First, we have to check that the relationship of
either of the two predictors with the outcome quality of life is significant in the usual
linear regression model: they were so with p-values of 0.03 and 0.02. Then, a path
diagram with standardized regression coefficients is constructed (Fig. 25.1). The
standardized regression coefficients of the residual effects are obtained by taking
the square root of (1 − r2). The standardized regression coefficient of one residual
effect versus another can be assumed to equal 1.00.
We now find the overall correlation between the two outcome variables as
follows:

1. Direct effect of counselling


0.72 × 0.31 = 0.22
2. Direct effect of non-compliance
0.32 × 0.18 = 0.06
3. Indirect effect of counseling and non-compliance
0.72 × 0.38 × 0.32 + 0.18 × 0.38 × 0.31 = 0.11
4. Residual effects
1.00 × 0.58 × 0.83 = 0.48 +
Total 0.85

A path statistic of 0.85 is considerably larger than that of the single outcome
model: 0.85 versus 0.45 (Chap. 20, 47.1% larger). Obviously, two outcome variables
make better use of the predictors in our data than does a single one. An advantage of
this nonmathematical approach to multivariate regression is that it nicely summa-
rizes all relationships in the model, and it does so in a quantitative way (Fig. 25.1).
292 25 Multivariate Analysis

3 Multiple Analysis of Variance, First Example

The example from the above section will be used once more. We will first assess
whether compliance with counselling is a significant predictor of both improve-
ment of frequency of stools and improved quality of life. We will use SPSS 17.0
(SPSS 2011) for data analysis. Command: Analyze…General Linear Model…
Multivariate…in dialog box multivariate transfer y1 and y2 to dependent variables
and x1 to the fixed factors, and…ok.
The Table 25.1 shows that even MANOVA can be considered a regression model
with intercepts and regression coefficients. Just like ANOVA it is based on normal
distributions and homogeneity of the variables. SPSS has checked the assumptions,
and the results as given indicate that the model is adequate for the data. Generally
Pillai’s method gives the best robustness and Roy’s the best p-values. We can con-
clude that counselling is a strong predictor of both improvement of stools and
improved quality of life.
In order to find out which of the two outcomes is the most important one, two
ANOVAs with each of the outcomes separately must be performed. We command:
Analyze…General Linear Model…Univariate…in dialog box univariate transfer y1 to
dependent variables and x1 to the fixed factors, and…ok. Do the same for variable y2.
Compliance with counselling is an important predictor of not only improved
frequency of stools but also of improved quality of life (Table 25.2).
In order to find out whether the compliance of drug treatment is a contributory
predicting factor in this multivariate model, MANOVA with two predictors and two
outcomes is performed. Instead of x1 both x1 and x2 are transferred to fixed factors.
Table 25.3 shows the results.
The Table 25.3 shows that after including a second predictor variable the
MANOVA is not significant anymore. Probably, the second predictor is a con-
founder of the first one. The analysis of this model stops here.

Table 25.1 MANOVA test statistics of the above data. All of the test statistics show that compli-
ance with counselling is a strong predictor of both improvement of frequency of stools and
improved quality of life
Multivariate testsa
Effect Value F Hypothesis df Error df Sig.
Intercept Pillai’s trace ,992 1185,131b 2,000 19,000 ,000
Wilks’ lambda ,008 1185,131b 2,000 19,000 ,000
Hotelling’s trace 124,751 1185,131b 2,000 19,000 ,000
Roy’s largest root 124,751 1185,131b 2,000 19,000 ,000
VAR00004 Pillai’s trace 1,426 3,547 28,000 40,000 ,000
Wilks’ lambda ,067 3,894b 28,000 38,000 ,000
Hotelling’s trace 6,598 4,242 28,000 36,000 ,000
Roy’s largest root 5,172 7,389c 14,000 20,000 ,000
a
Design: Intercept + VAR00004
b
Exact statistic
c
The statistic is an upper bound on F that yields a lower bound on the significance level.
3 Multiple Analysis of Variance, First Example 293

Table 25.2 ANOVAs of the data from Table 25.1. Also in ANOVA compliance with counselling
is a strong predictor of not only improvement of frequency of stools but also improved quality
of life
Test of between-subjects effects
Dependent variable:improv freq stool
Source Type III sum of squares df Mean square F Sig.
Corrected model 2733,005a 14 195,215 6,033 ,000
Intercept 26985,054 1 26985,054 833,944 ,000
VAR00004 2733,005 14 195,215 6,033 ,000
Error 647,167 20 32,358
Total 36521,000 35
Corrected total 3380,171 34
Tests of between-subjects effects
Dependent variable:improv gol
Source Type III sum of squares df Mean square F Sig.
Corrected model 6833,671b 14 488,119 4,875 ,001
Intercept 223864,364 1 223864,364 2235,849 ,000
VAR00004 6833,671 14 488,119 4,875 ,001
Error 2002,500 20 100,125
Total 300129,000 35
Corrected total 8836,171 34
improv freq stool improvement of frequency of stools, improve qol improved quality of life
scores
a
R Squared = ,809 (Adjusted R Squared = ,675)
b
R Squared = ,733 (Adjusted R Squared = ,615)

Table 25.3 MANOVA of the above data with two predictor (x1 and x2) and two outcome variables
(y1 and y2)
Multivariate testsa
Effect Value F Hypothesis df Error df Sig.
b
Intercept Pillai’s trace 1,000 29052,980 1,000 1,000 ,004
Wilks’ lambda ,000 29052,980b 1,000 1,000 ,004
Hotelling’s trace 29052,980 29052,980b 1,000 1,000 ,004
Roy’s largest root 29052,980 29052,980b 1,000 1,000 ,004
VAR00004 Pillai’s trace ,996 27,121b 10,000 1,000 ,148
Wilks’ lambda ,004 27,121b 10,000 1,000 ,148
Hotelling’s trace 271,209 27,121b 10,000 1,000 ,148
Roy’s largest root 271,209 27,121b 10,000 1,000 ,148
VAR00003 Pillai’s trace ,995 13,514b 14,000 1,000 ,210
Wilks’ lambda ,005 13,514b 14,000 1,000 ,210
Hotelling’s trace 189,198 13,514b 14,000 1,000 ,210
Roy’s largest root 189,198 13,514b 14,000 1,000 ,210
VAR00004* Pillai’s trace ,985 12,884b 5,000 1,000 ,208
VAR00003 Wilks’ lambda ,015 12,884b 5,000 1,000 ,208
Hotelling’s trace 64,418 12,884b 5,000 1,000 ,208
Roy’s largest root 64,418 12,884b 5,000 1,000 ,208
a
Design: Intercept + VAR00004 + VAR00003 + VAR00004 * VAR00003
b
Exact statistic
294 25 Multivariate Analysis

4 Multiple Analysis of Variance, Second Example

As a second example we use the data from Field (2005) on the effect of three
treatment modalities on compulsive behaviour disorder estimated by two scores, a
thought-score and an action-score (Var = variable).

Var 1 (y1) Var 2 (x) Var 3 (y2)


Action Treatment Thought
5,00 1,00 14,00
5,00 1,00 11,00
4,00 1,00 16,00
4,00 1,00 13,00
5,00 1,00 12,00
3,00 1,00 14,00
7,00 1,00 12,00
6,00 1,00 15,00
6,00 1,00 16,00
4,00 2,00 14,00
4,00 2,00 15,00
1,00 2,00 13,00
1,00 2,00 14,00
4,00 2,00 15,00
6,00 2,00 19,00
5,00 2,00 13,00
5,00 2,00 18,00
2,00 2,00 14,00
5,00 2,00 17,00
4,00 3,00 13,00
5,00 3,00 15,00
5,00 3,00 14,00
4,00 3,00 14,00
6,00 3,00 13,00
4,00 3,00 20,00
7,00 3,00 13,00
4,00 3,00 16,00
6,00 3,00 14,00
5,00 3,00 18,00

Command: Analyze…General Linear Model…Multivariate…in dialog box


multivariate transfer y1 and y2 to dependent variables and x1 to the fixed factors,
and…ok.
The Pillai test shows that the predictor (treatment modality) has a significant
effect on both thoughts and actions at p = 0.049. Roy’s test being less robust gives an
even better p-value of 0.020 (Table 25.4).
We will use ANOVAs to find out which of the two outcomes are more
important.
4 Multiple Analysis of Variance, Second Example 295

Table 25.4 MANOVA test statistics of the above data. The Pillai test shows that the predictor
(treatment modality) has a significant effect on both thoughts and actions at p = 0.049. Roy’s test
being less robust gives an even better p-value of 0.020
Multivariate testsa
Effect Value F Hypothesis df Error df Sig.
Intercept Pillai’s trace ,983 745,230b 2,000 26,000 ,000
Wilks’ lambda ,017 745,230b 2,000 26,000 ,000
Hotelling’s trace 57,325 745,230b 2,000 26,000 ,000
Roy’s largest root 57,325 745,230b 2,000 26,000 ,000
VAR00002 Pillai’s trace ,318 2,557 4,000 54,000 0,49
Wilks’ lambda ,699 2,555b 4,000 52,000 ,050
Hotelling’s trace ,407 2,546 4,000 50,000 0,51
Roy’s largest root ,335 4,520c 2,000 27,000 ,020
a
Design: Intercept + VAR00002
b
Exact statistic
c
The statistic is an upper bound on F that yields a lower bound on the significance level

Table 25.5 ANOVAs of the data from Table 25.4. In ANOVA nor actions nor thought are signifi-
cant outcomes from the predictor treatment modality
ANOVAb
Model Sum of squares df Mean square F Sig.
1 Regression ,050 1 ,050 ,023 ,881a
Residual 61,417 28 2,193
Total 61,467 29
a
Predictors: (Constant), cog/beh/notreat
b
Dependent Variable: actions

ANOVAb
Model Sum of squares df Mean square F Sig.
1 Regression 12,800 1 12,800 2,785 ,106a
Residual 128,667 28 4,595
Total 141,467 29
a
Predictors: (Constant), cog/beh/notreat
b
Dependent Variable: thoughts

Command: Analyze…General Linear Model…Univariate…in dialog box univariate


transfer y1 to dependent variables and x to the fixed factors, and…ok. Do the same
for variable y2.
Table 25.5 shows that in the ANOVAs nor thoughts nor actions are significant
outcomes of treatment modality anymore. This would mean that the treatment
modality is a rather weak predictor of either of the outcomes, and that it is not able
to significantly predict a single outcome, but that it significantly predicts two
outcomes pointing into a similar direction.
296 25 Multivariate Analysis

5 Multivariate Probit Regression

For univariate analyses with binary outcome variables logistic regression is


adequate. A problem with logistic regression with multiple outcome variables is
that after iteration (= computer program for finding the largest log likelihood ratio
(see Chap. 4) for fitting the data) the results often do not converse, i.e., a best log
likelihood ratio is not established. This is due to insufficient data size, inadequate
data, or non-quadratic data patterns. A better alternative for that purpose is probit
modeling. This may sound incomprehensible, but the dependent variable of logistic
regression (the log odds of responding) is closely related to log probit (probit is the
z-value corresponding to its area under curve value of the normal distribution).
It can be shown that log odds of responding = logit » (p Ö3) × probit. Multivariate
probit analysis is not available in SPSS and we will use the statistical software of
the program Stata (STATA 10) (Stata 2011). An example is given of the effect of the
physicians’ age (x) on their inclination to prescribe life style treatments (1) non
smoking advise (0 = no, 1 = yes) and (2) weight reduction advise (0 = no, 1 = yes),
(y and z), (Var = variable).

Var(x) Var (y) Var (z)


42.7 0 0
47.6 0 0
36.4 0 0
49.0 0 0
49.0 0 1
55.3 0 1
57.4 0 1
63.0 0 1
27.3 0 1
53.2 1 0
54.6 1 0
32.9 1 0

We can quickly input the data with the commands:


Input x y z…input values…end…List…Statistics…binary outcomes…Bivariate
probit regression…dependent variable 1 y…dependent variable 2 z….independent
variables x… ok.
Table 25.6 shows that physicians’ age is significant predictor of both prescrib-
ing non smoking and weight reduction advise. In order to find out which is the
most significant outcome, simple logistic regression can be performed using
physicians’ ages as predictor and the non drug treatments as separate outcomes
(see Chap. 17).
6 Discussion 297

Table 25.6 According to the underneath analysis the probit regression shows that indeed physi-
cians’ age is significant predictor of both prescribing non smoking and weight reduction advise
STATA
Probit var 3 var 2 var 1
Fitting comparison equation 1:
Iteration 0: log likelihood = −8.3177662
Fitting comparison equation 2:
Iteration 0: log likelihood = −8.3177662
Comparison: log likelihood = −16.635532
Fitting full model:
Iteration 0: log likelihood = −16.635532
Iteration 1: log likelihood = −15.9573
Iteration 2: log likelihood = −15.955936
Iteration 3: log likelihood = −15.955936
Bivariate probit regression number of observations = 12
Wald chi2 (2 ) = 0.00
Log likelihood = −15.955936 Prob > chi2 = 1.0000
2 log likelihood ratio = 31.911872
P < 0.000

6 Discussion

A number of advantages of multivariate analysis instead of multiple univariate anal-


yses are summarized:
1. It prevents the type I error from being inflated.
2. It looks at interactions between dependent variables.
3. It can detect subgroup properties and includes them in the analysis.
4. It can demonstrate otherwise underpowered effects
Multivariate analysis should not be used for explorative purposes and data dredg-
ing, but should be based on sound clinical arguments. In this chapter three methods
are reviewed and explained with examples.
A pleasant thing about path analysis (see also Chap. 20) is that it can be used as
a nonmathematical approach to multivariate regression. We should emphasize once
more that the term multivariate regression is often erroneously applied, when mul-
tiple independent and just a single dependent variable are in the data. Strictly, mul-
tivariate regression regards models with more than a single dependent variable
(y-variable). The main aim is to quantify reasons for the correlation between two or
more dependent variables. In the example given the multivariate model of our data
with two instead of one outcome variables made even better use of the predictors
than did the single outcome model.
298 25 Multivariate Analysis

If you read articles you will find that it is not uncommon for researchers to
perform multiple ANOVAs instead of a single MANOVA. There are problems with
this approach. First you have to perform multiple tests, which means that the risk of
type I errors is enhanced (see also the Chaps. 8 and 9). In the first MANOVA exam-
ple it would therefore be appropriate to perform a Bonferroni adjustment of the
ANOVAs meaning that the p-values should be doubled. Another problem is that of
weak outcomes. The second MANOVA example is an example of weak outcomes:
the MANOVA was statistically significant, while the ANOVAs of the same data
were not. Another method for post hoc analysis of a positive MANOVA is socalled
discriminant analysis using normally distributed eigenvectors which assess the cor-
relation of the outcome variables using scatterplots in the form of ellipses. The
advantage of this method which is readily provided by SPSS is, that it does not need
Bonferroni adjustment and gives somewhat more quantitative result about underly-
ing mechanisms than ANOVA does.
Multivariate probit regression is a more safe alternative for multivariate logistic
regression, and it is available in Stata and other software programs. In case of a
significant multivariate probit regression, post hoc analysis can be performed in the
usual way by binary logistic models to find out which of the outcome is more
important.

7 Conclusions

The term multivariate analysis refers to the simultaneous analysis of more than one
outcome variable.
This chapter reviews multivariate methods suitable to analyze data files with
multiple outcome variables
For the analysis of continuous outcome variables path analysis and multiple
analysis of variance (MANOVA) are suitable, for the analysis of binary outcome
variables probit analysis is recommended.
We conclude that multivariate methods have multiple advantages compared to
univariate methods.
1. It prevents the type I error from being inflated.
2. It looks at interactions between dependent variables.
3. It can detect subgroup properties and includes them in the analysis.
4. It can demonstrate otherwise underpowered effects
Multivariate analysis should not be used for explorative purposes and data dredg-
ing, but should be based on sound clinical arguments. In this chapter three methods
are reviewed and explained with examples.
References 299

References

Field A (2005) Discovering statistics using SPSS, 2nd edn. Sage Publications, London,
pp 571–618
SPSS. www.spss.com. Accessed 15 Dec 2011
Stata. www.stata.com. Accessed 15 Dec 2011