Вы находитесь на странице: 1из 13

Canonical Correlation Analysis

Ryan R. Cuenca BS Statistics-IV


Lovelle Belaca-ol BS Statistics-IV
Fritzie A. Cacas BS Statistics-IV

I. Introduction

Canonical correlation analysis seeks to develop a weighted linear composite for each
variate. Now when you say variate, it could be sets of dependent variables or independent
variables. The reason why it wants to develop a weighted linear composite or combination is in
order to maximize an overlap in their distributions because when we have correlation, if we think
an individual variable coming from a specific distribution and when those distributions will
overlap that’s where we have some sort of dependence going on between those two variables.
And therefore we have some degree of correlation.

So the goal of canonical correlation is to look for the linear combinations of the x
variables and the y variables that will maximize their correlation. And so, If we have two vectors
X = (X1, ..., Xn) and Y = (Y1, ..., Ym) of random variables, and there are correlations among the
variables, then canonical-correlation analysis will find linear combinations of the Xi and Yj
which have maximum correlation with each other.

For example we have variables related to exercise and health. The variables associated
with exercise are the climbing rate on a stair stepper, how fast you can run, the amount of weight
lifted on bench press, and the number of push-ups per minute. And the variables associated with
health are blood pressure, cholesterol levels, glucose levels, and the body mass index.

Again, canonical correlation analysis describes the relationship between the exercise set
of variables and the health set of variables by finding linear combinations of the Xi and Yj which
have maximum correlation with each other.

So there are some assumptions with canonical correlation just like with all other
multivariate techniques,
 The first one is that there should be multiple continuous variables for dependent
and independent variables as well as categorical variables coded as dummy
variables. So we can include categorical variables in the analysis but they have to
be dummy coded, they have to be 1 or 0 or some sort of binary relationship.
 Assumes linear relationship between any two variables and between variables.
 Multivariate normality is necessary to perform statistical tests.

Advantages

 Can reduce or synthesize a large amount of information. Since large amount of


information can be out of control and can be complicated to handle. In canonical
correlation we can synthesize and in capture that large information in much fewer
values.

The method was first introduced by Harold Hotelling in 1936.

II. Motivation of Canonical Correlation Analysis

It is possible to create pairwise scatter plots with variables in the first set (e.g., exercise
variables), and variables in the second set (e.g., health variables). But if dimension of the first set
is p and that of the second set is q, there will be pq such scatter plots, it may be difficult, if not
outright impossible, to look at all of these graphs together and be able to interpret the results.

Canonical Correlation Analysis allows us to summarize the relationships into lesser


number of statistics while preserving the main facets of the relationships. In a way the motivation
for canonical correlation is very similar to principal component analysis. It is another dimension
reduction technique.

III. Notations and Formulas

For example we have two set of variables X and Y.

Set 1 Set 2
: :
X = : Y = :

We select X and Y based on the number of variables that exist in each set so that .
This is done for computational convenience.

Just as done in principal components analysis we look at linear combinations of the data.
We define a set of linear combinations named and will correspond to the linear
combinations from the first set of variables, , and V will correspond to the second set of
variables, . Each member of will be paired with a member of . For example, below is a
linear combination of the variables and is the corresponding linear combination of the
variables. Similarly, U2 is a linear combination of the variables, and is the corresponding
linear combination of the variables. And, so on....

:
:
:
:

Thus define,

( )

As the canonical pair.

Note: ( ) is the first canonical pair, similarly ( ) would be the second canonical variate
pair and so on… . With there are p canonical variate pair.

Hence by this we will derive our formula in finding the correlation of the Sets of variables.

We can compute the variance of and variables using the following expression:

( ) ∑∑ ( )

And

( ) ∑∑ ( )

Then calculate the covariance between and as:

( ) ∑∑ ( )
The correlation between and j is calculated using the usual formula. We take the
covariance between those two variables and divide it by the square root of the product of the
variances:

( )
√ ( )√ ( )

This quantity is to be maximized. We want to find linear combinations of the and


linear combinations of the that maximize the above correlation.

IV. Canonical Variates Defined

1. First canonical variate pair: ( )

The coefficients and are to be selected so as to maximize


the canonical correlation of the first canonical variate pair. This is subject to the constraint
that variances of the two canonical variates in that pair are equal to one.

( ) ( )

This is required so that unique values for the coefficients are obtained.

2. Second canonical variate pair ( )

Similarly we want to find the coefficients and that


maximize the canonical correlation of the second canonical variate pair, ( ). Again, we
will maximize this canonical correlation subject to the constraints that the variances of the
individual canonical variates are both equal to one. Furthermore, we require the additional
constraints that ( ), and ( ) have to be uncorrelated. In addition, the
combinations( ), and ( )must be uncorrelated. In summary, our constraints are:

( ) ( )

( ) ( )

( ) ( )
Basically we require that all of the remaining correlations equal zero.

3. i’th canonical variate pair ( )

We want to find the coefficients and that maximizes the


canonical correlation subject to the similar constraints that

( ) ( )
( ) ( )
( ) ( )
:
:
( ) ( )
( ) ( )
( ) ( )
:
:
( ) ( )

Again, requiring all of the remaining correlations to be equal zero.

V. Example using SAS Software.

The data to be analyzed comes from a firm that surveyed a random sample of n = 50 of
its employees in an attempt to determine what factors influence sales performance. Two
collections of variables were measured:

Sales Performance: 2. Sales Profitability


3. New Account Sales
1. Sales Growth
3. Abstract Reasoning
4. Mathematics

Test Scores as a Measure of Intelligence


1. Creativity
2. Mechanical Reasoning

There are p = 3 variables in the first group relating to Sale Performance and q = 4
variables in the second group relating to the Test Scores.

Here’s the SAS command:

VI. Test for Relationship Between Canonical Variate Pairs

To test for independence between the Sales Performance and the Test Score
variables first consider a multivariate multiple regression model where we are predicting, in this
case, Sales Performance variables from the Test Score variables. In this general case, we are
going to have p multiple regressions, each multiple regression predicting one of the variables in
the first group ( X variables) from the q variables in the second group (Y variables).
:

In our example, we have multiple regressions predicting the p = 3 sales variables from
the q = 4 test score variables.

NULL HYPOTHESIS: We wish to test the null hypothesis that these regression coefficients
(except for the intercepts) are all equal to zero. This would be equivalent to the null hypothesis
that the first set of variables is independent from the second set of variables.

This is carried out using Wilk's lambda. The results of this are found on page 1 of the
output of the SAS Program.

SAS reports the Wilks’ lambda ᴧ = 0.00215; F = 87.39; d.f. = 12, 114; p < 0.0001. Wilks'
lambda is ratio of two variance-covariance matrices (raised to a certain power). If the values of
these statistics are too large (small p-value), it indicates rejection of the null hypothesis. Here we
reject the null hypothesis that there is no relationship between the two sets of variables, and can
conclude that the two sets of variables are dependent.

Since Wilk's lambda is significant, and since the canonical correlations are ordered from
largest to smallest, we can conclude that at least there exist a relationship of the two Set of
groups.

The graphical presentation or the Scatter plot for the first canonical variate pair:
The graphical presentation or the Scatter plot for the second canonical variate pair:

VII. Obtain Estimates of Canonical Correlation

After we have tested the hypotheses of independence and since it was rejected, the next
step is to obtain estimates of canonical correlation.
The squared canonical correlation values can be interpreted the same way as r2 values are
interpreted.

We see that 98.9% of the variation in U1 is explained by the variation in V1, and 77.11%
of the variation in U2 is explained by V2, but only 14.72% of the variation in U3 is explained
by V3. These first two have very high canonical correlation and this implies that only the first two
canonical correlations are important.

VIII. Obtain the Canonical Coefficients

Thus the first canonical variable for sales is:

And for the test score is:

IX. Correlations Between Variable and Canonical Variate


Looking at the first canonical variable for sales (sales1), we see that all correlations are
large compare to sales2 and sales3. Therefore, you can think of this canonical variate as an
overall measure of Sales Performance. And for the second canonical variable for Sales
Performance, the canonical variable yields little information about the data since none of the
correlations is large.

Just like sales, test scores are also large for the first canonical variable and this can be
thought of as an overall measure of test performance as well, however, it is most strongly
correlated with mathematics test scores with 0.9441 correlation value. Comparing the result with
sales we see that the best predictor of sales performance is mathematics test scores as this
indicator stands out most.
Now let’s compare the correlations between each variable and the corresponding
canonical variate with correlations between each set of variables and the opposite group of
canonical variates.

We can see that both sales variables and test scores variables have strong correlation
values and show a pattern similar to that with the corresponding canonical variate. The reason for
this is obvious: The first canonical correlation is very high.
X. Other Statistical Tools
The software we’ve used in the discussion is SAS which is not a free software. So here
are some statistical tools you can use aside from SAS.
1. R
These is the most used statistical tool since it is free though you may find it hard to
understand especially if you are not used to codes but since it is free a lot of people are using
this and you could ask them for help.
Before running the codes, install the packages that you need to use for the Canonical
Correlation Analysis.
install.packages("yacca")
install.packages("GGally")
After installing the package, import the data to R and input the codes.
sales<-datacca[, 1:3]
score<-datacca[, 4:7]
yacca::cca(sales, score)
GGally::ggpairs(sales)
GGally::ggpairs(score)
2. SPSS
This one of the most easiest statistical tool to use, it is very user friendly and doesn’t
need a lot of codes. All you have to do is input the data then click File-New-Syntax.

Then enter this syntax


The list of variables in the MANOVA command contains the first variables, followed
by the second variables. The subcommand / DISCRIM produce a canonical correlation
analyses for all covariates. ALPHA specifies the significance level required before a
canonical variable is extracted.

Вам также может понравиться