Вы находитесь на странице: 1из 11

Correlation and covariance

Covariance
In probability theory and statistics, covariance is a measure of how much two random variables change
together.

The covariance of two variables x and y in a data sample measures how the two are linearly
related. A positive covariance would indicate a positive linear relationship between the variables,
and a negative covariance would indicate the opposite.

Covariance Symbol
Covariance Symbol of two variables x and y is denoted by = COV(x, y)

Covariance Equation
When we consider two random variables x and y: (x1, y1), (x2. y2).. (xn, yn)
Then covariance between two variables X and Y, denoted by Cov(x,y), can be measured by
using the covariance equation.
Cov(X, Y) = E {[X - E(X)] [Y - E(Y)]}
where E{x} and E{y} can be defined as the means of x and y, respectively.
So its the expected values of the product where is the deviation of X from expected mean that is
X - E (X) and is the deviation of Y from expected mean, that is Y - E(Y).
When becomes positive, it has two meanings

Both are above their respective means


Both are below their respective means

It means both and will have same sign when the deviation from the mean is
calculated.
When becomes negative, it has two meanings

Either one of it is above and the other is below their respective means
It means both and will have different sign (opposite) when the deviation from the mean are
calculated.

Hence the covariance between any two variance X and Y provides a measure of the degree to which X
and Y tends to move together.

If,
1. Cov(X, Y) > 0 => either X becomes high when Y is high or Y becomes low when X is low
2. Cov(X, Y) < 0 => either X becomes high when Y is low or Y becomes high when X is low.
3. Cov(X, Y) = 0 => X and Y doesnt show any of the above tendencies.
By the formula, each pairs if x and y are taken, the difference from their mean values are calculated and
the differences are multiplied together. If any pairs of x and y are positive, the product will also result in
positive values and hence for that set the values of x and y varies together, in the same direction.
If any pairs of x and y are negative, the product will also result in negative values and hence for that set
the values of x and y varies together, in the opposite direction. As the magnitude increases the strength
of the relationship also increases. There can also arise a condition with covariance is zero. This happens
when the pairs that resulted in positive values got cancelled by those in which the product was negative,
and hence there resulted no relationship between the two of the random variables.

Covariance Formula
Now the above equation can be modified as changed into an equivalent covariance formula,
which is more effective.
Cov(x,y) = E[xy] - E[x]E[y]
When using the datas, the covariance formula can be modified as
(

)(

This can be written using shortcut method as


(

The sample covariance is defined in terms of the sample means as:

Where
n = number of sample

Similarly, the population covariance is defined in terms of the population means x, y as:

If the greater values of one variable correspond with the greater values of the other variable, or for the
smaller values, then the variables shows similar behavior, the covariance is a positive.
If the greater values of one variable correspond to the smaller values of the other, the variables tend to
show opposite behavior, the covariance is negative.
If one variable is greater and paired equally often with both greater and lesser values on the other, the
covariance will be near to zero.

Example

x
y

1
8

3
6

Solution

1
3
2
5
8
7
12
2
4

8
6
9
4
3
3
2
7
7

Here n =9

8
18
18
20
24
21
24
14
28
175

2
9

5
4

8
3

7
3

12
2

2
7

4
7

((

)(

((

))

))

(
(

)
)

Therefore covariance is -8.05

Correlation
Correlation makes no a priori assumption as to whether one variable is dependent on the
other(s) and is not concerned with the relationship between variables; instead it gives an estimate
as to the degree of association between the variables. In fact, correlation analysis tests for
interdependence of the variables.
Correlation coefficient:
The numerical measure that assesses the strength of a linear relationship is called the correlation
coefficient, and is denoted by r.
Definition of correlation coefficient:
The correlation coefficient (r) is a numerical measure that measures the strength and direction of
a linear relationship between two quantitative variables.
The quantity r, called the linear correlation coefficient, measures the strength and the direction of
a linear relationship between two variables. The linear correlation coefficient is sometimes
referred to as the Pearson product moment correlation coefficient in honor of its developer Karl
Pearson.

Calculation of r:
The mathematical formula for computing r is:

where n is the number of pairs of data.


(Aren't you glad you have a graphing calculator that computes this formula?)
The value of r is such that -1 < r < +1. The + and signs are used for positive
linear correlations and negative linear correlations, respectively.
Positive correlation: If x and y have a strong positive linear correlation, r is close
to +1. An r value of exactly +1 indicates a perfect positive fit. Positive values
indicate a relationship between x and y variables such that as values for x increases,
values for y also increase.
Negative correlation: If x and y have a strong negative linear correlation, r is close
to -1. An r value of exactly -1 indicates a perfect negative fit. Negative values
indicate a relationship between x and y such that as values for x increase, values
for y decrease.
No correlation: If there is no linear correlation or a weak linear correlation, r is
close to 0. A value near zero means that there is a random, nonlinear relationship
between the two variables
Note that r is a dimensionless quantity; that is, it does not depend on the units
employed.
A perfect correlation of 1 occurs only when the data points all lie exactly on a
straight line. If r = +1, the slope of this line is positive. If r = -1, the slope of this
line is negative.
A correlation greater than 0.8 is generally described as strong, whereas a correlation
less than 0.5 is generally described as weak. These values can vary based upon the
"type" of data being examined. A study utilizing scientific data may require a stronger
correlation than a study using social science data.

Properties of r:
1. The correlation does not change when the units of measurement of either one of
the variables change. In other words, if we change the units of measurement of the
explanatory variable and/or the response variable, this has no effect on the
correlation (r).
To illustrate this, below are two versions of the scatter plot of the relationship between sign
legibility distance and driver's age:

The top scatter plot displays the original data where the maximum distances is measured in feet.
The bottom scatter plot displays the same relationship, but with maximum distances changed to
meters.
Notice that the Y-values have changed, but the correlations are the same. This is an example of
how changing the units of measurement of the response variable have no effect on r, but as we
indicated above, the same is true for changing the units of the explanatory variable, or of both
variables. This might be a good place to comment that the correlation (r) is "unit less". It is just a
number.
2. The correlation only measures the strength of a linear relationship between two
variables. It ignores any other type of relationship, no matter how strong it is.
For example, consider the relationship between the average fuel usage of driving
a fixed distance in a car, and the speed at which the car drives:

Our data describe a fairly simple curvilinear relationship: the amount of fuel consumed
decreases rapidly to a minimum for a car driving 60 kilometers per hour, and then
increases gradually for speeds exceeding 60 kilometers per hour. The relationship is very
strong, as the observations seem to perfectly fit the curve. Although the relationship is
strong, the correlation r = -0.172 indicates a weak linear relationship. This makes sense
considering that the data fails to adhere closely to a linear form:

The correlation is useless for assessing the strength of any type of relationship that is not
linear (Including relationships that are curvilinear, such as the one in our example).
Beware, then, of Interpreting the fact that "r is close to 0" as an indicator of a "weak
relationship" rather than a "weak linear relationship." This example also illustrates how
important it is to always "look at" the data in the scatter plot, since, as in our example,
there might be a strong nonlinear relationship that r does not indicate. Since the correlation
was nearly zero when the form of the relationship was not linear, we might ask if the
correlation can be used to determine whether or not a relationship is linear.
3. The correlation by itself is not enough to determine whether or not a relationship
is linear. To see this, let's consider the study that examined the effect of monetary

incentives on the return rate of questionnaires. Below is the scatterplot relating the
percentage of participants who completed a survey to the monetary incentive that
researchers promised to participants, in which we find a strong curvilinear
relationship:

The relationship is curvilinear, yet the correlation r = 0.876 is quite close to 1.


4.
The correlation is heavily influenced by outliers. As you will learn in the next two
activities, the way in which the outlier influences the correlation depends upon whether or
not the outlier is consistent with the pattern of the linear relationship. Covariance
Covariance indicates how two variables are related. A positive covariance means the variables
are positively related, while a negative covariance means the variables are inversely related. The
formula for calculating covariance of sample data is shown below.
(

)(

Where

X= the independent variable


Y = the dependent variable
n = number of data points in the sample
= the mean of the independent variable x
= the mean of the dependent variable y
Result Interpretation
Covariance and correlation always have the same sign (positive, negative, or 0). When the sign is
positive, the variables are said to be positively correlated. When the sign is negative, the
variables are said to be negatively correlated and when the sign is 0, the variables are said to be
uncorrelated. A positive covariance would indicates a positive linear relationship and negative
covariance indicated negative linear relationship between the variables.

There can be mainly three interpretations in terms of graph, when the points are plotted.

Positive correlation -The correlation can be positive means it rises. If the pattern in the
graph slopes from lower left to upper right, that is upward sloping line, it means there is a
positive correlation between the variables. In simple sense, if the data makes a straight
line going through the origin to the higher values of x and y, then these variables will be
having positive correlation.

Negative correlation - The correlation can be negative means its falling. If the pattern in
the graph slopes from upper left to lower right, that is downward sloping line, it means
there is a negative correlation between them. In simple sense, if the data makes a straight
line going through the higher values of y down to the higher value of x, then these
variables will be having negative correlation.

Zero correlation - There can also be a null means no correlation relation as we


wouldnt be able to find any straight line that passes through most of the datas. It doesnt
mean the variables will be independent. There can exist a non linear relationship between
them.

Hence both of them measures to an extend, a certain type of dependence between the variables.
Example:
Consider the table below, containing the values of the variables, x and y.
X
2.1
2.5
4
3.6

Y
8
12
14
10

In which direction both variables are moving.


Solution:
Step 1: Find the mean of both the variables, X and Y.

and

Step 2:
Here N = 4, = 3.1 and

= 11

X
2.1
2.5
4
3.6

X-
-1
-0.6
0.9
0.5

Y
8
12
14
10

Y -
-3
1
3
-1

(X- )( Y - )
3
-0.6
2.7
-0.5
)(
((

)) = 4.6

Now

(
(

)(

Since the covariance is positive, the variables are positively related. So they move together in the same
direction.

5. Cov(X,Y) = Cov(Y,X)
6. Cov(X,X) = Var(X)
7. Cov(aX+bY+c,Z) =aCov(X,Z) +bCov(Y,Z) where X,Y,Z are RVs and a,b,c are constants.

8.
(
)
(
)=

Covariance Rules
Some rules for the Covariance:
1. The covariance of two constants, a and b, is zero.
=> COV(a, b) = E[(a - E(a))(b - E(b))] = E[(0)(0)] = 0
2. The covariance of two independent random variables is zero.
=> COV(x, y) = 0
3. The covariance is a combination as is obvious from the definition.
=> COV(x, y) = COV(y, x)
4. Adding a constant to either or both random variables does not change their covariances.
=> COV(x + a, y + b) = COV(x, y)
5. The additive law of covariance holds that the covariance of a random variable with a sum of
random variables is just the sum of the covariances with each of the random variables.
=> COV(x + y, z) = COV(x, z) + COV(y, z)

Limitations of correlation
(1)r is a measure of linear relationship only. There may be an exact connection between the two
variables but if it is not a straight line r is no help. It is well worth studying the scatter diagram
carefully to see if a non-linear relationship may exist. Perhaps studying x and ln y may provide
an answer but this is only one possibility.
(2) Correlation does not imply causality. A survey of pupils in a primary school may well show
that there is a strong correlation between those with the biggest left feet and those who are best at
mental arithmetic. However it is unlikely that a policy of 'left foot stretching' will lead to
improved scores. It is possible that the oldest children have the biggest left feet and are also best
at mental arithmetic.
(3) An unusual or freak result may have a strong effect on the value of r. What value of r would
you expect if point P were omitted in the scatter diagram opposite?

Вам также может понравиться