Вы находитесь на странице: 1из 45

CORRELATION AND

REGRESSION
Correlation and regression (linear) are the most
commonly used techniques for investigating the
relationship between two quantitative variables.

Many hydrologic variables are related to each other


through cause and effect – changes in the
values of one or more variables cause changes in
some other variable.
INTRODUCTION
WHAT IS CORRELATION?
The goal of a correlation analysis is to see whether two measurement
variables co vary, and to quantify the strength of the relationship between the
variables. If the change in one variable brings about a change in the other
variable, they are said to be correlated. Correlation is only concerned with
strength of the relationship. No causal effect is implied with correlation

A set of variables may be related due to two reasons:


• If one variable drives the other, they may be correlated, as rainfall and runoff.
• The variables may also be correlated if they share the same cause. Examples
include dependent variables, such as river discharge, concentration or transport
rates of sediment, and concentration or transport rates of substances that are
transported in association with suspended sediment.
WHAT IS REGRESSION?
Regression expresses the relationship, which was determined in the
correlation analysis, in the form of an equation. In regression analysis,
the problem of interest is the nature of the relationship itself between the
dependent variable(response) and the (explanatory) independent
variable.
Regression analysis is used to detect a relation between the values
of two or more variables, of which at least one is subject to random
variation, and to test whether such a relation, either assumed or
calculated, is statistically significant. It is a tool for detecting relations
between hydrologic parameters in different places, between the
parameters of a hydrologic model, between hydraulic parameters and
soil parameters, between crop growth and water table depth, and so on.
REGRESSION ANALYSIS IS USED TO:
• Predict the value of a dependent variable based on the value of at least one
independent variable
• Explain the impact of changes in an independent variable on the dependent variable
• Dependent variable: the variable we wish to predictor explain (i.e. runoff)
• Independent variable: the variable used to explain the dependent variable (i.e.
rainfall)
• Only one independent variable, X
• Relationship between X and Y is described by a linear function
• Changes in Y are assumed to be caused by changes in X
WHAT IS THE SCATTER DIAGRAM?
A scatter diagram can be used to show the relationship
between two variables. The starting point is to draw a scatter of
points on a graph, with one variable on the X-axis and the other
variable on the Y-axis, to get a feel of the relationship (if any)
between the variables as suggested by the data. The closer the
points are to a straight line, the stronger the linear relationship
between two variables. A scatter diagram of the data provides an
initial check of the assumptions for regression.
ASSUMPTIONS
Some underlying assumptions governing the uses
of correlation and regression are as follows. The observations are
assumed to be independent. For correlation, both variables
should be random variables, but for regression only the
dependent variable Y must be random. In carrying
out hypothesis tests, the response variable should follow normal
distribution and the variability of Y should be the same for each
value of the predictor variable.
THREE MAIN USES OF CORRELATION AND REGRESSION
• One is to test hypotheses about cause-and-effect relationships. In this case, the
experimenter determines the values of the X-variable and sees whether variation in
X causes variation in Y. For example, giving people different amounts of a drug
and measuring their blood pressure.

• The second main use for correlation and regression is to see whether two variables
are associated, without necessarily inferring a cause-and-effect relationship. In this
case, neither variable is determined by the experimenter; both are naturally
variable. If an association is found, the inference is that variation in X may cause
variation in Y, or variation in Y may cause variation in X, or variation in some
other factor may affect both X and Y.

• The third common use of regression (linear) is estimating the value of one variable
corresponding to a particular value of the other variable.
CORRELATION
CORRELATION COEFFICIENT:
A) Pearson Product-Moment Correlation is one of the
measures of correlation which quantifies the strength as
well as direction of such relationship. It is usually
denoted by Greek letter ρ.

CONDITIONS
This coefficient is used if two conditions are satisfied
• the variables are in the interval or ratio scale of measurement
• a linear relationship between them is suspected
POSITIVE AND NEGATIVE CORRELATION
The coefficient (ρ) is computed as the ratio of covariance between the
variables to the product of their standard deviations. This formulation is
advantageous.
ρ (X,Y)=

• First, it tells us the direction of relationship. Once the coefficient is


computed, ρ > 0 will indicate positive relationship, ρ < 0 will indicate negative
relationship while ρ = 0 indicates non existence of any relationship.
• Second, it ensures (mathematically) that the numerical value of ρ range
from -1.0 to +1.0. This enables us to get an idea of the strength of
relationship - or rather the strength of linear relationship between the
variables. Closer the coefficients are to +1.0 or -1.0, greater is the strength of
the linear relationship.
ρ (X,Y)=
Properties of ρ
• This measure of correlation has interesting properties, some of which
are enunciated below:
• It is independent of the units of measurement. It is in fact unit free.
• It is symmetric. This means that ρ between X and Y is exactly the same
as ρ between Y and X.
• Pearson's correlation coefficient is independent of change in origin and
scale.
• If the variables are independent of each other, then one would obtain ρ
= 0. However, the converse is not true. In other words ρ = 0 does not
imply that the variables are independent - it only indicates the non
existence of a non-linear relationship.
B) Spearman Rank Correlation Coefficient is a non-parametric measure of
correlation, using ranks to calculate the correlation.
Spearman Rank Correlation Coefficient uses ranks to calculate correlation.
The correlation coefficient is sometimes denoted by rs.

rs = correlation coefficient,

In general,
• rs > 0 implies positive agreement among ranks
• rs < 0 implies negative agreement (or agreement in the reverse direction)
• rs = 0 implies no agreement
Closer rs is to 1, better is the agreement while rs closer to -1 indicates strong
agreement in the reverse direction.
SIGNIFICANCE OF CORRELATION

Look up r in a table of correlation coefficients (ignoring + or - sign). The


number of degrees of freedom is two less than the number of points on the
graph . If our calculated r value exceeds the tabulated value at p = 0.05 then the
correlation is significant.
Degrees of Freedom Probability, p
0.05 0.01 0.001
1 0.997 1.000 1.000

2 0.950 0.990 0.999

3 0.878 0.959 0.991

4 0.811 0.917 0.974

5 0.755 0.875 0.951

6 0.707 0.834 0.925

7 0.666 0.798 0.898

8 0.632 0.765 0.872

9 0.602 0.735 0.847

10 0.576 0.708 0.823

11 0.553 0.684 0.801

12 0.532 0.661 0.780

13 0.514 0.641 0.760

14 0.497 0.623 0.742

15 0.482 0.606 0.725

16 0.468 0.590 0.708

17 0.456 0.575 0.693

18 0.444 0.561 0.679

19 0.433 0.549 0.665

20 0.423 0.457 0.652


PARTIAL CORRELATION ANALYSIS involves
studying the linear relationship between two variables after
excluding the effect of one or more independent factors. In
order to get a correct picture of the relationship between two
variables, we should first eliminate the influence of other
variables. For example, study of partial correlation between
price and demand would involve studying the relationship
between price and demand excluding the effect of money
supply, exports, etc.
The partial correlation analysis assumes
great significance in cases where the phenomena
under consideration have multiple factors
influencing them, especially in physical and
experimental sciences, where it is possible to
control the variables and the effect of each
variable can be studied separately.
DISADVANTAGE:
In simple correlation, we measure the
strength of the linear relationship between two
variables, without taking into consideration the
fact that both these variables may be influenced
by a third variable.
MULTIPLE CORRELATION
Another technique used to overcome the drawbacks of simple
correlation is multiple regression analysis.
Here, we study the effects of all the independent variables
simultaneously on a dependent variable. For example, the correlation
co-efficient between the yield of paddy (X1) and the other variables, viz.
type of seedlings (X2), manure (X3), rainfall (X4), humidity (X5) is the
multiple correlation co-efficient R1.2345 . This co-efficient takes value
between 0 and +1.
The limitations of multiple correlation are similar to those of
partial correlation.
Coaxial graphical correlations of runoff, with rainfall and other
parameters like time of the year, storm duration and antecedent
CORRELATION COEFFICIENT
P – rainfall
R – Run-off
N – Number of observations
REGRESSION
ASSUMPTION OF LINEARITY
Linear regression does not test whether data is linear.
It finds the slope and the intercept assuming that the
relationship between the independent and dependent
variable can be best explained by a straight line.
One can construct the scatter plot to confirm this
assumption. If the scatter plot reveals non linear
relationship, often a suitable transformation can be used to
attain linearity (e.g logarithmic).
Depending on the number of independent variables,
regression analysis can also be classified as:
a) Simple linear regression: most commonly used model in hydrology,
where the dependent variable is regressed on only one independent
variable
yi = a + bxi + εi i =1, 2, … n
where, yi is the ith value of the dependent or regressed variable, xi is the ith
value of the independent or regressor or predictor variable and a, b =
regression coefficients. The regression line crosses the y-axis at a point a
(the intercept), and has a slope b, and εi is the random error or residual
term for the ith data point. Since the actual (or observed) values of
variable Y will not match with the values estimated by the regression
equation, there will be residuals.
b) Multiple regression: where the dependent variable is
regressed on more than one independent variable
DEPENDENT AND INDEPENDENT
VARIABLES
By linear regression, we mean models with just
one independent and one dependent variable. The
variable whose value is to be predicted is known as
the dependent variable and the one whose known
value is used for prediction is known as
the independent variable.
CHOICE OF REGRESSION LINE
For example, consider two variables crop yield (Y) and
rainfall (X). Here construction of regression line of Y on X
would make sense and would be able to demonstrate the
dependence of crop yield on rainfall. We would then be able to
estimate crop yield given rainfall.
Careless use of linear regression analysis could mean
construction of regression line of X on Y which would
demonstrate the laughable scenario that rainfall is dependent on
crop yield; this would suggest that if you grow really big crops
you will be guaranteed a heavy rainfall.
MAKING THE REGRESSION LINE
The way to draw the line is to take three values of x, one on
the left side of the scatter diagram, one in the middle and one
on the right. Use the regression equation to get the values.
Although two points are enough to define the line, three are
better as a check.
y= a + bx
REGRESSION COEFFICIENT
The coefficient of X in the line of regression of Y on X is called
the regression coefficient of Y on X. It represents change in the value
of dependent variable (Y) corresponding to unit change in the value of
independent variable (X).
For instance if the regression coefficient of Y on X is 0.53 units,
it would indicate that Y will increase by 0.53 if X increased by 1 unit.
A similar interpretation can be given for the regression coefficient of X
on Y.
Once a line of regression has been constructed, one can check
how good it is (in terms of predictive ability) by examining the
coefficient of determination (R2). R2 always lies between 0 and 1.
Coefficient of determination R2 = 1 – Sse / Syy
Non-linear Regression: In a non-linear regression equation,
the dependent and independent variable(s) are related
through a non-linear relationship:
Y = a*X1b1*X2b2…*Xnbn

Note that by logarithmic transformation, the above non-


linear equation can be written as a linear equation and
the coefficients can be estimated in the same manner as
for linear regression.
Transforming Non Linear Relations
Relationship between some variables may be non-linear but can be transformed to linear
form so that the technique of linear regression can be applied. For example, consider that
two variables X and Y are non-linearly related as follows:
This non-linear relation can be linearized by logarithmic transformation of the equation
Ln Y = Ln α+ βLn X or
A = a + b*B
where A = Ln Y, a = Ln α, b = β, and B = Ln X. Now, one can use the regression technique
to estimate parameters a and b and thereby α and β. In this procedure, two important
points are worth noting.
a) The values of a and b are estimated by minimizing Σ(A - Areg)2 and not by minimizing
Σ(Y - Yreg)2. Here Areg and Yreg are the value of A and Y estimated by the regression
equation.
b) In the log-transformed equation, the error term is additive (A = a + b B+ c) which means
that it is multiplicative in the original equation
Y = αXβε (11.60)
The errors are related as c = ln ε . Hence, the assumptions in hypothesis testing and
confidence intervals should be valid for c.
MAKING A REGRESSION EQUATION
s
Solution: In regression analysis, it is always helpful to first plot the data
and note the variation sin the dependent and independent variables. Fig.
11.5 gives a plot of the precipitation and runoff data which shows that there
is not much scatter around the line of best fit.
(a)The values of various variables required to calculate a and b are computed in the Table 11.1.
Here, x = 763.67/18 = 42.43, y = 272.75/18= 15.15.
The regression coefficients are:
b = Sxy/Sxx = 321.443/678.979=0.473
and a= y - b x = 15.15-0.473*42.43=-4.933
Hence, the regression equation is: y = -4.933 + 0.473 x.

(b) The percent of variation in y that is accounted for by the regression is computed as the
coefficient of determination (r2) multiplied by 100. The value of Sse has been computed in
Table
11.1.
Coefficient of determination R2 = 1 – Sse / Syy = 1 – 163.073/315.251=0.483.
The coefficient of correlation (r) = square root of coefficient of determination= (0.483)1/2 = 0.695.
Thus, nearly 66 percent of variation in y is explained by the regression equation. The remaining
34 percent variation is due to unexplained causes.

Вам также может понравиться