Вы находитесь на странице: 1из 23

Correlation and Regression Analysis

• Many engineering design and analysis problems involve factors that are
interrelated and dependent. E.g., (1) runoff volume, rainfall; (2) evaporation,
temperature, wind speed; (3) peak discharge, drainage area, rainfall intensity;
(4) crop yield, irrigated water, fertilizer.
• Due to inherent complexity of system behaviors and lack of full understanding
of the procedure involved, the relationship among the various relevant factors
or variables are established empirically or semi-empirically.
• Regression analysis is a useful and widely used statistical tool dealing with
investigation of the relationship between two or more variables related in a
non-deterministic fashion.
• If a variable Y is related to several variables X1, X2, …, XK and their
relationships can be expressed, in general, as
Y = g(X1, X2, …, XK)
where g(.) = general expression for a function;
Y = Dependent (or response) variable;
X1, X2,…, XK = Independent (or explanatory) variables.
Correlation
• When a problem involves two dependent random variables, the degree of
linear dependence between the two can be measured by the correlation
coefficient (X,Y), which is defined as

where Cov(X,Y) is the covariance between random variables X and Y defined


as
 

where <Cov(X,Y)< and  (X,Y)  .

• Various correlation coefficients are developed in statistics for measuring the


degree of association between random variables. The one defined above is
called the Pearson product moment correlation coefficient or correlation
coefficient.

• If the two random variables X and Y are independent, then (X,Y)=


Cov(X,Y)= . However, the reverse statement is not necessarily true.
Cases of Correlation
Perfectly linearly
correlated in opposite
direction

Uncorrelated in
linear fashion

Strongly & positively


correlated in
linear fashion

Perfectly correlated in
nonlinear fashion, but
uncorrelated linearly.
Calculation of Correlation Coefficient
• Given a set of n paired sample observations of two random variables
(xi, yi), the sample correlation coefficient ( r) can be calculated as
Auto-correlation
• Consider following daily stream flows (in 1000 m3) in June 2001 at Chung Mei
Upper Station (610 ha) located upstream of a river feeding to Plover Cove
Reservoir. Determine its 1-day auto-correlation coefficient, i.e., (Qt, Qt+1).

Day (t) Flow Q(t) Day (t) Flow Q(t) Day (t) Flow Q(t)
1 8.35 11 313.89 21 20.06
2 6.78 12 480.88 22 17.52
3 6.32 13 151.28 23 116.13
4 17.36 14 83.92 24 68.25
5 191.62 15 44.58 25 280.22
6 82.33 16 36.58 26 347.53
7 524.45 17 33.65 27 771.30
8 196.77 18 26.39 28 124.20
9 785.09 19 22.98 29 58.00
10 562.05 20 21.92 30 44.08

• 29 pairs: {(Qt, Qt+1)} = {(Q1, Q2), (Q2, Q3), …, (Q29, Q30)};


Relevant sample statistics: n=29

Qt  186.22; SQt  230.06; Qt 1  187.45; SQt 1  229.17


The 1-day auto-correlation is 0.439
Chung Mei Upper Daily Flow

800 900
700 800
700

Q(t+1), 1000 m^3


Flow (1000 cubic meters)

600

500
600
500
400
400
300
300
200
200
100
100
0
0
10 Day 20 30 0 200 400 600 800 1000
Q(t), 1000 m^3

Autocorrelation for June 2001 Daily Flows at Chung Mei Upper, HK


1.0
0.8
0.6
Autocorrelation

0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0

1 2 3 4 5
Time lags (Days)
Regression Models
• due to the presence of uncertainties a deterministic functional
relationship generally is not very appropriate or realistic.
• The deterministic model form can be modified to account for
uncertainties in the model as
Y = g(X1, X2, …, XK) + 
where  = model error term with E()=0, Var()=2.

• In engineering applications, functional forms commonly used for


establishing empirical relationships are 
– Additive: Y = 0 + 1X1 + 2X2 + … + KXK +

– Multiplicative: Y  β 0 X11 X 22 ... X KK 


β β β
Least Square Method
Suppose that there are n pairs of data, {(xi, yi)}, i=1, 2,.. , n and a plot of
these data appears as
y

What is a plausible mathematical model describing x & y relation?


Least Square Method

Considering an arbitrary straight line, y =0+1 x, is to be fitted through these


data points. The question is “Which line is the most representative”?

y
^
y =0+1 x
1
1

^ = error (residual)
ei = yi – y i

^
yi
yi

0
x
xi
Least Square Criterion
• What are the values of 0 and 1 such that the resulting line “best” fits
the data points?

• But, wait !!! What goodness-of-fit criterion to use to determine among


all possible combinations of 0 and 1 ?

• The least squares (LS) criterion states that the sum of the squares of
errors (or residuals, deviations) is minimum. Mathematically, the LS
criterion can be written as:
 

• Any other criteria that can be used?


Normal Equations for LS Criterion
• The necessary conditions for the minimum values of D are:
D D
 0 and 0
 0  1

n
  y i       x i   0
 D n

   2  y i    0   1 xi     1  0  i 1
 0 i 1
 n

 D  2  y      x    x   0  x y     x   0
n

  1 
i 1
i 0 1 i i

i 1
i i   i

• Expanding the above equations


n  
n

 y i  n     xi  0
 i 1 i 1
n n n
 xi y i    xi    xi2  0

i 1

i 1

i 1

• Normal equations:
  n  n

n
 

    x
 i 1 
i     
i 1
yi 

 
 x     x 2   
n n n

    xi y i
i  i  
 i 1   i 1  i 1 
LS Solution (2 Unknowns)

  n
  n

   y i    xi 
ˆ    i 1    i 1  ˆ  y  x  
  n   n 
    
    
 n n n n
 1
ˆ 
i 1
x i y i   x i  y i  x i y i  nx y
n i 1 i 1 i 1
   2  n
n
1 n

 
2 2

 x i    xi  x nx
2
i
 i 1 n  i 1  i 1
Fitting a Polynomial Eq. By LS Method
y i       xi   2 xi2       k xik   i , i  1,2,  , n
LS criterion:
D=   y i        x i    x i2        x ik 
n
2
minimize
i 1

  ,  ,  

D
Set  0 , for j  0,1, 2,  , k
 j
Normal Equations are:

 n   n k n

 n  xi       xi   yi


  i1   i1  i1

 n   n 2  n k1  n
 xi   xi       xi   yi xi
  i1   i1   i1  i1
   

  n k  n k1   n 2k  n
 xi   xi       xi   yi xi
k

  i1   i1   i1  i1


Fitting a Linear Function of Several Variables
y       x1    x 2      x k  
LS criterion :
n
Minimize D=   yi      xi    x1       xk  
2

i 1
     , 1 ,,  k 

D
Set  0 , for j  0, 1, 2,   , k
 j

Normal equations:
 n  n  n

 n  xi1     xik   yi


  i1   i1  i1

 n   n 2 n  n
xi   xi1     xi1xik   yi xi1
  i1   i1   i1  i1
   

 n  n   n 2 n
xik   xik xi1     xik   yi xik
  i1   i1   i1  i1
Matrix Form of Multiple Regression by LS

y
1
1x x
11
12 x
1
k

 
1

y1x x x




2
 21
22 2
k


2


  
 

  
y

n 1x
n x
1n2x

nk


k 
n

t
h t
h
(
No
te
:x=
iji ob
s
er
v
at
io
no
f
th
ejin
d
ep
e
nd
e
nt
var
i
ab
le
)

or y=X+ in short

m
i
n

D
2
i y

ε'
ε'
-
X
βy
-
X
β

i
1

LS criterion is:
n

β ^
t
S
eD

β
,
0a
n
dr
e
s
ul
t
i: X' ( y - X β )  0
n
β   X' X  X' y
ˆ 1
The LS solutions are: 
Measure of Goodness-of-Fit
2
R
=Co
e
ff
i
ci
en
to
fD
et
er
mi
na
ti
on
n2
ε
i

1i
1
n

y
i
2
y
i
1

=
1-
% o
fv
ari
at
io
nint
hed
ep
e
nd
ent
var
i
abl
e,
y,u
nex
pl
ai
ne
dby
th
er
eg
re
ss
io
nequ
at
io
n;
=
%of
va
ri
at
io
nint
hed
epe
nd
e
nt
var
ia
bl
e,
y,e
xpl
ai
ne
dbyt
he
re
g
re
ss
io
nequ
at
io
n.
Example 1 (LS Method)
Example 1 (LS Method)
LS Example
LS Example (Matrix Approach)
LS Example (by Minitab w/ 0)
LS Example (by Minitab w/o 0)
LS Example (Output Plots)

Вам также может понравиться