Академический Документы
Профессиональный Документы
Культура Документы
TYPES OF RELATIONSHIP
1
Regression
Regression analysis is a statistical technique for
investigating and modeling the relationship
between variables. The word regression is used
to investigate the dependence of one variable
called the dependent variable denoted by Y, on
one or more variables, called independent
variables denoted by X’s and provides an
equation to be used for estimating or predicting
the average value of the dependent variable from
the known values of the independent variables
3
Regression Analysis
► Regression Analysis is used to estimate a function f( )
that describes the relationship between a continuous
dependent variable and one or more independent
variables.
Y = f(X1, X2, X3,…, Xn) + ε
Note:
• f( ) describes systematic variation in the relationship.
• ε represents the unsystematic variation (or random error) in the
relationship
Where Y=dependent, response, predeictand, Regressand
X=Independent, Stimulus, predictor, Regressor
4
2
Examples
► Sales=f(Adv.Expenditure)+E
► Fiber=f(Weight of jute plant)+E
► Consumption Exp.=f( Income) +E
► Yield=f( fertilizer, seed rate, rainfall)+E
► Marks=f(Study hours, IQ level)+E
► Demand=f(Price, Price of related commodities,
Consumer income, Consumer taste, Adv. Expenses
for creation of demand)+E
5
3
Example: [1]The following data represent the money spent
on research and development and the firm’s annual profit
X=Expenditure for R&D Y=Annual profit
Fit appropriate model to the data
X Y
$ million ($ million)
5 31
11 40
4 30
5 34
3 25
2 20
Scatter plot
SCATTER PLOT
PROFIT VS EXPENDITURE
50
40
30
20
Y
10
0 2 4 6 8 10 12
However, the observed data points do not all fall on a straight line but
cluster about it. Many lines can be drawn through the data points; the
problem is to select among them. The method of LEAST SQUARE
results in a line that minimizes the sum of squared vertical distances
from the observed data points to the line (i.e Random Error).
Any other line has a larger sum 8
4
Best fit line to the data
LEAST SQUARE LINE
A least square line is described in terms of its Y-
Y-intercept
(the height at which it intercepts the Y-
Y-axis) and its
slope (the angle of the line). The line can be expressed
by the following relation
− −
S xy =
∑ (X − X ) (Y − Y )
=
1
∑ XY −
∑ X ∑ Y = 1 ( 30 )( 180 )
n −1 .n − 1 1000 − = 20
Y=a + bX (Estimated regression of Y on X) 2
n 5 6
−
Y=20+2X ∑ ( X − X ) = 1 X − ( ∑ X ) = 2
1 ( 30 ) 2
S =2
X
n −1 . n − 1
∑ 2
200 −
5 6
= 10
n
Where
► b= Slope of line b = S XY
2
S X
− −
X =5 Y = 30
► a=intercept of the line − −
a = Y−b X
9
Y=20+2X
► The value of b=2, indicates that the annual
profit is expected to increase by $2 million ,on
the average, with each $1 million increase in
R&D expenditures.
► The value a=20 indicates that the annual
profit is $20 million when X=0 i.e without R&D
expenditures, but this interpretation is not
always valid.
Regression results are valid within the scope of the
data i.e experimental region
10
5
Measuring the reliability of
the estimating equation
E x p e n d itu re V S A n n u a l p ro fit
45
A c t u a l v a lu e
40
F it t e d v a lu e
35
profit
30 L in e a r ( F it t e d
v a lu e )
25 L in e a r ( A c t u a l
20
15 y = 2x + 20
0 2 4 6 8 10 12 R 2 = 0 .8 2 6 4
E x p e n d itu re
The observed values of (X,Y) do not all fall on the regression line but they scatter
away from it. The degree of scatter of the observed values about the regression
line is measured by what is called standard error of estimate or standard error of
regression and denoted by Se.
To measure the accuracy or reliability of the estimating regression, we need to
compute the standard error of the estimate also called standard error of
regression. The standard error of regression measures the variability of observed
points about the regression line. A small variation indicates that the estimating 11
regression is adequate
+
Unexplained variation
(Variation due to unknown factors)
Total Variation= (n-1)S2y=242
=(b)((n-1)Sxy =2(5)(20)=200
Explained variation =(b)
12
6
Goodness of Fit
A commonly used measure of the goodness of fit of a
linear model is R2 called coefficient of determination.
If all the observations fall on the regression line R2 is
1. If no linear relationship between Y & X R2 is 0.
The co-
co-efficient of determination tells us the proportion
of variation in the dependent variable explained by the
independent variable
Coefficient of determination (R2)=(Explained/Total Variation)x100=83%
The higher the coefficient of determination is, the better the
regression function explains the observed values. The value of
R2 indicates that about 83% variation in the dependent variable
has been explained by the linear relationship with X and
remaining are due to some other unknown factors.
13
Estimation in regression
(Predicting unknown value of Y
from known value of X)
► Estimate the profit of the firm for which research and
development expenditure are $8 million
put X=8 in the estimated equation
Y=20 + 2(8)=$36 million
14
7
Example:-[2]
Example:-
Find the least squares regression line for the data on
incomes (in hundreds of dollars) and food expenditure on
the seven households
Scatter Diagram
A plot of paired
observations is
Food expenditure
called a scatter
diagram.
Income
16
8
Scatter diagram and straight lines.
Food expenditure
Income
17
e
Food expenditure
Regression line
18
Income
9
Regression Analysis
19
SSE = ∑ e 2 = ∑ ( y − yˆ ) 2
The values of a and b that give the minimum SSE
are called the least square estimates of A and
B, and the regression line obtained with these
estimates is called the least square line.
20
10
The Least Squares Line
For the least squares regression line
ŷ = a + bx
bx,,
S xy
b= 2
and a = y − bx
S x
( ∑x)( ∑y) ( ∑ )
2
1 1 x
Sxy = ∑xy − and S2x = ∑x2 −
n −1 n n −1 n
21
Solution
11
∑ x = 212 ∑ y = 64
x = ∑ x / n = 212 / 7 = 30 . 2857
y = ∑ y / n = 64 / 7 = 9 . 1429
S S xy =
1
∑ xy −
( ∑ x )( ∑ y ) = 1 2 1 5 0 − ( 2 1 2 )(6 4 ) = 3 5 .2 8 5
n −1 n 6 7
(∑ x )
2
= 1 7 2 2 2 − (2 1 2 ) = 1 3 3 .5 7 1
2
1
S 2
x =
n −1
∑ x2 − n 6 7
S xy 3 5 .2 8 5
b= 2
= = 0 .2 6 4 2
S x 1 3 3 .5 7 1
a = y − b x = 9 .1 4 2 9 − (.2 6 4 2 )(3 0 .2 8 5 7 ) = 1 .1 4 1 4
Fitted line ŷ = 1.1414 + 0.2642x 23
Error of prediction.
ŷ = 1.1414 + .2642x
Food expenditure
Predicted = $1038.84
e Error = -$138.84
Actual = $900
Income
24
12
Interpretation of a and b
ŷ = 1.1414 + .2642 X
Interpretation of a
Consider the household with zero income
ŷ = 1.1414 + .2642(0) = $1.1414 hundred
Thus, we can state that households with no
income is expected to spend $114.14 per
month on food
25
13
Goodness of Fit
R2=92%
The value of R2 indicates that about 92%
variation in the dependent variable has
been explained by the linear relationship
with X and remaining are due to some
other unknown factors.
27
y y
b<0
b>0
28
14
Example:[3]:-
Example:[3]:- Driving Monthly Auto
A random sample of Experience Insurance
eight drivers insured (years) Premium($)
with a company and
having similar auto 5 64
insurance policies was 2 87
selected. The following 12 50
table lists their driving 9 71
experience (in years) 15 44
and monthly auto 6 56
insurance premiums. 25 42
16 60
29
15
b) Plot the scatter diagram and identify the
nature and strength of relationship.
Experience
31
16
c)
x= ∑ x/n=90/8=11.25
y= ∑ y/n=474/8=59.25
(∑x)(∑ y) (90)(474)
SSxy =∑xy- =4739- =-593.5
n 8
(∑x)2 (90)2
SSxx =∑x - 2
=1396- =383.5
n 8
(∑ y)2 (474)2
SSyy =∑ y - 2
=29,642- =1557.5
n 8
33
SSxy − 593.5000
b= = = −1.5476
SSxx 383.5000
a = y − bx = 59.25− (−1.5476)(11.25) = 76.6605
34
17
d) Interpret the meaning of the values
of a and b calculated
35
f) Calculate coefficient of
determination
R² = 59%
59% of the total variation in insurance
premiums is explained by years of driving
experience and 41% is due to other
unknown factors
36
18
Predict the monthly auto insurance for a
driver with 10 years of driving experience.
37
38
19
Regression with more than one independent variables
Example:-The following information has been gathered
Example:-
from a random sample of apartment’s renters in a city. We
are trying to predict rent (in dollars per month) based on
the size of the apartment (number of rooms) and the
distance from downtown (in miles)
Rent Number Distance
($) of rooms ( miles)
[Y] [X1] [X2]
360 2 1
1000 6 1
450 3 2
525 4 3
350 2 10
300 1 4 39
Regression equation:
RENT = 96.458 +136.485 NUM_ROOM –2.403 DISTANCE
40
20
Goodness of fit
► R2=92%
A high value of R2 indicates that much of the
variation in rent has been explained by
regressors; number of rooms and downtown
distance
42
21
CORRELATION
The correlation can be defined as the degree
of association/relationship between two or
more variables
Marks of students in physics is associated with the
marks in mathematics.
The cost of a commodity in the market is related to
the quantity of the commodity available for sale in the
market
43
Types of correlation
(number of variables)
► Simple Correlation
Degree of relationship existing between two variables is
called simple correlation.
► Multiple Correlation
Degree of relationship connecting three or more
variables is called multiple correlations.
44
22
Scatter plot
► The plot between two variables is called
scatter plot
► The scatter plot indicates
►The nature of relationship (+ve , -ve, No)
►The strength of relationship (Strong,moderate,Weak)
45
Types of correlation
(Direction Change)
► Positive Correlation
When variables tend to change together in
the same direction e.g quantity of
commodity supplied and its price
► Negative Correlation
When variables tend to change in opposite
directions
e.g quantity demanded and the price of a
normal good are negatively correlated
► Uncorrelated
Two variables are uncorrelated when they
tend to change with no connection to
each other e.g. height of people and 46
steel production.
23
Possible patterns in Scatter plot
Perfect positive Strong positive Weak positive
linear correlation linear correlation linear correlation
No Correlation r=0
r=1 r is close to 1 r is close to 0
47
Measurement of correlation
► We can determine the kind of correlation between two variables by
direct observation of the scatter plot
► If the points lie close to the line, the correlation is strong.
► The inspection of scatter diagram gives only a rough idea about the
relationship between the two variables
• For a precise quantitative measurement of the degree of correlation
between two variables we use a quantity which is called correlation
coefficient.
• Simple linear correlation coefficient:-
coefficient:-It is used to
measure the strength of linear relationship between two variables.
S -1 ≤ r ≤ +1
r = XY
2 2
S X S y 48
24
Limitations
► Simple correlation coefficient measures only linear relation
ship between variables i.e even if r=0 the variables may be
related nonlinearly
49
− −
( X − X )(Y − Y )
S xy = ∑ ∑ X ∑ Y = 1 8520 − (110)(610) = 201.11
1
= ∑ XY −
n −1 n − 1 − 2 n
9 10
S 2Y =-S(2Y,Y ) =
∑(Y −Y) = 1 ∑Y 2 − (∑Y)2 = 1 5642− (180)2 = 48.4
−1 2 1 n 5(1 1 0 ) 2 6
S 2 ∑ (X - X )
=
1 n
= ∑ X2 (.n∑−1X )
- =
1 5 4 0 - = 3 6 .6 7
X n -1 n -1 n 9 10
- 2
1 ( ∑ Y ) 2 1 ( 6 1 0 ) 2
= ∑
(Y - Y )
S2 = ∑ Y 2- = 47700- = 1 1 6 5 .5 6
Y n -1 n -1 n 9 10
50
25
Amazing correlations
High positive correlation between
► Temperature in Faisalabad and employment rate
► Import of banana and Divorce rate
► Strength of police force and number of crimes
the following situations may brought
about a high correlation
X is the cause of Y
Y is the cause of X
There is a third factor Z that affects X and Y such
that they show a close relation
The correlation between X and Y may be due to
chance. 51
52
26
Example:- Calculate rank correlation coefficient between
Example:-
interview grade and test score
Student Interview Test score RANK RANK D D2
grade
Interview Test score
grade
6 ∑ D2 6(36.5) 36.5
r′ = 1 − = 1− 2 = -0.825
n(n − 1)
2
5(5 − 1)
The negative value indicates that there is no
agreement between two methods so these
53
27