Class Material - Simple Linear Regression

Linear Regression
1
Introduction to Linear Regression
• Regression is a tool to examine the association relationship
between quantitative variables x and y using a mathematical
equation (between a dependent variable (Y) and one or more
independent variables (X1, X2, …, Xn)).
• The motivation for using the technique:
• Forecast the value of a dependent variable (y) from the value of
independent variables (x1, x2,…xk.).
• Analyze the specific relationships between the independent variables
and the dependent variable - to identify variables that may be used in
model building
• The relationship can be linear or non-linear.
2
• A dependent variable (response variable) “measures an outcome of a
study (also called outcome variable)”.
• An independent variable (explanatory variable) “explains changes in a

response variable”.
• Regression often set values of explanatory variable to understand how it

affects response variable (predict response variable)
• Regression is type of supervised learning algorithm under Machine

Learning terminology
• An important tool in Predictive Analytics
3
Regression Vs Correlation
• Regression is the study of, “existence of a relationship”, between

two variables. The main objective is to estimate the change in mean
value of independent variable.
• Correlation is the study of, “strength of relationship”, between two

variables.
4
Dependent and Independent Variables
 Terms dependent and independent does not necessarily imply a causal

relationship between two variables.
 Regression is not designed to capture causality.
 Purpose of regression is to predict the value of dependent variable given

the value(s) independent variable(s)
5
Regression Nomenclature
Dependent Variable Independent Variable

Explained Variable Explanatory variable
Regressand Regressor
Predictand Predictor
Endogenous Variable Exogenous Variable
Controlled Variable Control Variable
Target Variable Stimulus Variable
Response Variable
Feature Outcome Variable
6
Regression History
• Francis Galton was the first to apply regression.
• Claimed that height of children of tall parents

“regress towards mean of that generation”.
• Modern regression analysis is developed by R A

Fisher.
Francis Galton
Ref: F Galton, “Regression towards mediocrity in hereditary stature”, Nature, Vol. 15, 246-263,
1886
7
Types of
Regression Models
Regression
Models
8
Types of
Regression Models
1 Explanatory Regression
Variable Models
Simple
9
Types of
Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables
Simple Multiple
10
Types of
Regression Models
Simple Multiple
Linear
11
Types of
Regression Models
Simple Multiple
Non-
Linear
Linear
12
Types of
Regression Models
Simple Multiple
Non-
Linear Linear
Linear
13
Types of
Regression Models
Simple Multiple
Non- Non-
Linear Linear
Linear Linear
14
Which is Linear?
1) Yi   0  1 X 12
 i
1
2) Yi   0  e X 1   i
3) Yi   0  1 X 1   2 X 1 X 2   3 X 22   i
X1
4) Yi   0   i
1  1
5) Yi   0  1 ln X 1   i
15
Types of Regression
• Simple linear regression – refers to a regression model between two
variables.
Y   0  1 X 1  
• Multiple linear regression – refers to a regression model on more than
one independent variables.
Y   0  1 X 1   2 X 2  ...   k X k  
• Nonlinear regression.
1 3
Y  0   X2 
1   2 X 1
16
Linear Regression
• Linear condition in linear regression is defined with respect to the

regression coefficients (  0 , 1 ) and not with respect to the
explanatory variables in the model.
• The following equations can be treated as linear as far as

regression coefficients are concerned.
Y   0  1 X 1 2  
Y   0  1 ln X 1  
Y   0  1 X 1   2 X 1 X 2   3 X 22
17
Non-Linear Regression
• The following equations can be treated as examples of non-linear

regression models.
1
Y  0  X1  
1  1
1
Y  0  e X1  
• The relationship between the dependent variable Y and

regression coefficient β1 can be converted into linear regression
model by suitable transformations.
18
Framework for SLR model development
19
Define the Functional Form of Relationship
For better predictive ability (model accuracy) it is important to specify the correct
functional form between the dependent variable and the independent variable. Scatter
plots may assist the modeller to define the right functional form.
Linear relationship between X1 and Y1 Log-linear relationship between X2 and Y2.
Y1   0  1 X 1 ln Y2   0  1 X 2 20
The Model
• The first order linear model y   0  1x  
y = dependent variable
x = independent variable 0 and 1 are unknown population
y parameters, therefore are estimated
0 = y-intercept from the data.
1 = slope of the line
 = error variable
Rise 1 = Rise/Run
0 Run
x
21
Estimating the Coefficients
• The estimates are determined by
• drawing a sample from the population of interest,
• calculating sample statistics.
• producing a straight line that cuts into the data.
y w
w Question: Which line is to be
w considered as a good line?
w
w w w w w
w w w w w
w
x 22
The Least Squares (Regression) Line
A good line is one that minimizes the sum of squared differences

between the points and the line.
23
The Least Squares (Regression) Line
Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
Let us compare two lines
4 (2,4)
w The second line is horizontal
3 w (4,3.2)
2.5
2
(1,2) w
w (3,1.5)
1 The smaller the sum of squared
differences the better the fit of the
line to the data.
1 2 3 4
24
Least Squares
• ‘Best Fit’ Means Difference Between Actual Y Values & predicted
Y Values are a Minimum.
• But Positive Differences Off-Set Negative ones
25
Least Squares
• ‘Best Fit’ Means Difference Between Actual Y Values &
Predicted Y Values is a Minimum. But Positive Differences Off-
Set Negative ones. So square errors!
 Y  Yˆ 
n 2 n
i i   î2
i 1 i 1
26
Least Squares
• ‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y
Values Are a Minimum. But Positive Differences Off-Set Negative.
So square errors!
 Y  Yˆ 
n 2 n
i i   î2
i 1 i 1
• LS Minimizes the Sum of the Squared Differences (errors) (SSE)
27
Estimate the Regression Parameters
• Estimate the regression parameters.
• The method of Ordinary Least Squares (OLS) is used to estimate the
regression parameters.
• OLS fits regression line through a set of data points such that the sum of the
squared distances between the actual observations 
in the sample and
the regression line is minimized (i.e., (Yi  Yi ) 2is minimized ).
i
• OLS provides the Best Linear Unbiased Estimate (BLUE).


• That is, E      0 where  is the population parameter and  is estimated

parameter value from the sample.

The Estimated Coefficients
n n
SSE    i   (Yi   0 1 X i )
2 2
i 1 i 1
SSE  n
 (  (Yi   0 1 X i ) )  0
2
 0  0 i 1
n n 
 2  (Yi )  n 0 1  ( X i )  0
i 1 i 1 
 0  Y  1 X
29
n n
SSE    i   (Yi   0 1 X i ) 2
2
i 1 i 1
SSE  n
 (  (Yi   0 1 X i ) )  0
2
1 1 i 1
n
   2 X i (Yi   0  1 X i )  0
i 1
30
n
 2  ( X iYi  X i (Y  1 X )  1 X i )  0
2
i 1
n n n n
 ( X iYi )   ( X i )Y  1 X  ( X i )  1  ( X i )  0
2
i 1 i 1 i 1 i 1
n
 X i (Yi  Y )
i 1
1  n
 Xi (Xi  X )
i 1
31
To calculate the estimates of the slope and intercept of the least
squares line , use the formulas:
 0  Y  1 X
 X i (Yi  Y )
1 
 Xi (Xi  X )
32
Least Squares Graphically
n
LS minimizes   i   1   2   3   4
ˆ 2
ˆ 2
ˆ 2
ˆ 2
ˆ 2
i 1
Y Y2  ˆ0  ˆ1 X 2  ˆ2
^4
^2
^1 ^3
Yî  ˆ0  ˆ1 X i

X
33
Coefficient Equations
• Prediction equation
yî  ˆ0  ˆ1xi
• Sample slope
SS xy   xi  x  yi  y 
ˆ1  
SS xx  i x  x 2
• Sample Y - intercept
ˆ0  y  ˆ1x
34
Simple Linear Regression Model
Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable
Yi  β0  β1Xi  ε i
Linear component Random Error
component
35
Simple Linear Regression Model
Y Yi  β0  β1Xi  ε i
Observed Value
of Y for Xi
εi Slope = β1
Predicted Value Random Error for this Xi

of Y for Xi value
Intercept = β0
Xi
X
36
Simple Linear Regression
Equation (Prediction Line)
The simple linear regression equation provides an estimate of
the population regression line
Estimated (or
predicted) Y Estimate of the Estimate of the
value for regression regression slope
observation i intercept
Value of X for
Ŷi  b0  b1Xi
observation i
37
Assumptions
The method of least squares gives the best equation under the assumptions
stated below:
 The regression model is linear in regression parameters.
 The explanatory variable, X, is assumed to be non-stochastic (i.e., X is
deterministic).
 The conditional expected value of the residuals, E(i|Xi), is zero.
 In case of time series data, residuals are uncorrelated, that is, Cov (i, j) = 0
for all i  j.
 The residuals, i, follow a normal distribution.
 The variance of the residuals, Var(i|Xi), is constant for all values of Xi. When
the variance of the residuals is constant for different values of Xi, it is called
homoscedasticity. A non-constant variance of residuals is called
heteroscedasticity
Assumptions
Simple Linear Regression Example
• A real estate agent wishes to examine the relationship between the

selling price of a home and its size (measured in square feet)
• A random sample of 10 houses is selected

– Dependent variable (Y) = house price in $1000s
– Independent variable (X) = square feet
40
Simple Linear Regression Example: Data
House Price in $1000s Square Feet
(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
41
Simple Linear Regression Example: Scatter Plot
House price model: Scatter Plot
Example problem
3000
2500
House Price in $1000
2000
1500
1000
500
0
0 50 100 150 200 250 300 350 400 450
Square feet
13-42
Simple Linear Regression Example:
Using Excel Data Analysis Function
1. Choose Data 2. Choose Data Analysis
3. Choose Regression
43
Using Excel Data Analysis Function
Enter Y’s and X’s and desired options
44
Dataset 5_Sheet Real estate
• SLOPE function
• INTERCEPT function
• RSQ function
45
Excel Output
Regression Statistics
Multiple R 0.76211 The regression equation is:
R Square 0.58082
Adjusted R Square 0.52842 house price  98.24833  0.10977 (square feet)
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
46
Graphical Representation
House price model: Scatter Plot and Prediction Line
450
400
House Price ($1000s)

350 Slope
300
= 0.10977
250
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet
house price  98.24833  0.10977 (square feet)

47
Measures of Variation
• Total variation is made up of two parts:
SST  SSR  SSE

Total Sum of Regression Sum of Error Sum of
Squares Squares Squares
SST   ( Yi  Y )2 SSR   ( Ŷi  Y)2 SSE   ( Yi  Ŷi )2

where:
Y = Mean value of the dependent variable
Yi = Observed value of the dependent variable
Yî = Predicted value of Y for the given Xi value
48
Measures of Variation
• SST = total sum of squares (Total Variation)
• Measures the variation of the Yi values around their mean Y
• SSR = regression sum of squares (Explained Variation)
• Variation attributable to the relationship between X and Y
• SSE = error sum of squares (Unexplained Variation)
• Variation in Y attributable to factors other than X
49
Coefficient of Determination, r2
• The coefficient of determination is the portion of the total variation in
the dependent variable that is explained by variation in the
independent variable
• The coefficient of determination is also called r-squared and is
denoted as r2
SSR regression sum of squares
r 
2

SST total sum of squares
note:
0  r 1
2
50
Coefficient of Determination, r2 in Excel
SSR 18934.9348
r  2
  0.58082
Multiple R 0.76211 SST 32600.5000
R Square 0.58082
Adjusted R Square 0.52842 58.08% of the variation in house
Standard Error 41.33032 prices is explained by variation in
Observations 10
square feet
ANOVA
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
51
Inferences About the Slope
• The standard error of the regression slope coefficient (b1) is

estimated by
S YX S YX
Sb1  
SSX  (X  X)
i
2
where:
Sb1 = Estimate of the standard error of the slope
SSE = Standard error of the estimate

S YX 
n2
52
Inferences About the Slope: t Test
Is there a linear relationship between X and Y?
• Null and alternative hypotheses
• H0: β1 = 0 (no linear relationship)
• H1: β1 ≠ 0 (linear relationship does exist)
• Test statistic
where:
b1  β 1
t STAT  b1 = regression slope
coefficient
Sb β1 = hypothesized slope
1
Sb1 = standard
d.f.  n  2 error of the slope
53
Inferences About the Slope: t Test Example
H0: β1 = 0
From Excel output: H1: β1 ≠ 0
Coefficients Standard Error t Stat P-value

Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
b1
Sb1
b1  β1 0.10977  0
t STAT    3.32938
Sb 0.03297
1
13-54
H0: β1 = 0
Test Statistic: tSTAT = 3.329 H1: β1 ≠ 0
d.f. = 10- 2 = 8
a/2=.025 a/2=.025
Decision: Reject H0
There is sufficient evidence

Reject H0
-tα/2
Do not reject H0
tα/2
Reject H0 that square footage affects
0
-2.3060 2.3060 3.329 house price
55
H0: β1 = 0
H1: β1 ≠ 0
From Excel output:
Coefficients Standard Error t Stat P-value

Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
p-value
Decision: Reject H0, since p-value < α
There is sufficient evidence that square footage
affects house price.
56
F Test for Significance
MSR
• F Test statistic: FSTAT 
MSE
where MSR 
SSR
k
SSE
MSE 
n  k 1
where FSTAT follows an F distribution with k numerator and (n – k - 1)

denominator degrees of freedom
(k = the number of independent variables in the regression model)
57
F-Test for Significance
Excel Output
MSR 18934.9348
Multiple R 0.76211
FSTAT    11.0848
R Square
Adjusted R Square
0.58082
0.52842
MSE 1708.1957
Standard Error 41.33032
Observations 10
With 1 and 8 degrees of freedom
p-value for the F-Test
ANOVA
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
58
F Test for Significance
H0: β1 = 0 Test Statistic:
H1: β1 ≠ 0 MSR
FSTAT   11.08
a = .05 MSE
df1= 1 df2 = 8
Decision:
Critical Reject H0 at a = 0.05
Value:
Fa = 5.32
a = .05 Conclusion:
There is sufficient evidence that house size
affects selling price
0 F
Do not Reject H0
reject H0
F.05 = 5.32
59
Confidence Interval Estimate
for the Slope
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Since the units of the house price variable is $1000s, we are 95%
confident that the average impact on sales price is between $33.74 and
$185.80 per square foot of house size
This 95% confidence interval does not include 0.

Conclusion: There is a significant relationship between house
price and square feet at the .05 level of significance
60
t Test for a Correlation Coefficient
• Hypotheses
H0: ρ = 0 (no correlation between X and Y)
H1: ρ ≠ 0 (correlation exists)
• Test statistic
r -ρ
t STAT  (with n – 2 degrees of freedom)
1 r 2 where
n2 r   r 2 if b1  0
r   r 2 if b1  0
61
t-test for a Correlation Coefficient
Is there evidence of a linear relationship between square feet

and house price at the .05 level of significance?
H0: ρ = 0 (No correlation)

H1: ρ ≠ 0 (correlation exists)
a =.05 , df = 10 - 2 = 8
r ρ .762  0
t STAT    3.329
1 r2 1  .762 2
n2 10  2
62
t-test For A Correlation Coefficient
r ρ .762  0 Decision:
t STAT    3.329 Reject H0
1 r2 1  .762 2
n2 10  2 Conclusion:
There is evidence of a
linear association at the
d.f. = 10-2 = 8
5% level of significance
a/2=.025 a/2=.025
Reject H0 Do not reject H0 Reject H0

-tα/2 tα/2
0
-2.3060 2.3060
3.329 63
Evaluation metrics
• Mean Absolute Error (MAE) is the mean of the absolute value
of the errors. It is calculated as:
1 n 
MAE   y j  y j
n j 1
• Mean Squared Error (MSE) is the mean of the squared errors

and is calculated as
1 n 
MSE   ( y j  y j ) 2
n j 1
64
Evaluation metrics
• Root Mean Squared Error (RMSE) is the square root of the
mean of the squared errors:
1 n 
MSE   (y j  y j)
2
n j 1
65
Dataset 5_Sheet Iris data
• Is there a statistical evidence to support that the petal width and
petal length are related?
• Based upon the linear models you generated, which pair of
features appear to be most predictive for one another?
66
Descriptive Analytics in Python
• #Pandas is an open source, Berkeley Software Distribution licensed
library
• #This provides data analysis tools for Python programming language
• #we are using import keyword, pandas is imported as alias pd
import pandas as pd
• #we will use pd.read_csv method to read and load it into a DataFrame
data=pd.read_csv('Sa_Data.csv')
• #to display all the column names
list(data.columns)
• #to display all the top dataset
data_IPL.head()
67
• #to display the size of the data frame
data.shape
• #For more detailed summary about the dataset use info()
data.info()
• #the row and column indexes always start with the value 0
• #We want to display the first 5 rows of the DataFrame
data[0:5]
• #Negative indexing is an excellent feature in Python
• #We want to display the last 6 rows of the DataFrame
data[-6:]
68
• #If we want to know the frequency/occurences of each unique value
in a column named YearsExperience
data.YearsExperience.value_counts()
• #Let us take another data set IPL Data 2015 CSV. Here if we want to
know the frequency/occurences of players from each country
• #Then we need to use value_counts as shown below
data_IPL=pd.read_csv('IPL Data 2015 CSV.csv')
data_IPL.Country.value_counts()
• #Cross-tabulation features will help find occurences for the
combination of values for two columns
pd.crosstab(data_IPL.L25, data_IPL.Country)
69
• #Read IPL file with Age
data_AgeIPL=pd.read_csv('IPL AgeData 2015 CSV.csv')
pd.crosstab(data_AgeIPL.AGE, data_AgeIPL.Country)
• #Drawing plots- import matplotlib - a Python 2D plotting library
• #Seaborn is a library for making elegant charts in Python
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
• #To draw a bar chart with Age and sold price
sn.barplot(x='AGE', y='Sold Price', data=data_AgeIPL);
70
• #A third variable hue is also introduced. Hue is the colour code w.r.t third
variable
sn.barplot(x='AGE', y='Sold Price', hue='PLAYING ROLE',
data=data_AgeIPL);
• #To draw the box plot call boxplot()of seaborn library
box=sn.boxplot(data_AgeIPL['Sold Price']);
• #To compare the Sold Price with different Playing Role
sn.boxplot(x='PLAYING ROLE', y='Sold Price', data=data_AgeIPL);
• #Scatterplot between Sixers and Sold Price
IPL_Batsman=data_AgeIPL[data_AgeIPL['PLAYING ROLE']=='Batsman']
plt.scatter(x=IPL_Batsman.SIXERS, y=IPL_Batsman['Sold Price']);
71
• #To fit a line to represent direction of relationship use regplot() of seaborn
sn.regplot(x='SIXERS', y='Sold Price', data=IPL_Batsman);
• #Let us assume that the influential features may be Strike rate, Average,
Sixers and Sold Price
• #Correlation values for features can be computed using corr()
influential_features=['SR -B', 'AVE','SIXERS', 'Sold Price']
data_AgeIPL[influential_features].corr()
• #Heatmap of correlation values
sn.heatmap(data_AgeIPL[influential_features].corr(), annot=True);
72
data=pd.read_excel('Real_Estate.xls')
• #Correlation between variables
data.corr(method ='pearson')
73
Influence During Regression – Outlier
Analysis
• An outlier is an observation point that is distant from other
observations.
• Outliers are data with an extreme value of the response variable (Y)
• The value ( Yi  Y ) show a large deviation.
• Presence of an outlier will have a significant influence on values of
regression coefficients
• Leverage points are data with an extreme value of the predictor
variable (X)
• Some combination of extreme Y (outlier) and extreme X (leverage)
makes a data point influential
• An influential data point: removing the data point substantially changes
the regression results – How do we define “substantial”?
74
Distance measures
• Z score
• Mahalanobis distance
• Cook’s distance
• Leverage values
• DFBeta and DFFit Values
75
Z-score
Is the standardized distance of an observation from its mean value.
Yî  Y
Z
Y
Where Y and  Y are the mean and standard deviation of dependent

variable estimated from the sample data
Mahalanobis distance
• Distance between specific values of the independent variables to the

centroid of all observations of the explanatory variable.
• Mahalanobis distance value of more than chi-square critical value (with df
equal to the number of explanatory variables) is classified as outliers.
• A large Mahalanobis distance identifies a case as having extreme values on
one or more of the independent variables.
• The threshold value of 0.001 was suggested by Tabachnick & Fidell
(2007), who state that a very conservative probability estimate for outlier
identification is appropriate for the Mahalanobis Distance.
Tabachnick, B.G., & Fidell, L.S. (2007). Using Multivariate Statistics (5th Ed.).
Boston: Pearson. (p. 74).
Cook’s Distance
A measure of how much the residuals of all cases would change if a particular
case were excluded from the calculation of the regression coefficients.
A Cook’s D value of more than 1 indicates that excluding a case from

computation of the regression statistics, changes the coefficients substantially.
Cook D is given by
 j
(Yˆ  Yˆ ) 2
j (i )
j
Di 
(k  1) MSE
Yˆ j (i ) is the predicted value of j th observation after excluding i th observation

from the sample
Cook’s Distance
Hair et al (1998) consider 4/(N-k-1) for k predictors and N points as a

threshold for Cook’s distance which usually gives a lower threshold
than 1.
Bollen and Jackman (1990) state another threshold of 4/N.

References
Bollen, K. A. and Jackman, R. W. (1990) Regression diagnostics: An expository treatment of outliers and influential
cases, in Fox, John; and Long, J. Scott (eds.); Modern Methods of Data Analysis (pp. 257-91). Newbury Park, CA: Sage.
Hair, J., Anderson, R., Tatham, R. and Black W. (1998). Multivariate Data Analysis (fifth edition). Englewood Cliffs, NJ:
Prentice-Hall.
Leverage Value
• of an observation measures the influence of that data point (or
observation) on the fit of the regression model.
• The value is given by
1 ( X i  X )2
hi   n
 i
n
( X  X ) 2
i 1
• The value of more than 2/n or 3/n is treated as highly influential
observation
DFFit and DFBeta

• DFFit is the change in the predicted value of Yi, when case i is removed from
the data set.
• DFBeta is the change in the regression coefficient values when an observation
i is removed from the data.
Dataset 6_Sheet Crime Data
• Is there exists a relationship between crime rate and police
expenditure ?
• If the crime rate and police expenditure are related, develop a
simple linear regression model with these variables.
• Comment on the relationship between crime rate and number of
families below half wage.
• Whether more males being identified per 1000 females related
to the region?
• Can the number of males to females predict the crime rate?
81
Dataset 7_Sheet
• Is there a statistical evidence to support that the bill amount depends on weight of
the patient?
• Whether the assumption of normality and homoscedasticity satisfied? If not what

is the remedial measure?
• Comment on value of R –square? What does a low value of R-square indicate?
• What will be the average difference in the bill amount for someone with weight 55
and 60?
• It is learnt that a patient weighing 60 kg is likely to incur atleast INR 500 more
than the patient with 55 kg. Will you believe this statement? Justify your argument
statistically at 5% level of significance.
82
Linear Regression in Python
• #statsmodels is a Python module that provides classes and
functions for the estimation of many different statistical models, as
well as for conducting statistical tests, and statistical data
exploration.
import statsmodels.api as sm
import pandas as pd
data=pd.read_csv('IPL AgeData 2015 CSV.csv')
83
• # develop a simple linear regress between variables 'Sixers' and 'Sold Price'
X = sm.add_constant(data['SIXERS'])
Y=data['Sold Price']
• #for splitting dataset randomly to training and test dataset use train_test_split
function() from sklearn.model_selection module
from sklearn.model_selection import train_test_split
• #80% of data is used for training and remaining 20% is used for testing
train_X, test_X, train_Y, test_Y= train_test_split(X,Y,train_size=0.8,
random_state=100)
• #fit()method on OLS() estimates the parameters and return model information to
the variable model2
model2=sm.OLS(train_Y,train_X).fit()
print(model2.params)
• #function summary2() prints the model summary for diagnosing a regression
model
model2.summary2() 84
const 366848.000678
SIXERS 11003.343655
dtype: float64
85
• Omnibus/Prob(Omnibus) –
• A test of the skewness and kurtosis of the residual.
• A value close to zero which would indicate normalcy.
• The Prob (Omnibus) performs a statistical test indicating the probability that the
residuals are normally distributed. Value close to 1 is expected.
• In this case the data is not normal.
• Skew – a measure of data symmetry. Expect a value close to zero,
indicating the residual distribution is normal.
• Kurtosis – a measure of "peakiness", or curvature of the data. Higher
peaks lead to greater Kurtosis. Greater Kurtosis can be interpreted as a
tighter clustering of residuals around zero, implying a better model with
few outliers.
86
• Durbin-Watson – tests for homoscedasticity. Expect a value between 1
and 2.
• Jarque-Bera (JB)/Prob(JB) – like the Omnibus test in that it tests both
skew and kurtosis. This test will be confirmation of the Omnibus test.
• Condition Number – This test measures the sensitivity of a function's
output as compared to its input. When we have multicollinearity, we can
expect much higher fluctuations to small changes in the data, hence, we
hope to see a relatively small number, something below 30.
87
• #Normality residuals to be checked using P-P plot
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
model2_residual=model2.resid
• #ProbPlot() method on statsmodel draws P-P plot
probplot=sm.ProbPlot(model2_residual)
plt.figure (figsize=(8,6))
probplot.ppplot(line='45')
plt.show()
88

Class Material - Simple Linear Regression

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Class Material - Simple Linear Regression

Загружено:

Авторское право:

Доступные форматы

Linear Regression

• An independent variable (explanatory variable) “explains changes in a

• Regression often set values of explanatory variable to understand how it

• Regression is type of supervised learning algorithm under Machine

• An important tool in Predictive Analytics

• Regression is the study of, “existence of a relationship”, between

• Correlation is the study of, “strength of relationship”, between two

 Terms dependent and independent does not necessarily imply a causal

 Regression is not designed to capture causality.

 Purpose of regression is to predict the value of dependent variable given

Dependent Variable Independent Variable

• Claimed that height of children of tall parents

• Modern regression analysis is developed by R A

• Linear condition in linear regression is defined with respect to the

• The following equations can be treated as linear as far as

• The following equations can be treated as examples of non-linear

• The relationship between the dependent variable Y and

• The first order linear model y   0  1x  

A good line is one that minimizes the sum of squared differences

• OLS provides the Best Linear Unbiased Estimate (BLUE).

parameter value from the sample.

Yˆi  ˆ0  ˆ1 X i

Predicted Value Random Error for this Xi

• A real estate agent wishes to examine the relationship between the

• A random sample of 10 houses is selected

1. Choose Data 2. Choose Data Analysis

Enter Y’s and X’s and desired options

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

House price model: Scatter Plot and Prediction Line

House Price ($1000s)

house price  98.24833  0.10977 (square feet)

• Total variation is made up of two parts:

SST  SSR  SSE

SST   ( Yi  Y )2 SSR   ( Ŷi  Y)2 SSE   ( Yi  Ŷi )2

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

• The standard error of the regression slope coefficient (b1) is

SSE = Standard error of the estimate

Coefficients Standard Error t Stat P-value

There is sufficient evidence

Coefficients Standard Error t Stat P-value

where FSTAT follows an F distribution with k numerator and (n – k - 1)

(k = the number of independent variables in the regression model)

This 95% confidence interval does not include 0.

Is there evidence of a linear relationship between square feet

H0: ρ = 0 (No correlation)

Reject H0 Do not reject H0 Reject H0

• Mean Squared Error (MSE) is the mean of the squared errors

Where Y and  Y are the mean and standard deviation of dependent

• Distance between specific values of the independent variables to the

A Cook’s D value of more than 1 indicates that excluding a case from

Yˆ j (i ) is the predicted value of j th observation after excluding i th observation

Hair et al (1998) consider 4/(N-k-1) for k predictors and N points as a

Bollen and Jackman (1990) state another threshold of 4/N.

DFFit and DFBeta

• Whether the assumption of normality and homoscedasticity satisfied? If not what

• Comment on value of R –square? What does a low value of R-square indicate?

Вам также может понравиться