Вы находитесь на странице: 1из 88

Linear Regression

1
Introduction to Linear Regression
• Regression is a tool to examine the association relationship
between quantitative variables x and y using a mathematical
equation (between a dependent variable (Y) and one or more
independent variables (X1, X2, …, Xn)).
• The motivation for using the technique:
• Forecast the value of a dependent variable (y) from the value of
independent variables (x1, x2,…xk.).
• Analyze the specific relationships between the independent variables
and the dependent variable - to identify variables that may be used in
model building
• The relationship can be linear or non-linear.

2
Introduction to Linear Regression
• A dependent variable (response variable) “measures an outcome of a
study (also called outcome variable)”.

• An independent variable (explanatory variable) “explains changes in a


response variable”.

• Regression often set values of explanatory variable to understand how it


affects response variable (predict response variable)

• Regression is type of supervised learning algorithm under Machine


Learning terminology

• An important tool in Predictive Analytics

3
Regression Vs Correlation

• Regression is the study of, “existence of a relationship”, between


two variables. The main objective is to estimate the change in mean
value of independent variable.

• Correlation is the study of, “strength of relationship”, between two


variables.

4
Introduction to Linear Regression
Dependent and Independent Variables

 Terms dependent and independent does not necessarily imply a causal


relationship between two variables.

 Regression is not designed to capture causality.

 Purpose of regression is to predict the value of dependent variable given


the value(s) independent variable(s)

5
Regression Nomenclature

Dependent Variable Independent Variable


Explained Variable Explanatory variable
Regressand Regressor
Predictand Predictor
Endogenous Variable Exogenous Variable
Controlled Variable Control Variable
Target Variable Stimulus Variable
Response Variable
Feature Outcome Variable
6
Regression History
• Francis Galton was the first to apply regression.

• Claimed that height of children of tall parents


“regress towards mean of that generation”.

• Modern regression analysis is developed by R A


Fisher.
Francis Galton

Ref: F Galton, “Regression towards mediocrity in hereditary stature”, Nature, Vol. 15, 246-263,
1886

7
Types of
Regression Models
Regression
Models

8
Types of
Regression Models
1 Explanatory Regression
Variable Models

Simple

9
Types of
Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables

Simple Multiple

10
Types of
Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables

Simple Multiple

Linear

11
Types of
Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables

Simple Multiple

Non-
Linear
Linear

12
Types of
Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables

Simple Multiple

Non-
Linear Linear
Linear

13
Types of
Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables

Simple Multiple

Non- Non-
Linear Linear
Linear Linear

14
Which is Linear?
1) Yi   0  1 X 12
 i
1
2) Yi   0  e X 1   i
3) Yi   0  1 X 1   2 X 1 X 2   3 X 22   i
X1
4) Yi   0   i
1  1
5) Yi   0  1 ln X 1   i

15
Types of Regression
• Simple linear regression – refers to a regression model between two
variables.

Y   0  1 X 1  
• Multiple linear regression – refers to a regression model on more than
one independent variables.

Y   0  1 X 1   2 X 2  ...   k X k  
• Nonlinear regression.

1 3
Y  0   X2 
1   2 X 1
16
Linear Regression

• Linear condition in linear regression is defined with respect to the


regression coefficients (  0 , 1 ) and not with respect to the
explanatory variables in the model.

• The following equations can be treated as linear as far as


regression coefficients are concerned.

Y   0  1 X 1 2  
Y   0  1 ln X 1  
Y   0  1 X 1   2 X 1 X 2   3 X 22

17
Non-Linear Regression

• The following equations can be treated as examples of non-linear


regression models.

1
Y  0  X1  
1  1
1
Y  0  e X1  

• The relationship between the dependent variable Y and


regression coefficient β1 can be converted into linear regression
model by suitable transformations.
18
Framework for SLR model development

19
Define the Functional Form of Relationship

For better predictive ability (model accuracy) it is important to specify the correct
functional form between the dependent variable and the independent variable. Scatter
plots may assist the modeller to define the right functional form.
Linear relationship between X1 and Y1 Log-linear relationship between X2 and Y2.

Y1   0  1 X 1 ln Y2   0  1 X 2 20
The Model

• The first order linear model y   0  1x  

y = dependent variable
x = independent variable 0 and 1 are unknown population
y parameters, therefore are estimated
0 = y-intercept from the data.
1 = slope of the line
 = error variable
Rise 1 = Rise/Run
0 Run
x
21
Estimating the Coefficients
• The estimates are determined by
• drawing a sample from the population of interest,
• calculating sample statistics.
• producing a straight line that cuts into the data.

y w
w Question: Which line is to be
w considered as a good line?
w
w w w w w
w w w w w
w
x 22
The Least Squares (Regression) Line

A good line is one that minimizes the sum of squared differences


between the points and the line.

23
The Least Squares (Regression) Line
Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
Let us compare two lines
4 (2,4)
w The second line is horizontal
3 w (4,3.2)
2.5
2
(1,2) w
w (3,1.5)
1 The smaller the sum of squared
differences the better the fit of the
line to the data.
1 2 3 4

24
Least Squares
• ‘Best Fit’ Means Difference Between Actual Y Values & predicted
Y Values are a Minimum.
• But Positive Differences Off-Set Negative ones

25
Least Squares
• ‘Best Fit’ Means Difference Between Actual Y Values &
Predicted Y Values is a Minimum. But Positive Differences Off-
Set Negative ones. So square errors!

 Y  Yˆ 
n 2 n
i i   ˆi2
i 1 i 1

26
Least Squares
• ‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y
Values Are a Minimum. But Positive Differences Off-Set Negative.
So square errors!

 Y  Yˆ 
n 2 n
i i   ˆi2
i 1 i 1
• LS Minimizes the Sum of the Squared Differences (errors) (SSE)

27
Estimate the Regression Parameters
• Estimate the regression parameters.
• The method of Ordinary Least Squares (OLS) is used to estimate the
regression parameters.
• OLS fits regression line through a set of data points such that the sum of the
squared distances between the actual observations 
in the sample and
the regression line is minimized (i.e., (Yi  Yi ) 2is minimized ).
i

• OLS provides the Best Linear Unbiased Estimate (BLUE).



• That is, E      0 where  is the population parameter and  is estimated

parameter value from the sample.


The Estimated Coefficients
n n
SSE    i   (Yi   0 1 X i )
2 2
i 1 i 1
SSE  n
 (  (Yi   0 1 X i ) )  0
2
 0  0 i 1
n n 
 2  (Yi )  n 0 1  ( X i )  0
i 1 i 1 
 0  Y  1 X
29
The Estimated Coefficients
n n
SSE    i   (Yi   0 1 X i ) 2
2
i 1 i 1
SSE  n
 (  (Yi   0 1 X i ) )  0
2
1 1 i 1
n
   2 X i (Yi   0  1 X i )  0
i 1

30
The Estimated Coefficients
n
 2  ( X iYi  X i (Y  1 X )  1 X i )  0
2
i 1
n n n n
 ( X iYi )   ( X i )Y  1 X  ( X i )  1  ( X i )  0
2
i 1 i 1 i 1 i 1
n
 X i (Yi  Y )
i 1
1  n
 Xi (Xi  X )
i 1
31
The Estimated Coefficients
To calculate the estimates of the slope and intercept of the least
squares line , use the formulas:

 0  Y  1 X

 X i (Yi  Y )
1 
 Xi (Xi  X )

32
Least Squares Graphically
n
LS minimizes   i   1   2   3   4
ˆ 2
ˆ 2
ˆ 2
ˆ 2
ˆ 2
i 1
Y Y2  ˆ0  ˆ1 X 2  ˆ2

^4
^2
^1 ^3

Yˆi  ˆ0  ˆ1 X i


X

33
Coefficient Equations
• Prediction equation
yˆi  ˆ0  ˆ1xi
• Sample slope
SS xy   xi  x  yi  y 
ˆ1  
SS xx  i x  x 2
• Sample Y - intercept

ˆ0  y  ˆ1x
34
Simple Linear Regression Model

Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable

Yi  β0  β1Xi  ε i
Linear component Random Error
component

35
Simple Linear Regression Model

Y Yi  β0  β1Xi  ε i
Observed Value
of Y for Xi

εi Slope = β1

Predicted Value Random Error for this Xi


of Y for Xi value

Intercept = β0

Xi
X
36
Simple Linear Regression
Equation (Prediction Line)
The simple linear regression equation provides an estimate of
the population regression line
Estimated (or
predicted) Y Estimate of the Estimate of the
value for regression regression slope
observation i intercept

Value of X for

Ŷi  b0  b1Xi
observation i

37
Assumptions
The method of least squares gives the best equation under the assumptions
stated below:
 The regression model is linear in regression parameters.
 The explanatory variable, X, is assumed to be non-stochastic (i.e., X is
deterministic).
 The conditional expected value of the residuals, E(i|Xi), is zero.
 In case of time series data, residuals are uncorrelated, that is, Cov (i, j) = 0
for all i  j.
 The residuals, i, follow a normal distribution.
 The variance of the residuals, Var(i|Xi), is constant for all values of Xi. When
the variance of the residuals is constant for different values of Xi, it is called
homoscedasticity. A non-constant variance of residuals is called
heteroscedasticity
Assumptions
Simple Linear Regression Example

• A real estate agent wishes to examine the relationship between the


selling price of a home and its size (measured in square feet)

• A random sample of 10 houses is selected


– Dependent variable (Y) = house price in $1000s
– Independent variable (X) = square feet

40
Simple Linear Regression Example: Data
House Price in $1000s Square Feet
(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700

41
Simple Linear Regression Example: Scatter Plot
House price model: Scatter Plot

Example problem
3000

2500
House Price in $1000

2000

1500

1000

500

0
0 50 100 150 200 250 300 350 400 450
Square feet

13-42
Simple Linear Regression Example:
Using Excel Data Analysis Function

1. Choose Data 2. Choose Data Analysis

3. Choose Regression

43
Simple Linear Regression Example:
Using Excel Data Analysis Function

Enter Y’s and X’s and desired options

44
Dataset 5_Sheet Real estate
• SLOPE function
• INTERCEPT function
• RSQ function

45
Simple Linear Regression Example:
Excel Output
Regression Statistics
Multiple R 0.76211 The regression equation is:
R Square 0.58082
Adjusted R Square 0.52842 house price  98.24833  0.10977 (square feet)
Standard Error 41.33032
Observations 10

ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

46
Simple Linear Regression Example:
Graphical Representation

House price model: Scatter Plot and Prediction Line

450
400

House Price ($1000s)


350 Slope
300
= 0.10977
250
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet

house price  98.24833  0.10977 (square feet)


47
Measures of Variation

• Total variation is made up of two parts:

SST  SSR  SSE


Total Sum of Regression Sum of Error Sum of
Squares Squares Squares

SST   ( Yi  Y )2 SSR   ( Ŷi  Y)2 SSE   ( Yi  Ŷi )2


where:
Y = Mean value of the dependent variable
Yi = Observed value of the dependent variable
Yˆi = Predicted value of Y for the given Xi value
48
Measures of Variation
• SST = total sum of squares (Total Variation)
• Measures the variation of the Yi values around their mean Y
• SSR = regression sum of squares (Explained Variation)
• Variation attributable to the relationship between X and Y
• SSE = error sum of squares (Unexplained Variation)
• Variation in Y attributable to factors other than X

49
Coefficient of Determination, r2
• The coefficient of determination is the portion of the total variation in
the dependent variable that is explained by variation in the
independent variable
• The coefficient of determination is also called r-squared and is
denoted as r2
SSR regression sum of squares
r 
2

SST total sum of squares

note:
0  r 1
2

50
Simple Linear Regression Example:
Coefficient of Determination, r2 in Excel
SSR 18934.9348
Regression Statistics
r  2
  0.58082
Multiple R 0.76211 SST 32600.5000
R Square 0.58082
Adjusted R Square 0.52842 58.08% of the variation in house
Standard Error 41.33032 prices is explained by variation in
Observations 10
square feet
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

51
Inferences About the Slope

• The standard error of the regression slope coefficient (b1) is


estimated by

S YX S YX
Sb1  
SSX  (X  X)
i
2

where:
Sb1 = Estimate of the standard error of the slope

SSE = Standard error of the estimate


S YX 
n2
52
Inferences About the Slope: t Test
Is there a linear relationship between X and Y?
• Null and alternative hypotheses
• H0: β1 = 0 (no linear relationship)
• H1: β1 ≠ 0 (linear relationship does exist)
• Test statistic
where:
b1  β 1
t STAT  b1 = regression slope
coefficient
Sb β1 = hypothesized slope
1

Sb1 = standard
d.f.  n  2 error of the slope
53
Inferences About the Slope: t Test Example
H0: β1 = 0
From Excel output: H1: β1 ≠ 0

Coefficients Standard Error t Stat P-value


Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039

b1
Sb1

b1  β1 0.10977  0
t STAT    3.32938
Sb 0.03297
1

13-54
Inferences About the Slope: t Test Example
H0: β1 = 0
Test Statistic: tSTAT = 3.329 H1: β1 ≠ 0

d.f. = 10- 2 = 8

a/2=.025 a/2=.025
Decision: Reject H0

There is sufficient evidence


Reject H0
-tα/2
Do not reject H0
tα/2
Reject H0 that square footage affects
0
-2.3060 2.3060 3.329 house price

55
Inferences About the Slope: t Test Example
H0: β1 = 0
H1: β1 ≠ 0
From Excel output:

Coefficients Standard Error t Stat P-value


Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039

p-value
Decision: Reject H0, since p-value < α
There is sufficient evidence that square footage
affects house price.
56
F Test for Significance
MSR
• F Test statistic: FSTAT 
MSE
where MSR 
SSR
k
SSE
MSE 
n  k 1

where FSTAT follows an F distribution with k numerator and (n – k - 1)


denominator degrees of freedom

(k = the number of independent variables in the regression model)

57
F-Test for Significance
Excel Output
Regression Statistics
MSR 18934.9348
Multiple R 0.76211
FSTAT    11.0848
R Square
Adjusted R Square
0.58082
0.52842
MSE 1708.1957
Standard Error 41.33032
Observations 10
With 1 and 8 degrees of freedom
p-value for the F-Test
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

58
F Test for Significance
H0: β1 = 0 Test Statistic:
H1: β1 ≠ 0 MSR
FSTAT   11.08
a = .05 MSE
df1= 1 df2 = 8
Decision:
Critical Reject H0 at a = 0.05
Value:
Fa = 5.32
a = .05 Conclusion:
There is sufficient evidence that house size
affects selling price
0 F
Do not Reject H0
reject H0
F.05 = 5.32
59
Confidence Interval Estimate
for the Slope
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

Since the units of the house price variable is $1000s, we are 95%
confident that the average impact on sales price is between $33.74 and
$185.80 per square foot of house size

This 95% confidence interval does not include 0.


Conclusion: There is a significant relationship between house
price and square feet at the .05 level of significance

60
t Test for a Correlation Coefficient
• Hypotheses
H0: ρ = 0 (no correlation between X and Y)
H1: ρ ≠ 0 (correlation exists)

• Test statistic
r -ρ
t STAT  (with n – 2 degrees of freedom)

1 r 2 where

n2 r   r 2 if b1  0

r   r 2 if b1  0
61
t-test for a Correlation Coefficient

Is there evidence of a linear relationship between square feet


and house price at the .05 level of significance?

H0: ρ = 0 (No correlation)


H1: ρ ≠ 0 (correlation exists)
a =.05 , df = 10 - 2 = 8

r ρ .762  0
t STAT    3.329
1 r2 1  .762 2
n2 10  2
62
t-test For A Correlation Coefficient

r ρ .762  0 Decision:
t STAT    3.329 Reject H0
1 r2 1  .762 2
n2 10  2 Conclusion:
There is evidence of a
linear association at the
d.f. = 10-2 = 8
5% level of significance

a/2=.025 a/2=.025

Reject H0 Do not reject H0 Reject H0


-tα/2 tα/2
0
-2.3060 2.3060
3.329 63
Evaluation metrics
• Mean Absolute Error (MAE) is the mean of the absolute value
of the errors. It is calculated as:
1 n 
MAE   y j  y j
n j 1

• Mean Squared Error (MSE) is the mean of the squared errors


and is calculated as
1 n 
MSE   ( y j  y j ) 2
n j 1
64
Evaluation metrics
• Root Mean Squared Error (RMSE) is the square root of the
mean of the squared errors:

1 n 
MSE   (y j  y j)
2
n j 1

65
Dataset 5_Sheet Iris data
• Is there a statistical evidence to support that the petal width and
petal length are related?
• Based upon the linear models you generated, which pair of
features appear to be most predictive for one another?

66
Descriptive Analytics in Python
• #Pandas is an open source, Berkeley Software Distribution licensed
library
• #This provides data analysis tools for Python programming language
• #we are using import keyword, pandas is imported as alias pd
import pandas as pd
• #we will use pd.read_csv method to read and load it into a DataFrame
data=pd.read_csv('Sa_Data.csv')
• #to display all the column names
list(data.columns)
• #to display all the top dataset
data_IPL.head()
67
Descriptive Analytics in Python
• #to display the size of the data frame
data.shape
• #For more detailed summary about the dataset use info()
data.info()
• #the row and column indexes always start with the value 0
• #We want to display the first 5 rows of the DataFrame
data[0:5]
• #Negative indexing is an excellent feature in Python
• #We want to display the last 6 rows of the DataFrame
data[-6:]

68
Descriptive Analytics in Python
• #If we want to know the frequency/occurences of each unique value
in a column named YearsExperience
data.YearsExperience.value_counts()
• #Let us take another data set IPL Data 2015 CSV. Here if we want to
know the frequency/occurences of players from each country
• #Then we need to use value_counts as shown below
data_IPL=pd.read_csv('IPL Data 2015 CSV.csv')
data_IPL.Country.value_counts()
• #Cross-tabulation features will help find occurences for the
combination of values for two columns
pd.crosstab(data_IPL.L25, data_IPL.Country)

69
Descriptive Analytics in Python
• #Read IPL file with Age
data_AgeIPL=pd.read_csv('IPL AgeData 2015 CSV.csv')
pd.crosstab(data_AgeIPL.AGE, data_AgeIPL.Country)
• #Drawing plots- import matplotlib - a Python 2D plotting library
• #Seaborn is a library for making elegant charts in Python
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
• #To draw a bar chart with Age and sold price
sn.barplot(x='AGE', y='Sold Price', data=data_AgeIPL);

70
Descriptive Analytics in Python
• #A third variable hue is also introduced. Hue is the colour code w.r.t third
variable
sn.barplot(x='AGE', y='Sold Price', hue='PLAYING ROLE',
data=data_AgeIPL);
• #To draw the box plot call boxplot()of seaborn library
box=sn.boxplot(data_AgeIPL['Sold Price']);
• #To compare the Sold Price with different Playing Role
sn.boxplot(x='PLAYING ROLE', y='Sold Price', data=data_AgeIPL);
• #Scatterplot between Sixers and Sold Price
IPL_Batsman=data_AgeIPL[data_AgeIPL['PLAYING ROLE']=='Batsman']
plt.scatter(x=IPL_Batsman.SIXERS, y=IPL_Batsman['Sold Price']);
71
Descriptive Analytics in Python
• #To fit a line to represent direction of relationship use regplot() of seaborn
sn.regplot(x='SIXERS', y='Sold Price', data=IPL_Batsman);
• #Let us assume that the influential features may be Strike rate, Average,
Sixers and Sold Price
• #Correlation values for features can be computed using corr()
influential_features=['SR -B', 'AVE','SIXERS', 'Sold Price']
data_AgeIPL[influential_features].corr()
• #Heatmap of correlation values
sn.heatmap(data_AgeIPL[influential_features].corr(), annot=True);

72
Descriptive Analytics in Python
data=pd.read_excel('Real_Estate.xls')
• #Correlation between variables
data.corr(method ='pearson')

73
Influence During Regression – Outlier
Analysis
• An outlier is an observation point that is distant from other
observations.
• Outliers are data with an extreme value of the response variable (Y)
• The value ( Yi  Y ) show a large deviation.
• Presence of an outlier will have a significant influence on values of
regression coefficients
• Leverage points are data with an extreme value of the predictor
variable (X)
• Some combination of extreme Y (outlier) and extreme X (leverage)
makes a data point influential
• An influential data point: removing the data point substantially changes
the regression results – How do we define “substantial”?

74
Distance measures
• Z score
• Mahalanobis distance
• Cook’s distance
• Leverage values
• DFBeta and DFFit Values

75
Z-score
Is the standardized distance of an observation from its mean value.
Yˆi  Y
Z
Y

Where Y and  Y are the mean and standard deviation of dependent


variable estimated from the sample data
Mahalanobis distance

• Distance between specific values of the independent variables to the


centroid of all observations of the explanatory variable.
• Mahalanobis distance value of more than chi-square critical value (with df
equal to the number of explanatory variables) is classified as outliers.
• A large Mahalanobis distance identifies a case as having extreme values on
one or more of the independent variables.
• The threshold value of 0.001 was suggested by Tabachnick & Fidell
(2007), who state that a very conservative probability estimate for outlier
identification is appropriate for the Mahalanobis Distance.

Tabachnick, B.G., & Fidell, L.S. (2007). Using Multivariate Statistics (5th Ed.).
Boston: Pearson. (p. 74).
Cook’s Distance

A measure of how much the residuals of all cases would change if a particular
case were excluded from the calculation of the regression coefficients.

A Cook’s D value of more than 1 indicates that excluding a case from


computation of the regression statistics, changes the coefficients substantially.

Cook D is given by

 j
(Yˆ  Yˆ ) 2
j (i )
j
Di 
(k  1) MSE

Yˆ j (i ) is the predicted value of j th observation after excluding i th observation


from the sample
Cook’s Distance

Hair et al (1998) consider 4/(N-k-1) for k predictors and N points as a


threshold for Cook’s distance which usually gives a lower threshold
than 1.

Bollen and Jackman (1990) state another threshold of 4/N.


References
Bollen, K. A. and Jackman, R. W. (1990) Regression diagnostics: An expository treatment of outliers and influential
cases, in Fox, John; and Long, J. Scott (eds.); Modern Methods of Data Analysis (pp. 257-91). Newbury Park, CA: Sage.
Hair, J., Anderson, R., Tatham, R. and Black W. (1998). Multivariate Data Analysis (fifth edition). Englewood Cliffs, NJ:
Prentice-Hall.
Leverage Value
• of an observation measures the influence of that data point (or
observation) on the fit of the regression model.
• The value is given by
1 ( X i  X )2
hi   n
 i
n
( X  X ) 2

i 1
• The value of more than 2/n or 3/n is treated as highly influential
observation

DFFit and DFBeta


• DFFit is the change in the predicted value of Yi, when case i is removed from
the data set.
• DFBeta is the change in the regression coefficient values when an observation
i is removed from the data.
Dataset 6_Sheet Crime Data
• Is there exists a relationship between crime rate and police
expenditure ?
• If the crime rate and police expenditure are related, develop a
simple linear regression model with these variables.
• Comment on the relationship between crime rate and number of
families below half wage.
• Whether more males being identified per 1000 females related
to the region?
• Can the number of males to females predict the crime rate?

81
Dataset 7_Sheet
• Is there a statistical evidence to support that the bill amount depends on weight of
the patient?

• Whether the assumption of normality and homoscedasticity satisfied? If not what


is the remedial measure?

• Comment on value of R –square? What does a low value of R-square indicate?

• What will be the average difference in the bill amount for someone with weight 55
and 60?

• It is learnt that a patient weighing 60 kg is likely to incur atleast INR 500 more
than the patient with 55 kg. Will you believe this statement? Justify your argument
statistically at 5% level of significance.
82
Linear Regression in Python
• #statsmodels is a Python module that provides classes and
functions for the estimation of many different statistical models, as
well as for conducting statistical tests, and statistical data
exploration.
import statsmodels.api as sm
import pandas as pd
data=pd.read_csv('IPL AgeData 2015 CSV.csv')

83
Linear Regression in Python
• # develop a simple linear regress between variables 'Sixers' and 'Sold Price'
X = sm.add_constant(data['SIXERS'])
Y=data['Sold Price']
• #for splitting dataset randomly to training and test dataset use train_test_split
function() from sklearn.model_selection module
from sklearn.model_selection import train_test_split
• #80% of data is used for training and remaining 20% is used for testing
train_X, test_X, train_Y, test_Y= train_test_split(X,Y,train_size=0.8,
random_state=100)
• #fit()method on OLS() estimates the parameters and return model information to
the variable model2
model2=sm.OLS(train_Y,train_X).fit()
print(model2.params)
• #function summary2() prints the model summary for diagnosing a regression
model
model2.summary2() 84
Linear Regression in Python
const 366848.000678
SIXERS 11003.343655
dtype: float64

85
Linear Regression in Python
• Omnibus/Prob(Omnibus) –
• A test of the skewness and kurtosis of the residual.
• A value close to zero which would indicate normalcy.
• The Prob (Omnibus) performs a statistical test indicating the probability that the
residuals are normally distributed. Value close to 1 is expected.
• In this case the data is not normal.
• Skew – a measure of data symmetry. Expect a value close to zero,
indicating the residual distribution is normal.
• Kurtosis – a measure of "peakiness", or curvature of the data. Higher
peaks lead to greater Kurtosis. Greater Kurtosis can be interpreted as a
tighter clustering of residuals around zero, implying a better model with
few outliers.
86
Linear Regression in Python
• Durbin-Watson – tests for homoscedasticity. Expect a value between 1
and 2.
• Jarque-Bera (JB)/Prob(JB) – like the Omnibus test in that it tests both
skew and kurtosis. This test will be confirmation of the Omnibus test.
• Condition Number – This test measures the sensitivity of a function's
output as compared to its input. When we have multicollinearity, we can
expect much higher fluctuations to small changes in the data, hence, we
hope to see a relatively small number, something below 30.

87
Linear Regression in Python
• #Normality residuals to be checked using P-P plot
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
model2_residual=model2.resid
• #ProbPlot() method on statsmodel draws P-P plot
probplot=sm.ProbPlot(model2_residual)
plt.figure (figsize=(8,6))
probplot.ppplot(line='45')
plt.show()
88

Вам также может понравиться