Академический Документы
Профессиональный Документы
Культура Документы
1
Introduction to Linear Regression
• Regression is a tool to examine the association relationship
between quantitative variables x and y using a mathematical
equation (between a dependent variable (Y) and one or more
independent variables (X1, X2, …, Xn)).
• The motivation for using the technique:
• Forecast the value of a dependent variable (y) from the value of
independent variables (x1, x2,…xk.).
• Analyze the specific relationships between the independent variables
and the dependent variable - to identify variables that may be used in
model building
• The relationship can be linear or non-linear.
2
Introduction to Linear Regression
• A dependent variable (response variable) “measures an outcome of a
study (also called outcome variable)”.
3
Regression Vs Correlation
4
Introduction to Linear Regression
Dependent and Independent Variables
5
Regression Nomenclature
Ref: F Galton, “Regression towards mediocrity in hereditary stature”, Nature, Vol. 15, 246-263,
1886
7
Types of
Regression Models
Regression
Models
8
Types of
Regression Models
1 Explanatory Regression
Variable Models
Simple
9
Types of
Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables
Simple Multiple
10
Types of
Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables
Simple Multiple
Linear
11
Types of
Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables
Simple Multiple
Non-
Linear
Linear
12
Types of
Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables
Simple Multiple
Non-
Linear Linear
Linear
13
Types of
Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables
Simple Multiple
Non- Non-
Linear Linear
Linear Linear
14
Which is Linear?
1) Yi 0 1 X 12
i
1
2) Yi 0 e X 1 i
3) Yi 0 1 X 1 2 X 1 X 2 3 X 22 i
X1
4) Yi 0 i
1 1
5) Yi 0 1 ln X 1 i
15
Types of Regression
• Simple linear regression – refers to a regression model between two
variables.
Y 0 1 X 1
• Multiple linear regression – refers to a regression model on more than
one independent variables.
Y 0 1 X 1 2 X 2 ... k X k
• Nonlinear regression.
1 3
Y 0 X2
1 2 X 1
16
Linear Regression
Y 0 1 X 1 2
Y 0 1 ln X 1
Y 0 1 X 1 2 X 1 X 2 3 X 22
17
Non-Linear Regression
1
Y 0 X1
1 1
1
Y 0 e X1
19
Define the Functional Form of Relationship
For better predictive ability (model accuracy) it is important to specify the correct
functional form between the dependent variable and the independent variable. Scatter
plots may assist the modeller to define the right functional form.
Linear relationship between X1 and Y1 Log-linear relationship between X2 and Y2.
Y1 0 1 X 1 ln Y2 0 1 X 2 20
The Model
y = dependent variable
x = independent variable 0 and 1 are unknown population
y parameters, therefore are estimated
0 = y-intercept from the data.
1 = slope of the line
= error variable
Rise 1 = Rise/Run
0 Run
x
21
Estimating the Coefficients
• The estimates are determined by
• drawing a sample from the population of interest,
• calculating sample statistics.
• producing a straight line that cuts into the data.
y w
w Question: Which line is to be
w considered as a good line?
w
w w w w w
w w w w w
w
x 22
The Least Squares (Regression) Line
23
The Least Squares (Regression) Line
Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
Let us compare two lines
4 (2,4)
w The second line is horizontal
3 w (4,3.2)
2.5
2
(1,2) w
w (3,1.5)
1 The smaller the sum of squared
differences the better the fit of the
line to the data.
1 2 3 4
24
Least Squares
• ‘Best Fit’ Means Difference Between Actual Y Values & predicted
Y Values are a Minimum.
• But Positive Differences Off-Set Negative ones
25
Least Squares
• ‘Best Fit’ Means Difference Between Actual Y Values &
Predicted Y Values is a Minimum. But Positive Differences Off-
Set Negative ones. So square errors!
Y Yˆ
n 2 n
i i ˆi2
i 1 i 1
26
Least Squares
• ‘Best Fit’ Means Difference Between Actual Y Values & Predicted Y
Values Are a Minimum. But Positive Differences Off-Set Negative.
So square errors!
Y Yˆ
n 2 n
i i ˆi2
i 1 i 1
• LS Minimizes the Sum of the Squared Differences (errors) (SSE)
27
Estimate the Regression Parameters
• Estimate the regression parameters.
• The method of Ordinary Least Squares (OLS) is used to estimate the
regression parameters.
• OLS fits regression line through a set of data points such that the sum of the
squared distances between the actual observations
in the sample and
the regression line is minimized (i.e., (Yi Yi ) 2is minimized ).
i
30
The Estimated Coefficients
n
2 ( X iYi X i (Y 1 X ) 1 X i ) 0
2
i 1
n n n n
( X iYi ) ( X i )Y 1 X ( X i ) 1 ( X i ) 0
2
i 1 i 1 i 1 i 1
n
X i (Yi Y )
i 1
1 n
Xi (Xi X )
i 1
31
The Estimated Coefficients
To calculate the estimates of the slope and intercept of the least
squares line , use the formulas:
0 Y 1 X
X i (Yi Y )
1
Xi (Xi X )
32
Least Squares Graphically
n
LS minimizes i 1 2 3 4
ˆ 2
ˆ 2
ˆ 2
ˆ 2
ˆ 2
i 1
Y Y2 ˆ0 ˆ1 X 2 ˆ2
^4
^2
^1 ^3
33
Coefficient Equations
• Prediction equation
yˆi ˆ0 ˆ1xi
• Sample slope
SS xy xi x yi y
ˆ1
SS xx i x x 2
• Sample Y - intercept
ˆ0 y ˆ1x
34
Simple Linear Regression Model
Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable
Yi β0 β1Xi ε i
Linear component Random Error
component
35
Simple Linear Regression Model
Y Yi β0 β1Xi ε i
Observed Value
of Y for Xi
εi Slope = β1
Intercept = β0
Xi
X
36
Simple Linear Regression
Equation (Prediction Line)
The simple linear regression equation provides an estimate of
the population regression line
Estimated (or
predicted) Y Estimate of the Estimate of the
value for regression regression slope
observation i intercept
Value of X for
Ŷi b0 b1Xi
observation i
37
Assumptions
The method of least squares gives the best equation under the assumptions
stated below:
The regression model is linear in regression parameters.
The explanatory variable, X, is assumed to be non-stochastic (i.e., X is
deterministic).
The conditional expected value of the residuals, E(i|Xi), is zero.
In case of time series data, residuals are uncorrelated, that is, Cov (i, j) = 0
for all i j.
The residuals, i, follow a normal distribution.
The variance of the residuals, Var(i|Xi), is constant for all values of Xi. When
the variance of the residuals is constant for different values of Xi, it is called
homoscedasticity. A non-constant variance of residuals is called
heteroscedasticity
Assumptions
Simple Linear Regression Example
40
Simple Linear Regression Example: Data
House Price in $1000s Square Feet
(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
41
Simple Linear Regression Example: Scatter Plot
House price model: Scatter Plot
Example problem
3000
2500
House Price in $1000
2000
1500
1000
500
0
0 50 100 150 200 250 300 350 400 450
Square feet
13-42
Simple Linear Regression Example:
Using Excel Data Analysis Function
3. Choose Regression
43
Simple Linear Regression Example:
Using Excel Data Analysis Function
44
Dataset 5_Sheet Real estate
• SLOPE function
• INTERCEPT function
• RSQ function
45
Simple Linear Regression Example:
Excel Output
Regression Statistics
Multiple R 0.76211 The regression equation is:
R Square 0.58082
Adjusted R Square 0.52842 house price 98.24833 0.10977 (square feet)
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
46
Simple Linear Regression Example:
Graphical Representation
450
400
49
Coefficient of Determination, r2
• The coefficient of determination is the portion of the total variation in
the dependent variable that is explained by variation in the
independent variable
• The coefficient of determination is also called r-squared and is
denoted as r2
SSR regression sum of squares
r
2
SST total sum of squares
note:
0 r 1
2
50
Simple Linear Regression Example:
Coefficient of Determination, r2 in Excel
SSR 18934.9348
Regression Statistics
r 2
0.58082
Multiple R 0.76211 SST 32600.5000
R Square 0.58082
Adjusted R Square 0.52842 58.08% of the variation in house
Standard Error 41.33032 prices is explained by variation in
Observations 10
square feet
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
51
Inferences About the Slope
S YX S YX
Sb1
SSX (X X)
i
2
where:
Sb1 = Estimate of the standard error of the slope
Sb1 = standard
d.f. n 2 error of the slope
53
Inferences About the Slope: t Test Example
H0: β1 = 0
From Excel output: H1: β1 ≠ 0
b1
Sb1
b1 β1 0.10977 0
t STAT 3.32938
Sb 0.03297
1
13-54
Inferences About the Slope: t Test Example
H0: β1 = 0
Test Statistic: tSTAT = 3.329 H1: β1 ≠ 0
d.f. = 10- 2 = 8
a/2=.025 a/2=.025
Decision: Reject H0
55
Inferences About the Slope: t Test Example
H0: β1 = 0
H1: β1 ≠ 0
From Excel output:
p-value
Decision: Reject H0, since p-value < α
There is sufficient evidence that square footage
affects house price.
56
F Test for Significance
MSR
• F Test statistic: FSTAT
MSE
where MSR
SSR
k
SSE
MSE
n k 1
57
F-Test for Significance
Excel Output
Regression Statistics
MSR 18934.9348
Multiple R 0.76211
FSTAT 11.0848
R Square
Adjusted R Square
0.58082
0.52842
MSE 1708.1957
Standard Error 41.33032
Observations 10
With 1 and 8 degrees of freedom
p-value for the F-Test
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
58
F Test for Significance
H0: β1 = 0 Test Statistic:
H1: β1 ≠ 0 MSR
FSTAT 11.08
a = .05 MSE
df1= 1 df2 = 8
Decision:
Critical Reject H0 at a = 0.05
Value:
Fa = 5.32
a = .05 Conclusion:
There is sufficient evidence that house size
affects selling price
0 F
Do not Reject H0
reject H0
F.05 = 5.32
59
Confidence Interval Estimate
for the Slope
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Since the units of the house price variable is $1000s, we are 95%
confident that the average impact on sales price is between $33.74 and
$185.80 per square foot of house size
60
t Test for a Correlation Coefficient
• Hypotheses
H0: ρ = 0 (no correlation between X and Y)
H1: ρ ≠ 0 (correlation exists)
• Test statistic
r -ρ
t STAT (with n – 2 degrees of freedom)
1 r 2 where
n2 r r 2 if b1 0
r r 2 if b1 0
61
t-test for a Correlation Coefficient
r ρ .762 0
t STAT 3.329
1 r2 1 .762 2
n2 10 2
62
t-test For A Correlation Coefficient
r ρ .762 0 Decision:
t STAT 3.329 Reject H0
1 r2 1 .762 2
n2 10 2 Conclusion:
There is evidence of a
linear association at the
d.f. = 10-2 = 8
5% level of significance
a/2=.025 a/2=.025
1 n
MSE (y j y j)
2
n j 1
65
Dataset 5_Sheet Iris data
• Is there a statistical evidence to support that the petal width and
petal length are related?
• Based upon the linear models you generated, which pair of
features appear to be most predictive for one another?
66
Descriptive Analytics in Python
• #Pandas is an open source, Berkeley Software Distribution licensed
library
• #This provides data analysis tools for Python programming language
• #we are using import keyword, pandas is imported as alias pd
import pandas as pd
• #we will use pd.read_csv method to read and load it into a DataFrame
data=pd.read_csv('Sa_Data.csv')
• #to display all the column names
list(data.columns)
• #to display all the top dataset
data_IPL.head()
67
Descriptive Analytics in Python
• #to display the size of the data frame
data.shape
• #For more detailed summary about the dataset use info()
data.info()
• #the row and column indexes always start with the value 0
• #We want to display the first 5 rows of the DataFrame
data[0:5]
• #Negative indexing is an excellent feature in Python
• #We want to display the last 6 rows of the DataFrame
data[-6:]
68
Descriptive Analytics in Python
• #If we want to know the frequency/occurences of each unique value
in a column named YearsExperience
data.YearsExperience.value_counts()
• #Let us take another data set IPL Data 2015 CSV. Here if we want to
know the frequency/occurences of players from each country
• #Then we need to use value_counts as shown below
data_IPL=pd.read_csv('IPL Data 2015 CSV.csv')
data_IPL.Country.value_counts()
• #Cross-tabulation features will help find occurences for the
combination of values for two columns
pd.crosstab(data_IPL.L25, data_IPL.Country)
69
Descriptive Analytics in Python
• #Read IPL file with Age
data_AgeIPL=pd.read_csv('IPL AgeData 2015 CSV.csv')
pd.crosstab(data_AgeIPL.AGE, data_AgeIPL.Country)
• #Drawing plots- import matplotlib - a Python 2D plotting library
• #Seaborn is a library for making elegant charts in Python
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
• #To draw a bar chart with Age and sold price
sn.barplot(x='AGE', y='Sold Price', data=data_AgeIPL);
70
Descriptive Analytics in Python
• #A third variable hue is also introduced. Hue is the colour code w.r.t third
variable
sn.barplot(x='AGE', y='Sold Price', hue='PLAYING ROLE',
data=data_AgeIPL);
• #To draw the box plot call boxplot()of seaborn library
box=sn.boxplot(data_AgeIPL['Sold Price']);
• #To compare the Sold Price with different Playing Role
sn.boxplot(x='PLAYING ROLE', y='Sold Price', data=data_AgeIPL);
• #Scatterplot between Sixers and Sold Price
IPL_Batsman=data_AgeIPL[data_AgeIPL['PLAYING ROLE']=='Batsman']
plt.scatter(x=IPL_Batsman.SIXERS, y=IPL_Batsman['Sold Price']);
71
Descriptive Analytics in Python
• #To fit a line to represent direction of relationship use regplot() of seaborn
sn.regplot(x='SIXERS', y='Sold Price', data=IPL_Batsman);
• #Let us assume that the influential features may be Strike rate, Average,
Sixers and Sold Price
• #Correlation values for features can be computed using corr()
influential_features=['SR -B', 'AVE','SIXERS', 'Sold Price']
data_AgeIPL[influential_features].corr()
• #Heatmap of correlation values
sn.heatmap(data_AgeIPL[influential_features].corr(), annot=True);
72
Descriptive Analytics in Python
data=pd.read_excel('Real_Estate.xls')
• #Correlation between variables
data.corr(method ='pearson')
73
Influence During Regression – Outlier
Analysis
• An outlier is an observation point that is distant from other
observations.
• Outliers are data with an extreme value of the response variable (Y)
• The value ( Yi Y ) show a large deviation.
• Presence of an outlier will have a significant influence on values of
regression coefficients
• Leverage points are data with an extreme value of the predictor
variable (X)
• Some combination of extreme Y (outlier) and extreme X (leverage)
makes a data point influential
• An influential data point: removing the data point substantially changes
the regression results – How do we define “substantial”?
74
Distance measures
• Z score
• Mahalanobis distance
• Cook’s distance
• Leverage values
• DFBeta and DFFit Values
75
Z-score
Is the standardized distance of an observation from its mean value.
Yˆi Y
Z
Y
Tabachnick, B.G., & Fidell, L.S. (2007). Using Multivariate Statistics (5th Ed.).
Boston: Pearson. (p. 74).
Cook’s Distance
A measure of how much the residuals of all cases would change if a particular
case were excluded from the calculation of the regression coefficients.
Cook D is given by
j
(Yˆ Yˆ ) 2
j (i )
j
Di
(k 1) MSE
i 1
• The value of more than 2/n or 3/n is treated as highly influential
observation
81
Dataset 7_Sheet
• Is there a statistical evidence to support that the bill amount depends on weight of
the patient?
• What will be the average difference in the bill amount for someone with weight 55
and 60?
• It is learnt that a patient weighing 60 kg is likely to incur atleast INR 500 more
than the patient with 55 kg. Will you believe this statement? Justify your argument
statistically at 5% level of significance.
82
Linear Regression in Python
• #statsmodels is a Python module that provides classes and
functions for the estimation of many different statistical models, as
well as for conducting statistical tests, and statistical data
exploration.
import statsmodels.api as sm
import pandas as pd
data=pd.read_csv('IPL AgeData 2015 CSV.csv')
83
Linear Regression in Python
• # develop a simple linear regress between variables 'Sixers' and 'Sold Price'
X = sm.add_constant(data['SIXERS'])
Y=data['Sold Price']
• #for splitting dataset randomly to training and test dataset use train_test_split
function() from sklearn.model_selection module
from sklearn.model_selection import train_test_split
• #80% of data is used for training and remaining 20% is used for testing
train_X, test_X, train_Y, test_Y= train_test_split(X,Y,train_size=0.8,
random_state=100)
• #fit()method on OLS() estimates the parameters and return model information to
the variable model2
model2=sm.OLS(train_Y,train_X).fit()
print(model2.params)
• #function summary2() prints the model summary for diagnosing a regression
model
model2.summary2() 84
Linear Regression in Python
const 366848.000678
SIXERS 11003.343655
dtype: float64
85
Linear Regression in Python
• Omnibus/Prob(Omnibus) –
• A test of the skewness and kurtosis of the residual.
• A value close to zero which would indicate normalcy.
• The Prob (Omnibus) performs a statistical test indicating the probability that the
residuals are normally distributed. Value close to 1 is expected.
• In this case the data is not normal.
• Skew – a measure of data symmetry. Expect a value close to zero,
indicating the residual distribution is normal.
• Kurtosis – a measure of "peakiness", or curvature of the data. Higher
peaks lead to greater Kurtosis. Greater Kurtosis can be interpreted as a
tighter clustering of residuals around zero, implying a better model with
few outliers.
86
Linear Regression in Python
• Durbin-Watson – tests for homoscedasticity. Expect a value between 1
and 2.
• Jarque-Bera (JB)/Prob(JB) – like the Omnibus test in that it tests both
skew and kurtosis. This test will be confirmation of the Omnibus test.
• Condition Number – This test measures the sensitivity of a function's
output as compared to its input. When we have multicollinearity, we can
expect much higher fluctuations to small changes in the data, hence, we
hope to see a relatively small number, something below 30.
87
Linear Regression in Python
• #Normality residuals to be checked using P-P plot
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
model2_residual=model2.resid
• #ProbPlot() method on statsmodel draws P-P plot
probplot=sm.ProbPlot(model2_residual)
plt.figure (figsize=(8,6))
probplot.ppplot(line='45')
plt.show()
88