Вы находитесь на странице: 1из 45

Linear Regression

Disclaimer: This material is protected under copyright act AnalytixLabs ©, 2011. Unauthorized use and/ or duplication of this material or any part of this material including data, in any
form without explicit and written permission from AnalytixLabs is strictly prohibited. Any violation of this copyright will attract legal actions.

Learn to Evolve
Introduction to Linear Regression
Business Problem

I am the CEO of a hypermarket chain “Safegroceries” and I want to open


new store which should give me the best sales . I am hiring “Alabs” to
help me figure out a location where to open the new store

What should ALABS do ?

• Safegroceries has more than 5000 stores across the world


• It is upstream hypermarket store catering to high end products
• There are more than 100 locations he needs to choose from ?
What could impact sales ?

Population Density in the area


Disposable Income
Demographics of the region
Parking size of the location
No of other grocery stores in around (3km)
Credit card usage
Internet penetration/usage
Average no of cars/household
Avg family size/household
No of working people/household
…………………..
………………….
Relationship between Sales and Variables

Sales = function (X1, X2, X3, X4,X5,X6……)

Sales = 10X1 + 20X2 +0.5X3 + 8X4 +…………

If the function is linear we call it linear regression

• This was a case of prediction . How about doing root cause analysis ?

Now CEO wants to improve the performance of the existing


stores and wants to increase sales ?
Decision – Prediction Vs Inference(root causal)
Regression
Regression Analysis
“Regression analysis is a statistical tool for the
investigation of relationships between variables.
Usually, the investigator seeks to ascertain the causal
effect of one variable upon another”

Regression modeling
Establishing a functional relationship between a set of Explanatory or
Independent variables X1, X2, …, Xp with the Response or Dependent variable Y.

Y = f (X1, X2,.., Xp)


Example
In a credit card business. Applications have come for new card, whether to approve the credit or
not

Question
- Should we grant him/her the card?

Non-deterministic information (Y)


- Chances that the customer will default on is payments
- The maximum amount ($) that we may approve

Known information (X)


- Information on credit history, past transactions, financial
status ofrelationship
A Functional the customer.between X and Y helps deciding whether to approve the
credit request
More Business Examples
1. Does income of individual depend on demographics (Age and Years of education) and others?
2. Which of the retail image levers drives footfalls or conversions?
3. Which of the following lever such as size of the house (in square feet), the number of bedrooms, the average
income in the respective neighborhood according to census data, and a subjective rating of appeal of the house
impact price of the house?
4. What drives satisfaction among branch users?
5. What causes high performance of bank branch on the basis of financial parameters?

8
Nature of Explanatory & Dependent variables
An Explanatory variable could be

 Numerical
Discrete : e.g.Number of satisfactory trades
Continuous: e.g. Highest Credit Line
 Categorical
Ordinal : e.g. Income Group (High/Medium/Low)
Nominal : e.g. Gender (Male/Female)

A Dependent variable could be

 Continuous: e.g. The total ($) that we may approve


 Discrete : e.g. Number of equipments that may be funded
 Binary : e.g. Whether the customer would default on payment or not (1/0)
Types of Regression Models

Continuous
Y
Binary(0/1)

OLS regression Logistic model

Y>0 {0,1,2,3,….}

Exponential model Poisson model


Ordinary Least Square Linear
Regression Model
Curve Fitting
Linear regression is a special case of quantitative approaches known as curve fitting. Curve fitting attempts to
find the mathematical relationship which best describes a given set of data.

DIFFERENT UNDERLYING RELATIONSHIPS CAN BE MODELED WITH CURVE FITTING


Linear Exponential

x x
x
x x x x
x x x
x x x
x
x x x
x x x
x x x
x x x

Logarithmic Polynomial

x x x x
x x
x x x x
x x
x x
x x x x x x
x x
x
x
x
What is OLS REGRESSION ANALYSIS?
OLS Regression basically try to draw the best fit regression line - a line such that the sum of the squared deviations of
the distances of all the points to the line is minimized.

Dependent variable (y)


x

x x
Error or x
x residual
term β “Best fit” line
1 Slope = β y = a + βE

Intercept = a
x
Independent variable (x)
Ordinary Least Squares (OLS) linear regression assumes that the underlying relationship between two variables can
best be described by a line.
13
Regression-Step-0
Step-0:

Identification of Dependent Variable


Example: Expected revenue from telecom license

Step-1:

Once we have selected the dependent variable we wish to predict, the first step before running a
regression is to identify what independent variables might influence the magnitude of the dependent
variable and why.
Regression-Step-1
COLLECTING AND GRAPHING THE DATA
The first step is to collect the necessary information and to enter it in a format that allows the user to graph
and later "regress" the data.

Company
(Y ) (X ) revenue
Com pany C u s to m e r 500
re ve n u e in c o m e
180 10 400

100 10 Plotting the data 300


200 20 allows us to get a
290 23
“first look” at the 200
350 30
240 38
strength of our 100

460 44 relationship
0
300 60 0 10 20 30 40 50 60 70 80
425 62
400 70

Customer
income
Regression-Step-2
The way linear regression "works" is to start by naively fitting a horizontal no-slope (slope = A=0) line to the data. The
y-intercept B of this line is simply the arithmetic average of the collected values of the dependent variable.

Company
(Y ) (X ) revenue
Com pany C u s to m e r
500
re ve n u e in c o m e
e7 = 166
180 10 400 e9 = 131
100 e10 = 106
200
10
300 c5 = 166 The sum of the
20
290 23 e4 = -5 e8 = 6 squared residuals,
Sno-slope gives us a
200
350 30 e6 = -55 Y=295
e3 = -95
240 38 100
e1 = -115 measure of how
460 44 well the horizontal
0 e2 = -195
300 60 0 10 20 30 40 50 60 70 80 line fits the data
425 62
400 70

A v e ra g e
Customer
Y v a lu e = 2 9 5 income

Sno-slope= (-115)2 + (-195) 2 + (-95) 2 + (-5) 2 + . . . (106)2 = 121,523


Regression-Step-3
If we allow the line to vary in slope and intercept, we should be able to find that line which minimizes
the sum of squared residuals.

Company
revenue
500 “Best fit” sloped line =
400 e9 = 27
e10 = -31 Revenue = 144 + 4.1 x (income)
300 c5 = 83
e4 = 52
200 Slope = 4.1

100 e3 = -26 The new sum of squared


e1 = -5
residuals, Sslope, should be
Intercept = 144 e2 = -85 lower than Sno-slope, if the
0
0 10 20 30 40 50 60 70 80
new line provides a better
fit to the data
Customer
income

Sslope= (-5)2 + (-85) 2 + (-26) 2 + . . . + (-31)2 = 49,230


3 CRITICAL ELEMENTS OF REGRESSION RESULTS
Since software packages like Excel will regress any stream of data regardless of its integrity, it is critical that we
review the regression results first to determine if a meaningful relationship exists between the two variables
before drawing any conclusions.

•Sign and magnitude of coefficients

•T-statistics

•R2-statistics
INTERPRETING THE COEFFICIENT – SIGN TEST
The coefficient of the independent variable represents our best estimate for the change in the
dependent variable given a one-unit change in the independent variable.

Revenue

Coefficient or best If the sign of the resulting coefficient


estimate of slope = +4.1 does not match the anticipated change in
the dependent variable
•Data may be corrupt (or incomplete)
Income preventing the true relationship from
appearing
Distribution of estimate
•True relationship between variables may
Best estimate = 4.1
not be as strong as initially thought
•Counter-intuitive relationship might
Standard error = 1.2
exist between variables

0 1 2 3 4 5 6 7
INTERPRETING THE INTERCEPT
Similarly, the intercept represents our best estimate for the value of the dependent variable when
the value of the independent variable is zero.

If the sign of the intercept does not


match your expectation, data may be
Intercept = 144
corrupt or incomplete

In some cases, it is appropriate to force


the regression to have an intercept of 0,
Distribution of estimate if, for instance, no meaningful value
Best estimate = 144 exists if the independent variable is 0

Standard error = 50

0 40 80 120 160 200 240 280


T-STATISTICS
If the regression has passed the sign test, the single most important indicator of how strong the
data supports an underlying linear relationship between the dependent and independent
variables is the t-statistic.
Probability
density

Best estimate = 4.1

Standard error = 1.2


t-stat = 3.4
In general, a t-statistic of
0 1 2 3 4 5 6 7 Coefficient magnitude equal or greater than 2
suggests a statistically significant
Probability relationship between the 2
density variables
Best estimate = 144

Standard error = 50
t-stat = 2.8

0 40 80 120 160 200 240 280 Intercept


INTERPRETING THE R2-STATISTIC
If we are comfortable with the sign and magnitude of the coefficient and intercept, and our t-statistic is sufficiently
large to suggest a statistically significant relationship, then we can look at the R2-statistic.

Revenue
The R2-statistic is the percent reduction in
the sum of squared residuals from using
500
400
300 our best fit sloped line vs. a horizontal line
200
100
0
R2 = Sno-slope – Sslope
S
80no-slope
= 121,523
0 10 20 30 40 50 60 70
Sno-slope

Income
R2 = 121,523121,523
– 49,230
Revenue
Sslope = 49,230
R2the
= 0.59
500
400 If independent variable does not drive (or
300
is not correlated) with the dependent
200
100
variable in any way, we would expect no
0 consistent change in "y" with consistently
0 10 20 30 40 50 60 70 80
changing "x." This is true when the slope is
zero or Sslope = Sno-slope which makes R2 = 0
Income
MULTIPLE REGRESSION
Multiple regression allows you to determine the estimated effect of multiple independent variables on the
dependent variables.

Tests
Testsfor
formultiple
multipleregressions
regressions
•Sign
•Signtest
test––check
checksigns
signsof
of
coefficients for hypothesized
coefficients for intuitive variability
Dependent
Dependentvariable:
variable: Y Multiple
Multipleregression
regressionprograms
programs change in dependent
to determine fit variable
will
willcalculate
calculatethe
thevalue
valueof ofall
all •T-statistic
•T-statistic––check
checkt-stat
t-statfor
foreach
each
Independent
Independentvariables: variables: the
thecoefficients
coefficients(a(0 0totoa
n)nand
) and coefficient
coefficienttotoestablish
establishifift>2
t>2(for
(foraa
X11,,X22, ,X3,3,. .. .. ., ,Xnn
give
givethethemeasures
measuresof of “good
“goodfit”)
fit”)
Relationship:
Relationship: variability
variabilityfor
foreach
eachcoefficient
coefficient
Y == a00++a1x11x1+, a22x22+, a3 
x3+, .. .. .. ,+annxnn (i.e.,
(i.e.,RR and
22 andt-statistic)
t-statistic) •R
•R22,,adjusted
adjustedRR22
––RR22values
valuesincrease
increasewith
withthe
the
number
numberof ofvariables;
variables;therefore
therefore
check
checkadjusted
adjustedRR value
22 valuetotoestablish
establish
aagood
goodfitfit(adjusted
(adjustedRR close
22 closeto
to1)
1)
MULTIPLE REGRESSION
If you can dream up multiple independent variables or "drivers" of a dependent variable, you may want to use
multiple regression.

Independent Dependent
variable variables Slopes Intercept Multiple regression notes
y x1 a1 b
• Having more independent variables
x2 a2
always makes the fit better – even
• • if it is not a statistically significant
• •
• • improvement. So:
xi ai 1. Do the sign check for all
slopes and the intercept
2. Check the t-stats (should be
y = a1 x1 + a2 x2 . . . + ai xi + b
>2) for all slopes and the
= b +  a 1x 1
i intercept
3. Use the adjusted R2 which
takes into account the false
improvement due to multiple
variables
Multiple regression – 4 primary issues
Multicollinearity Serial correlation/ Autocorrelation Heteroscedasticity Outlier

• High correlation • Residual terms are • Variance of the residual • If some values are
What is it? correlated with one term increases as the markedly different
among two or value of the independent from the majority
another. It occurs
more of the most often with time variable increases of the values
independent series data
variable

• Distorts the standard


error of coefficient. • Coefficient standard • Standard errors will • Prediction line gets
Effect This will lead to greater error too large or too be different for pulled-up /down in
probability of small leading to different sets of presence of outlier(s)
incorrectly concluding erroneous t-statistic independent variable and R-sq dips
that a variable is not
statistically significant
(Type II error)

• Examine from scatter


Detection • R-square is high, F test is plot or do an
statistically significant but • Scatter plot of residuals • Examine scatter plot univariate analysis
t-tests indicate that none of or run Durbin Watson of residuals or run and look at
the individual coefficients is statistic
Breusch-Pagan test 5,10,90,95,98,99,100
significantly different than percentiles to detect
zero, VIF and condition outlier or check from
index is very high box-plot

• Adjust coefficient standard


Correction • Run correlation error using Hansen method • Calculate robust • Either drop the values
standard errors (also or cap it by the closest
matrix and drop (available in SAS and SPSS). called White- observation/ replace by
This will help in correct corrected standard mean
one of the hypothesis testing of the errors) to recalculate
correlated regression coefficient t-statistics
variable
Regression-Best practices
1. Check for the collinearity ( by finding correlation between all the variables and keeping only 1 of the variables which is
highly correlated)
2. Transform data as applicable – e.g., income should be transformed by taking log of that
3. Do not run regression on categorical variables, recode them into dummy variables
4. Check the directionality of the variables
5. Following methods should be used under different situations

▪Enter Method : To get the coefficient of each and every variable in the regression
▪Back ward method : When the model is exploratory and we start with all the variables and then remove the
insignificant ones
▪Forward Method: Sequentially add variables one at a time based on the strength of their squared semi-partial
correlations (or simple bivariate correlation in the case of the first variable to be entered into the equation)
▪Step wise method : A combination of forward and backward at each step one can be entered (on basis of greatest
improvement in R2 but one also may be removed if the change (reduction) in R2 is not significant (In the Bordens and
Abbott text it sounds like they use this term to mean Forward regression)
Steps in Regression Model building

Formulate the Business problem as a statistical problem

Prepare the data required for model building

Develop the model

Validate the model

Identify the limitations of the model


Development of the model
Identify Explanatory and Response variables

Decide on type of model Here the type of model is OLS

Variable Selection Forward


Backward
Stepwise
VIF
Check Multicollinearity Condition index
Variance proportions

Run model OLS

Diagnostics For OLS

Model unsatisfactory ? Try transformations Log, sqrt,


Inverse, Box-Cox etc.
Diagnostics for OLS Model
Is the model satisfactory ?
 R2 = proportion of variation in the response variable explained by the model
- check R2 >50%

 Plots of Standardized Residual (= (Actual – Predicted)/SD)


- vs predicted values
- vs X variables
- check if there is no pattern
- check for homoscedasticity

 Significance of parameter estimates


- check if p-value<0.01

 Stability of parameter estimates:


- Take a random subsample from the development sample
- Obtain a new set of parameter estimates from the sub sample
- Check if the parameter estimates got from development sample and the subsample differ by less than 3 standard deviations

 Rank ordering:
-order data in descending order of predicted values
-Break into 10 group
-check if average of actual is in the same order as average predicted
Validation

On the validation sample

Stability of parameter estimates:


-Obtain a new set of parameter estimates from the
validation sample
-check if the new parameter estimates differ from that got
from development sample by less than 3 standard deviations
Compare Predicted vs Actual values
Q&A
Sample ‘R’ Codes
Linear models

> myModel <- lm(Learning ~ Pre1 + Pre2 + Pre3 + Pre4)


> par(mfrow=c(2,2))
> plot(myModel)
Linear models

> summary(myModel)

Call:
lm(formula = Learning ~ Pre1 + Pre2 + Pre3 + Pre4)

Residuals:
Min 1Q Median 3Q Max
-0.40518 -0.08460 0.01707 0.09170 0.29074

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.22037 0.11536 -1.910 0.061055 .
Pre1 1.05299 0.12636 8.333 1.70e-11 ***
Pre2 0.41298 0.10926 3.780 0.000373 ***
Pre3 0.07339 0.07653 0.959 0.341541
Pre4 -0.18457 0.11318 -1.631 0.108369
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1447 on 58 degrees of freedom


Multiple R-squared: 0.6677, Adjusted R-squared: 0.6448
F-statistic: 29.14 on 4 and 58 DF, p-value: 2.710e-13
Linear models

> step(myModel, direction="backward") • ...


Start: AIC=-238.8 • Step: AIC=-239.89
Learning ~ Pre1 + Pre2 + Pre3 + Pre4 • Learning ~ Pre1 + Pre2

Df Sum of Sq RSS AIC • Df Sum of Sq RSS AIC


- Pre3 1 0.01925 1.2332 -239.81 • <none> 1.2713 -239.89
<none> 1.2140 -238.80 • - Pre2 1 0.24997 1.5213 -230.59
- Pre4 1 0.05566 1.2696 -237.98 • - Pre1 1 1.52516 2.7965 -192.23
- Pre2 1 0.29902 1.5130 -226.93
- Pre1 1 1.45347 2.6675 -191.21 • Call:
• lm(formula = Learning ~ Pre1 + Pre2)
Step: AIC=-239.81
Learning ~ Pre1 + Pre2 + Pre4 • Coefficients:
• (Intercept) Pre1 Pre2
Df Sum of Sq RSS AIC • -0.2864 1.0629 0.3627
- Pre4 1 0.03810 1.2713 -239.89
<none> 1.2332 -239.81
- Pre2 1 0.28225 1.5155 -228.83
- Pre1 1 1.54780 2.7810 -190.58
...
Linear models
• R provides comprehensive support for multiple linear regression. The topics below are provided in order of increasing complexity
• Fitting the Model

# Multiple Linear Regression Example


> fit <- lm(y ~ x1 + x2 + x3, data=mydata)
> summary(fit) # show results

# Other useful functions


> coefficients(fit) # model coefficients
> confint(fit, level=0.95) # CIs for model parameters
> fitted(fit) # predicted values
> residuals(fit) # residuals
> anova(fit) # anova table
> vcov(fit) # covariance matrix for model parameters
> influence(fit) # regression diagnostics
Regression Diagnostics: Code
# Assessing Outliers
> outlier.test(fit) # Bonferonni p-value for most extreme obs
> qq.plot(fit, main="QQ Plot") #qq plot for studentized resid
> layout(matrix(c(1,2,3,4,5,6),2,3)) # optional layout
> leverage.plots(fit, ask=FALSE) # leverage plots

# Influential Observations
# added variable plots
> av.plots(fit, one.page=TRUE, ask=FALSE)
# Cook's D plot
# identify D values > 4/(n-k-1)
> cutoff <- 4/((nrow(mtcars)-length(fit$coefficients)-2))
> plot(fit, which=4, cook.levels=cutoff)
# Influence Plot
> influencePlot(fit, main="Influence Plot",
+ sub="Circle size is proportial to Cook's Distance" )

# Evaluate Nonlinearity
# component + residual plot
> cr.plots(fit, one.page=TRUE, ask=FALSE)
# Ceres plots
ceres.plots(fit, one.page=TRUE, ask=FALSE)
Regression Diagnostics: Code
# Normality of Residuals
# qq plot for studentized resid
> qq.plot(fit, main="QQ Plot")
# distribution of studentized residuals
> library(MASS)
> sresid <- studres(fit)
>hist(sresid, freq=FALSE,
+ main="Distribution of Studentized Residuals")
> xfit<-seq(min(sresid),max(sresid),length=40)
>yfit<-dnorm(xfit)
> lines(xfit, yfit)

# Evaluate homoscedasticity
# non-constant error variance test
> ncv.test(fit)
# plot studentized residuals vs. fitted values
> spread.level.plot(fit)

# Evaluate Collinearity
> vif(fit) # variance inflation factors
> sqrt(vif(fit)) > 2 # problem?
Non-independence of Errors
# Test for Autocorrelated Errors
> durbin.watson(fit)

Additional Diagnostic Help


The gvlma( ) function in the gvlma package, performs a global validation of linear model assumptions as well separate evaluations of skewness,
kurtosis, and heteroscedasticity.

# Global test of model assumptions


> library(gvlma)
> gvmodel <- gvlma(fit)
> summary(gvmodel)
Comparing Models
• You can compare nested models with the anova( ) function. The following code provides a simultaneous test that x3 and x4 add to linear prediction
above and beyond x1 and x2

# compare models
> fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)
> fit2 <- lm(y ~ x1 + x2)
> anova(fit1, fit2)
Cross Validation
• You can do K-Fold cross-validation using the cv.lm( ) function in the DAAG package
# K-fold cross-validation
> library(DAAG)
> cv.lm(df=mydata, fit, m=3) # 3 fold cross-validation
• Sum the MSE for each fold, divide by the number of observations, and take the square root to get the cross-validated standard error of estimate
• You can assess R2 shrinkage via K-fold cross-validation. Using the crossval() function from the bootstrap package, do the following:
# Assessing R2 shrinkage using 10-Fold Cross-Validation
> fit <- lm(y~x1+x2+x3,data=mydata)

> library(bootstrap)
# define functions
> theta.fit <- function(x,y){lsfit(x,y)}
> theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}

# matrix of predictors
> X <- as.matrix(mydata[c("x1","x2","x3")])
# vector of predicted values
> y <- as.matrix(mydata[c("y")])

> results <- crossval(X,y,theta.fit,theta.predict,ngroup=10)


> cor(y, fit$fitted.values)**2 # raw R2
> cor(y,results$cv.fit)**2 # cross-validated R2
Variable Selection
• Selecting a subset of predictor variables from a larger set (e.g., stepwise selection) is a controversial topic
• You can perform stepwise selection (forward, backward, both) using the stepAIC( ) function from the MASS package. stepAIC( ) performs stepwise
model selection by exact AIC.

# Stepwise Regression
> library(MASS)
> fit <- lm(y~x1+x2+x3,data=mydata)
> step <- stepAIC(fit, direction="both")
> step$anova # display results
Variable Selection
• Alternatively, you can perform all-subsets regression using the leaps( ) function from the leaps package
• In the following code nbest indicates the number of subsets of each size to report
• Here, the ten best models will be reported for each subset size (1 predictor, 2 predictors, etc.)

# All Subsets Regression


> library(leaps)
> attach(mydata)
> leaps<-regsubsets(y~x1+x2+x3+x4,data=mydata,nbest=10)
# view results
> summary(leaps)
# plot a table of models showing variables in each model.
# models are ordered by the selection statistic.
> plot(leaps,scale="r2")
# plot statistic by subset size
> library(car)
> subsets(leaps, statistic="rsq")
Variable Selection
• Other options for plot( ) are bic, Cp, and adjr2. Other options for plotting with
subset( ) are bic, cp, adjr2, and rss.
• Relative Importance
• The relaimpo package provides measures of relative importance for each of the predictors in the model. See help(calc.relimp) for details on the
four measures of relative importance provided

# Calculate Relative Importance for Each Predictor


> library(relaimpo)
> calc.relimp(fit,type=c("lmg","last","first","pratt"),
+ rela=TRUE)

# Bootstrap Measures of Relative Importance (1000 samples)


> boot <- boot.relimp(fit, b = 1000, type = c("lmg",
+"last", "first", "pratt"), rank = TRUE,
+ diff = TRUE, rela = TRUE)
> booteval.relimp(boot) # print result
> plot(booteval.relimp(boot,sort=TRUE)) # plot result
Contact us
Visit us on: http://www.analytixlabs.in/

For course registration, please visit: http://www.analytixlabs.co.in/course-registration/

For more information, please contact us: http://www.analytixlabs.co.in/contact-us/


Or email: info@analytixlabs.co.in
Call us we would love to speak with you: (+91) 88021-73069

Join us on:
Twitter - http://twitter.com/#!/AnalytixLabs
Facebook - http://www.facebook.com/analytixlabs
LinkedIn - http://www.linkedin.com/in/analytixlabs
Blog - http://www.analytixlabs.co.in/category/blog/

Вам также может понравиться