Вы находитесь на странице: 1из 19

5/4/2018 Linear Regression in R

Linear Regression in R
Danilo Bianchi
Introduction
Learning Objectives
Problem Definition
Exploratory Data Analysis (EDA)
Univariate Model
Residuals Diagnostics
Residuals vs Fitted
Normal Q-Q Plot
Scale-Location
Residuals vs Leverage
Multi-variate Model
Collinearity Test
Model Search
Exercises

Introduction
Learning Objectives
1. Apply a linear regression in R
2. Interpret the linear model outputs in R
3. Assess the linear model validity (perform diagnostics tests)
4. Choose significant variables for the model
5. Compare and choose the best alternative from a selection of models

We assume you already know what is a Linear Regression.

Problem Definition
We have the credit data for several clients of a bank.

We’d like to know what are the characteristics of a client that brings a high revenue, so we know who to target.

It would also be nice to understand what kind of clients bring a low revenue, so we can develop strategies to
increase it.

Exploratory Data Analysis (EDA)


The main objectives of an EDA is to check and understand our data as well as to detect any interesting patterns.

We first load the data and get a high-level summary (remember to set the right working directory with setwd).

The summary command gives an overview of all the variables and their distribution.

For numeric variables, summary will give the minimum and maximum values, as well as the mean and all
four quartiles.

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 1/19
5/4/2018 Linear Regression in R

For categorical variables, summary will give the number of occurences in each category.

# Load data
credit<-read.csv("Credit_Data.csv")
# Get summary for each variable
summary(credit)

## StatExCheAcc Duration CredHist Purpose CredAmt


## A11:274 Min. : 4.0 A30: 40 A43 :280 Min. : 250
## A12:269 1st Qu.:12.0 A31: 49 A40 :234 1st Qu.: 1366
## A13: 63 Median :18.0 A32:530 A42 :181 Median : 2320
## A14:394 Mean :20.9 A33: 88 A41 :103 Mean : 3271
## 3rd Qu.:24.0 A34:293 A49 : 97 3rd Qu.: 3972
## Max. :72.0 A46 : 50 Max. :18424
## (Other): 55
## SavBal EmpTen PerStat Othdebt PresRes Property
## A61:603 A71: 62 A91: 50 A101:907 Min. :1.000 A121:282
## A62:103 A72:172 A92:310 A102: 41 1st Qu.:2.000 A122:232
## A63: 63 A73:339 A93:548 A103: 52 Median :3.000 A123:332
## A64: 48 A74:174 A94: 92 Mean :2.845 A124:154
## A65:183 A75:253 3rd Qu.:4.000
## Max. :4.000
##
## Age House NumbCred Job Default
## Min. :19.00 A151:179 Min. :1.000 A171: 22 Min. :1.0
## 1st Qu.:27.00 A152:713 1st Qu.:1.000 A172:200 1st Qu.:1.0
## Median :33.00 A153:108 Median :1.000 A173:630 Median :1.0
## Mean :35.55 Mean :1.407 A174:148 Mean :1.3
## 3rd Qu.:42.00 3rd Qu.:2.000 3rd Qu.:2.0
## Max. :75.00 Max. :4.000 Max. :2.0
##
## Revenue
## Min. : -6.532
## 1st Qu.: 55.022
## Median : 86.719
## Mean :112.213
## 3rd Qu.:138.224
## Max. :523.219
##

We see that the variable Default is numeric although it needed to be categorical (Yes/No). Let’s fix this.

#Rename "Default" values for legibility


credit$Default[credit$Default == 1] <- 'Yes'
credit$Default[credit$Default == 2] <- 'No'
#Change "Default" variable to categorical (factor)
credit$Default <- as.factor(credit$Default)

While it helped us to understand our data, knowing the summary won’t get us far on understanding what drives
revenue. For this, it’s helpful to plot every variable against revenue to see if we can find any visual pattern.
Moreover, because a linear model is sensitive to multicollinearity (high correlation between explanatory variables)

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 2/19
5/4/2018 Linear Regression in R

it would also be valuable to see the correlation between each variable and all others. This situation is so common
that there’s already a function in R to do that for us.

#Create a scatter plot between all the selected variables


pairs( ~ Revenue + CredHist + CredAmt + EmpTen + Age + Default,data=credit)

The credit amount seems to have a fairly good correlation to Revenue. Let’s look at both more closely and build a
linear reression model between them.

Univariate Model
Let’s create a linear regression using the “Revenue” as the dependent variable and “Credit Amount” as the
explanatory (independent) variable.

R function lm lets us do this by specifying the dependent variable and then the independent variables followed by
a tilde (~). To simplify, the function lets us specify the data frame we want to use as a source so we don’t need to
repeat it for every variable.

Lastly, we can “save” the model in a R variable (“fit1” in this case).

# Create a linear regression model


fit1 <- lm(Revenue ~ CredAmt, data = credit)
# Provide a summary for the model
summary(fit1)

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 3/19
5/4/2018 Linear Regression in R

##
## Call:
## lm(formula = Revenue ~ CredAmt, data = credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.530 -12.946 0.615 13.730 39.842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.490e+01 8.623e-01 17.28 <2e-16 ***
## CredAmt 2.975e-02 1.996e-04 149.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.81 on 998 degrees of freedom
## Multiple R-squared: 0.957, Adjusted R-squared: 0.9569
## F-statistic: 2.221e+04 on 1 and 998 DF, p-value: < 2.2e-16

Now we have the model, we need to know how to interpret the output.

Call provides the model which is created. Very useful when dealing with several models.

Residuals (y - yhat) are the difference between the real values of dependent variable (y) and the modeled values
(aka predicted) using the independent variables (yhat). Because we have one residual for each data point, the
output shows the min, max and quartiles of the residuals distribution. This should give us a notion if the residuals
are normally distributed. Note that the mean of the residuals should be zero < mean(resid(fit1)) =
1.985068810^{-16} >!

Coefficients: Gives information on the value (Estimate), error (Std. Error), and significance (Pr(>|t|)) for each of
the coefficients considered in the model.

Signf. codes is the legend for the markup next to the significance of the coeffiecients.

Residual standard error (RSE) is the estimative of the standard deviation of the residuals. That is, in a linear
regression, the residuals are considered to be distributed normally with mean zero (see Residuals above) and
some standard deviation. The RSE is an estimative of that standard deviation < sd(resid(fit1)) = 17.8006252 >.
The smaller the RSE, the lower is the variance between the real and the modeled values, i.e. the better is the
model.

Multiple R-Squared is the measure of how much variance is explained by the model. If R-square is 1 it means all
variance is explained by the model. The multiple R-squared always increases as we add more variables into the
model.

Adjusted R-squared is the R-squared adjusted by the complexity of the model. It considers the number of
variables in the model. We should use this measure to compare models with different number of variables. It may
decrease if an additional variable is not contributing to explaining the variance.

F-statistic gives the significance of the model rather of a single variable. It tests for the null hypothesis that all the
model coefficients are zero. The F-statistic is the ratio of the explained and unexplained variance (< anova(fit1) >).

(for more help interpreting the output check Stack Exchange Discussion on R output
(http://stats.stackexchange.com/questions/5135/interpretation-of-rs-lm-output) and the UCLA’s Annotated
Regression Output in SAS (http://www.ats.ucla.edu/stat/sas/output/reg.htm) )

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 4/19
5/4/2018 Linear Regression in R

Perhaps it’s better to understand these concepts in a plot. We can use the lm object (fit1) we created to retrieve
relevant information of the model and create the plot (for more on plot formats click here
(http://www.statmethods.net/advgraphs/parameters.html) ).

## Creating a Linear fit for Revenue vs. Credit Amount

# get predicted values


yhat <- predict(fit1)

# get residuals (y-yhat)


e <- resid(fit1)

# assign axes
y <- credit$Revenue
x <- credit$CredAmt

# Generate scatterplot
plot(credit$CredAmt, credit$Revenue, # x vs y axes
xlab = "Independent (Credit Amount)", # x label
ylab = "Residuals", # y label
pch = 21, # type (21 = filled circle)
col = "black", # circle color
bg = "gray" , # fill color
cex = 1, # size (x default)
frame = TRUE)

# Plot residuals
n <- length(y)
for (i in 1 : n)
lines(c(x[i], x[i]), c(y[i], yhat[i]), col = "blue" , lwd = 0.5, lty=1)

# Plot the linear fit


abline(fit1, lwd = 2, lty=1, col="red")

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 5/19
5/4/2018 Linear Regression in R

From the chart above, we can see some of the relevant elements in a linear model

the real values for revenue (y) plotted as a grey points


the predicted values for revenue (yhat) as a solid red line (with intercept and linear coefficient given by the
model)
the residuals (yhat) as vertical blue lines

Now we can understand and interpret what our model is telling, we need to assess if it is a valid model.

Residuals Diagnostics
In a model diagnosis, we check for the assumptions underlying the linear regression. Also, we’d like to verify if
there’s any influential point in our dataset.

There’s a quick command that will generate four diagnostic plots to let us do that.

## Create dignostics plots for this model


# Divide the plotting area in four
par(mfrow=c(2,2))
# Generate diagnostics plots
plot(fit1)

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 6/19
5/4/2018 Linear Regression in R

Let us look at each of those individually.

Residuals vs Fitted
We assumed that the residuals were homogeneously distributed (heteroscedasticity). That is to say, there
shouldn’t be any pattern or dependence of the residuals on the predicted values. We can verify that by plotting the
residuals against the predicted values.

In this graph, we see that the residuals are farily distributed around zero and also they there’s no pattern in how
they’re distributed.

## Plot residuals agains the predicted values


# an easier way is to write plot(fit1,which=1)
# Get Residual
e = resid(fit1)
# Get the predicted values
x = fitted(fit1)
# generate the plot
plot(x, e,
main = "Residuals vs Fitted",
xlab = "Independent (Credit Amount)",
ylab = "Dependent (Revenue)",
col = "black", bg = "gray", cex = 1, pch = 21,frame = FALSE)
abline(h = 0, lwd = 2, lty=2)

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 7/19
5/4/2018 Linear Regression in R

Normal Q-Q Plot


Q-Q plots are useful to compare two distributions. In linear regression it compares the distribution of the residuals
against a theoretical normal distribution (with mean zero and standard deviation given by the RSE). With it we can
check if the residuals are normally distributed.

# generate qq-plot
plot(fit1,which=2)

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 8/19
5/4/2018 Linear Regression in R

In our model, the Q-Q plot lies fairly well in a straight line deviating at the extremes. This indicates it is normally
distributed with some slightly long tails. We can compare the Q-Q plot with a histogram of the residuals.

## Determine if residuals are normally distributed


m = mean(resid(fit1))
std = sd(resid(fit1))
# plot histogram
hist(resid(fit1),breaks=15,main="Histogram of Residuals", prob=TRUE, xlab="Residuals")
curve( dnorm(x, mean = m, sd = std),
col="darkblue",
lwd=2,
add=TRUE, yaxt="n")

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 9/19
5/4/2018 Linear Regression in R

If the normality test fails, the R-squared and coefficient values would still be right. However, we wouldn’t be able to
trust the estimates for the standard deviation and the p-values.

Scale-Location
This graph is similar to the Residuals vs. Fitted. The difference is that the residuals are normalized. So the y-axis
will always be positive and measures how many standard deviations a residual falls from the residuals distribution.

In this, we would be able to detect outliers that would present high standard deviations (far from zero). It also
enables us to check again for heteroscedasticity by detecting any pattern. In this case, the red line seems fairly
well-behaved.

# Generate scale-location plot


plot(fit1,which=3)

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 10/19
5/4/2018 Linear Regression in R

Residuals vs Leverage
With this plot, we have a much better view of the most influential points in our dataset.

To understand what is an influential point, we need to understand what is leverage and an outlier.

Leverage is a measure of how much weight a point has on the coefficients of a linear fit. In other words, it
measures how much the coefficients would change by a one-unit change in the y-value of that point. High leverage
points are points that fall far from the mean of the predictor (i.e. very high/low x compared to the dataset).

Outliers are data points with large residuals (i.e. very high/low y compared to similar x’s).

Influence is the measure of how much the coefficients would change if a certain point were removed. Highly
influential points are outliers with high leverage.

Now we understand these concepts, we can interpret the last plot.

# Plot Residuals vs Leverage


plot(fit1,which=5)

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 11/19
5/4/2018 Linear Regression in R

The Residuals vs Levereage plot helps us identify influential points by showing their leverage and residuals.
Outliers are points with high standardized residual (measured in standard deviations). These are shown with a y
value far from zero. Points with high leverage are shown with a x value far from zero. Thus, influential points are
those with a high x and y values, located at the right-hand top/bottom corners of the graph.

To help us identify influential points, this plot also shows the Cook’s distance (a measure of influence) for those
points as dashed red lines. Points leaving the boundaries marked by those lines can be considered as highly
influential and deserve a closer look and special treatment. In our case, it appears that observation 916 is very
influential but it still falls within the boundary of an acceptable Cook’s distance (not shown because it is not near).

Multi-variate Model
When building a model, we’re not constrained in using only one variable. We can try to enhance our predictive
power by adding more explanatory variables into the model.

Let us build a model that considers both Credit Amount and Credit history.

# Build a multivariate model


fit2=lm(formula = Revenue ~ CredAmt + CredHist, data = credit)
# gets its summary
summary(fit2)

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 12/19
5/4/2018 Linear Regression in R

##
## Call:
## lm(formula = Revenue ~ CredAmt + CredHist, data = credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.213 -12.169 -0.693 12.684 31.688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.1700383 2.6588002 -3.449 0.000586 ***
## CredAmt 0.0299477 0.0001794 166.957 < 2e-16 ***
## CredHistA31 7.5728079 3.3643153 2.251 0.024608 *
## CredHistA32 20.1922327 2.6064696 7.747 2.32e-14 ***
## CredHistA33 28.9940478 2.9995753 9.666 < 2e-16 ***
## CredHistA34 33.4208933 2.6764075 12.487 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.7 on 994 degrees of freedom
## Multiple R-squared: 0.9667, Adjusted R-squared: 0.9665
## F-statistic: 5772 on 5 and 994 DF, p-value: < 2.2e-16

It looks that the model is better than the previous one. That is, the Credit History helps us explain a little bit more
variance than using only the Credit Amount (higher adjusted R-squared).

Collinearity Test
However, we should check for another assumption now – the explanatory variables should be uncorrelated. To
check that, let us create a correlation table.

# Check for correlations between credit history and amount.


cor(cbind(credit$Revenue,credit$CredAmt,credit$CredHist))

## [,1] [,2] [,3]


## [1,] 1.00000000 0.97825914 0.03793717
## [2,] 0.97825914 1.00000000 -0.05990549
## [3,] 0.03793717 -0.05990549 1.00000000

# plot
plot(credit$CredHist,credit$CredAmt,
xlab = "Credit History",
ylab = "Credit Amount")

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 13/19
5/4/2018 Linear Regression in R

It looks that the explanatory variables are fairly uncorrelated.

But, do we have the best model? Adding one variable increased the predictive power of the model. We still have
several other variables to test – how can we select a better model?

For that, we’d have to comparate all the bi-variate models using all variables, then with multi-variate models with 3
variables, 4 variables, and so on…

Model Search
As you guessed, there’s an automatic way to select good models. We will touch on two of them:

Forward selection: in this method, we test all univariate models and select the best. Then, we test all the
other remaining variables to build a bi-variate model. We go on until the addition of another variable won’t
increase the model confidence by a certain threshold.
Backward elimination: the idea is similar to the forward selection method. In the backward elimination, we
start with a model that has all variables and remove the worst variable. We test the model again and again
and stop until the all the variables have a significance lower than a certain limit.

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 14/19
5/4/2018 Linear Regression in R

## Perform a model selection


# Define a model with no variables
null=lm(Revenue~1 ,data=credit)

# Define a model with all variables


full=lm(Revenue~. ,data=credit)

#Perform a Forward selection (trace = 1 shows results at each step)


fit_for <- step(null, scope=list(lower=null, upper=full),trace=1,direction="forward")

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 15/19
5/4/2018 Linear Regression in R

## Start: AIC=8905.81
## Revenue ~ 1
##
## Df Sum of Sq RSS AIC
## + CredAmt 1 7043426 316545 5761.5
## + Duration 1 2772613 4587359 8435.1
## + Purpose 9 953875 6406097 8785.0
## + Job 3 775594 6584378 8800.5
## + Property 3 657661 6702311 8818.2
## + House 2 221244 7138728 8879.3
## + CredHist 4 242672 7117300 8880.3
## + PerStat 3 217711 7142261 8881.8
## + Default 1 164672 7195300 8885.2
## + StatExCheAcc 3 131926 7228046 8893.7
## + SavBal 4 117499 7242473 8897.7
## + Othdebt 2 76497 7283475 8899.4
## + EmpTen 4 102600 7257372 8899.8
## + NumbCred 1 25337 7334635 8904.4
## <none> 7359972 8905.8
## + Age 1 3432 7356540 8907.3
## + PresRes 1 2181 7357790 8907.5
##
## Step: AIC=5761.47
## Revenue ~ CredAmt
##
## Df Sum of Sq RSS AIC
## + CredHist 4 71485 245060 5513.5
## + Age 1 21166 295379 5694.3
## + NumbCred 1 10818 305727 5728.7
## + House 2 4513 312033 5751.1
## + EmpTen 4 4579 311967 5754.9
## + Property 3 3673 312873 5755.8
## + PresRes 1 905 315641 5760.6
## <none> 316545 5761.5
## + StatExCheAcc 3 1426 315119 5763.0
## + Duration 1 68 316477 5763.3
## + Default 1 24 316521 5763.4
## + PerStat 3 1259 315286 5763.5
## + Othdebt 2 380 316166 5764.3
## + Job 3 366 316180 5766.3
## + SavBal 4 751 315794 5767.1
## + Purpose 9 2107 314439 5772.8
##
## Step: AIC=5513.5
## Revenue ~ CredAmt + CredHist
##
## Df Sum of Sq RSS AIC
## + Age 1 33482 211579 5368.6
## + House 2 5111 239949 5496.4
## + EmpTen 4 5981 239080 5496.8
## + Default 1 3365 241695 5501.7
## + Property 3 3503 241557 5505.1
## + PerStat 3 3458 241602 5505.3

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 16/19
5/4/2018 Linear Regression in R

## + PresRes 1 1897 243164 5507.7


## + SavBal 4 1960 243101 5513.5
## <none> 245060 5513.5
## + Duration 1 453 244608 5513.7
## + Othdebt 2 676 244384 5514.7
## + NumbCred 1 12 245048 5515.5
## + StatExCheAcc 3 775 244285 5516.3
## + Job 3 584 244477 5517.1
## + Purpose 9 3092 241968 5518.8
##
## Step: AIC=5368.6
## Revenue ~ CredAmt + CredHist + Age
##
## Df Sum of Sq RSS AIC
## + Default 1 1983.50 209595 5361.2
## <none> 211579 5368.6
## + NumbCred 1 255.34 211324 5369.4
## + EmpTen 4 1505.60 210073 5369.5
## + StatExCheAcc 3 982.11 210597 5369.9
## + Duration 1 76.31 211503 5370.2
## + Othdebt 2 456.57 211122 5370.4
## + PresRes 1 9.89 211569 5370.6
## + House 2 326.80 211252 5371.1
## + PerStat 3 657.98 210921 5371.5
## + Property 3 522.20 211057 5372.1
## + Job 3 388.57 211190 5372.8
## + SavBal 4 673.54 210905 5373.4
## + Purpose 9 1293.16 210286 5380.5
##
## Step: AIC=5361.18
## Revenue ~ CredAmt + CredHist + Age + Default
##
## Df Sum of Sq RSS AIC
## + EmpTen 4 1892.20 207703 5360.1
## <none> 209595 5361.2
## + NumbCred 1 201.53 209394 5362.2
## + Othdebt 2 505.83 209090 5362.8
## + StatExCheAcc 3 878.58 208717 5363.0
## + Duration 1 6.94 209588 5363.1
## + PresRes 1 3.45 209592 5363.2
## + House 2 421.52 209174 5363.2
## + Property 3 709.75 208886 5363.8
## + PerStat 3 483.49 209112 5364.9
## + Job 3 370.40 209225 5365.4
## + SavBal 4 422.37 209173 5367.2
## + Purpose 9 1582.91 208013 5371.6
##
## Step: AIC=5360.11
## Revenue ~ CredAmt + CredHist + Age + Default + EmpTen
##
## Df Sum of Sq RSS AIC
## <none> 207703 5360.1
## + StatExCheAcc 3 980.46 206723 5361.4
## + NumbCred 1 133.11 207570 5361.5

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 17/19
5/4/2018 Linear Regression in R

## + Othdebt 2 495.20 207208 5361.7


## + Duration 1 3.17 207700 5362.1
## + PresRes 1 2.95 207700 5362.1
## + House 2 406.37 207297 5362.2
## + Job 3 796.93 206906 5362.3
## + PerStat 3 759.79 206943 5362.4
## + Property 3 638.32 207065 5363.0
## + SavBal 4 394.72 207309 5366.2
## + Purpose 9 1448.26 206255 5371.1

summary(fit_for)

##
## Call:
## lm(formula = Revenue ~ CredAmt + CredHist + Age + Default + EmpTen,
## data = credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.0489 -11.9743 0.6923 12.0077 27.9325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3603090 3.6153166 2.589 0.009765 **
## CredAmt 0.0299103 0.0001686 177.383 < 2e-16 ***
## CredHistA31 8.2618879 3.1197936 2.648 0.008221 **
## CredHistA32 20.6640166 2.4282225 8.510 < 2e-16 ***
## CredHistA33 30.5193670 2.7925681 10.929 < 2e-16 ***
## CredHistA34 36.4678345 2.5235479 14.451 < 2e-16 ***
## Age -0.5137518 0.0446453 -11.507 < 2e-16 ***
## DefaultYes -3.5336624 1.0524095 -3.358 0.000816 ***
## EmpTenA72 -0.4102247 2.2082353 -0.186 0.852663
## EmpTenA73 0.2801253 2.0419229 0.137 0.890911
## EmpTenA74 3.7774582 2.1747553 1.737 0.082706 .
## EmpTenA75 1.3238294 2.0697135 0.640 0.522568
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.5 on 988 degrees of freedom
## Multiple R-squared: 0.9718, Adjusted R-squared: 0.9715
## F-statistic: 3093 on 11 and 988 DF, p-value: < 2.2e-16

#Perform a Backward Elimination (trace = 0 hide results)


fit_back <- step(full,trace=0,direction="backward")
summary(fit_back)

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 18/19
5/4/2018 Linear Regression in R

##
## Call:
## lm(formula = Revenue ~ CredHist + CredAmt + EmpTen + Age + Default,
## data = credit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.0489 -11.9743 0.6923 12.0077 27.9325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3603090 3.6153166 2.589 0.009765 **
## CredHistA31 8.2618879 3.1197936 2.648 0.008221 **
## CredHistA32 20.6640166 2.4282225 8.510 < 2e-16 ***
## CredHistA33 30.5193670 2.7925681 10.929 < 2e-16 ***
## CredHistA34 36.4678345 2.5235479 14.451 < 2e-16 ***
## CredAmt 0.0299103 0.0001686 177.383 < 2e-16 ***
## EmpTenA72 -0.4102247 2.2082353 -0.186 0.852663
## EmpTenA73 0.2801253 2.0419229 0.137 0.890911
## EmpTenA74 3.7774582 2.1747553 1.737 0.082706 .
## EmpTenA75 1.3238294 2.0697135 0.640 0.522568
## Age -0.5137518 0.0446453 -11.507 < 2e-16 ***
## DefaultYes -3.5336624 1.0524095 -3.358 0.000816 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.5 on 988 degrees of freedom
## Multiple R-squared: 0.9718, Adjusted R-squared: 0.9715
## F-statistic: 3093 on 11 and 988 DF, p-value: < 2.2e-16

For this dataset, both methods returned the same model. However, it can happen that these methods return
different results.

Exercises
1. Select 3 variables to build a model to explain the Revenue (fit3).
2. At first glance, is it better or worse than our previous models (fit1 and fit2)? Why?
3. Is it a valid model? Why?

file:///C:/Users/cpina/Desktop/Camila/BIG%20DATA/prueba%202%20BD/Linear_Example.html 19/19

Вам также может понравиться