Вы находитесь на странице: 1из 16

# Regression With Stata

## Interpreting regression output

Suppose I am investigating the relationship between types of cars and their miles per gallon.
My hypothesis is that luxury models are gas guzzlers. I am testing this hypothesis using 1978
auto data. I use weight as a proxy for luxury models, as I expect luxury cars are heavier. It
also seems to make sense that heavier cars would use more gas. At the command window,
type:

sysuse auto

## regress mpg weight

Stata outputs analysis of variance (anova) results along with the regression results. Top left is
anova table, and bottom is regression results. The dependent variable here is miles per gallon
(mpg), and the variable name is shown at the left top of regression results table. The weight
here is measured in pounds. The coefficients for weight and foreign are shown in the Coef.
column. Std. Err. is Standard Error, t t test statistics, P>|t| the p values, and 95% Confidence
Interval. The results can be written in regression equation form as:

## predicted MPG = 39.44 - 0.006WEIGHT

For each pound increase in auto weight, miles per gallon decrease by 0.006, and it is
statistically significant at least at 99% level (when shown as 0.000, it is less than 0.0005). You
can see that the standard error is very small showing less variation and the absolute value of
the t test statistic is relatively large. You can tell the statistical significance through the p
value: when it is less than 0.05, it is significant at 95% level, and if it is less than 0.01, it is
significant at 99% level. Constant (_cons) is an intercept of the regression line, or the starting
point: mpg would be about 39 for cars with no weight. It may not make sense as such, but
that is the average of mpg controlling for weight.
Right top corner lists information associated with the anova and the regression output. Total
number of observations used for the analysis is 74, F test statistic with 1 numerator degrees
of freedom and 72 denominator degrees of freedom is 134, and it is statistically significant at
99% level, because the p value is 0.000. I will come back to the R-squared and adjusted R-
squared in the next model. Root MSE is square root of the mean squared error (MS Residual
in the anova table), and is the standard deviation of the error term, what is not explained by
the model.

What I did earlier is a simple regression with just one predictor variable. Now, I want to
control whether the cars are U.S. models or non-U.S. models in addition to weight in
predicting miles per gallon. Then it is an example of a multiple regression.Variables that
have a binary outcome like this U.S. vs non-U.S. models are called dummy variables. The
interpretation of the variable is easier if you code them as 0 or 1. Here, the variable foreign
are coded 0 for US (domestic) cars and 1 for non-US (foreign) cars.

## predicted MPG = 41.68 - 0.0066WEIGHT - 1.65FOREIGN

You can plug in 0 into foreign to estimate the MPG for domestic cars, and 1 for foreign cars:
so MPG is 1.65 less for foreign cars than for domestic cars. Controlling for foreign cars, still,
heavier models use more gas: each one pound increase in weight results in 0.0066 less mpg.
Notice that foreign is not statistically significant at any conventional level of significance in
this model. So can we say foreign, after all, is not important in estimating mpg?

Here, it is very important that you distinguish statistical and substantive significance.
Statistical significance shows you the probability that the sample value is the population
value, assuming null hypothesis of no relationship is true. In addition, statistical significance
can change by getting more observations, or by fitting the regression line better. Later you
can see the change in the statistical significance of foreign by making an adjustment to the
model.

In the earlier model, R squared was 0.65, meaning about 65% of the variance in mpg is
explained by the model. In this regression I got R-squared of 0.6627, so by adding one
variable I am explaining the mpg 1% more. Adjusted R squared adjusts the value of the R
squared by the ratio of the sample size to the number of variables. Naturally, R squared will
be larger if you have more variables, but the adjusted R squared takes the number of
variables into account. It can be useful when you have many variables and a small sample
size. The formula to get the adjusted R squared is 1- ((1- R squared)* ((n-1)/(n-k-1)). In the
earlier model, adjusted R-squared was 0.6467, and in the current model it is 0.6532. So still
this model explains the mpg better.

We jumped right in to regression, but there is a whole series of assumptions we are making
in running regression analyses. In your study, you need to check the data to see if the
regression assumptions are met. UCLA has very good sites where they discuss regression
diagnostics.

## Using dummy variables

I have mentioned earlier about a dummy variable by including foreign in the model. I have
another categorical variable, repair rating, that I am interested in seeing the effect on mpg.
The repair rating, called rep78, ranges from 1 to 5, 1 being more repairs and 5 being less
repairs. Here, the repair rating could be treated as a continuous value, but since it only has
five values and I consider it as a categorical variable, I will make each of the value into a
dummy variable. This kind of situation is more common with variables like ethnicity or
occupation, where the assignment of number is rather arbitrary and the quantity does not
have a meaning. An easy way to create a dummy variable from a multiple category variable
like this rep78 is to use tabulate command.

## tab rep78, gen(repair)

creates five dummies, one for each value of rep78. You can see the new variables Stata
created by scrolling the variable window to the bottom.

Notice that tabulation shows the total as 69, when total number of records is 74. It turns out
that five cars have their repair ratings missing. Stata drops cases with missing values
altogether when running regressions. So in the next model you can see that the total case
used in the analysis is 69.

Of the five categories, I can include four, one fewer categories than the total number of
categories, in the model, as one of them will be a reference category. The coefficients will be
interpreted in reference to the excluded category.
predicted MPG = 27.36 - 6.36REPAIR1 - 8.24REPAIR2 - 7.93REPAIR3 - 5.70REPAIR4

The coefficients of repairs are in reference to repair rating 5. So the cars with repair rating 1
yields about 6.36 less mpg than the cars with repair rating 5, repair rating 2 yields about 8.23
less mpg than repair rating 5, and so on. It kind of makes sense that cars with better repair
rating use less gas: they must be constructed to be more efficient. Each dummy is 0 or 1, so to
compute the predicted mpg, you can plug in 1 to the rating you are looking at, and 0 for
others. When a car has a repair rating 5, the predicted mpg is 27.36. When a car has a repair
rating 1, the predicted mpg is 27.36-6.36 = 21.

Some people are confused when I tell them to exclude a category to make it into a reference
group. If you have only one set of dummies and want to include them all, you can fit a
model with all the dummies but tell Stata that there already is a constant. I do not
recommend using this if you have multiple sets of dummy variables, such as marital status
(single, married, divorced, etc.) AND ethnicity(white, black, hispanic, asian, etc.), as where
the intercept went can get confusing.

This time, the coefficients are predicted mpg for each repair rating instead of difference in
reference to the excluded category. The results are the same either way.

In this data, I happen to know that the relationship between mpg and weight are quadratic,
and therefore square of the weight is necessary to improve the fit. How would I know that
the quadratic term is necessary? One way is to examine the residual plots without a
quadratic term against the suspected predictor variable. Here I suspect weight is quadratic,
so I plot the residual of the model without square term against weight.

I see that the plots show a curve, a sign that the error term is correlated with weight
quadratically. I can also examine linear fit and quadratic fit between mpg and weight. In the
following graphs, though, I am not controlling for foreign.

## graph twoway (scatter mpg weight) (lfit mpg weight)

graph twoway (scatter mpg weight) (qfit mpg weight)

Quadratic seems to be a better fit from these graphs, so I include it in the model.
Now the coefficient of foreign is significant at 95% level, and the absolute value of the effect
is larger. The equation will be:

## The effect of weight is -0.017+2(0.00000159)weight or -0.017+0.00000318weight. If I evaluate

the effect of the weight at the mean (3019), then it is -0.017+0.0000032(3019) = -0.007: mpg
decreases by 7 for additional 1000 pounds.

Log transformations

If the distribution of a variable has a positive skew, taking a natural logarithm of the variable
sometimes helps fitting the variable into a model. Log transformations make positively
skewed distribution more normal. Also, when a change in the dependent variable is related
with percentage change in an independent variable, or vice versa, the relationship is better
modeled by taking the natural log of either or both of the variables.

For example, I estimate person's wage based on one's education, experience, and region of
residence using Stata's sample data nlsw88, an extract from 1988 National Logitudinal Study
of Young Women.

sysuse nlsw88
It looks ok, but when I look at the distribution of tenure, it looks somewhat skewed.

histogram tenure

## So I compute a natural log of tenure.

gen lntenure=ln(tenure)
histogram lntenure
It seems to have overshot a little, but looks somewhat normal. I try a regression with the
logged tenure.

The R-squared has gotten a little higher, so taking the natural log seems to have helped to fit
it in the model better.

When the independent variable but not the dependent variable is logged, one percent change
in the independent variable is associated with 1/100 times the coefficient change in the
dependent variable.

So one percent increase in tenure is associated with an increase in the wage of 0.01x0.774 or

## Now I examine the wage, and find that it is very skewed.

histogram wage
So I take a natural log of wage, and look at the distribution of logged wage.

gen lnwage=ln(wage)
histogram lnwage

The distribution looks much more normal. Now I run the same regression with the logged
wage as the dependent variable.

## reg lnwage grade tenure south

When the dependent variable but not an independent variable is logged, a one-unit change
in the independent variable is associated with a 100 times the coefficient percent change in
the dependent variable.

In this data, tenure is measured in years: so, one year increase in tenure increases the wage
by 100x0.026 % or about 2.6%.

If we logged both the dependent and an independent variables, then we are looking at
elasticity: percentage change in X results in percentage change in Y.

## predicted lnwage = 0.659 + 0.084GRADE+0.136LNTENURE-0.151SOUTH

One percent increase in tenure is estimated to result in about 0.136 % increase in wage.

Interactions

## Between a Dummy and a Continuous Variables

When I included foreign in the gas model earlier, I was examining the effect of weight
controlling for foreign (or foreign controlling for weight). There is one intercept, which takes
on the effect of domestic cars. There, the effect of foreign was reflected as a different slope
from domestic. Now suppose I think that the effect of weight on mpg is different for foreign
and domestic cars. So I am thinking that foreign and domestic cars not only have different
slopes but also have different intercepts. So I compute the interaction between foreign and
weight by multiplying them, and include it in the model.

## predicted MPG = 39.65 -0.006WEIGHT + 9.27FOREIGN - 0.004FOREIGNWEIGHT

Here, 39.65 is the intercept and -0.006 is the slope for domestic cars, and 39.65+9.27 or 48.92 is
the intercept and -0.006-0.004 or -0.01 is the slope for foreign cars. Predicted mpg for
domestic cars evaluated at the mean weight is 39.65-0.006(3019) = 21.54, and for foreign cars
it is 48.92-0.01(3019) = 18.73. The difference may be easy to see in a graph. You can save the
predicted values by issuing a command predict. I call the predicted value predicted2 for the
model that includes the interaction and predicted1 that excludes the interaction.

predict predicted2

Then I plotted observed values in dots and predicted values in lines, separately for domestic
and foreign when a model includes an interaction term. You can see that the intercepts and
the slopes are a bit different between the two.

## graph twoway (scatter mpg weight) (line predicted2 weight), by(foreign)

Here I plotted the same without the interaction term. You can see that their slopes are about
the same.

## Between Two Continuous Variables

Suppose I suspect that the effect of weight on mpg is different by different value of length. So
I compute weight-length interaction and include it in the model.
predicted MPG = 67.45-0.01WEIGHT-0.18LENGTH+0.0000376WEIGHTLENGTH

## A change in predicted MPG given a change in weight is -0.01+0.0000376(length) and a

change in predicted MPG given a change in length is -0.18+0.0000376(weight).

Multicollinearity

## Multicollinearity is a condition where independent variables are strongly correlated with

each other. When multicollinearity exists in your model, you may see very high standard
error and low t statistics, unexpected changes in coefficient magnitudes or signs, or non-
significant coefficients despite a high R-square.

Stata drops perfectly collinear independent variables with warnings. If the collinearity is
high but not perfect, you may want to examine for multicollinearity. You can check for
multicollinearity by running a regression having each of the predictor variable as the
dependent variable, against all the other predictors. Then examine how much of the
variable's effect is independnt of other predictors.

Using the same autodata, let's check if we observe multicollinearity. quietly at the beginning
of the regress command suppresses the output. I executed the command to get the R-square,
type help ereturn in the Command window.

The variable foreign seems to be ok, having about 62% of the effect independent of other
predictors. But less than 2% of weight and weight2 are independent of other predictors.
Weight2 is computed from weight, so it is understandable. The same values can be
computed by using a regress postestimation command, estat vif. This time, you run the
whole model including the dependent variable.

1/VIF gives the same values as 1-R2 we did earlier. VIF column shows by how much other
coefficients variances (and standard errors) are increased due to the inclusion of that
predictor. We see that foreign has no impact on other variances, but weight and weight2
affect the variances substantially. What can we do to address this problem? We may be able
to reduce the multicollinearity by centering, which is subtracting the mean from the
predictor values before generating the square term. Again, here, I execute summarize
command to get the mean, which is saved as r(mean). Type help return in the Command

## Let's check to see if centered_weights have corrected the multicollinearity we observed in

weights.
The correlation between weight and weight2 is 0.99, but the correlation between
centered_weight and centered_weight2 is 0.14. Now 1/VIF shows that 61% of
centered_weight's and 93% of centered_weight2's variances are independent of other
variables. I used centering to show an example of how to correct for multicollinearity, but in
this case, it may not really have been necessary. If you compare regression results using
weights and centered_weights, you see that overall R-square and the p-values for weights
are not so different between the two models. So you do not always have to do this centering
when you include square term in the model. It may be more of an issue when there are two
supposed different but very closely related variables are included and show the conditions
described earlier, that standard errors are substantially high, coefficients' maginitudes and
signs are unexpected, or coefficients are not significant while the R-squared is high.

References

Chatterjee, Samprit and Bertram Price. (1977) Regression Analysis by Example. New York:
NY. John Wiley & Sons, INc.
Hamilton, Lawrence. (2006). Statistics with Stata. Updated for Version 9. Belmont, CA:
Thomson Brooks/Cole.