0 оценок0% нашли этот документ полезным (0 голосов)

61 просмотров16 страницMay 10, 2011

© Attribution Non-Commercial (BY-NC)

PDF, TXT или читайте онлайн в Scribd

Attribution Non-Commercial (BY-NC)

0 оценок0% нашли этот документ полезным (0 голосов)

61 просмотров16 страницAttribution Non-Commercial (BY-NC)

Вы находитесь на странице: 1из 16

Suppose I am investigating the relationship between types of cars and their miles per gallon.

My hypothesis is that luxury models are gas guzzlers. I am testing this hypothesis using 1978

auto data. I use weight as a proxy for luxury models, as I expect luxury cars are heavier. It

also seems to make sense that heavier cars would use more gas. At the command window,

type:

sysuse auto

Stata outputs analysis of variance (anova) results along with the regression results. Top left is

anova table, and bottom is regression results. The dependent variable here is miles per gallon

(mpg), and the variable name is shown at the left top of regression results table. The weight

here is measured in pounds. The coefficients for weight and foreign are shown in the Coef.

column. Std. Err. is Standard Error, t t test statistics, P>|t| the p values, and 95% Confidence

Interval. The results can be written in regression equation form as:

For each pound increase in auto weight, miles per gallon decrease by 0.006, and it is

statistically significant at least at 99% level (when shown as 0.000, it is less than 0.0005). You

can see that the standard error is very small showing less variation and the absolute value of

the t test statistic is relatively large. You can tell the statistical significance through the p

value: when it is less than 0.05, it is significant at 95% level, and if it is less than 0.01, it is

significant at 99% level. Constant (_cons) is an intercept of the regression line, or the starting

point: mpg would be about 39 for cars with no weight. It may not make sense as such, but

that is the average of mpg controlling for weight.

Right top corner lists information associated with the anova and the regression output. Total

number of observations used for the analysis is 74, F test statistic with 1 numerator degrees

of freedom and 72 denominator degrees of freedom is 134, and it is statistically significant at

99% level, because the p value is 0.000. I will come back to the R-squared and adjusted R-

squared in the next model. Root MSE is square root of the mean squared error (MS Residual

in the anova table), and is the standard deviation of the error term, what is not explained by

the model.

What I did earlier is a simple regression with just one predictor variable. Now, I want to

control whether the cars are U.S. models or non-U.S. models in addition to weight in

predicting miles per gallon. Then it is an example of a multiple regression.Variables that

have a binary outcome like this U.S. vs non-U.S. models are called dummy variables. The

interpretation of the variable is easier if you code them as 0 or 1. Here, the variable foreign

are coded 0 for US (domestic) cars and 1 for non-US (foreign) cars.

You can plug in 0 into foreign to estimate the MPG for domestic cars, and 1 for foreign cars:

so MPG is 1.65 less for foreign cars than for domestic cars. Controlling for foreign cars, still,

heavier models use more gas: each one pound increase in weight results in 0.0066 less mpg.

Notice that foreign is not statistically significant at any conventional level of significance in

this model. So can we say foreign, after all, is not important in estimating mpg?

Here, it is very important that you distinguish statistical and substantive significance.

Statistical significance shows you the probability that the sample value is the population

value, assuming null hypothesis of no relationship is true. In addition, statistical significance

can change by getting more observations, or by fitting the regression line better. Later you

can see the change in the statistical significance of foreign by making an adjustment to the

model.

In the earlier model, R squared was 0.65, meaning about 65% of the variance in mpg is

explained by the model. In this regression I got R-squared of 0.6627, so by adding one

variable I am explaining the mpg 1% more. Adjusted R squared adjusts the value of the R

squared by the ratio of the sample size to the number of variables. Naturally, R squared will

be larger if you have more variables, but the adjusted R squared takes the number of

variables into account. It can be useful when you have many variables and a small sample

size. The formula to get the adjusted R squared is 1- ((1- R squared)* ((n-1)/(n-k-1)). In the

earlier model, adjusted R-squared was 0.6467, and in the current model it is 0.6532. So still

this model explains the mpg better.

We jumped right in to regression, but there is a whole series of assumptions we are making

in running regression analyses. In your study, you need to check the data to see if the

regression assumptions are met. UCLA has very good sites where they discuss regression

diagnostics.

I have mentioned earlier about a dummy variable by including foreign in the model. I have

another categorical variable, repair rating, that I am interested in seeing the effect on mpg.

The repair rating, called rep78, ranges from 1 to 5, 1 being more repairs and 5 being less

repairs. Here, the repair rating could be treated as a continuous value, but since it only has

five values and I consider it as a categorical variable, I will make each of the value into a

dummy variable. This kind of situation is more common with variables like ethnicity or

occupation, where the assignment of number is rather arbitrary and the quantity does not

have a meaning. An easy way to create a dummy variable from a multiple category variable

like this rep78 is to use tabulate command.

creates five dummies, one for each value of rep78. You can see the new variables Stata

created by scrolling the variable window to the bottom.

Notice that tabulation shows the total as 69, when total number of records is 74. It turns out

that five cars have their repair ratings missing. Stata drops cases with missing values

altogether when running regressions. So in the next model you can see that the total case

used in the analysis is 69.

Of the five categories, I can include four, one fewer categories than the total number of

categories, in the model, as one of them will be a reference category. The coefficients will be

interpreted in reference to the excluded category.

predicted MPG = 27.36 - 6.36REPAIR1 - 8.24REPAIR2 - 7.93REPAIR3 - 5.70REPAIR4

The coefficients of repairs are in reference to repair rating 5. So the cars with repair rating 1

yields about 6.36 less mpg than the cars with repair rating 5, repair rating 2 yields about 8.23

less mpg than repair rating 5, and so on. It kind of makes sense that cars with better repair

rating use less gas: they must be constructed to be more efficient. Each dummy is 0 or 1, so to

compute the predicted mpg, you can plug in 1 to the rating you are looking at, and 0 for

others. When a car has a repair rating 5, the predicted mpg is 27.36. When a car has a repair

rating 1, the predicted mpg is 27.36-6.36 = 21.

Some people are confused when I tell them to exclude a category to make it into a reference

group. If you have only one set of dummies and want to include them all, you can fit a

model with all the dummies but tell Stata that there already is a constant. I do not

recommend using this if you have multiple sets of dummy variables, such as marital status

(single, married, divorced, etc.) AND ethnicity(white, black, hispanic, asian, etc.), as where

the intercept went can get confusing.

This time, the coefficients are predicted mpg for each repair rating instead of difference in

reference to the excluded category. The results are the same either way.

In this data, I happen to know that the relationship between mpg and weight are quadratic,

and therefore square of the weight is necessary to improve the fit. How would I know that

the quadratic term is necessary? One way is to examine the residual plots without a

quadratic term against the suspected predictor variable. Here I suspect weight is quadratic,

so I plot the residual of the model without square term against weight.

I see that the plots show a curve, a sign that the error term is correlated with weight

quadratically. I can also examine linear fit and quadratic fit between mpg and weight. In the

following graphs, though, I am not controlling for foreign.

graph twoway (scatter mpg weight) (qfit mpg weight)

Quadratic seems to be a better fit from these graphs, so I include it in the model.

Now the coefficient of foreign is significant at 95% level, and the absolute value of the effect

is larger. The equation will be:

the effect of the weight at the mean (3019), then it is -0.017+0.0000032(3019) = -0.007: mpg

decreases by 7 for additional 1000 pounds.

Log transformations

If the distribution of a variable has a positive skew, taking a natural logarithm of the variable

sometimes helps fitting the variable into a model. Log transformations make positively

skewed distribution more normal. Also, when a change in the dependent variable is related

with percentage change in an independent variable, or vice versa, the relationship is better

modeled by taking the natural log of either or both of the variables.

For example, I estimate person's wage based on one's education, experience, and region of

residence using Stata's sample data nlsw88, an extract from 1988 National Logitudinal Study

of Young Women.

sysuse nlsw88

reg wage grade tenure south

It looks ok, but when I look at the distribution of tenure, it looks somewhat skewed.

histogram tenure

gen lntenure=ln(tenure)

histogram lntenure

It seems to have overshot a little, but looks somewhat normal. I try a regression with the

logged tenure.

The R-squared has gotten a little higher, so taking the natural log seems to have helped to fit

it in the model better.

When the independent variable but not the dependent variable is logged, one percent change

in the independent variable is associated with 1/100 times the coefficient change in the

dependent variable.

So one percent increase in tenure is associated with an increase in the wage of 0.01x0.774 or

about $0.0077.

histogram wage

So I take a natural log of wage, and look at the distribution of logged wage.

gen lnwage=ln(wage)

histogram lnwage

The distribution looks much more normal. Now I run the same regression with the logged

wage as the dependent variable.

When the dependent variable but not an independent variable is logged, a one-unit change

in the independent variable is associated with a 100 times the coefficient percent change in

the dependent variable.

predicted lnwage=0.666+0.085GRADE+0.026TENURE-0.150SOUTH

In this data, tenure is measured in years: so, one year increase in tenure increases the wage

by 100x0.026 % or about 2.6%.

If we logged both the dependent and an independent variables, then we are looking at

elasticity: percentage change in X results in percentage change in Y.

One percent increase in tenure is estimated to result in about 0.136 % increase in wage.

Interactions

When I included foreign in the gas model earlier, I was examining the effect of weight

controlling for foreign (or foreign controlling for weight). There is one intercept, which takes

on the effect of domestic cars. There, the effect of foreign was reflected as a different slope

from domestic. Now suppose I think that the effect of weight on mpg is different for foreign

and domestic cars. So I am thinking that foreign and domestic cars not only have different

slopes but also have different intercepts. So I compute the interaction between foreign and

weight by multiplying them, and include it in the model.

Here, 39.65 is the intercept and -0.006 is the slope for domestic cars, and 39.65+9.27 or 48.92 is

the intercept and -0.006-0.004 or -0.01 is the slope for foreign cars. Predicted mpg for

domestic cars evaluated at the mean weight is 39.65-0.006(3019) = 21.54, and for foreign cars

it is 48.92-0.01(3019) = 18.73. The difference may be easy to see in a graph. You can save the

predicted values by issuing a command predict. I call the predicted value predicted2 for the

model that includes the interaction and predicted1 that excludes the interaction.

predict predicted2

Then I plotted observed values in dots and predicted values in lines, separately for domestic

and foreign when a model includes an interaction term. You can see that the intercepts and

the slopes are a bit different between the two.

Here I plotted the same without the interaction term. You can see that their slopes are about

the same.

Suppose I suspect that the effect of weight on mpg is different by different value of length. So

I compute weight-length interaction and include it in the model.

predicted MPG = 67.45-0.01WEIGHT-0.18LENGTH+0.0000376WEIGHTLENGTH

change in predicted MPG given a change in length is -0.18+0.0000376(weight).

Multicollinearity

each other. When multicollinearity exists in your model, you may see very high standard

error and low t statistics, unexpected changes in coefficient magnitudes or signs, or non-

significant coefficients despite a high R-square.

Stata drops perfectly collinear independent variables with warnings. If the collinearity is

high but not perfect, you may want to examine for multicollinearity. You can check for

multicollinearity by running a regression having each of the predictor variable as the

dependent variable, against all the other predictors. Then examine how much of the

variable's effect is independnt of other predictors.

Using the same autodata, let's check if we observe multicollinearity. quietly at the beginning

of the regress command suppresses the output. I executed the command to get the R-square,

which is saved in the Stata's internal memory as e(r2). To learn more about saved results,

type help ereturn in the Command window.

The variable foreign seems to be ok, having about 62% of the effect independent of other

predictors. But less than 2% of weight and weight2 are independent of other predictors.

Weight2 is computed from weight, so it is understandable. The same values can be

computed by using a regress postestimation command, estat vif. This time, you run the

whole model including the dependent variable.

1/VIF gives the same values as 1-R2 we did earlier. VIF column shows by how much other

coefficients variances (and standard errors) are increased due to the inclusion of that

predictor. We see that foreign has no impact on other variances, but weight and weight2

affect the variances substantially. What can we do to address this problem? We may be able

to reduce the multicollinearity by centering, which is subtracting the mean from the

predictor values before generating the square term. Again, here, I execute summarize

command to get the mean, which is saved as r(mean). Type help return in the Command

window to learn more about the rclass variables.

weights.

The correlation between weight and weight2 is 0.99, but the correlation between

centered_weight and centered_weight2 is 0.14. Now 1/VIF shows that 61% of

centered_weight's and 93% of centered_weight2's variances are independent of other

variables. I used centering to show an example of how to correct for multicollinearity, but in

this case, it may not really have been necessary. If you compare regression results using

weights and centered_weights, you see that overall R-square and the p-values for weights

are not so different between the two models. So you do not always have to do this centering

when you include square term in the model. It may be more of an issue when there are two

supposed different but very closely related variables are included and show the conditions

described earlier, that standard errors are substantially high, coefficients' maginitudes and

signs are unexpected, or coefficients are not significant while the R-squared is high.

References

Chatterjee, Samprit and Bertram Price. (1977) Regression Analysis by Example. New York:

NY. John Wiley & Sons, INc.

Hamilton, Lawrence. (2006). Statistics with Stata. Updated for Version 9. Belmont, CA:

Thomson Brooks/Cole.

## Гораздо больше, чем просто документы.

Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.

Отменить можно в любой момент.