Вы находитесь на странице: 1из 17
0 9 /09/2011 Source : http://www.princeton.edu/~otorres/Stata/

0 9/09/2011

Source: http://www.princeton.edu/~otorres/Stata/

Regression

Technically, linear regression estimates how much Y changes when X changes one unit.

In Stata use the command "regress", type:

regress [dependent variable] [independent variable(s)]

regress y x

In a multivariate setting we type:

regress y x1 x2 x3 …

Before running a regression it is recommended to have a clear idea of what you are trying to estimate (i.e. which are your outcome and predictor variables).

A regression makes sense only if there is a sound hypothesis behind it.

Regression: example

Example: Do older people report lower life satisfaction controlling for other factors?*

– Outcome (Y) variable – life satisfaction, cp08a011 in sample dataset

– Predictor (X) variables

• Age of houshold member (leeftijd)

• nationality ( cr08a043)

• gender (geslacht)

• level of education (oplcat)

• Personal monthly income in categories (nettocat)

• Civil Status (burgstat)

Assuming that sample dataset is saved on the desktop, type:

use "C:\Documents and Settings\Administrator\Desktop\sample dataset.dta"

Regression: variables

It is recommended first to examine the variables in the model to check for possible errors, type:

describe lifesatisfaction age dutch female married nevermarried netincome educ

summarize lifesatisfaction age dutch female married nevermarried netincome educ

female married nevermarried netincome educ summarize lifesatisfaction age dutch female married nevermarried netincome educ

Lets run the regression:

Regression: what to look for

Outcome Predictor variable (Y) variable s (X)
Outcome Predictor variable (Y) variable s (X)

Outcome

Predictor

variable (Y)

variables (X)

1
1
2
2

This is the p-value of the model. It tests whether R 2 is different from 0. Usually we need a p-value lower than 0.05 to show a statistically significant relationship between Y and Xs.

R-square shows the amount of variance of Y explained by Xs. In this case the model explains 4% of the variance in life satisfaction.

4
4
5
5

The t-values test the hypothesis that the coefficient is different from 0. To reject this, you need a t - value greater than 1.96 (for 95% confidence). You can get the t-values by dividing the coefficient by its standard error. The t- values also show the importance of a variable in the model.

3

Adj R 2 shows the same as R 2 but adjusted by the # of cases and # of variables.

R 2 but adjusted by the # of cases and # of variables. Two-tail p-values test

Two-tail p-values test the hypothesis that each coefficient is different from 0. To reject this, the p - value has to be lower than 0.05 (you could choose also an alpha of 0.10). In this case, age is not statistically significant in explaining life satisfaction.

Regression: with dummies

Region is entered here as dummy variable. The easy way to add dummy variables to a regression is using “xi” and the prefix “i.” (interpretation is the same as before). The first category is always the reference:

xi:lifesatisfaction age female netincome educ dutch married nevermarried

i.

age female netincome educ dutch married nevermarried i. NOTE : By default xi excludes the first

NOTE: By default xi excludes the first value, to select a different value, before running the regression type:

char sted [omit] 4 xi: regress lifesatisfaction age female netincome educ dutch married nevermarried i.sted This will select (4) as the reference category for the dummy variables.

NOTE: Another way to create dummy variables is to type:

tab sted, gen(urban) This will create 5 new variables (or a many a categories in the variable), one for each region in this case.

Regression: ANOVA table

When you run the regression, at the top you get the ANOVA table

xi: regress csat expense percent income high college i.region

regress csat expense percent income high college i.region A = Model Sum of Squares (MSS). The

A = Model Sum of Squares (MSS). The closer to TSS the better fit.

B = Residual Sum of Squares (RSS)

C = Total Sum of Squares (TSS)

Regression: estto/esttab

To show the models side - by - side you can use the commands estto and esttab:

regress lifesatisfaction age female eststo model1 regress lifesatisfaction age female netincome educ dutch married nevermarried eststo model2 xi:regress lifesatisfaction age female netincome

educ dutch married nevermarried i.sted

eststo model3 esttab, r2 ar

model2 xi:regress lifesatisfaction age female netincome educ dutch married nevermarried i.sted eststo model3 esttab, r2 ar

Regression: exploring relationships

scatter lifesatisfaction age

scatter lifesatisfaction age2

scatter lifesatisfaction age scatter lifesatisfaction age2 There might be be a curvilinear relationship between l
scatter lifesatisfaction age scatter lifesatisfaction age2 There might be be a curvilinear relationship between l

There might be be a curvilinear relationship between llifesatisfaction and age. we add a square version of the variable, in this case

gen age2=age*age

Regression: getting predicted values

How good the model is will depend on how well it predicts Y, the linearity of the model and the behavior of the residuals.

to generate the predicted values of Y (usually called Yhat) given the model:

use predict immediately after running the regression:

xi: regress lifesatisfaction age female netincome educ dutch married nevermarried i.sted predict lifesathat label variable lifesathat "predicted life satisfaction"

For a quick assessment of the model run a scatter plot

scatter

Regression: observed vs. predicted values

plot scatter Regression: observed vs. predicted values We should expect a 45 degree pattern in the

We should expect a 45 degree pattern in the data. Y-axis is the observed data and x-axis the predicted data (Yhat).

In this case the model does not seem to be doing a good job in predicting lifesatisfaction

Regression: joint test (F-test)

To test whether two coefficients are jointly different from 0 use the command test

xi: quietly regress lifesatisfaction age female netincome educ dutch married nevermarried i.sted

Note ‘quietly’ suppress the regression output

To test the null hypothesis that both coefficients do not have any effect on lifesatisfaction type:

test age female

any effect on lifesatisfaction type : test age female The p-value is 0.00 23 , we

The p-value is 0.0023, we reject the null and conclude that the variables jointly have indeed a significant effect on lifesatisfaction.

Some other possible tests are test netincome = 1 test netincome = educ

Regression: saving regression coefficients

Stata temporarily stores the coefficients as _b[varname], so if you type:

gen age_b = _b[age]

gen constant_b = _b[_cons]

You can also save the standard errors of the variables _se[varname]

gen age_se = _se[age] gen constant_se = _se[_cons]

can also save the standard errors of the variables _se[varname] gen age _se = _se[ age

Regression: saving regression coefficients/getting predicted values

Regression: saving regression coefficients/getting predicted values Type help return for more details

Type help return for more details

Regression: interaction between dummies

Interaction terms are needed whenever there is reason to believe that the effect of one independent variable depends on the value of another independent variable. We will explore here the interaction between two dummy (binary) variables. In the example below there could be the case that the effect of type of dwelling on lifesatisfaction may depend on the gender of the respondent.

Dependent variable (Y) – Lifesatisfaction

Independent variables (X)

Binary

selfowneddwelling is 1’ if (woning) type of dwelling is self-owned.

Binary

rentaldwelling is

1’ if (woning) type of dwelling is rental.

Interaction term In Stata: gen selfownd_f=female* selfowneddwelling

xi: regress lifesatisfaction age female netincome educ dutch married nevermarried i.sted selfowneddwelling selfownd_f

married nevermarried i.sted selfowneddwelling selfownd_f The effect of female on the lifesatisfaction is 0.8 but

The effect of female on the lifesatisfaction is 0.8 but given the interaction term (and assuming all coefficients are significant), the net effect is 0.8+0.4* selfowneddwelling. If selfowneddwelling is 0 then the effect is 0.8 (which is selfowneddwelling coefficient), but if selfowneddwelling is 1 then the effect is 0.8+0.4= 1.2. In this case, the effect of being female on lifesatisfaction is more positive if women have their own houses.

Regression: interaction between a dummy and a continuous variable

Lets explore the same interaction as before but we keep student-teacher ratio continuous and the English learners variable as binary. The question remains the same*.

Dependent variable (Y) – Lifesatisfaction

Independent variables (X)

Continous netincome

Binary female

Interaction term In Stata: gen income_f=female* netincome

xi:regress lifesatisfaction age female netincome educ dutch married nevermarried i.sted income_f

netincome educ dutch married nevermarried i.sted income_f The effect of income on lifesatisfaction is lower for

The effect of income on lifesatisfaction is lower for females.

If female=0 then the effect of income

0.06

If female=1 then the effect of income

0.06-0.03

Increasing income category by females.

1 unit for males

will increase life satisfaction by 0.06 units, but it will have a lower impact for

Regression: interaction between two continuous variables

Lets keep now both variables continuous. The question remains the same*.

Dependent variable (Y) – Lifesatisfaction

Independent variables (X)

Interaction term

Continous netincome

Continous age

In Stata: gen income_age=age* netincome

xi:regress lifesatisfaction age female netincome educ dutch married nevermarried i.sted inc_age

netincome educ dutch married nevermarried i.sted inc_ age The effect of the interaction term is very

The effect of the interaction term is very small. the effect of

rise in income

So:

If age = 50, the slope of income is 0.042

If age = 70, the slope of income 0.05.

category is 0.02 + 0.0003*age

In the continuous case there is a very small effect (and not significant).