Вы находитесь на странице: 1из 10

Hints and codes for the final paper1

• Please note that this handout is to help you with your data analysis for the final
project. You do not have to follow this process. In addition, in this handout I do
not repeat the steps in the homeworks, make sure you review those too. As you can
see there are many many things you can do with this project, make sure you focus
on a single slice of all the possibilities. Do not forget that your paper should be in
essay format, which should have an introduction (in which you can already mention
your findings), a body (evidence) and a conclusion. You may choose to include the
tables and figures in text or as an appendix after the text.

• Sample Question: What is the evidence that trade reduces political instability? An-
alyze the data to answer the question and include a region and ethic fractionalization
as a control variable.

• Step 1: write down the Hypothesis with a plausible mechanism. You


may use sources if you wish but you have to cite them properly.

Examination of the Data


• Set your working directory , load the dataset and ”attach” the dataset (this will
allow you to work with the data without the $ sign.

setwd("/Users/Zsuzsa/Documents/Teaching/Spring2017")
newdata<-read.csv(file="newdata2.csv", header=TRUE)
attach(newdata)

• Examine the variables in the dataset.

names(newdata)
summary(newdata)

• What measure of political instability is included in the dataset?

• How is this measure constructed?

> summary(instability)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
-1.46000 -0.71500 -0.08000 0.07358 0.67500 3.31000 2

• Visualize the variable

hist(instability)
boxplot (instability)
2

This variable does not look exactly like a normal distribution, but it is probably not skewed
enough to merit a logarithmic transformation. So I will not transform the variable this
time. We need to look at the distribution of every variable we will use in order to make
the best judgment about when to transform the variables and also to get a sense of the
data we are using.

• Now review the other variables that you will include in your analysis in identical
fashion. What measure of region is included in the dataset? How many regions are
there? Are the variables continuous or categorical?

• What measure of ethnic fractionalization is included in the dataset?

• How are these variables measured? What years do they reflect? How is the data
distributed? Is the distribution close enough to normal or do you need to transform
either of them? How are all the variables ordered? What do higher and lower values
mean?

• Step 2: Decide how do you want to operationalize your concepts (which


actual variables are you going to use to measure the concepts). Examine
all the variables that you want to use in your regressions. Note that you
always have to have a theoretical reason why you use those particular
variables (which means that it has to make common sense that you want
to control for them).
1
This handout is based on the handout we crated with Professor Miriam Golden and Chao-yo Cheng
in 2016
3

Initial Data Analysis


Let us begin the analysis by examining the correlation of instability and trade.

> cor(instability, trade09,use="complete.obs")


[1] -0.3915887

• Why is the sign negative?


To answer that question, we have to think back to how our variables were measured.
Higher values of trade mean more trade. Higher values of instability mean more
instability. So a negative correlation means that as trade increases, instability falls.
Is this what we expected?

• Now let us generate a scatterplot of the relationship. In a scatterplot, the order of


the variables is important. The dependent variable appears on the y-axis and the
independent variable appears on the x-axis.

plot(trade09 ,instability, main="Scatterplot of Trade and Instability",


xlab="Trade in 2009",ylab="Instability")
4

• There is an obvious relationship that appears in the data: we can visualize the
negative line from high instability/low instability to low trade/high trade. But
the relationship does not appear very strong, and there is clearly a lot of variance
in political instability that is not connected to trade. We know this because the
country points are very diffused.

• Step 3: Make sure you explore the variables carefully. Make sure you
know how the variables are measured (categorical, continuous etc). You
are welcome to use histograms/ boxplots etc to explore them. Go back
to HW2 and HW3 to review some of the analysis you did there. Feel free
to use any of those methods if you think they fit the data. For instance,
side by side boxplots, if one of your variable is a dummy variable. Make
sure you have a sense how the variables are measured.

Bivariate Regression Analysis


Following our preliminary data analysis, we move to formal regression analysis. Let us
begin with a bivariate regression that recapitulates the scatterplot but in a more precise
way.
> model1<-lm(instability~trade09)
> summary(model1)

Call:
lm(formula = instability ~ trade09)

Residuals:
Min 1Q Median 3Q Max
-1.56440 -0.69430 -0.06981 0.58468 2.25917

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.752934 0.149527 5.035 1.42e-06 ***
trade09 -0.007589 0.001496 -5.071 1.21e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8586 on 142 degrees of freedom


(45 observations deleted due to missingness)
Multiple R-squared: 0.1533,Adjusted R-squared: 0.1474
F-statistic: 25.72 on 1 and 142 DF, p-value: 1.214e-06
A bivariate regression estimates the linear relationship between two variables: one cause
and one effect. The given slope is the slope of the regression line (here it is negative, and
5

its value is 0.007 and the given intercept (in this case, 0.75) where x = 0. The intercept
is given in the regression results by the value of the constant.
Let us examine the regression line, which is the best fit line between values of x (in this
case trade) and values of y (in this case, political instability) for our dataset.

• As you can see, the graphic is identical to the scatterplot already produced except
that we have added the line of best fit line. You can see that, although the line is
clearly sloping downwards, lots of country-dots are far from the line. This suggests
a lot of unexplained variance in the dependent variable.

• Now let us interpret the bivariate results. In this model, the trade volume in 2009
is negatively related to political instability, and the relationship is statistically sig-
nificant at the p=0.01 or smaller level.

• Let us add more substance to our interpretation. A one-percent change in the trade
as a percent of the GDP is associated with a 0.007 unit reduction in instability, on
average. We talk about the trade in percentages because those are the units in which
we measure trade rates (you can figure this out from the variable description PDF).
We talk about instability as ”units” because there is no natural interpretation to the
units used to measure this variable. In general, we have to talk about each variable
with whatever metric is used to measure it.

• Is this substantively significant? As you know we have to know how the variables
are measured to answer this question as this is a regression analysis. To answer this
question we can look at the range of the variables. The range of trade is: 380.5-
22.30= 358.2. This means that the maximum change we could see in instability is
(-0.007)*358.2=-2.5. If we look at the range of instability in our sample it ranges
6

from 0.675 to -1.46 which means the range is: 2.135. So trade in fact in the extreme
could make the most unstable country the least unstable!!!! (remember the slope
tells you how much you change Y (instability) when you change X (trade).

> summary(trade09)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
22.30 57.39 78.08 87.74 107.80 380.50 45

> summary(instability)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
-1.46000 -0.71500 -0.08000 0.07358 0.67500 3.31000 2

• Step 4: Establish the basic relationship between your two main variables.
Explain the findings. Explain whether this was what you expected to find.

Multivariate Regression Analysis

• Usually we need to include control variables in a regression equation. As you know


if you leave out important confounding variables from your regression you may end
up drawing inaccurate conclusions.

• Normally as with a bivariate regression, you will have a theoretical reason to choose
your control variables. In this example, we are going to re-estimate the effect of
trade on instability and include ”africa” a dummy variable into the regression. After
that I will run another regression where I also include the interaction term between
”africa” and trade. This is because I believe that trade may affect instability in
a different way in Africa from the rest of the world, because the high volume of
trade in natural resources and minerals. I believe that in Africa trade may actually
increase instability. To show how a non-dummy control variable can work I will
also run a regression where I control for ethnic fractionalization. This is because I
believe that ethic fictionalization can also be a confounding variable- it can decrease
trade and increase instability.

• In R you can run a regression by calling lm(Y X). You can call a multiple regression
just by adding variables into the regression. lm(Y ∼ X1 + X2 + X3 ). It is a good
practice to assign your regression to an object (with the little arrow). Your results
will be stored there from now on. You can get the results in a table by asking R to
”summarize” the results. You can just put your variables into the following code.

model1<-lm(instability~trade09)
summary(model1)
7

model2<-lm(instability~trade09+ethfrac)
summary(model2)

model3<-lm(instability~trade09+africa)
summary(model3)

model4<-lm(instability~trade09+africa+africa*trade09)
summary(model4)

Call:
lm(formula = instability ~ trade09 + ethfrac)

Residuals:
Min 1Q Median 3Q Max
-1.74282 -0.63422 -0.01885 0.65651 1.88109

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.028227 0.212312 0.133 0.894460
trade09 -0.006022 0.001549 -3.889 0.000167 ***
ethfrac 1.479904 0.296156 4.997 2.04e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8108 on 118 degrees of freedom


(68 observations deleted due to missingness)
Multiple R-squared: 0.2945,Adjusted R-squared: 0.2826
F-statistic: 24.63 on 2 and 118 DF, p-value: 1.148e-09

• Here I will only interpret one of these multivariate models: model 2. Please find
guidance on how to interpret models 3 and 4 in the lecture slides and in the home-
work.
Our interpretation proceeds as in the bivariate example, except that now each of
the two independent variables is estimated holding the value of the other constant.
Let’s think of it this way: ethnic fractionalization is a confounding variable that
may affect both trade and instability. We want to know the true impact of trade on
instability after we purge both variables of any impact of ethnic fractionalization.

• We begin by examining the coefficient of trade. It’s still negative, and now it’s -
0.06, and it’s still statistically significant. This means that all else equal one percent
change in trade will decrease instability with 0.06 unit on average. On the other
8

hand we also find that all else equal, one unit change in ethnic fractionalization will
increase instability with 1.48 units on average.

• Is the ethnic fractionalization coefficient bigger (better) than the trade coefficient?
Not necessarily! Remember you always have to know the scale of these variables!

> summary(trade09)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
22.30 57.39 78.08 87.74 107.80 380.50 45
> summary(ethfrac)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
0.0100 0.2200 0.4750 0.4497 0.6625 0.9300 49
> summary(instability)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
-1.46000 -0.71500 -0.08000 0.07358 0.67500 3.31000 2

• As you can see here the range of the ethic fractionalization is: 0.66-0.01=0.65.
This means that the maximum change we can see in instability (keeping the trade
constant) is 0.65*1.48=0.962 (here I multiply the observed range of X with the beta
coefficient) . If we look at the range of instability (our Y variable) it ranges from
0.675 to -1.46, which means the range is: 2.135. The maximum change we can
achieve by changing X in Y is 0.962/2.135=0.45, half of the variation in instability.
This is not trivial, by all means, since ethnic homogenity vs. ethnic fractionalization
be the difference between whether a country is in the 25% most unstable or 25%
least unstable countries.

• The range of trade is: 380.5-22.30= 358.2. This means that the maximum change we
could see in instability (keeping ethnic fractionalization constant) is (-0.006)*358.2=-
2.14. If we look at the range of instability it currently ranges from 0.675 to -1.46
which means the range is: 2.135. So trade in fact in the extreme could make the
most unstable country the least unstable!!!!

• Our interpretation of the constant is now slightly changed. The constant is now the
average value of y when x = 0, for both x1 and for x2; that is, both regressors are
equal to 0 (which means that instability would be 0.02 if there was no trade and no
ethnic fractionalization).

• The overall amount of variance explained by trade and ethnic fractionalization is 28


percent. We have improved the R2 with the addition of ethnic fractionalization in
the regression. Adding more variables almost always improves the R2 just because
the model can find more ways of predicting the right outcome. Sometimes this does
not represent a genuine improvement in the model. In this case, we have theoretical
reasons to believe that ethnic fractionalization probably affects trade and political
instability, so the addition of the variable can be justified on those grounds.
9

You may want to organize your results in a table if you want to present more than
one regressions. See a simpler example in the Homeworks. Here I put the standard errors
under the coefficients, but you may use other formats as well to show the significance of
your results (p-values, confidence intervals etc).

Tab. 1:
Dependent variable: Instability
instability
(Model 1) (Model 2) (Model 3) (Model 4)
trade09 −0.008∗∗∗ −0.006∗∗∗ −0.007∗∗∗ −0.007∗∗∗
(0.001) (0.002) (0.001) (0.002)

ethfrac 1.480∗∗∗
(0.296)

africa 0.476∗∗∗ 0.705∗∗


(0.162) (0.344)

trade09*africa −0.003
(0.004)

Constant 0.753∗∗∗ 0.028 0.600∗∗∗ 0.551∗∗∗


(0.150) (0.212) (0.155) (0.168)

Observations 144 121 144 144


R2 0.153 0.295 0.202 0.206
∗ ∗∗ ∗∗∗
Note: p<0.1; p<0.05; p<0.01

• What you can see here is that the coefficient of the variable trade is pretty stable.
This means that we can be fairly confident that we recovered the real effect of
trade on instability, it does not seem to be confounded with ”africa” or ”ethnic
fractionalization” (although please note that independently these variables still seem
to influence instability.

• Another interesting thing I see in this table is that the interaction term does not seem
to matter in the regression. The dummy variable shows that the instability is higher
in Africa than anywhere else. However, the interaction term is not statistically
significant (so we cannot say it is not 0) .This means that in Africa trade has a
similar influence in instability than anywhere else, it decreases instability.
10

• If you use dummy variables make sure that you go back to HW5 and use the template
there to explain your findings (remember that you will have two equations- here one
for Africa and one for not Africa). Remember what the interaction term means.

• Step 5: Multiple Regression: As you can see you can explore quite a
bit with multiple regressions. Make sure you choose a few models and
analyze them carefully as opposed to running too many regressions and
try to explain them all ( I know it is fun, but we do not have too much
space here). You may mention what you explored in a footenote or in a
few quick sentences if you have explored many different models. If you
have an interesting finding you can just frame your paper around it. In
the conclusion, revisit your results, and you can also mention what else
would you have done, what other variables would you use if you had them
etc.

Вам также может понравиться