Академический Документы
Профессиональный Документы
Культура Документы
These notes cover a partial solution to the last problem from Homework 5, and
also one other problem where we are concerned with testing for interactions of two
factors. The point here is not to provide a complete homework solution, but to
illustrate (1) the writing style of an acceptable answer in a write-up where your
explanations should stand alone, and (2) correct and incorrect ways of phrasing the
explanation of hypothesis tests.
First, we load the data
Ericksen =
read.table("http://socserv.socsci.mcmaster.ca/jfox/
Books/Applied-Regression-2E/datasets/Ericksen.txt",
header = T)
attach(Ericksen)
monority = log(minority)
crime = log(crime)
CODE
model = lm(crime~poverty+minority+highschool)
1
2
summary(model)
OUTPUT:
Residuals:
Min 1Q Median 3Q Max
-0.76675 -0.14504 -0.04639 0.16388 0.82748
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.057495 0.129794 31.261 < 2e-16 ***
poverty 0.024410 0.011596 2.105 0.0393 *
minority 0.217060 0.032441 6.691 7.35e-09 ***
highschool -0.024877 0.005674 -4.384 4.59e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Based on the p-values in the table of t-statistics, we conclude that all of the
explanatory variables have statistically significant association with crime at the
five-percent level.
This could also be phrase We have statistically significant evidence, at the five-
percent level, that each of the βj (tested individually) is nonzero in the model.
This can also be phrased We have statistically significant evidence that elimi-
nating any one variable Xj from the model leads to a model with less accurate
predictions.
This can also be phrased We have statistically significant evidence that the model
(1) fits the data better than any of the corresponding two-variable models.
Note: It is actually better not to use one sentence to describe all three tests,
because technically there is a multiple testing problem. I won’t worry too much
about that though. The other advantage of writing a sentence for each test is that
then you can refer to the variables by name. I’m too lazy to type out each of the
phrasings above for each of the three tests, so I’ll do a sample of this better answer:
(Ideal solution., with some variation in the phrasing..)
• The t-test of the hypothesis H0 : β1 = 0 against the full model in (1) results
in a very small p-value, indicating that we have statistically significant
evidence indicating that removing poverty from the model results in a worse
model fit.
• The t-test of the hypothesis H0 : β1 = 0 against the full model in (1) results
in a small p-value of about 0.04, indicating that the crime rate and the
poverty rate have a statistically significant association after adjusting for
the effects of minority percentage and highschool education demographics.
(Note: the word ‘effects’ is dangerous, but above it was less inappropriate
because I did not talk about the statistically significan effect of poverty; it
is okay to use the word effect because it is already in the statistical jargon,
i.e. “main effects,” but don’t use it as the main object of a sentence because
then you sound like you are saying something causal)
• The t-test of the hypothesis H0 : β1 = 0 against the full model in (1)
results in a very small p-value, indicating strong statistical evidence that
3
highschool grauation demographics are useful for predicting the crime rate,
even after accounting for poverty and minority demographics.
More Notes on this:
It is even better if you talk about linear association and linear predictive power,
but that is understood if you tell the reader you fit a linear model then this is what
you mean, and at some point it is okay to save words.
It is not appropriate to say that any test provides evidence that variable X has
an effect on the crime rate. This phrasing suggests causation, and the only way to
establish causation is to know that your model is a valid causal model. This is only
possible if you make stronger assumptions that we like to make, or if you run an
experiment, which this is not.
It is also undesirable to talk about the association between variable X and the
crime rate without mentioning that you are accounting for the other variables in (1)
because the association depends on the model. As you saw in the lab, for instance,
highschool does not have a statistically significant association with the crime rate
in the marginal model Y = α + β3 X3 + , but it does have a statistically significant
association after accounting for the other variables X1 and X2 .
like “we are assuming that other variables don’t matter.” This is not true, if we
take the empirical point of view rather than the structural point of view, and the
empirical point of view is the only one that can be supported by pure statistics
(except in an experimental context).
interlockDat =
read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/
Applied-Regression-2E/datasets/Ornstein.txt", header = T)
attach(interlockDat)
OUTPUT:
Response: interlocks
Df Sum Sq Mean Sq F value Pr(>F)
sector 9 20263 2251 13.0872 < 2.2e-16 ***
nation 3 4125 1375 7.9917 4.439e-05 ***
sector:nation 16 1823 114 0.6624 0.829
Residuals 219 37675 172
Recall that in the interlock data set, we have the economic sector, assets, and
nation of ownership for various firms operating in Canada. As in the preceding
dataset, it is debatable whether this data should be treated as a random sample,
but again even a wrong model can tell us something.
The ANOVA table above gives F -tests for a model with just an intercept against
one accounting for sector, and then for a model accounting for both nation and
sector against one accounting for just sector, and finally for the full two-way model
with interactions.
The F -test of interest is the one in the last row, testing whether the interactions
are statistically significant. The p-value is quite high, so we do not reject the null
hypothesis. In other words, we do not have enough evidence to conclude that the
model with interactions fits the data any better than the model without interactions.
Hence, we conclude that it is appropriate to model the effects of sector and nation as
additive (Here again I used the word effects, but again I was using it in a situation
where it is statistical jargon and will be understood by informed readers. I did NOT
say that some variable has an effect, which seems to suggest causation; I simply
refered to ‘the effects’ of two variables. The distinction is subtle, but important if
you want to avoid misleading statements)