Вы находитесь на странице: 1из 11

Graded Assignment Advanced Statistics

Jessica A. Eddy and Dashari Colon-Maldonado

September 27, 2016

Student Information
Jessica A. Eddy S179533 Dashari Colon-Maldonado S1785494

Experiment Description
Fun description We are a company named JaDa Bio Inc. that intends to complete world
domination, but we want to know at which of the 4 seasons, spring, fall, winter, and summer,
the highest number of species of microorganism can be found and in which of our two choice
countries. The locations to start our takeover of the world (and thus where we need to know
the amount of microbial species) are the United States of America and China, where literature
reviews suggest the most difference in microbes are in these countries. There will be 5
locations within each country determined randomly by a number generator with the given
paramaters of the latitude and longitude lines of each country. We send our trusted, unpaid
henchman Napoleon to measure 5 samples in each country once per season with using an
idiot-proof, step-by-step guide (for cost effectiveness, reduction of technician differences,
bias reduction, efficiency, and standardization).

Dry description We are a research laboratory interested in determining the number of


microbial species found in 2 different countries, at each of the four seasons of the year. The
two different countries are the United States and China, chosen due to literature review
indicating the most diference in microbes in these countries. The experiment will be done
once per season at 5 different locations within each country. The locations will be determined
before each experiment by a random generator for latitude and longitude (with limits
pertaining to those of each country, geographically). We eliminate technician and
methodology bias by sending the same technician to perform the experiment every single
time, with the very same standard operating pocedures (SOPs) and the very same calibrated
instruments.

Simulation and Analysis


We are trying to simulate the possible results of our research question. Therefore, we need to
create the data for the two explanatory variables Country and Season, and the response
variable Species. Country is a factor with two levels: USA and China. Season is a factor with
four levels: Spring, Summer, Fall, and Winter.

Four-level factor Seasons and Two-level


factor Country
First, we must create our Season and our Country factors by asking it to make 5 replications
of 4 levels (two times) with the approrpiate labels and 20 replications of 2 levels,
respectively. Then, we make sure that the values reflect our commands.

Season<-rep(gl(4, 5, labels=c("Spring", "Summer", "Fall", "Winter")),


times=2)
str(Season)
## Factor w/ 4 levels "Spring","Summer",..: 1 1 1 1 1 2 2 2 2 2 ...
Season
## [1] Spring Spring Spring Spring Spring Summer Summer Summer Summer
Summer
## [11] Fall Fall Fall Fall Fall Winter Winter Winter Winter
Winter
## [21] Spring Spring Spring Spring Spring Summer Summer Summer Summer
Summer
## [31] Fall Fall Fall Fall Fall Winter Winter Winter Winter
Winter
## Levels: Spring Summer Fall Winter
Country<-gl(2, 20, labels=c("USA", "China"))
str(Country)
## Factor w/ 2 levels "USA","China": 1 1 1 1 1 1 1 1 1 1 ...
Country
## [1] USA USA USA USA USA USA USA USA USA USA USA
## [12] USA USA USA USA USA USA USA USA USA China China
## [23] China China China China China China China China China China China
## [34] China China China China China China China
## Levels: USA China

Second, we must create numerical versions of each variable to represent the value
expectations for the number of microbial species. This way, we end up with two vectors.
Then, we visually check our empirical models.

Season_N<-rep(c(4231.6, 2189.3, 1695.1, 1002.8), each=5, times=2)


Country_N<-rep(c(6839.1,2279.7), each=20)

plot(Country_N~Country)

plot(Season_N~Season)

Statistical Model
Now we add the variance to our working models and check each visually.

set.seed(362)
Residuals_S<-rnorm(40, 0, 256)
Species_S<-Season_N+Residuals_S
plot(Species_S~Season)
points(Species_S~Season)
set.seed(362)
Residuals_C<-rnorm(40, 0, 256)
Species_C<-Country_N+Residuals_C
plot(Species_C~Country)
points(Species_C~Country)

The plots visually look indistinguishable from a possible real and legitimate experiment, so
our models seem to be working properly. Next we will check if we can get our means back
for each variable.

T-Test for Country


Since the only factor with two levels is Country, I can run a t-test to estimate my population
means. Seasons cannot run in a t-test because it has more than two levels.

t.test(Species_C~Country)
##
## Welch Two Sample t-test
##
## data: Species_C by Country
## t = 66.969, df = 37.881, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4423.568 4699.374
## sample estimates:
## mean in group USA mean in group China
## 6872.182 2310.711

Upon inspection, the t-test output gives me very close estimates to my population means.

ANOVA for Seasons


Now we perform a one-way ANOVA for Seasons. We can do this because Seasons has four
levels. Running a one-way ANOVA for Country would be the same thing as running a t-test
on it, which we already did. We want the absolute values for the estimated means, to be able
to compare it to our original population means.

one_anova<-lm(Species_S~Season-1)
summary(one_anova)
##
## Call:
## lm(formula = Species_S ~ Season - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -325.37 -163.47 3.07 139.94 521.55
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## SeasonSpring 4196.1 67.2 62.44 <2e-16 ***
## SeasonSummer 2180.7 67.2 32.45 <2e-16 ***
## SeasonFall 1810.0 67.2 26.93 <2e-16 ***
## SeasonWinter 1060.2 67.2 15.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 212.5 on 36 degrees of freedom
## Multiple R-squared: 0.994, Adjusted R-squared: 0.9933
## F-statistic: 1482 on 4 and 36 DF, p-value: < 2.2e-16

Upon inspection, we see that the estimated values of the means match our original means.

Building our model


We cannot perform a linear regression analysis or an ANCOVA due to the nature of our
explanatoy variables. They are both categorical factors and we lack a numerical explanatory
variable needed for these two methods. We will therefore focus on making a multiple
ANOVA (aka a factorial ANOVA) for our model. Then, we need to check for the associated
assumptions, which are independence of observations, homogeneity of variance, and normal
distribution of the residuals.

Model
Here we define our explanatory variables.

set.seed(362)
Country_Int<-rep(c(1,2), each=20)
Seasons_Int<-rep(c(1,2,3,4), each=5, times=2)

Now we make a dataframe to easily check if the countries and the seasons are distributed
properly.

Data.Seasons<-data.frame(Country, Country_N, Season, Seasons_Int,


Country_Int)

Then, we plot our model in order to determine the season in which the highest number of
species can be found, and in which country this occurs.

library(lattice)
Species<-Country_N+Country_Int*Season_N
xyplot(Species~Season|Country, type=c("p", "r")) #shows the season in which
there is a highest number of species per country

xyplot(Species~Country|Season, type=c("p", "r")) #shows the country with


the highest number of species per season

Now, we need to add the residuals to our model in order to make our statistical model. We
will also look at different plots that best show our data.

set.seed(362)
Residuals_1<-rnorm(length(Season), 0, 300)
Species_1<-Country_N+Country_Int*Season_N+Residuals_1
xyplot(Species_1~Season|Country, type=c("p", "r"), xlab="Seasons",
ylab="Species", main="Microbial Species in USA and China Across the
Seasons") # nicely shows the relationship between the means of each season
per country

bwplot(Species_1~Season|Country, xlab="Seasons", ylab="Species",


main="Microbial Species in USA and China Across the Seasons") #good
representation of the distribution of each of the levels, with an
indication of a possible outlier for Spring in China

interaction.plot(Season, Country, Species_1, main="Microbial Species in USA


and China Across the Seasons")

The last plot, the interaction plot, was run to show that up to this moment, there is no clear
indication of interaction between our variables.

Now we can finally make our full ANOVA model. Then we will proceed with data inspection
and model selection. Afterwards, we will run diagnostics on our chosen model.

anova<-aov(Species_1~Season+Country+Season:Country)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Season 3 123123873 41041291 641.80 < 2e-16 ***
## Country 1 52081052 52081052 814.44 < 2e-16 ***
## Season:Country 3 13265645 4421882 69.15 4.47e-14 ***
## Residuals 32 2046309 63947
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(anova)

Before we run the diagnostics, we will make a model without the interaction term, even
though the summary of the anova model indicates significance of the interaction between
Country and Season.

anova1<-aov(Species_1~Season+Country)
summary(anova1)
## Df Sum Sq Mean Sq F value Pr(>F)
## Season 3 123123873 41041291 93.81 < 2e-16 ***
## Country 1 52081052 52081052 119.05 8.25e-13 ***
## Residuals 35 15311955 437484
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(anova1)

The plots show a worse distribution of the residuals in comparison to the prior model.
Nevertheless, we will compare the models next to justify keeping the interaction term. The
advantage of this second model is that the p-value for Country is no longer at the lower
limitations of R calculations, which indicates that Season is more important towards the
effect on number of microbial species.

Model comparison using RSS and AIC values and model selection:

anova(anova, anova1)
## Analysis of Variance Table
##
## Model 1: Species_1 ~ Season + Country + Season:Country
## Model 2: Species_1 ~ Season + Country
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 32 2046309
## 2 35 15311955 -3 -13265645 69.149 4.47e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(anova, anova1)
## df AIC
## anova 9 565.2218
## anova1 6 639.7257

The output of the anova comparison shows that the RSS score for our first model is smaller
than for the second, and the generated p-value indicates that this is a very significant
difference. We also notice that the second model gives negative degrees of freedom, which is
unacceptable for a viable working model. The output of the AIC comparison also indicates
that the first model is the best model for our data. It also retains more degrees of freedom,
which is always nice.

Diagnostics on our best model


Our experimental description indicates that all observations are independent, since we have
one mean measurement per season. The homogeneity of variance will be checked visually by
viewing the residual plots and numerically by running the variance test. The normal
distribution of the residuals will also be checked by visually and numerially with a qqPlot and
the Shapiro test, respectively. In the diagnostics, we will also check for outliers in our data.
There shouldn†™t be any because we have simulated our data from the start as having a
normal distribution.

In order to check the assumptions, we have to generate a full linear model with the interaction
term:

anova.model<-lm(Species_1~Season+Country+Season:Country)

Homogeneity of variance:

library(car)
## Warning: package 'car' was built under R version 3.2.5
residualPlots(anova.model)
## Warning in residualPlots.default(model, ...): No possible lack-of-fit
tests

ncvTest(anova.model)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 3.679056 Df = 1 p = 0.05510005

The output of the residual plots show that variance is quite homogeneous along the line and
the p-value generated in the ncvTest is higher than 0.05, which means that there is no
difference in the residuals from a homogeneous variance. These two indicate that the
assumption of homogeneity of variance has been met.

Normal distribution of residuals:

qqPlot(anova.model$residuals)

shapiro.test(anova.model$residuals)
##
## Shapiro-Wilk normality test
##
## data: anova.model$residuals
## W = 0.95667, p-value = 0.1287

The qqPlot shows that the residuals follow a normal distribution and there is no visual
indication of extreme points or divergence from normality. The shapiro test gives a p-value of
0.13, which means that there is no difference in the residuals from a normal distribution.
These two indicate that the assumption of normal distribution of residuals has been met.

Testing for outliers:

influenceIndexPlot(anova.model, col=Season, pch=19)


## Warning in plot.window(...): relative range of values = 27 * EPS, is
small
## (axis 2)

outlierTest(anova.model)
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 22 3.182989 0.0033078 0.13231

The influence index plots show that only one of the ppoints gets close to but doesn†™t
touch 0.5 in the Cook†™s distance, so that makes it slightly suspicious even if not clearly
an outlier. The Studentized residuals line shows an overall even distribution of the residuals
along the line. The only point to also get close to but not touch the 0.0 Bonferoni p-value is
the same one as above, so it again seems like it could be an outlier visually though not
numerically. Then we check the hat values, which show that all the points have similar
influence on the final results, but that the suspicious point we†™ve been tracking has a
comparatively lower hat value. Therefore, even if it were clearly an outlier (which it
isn†™t), it would not really affect the final results if we keep it in. The outlierTest gives a p-
value over 0.05, which means there is no numerical indication for outliers.
Now, for the final inspection of the model:

summary.aov(anova.model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Season 3 123123873 41041291 641.80 < 2e-16 ***
## Country 1 52081052 52081052 814.44 < 2e-16 ***
## Season:Country 3 13265645 4421882 69.15 4.47e-14 ***
## Residuals 32 2046309 63947
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output shows how both variables and the interaction terms are significant for explaining
the results of our data.

Next, we do a Post-Hoc test to make sure 0 is not in our confidence intervals. This means that
the

(Post.Hoc<-TukeyHSD(aov(Species_1~Season+Country+Season:Country)))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Species_1 ~ Season + Country + Season:Country)
##
## $Season
## diff lwr upr p adj
## Summer-Spring -3031.9720 -3338.3747 -2725.5693 0.0e+00
## Fall-Spring -3628.4751 -3934.8778 -3322.0723 0.0e+00
## Winter-Spring -4734.3452 -5040.7479 -4427.9424 0.0e+00
## Fall-Summer -596.5031 -902.9058 -290.1003 5.1e-05
## Winter-Summer -1702.3732 -2008.7759 -1395.9704 0.0e+00
## Winter-Fall -1105.8701 -1412.2729 -799.4674 0.0e+00
##
## $Country
## diff lwr upr p adj
## China-USA -2282.127 -2445.015 -2119.24 0
##
## $`Season:Country`
## diff lwr upr p adj
## Summer:USA-Spring:USA -2021.6009 -2539.6747 -1503.5271 0.0000000
## Fall:USA-Spring:USA -2306.8172 -2824.8910 -1788.7434 0.0000000
## Winter:USA-Spring:USA -3252.9931 -3771.0669 -2734.9193 0.0000000
## Spring:China-Spring:USA -375.4368 -893.5106 142.6370 0.3006021
## Summer:China-Spring:USA -4417.7799 -4935.8537 -3899.7061 0.0000000
## Fall:China-Spring:USA -5325.5698 -5843.6436 -4807.4960 0.0000000
## Winter:China-Spring:USA -6591.1341 -7109.2079 -6073.0603 0.0000000
## Fall:USA-Summer:USA -285.2163 -803.2900 232.8575 0.6351262
## Winter:USA-Summer:USA -1231.3922 -1749.4659 -713.3184 0.0000002
## Spring:China-Summer:USA 1646.1641 1128.0903 2164.2379 0.0000000
## Summer:China-Summer:USA -2396.1790 -2914.2527 -1878.1052 0.0000000
## Fall:China-Summer:USA -3303.9688 -3822.0426 -2785.8951 0.0000000
## Winter:China-Summer:USA -4569.5332 -5087.6070 -4051.4594 0.0000000
## Winter:USA-Fall:USA -946.1759 -1464.2497 -428.1021 0.0000350
## Spring:China-Fall:USA 1931.3804 1413.3066 2449.4542 0.0000000
## Summer:China-Fall:USA -2110.9627 -2629.0365 -1592.8889 0.0000000
## Fall:China-Fall:USA -3018.7526 -3536.8264 -2500.6788 0.0000000
## Winter:China-Fall:USA -4284.3169 -4802.3907 -3766.2431 0.0000000
## Spring:China-Winter:USA 2877.5563 2359.4825 3395.6301 0.0000000
## Summer:China-Winter:USA -1164.7868 -1682.8606 -646.7130 0.0000007
## Fall:China-Winter:USA -2072.5767 -2590.6505 -1554.5029 0.0000000
## Winter:China-Winter:USA -3338.1410 -3856.2148 -2820.0672 0.0000000
## Summer:China-Spring:China -4042.3431 -4560.4169 -3524.2693 0.0000000
## Fall:China-Spring:China -4950.1330 -5468.2068 -4432.0592 0.0000000
## Winter:China-Spring:China -6215.6973 -6733.7711 -5697.6235 0.0000000
## Fall:China-Summer:China -907.7899 -1425.8637 -389.7161 0.0000694
## Winter:China-Summer:China -2173.3542 -2691.4280 -1655.2804 0.0000000
## Winter:China-Fall:China -1265.5643 -1783.6381 -747.4905 0.0000001
plot(Post.Hoc)

The output shows that in most cases, the difference in means is significant and their
confidence intervals don†™t contain 0. Therefore, this model really is the best for
explaining the data.

Building a mixed model for our ANOVA


Now, we would like to introduce a random effect factor to our model. We speculate that the 5
locations within each country are chosen at random, which means that we can introduce
Locations as a random effect factor. This could interact with the intercept, with the slope or
with both.

To introduce the random effect factor:

LETTERS[1:5]
## [1] "A" "B" "C" "D" "E"
Locations<-factor(rep(c(LETTERS[1:10]), times=4)) #Location for random
effect factor labeled by letters
Locations
## [1] A B C D E F G H I J A B C D E F G H I J A B C D E F G H I J A B C D
E
## [36] F G H I J
## Levels: A B C D E F G H I J
set.seed(362)
Random_EF<-rep(rnorm(10, 0, 256), each=4) #Random effect factor
Random_EF
## [1] 144.42256 144.42256 144.42256 144.42256 -38.69105 -38.69105
## [7] -38.69105 -38.69105 -171.77653 -171.77653 -171.77653 -171.77653
## [13] -216.21228 -216.21228 -216.21228 -216.21228 206.39975 206.39975
## [19] 206.39975 206.39975 -25.44847 -25.44847 -25.44847 -25.44847
## [25] -146.71013 -146.71013 -146.71013 -146.71013 -262.36423 -262.36423
## [31] -262.36423 -262.36423 182.05818 182.05818 182.05818 182.05818
## [37] 264.92312 264.92312 264.92312 264.92312

First we include a an interaction between the random effect on the intercept. Like the
residuals, a random effect has a mean of 0 and some standard deviation. Every location has
its own deviations. We plot the data to visually check that the random effect factor has been
incorporated correctly.

Intercept<-(Country_N+Random_EF)
Slope<-(Country_Int)
Residuals_2<-rnorm(length(Season_N), 0, 256)
Species_2<-Intercept+Slope*Season_N+Residuals_2
xyplot(Species_2~Season_N|Country, type=c("p", "r"), groups=Locations,
main="Microbial Species in Different Seasons")
The plot output was as expected, where the slopes seem to be quite unaffected while the
intercept of each location is different.

Now we use a mixed model to analyse the data.

library(lme4)
## Warning: package 'lme4' was built under R version 3.2.5
## Loading required package: Matrix
MModel1<-lmer(Species_2~Country*Season_N+(1+Season_N|Locations))
## Warning: Some predictor variables are on very different scales: consider
## rescaling
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : Model failed to converge with max|grad| = 1.11185 (tol =
## 0.002, component 1)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?;Model is nearly unidentifiable: large eigenvalue
ratio
## - Rescale variables?
summary(MModel1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Species_2 ~ Country * Season_N + (1 + Season_N | Locations)
##
## REML criterion at convergence: 562.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.9904 -0.7119 -0.1293 0.6226 2.0829
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Locations (Intercept) 1.225e+05 349.9772
## Season_N 1.258e-02 0.1122 -1.00
## Residual 9.716e+04 311.7090
## Number of obs: 40, groups: Locations, 10
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 6.584e+03 1.865e+02 35.30
## CountryChina -4.068e+03 2.112e+02 -19.26
## Season_N 1.101e+00 6.793e-02 16.21
## CountryChina:Season_N 8.202e-01 8.193e-02 10.01
##
## Correlation of Fixed Effects:
## (Intr) CntryC Sesn_N
## CountryChin -0.566
## Season_N -0.912 0.533
## CntryCh:S_N 0.501 -0.884 -0.603
## fit warnings:
## Some predictor variables are on very different scales: consider
rescaling
## convergence code: 0
## Model failed to converge with max|grad| = 1.11185 (tol = 0.002,
component 1)
## Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
## Model is nearly unidentifiable: large eigenvalue ratio
## - Rescale variables?
In the output examination, we see that the random effect factor does affect the intercept. We
also check our estimates in the fixed effects and can see that they resemble the original means
for USA and China.

We can run the diagnostics on this model. First, we check homogeneity of variance:

plot(residuals(MModel1)~fitted(MModel1))
abline(h=0)

Here we see that variance is visually homogeneous along the line.

Second, we check that the residuals follow a normal distribution:

library(car)
qqPlot(residuals(MModel1))

Here we see that the residuals follow a normal distribution.

Third, we check the sigmoid curve:

library(lattice)
dotplot(ranef(MModel1, condVar = TRUE))
## $Locations

We can see that the Locations only had an effect on the Intercept, as we wanted.

We can now choose our best model between the ANOVA and the mixed model by running an
AIC test.

AIC(anova.model, MModel1)
## df AIC
## anova.model 9 565.2218
## MModel1 8 578.5030

The output shows that our best model for our data is the ANOVA model because it has a
lower AIC score and more degrees of freedom. This could be due to the fact that the ANOVA
doesn†™t include the random effect factor found in our mixed model.

Conclusions
Our best model is the ANOVA including the interaction term. It looks like we need to take
over the world from the United States during Spring because Spring has the overall highest
number of species and the United States has the highest number of species between China
and the USA.

Вам также может понравиться