Вы находитесь на странице: 1из 5

Which test when

Floris van Beers


October 2015

1
1.1

Exploratory Tests
One variable

Binomial
Categorical
Continuous

representation
Frequency Table
Frequency Table
boxplot

R command
table(var)
table(var)
boxplot(var)

Use frequency tables to determine ratios. R-command: table(var)/length(var)


Use boxplots to determine assumption of normality. Box with equal sides and
equal whiskers is normally distributed. Top whisker longer means more low
values or skewed to the left. Bottom whisker longer means more high values or
skewed to the right.

1.2

Multiple variables

Binomial
Binomial
Binomial
Categorical
Categorical
Continous

2
2.1

Binomial
Categorical
Continuous
Categorical
Continuous
Continous

representation
table
table
boxplot
table
boxplot
scatterplot

R command
table(var1 var2)
table(var1 var2)
boxplot(continuous binomial)
table(var1 var2)
boxplot(continuous categorical)
plot(var1 var2)

Formal Tests
Single Variable

True proportion
To calculate the confidence interval of a true proportion within a population,
use the prop.test command as follows:
prop.test(x, n, conf.level=), where x is the number of cases, n is the size of the
1

population and is the confidence interval.


Significance test for a population proportion To test whether the proportions in a population are significantly different from the given proportion, use a
prop.test command as follows:
prop.text(x, n, p=), where x is the number of cases, n is the size of the population and is the null hypothesis for the proportion. If the p-value given by this
test is below 0.05, the null hypothesis can be rejected, and the given proportion
is significantly different from the actual proportion.

2.2

Multiple Variables

Difference in means (t.test)


To calculate the true difference in means for two different sets, use the t.test
command as follows:
t.test(x, y, var.equal=T/F, conf.level=), where x and y are a list of numbers,
T/F is either TRUE or FALSE and is the confidence interval.
This test will calculate the difference in means within the given confidence interval. The parameter var.equal can be set to either TRUE or FALSE based on
whether the variance/distribution in the two sets is equal (or at least comparable). The default is FALSE.
If the values of both sets are connected in some logical way, such as left foot
and right foot of the same person, use the parameter paired=TRUE to perform
a paired t.test.
If the distribution is non-normal or there are a number of outliers in one or both
of the sets, use a box-cox transformation. Transform the data by using a square
root or logarithmic transformation as such: x = sqrt(x) or x = log(x).
Two-sample test of proportion
To see if two proportions are truly different or not, use the prop.test command
as follows:
prop.test(x, n), where x is a set of samples related to n, the set of populations.
The sets x and n have to be the same size. The null hypothesis is that the
proportions are equal, and this is rejected with a p-value lower than 0.05. If the
null hypothesis is rejected, the proportions are non-equal.
Wilcoxon Rank-Sum test
Lecture 3ab
If you want to use a t-test to compare means, but the data is not anywhere near
normally distributed, use the wilcox.test command as follows:
wilcox.test(x, y), where x and y are the sets to be compared. The null hypothesis is that the sets, having the same distribution, are shifted by = 0, in
other words not shifter. If the p-value is lower than 0.05, the null hypothesis is
rejected, which means is non-zero and there is a difference in means.
simple linear regression
To see if a single covariate and a response variable have a certain linear correlation use a simple linear model. Some useful functions:

What to do
Make the model
Model parameters/p-values
Check assumptions

Function
model.lm <- lm(response covariate)
summary(model.lm)
plot(model.lm)

If the p-values given by the summary command are above 0.05, the null hypothesis that that coefficient is 0 can not be rejected, and thus the coefficient is
irrelevant.
The adjusted R-squared determines what percentage of the variance is explained
by the model. So it determines how good the model fits.
The assumptions of a linear model and how to check them are:
1. There is linearity in the data. This assumption holds if there is a straight
line through the residuals vs. fitted values plot, the first plot given by the
plot(model.lm) function.
2. The data is normally distributed. This assumption holds if there is a straight
line on the Q-Q (quantiles) plot, the second plot given by the plot(model.lm)
function.
3. There is equal variance in the error. This assumption holds if there is a
straight line through the standardized residuals vs fitted values plot, the third
plot given by the plot(model.lm) function.
NOTES:
If you do not want to include the intercept change the function to:
model.lm <- lm(response -1 + covariate)
If you do not want to attach the dataset, add the following:
model.lm <- lm(response covariate, data=yourdata)
Multiple regression
To see if a set of n co-variates and a response variable have a certain correlation
use a multiple regression model. Note that the variables can also be quadratic,
cubic, etc. Some useful functions:
What to do
Function
Make the model
model.lm <- lm(response covariate1 + covariate2)
Model parameters/p-values summary(model.lm)
Check assumptions
plot(model.lm)
If the p-values given by the summary command are above 0.05, the null hypothesis that that coefficient is 0 can not be rejected, and thus the coefficient is
irrelevant.
The adjusted R-squared determines what percentage of the variance is explained
by the model. So it determines how good the model fits.
The assumptions of a linear model and how to check them are:
1. There is linearity in the data. This assumption holds if there is a straight
line through the residuals vs. fitted values plot, the first plot given by the
plot(model.lm) function.
2. The data is normally distributed. This assumption holds if there is a straight
line on the Q-Q (quantiles) plot, the second plot given by the plot(model.lm)
function.
3. There is equal variance in the error. This assumption holds if there is a
3

straight line through the standardized residuals vs fitted values plot, the third
plot given by the plot(model.lm) function.
NOTES:
If you do not want to include the intercept change the function to:
model.lm <- lm(response -1 + covariate1 + covariate2)
If you do not want to attach the dataset, add the following:
model.lm <- lm(response covariate1 + covariate2, data=yourdata)
ANOVA
If you need to find the difference in means over a certain categorical variable
use the one-way-anova method.
Some useful functions:
What to do
Function
Different categories into 1 list d <- stack(list(name1=cat1,name2=cat2,name3=cat3),pch=20)
Anova through oneway test
oneway.test(values ind, data = d) optional: var.equal
Anova through aov()
aov(values ind, data=d)
Anova through modelling
lm(values ind, data = d)
Model parameters/p-values
summary(model.lm)
Differences between groups
TukeyHSD(model.aov)
The aov() function makes an aov object, which is noted as model.aov above.
This object can be used for the Tukeys Honest Significant Differences test, or
TukeyHSD, which determines the true differences between each combination of
groups.
ANCOVA If you need to know the impact of a categorical variable while modelling over a continuous variable, use the ANCOVA method.
Some useful functions:
What to do
Function
Make the model
model.lm <- lm(response categorical-var + continuous-var)
Model parameters/p-values summary(model.lm)
Check assumptions
plot(model.lm)
Make a variable categorical as.factor()
The output of the ANCOVA method is virtually the same as that of a linear model, except that one of the explanatory variables is categorical.
Interaction
If two variables do not simply have an additive effect, then interaction may be
applicable. To include interaction in the model use the following commands:
What to do
Function
Add var1, var2 and var3 and all possible interactions lm(response var1 * var2 * var3)
Add var1, var2, var3 and interaction var1 to var2
lm(response var1 + var2 + var3 + var1:var2)
Check assumptions
plot(model.lm)
Treat a model with interactions the same way as a model of the same kind
without interactions.
Logistic regression
To see if a set of n co-variates and a binary response variable have a certain
correlation use a multiple regression model. Note that the variables can also be
quadratic, cubic, etc. Some useful functions:
4

What to do
Make the model
Model parameters/p-values
Check assumptions

Function
glm(binary.response covariate1 + covariate2, family=binomial)
summary(model.lm)
plot(model.lm)

If the p-values given by the summary command are above 0.05, the null hypothesis that that coefficient is 0 can not be rejected, and thus the coefficient is
irrelevant.
The assumptions for a logistic regression model are:
1. The response variable is binary
2. The response variable is coded such that 1 describes an event and 0 describes
a non-event.
3. There should be no over-fitting or under-fitting.
4. The variables should be independent or the model should account for interaction. The assumptions of a linear model are not applicable to a logistic
regression model. NOTES:
If you do not want to include the intercept change the function to:
model.lm <- lm(response -1 + covariate1 + covariate2)
If you do not want to attach the dataset, add the following:
model.lm <- lm(response covariate1 + covariate2, data=yourdata)

Вам также может понравиться