Академический Документы
Профессиональный Документы
Культура Документы
1
1.1
Exploratory Tests
One variable
Binomial
Categorical
Continuous
representation
Frequency Table
Frequency Table
boxplot
R command
table(var)
table(var)
boxplot(var)
1.2
Multiple variables
Binomial
Binomial
Binomial
Categorical
Categorical
Continous
2
2.1
Binomial
Categorical
Continuous
Categorical
Continuous
Continous
representation
table
table
boxplot
table
boxplot
scatterplot
R command
table(var1 var2)
table(var1 var2)
boxplot(continuous binomial)
table(var1 var2)
boxplot(continuous categorical)
plot(var1 var2)
Formal Tests
Single Variable
True proportion
To calculate the confidence interval of a true proportion within a population,
use the prop.test command as follows:
prop.test(x, n, conf.level=), where x is the number of cases, n is the size of the
1
2.2
Multiple Variables
What to do
Make the model
Model parameters/p-values
Check assumptions
Function
model.lm <- lm(response covariate)
summary(model.lm)
plot(model.lm)
If the p-values given by the summary command are above 0.05, the null hypothesis that that coefficient is 0 can not be rejected, and thus the coefficient is
irrelevant.
The adjusted R-squared determines what percentage of the variance is explained
by the model. So it determines how good the model fits.
The assumptions of a linear model and how to check them are:
1. There is linearity in the data. This assumption holds if there is a straight
line through the residuals vs. fitted values plot, the first plot given by the
plot(model.lm) function.
2. The data is normally distributed. This assumption holds if there is a straight
line on the Q-Q (quantiles) plot, the second plot given by the plot(model.lm)
function.
3. There is equal variance in the error. This assumption holds if there is a
straight line through the standardized residuals vs fitted values plot, the third
plot given by the plot(model.lm) function.
NOTES:
If you do not want to include the intercept change the function to:
model.lm <- lm(response -1 + covariate)
If you do not want to attach the dataset, add the following:
model.lm <- lm(response covariate, data=yourdata)
Multiple regression
To see if a set of n co-variates and a response variable have a certain correlation
use a multiple regression model. Note that the variables can also be quadratic,
cubic, etc. Some useful functions:
What to do
Function
Make the model
model.lm <- lm(response covariate1 + covariate2)
Model parameters/p-values summary(model.lm)
Check assumptions
plot(model.lm)
If the p-values given by the summary command are above 0.05, the null hypothesis that that coefficient is 0 can not be rejected, and thus the coefficient is
irrelevant.
The adjusted R-squared determines what percentage of the variance is explained
by the model. So it determines how good the model fits.
The assumptions of a linear model and how to check them are:
1. There is linearity in the data. This assumption holds if there is a straight
line through the residuals vs. fitted values plot, the first plot given by the
plot(model.lm) function.
2. The data is normally distributed. This assumption holds if there is a straight
line on the Q-Q (quantiles) plot, the second plot given by the plot(model.lm)
function.
3. There is equal variance in the error. This assumption holds if there is a
3
straight line through the standardized residuals vs fitted values plot, the third
plot given by the plot(model.lm) function.
NOTES:
If you do not want to include the intercept change the function to:
model.lm <- lm(response -1 + covariate1 + covariate2)
If you do not want to attach the dataset, add the following:
model.lm <- lm(response covariate1 + covariate2, data=yourdata)
ANOVA
If you need to find the difference in means over a certain categorical variable
use the one-way-anova method.
Some useful functions:
What to do
Function
Different categories into 1 list d <- stack(list(name1=cat1,name2=cat2,name3=cat3),pch=20)
Anova through oneway test
oneway.test(values ind, data = d) optional: var.equal
Anova through aov()
aov(values ind, data=d)
Anova through modelling
lm(values ind, data = d)
Model parameters/p-values
summary(model.lm)
Differences between groups
TukeyHSD(model.aov)
The aov() function makes an aov object, which is noted as model.aov above.
This object can be used for the Tukeys Honest Significant Differences test, or
TukeyHSD, which determines the true differences between each combination of
groups.
ANCOVA If you need to know the impact of a categorical variable while modelling over a continuous variable, use the ANCOVA method.
Some useful functions:
What to do
Function
Make the model
model.lm <- lm(response categorical-var + continuous-var)
Model parameters/p-values summary(model.lm)
Check assumptions
plot(model.lm)
Make a variable categorical as.factor()
The output of the ANCOVA method is virtually the same as that of a linear model, except that one of the explanatory variables is categorical.
Interaction
If two variables do not simply have an additive effect, then interaction may be
applicable. To include interaction in the model use the following commands:
What to do
Function
Add var1, var2 and var3 and all possible interactions lm(response var1 * var2 * var3)
Add var1, var2, var3 and interaction var1 to var2
lm(response var1 + var2 + var3 + var1:var2)
Check assumptions
plot(model.lm)
Treat a model with interactions the same way as a model of the same kind
without interactions.
Logistic regression
To see if a set of n co-variates and a binary response variable have a certain
correlation use a multiple regression model. Note that the variables can also be
quadratic, cubic, etc. Some useful functions:
4
What to do
Make the model
Model parameters/p-values
Check assumptions
Function
glm(binary.response covariate1 + covariate2, family=binomial)
summary(model.lm)
plot(model.lm)
If the p-values given by the summary command are above 0.05, the null hypothesis that that coefficient is 0 can not be rejected, and thus the coefficient is
irrelevant.
The assumptions for a logistic regression model are:
1. The response variable is binary
2. The response variable is coded such that 1 describes an event and 0 describes
a non-event.
3. There should be no over-fitting or under-fitting.
4. The variables should be independent or the model should account for interaction. The assumptions of a linear model are not applicable to a logistic
regression model. NOTES:
If you do not want to include the intercept change the function to:
model.lm <- lm(response -1 + covariate1 + covariate2)
If you do not want to attach the dataset, add the following:
model.lm <- lm(response covariate1 + covariate2, data=yourdata)