Академический Документы
Профессиональный Документы
Культура Документы
Using R
Statistics course for first-year PhD students
Session 2
sampling
Sample
Estimation
Population (Uncertainty!!!)
testing
Statistical Model
Key concepts: Session 1
Key quantities
residual
mean
yi deviance SS ( yi mean)
2 y mean
n
i
2
( y mean)
var
(n 1)
x
Hypothesis testing
Correlation
In probability theory and statistics, correlation, (often measured as a
correlation coefficient), indicates the strength and direction of a linear
relationship between two random variables. In general statistical usage,
correlation refers to the departure of two variables from independence.
The t test
Assumptions
• Independence of cases (work with true replications!!!) - this is a requirement
of the design.
• Normality - the distributions in each of the groups are normal
• Homogeneity of variances - the variance of data in groups should be the
same (use Fisher test or Fligner's test for homogeneity of variances).
• These together form the common assumption that the errors are
independently, identically, and normally distributed
Normality
Before we can carry out a test assuming normality of the data we
need to test our distribution (not always before!!!)
Normal qqplot
Graphics analysis
2.5
Observed quantiles
In many cases we
50
must check this Frequency
1.5
assumption after 30
0.5
model
(e.g. regression or 0 5 10 15 -2 -1 0 1 2
multifactorial mass norm quantiles
ANOVA) hist(y) library(car)
lines(density(y)) qq.plot(y) or qqnorm(y)
RESIDUALS MUST
BE NORMAL Test for normality
Shapiro-Wilk Normality Test shapiro.test()
Skew + kurtosis (t test)
Normality: Histogram and Q-Q Plot
Histogram of fishes$mas Histogram of log(fishes$mas)
40
50
30
Frequency
Frequency
30
20
10
10
0
0
0 5 10 15 -0.5 0.5 1.0 1.5 2.0 2.5
fishes$mas log(fishes$mas)
15
2.0
log(fishes$mass)
fishes$mass
10
1.0
5
0.0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
400
library(animation)
ani.options(nmax = 2000 + 15 -2, interval = 0.003)
freq = quincunx(balls = 2000, col.balls = rainbow(1))
# frequency table
barplot(freq, space = 0)
Normality: Quantile-Quantile Plot
2. Data transformation
40
50
30
Logaritmic (skewed data)
Frequency
Frequency
30
20
Square-root
10
0 10
Arcsin (percentage)
0
Probit (proportion) 0 5 10 15 -0.5 0.5 1.5 2.5
mass fishes$logmass
Box-Cox transformation
Homogeneity of variance: two samples
F<-var(A)/var(B) F calculated
qf(0.975,nA-1,nB-1) F critical
Bartlett.test(response,factor)
Fligner.test(response,factor)
There are differences between the tests: Fisher and Bartlett are
very sensitive to outliers, whereas Fligner–Killeen is not
Mean comparison
meana
y i
meanb
y i
n n
vara varb
SEdiff
na nb
meana meanb
t Test can be carried
SEdiff out with the
t.test() function
Mean comparison: t test for paired samples
Time 1 a: 1, 2, 3, 2, 3, 2 ,2
( ai bi ) / n Time 2 b: 1, 2, 1, 1, 5, 1, 2
t
SDdiff / n
If we have information about dependence,
we have to use this!!!
Test can be carried
We can deal with dependence out with the
t.test() function
Mean comparison: Wilcoxon
E
Which is the response variable in a correlation analysis?
N
O
N
Bird species Plant species
richness richness
1 x1 l1
2 x2 l2 Sampling unit
3 x3 l3
4
x4 l4
…
… …
458
x458 l458
Correlation
Plot the two variables in a Cartesian space
r= 0.816
Parametric correlation: when is significant?
Correlation coefficient:
cor
( xy) SEcor
(1 cor 2 )
x y
2 2
n2
cor.spearman
(rank rank ) x y
rank rank
x
2
y
2
4P
cor.kendall 1
n(n 1)
7 sites
Issues related to correlation
1. Temporal autocorrelation
Values in close years are more similar
Dependence of the data
2. Spatial autocorrelation
Values in close sites are more similar
Dependence of the data
Moran's I = 0 Moran's I = 1
Moran's I or Geary’s C
Measures of global spatial autocorrelation
Three issues related to correlation
2. Temporal autocorrelation
Values in close years are more similar
Dependence of the data
Autoregressive models
(not covered!)
Three issues related to correlation
Why bootstrap?
Bootstrap Quantiles
Sample Statistic
distribution
N samples
with
replacement
…
Estimate correlation with bootstrap
Count data:
data where we count how many times something
happened, but we have no way of knowing how often it did
not happen (e.g. number of students coming at the first
lesson)
Trait 2 c d c+d
Ei
Oak Beech Row total (Ri)
With ants 22 30 52
Without ants
Column total (Ci)
31
53
18
48
49
101 (G)
X
Count data: contingency tables
- G test
1. We need a model to define the expected frequencies (E)
(many possibilities) – E.g. perfect independence
(R i x C i )
Ei
G
Oi
G 2 Oi ln χ2 distribution
Ei
Proportion data have three important properties that affect the way
the data should be analyzed:
• the data are strictly bounded (0-1);
• the variance is non-constant (it depends on the mean);
• errors are non-normal.
Probit transformation
The probit transformation takes care of the non-linearity
2 approaches
The best model is the model that produces the least unexplained
variation (the minimal residual deviance), subject to the
constraint that all the parameters in the model should be
statistically significant (many ways to reach this!)
deviance SS ( yi mean)
2
Statistical modelling
1. Multicollinearity
Correlation between predictors in a non-orthogonal multiple linear
models
Confounding effects difficult to separate
Occam’s Razor
Statistical modelling
Occam’s Razor
• Models should have as few parameters as possible;
• linear models should be preferred to non-linear models;
• experiments relying on few assumptions should be preferred to those
relying on many;
• models should be pared down until they are minimal adequate;
• simple explanations should be preferred to complex explanations.
MODEL SIMPLIFICATION
Model simplification
• remove non-significant interaction terms;
• remove non-significant quadratic or other non-linear terms;
• remove non-significant explanatory variables;
• group together factor levels that do not differ from one another;
• in ANCOVA, set non-significant slopes of continuous
explanatory variables to zero.
Statistical modelling: model simplification