Вы находитесь на странице: 1из 47

Introduction to Biostatistical Analysis

Using R
Statistics course for first-year PhD students

Session 2

Lecture: Introduction to statistical hypothesis testing

Null and alternate hypothesis. Types of error.


Two-sample hypotheses. Correlation. Analysis of frequency data.
Model simplification

Lecturer: Lorenzo Marini, PhD


Department of Environmental Agronomy and Crop Production,
University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova.
E-mail: lorenzo.marini@unipd.it
Tel.: +39 0498272807
http://www.biodiversity-lorenzomarini.eu/
Inference

A statistical hypothesis test is a method of making statistical


decisions from and about experimental data. Null-hypothesis testing
just answers the question of "how well do the findings fit the
possibility that chance factors alone might be responsible?”.

sampling
Sample

Estimation
Population (Uncertainty!!!)

testing
Statistical Model
Key concepts: Session 1

Statistical testing in five steps:


1. Construct a null hypothesis (H0)
2. Choose a statistical analysis (assumptions!!!) Remember the order!!!
3. Collect the data (sampling)
4. Calculate P-value and test statistic
5. Reject/accept (H0) if P is small/large

Concept of replication vs. pseudoreplication


1. Spatial dependence (e.g. spatial autocorrelation)
2. Temporal dependence (e.g. repeated measures)
3. Biological dependence (e.g. siblings) n=6
yi

Key quantities
residual
mean 
 yi deviance  SS   ( yi  mean)
2 y mean
n
 i
2
( y  mean)
var 
(n  1)
x
Hypothesis testing

• 1 – Hypothesis formulation (Null hypothesis H0 vs.


alternative hypothesis H1)

• 2 – Compute the probability P that H0 is false;

• 3 – If this probability is lower than a defined threshold we


can reject the null hypothesis
Hypothesis testing: Types of error
Wrong conclusions: Type 1 and 2
errors
Reject H0 Retain H0
Effect Correct Type 2 error ()
Effect detected Effect not detected
Actual situation

No effect Type 1 error () Correct,


Effect detected, No effect detected,
none exists None exists
(P-value) (POWER)

As power increases, the chances of a Type II error decreases


Statistical power depends on:
the statistical significance criterion used in the test
the size of the difference or the strength of the similarity (effect size) in
the population
the sensitivity of the data.
Statistical analyses

Mean comparisons for 2 populations


Test the difference between the means drawn by two samples

Correlation
In probability theory and statistics, correlation, (often measured as a
correlation coefficient), indicates the strength and direction of a linear
relationship between two random variables. In general statistical usage,
correlation refers to the departure of two variables from independence.

Analysis of count or proportion data


Whole number or integer numbers (not continuous, different
distributional properties) or proportion
Mean comparisons for 2 samples

The t test

H0: means do not differ


H1: means differ

Assumptions
• Independence of cases (work with true replications!!!) - this is a requirement
of the design.
• Normality - the distributions in each of the groups are normal
• Homogeneity of variances - the variance of data in groups should be the
same (use Fisher test or Fligner's test for homogeneity of variances).
• These together form the common assumption that the errors are
independently, identically, and normally distributed
Normality
Before we can carry out a test assuming normality of the data we
need to test our distribution (not always before!!!)
Normal qqplot
Graphics analysis

2.5
Observed quantiles
In many cases we

50
must check this Frequency

1.5
assumption after 30

having fitted the


0 10

0.5
model
(e.g. regression or 0 5 10 15 -2 -1 0 1 2
multifactorial mass norm quantiles
ANOVA) hist(y) library(car)
lines(density(y)) qq.plot(y) or qqnorm(y)
RESIDUALS MUST
BE NORMAL Test for normality
Shapiro-Wilk Normality Test shapiro.test()
Skew + kurtosis (t test)
Normality: Histogram and Q-Q Plot
Histogram of fishes$mas Histogram of log(fishes$mas)

40
50

30
Frequency

Frequency
30

20
10
10
0

0
0 5 10 15 -0.5 0.5 1.0 1.5 2.0 2.5

fishes$mas log(fishes$mas)
15

2.0
log(fishes$mass)
fishes$mass

10

1.0
5

0.0

-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

norm quantiles norm quantiles


Normality: Histogram

400

Normal distribution must be


300

symmetrical around the mean


200
100
0

2.5 4.5 6.5 8.5 10.5 12.5

library(animation)
ani.options(nmax = 2000 + 15 -2, interval = 0.003)
freq = quincunx(balls = 2000, col.balls = rainbow(1))
# frequency table
barplot(freq, space = 0)
Normality: Quantile-Quantile Plot

Quantiles are points taken at


regular intervals from the
cumulative distribution function
(CDF) of a random variable.
The quantiles are the data
values marking the boundaries
between consecutive subsets
Normality

In case of non-normality: 2 possible approaches

1. Change the distribution (use GLMs) Advanced statistics


E.g. Poisson (count data)
E.g. Binomial (proportion)

2. Data transformation

40
50

30
Logaritmic (skewed data)
Frequency

Frequency
30

20
Square-root

10
0 10

Arcsin (percentage)

0
Probit (proportion) 0 5 10 15 -0.5 0.5 1.5 2.5

mass fishes$logmass
Box-Cox transformation
Homogeneity of variance: two samples

Before we can carry out a test to compare two sample means,


we need to test whether the sample variances are significantly
different. The test could not be simpler. It is called Fisher’s F
To compare two variances, all you do is
divide the larger variance by the smaller variance.

E.g. Students from TESAF vs. Students from DAAPV

F<-var(A)/var(B) F calculated

qf(0.975,nA-1,nB-1) F critical

if the calculated F is larger than Test can be carried


the critical value, we reject the null hypothesis out with the
var.test()
Homogeneity of variance : > two samples

It is important to know whether variance differs significantly


from sample to sample. Constancy of variance
(homoscedasticity) is the most important assumption
underlying regression and analysis of variance. For multiple
samples you can choose between the
Bartlett test and the Fligner–Killeen test.

Bartlett.test(response,factor)

Fligner.test(response,factor)

There are differences between the tests: Fisher and Bartlett are
very sensitive to outliers, whereas Fligner–Killeen is not
Mean comparison

In many cases, a researcher is interesting in gathering information about


two populations in order to compare them. As in statistical inference for
one population parameter, confidence intervals and tests of significance
are useful statistical tools for the difference between two population
parameters.

Ho: the two means are the same


H1: the two means differ

- All Assumptions met? Parametric t.test()


- t test with independent or paired sample

-Some assumptions not met? Non-parametric Wilcox.test()


- The Wilcoxon signed-rank test is a non-parametric alternative to the
Student's t-test for the case of two samples.
Mean comparison: 2 independent samples
Two independent samples
Students on the left Students on the right

The two samples


are statistically
independent

meana 
 y i
meanb 
 y i

n n
vara varb
SEdiff  
na nb
meana  meanb
t Test can be carried
SEdiff out with the
t.test() function
Mean comparison: t test for paired samples

Paired sampling in time or in space

E.g. Test your performance before or after


the course. I measure twice on the same
student

Time 1 a: 1, 2, 3, 2, 3, 2 ,2

( ai  bi ) / n Time 2 b: 1, 2, 1, 1, 5, 1, 2
t
SDdiff / n
If we have information about dependence,
we have to use this!!!
Test can be carried
We can deal with dependence out with the
t.test() function
Mean comparison: Wilcoxon

A B ozone label ranks


3 5 1 1 A 1
4 5 2 2 A 2.5
4 6 3 2 A 2.5
3 7 4 3 A 6
2 4 5 3 A 6
3 4 6 3 A 6
1 3 7 3 A 6
3 5 8 3 B 6
Rank procedure 5 6 9 4 A 10.5
10 4 A 10.5
2 5
11 4 B 10.5
n1 (n1  1) 12 4 B 10.5
U  n1n2   R1 13 5 A 15
2 14 5 B 15
15 5 B 15
n1 and n2 are the number of observations 16 5 B 15
17 5 B 15
R1 is the sum of the ranks in the sample 1 18 6 B 18.5
19 6 B 18.5
20 7 B 20
Test can be carried out
with the wilcox.test() -NB Tied ranks correction
function
Correlation
Correlation, (often measured as a correlation coefficient),
indicates the strength and direction of a linear relationship
between two random variables
Bird species Plant species
richness richness
1 x1 l1
2 x2 l2
3 x3 l3
4 Sampling unit
x4 l4

… …
458
x458 l458
Three alternative approaches
1. Parametric - cor()
2. Nonparametric - cor()
3. Bootstrapping - replicate(), boot()
Correlation: causal relationship?

E
Which is the response variable in a correlation analysis?

N
O
N
Bird species Plant species
richness richness
1 x1 l1
2 x2 l2 Sampling unit
3 x3 l3
4
x4 l4

… …
458
x458 l458
Correlation
Plot the two variables in a Cartesian space

A correlation of +1 means that there is a perfect positive LINEAR


relationship between variables.
A correlation of -1 means that there is a perfect negative LINEAR
relationship between variables.
A correlation of 0 means there is no LINEAR relationship between the two
variables.
Correlation
Same correlation coefficient!

r= 0.816
Parametric correlation: when is significant?

Pearson product-moment correlation coefficient

Correlation coefficient:
cor 
 ( xy) SEcor 
(1  cor 2 )
x y
2 2
n2

Hypothesis testing using the t distribution:


Ho: Is cor = 0
H1: Is cor ≠ 0
cor
t t critic value for d.f. = n-2
SEcor
Assumptions
-Two random variables from a random populations

- cor() detects ONLY linear relationships


Nonparametric correlation

Rank procedures Distribution-free but


less power
Spearman correlation index

cor.spearman 
 (rank rank ) x y

 rank  rank
x
2
y
2

The Kendall tau rank correlation coefficient

4P
cor.kendall  1
n(n  1)

P is the number of concordant pairs


n is the total number of pairs
Scale-dependent correlation
NB Don’t use grouped data to compute overall correlation!!!

7 sites
Issues related to correlation

1. Temporal autocorrelation
Values in close years are more similar
Dependence of the data

2. Spatial autocorrelation
Values in close sites are more similar
Dependence of the data
Moran's I = 0 Moran's I = 1
Moran's I or Geary’s C
Measures of global spatial autocorrelation
Three issues related to correlation

2. Temporal autocorrelation
Values in close years are more similar
Dependence of the data

Working with time series is likely


to have temporal pattern in the
data

E.g. Ring width series

Autoregressive models
(not covered!)
Three issues related to correlation

3. Spatial autocorrelation ISSUE: can we explain the


Values in close sites are more similar spatial autocorrelation with our
Dependence of the data models?
Moran's I or Geary’s C (univariate response) Measures of global spatial
autocorrelation
Raw response
Residuals after
model fitting

Hint: If you find spatial autocorrelation in your residuals, you


should start worrying
Estimate correlation with bootstrap
BOOTSTRAP
Bootstrapping is a statistical method for estimating the sampling
distribution of an estimator by sampling with replacement from the
original sample, most often with the purpose of deriving robust
estimates of SEs and CIs of a population parameter

Sampling with replacement


>a<-c(1:5)
> a 1 original sample
[1] 1 2 3 4 5
> replicate(10, sample(a, replace=TRUE)) 10 bootstrapped samples
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2 3 2 1 4 2 1 2 1 3
[2,] 1 5 2 3 5 3 1 1 3 2
[3,] 4 4 4 5 4 4 5 1 1 5
[4,] 4 1 1 3 3 2 3 1 5 2
[5,] 5 5 1 3 5 2 4 1 5 4
Estimate correlation with bootstrap

Why bootstrap?

It doesn’t depend on normal distribution assumption


It allows the computation of unbiased SE and CIs

Bootstrap Quantiles
Sample Statistic
distribution
N samples
with
replacement


Estimate correlation with bootstrap

CIs are asymmetric because our


distribution reflects the structure
of the data and not a defined
probability distribution

If we repeat the sample n time we


will find 0.95*n values included in
the CIs
Frequency data

Properties of frequency data:


-Count data
-Proportion data

Count data:
data where we count how many times something
happened, but we have no way of knowing how often it did
not happen (e.g. number of students coming at the first
lesson)

Proportion data: where we know the number doing a


particular thing, but also the number not doing that thing (e.g.
‘mortality’ of the students who attend the first lesson, but not the
second)
Count data

Straightforward linear methods (assuming constant variance,


normal errors) are not appropriate for count data for four main
reasons:

• The linear model might lead to the prediction of negative counts.


• The variance of the response variable is likely to increase with
the mean.
• The errors will not be normally distributed.
• Many zeros are difficult to handle in transformations.

- Classical test with contingency tables


- Generalized linear models with Poisson distribution
and log-link function (extremely powerful and flexible!!!)
Count data: contingency tables
We can assess the significance of the differences between
observed and expected frequencies in a variety of ways:

- Pearson’s chi-squared (χ2)


- G test
- Fisher’s exact test

Group 1 Group 2 Row total


Trait 1 a b a+b

Trait 2 c d c+d

Column total a+c b+d a+b+c+d

H0: frequencies found in rows are independent from frequencies


in columns
Count data: contingency tables
- Pearson’s chi-squared (χ2)
We need a model to define the expected frequencies (E)
(many possibilities) – E.g. perfect independence
(R i x C i ) df  (r - 1) x (c - 1)
Ei 
G
(O i - E i ) 2 /E i Critic value
  2

Ei
Oak Beech Row total (Ri)
With ants 22 30 52
Without ants
Column total (Ci)
31
53
18
48
49
101 (G)
X
Count data: contingency tables

- G test
1. We need a model to define the expected frequencies (E)
(many possibilities) – E.g. perfect independence

(R i x C i )
Ei 
G
 Oi 
G  2  Oi ln  χ2 distribution
 Ei 

- Fisher’s exact test fisher.test()


If expected values are less than 4 o 5
Proportion data

Proportion data have three important properties that affect the way
the data should be analyzed:
• the data are strictly bounded (0-1);
• the variance is non-constant (it depends on the mean);
• errors are non-normal.

- Classical test with probit or arcsin transformation


- Generalized linear models with binomial distribution
and logit-link function (extremely powerful and flexible!!!)
Proportion data: traditional approach

Transform the data!


Arcsine transformation
The arcsine transformation takes care of the error distribution

p ' arcsin p p are percentages (0-100%)

Probit transformation
The probit transformation takes care of the non-linearity

p are proportions (0-1)


Proportion data: modern analysis

An important class of problems involves data on proportions such as:


• studies on percentage mortality (LD50),
• infection rates of diseases,
• proportion responding to clinical treatment (bioassay),
• sex ratios, or in general
• data on proportional response to an experimental treatment

2 approaches

1. It is often needed to transform both response and explanatory variables


or
2. To use Generalized Linear Models (GLM) using different error
distributions
Statistical modelling
MODEL
Generally speaking, a statistical model is a function of your
explanatory variables to explain the variation in your response
variable (y)
E.g. Y=a+bx1+cx2+ dx3
Y= response variable (performance of the students)
xi= explanatory variables (ability of the teacher, background, age)
The object is to determine the values of the parameters (a, b, c and
d) in a specific model that lead to the best fit of the model to the data

The best model is the model that produces the least unexplained
variation (the minimal residual deviance), subject to the
constraint that all the parameters in the model should be
statistically significant (many ways to reach this!)

deviance  SS   ( yi  mean)
2
Statistical modelling

Getting started with complex statistical modeling

It is essential, that you can answer the following questions:


• Which of your variables is the response variable?

• Which are the explanatory variables?

• Are the explanatory variables continuous or categorical, or a


mixture of both?

• What kind of response variable do you have: is it a


continuous measurement, a count, a proportion, a time at
death, or a category?
Statistical modelling: multicollinearity

1. Multicollinearity
Correlation between predictors in a non-orthogonal multiple linear
models
Confounding effects difficult to separate

Variables are not independent

This makes an important difference to our statistical modelling


because, in orthogonal designs, the variation that is attributed to a
given factor is constant, and does not depend upon the order in which
factors are removed from the model.

In contrast, with non-orthogonal data, we find that the variation


attributable to a given factor does depend upon the order in which
factors are removed from the model

The order of variable selection makes a huge difference


(please wait for session 4!!!)
Statistical modelling

Getting started with complex statistical modeling

The explanatory variables


(a) All explanatory variables continuous - Regression
(b) All explanatory variables categorical - Analysis of variance
(ANOVA)
(c) Explanatory variables both continuous and categorical -
Analysis of covariance (ANCOVA)

The response variable


(a) Continuous - Normal regression, ANOVA or ANCOVA
(b) Proportion - Logistic regression, GLM logit-linear models
(c) Count - GLM Log-linear models
(d) Binary - GLM binary logistic analysis
(e) Time at death - Survival analysis
Statistical modelling

Each analysis estimate a MODEL

You want the model to be minimal (parsimony), and adequate


(must describe a significant fraction of the variation in the data)

It is very important to understand that there is not just one


model.

• given the data,


• and given our choice of model,
• what values of the parameters of that model make the observed
data most likely?
Model building: estimate of parameters
(slopes and level of factors)

Occam’s Razor
Statistical modelling

Occam’s Razor
• Models should have as few parameters as possible;
• linear models should be preferred to non-linear models;
• experiments relying on few assumptions should be preferred to those
relying on many;
• models should be pared down until they are minimal adequate;
• simple explanations should be preferred to complex explanations.

MODEL SIMPLIFICATION

The process of model simplification is an integral part of


hypothesis testing in R. In general, a variable is retained
in the model only if it causes a significant increase in
deviance when it is removed from the current model.
Statistical modelling: model simplification

Parsimony requires that the model should be as simple as


possible. This means that the model should not contain
any redundant parameters or factor levels.

Model simplification
• remove non-significant interaction terms;
• remove non-significant quadratic or other non-linear terms;
• remove non-significant explanatory variables;
• group together factor levels that do not differ from one another;
• in ANCOVA, set non-significant slopes of continuous
explanatory variables to zero.
Statistical modelling: model simplification

Step Procedure Interpretation


1 Fit the maximal model Fit all the factors, interactions and covariates of interest. Note
the residual deviance. If you are using Poisson or binomial
errors, check for overdispersion and rescale if necessary.
2 Begin model simplification Inspect the parameter estimates (e.g. using the R function
summary(). Remove the least significant terms first (using
update -,) starting with the highest-order interactions.
3 If the deletion causes an Leave that term out of the model.
insignificant increase in Inspect the parameter values again.
deviance Remove the least significant term remaining.
4 If the deletion causes a Put the term back in the model (using update +). These are
significant increase in the statistically significant terms as assessed by deletion from
the maximal model.
deviance
5 Keep removing terms Repeat steps 3 or 4 until the model contains nothing but
from the model significant terms.
This is the minimal adequate model (MAM).
If none of the parameters is significant, then the minimal
adequate model is the null model.

Вам также может понравиться