Вы находитесь на странице: 1из 50

Common Mistakes of Statistical

Analysis in Ecobiological Data,


Modeling and DOE

Dr. Tapan Kr. Dutta


Teachers Training Department, Panskura Banamali College
Purba Medinipur, West Bengal, 721152
dr.t.dutta@gmail.com
Please reset
to silent
mode all the
mobile phone
in this
session
Thank you
The problem with statistics
We are all familiar with the disparaging
quotes about statistics :
"There are three kinds of lies:
lies, damned lies, and
statistics"
3
The Power of statistics!!!

1. Independent assortment
2. Segregation of Mendelian Traits
4
Mendelian Traits

5
Inheritance of Mendelian Traits

6
Inheritance of Mendelian Traits

7
Values of data
Most valuable data become worst
through wrong data representation,
improper design (experimental),
analysis and misinterpretation
Worst data become valuable through
proper data representation, managing
through proper data analysis and
interpretation
8
Error
Error is the collective noun for
any departure of the result from
the "true" value.

9
Statistical Errors
Known degrees of imprecision in the
procedures used to gather and process
information
Four main sources of statistical error:
(1) Sampling error
(2) Error due to wrong Measurement
(3) Analytical error
(4) Error in interpretation

10
Pre-analytical error

Analytical error

Post-analytical error

11
12
The standard Deviation

For a value that is sampled with an unbiased normally


distributed error, the above depicts the proportion of samples
that would fall between 0, 1, 2, and 3 standard deviations
13
above and below the actual value.
14
Sample Size and
the truth

15
Standard
deviation (SD)

16
Field where from errors can arouse
Step 1: Are there outliers in Y and X?
Unaccepted data (outlier)
Step 2: Do we have homogeneity of variance?
ANOVA, MANOVA, ANCOVA, CCA, DCA etc
Step 3: Are the data normally distributed?
t-test, linear regression, GLM etc
Step 4: Are there lots of zeros in the data?
GLM, multivariate analysis CCA, DCA, RDA etc (ecological data)
Step 5: Is there collinearity among the covariates?
if ignored, confusing statistical interpretation significant vs
insignificant, multiple linear regression, GLM, RDA and CCA etc.
Step 6: What are the relationships between Y and X variables?
Univariate analysis- response variable vs coveriable
Step 7: Should we consider interactions?
continues vs categorical variables (eg. wing length vs sex and month)
Step 8: Are observations of the response variable independent?
independent vs independent and dependents vd dependents is misleading17
Pre-analytical error
Hypothesis (H0 & HA)
Assumption
Modeling of experiment
Sampling
Error due to Design of
Experiment (DOE)
18
Sampling Error
Inaccuracy in predictions about
a population that results from the
fact that we do not observe
every subject in the population

19
Errors in Sampling
Nonsampling error
Poor sample design
Sampling (statistical) error
Biased sample
Depends on sample size
Tradeoff between cost of sampling and
accuracy of estimates obtained by
sampling 20
Statistical Errors

We can never know if we have made a


statistical error, but we can quantify the
chance of making such errors. What are
consequences of errors?
The probability of a Type I error =
The probability of a Type II error =
21
Mistake # 1
Failing to investigate data for data
entry or recording errors.
Failing to graph data and calculate
basic descriptive statistics before
analyzing data.

22
Example:
Wrong Decision Due to Error

23
Analytical error
Sample No.
Homogeneity of variances
Normality/ homogeneity of
variance test
Ad Hoc or Post Hoc Test
24
Post-analytical error
Misinterpretation due to lack of
proper application of theoretical
knowledge
PCA, CCA, DCA etc
Wrong Interpretation due to Error

25
CCA
Graph

26
DCA
Graph

27
Axis-1 Axis-2 Axis-3 Axis-4 Total
inertia
Eigenvalues 0.384 0.044 0.008 0.003
Length of gradient 1.859 0.955 0.867 0.860
Species-Environmental correlations 0.986 0.901 0.865 0.000
Cumulative percentage variance of species 31.2 34.8 35.4 35.7 1.231
data
Cumulative percentage variance of 49.5 62.2 0.0 0.0
species-environment relation
Monte Carlo test of significance F-ratio 1.886
of all canonical axis P-value 0.044

28
Example:
Wrong Decision Due to Error

Test of mu = 26.000 vs mu not = 26.000

Variable N Mean StDev SE Mean T P


With 16 25.625 3.964 0.991 -0.38 0.71
Without 15 24.733 1.792 0.463 -2.74 0.016

Variable N Mean StDev SE Mean 95.0 % CI


With 16 25.625 3.964 0.991 (23.513, 27.737)
Without 15 24.733 1.792 0.463 (23.741, 25.725)

29
Analytical errors
1. Random or unpredictable deviations
between replicates, quantified with the
"standard deviation".
2. Systematic or predictable regular
deviation from the "true" value,
quantified as "mean difference" (i.e. the
difference between the true value and
the mean of replicate determinations). 30
Analytical errors
3. Constant, unrelated to the
concentration of the substance
analyzed (the analyte).
4. Proportional, i.e. related to the
concentration of the analyte.

31
The sources of error
Using the same set of data both to formulate hypotheses
and to test them.
Taking samples from the wrong population or failing to
specify the population(s) about which inferences are to
be made in advance.
Failing to draw random, representative samples.
Measuring the wrong variables or failing to measure
what youd hoped to measure.
Using inappropriate or inefficient statistical methods.
Failing to validate models. include all of the following:
32
DETERMINING SAMPLE SIZE

Determining optimal sample size is simplicity itself


once we specify all of the following:
Desired power and significance level.
Distributions of the observables.
Statistical test(s) that will be employed.
Anticipated losses due to nonresponders,
noncompliant participants, and dropouts.

33
Mistake # 2
Using the wrong statistical procedure in
analyzing data.
Includes failing to check that necessary
assumptions are met.

34
Example:
Wrong Decision Due to Wrong
Analysis
Paired T for AFTER - BEFORE

N Mean StDev SE Mean


AFTER 4 82.00 12.96 6.48
BEFORE 4 71.00 15.87 7.94
Difference 4 11.00 5.03 2.52

95% CI for mean difference: (2.99, 19.01)


T-Test of mean difference = 0 (vs not = 0): T-Value = 4.37
P-Value = 0.02

Conclude mean pulse rate after is greater than mean pulse rate before.
Example:
Wrong Decision Due to Wrong Analysis

Two sample T for AFTER vs BEFORE

N Mean StDev SE Mean


AFTER 4 82.0 13.0 6.5
BEFORE 4 71.0 15.9 7.9

95% CI for mu AFTER - mu BEFORE: ( -15.3, 37.3)


T-Test mu AFTER = mu BEFORE (vs not =): T = 1.07 P = 0.33
DF = 5

Conclude no difference in mean pulse rates before and after marching.


Mistake #3
Failing to design our study so that it has
high enough power to call meaningful
differences significantly different.
Includes concluding that the null
hypothesis is true. Should be not
enough evidence to say the null is
false.
Example: Low Power
Success = Yes, I recycle.

Gender X N Sample p
Male 33 59 0.559322
Female 54 79 0.683544

Estimate for p(1) - p(2): -0.124222


95% CI for p(1) - p(2): (-0.287215, 0.0387704)
Test for p(1) - p(2) = 0 (vs not = 0): Z = -1.49
P-Value = 0.135

A number of students said that they were surprised that


the hypothesis test said no difference in percentages.
Example: Low Power
Power and Sample Size
Test for Two Proportions
Testing proportion 1 = proportion 2 (versus not =)
Calculating power for:
proportion 1 = 0.55 and proportion 2 = 0.70
Alpha = 0.05 Difference = -0.15
Sample
Size Power
60 0.4366
70 0.4911
80 0.5421
*Sample size = # in EACH group
Mistake #4
Failing to report a confidence interval
as well as the P-value.
P-value tells you if statistically
significant.
Confidence interval tells you what the
population value might be.
Example: A Significant, but Potentially
Meaningless Difference

Two sample T for Phone

Gender N Mean StDev SE Mean


Male 59 79 162 21
Female 80 153 247 28

95% CI for mu (1) - mu (2): ( -142, -5)


T-Test mu (1) = mu (2) (vs not =): T = -2.11 P = 0.036
DF = 135

P-value tells us significant difference, but confidence interval tells us


that the difference in the averages could be as small as 5 minutes.
Incidentally. Outliers
Removing Outliers
Two sample T for Phone

Gender N Mean StDev SE Mean


Male 58 59.9 66.5 8.7
Female 79 129 133 15

95% CI for mu (1) - mu (2): ( -103.7, -35)


T-Test mu (1) = mu (2) (vs not =): T = -4.02 P = 0.0001
DF = 121

The difference in male and female phone usage becomes even more
significant. We are 95% confident that the difference in the
averages is now more than 35 minutes.
Mistake #5
Fishing for significant results. That is,
performing several hypothesis tests on a
data set, and reporting only those results
that are significant.

If = P(Type I) = 0.05, and we perform 20


tests on the same data set, we can expect to
make 1 Type I error. (0.05 20 = 1).
Example:
Results Obtained from Fishing
Primary driver of $10,000 vehicle and going
away for Spring Break are related (P=0.01).
Virginity and supporting self through school
are related (P = 0.045).
Virginity and graduating in four years are
related (P = 0.041).
Virginity and attending non-football PSU
sports events are related (P = 0.016).
Mistake #6
Overstating the results of an observational
study.
That is, suggesting that one variable caused
the differences in the other variable.
As opposed to correctly saying that the two
variables are associated or correlated.
Dont forget that a significant result may be
spurious.
Example: Misleading Headlines

Virgins dont support themselves


through school.
Non-virgins too busy to go to non-
football PSU sporting events.

Non-virgins also too busy to graduate


in four years.
Mistake #7
Using a non-random or
unrepresentative sample.

Includes extending the results of


an unrepresentative sample to the
population.
Example: Unrepresentative sample

Shere Hite wrote a book in 1987 called Women


in Love
100,000 questionnaires about love, sex, and
relationships sent to womens groups. Only
4,500 questionnaires returned.
Entire book devoted to results of survey.
Examples: 91% of divorcees initiated the divorce;
70% of women married 5 years committed
adultery.
Mistake #8
Failing to use all of the basic
principles of experiments, including
randomization, blinding, and
controlling.