Randomization Distributions: Methods of Data Analysis I

Randomization Distributions
Methods of Data Analysis I
September 27, 2017

Hypothesis Testing
Remember the logic of hypothesis testing:
1. Assume the null hypothesis is true (the null hypothesis is

usually a hypothesis of no effect or no difference).
2. Determine how unusual the results you observed (usually in

the form of a test statistic) are relative to the probability
distribution that’s induced by assuming the null hypothesis to
be true.
3. If the results you observed are quite unusual, it provides

evidence against the null hypothesis.
Example: Motivation Case Study
In the Intrinsic/Extrinsic motivation case study, the researcher was

interested in whether creativity depended upon the type of
questionnaire.
I Let Y ∗ denote the creativity score of a hypothetical subject if

they took the Intrinsic Motivation Questionnaire.
I Let Y denote the creativity score of that same subject if

instead they took the Extrinsic Motivation Questionnaire.
I Then consider the model: Y ∗ = Y + δ.

(a) δ = 0 implies there’s no effect of the questionnaires
(b) δ > 0 implies higher creativity under Intrinsic Motivation
(c) δ < 0 implies lower creativity under Intrinsic Motivation
Example: Motivation Case Study
In the Intrinsic/Extrinsic Motivation case study where subjects
were randomly assigned to treatment groups, you can understand
the distribution induced by the null hypothesis as a randomization
distribution
1. The null hypothesis is H0 : δ = 0.
2. Under this assumption, the Treatment variable just contains

meaningless labels.
3. We create a randomization distribution for the test statistic

∗
Y −Y
by randomly re-ordering the Treatment variable and

re-calculating this test statistic a large number of times
The Randomization Distribution
The histogram of these re-calculated test statistics is the

randomization distribution that we’ll use to evaluate the null
hypothesis.
I If the test statistic calculated on the Scores using the

ACTUAL randomization (the observed test statistic) is
unusual when we compare it to this randomization
distribution, we will have evidence against the null
hypothesis.
I If the observed test statistic is not that unusual relative to the

randomization distribution, then we will not have any evidence
to refute the null hypothesis.
Comparing Two Groups
September 29, 2017

Randomization Distribution
On Wednesday, we ended class looking at this plot:

Conclusions
At least from the plot, it appears that the observed value,
∗
Y − Y = 4.14 is pretty unusual relative to the null distribution.
I We can quantify just how unusual it is by calculating a

p-value:
The p-value is the probability, assuming the null hypothesis is
true, of observing a result as extreme or more extreme than
what we did observe.
“...as extreme or more extreme” is relative to the null
hypothesis—in this case, as far, or farther from zero.
I From our histogram, this is the proportion of values that are

larger in absolute value than 4.14.
I I calculated this to be: p = 0.0059.

Conclusions
p = 0.0059 is pretty small.
I That is, the probability of seeing something as usual as, or

more unusual that a value of 4.14 when we assume the null
hypothesis to be true is small.
I This provides us with fairly strong evidence against the null

hypothesis—it’s likely that there actually is a difference
between mean creativity scores between the two motivation
types.
I And, because the treatments were randomly assigned, we can

conclude that the difference was caused by the difference in
motivation type.
Comparing Two Groups
October 2, 2017
Identical Twins Case Study
I 15 sets of identical twins; one member of each set has

schizophrenia the other does not
I These are paired data—there is statistical dependence

between measurements within a set of twins
I Response measurement: volume (cm3 ) of the left

hippocampus.
I Scientific Question: Is there are difference in left

hippocampus volume between schizophrenic and
non-schizophrenic twins in sets of monozygotic twins?
Twins Data
Paired Samples
The twins case study is an example of paired data
I Because of this, we can treat the differences between volume

measurements for twins in each set as a single sample.
I That is, consider the population of identical twins with the

property that one of the pair has schizophrenia and the other
does not. We have obtained a sample of size 15 from this
population.
I A two-sample problem with paired samples reduces to a

one-sample problem.
Twins Data
Twins Data
The One-Sample Problem
Outline:
1. What’s the population of interest?
2. How do we relate the scientific question to the population of

interest?
3. How do we relate the observations we have (the sample) to

the population of interest?
4. How do we account for the problem that we just have

information about a sample, not the whole population?
Checking Assumptions
October 9, 2017
Assumptions
The underlying assumptions of the two-sample t-test and

t-confidence interval:
1. Independence. The two samples are statistically independent

and the observations within each sample are statistically
independent.
2. Normality. The two populations from which the samples are

taken are Normally shaped.
3. Equal Variance. The two population variances (equivalently,

standard deviations) are equal.
Outliers
An outlier is an observation that is far from the majority of the

other observations in the sample.
I The presence of outliers may signify long-tailed distributions

or skewed distributions
I The t-tools may not be robust to this particular departure

from Normality, especially if the samples sizes are different.
I Explanations for outliers? Data entry error? A different

population?
The Problem with Outliers
A statistical procedure is resistant if it doesn’t change very much

when a small portion of the data changes.
Since the t-tools are based on computing sample means, they are
not resistant to outliers.
Distinguishing between a distribution that is skewed or that has

heavy tails and one that has outliers can be tricky.
In general, possible outliers should be evaluated on a case-by-case

basis (Display 3.6).
Transformations
Data transformations are sometimes used to bring samples into line

with the assumptions of a particular statistical method. In
particular, they can sometimes be useful for
1. reducing skewness
2. reducing differences in spread
3. reducing the presence and/or magnitude of outliers
The idea: if our samples don’t meet the t-tool assumptions on the
original scale of measurement, maybe they will on a transformed
scale.
The Log Transformation
We will typically use the natural logarithm (log base

e = 2.71828 . . .).
This transformation is a good one because it tends to make small

numbers a little bit smaller, but it makes large numbers a lot
smaller (Display 3.8).
It’s also nice because after performing the two-sample t-procedures

on the log-transformed data, we can sometimes take a
back-transformation and interpret results on the original scale of
measurement.
Other Transformations
√
I Square root transformation, Y.
I only for positive data
I good for counts; measurements of area
1
I Reciprocal transformation, Y
I good for waiting times (e.g., to recurrence, to arrival)
√ Y
I Arcsin square root, arcsin Y , and logit, log 1−Y ,
transformations
I good for proportion data (between 0 and 1)
Multiplicative Treatment Effect
For randomized experiments, we talked about an Additive

Treatment Effect Model: Y ∗ = Y + δ.
Suppose that after some exploratory data analysis, we decide that

transforming the two samples using the log transformation will
bring the samples more in line with the two-sample t-test
assumptions.
If the log transformation is a good one, then we’re probably

prepared to believe that the additive treatment effect model holds
for the transformed data:
log(Y ∗ ) = log(Y ) + γ
But if log(Y ∗ ) = log(Y ) + γ, then equivalently, by exponentiating

both sides:
Y ∗ = e log(Y )+γ
= e log(Y ) e γ
= Ye γ
So, on the original scale of measurement, we have a multiplicative

treatment effect model:
Y ∗ = Ye γ .
If the multiplicative treatment effect model is appropriate, then the

effect is e γ , and we can estimate it using e γ̂ . Interpreting the
multiplicative treatment effect model is straightforward:
Outcomes under one treatment are estimated to be,

on average, e γ̂ times as large as those under the
other treatment.
(Display 3.9)
Some Caution
The additive treatment effect model is slightly more complicated

than I’ve represented.
I I’ve written Y ∗ = Y + δ
I But how does that actually relate to our situation where

X1 , . . . , Xn1 are independent responses for subjects
(experimental units) exposed to one treatment and
Y1 , . . . , Yn2 are independent responses for subjects exposed to
a different treatment.
Some hand-written notes.

Checking Assumptions
October11, 2017
Log Transformations
I recommending using any transformations, including the log

transformation sparingly.
1. The two-sample t-test is robust to departures from Normality,

especially for large sample sizes.
2. The back-transformation interpretation of log-transformed

data in a randomized experiment—i.e., that the estimated
treatment effect is multiplicative—can be misleading.
3. The back-transformation interpretation of log-transformed

data in observational studies—i.e., that the estimated effect is
a ratio of population medians—rests on strong and
unverifiable assumptions of symmetry of the population
distributions on the log scale.
Transformations
Sometimes, transformations are called for and/or they are justified.
I For example: there may be literature to suggest that rainfall

measurements follow exponential distributions (these are
right-skewed distributions).
If you do need to use a log transformation, then it may be best
simply to interpret your findings on the transformed scale.
Assumptions
Some important things to remember in the two-sample setting:
(1) Observations within samples must be statistically

independent to use any of the methods we’ll talk about
in this course.
(2) If the two samples are paired then you should use the paired
t-test.
I You should still examine a plot of the two samples, with lines
connecting the pairs, to make sure that any difference in the
pairs can be adequately explained by an additive effect.
I Starting on Friday we’ll look at some alternatives to the paired
t-test.
Assumptions
(3) If the two underlying observations from which the samples are
obtained are Normally shaped, then the t-distribution is the
exact distribution for evaluating differences in population
means, regardless of sample sizes and regardless of differences
in variance.
(4) For underlying population distributions that are different from

Normal, results using the t-distribution are fairly robust, and
deviations from Normality and can be overcome by increasing
the sample sizes.
(5) There may be some situations where results using the

t-distribution are not valid, and so we’ll explore some
alternatives.
Two Sample Alternatives
October 16, 2017

Cognitive Load Experiment
A randomized experiment to compare conventional teaching

materials to modified materials.
I Treatments: conventional textbook solutions and modified

worked examples; self-study
I Response: time to solution of a “moderately difficult

problem.”
I Question: Is there a difference between the learning methods?

Cognitive Load Data
The Cognitive Load Experiment
I Both distributions are rather heavy-tailed.
I 5 observations are censored in the conventional method

distribution. An observation is censored if we do not know its
true value, only that it is larger (or smaller) than a certain
value.
We shouldn’t use the methods based on the t-distribution for these

data because of the censoring. Instead, we’ll use the Rank-Sum
test to assess the difference between the groups.
The Rank-Sum Test
I Also known as the Wilcoxon rank-sum test.
I This is an alternative to the methods based on the

t-distribution that is resistant to outliers.
I It’s based on the ranks (i.e., the order) of the data rather than
the data themselves, making it useful when there are censored
observations.
I It is almost as good as the t-distribution methods if the

populations are Normal.
I It is better than the t-distribution methods if there are

extreme outliers.
The Basic Idea
The essential idea: to determine differences between groups, we

don’t use the data directly, we use the ranks of the data.
I Rank all of the data from 1 to n1 + n2 (note: we’ll deal with

ties in a minute)
I Determine whether, on average, the lower (higher) ranks

appear in one of the treatment groups or the other.
The Rank-Sum Statistic
1. List all observations from both samples in increasing order.
2. Identify which sample each observation came from.
3. Assign each observation a rank according to its order in the

list. Ties are taken to be the average of their orders.
4. The rank-sum statistic, T , is the sum of the ranks in one of

the groups.
Note: censored observations are treated as ties (with the highest

[lowest] ranks); see Display 4.5
The Rank-Sum Test
The null hypothesis for randomized experiments is written in terms

of the additive treatment effect:
H0 : δ = 0
The null hypothesis for observational studies is a little more vague:
H0 : the two population distributions are identical
In either case, if the null hypothesis is true, then the n1 ranks in

group 1 are a random sample from the n1 + n2 possible ranks. This
is the basis for our test.
The Sampling Distribution of T
The sampling distribution of the rank-sum statistic, T , is the

randomization or permutation distribution of T .
I Recall that the randomization distribution applies in the case

of a randomized experiment—it’s the histogram of T under
all possible re-randomizations.
I The permutation distribution applies to observational

studies—it’s the histogram of T under all possible allocations
of the responses to the two groups.
The Sampling Distribution of T
In either case (randomized experiment or observational study), we

can approximate this sampling distribution using a Normal
distribution.
It turns out that this approximation works well except when the
sample sizes are small (e.g., 5 or so), or there are a lot of ties (i.e.,
so many of the ranks are the same).
Rank-Sum in R
You’re going to see in lab this week a couple versions of the

rank-sum test:
1. wilcox.test is based on the version of the rank-sum procedure

that I just described.
2. wilcox.exact makes an exact p-value calculation, rather than

relying on the Normality of the test statistic.
The two approaches give virtually identical results when there are
not a lot of ties, and when the sample sizes are relatively large.
The Challenger Data
Launch
Temperature Number of O-ring incidents
Below 65◦ F 1 1 1 3
0 0 0 0 0 0 0 0 0 0
Above 65◦ F 0 0 0 0 0 0 0 1 1 2
The Challenger Data
I We can’t reasonably say that these data are from Normal

populations.
I The sample sizes are small (n1 = 4, n2 = 20) and not the
same.
I There are lots of zeros, so a log-transformation is certainly not

recommended.
I The methods based on t-distributions are just not appropriate

here.
I Instead, we’ll calculate a one-sided p-value from a

permutation test on a test statistic and see that there’s
evidence of an association between O-ring failure and
temperature.
Permutation Test
Suppose we want to use a statistic, T , calculated from two

samples—one of size n1 and one of size n2 —to compare two
populations.
The null hypothesis for the permutation test is:
H0 : the two population distributions are the same.
Assuming the null hypothesis to be true, we can create the

permutation distribution of T —this is just the histogram of T ’s
recalculated from all possible permutations of the observations into
the two groups (maintaining the observed sample sizes).
Permutation Tests
The p-value for this test is determined as follows:
1. Denote by Tobs the value of the test statistic for the observed
data.
2. Determine how many permutations of the two samples there

are, call this N.
3. Count up the number of those permutations that result in

values of T that are as extreme or more extreme (relative to
the null hypothesis) as Tobs , call this m.
4. The p-value is just the proportion, m/N.

The Number of Permutations
To calculate a permutation test p-value, we need to know the

number of ways it is possible to permute n1 + n2 observations into
two groups, one of size n1 and the other of size n2 .
This is a well-known problem, and it’s solution is

n1 + n2 (n1 + n2 )!
=
n1 n1 !n2 !
This is read “n1 + n2 choose n1 ”, and recall that
n! = n × (n − 1) × (n − 2) × · · · × 1.
The Number of Permutations
For the Challenger data, n1 + n2 = 24 and n1 = 4, so

24 24! 24 · 23 · 22 · 21
= = = 10, 626
4 4!20! 4·3·2·1
This number is the denominator of our p-value calculation.

Calculating the P-Value
Now, we need to count up those permutations that lead to a value

of the chosen test statistic that is as extreme or more extreme than
the one we actually observed.
I In the Challenger data, there are a total of 6 O-ring incidents

observed for launch temperatures less than 65◦ F, so let’s take
Tobs = 6, the sum of the number of failures in the cold
temperature group.
I Now we need to count up the number of permutations that

would yield a total of 6 or more O-ring incidents in the cold
temperature group.
Counting Permutations
First, for the Challenger data, let’s write down the specific
permutations that result in a sum of 6 or more in the cold
temperature group:
permutation sum in the group

1,1,1,3 6
1,1,2,3 7
0,1,2,3 6
Now we have to ask: how many WAYS can each of these

permutations occur?
In the Challenger data, there are 17 zeros, 5 ones, 1 two and 1

three. Let’s first focus on the number of ways to get the
permutation 1,1,1,3.
For this, we need to calculate the number of ways to choose 3 ones

out of the 5 available AND to choose 1 three out of the 1
available. This is another common problem in combinatorics, and
there is a known solution.
The number of ways to choose 3 ones out of the 5 available AND

to choose 1 three out of the 1 available is given by:

5 1 5!
× = ×1
3 1 3!2!
5×4×3×2×1
=
3×2×1×2×1
= 10.
Now, let’s focus on the number of ways to get the permutation

1,1,2,3. This is just:

5 1 1
= 10
2 1 1
Finally, the number of ways to get the permutation 0,1,2,3 is

17 5 1 1
= 17 × 5 = 85.
1 1 1 1
Calculating the P-value
Therefore, altogether, there are 10 + 10 + 85 = 105 ways to obtain

a sum in the cold temperature group that’s at least as large as 6.
A one-sided p-value for a test of whether the two populations are

the same is then:
105/10626 = 0.00988.
That is, we have strong evidence of a difference in the number of

O-ring failures between the two temperature groups.
Two Sample Alternatives
October 18, 2017

The Sign Test
This is a test that is designed for paired data (e.g., the identical
twins data).
I It’s another distribution-free test; resistant to outliers.
I The idea:
I count the number of pairs in which the observation in group 1
exceeds that in group 2
I under the null hypothesis of no difference, this count should be
roughly one half of the pairs
In R, you can perform the sign test using the binom.test function.
The Wilcoxon Signed-Rank Test
Like the Sign Test, this is distribution-free and resistant test for
paired samples.
In this test, we use the ranks of the observations, in addition to

their signs:
1. Rank the absolute value of the paired differences.
2. Add up the ranks of the pairs for which the difference is

positive.
3. Compare this test statistic to what we would expect if there

was no difference in the pairs.
The Wilcoxon Signed-Rank Test
Essentially, the signed-rank test is the rank-sum test for paired

samples.
I The idea is that we create two groups—one in which the

paired differences are positive and the other in which the
paired differences are negative.
I Then, we perform the wilcoxon rank-sum test using these two

“groups.”
Use wilcox.exact with paired = TRUE.

Levene’s Test for Variances
Sometimes, we want to compare the variances of two populations.

Levene’s test provides a method for this.
Suppose there are n1 observations, Y1i , from population 1, with

standard deviation σ1 ; and n2 Y2i ’s from population 2 with SD σ2 .
Let
Z1i = (Y1i − Y 1 )2 and Z2i = (Y2i − Y 2 )2
Then the Z1 ’s and Z2 ’s are samples from populations with means
σ12 and σ22 , respectively
So, perform a two-sample t-test for means using the Z ’s.
leveneTest function in the car package.

F-Test for Equal Variance
Another test for equal variances is called the F-test for equal
variances, and it’s based on the test statistic:
s12
F = ,
s22
where s12 and s22 are the sample variances from samples 1 and 2,
respectively.
Provided that the two samples are drawn from Normal populations,
the sampling distribution of F is an F-distribution with n1 − 1 and
n2 − 2 df.
This test is not robust to departures from the Normality

assumption.
var.test in R.

Randomization Distributions: Methods of Data Analysis I

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Randomization Distributions: Methods of Data Analysis I

Загружено:

Авторское право:

Доступные форматы

Randomization Distributions

Methods of Data Analysis I

September 27, 2017

Remember the logic of hypothesis testing:

1. Assume the null hypothesis is true (the null hypothesis is

2. Determine how unusual the results you observed (usually in

3. If the results you observed are quite unusual, it provides

In the Intrinsic/Extrinsic motivation case study, the researcher was

I Let Y ∗ denote the creativity score of a hypothetical subject if

I Let Y denote the creativity score of that same subject if

I Then consider the model: Y ∗ = Y + δ.

1. The null hypothesis is H0 : δ = 0.

2. Under this assumption, the Treatment variable just contains

3. We create a randomization distribution for the test statistic

by randomly re-ordering the Treatment variable and

The histogram of these re-calculated test statistics is the

I If the test statistic calculated on the Scores using the

I If the observed test statistic is not that unusual relative to the

September 29, 2017

On Wednesday, we ended class looking at this plot:

I We can quantify just how unusual it is by calculating a

I From our histogram, this is the proportion of values that are

I I calculated this to be: p = 0.0059.

p = 0.0059 is pretty small.

I That is, the probability of seeing something as usual as, or

I This provides us with fairly strong evidence against the null

I And, because the treatments were randomly assigned, we can

I 15 sets of identical twins; one member of each set has

I These are paired data—there is statistical dependence

I Response measurement: volume (cm3 ) of the left

I Scientific Question: Is there are difference in left

The twins case study is an example of paired data

I Because of this, we can treat the differences between volume

I That is, consider the population of identical twins with the

I A two-sample problem with paired samples reduces to a

2. How do we relate the scientific question to the population of

3. How do we relate the observations we have (the sample) to

4. How do we account for the problem that we just have

The underlying assumptions of the two-sample t-test and

1. Independence. The two samples are statistically independent

2. Normality. The two populations from which the samples are

3. Equal Variance. The two population variances (equivalently,

An outlier is an observation that is far from the majority of the

I The presence of outliers may signify long-tailed distributions

I The t-tools may not be robust to this particular departure

I Explanations for outliers? Data entry error? A different

A statistical procedure is resistant if it doesn’t change very much

Distinguishing between a distribution that is skewed or that has

In general, possible outliers should be evaluated on a case-by-case

Data transformations are sometimes used to bring samples into line

2. reducing differences in spread

3. reducing the presence and/or magnitude of outliers

We will typically use the natural logarithm (log base

This transformation is a good one because it tends to make small

It’s also nice because after performing the two-sample t-procedures

For randomized experiments, we talked about an Additive

Suppose that after some exploratory data analysis, we decide that

If the log transformation is a good one, then we’re probably

But if log(Y ∗ ) = log(Y ) + γ, then equivalently, by exponentiating

So, on the original scale of measurement, we have a multiplicative

If the multiplicative treatment effect model is appropriate, then the

Outcomes under one treatment are estimated to be,

The additive treatment effect model is slightly more complicated