Вы находитесь на странице: 1из 61

Randomization Distributions

Methods of Data Analysis I

September 27, 2017


Hypothesis Testing

Remember the logic of hypothesis testing:

1. Assume the null hypothesis is true (the null hypothesis is


usually a hypothesis of no effect or no difference).

2. Determine how unusual the results you observed (usually in


the form of a test statistic) are relative to the probability
distribution that’s induced by assuming the null hypothesis to
be true.

3. If the results you observed are quite unusual, it provides


evidence against the null hypothesis.
Example: Motivation Case Study

In the Intrinsic/Extrinsic motivation case study, the researcher was


interested in whether creativity depended upon the type of
questionnaire.

I Let Y ∗ denote the creativity score of a hypothetical subject if


they took the Intrinsic Motivation Questionnaire.

I Let Y denote the creativity score of that same subject if


instead they took the Extrinsic Motivation Questionnaire.

I Then consider the model: Y ∗ = Y + δ.


(a) δ = 0 implies there’s no effect of the questionnaires
(b) δ > 0 implies higher creativity under Intrinsic Motivation
(c) δ < 0 implies lower creativity under Intrinsic Motivation
Example: Motivation Case Study
In the Intrinsic/Extrinsic Motivation case study where subjects
were randomly assigned to treatment groups, you can understand
the distribution induced by the null hypothesis as a randomization
distribution

1. The null hypothesis is H0 : δ = 0.

2. Under this assumption, the Treatment variable just contains


meaningless labels.

3. We create a randomization distribution for the test statistic



Y −Y

by randomly re-ordering the Treatment variable and


re-calculating this test statistic a large number of times
The Randomization Distribution

The histogram of these re-calculated test statistics is the


randomization distribution that we’ll use to evaluate the null
hypothesis.

I If the test statistic calculated on the Scores using the


ACTUAL randomization (the observed test statistic) is
unusual when we compare it to this randomization
distribution, we will have evidence against the null
hypothesis.

I If the observed test statistic is not that unusual relative to the


randomization distribution, then we will not have any evidence
to refute the null hypothesis.
Comparing Two Groups
Methods of Data Analysis I

September 29, 2017


Randomization Distribution

On Wednesday, we ended class looking at this plot:


Conclusions
At least from the plot, it appears that the observed value,

Y − Y = 4.14 is pretty unusual relative to the null distribution.

I We can quantify just how unusual it is by calculating a


p-value:
The p-value is the probability, assuming the null hypothesis is
true, of observing a result as extreme or more extreme than
what we did observe.
“...as extreme or more extreme” is relative to the null
hypothesis—in this case, as far, or farther from zero.

I From our histogram, this is the proportion of values that are


larger in absolute value than 4.14.

I I calculated this to be: p = 0.0059.


Conclusions

p = 0.0059 is pretty small.

I That is, the probability of seeing something as usual as, or


more unusual that a value of 4.14 when we assume the null
hypothesis to be true is small.

I This provides us with fairly strong evidence against the null


hypothesis—it’s likely that there actually is a difference
between mean creativity scores between the two motivation
types.

I And, because the treatments were randomly assigned, we can


conclude that the difference was caused by the difference in
motivation type.
Comparing Two Groups
Methods of Data Analysis I

October 2, 2017
Identical Twins Case Study

I 15 sets of identical twins; one member of each set has


schizophrenia the other does not

I These are paired data—there is statistical dependence


between measurements within a set of twins

I Response measurement: volume (cm3 ) of the left


hippocampus.

I Scientific Question: Is there are difference in left


hippocampus volume between schizophrenic and
non-schizophrenic twins in sets of monozygotic twins?
Twins Data
Paired Samples

The twins case study is an example of paired data

I Because of this, we can treat the differences between volume


measurements for twins in each set as a single sample.

I That is, consider the population of identical twins with the


property that one of the pair has schizophrenia and the other
does not. We have obtained a sample of size 15 from this
population.

I A two-sample problem with paired samples reduces to a


one-sample problem.
Twins Data
Twins Data
The One-Sample Problem

Outline:
1. What’s the population of interest?

2. How do we relate the scientific question to the population of


interest?

3. How do we relate the observations we have (the sample) to


the population of interest?

4. How do we account for the problem that we just have


information about a sample, not the whole population?
Checking Assumptions
Methods of Data Analysis I

October 9, 2017
Assumptions

The underlying assumptions of the two-sample t-test and


t-confidence interval:

1. Independence. The two samples are statistically independent


and the observations within each sample are statistically
independent.

2. Normality. The two populations from which the samples are


taken are Normally shaped.

3. Equal Variance. The two population variances (equivalently,


standard deviations) are equal.
Outliers

An outlier is an observation that is far from the majority of the


other observations in the sample.

I The presence of outliers may signify long-tailed distributions


or skewed distributions

I The t-tools may not be robust to this particular departure


from Normality, especially if the samples sizes are different.

I Explanations for outliers? Data entry error? A different


population?
The Problem with Outliers

A statistical procedure is resistant if it doesn’t change very much


when a small portion of the data changes.

Since the t-tools are based on computing sample means, they are
not resistant to outliers.

Distinguishing between a distribution that is skewed or that has


heavy tails and one that has outliers can be tricky.

In general, possible outliers should be evaluated on a case-by-case


basis (Display 3.6).
Transformations

Data transformations are sometimes used to bring samples into line


with the assumptions of a particular statistical method. In
particular, they can sometimes be useful for

1. reducing skewness

2. reducing differences in spread

3. reducing the presence and/or magnitude of outliers

The idea: if our samples don’t meet the t-tool assumptions on the
original scale of measurement, maybe they will on a transformed
scale.
The Log Transformation

We will typically use the natural logarithm (log base


e = 2.71828 . . .).

This transformation is a good one because it tends to make small


numbers a little bit smaller, but it makes large numbers a lot
smaller (Display 3.8).

It’s also nice because after performing the two-sample t-procedures


on the log-transformed data, we can sometimes take a
back-transformation and interpret results on the original scale of
measurement.
Other Transformations


I Square root transformation, Y.
I only for positive data
I good for counts; measurements of area

1
I Reciprocal transformation, Y
I good for waiting times (e.g., to recurrence, to arrival)
√ Y
I Arcsin square root, arcsin Y , and logit, log 1−Y ,
transformations
I good for proportion data (between 0 and 1)
Multiplicative Treatment Effect

For randomized experiments, we talked about an Additive


Treatment Effect Model: Y ∗ = Y + δ.

Suppose that after some exploratory data analysis, we decide that


transforming the two samples using the log transformation will
bring the samples more in line with the two-sample t-test
assumptions.

If the log transformation is a good one, then we’re probably


prepared to believe that the additive treatment effect model holds
for the transformed data:

log(Y ∗ ) = log(Y ) + γ
Multiplicative Treatment Effect

But if log(Y ∗ ) = log(Y ) + γ, then equivalently, by exponentiating


both sides:

Y ∗ = e log(Y )+γ
= e log(Y ) e γ
= Ye γ

So, on the original scale of measurement, we have a multiplicative


treatment effect model:

Y ∗ = Ye γ .
Multiplicative Treatment Effect

If the multiplicative treatment effect model is appropriate, then the


effect is e γ , and we can estimate it using e γ̂ . Interpreting the
multiplicative treatment effect model is straightforward:

Outcomes under one treatment are estimated to be,


on average, e γ̂ times as large as those under the
other treatment.

(Display 3.9)
Some Caution

The additive treatment effect model is slightly more complicated


than I’ve represented.

I I’ve written Y ∗ = Y + δ

I But how does that actually relate to our situation where


X1 , . . . , Xn1 are independent responses for subjects
(experimental units) exposed to one treatment and
Y1 , . . . , Yn2 are independent responses for subjects exposed to
a different treatment.

Some hand-written notes.


Checking Assumptions
Methods of Data Analysis I

October11, 2017
Log Transformations

I recommending using any transformations, including the log


transformation sparingly.

1. The two-sample t-test is robust to departures from Normality,


especially for large sample sizes.

2. The back-transformation interpretation of log-transformed


data in a randomized experiment—i.e., that the estimated
treatment effect is multiplicative—can be misleading.

3. The back-transformation interpretation of log-transformed


data in observational studies—i.e., that the estimated effect is
a ratio of population medians—rests on strong and
unverifiable assumptions of symmetry of the population
distributions on the log scale.
Transformations

Sometimes, transformations are called for and/or they are justified.

I For example: there may be literature to suggest that rainfall


measurements follow exponential distributions (these are
right-skewed distributions).
If you do need to use a log transformation, then it may be best
simply to interpret your findings on the transformed scale.
Assumptions

Some important things to remember in the two-sample setting:

(1) Observations within samples must be statistically


independent to use any of the methods we’ll talk about
in this course.

(2) If the two samples are paired then you should use the paired
t-test.

I You should still examine a plot of the two samples, with lines
connecting the pairs, to make sure that any difference in the
pairs can be adequately explained by an additive effect.
I Starting on Friday we’ll look at some alternatives to the paired
t-test.
Assumptions

(3) If the two underlying observations from which the samples are
obtained are Normally shaped, then the t-distribution is the
exact distribution for evaluating differences in population
means, regardless of sample sizes and regardless of differences
in variance.

(4) For underlying population distributions that are different from


Normal, results using the t-distribution are fairly robust, and
deviations from Normality and can be overcome by increasing
the sample sizes.

(5) There may be some situations where results using the


t-distribution are not valid, and so we’ll explore some
alternatives.
Two Sample Alternatives
Methods of Data Analysis I

October 16, 2017


Cognitive Load Experiment

A randomized experiment to compare conventional teaching


materials to modified materials.

I Treatments: conventional textbook solutions and modified


worked examples; self-study

I Response: time to solution of a “moderately difficult


problem.”

I Question: Is there a difference between the learning methods?


Cognitive Load Data
The Cognitive Load Experiment

I Both distributions are rather heavy-tailed.

I 5 observations are censored in the conventional method


distribution. An observation is censored if we do not know its
true value, only that it is larger (or smaller) than a certain
value.

We shouldn’t use the methods based on the t-distribution for these


data because of the censoring. Instead, we’ll use the Rank-Sum
test to assess the difference between the groups.
The Rank-Sum Test

I Also known as the Wilcoxon rank-sum test.

I This is an alternative to the methods based on the


t-distribution that is resistant to outliers.

I It’s based on the ranks (i.e., the order) of the data rather than
the data themselves, making it useful when there are censored
observations.

I It is almost as good as the t-distribution methods if the


populations are Normal.

I It is better than the t-distribution methods if there are


extreme outliers.
The Basic Idea

The essential idea: to determine differences between groups, we


don’t use the data directly, we use the ranks of the data.

I Rank all of the data from 1 to n1 + n2 (note: we’ll deal with


ties in a minute)

I Determine whether, on average, the lower (higher) ranks


appear in one of the treatment groups or the other.
The Rank-Sum Statistic

1. List all observations from both samples in increasing order.

2. Identify which sample each observation came from.

3. Assign each observation a rank according to its order in the


list. Ties are taken to be the average of their orders.

4. The rank-sum statistic, T , is the sum of the ranks in one of


the groups.

Note: censored observations are treated as ties (with the highest


[lowest] ranks); see Display 4.5
The Rank-Sum Test

The null hypothesis for randomized experiments is written in terms


of the additive treatment effect:

H0 : δ = 0

The null hypothesis for observational studies is a little more vague:

H0 : the two population distributions are identical

In either case, if the null hypothesis is true, then the n1 ranks in


group 1 are a random sample from the n1 + n2 possible ranks. This
is the basis for our test.
The Sampling Distribution of T

The sampling distribution of the rank-sum statistic, T , is the


randomization or permutation distribution of T .

I Recall that the randomization distribution applies in the case


of a randomized experiment—it’s the histogram of T under
all possible re-randomizations.

I The permutation distribution applies to observational


studies—it’s the histogram of T under all possible allocations
of the responses to the two groups.
The Sampling Distribution of T

In either case (randomized experiment or observational study), we


can approximate this sampling distribution using a Normal
distribution.

It turns out that this approximation works well except when the
sample sizes are small (e.g., 5 or so), or there are a lot of ties (i.e.,
so many of the ranks are the same).
Rank-Sum in R

You’re going to see in lab this week a couple versions of the


rank-sum test:

1. wilcox.test is based on the version of the rank-sum procedure


that I just described.

2. wilcox.exact makes an exact p-value calculation, rather than


relying on the Normality of the test statistic.

The two approaches give virtually identical results when there are
not a lot of ties, and when the sample sizes are relatively large.
The Challenger Data

Launch
Temperature Number of O-ring incidents
Below 65◦ F 1 1 1 3
0 0 0 0 0 0 0 0 0 0
Above 65◦ F 0 0 0 0 0 0 0 1 1 2
The Challenger Data

I We can’t reasonably say that these data are from Normal


populations.

I The sample sizes are small (n1 = 4, n2 = 20) and not the
same.

I There are lots of zeros, so a log-transformation is certainly not


recommended.

I The methods based on t-distributions are just not appropriate


here.

I Instead, we’ll calculate a one-sided p-value from a


permutation test on a test statistic and see that there’s
evidence of an association between O-ring failure and
temperature.
Permutation Test

Suppose we want to use a statistic, T , calculated from two


samples—one of size n1 and one of size n2 —to compare two
populations.

The null hypothesis for the permutation test is:

H0 : the two population distributions are the same.

Assuming the null hypothesis to be true, we can create the


permutation distribution of T —this is just the histogram of T ’s
recalculated from all possible permutations of the observations into
the two groups (maintaining the observed sample sizes).
Permutation Tests

The p-value for this test is determined as follows:

1. Denote by Tobs the value of the test statistic for the observed
data.

2. Determine how many permutations of the two samples there


are, call this N.

3. Count up the number of those permutations that result in


values of T that are as extreme or more extreme (relative to
the null hypothesis) as Tobs , call this m.

4. The p-value is just the proportion, m/N.


The Number of Permutations

To calculate a permutation test p-value, we need to know the


number of ways it is possible to permute n1 + n2 observations into
two groups, one of size n1 and the other of size n2 .

This is a well-known problem, and it’s solution is


 
n1 + n2 (n1 + n2 )!
=
n1 n1 !n2 !

This is read “n1 + n2 choose n1 ”, and recall that

n! = n × (n − 1) × (n − 2) × · · · × 1.
The Number of Permutations

For the Challenger data, n1 + n2 = 24 and n1 = 4, so


 
24 24! 24 · 23 · 22 · 21
= = = 10, 626
4 4!20! 4·3·2·1

This number is the denominator of our p-value calculation.


Calculating the P-Value

Now, we need to count up those permutations that lead to a value


of the chosen test statistic that is as extreme or more extreme than
the one we actually observed.

I In the Challenger data, there are a total of 6 O-ring incidents


observed for launch temperatures less than 65◦ F, so let’s take
Tobs = 6, the sum of the number of failures in the cold
temperature group.

I Now we need to count up the number of permutations that


would yield a total of 6 or more O-ring incidents in the cold
temperature group.
Counting Permutations

First, for the Challenger data, let’s write down the specific
permutations that result in a sum of 6 or more in the cold
temperature group:

permutation sum in the group


1,1,1,3 6
1,1,2,3 7
0,1,2,3 6

Now we have to ask: how many WAYS can each of these


permutations occur?
Counting Permutations

In the Challenger data, there are 17 zeros, 5 ones, 1 two and 1


three. Let’s first focus on the number of ways to get the
permutation 1,1,1,3.

For this, we need to calculate the number of ways to choose 3 ones


out of the 5 available AND to choose 1 three out of the 1
available. This is another common problem in combinatorics, and
there is a known solution.
Counting Permutations

The number of ways to choose 3 ones out of the 5 available AND


to choose 1 three out of the 1 available is given by:
   
5 1 5!
× = ×1
3 1 3!2!
5×4×3×2×1
=
3×2×1×2×1
= 10.
Counting Permutations

Now, let’s focus on the number of ways to get the permutation


1,1,2,3. This is just:
   
5 1 1
= 10
2 1 1

Finally, the number of ways to get the permutation 0,1,2,3 is


    
17 5 1 1
= 17 × 5 = 85.
1 1 1 1
Calculating the P-value

Therefore, altogether, there are 10 + 10 + 85 = 105 ways to obtain


a sum in the cold temperature group that’s at least as large as 6.

A one-sided p-value for a test of whether the two populations are


the same is then:
105/10626 = 0.00988.

That is, we have strong evidence of a difference in the number of


O-ring failures between the two temperature groups.
Two Sample Alternatives
Methods of Data Analysis I

October 18, 2017


The Sign Test

This is a test that is designed for paired data (e.g., the identical
twins data).

I It’s another distribution-free test; resistant to outliers.

I The idea:
I count the number of pairs in which the observation in group 1
exceeds that in group 2
I under the null hypothesis of no difference, this count should be
roughly one half of the pairs

In R, you can perform the sign test using the binom.test function.
The Wilcoxon Signed-Rank Test

Like the Sign Test, this is distribution-free and resistant test for
paired samples.

In this test, we use the ranks of the observations, in addition to


their signs:

1. Rank the absolute value of the paired differences.

2. Add up the ranks of the pairs for which the difference is


positive.

3. Compare this test statistic to what we would expect if there


was no difference in the pairs.
The Wilcoxon Signed-Rank Test

Essentially, the signed-rank test is the rank-sum test for paired


samples.

I The idea is that we create two groups—one in which the


paired differences are positive and the other in which the
paired differences are negative.

I Then, we perform the wilcoxon rank-sum test using these two


“groups.”

Use wilcox.exact with paired = TRUE.


Levene’s Test for Variances

Sometimes, we want to compare the variances of two populations.


Levene’s test provides a method for this.

Suppose there are n1 observations, Y1i , from population 1, with


standard deviation σ1 ; and n2 Y2i ’s from population 2 with SD σ2 .
Let
Z1i = (Y1i − Y 1 )2 and Z2i = (Y2i − Y 2 )2
Then the Z1 ’s and Z2 ’s are samples from populations with means
σ12 and σ22 , respectively

So, perform a two-sample t-test for means using the Z ’s.

leveneTest function in the car package.


F-Test for Equal Variance
Another test for equal variances is called the F-test for equal
variances, and it’s based on the test statistic:

s12
F = ,
s22

where s12 and s22 are the sample variances from samples 1 and 2,
respectively.

Provided that the two samples are drawn from Normal populations,
the sampling distribution of F is an F-distribution with n1 − 1 and
n2 − 2 df.

This test is not robust to departures from the Normality


assumption.

var.test in R.

Вам также может понравиться