Вы находитесь на странице: 1из 100

Hypothesis Testing

Reject or Fail to Reject? That is the question!

Objectives
 

Sample vs. Population


Is there a difference?

Null and Research Hypotheses


What are they, what do they look like, and what do they mean?

 

A Good Hypothesis
What are the criteria?

Testing the Hypothesis


The six-step program

Samples and Populations


How do we select?

Population Q of population Inference Parameters

Sample

Statistics
X of sample

Samples and Populations Contd.


  

Samples must match characteristics of the population. Similarity results in generalizability. Type of sample impacts research quality
Systematic Sample Random Sample Sample of Convenience Volunteerism

Pitfalls of Samples


Sample Bias
Over represent subgroups. Not representative of population.

Sampling Error
Sample group not accurate picture. Reduce by enlarging sample. Larger sample, less error.

Standard Error
Measure of how much sampling error likely to occur when sample is extracted from population Standard deviation of values of the sampling distribution.

Null and Research Hypotheses


 Hypothesis

Educated guess Reflects the research problem being investigated Determines the techniques for testing the research questions Should be grounded in theory

Purposes of the Null Hypothesis


 Acts

as a starting point

State of affairs accepted as true in the absence of any other information Until a systematic difference is shown, assume that any difference observed is due to chance Research job is to eliminate chance factors and evaluate other factors that may contribute to group differences

Null Hypothesis Purpose # 2


 Provides

a benchmark to measure actual outcomes


How likely is it that outcomes are due to some other factor? Helps define range within which observed differences can be reasonably attributed to chance or something other than chance

Null Hypotheses
 Usually

a statement of no differences or no associations an equality


Sentence There will be no difference in at-home and pre-school program children on presocial and pre-literacy tests. Symbols
Ho: Q at-home = Q day-care Ho: Q at-home Q day-care = 0

Research/Alternative Hypotheses
A

statement of a relationship between the variables an inequality.  May be nondirectional (two-tailed)  May be directional (one-tailed) which is more powerful in research results as it splits the p value in half

Nondirectional Alternative Hyp.


 Reflects

a difference between groups but the direction of the difference is not specified
Nondirectional Sentence There is a difference in at-home and pre-school program children on presocial and pre-literacy tests. Nondirectional Symbols
Ha: Q at-home { Q day-care

Directional Alternative Hyp.


 Reflects

a difference between groups, and the direction of the difference is specified


Directional Sentence Children in pre-school programs will have higher pre-social and pre-literacy scores than children who stay at home. Directional Symbols
Ha: Q at-home < Q day-care

What Makes a Good Hypothesis?


A good hypothesis:  is stated in declarative form and not as a question.  posits an expected relationship between variables.

What Makes a Good Hypothesis?


A good hypothesis: reflects the theory or literature on which it is based.  should be brief and to the point.  is testable, which means that it can carry out the intent of the question reflected by the hypothesis.

Six Steps of Hypothesis Testing


1. 2. 3. 4. 5. 6.

State the null hypothesis. State the alternative hypothesis Select a level of significance Collect and summarize the sample data. Refer to a criterion for evaluating the sample evidence. Make a decision to keep/reject the null.

. State the Null Hypothesis


 States

that there is no relationship between the variables.  Refers to the population.

Examples of the Null Hypothesis


 Written:

There are no differences in the pre, mid, and post test scores of students who are either enrolled in Headstart, daycare, or homecare.  Symbols: pre = mid = post

Examples of Null Hypothesis


 Written:

There are no differences in the literacy scores between students who either in Headstart, daycare, or homecare.  Symbols: Headstart = daycare= homecare

Examples of Null Hypothesis


 Written:

The pattern of differences of the cell s of pre, mid, and post test literacy scores in the first column (or row) accurately describes the pattern of differences among the cell s in at least one other column (row).  Symbols: : Qjk Qjk = Qjk Qjk, for all rows and columns, for all combinations of both j and j, k and k.

Step 2: State the Alternative Hypothesis


 Symbolically

referred to as Ha  States the opposite of the Ho

Examples of Alternative Hypothesis


 Written:

There are differences within the pre, mid, and post test scores of students who are either in Headstart, daycare, or homecare.  Symbols: Q pre { Q mid { Q post for at least one pair.

Example of Alternative Hypothesis


 Written:

There are differences in the the scores between the students in Headstart, daycare, or homecare who either took the pre, mid, or post literacy skills test.  Symbols: Q Headstart { Q daycare { Q homecare, for at least one pair

Example of Alternative Hypothesis




Written: The pattern of differences among the cell s of pre, mid, and post literacy skills scores in the first column (or first row) fails to describe accurately the pattern of differences among the cell s in at least one other column (row). Symbols: Qjk Qjk { Qjk Qjk, for all rows and columns, for all combinations of both j and j, k and k.

Step 3: Select a Level of Significance


   

Most researchers select a small number such as 0.001, 0.01, or 0.05. The most common choice is 0.05 Otherwise known as alpha level, p=0.05, E=0.05 The significance level serves as a scientific cutoff point that determines what decision will me made concerning the null hypothesis.

Type I and Type II Errors



1.

Mistakes can occur: Type I Error designates the mistake of rejecting the Ho when the null is actually false. When the level of significance is set at 0.05, this means the chance of a Type I error becomes equal to 1 out of 20.

Type II Errors


Designates a mistake made if Ho is not rejected when the null is actually false.

Step 4: Collection and Analysis of Sample Data


 The

summary of the sample data will always lead to a single numerical value which is referred to as the calculated value. ( r, t, or f).  The computer calculates the probability of the above value in the form of p = ____.

Step 5: The Criterion for Evaluating the Sample Evidence


Two Methods:  Compare the calculated and critical values.  Compare the data-based p-value against a preset point on the 0-1 scale on which the p must fall. (Level of Significance)

Step 6: Make a Decision!


 Reject

the Null if the p-value is less than the established level of significance. a statistically significant difference was obtained p< 0.05

Fail to Reject the Null




Retain the Null if the p-value is greater than the established level of significance. H0 was tenable The null was retained. No significant difference was found. The result was not statistically significant.

Rejection Region
H0: Qu H1: Q < 0 H0: Qe0 H1: Q > 0

Reject H0 E 0
Must Be Significantly Below Q= 0

Reject H 0 E Z 0 Z

Small values dont contradict H0 Dont Reject H0!

t-Test: WUnknown
 Assumptions

Population is normally distributed If not normal, only slightly skewed & a large sample taken (Central limit theorem applies)
 Parametric  t test

test procedure
X Q t! S n

statistic, with n-1 degrees of freedom

Degrees of Freedom

# in sample - number of parameters that must be estimated before test statistic can be computed. For a single sample t-test, we must first estimate the mean before we can estima te the standard deviation. f Once the mean is estimated, n-1 of the values are left since we know that the nth value is equal to
n

nx  x i
i !1

Example: One Tail t-Test


Does an average box of cereal contain more than 368 grams of cereal? A random sample of 36 boxes showed X = 372.5, and W 15. Test at the E 0.01 level.

368 gm.

Wis not given,

H0: Qe 368 H1: Q" 368

Z test for proportions

The null hypothesis for the proportion also implies we know the variance, since the variance is just P times (1-P). This is a good approximation when the sample size is large. If the sample size is small, we could use the binomial distribution to compute the exact p value that a sample of size n would yield a sample proportion ps given the population proportion P. Using the normal approximation is much easier.

Example:Z Test for Proportion


Problem: A marketing company claims that it receives 4% responses from its Mailing. Approach: To test this claim, a random sample of 500 were surveyed with 25 responses. Solution: Test at the E = .05 significance level.

The Mann-Whitney U test




That you have acquired a set of measurements from 2 different sites.


Maybe one is alleged to be polluted, the other clean, and you measure residues in the soil. Maybe these are questionnaire returns from students identified as M or F.

You want to know whether these 2 sets of measurements genuinely differ. The issue here is that you need to rule out the possibility of the results being random noise.

The formal procedure:




Involves the creation of two competing explanations for the data recorded.
Idea 1:These are pattern-less random data. Any observed patterns are due to chance. This is the null hypothesis H0 Idea 2: There is a defined pattern in the data. This is the alternative hypothesis H1

Without the statement of the competing hypotheses, no meaning test can be run.

Example
You conduct a questionnaire survey of homes in the Heathrow flight path, and also a control population of homes in South west London. Responses to the question How intrusive is plane noise in your daily life are tabulated:
        

Noise complaints 1= no complaint, 5 = very unhappy Homes near airport Control site 5 3 4 2 4 4 3 1 5 2 4 1 5

Stage 1: Eyeball the data!


These data are ordinal, but not normally distributed (allowable scores are 1, 2, 3, 4 or 5).  Use Non-parametric statistics  It does look as though people are less happy under the flightpath, but recall that we must state our hypotheses H0, H1 H0: There is no difference in attitudes to plane noise between the two areas any observed differences are due to chance. H1: Responses to the question differed between the two areas.


Now we assess how likely it is that this pattern could occur by chance:
 This

is done by performing a calculation. Dont worry yet about what the calculation entails.  What matters is that the calculation gives an answer (a test statistic) whose likelihood can be looked up in tables. Thus by means of this tool the test statistic - we can work out an estimate of the probability that the

One philosophical hurdle to go:


 The

test statistic generates a probability a number for 0 to 1, which is the probability of H0 being true.  If p = 0, H0 is certainly false. (Actually this is over-simple, but a good approximation)  If p is large, say p = 0.8, H0 must be accepted as true.  But how about p = 0.1, p = 0.01?

Significance


We have to define a threshold, a boundary, and say that if p is below this threshold H0 is rejected otherwise H1 is accepted. This boundary is called the significance level. By convention it is set at p=0.05 (1:20), but you can chose any other number - as long as you specify it in the write-up of your analyses. WARNING!! This means that if you analyse 100 sets of random data, the expectance (log-term average) is that 5 will generate a significant test.

The procedure:
Set up H0, H1. Decide significance level p=0.05


      

Data
5 4 4 3 5 4 5 3 2 4 1 2 1

Test statistic U = 15.5

Probability of H0 being true p = 0.03

Is p above critical level? Y N Reject H0

Accept H0

This particular test:




The Mann-Whitney U test is a non-parametric test which examines whether 2 columns of data could have come from the same population (ie should be the same) It generates a test statistic called U (no idea why its U). By hand we look U up in tables; PCs give you an exact probability. It requires 2 sets of data - these need not be paired, nor need they be normally distributed, nor need there be equal numbers in each set.

How to do it
 1: rank all data into

2 Harmonize ranks where the


same value occurs more than once

ascending order, then re-code the data set replacing raw data with ranks.

      


      

Data
5 4 4 3 5 4 5 3 2 4 1 2 1

Data
5 4 4 3 5 4 5 #13 #10 #9 #6 #12 #8 #11 3 2 4 1 2 1 #5 #4 #7 #2 #3 #1


      

Data
5 4 4 3 5 4 5 #13 = 12 #10 = 8.5 #9 = 8.5 #6 = 5.5 #12 = 12 #8 = 8.5 #11 = 12 3 2 4 1 2 1 #5 #4 #7 #2 #3 #1 = = = = = = 5.5 3.5 8.5 1.5 3.5 1.5

Once data are ranked:


 Add

up ranks for each column; call these rx and ry  (Optional but a good check:
rx + ry = n2/2 + n/2, or you have an error)


Calculate
Ux = NxNy + Nx(Nx+1)/2 - Rx Uy = NxNy + Ny(Ny+1)/2 - Ry

take the SMALLER of these 2 values and look up in tables. If U is LESS than the critical value, reject H0 NB This test is unique in one feature: Here low values of the

In this case:

        

Data
5 4 4 3 5 4 5 #13 = 12 #10 = 8.5 #9 = 8.5 #6 = 5.5 #12 = 12 #8 = 8.5 #11 = 12 ___ rx=67 3 2 4 1 2 1 #5 #4 #7 #2 #3 #1 = = = = = = 5.5 3.5 8.5 1.5 3.5 1.5

Ux = 6*7 + 7*8/2 - 67 = 3 Uy = 6*7 + 6*7/2 - 24 = 39 Lowest U value is 3. Critical value of U (7,6) = 4 at p = 0.01. Calculated U is < tabulated U so reject H0. At p = 0.01 these two sets of data differ.

___ ry=24

 

Check: rx + ry + 91 13*13/2 + 13/2 = 91 CHECK.

Tails.. Generally use 2 tailed tests


2 tailed test: These
populations DIFFER.

1 tailed test: Population X is


Greater than Y (or Less than Y).

Lower tail of distribution

Upper tail of distribution

Kruskal-Wallis: The U tests big cousin


When we have 2 groups to compare (M/F, site 1/site 2, etc) the U test is correct applicable and safe. How to handle cases with 3 or more groups? The simple answer is to run the Kruskal-Wallis test. This is run on a PC, but behaves very much like the M-W U. It will give one significance value, which simply tells you whether at least one group differs from one other.
Males Females Site 1 Site 2 Site 3

Do males differ from females?

Do results differ between these sites?

Wilcoxon Rank Sum Test


Z test and the t test are parametric tests that is, they answer a question about the difference between populations by comparing sample statistics (e.g., X1 and X2) and making an inference to the population parameters (1 and 2). The Wilcoxon, in contrast, allows inferences about whole populations
Wilcoxon

The

Distribution A

Note that distribution B is shifted to the right of distribution A X

Distribution B

Wilcoxon

1b. Small samples, independent groups


Wilcoxon

Rank Sum Test

first, combine the two samples and rank order all the observations. smallest number has rank 1, largest number has rank N (= sum of n1 and n2). separate samples and add up the ranks for the smaller sample. (If n1 = n2, choose either one.) test statistic : rank sum T for smaller Wilcoxon sample.

1b. Small samples, independent groups


One-tailed Hypotheses H0: Prob. distributions for 2 sampled populations are identical. HA: Prob. distribution for Population A shifted to right of distribution for Population B. (Note: could be to the left, but must be one or the other, not both.)
Wilcoxon

Wilcoxon

1b. Small samples, independent groups


Two-tailed Hypotheses H0: Prob. distributions for 2 sampled populations are identical. HA: Prob. distribution for Population A shifted to right or left of distribution for Population B.
Wilcoxon

Wilcoxon

1b. Small samples, independent groups


Wilcoxon (With

Rejection region:

Sample taken from Population A being smaller than sample for Population B) reject H0 if TA TU or TA TL
Wilcoxon

Wilcoxon Test

1b. Small samples, independent groups


for n1 10 and n2 10: statistic:

   

Z = TA n1(n1 + n2 + 1) 2 n1n2(n1 + n2 + 1) 12
Wilcoxon

Wilcoxon for n1 10 and n2 10


Rejection

region: One-tailed
Z

Two-tailed Z > Z/2

> Z

10

Note:

use this only when n1 10 and n2


Wilcoxon

Example 1
are small samples, and they are independent (random samples of Cajun and Creole dishes). Therefore, we have to begin with the test of equality of variances.
These

Wilcoxon

Test of hypothesis of equal variances


H0:

W12 = W22 HA: W12 W22


Test  Rej. 

statistic: F = S12 S22 region: F > F/2 = F(6,6,.025) = 5.82 or F < (1/5.82) = .172
Wilcoxon

Test of hypothesis of equal variances


= (385.27)2 = 148432.14 S2Creole = (1027.54)2 = 1055833.33
S2Cajun Fobt  Reject

148432.14 1055833.33

= 7.11

H0 variances are not equal, so we do the Wilcoxon.


Wilcoxon

Example 1 Wilcoxon Rank Sum Test


Prob. distributions for Cajun and Creole populations are identical. HA: Prob. distribution for Cajun is shifted to right of distribution for Creole.
Statistical H0:

test:

Wilcoxon

Example 1 Wilcoxon Rank Sum Test


Rejection 

39)

region: Reject H0 if TCajun > 66 (or if TCreole <

We shall give lower heat values lower rank values)

(Note:

Wilcoxon

Example 1 Wilcoxon Rank Sum Test


Cajun 6.5 3500 11.5 4200 9.5

4 100 13.5 4 7.5 00 11 4200 8 3 705 9.5

4100

70

Creole 4.5 13.53100 3 4700 6.5 2700 2 3500 4.5 2000 1 3100 1550 35
Wilcoxon

Example 1 Wilcoxon Rank Sum Test


Calculation Sum  70 

check:

of the ranks should = (n) (n+1) 2

+ 35 = 105 = (14)(15) 2
Wilcoxon

Example 1 Wilcoxon Rank Sum Test


TCajun

39)

= 70 > 66

(and TCreole = 35 <

reject H0 Cajun dishes are significantly hotter than Creole dishes.

Therefore,

Wilcoxon

Example 2 Wilcoxon Rank Sum Test


H0:

W12 = W22 HA: W12 W22


Test  Rej. 

statistic: F = S12 S22 region: F > F/2 = F(7,8,.025) = 4.53 or F < (1/4.90) = .204
Wilcoxon

Example 2 Wilcoxon Rank Sum Test


Fobt  Reject

4.316 .46

= 9.38

H0 do Wilcoxon

Wilcoxon

Example 2 Wilcoxon Rank Sum Test


Prob. distributions for females and males populations are identical. HA: Prob. distribution for females is shifted to left of distribution for males.
Statistical H0:

test: Rejection region:




T T > TU = 90 (or T < TL = 54) Wilcoxon

Example 2 Wilcoxon Rank Sum Test


6.4 1.7 3.2

5.9 2.0 3.6 5.4 7.2  

16 1 5 15 2 8 14 17 78

2.7 3.9 4.6 3.0 3.4 4.1 3.4 4.7 3.8

3 10 12 4 6.5 11 6.5 13 9 75
Wilcoxon

Example 2 Wilcoxon Rank Sum Test


T

= 78 < TU = 90

Therefore,

do not reject H0 no evidence that mean distance in females is less than that in males.

Wilcoxon

Example 3 Wilcoxon Rank Sum Test


H0:

W12 = W22 HA: W12 W22


Test  Rej. 

statistic: F = S12 S22 region: F > F/2 = F(5,5,.025) = 7.15 or F < (1/7.15) = .140
Wilcoxon

Example 3 Wilcoxon Rank Sum Test


Fobt 

(7.563)2 (2.04)2

= 57.20 4.16

= 13.74 H0 do Wilcoxon
Wilcoxon

Reject

Example 3 Wilcoxon Rank Sum Test


Prob. distributions for Hoodoo and Mukluk populations are identical. HA: Prob. distribution for Hoodoos is shifted to right or left of distribution for Mukluks.
Statistical H0:

test: T Rejection region: TH > 52 or < 26


Wilcoxon

Example 3 Wilcoxon Rank Sum Test


Hoodoo 2

1 6 5 4 2.5 23 12 7 7.5 6 5  33

Mukluk 6 5 8 9.5 7 7.5 10 11 8 9.5 4 2.5 45


Wilcoxon

Example 3 Wilcoxon Rank Sum Test


Check:   TH Do

TH + TM = 78 (12)(13) = 78 2

= 33 > 26 and < 52

not reject H0 no evidence for a significant difference between teams.


Wilcoxon

The Kruskal-Wallis (KW) Test for Comparing Populations with Unknown Distributions
A nonparametric test for comparing population medians by Kruskal and Wallis  The KW procedure tests the null hypothesis that k samples from possibly different populations actually originate from similar populations, at least as far as their central tendencies, or medians, are concerned. The test assumes that the variables under consideration have underlying continuous distributions.  In what follows assume we have k samples, and the sample size of the i-th sample is ni, i = 1, 2, . . ., k.


Test based on ranks of combined data


In the computation of the KW statistic, each observation is replaced by its rank in an ordered combination of all the k samples.  By this we mean that the data from the k samples combined are ranked in a single series.  The minimum observation is replaced by a rank of 1, the next-to-the-smallest by a rank of 2, and the largest or maximum observation is replaced by the rank of N, where N is the total number of observations in all the samples (N is the sum of the ni).


Compute the sum of the ranks for each sample


 The

next step is to compute the sum of the ranks for each of the original samples. The KW test determines whether these sums of ranks are so different by sample that they are not likely to have all come from the same population.

Test statistic follows




It can be shown that if the k samples come from the same population, that is, if the null hypothesis is true, then the test statistic, H, used in the KW procedure is distributed approximately as a chi-square statistic with df = k - 1, provided that the sample sizes of the k samples are not too small (say, ni>4, for all i). H is defined as follows:

where k = number of samples (groups) ni = number of observations for the i-th sample or group N = total number of observations (sum of all the ni) Ri = sum of ranks for group i


Non-parametric Tests
 Parametric  Chi-Square

vs Non-parametric

1 way 2 way

Parametric Tests
 Data

approximately normally distributed.  Dependent variables at interval level.  Sampling random  t - tests  ANOVA

Non-parametric Tests
 Do

not require normality  Or interval level of measurement


 Less

Powerful -- probability of rejecting the null hypothesis correctly is lower. So use Parametric Tests if the data meets those requirements.

One-Way Chi Square Test


 Compares

observed frequencies within groups to their expected frequencies.

 HO

= observed frequencies are not different from the expected frequencies.  Research hypothesis: They are different.

Chi Square Statistic


 fo = observed frequency  fe = expected frequency

Chi Square Statistic


2

( fo  fe ) !7 fe

One-way Chi Square Interpretation




If our calculated value of chi square is less than the table value, accept or retain Ho If our calculated chi square is greater than the table value, reject Ho as with t-tests and ANOVA all work on the same principle for acceptance and rejection of the null hypothesis

Two-Way Chi Square


 Review

cross-tabulations (= contingency tables) from Chapter 2.  Are the differences in responses of two groups statistically significantly different?  One-way = observed vs expected  Two-way = one set of observed frequencies vs another set.

Two-way Chi Square


 Comparisons

between frequencies (rather than scores as in t or F tests).  So, null hypothesis is that the two or more populations do not differ with respect to frequency of occurrence.
 rather

than working with the means as in t test, etc.

Two-way Chi Square Example


 Null

hypothesis: The relative frequency [or percentage] of liberals who are permissive is the same as the relative frequency of conservatives who are permissive.  Categories (independent variable) are liberals and conservatives. Dependent variable being measured is permissiveness.

Two-Way Chi Square Example


Child-rearing Practices Permissive Non-permissive Total Political Liberals 13 7 20 Orientation Conservatives 7 13 20 Total 20 20 40

Two-Way Chi Square Example


 Because

we had 20 respondents in each column and each row, our expected values in this cross-tabulation would be 10 cases per cell.  Note that both rows and columns are nominal data -- which could not be handled by t test or ANOVA. Here the numbers are frequencies, not an interval variable.

Two-Way Chi Square Expected


Child-rearing Practices Permissive Non-permissive Total Political Liberals 10 10 20 Orientation (Expected) Conservatives 10 10 20 Total 20 20 40

Two-Way Chi Square Example


 Unfortunately,

most examples do not have equal row and column totals, so it is harder to figure out the expected frequencies.

Two-Way Chi Square Example


 What

frequencies would we see if there were no difference between groups (if the null hypothesis were true)?  If 25 out of 40 respondents(62.5%) were permissive, and there were no difference between liberals and conservatives, 62.5% of each would be permissive.

Two-Way Chi Square Example


 We

get the expected frequencies for each cell by multiplying the row marginal total by the column marginal total and dividing the result by N.  Well put the expected values in parentheses.

Two-Way Chi-Square Example


Political Orientation Permissive Not Permissive Total Liberals Conservatives Total 15 (12.5) 10 (12.5) 25 5 (7.5) 10 (7.5) 15 20 20 40

Two-Way Chi-Square Example


 So

the chi square statistic, from this data is  (15-12.5)squared / 12.5 PLUS the same values for all the other cells  = .5 + .5 + .83 + .83 = 2.66

Two-Way Chi-Square Example


 df

= (r-1) (c-1) , where r = rows, c =columns so df = (2-1)(2-1) = 1 Table C, = .05, chi-sq = 3.84

 From

 Compare:

Calculate 2.66 is less than table value, so we retain the null hypothesis.

Вам также может понравиться