Академический Документы
Профессиональный Документы
Культура Документы
Example: Consider the following data on how much students paid for their last haircut (including tips) in dollars:
18, 55, 23, 75, 36.
A common measure of data variability is the standard deviation (SD). The standard deviation is approximately
equal to the average deviation the values from the mean.
2 2 2 2
∑ 1 2 …
2. Sample SD = s = =
1 1
from the sample mean ($41.4) was $23.61, for this sample of 5 people.
5. Median = middle value in a sorted dataset, when odd number of observations. When there is an even
number of observations, the median is the average of the two middle values in the sorted dataset.
Then, observed lower quartile = (190+190)/2 = 190lbs, and observed upper quartile = (235+237)/2 = 236lbs.
2. In general a z-statistic is
z=
3. Let, the null hypothesis be H0: π = π0
Then, the z-statistic, for one proportion scenario is given by
z=
Example:
Suppose that in 32 attempts, the dolphin Buzz had pushed the correct button, 30 times. Does this provide evidence
that Buzz was doing better than guessing which button to push?
Then, the
Null hypothesis, H0: Buzz is just guessing, so his long-run probability of choosing the correct button is 0.5.
Alternative hypothesis, Ha: Buzz is doing better than guessing, so his long-run probability of choosing the correct
button is greater than 0.5.
Let, π represent Buzz’s long-run probability of pushing the correct button.
Then, H0: π = 0.5 versus Ha: π > 0.50
where π0 = 0.5.
From the sample data, n = 32, and ̂ = 30/32 = 0.9375.
. . .
Then, z = = 4.95
. . .
The observed proportion of successes (0.9375) that Buzz had is 4.95 SDs above the hypothesized
proportion 0.5, which is what Buzz’s long-run probability of success would have been if he had been
guessing.
Note: Using the z-statistic along with the theory-based approach to find a p-value is valid if the sample
size is large enough, that is, at least 10 “successes” and at least 10 “failures.”
Notation
π represents the underlying process probability or population proportion (of “successes”)
̂ represents the sample proportion of “successes”
n represents the sample size
1. Sample proportion = ̂ =
2. In general, a confidence interval is given by
statistic ± margin of error
3. Theory-based confidence intervals for π, the underlying process probability or population proportion is
given by
Confidence Margin of error Confidence interval
level
90%
1.645 × p̂ + 1.645 ×
95%
1.96 × p̂ + 1.96 ×
99%
2.576 × p̂ + 2.576 ×
For a different confidence level, change the multiplier used in the margin of error formula.
These theory-based confidence intervals are valid when the sample size is large enough, that is, there are
at least 10 “successes” and at least 10 “failures” in the sample data.
Example:
In a July 2012 Gallup survey of 1,014 randomly selected U.S. adults, 5% said that they consider themselves to be
vegetarians. Find a 95% confidence interval of the proportion of all U.S. adults who consider themselves to be
vegetarians.
Let, π represent the proportion of all U.S. adults who consider themselves to be vegetarians.
From the sample data, n = 1014, and ̂ = 0.05. Thus, there are (1014) × (0.05) = 50.7 51 “successes”
and (1014)×(1-0.05) = 963.3 963 “failures” in the sample.
Then, the theory-based 95% confidence interval is given by
. .
p̂ + 1.96 = 0.05 + 1.96 = 0.05 + 1.96 (0.0068) = 0.05 + 0.0134 = (0.037, 0.063)
So, we are 95% confident that the proportion of U.S. adults who consider themselves to be vegetarians is
somewhere between 0.037 and 0.063.
CHAPTER 5: Comparing Two Proportions
Notation
π1 represents the underlying probability for the 1st process or population proportion for 1st population (of
“successes”)
π2 represents the underlying probability for the 2nd process or population proportion for 2nd population (of
“successes”)
̂ represents the sample proportion of successes in the 1st sample
̂ represents the sample proportion of successes in the 2nd sample
̂ represents the combined proportion of successes in the sample =
Note: Using the z-statistic along with the theory-based approach to find a p-value is valid if the sample size is
large enough, that is, there are at least 10 observations in each of the four cells, when the data are written out as a
2×2 table.
Thus, the observed difference in the sample proportions of 0.097 is 4.3 SDs above the hypothesized
difference of 0. Recall that the null hypothesis says π1 – π2 is equal to 0.
Because the sample sizes are large enough, the theory-based 95% confidence interval for π1 – π2 can be
computed as
. . . .
0.548 – 0.451 + 1.96 = (0.053, 0.141)
We are 95% confident that the π1 is higher than π2 by somewhere between 0.053 and 0.141.
Notation
µ1 represents the “population” average for population 1
µ2 represents the “population” average for population 2
̅ represents the sample mean for sample 1
̅ represents the sample mean for sample 2
s1 represents the sample SD for sample 1
s2 represents the sample SD for sample 2
n1 represents the sample size for sample 1
n2 represents the sample size for sample 2
t represents the standardized statistic
Let, H0: µ1 – µ2 = 0
The formula used to calculate the t-statistic to compare two groups on a quantitative response is
̅ ̅
An approximate theory-based 95% confidence interval for µ1 – µ2, can be written as follows:
̅ ̅ 2
Note: Using the t-statistic along with the theory-based approach to find a p-value, and/ or finding the theory-based
confidence interval is valid provided
(i) the sample sizes are large enough. That is, each of the samples has at least 20 observations.
OR (ii) the each sample comes from a normally distributed population.
Example: Consider data on BMI of women participating in a randomized experiment of lifestyle change
programs.
Group Sample size Sample mean Sample SD
2
Intervention = 60 ̅ = 30 kg/m = 2.1 kg/m2
Control = 60 ̅ = 34 kg/m2 = 2.4 kg/m2
We can define our parameters of interest to be,
= Average BMI after 2 years of being enrolled in an individualized lifestyle change
program, for (obese) women like those in the study, and
= Average BMI after 2 years of being enrolled in a “one size fits all” lifestyle change program,
for (obese) women like those in the study
And using the symbols µ1 and µ2 we can restate our hypotheses to be,
Null hypothesis, H0: – =0
Alternative hypothesis, Ha: – ≠0
This tells us that the observed difference (-4) is 9.72 SDs below the hypothesized difference of 0.
Using the data from the study, we can see that each sample had 60 observational units, which is bigger
than 20. Thus, the theory-based approximate 95% confidence interval for – , is:
2.1 2.4
30 34 2 4 2 0.4117 4 0.8234 4.8234, 3.1776
60 60
Thus, we are 95% confident that enrollment in individualized lifestyle change programs (compared to
“one size fits all” programs) decreases the average BMI of women like the obese Italian women in our
study, by somewhere between 3.18 kg/m2 to 4.82 kg/m2.
Notation
Let, H0: µd = 0
̅
The t-statistic for paired data on a quantitative response is .
√
When the sample sizes are large enough, we can also find an approximate theory-based 95% confidence
interval for µd, as follows:
̅ 2 . .
√
Note: Using the t-statistic along with the theory-based approach to find a p-value, and/or finding the theory-
based confidence interval is valid if
(i) The sample size (that is, the number of pairs observed) is large enough (at least 20).
OR (ii) the sample of differences comes from a normally distributed population.
Example: Suppose that we have the following data
Sample size
Mean of sample SD of sample
differences (kcal) differences (kcal)
Difference = Estimated – Actual 80 ̅ -435 527.7
Let, µd represent the average difference in estimated and actual energy intake, by men like those in the
study.
Then, H0: µd = 0, versus Ha: µd ≠ 0.
The t-statistic can be found as . 7.37
.
√
Thus, the observed average difference between estimated and actual Energy intake, 435kcal, is 7.37 SDs
below 0.
Using the data from the study, because there are 80 observational units (that is, 80 pairs of responses), the
approximate theory-based 95% confidence interval for µd, is:
393.5
435 2 129 2 43.99 435 87.99 523, 347
√80
Thus, we are 95% confident that, on average, men such as the ones in this study, underestimate their
energy intake by somewhere between 347kcal to 523kcal.
Notation
πi represents the underlying probability for the ith process or population proportion for ith population.
̂ represents the sample proportion of successes in ith sample
ni represents the sample size of ith sample
̂ represents the overall proportion of successes in the entire study
represents the chi-square statistic
Σ represents summation
The chi-square statistic for comparing multiple groups on a binary response variable is given by
∑ ̂ ̂
H0: There is no association between type of treatment and whether or not person experiences substantial
reduction in pain,
Ha: There is an association between type of treatment and whether or not person experiences substantial
reduction in pain.
̂ = sample proportion of subjects experiencing substantial reduction in pain among those who received
the real acupuncture = 184/387 = 0.495
Similarly, ̂ = 171/387 = 0.442, and ̂ = 106/388 = 0.273
n1 = 387, n2 = 387, n3 = 388
̂ = overall proportion of successes in the entire study = (184 + 171 + 106)/(1162) = 0.397
Then, the observed value of the chi-square statistic can be calculated as
2
observed
. 397 (1
1
. 397 )
387 (.495 .397 ) 2 387 (.442 .397
2
388(.273 .397 ) 2 ) 38.05
Where
To calculate the expected counts, we determine the overall proportion in each response category, and then apply
this proportion to each explanatory variable group. One way of implementing this is by calculating the expected
cell counts (Ei) using the following formula
where the row and column totals correspond to the row and column in which the ith cell appears.
Note: Using the chi-square statistic along with the theory-based approach to find a p-value is valid provided the
sample sizes are large enough. That is, there are at least 10 observations in each of the cells, when the data are
written out as a two-way table.
Example: Let us use the data from Chapter 8 again, where the observed counts are as follows
Each calculated value of is called a chi-square cell contribution. For example, the chi-square cell
contribution for the cell corresponding to “real acupuncture” and “substantial reduction in pain” is given by
.
= 6.05.
.
Notation
Or as,
Then, the F-statistic for comparing more than two groups on a quantitative response is given by
n (x
i 1
i i x)2
between group variability I 1
F
within group variability i
(n
i 1
i 1) si2
N I
Example: Consider the following data from Chapter 9 on the comprehension scores of an ambiguous prose
passage:
To find the observed value of the F-statistic, let us first calculate the numerator:
∑ ̅ ̅ . . . . . . . . .
between-group variability =
.
= 17.57
∑ . . . . . .
witthin-group variability = =
.
1.75
Thus, the observed value of the F-statistic = 17.57 /1.75 = 10.02.
Note: Using the F-statistic along with the theory-based approach to find a p-value is valid as long as
(i) The sample sizes are large enough. That is, there are at least 20 observations in each sample,
OR, the samples each come from normally distributed populations.
(ii) The variability (SD) in each of the populations is the same.
CHAPTER 10: Paired Data: Two Different Quantitative Variables
Notation
The sample correlation coefficient for data on a quantitative explanatory variable and a quantitative response
variable, is given by
̅
∑
The least squares regression line for data on a quantitative explanatory variable and a quantitative response is
given by
Where, the sample slope for least squares regression line for data on a quantitative explanatory variable and a
quantitative response is given by
And, the sample intercept for least squares regression line for data on a quantitative explanatory variable and a
quantitative response is given by
̅
Let, H0: ρ = 0 versus Ha: ρ ≠ 0
Then, the corresponding statistic for the hypotheses about the population correlation coefficient (ρ) is given
as
t=
Then, the corresponding statistic for the hypotheses about the population slope (β) is given as
b ± 2 SEb
Example: Let us use the data on height (inches) and handspan (cm) for a sample of 10 college students.
Mean SD
Height (inches), yi 64 67 65 72 71 70 66 62 73 65 67.50 3.75
Handspan (cm), xi 17 21 20.3 26 24 22 21 19 20 19 20.93 2.59
Then,
̅ = 20.93cm
= 67.50 inches
sx = 2.59 cm
sy = 3.75 inches
n = 10
Thus, the calculated least squares regression line for the sample data is found to be 46.12 1.02
. . . .
t= 2.85
. . √ . .
Thus, the observed (sample) correlation coefficient 0.71 between height and handspan is about 2.85 SDs above
the hypothesized value 0 of the population correlation coefficient, the value ρ would be if the null hypothesis were
true.
Here is statistical software output with b, SEb, and the corresponding value of the t-statistic circled.
Notice that the t-statistic for H0: β = 0 versus Ha: β ≠ 0, can be calculated to be
.
t= = = 2.82
.
And, the approximate theory-based 95% confidence interval for β can be calculated to be
We are 95% confident that the increase in average height associated with an increase of 1cm in hand span is
somewhere between 0.30 inches and 1.74 inches.