Вы находитесь на странице: 1из 273

STAT 3660

Introduction to Statistics

Course pack prepared by Dr. Gerald Sievers, Summer 2013

Acknowledgement:
Robert McNutt, Judson Fonger, Barbie Bugna Jacqueline Andan, Nichole
Andrews, Sanduni Palliyage, and Ida Alcantara assisted with this material
Copyright ©2017 by The Department of Statistics at Western Michigan University.

All rights reserved.

Reproductions or translation of any part of this work beyond that permitted by Sections 107
and 108 of the 1976 United States Copyright Act without permission of the copyright owner is
unlawful.
Contents

1 Introduction 1

2 Controlled Experiments 3

3 Observational Studies 7

4 Issues Related To Statistical Studies 13

5 Plots 17

6 Statistics For Center and Spread of Data 35

7 Normal Distribution 47

8 Correlation and Association 61

9 Regression 77

10 Box Models For a Population 101

11 Sampling Distribution of a Sample Proportion 113

12 Sampling Distribution of a Sample Mean 123

13 Tests of Significance 131

14 The Two Sample Problem 147

15 One Categorical Variable / The Chi-Square Test 160

16 Two-Dimensional Tables 169

17 Judgment Tables 179

Workshop 186

Answer Key 238

ii
Chapter 1

Introduction

This is a first course in Statistics. No special background is assumed for the reader, but
high school math experience is helpful to understand symbols, formulas, and graphs.
Statistics is about obtaining and analyzing data. It plays a central role for quantitative
aspects of any field of endeavor.
Statistics is the art and science of making conclusions (decisions, judgments)
from data (numbers, information).
In a real sense, all of us use our senses to obtain information and our brain processes
this information to learn about the world around us and to build our views and knowledge
base. We have all been doing this very ordinary “statistical” activity continually since
birth and have become experts at it in our own way. You all are statisticians in this casual
sense. Some seem to have better sense and skill at doing this. We would all hope to
improve our abilities.
The “science” part of Statistics deals with methods and procedures for getting information
(data) and analyzing it. This involves concepts, calculations, and graphs for reaching
conclusions. This material has been developed over many decades in many different
fields. Good ideas that proved to be useful and effective have been extracted from such
sources over time and assembled into the modern field of Statistics.
The “art” part of Statistics comes in when we realize that elements of this process cannot
be made objectively and we have to rely on intuition and judgment, to some degree. Like
Charlie the tuna said, it’s a matter of “good taste”.
The subject of Statistics is concerned with all aspects of the process of drawing
conclusions (decisions, predictions) from data (numbers, information, experience).
Speaking informally, this is a very ordinary activity that all of us have been doing and
continue to do in all fields of endeavor. In this loose sense, you have all encountered
statistical ideas in most of your previous courses. You are all experienced in doing this.
It is clearly important to do this business in efficient and effective ways.

1
INTRODUCTION 2

Statistics, then, is concerned with


• how to obtain data,
• what to measure,
• how to extract useful, focused information from data, and
• what conclusions are warranted (strengths, weaknesses).
All this is taking place in an environment where our information is incomplete (often
woefully so), is subject to errors, involves random variability, and is affected by many
unsure factors. There is always some level of uncertainty. Statistics takes on the tough
job of assessing how far off our conclusions might be from the truth and how likely we
are to make mistakes of various kinds.
Statistics is utilized in various disciplines. Some examples are presented below.
Sociology: Statistics are used to study the factors that affect social lives of people,
groups or societies. Data on crimes, unemployment, income, births, mortality level,
divorces, etc. are being gathered yearly. A sociologist may be interested in determining
the relationship between divorce rate and domestic violence rate in Michigan.
Business: Data on sales, advertising or marketing expenses, exports, productions, number
of employees and their characteristics, etc. are looked at. A performance management in
a certain company can be done. For example, a manager wants to find ways to improve
his employees’ performance, that is, for them to achieve maximum productivity. So
he needs to monitor data relating to his employees’ productivity regularly such as the
number of tasks completed or number of units produced in a day. And if the number of
tasks done in a day by an employee drops by 30%, he might want to talk to the employee
and address the issue.
Agriculture: The USDA Agricultural Research Service is one of the world’s leading scientific
organization and it aids in solving problems relating to agriculture that affects Americans
daily. Researchers may collect data such as counts and characteristics of livestock in
farms, number of acres allotted to corns or strawberries, amount of minerals in farm area
soils, and water levels in rice fields. For example, a study may be done to know the best
fertilizer to use to achieve highest crop yield.
Health Science: Statistics is used in Health Sciences in numerous ways. It includes
study on public health, medical care, occurrence of diseases, development of new drugs
and medical procedures. An example is the ongoing study of a treatment for multiple
sclerosis, the stem cell therapy. To implement this treatment safely and effectively, a
careful clinical trial must be done with statistical considerations.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Chapter 2

Controlled Experiments

As a first case, consider a situation where a researcher needs to collect appropriate data
to assess whether or not an improvement will result from a new method. Examples might
be
• testing a new surgical procedure.
• deciding if a new packaging will improve sales of a product.
• proving that Azomite (a clay from the Nevada desert) greatly enhances plant growth.
• deciding if a new textbook would increase student learning.
In collecting data to decide on issues such as these, there are four principles of good
practice that will be considered for a Controlled Experiment:
1. Make a comparison (set up a treatment group and a control group).
2. Randomize the group membership of the subjects.
3. Make the experiment blind (subjects don’t know which group they’re in).
4. Make the experiment double blind (those assessing the results don’t know group
identity).
The above conditions require that we are able to control such experimental details.
In comparing two groups, the goal is to have all things equal between the two groups
except the presence of the treatment. Then any difference in the outcomes between the
two groups can only be attributed to the treatment.

3
CONTROLLED EXPERIMENTS 4

New Drug Example

We want to test a new drug that has been developed to control a skin rash. Suppose the
drug is applied in a topical cream. We decide to involve 200 people who have a skin rash.
To have a comparison, we divide these people into 2 groups:
• one group of 100 people will receive the new drug (treatment group),
• the other 100 people will use an old drug (control group).
We set up a randomization scheme so that as people become available for the study they
are randomly assigned to be in the treatment or control group. In this way, we avoid any
potential bias in the group membership. To keep the people blind, they are not told which
group they are in and both groups use a topical cream. After a period of time, we record
either “success” or “failure” for each person. The doctor who makes this determination
is not informed about group membership so that the experiment is double blind. The data
may look like this:
• Treatment group: 75% success
• Control group: 60% success
This is very simple data. We see there is a higher success rate for the treatment group,
indicating that the treatment is more effective. Yet there are some concerns to consider.
Would these percentages be essentially the same in another study? Surely they would
change somewhat. Would a larger study with more people, maybe thousands, be better?
Is this a strong conclusion – should we recommend the new drug for general use?

Salk Polio Vaccine

The polio virus attacks nerves controlling muscles causing paralysis, breathing difficulties,
and even death. Through the earlier years of the last century, there were about 50,000
cases per year (in 1952 there were about 57,000 cases). Children were especially susceptible
and this caused great concern and fear in families. Not much was known about how polio
was spread; it may be contagious. There could be hygiene issues and socio-economic
factors. There was yearly variation and regional variation in the occurrence of polio.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CONTROLLED EXPERIMENTS 5

By the 1950’s Dr. Jonas Salk had developed an injectable vaccine, based on killed polio
virus, and a large field trial was needed to study this vaccine. Funds were provided by the
March of Dimes (founded by President Roosevelt in 1938) and the National Foundation for
Infantile Paralysis (NFIP) were also involved. There was much debate and disagreement
as to how this field trial should be carried out. It would focus on children and then parental
consent would be needed. The idea of injecting children with live polio virus caused great
concern. An extensive educational campaign was done to inform the public about the
issues and the importance of the trial in an effort to gain participation. Since volunteer
families would be used for the vaccine, there was concern that such families could differ
in some important ways from families who would not volunteer and, thus, the study could
be biased either for or against the vaccine. In the end, several studies were carried out.
The field trials for the polio vaccine began in April 1954. On April 12, 1955, Dr. Thomas
Francis, who oversaw the trial, announced at the University of Michigan that “the vaccine
works; it’s safe, effective and potent”. The following table summarizes some of the main
results (American Journal of Public Health, vol 45, 1955, p. 1–63). Some numbers are
rounded.
Randomized Experiment NFIP study
size rate size rate
Treatment 200,000 28 Grade 2 (consent,vaccine) 225,000 25
Control 200,000 71 Grades 1 and 3 (control) 725,000 54
No Consent 350,000 46 Grade 2 (no consent) 125,000 44
The table shows the size of the group and the rate per 100,000 of polio cases. In the
randomized experiment, both the treatment (vaccine) and the control groups have parental
consent. It was also blind (the control group children received a neutral injection) and
double blind.
Note that the NFIP study shows a smaller beneficial effect of the vaccine. It may be biased
against the vaccine. This could be due to the confounding effect of “consent”, since the
control group contains both possible consent as well as non-consent children. If we
only had done the NFIP study, the conclusion about the benefit of the vaccine would be
suspect for this reason.
On careful statistical analysis, the difference between rates of 28 and 71 for the randomized
experiment was found to be huge; it could not reasonably be due to chance. This is the
basis for the strong conclusion that the vaccine was effective.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CONTROLLED EXPERIMENTS 6

Pre-eclampsia and Vitamins C and E

A government-sponsored study of more than 10,000 women failed to find that large doses
of vitamins C and E cut the risk of complications from pregnancy-induced high blood
pressure. Pre-eclampsia, a condition characterized by high blood pressure and protein
in the urine, occurs in up to 8% of pregnancies. A leading cause of illness and death
in pregnant women and infants, pre-eclampsia can be cured only by delivering the baby.
An earlier study of fewer than 300 pregnant women found that taking vitamins C and E
lowered the risk of pre-eclampsia. But scientists have repeatedly failed to replicate that
finding.
In the current study, women had to be pregnant for the first time and at low risk for
pre-eclampsia. They began taking their pills between the ninth and sixteenth week of
pregnancy and continued up to delivery. Participants were randomly assigned to a treatment
group or a control group. The control group took placebo pills to keep the study blind.
“The study found no evidence of benefit to either the mother or the baby”, says lead
author James Roberts, an obstetrician/gynecologist at the University of Pittsburgh. In
fact, Roberts and his collaborators at fifteen other medical schools found that women
randomly assigned to take vitamins were slightly more likely to develop high blood pressure
than those assigned to take placebo pills, although the difference was not significant.
A similar study of nearly 2,000 women published in 2006 reached a similar conclusion. It
seems that large doses of vitamin C and E during pregnancy have no benefit in reducing
the risk of pre-eclampsia; they clearly cannot be recommended.

Stitches

The following information was reported in the Wall Street Journal, Dec 1, 2009.
Short stitches sewn close together led to fewer infections after abdominal surgery, compared
to longer stitches, according to a study in the Archives of Surgery. This differs from
current guidelines, which recommend that surgeons close abdominal incisions with stitches
using stitches that begin and end at least ten millimeters from the wound edge. The study
randomized 737 patients into two groups. Among patients receiving standard stitches,
10.2% of the wounds became infected. The second group, which received stitches placed
five to eight millimeters from the wound, had a 5.2% infection rate. It appears that large
stitches increase the risk of infections for abdominal incisions.

Note: Exercise questions over this material have been combined with Chapter 3.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Chapter 3

Observational Studies

Often we have data that was collected in a more haphazard manner. We observe results
that are happening in the world without any explicit control on our part. Such studies are
called observational studies. There are many potential troubles in analyzing observational
data. Other factors that we have not controlled may be affecting the main response.
Biases may be built into the data. Seemingly obvious conclusions can be wrong. Sometimes
we can partially correct for such potential biases with various adjustments or by stratification.
This can be quite tricky and requires considerable talent and experience.

Tall Men

This information was reported in The Week, July 31, 2009.


Tall people get more respect. In an Australian study, researchers found that for men
an additional 2 inches of height brings an additional $1,000 in annual income. This
“height premium” is slightly less for women in Australia, while for men in the U.K. and
the U.S.A. it is higher. Many theories have been proposed to explain this phenomena.
This is observational data. The researchers obtained people in some vague fashion and
recorded their height and income. Their calculations showed this height effect. Perhaps
if you do stretching exercises to raise your height you can increase your income.

SAT Math Scores and Grade Inflation

This information was taken from The College Board (research.collegeboard.org) where
they produce yearly reports on SAT.

7
OBSERVATIONAL STUDIES 8

High school grades have been increasing over time. Does this mean that students are
learning more, say in math? The table below shows that at each level of high school
grades the SAT math score average decreased by 3 to 13 points between 1996 and 2014.
Apparently, the higher grades are not deserved. Yet the table also shows that SAT math
scores averaged 510 in 1996 and 515 in 2014, an increase of 5 points. This apparently
shows that students are improving over this time period. This seems rather strange. How
can scores decrease at each grade level and yet overall the scores increase?
% of Students Average SAT
HS grade Average Getting Grades Math Scores
1996 2014 1996 2014 Change
A+ 6.1 6.8 632 619 −13
A 14.4 21.0 583 580 −3
A- 15.2 20.0 554 543 −11
B 49.5 43.3 485 474 −11
C 14.7 8.9 426 413 −13
Overall Average 510 515

Note that in 1996 the grades were higher (these were different students). In computing the
overall average we need to use a weighted average to account for the differing percentages
of students in the grade levels (do not just average the 5 averages). In 2014 there is
a greater percentage of students in the higher grade levels where the SAT scores are
higher.
This is an example of Simpson’s paradox. From one perspective the SAT scores decrease,
while from another perspective the SAT scores increase.

Baseball Batting Averages

Data on the Major League Baseball can be found on ESPN (espn.go.com).


In 2009 Robinson Cano and Jason Barlett had batting averages of .320 so we conclude
that they are equally good hitters. This is observational data. Was it a fair comparison?
Let’s stratify the data by looking at two subgroups. Choose type of playing field since the
game is somewhat different due to this factor. The hitting data for these two players on
the two types of playing fields is given in the following table. Now who is the better hitter?

Robinson Cano Jason Barlett


At Bats Hits Batting Ave At Bats Hits Batting Ave
Grass field 560 181 .323 228 74 .325
Artificial Turf 77 23 .299 272 86 .316
Overall 637 204 .320 500 160 .320

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
OBSERVATIONAL STUDIES 9

When the data is combined, these batters are considered equally as good (with a slight
advantage towards Cano). However, Barlett has a better batting average on both grass
and turf.
I think the subgroup view is the correct one and conclude Barlett is the better hitter. At
any rate the point here is that the initial data, over the whole season, could be misleading
since it is not a fair comparison. We spot a “hidden factor” by considering subgroups of
the data. Cano batted most on grass while Barlett batted most on turf.

Expenditures by Ethnicity

This data set was taken from the Journal of Statistics Education Volume 22, 2014
(http://www.amstat.org/publications/jse/v22n1/mickel.pdf). It is on a real-life scenario where
discrimination based on ethnicity was claimed and this question was asked: “Is the
typical Hispanic receiving fewer funds (i.e., expenditures) than the typical White non-Hispanic?”

Percent Average Expenditures


Age Cohort Hispanic White Non-Hispanic Hispanic White Non-Hispanic
0-5 12 5 $1,393 $1,367
6 - 12 24 12 $2,312 $2,052
13 - 17 27 17 $3,955 $3,904
18 - 21 21 17 $9,960 $10,133
22 - 50 11 33 $40,924 $40,188
51 - above 5 16 $55,585 $52,670
All Consumers $11,066 $24,698

Notice on the table above that there is a big difference on the overall average amount of
expenditures between the Hispanic ($11,066) and White Non-Hispanic ($24,698). It really
seems like there is a presence of discrimination. But look at the average expenditures per
age cohort. The average amount of expenditure for Hispanic is greater than that of White
Non-Hispanic except for age 18-21. So it seems strange that the overall average amount
of expenditures White Non-Hispanic is more than twice that of the Hispanic.
Now, consider the percentages. It can be seen that there are more Hispanic consumers
on the first four (younger) age cohorts while there are more White Non-Hispanic in the
last two (older) age cohorts. This tells us that there is a presence of Simpson’s paradox.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
OBSERVATIONAL STUDIES 10

Additional Examples of Situations With Observational Data

• opinion polls on the Web on any issue


• value of the UNIV 101 course at WMU
• using seatbelts is safer
• private vs public schools, which is better
• effects of second-hand smoke on health

EXERCISES
1. A town has two high schools; one is public and the other is private. A statewide assessment test
was given to all eleventh graders and the results showed that the private school students scored
considerably higher on the test compared to the public school students. This has been the case for
several years. People concluded from this that the private school provides a better education for its
students (they do a better job of teaching) and it is worth the extra tuition expense to obtain this good
result. Do you think this conclusion is warranted from the data? Discuss the issues.
2. Scurvy was a problem for sailors since ancient times and it became more serious by the eighteenth
century when long voyages of discovery were happening. It was not unusual for 60% of the sailors to
die from scurvy on long voyages. James Lind is credited with one of the first controlled experiments
in his attempt to find a cure and preventative. Lind thought that scurvy was due to putrefaction
of the body which could be prevented by acids; that is why he chose to experiment with dietary
supplements of acidic quality. In his experiment he divided twelve scorbutic sailors into six groups.
They all received the same diet, and in addition group one was given a quart of cider daily, group two
twenty-five drops of elixir of vitriol (sulfuric acid), group three six spoonfuls of vinegar, group four
half a pint of seawater, group five received two oranges and one lemon and the last group a spicy
paste plus a drink of barley water. The treatment of group five stopped after six days when they ran
out of fruit, but by that time one sailor was fit for duty and the other had almost recovered. Apart from
that, only group one also showed some effect of its treatment. Note, through trial and error methods
over fifty years or so, it was finally concluded that lemon and other citrus juices in the diet could
prevent scurvy. Knowledge of vitamins, like vitamin C, was unknown at this time. Comment on good
and bad features of Lind’s experiment.
3. Western Michigan University offers a 1 credit course for freshmen, UNIV 1010, which teaches about
university resources and study habits for success in college. It is an elective course and about half of
the freshmen take it. WMU has studied the results by comparing the retention and GPAs of students
who took this class against those who did not take this class. It was found that retention and GPAs
were generally higher for those who took UNIV 1010. This evidence was put forth as proof that the
course was successful and that it should be continued. What do you think of the conclusion based
on such data?
4. Various studies have noted that children who study music and learn to play a musical instrument
do better in school in many aspects. Based on this, it is concluded that the mental skills from
this music experience also promote development in other aspects of the brain to promote higher
capability. Thus, parents and schools are encouraged to provide musical training to children to help
them succeed in other classes. Comment on this conclusion.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
OBSERVATIONAL STUDIES 11

5. Studies have shown that older people who daily work crossword puzzles retain better mental abilities
as their age compared to those who do not. They are less likely to develop dementia or Alzheimer’s
disease. Other mental activities such as Sudoku or games seem to have the same effects. As a result
of such data, older people are encouraged to do these activities daily to avoid mental decline. It’s
the “use it or lose it” message. Is this a strong conclusion to make from such data or could there be
other forces affecting the data?
6. In a study on obesity involving thousands of women it was found that those who regularly drank diet
sodas were more likely to be obese. It was concluded that drinking diet sodas does not help women
lose weight. In fact, it may promote weight gain. Comment on this conclusion.
7. A chemistry class is offered in two sections, a regular class and an online class. Students could
choose to register for the section of their choice. Both sections took the same final exam at the end
of the term. The scores on the final exam were similar for both sections. Since student learning
seemed the same for either delivery method, the department decided to offer only online sections in
the future (lower costs). Do you think this conclusion was appropriate based on the data?
8. (Hypothetical) A recent article in a small county newspaper states that men tend to participate more
in elections than women. To support the claim, the article cites the first three columns of the table
below (District, Male Vote %, Female Vote %) for recent elections in surrounding districts. Further
investigation on your part reveals the rest of the columns.
Would you agree with the article’s assessment? Explain.

District Male Female %Male of %Female of % part. of % part.of


Vote% Vote% total reg.voters total reg.voters Male reg.voters Fem.reg.voters
1 53% 47% 60% 40% 54% 60%
2 50.1% 49.9% 57% 43% 50% 66%
3 54% 46% 56% 44% 54% 59%
4 55.1% 44.9% 54% 46% 57% 59%
5 50.8% 49.2% 58% 42% 48% 64%

REVIEW EXERCISES
9. An article in the Kalamazoo Gazette 8/26/08 entitled “Program Helps Save Knees of Female Athletes”
pertains to knee injuries of female athletes playing sports like soccer and basketball. It cited a study
published in the American Journal of Sports Medicine on the PEP program of training exercises
which promotes strength, agility and balance. The study followed 1435 players on 61 Division I soccer
teams. 26 teams were randomly assigned to use the PEP program and 35 teams did not. As a result,
the PEP teams had 41% fewer ACL injuries per team than the other teams during the season. In the
second half of the season there were no such injuries on the PEP teams, while the other teams have 5.

A. Was this a controlled experiment or an observational study?


B. Did it give solid evidence for the benefit of the PEP program?
C. Suppose the study had been different in that it just followed some teams that used the PEP
program and others that did not with the same results (PEP teams had 41% few ACL injuries per
team than non PEP teams). Would this weaken the conclusion?
10. A study was done on the benefit of two drugs. The results were:
Drug A: 300 out of 400 men were cured, 120 out of 200 women were cured.
Drug B: 270 out of 300 men were cured, 420 out of 700 women were cured.
A. For men, which drug is more effective? For women, which drug gives better result?
B. For people, overall, which is the better drug?
C. How can you explain the puzzling difference?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
OBSERVATIONAL STUDIES 12

11. Numerous medical studies have shown the vitamin D can reduce the incidence of cancers and heart
problems. One of the strongest studies to date (1/10/08) followed 1739 members of the Framingham
Offspring Study for more than five years. They found that the rate of cardiovascular disease like
strokes and heart attacks were from 53% to 80% higher in people with low levels of vitamin D
compared to those with sufficient levels. Was this a controlled experiment? How might the study
been improved to provide stronger evidence in its conclusion?
12. Many reports have appeared where people have found that by wearing an ionized bracelet their
muscle and joint pain was reduced. Sales are booming. In a study by the Mayo Clinic, 305 people
with such pain wore an ionized bracelet for a month while 305 others wore a placebo bracelet. The
bracelets appeared identical. Neither the participants nor the researchers knew who wore the ionized
bracelets. In evaluating pain levels, the researchers found that both groups reported significant
reduction in pain at the end of the study. They found no difference in pain reduction between the
two groups. Was this a good study? What might explain the success for those with the placebo
bracelets?
13. Here are numbers of flights on time and delayed for two airlines at five airports in June 1991. The
table shows that Alaska Airlines outperforms America West at all five cities.

Alaska Airlines America West Airlines


On time Delayed Delay% On time Delayed Delay%
LA 497 62 11.1% 694 117 14.4%
Phoenix 221 12 5.4% 4840 415 7.9%
San Diego 212 20 8.6% 383 65 14.5%
San Fran. 503 102 16.9% 320 129 28.7%
Seattle 1841 305 14.2% 201 61 23.3%
Total 3274 501 13.3% 6438 787 10.9%

Which Airline would you say is doing better regarding “on time” flights? Why?
14. The local newspaper examined the town’s two hospitals and found that over the last six months at
Hope Hospital, 79% of the patients survived, while at City Hospital 90% survived. The table below
summarizes the findings.

Lived Died Total % who lived


HOPE HOSPITAL 790 210 1000 79.0%
CITY HOSPITAL 900 100 1000 90.0%

On closer investigation it was observed that the patients were categorized upon admission as being
in fair (or better) condition or in poor (or worse) condition. When the survival rates were examined
for these groups, the following tables emerged:

Patients admitted in fair condition or better:


Lived Died Total
HOPE HOSPITAL 580 10 590
CITY HOSPITAL 860 30 890

Patients admitted in poor condition or worse:


Lived Died Total
HOPE HOSPITAL 210 200 410
CITY HOSPITAL 40 70 110

Which hospital appears to be doing better? Why?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Chapter 4

Issues Related To Statistical Studies

Key Stages of a Study


• Assignment/choice of individuals who will participate
– Select them in some way from a larger population of individuals.
– Individuals may be people, plants, animals, cars, stores, farms, seeds, batteries, TV s, etc—any
objects we care to study.
• Carry out the mechanics/details of the study
– Procedures are carefully specified.
• Assess/measure the results, get the data
– This includes the choice of what to measure and how to do it well.
• Analyze the data by some appropriate methods
– This may be simple calculations or highly complex computer work.
• Draw conclusions/interpretations from the analysis results
– What does the data say, assess strength of conclusions.
• Extrapolate the study results to individuals not in the study
– This is subjective, but highly important, frequent errors here.

An important consideration in a study of an issue is to make a comparison. Set up two groups and compare
their responses in some way. The conclusion is a statement comparing the two groups.

Retrospective (Case-Control) Studies


The treatment group: a group of individuals is chosen who have the response we are interested in.
The control group: another group of individuals is chosen who do not have the response of interest. Ideally,
this group is similar in relevant characteristics to the treatment group.

The point is that the response we focus on has already occurred for some individuals and now we will
look back in time to identify some characteristic that may have had an influence or effect on this outcome.
We seek some characteristic present earlier in time for the treatment group but not present for the control
group.

13
ISSUES RELATED TO STATISTICAL STUDIES 14

Example
• Select a group of women who have uterine cancer.
• Select another group of women who do not have uterine cancer but are otherwise similar (this is not
easy to do well). For all these women, determine if they previously took estrogen (and, if so, how
much).
• Calculate the percent of estrogen users among each group. Compare these two percentages and
interpret any difference observed. For example, if the treatment group had a larger percentage of
estrogen users than the control, this may be evidence of a causal relationship.
• Consider whether the results seen for the women in the study apply to all women or at least a large
group of women who did not take part in the study.

Example
• Select a group of 400 small businesses that failed in 1998 and another group of 400 small businesses
of similar types that did not go out of business in 1998. The response here is “out of business”.
• Then look back in time at characteristics of these businesses at startup, perhaps several measures
of their financial stability.
• Compare these measures for one group against the other. Differences found may identify causes of
business failure.
• Generalizing to all startup small businesses in the future, we formulate rules to follow to avoid this
problem.

Prospective (Cohort) Studies


This type of study begins before the individuals have developed the response of interest. We look forward
in time. We select two groups to follow over time; one group has a characteristic of interest (perhaps a
treatment) and the other does not (controls).
Prospective studies can carry on over some period of time, often several years. This may require considerable
effort and expense to monitor subjects and adhere to protocols.

Example
• Paralleling the earlier example, suppose we select a group of women who are using estrogen and
another group of women who are not.
• The two groups are monitored over time to record who develops uterine cancer. Calculations of the
percentages developing uterine cancer are made for each group.
• These percentages are compared to see the difference in this regard. We are looking to see if the
estrogen group has a higher percentage, which would be evidence that estrogen use may cause
uterine cancer.
• Generalize to all women who may use estrogen.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ISSUES RELATED TO STATISTICAL STUDIES 15

Example
• Select a group of 1000 low income families whose preschool children are in a government sponsored
Head Start program.
• Select another group of 1000 families whose preschool children are not enrolled in such a program.
• Measure the success in school for all these children, say at the second and fourth grades. Calculate
various averages to compare the school success for the two groups.
• Differences found would be attributed to the effect of the Head Start program.
• Generalize to wider populations of students.

Designed Experiments (Clinical Studies)


Like the prospective studies, individuals are followed forward in time. There are two groups, a treatment
group and a control group. A special feature here is in the method by which individuals are chosen for the
two groups. It is done by some randomization device. We start with a group of individuals and assign them
to the treatment and control groups at random. The purpose of this is to avoid any source of bias in forming
the two groups. It is possible that after the randomization the two groups could differ on some important
characteristics, but the chance of this happening is low. An additional feature of the plan is to make the
experiment “blind”. This means that individuals do not know which group they are in. The control group is
treated the same as the treatment group in all outward appearances. Among other things, we are trying to
avoid bias due to the placebo effect. When the response is measured by a person, for example a doctor who
makes a diagnosis, it is a good practice to make the experiment “double-blind”. This means that the doctor
does not know the group membership of any individual. We want both groups to be formed and handled in
the same way so that they differ only in the respect that one has the treatment and the other does not.

Example
• In a study to explore the effect of estrogen on the development of uterine cancer we may begin with
a group of 200 women.
• By some randomization device they are divided into two groups of size 100. One group (TMT) will
take a pill containing estrogen while the other group (CTL) will take a dummy pill with no estrogen.
• The groups will be followed for say 10–20 years. The occurrence of uterine cancer over the course of
the study can be determined and the percentages can be compared for the two groups. A higher rate
of uterine cancer for the treatment group would be interpreted as evidence that estrogen use may
cause uterine cancer.
• Of course, we want to generalize the conclusions to a wide population of women. Note the difficulties
that arise in continuing the experimental protocol over such a long time period.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ISSUES RELATED TO STATISTICAL STUDIES 16

Example
• Consider an experiment to study the effect of a new package for a cereal on sales. Say 100 grocery
stores are selected for the study.
• We randomly divide these into two groups of size 50. One of the groups will stock the cereal with the
new package while the other group will stock the cereal with the old package.
• Sales of the cereal will be recorded over a three month period. Average sales for the two groups will
be compared. If the new package group has significantly higher average sales we would conclude
that the new package is effective in increasing sales.
• Generalizing to all stores, the company may decide to switch to the new package, incurring the costs
and troubles associated with such a move. The decision is sensitive in that the company would want
solid evidence of the benefits before risking the change.

Other Related Issues


What outcome should we measure? There are often several possibilities and the choice made could affect
the results. Are we measuring the right thing?
How is the response to be measured? Care needs to be taken on this matter. For example, self reports can
be inaccurate, recall of past events is unsure, subjective judgments are erratic, questionnaire results are
influenced by the format of the question, measurement errors may be present, there may be investigator
bias, missing data problems can arise, individuals may drop out of the study, etc.
The two groups may differ on some important characteristic that is related to the response (selection bias).
If present, such “confounding variables” can erroneously affect the conclusions reached. There are ways
to handle (or adjust for) confounding variables, such as stratification, multiple regression methods, or
analysis of covariance methods.
The possibility of using matched pairs, one-to-one matching of an individual in the treatment group with
an individual in the control group, is sometimes useful. Individuals are matched so that they are similar on
characteristics that are judged important.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Chapter 5

Plots

Variables
Before we discuss different types of plots, we must first discuss variables. A variable is what we seek
to gain information on from the subjects over study. The value of these variables change from subject to
subject. Examples of these are hair color and age. This is information we seek to gain from the subjects
under study.
There are two types of variables to consider: categorical and numeric. Numeric variables are ones in which
the response is a number, such as age. Categorical variables are ones in which the response falls in a
category, such as hair color.
Once it has been identified which type of variable we have, we can then make a plot. The two plots for
categorical variables are pie chart and bar graph. The two plots for numeric variables are dot plot and
histogram.

Pie Charts
A pie chart is a circular graph which is divided into slices. The slices represent the percentage distribution
or the relative contribution that different categories contribute to an overall total.
The table below summarizes the number of breast cancer cases for women by age and race in Michigan
for the year 2011. Source: Michigan Resident Cancer Incidence File. Division for Vital Records & Health
Statistics, Michigan Department of Health & Human Services.
Age Group Race Cases
below 50 White 1065
50-64 White 2222
65-79 White 2079
above 80 White 769
below 50 Black 231
50-64 Black 372
65-79 Black 233
above 80 Black 98

17
PLOTS 18

This tabular data can be visualized using a pie chart. Consider only the number of cases per age group.
Age Group Cases Relative Frequency Percentage
50-64 2594 0.37 37
65-79 2312 0.33 33
above 80 867 0.12 12
below 50 1296 0.18 18
Grand Total 7069 1 100

Now we will draw a pie chart for this data using age groups as slices of our pie. A slice of the circle
represents each age group’s contribution to the overall number of breast cancer cases in Michigan for year
2011. Note that the counts can also be used instead of percentages.

Bar Graphs
Bar graphs can be used instead of pie charts. But instead of using slices in a circle, we use bars. For
illustration, let us use the same data as above. Now we draw a bar graph. Each bar represents the age
group and the height of the bars are percentages.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 19

A bar graph of the number of cases per age group for the two races can also be done (see the graph below).
Now you can see that the heights of the bars can also be the number of cases.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 20

Dotplot
The dotplot is a useful graph of numeric data. It is produced by placing a dot at the location of each data
value on a number scale. It is simple in nature, easily understandable, and quick to produce by hand. It is
best when there is only a modest amount of data, say no more than 100 observations.

Histograms
Given a set of data (imagine a list of numbers, either large or small), we can draw a figure called a “histogram”
to display the data. The figure provides a graphical view of the data. It is essentially a sequence of
rectangles where the bottom axis records the values of the data and the height of the rectangles represents
the frequency or intensity of occurrence. The histogram helps us to understand the data, since we quickly
see the lower values, the higher values, the middle values, and the frequency of occurrence of the values.
The process of creating a relative frequency histogram includes:

1. Choose the number of intervals


2. Choose the location of the intervals (goal is to choose nice numbers for the boundaries)
3. Tally frequencies for the intervals (frequency of an interval is the number of data values in it)
4. For each interval, draw a rectangle over it of height equals to its frequency (or relative frequency)

Sometimes the relative frequencies are expressed as a percentage (multiply the relative frequency by 100).
The choice on the number of intervals is tricky. The histogram is too crude and uninformative if there are
too few intervals. It is too erratic and hard to interpret if there are too many intervals. On reviewing the
first attempt, if it appears unsatisfactory for some reason we may repeat the process with fewer or more
intervals.

Example 1
Ages of employees
30 47 48 51 40 49 62 55 45 57 54 45
60 54 48 59 37 43 52 57 53 32 42 53
53 50 51 53 43 56 53 50 38 55 52 55
56 63 59 56 63 58 57 61 61 60 49 61
54 58

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 21

Histogram: Using “nice” boundaries of 30, 35, 40, 45, 50, 55, 60, 65, we can calculate relative frequencies
and draw the rectangles. The result is the following:

Interval Frequency Relative Frequency

30–35 2 0.04

35–40 2 0.04

40–45 4 0.08

45–50 7 0.14

50–55 14 0.28

55–60 13 0.26

60–65 8 0.16

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 22

Example 2
The following data is the white blood cell count for 107 people involved in a study:
Begin WBC
6.7 3.6 7.1 7.7 6.2 9.1 5.3 5.4 6.8 7.6
4.0 7.2 6.9 7.3 6.2 5.6 6.4 5.0 7.6 9.7
5.7 8.9 4.7 5.1 9.3 6.0 7.3 6.9 7.7 6.8
6.3 7.8 5.2 6.2 6.2 4.5 6.6 5.3 4.2 11.2
5.1 6.4 6.6 5.8 7.1 6.9 4.6 4.5 5.7 10.1
8.4 7.8 7.6 5.5 5.1 4.3 5.0 7.0 9.9 5.6
4.5 6.0 8.3 9.9 9.8 9.5 4.5 3.6 7.5 4.7
6.0 8.1 8.7 5.5 6.7 6.5 5.5 6.4 3.9 8.1
5.6 7.3 5.7 5.1 9.1 5.8 6.0 5.2 10.2 6.5
4.8 7.4 7.7 5.1 5.3 6.4 5.5 8.7 9.9 10.4
5.7 6.7 7.1 4.2 5.7 8.1 4.5

First, we will draw a histogram for this data using intervals defined by the cutoff points 3, 4, 5, 6, 7, 8, 9, 10,
11, 12. We tally the frequencies of these intervals into a table (adopt the convention that if a value falls on
the boundary between two intervals it is counted in the interval to the right):
Interval Frequency Relative Frequency
3 to 4 3 0.028
4 to 5 13 0.121
5 to 6 27 0.252
6 to 7 25 0.234
7 to 8 18 0.168
8 to 9 8 0.075
9 to 10 9 0.084
10 to 11 3 0.028
11 to 12 1 0.009

Now the rectangles for the histogram are drawn over the intervals so that height is the relative frequency
(percent). See the result below:

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 23

Example 3
Suppose we have data on the final grades (on a 10 point scale) for a class of students. A histogram is drawn
for these scores below. Note the information that we can see from the histogram on the percentage for the
intervals. The general shape of the data is also apparent. Note the longer tail on the left.

We can roughly answer questions such as

• what percent score over 90?


• what percent score less than 60?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 24

Example 4: Driving times


The following data was collected on the time it took to drive from point A to point B. Time is measured in
seconds.

Time
218 164 260 168 194 262 188 289 237 281
223 255 200 213 225 232 204 300 256 206
171 216 261 304 314 265 195 204 199 175
192 177 232 261 273 266 214 197 177 183
199 212 217 212 253 216 176 216 206 230
310 267 193 251 194 230 198 279 180 183
251

Summary Statistics
n = 61
mean = 225
median = 216
SD = 38.69
Q1 = 194.5
Q3 = 258

A histogram of this data is as follows. The erratic shape indicates we should redo the histogram using
fewer intervals to get a smoother shape.

Note the horizontal axis labels the midpoint of the intervals rather than the cutoff points. Sometimes this is
done.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 25

Example 5: Final Exam Scores


The following shows two histograms of final exam scores for 63 students in a statistics class. The first is
of the relative frequency scale and the second of the density scale.

Boxplot
The boxplot or box-and-whisker plot is another useful descriptive plot that can be used to visualize numeric
variables. Five-number-summary which includes MIN, Q1 , MED, Q3 , MAX, is used to produce a boxplot.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 26

The process of creating a boxplot.

• Order the data set from the smallest to the largest.


• Find the MIN, which is the smallest value and the MAX, which is the largest value.
• Find the first quartile (Q1 ) which is the upper boundary of the first quarter (25% of the data is below
Q1 ). It is the 0.25(n + 1)st ordered observation.
• Find the median (MED), which is the middle value of the data set (50% of the data is below the median).
(n + 1) nd
It is the ordered observation.
2
• Find the third quartile (Q3 ) which is the upper boundary of the first quarter (75% of the data is below
Q3 . It is the 0.75(n + 1)st ordered observation.
• if a non-integer resulted in any computation of the quartiles (Q1 , MED, Q3 ) above, average the two
adjacent ordered values for the respective quartile.
• Draw a horizontal axis covering data range.
• Draw a box with edges at Q1 and Q3 .
• Draw within the box, a line located at median (MED).
• Draw “fences” (lines) at the MIN and MAX.
• Draw “whiskers” extending from the edges of the box to the MIN and MAX.

Example 6: STAT 3660 midterm I score


The following data set contains STAT 3660 midterm I score for 31 students.
88 76 100 60 68 68 56 80 60 72
100 64 80 76 96 68 72 88 88 92
64 80 96 64 84 100 92 68 92 96
92
Ordered data set:
56 60 60 64 64 64 68 68 68 68
72 72 76 76 80 80 80 84 88 88
88 92 92 92 92 96 96 96 100 100
100
MIN = 56 and MAX = 100.
Q1 is 0.25(n + 1)st ordered observation. 0.25(n + 1) = 0.25(31 + 1) = 0.25(32) = 8. Therefore, Q1 is the 8th ordered
observation. Q1 = 68.
(n + 1) nd (31 + 1) 32
MED is the ordered observation. = = 16. Median is 80, which is the 16th observation.
2 2 2
Q3 is 0.75(n + 1)st ordered observation. 0.75(n + 1) = 0.75(31 + 1) = 0.75(32) = 24. Therefore, Q3 is the 24th
ordered observation. Q3 = 92.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 27

An outlier is an observation point that is distant from other observations. Observations below Q1 − 1.5(Q3 −
Q1 ) or above Q3 + 1.5(Q3 − Q1 ) are considered as outliers.

Example 7: Weight of newborn babies


This data set contains weights of 40 newborn babies.
8.2 6.7 8.8 7.68 8.37 8.74 8.11 8.24 6.98 8.78
7.99 7.96 8.86 8.62 8.91 8.85 8.69 7.84 8.65 7.82
8.27 8.77 8.77 8.18 7.98 7.80 8.68 7.97 6.83 7.63
8.87 8.49 8.89 8.5 7.35 8.49 7.24 8.52 7.01 5.3
Ordered data set:
5.3 6.7 6.83 6.98 7.01 7.24 7.35 7.63 7.68 7.8
7.82 7.84 7.96 7.97 7.98 7.99 8.11 8.18 8.2 8.24
8.27 8.37 8.49 8.49 8.5 8.52 8.62 8.65 8.68 8.69
8.74 8.77 8.77 8.78 8.85 8.86 8.87 8.88 8.89 8.91
Q1 is 0.25(n + 1)st ordered observation. 0.25(n + 1) = 0.25(40 + 1) = 0.25(41) = 10.25. Therefore, Q1 is the average
(7.8 + 7.82)
of 10th and 11th ordered observation. Q1 = = 7.81.
2
(n + 1) nd (40 + 1) 41
MED is the ordered observation. = = 20.5. Median is the average of 20th and 21st
2 2 2
(8.24 + 8.27)
observation. MED is = 8.255
2
Q3 is 0.75(n + 1)st ordered observation. 0.75(n + 1) = 0.75(40 + 1) = 0.75(41) = 30.75. Therefore, Q3 is the average
(8.69 + 8.74)
of 30th and 31st ordered observation. Q3 = = 8.715.
2
Outliers:
Q1 − 1.5(Q3 − Q1 ) = 7.81 − 1.5(8.715 − 7.81) = 6.4525.
Q3 + 1.5(Q3 − Q1 ) = 8.715 + 1.5(8.715 − 7.81) = 10.0725.
Any value below 6.4525 or above 10.0725 is considered as an outlier. Therefore, 5.3 is an outlier.
When we find minimum and maximum, we omit oultiers. MIN = 6.7 and MAX = 8.91.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 28

EXERCISES
1. Big 3 Jobs in Michigan, 1995 (From the Detroit Free Press, 2-26-96).
Draw both histograms for the age distributions of Big 3 workers.

Age Hourly Workers Salaried Workers


18-21 0.5% 0.1%
21-31 6.1% 13.3%
31-41 20.4% 26.9%
41-51 43.3% 33.6%
51-61 25.4% 23.7%
61-65 3.5% 2.1%
65-75 0.8% 0.3%

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 29

2. Customers at a new car dealership had been complaining about brake problems with their new cars
in the first three years. The dealership randomly selected 50 records on brake work from its files and
found the following classification.

Cause Frequency
master cylinder ——
wheel cylinder —————
brake lines ———————————
brake plate ——————————————————
brake pads ——————
springs —————
other ———

Make a Pareto plot (a histogram-like figure where the rectangles are arranged in decreasing size).
The Pareto principle says that a few causes will generally be causing the majority of the problems.
To fix problems one should identify these few causes and concentrate our ”fixing” efforts on these.
What would you conclude in this case?

3. The fuel cost (cents per mile) for trucks carrying milk from farms to dairy plants is given below for
36 trucks.Construct a histogram of the data using three intervals: 0–10, 10–15, 15–30. Make sure to
label your axes. Data:

16.44 9.92 11.20 13.50 29.11 7.19 4.24 14.25 7.51 10.25 12.17 10.18
12.34 26.16 16.93 10.32 9.70 9.49 13.70 15.86 12.49 13.32 12.68 9.90
11.11 10.24 8.88 8.51 12.95 14.70 8.98 12.72 8.22 8.21 9.18 17.32

4. Draw a histogram of the following data. Make sure to use the relative frequency and label your axes
properly. Height (inches) of 12-year olds:

48 48 49 49 50 51 51 51 53 53
54 54 55 56 57 57 57 59 59 59
59 60 60 60 60 60 61 61 62 62
62 62 63 63 64 64 66 66 67 68

Use the following intervals:48–50, 51–53, 54–56, 57–58, 59–60, 61–62, 63–64, 65–68.

5. Data was collected on the cost of a year of college for 800 freshmen. A data summary is shown in the
table below:
Cost ($) Percent
$0 - 5000 10%
$5000 - 7500 50%
$7500 - 10000 20%
$10000 - 15000 10%
$15000 - 25000 10%

Draw a histogram for the data and carefully label both axes.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 30

6. For a sample of students at a college, their GPAs were tallied in the following table. Draw a histogram
for this data. Axes should be labeled properly.

GPA Frequency
0 – 1.99 100
2.00 – 2.99 550
3.00 – 3.49 250
3.50 – 4.00 100

7. Below is a histogram of the number of fries that come in a small order at a fast food restaurant.

(a) What percent of small orders has 30 to 32 fries in them (% in interval 26-28)?
(b) What percent of small orders has at least 28 fries in them?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 31

8. (Hypothetical) The following is a histogram of the number of times per day freshman college students
check their email.

(a) What percent of students check their email either 4 to 6 times per day (% in interval 4–6)?
(b) What percent of students check their email at most 6 times per day?

9. Data on the number of times a toddler visits a pediatrician in a year is summarized in the boxplot
below:

(a) What is the shape of the distribution of the number of visits per year based on the boxplot?
(b) What is the value where 50% of the number of visits are less?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 32

REVIEW EXERCISES
10. An age distribution of people in the U.S. is shown below. Draw the histogram.

Age % of Population Age % of Population


0-5 7 35-45 15
5-15 14 45-55 14
15-20 7 55-65 10
20-25 7 65-75 6
25-30 7 75-95 6
30-35 7 95 and over 0

Answer the following questions based upon your histogram.


(a) Are there more children age 0 to 5 or elders age 65 to 75?
(b) Are there more 20 to 25-year-olds or 55 to 75-year-olds?
(c) The percentage of people age 35 and over is around 24%, 51%, or 75%?
(d) Why is the percentage of people over the age of 95 zero?

11. A group of 20 applicants for a professional fire fighter position ran an obstacle course. The individual
times, in seconds, were as follows:

37, 39, 39, 40, 40, 41, 42, 44, 44, 45, 45, 46, 49, 50, 51, 52, 52, 55, 56, 56

Make a histogram of the data with the following cutoffs: 37, 40, 45, 50, and 57. Use the relative
frequencies as your vertical axis.

12. The following data summarizes the ages of 500 cars in a city.

Age of Car (years) Frequency


0–2 100
2–4 200
4–6 100
6 – 10 50
10 – 20 50

Draw a histogram using the relative frequency scale. Label each axis carefully.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 33

13. (Hypothetical) Below is a histogram of money (in dollars) spent per week for food and beverages
between meals for college students.

What percent of students spend


• $20–25/week?
• $10–15/week?
• more than $30/week?

14. (Hypothetical) Below is a histogram of the number of meals eaten with at least one parent, per week,
by high school seniors attending public schools.

How many seniors eat


• at least 12 meals per week with at least one parent?
• no more than 10?
• less than 4?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
PLOTS 34

15. A result of an occupied housing survey for the US is shown below. Draw a histogram for each of the
two distributions.

# of Rooms in Unit Owner-occupied (%) Renter-occupied (%)


1 0.0 1.0
2 0.1 3.0
3 1.6 22.5
4 9.5 33.8
5 22.5 23.3
6 27.2 10.8
7 17.0 3.2
8 10.9 1.0
9 5.4 0.7
10 4.3 0.6
11 1.5 0.1
12 0.2 0.0
Total % 100.2% 100.0%
Total Number 72.2 million 33.6 million
(a) The owner-occupied percents add up to 100.2% while the renter-occupied percents add up to
100.0%. How could this happen?
(b) The percentage of one-room units is much smaller for owner-occupied housing. Is that because
there are so many more owner-occupied units in total? Briefly explain your answer.
(c) Which are larger, on the whole: the owner-occupied units or the renter-occupied units?

16. The data below shows the ages when an elderly first experienced symptoms of Alzheimer’s disease.

65 66 67 67 68 68 68 69 69 70 70 70 73 75

(a) Calculate the 5-number summary for the dataset.


(b) Draw the boxplot using the values obtained in (a).
(c) Describe the shape of the distribution of age based on the boxplot.
(d) What is the value where 75% of the ages are above?
(e) What is the value where 75% of the ages are less?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Chapter 6

Statistics For Center and Spread of Data

At some point in a study we have data, a list of numbers on a variable of interest. To make sense of the data
and gain some understanding on its behavior, we need some summary measures that capture important
features of the data. We will concentrate on numerical measures of center and spread of the data.

Mean or Average
Given a set of data, say it’s represented by

x1 , x2 , ..., xn

or just generically by the symbol x , we measure the center, location, middle by the statistic

∑x
x̄ = , the mean or average of the data
n

Example
Consider the small set of data 1, 2, 3, 4, 5, 6 . The mean is
1+2+3+4+5+6
x̄ = = 3.5
6
Clearly, this number measures the “center” of the data.
Consider modifying the data above to 1, 2, 3, 4, 5, 21. The last value is now considerably above the others.
The mean now is
1 + 2 + 3 + 4 + 5 + 21
x̄ = =6
6
Changing the one value to be an “outlier” had a big effect on the mean. In general, the mean measures the
center of the data, but it is sensitive to and influenced by the extreme values on either end.
There is a useful physical interpretation of the mean as a balance point. Imagine taking a thin rod with a
number scale marked on it and for each number in the data set place a weight on the rod at its location. All
weights are the same. Now ask where this entity, rod with weights, will balance on an edge.

35
STATISTICS FOR CENTER AND SPREAD OF DATA 36

—–O——O—O—–O–O–O—-O———————–O——O—–
That the balance point will be at the mean is a simple principle from Physics. This interpretation of the
mean as a balance point is helpful in understanding how the mean is at the “center”. It also helps in
understanding why the mean is heavily affected by outliers, since a weight near an end on the rod exerts
more force than one in the middle.
While discussing the mean, it is important to note the difference between a population mean (µ) and a
sample mean (x̄). A population is the entire set of individuals you are interested in. The sample is a subset
of the population. For example, you are interested in the average height of all men in the United States which
is your population. So what you do is take a sample of men in US, measure their heights and compute the
average. This average height you computed from your sample is known as the sample mean or x whereas
the average height you compute from your population is the population mean or µ.

Median
Another measure of the center, location, middle of the data is the median, defined as the middle value when
the data is arranged in increasing order. The median is the 50th percentile. Loosely speaking, the median
measures the middle in the sense that half of the data is below and half of the data is above the median (or
at least as close to this idea as is possible).
When the data set contains an odd number of values in increasing order, the median is the middle one.
When the data set contains an even number of values in increasing order, there will be two middle values
and the median is taken to be the average of these two middle ones.

Example
Consider the small set of data 1, 2, 3, 4, 5, 6. The median is the average of the two middle values,

3+4
Median = = 3.5
2
Clearly, this number measures the “center” of the data.
Consider modifying the data above to 1, 2, 3, 4, 5, 21. The last value is now considerably above the others.
The median now is
3+4
Median = = 3.5 the same as above.
2
The outlier in this second data had no effect on the median. In general, the median is not influenced by the
presence or location of outliers. This is a desirable property. We say the median is robust, meaning that it
is not affected by outliers.
When the data is right-skewed, with a longer tail on the right, the mean tends to be greater than the median
since the median is not affected by the location of values in the tails while the mean is pulled towards the
outlying values in the right tail. Similarly, when the data is left-skewed, with a longer tail on the left, the
mean tends to be less than the median.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
STATISTICS FOR CENTER AND SPREAD OF DATA 37

Example
The Kalamazoo Gazette reported on housing sales in the area for a two year period.

Year Number of Sales Mean Value Median Value


1997 3639 114,800 108,820
1998 4310 119,755 (4.3% increase) $ 103,900 (4.5% drop)
Note that the means are higher than the medians. This is because there are usually a few very expensive
houses sold and these outliers on the right pull the mean up towards them, but do not affect the median.
Note that the year-to-year comparison made in the article painted a rosy picture; the number of sales is
up and the average value increased. Times are good in Kalamazoo. Yet, when the median values are
considered a different picture emerges.
Should one use the mean values or the median values in reporting the story? There is no solid answer to
such a question; it depends on one’s purpose and on the nature of the variable. The question is important
though, since, as seen here, the message can change depending on the choice made. Perhaps it would be
best to report both the mean and the median to provide the most information to the reader. When you get
information from others, demand both so you can see a fuller picture.
Note that the mode is the data value appearing most often. It is sometimes used as a measure of center but
can be inappropriate.

Example
The newspaper reported the following statistics on the NBA salaries for 1997-1998. This was at a time of
bitter negotiations for a new contract between the players union and the team owners. In fact, they could
not reach an agreement, there was a strike and a shortened season was played the following year.

Mean $ 2.6 million


Median $ 1.3 million
Mode $ 272,000
(47 players)
The mode value is in fact the smallest salary for this data. Note that the mean is so much higher than the
median due to the very high salaries of a few superstars in the league.

Some Measures of Spread


The mean or median gives a single number that measures the middle of the data, but how far do the data
values stray from this middle? We seek ways to measure this issue of variability in the data.
Measures of center do not provide all the necessary information. Consider the two histograms below. Both
have the same mean and median, however histogram A has equal frequencies while histogram B has most
of its measurements clustered around its center.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
STATISTICS FOR CENTER AND SPREAD OF DATA 38

If we were to simply report a mean, we would not know which data set we were talking about.
The Range of the data is a simple statistic to measure the spread of data. It is defined by

Range = Max(x) − Min(x)

the difference between the maximum and minimum values in the data. It directly indicates the spread of the
data. One serious drawback to the range is its sensitivity to outliers. In fact, it is directly computed from
the two most extreme values and if they are wrong/unusual/inappropriate then this “error” is picked up by
the range.

Standard Deviation
The population standard deviation σ is perhaps the most widely used and useful measure of spread,
variation and dispersion. In general, the standard deviation measures the spread from the center (mean) of
the data. It is defined by r r
∑(xi − µ)2 SS
σ= =
n n

Read this formula from the inside out: start with data x, subtract µ, then square these values, average them
and finally take the square root. Here SS = ∑(xi − µ)2
The units for σ are the same as for the original data. One can intuitively see that when the deviations xi − µ
are “small” (“large”) the σ will be “small” (“large”). Experience will help to interpret and understand the
meaning of a σ value.

Examples
Let’s first consider the data from histogram A. This has the following dataset: 1, 1, 2, 2, 3, 3, 4, 4, 5, 5. We
will find the standard deviation using the following layout:

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
STATISTICS FOR CENTER AND SPREAD OF DATA 39

xi xi − µ (xi − µ)2
1 1 − 3 = −2 (−2)2 = 4
1 1 − 3 = −2 (−2)2 = 4
2 2 − 3 = −1 (−1)2 = 1
2 2 − 3 = −1 (−1)2 = 1
3 3−3 = 0 (0)2 = 0
3 3−3 = 0 (0)2 = 0
4 4−3 = 1 (1)2 = 1
4 4−3 = 1 (1)2 = 1
5 5−3 = 2 (2)2 = 4
5 5−3 = 2 (2)2 = 4
∑ = 30
x SS = 20

20 = √2 = 1.41.
q
The mean is µ = 30 = 3. Then the standard deviation is
10 10
Let’s now consider the data from histogram B. This has the following dataset: 1, 2, 2, 3, 3, 3, 3, 4, 4, 5. We
will find the standard deviation using the following layout:
xi xi − µ (xi − µ)2
1 1 − 3 = −2 (−2)2 = 4
2 2 − 3 = −1 (−1)2 = 1
2 2 − 3 = −1 (−1)2 = 1
3 3−3 = 0 (0)2 = 0
3 3−3 = 0 (0)2 = 0
3 3−3 = 0 (0)2 = 0
3 3−3 = 0 (0)2 = 0
4 4−3 = 1 (1)2 = 1
4 4−3 = 1 (1)2 = 1
5 5−3 = 2 (2)2 = 4
∑ x = 30 SSx = 12

12 = √1.2 = 1.10.
q
The mean is µ = 30 = 3. Then the standard deviation is
10 10
Note: σ defined above is referred to as a population standard deviation. There is another formula for a
standard deviation called a sample standard deviation, denoted by s. This is the same except for using
(n − 1) rather than n in the denominator for the average part of the formula. Specifically,
s r
∑(xi − x̄)2 SSx
s= =
n−1 n−1

Be alert when reading standard deviation in other material or computer output – ask about which formula
is being used.
So for histogram A above, the sample standard deviation is computed as

20 √
r r
20
s= = = 2.22 = 1.49.
10 − 1 9

Likewise, the sample standard deviation of histogram B is computed as

12 √
r r
12
s= = = 1.33 = 1.15.
10 − 1 9

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
STATISTICS FOR CENTER AND SPREAD OF DATA 40

Look at the standard deviations, then the histograms. See why the standard deviation of histogram A
is higher than that of histogram B? Histogram A has more points further from the mean, resulting in a
higher standard deviation. Likewise, histogram B has points more clustered around the mean, resulting in
a smaller standard deviation.
Note that most of the time, the population mean µ and population standard deviation σ are unknown. This
is due to the following reasons: (1) it is difficult to collect data on all units of the population in a given
period, (2) data collection and processing generally takes longer , and (3) the process is very expensive. So
instead of gathering data from the population, we use the data from the sample to draw inferences about
the population.
The empirical rule is a useful tool that uses the standard deviation. It is stated as follows:
When the data has a “mound-shaped” histogram, highest in the middle and tapering off in a similar fashion
on each side, then

• the interval µ ± 1σ contains about 68% of the data.


• the interval µ ± 2σ contains about 95% of the data.
• the interval µ ± 3σ contains about 99.7% of the data.

Example
Suppose we have data on the city gas mileage of a fleet of chevys and we know the mean is
µ = 22 and σ = 3
Then, approximately, 68% of the mileages are in the range 22 ± 3. Likewise, approximately, 95% of the
mileages are in the range 22 ± 6.
Note how the standard deviation is providing information on how the mileages vary about the mean in this
sense.

Example
The sand lance is a small fish in the north Atlantic. Data was collected on them to study how fast they grow.
The length of the sampled fish averaged 225 (mm) with a standard deviation of 30 (mm). Using the empirical
rule, approximately, 68% of the fish have length in the range 225 ± 30. Likewise, 95% of the fish have length
in the range 225 ± 60.
The empirical rule can be used in a “backwards” way. Given a histogram of a data set one can quickly, by
eye, specify an approximate value for the mean and standard deviation. This works best by visually marking
the middle 95% of the area of the histogram. Note the boundaries of this range on the bottom axis. Divide
the length of this range by 4 to get the standard deviation.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
STATISTICS FOR CENTER AND SPREAD OF DATA 41

Example
Suppose the following histogram represents the highway gas mileages for a sample of compact cars. Give
a rough estimate for the mean and standard deviation.

The mean (middle) seems to be about 35. Draw the two vertical lines so they cut off a small percent of area
on each end, about 2.5% each. This means there is about 95% area in between them. This range 30 to 40
should be about 4 standard deviations in length. So
40 − 30 10
σ≈ = = 2.5
4 4
Note that the actual mean is 34.7 and the actual standard deviation is 2.9.

Example
This is the histogram for the data on white blood cell counts. Estimate the mean and standard deviation.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
STATISTICS FOR CENTER AND SPREAD OF DATA 42

The mean is at about 7. To cut off about 2.5% area on the left, cut at 4. To cut off about 2.5% area on the
right, cut at 10. Then 4 to 10 should be 4 standard deviations. So

10 − 4 6
σ≈ = = 1.5
4 4
Note that the actual mean is 6.6 and the actual standard deviation is 1.7.

EXERCISES
1. Calculate the mean, median, and population standard deviation(σ ) for the data:

23, 25, 28, 29, 30

2. Calculate the mean, median, and sample standard deviation(s) for the data:

18, 5, 15, 22, 11, 19

3. What is the value of the SD if the data values all equal the same number?
4. If the same constant is added to the data values, what change would we see in the mean? Median?
SD? (Think about how the equations work to answer)
5. Repeat the previous exercise, but change so that all the data values are multiplied by the same
constant.
6. Given two sets of data
A: n = 25, x̄ = 8, SD = 2
B: n = 50, x̄ = 8, SD = 2
Suppose we add one more observation, 26, to each. If we now calculate a new mean for each set,
x̄new , which one will change the most, A or B? Why?

7. Verify your answer in 6 by calculating x̄new for data set A and B.


8. If instead of adding the observation 26 to the data sets A and B, the highest observation in each data
set (x =13), was replaced by 26, for which set would the median change the most, A or B?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
STATISTICS FOR CENTER AND SPREAD OF DATA 43

9. Roughly estimate the standard deviation of the data set displayed in the histogram below.

10. An accountant at a company is making a budget for the following year. The company has a total of 125
employees, with an average salary of $40, 000/year. Without knowing the individual salaries of each
employee, a fair estimate of a budget allowance for employee salaries (not including any bonuses or
overtime) is
A. not calculable for the information given.
B. $3.25 milllion.
C. $5 million.
D. $7.5 million.

11. In the season of 2001-2002 for NFL and NBA salaries, statistics were as follows:

Mean Median Standard Deviation


NFL $1,175,500 $521,660 $1,595,995
NBA $3,470,940 $2,400,000 $3,798,535

A friend of yours says ”Whoa!! I know this info is correct, but how can the standard deviation be
bigger than the mean (x̄)? That means that the range of salaries x̄ ± SD would include negative
numbers! The empirical rule for “mound-shaped” data is that this range is supposed to contain
about 68% of the data, but I sure don’t know of any players with negative salaries.” Briefly explain his
error.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
STATISTICS FOR CENTER AND SPREAD OF DATA 44

12. A sports enthusiast is interested to compare the salaries of professional athletes from the four major
sports in the US. He randomly selected 20 players from each league, and calculated the statistics as
follows:

NBA MLB NHL NFL


Average 3.2 Million 2.5 Million 1.2 Million 5.2 Million
Median 1.5 Million 1.2 Million 1.1 Million 1.3 Million
Standard Deviation 675,000 375,000 220,000 500,000

A. Based on the data gathered, which league pays the most? Justify your answer.
B. Which league has the largest spread in salaries?
C. When comparing the salaries of the four professional leagues, which between the average and
the median should be used? Why?

13. Suppose that the final exam scores were normally distributed with mean 75 and standard deviation
8. Using the empirical rule, find the approximate percentage of students whose final exam score is
between 83 and 91.

REVIEW EXERCISES
14. Calculate the x̄ and SD for the following data sets
Dataset A: 7, 13, 19, 10, 16
Dataset B: 11, 17, 23, 14, 20
Dataset C: 1.1, 2.3, 2, 1.4, 1.7
Dataset D: 2, 8, 11, 7, 5, 3
15. (Hypothetical) Below is a histogram taken from 150 adult male patients.

Assuming the data is approximately mound shaped, an estimate of the standard deviation would be:

(a) 0.25 gm/dl (b) 0.50 gm/dl (c) 1.00 gm/dl (d) 1.50 gm/dl

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
STATISTICS FOR CENTER AND SPREAD OF DATA 45

16. Which data set has a larger standard deviation?


A: thirty “30’s”, one “60” and thirty “90’s”, so that’s 61 data points in all with mean = 60.
B: 30, 31, 32, 33, . . . ,88 , 89, 90 again, that’s 61 data points in all with mean = 60.
(Hint: Don’t try to calculate the actual standard deviations. What does standard deviation measure?)

17. (Hypothetical) In a particular school district in the U.S., a samples of teacher salaries for those with
ten years or more of experience resulted in the following:

w/BS/BA w/MS or Higher


n = 50 n = 50
x̄ = $50k x̄ = $59k
SD = $4k SD = $4k

If the two groups above were combined into a single data set, would the SD of the combined data set
increase, decrease or stay about the same? Briefly explain your answer.

18. For each of the following, determine which of these should apply
(i) average = median
(ii) average ≤ median, or
(iii) average ≥ median.
(Hint: “sketch” a histogram based on how you think the values will be distributed and possibly
skewed.)
A: Age at which a person graduates from High School. (Hint: There are some people who graduate
much earlier than 18 yrs old, the average, but not many who graduate later than 19 yrs old.)
B: Age at which a person graduates from College. (Hint: This is the reverse of High School, i.e.
many graduate much later than 22 yrs old, but very few younger than this.)
C: Measured weight of cereal from a box of a particular type of cereal at the Kellog’s plant. (We
expect the variation in box weight to be completely because of random variation, equally likely
to be greater or smaller than the target weight.)
19. Estimate, roughly, the mean and SD for the histogram below on test scores.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
STATISTICS FOR CENTER AND SPREAD OF DATA 46

20. Looking at the set of six numbers below,

0, 2, 4, 8, 10, 12

A friend says:
”The mean is obviously 6 and the SD is 4, since the SD measures, on average, the “distance” to the
mean and r
(6 − 0)2 + (6 − 2)2 + (6 − 4)2 + (8 − 6)2 + (10 − 6)2 + (12 − 6)2
6
6 + 4 + 2 + 2 + 4 + 6 24
= = = 4.”
6 6

A Do you agree? Explain.


B Calculate the standard deviation for this set of numbers.

21. The scores for the final exam last semester averaged to 75 with a standard deviation of 10. A closer
look at the actual scores revealed that 10% of the students who took the exam got a score below 45.
Using the empirical rule, can we say that the final exam scores do not follow the normal distribution?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Chapter 7

Normal Distribution

Begin with a set of data from a sample or a population. To visualize the data, draw a histogram. Use the
density scale, so that the total area is 1.0 or 100%.
Draw a smooth curve to approximate the top of the histogram. Then erase the bars leaving only the smooth
curve. The curve represents the data. Its total area is set to be 1.0 or 100%. An area of one or more
rectangles for the histogram is the proportion of the data in the interval at the bottom. The corresponding
area under the smooth curve over the interval represents the same proportion. In the lower figure, the
shaded area is the proportion of people with ages less than 40.

We turn our attention to smooth curves to represent the distribution of a population of numbers.
The ages of the employees in a company have such a curve, the batting averages of all major league
players have such a curve, the base diameters of all the trees in a forest have such a curve, and on and on.
Quantities in any field can be represented by curves and they can be studied, compared, and discussed in
this way. Many different shapes arise for the curves.

47
NORMAL DISTRIBUTION 48

One special shape is that of the normal or bell-shaped curve. See the figure below. It has a definite shape
given by a mathematical formula (we have no need to express this formula). Two key features describe the
normal curve:

• the mean: gives the center of the curve (like the mean of the population of numbers)
• the standard deviation (SD): gives the spread of the curve (like the SD of the population)

The figure below at left represents the lifetimes of alternators on Chevy Blazers. They have a mean of 90
and a standard deviation of 20 (in 1000 mile units). The curve shows this information visually in a simple
way. Areas under this curve correspond to population proportions – we may want to know proportions
corresponding to

• the area under the normal curve above 110, (see figure at right), or
• the area under the curve below 75, or
• the area under the curve between 75 and 125.

Figure 7.1: miles

How do you determine the area under a portion of a normal curve? The basic method is to change the value
we want, to the standard normal scale, the so-called Z scale. A given number X is converted to the Z scale
by the move
X −µ
Z=
σ
and then we find the area for Z using the standard normal table (the Z table). The Z table gives areas to the
left of a value Z for a standard normal distribution, a normal distribution with mean = 0 and SD = 1. It does
this for positive values of Z. All other types of areas that we may want, can be found from such areas by
using the symmetry of the Z curve about 0 and the fact that the total area is 1. The Z curve is given in the
figure below and the area that is available in the Z table is shown in the figure. See the Z table (it’s at the
end of this section).

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
NORMAL DISTRIBUTION 49

Reading the Z Table


The z table gives areas to the left of a value z for a standard normal distribution, a normal distribution with
mean = 0 and SD = 1. It does this for positive values of z.
For example, the shaded area to the left of z = 0.67 is 0.749 (see Figure 7.2). On the z table, look for 0.67 on
column Z. Then the corresponding area is 0.749.

Figure 7.2: Area to the left of 0.67

All other types of areas that we may want, can be found from such areas by using the symmetry of the z
curve about 0 and the fact that the total area is 1.
Now, let us answer the following questions.

(a) What is the area to the right of z = −0.67? To answer this question, see Figure 7.3.
By symmetry, the area to the right of z = −0.67 is equal to the area to the left of z = 0.67. Using the z
table (look for 0.67 on column Z), the area is 0.749.

Figure 7.3: Area to the right of -0.67

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
NORMAL DISTRIBUTION 50

(b) What is the area above z = 1?


We know that the z table always give us the area to left of the z value. Looking at the table, the area
to the left of z = 1 is 0.841 (see left graph on Figure 7.4). But we also know that the total area under
the z curve is 1. Therefore, the area above z = 1 can be computed by subtracting 0.841 from 1. That
is, area = 1 − 0.841 = 0.159 (see right graph on Figure 7.4).

Figure 7.4: Area below z = 1 (left) and Area above z = 1 (right)

(c) What is the area between z = 1 and z = 2?

Figure 7.5: Illustration 1 (left) and Illustration 2 (right)

From the z table, we can look at the areas below any positive z value. So in this case, we know
the areas below z = 1 (area shaded in red on left Figure 7.5) and z = 2 (area shaded in blue on left
Figure 7.5). If we subtract the area below z = 1 from the area below z = 2, we get the area between
z = 1 and z = 2 (area shaded in blue on right Figure 7.5). That is, the area between z = 1 and z = 2 is
0.977 − 0.841 = 0.136.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
NORMAL DISTRIBUTION 51

EXAMPLE 1
Suppose the gas mileages of Ford Broncos are normally distributed with mean of 20 and a standard
deviation of 2 (miles/gal). What proportion of these vehicles get more than 23 m/g? Drawing the figure
below at left, we need the shaded area. Convert to the Z scale by

23 − 20 3
Z= = = 1.50
2 2
Now draw a new figure, at right, to express the area on the Z scale. Find the area to the left of 1.50 – from
the Z table in row Z = 1.50, Area = 0.933. To get the area to the right of 1.50, subtract from 1 to get
1 − 0.933 = 0.067.
State the result: the proportion of vehicles getting more than 23 m/g is 0.067 or 6.7%.

Figure 7.6: X curve (left) and Z curve (right)

EXAMPLE 2
In a study of the amount spent on books for a population of students, the histogram of the data showed the
shape of a normal distribution. The mean was $250 and the standard deviation was $50. What proportion
of the students spent less than $165? Drawing the figure below at left, we need the shaded area. Convert
to the Z scale by
165 − 250 85
Z= = = −1.70 (Note the minus sign.)
50 50
Now draw a new figure, in the middle, to express the area (on left) on the Z scale. With this negative Z value
we can’t find any area directly so the area is flipped over to an equal area on the positive side. See the
figure on the right side. The area to the left of 1.70 from the Z table is Area = 0.955. To get the area we want
to the right of 1.70, subtract from 1 to get 1 − 0.955 = 0.045.
State the result: the proportion of students spending less than $165 is 0.045 or 4.5%.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
NORMAL DISTRIBUTION 52

Figure 7.7: Amount spent (left) and Z curve (right)

Backwards Normal Calculations


In the above examples, we started with an X value of interest and found the area it cuts off. Now we will
reverse the order of things – start with an area of interest and find the corresponding X value. The method
is to reverse the steps learned above – think it through.

EXAMPLE 3
Test scores for 800 students in a Biology class followed a normal distribution with a mean of 75 and a
standard deviation of 9. What is the 80th percentile of the test scores (80% of the scores are less than this
one, 20% are higher)? See the figure below at left representing the distribution of the scores. We seek a
score X cutting off an area of 0.80 to its left as indicated in the figure.
To begin, consider the corresponding question for the standard normal curve. The drawing is below at
right. Find a Z value which cuts off an area of 0.80 to its left. From the Z table, finding the value 0.80 in the
area column. We see the corresponding Z value as Z = 0.84. This value is written in the figure. Thus we
know that the standardized X should be 0.84. Write this as an equation in the form

X − 75
= Z = 0.84
9
Now the solution to this equation is
X = 75 + 0.84 (9) = 75 + 7.56 = 82.56 or rounding 83
Note the template here: X = mean +Z (SD)
State the result: the 80th percentile of the test scores is approximately 83.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
NORMAL DISTRIBUTION 53

Figure 7.8: Test scores (left) and Z curve (right)

EXAMPLE 4
A company makes a line of oak panels. Suppose the moisture content of their oak panels has approximately
a normal distribution with mean of 0.15 and a standard deviation of 0.02. Find a moisture value, so that,
75% of the oak panels have moisture content above this value (that is, we seek the 25th percentile of the
moisture contents).
Begin with a normal graph of the moisture contents with the area desired above a value X (see the figure
below at left). Draw the corresponding graph for the standard normal curve (see the figure in the middle).
We need to find the value Z having 75% area to its right. Clearly, Z will be a negative number, so imagine
flipping it over to the positive side to find the positive Z number with 75% area to its left (see the figure
below at left). Consulting the Z table, we find that Z = 0.67 works. So our original question is answered by
Z = −0.67 and this is indicated in the figure. Thus, the moisture content value X should have a standardized
value of −0.67, that is
X − 0.15
= Z = −0.67
0.02
Solving this equation
X = 0.15 + 0.02 (−0.67) = 0.15 − 0.0134 = 0.1366
State the result: 75% of the moisture values are above 0.1366.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
NORMAL DISTRIBUTION 54

Figure 7.9: Moisture content (left) and Z curve (right)

EXERCISES
1. For a normal distribution with a mean of 5 and a SD of 0.8,
(A) what percent of the data lies above the value X = 6.2?
(B) what percent of the data lies below the value X = 3.8?

2. For the same distribution as in 1,


(A) what is the value of the 90th percentile?
(B) what is the value of the 15th percentile?

3. Let X be the number of miles a long range truck driver drives each week. If X is normally distributed
with average = X̄, standard deviation = SDX , and X0 = X̄ + 1.28(SDX ), then the percent of weeks the
driver drives more than X0 in a week is :
(A) 30%
(B) 20%
(C) 10%
(D) not determinable with the information given.

4. With the information as in 1 above, let X67th be the 67th percentile. What is the chance that, for any
random week, the driver will drive more than X67th ?
(A) 67%
(B) 44%
(C) 33%
(D) 25%

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
NORMAL DISTRIBUTION 55

5. (Hypothetical) A study was done in Michigan on the homeless rate among small to large cities
(population greater than 20,000). The mean rate (homeless person/1000 in population) was 0.54 with
SD = 0.13. Assuming the data is normally distributed, what homeless rate would a city need to be
located in the top 10% (90th percentile or better)?

6. (Hypothetical) In 2006, starting salaries for elementary school teachers in Michigan had an average
of $28,000 with a SD = $2000. What percent of salaries were
(A) below $25,000?
(B) between $27,500 and $30,500?

7. Below are graphs of two normal distributions , A (the solid line) and B(the dotted line

Which of the following best describes the relation between A and B?


(A) X̄A < X̄B , SDA < SDB
(B) X̄A < X̄B , SDA > SDB
(C) X̄A > X̄B , SDA < SDB
(D) X̄A > X̄B , SDA > SDB

8. Using the distributions indicated above, suppose the value X1 is exactly one standard deviation above
X̄A in the A distribution (i.e. X1 = X̄A + SDA ) and the value of X2 is exactly one standard deviation above
X̄B in the B distribution (i.e. X2 = X̄B + SDB ). If you pick a number XA at random from distribution A and
a number XB at random from distribution B, which of the following is true?
(A) [Chance of (XA < X1 )] < [Chance of (XB < X2 )]
(B) [Chance of (XA < X1 )] > [Chance of (XB < X2 )]
(C) [Chance of (XA < X1 )] = [Chance of (XB < X2 )]
(D) None of the above.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
NORMAL DISTRIBUTION 56

9. Suppose now that you have normal distributions C and D, with X̄C = X̄D = 0 and SDC < SDD . If you
pick a number XC at random from distribution C and a number XD from distribution D, which of the
following is true?
(A) [Chance of (XC < 1)] < [Chance of (XD < 1)]
(B) [Chance of (XC < 1)] > [Chance of (XD < 1)]
(C) [Chance of (XC < 1)] = [Chance of (XD < 1)]
(D) None of the above.

10. Suppose the PCB levels in the fish in a river are at an average of 30 ppm with SD of 9 ppm. The shape
of the distribution of levels is approximately normal.
(A) What percent of the fish have PCB level above 45 ppm?
(B) What are the 25th and 75th percentiles of the PCB levels?

11. (Hypothetical) The weight of cereal in boxes of corn flakes are measured at the Kellog plant in Battle
Creek. The weight is normally distributed with a mean of 18.2 ounces and SD = 0.9 ounces.
(A) What is the chance of a box having more than 20 ounces of cereal?
(B) 30% of the boxes will have a weight less than what value?
12. Suppose the daily wage of employers of a manufacturing company is normally distributed with mean
$250 and a standard deviation σ .
(A) (TRUE or FALSE) If the company has 40 employers in total, we are expecting around 20 workers
would have daily wage less $250 and 20 would have daily wage above $250.
(B) Which of the following statement(s) is(are) TRUE?
I. The chance an employee earns less than $400 is greater than the chance an employee earns
less than $300.
II. The chance an employee earns more than $350 is less than the chance an employee earns
more than $450.

a. I only b. II only c. Both I and II d.Neither I nor II

REVIEW EXERCISES
13. Suppose the average height of males in a particular city in 72 inches with SD = 1.5. Assuming that
the heights of males in this city are normally distributed (show your work):
(A) What percent of males fall within the height range of 69 inches and 73.5 inches ?
(B) Only 10% of the males in the city have heights greater than what value?
(C) What is the 90th percentile of the heights of men in this city?
(D) What is the chance that a randomly selected male from this city will be shorter than 71 inches?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
NORMAL DISTRIBUTION 57

14. Suppose you are a professor looking at the chemistry scores of the mass freshman chemistry midterm
exam with an approximately normal distribution with µ = 70 (out of 100) and SD = 11. You know that
the number of scores in the range of 59 to 81 should account for about 68% of the data. True or
False? (Briefly explain your answers.)
(A) If 5 is added to every observation (midterm score) to obtain a new data set (curved), then it must
follow that number of scores in the range of 64 to 86 should also account for 68% of the data in
the new data set.
(B) In order to assign 10% A’s for the midterm, you would need to assign an A to a score for 90 or
higher.
(C) Suppose an unexpected blizzard cancels the second exam, so you consider just counting the
midterm twice. If every midterm score is multiplied by 2 to obtain a new data set, i.e., a set of
scores out of 200, then it must follow that scores in the range of 118 to 162 should also account
for 68% of the data in the new data set.

15. (Hypothetical) The time for completion of a design project has a normal distribution with mean of 100
hours and a standard deviation of 25 hours.
(A) How many hours should be allowed in planning a design project if we want to be 90% certain
that it is finished before the deadline?
(B) If we set the deadline at 80 hours (two standard work weeks), what is the chance that the job
will be finished by the deadline?

16. In poorer countries, growth of children can be an important indicator of general levels of health and
nutrition. In an article from an anthropological journal, it was suggested that the population of 5 year
olds have heights approximately normally distributed with mean = 100 cm and SD = 6 cm. Answer
the following using the normal curve approximation.
(A) What percent of the population of 5 year olds should have a height less than 96 cm?
(B) What height is the 70th percentile in this population?
(C) If a child were picked at random from this population, what is the % chance that the child’s
height would be between 100 cm and 102 cm ?

17. Suppose the final exam score last semester is approximately normal with mean 75, and a student is
randomly selected. Which of the following statements is incorrect?
(A) The probability that the student’s final exam score is less than 70 is the same as the probability
that the student’s final exam score is more than 80.
(B) The probability that the student’s final exam score is greater than 70 is 1-probability that the
student’s final exam score is less than 70.
(C) The probability that the student’s final exam score is between 70 and 75 is the same as the
probability that the student’s final exam score is between 80 and 85.
(D) The probability that the student’s final exam score is between 70 and 75 is half of the probability
that the student’s final exam score is between 70 and 80.

18. (TRUE or FALSE) Suppose the weight of a 12-pack of bottled water is normally distributed with a
mean of 16.9 ounces, and a standard deviation of 1.5 ounces. The median of this normal distribution
will be twice the mean (that is 33.8 ounces).

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
NORMAL DISTRIBUTION 58

REVIEW ACTIVITY (CHAPTERS 1 TO 7)


Name: Section:

1. A new cough medicine was formulated with the intent to reduce the number of days patients suffer
from coughing. To test if this drug is better than the standard drug, a sample of 50 patients in a
certain hospital was selected, and were equally split at random into two groups. Group 1 will receive
the new drug, while group 2 will receive the standard drug. Neither the patients nor the doctor who
will examine them know what drug each patient received. Is this an example of a good controlled
experiment? Justify your answer using the characteristics discussed in Chapter 2.

2. A student wants to know if there are more freshman students who go to Waldo library compared to
upperclass students. On a random day, he interviewed each student who walks in the library and
asked them what year level they are (Freshman, Sophomore, Junior, Senior). Is this a controlled or
observational study? Explain your answer.

3. The Canadian Longitudinal Study on Aging is a large, national, long-term study of more than 50000
men and women who were between the ages of 45 and 85 when recruited. These participants would
be followed until 2033 or death, with the aim to find ways to help people live long and well, and
understand why some people age in healthy fashion while others do not. Is this a retrospective or a
prospective study? Explain your answer.

4. A survey was taken on Maple Avenue, Georgetown, Ontario among 9 randomly selected homes. In
each home, people were asked how many cars were registered to their households, and the results
were recorded as follows:

3 3 2 2 2 1 1 1 0
(a) Construct a Histogram for the data.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
NORMAL DISTRIBUTION 59

(b) Calculate the mean of the data set.

(c) Calculate the median of the data set.

(d) Calculate the standard deviation of the data set.

5. Consider the boxplot for the ages of children when they started attending specialized class to improve
their reading skills.

At what age did more than 75% of children start attending the specialized class?

6. Assume the speed of vehicles along the stretch of I-81 has approximately normal distribution with a
mean of 65 mph and a standard deviation of 8 mph.

(a) What is the probability that a vehicle is running at a speed greater than 68 mph? Include the
graph and shade the region of interest in your solution.

(b) What is the probability that a vehicle is running at a speed less than 64 mph? Include the graph
and shade the region of interest in your solution.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Z-table
Z Area Z Area Z Area Z Area Z Area
0.00 0.500 0.65 0.742 1.32 0.907 1.99 0.977 2.66 0.996
0.01 0.504 0.66 0.745 1.33 0.908 2.00 0.977 2.67 0.996
0.02 0.508 0.67 0.749 1.34 0.910 2.01 0.978 2.68 0.996
0.03 0.512 0.68 0.752 1.35 0.911 2.02 0.978 2.69 0.996
0.04 0.516 0.69 0.755 1.36 0.913 2.03 0.979 2.70 0.997
0.05 0.520 0.70 0.758 1.37 0.915 2.04 0.979 2.71 0.997
0.06 0.524 0.71 0.761 1.38 0.916 2.05 0.980 2.72 0.997
0.07 0.528 0.72 0.764 1.39 0.918 2.06 0.980 2.73 0.997
0.08 0.532 0.73 0.767 1.40 0.919 2.07 0.981 2.74 0.997
0.09 0.536 0.74 0.770 1.41 0.921 2.08 0.981 2.75 0.997
0.10 0.540 0.75 0.773 1.42 0.922 2.09 0.982 2.76 0.997
0.11 0.544 0.76 0.776 1.43 0.924 2.10 0.982 2.77 0.997
0.12 0.548 0.77 0.779 1.44 0.925 2.11 0.983 2.78 0.997
0.13 0.552 0.78 0.782 1.45 0.926 2.12 0.983 2.79 0.997
0.14 0.556 0.79 0.785 1.46 0.928 2.13 0.983 2.80 0.997
0.15 0.560 0.80 0.788 1.47 0.929 2.14 0.984 2.81 0.998
0.16 0.564 0.81 0.791 1.48 0.931 2.15 0.984 2.82 0.998
0.17 0.567 0.82 0.794 1.49 0.932 2.16 0.985 2.83 0.998
0.18 0.571 0.83 0.797 1.50 0.933 2.17 0.985 2.84 0.998
0.19 0.575 0.84 0.800 1.51 0.934 2.18 0.985 2.85 0.998
0.20 0.579 0.85 0.802 1.52 0.936 2.19 0.986 2.86 0.998
0.21 0.583 0.86 0.805 1.53 0.937 2.20 0.986 2.87 0.998
0.22 0.587 0.87 0.808 1.54 0.938 2.21 0.986 2.88 0.998
0.23 0.591 0.88 0.811 1.55 0.939 2.22 0.987 2.89 0.998
0.24 0.595 0.89 0.813 1.56 0.941 2.23 0.987 2.90 0.998
0.25 0.599 0.90 0.816 1.57 0.942 2.24 0.987 2.91 0.998
0.26 0.603 0.91 0.819 1.58 0.943 2.25 0.988 2.92 0.998
0.27 0.606 0.92 0.821 1.59 0.944 2.26 0.988 2.93 0.998
0.28 0.610 0.93 0.824 1.60 0.945 2.27 0.988 2.94 0.998
0.29 0.614 0.94 0.826 1.61 0.946 2.28 0.989 2.95 0.998
0.30 0.618 0.95 0.829 1.62 0.947 2.29 0.989 2.96 0.998
0.31 0.622 0.96 0.831 1.63 0.948 2.30 0.989 2.97 0.999
0.32 0.626 0.97 0.834 1.64 0.949 2.31 0.990 2.98 0.999
0.33 0.629 0.98 0.836 1.65 0.951 2.32 0.990 2.99 0.999
0.34 0.633 0.99 0.839 1.66 0.952 2.33 0.990 3.00 0.999
0.35 0.637 1.00 0.841 1.67 0.953 2.34 0.990 3.01 0.999
0.36 0.641 1.01 0.844 1.68 0.954 2.35 0.991 3.02 0.999
0.37 0.644 1.02 0.846 1.69 0.954 2.36 0.991 3.03 0.999
0.38 0.648 1.03 0.848 1.70 0.955 2.37 0.991 3.04 0.999
0.39 0.652 1.04 0.851 1.71 0.956 2.38 0.991 3.05 0.999
0.40 0.655 1.05 0.853 1.72 0.957 2.39 0.992 3.06 0.999
0.41 0.659 1.06 0.855 1.73 0.958 2.40 0.992 3.07 0.999
0.42 0.663 1.07 0.858 1.74 0.959 2.41 0.992 3.08 0.999
0.43 0.666 1.08 0.860 1.75 0.960 2.42 0.992 3.09 0.999
0.44 0.670 1.09 0.862 1.76 0.961 2.43 0.992 3.10 0.999
0.45 0.674 1.10 0.864 1.77 0.962 2.44 0.993 3.11 0.999
0.46 0.677 1.11 0.867 1.78 0.962 2.45 0.993 3.12 0.999
0.47 0.681 1.12 0.869 1.79 0.963 2.46 0.993 3.13 0.999
0.48 0.684 1.13 0.871 1.80 0.964 2.47 0.993 3.14 0.999
0.49 0.688 1.14 0.873 1.81 0.965 2.48 0.993 3.15 0.999
0.50 0.691 1.15 0.875 1.82 0.966 2.49 0.994 3.16 0.999
0.51 0.695 1.16 0.877 1.83 0.966 2.50 0.994 3.17 0.999
0.52 0.698 1.17 0.879 1.84 0.967 2.51 0.994 3.18 0.999
0.53 0.702 1.18 0.881 1.85 0.968 2.52 0.994 3.19 0.999
0.54 0.705 1.19 0.883 1.86 0.969 2.53 0.994 3.20 0.999
0.55 0.709 1.20 0.885 1.87 0.969 2.54 0.994 3.21 0.999
0.56 0.712 1.21 0.887 1.88 0.970 2.55 0.995 3.22 0.999
0.57 0.716 1.22 0.889 1.89 0.971 2.56 0.995 3.23 0.999
0.58 0.719 1.23 0.891 1.90 0.971 2.57 0.995 3.24 0.999
0.59 0.722 1.24 0.893 1.91 0.972 2.58 0.995 3.25 0.999
0.60 0.726 1.25 0.894 1.92 0.973 2.59 0.995 3.26 0.999
0.61 0.729 1.26 0.896 1.93 0.973 2.60 0.995 3.27 0.999
0.62 0.732 1.27 0.898 1.94 0.974 2.61 0.995 3.28 0.999
0.63 0.736 1.28 0.900 1.95 0.974 2.62 0.996 3.29 0.999
0.64 0.739 1.29 0.901 1.96 0.975 2.63 0.996 3.30 1.00
1.30 0.903 1.97 0.976 2.64 0.996
1.31 0.905 1.98 0.976 2.65 0.996
Chapter 8

Correlation and Association

Suppose there are two variables under consideration, say X and Y. That is, the units under study, such as
people, plants, cars, stores, etc., have two numbers associated with them. For example,

people: X = height and Y = weight


people: X = age and Y = blood pressure
students: X = score on a midterm exam and Y = score on the final exam
cars: X = city mpg and Y = highway mpg
cars: X = weight and Y = engine size
farms: X = corn yield in 1998 and Y = corn yield in 1999
stores: X = sales in first quarter and Y = sales in second quarter.

In such situations, suppose we are interested in the relationship between the two variables. If the value of X
is known, does this give us information about the value of Y ? How closely are the values of X and Y related
to each other?
One generic type of relationship finds that the two variables go together in the sense that if X is “high” then
Y is also “high” and if X is “low” then Y is also “low”? This happens in situations like the following:

• size of a house vs. monthly heating cost


• SAT score vs. college GPA
• age of husband vs. age of wife
• advertising expenditure vs. sales

Another type of relationship finds that the two variables go in opposite directions in the sense that if X is
“high” then Y is “low” and when X is “low” then Y is “high”. This occurs with the following variables:

• weight of a car vs. gas mileage


• in archery, distance to target vs. accuracy
• age of a car vs. price

61
CORRELATION AND ASSOCIATION 62

To explore the relationship between two variables X and Y , we need paired data. For each individual,
suppose we have a pair of values (x, y). To visualize the data, we make a scatterplot by plotting each point
(x, y) on appropriate X– and Y – axes. The general pattern of the relationship will be seen.
To measure/quantify the degree of relationship between X and Y , a number called the correlation coefficient
R is calculated. The value of R is always between −1 and +1, that is
−1 ≤ R ≤ 1
As the graphs will illustrate,

when R is “near” 1 there is a strong positive relationship


when R is “near” 0 there is a little or no relationship
when R is “near” −1 there is a strong negative relationship

The value R = 1 occurs when the points in the scatterplot fall exactly on a straight line with positive slope.
The value R = −1 occurs when the points in the scatterplot fall exactly on a straight line with negative slope.
We need some experience with scatterplots of data, “seeing” the degree of relationship in the plot and the
value of the correlation coefficient R to measure this. The following examples will illustrate this.

Example 1
The figure below is a scatterplot of the height and weight of 500 men (artificial data). We see the general type
of relationship: for higher X (height) there is higher Y (weight) and for lower X there is lower Y . Summary
statistics are:
Height: µX = 69 and σx = 3 (inches)
Weight: µY = 180 and σy = 20 (pounds)
The correlation coefficient is
R = 0.60
Note the value of R indicates a moderately strong, positive relationship.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CORRELATION AND ASSOCIATION 63

Example 2
Data was available on 489 types of cars a few years ago. This data included multiple entries for a given brand
of car considering different engine sizes and transmissions available. The figures below show scatterplots
of city MPG against highway MPG and size of engine. The points were “jittered” a bit so that, several points
did not fall at the same location. Note the fairly strong relationships. In the top graph, the points cluster
along a line, while in the second graph they cluster along a curve.
The directions of the relationships are as expected. The top graph has R = 0.95 and the bottom graph has
R = −0.72, indicating fairly strong relationships, positive for the top and negative for the bottom.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CORRELATION AND ASSOCIATION 64

Example 3
For the 1997 Masters Tournament in golf, consider data on round 1 and round 2 scores for the 86 golfers
who finished all four rounds (this is the next major tournament after Tiger Woods made his first splash;
Tiger did not win this one, E. Elhs did), A scatterplot for this data appears below. The points were slightly
“jittered”. There is some evidence of an upward, positive trend. Those who were high in the first round
tended to also be high on the second round, although the tendency is moderate.
The correlation coefficient is R = 0.45.

Example 4
For a class of 54 students in a statistics class, data was considered for scores on midterm exams, final
exams, and homework. How is the final exam score related to the midterm score and to the homework
score? The two scatterplots are below, slightly “jittered”. We see an upward tendency; those who did
higher on the midterm tended to do higher on the final, with the same relationship for homework vs. final.
Examine a few points of those who did higher (to the right) on the midterm and note their final score
(up-down). Do the same in the bottom graph. The correlation coefficients are:

• top graph: R = 0.48


• bottom graph: R = 0.482

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CORRELATION AND ASSOCIATION 65

So we have moderately high correlations in both cases. Summary statistics for the 3 variables:
Midterm Final Homework
means 36.36 53.17 102.22
SDs 11.23 19. 34 27.11

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CORRELATION AND ASSOCIATION 66

Example 5
Consider the 1998 nationwide data on the SAT test scores for each state and DC. The data is broken down
by verbal score and math score. The percentage of high school graduates who took the SAT test is included
too. The data appears below.
State Verbal Math Percentage
Ala. 562 558 8
Alaska 521 520 52
Ariz. 525 528 32
Ark. 568 555 6
Calif. 497 516 47
Colo. 537 542 31
Conn. 510 509 80
Del. 501 493 70
D.C. 488 476 83
Fla. 500 501 52
Ga. 486 482 64
Hawaii 483 513 55
Idaho 545 544 16
Ill. 564 581 13
Ind. 497 500 59
Iowa 593 601 5
Kan. 582 585 9
Ky. 547 550 13
La. 562 558 8
Maine 504 501 68
Md. 506 508 65
Mass. 508 508 77
Mich. 558 569 11
Minn. 585 598 9
Miss. 562 549 4
Mo. 570 573 8
Mont. 543 546 24
Neb. 565 571 8
Nev. 510 513 33
N.H. 523 520 74
N.J. 497 508 79
N.M. 554 551 12
N.Y. 495 503 76
N.C. 490 492 62
N.D. 590 599 5
Ohio 526 540 24
Okla. 568 564 8
Ore. 528 528 53
Pa. 497 495 71
R.I. 501 495 72
S.C. 478 473 61
S.D. 584 581 5
Tenn. 564 557 13
Texas 494 501 51
Utah 572 570 4
Vt. 508 504 71
Va. 507 499 66
Wash. 524 526 53
W.Va. 525 513 18
Wis. 581 594 7
Wyo. 548 546 10
U.S. 505 512 43

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CORRELATION AND ASSOCIATION 67

Scatterplots are below for two comparisons.

The correlation coefficients are:

top graph R = 0.97 (very high, positive)


bottom graph R = −0.89 (very high, negative)

The high negative R between Total score and Percent Take has important implications for comparing states
– the SAT scores depend highly on the percentage of students taking it – the higher the percentage, the
lower the score. The summary statistics are:
Variable N Mean Median StDev
Verbal 51 532.02 525.0 33.86
Math 51 533.47 528.0 35.13
Percent 51 37.35 32.0 28.03
Total Sc 51 1065.50 1053.0 68.50

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CORRELATION AND ASSOCIATION 68

Calculation of R
Given paired data on two variables X and Y , the correlation coefficient, R, is calculated as follows:

1. Calculate the Z scores for the X data.


X − µX
ZX =
σX
2. Calculate the Z scores for the Y data.
Y − µY
ZY =
σY
3. Multiply ZX and ZY and calculate the average of these products.
∑(ZX ZY )
R=
n
The figure below displays how these steps work.

In the lower left quadrant of the scatterplot, points are below the mean for both X and Y , thus resulting in a
negative Z value. Keep in mind that the product of two negative values is positive. Likewise, in the upper
right quadrant of the scatterplot, points are above the mean for both X and Y , thus resulting in a positive Z
value. The sum of positive values is positive, resulting in a positive correlation coefficient, R. Similar logic
would work for a negative correlation coefficient.

Alternate Calculation of R
Alternately, the correlation coefficient, R can be calculated from the covariance. The covariance is
∑(xy) ∑(x − µX )(y − µY )
COV = − µX µY = (alternate formula)
n n
The first formula is the average of the products minus the product of the averages.
Then dividing by the two standard deviations,
COV
R=
σX · σY

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CORRELATION AND ASSOCIATION 69

Example
Take a small data set with 5 points as given in the X and Y columns below. The mean of X and Y are 5 and
10, respectively. The standard deviation of X is 2.83 and the standard deviation of Y is 2.61. Then we use
these means and standard deviations to calculate the Z scores in the following table:
X Y X ·Y ZX = X−µ
σX
X
Zy = Y −µ
σY
Y
ZX · ZY
1 7 7 1−5 = −1.413 7−10 = −1.149 1.624
2.83 2.61
3 8 24 3−5 = −0.707 8−10 = −0.766 0.542
2.83 2.61
5 9 45 5−5 = 0 9−10 = −0.383 0
2.83 2.61
7 12 84 7−5 = 0.707 12−10 = 0.766 0.542
2.83 2.61
9 14 126 9−5 = 1.413 14−10 = 1.533 2.166
2.83 2.61
286 4.874

For the first method, calculating the average of the last column, R = 4.874 = 0.9748
5
For the alternative method, we calculate the means and standard deviations as above and do the last column
also. Then,
∑(X ·Y ) 286
= = 57.2
n 5
COV = 57.2 − 5 (10) = 57.2 − 50 = 7.2

Then R = 7.2 = 0.9748 is the same as the above.


2.83 (2.61)
Scatterplot of the above data set is as follows.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CORRELATION AND ASSOCIATION 70

Example
Take a small data set with 6 points as given in the X and Y columns below. The adjacent columns are
produced to organize the calculations. The mean of X is 8 and the standard deviation is 3.416. The mean of
Y is 7 and the standard deviation is 2.944. Then we use these means and standard deviations to calculate
the Z scores in the following table:
x y X ·Y ZX = X−µ
σX
X
ZY = Y −µ
σY
Y
ZX · ZY
3 2 6 3 − 8 = −1.464 2 − 7 = −1.698 2.486
3.416 2.944
5 6 30 5 − 8 = −0.878 6 − 7 = −0.340 0.299
3.416 2.944
7 10 70 7 − 8 = −0.293 10 − 7 = 1.019 −0.298
3.416 2.944
9 6 54 9 − 8 = 0.293 6 − 7 = −0.340 −0.099
3.416 2.944
11 7 77 11 − 8 = 0.878 7−7 =0 0
3.416 2.944
13 11 143 13 − 8 = 1.464 11 − 7 = 1.359 1.990
3.416 2.944
380 4.378

Then, calculating the average of the last column, R = 4.378 = 0.730.


6
For the alternative method, we calculate the means and standard deviations as above and do the last column
also. Then,
∑(X ·Y ) 380
= = 63.333
n 6
COV = 63.333 − (8·7) = 63.333 − 56 = 7.333

Then, R = 7.333 = 0.729, the same as the above.


3.416(2.944)
Scatterplot of the above data set is as follows.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CORRELATION AND ASSOCIATION 71

Comments
1. Interpretation of a correlation R depends to some degree on the field of application. For physical
quantities studied in science and engineering, we often find R values of 0.80 or higher, while in social
science areas correlations are seldom this large; R values of 0.30 may be notable.
2. The correlation coefficient, R, measures linear relationships, that is, the degree to which the points
concentrate along a straight line. It is most appropriate when the scatterplot is generally an oval
(ellipse or circle), either thin or fat in appearance.
3. The correlation coefficient, R, is less effective, even misleading, in measuring relationship when the
scatterplot of points does not concentrate along a straight line. Trouble is caused by
• outliers (data points that are extreme, outside of the cloud of the majority of the points)
• curved patterns
• non-oval shape
4. The correlation coefficient, R, is not affected by the choice of units used for the measurement scale.
We say that R is location and scale invariant. Specifically, the value of R is not changed if
• we add (or subtract) the same number to all the X values or to all the Y values
• we multiply or divide all X values (or all Y values) by the same positive number
This means that the choice of measuring temperature in degrees centigrade or fahrenheit, length
in inches or feet, weight in ounces or grams, etc. will have no effect on correlations that may be
calculated.
5. Correlation does not imply causation. Two variables being correlated does not imply that one variable
is causing the other variable.
6. Sometimes the data could be divided into two or more subgroups in a natural way. For instance, data
on people could be divided into two sets, one for the men and the other for the women. Alternately,
the people could be divided into say five subgroups according to their age. In such cases we could
calculate correlation coefficients separately for each subgroup. It is important to know that the
correlations for the subgroups need not be the same or even close to the correlation coefficient
for the whole group.
The following graph, at top, shows scatterplots (really ovals) for men and women separately, each
having correlation of about R = 0.70. Together, just considering people, the combined plot is longer
and narrower having a correlation of about R = 0.90.
The graph below, at bottom, shows scatterplots for four groups, each having correlation of about R =
0. All together, the whole plot is longer and narrower and has correlation of about R = 0.80.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CORRELATION AND ASSOCIATION 72

EXERCISES
1. Calculate R, the correlation coefficient, for the following sets of paired data:
A:
X = 3 8 10
Y =6 4 2

B:
X =4 7 5 9 5
Y =2 4 3 7 4

2. In the scatterplots below, match the correlation coefficient from the selections: (0.15 – 0.25), (0.55 –
0.65), (0.75 – 0.85), (0.85 – 0.95).

3. What does it mean when two variables have a positive relationship? What about if the relationship is
negative? Explain.
4. Suppose you have the following two sets of data pairs, where X = (number of hours worked per week)
and Y= (# hours of watching TV per week).

MEN (n = 100) WOMEN (n = 100)


x̄ = 38 SDx = 4.2 x̄ = 38 SDx = 4.2
ȳ = 15 SDy = 1.5 ȳ = 12 SDy = 1.5
R = 0.75 R = 0.75

If you combined both data sets, would the overall correlation coefficient of the new combined data
set likely increase, stay about the same, or decrease? (Hint: drawing a picture of the “data clouds”,
or ovals, sometimes helps)

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CORRELATION AND ASSOCIATION 73

5. (Hypothetical) Consider three data sets from three different branches of a company:

Branch A Branch B Branch C


Mean hours emloyees work per week 45 43 41
SD of hours week 1 1 1
Mean distance employees live from work (in miles) 4 7 10
SD of distance (in miles) 1.5 1.5 1.5
Correlation ≈0 ≈0 ≈0

If the data between the three branches is combined, do you think the new correlation will be
(a) greater than 0
(b) less than 0
(c) approximately 0
6. Below are the scores for two games of bowling for a team of eight amateur bowlers. Calculate the
correlation coefficient between the two games. What can you conclude by this result?
Bowler 1 Bowler 2 Bowler 3 Bowler 4 Bowler 5 Bowler 6 Bowler 7 Bowler 8
Game 1 185 100 126 137 190 156 147 128
Game 2 162 157 138 167 210 169 157 168
7. Give two examples of two variables that are positively correlated and two that are negatively correlated.
8. For the two variables below, indicate whether you would “expect” a positive, negative, or close to
zero correlation, and briefly explain why:
(a) The number of animals at the SPCA (Society for Prevention of Cruelty to Animals) and the
number of of SUV’s on the road for a given county.
(b) National unemployment percentage and federal tax revenues.
(c) The number of hurricanes and the number of insurance claims in a state.
(d) Illegal drug use and graduation rate at a high school.
9. Look at the two scatterplots of data sets. For which one is the correlation coefficient a better measure
of the degree of association? Explain what correlation measures.

10. Studies have shown that eating breakfast is strongly associated with getting a high score on the
exam. (TRUE or FALSE) Due to strong association between the two, we can conclude that in order to
get high score on the exam, you should eat breakfast.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CORRELATION AND ASSOCIATION 74

11. To complete their training, law enforcement officers make two laps around a driving obstacle course
where they drive a patrol car through a pylon lined course as fast as possible while avoiding hitting
any pylons. Instructors riding in the car with them discuss their performance rating after each lap.
Program analysis shows that officer candidates who perform poorly on the first lap do much better on
the second, while those performing well on the first lap tend to do worse on the second. A conclusion
was reached that criticism helps candidates and praise makes them perform worse. As a result,
instructors were ordered to criticize all laps by candidates, regardless of performance ratings. Is this
policy warranted? Explain.

REVIEW EXERCISES
12. (Hypothetical) In a random sample of electrical engineering salaries for engineers, the following data
was collected:
Holding B.S. Degree (n = 120) Holding M.S. Degree (n = 120)
Sample mean salary(x) $64k $73k
Sample SD salary $4.5k $4.5k
Sample mean experience(y) 9 yrs 9 yrs
Sample SD experience 4 yrs 4 yrs
R 0.67 0.67

If the data sets were combined, do you think the correlation coefficient would
(a) stay about the same.
(b) increase.
(c) decrease.
Briefly explain your answer. (Hint: sketching both “data clouds” can help.)
13. A teaching assistant gives a quiz with ten questions and no partial credit. After grading the papers,
the TA writes down for each student the number of questions that student got right and the number
wrong. The average number of right answers is 6.4 with a SD of 2.0; the average number of wrong
answers is 3.6 with the same SD of 2.0. The correlation between the number of right answers and
number of wrong answers is:

0 − 0.50 0.50 −1 1 can’t tell without the data

Explain.
14. Three data sets are collected and the correlation coefficient is computed in each case. The variables
are:
(i) grade point average in freshman year and in sophomore year
(ii) grade point average in freshman year and in senior year
(iii) length and weight of two-by-four boards
Possible values for correlation coefficients are

−0.50 0.0 0.30 0.60 0.95

Match the correlations with the data sets (two will be left over). Explain your choices.
15. In a study of children, a researcher found a high negative correlation between hours of TV per week
and reading ability test score. He concluded from this that TV watching makes children poor readers.
Briefly evaluate this conclusion.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CORRELATION AND ASSOCIATION 75

16. The scatterplot below shows the heights of 56 boys at ages 4 and 18.

(a) The average height at age 4 is around:


39 inches 41 inches 43 inches

(b) The SD of height at age 18 is around:


0.5 inches 1.0 inches 2.0 inches

(c) The correlation coefficient is around:


0.50 0.80 0.95

Explain your answers.


17. For 3rd grade children, the correlation between height and weight is about 0.20. 4th grade children
have the same correlation of 0.2, but are taller by about 3” and heavier by about 15 lbs. If we consider
the combined group of 3rd and 4th graders, the correlation between height and weight would be (circle
one):
less than 0.20 approximately equal to 0.20 or greater than 0.20
18. Mark, who lives in a one-bedroom apartment, is interested to know if there is association between
the size of apartment and monthly heating cost. Reading through some published articles,he found
the correlation between the two to be 0.48. Is he right to say that if he moves to an apartment double
the size of his apartment now, the correlation would double to 0.96 as well? Explain your answer.
19. Upon the completion of a study which investigated a new technique for teaching a reading course,
investigators submitted the following statement: “While the education program produced no overall
increase in reading ability, those with low initial reading scores subsequently increased their scores
while those with higher initial reading scores subsequently showed not appreciable change or decreased
slightly. We recommend that the education program be continued because of its demonstrated
benefit to those with low scores. However, it should not be offered to those whose scores are
adequate to begin with.” Is this a justified statement?
20. Students in a large lecture class take a midterm test and they have the option of taking a makeup
test a week later. The data shows that those who make this choice tend to do better on the make up.
The instructor believes this is evidence that these students study harder and learn the material better
and this concludes that the option of a second test promotes learning. Is there another plausible
explanation?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
CORRELATION AND ASSOCIATION 76

21-23. Find the correlation coefficient for each data set.


(21) (22) (23)

X Y X Y X Y
2 9 1 1 1 2
3 7 1 3 1 2
7 3 3 3 1 2
8 1 5 4 1 2
9 5 2 4
11 8 2 4
2 4
3 6
3 6
4 8

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Chapter 9

Regression

When we have two variables, X and Y , on the individuals under study, the correlation coefficient, R, gives
us a quantitative measure on the degree of the relationship between them. The value of R is useful as a
descriptive statistic, but now we turn to a further detail, namely that of prediction.

If we have summary information

• on Y via µY and σY ,
• on X via µX and σX , and
• and on R (the degree of relationship),

how can we predict the Y value for an individual with a known X value? or, how can we estimate the average
Y value for those individuals with a specified X value? If we make such a prediction or estimate, how far off
will we be?

As an example, suppose we consider data on X = height and Y = weight for a sample of 500 men in a
community. The summary statistics for the data are:
Height: µX = 69 and σX = 3 (inches)
Weight: µY = 180 and σY = 20 (pounds)
R = 0.6

A scatterplot of the 500 points is given below.

77
REGRESSION 78

We see a considerable scatter, but there is a general tendency for weight to increase as height increases. If
we consider the height for an arbitrary man in this group, how should we predict his weight? Suppose, this
man has height that is one standard deviation above the mean, that is x = 69 + 3 = 72. Should we predict
that his weight will also be one standard deviation above the mean at y = 180 + 20 = 200?

The answer is no, as we can see by the following consideration. In the scatterplot, visualize the points that
have x = 72. They are in the vertical strip as drawn in the figure below at left. For these men with height x =
72, inspect their weights, y, in the vertical direction (up-down) and make a mark (the large X) at the middle
(mean) as was done. This weight y, being at the middle of the group, is our best prediction for the weight
of a single individual of height x = 72. The mark X is at about y = 192, so we will use this as our predicted
weight. Note that this is less than one standard deviation above the mean.

We can repeat the above thinking for any x value. Let’s do it for a choice of several x values, say for x = 63,
66, 69, 72, and 75. Visualize the vertical strips at each of these values and mark the middle for each. See the
figure below at right. Again, these X marks at the center of the vertical strips would be the best estimates of
weight y at these heights. We see that the estimate of weight differs as height x changes, as is appropriate,
since weight depends on height to some degree.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 79

The Regression Line


In an overall view, the X marks fall approximately along a straight line that catches our eye. Draw such a
line as is done in the figure below at left. This line is called the regression line and it is important as a
summary of the above thinking. The regression line passes through the middles of the vertical strips at
fixed X values. We can use it to predict the weight Y for individuals at arbitrary heights X.
As a final view, let’s look at the scatterplot of the data with the regression line as in the figure on the next
page at right. The standard deviation line, the dashed line, is also drawn for reference.

To visually use the graph to predict Y from X:

• start with some x value on the bottom axis


• move straight up to the regression line
• then move horizontally to the left side axis and stop
• the y value on this axis is the prediction

The standard deviation line in the figure below-right naturally catches our attention, since, it passes through
the ends of the football-shaped region of the scatterplot. This is fine for a reference purpose, but it is
important to note that the standard deviation line does not pass through the middles of vertical strips and
so is not the regression line.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 80

A Calculation Method For Prediction


The regression method of predicting y from x used in the height-weight example can be summarized by the
following general algorithm:

1. Choose the X value x


X−µX
2. Express this X value with its Z score: ZX = σX

3. Calculate the Z score of Y : ZY = R · ZX


4. Convert this to its Y value: Y = µY + σY · ZY

To illustrate this approach, suppose, we want to predict the weight for an individual of height x = 75. We
proceed as follows. Recall that we had

Height: µX = 69 and σX = 3 (inches)


Weight: µY = 180 and σY = 20 (pounds)
R = 0.6

x = 75
X − µX 75 − 69
ZX = = =2
σX 3
ZY = R · ZX = (0.6) (2) = 1.2
y = µy + σy · Zy = 180 + (20) (1.2) = 180 + 24 = 204

Thus, we predict an individual of height 75 inches will weigh 204 pounds. This result has another use
and interpretation in terms of an average: the group of individuals whose height is 75 inches is estimated
to have average weight of 204 pounds. This second interpretation should seem natural from the earlier
graphical discussion with the vertical strips.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 81

It is useful to express the above formal calculations in a verbal manner as the following. An individual with
a height 75 inches has Z score of 2, meaning that he is 2 standard deviations above the mean height. Then
convert his height Z score to his weight Z score by multiplying by the correlation R, getting 1.2. This means
that his weight is 1.2 standard deviations, that is 24 pounds, above the mean weight. Converting this to an
actual weight we get the 204 value (180 + 24 = 204).
The central notion in the above calculations is the equation about the Z scores

ZY = R · ZX

which says that an individual’s weight Z score is R times his height Z score. This is the central regression
idea. A rigorous, mathematical proof of this is beyond our scope here.
Let’s try our method for one last prediction. What is the predicted weight for someone who is 64 inches
tall? The calculation is

x = 64
X − µX 64 − 69 5
ZX = = = − = −1.67
σX 3 3
ZY = R · ZX = 0.6(−1.67) = −1.002
y = µY + σY · ZY = 180 + 20(−1.002) = 180 − 20.04 = 159.96

which we will round off at the end to y = 160. Thus x = 64 goes with y = 160; we predict that an individual
who is 64 inches tall will weigh 160 pounds. Or, for the second interpretation, the average weight is 160
pounds for the group that is 64 inches tall.
Here, an individual with a height 64 inches is 1.67 SDs below the mean height. When we convert height z
score to weight z score, we get weight z score is -1.002, meaning that the predicted weight is 1.002 SDs
below the mean weight.

Regression With Percentiles


Sometimes we want to express our information in terms of percentiles rather than actual values. For
example, with the height-weight data, if an individual’s height is at the 80th percentile of the heights, what
would we predict for the percentile rank of his weight? We deal with such a question in the same manner
as before except we take care to use percentiles. We will assume that the normal distribution applies to a
satisfactory approximation for the distribution of the heights and of the weights. Then proceed as follows:
X at the 80th percentile means its Z score is ZX = 0.84 (see the Normal table, area of 0.80 goes with Z =
0.84), then using (ZY = R · ZX ) the Z score on Y is ZY = 0.6 (0.84) = 0.504 = 0.50 and this Z score is at the
percentile 0.69 (in the Normal table a Z = 0.50 goes with area 0.69). Thus, a person whose height is at the
80th percentile is predicted to have weight at the 69th percentile.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 82

In using the Normal table above, remember that given a Z score, its percentile rank is the area under the
Normal curve to the left. In reverse direction, given a percentile rank, its corresponding Z score is the value
that cuts off this area to the left. In general, the percentile rank of a Z score is the area under the Normal
curve to the left of the Z score.
As another example, suppose we begin with an individual whose height is at the 10th percentile. This is in
the lower end of the distribution. What is our prediction for the percentile rank of his weight? The method
is as follows:
X at the 10th percentile means its Z score is ZX = −1.28 (Normal table, left side of the curve), then using
(ZY = R · ZX ) the Z score on Y is ZY = (0.6) (−1.28) = −0.768 ≈ −0.77 and this Z score, on consulting the
Normal table again on the left side, is at the percentile 1 − 0.779 = 0.221 ≈ 0.22. Thus, a person whose
height is at the 10th percentile is predicted to have weight at the 22nd percentile.

A curious and important point emerges from the previous two percentile calculations. When height was at
the 80th percentile (this is above the mean) the weight is predicted to be at the 69th percentile (still above

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 83

the mean, but closer to the mean). And when height was at the 10th percentile (this is below the mean) the
weight is predicted to be at the 22nd percentile (still below the mean, but closer to the mean). In both cases
the predicted weight Y was closer to the mean than the starting height X. This is a generally occurring
phenomena, called the “Regression Effect”, that has some important implications in practice.

The Regression Effect


In regression, for a given x value we make a prediction of the y value for an individual. The predicted y value
is closer to the Y -mean than the x value was to the X-mean. This is the regression effect. The “closeness”
is measured on a standard deviation or percentile scale. The regression effect is caused by the pivotal
calculation ZY = R · ZX . Since R is a fraction, ZY will be closer to zero than the ZX .
The regression effect is a phenomena that is of special interest when the two variables X and Y measure
the same or similar outcomes but at two different points in time (a test-retest situation). For example,

• students take a midterm and a final exam


• students have an entering SAT score and a freshman college GPA
• social indicators on low income families measured before and after an intervention program is started
• blood pressure of patients in a medical study is measured before (baseline) and six months after they
begin treatment with a new drug
• baseball players have batting averages for two consecutive years
• golfers in a tournament have first round and second round scores
• football players’ performance in college and their performance as a professional
• employees have an annual performance rating on various aspects of their job for a total of 100 points
and this data is considered for two consecutive years

The term “regression effect” was first coined by Galton when he was studying data on the heights of fathers
and their sons. His insight was to notice that tall fathers had sons who were tall on average, but not as tall
as they were and that short fathers had sons who were short on average, but taller than they were. In both
cases the sons had heights that on average were closer to the average height for all men than the height of
their father. In other words, the sons heights “regressed” toward the average height of men. After this first
discovery, others noticed this same force in a wide variety of situations of repeated measurements.

Example
Consider data on test scores for 104 students in three statistics classes in 1997. The midterm exam score
percent is X and the final exam score percent is Y with the following summary statistics:

µX = 68.7 and σX = 15.2


µY = 69.2 and σY = 16.3
R = 0.67

The means and standard deviations are similar for both tests. A scatterplot of the data is shown below.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 84

Consider a student who scores quite well on the midterm, say he has x = 90. The regression method shows,
as you can check, that his predicted final score is 84 (rounded from 84.5). Thus, we predict that he will score
less on the final; he will regress towards the mean.
Take another case, say a student who scores low at x = 50 on the midterm. The regression method shows
that his predicted score on the final is 56 (rounded from 55.8). Thus we predict that his score will go up,
again closer to the mean.
Instead of the perspective of one student, consider the group who got the top ten scores on the midterm
(in the graph they are the ten points farthest to the right). From the graph, we can inspect their final scores
(up-down height) and ask if they are the ten highest points. The answer we see is no, they have generally
high final scores, but not the highest. In this sense they have “fallen back towards the mean”. Similarly,
look for the twelve students who had the lowest midterm scores (left-right direction). Then ask if they have
the lowest final scores (up-down direction) and again answer no. On average, these twelve students have
lower final scores, but not the lowest. They have “moved up towards the mean”.
The same result is noticed repeatedly in sports. Consider the drafting of football players from college to the
NFL. Speaking a bit loosely, if we consider the top twenty candidates based on their college performance
do we find them to be the top twenty performers among the first year NFL players? Usually, this is not so.
On average they do well, but not at the very top. Some others, the “surprises”, emerge to be among the top
players in the pro league.
The same sort of thing happens in the “drafting”, via scholarships, of high school athletes to play on
college teams. The coaches spend a lot of time on their recruiting choices; their whole success depends
on good choices and this business gives them many headaches, I’m sure. Sometimes their choices prove
to be good, other times not so good and there are always “surprises”, both good and bad. A big part of the
problem may very well be the regression effect governing human performance from one time period to the
next. On average, the best fall back and the worst improve.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 85

As another sports example, consider the golf data from the correlation chapter. Viewing the scatterplot of
the scores on the first two rounds, (see figure above) consider the five left most points, that would be the
five golfers who had the lowest first round scores. Now examine these five points in the up-down direction
(their second round scores). These points, or their average, are not the lowest in the up-down direction
indicating that these “best” golfers on round one did worse on round two; they regressed to the mean,
tended towards average.
One way to understand the regression effect is to think of a conceptual model as follows:

OUTCOME = SKILL + LUCK

where SKILL = ability, effort, experience (unique to a person),


LUCK = chance occurrence, noise (varies time to time).
A high outcome at the first time is likely to mean that both the person’s SKILL and LUCK are high (others
with about the same SKILL may have less LUCK). For the next time the SKILL is the same, but the LUCK
changes and it tends to go down since it is not likely to be high twice in a row. In this way a person who
does high at first tends to fall back and regress to the mean.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 86

Example: Father–Son heights (Galton)

Father Height: mean = 68 SD = 2.7


Son Height: mean = 69 SD = 2.7
R = 0.5
n = 300 points
A father who is 72 inches tall is predicted to have a son about 70 inches tall. A father who is 64 inches tall
is predicted to have a son about 67 inches tall.

The Root Mean Square


With paired data on two variables X and Y the scatterplot shows a nice view of the relationship between
them. The regression line simply summarizes the main trend and is used to predict Y from X. We now ask
the question “how accurate is such a prediction”? How closely can we predict the Y value for an individual
when we know the X value? For a golfer, if we know his first round score, how accurately can we predict
his second round score? For a student, if we know her first test score, how well can we predict her second
test score?
The matter of the accuracy of such a prediction can be seen nicely in the scatterplot. It’s clear that when the
points cluster closely along the regression line, the prediction will be quite close (knowing the X value, we
“almost” know the Y value). On the other hand, when the points scatter widely around the regression line,
knowledge of X doesn’t tie down the corresponding Y value very well. That is, we are not able to predict Y
very well. So the issue of accuracy of prediction rests on the degree to which the points concentrate along
the regression line. We measure this feature of the data with a quantity called Root Mean Square (RMS).
The Root Mean Square (RMS) is defined in terms of residuals. Each point in the scatterplot has a residual
defined by
Residual = vertical distance of point to regression line = Yactual −Ypredicted
The next two figures show graphs of the residuals for two small data sets having just five points. Residuals
are the vertical distances from the points to the regression line as labeled on the figures. In the right figure,
the points concentrate more closely along the line and we can see that the residuals are smaller.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 87

An overall measure of the sizes of the residuals is made with the Root Mean Square defined by
s
2
∑ residual
RMS =
n

where we square the residuals, then average these and finally take a square root.
For example, for the data of the figure at left the following table gives the calculations.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 88

X Y YPredicted Residual Residual2


1 1 4 −3 9
2 11 5.5 5.5 30.25
3 3 7 −4 16
4 12 8.5 3.5 12.25
5 8 10 −2 4
71.5
q
Then RMS = 71.5 = 3.78.
5
For the figure above at right, the RMS can be calculated in a similar way or its quicker to use the residuals
as labeled on the graph. The calculations are
(−0.5)2 + (1.5)2 + (1)2 + (1)2 + (1)2 = 5.5
r
5.5
RMS = = 1.05
5
We see the points are closer to the line in the rightmost figure and its RMS is smaller than that of the
leftmost figure. This is the point of RMS, to measure the amount of spread of the points around the line.
The calculation here of RMS should remind you of our earlier calculations of SD. In fact, it is the same
calculation as before, but based on residuals. Residuals visually are vertical distances within narrow
vertical strips of points to the “middle” of the strip. So, in this sense, the RMS measures variability in
vertical strips through the data in the same way that SD measures variability in a set of data. This means
that the interpretation of RMS is the same as that of SD with an adjustment for the context. For instance,
if we draw two lines parallel to the regression line, one up RMS units and the other down RMS units, this
band through the plot would contain about 68% of the points.
Alternately, if we draw the band up and down 2 RMS units, it would contain about 95% of the points. This
is a nice view in seeing how the value of RMS is measuring the amount of variability of the data along the
regression line.
A technical point is of interest here: the regression line is uniquely defined from the data as the line that
has the smallest RMS among all potential lines. Sometimes it is called the “least-squares” regression line
to emphasize this fact.
The calculations for RMS above illustrate the concept, but with larger data sets the work is time consuming.
Fortunately, there is a shortcut, a quick way if we have the basic five summary statistics for bivariate data,
that is

p 
RMS = 2
1 − R · σY

Thus, RMS is calculated from the correlation R and the standard deviation SDY . From this formula we
can see that RMS, the variation of y for a given x, is a fraction of the overall variation of Y . The fraction is
dependent on R.
The larger the R in magnitude (either R near 1 or R near −1), the smaller the RMS.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 89

Example
The sand lance is a small fish in the north Atlantic. Data was collected on them to study how fast they grow.
With X = age (years) and Y = length (mm) the following summary statistics were calculated:

µX = 5 and σX = 2
µY = 225 and σY = 30
and R = 0.80

Predict the length of an 8 year old fish:


X =8
8−5
ZX = = 1.5
2
ZY = 0.80 · 1.5 = 1.2
y = 225 + (1.2) · 30 = 225 + 36 = 261
We predict an 8 year old fish would have length 261 mm. How accurate is this prediction?
p 
2
RMS = 1 − 0.80 · 30 = 0.6 · 30 = 18

So the length is predicted as 261 give or take 18 (68% chance).


95% of the 8 year old sand lances have length in the range 261 ± 2 · 18 or 261 ± 36.

Example
Consider earlier data on test scores for 104 students in three statistics classes in 1997. The midterm exam
score percent is X and the final exam score percent is Y with the following summary statistics:

µX = 68.7 and σX = 15.2


µY = 69.2 and σY = 16.3
and R = 0.67

Earlier we had calculated that for a student with midterm score x = 90, his final score is predicted as 84.
How close might this be? To answer this, we need RMS, which is calculated as
p 
RMS = 1 − 0.672 · 16.3 = 12.1 or 12

So we can state conclusions such as:

• A student with midterm score of 90 is predicted to have final score of 84, give or take 12.
• 68% of the students with midterm score of 90 have final scores in the range 84 ± 12.
• There is a 68% chance that a student with midterm score of 90 will have final score in the range 84 ±
12.
• 95% of the students with midterm score of 90 have final scores in the range 84 ± 24.
• There is a 95% chance that a student with midterm score of 90 will have final score in the range 84 ±
24 (to get 95% we use ± 2 RMS = 24 following our earlier empirical rule).

These statements are based on the idea that the prediction of 84 and the RMS of 12 are to be interpreted
like the average and standard deviation of a group of students (those with midterm 90).

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 90

Using the Normal Distribution


If we can make the additional assumption that the final scores in a vertical strip follow a normal distribution
to a sufficient degree of accuracy, then we can use the normal distribution to make more detailed conclusions.
Consider a question such as “what percent of the students who score 90 on the midterm will have final
score above 89”, or equivalently, “what is the chance that someone who scores 90 on the midterm will
score above 89 on the final?”
The approach to answer these questions is to infer, from our earlier calculations, that final scores for this
group (those with midterm 90) have a normal distribution with mean 84 (Predicted Y value) and standard
deviation 12 (RMS). Then, with our usual procedure for normal distributions, we have

89 − 84 5
Z= = = 0.42
12 12
and we want the area for the Z curve above this value. From the normal table, the area less than 0.42 is
0.663. Thus, the area above 0.42 is 1 − 0.663 = 0.337. We state then, that, there is a 33.7% chance that a
student with 90 on the midterm will score above 89 on the final.

Example
Suppose from a large survey of mechanical engineers, we focus on the income issue with two of the
responses: X = years of experience and Y = salary. From the data we find

µX = 15 years and σX = 5 years


µY = $50,000 and σY = $5,000
and R = 0.70

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 91

Consider an engineer with considerable experience, say x = 25. To predict his salary by our regression
method, use
x = 25
X − µX 25 − 15
ZX = = =2
σX 5
ZY = R · ZX = (0.7)(2) = 1.4
y = µY + σY · ZY = 50,000 + (5,000)(1.4) = 57,000

To assess variation,

p 
RMS = 1 − 0.72 · 5,000 = 3570.71 = 3571

Thus we conclude that,

• an engineer with 25 years of experience is predicted to have salary of $57,000, give or take $3,571.
• 68% of the engineers with 25 years of experience have salaries in the range $57,000 ± $3,571.
• there is a 95% chance an engineer with 25 years of experience has salary in the range $57,000 ±
$7,142 (note we have used ± 2 RMS = ± 7,142 to get this).

A more detailed question is “what percent of the engineers with 25 years of experience have salaries below
$50,000?” If we assume the normal distribution applies, we convert to Z score as

50,000 − 57,000
Z= = −1.96
3571

and the area under the normal curve to the left of −1.96 is 1 − 0.975 = 0.025. So about 2.5% of the engineers
with 25 years of experience have salaries below $50,000.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 92

Example
Return to the earlier example for X = height and Y = weight. The summary statistics for the data are:

• Height: µX = 69 and σX = 3 inches


• Weight: µY = 180 and σY = 20 pounds
• R = 0.6

Earlier we had calculated that for an individual with height of x = 75, his predicted weight is y = 204. The
variation for the regression prediction is

p 
RMS = 1 − 0.62 · 20 = 16

What is the chance an individual with height of x = 75 has weight above 230? To answer, visualize a
normal curve with mean 204 and standard deviation of 16 and find the area above 230. The Z score is
Z = 230−204 = 26 = 1.63. The area above a Z score of 1.63, from the table, is 1 − 0.948 = 0.052. This is
16 16
about a 5% chance.

The Regression Line Equation


Up until now, we have been using the regression line conceptually as a measure of the centers of data
in vertical strips. To be more concrete, the algebraic equation of this straight line can be specified. The
equation of a line can be expressed in terms of its slope and intercept as
Ŷ = intercept + slope ·X

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 93

For the regression line the results are


σY
slope = R ·
σX
intercept = µY − slope · µX

Example
Consider the previous data on salary and years of experience for mechanical engineers where the summary
statistics were

µX = 15 years and σX = 5 years


µY = $50,000 and σY = $5,000
and R = 0.70

Then we have for the equation of the regression line

σY 5,000
slope = R · = 0.70 · = 700
σX 5
intercept = µY − slope · µX = 50,000 − 700 · 15 = 39,500
and the equation of the regression line is
Ŷ = 39,5000 + 700 ·X
We use the equation of the regression line to make predictions by substituting the desired X in the equation.
For example, if x = 25 then ŷ = 39,500 + 700 (25) = 57,000. This is the same predicted salary that we obtained
before by our first method.

Example
Consider a clinical trial which focused, among other responses, on white blood cell count (WBC) measured
in billions/liter. The 104 patients in the control group had wbc measured at the beginning of the study (X)
and after 3 months (Y ) at the end of the study. The following summary statistics were obtained:

X̄ = 6.595 and SDX = 1.711


Ȳ = 6.569 and SDY = 1.729
and R = 0.579

Note that the means and standard deviations are about the same for X and Y indicating that WBC is stable
from one time to the next. The scatterplot of the data is shown in the figure below along with the regression
line which (as you can check) has equation
ŷ = 2.7103 + 0.5851 ·x

Also, the vertical strip over x = 8 is drawn to focus attention on this case.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 94

When x = 8, the predicted y, using the regression equation, is


ŷ = 2.7103 + 0.5851 (8) = 7.3911 ≈ 7.4
To assess variability (in the vertical strip), calculate RMS as
p 
RMS = 1 − 0.5792 · 1.729 = 1.4097 ≈ 1.4

So a person with WBC of 8 at the beginning is predicted to have WBC of 7.4 at the end give or take 1.4.
There is a regression to the mean effect in evidence. If we use the up-and-down 2 RMS rule, we can state
our estimate: a person with WBC of 8 at the beginning will have WBC at the end in the range 7.4 ± 2.8 with
95% chance. Visualize this interval in the vertical strip of the figure.

Example: Grades in 3660


x = first five quizes: mean = 73.52 SD = 13.67

y = final grade: mean = 79.65 SD = 11.33

R = 0.83

The regression equation is calculated as:


11.33
slope = 0.83 ∗ 13.67 = 0.68792 ≈ 0.69
intercept = 79.65 − (0.68792 ∗ 73.52) = 29.07394 ≈ 29.07
ŷ = 29.07 + 0.69X

RMS = ( 1 − 0.832 ) ∗ 11.33 = 6.31946 ≈ 6.32

If x = 87 (lowest A), predict ŷ = 29.07 + 0.69 (87) = 89.1


If x = 70 (lowest CB)predict ŷ = 29.07 + 0.69 (70) = 77.4
Question: If a student has x = 87, what is the chance he has more than 87 for a final grade?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 95

Prediction
The regression line equation is used to model the relationship between a dependent variable y and independent
variable x. This equation can be used to predict the value of y given an observed value of x.

Example

Study shows that body mass index (BMI) and abdominal circumference are good basis of whether you
are within a healthy weight range or not. Both of them are used in assessing your risk of cardiovascular
diseases and diabetes.

Data on BMI and abdominal circumference of 108 women were taken. In this particular data, the equation
of the regression line is
BMI = −1.46 + 0.35 · Abdominal Circumference

Now, we want to predict body mass index (BMI) from an observed abdominal circumference. For example,
if a woman’s abdominal circumference is 89 cm, what is her BMI? Using the regression line equation, we
can predict her BMI as shown below.

BMI = −1.46 + 0.35 · AC


= −1.46 + 0.35 · 89
= 29.69

Based on our result, the woman who has an abdominal circumference of 89 cm has a BMI of 29.69.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 96

EXERCISES
1. (Hypothetical) A state wide study of all high school track members yields the following data about
their 400 meter and 800 meter race times:
Average Standard Deviation
400 Meter Race 78 seconds 8 seconds
800 Meter Race 165 seconds 20 seconds

Also, an unknown but positive correlation exists between the two times.
(A) For a runner with a 400 meter time of 86 seconds, which statement best describes the regression
prediction of his 800 meter time?
(a) time > 185 seconds
(b) 165 seconds < time < 185 seconds
(c) time = 185 seconds
(d) time < 165 seconds
(e) None of the above
(B) For a runner with a 400 meter time of 66 seconds, which statement best describes the regression
prediction of her 800 meter time?
(a) 165 seconds < time < 195 seconds
(b) time < 135 seconds
(c) 135 seconds < time < 165 seconds
(d) time = 135 seconds
(e) None of the above
2. (Hypothetical) A heavy construction equipment company wants to conduct a study to get some idea
on the expected yearly sales volume for a salesperson based on the number of years of experience
the salesperson has. The separate individual data they collect yields the following:

Average Standard Deviation


Years of Experience 9.5 years 4.1 years
Yearly Sales Volume $412K $100K

From previous studies, they are quite confident that the correlation between the two numbers is
positive, although less than 1 (i.e. 0 < R < 1),and they want a regression analysis prediction of
average yearly sales, for a given number of years of experience, to have an RMS error (standard
deviation of the prediction) less than $100k. Will they get what they want?
(a) Yes, definitely
(b) Maybe
(c) Absolutely not
3. (Hypothetical) For a group of college freshmen the correlation R between x = high school GPA and
y = college GPA is 0.3. A student with x (high school GPA) at the 70th percentile is predicted to have y
(college GPA) at the percentile. (Choose the best answer to fill in the blank)
(a) 70th
(b) 62nd
(c) 56th
(d) 54th
(e) 48th

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 97

4. (Hypothetical) A sample of U.S. farm regions surveyed during summer of 2003 produced the following
statistics:
Average Standard Deviation
Temperature during growing season (X) 81 3
Corn Yield per acre (Y) 131 5

If the correlation coefficient R between the two is 0.30, then:


(A) For a region having a growing season with an average temperature of 75, what is the predicted
average corn yield per acre (in bushels)?
(B) For a region having an average temperature during growing season that is in the 67th percentile,
what percentile do we predict the average corn yield per acre for the region to be in?
5. For the data set from Exercise 1 in the previous page, if both the 400 meter and 800 meter race time
data above are “mound shaped” about the average, consider all the high school runners from the
state who have a 400 meter time of 70 seconds (1 SD below the mean of 400m race time). What
proportion of these runners will have predicted 800 meter races times below 145 seconds (1 standard
deviation below the mean of 800m race time)? (Hint: Predicted 800m race times are approximately
normal with mean equals to the predicted 800m race time given the 400m race time and standard
deviation equals to RMS.)
(a) more than 1 ,
2
(b) approximately 1
2
(c) less than 1
2
6. Under the same conditions in 5, consider all the runners whose 400 meter time gives them a predicted
800 meter time of 165 seconds (right at the mean of the distribution of predicted 800m race times).
What proportion of these runners will have predicted 800 meter race times between 145 seconds (1
standard deviation below the mean of 800m race times) and 185 seconds (1 standard deviation above
the mean of 800m race time)? (Hint: Predicted 800m race times are approximately normal with mean
equals to the predicted 800m race time given the 400m race time and standard deviation equals to
RMS.)
(a) more than 2
3
(b) approximately 2
3
(c) or less than 2
3
7. (Hypothetical) A national office supply company, who employs a large sales staff that answer and
take phone orders to be filled, has done a study yielding a positive correlation between the number
of phone calls per week a sales person answers and their total sales that week. From the study the
following statistics were found:
Calls Answered: X̄ = 49 and SDX = 8
Total Sales: Ȳ = $5600 and SDY = $500
and R = 0.8
(A) If a sales person answers 41 calls in one week, what is their predicted total sales for that week?
(B) A sales person answering 57 calls in a week is predicted to have $6000 in total sales for the
week. What is the probability that they will sell more than $6300?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 98

8. (Hypothetical) A random sample of small businesses in a state yield the following statistics:
Average advertising expense per month ($1k): X̄ = 5.2 and SDX = 1.2
Total sales per month ($1k): Ȳ = 102 and SDY = 14
Correlation: R = 0.45
(Assume average monthly advertising expense and total sales per month are both normally distributed.)
(A) What is the mean predicted total sales in a month when the month’s advertising expenditure is
$6000?
(B) What are the 30th and 75th percentiles in monthly advertising expenditure and to what percentiles
in monthly sales do they correspond?
(C) What is the RMS for the total sales per month by regression?
(D) Of the total monthly sales data for all small businesses in the state with an average expenditure
of $6k per month on advertising, what percentage of this do you predict to be less than $102k
in total sales per month?
9. (Hypothetical) The following information was obtained from a study of 65 female patients:
Body Temperature (F): X̄ = 98.5 and SDX = 0.75
Heart Rate: Ȳ = 74 and SDY = 9
Correlation R = 0.5
(A) For a body temperature of 98 F, what heart rate is predicted by regression?
(B) What temperature is the 85th percentile in body temperature?
(C) What percentile in heart rate will the 85th percentile in body temperature predict?
(D) What is the heart rate which corresponds to the percentile you found in (C)?
(E) What is the RMS of the regression predictions?
(F) If a female patient has a body temperature of 100 F, and hence a predicted heart rate of 83, what
is the chance she will have a heart rate less than 87?
(G) What is the equation of the regression line for predicting heart rate from body temperature?
10. Recall the data set from exercise 1 above. Find the equation for the regression line that predicts
the time for the 800 meter race given the time of the 400 meter race. Assume correlation coefficient
between 400m race times and 800m race times is 0.8.
11. Suppose for a pair data set (X, Y ) the regression equation for predicting Y given X is

Y = 15 + 6 ·X with RMS = 2.5

What is the chance of having a y value greater than 50 for x = 5?

REVIEW EXERCISES
12. (True or False?) If the slope of the regression line for predicting Y for a given X is 4, then the slope of
the regression line for predicting X for a given Y will be 1 . Explain.
4
13. If the correlation between X and Y is zero, then the equation for the regression line will give the same
predicted value of Y for every X. (True or False?) Explain.
14. Suppose a paired data set (X, Y ) yields the following results:
X̄ = 6 SDX = 2
Ȳ = 11 SDY = 3
R = −0.4
Predict the value of Y , using regression, for x = 4.5.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 99

15. For a paired data set (X, Y ) with R = 0.7, what percentile in Y is predicted for x in 30th percentile? (Note
that you have all the necessary information.)
16. Suppose we have a data set with X̄ = 20, SDX = 5, Ȳ = 20, SDY = 5, but we didn’t know the exact
correlation between X and Y , only that it was positive i.e. 0 < R < 1. For x = 14, in what range will the
regression prediction of Y be? Briefly explain your answer.
(a) 14 < y predicted < 20
(b) y predicted > 20
(c) y predicted < 14
(d) 13 < y predicted < 14
17. (Hypothetical) At the Super Bowl in New Orleans, LA, a huge Mardi Gras party was thrown. A study
was made where people were given a breathalyzer and reaction time test before entering and upon
leaving. The difference in the two blood alcohol levels, before and after, and % decrease in reaction
times for each person was recorded with the following results:
Blood Alcohol Content (BAC) increase (in mgm/100ml): X̄ = 30 SDX = 11
% Decrease in reaction time: Ȳ = 40 SDY = 10
Correlation R = 0.8
(A) For a BAC increase of 41 mgm/100ml, what is the regression prediction for the percent decrease
in reaction time?
(B) If a person has the 67th percentile in the BAC increase, what percentile are they predicted to
have in percent decrease in reaction time? (You are only asked to find the predicted percentile,
not the reaction time decrease percent to which it corresponds.)
(C) For a person having a BAC increase of 41 (i.e. X-value), what is the chance of them having a
percent reaction time decrease (i.e. Y -value) of 55 or better?
18. (Hypothetical) A study of yearly salary and monthly rent/mortgage payments in a middle sized city in
the Midwest yield the following results:
Salary: X̄ = $50k and SDX = $10k
Rent/Mortgage Payment: Ȳ = $750 and SDY = $100
Correlation: R = 0.8
(A) For a person with a salary = $65k, what monthly rent/mortgage payment would we predict using
regression?
(B) For a person having a salary of $45k, thus having a predicted monthly rent/mortgage payment of
$710, what is the chance that their actual monthly rent/mortgage payment is greater than $750?
(C) What is the regression equation for predicting the rent/mortgage payment given salary?
19. (Hypothetical) The following information was obtained from a study of car maintenance:
Age of a car (in years): X̄ = 8 and SDX = 3
Cost in annual maintenance/repairs($): Ȳ = 190 and SDY = 60
Correlation: R = 0.8
Assume X and Y are approximately normally distributed.
(A) For a car that is nine (9) years old, what annual maintenance/repair cost is predicted?
(B) For a car that is five (5) years old, thus having a predicted annual maintenance/repair cost of
$142, what is the chance that it will have an actual annual maintenance/repair cost of less than
$100?
(C) What is the regression equation for predicting the annual maintenance/repair cost given age of
the car?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
REGRESSION 100

20. In a study in 1998, the following equation for predicting baby birth weight in grams (Y ) given mothers
age in years (X):

Y = -1163.45 + 245.15 ·X

If RMS = 589.3 grams, what is the chance that a 19 year old mother will give birth to a baby weighing
more than 4000 grams?
21. In a set of pairs of data, the value of Y for a certain pair is recorded as 100 instead of 10. The
correlation coefficient was found to be positive. Due to this error,
(a) the slope of the regression line will increase
(b) the slope of the regression line will decrease
(c) the slope of the regression line will stay the same.
22. Which of the following statement(s) about regression is(are) TRUE?
I. A regression equation with a negative slope results to lower predicted value of Y compared to
a positive slope.
II. The closer the actual value is on the regression line, the higher the residual would be.
(a.) I only
(b.) II only
(c.) Both I and II
(d.) Neither I nor II
23. Suppose we consider a data set for X and Y. If the standard deviations of X and Y are equal, and the
correlation coefficient between X and Y is 0.8, which of the following statements is incorrect?
(a.) We can use the given information to predict the value of Y for a given value of X.
(b.) Using the given information, we can write the equation of the regression line.
(c.) We can find the percentile rank for Y given the percentile rank of X using the given information.
(d.) All of the above statements are incorrect since X̄ and Ȳ are unknown.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Chapter 10

Box Models For a Population

Statistics is concerned with populations of numbers, either real or hypothetical. Any finite population of
numbers in the real world, can be represented by a box of numbers. This can be done as follows. For each
number in the population, a chip with this value on it, is placed in the box. The same number could appear
more than once. Clearly, the distribution of numbers in the box, is the same as the distribution of numbers
in the population.
We’ll start with numeric variables. For a class of 10 students, consider their ages. Suppose that 5 are 18, 3
are 19, 1 is 20 and, 1 is 22. The following box of numbers describes these ages.
[ 18 18 18 18 18 19 19 19 20 22 ]
Drawing one number at random from this box is the same as encountering one of the students at random
and asking their age.
A box model can be used to represent the outcome of a physical experiment. Roll a dice and record the
number of dots. Equivalently, draw one number from the following box.
[123456]
Drawing 50 numbers at random from the box, with replacement, is equivalent to 50 rolls of the dice.

0 – 1 Box Models
We can also use a box model to represent a categorical variable. Changing our view, suppose we are
interested in the number of “aces” in 5 rolls of the dice. In terms of a box model, this is like the sum of 5
numbers drawn from the following box.
[100000]
A box model can be used to describe a classification interest. In a class of 10 students, suppose, there are
3 males and 7 females. The following box model describes this situation (0 = male, 1 = female).
[0001111111]
The sex of one student selected at random can be modeled by one draw from the box.

101
BOX MODELS FOR A POPULATION 102

Chance Events
Boxes can be used to model the outcomes of chance events. Suppose there are three traffic lights on your
way to work so that the number of times you have to stop for a light is one of the values 0, 1, 2, 3. Based on
your experience, you believe these values occur in frequencies of 10%, 30%, 50%, and 10%, respectively.
The following box describes the outcome for one trip to work. Note the duplication to reflect the differing
chances.
[0111222223]
To model a week’s experience we could draw 5 numbers at random from the box with replacement.
As another example, lets consider insurance. It costs you $60 to buy an insurance policy for your bike. The
policy pays $500 if your bike is lost, an event which you estimate could occur with chance 1 in 100. A box
model can be used to represent this situation. If your bike is lost you get $500 − $60 = $440. This happens
with chance 1 in 100. With chance 99 in 100, your bike is not lost and you lose $60 (the policy premium).
One draw from the following box of 100 chips models the situation from a financial point of view.
[ one +$440, ninety-nine −$60 ]
From the point of view of the insurance company, the overall return from one such bike policy is like one
draw at random from the box.
[ one −$440, ninety-nine +$60 ]
If the company sells 1000 such policies for a given year, its earnings will be like the SUM of 1000 draws at
random, with replacement, from the box.
The bottom line for these considerations and examples, is that real events in the world around us have
chance influences that can be artificially modeled by draws from a box of numbers. If we can understand
the behavior of random draws from a box, this knowledge can be transferred to the real situation.

Sampling Distribution of a Sum


A box of numbers can be described graphically by a histogram (use a density scale for the vertical axis so
that the total area is 1). It is also described numerically by parameters such as its mean, µ, and standard
deviation, σ (Greek symbols are commonly used in statistics for these parameters, but “mean” and “SD”
can be used too).

EXAMPLE
Consider the box of 10 numbers.
[1122222333]
Its histogram is below and its mean and standard deviation are
µ = 2.1 and σ = 0.7 (check the calculations)

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
BOX MODELS FOR A POPULATION 103

Suppose a random sample of numbers, say n of them, are drawn with replacement from the box. Consider
the sum of the numbers drawn. We cannot know this sum ahead of time; the result is due to chance. We
say the result is a random variable. Still, it is possible to list the possible values for the sum and for each
of these values determine the chance that it might arise. This information, possible values and chance of
occurrence, is called the chance distribution or the sampling distribution of the sum. This information may
be determined by a brute force listing of possibilities or by clever mathematical techniques.
A conceptual view of our interest here:

• Start with some box [ * * * ]


• Draw n times, with replacement, from the box and calculate the sum
• Mark this value of sum on a number axis
• Do another n draws, get the sum and mark this value on the axis
• Continue in this manner, marking each sum on the axis, do it many times
• The collection on sums obtained is a record of how sum behaves
• Describe this collection of sums with a smooth curve (the distribution of sum)

See the figures next that illustrate this process.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
BOX MODELS FOR A POPULATION 104

EXAMPLE
Consider drawing a sample of size 2, with replacement, from the following box.
[1245]
Taking a listing approach, we will list all 16 possible samples and the corresponding value of the sum.
1+1=2 1+2=3 1+4=5 1+5=6
2+1=3 2+2=4 2+4=6 2+5=7
4+1=5 4+2=6 4+4=8 4+5=9
5+1=6 5+2=7 5+4=9 5 + 5 = 10
Collecting the 16 values of the sum into a new box for the sum result, the contents are
[ 2 3 3 4 5 5 6 6 6 6 7 7 8 9 9 10 ] values for sum
and this box is represented as follows.
Sum value 2 3 4 5 6 7 8 9 10

Proportion 1 2 1 2 4 2 1 2 1
16 16 16 16 16 16 16 16 16
This information on values and proportions is called the sampling distribution of the sum. The behavior
of the sum is modeled by one draw from this box. The following figure is a probability histogram of the
distribution of this sum.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
BOX MODELS FOR A POPULATION 105

Definition of EV
The expected value (EV) of the sum is the mean of the sampling distribution of the sum. There is a shortcut
formula for the result:
EV = nµ
where n is the sample size and µ is the mean of the box.

In our example above, the box for the SUM of two draws has mean
2 + 3 + 3 + 4 +...
EV = =6 (there are 16 terms in the sum)
16

Or, with the shortcut


EV = nµ = 2 · 3 = 6
The expected value measures the center of the sampling distribution of the sum. The variability or spread
of this sampling distribution is measured by the standard error (SE).

Definition of SE
The standard error (SE) of the sum is the standard deviation of the sampling distribution of the sum. Thus,
it measures variability for the sum. A shortcut formula is

SE = nσ

where n is the sample size and σ is the standard deviation of the box. In our example, the box for the sum
of two draws has standard deviation of
r
(2 − 6)2 + (3 − 6)2 + . . .
SE = = 2.236 (there are 16 terms in the sum)
16
Using the shortcut formula, first we compute
r
(1 − 3)2 + (2 − 3)2 + (4 − 3)2 + (5 − 3)2
σ= = 1.581
4
then √
SE = 2 · 1.581 = 2.236

To summarize, we start with a box.


[ * * * * ] mean µ and standard deviation σ
and specify n draws with replacement. Then the sum has a sampling distribution with

EV = nµ

SE = nσ
The box is centered at its mean µ with a spread σ that measures how far the numbers are from the mean.
In a similar manner, the sampling distribution of the sum is centered at the expected value with a spread,
standard error, that measures how far the sum values are from the expected value. The standard error
is interpreted just like a “regular” standard deviation. For example, it is rare for a SUM to differ from its
expected value by more than 2 or 3 standard errors.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
BOX MODELS FOR A POPULATION 106

EXAMPLE
Consider tossing a die 10 times. This is modeled by 10 draws from the box
[123456]
This box has a mean of µ = 3.5 and a standard deviation of σ = 1.708. The sum of the 10 draws has an
expected value of
EV = 10 · 3.5 = 35
and a standard error of √
SE = 10 · 1.708 = 5.401
We use the language “the sum will be 35, give or take 5.4”.

EXAMPLE
Consider the sum of n = 100 draws from the box [ 1 3 3 3 ].
The box has mean
1+3+3+3
µ= = 2.5
4
and r r r
(1 − 2.5)2 + 3 · (3 − 2.5)2 2.25 + 3 · 0.25 3
SD = σ = = = = 0.866
4 4 4
So the sum has

√ 100 · 2.5 = 250


EV =
SE = 100· 0.866 = 8.66
The sum will be 250, give or take about 9.

EXAMPLE
Consider a simulation study on the sum of 10 draws from the box
[1122222333]
As noted earlier, the mean and standard deviation of this box are µ = 2.1 and σ = 0.7. The sum of 10 draws
from the box would have
EV =√(10) · 2.1 = 21
SE = 10 · 0.7 = 2.21
The following figures show the results of 100 replications of the sum of 10 draws. Below is a plot of the sum
values and below that a histogram of the sum values. Note that the theoretical values for expected value
and standard error seem about right for the histogram.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
BOX MODELS FOR A POPULATION 107

EXAMPLE
Consider a simulation study on the sum of 25 draws from the box
[123456]
As noted earlier, the box has mean µ = 3.5 and a standard deviation σ = 1.708. The sum of 25 draws from
the box would have
EV =√25 · 3.5 = 87.55
SE = 25· 1.708 = 8.54
Let’s repeat the sum of 25 draws for 200 replications. The 200 replications of the sum of 25 draws are listed
as:
91 80 100 93 84 90 84 88 84 83 88 102 97 88 78
85 98 104 68 84 103 89 99 93 86 86 99 99 100 89
83 73 94 87 81 105 70 78 95 87 89 90 78 99 89
90 84 97 99 81 89 76 88 87 92 96 82 96 79 85
90 76 84 76 100 77 87 99 99 80 93 76 93 102
84 92 88 92 80 100 84 86 88 88 82 107 86 85
84 85 80 97 91 85 81 80 89 88 88 84 99 83
95 94 91 101 86 78 91 100 99 102 78 89 87 88
81 83 77 93 81 86 76 97 94 101 91 90 75 108
94 78 84 88 87 91 97 91 92 99 79 99 82 94
91 103 100 90 86 85 84 85 76 98 102 91 77 87
87 98 92 82 93 91 97 74 90 73 79 94 80 75
71 89 86 94 90 89 89 88 86 90 94 91 71 79
71 92 87 83 95 95 92 83 86 96 81 77 78 101

The figures below show graphs of the results on the sum of 25 draws. At left is a plot of the sum values
and at right is a histogram of the sum values. Note that the theoretical values for the expected value and
standard error seem about right for the histogram.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
BOX MODELS FOR A POPULATION 108

EXAMPLE
Let’s revisit an earlier example. It costs you $60 to buy an insurance policy for your bike. The policy pays
$500 if your bike is lost, an event which you estimate could occur with chance 1 in 100. A box model can be
used to represent this situation. If your bike is lost you get $500 − $60 = $440. This happens with chance
1 in 100. With chance 99 in 100 your bike is not lost and you lose $60 (the policy premium). One draw from
the following box of 100 chips models the situation from a financial point of view.
[one +$440 ninety-nine −$60]
The expected value of one draw from the box is the mean of the box, which is

440 + 99 (−60)
= −$55
100
So on the average you expect to lose $55 with this policy. This doesn’t seem too wise, but remember this
is the price you pay for protection from an even larger loss.
From the company’s point of view, they expect to gain $55 from this one policy. Suppose they sell 10,000
such policies. The result of one policy is like the result of one draw from the box
[one −$440
ninety-nine +$60]
r
This box has mean µ = $55 and standard deviation of σ = (−440 − 55) +99·(60 − 55) = $49.75.
2 2

100
The income for the company from these 10,000 policies is random, but it is like the sum of 10,000 random
draws with replacement from the box. This sum has
EV = 10000
√ · 55 = $550,000 and
SE = 10000 · 49.75 = $4975
Thus, they would expect to gain $550,000 in the year, give or take $4975. Notice how little variation there is
relative to the size of the expected value.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
BOX MODELS FOR A POPULATION 109

EXAMPLE
At a roulette wheel, suppose you bet $1 on red each time. There are 18 red slots out of 38 slots. An
appropriate box for your winnings on one play is
[eighteen 1 twenty −1]
That is, there are 18 outcomes where we win $1 and 20 outcomes where we lose $1. One spin of the wheel
is like one draw from the box. This box has mean of µ = -0.0526 and standard deviation of σ = 0.9986. If
you played 100 times your winnings (sum) have an expected value of -$5.26 and standard error $9.986. So
you expect to lose $5.26, give or take $9.99. With 1000 plays of the game, your winnings have expected
value −$52.6 and standard error $31.58. The following figure shows the results of a simulation study on the
winnings on this bet for 100 plays. It is a histogram for the sum of 100 draws from the box (your winnings
in 100 plays at this bet). There were 500 replications. Note the agreement of the simulation results with our
theory about center, expected value, and spread, standard error.
What is the chance of coming out ahead with this bet, that is, winnings greater than 0? Check this out by
the areas of the rectangles above 0.

0 – 1 Boxes
A special case is worth noting. This is the case of a 0 – 1 box, where the numbers in the box are either 0 or
1. Suppose, the proportion of 1’s in the box is p. A view of the box is :
[ 0 – 1 values, p = proportion of 1’s]
p
Then it is easy to show that the box has mean µ = p and standard deviation σ = p · (1 − p). From our
previous results it follows that the sum of n draws from a 0 – 1 box has

EV = n · p and
p
SE = n · p · (1 − p)

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
BOX MODELS FOR A POPULATION 110

Further, by the Central Limit Theorem, the sum has a sampling distribution that is normal in shape for larger
sample sizes. These three results enable us to predict quite well the number of 1’s that may be drawn from
a 0 – 1 box using our previous methods.

EXAMPLE
Toss a coin 1000 times. How many heads will occur?
The toss of one coin can be modeled with the box
[01]
where 1 means heads, 0 means tails, and thus p = 0.5. Our model is completed by considering the sum of
1000 draws √ from the box (the number of heads in 1000 tosses). The box has mean µ = 0.5 and standard
deviation
√ 0.5 · 0.5 = 0.5. The sum of 1000 draws has expected value EV = 1000 · 0.5 = 500 and standard
error 1000 · 0.5 · 0.5 = 15.8. The sum will be 500, give or take 15.8 (about 68% chance). Or using 2 standard
errors, there is about a 95% chance the sum is in 500 ± 31.6.
What is the chance the number of heads is more than 520? For this we will use the normal approximation.
Converting to the Z scale,
520 − 500
Z= = 1.27
15.8
The area below 1.27 for the Z curve from the table is 0.898. So the area above 1.27 is 1 − 0.898 = 0.102. In
summary: the chance of more than 520 heads in 1000 tosses of the coin is 0.102 or 10.2%.

EXAMPLE
Suppose that 80% of the transistors made at a plant are high quality. For an order of 500 such transistors,
what is the chance that less than 375 of them are of high quality. To answer this, first make a box to model
one transistor:
[eight 1’s and two 0’s] with p = 0.8
Now the sum of 500 draws (modeling
√ the 500 transistors in the order) has expected value EV = 500 · 0.8 =
400 and standard error SE = 500 · 0.8 · 0.2 = 8.94. For the chance of sum less than 375, convert to the Z
scale
375 − 400
Z= = −2.80
8.94
The Z curve area less than −2.80 is the area above 2.80 which is 1 − 0.997 = 0.003. In summary: the chance
of less than 375 transistors of high quality is 0.003, or 0.3%, a very small chance.

EXERCISES
1. In the following two problems, determine the sampling distribution of the sum when sample size is
n = 10, 100, 1000. Give the expected value and standard error in each case.
(A.) Roll a die. The box is [ 1 2 3 4 5 6 ]. The mean is 3.5 and the standard deviation is 1.71.
(B.) Use the box which is skewed [ 1 1 1 1 2 2 2 3 3 4 ]. The mean is 2 and the standard deviation is
1.
2. History shows that in a certain hospital, 55% of births given by women are boys. What is the chance
of having less than 60 boys in a sample of 100 births?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
BOX MODELS FOR A POPULATION 111

3. A 50 question multiple choice test has 4 choices for each question. What is an approximate 95%
range for the score of someone who randomly guesses at each question?
4. A medical procedure has a 70% success rate. How many times will the procedure succeed out of 100
patients? Give a 95% range.
5. For the 2016-2017 Regular season of the NBA, Stephen Curry made 90% of his free throws per game.
Using this number to describe his performance in a game next season,
(A) What is the expected number of free throws Curry will make in 15 free throw attempts?
(B) What is the chance that he will make more than 14 free throws in 15 attempts?
6. In a particular neighborhood in Kalamazoo, when trick-or-treating on Halloween, the number of pieces
of candy received at each house is distributed as follows: 0 pieces: 10%, 1 piece: 10%, 2 pieces: 50%,
3 pieces: 30%.
(A) Make a box model for number of pieces of candy received per house in this neighborhood.
(B) Find the average and standard deviation of the box model in part (A).
(C) If you were to trick-or-treat at 100 houses in this neighborhood, what would be the expected
value (EV) of your total candy received, and what would be the standard error (SE) of this total?
(D) Assume the sum of the number of pieces of candy received is normally distributed. What is the
probability that you will receive less than 185 pieces of candy if you were to trick-or-treat at 100
houses?
7. Suppose the manager of a local auto service center gets a bonus of a weekend at a resort hotel for
having half year sales (26 weeks) sales of $395k or better. He estimates from the previous couple of
year’s records that his average weekly sales are $14.5k with a standard deviation of $2.25k. What is
his chance of getting to take his wife to the resort? For sake of this problem (but probably not terribly
realistic), assume that a half year is (26 weeks) can be considered as any randomly drawn 26 weeks
from the distribution of weekly sales (i.e. 26 draws from “the box”). Use the normal approximation.

REVIEW EXERCISES
8. Consider a sample (n = 50) which can be modeled as 50 draws from the box [ 1 1 2 2 2 4 ]. The mean
is 2 and the standard deviation is 1.
(A) Find the expected value and standard error of the sum.
(B) Using a normal approximation of the sum, what is the chance that the sum will be less than 90?
9. You are a nurse at a community health clinic, and your clinic sees, on average, about 25 patients per
day. On any given day, the number of patients needing stitches is distributed as follows: 0 patients
10% of the time, 1 patient 10% of the time, 2 patients 20% of the time, 3 patients 20% of the time, 4
patients 20% of the time, 5 patients 10% of the time, and 6 patients 10% of the time. Each of these
patients will need a special “laceration” package used for sterilizing and stitching a wound.
(A) Make a box model for the distribution of how many laceration packs you will need each day.
What are the mean and SD of the box?
(B) In a month’s time (30 days), how many “laceration” packages do you expect you’ll need? What
about in two months (60 days)? Four months (120 days)? (Your answer should be of the form
EV, “give or take” SE.)
(C) The percent your SE is of your EV is found by looking at the ratio SE . Find this ratio for what
EV
you found above for one, two, and four months. Compare these results. What do you notice?
(D) Using a normal approximation to the sums in (A) and (B), what is the chance of using more than
100 laceration packs in a month?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
BOX MODELS FOR A POPULATION 112

10. (Hypothetical) Suppose it is known that 80% of all cyclists who use the Kal-Haven trail wear helmets.
(A) What is the box model for the population proportion of riders on the Kal-Haven Trail who wear
helmets? What is the average of the Box and the SD of the Box?
(B) If a random sample of 64 riders was taken on a given day, what is the chance that more than 55
will be wearing a helmet?
11. Consider a true or false test with 100 questions where your score is 1 point for each correct answer.
Since you have studied, you have a 70% chance of getting the right answer on each question.
(A) Make a box model for your “score gain” for each question. Find the mean and SD of the box.
(B) What do you expect your score to be? Again, your answer should be EV, “give or take” SE.
(C) What is the chance of getting less than 75 correct answers on the true and false exam?
12. Suppose you flipped two “fair” coins and wanted to measure the number of heads. Clearly this will
be a random number. Which of the following would model for this measurement?
(A) One draw from the box [ 0 1 2 ]
(B) Two draws from the box [ 0 1 ]
(C) One draw from the box [ 0 0 1 1 2 2 ]
13. (TRUE or FALSE) At a certain university, the number of days a teaching assistant take a leave of
absence is at most 2. A researcher observed a random sample of TA’s and found the distribution of
the leave of absence as follows: 0-40%, 1-30%, 2-30%. If we take two sets of a random sample of 60
TA’s and note the number of absences they avail, we expect the sum of the number of absences for
each group to be the same.
14. Suppose the proportion of female students at WMU is 0.52. If it is of interest to determine the number
of MALE students in the sample, which of these statements is incorrect?
(a) EV (Male) = EV (Female), SE(Male) = SE(Female)
(b) EV (Male) 6= EV (Female), SE(Male) = SE(Female)
(c) EV (Male) = EV (Female), SE(Male) 6= SE(Female)
(d) EV (Male) 6= EV (Female), EV (Male) 6= EV (Female)

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Chapter 11

Sampling Distribution of a Sample


Proportion

Start with a 0 – 1 box. Let p be the proportion of 1’s in the box. The box is
[ 1 1 1 . . . 0 0 0 . . . p = proportion of 1’s ]
p
For this box, the mean is µ = p and the standard deviation is σ = p · ((1−p) −p)
−p). Now consider drawing n
numbers at random, with replacement, from the box. Let p̂ be the proportion of 1’s in the sample.

number of 1’s drawn SUM


p̂ = (note: this is )
n n
The sample proportion p̂ is a random variable and its behavior can be described by a sampling distribution.
When a sample is actually drawn, the sample proportion p̂ will differ from the population proportion p , but,
by how much? The difference p̂ − p is a chance error and it is important to know its size. The sampling
distribution of p̂ can readily be deduced from that of a sum and the result is that p̂ has
EV = p
r
−p)
p · ((1−p)
SE =
n
Further, this sampling distribution is approximately normal in shape (provided the sample size is not too
small, say n greater than 30, and the box proportion is not too near the extremes of 0 or 1).

Example
As a very simple starter example, consider tossing a coin. This is represented by the following box model.
[10]
Here, p = 0.5. If you toss a coin 100 times, what proportion of the time would you expect to get a head? If
we calculate the expected value using the formula above, we have

EV = p = 0.5
or we expect to get a head 50% of the time. But even though we expect this to happen, will there be exactly
this proportion of heads every time a coin is tossed? Not necessarily, thus we need to provide a measure
of how much we would miss the expected value, which is the standard error.

113
SAMPLING DISTRIBUTION OF A SAMPLE PROPORTION 114

Example
A production process makes parts with a defect rate of 10%. Suppose we sample 100 of these parts for
inspection. This is modeled by 100 draws from the box
[1000000000]
p
The box has p = 0.10 , mean µ = 0.10 and SD σ = 0.1(1 − 0.1) = 0.3.

Then, using the above formulas, the sampling distribution of p̂ is approximately normal with

EV = p = 0.1
q
SE = 0.10· 0.90 = 0.03
100
That is, if we draw a random sample of size 100, we expect that the sample proportion p̂ will be 0.10, give
or take 0.03. By the empirical rule, we are 95% sure that p̂ will be in the interval 0.10 ± 0.06 (up and down 2
standard errors). Note however, that these results would apply for the proportion of defective parts found
in the sample of 100 parts.

Example
Suppose an eye disease occurs in an elderly population of people at a rate of 1 in 1000. This is a proportion
of 1 = 0.001. Say, we are focusing on a specific population of 50,000 elderly people and want to assess
1000
how many of them may have this disease. A simple proportion calculation gives us 50,000 · 0.001 = 50.
So as a first guess we would say that 50 of these people would have the disease. But, we realize that the
occurrence of the disease is random and the actual count may be more or less than 50 to some degree.
How much variation should we expect?
A box model can help us assess the situation. Consider a 0 – 1 box (with 1 corresponding to “disease”)
having proportion of 1’s p = 0.001. Consider 50,000 draws from this box and the proportion of 1’s drawn:
number of 1’s
p̂ =
50,000
We seek to predict the value of p̂ with high confidence. The answer we get translates directly to a statement
about the prevalence of the disease in our population of 50,000 people. Using our theory for a sample
proportion, p̂ has approximately normal distribution with
EV = p = 0.001
s
0.001·· 0.999
SE = = 0.000141
50,000
This information can be used in several ways, for instance, there is about a 95% chance that p̂ is in EV
± 2 · SE, which is 0.001 ± 0.000282. These values are quite small to understand, so it may be helpful to
express them in terms of actual numbers of people by multiplying by 50,000. Then the number of people in
the population having the disease will be in 50 ± 14.1, which is 36 to 64 (still with 95% chance).
What is the chance the population has less than 40 people with the disease? To answer this, note that
40 is a proportion of 0.0008, so we want the chance that, when drawing from the box, p̂ is less than
50,000
0.0008. Convert this value to the Z scale using the formula
value − EV ( 0.0008−− 0.001))
Z= = = − 1.42
SE 0.000141

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
SAMPLING DISTRIBUTION OF A SAMPLE PROPORTION 115

Flipping areas a bit for the Z curve, the area below − 1.42 is the same as the area above 1.42. To get this
area, note that from the Z table, the area below 1.42 is 0.922. Subtracting this from 1, we get 1 − 0.922 =
0.078 or 7.8%. In summary, there is about a 7.8% chance there will be less than 40 people with the disease.

Example
For a simulation, take a 0 – 1 box with p = 0.2. Draw a sample of n = 100, calculate p̂ . Repeat this 500 times,
getting 500 values of p̂ . What does this list of
pp̂ values look like? It has average of 0.2017 and SD of 0.04040.
Our theory says p̂ has EV = 0.2 and SE = 0.2 · 0.8/100 = 0.04, so our list of values agrees well with the
theory. The histogram of the 500 p̂ values is the following.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
SAMPLING DISTRIBUTION OF A SAMPLE PROPORTION 116

Example
For a simulation, take a 0 – 1 box with p = 0.5. Draw a sample of n = 50, calculate p̂ . Repeat this 500 times,
getting 500 values of p̂ . What does this list of p̂ values look like?
p It has average of 0.499 and standard
deviation of 0.071. Our theory says p̂ has EV = 0.5 and SE = 0.5·· 0.5/50 = 0.0707, so our list of values
agrees well with the theory. The histogram of the 500 p̂ values is the following.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
SAMPLING DISTRIBUTION OF A SAMPLE PROPORTION 117

Confidence Interval For a Proportion


Suppose we have a 0 – 1 box and we don’t know p , the proportion of 1’s .
[ 0 0 0 . . . 1 1 1 . . . p = proportion of 1’s]
We are allowed to sample, with replacement, from the box and use the sample information to estimate p .
The key statistic is p̂ , the sample proportion of 1’s. p̂ is the natural estimate of p . For instance, if we draw
20 chips from the box and find 15 1’s, then p̂ = 15 = 0.75. Our estimate of the proportion of 1’s in the box is
20
0.75. But how accurate is this estimate? What is the size of the chance error? If we sampled again 20 chips
from the box our estimate would likely change due to the randomness of the drawing. How can we express
this “looseness” of our estimate?
This issue of the accuracy of our estimate can be addressed directly with a confidence interval. The
confidence interval formula, based on the sampling distribution of p̂ , is

p̂ ± Z· SE
q
p̂·(1
p̂·(1−− p̂)
where SE = n and Z is a constant from the Normal table. The Z is chosen to give a specified
confidence level to the interval:
Z value confidence level
Z=1 68%
Z=2 95% (see below for more accurate 95% case)
Z=3 99.7%
Z = 1.645 90%
Z = 1.96 95%
Z = 2.58 99%
Note that the standard error is really an “estimated” standard error based on the data (the p̂ value is used).
Empirical rule provides a rough estimate of the spread of the data in a normal distribution. However, with
the knowledge of Normal distribution, area between − 1.96 and + 1.96 under z curve is 0.95 or 95%. For
empirical rule, z value 1.96 is rounded up and used as 2. Therefore, we use 1.96 as the z value to obtain a
more accurate 95% confidence interval.

Example
A random sample of 1000 voters was taken in a district of 25,000 voters. It was found that 550 of the sampled
voters were in favor of a new bond issue. From this, p̂ = 550 = 0.55 is the estimate of p , the proportion of
1000
voters in the whole district in favor of the bond. A 95% confidence interval for p is
 r 
0.55 · 0.45
±
0.55± 1.96· ·
1000

0.55 ± 0.03
(0.52, 0.58)

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
SAMPLING DISTRIBUTION OF A SAMPLE PROPORTION 118

It is important to understand the interpretation of a confidence interval properly. The confidence or probability
is on the method, not on the particular resulting interval. If we do this method, we have a 95% chance that
our interval will contain the unknown p . When we calculate the interval for our particular data, we can’t
know if it’s correct or not, that is, if the interval really contains the unknown p . Our faith in the interval
is based on our faith in the method in general. Therefore, we are 95% confident that the proportion of all
voters in a district that are in favor of a new bond issue is between 0.52 and 0.58.
Empirically, if we could repeat this method over and over again with new samples, we would find that about
95% of the intervals would contain p . The 95% is on the sampling, not on our particular result 0.52 to 0.58.
Specifically, we do not say that there is a 95% chance that the box p is in the interval 0.52 to 0.58. This is a
very subtle point.

Example
Suppose we are interested in the proportion of grass seeds that will germinate when planted. The seeds
have been coated with a preservative that may inhibit germination. To obtain data, we select 500 coated
seeds and plant them in an appropriate bed. We subsequently find that 350 of them germinate. This is a
sample proportion of p̂ = 350 = 0.70. A 95% confidence interval for the true proportion for all such seeds p
500
is r
·( 1 − 0.70))
 
0.70·(
0.70±± 1.96··
500
± (1.96·· 0.0205)
0.70±
± 0.04
0.70±
(0.66,0.74)
Note that the formula for the standard error of p̂ was used. See the above table for Z value. (Z = 1.96 for 95%
confidence level).
We can write 0.66 < p < 0.74 is a 95% confidence interval for the population p. The interpretation of our
resulting confidence interval is, “we are 95% confident that the population proportion of grass seeds that
will germinate when planted is between 0.66 and 0.74.”
The confidence level that we use depends on the field of application or the researcher. Most commonly
used confidence levels are 90%, 95%, and 99%. For 90% confidence level, Z value = 1.645 is used. 90%
confidence interval is narrower than 95% and 99% confidence intervals. For 99% confidence interval, Z
value = 2.58 is used and this is wider than both 90% and 95% confidence intervals. In general, as we
increase the confidence level, Z value increases and the confidence interval gets wider. See the graphs
below for illustration.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
SAMPLING DISTRIBUTION OF A SAMPLE PROPORTION 119

EXERCISES
1. 30% of the seals in a colony have a certain genetic trait. If you sample 50 of them, what is the chance
the sample will contain 20 (40%) or more seals with this trait?
2. One poll used a sample of 1000 people in a city of population 50,000. Another poll used a sample of
1000 people in a city of population 500,000. True or false - the 2nd poll is likely to be more accurate
than the 1st . Explain your reasoning.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
SAMPLING DISTRIBUTION OF A SAMPLE PROPORTION 120

3. In sampling from a 0 – 1 box, we would like our sample proportion to be within 0.03 of the true
proportion, that is, | p̂ − p| < 0.03. If the true p = 0.20 , what is the minimum sample size to give this
with a 95% chance? Answer this by a trial and error approach, pick a sample size n and make the
calculation to get the chance. Choose another sample size n and repeat. Do this until you are satisfied
your n does the job. Repeat if the true p = 0.50.
Note: this type of calculation to determine a sample size to give specified results (within 0.03 with
95% chance) is very important in planning data collection. Also, your work shows that the sample
size needed for the sample proportion to be within 0.03 of the population proportion is not absolute,
rather it depends on the value of the population proportion.
4. Data from previous semester show that the proportion of students who pass STAT 3660 is 80%. Using
this information, it is of interest to find the sampling distribution of the proportion of students who
will pass the class, from a random sample of 55 students.
(A) Find the expected value of the proportion.
(B) Calculate the standard error of the proportion.
5. A box contains marbles of which 40% are blue and the rest are red. A marble is drawn from the box
and the color of the marble drawn is noted.
(A) (TRUE or FALSE)The sampling distribution of the proportion of blue marbles drawn is approximately
normal no matter what the sample size is.
(B) Suppose the experiment is done 200 times. We expect the proportion of blue marbles drawn to
be , give or take .
6. A random sample of 500 students at WMU was taken and 105 said they attended at least one football
game last fall. Construct and interpret a 95% confidence interval for the corresponding population
proportion.
7. A researcher is interested to conduct a study to determine the success rate (the proportion of patients
cured) of a new drug when it is applied to a suitable population of patients. A random sample of 60
patients was selected from the population and treated with the new drug. If 42 were cured, calculate
a 95% confidence interval for the proportion of patients cured with the new drug.
8. If the population proportion is unknown, and hence the box model is unknown, then for every random
sample from the population (pick the best answer)
(a) there is an associated sample proportion, which is expected to be equal to the true population
proportion 95% of the time.
(b) there is an associated 95% confidence interval, which should, 95% of the time, have the true
population proportion within its range.
(c) there is an associated true population proportion, which should land within one SE of the
sample proportion 95% of the time.
(d) there is an associated 68% confidence interval, which should, 68% of the time, have the associated
sample proportion within its range.

REVIEW EXERCISES
9. The Center for Disease Control and Prevention reported that 29% of American adults have high blood
pressure. If a random sample of 500 American adults is obtained, what is the chance that more than
33% of the sample have high blood pressure?
10. In an article published by Inside Higher Education in 2008, the proportion of traditional-aged college
students in the US that are uninsured is 20%. If a random sample of traditional-aged college students
of size 300 would be obtained, what is the chance the sample proportion of uninsured students would
be less than 17%?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
SAMPLING DISTRIBUTION OF A SAMPLE PROPORTION 121

11. There is a 0-1 box model where the population proportion is unknown. Two samples were taken
from this population, sample one with 30 samples, while sample two has 50 samples. Both of these
samples have the same proportion. If we construct a 95% confidence interval for the proportion using
both samples,
(a) The width of the confidence interval of sample one is wider than that of sample two.
(b) The width of the confidence interval of sample one is narrower than that of sample two.
(c) The width of the confidence interval of sample one is the same as that of sample 2.
12. (TRUE or FALSE)To reduce the width of the confidence interval by a factor of 2, we must quadruple
the sample size when considering a 95% confidence interval for the population proportion. Explain
your answer.
13. The students in a random sample were asked if they had visited the recreation center at least once
per week. Out of the random sample of 81 students, 27 said they visited the rec. center at least once
per week. Find a 95% confidence interval for the proportion of all WMU students who visit the rec.
center at least once per week.
14. Suppose a random sample of 64 cyclists on the Fred Meijer Heartland Trail was taken, of which 48
were wearing helmets. Find a 95% confidence interval for the proportion of all cyclists on the Fred
Meijer Heartland Trail who wore helmets.
15. A medical study was carried out to investigate a new drug. For the 140 subjects in the study the new
drug was successful for 98 of these people. Give a 95% confidence interval for the proportion of
successes if the new drug was adopted for general use.
16. Suppose a random sample of 100 dorm residents at WMU was taken and 28 reported that the food
service was poor. Give a 95% confidence interval for the proportion of dorm residents overall who
feel that the food is poor.
17. Suppose that 73% of all American households have internet access. A market survey group repeated
this study in a certain town with 70,000 households, using a simple random sample of 700 households,
it was found 507 of them have internet access. The percentage of households in the town with internet
access is estimated as ; this estimate is likely to be off by or so. Find a 95%
confidence interval for the percentage of all 70,000 households with internet access.
18. Suppose a simple random sample of 5000 high school seniors was taken. Of those, only 17.1% knew
that Cervantes wrote Don Quixote, but 95.7% knew that Shakespeare wrote Hamlet.
(A) Find a 95% confidence interval for the percentage of all high school seniors who knew that
Cervantes wrote Don Quixote.
(B) Find a 95% confidence interval for the percentage of all high school seniors who knew that
Shakespeare wrote Hamlet.
19. Three hundred random draws are made from a box of 200 1’s and 600 0’s. True, or false, and explain
briefly.
(A) The expected value for the percentage of 1’s among the draws is exactly 25%.
(B) The expected value for the percentage of 1’s among the draws is around 25%, give or take 2%
or so.
(C) The percentage of 1s among the draws will be around 25%, give or take 2% or so.
(D) The percentage of 1’s in the box is exactly 25%.
(E) The percentage of 1’s in the box is around 25%, give or take 2% or so.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
SAMPLING DISTRIBUTION OF A SAMPLE PROPORTION 122

20. Some studies show that increasing calcium intake decreases blood pressure, and it is strongest in
African-American men. A group of medical student researchers conducted a 12-week clinical trial to
see if this is true. 250 random samples of African-American teenager men with high blood pressure
participated. 150 of them took a calcium supplement for 12 weeks and the remaining took a placebo.
Suppose 108 of the 150 African-American teenagers who took calcium supplements for 12 weeks
experienced a reduced blood pressure.
(A) Calculate p̂. Is this value a statistic or a parameter?
(B) Calculate the standard error of the sample proportion.
(C) Calculate a 95% confidence interval for the proportion of all African-American teenager men
with lower blood pressure after taking the supplements.
(D) Interpret your result.
21. Suppose the proportion of all STAT 3660 students who have not used their cell phones in the last 3
months is 15%. For 90 randomly selected students, what is the probability that less than 18% of them
did not use their cell phone in the last 3 months?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Chapter 12

Sampling Distribution of a Sample Mean

Start with a box of numbers.


[ box of numbers, mean µ and SD σ ]

Consider the process of taking a random sample of size n from the box and calculating the sample mean X̄ .
The value we would get for X̄ is random, depending on chance. Different samples yield different values for
X̄ .
To understand this random behavior, we turn to the notion of a sampling distribution. It is convenient to
recall the work already done for the sampling distribution of a sum, since X̄ and sum are closely related.
Recall that the sample mean X̄ is calculated as
sum
X̄ =
n
Recall from Chapter 10 that the sum has expected value
EV = n · µ
so X̄ = sum
n has expected value

n·µ
EV = =µ
n
Also the sum has standard error


SE = n·σ

so X̄ = SUM
n has standard error


n·σ σ
SE = =√
n n
Dividing by n has only a scaling effect and doesn’t affect the normal approximation idea.

To summarize, the sampling distribution of the sample mean is approximately normal (recommend n > 30),
with
EV = µ and
σ
SE = √
n

123
SAMPLING DISTRIBUTION OF A SAMPLE MEAN 124

Example
Consider the experiment of rolling a die. The box model is
[123456]
The box has mean µ = 3.5 and standard deviation of σ = 1.708.

Suppose we roll the die 30 times and observed the value on the upturned face. This can be viewed
as drawing 30 numbers from the box and taking their average. Applying the concept of the sampling
distribution of the sample mean, then this average, X̄ has

EV = 3.5 and SE = 1.708


√ = 0.312.
30
We can say that we expect X̄ to be 3.5, give or take 0.312. Further, there is about a 95% chance that X̄ will
be in the interval 3.5 ± (1.96 · 0.312) which is 3.5 ± 0.612.

If we made 100 rolls, then the average X̄ has EV = 3.5 and SE = 1.708
√ = 0.1708. Then we expect X̄ to be 3.5,
100
give or take 0.17. There is about a 95% chance that X̄ will be in the interval 3.5 ± (1.96) · 0.17 which is 3.5 ±
0.3332.
Note how the interval for X̄ is narrower for n = 100 than for n = 30; this shows it has less variability for the
larger sample size.

Example
Suppose employees at a large hospital average 7 sick days per year with a standard deviation of 2 days. If a
random sample of 30 employees is to be selected, what is the probability the sample mean is less than 7.5?
The sample mean has EV = 7, SE = √2 = 0.365. Since the distribution of the sample mean is approximately
30
normal, we can use the Z-score to calculate the probability. The Z score for 7.5 is Z = 7.5 − 7 = 1.37 and
0.365
then the area below this value from the Z table is 0.9147. The chance the sample mean is less than 7.5 is
91.47%.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
SAMPLING DISTRIBUTION OF A SAMPLE MEAN 125

Example
Consider the box [ 1 1 2 2 2 2 3 4 5 9 ]. It has a mean of 3.1 and standard deviation of 2.3. Now draw n = 10
times and calculate the sample mean. Repeat this 500 times to produce 500 values for the sample mean.
What does this list of sample mean values look like? It has an average of 3.093 and a standard deviation of
0.734.
2.3 = 0.7273, so our results agree quite well with
Our theory says the sample mean has EV = 3.1 and SE = √
10
the theory.
The histogram of the 500 values of sample mean is the following. Note, with the small sample size here, the
shape of the curve is not too close to the Normal shape.

Example
Again, consider the box [ 1 1 2 2 2 2 3 4 5 9 ]. It has a mean of 3.1 and a standard deviation of 2.3. Now draw
n = 40 times and calculate the sample mean. Repeat this 500 times to produce 500 values for the sample
mean. What does this list of sample mean values look like? It has an average of 3.114 and a standard
deviation of 0.3833.
2.3 = 0.3637, so our results agree quite well with
Our theory says the sample mean has EV = 3.1 and SE = √
40
the theory.
The histogram of the 500 values of sample mean is next. Note, with the bigger sample size here, the shape
of the curve is quite close to the Normal shape.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
SAMPLING DISTRIBUTION OF A SAMPLE MEAN 126

Confidence Interval For a Population Mean


Consider a box model and suppose we don’t know enough about its composition to know the box mean
µ . That is, the box is unknown. Taking a statistical approach, we will estimate the box mean from the
information in a random sample drawn from the box. The sample mean X̄ is the natural estimate of the box
mean µ . The accuracy of this estimate is measured by the SE = √σn . A further problem arises in that σ is
usually unknown.
To proceed, we use the sample standard deviation s to estimate σ and then have an “estimated” SE = √sn .

Note: Although the box contents are unknown, we assume, we have the sample data, so that, X̄ and s are
known.
A formal confidence interval formula for the box mean µ is
s
X̄±(Z · SE),
X̄± where SE = √
n

and Z is a constant from the normal table (see the table in chapter 11). This result is approximate, being
based on the approximate normality of X̄ for larger sample sizes (nn > 30).

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
SAMPLING DISTRIBUTION OF A SAMPLE MEAN 127

Example
Suppose we have identified the population of 5000 rental apartments in a city. Let µ be the average rent paid
last month for this population. Note the box model here. To estimate the average, a random sample of 100
apartments is selected from the population and the sample had a mean of $450 and standard deviation of
$200. A 95% confidence interval for the population mean (the average rent paid over the 5000 apartments)
is
 
450 ± 1.96 · 200
√ = 450 ± 39.2 = (410.8, 489.2)
100

This interval means we are 95% confident that the true average rent paid over the 5000 apartments is
between $410.8 and $489.2.

A 99% confidence interval would use 2.58 in place of 1.96 and for the 90% confidence interval, use Z =
=1.645.

Example
On average, how much did students at a university pay for their September cell phone bill? A random
sample of 100 students was selected and their September cell phone bills averaged $60 with a standard
deviation of $20. So we estimate the average September cell phone bill is $60. How accurate is this
estimate? To answer this, we calculate a 95% confidence interval for the population average by the usual
formula:
 
60 ± 1.96 · √20 = 60 ± 3.92 = (56.08, 63.92)
100

Thus, we can say we are 95% confident that the true average bill students at a university paid for September
is between $56.08 and $63.92.

EXERCISES
1. Previous studies show that the average weight of a newborn is 7.5 lbs, with a standard deviation of
0.7 lbs. Suppose a random sample of 70 births in February was taken,
(A) What is the expected average weight of the sample?
(B) Calculate the standard error of the average weight of newborns in the sample.
(C) What is the chance the average weight of the newborns in the sample is more than 7.25 lbs.?
2. Test scores of a large population of students on a standard test have mean = 500 and a standard
deviation = 100.
(A) For a random sample of 50 students from the population, what is the chance their mean score
exceeds 520?
(B) If a single student is selected at random from the population, what is the chance his score
exceeds 520?
3. A sample of 30 adult coho salmon was taken from Lake Michigan and they were tested for the PCB
level in their flesh. The sample had a mean of 22 ppm and a standard deviation of 8 ppm. Give a 95%
confidence interval for the average PCB level of the coho salmon in the Lake.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
SAMPLING DISTRIBUTION OF A SAMPLE MEAN 128

4. What was the average credit card debt for students at WMU last month? To gain some information,
suppose a random sample of 300 students is selected and asked about this debt. The debts reported
had a mean of $225 with a standard deviation of $90.
A. Calculate a 90% confidence interval for the average credit card debt of WMU students last
month.
B. Interpret the confidence interval obtained in A.
5. Suppose that WMU wants to know the average number of years its student body has been out of high
school. Upon taking a random sample of 400 students, the sample average is 3.75 years out of high
school with a sample standard deviation of 1.6 years.
(A) Find a 99% confidence interval for the average number of years out of high school for the entire
student body.
(B) If we take a second sample of 400, what is the chance that the average of that second sample
will be in the 99% confidence interval we construct from that second sample?
6. (Hypothetical) Suppose that the average number of bus rides per week (including transfers) a Kalamazoo
Metro Transit Bus Rider takes is 10 with a standard deviation of 3.5.
(A) If a random sample of 49 riders was taken, what is the chance that the average number of bus
rides per week of this sample of riders is less than 9.5?
(B) (TRUE or FALSE) 95% of the sample (about 47 of the riders) taken in (A) ride the bus between 9
and 11 times per week.
7. (Hypothetical) In a particular year, there were 600,000 faculty members at institutions of higher learning
in the U.S. In a simple random sample of 3600 of these members, the average number of regular office
hours held per week, during the semester, was 6.8 with a standard deviation of 2.2. Construct and
interpret 95% confidence interval for the true average of regular office hours held by the population
of 600,000 faculty members.
8. (Hypothetical) A study was done at WMU to determine the number of times (perhaps numerous times
per day) a student visits a campus computer lab per week. A random sample of 81 students was
taken, and the sample average was found to be 4 with a standard deviation of 2.7. For the following
statements, determine if they are true or false.
(A) The average number of visits to a campus computer lab per week for all students at WMU is 4,
give or take 2.7.
(B) About 50% of the students at WMU visit a campus computer lab 4 or more times per week.
(C) A 95% confidence interval for the average number of visits to a campus computer lab per week
for all students at WMU is calculated as 4 ± 0.588.
(D) If another random sample of size 81 is taken, there is a 95% chance that it will also have a mean
of 4.
9. If the population mean and SD are unknown, and hence the box model is unknown, then for every
random sample from the population (pick the best answer)
(a) there is an associated sample mean, which is expected to be equal to the true population mean
95% of the time.
(b) there is an associated 95% confidence interval, which should, 95% of the time, have the true
population mean within its range.
(c) there is an associated true population mean, which should land within one SE of the sample
mean 95% of the time.
(d) there is an associated 68% confidence interval, which should, 68% of the time, have the associated
sample mean within its range.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
SAMPLING DISTRIBUTION OF A SAMPLE MEAN 129

REVIEW EXERCISES
10. (TRUE or FALSE) A confidence interval provides a range of possible values for the value of the
statistic being estimated (sample mean or sample proportion).
11. Suppose a random sample of 30 students in a STAT 3660 class was obtained, and their midterm exam
score is noted. Explain how a 95% confidence interval for the average midterm score will look like if
all the scores were the same.
12. Two sets of samples were taken from a population. Both samples have the same sample size and the
same standard deviation, however, the sample means are different. If we construct a 95% confidence
interval for the population mean using both samples, which of the following statements is TRUE?
(a) The confidence interval obtained using the sample with a larger sample mean is wider than that
obtained using the sample with smaller mean.
(b) Both confidence intervals have the same width.
(c) The confidence interval obtained using the sample with larger sample mean is narrower than
that obtained using the sample with smaller sample mean.
(d) The lower and upper bounds of the two confidence intervals would be the same.
13. (Hypothetical) A study was done to determine how often students check their email. In a survey of
100 students randomly selected at WMU, each student was asked how many times they normally
check their email per day. The sample average was 5.3 with a sample standard deviation of 4. For the
following, indicate whether the statement is TRUE or FALSE:
(A) The 95% confidence interval for the average number of times per day a student at WMU normally
checks their email is 5.3 ± 0.8 or (4.5, 6.1).
(B) 95% of the sample values lie between 4.5 and 6.1.
(C) 95% of the WMU students check their email 4.5 and 6.1 times per day.
(D) 19 out of 20 times, when a random sample of 100 is taken, the interval

sample SD sample SD
{sample mean − 2 · , sample mean + 2 · }
10 10
should contain the actual average number of times a student normally checks their email per
day for all students at WMU.
14. (Hypothetical) A study was made investigating student enrollment persistence with respect to hours
worked per week. Students classified as Persevering (those who returned after the first semester of
college) and non-persevering (those who did not return after the first semester of college) were asked
about the work schedules during the first semester of college. From these students, the following
data on the number of hours worked per week in the first semester was collected:

Persevering Non-persevering
Mean 17.41 28.64
SD 15.2 10.3
n 144 49

If each group is considered a random sample, give a 95% confidence interval for average number of
hours worked per week during the first semester of college for each group.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
SAMPLING DISTRIBUTION OF A SAMPLE MEAN 130

15. (Hypothetical) In a town with 10,000 occupied rental units (house, apartments, etc.), a random sample
of 400 of the units was taken. The mean rental price was $500 with a standard deviation of $180.
(A) What is the chance that the average rental price of the sample is below $380?
(B) What is the chance that the average rental price of the sample exceeds $430?
(C) Construct and interpret a 95% confidence interval for the average rent paid in the previous
month on all 10,000 occupied rental units in this town.
(D) (TRUE or FALSE): If one of the occupied rental units was picked at random, there is a 95%
chance that the rent paid last month for that unit is in the range of $491 to $509.
16. A surveyor wants to obtain the distance between two points, A and B, in a large field. He makes a
measurement with his “method” 30 times independently. The average of his 30 measurements is 800
feet with a standard deviation of 10 feet. Fill in the blanks.
(A) One measurement differs from the true distance by or so.
(B) The average of these 30 measurements differs from the true distance by or so.
(C) Give a 95% confidence interval for the true distance based on the surveyor’s data.
17. It is known from library records in a particular county that the patrons checkout, on average, µ =
8 movies per month with a σ = 6 movies (obviously not normally distributed). Suppose a random
sample of size 100 was taken from the patrons and the number of movies each checked out in the
last month was recorded. What is the chance that the sample average would be less than 7?
18. (Hypothetical) A medium size university knows that among its undergraduate population, students in
the current semester are enrolled, on average, for µ = 13.5 credit hours, with a σ = 3 hours. Suppose
a random sample of 40 students was taken.
(A) Construct and interpret a 90% confidence interval for the true average credit hours students are
enrolled in.
(B) What is the probability that the sample would have an average credit hours more than 15?
19. (TRUE or FALSE) A clinic knows that among its patient population, those with Type II diabetes eat
vegetables on average, 5 days in a week, with a standard deviation of 0.82 days. If a random sample
of 50 patients was taken, we expect the average number of days per week they eat vegetables is more
than 5 days.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Chapter 13

Tests of Significance

Start with a box model that represents the population. Suppose at least some feature of the box is unknown,
like the mean µ .

Box or population is [ * * * , mean µ unknown ]


Again, the box represents a population of values that we are interested in. It could be, for example:

• the diameters of all oak trees in a forest


• the amount of a drug in the blood 1 hour after an injection for a group of 100,000 people who use it
• the lifetimes (in miles) of the alternators in all Honda Civics, year 2009 models, sold in the USA
• the household income in year 2000 for all households in the city of Portage
• the high school GPA of all high school graduates in Michigan in the year 2009
• the level of PCB in the flesh of all the salmon in Lake Michigan

Each situation like the above is a population of numbers. How can the mean of such populations be
determined if we do not have the values in the box?
The statistical approach is to take a random sample from the box and to use the sample information to
make an inference about the box. For instance, we may use the sample mean to estimate the box mean
or, as described earlier, we may produce a confidence interval for the box mean. Now we will consider yet
another kind of inference motivated by a different need.

Example: A Jury Trial


A trial is a perfect non-statistical example of a hypothesis test. In any jury trial, the jury has two options:
guilty or not guilty. View this as
H0 : innocent
HA : guilty
We assume the defendant is innocent (H0 ) unless there is enough evidence to conclude they are guilty (HA ).
This is the same with a hypothesis test. We will assume H0 to be true unless we have enough evidence to
show HA .

131
TESTS OF SIGNIFICANCE 132

The basic procedure for a test of significance:

(1) Form hypotheses H0 and HA to reflect the question of interest.


• H0 is called the null hypothesis. This is what we consider to be true until we provide enough
evidence to show otherwise.
• HA is called the alternative hypothesis. This will be considered true only if we have enough
evidence to show it.
• The symbols in our hypotheses are always opposites.
(2) Calculate an appropriate test statistic, say Z or t .
• A test statistic is calculated from the sample data. It measures how far the data diverge from
what we would expect if the null hypothesis is true.
• In general, the test statistic is calculated as:
estimate−hypothesized value
test statistic = standard error o f the estimate

(3) Determine the p -value for the test statistic Z .


• A p -value is the same idea as a probability. A p -value is the probability that the test statistic
would take a value as extreme as or more extreme than the one actually observed. In simpler
words, it is the probability that we got the results from the sample we did, given the null
hypothesis.
(4) Reject or fail to reject H0 based on the p -value.
• If p -value ≤ 0.05
0.05, we declare the result is significant and reject H0 .
• If p -value > 0.05
0.05, we fail to reject H0 . The data did not provide significant evidence to reject H0 .
(5) Conclusion in terms of the problem.
• If we reject H0 , there is significant evidence to show HA .
• If we do not reject H0 , there is not significant evidence to show HA .

To summarize steps 3–5, see the following diagram.

p-value ≤ 0.05 p-value > 0.05


↓ ↓
Reject H0 Do not reject H0
↓ ↓
There is significant evidence to show HA . There is not significant evidence to show HA .

One Population Mean: Z-test


For test of significance on one population mean, we are interested to test the population mean µ against
a hypothesized value µ0 . For sample size greater than 30 (n > 30), the Z-test is used. In general, there are
three sets of hypotheses that we can consider:

H0 : µ = µ0 vs. HA : µ 6= µ0
H0 : µ ≤ µ0 vs. HA : µ > µ0
H0 : µ ≥ µ0 vs. HA : µ < µ0

The sets of hypotheses where the alternative (HA ) are > or < are called one-tailed or one-sided hypothesis,
while the case where HA is 6= is called a two-tailed or two-sided hypothesis.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TESTS OF SIGNIFICANCE 133

The Z test statistic for one population mean is given by the formula:
estimate−hypothesized value x̄−µ0
Z= standard error o f the estimate = √s
n

where s is the standard deviation of the sample.

The steps to testing hypothesis using any set of hypotheses stated above are the same except for calculating
p-values. The p-value depends on the alternative hypothesis of the test. For the Z-test, the p-value is the
area under the Normal curve related to the test statistic Z. The following are the formula to calculate p-value
for each alternative hypothesis:

HA : µ 6= µ0 p − value = 2 ∗ tail area = 2 ∗ P(Z > |Z|) two-tailed


HA : µ > µ0 p − value = 1 − area f or Z right-tail
HA : µ < µ0 p − value = area f or Z left-tail

The area under the curve pertaining to the p-values are shown in the graphs below.

Example A
A plastic manufacturing company has a machine that can make a certain molded part. In a run, they typically
make 1000–2000 of these parts. One dimension of this part should be 10 mm in length. Sometimes they
have shrinkage troubles for various reasons and the parts tend to be too short. As a quality control check,
a random sample of 36 parts is selected from the lot and measured (they measure a sample, not the whole
lot). Suppose the sample mean is x̄ = 9.7 mm and the standard deviation is s = 0.6. Is this evidence enough
to conclude that there has been a shrinkage problem with this lot?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TESTS OF SIGNIFICANCE 134

Clearly 9.7 is less than 10, but our doubt is due to chance variation possible in a sample (a new sample of
30 parts would have a different x̄ , maybe one bigger than 10).
Continuing, we start to formulate an idea by asking the question:
Is it likely that the whole lot has a mean of 10 (so it is satisfactory) and yet our sample mean is found to be
9.7? Or, as an alternative, when the lot has mean of 10 is it very unlikely for the sample mean to come out
at 9.7? Notice our interest here is to decide yes or no, that is

• yes, the lot is satisfactory having a mean of 10, (sell it to our customers), or
• no, the lot is not satisfactory having a mean that is less than 10 (shrinkage has occurred, scrap the
lot).

We have to make this decision between the two choices based on our data: sample mean x̄ = 9.7. How to
proceed? The steps in conducting a test of significance would be employed using α = 5%.

(1): Form hypotheses to reflect the question of interest:

H0 : µ ≥ 10 (no shrinkage)
HA : µ < 10 (shrinkage)

(2) The test statistic is


x̄ − µ0 9.7 − 10
Z= = = −3.0
√s 0.6

n 36

(3) The alternative hypothesis is “less than”, hence the p -value is

p -value = area for (Z < −


−3.0 ) = 0.001

The graph for the p-value is shown below:

(4) Since the p -value of 0.001 is less than 0.05, we reject H0 . If H0 is true, a very rare event has
occurred; we think it more likely that H0 is false.

(5) There is significant evidence to show shrinkage has occurred.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TESTS OF SIGNIFICANCE 135

Example B
In a study of household income in Kalamazoo, consider µ = mean household income over this population.
We may want to test
H0 : µ ≤ $$32,000
HA : µ > $$32,000
where $32,000 in a national figure of interest. To proceed, we would collect data in Kalamazoo on household
income and use it to make the decision whether or not to reject H0 .
Suppose a random sample of 100 households in Kalamazoo is selected and income is assessed for each
one. This sample of incomes has a mean of x̄ = $35,000 and a standard deviation of s = $15,000. The steps
for a test of significance at α = 5% are as follows:

(1) Step 1 was done.

(2) The test statistic is


35000 − 32000
Z= 15000
= 2.00

100

(3) The alternative hypothesis is “greater than”, hence the p -value is

p -value = 1 - area for (Z < 2.00 ) = 1-0.977 = 0.023

The graph for the p-value is shown below:

(4) Since this p -value is quite small, we reject H0 . Officially, by our rule above, the p -value is less than
0.05 so we reject H0 and say the result is significant. The logic is as follows: either H0 is true and we
have observed an extreme event (pp = 0.023) or H0 is false and the event we observed (Z Z = 2.00) is not
unusual. We judge the latter is more reasonable.

(5) There is significant evidence to show the mean household income in Kalamazoo is greater than
$32,000.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TESTS OF SIGNIFICANCE 136

One Population Proportion: Z-test


Hypothesis testing can also be done for a proportion, p, with a 0-1 box model. We start with a box with a
combination of 0’s and 1’s, and the proportion p is unknown.
Box or population is [ * * * , mean p unknown ]
Recall that proportions are part of a whole, and we can see them in either decimal (e.g. 0.35) or percentage
(e.g. 35%) form. We are interested to test the population proportion p against a hypothesized value p0 .

Similar to the mean, there are three sets of hypotheses about the proportion that we can consider:

H0 : p = p0 vs. HA : p 6= p0
H0 : p ≤ p0 vs. HA : p > p0
H0 : p ≥ p0 vs. HA : p < p0

The Z test statistic for one population proportion is given by the formula:
estimate−hypothesized value q p̂−p0
Z= standard error o f the estimate = p0 (1−p0 )
n

where p̂ is the sample proportion.

Since the test on proportions is also a Z-test, the formula used for calculating p-values are the same as the
formula used to calculate p-values for Z-test for one population mean.

Example C
Consider the population of students at WMU this semester and say we are interested in the proportion p of
the students who use the Rec Center regularly. Last year’s survey showed that 10% of the students did so.
We suspect the rate is higher this semester and want to “prove” this. Note that there is a proportion p for
the current population and a “benchmark” value of 10%. We want to establish that p > 0.10. Our method
will be formal – to express the issue in the hypothesis testing way and base the decision on a sample of
data that is collected. The steps are as follows:

(1) The hypotheses of interest are


H0 : p ≤ 0.10
HA : p > 0.10
(2) The test statistic is
p̂ − 0.10
Z=
SE
q q
where the SE for p̂ is SE = p·(1−p)
= 0.10· 0.90 = 0.03. Note the H value was used. Then
0
n 100
0.12 − 0.10
Z= = 0.67
0.03
(3) The alternative hypothesis is “greater than” hence the p-value is
p -value = 1 - area for (Z < 0.67 ) = 1-0.749 = 0.251

The graph for the p-value is shown below:

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TESTS OF SIGNIFICANCE 137

(4) Since the p -value is not small, not less than 0.05, we do not reject H0 . The data is not extreme
enough to lead us to reject H0 .
(5) There is not significant evidence to show the proportion of students who use the Rec Center
regularly is greater than 0.10.

Again, the methodology of significance testing is basically an argument by contradiction. We assume H0 is


true and show this leads to an unusual/untenable result (a small p -value) - so we conclude H0 is false.
Note: A good research strategy is to organize things to put what we hope to dispute/reject in H0 . Put what
we want to show/prove in HA . In our situation, it’s easier in general to reject a hypothesis by showing it
leads to a false or at least rare conclusion than it is to establish that the hypothesis is true.

Some Comments on hypothesis testing


The situation for testing hypotheses can broadly be viewed in this two-way table.
Decision is to
Reject H0 Fail to Reject H0
H0 Type I Error No Error
True Hypothesis
HA No Error Type II Error
The table shows that there are two ways of making an error, not just one. These two potential errors are
quite different and have different consequences.

Example: A Jury Trial


In any jury trial a decision must be made as to guilt or innocence. View this as
H0 : innocent
HA : guilty
The testimony at the trial, the data, will be used to choose H0 or HA . A Type I error is to convict an innocent
person. A Type II error is to set free a guilty person. These two potential errors are very different.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TESTS OF SIGNIFICANCE 138

Example: Test a New Drug


Suppose the issue focuses on a mean µ and there will be data to test

H0 : µ ≤ 0 (the new drug is not effective)


HA : µ > 0 (the new drug is effective)
A Type I error is to declare the new drug to be effective when it is not (and then start to use it on patients
when it will not help). A Type II error is to declare the new drug to be ineffective when it really is (and then
miss the opportunity of using a good drug on patients).

On the Type I error


A Type I error is the event that H0 is rejected when it is true. The chance of a Type I error is denoted by α .
The chance α is called the significance level of the test. The testing procedure requires that the researcher
set the value of α at some small, known value. In our class, we are using α = 0.05, that is, we have a
0.05 = 1 chance of making a Type I error. This is a commonly accepted value in research communities.
20
Sometimes a researcher may choose a different value for α , like 0.10 (larger) or 0.01 (smaller).

On the Type II error


A Type II error is the event that H0 is not rejected when it is false. The chance of a Type II error is denoted by
β . Typically, the alternative hypothesis HA specifies a range of parameter values (ex. Ha : µ > 10) and there
is a β value for each parameter value in HA .

More Comments
It would be desirable to have both error probabilities, α and β , to be small. Unfortunately, they work against
each other:

• If we set α to be smaller, then β gets larger.


• If we set β to be smaller, then α gets larger

Research communities generally accept α = 0.05 as an acceptable level. Then they accept whatever values
result for β . For a fixed significance level α , as sample size increases the β error decreases. For a fixed
significance level α , for small sample size, the β error can be quite high.

Example D
For a certain type of fluorescent light bulbs, it is claimed that their lifetimes average more than 1000 hours.
A study was done to test this claim. A random sample of 50 of the bulbs was selected and their lifetimes
had an average of 1050 hours with a standard deviation of 120 hours. What conclusion should be made?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TESTS OF SIGNIFICANCE 139

(1)
H0 : µ ≤ 1000
HA : µ > 1000
120 = 16.97. The test statistic is
(2) The test statistic is the standardized sample mean which has SE = √
50
1050 − 1000
Z= = 2.95
16.97

(3) The p -value is the chance a Z variable is above 2.95. From the normal table this is 1 − 0.998 = 0.002.
(4) Since the p -value is less than 0.05, we reject H0 .
(5) The average lifetime is above 1000.

Example E
A drug to treat a certain illness causes side effects in 20% of the patients. A study was done to test a new
drug for this illness. In this study of 200 patients there were 30 with side effects, 30 = 0.15. Is this evidence
200
of fewer side effects with the new drug?

(1)
H0 : p ≥ 0.20
HA : p < 0.20
where p is the proportion who get side effects.
(2) The test statistic is the standardized sample proportion which has standard error of
r
·( 1 − 0.20))
0.20·(
SE = = 0.028
200
The test statistic is
0.15 − 0.20
Z= = − 1.786.
0.028
(3) The p-value is the chance a Z variable is less than −1.79, which is 1 − 0.963 = 0.037.
(4) Since the p -value is less than 0.05, we reject H0 .
(5) The new drug causes fewer side effects.

One Population Mean: t-test


The test procedure that has been discussed for the case of a population mean µ is based on a Z test
statistic. The procedure requires a large sample size (nn > 30), since it is based on the result that the sample
mean X̄ has an approximate normal sampling distribution for larger sample sizes. What can be done for
this problem when the sample size is small? In general this is a difficult problem. The answer, it turns out,
depends in a complicated way on the distribution of the numbers in the population. Suppose we make the
additional assumption that the population (the box) has a normal distribution, at least approximately. Then
for this case a test procedure was developed in the 1920’s by W. S. Gossett, an employee of the Guinness
Brewery in England, which is now called the t-test or Student’s t -test. Gossett wrote a scientific paper on
his method under the pen name of “Student” to disguise his affiliation with the brewery.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TESTS OF SIGNIFICANCE 140

The t -test procedure is essentially the same as the previous Z -test procedure, but with one change. We
switch to a new table, the t -table, to determine critical values and p -values for the test. The t -table is based
on a sampling distribution called the t -distribution. The t -distribution is similar in shape to the normal
distribution, but it has higher tails. It is bell-shaped and symmetric about zero.

There is not one, but rather a whole family of t -distributions which are indexed by a parameter called the
“degrees of freedom”, or d f for short. There is a different t -distribution for each d f . The t -table provides
a few selected upper tail areas under a t -distribution curve. The rows of the t -table index the degrees of
freedom. As the degree of freedom gets larger the t -distribution gets closer to the normal distribution and at
some point the areas under a t -distribution curve are the same as the corresponding areas under a standard
Normal curve to a satisfactory approximation. The t -table we use has degrees of freedom up to d f = 50;
past this we will switch to the normal distribution. The t -table is provided at the end of this chapter.

Example F
In manufacturing, when printing ink, the raw materials are placed in a large vat and mixed for several hours.
One crucial quantity is the viscosity of the ink. A testing instrument is used to measure the viscosity of the
ink by recording the time it takes for a ball to drop through a vertical tube of the ink. This “drop test” result
should be 20 for satisfactory viscosity, but sometimes the measurement is higher, indicating that ink is too
thick. Suppose that five 16oz samples of ink are selected from random locations in the vat and the drop test
is carried out on each. The sample viscosity values average 22.3 with a sample standard deviation of 2.5.
Should we conclude that the vat of ink is too thick (average viscosity above 20)? The steps of the testing
are as follows:

(1) The hypotheses to be tested are


H0 : µ ≤ 20
HA : µ > 20

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TESTS OF SIGNIFICANCE 141

(2) The test statistic is


22.3 − 20
t= 2.5
= 2.057

5

(3) The p -value for the test is the area under the t -distribution curve above t = 2.057. There are degrees
of freedom of d f = 5 − 1 = 4. Consulting the t -table in the row with d f = 4, we see that our value of t =
2.057 is between the t -values of 1.533 and 2.132, which have tail areas of 0.10 and 0.05, respectively.
Thus, our p -value is

0.05 < pp-value < 0.10

(4) Since the p-value is not less than 0.05, we fail to reject H0 .
(5) There is not significant evidence to show the average viscosity measurement is greater than 20.

Although our sample mean is above 20, it is not far enough above 20 to be significant. This data could be
due to chance error.

Example G
A car rental company has a fleet of cars, all of the same type, that were believed to have gas mileages that
average 30 mpg. An employee believed that this figure was too high and to get data to check his suspicion,
he selected 10 of the cars, at random, and carefully determined their mileage over a three week period. The
sample average was 27.5 with a standard deviation of 3.1. Does this data confirm his suspicion? The steps
of the testing are as follows.

(1) The hypotheses to be tested are


H0 : µ ≥ 30
HA : µ < 30

(2) The test statistic is


27.5 − 30
t= 3.1
= 2.420

10

(3) The p -value of the test is the area under the t -distribution curve above t = 2.42. There are degrees of
freedom of d f = 10 − 1 = 9. Consulting the t -table in the row with d f = 9, we see that our value of t =
2.42 is between the t -values of 2.262 and 2.821, which have tail areas of 0.025 and 0.01, respectively.
Thus our p -value is 0.01 < pp-value < 0.025.
(4) Since the p -value is less than 0.05, we reject H0 , the result is statistically significant. The data have
substantiated the suspicion of the employee. Note: if the p -value had been less than 0.01 we would
have stated that the result is statistically “highly significant”.
(5) There is significant evidence that the average miles per gallon is less than 30.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TESTS OF SIGNIFICANCE 142

EXERCISES
1. If for the hypothesis test H0 : µ ≤ 0 vs HA : µ > 0, the test statistic, obtained from a sample, is Z =
X̄−0
= −0.5, then the p-value will be the shaded area from which of the following graphs?
SE

2. (Hypothetical) A company considers moving its headquarters to one of a short list of cities. In one
of these cities, the mayor, much interested in having the company’s headquarters relocate to his city,
claims that his city has an above average high school graduation level for adults aged 18 – 45. The
national average is reported to be 77%. In a random sample of 64 city residents ages 18 – 45, 79%
had graduated from high school. Is there enough evidence in the sample to support the mayor’s
claim? State the null and alternative hypothesis, test statistic, p-value (or range, if using a t-test), and
conclusion. (Hint: recall that sample data is used as evidence against / to reject the null hypothesis.)
3. If true, state it. If false, say why:
(A) Using the same positive test statistic from a given data sample, and a level of significance (0.05),
it is possible that you could reject the two sided hypothesis (H0 : µ = k vs. HA : µ 6= k) and yet fail
to have enough evidence to reject the one sided hypothesis (H0 : µ ≤ k vs. HA : µ > k).
(B) While it is difficult to find the exact p-value for a particular t-statistic given our tables, we know
that, for a fixed sample size, the larger the t-statistic and hypothesis H0 : µ ≤ 0 vs. HA : µ > 0 the
smaller the p-value.
(C) In a hypothesis test, a low p-value means the observed sample is fairly likely to occur assuming
the null hypothesis is true.
(D) For a null hypothesis H0 : µ ≤ 0 vs. HA : µ > 0 a sample size of 14 and an associated t-statistic of
2.15 (population is normally distributed), the p-value is between 0.025 and 0.01.
4. A cigarette industry spokesperson remarks that the current levels of tar are no more than 5 milligrams
per cigarette. A reporter does a quick check on 15 cigarettes representing a cross section of the
market and finds a sample mean of 5.63 milligrams with a standard deviation of 1.61 milligrams (this
is SD, not s). Assuming that the actual industry wide distribution of tar in cigarettes is approximately
normally distributed, what do you make of the spokesperson’s claim? State the null and alternative
hypothesis, test statistic, p-value (or range if using a t-test), and conclusion.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TESTS OF SIGNIFICANCE 143

5. Seventh grade students in a school district scored an average of 75 on a standardized math test. In
the following year, a special class of 30 randomly selected seventh graders used a new textbook and
class method for math and on the same standard test they score an average of 78 with an SD of 8.
Is this good evidence that the new method is successful in increasing math test scores? Carry out a
test of appropriate hypotheses to reach your conclusion. Note: this is not a very good experimental
plan to study this issue.
6. Suppose that for a random sample of size 25, taken from a normal population, the sample average is
0.75 and s = 2. Test the following hypothesis, H0 : µ = 0 vs. Ha : µ 6= 0. Include a test statistic, p-value
and conclusion for a significance level of 0.05. What conclusion do you draw?
7. (Hypothetical) A sociologist is studying the effect of having children in the first two years of marriage
on the divorce rate. Using hospital birth records, a random sample of 76 couples who had a child in
their first two years of marriage was obtained. Following these couples, it was found that 38 divorced
in the first five years. If the national proportion of all couples that divorce in the first five years is
40%, perform a test of significance to see if the proportion among those having children in first two
years is different. Include your hypotheses, test statistic, p-value (or range and df) and conclusion.
8. A business is interested in knowing what proportion of the city in which they are located would be
interested in “Zipcars”, a form of car rental of “shared cars”. They would seriously consider investing
in this technology if the proportion of seriously interested city residents is greater than 30%. From a
random sample of 100 city residents, 33 responded that they would be seriously interested. Should
the company seriously consider investing? Perform a significance test at significance level 0.05,
stating the hypotheses, test statistic, p-value and conclusion.
9. (Hypothetical) It is known, from early studies, that 40% of the patients undergoing a certain common
surgical procedure experience a negative side effect. Post-surgical treatment A was introduced which
reduced the proportion of patients experiencing the side effect to 35%. A pharmaceutical company
proposes new treatment B, and claims that it performs better than treatment A. A random sample of
120 patients undergoing the surgical treatment was given treatment B and 35 of them experienced
the side effect. Perform a test of significance at 5% level to determine if the claim is valid. Include
hypotheses, test statistic, p-value and conclusion.
10. A company that manufactures breakfast cereals claim that the average weight of their cereal boxes
is at least 26.8oz. A quality control officer drew a random sample of 15 boxes for inspection to verify
the company’s claim. The hypotheses are as follows: H0 : µ ≥ 26.8 vs. HA : µ < 26.8. If the sample mean
is 25.3, and the null hypothesis is not rejected, what kind of error (if any) occurred?

REVIEW EXERCISES
11. Which of the following is (are) TRUE about hypothesis test for one population mean?
I. The Z-test statistic is used when the sample size is more than 30.
II. A one-tailed test is performed when the alternative hypothesis is “not equal to”.
(a) I only
(b) II only
(c) Both I and II
(d) Neither I nor II

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TESTS OF SIGNIFICANCE 144

12. Suppose we are conducting a hypothesis test for one population mean. We use two samples with the
same mean and same standard deviation, however, the sample size of sample 1 is larger than that of
sample 2. Which of the following statement is correct?
(a) The test statistic obtained using sample 1 is less than that obtained using sample 2.
(b) The test statistic obtained using sample 1 is greater than that obtained using sample 2.
(c) The test statistic obtained using sample 1 is the same as that of sample 2.
(d) We do not have enough information to answer this question.
13. (TRUE or FALSE) When the null hypothesis is not rejected, then we can assume the null hypothesis
is certainly true. Explain your answer.
14. (Hypothetical) A company produces circuit boards for a particular computer part with a claim that the
width is 2.9 inches. It is important that the width be this measurement, not bigger or smaller, so that
integrated parts can properly be installed. A random sample of 16 of the circuit boards was taken
and measured to have an average = 2.9425 inches and s = 0.08 inches. Let µ = the true average width
of this circuit board. Assuming the actual widths of the circuit boards are approximately normally
distributed, test the hypothesis H0 : µ = 2.9 vs. Ha : µ 6= 2.9. Include a test statistic (hint: don’t round
off), df, p-value and conclusion.
15. A chemist is interested in the amount of impurities in a liquid that she could produce by a chemical
process. She was concerned that the liquid would not be satisfactory for use if the impurity level
went above 12. To test her process of making the liquid, she made 9 samples of the liquid and found
the sample impurities had an average of 17 with a SD of 6. Carry out a test of the hypotheses H0 : µ ≤
12 vs. Ha : µ > 12, where µ is the average impurity level for her process.
16. (Hypothetical) A city spokesperson says the mean response time for arrival of a fire rescue team in
response to a 911 call is 11 minutes. A newspaper reporter suspects that the response time is longer
and runs a test of 64 fire rescue situations. She calculates the sample mean to be 12.2 minutes
with a standard deviation of 5 minutes. What should she conclude? Indicate the null and alternative
hypothesis, calculate a test statistic and determine a p-value (or range if using the t-test), and state
your conclusion.
17. For commercial airplanes a “repair ticket” is written when a problem is noticed and the work is
actually done at some convenient time when the plane is being serviced. Of course, serious problems
affecting the safety or ability to fly are handled immediately if not sooner. To study the delay time until
a repair is made, a random sample of 50 repair tickets was selected from the January records on all
planes of a company. The time to complete the repairs had a mean of 4.65 days and a SD of 2.0 days.
Carry out a test of significance for testing H0 : µ ≥ 5 vs. Ha : µ < 5. Here µ is the population mean
repair time.
18. A student leader believes that 60% of the students oppose a development at Asylum Lake. To check
this claim, a survey of 200 students is carried out, and it was found that 100 students in the sample
opposed Asylum Lake development. Is there enough evidence to reject the student leaders belief?
Include your hypotheses, test statistic, p-value and conclusion.
19. Suppose a researcher wants to conduct a two-sided hypothesis test for population proportion. She
randomly selects two samples with equal sample sizes. Looking at the sample proportions of the
two groups, we found the sample proportion of sample 1 is closer to the hypothesized proportion (P0 )
compared to the sample proportion of sample 2. Select the correct statement from the following:
(a) p-value obtained using sample 1 is greater than the p-value obtained using sample 2.
(b) Absolute value of the test statistic obtained using sample 1 is greater than the absolute value
of the test statistic obtained using sample 2.
(c) p-value obtained using sample 1 is smaller than the p-value obtained using sample 2.
(d) Not enough information to answer this question.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TESTS OF SIGNIFICANCE 145

20. (TRUE or FALSE) A Type II error is committed when we fail to reject a false null hypothesis.
21. A new diet pill was formulated to reduce weight among obese adults. The manufacturer of the said
pill claims that their product was able to reduce the weights of at least 80% of the participants in their
clinical trial. A second trial was conducted to verify this claim using different sets of participants, and
it was found that the pill was effective in 83% of the participants. If the null hypothesis was rejected,
what type of error (if any) occurred?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
t-table

t distribution table

df 0.2 0.1 0.05 0.025 0.01 0.005 0.001


1 1.376 3.078 6.314 12.706 31.821 63.657 318.309
2 1.061 1.886 2.92 4.303 6.965 9.925 22.327
3 0.978 1.638 2.353 3.182 4.541 5.841 10.215
4 0.941 1.533 2.132 2.776 3.747 4.604 7.173
5 0.92 1.476 2.015 2.571 3.365 4.032 5.893
6 0.906 1.44 1.943 2.447 3.143 3.707 5.208
7 0.896 1.415 1.895 2.365 2.998 3.499 4.785
8 0.889 1.397 1.86 2.306 2.896 3.355 4.501
9 0.883 1.383 1.833 2.262 2.821 3.25 4.297
10 0.879 1.372 1.812 2.228 2.764 3.169 4.144
11 0.876 1.363 1.796 2.201 2.718 3.106 4.025
12 0.873 1.356 1.782 2.179 2.681 3.055 3.93
13 0.87 1.35 1.771 2.16 2.65 3.012 3.852
14 0.868 1.345 1.761 2.145 2.624 2.977 3.787
15 0.866 1.341 1.753 2.131 2.602 2.947 3.733
16 0.865 1.337 1.746 2.12 2.583 2.921 3.686
17 0.863 1.333 1.74 2.11 2.567 2.898 3.646
18 0.862 1.33 1.734 2.101 2.552 2.878 3.61
19 0.861 1.328 1.729 2.093 2.539 2.861 3.579
20 0.86 1.325 1.725 2.086 2.528 2.845 3.552
21 0.859 1.323 1.721 2.08 2.518 2.831 3.527
22 0.858 1.321 1.717 2.074 2.508 2.819 3.505
23 0.858 1.319 1.714 2.069 2.5 2.807 3.485
24 0.857 1.318 1.711 2.064 2.492 2.797 3.467
25 0.856 1.316 1.708 2.06 2.485 2.787 3.45
26 0.856 1.315 1.706 2.056 2.479 2.779 3.435
27 0.855 1.314 1.703 2.052 2.473 2.771 3.421
28 0.855 1.313 1.701 2.048 2.467 2.763 3.408
29 0.854 1.311 1.699 2.045 2.462 2.756 3.396
30 0.854 1.31 1.697 2.042 2.457 2.75 3.385
31 0.853 1.309 1.696 2.04 2.453 2.744 3.375
32 0.853 1.309 1.694 2.037 2.449 2.738 3.365
33 0.853 1.308 1.692 2.035 2.445 2.733 3.356
34 0.852 1.307 1.691 2.032 2.441 2.728 3.348
35 0.852 1.306 1.69 2.03 2.438 2.724 3.34
36 0.852 1.306 1.688 2.028 2.434 2.719 3.333
37 0.851 1.305 1.687 2.026 2.431 2.715 3.326
38 0.851 1.304 1.686 2.024 2.429 2.712 3.319
39 0.851 1.304 1.685 2.023 2.426 2.708 3.313
40 0.851 1.303 1.684 2.021 2.423 2.704 3.307
41 0.85 1.303 1.683 2.02 2.421 2.701 3.301
42 0.85 1.302 1.682 2.018 2.418 2.698 3.296
43 0.85 1.302 1.681 2.017 2.416 2.695 3.291
44 0.85 1.301 1.68 2.015 2.414 2.692 3.286
45 0.85 1.301 1.679 2.014 2.412 2.69 3.281
46 0.85 1.3 1.679 2.013 2.41 2.687 3.277
47 0.849 1.3 1.678 2.012 2.408 2.685 3.273
48 0.849 1.299 1.677 2.011 2.407 2.682 3.269
49 0.849 1.299 1.677 2.01 2.405 2.68 3.265
50 0.849 1.299 1.676 2.009 2.403 2.678 3.261
Chapter 14

The Two Sample Problem

One of the most important problems in statistics is the comparison of two populations. Recall the earlier
discussion in the first chapter on the value of making a comparison when studying an issue.

Two Proportions
As a first case, we will focus on the comparison of two proportions. Suppose we have data on the
proportions from two populations; that is, we have two sample proportions and we want to make a conclusion
comparing the two population proportions. To simplify, we consider their difference as our focus. The
following are some examples.

• In a clinical trial, we have a treatment group and a control group. For each group, we note the
proportion of individuals who experienced side effects. The difference between these two proportions
is our main statistic in assessing if the population proportions differ. Note the distinction made
between the proportions in the two populations and the proportions observed in the sample data.
• Suppose in studying the smoking habits of high school students, we are interested in any difference
between males and females. A survey was conducted involving sampling 200 males and 200 females,
at random, and asking them if they had smoked in the last month. The difference between the
sample proportions who had smoked is our primary statistic in assessing any difference between
the proportions who smoked in the full populations.
• Do drivers in Kalamazoo County wear seat belts at the same rate in city driving vs. in highway
driving? To gain some data on this issue, a scheme was devised to randomly observe 500 cars
driving in the city and another 500 cars driving on county highways. It was easy to observe if the
driver was wearing a seat belt or not. Suppose it was found that 60% of the city drivers and 75% of
the highway drivers were wearing seat belts when observed. What then is our assessment for the
original question? Note again the distinction between the population proportions and the sample
proportions in our data.

In general, we can model situations like these with two boxes representing the populations and two sampling
operations in drawing from each box. Looking at one box and its sample, we know a lot about how this
works from our earlier study of the sample proportion from a 0 – 1 box. Now we have two such items and
we need to put the information together in a proper way to focus on the difference in proportions.

147
THE TWO SAMPLE PROBLEM 148

POPULATION A POPULATION B
0 – 1 Box A: 0 – 1 Box B:
pA = proportion of 1’s pB = proportion of 1’s
mean
p = pA mean
p = pB
SD = pA · (1 − pA ) SD = pB · (1 − pB )

SAMPLE A SAMPLE B
Sample size nA Sample size nB
Sample proportion
qp̂A Sample proportion
qp̂B
p̂A (1− p̂A ) p̂B (1− p̂B )
Standard error SEp̂A = nA Standard error SEp̂B = nB

We are assuming that the two samples are independent. That is, the two separate populations are randomly
sampled by independent, unrelated processes. Note how we are carefully distinguishing in the notation
used between the population quantities like pA and pB and the corresponding sample quantities like p̂A and
p̂B .
Our hypotheses are
H0 : pA − pB = 0 vs. Ha : pA − pB 6= 0
or equivalently
H0 : pA = pB vs. Ha : pA 6= pB

We are using equal and not equal signs, since we are looking for differences. We use inequalities if it is
specified that one is bigger or smaller than the other.
To compare the two population proportions, we are focusing on their difference. It is natural to use the
sample difference as a natural estimate of the population difference. The sampling distribution of the
sample difference p̂A − p̂B has
EV = pA − pB
q
SE = SE p̂2A + SE p̂2B
s s
p̂A · (1 − p̂A ) p̂B · (1 − p̂B )
SE p̂A = and SE p̂B =
nA nB
(use the earlier result for the standard error of the sample proportions here) and the shape is approximately
Normal for larger sample sizes ( nA , nB > 30). This layout summarizes the main results to use for our
applications.
Our test statistic is in a similar fashion to our previous chapter.

( p̂A − p̂B ) − 0
Z=
SE

The process, itself, for hypothesis testing remains the same as the previous chapter.

Example A
In a traffic safety study, data was collected by observing cars driving in a city and, among other things, they
noted whether or not the driver was wearing a seat belt. Among the male drivers,  they noted that 99 out of
150 99 = 0.66 were wearing a seat belt and they found that 146 out of 200 146 = 0.73 female drivers
150 200
were wearing a seat belt. Is this evidence of a difference in seat belt usage between males and females in
the overall population of drivers?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
THE TWO SAMPLE PROBLEM 149

To examine this issue, let M represent the male group and F represent the female group. We have

p̂M = 0.66 and p̂F = 0.73

with a difference of p̂F − p̂M = 0.73 - 0.66 = 0.07 or 7%


The two standard errors are estimated from the data as
r r
0.73·· (1 - 0.73) 0.73·· 0.27
SE of p̂F = = = 0.0314
200 200
r r
0.66·· (1 - 0.66) 0.66·· 0.34
SE of p̂M = = = 0.0387
150 150
Then the difference p̂F − p̂M has SE given by
q
(0.0314)2 + (0.0387)2 = 0.0498

To carry out a test of the hypothesis H0 :ppF − pM = 0 vs. HA :ppF − pM 6= 0, the test statistic is the standardized
difference in sample proportions
( p̂F − p̂M )−
)−0 0.07
Z= = = 1.41
SE 0.0498
The p -value for this Z statistic is the area under the standard Normal curve above 1.41, which is 1 − 0.921
= 0.079. However, this is a two-sided test, so we need to multiply this value by 2, meaning the p -value is
0.158. A p -value of 0.158 is not significant (it is not less than 0.05) and so we fail to reject the hypothesis
H0 :ppF − pM = 0, that is, we have no reason to conclude the population proportions differ.
Suppose we want a 95% confidence interval for the population difference pF − pM . The usual format for a
95% confidence interval is
 
estimate ± 1.96 · SE

which in our case here is


0.07 ± (1.96 · 0.0498)
0.07 ± 0.10
(-0.03, 0.17)
So with 95% confidence, we say that the true difference pF − pM is between -0.03 and 0.17. Note that
this interval contains zero, that is, the case pF = pM , hence based on the confidence interval, there is no
significant difference between the two proportions.

Example B
In a health study, a treatment group of 200 people received diet information and performed moderate
exercise every day. At the end of six months, it was found that 168 had lowered their cholesterol level.
A parallel control group of 200 people received diet information only and at the end of the six month period,
150 had lowered their cholesterol level. Participants in the study were volunteers over age 65. They were
assigned to the treatment (TMT) or control (CTL) group at random. Is the difference observed statistically
significant, thus indicating that the exercise is effective in lowering cholesterol, or could such a difference
be just chance variation?
The sample proportions are
168 150
p̂T MT = = 0.84 p̂CT L = = 0.75 with a difference of 0.09 or 9%.
200 200

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
THE TWO SAMPLE PROBLEM 150

The two standard errors are estimated from the data as


r
0.84·· 0.16
SE of p̂T MT = = 0.026
200
r
0.75·· 0.25
SE of p̂CT L = = 0.031
200
The difference p̂T MT − p̂CT L has a standard error given by
q
(0.026)2 + (0.031)2 = 0.040

To carry out a test of the hypothesis of equal population proportions, H0 :ppT MT − pCT L ≥ 0 vs. HA :ppT MT − pCT L <
0, the test statistic is the standardized difference in sample proportions
( p̂T MT − p̂CT L )−
)−0 0.09
Z= = = 2.25.
SE 0.04
The p -value for this Z statistic is the area under the standard Normal curve above 2.25, which is 1 − 0.988
= 0.012. A p -value of 0.012 is statistically significant (it is less than 0.05) and so we reject the hypothesis
H0 :ppT MT − pCT L ≥ 0 and conclude that moderate exercise helps in reducing cholesterol level.
Suppose we want a 95% confidence interval for the population difference pT MT − pCT L . The usual format for
a 95% confidence interval is
estimate ± (1.96 · SE)
which in our case here is
0.09 ± (1.96 · 0.04)
0.09 ± 0.08
(0.01, 0.17)
Thus, we are 95% confident that the true difference in proportions is between 0.01 and 0.17. Note that this
interval does not contain 0, hence indicating a significant difference in the proportions between the two
groups.

Note: p̂T MT and p̂CT L refer to the sample data as opposed to the populations.

Two Means
With a quantitative variable, a problem of considerable interest is to compare the means of the two populations.
Are the population means equal?
POPULATION A POPULATION B
0 – 1 Box A: 0 – 1 Box B:
Box A: [ *** ] Box B :[ **** ]
mean = µA mean = µB
SD = σA SD = σB

SAMPLE A SAMPLE B
Sample size nA Sample size nB
Sample mean X̄A Sample mean X̄B
Sample std. dev. sA Sample std. dev. sB

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
THE TWO SAMPLE PROBLEM 151

The two samples are assumed to be independent. The sampling distribution of the sample difference X̄A − X̄B
has
EV = µA − µB
q
SE = (SE of X̄A )2 + (SE of X̄B )2
and the shape is approximately Normal for larger sample sizes (nnA , nB > 30). Recall from our one sample
work that
sA
SE of X̄A = √
nA
sB
SE of X̄B = √
nB
These two quantities use the sample standard deviations as “estimates” of the corresponding population
standard deviations. Ideally, we would like to use the population standard deviations in these SE formulas,
but they are unknown.
A standardized sample mean difference can be expressed as

(X̄A − X̄B ) − (µA − µB )


Z=
SE
which has an approximate Normal distribution for larger sample sizes.

Example C
In a clinical trial to study the efficacy of a new drug, the researchers used a control group of size 104 and a
treatment group of size 80. At baseline, they measured many physical parameters and made comparisons
between the two groups to see if there were differences that would be significant. One of the variables
measured was white blood cell count (WBC) in ppt. The summary statistics were
Treatment Group (A) Control Group (B)
Sample size nA = 80 Sample size nB = 104
Sample mean X̄A = 5.6 Sample mean X̄B = 5.1
Sample SD sA = 1.7 Sample SD sB = 2.2
We see a difference in the sample means – is this difference statistically significant? We are viewing the
data as two samples from a population of patients eligible to be treated. The hypotheses being tested are
µA − µB = 0 vs. Ha :µ
H0 :µ µA − µB 6= 0. The difference of sample means is

X̄A − X̄B = 5.6−− 5.1 = 0.5 with


q
SE = (SE of X̄A )2 + (SE of X̄B )2
s
1.7 2 2.2 2
  
= √ + √
80 104
p
2 2
= 0.190 + 0.216
= 0.288
Note this is an “estimated” standard error. The test statistic is
−0
(X̄A − X̄B )−
Z=
SE
0.5
= = 1.74
0.288

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
THE TWO SAMPLE PROBLEM 152

The area above Z = 1.74 for the Normal curve is 1 − 0.959 = 0.041. Since we have a two-sided alternative
hypothesis, the p -value is twice this area, p -value = 2 · 0.041 = 0.082. This is not significant at the 0.05 level
and so, we do not reject the null hypothesis. We conclude that there is no significant evidence that the two
populations have different average WBC.
A 95% confidence interval for the difference between the population means is

(X̄A − X̄B ) ± (1.96·· SE)

± (1.96·· 0.288)
0.5±
± 0.564
0.5±
(-0.064, 1.064)

Thus, we are 95% confident that the true difference in the average WBC between the two groups is between
-0.064 and 1.064. This confidence interval contains 0, hence there is no significant difference in the
averages.

The method covered above, based on the Z variable, is suitable when the sample sizes are larger (both nA
and nB ). If we have small sample sizes (at least one sample size nA or nB or both nA and nB less than 30), a
modified procedure has been proposed. Use the same basic steps, but
• denote the test statistic by t rather than the Z
• use a t distribution in finding the p -value, rather than the Normal distribution
• use degrees of freedom df = smaller of {nnA − 1 and nB − 1 }
This procedure is approximate and conservative. For the confidence interval method, use the t distribution
rather than the normal distribution in setting the constant. To get the constant from the t-table we follow
these steps:
• Calculate the level of significance α as 1 - confidence level (e.g. For a 95% confidence interval,
α = 1 − 0.95 = 0.05)
α
• On the t-table, look at the 2 column, at the df row
α
• Get the constant at the value where 2 and df will intersect.
Other modified methods based on the t distribution have been developed for this problem, but we will not
go into this in detail. Consult other statistics, texts for reference to these methods. The modified method for
small sample sizes still requires an assumption that the population distributions be normally distributed, or
at least approximately so. When this assumption is not met the two sample problem becomes very complex
and specialized methods are required.

Example D
A study was done on the mercury level in pan fish in a local lake. There was a concern that this level has
been increasing over time due to increased power plant emissions far upwind. They randomly sampled 50
fish and samples of their flesh were analyzed in a laboratory for mercury level. The summary statistics were
Sample size nA = 50
Sample mean X̄A = 15.1 ppt
Sample SD sA = 6.5 ppt
They found historical data from a similar study 5 years ago which had the results

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
THE TWO SAMPLE PROBLEM 153

Sample size nB = 27
Sample mean X̄B = 10.3 ppt
Sample SD sB = 5.2 ppt
A concern was expressed that the laboratory techniques had changed over the five year interim and
this could cause differences in the measurements, but it was decided to ignore this potential bias. The
hypotheses H0 :µµA − µB = 0 vs. Ha :µ
µA − µB > 0 would be appropriate to test for an increase in average mercury
levels over time.

The difference of sample means is

X̄A − X̄B = 15.1−− 10.3 = 4.8 with


q
SE = (SE of X̄A )2 + (SE of X̄B )2
s
6.5 2 5.2 2
  
= √ + √
50 27
p
2 2
= 0.919 + 1.00
= 1.358
The test statistic is
4.80
= 3.53
t=
1.358
To get the p-value, we need to use 26 degrees of freedom. Hence, the p-value is less than 0.001. With this
small p -value we reject H0 strongly and conclude there has been an increase in average mercury level in the
lake over the five year span.
To construct a 95% confidence interval, we calculate α = 1 − 0.95 = 0.05. This means that we are going to
look at the 0.025 column on the t-table. Recall that the degrees of freedom for this test is 26, hence the
constant at which 0.025 and 26 intersects is 2.056.

Using this value, a 95% confidence interval for the difference in average mercury level is calculated as

(X̄A − X̄B) ± (2.056 · SE)

± (2.056 · 1.358)
4.8±
± 2.792
4.8±
(2.008, 7.592)

Thus, we are 95% confident that the true difference in the averages between the two groups is between
2.008 and 7.592. This interval does not contain 0, hence indicating a significant difference between the two
groups.

Dependent Groups
This is another case for two means. It may be from a certain individual taken in two different times or two
groups that have a logical pairing.
Since there is pairing within each point from every group, the difference on each set of pair is computed.
The sampling distribution then is based on the mean of the differences (x̄x̄d ) is,

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
THE TWO SAMPLE PROBLEM 154

EV = µd
SE = √sdn

where n is the number of pairs and sd is the standard deviation of the mean differences.
Assuming that the number of pairs is greater than 30, we standardize the data and compute the Z test
statistic using the formula
X̄d − µd
Z=
SE
It is important to note that a dependent groups problem is really the same as a one means problem, where
the one population of interest is the difference.

Example E
Will eating oatmeal promote healthy levels of cholesterol? A consumer reports analyst took a sample of
44 people with high cholesterol and asked them to eat oatmeal once a day for 3 months. Measurements
were taken of their cholesterol levels before and after the 3 months in mg/dL. The analyst is testing whether
the cholesterol levels after the diet are different from the cholesterol levels before the diet. If the analyst
calculated the mean difference in cholesterol levels (after - before) to be 0.48 mg/dL with a standard deviation
of 6.35 md/dL, can we conclude a significant difference?
For this test, our hypotheses are
H0 : µd = 0 vs. Ha : µd 6= 0

Since we have a large sample size, we will have a Z test statistic.

0.48 − 0
Z= 6.35
= 0.5

44

The area under the normal curve above Z = 0.5 is 1 − 0.691 = 0.309
0.309. But because we have a two-sided test,
we need to double our probability, resulting in a p -value of 0.618. Since this is a large p -value, we will fail to
reject the null hypothesis. We cannot conclude there is a difference in cholesterol levels before and after
the diet.

Example F
We want to test if a certain weight loss pill is effective. This study used 5 obese men having the same type
of diet and activity. Their weight (in lbs.) before starting the program and after 2 weeks were recorded as
follows:
Before After
1 250 240
2 380 350
3 277 262
4 250 245
5 360 340
How do we know if the pill was effective? We first compute the difference on each pair of observation, the
before and after weights.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
THE TWO SAMPLE PROBLEM 155

Before After Difference


1 250 240 10
2 380 350 30
3 277 262 15
4 250 245 5
5 360 340 20

This gives us X̄d = 16 and sd = 9.618


9.618. To see if the pill is effective, the hypotheses are

H0 : µd ≤ 0 vs. Ha : µd > 0

Because we have a small sample size, we have a test statistic

16 − 0
t= 9.618
= 3.72

5

Based on this, we can conclude our p -value is between 0.01 and 0.025 with 4 degrees of freedom. This is a
small p -value so we reject the null hypothesis. There is significant evidence to show the weight loss pill is
effective.

EXERCISES
1. (Hypothetical) Suppose that a drug company has new drug for epilepsy and claims that it reduces
seizures in epileptics. A group of 162 patients suffering from 1–2 seizures per month (under non
medicated conditions) are selected and randomly divided into two groups of 81 each. In a double
blind study, group A is given the new drug and group B is given the old drug. At the end of three
months, group A had 54 patients who had no seizures, while group B had 42. Does the drug reduce
seizures? Let p = proportion of people who suffered no seizures during the three month study.
(A) which of the following hypothesis tests should be used to test the company’s claim (and hence
answer the question):
(a) H0 : pNEW > pOLD vs. Ha : pNEW < pOLD
(b) H0 : pNEW ≤ pOLD vs. Ha : pNEW > pOLD
(c) H0 : pNEW ≥ pOLD vs. Ha : pNEW < pOLD
(d) H0 : pNEW < pOLD vs. Ha : pNEW ≥ pOLD
(B) The standard error for the difference of the proportions, pNEW − pOLD , of (1) is

(a) 0.542 + 0.422
q
(b) 0.54 · 0.46 + 0.42 · 0.58
81 81
s
54 · 27 42 · 39
(c) 81 81 + 81 81
81 81
r
(54−48)2
+ (42−48)
2
(d)
81 81
(C) Calculate a test statistic, p-value and conclusion with respect to the company’s claim above in
1.
2. A medical researcher wishes to see whether the pulse rates of smokers are higher than the pulse
rates of nonsmokers. Samples of 100 smokers and 100 nonsmokers are selected. The results are
shown below. Can the researcher conclude that smokers have higher pulse rates than nonsmokers?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
THE TWO SAMPLE PROBLEM 156

Smokers Nonsmokers
mean = 90 mean = 88
SD = 5 SD = 6
n = 100 n = 100

3. (Hypothetical) A company claims that a gas additive it produces increases gas mileage. To test this
claim, 36 identical model cars were filled with 10 gallons of gas. These cars were then run on a test
course under identical driving conditions until they had only one half of a gallon of gas remaining
(indicated by a sensor in the gas tank), at which time they recorded the mileage. The cars then
had their gas tanks drained and refilled with 10 gallons of gas which contained one can of the gas
additive. Again, the cars ran on the test course as before and recorded their mileage when only one
half of a gallon remained in the tank. In the first run, with gas alone, the cars had an average mileage
of 224 with a SD = 13. In the second run, the average mileage was 230 with a SD = 10. Is there
enough evidence to support the company’s claim? Test the hypothesis H0 : µAdditive ≤ µNoAdditive vs. Ha :
µAdditive > µNoAdditive .(You must include a test statistic, p-value and conclusion.)

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
THE TWO SAMPLE PROBLEM 157

4. Two groups of students are given a problem-solving test, and the results are compared. Find the 95%
confidence interval of the true difference in means.
Nursing majors Business majors
mean = 83.6 mean = 79.2
SD = 4.3 SD = 3.8
n = 36 n = 36

5. A comparison of two groups, A and B, was conducted using two samples. Sample one, from A, had
a sample size of 20. Sample two, from B, had a sample size of 26. Given H0 : µA = µB vs. Ha : µA 6= µB , a
significance level of 0.05, and a test statistic for the difference t = 2.0, find a p-value and state whether
you reject or do not reject H0 .
6. In a sample of 80 workers from a factory in city A, it was found that 5% were unable to read, while
in a sample of 50 workers in city B, 8% were unable to read. Can it be concluded that there is a
difference in the proportions of nonreaders in the two cities? Answer this question by constructing
a 95% confidence interval for the difference of the two proportions.
7. (Hypothetical) A pharmaceutical company has a new drug for lowering total cholesterol levels. While
efficacy was determined in previous clinical trials, the company wants to know if it works the same
for females and males. Two random samples of patients suffering from high cholesterol levels were
taken, 81 males and 64 females, and the new drug was administered to each participant. After six
months, 39 of the males and 22 of the females showed significant reduction (as predetermined by
the experimenters) in total cholesterol levels. Does the drug work differently between males and
females? Perform a significance test at significance level = 0.05. Include hypotheses, test statistic,
p-value, and conclusion.
8. (Hypothetical) Researchers compared two groups of competitive rowers: a group of skilled rowers
and a group of novices. The researchers measured the angular velocity of each subject’s right knee,
which describes the rate at which the knee joint opens as the legs push the body back on the sliding
seat. The data is given below:

SKILLED NOVICE
Average 4.6 3.8
Standard Deviation 0.65 0.85
n 25 25

Perform a test of significance to determine if the average knee velocity of skilled rowers is greater
than that of novice rowers. Include your hypotheses, test statistic, p-value (or range thereof and df)
and conclusion.
9. A gym instructor wants to determine if the program he designed for his clients is effective in reducing
their weight. He compared the weights of 10 randomly selected clients before the program, and
measured it again 6 weeks after. Which test should the instructor perform to determine if his program
is effective?
(a) t-test on two population means, independent samples
(b) z-test on two population means, independent samples
(c) t-test on two population means, dependent samples
(d) z-test on two population means, dependent samples
10. Refer to problem 9. Suppose the average difference in weights before and after the program was
found to be 10.5 with a standard deviation of 7.75. Is this enough evidence to conclude that the
program is effective in reducing weight? Note that the difference was calculated as before weight -
after weight. Complete the five steps in hypothesis testing to determine if the program is effective at
5% level of significance.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
THE TWO SAMPLE PROBLEM 158

REVIEW EXERCISES
11. (TRUE or FALSE) When we compare two means, for both independent and dependent samples, the
sample sizes of the two groups can be different from each other.
12. Select the incorrect statement about the confidence intervals for the difference of two proportions
(P1 − P2 ) and (P2 − P1 )
(a) Both confidence intervals have the same width.
(b) Both confidence intervals have the same margin of error.
(c) Both confidence intervals have the same upper and lower bounds.
13. (TRUE or FALSE) If the difference between two sample means is significant, then this is evidence that
the two samples come from populations with equal means.
14. (TRUE or FALSE) The two assumptions necessary to test for a difference in two population proportions
are: the sample sizes are large enough, and the samples are independent random samples from the
two populations.
15. A large computer service company which employs thousands of phone-technicians (assist customers
over the phone) did a study in which researchers interviewed technicians to assess if they were
experiencing back pain at their monitor. For the 128 people suffering from back pain while sitting at
their monitors that were utilized, each was given a “new” chair identical in every visible aspect to
their old chair except in color. Secretly, 64 of the recipients were randomly chosen to receive a chair
with the seat cushion containing a “gel” center, while the others received the same “old” chair. After
three weeks, 20 of the people with the “gel” seats reported having reduced back pain compared to
only 12 of those having the “regular” seats. Is there enough evidence to support the claim that the
gel seats reduce back pain? Perform a test of significance at significance level 0.05, showing H0 , Ha ,
test statistic, p-value, and conclusion.
16. A test is run to determine whether auto A has better gas mileage than auto B. Independent simple
random samples of size 30 autos of type A and 50 of type B were taken with the following results:
x̄A = 28.0 SDA = 4.7
x̄B = 25.3 SDB = 1.5
State the populations in question, the null and alternative hypothesis, test statistic, p-value (or range
if using a t-test), and conclusion.
17. (Hypothetical) In one study, 1500 people with Lynch syndrome – an inherited condition that predisposes
a person to a range of cancers, particularly of the colon – were randomly divided into groups of 750,
one group given aspirin and the other placebo. Follow-up tests after ten years, done on a double
blind basis, showed a difference with fewer people in the aspirin group developing colon cancer. At
the time of testing, there were only six colon cancers in the aspirin group and 16 in the placebo group.
Is there a statistically significant difference between the treatment (aspirin) and the control (placebo)
group? State your hypotheses, perform the test, find a p-value, and state your conclusion.
18. A survey of 1000 students nationwide showed a mean ACT score of 21.4 with a standard deviation of
4.85. A survey of 500 Ohio scores showed a mean of 20.8 with a standard deviation of 3.68. Can we
conclude that Ohio is below the national average?
19. A recent survey of 200 households showed that 8 had a single male as the head of household. Forty
years ago, a survey of 200 households showed that 6 had a single male as the head of household. Can
it be concluded that the proportion has changed? Find the 95% confidence interval of the difference
of the two proportions. Does the confidence interval contain zero? Why is this important to know?
20. The average price of a sample of 12 bottles of diet salad dressing taken from different stores is $1.43
with a standard deviation of $0.09. The average price of a sample of 16 low-calorie frozen desserts
is $1.03 with a standard deviation of $0.10. Find the 95% confidence interval for the difference in the
means. Based on the confidence interval, is there a significant difference in price?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
THE TWO SAMPLE PROBLEM 159

21. An investigator for NASA examines the effect of cabin temperature on reaction time. A random
sample of 10 astronauts was selected, and their reaction time to an emergency light is measured in a
simulator where the cabin temperature is maintained at 70 degrees Farenheit, and again the next day
at 95 degrees Farenheit. The average difference in reaction times was found to be 215 milliseconds,
with a standard deviation of 98.2 milliseconds (Note: difference is calculated as reaction time at 70
degrees - reaction time at 95 degrees). Does the data provide enough evidence to conclude that
astronauts react faster at higher temperature? Test at 1% level of significance.
22. A legislative committee is interested to determine if there is a significant difference in tax revenue
between the proposed new tax law and the existing tax law. The average difference in tax revenue of
100 representative tax returns was found to be -219, with a standard deviation of 725. Is this evidence
to conclude a significant difference in the average between the proposed and the existing law? Test
at 5% level of significance.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Chapter 15

One Categorical Variable / The Chi-Square


Test

Suppose our experiment or data collection process gives a categorical variable for its outcome. A categorical
variable has a finite list of possible outcomes, not necessarily numerical, which we will label as 1, 2, 3, ... ,
k (for 1st , 2nd , 3rd ,. . . , kth values), where k is the number of categories.

Examples
An opinion poll of registered voters on a bond issue – outcomes are

• favor = 1
• opposed = 2
• no opinion = 3
k = 3 categories, the numbers are just labels for the opinion cases

Class standing of a sample of university students – outcomes are

• freshman = 1
• sophomore = 2
• junior = 3
• senior = 4
k = 4 categories

Assessed result of a medical procedure – outcomes are

• improved = 1
• worsened = 2
• no change = 3
k = 3 categories

160
ONE CATEGORICAL VARIABLE / THE CHI-SQUARE TEST 161

A box model can be used to represent an outcome for the experiment or to represent a population of
individuals classified by a categorical variable. For instance, in a population of voters say that 40% of the
voters favor an issue, 52% of them are opposed to the issue and 8% have no opinion. Then, let 1 = favor, 2
= opposed, and 3 = no opinion. The following box of numbers would represent this population
[ 40% 1’s 52% 2’s 8% 3’s ]
In situations such as this, the box of numbers would look like this in general:
[ values 1, 2, 3, . . . k with proportions p1 , p2 , p3 , . . . pk ]
where
p1 is the proportion of 1’s in the box
p2 is the proportion of 2’s in the box
..
.
pk is the proportion of k ’s in the box
Suppose, to represent a sample from the population, we consider making n draws at random, with replacement,
from the above box of numbers. The data can be summarized by recording the frequencies of occurrences.
Observed Data:

Values 1 2 3 ... k
Frequency O1 O2 O3 ... Ok ∑Oj = n
Inference for a single proportion:
If we focus on a single category, say the jth , the population proportion p j would be estimated with the usual
O
sample proportion nj and a confidence interval or test of hypotheses about p j could be produced using our
earlier formulas for the case of a single proportion.

The chi-square test


Suppose, individuals in a population are classified into k categories labeled 1, 2, . . . , k . A box model can be
used to represent the population:
[ Values 1 2 3 . . .k ]
[ Proportions p1 , p2 , p3 , . . . pk ]
where
p1 is the proportion of 1’s in the box
p2 is the proportion of 2’s in the box
..
.
pk is the proportion of k ’s in the box
Now suppose, we take a random sample of size n and record the frequencies of the categories in the sample:
Observed Data:

Values 1 2 3 ... k
Frequency O1 O2 O3 ... Ok ∑Oj = n
From our experience with a single proportion, we know a sample frequency (count/sum) has an expected
value of n · pp. So the expected values, EV, of our list of sample frequencies are

E j = n · p j for j = 1, 2, . . . , k

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ONE CATEGORICAL VARIABLE / THE CHI-SQUARE TEST 162

Consider a hypothesis that the proportions in the population (box) equal some specified values p1 ∗, p2 ∗, p3 ∗, . . . , pk ∗

H0 : p1 = p1 ∗, p2 = p2 ∗, p3 = p3 ∗, . . . , pk = pk ∗
Corresponding alternative hypothesis is
Ha : at least one proportion differs from what is expected.
Under the hypothesis H0 , the expected frequencies are

E1 = n · p1 ∗, E2 = n · p2 ∗, . . . , Ek = n · pk ∗

We summarize our information by the table

Values 1 2 3 ... k
Observed Frequency O1 O2 O3 ... Ok ∑Oj = n
Expected Frequency E1 E2 E3 ... Ek ∑Ej = n
The observed and expected frequencies are compared by the chi-square test statistic

(O j − E j )2
χ2 = ∑
Ej

Under H0 , it has approximately a chi-square distribution with degrees of freedom d f = k− 1, and this distribution
is used to determine the p -value of the test. We reject H0 , if χ 2 is “large”.
The chi-square distribution is new for us. There is a family of chi-square distributions indexed by a parameter
called the degrees of freedom, d f for short. A chi-square distribution curve begins at 0 and has a right-skewed
shape. The graphs below show two chi-square curves. The graph at right shows the p -value as the area
above a chi-square value.

Example: A Political Poll


A candidate for a city office is interested in the support he has among the voting population. The population
is categorized by

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ONE CATEGORICAL VARIABLE / THE CHI-SQUARE TEST 163

favor opposed unsure


Category 1 2 3
Proportions p1 p2 p3
A sample of 500 voters is selected at random from the population and the following data was observed
favor opposed unsure
Category 1 2 3
Frequency 230 190 80 Sum = 500
Prior to this survey, the belief was that the proportions for the voting population were favor = 40%, opposed
= 40%, and unsure = 20%, and since that time, the candidate had engaged in extensive efforts to make
himself and his views on issues known to the public. Has this effort made any change in the voter’s
opinions?
The hypothesis at the time of the poll is

H0 : p1 = 0.40, p2 = 0.40, p3 = 0.20

If H0 is true, the expected frequencies for a sample of size 500 are

E1 = 500·· 0.40 = 200 ,

E2 = 500·· 0.40 = 200 ,


E3 = 500·· 0.20 = 100 .
Our data is then summarized by
favor opposed unsure
Category 1 2 3
Observed Frequency 230 190 80
Expected Frequency 200 200 100
The chi-square test statistic is
(230−−200)2 (190− −200)2 (80−
−100)2
χ2 = + +
200 200 100
= 4.5 + 0.5 + 4.0 = 9.0

With d f = 3 − 1 = 2, the chi-square p -value is between 0.01 and 0.025. This would lead us to reject H0 ; there
is a change.

Example: Is a die fair?


Suppose, a die is tossed 100 times and the following frequencies are observed for the six faces.

Outcome 1 2 3 4 5 6
Observed frequencies 10 22 15 19 20 14
For a fair die, the six outcomes are equally likely, yet here we see a considerable difference. One of the
outcomes occurred more than twice as often as another. Is this evidence that the die is not fair? To check
this, we formally test the hypothesis
1 1 1 1 1 1
H0 : p1 = , p2 = , p3 = , p4 = , p5 = , p6 =
6 6 6 6 6 6
Where p j is the probability of outcome j.
The expected frequencies, if H0 is true, are each

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ONE CATEGORICAL VARIABLE / THE CHI-SQUARE TEST 164

100 · 1 = 16.67
6
Thus, we have the observed and expected frequencies

Outcome 1 2 3 4 5 6
Observed frequencies 10 22 15 19 20 14
Expected frequencies 16.67 16.67 16.67 16.67 16.67 16.67
The chi-square test statistic is a comparison of the observed and expected frequencies (one term for each
cell)

(10 − 16.67)2 (22 − 16.67)2 (15 − 16.67)2 (19 − 16.67)2 (20 − 16.67)2 (14 − 16.67)2
χ2 = + + + + +
16.67 16.67 16.67 16.67 16.67 16.67
= 2.669 + 1.704 + 0.167 + 0.326 + 0.665 + 0.428
= 5.958

The degrees of freedom are d f = 6 − 1 = 5. The p -value is the area under a chi-square curve with 5 d f above
5.958. Consulting the chi-square table, we see the p -value is above 0.20. Since the p -value is not small, not
under 0.05, we fail to reject H0 . That is, the data does not deviate enough from the H0 expectations to allow
us to reject H0 . We cannot conclude that the die is unbalanced.

Example: Clerical Errors


A company had studied a problem, it was having on clerical errors in its invoices. Records were examined
and they found the following proportions for the number of clerical invoice errors per day:
Errors per day 0 1 2 3 4
Proportion 0.10 0.20 0.25 0.20 0.25
Following this, the company instituted changes in its procedures and extensive training of its staff. Then,
after a break-in period, they checked the number of clerical errors occurring for 70 days. The following data
was observed:
Errors per day 0 1 2 3 4
Observed frequencies 8 30 20 5 7 Note the frequencies sum to n = 70.
Is this evidence that the proportions for the number of errors per day have changed following the efforts
made to improve? The formal hypothesis is about the proportions for the 5 categories, that they have
remained the same

H0 : p1 = 0.10 , p2 = 0.20 , p3 = 0.25 , p4 = 0.20 , p5 = 0.25


The expected frequencies under the assumption that H0 is true are found by multiplying the hypothesized
proportions by n = 70 (the rule is EV j = n · p j ). The results are

Errors per day 0 1 2 3 4


Observed frequencies 8 30 20 5 7
Expected frequencies 7 14 17.5 14 17.5
The chi-square test statistic compares the observed to the expected frequencies (one term for each cell)

(8 − 7)2 (30 − 14)2 (20 − 17.5)2 (5 − 14)2 (7 − 17.5)2


χ2 = + + + +
7 14 17.5 14 17.5
= 0.143 + 18.286 + 0.357 + 5.786 + 6.300
= 30.872

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ONE CATEGORICAL VARIABLE / THE CHI-SQUARE TEST 165

The degrees of freedom are d f = 5 − 1 = 4. The p -value is the area under a chi-square curve with 4 d f above
30.872. Consulting the chi-square table, we see the p -value is less than 0.001. Since the p -value is under
0.01, we reject H0 . The result is highly significant. We have concluded that there has been a change in
the frequencies of errors per day. Examining the observed frequencies, there are higher frequencies at the
lower values. Thus, it seems the error rate has been reduced as a result of the changes that were instituted.

EXERCISES
1. Data was collected on the racial composition of a jury panel that was selected in a county in California.
The following numbers were reported:

Race County-wide Percentage Number of Jurors on the Panel


White 40 % 34
Black 30 % 12
Hispanic 20 % 9
Other 10 % 5
100% 60

Can the jury be judged as a random sample from the population?


(A) State an appropriate null hypothesis H0 .
(B) What are the expected frequencies for the race categories under H0 ?
(C) How many degrees of freedom does the χ 2 statistic have?
(D) Calculate the value of the χ 2 statistic, find a p-value, and state your conclusion (α = 0.05).
2. At a university, the grading policy specify that the grading methods of its courses should attain the
following percentages, on average:

Grade Category Percentage


High-A, B/A, B 30%
Middle-C/B, C 50%
Low-D/C, D, E 20%

A class of 50 students has the following distribution: High = 10, Middle = 27, Low = 13. While this
deviates from the university standard, the professor argues this is just chance variation. Is he right?
Carry out a test of significance at 5% significance level.
3. A restaurant assesses the distribution of alcoholic beverage orders to be as follows: beer: 30%, wine:
55%, mixed drink: 10%, liqueur: 5%. From a random sample of 120 alcoholic drink orders, they obtain
the following drink counts: beer = 30, wine = 75, mixed drink = 13, liqueur = 2. Is this sample within
chance variation, or should the restaurant consider a different distribution?
(A) State an appropriate null hypothesis H0 .
(B) What are the expected frequencies for the drink categories under H0 ?
(C) How many degrees of freedom does the χ 2 statistic have?
(D) Calculate the value of the χ 2 statistic, find a p-value, and state your conclusion (α = 0.05).
4. A university collected data on home football game attendance to see if it is equally distributed among
undergraduate class rank (freshman, sophomore, junior, senior). In a random sample of 100 students
attending home games, the following was measured:

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ONE CATEGORICAL VARIABLE / THE CHI-SQUARE TEST 166

Class Rank # of Undergraduate Home Game Attendance


Freshman 30
Sophomore 19
Junior 22
Senior 29

Is there evidence to show attendance in a home game is equally distributed among class rank? (α =
0.05).
5. Suppose that a population is hypothesized to be 50% male, and that from a random sample of size
50, 30 males were observed (and hence 20 females).
(A) Calculate a χ 2 statistic and p-value based upon the hypothesis H0 : pmale = 0.5, p f emale = 0.5.
(B) Calculate a Z statistic and p-value based upon the hypothesis H0 : pmale = 0.5 vs. Ha : pmale 6= 0.5.
(C) Do you notice any relationship between the statistics and the p-values?
6. A Researcher want to determine whether the distribution of purchasing different cereal brands has
changed during the past few years. According to a study done by few years ago the distribution of
purchasing different cereal brands are as follows:

Brand Purchasing percentage


A 35%
B 10%
C 20%
D 15%
E 20%

According to the most recent study, number of purchases for brand A is 22, B is 13, C is 15, D is 26,
and E is 24. Perform a test of significance. (α = 0.05)
7. According to past year records of patients suffering from depression, the distribution according to
their social class is as follows:
Social class Lower Middle Upper
Percentage 18% 59% 23%

Number of patients for each social class at the end of this year is as follows:

Social class Lower Middle Upper


Number of Patients 102 219 111
(A) State an appropriate null and alternative hypothesis.
(B) What are the expected numbers of patients for social class categories under H0 ?
(C) Calculate the χ 2 test statistic?
(D) How many degrees of freedom does the χ 2 statistic have? Find the p-value.
(E) State your decision and conclusion at 0.05 level of significance.
8. In a standard-52 card deck, there are 13 ranks of each suit. The suits are Clubs, Spades, Hearts and
Diamonds. A card is taken at random from a card deck, noted its suit and returned it. This process
is repeated 50 times. There were 14 Clubs, 12 Spades, 10 Hearts and 14 Diamonds. Is there enough
evidence to show that the proportion of each suit drawn is not equally likely? (α = 0.05)

REVIEW EXERCISES:
9. Which of the following statement(s) is(are) TRUE about Chi-square test for one categorical data?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ONE CATEGORICAL VARIABLE / THE CHI-SQUARE TEST 167

I. If the null hypothesis is rejected, we have evidence to show the proportions are the same as
what we expected.
II. The degrees of freedom of the test is calculated as n − 1.
a. I only b. II only c. Both I and II d. Neither I nor II
10. A local sports merchant carries the particular golf shirt that Tiger Woods wears at the Master’s
tournament, and believes that the color of the shirt actually worn by Tiger in the tournament will
influence which color(s) sells more in the month following the tournament. Prior to the tournament,
the sales of this shirt were approximately equally distributed among the colors red, white, aqua and
yellow. During the tournament, Tiger wore only the red and aqua colors, and the merchant saw, from
its string of stores, the following sales of the shirt for the month following the tournament: red =
34, white = 20, aqua = 30, and yellow = 16. Is there enough evidence to support the belief of the
merchant?
(A) State an appropriate null hypothesis H0 .
(B) What are the expected frequencies for the color categories under H0 ?
(C) How many degrees of freedom does the χ 2 statistic have?
(D) Calculate the value of the χ 2 statistic, find a p-value, and state your conclusion (α = 0.05).
11. (Hypothetical) According to an online magazine study approximately in 2009, for every $100 the
average American consumer spends, the approximate distribution of after tax expenditures is as
follows:: groceries – $12, shelter – $34, transportation – $18, insurance & health care – $17 (seems
low, but for sake of this exercise), and other (clothing, entertainment, misc., savings) – $19. In a study
done in a small mid-western city, it was found that for every $100 of after-tax monthly income, it was
spent in the following way: groceries – $18, shelter – $28, transportation – $13, insurance & health
care – $25, and other – $16. Would this town seem “out of the ordinary”, or are the differences within
the expectations due to random variations? Use a χ 2 test with 0.05 level of significance.
12. A study of local blood types, involving a group of 200 people, revealed the following data:

Blood Type Estimated U.S. distr. Observed


O+ 37% 85
A+ 36% 78
Other(B+,AB+, O-, A-,B-,AB-) 27% 37

Would the observed sample qualify as a random sample, based on the estimated distribution? (α =
0.05).
13. (Hypothetical) From a previous study, the registered voters in a particular district were 48% Republican,
41% Democrat and 11% independent. Prior to the most recent election, a random sample of 150 voters
showed 64 Republicans, 58 Democrats, and 28 independents. Is there enough evidence to suggest
that the distribution among registered voters has changed since the previous study? Use a 0.05 level
of significance.
14. A box contains 2 red balls, 2 green balls, 2 blue balls, and 2 yellow balls. A ball is taken at random
from the box, noted its color, and returned into the box. This process is repeated 30 times. If we want
to test the claim that the distribution of the color of the ball drawn is not equal among the 4 colors,
which of the following statements is correct?
(a) H0 : The proportion of each color drawn is 0.25.
Ha : The proportion of each color drawn is not 0.25.
(b) The degrees of freedom of the test is 29.
(c) The expected values of the color of the balls drawn are the same.
(d) H0 : The proportion of each color drawn is 0.125
Ha : The proportion of each color drawn is not 0.125

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
chi-square table

Chi-square table

df 0.2 0.1 0.05 0.025 0.01 0.005 0.001


1 1.642 2.706 3.841 5.024 6.635 7.879 10.828
2 3.219 4.605 5.991 7.378 9.21 10.597 13.816
3 4.642 6.251 7.815 9.348 11.345 12.838 16.266
4 5.989 7.779 9.488 11.143 13.277 14.86 18.467
5 7.289 9.236 11.07 12.833 15.086 16.75 20.515
6 8.558 10.645 12.592 14.449 16.812 18.548 22.458
7 9.803 12.017 14.067 16.013 18.475 20.278 24.322
8 11.03 13.362 15.507 17.535 20.09 21.955 26.124
9 12.242 14.684 16.919 19.023 21.666 23.589 27.877
10 13.442 15.987 18.307 20.483 23.209 25.188 29.588
11 14.631 17.275 19.675 21.92 24.725 26.757 31.264
12 15.812 18.549 21.026 23.337 26.217 28.3 32.909
13 16.985 19.812 22.362 24.736 27.688 29.819 34.528
14 18.151 21.064 23.685 26.119 29.141 31.319 36.123
15 19.311 22.307 24.996 27.488 30.578 32.801 37.697
16 20.465 23.542 26.296 28.845 32 34.267 39.252
17 21.615 24.769 27.587 30.191 33.409 35.718 40.79
18 22.76 25.989 28.869 31.526 34.805 37.156 42.312
19 23.9 27.204 30.144 32.852 36.191 38.582 43.82
20 25.038 28.412 31.41 34.17 37.566 39.997 45.315
21 26.171 29.615 32.671 35.479 38.932 41.401 46.797
22 27.301 30.813 33.924 36.781 40.289 42.796 48.268
23 28.429 32.007 35.172 38.076 41.638 44.181 49.728
24 29.553 33.196 36.415 39.364 42.98 45.559 51.179
25 30.675 34.382 37.652 40.646 44.314 46.928 52.62
26 31.795 35.563 38.885 41.923 45.642 48.29 54.052
27 32.912 36.741 40.113 43.195 46.963 49.645 55.476
28 34.027 37.916 41.337 44.461 48.278 50.993 56.892
29 35.139 39.087 42.557 45.722 49.588 52.336 58.301
30 36.25 40.256 43.773 46.979 50.892 53.672 59.703
Chapter 16

Two-Dimensional Tables

Case : One Population With Two Categorical Variables


The focus is on a population of objects (people, animals, cars, restaurants, schools, cities, counties, etc).
Each object is classified by two categorical variables. Examples of categorical variables are

• gender: male or female


• color: red or white or green or . . .
• education: less than high school or high school degree or bachelors degree or graduate degree
• opinion: favor or opposed or undecided
• outcome: success or failure
• quality: good or defective
• grade: A or B or C or D or E
• disease: present or absent
• age: 0 – 9 or 10 – 19 or 20 – 29 or . . .
• dose level: low or medium or high
• group: treatment or control

Visualize a population box of chips where each chip has 2 values on it, one for each variable. A chip
corresponds to an object in the population. We describe the population box by a 2-way table of proportions.
The two categorical variables define the rows and columns of the table. The table below illustrates this
layout with the row variable called Group (A or B) and the column variable called Response (1 or 2).

Example
300 patients ina group were
 classified as to their result for a medical
 procedure. There were 100 men with
60 successes 60 = 0.60 and 200 women with 150 successes 150 = 0.75 . Summarize this data with
100 200
the following two way table.

169
TWO-DIMENSIONAL TABLES 170

Proportions for a Population Box


Response
Success Failure row sums
Group
Men 0.60 0.40 1
Women 0.75 0.25 1
Ignore Group 0.70 0.30 1
(not column sum)
 
Note, overall there were 210 successes for the 300 patients 210 = 0.70 .
300
In general, visualize the following table for such a population.
Proportions for a Population Box
Response
1 2 row sums
Group
A PA1 PA2 1
B PB1 PB2 1
Ignore Group P1 P2 1
(not column sum)
Interpretation: In the row “Group = A” record the proportion of chips where Response = 1 or = 2. Repeat
this in the row “Group = B”. The bottom row ignores the Group value, just record the proportion of chips
where Response = 1 or = 2. Note the double subscript on P signals the value of each variable “Group” and
“Response”. A single subscript on P signals the value of the variable “Response”.

The Standard Hypothesis


H0 : Independence of “Group” and “Response”
No association between “Group” and “Response”
“Group” and “Response” are not related
Proportions of Response = 1 or = 2 are the same for each Group
(and then the same as if we ignore Group)
If H0 is true the population proportions above would look like this (P
P1 and P2 unspecified):
Response
1 2 row sums
Group
A P1 P2 1
B P1 P2 1
Ignore Group P1 P2 1
(not column sum)

DATA
Now suppose, we take a random sample from the population box. Each object will fall in one of the four
cells of the table layout determined by its “Group” and “Response”. We count up how many objects fall
in each of the four cells and this table of counts is our main information. See the layout below, called the
“Observed Count”. The sampling can be done in one of two ways:

1. Sample n objects at random from the whole population and classify them by “Group” and “Response”.
2. Sample objects from “Group A” and classify them on “Response” and then sample objects from
“Group B” and classify them on “Response”. There are two separate samples here.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TWO-DIMENSIONAL TABLES 171

Example
In the above example, suppose the 300 patients are a sample from a bigger population. Express the data as
Response
Success Failure row sums
Group
Men 60 40 100
Women 150 50 200
Ignore Group 210 90 300
(column sum)
In general, we view the data as follows.
OBSERVED COUNTS
Response
1 2 row sums
Group
A OA1 OA2 nA
B OB1 OB2 nB
Ignore Group O1 O2 n
(column sum)
To test the hypothesis, we will build a table of expected counts if the hypothesis is true. It will look like this
in general form.
EXPECTED COUNTS if the hypothesis is true
Response
1 2 row sums
Group
A nA
B nB
Ignore Group O1 O2 n
(column sum)
We need to fill in the entries of the 4 cells so that the table has the same marginal totals as the observed
count table (as already indicated here) and, moreover, we need to use the “same row proportions” rule of
the hypothesis (see the table “Proportions for a Population Box if the hypothesis is true” above). This will
be done in two steps.

1. We estimate P1 by On1 , its corresponding sample proportion and


we estimate P2 by On2 , its corresponding sample proportion.
2. We use the rule of expected count = sample size · proportion.
The sample size for row A is nA and the sample size for row B is nB .

Putting this together we have the following table.


EXPECTED COUNTS if the hypothesis is true
Response
1 2 row sums
Group O1 O2
A nA · n nA · n nA

B nB · On1 nB · On2 nB
Ignore Group O1 O2 n
(column sum)
Note: a bit of algebra shows that the four cells do have the marginal totals as indicated.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TWO-DIMENSIONAL TABLES 172

Let’s introduce the notation “E” for “expected” in the four cells in the table above to have:
EXPECTED COUNTS if the hypothesis is true
Response
1 2 row sums
Group
A EA1 EA2 NA
B EB1 EB2 NB
Ignore Group E1 E2 N
(column sum)
In other words, row sums and column sums can be used to calculate expected counts. Considers the
following table with observed counts. First calculate the row sums and column sums. C1 = OA1 + OB1 and
OB2 . R1 = OA1 +O
C2 = OA2 +O OB2 .
OA2 and R2 = OB1 +O
OBSERVED COUNTS
Response
1 2 row sums
Group
A OA1 OA2 R1
B OB1 OB2 R2
Ignore Group C1 C2 n
(column sum)
n = R1 + R2 or n = C1 +C2 .

Row sum · Column sum


Expected count =
Total
EXPECTED COUNTS if the hypothesis is true
Response
1 2 row sums
Group
A EA1 = R1n·C1 EA2 = R1n·C2 R1

B EB1 = R2n·C1 EB2 = R2n·C2 R2


Ignore Group C1 C2 n
(column sum)

The Chi-Square Test Statistic


The test statistic compares the observed counts (entries = O) with the expected counts (entries = E ) using
the formula:
(O − E)2
χ2 = ∑
E
with four terms in this sum, one for each cell of the table.
The O’s are the entries in the “observed counts table” and the E’s are the entries in the “expected counts
table”. Under H0 this statistic has a chi-square distribution approximately, with degrees of freedom d f = 1.
In general, when our table has r rows and c columns, d f = (r− 1) · (c−
c− 1).
The p -value must be calculated to finish the test of the hypothesis. The p -value is the area under the
chi-square curve with proper degrees of freedom above the chi-square test statistic value. Use the chi-square
table for this.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TWO-DIMENSIONAL TABLES 173

Example: Age Discrimination


A company had been downsizing its workforce over the last two years. It was charged with age discrimination
by some of its employees who were fired. A subsequent investigation gave the following 2x2 table.
Outcome
Fired Not Fired
Group
Age < 50 7 8
Age ≥ 50 16 7
It appears that there is a higher rate of firing for the older workers. Is this data sufficient evidence to show
that the company discriminated against older workers in its cutbacks? Test the hypothesis of independence
between age group and firing outcome using a significance level of 0.05. The table of observed frequencies
with the marginal totals is
Outcome
Fired Not Fired
Group
Age < 50 7 8 15
Age ≥ 50 16 7 23
23 15 38
The table of expected frequencies under the hypothesis H0 is
Outcome
Fired Not Fired
Group
Age < 50 9.08 5.92
Age ≥ 50 13.92 9.08
To compute these entries use:

cell row1, col1 15 · 23 = 9.08


38

cell row1, col 2 15 · 15 = 5.92


38

cell row2, col 1 23 · 23 = 13.92


38

cell row2, col 2 23 · 15 = 9.08


38
The chi-square test statistic is

(7 − 9.08)2 (8 − 5.92)2 (16 − 13.92)2 (7 − 9.08)2


χ2 = + + +
9.08 5.92 13.92 9.08
= 0.476 + 0.731 + 0.311 + 0.476
= 1.994

The degrees of freedom is d f = 1 and the p -value is the area under the curve above a chi-square value of
1.994. This is found to be 0.158.
Actually, this p -value was produced by the Minitab software. To use the chi-square table, with d f = 1 and a
chi-square value of 1.994, the area above is found to be between 0.10 and 0.20. In either case, the p -value is
not less than a significance level of 0.05 and so, we fail to reject the hypothesis H0 . The data is not extreme
enough to let us reject H0 ; we do not have evidence to refute the independence of age group and firing
outcome.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TWO-DIMENSIONAL TABLES 174

Example: Opinion on a Bond Issue


A survey was conducted in a community to assess the resident’s opinion on a bond issue. The residents
were also asked a number of demographic questions, one of which involved home ownership. The following
table resulted.
Opinion on Bond Issue
Home Owner Favor Opposed Undecided
Yes 56 68 25
No 44 21 15
Test the hypothesis of independence between home ownership and opinion on the bond issue. Use a
significance level of 0.05. The table of observed frequencies with the marginal totals is
Opinion on Bond Issue
Home Owner Favor Opposed Undecided
Yes 56 68 25 149
No 44 21 15 80
100 89 40 229
The table of expected frequencies under H0 is
Opinion on Bond Issue
Home Owner Favor Opposed Undecided
Yes 65.07 57.91 26.03
No 34.93 31.09 13.97
To compute these entries, use the observed table margins and the rule

row sum·· column sum


E=
n

For example, the first row, first column entry is E11 = 149 · 100 = 65.07 . The other five entries are similar.
229
The chi-square test statistic is

(56 − 65.07)2 (40 − 13.97)2


χ 2 = 6 terms = + ... +
65.07 13.97
= 1.264++ 1.758++ 0.041+
+ 2.355+
+ 3.275+
+ 0.076
= 8.769

The degrees of freedom is d f = (2 − 1) · (3 − 1) = 2 and the p -value is the area under the curve above a
chi-square value of 8.769. This is found to be 0.012.
Actually, this p -value was produced by the Minitab software. To use the chi-square table, with d f = 2 and a
chi-square value of 8.769, the area above is found to be between 0.01 and 0.025. In either case, the p -value
is less than a significance level of 0.05 and so we reject the hypothesis H0 . The data is extreme enough to
let us reject H0 ; we have evidence to refute the independence of opinion and home ownership.

Example: A Medical Study


In a study of a new surgical procedure, a group of 100 patients was divided, at random, into two groups;
one group received the standard procedure and the other group received the new procedure. The surgical
results were evaluated as success or failure. The experiment was double blind. The following table
summarizes the results.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TWO-DIMENSIONAL TABLES 175

Result
Procedure Success Failure
Standard 30 20
New 38 12
Is this strong evidence that the new procedure is better than the old? Test for significance. The null
hypothesis will be that the two procedures have the same chance of success, that is the hypothesis of
homogeneity. The table of observed frequencies with marginal totals is
Result
Procedure Success Failure
Standard 30 20 50
New 38 12 50
68 32 100
The table of expected frequencies under H0 is
Result
Procedure Success Failure
Standard 34 16 50
New 34 16 50
68 32 100

(for example, the row 1, col 1 entry is E11 = 50· 68 using the row/col sums). The chi-square test statistic is
100
(30 − 34)2 (12 − 16)2
χ 2 = 4 terms = + ... +
34 16
= 0.471++ 1.000++ 0.471++ 1.000
= 2.941

The degrees of freedom is d f = 1 and the p -value is the area under the curve above a chi-square value of
2.941. This is found to be 0.086.
Actually, this p -value was produced by the Minitab software. To use the chi-square table, with d f = 1 and a
chi-square value of 2.941, the area above is found to be between 0.05 and 0.10. In either case, the p -value
is not less than a significance level of 0.05 and so we accept the hypothesis H0 . The data is not extreme
enough to let us reject H0 ; we do not have evidence to show there is a difference in the success probabilities
between the two surgical methods.

EXERCISES
1. A company buys a certain part from three suppliers. In a study of the quality of these parts, it tests
200 parts from each supplier selected at random from their inventory. The results are given in the
following table.

Outcome
Supplier Good Defective
A 180 20
B 175 25
C 190 10

Does this show dependence of quality level on suppliers? Test with significance level of 0.05.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TWO-DIMENSIONAL TABLES 176

2. A large computer service company which employs 1000’s of phone-technicians (assist customers
over the phone) did a study in which, researchers interviewed technicians to assess if they were
experiencing back pain at their monitor. For the 128 people suffering from back pain while sitting at
their monitors that were utilized, each was given a “new” chair identical in every visible aspect to
their old chair except in color. Secretly, 64 of the recipients were randomly chosen to receive a chair
with the seat cushion containing a “gel” center, while the others received the same “old” chair. After
three weeks, 20 of the people with the “gel” seats reported having reduced back pain compared to
only 12 of those having the “regular” seats.
Back Pain
Reduced Not Reduced
Gel Seat
Observed = 20 Observed = 44
Expected = Expected =
Regular Seat
Observed = 12 Observed = 52
Expected = 16 Expected =
(A) Complete the expected values under the hypothesis of no difference for the two types of chairs.
(B) Write down (but don’t crunch) the terms in the χ 2 statistic used to test if the back pain response
is independent of the type of seat, and determine the degree(s) of freedom.
(C) If the value of the χ 2 statistic is 6.95, give a p-value for the test of independence.
(D) Write the conclusion of the test. (α = 0.05)
3. (Hypothetical) A survey of 240 randomly selected students was taken at WMU regarding their position
on the following proposition: “Would you be in favor of extending the semester in December by two
additional days if a 2-day fall break were was added in October?”. The following responses were
recorded:
Response
Residency Status Yes No
Out of State 73 32
In State 80 55

Use a χ 2 test to test the hypothesis:


H0 : response and residency status are independent (no relation exists between the two)
Ha : response and residency are dependent (a relation between the two exist)
(Include expected values, test statistic, df, p-value, and conclusion at 0.05 level of significance.)
4. In testing a new drug, 75 people were selected from the patient population to receive the new drug.
The following data resulted.
Response
No Change Got Better
Men 15 25
Women 17 18

Test the hypothesis that the effectiveness of the drug is independent of the sex of the patient. (α =
0.05)
5. A car company buys certain casting from two suppliers, A and B say. In a study of the quality of this
part, the company selected 400 castings at random from those purchased last month and carefully
tested them. The following table summarizes their findings:
Quality
Supplier Poor Good
A 10 70
B 20 300

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TWO-DIMENSIONAL TABLES 177

(A) What is the expected number of poor castings for the two suppliers?
(B) What is the expected number of good castings for the two suppliers?
(C) Is there a statistically significant difference in the quality of castings between the two suppliers?
(α = 0.05)
6. Consider the table, GSS data on party ID and race.
Party Identification
Race Democrat Independent Republican
Black 192 75 8
White 459 586 471
Source: 2008 General Social Survey, National Opinion Research Center

Perform a test of independence between party ID and Race. (Include null and alternative hypothesis,
expected counts, test statistic, df, p-value, decision and conclusion at 0.05 level of significance).
7. A study was conducted to determine whether the final grade category and prior knowledge in Statistics
are related. The following table contains final grade category and the response to the question “Do
you have any prior knowledge in Statistics?”.
Response
Grade Category Yes No
High-A,BA,B 3 25
Middle-CB,C 9 21
Low-DC,D,E 1 8
(a) write the null and alternative hypothesis.
(b) Calculate the expected counts.
(c) Calculate the test statistic.
(d) Find the degrees of freedom and the p-value.
(e) State your decision and conclusion.
8. A retrospective study about lung cancers and smoking gives the following data.
Lung cancers
Smoker Treatment Control
Yes 97 88
No 3 12

Is there enough evidence of an association between smoking and lung cancers? (α = 0.05)

REVIEW EXERCISES
9. Which of the following is(are) TRUE about Chi-square test for two categorical data?
I. The null hypothesis is a statement about the two variables being independent.
II. The degrees of freedom of the test is calculated as k − 1.

a. I only b. II only c. Both I and II d. Neither I nor II


10. (TRUE or FALSE)The chi-square test of independence is essentially the occurrences of frequencies
across any pair of variables that are correlated. Can we say that it is simply a means of comparing
categorical correlations? Explain your answer.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
TWO-DIMENSIONAL TABLES 178

11. There are 4 different treatment methods to treat a certain medical condition. The result can be “got
better” or “no change”. By mistake, the observed count for the first treatment and “got better” was
recorded as 20 instead of 200. Which of the following statements is correct?
(a) The degrees of freedom will change because the sample size is wrong.
(b) All the expected values are wrong.
(c) Only the expected values in the first row and first column will change due to wrong column and
row totals.
12. Three treatment methods for a serious medical condition were tested in a randomized experiment
involving 210 patients. The following table summarizes the results:
Patient Outcome
Treatment Method Live Death
A 40 10
B 30 10
C 100 20
Carry out a test of independence between outcome and treatment at 0.05 level of significance.
13. According to a recent survey, 64% of Americans between the ages of 6 and 17 cannot pass a basic
fitness test. A physical education instructor wishes to determine if the percentages of such students
in different schools in this school district are the same. He administers a basic fitness test to 120
students in each of four schools.
Southside West End East Hills Jefferson
Passed 49 38 46 34
Failed 71 82 74 86
Test the claim that fitness test result and school district are independent. (α = 0.05)
14. To test the effectiveness of a new drug, a researcher gives one group of individuals the new drug and
another group a placebo. Using a χ 2 test at a 0.1 level of signficane, determine if the researcher can
conclude that the drug is effective?
Medication Effective Not effective
Drug 32 9
Placebo 12 18
15. An advertising firm has decided to ask 92 customers at each of three local shopping malls if they are
willing to take part in a market research survey. According to previous studies, 38% of Americans
refuse to take part in such surveys.
Mall A Mall B Mall C
Will participate 52 45 36
Will not participate 40 47 56
Test, at the 0.01 significance level, the claim that the proportions of those who are willing to participate
are equal.
16. A study is conducted as to whether there is a relationship between joggers and the consumption of
nutritional supplements. A random sample of 210 subjects is selected, and they are classified as
shown. Test the claim that jogging and the consumption of supplements are not related. Why might
supplement manufacturers use the results of this study?
Jogging status Daily Weekly As needed
Joggers 34 52 23
Non-joggers 18 65 18
Does this show a significant difference in quality level between the three suppliers? Test with
significance level of 0.05.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Chapter 17

Judgment Tables

Consider situations where a judgment (decision) is made with the possibility of being wrong. Some information
is available to help make the decision. The following examples on judgments are situations where there are
two choices.

Decide on the admission of a student to college (yes or no)


Evaluate a medical test ( + or − )
Should a submitted article be published in a journal (yes or no)
Decide to attend a movie (or not)
Decide to give someone a loan (or not)
Inspection decision on a product (good or bad)
Give a student a grade for a class (pass or fail)
Merit pay for an employee (yes or no)
A jury trial (innocent or guilty)
Flirting (is he/she interested or not)

You can easily think of many more such situations around us where a judgment decision is made. There
is the feature that a correct decision exists and the decision made may agree with the correct one or not.
Our decision may be right or wrong. The following layout (the Spatial Layout) is a useful way to organize
and visualize the information and calculations. The entries may be frequencies (counts), proportions or
percentages (chances). The same type of layout works for tables larger in size than 2 x 2.
The heart of the layout is the 4 cells in the upper-left 2 x 2 table. Its 2 rows indicate the true states and its
2 columns indicate the decisions. Adding the entries of this table across rows gives the row margin to its
right. Adding the entries of this table down columns gives the column margin below. The two tables at the
far right are conditional tables. The two tables at the very bottom are also condition tables. The conditional
tables will be discussed and interpreted in the examples to follow. Note there are 7 tables in this layout.

179
JUDGMENT TABLES 180

The True State -Decision labels (above) for rows and columns may be modified to fit other situations, such
as

Skill - Performance
Group - Response
Factor - Outcome

The convention is to put the prior condition at left in the row position and the subsequent outcome on top
in the column position. Watch out to distinguish whether the entries in the above tables refer to

1. data (from the past) or


2. chance or probabilities (for the future).

Be consistent. The matter is further confused when we use data from the past as estimates of chances of
future happenings, but more on that later.

Example
Let’s do an example to illustrate the relationship between the entries of the 7 tables in the layout. This
example will not have a decision context – these will come later.
Suppose a group of 100 students are classified by two characteristics: hair color and eye color with the
following table below of size 2 x 2 in the upper left resulting. From this beginning 2 x 2 table we add across
the rows to get the right marginal distribution and we add down the columns to get the bottom marginal
distribution. On the far right we have the conditional distributions for each row. In the first row the values
are calculated as 10 = 0.33 and 20 = 0.67.
30 30
In the second row the values are calculated as 30 = 0.43 and 40 = 0.57.
70 70
Note the conditional nature of these numbers. They’re of the form
”if eye = blue then hair = light in proportion 0.33”
”if eye = blue then hair = dark in proportion 0.67”
In smoother language, ”among those with blue eyes, 33% have light hair and 67% have dark hair”.
The second distribution at right gives the conditional hair information given eye = brown. These numbers
are about hair color given a specific eye color. At the bottom we have the conditional distributions for each
column. They are calculated in the same way as above, but move down the columns. For example, in the
first column hair = light and in this column the proportion with eye = blue is 10 = 0.25 and the proportion
40
with eye = brown is 30 = 0.75. Note the conditional nature of this information: ”among those with light hair,
40
25% have blue eyes and 75% have brown eyes”. Repeat these calculations for the second column below.
Note there are 7 tables here; each has a different interpretation:

• this 2 x 2 table has joint information on both eye color and hair color
• these two marginal tables have info on eye color alone (at right) or hair color alone (below)
• the two tables at right have info on hair color for a given eye color (given an eye color, how does hair
color break down)
• the two tables below have info on eye color for a given hair color (given a hair color, how does eye
color break down)

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
JUDGMENT TABLES 181

The beginning 2 x 2 table was expressed with ”counts” in each cell. Alternately, we could express the
information as proportions or percentages. It’s only a matter of preference.

Example Continued
Suppose the initial information is given about the marginal and conditional distributions on the right as
follows:

• 33% of the blue-eyed people have light hair


• 43% of the brown-eyed people have light hair
• 30% of the people have blue eyes

Note, the first two pieces are conditional, the last is marginal (only about eyes). Begin by drawing the layout
and filling in these three numbers in their proper place at right. See the layout at right. Now fill in these
three tables using the fact that the entries of a table sum to 1. We get the layout at right. Now we want to
move to the left and complete the 2 x 2 main table in the upper left. We do this by multiplying - remember
the meanings of the entries we have so far. The conditional entries at the far right are obtained by dividing,
so to go backwards we need to multiply:

• row 1, col 1 = 0.30 · 0.33 = 0.099 = 0.10


• row 1, col 2 = 0.30 · 0.67 = 0.201 = 0.20
• row 2, col 1 = 0.70 · 0.43 = 0.301 = 0.30
• row 2, col 2 = 0.70 · 0.57 = 0.399 = 0.40

To finish the layout, we need to fill in the three tables below. This is done exactly as before by summing
down the columns to get the hair marginal distribution. Then taking ratios down the columns we compute
the two conditional tables at the bottom. The final layout is the same as the one displayed earlier. Review
what happened here. We started with 3 pieces of information. From them, we filled in completely all tables.
In general, 3 pieces of information (suitably placed) determine the entries of all 7 tables of the layout.

Example
Let’s look at a situation that has a decision/judgment context. It is a very practically important situation
that arises quite often. The outcome is extremely surprising and difficult for people to understand. Yet, it
follows easily from the type of calculation we did above. Viewing the situation with the complete layout of
7 tables helps in seeing why the result occurs.
A diagnostic medical test for a disease is 95% accurate. That is, if a person has the disease the test will
indicate so with a 95% chance and it has the same chance of correctly diagnosing a person who does not
have the disease. Suppose there is a 1 in 200 chance that a person being tested really has the disease. If
the test result is positive, indicating that the person has the disease, what is the chance the person has the
disease?
Note, most people would quickly say that the answer if 95%, since the test is correct 95% of the time. This
is wrong, in fact very wrong, as we will see.
Begin by drawing a layout with the usual 7 tables. Label rows by the true state of a patient (truth) and the
columns by the decision made by the medical test (judgement). Then fill in the 3 pieces of information that
we have been given. Make sure they are put in the proper position.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
JUDGMENT TABLES 182

Note the 1 = 0.005 was used. Also note carefully the cells in which the 3 pieces of information were
200
placed. It is critical to get this right.
As before, we fill out the 3 tables on the right using the fact the table entries must sum to 1. This gives the
next layout below. Now we need to move to the left and fill in the upper left 2 x 2 main table. This is done
by multiplying:

• row 1, col 1 = 0.005 · 0.95 = 0.00475


• row 1, col 2 = 0.005 · 0.05 = 0.00025
• row 2, col 1 = 0.995 · 0.05 = 0.04975
• row 2, col 2 = 0.995 · 0.95 = 0.94525

Then we have the next layout below.


Continuing, we now fill in the tables below. Add down the two columns to get the decision distribution.
Then fill in the two conditional distributions below by dividing. In the left one:

• first entry is 0.00475 = 0.087


0.05450
• second entry is 0.04975 = 0.913
0.05450

In the right one:

• first entry is 0.00025 = 0.00026


0.94550
• second entry is 0.94525 = 0.99974
0.94550

This is the completed layout. Now we can find the answer to the question: given that the medical test
says ”has disease” what is the chance that the person ”has disease”. This says if we are in column ”has
disease” what is chance we are in row ”has disease”. The answer is below in the conditional distribution
and is 0.087 or 8.7%.
To emphasize that the statement here is conditional, say it in the if-then form: if test says ”has disease”
then truth is ”has disease” with 8.7% chance. The answer of 8.7% is surprisingly low. How can it be that
the test is accurate 95% of the time (a very high rate really) and yet its credibility is only 8.7% when it says
the person has the disease? (We could get higher credibility by tossing a coin; heads you have it, tails you
don’t.) Review the calculations above again to convince yourself they are right.
One point brought out by this example is that for a conditional statement, the order of the pieces matter.
Contrast these:

• IF ”person has diseas” THEN ”test says has disease” (chance is 95%)
• IF ”test says has disease” THEN ”person has disease” (chance is 8.7%)

Thinking carefully, one can see these are two different statements (having different answers), even though
their similarity often leads people to think of them as the same.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
JUDGMENT TABLES 183

Example: Genetic trait and disease


Is a genetic trait related to (or cause) a certain disease? To study this issue data was obtained as follows:

• among people with the disease, 60% have the trait


• among people without the disease, 30% have the trait
• in the general population, 10% get the disease

Make the judgment table below with these entries made.


Disease Disease
yes no yes no
Have Trait yes
no
0.1
Have Trait yes 0.6 0.3
no
Finish the table to obtain the results below.
Disease Disease
yes no yes no
Have Trait yes 0.06 0.27 0.33 0.18
no 0.04 0.63 0.67 0.06
0.1 0.9
Have Trait yes 0.6 0.3
no 0.4 0.7
If one has the trait the chance of the disease is 18%, otherwise it is 6%. Ratio = 3.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
JUDGMENT TABLES 184

EXERCISES
1. (Hypothetical) There is a little known condition called Teaching Assistants Disorder, or TAD, in which
college teaching assistants become so enamored with the subject they are teaching that they sometimes
talk so passionately and rapidly about their subject in lecture that they literally forget there is a class
of students in the room with them. New graduate students are given a test for this condition, and
the test is accurate 80% of the time when the condition is present, and 70% of the time when the
condition is absent. It is known that 10% of all graduate students have this condition. Complete the
judgment table below

Test Test Test Test


Positive Negative Positive Negative
Condition Yes 0.020 0.100 0.800
No 0.270
0.650
Condition Yes 0.229
No 0.969

2. From the information and table in 1


(a) What percentage of grad. students test positive and don”t suffer from the condition?
(b) If a grad. student tests positive, what is the percentage chance that he does not have the
condition?
(c) Out of all tests administered, what percentage will be negative?
3. A manufacturer tests each item made. If an item tests ”good” it is sold, if it tests ”bad” it is scrapped.
Suppose that 80% of the items made are really good (this implies the other 20% are bad). The testing
procedure is correct 70% of the time on good items; that is, a good item has a 70% chance of testing
good. The testing procedure is correct 80% of the time on bad items.
(a) What percent of the items sold are really good?
(b) What percent of the scrapped items are really good?
4. In the above problem, suppose the testing procedure is much better. It identifies correctly a good item
with 99% chance and it identifies correctly a bad item with 99% chance. Answer the same questions.
5. A diagnostic medical test for a disease is 90% accurate. If a person has the disease the test will
indicate so with a 90% chance and it has the same chance of correctly diagnosing a person who does
not have the disease. Suppose there is a 1 in 100 chance that a person being tested really has the
disease.
(a) If the test result is positive, indicating that the person has the disease, what is the chance the
person has the disease?
(b) Answer part (a) if there is a 1 in 500 chance that a person being tested has the disease.
6. In the business of a professional journal publishing articles that are submitted to it by authors
the papers are reviewed/refereed to decide whether or not they are good enough to be published.
Assume there is a sufficiently clear notion of what constitutes a ”good” paper, perhaps through
editorial policy, professional standards and practices. If this is so, why are there so many bad
papers published in journals? To get a handle on this situation we need to make some reasonable
assumptions. (See Rousseeuw, Chance (1991).) Suppose we focus on a high quality journal which
wants to publish the top 20% of the papers it receives befitting its high standards and reputation. (A
submitted paper is ”good” if it’s in the top 20% of those submitted.) Suppose referees can recognize
a good paper 70% of the time and, likewise, they recognize a bad paper 70% of the time. This is
really quite optimistic to assume these people have such good discrimination powers. Under these
conditions
(a) what percent of the papers published in the journal are really good?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
JUDGMENT TABLES 185

(b) what percent of the papers rejected by the journal are really good?
7. The Journal of the American Medical Association reported in July 1995 on a study of 146 patients,
aged 2 – 21, who were referred to the University of Connecticut Health Center with possible Lyme
disease. The patients were tested with the usual test for Lyme disease and diagnosed as either + or
− . After some time passed and further evaluations were made they were able to determine the truth
as to the presence of Lyme disease for the patients. Of the 87 patients who truly had Lyme disease,
75 were correctly diagnosed. Also, 56 patients where incorrectly diagnosed as having Lyme disease.
Evaluate the effectiveness of the diagnostic testing:
(a) what percent of patients diagnosed as having Lyme disease really have it?
(b) what percent of patients diagnosed as not having Lyme disease really do not have it?
8. Consider a situation where students get grades in a class. Suppose it makes sense to think of ”true
worth” of the work and learning of a student. Students could be ranked on such a scale. Suppose we
define an A grade to be the correct grade for the top 30% of the students on this scale. The other 70%
should not get an A grade. This scale is unknown to the teacher who can obtain information about
this scale by some testing/grading procedure. Let’s suppose the grading procedure is 90% accurate
in the sense that there is a 90% chance that a true A student will get an A and a 90% chance that a
true non-A student will get a non-A grade.
(a) What percents of A and non-A grades will be given?
(b) Among the students who get an A grade, what percent are true A students?
(c) Among students who get a non-A grade, what percent are true non-A students?
9. A review of the book ”The New Academic Generation: A Profession in Transformation” in Academe
(Sept-Oct 1999) reported that of the 514,976 full-time faculty members at colleges and universities in
the U.S.A. in 1992, there were 172,319 ”new generation” faculty (they were in years 1 to 7 of full-time
employment). It was reported that 40.8% of this new cohort was female compared to 28.5% female in
the senior cohort.
(a) Finish a full spacial layout of this data and comment on a few of the entries.
(b) It was also reported that 66.8% of this new cohort hold tenure-track positions compared to
83.5% in the senior cohort. Finish a spatial layout for this issue.
10. A lie detector test gives the reading ”truth” or ”lying” when administered to a person. Suppose the
lie detector test can detect that person is lying 80% of the time and when a person is telling the truth
it erroneously indicates ”lying” 20% of the time. Suppose that 10% of the of the people coming to be
tested are actually lying. If the lie detector indicates a person is lying, what is the chance the person
is actually lying?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Workshop

Workshop 1A
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

The study Transcatheter or Surgical AorticValve Replacement in IntermediateRisk Patients was published
in the New England Journal of Medicine in 2016. This study compared the efficacy of two treatments for
cardiovascular malfunction. Over 1,000 low-risk patients were randomly assigned to a standard treatment:
a type of invasive heart surgery. Another 1,000 low-risk patients were randomly assigned to a nonstandard
treatment: the use of a less invasive non-surgical catheter. Researchers compared the two treatment
groups, and concluded that the less invasive catheter was (for the most part) not inferior to heart surgery
in the treatment of cardiovascular malfunction.

a. Is the study a randomized controlled experiment or observational? Why?

186
WORKSHOP 187

b. What were the treatment and control groups in this study?

c. Suppose we perform the study again, but we change patient recruitment such that any patient who
comes into a hospital (out of the blue) for treatment just have their prescribed treatments recorded by
a presiding doctor. Could the researchers conclusion of noninferiority have been ruined by significant
selection bias? Why or why not?

d. Suppose we perform the study again, but we change patient recruitment such that any patient who
comes into a hospital (out of the blue) for treatment just have their prescribed treatments recorded
by a presiding doctor. Could the researchers conclusion of noninferiority have been ruined by
confounding? If so, what variables might act as confounders in this study?

e. If you were helping to design the new study, how would you suggest correcting for the potential
selection bias and confounding you have identified in parts c. and d. above?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 188

Workshop 1B
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

A study published in a 2003 volume of the journal Clinical Infectious Diseases was conducted between
1995-1997 across many hospitals. The study investigated a dangerous fungal blood infection called candidemia
that had developed in regular patients at these hospitals. It was observed that about 1,600 adults and
children who developed the infection over the two years were given one of two antifungal treatments, a
combination of both antifungal treatments, or none of these. Investigators looked forward in time and
made a comparison between the rates of mortality three months after treatment to see 1) whether adults
or children had significantly higher mortality rates, and 2) whether any specific antifungal drug was better
than the others. It was found that adults had a significantly higher mortality rate than children three months
after treatment. It was also found that adults appeared to better respond to one of the antifungal treatments
than any of the others.

a. Is this study a randomized controlled experiment or observational? Why?

b. Is this study retrospective or prospective? Why?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 189

c. Suppose that it is noticed that children in this study are of a much higher average age from those
who usually develop candidemia in the population at large. Could either of the studys conclusions
be ruined by selection bias? If so, which conclusion, and how?

d. The investigators noted that adults who received the more effective drug were on average healthier
than those who received the other drugs. Could the study suffer from confounding? If so, what
variables might act as confounders in this study?

e. If you were helping to design the study, how would you suggest correcting for the potential selection
bias and confounding youve identified in parts c. and d. above?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 190

Workshop 2A
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

A. For the next set of questions, consider the histogram below:

1. The above histogram is skewed.

2. You would expect the mode to be between and


(pick a 10-point interval).

3. You would expect the mean to be than the median.

4. The variable Final Grade is (select a variable type learned in class).

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 191

5. A would have also been an appropriate choice for data presentation


(select a specific plot or graph learned in class).

6. The final grades of students from two statistics classes are presented as separate histograms
with the same scale. Class 1 has a histogram that shows longer tails and a shorter peak, while
Class 2 has a histogram that shows shorter tails and a taller peak. You infer that
has a larger standard deviation than .

7. If Class 1 grades are approximately normally distributed, then is between


plus and minus one standard deviation from the mean.

8. If Class 1 grades are approximately normally distributed, then is between


plus and minus two standard deviations from the mean.

9. If Class 1 grades are approximately normally distributed, then is between


plus and minus three standard deviations from the mean.

10. If Class 2 grades are not normal, then we should not apply the to the
dataset.

B. The duration of time from first exposure to HIV infection to AIDS diagnosis is called the incubation
period. The incubation period of a random sample of 5 HIV infected individuals is given below (in
years):

12.0 9.5 13.5 7.2 10.5

Complete the table below and calculate the standard deviation of incubation period.

Xi (Xi − X̄) (Xi − X̄)2

12.0

9.5

13.5

7.2

10.5

X̄ = ΣXi
n = SSx = Σ(Xi − X̄)2 =

The value of the standard deviation is:

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 192

Workshop 2B
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

A. For the next set of questions, consider the dataset below:

Dataset: 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 6, 6, 7, 8, 9

1. Using the above dataset, draw a histogram in the space provided below. Set tickmarks at
intervals of 1 on both axes.

2. Find the mean of the above dataset.

3. Calculate the 5-number summary (Min, Max, Median, Q1, Q2) for the data.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 193

4. Using the values obtained in 3, draw the boxplot for the data.

5. Based on the values calculated, and the plots drawn for the data, is the dataset skewed? If so,
in which direction?

B. A random sample of 5 12-year-old boys was obtained, and their body weights (in pounds)are recorded
as follows:

116 168 124 132 110

Complete the table below and calculate the standard deviation of body weights for the random
sample.

Xi (Xi − X̄) (Xi − X̄)2

116

168

124

132

110

X̄ = ΣXi
n = SSx = Σ(Xi − X̄)2 =

The value of the standard deviation is:

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 194

Workshop 3A
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

According to the American Cancer Society, lung cancer is the leading cause of cancer death in both men
and women. The ACS also states that the average age of a lung cancer patient at the time of diagnosis is 70
years. Assume a standard deviation of 8 years. Assume the population of recently diagnosed lung cancer
patients is normally distributed.

1. Suppose we are interested in the probability that a patient is 59 years old. Draw a graph of the
distribution that clearly indicates the above-described mean and x value in the space provided below
and shade the area below the vertical line that you marked at 59 years.

2. In the space provided below, calculate the Z-scale value for a patient who is diagnosed with lung
cancer at 59 years old.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 195

3. Draw the graph below of the standard normal curve with the z-scale value you obtained above and
with the area shaded (remember we want the area below this value).

4. What is the proportion of the area below/less than your z-score?

5. What is the probability (percent) of a patient having lung cancer when they are 59 years or younger?

6. Calculate the probability that someone who is recently diagnosed with lung cancer is more than 59
years old (you do not have to mark this on your graph).

7. What is the probability of a patient having lung cancer at age 70 and above?

a. Calculate the Z-score.

b. Provide the graph with shading for the z-score.

c. What is the probability (percentage) of a patient having lung cancer in this age range?

8. Calculate the approximate age of someone in the fifth percentile (5%) of recently diagnosed lung
cancer patients (provide a graph). Hint: you'll need to work backwards.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 196

Workshop 3B
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

According to the Prostate Cancer Foundation, prostate cancer is the second most common cancer in men
behind skin cancer. The PCF also states that the average age of a male prostate cancer patient at the time
of diagnosis is 69 years. Assume the population of recently diagnosed prostate cancer patients is normally
distributed.

1. Draw a graph of the distribution that clearly indicates the above-described mean in the space provided
below.

2. Clearly mark 65 years with a vertical line on your above graph. In the space provided below, calculate
the Z-scale value for a man diagnosed with prostate cancer at 65 years old assuming the standard
deviation of the distribution is 7 years.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 197

3. Shade the area below the vertical line that you marked at 65 years. In the space provided below, find
the area of the distribution that is less than the Z-scale value you calculated in number 2.

4. Calculate the probability that a man who is recently diagnosed with prostate cancer is more than 65
years old (you do not have to mark this on your graph).

5. Calculate the approximate age of a man in the seventh percentile of recently diagnosed prostate
cancer patients (you do not have to mark this on your graph).

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 198

Workshop 4A
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

1. As age increases, the risk of developing various diseases tends to increase as well. This implies a
correlation between age and risk of disease, with a correlation coefficient
value (r-value) between and (hint: between what two numbers?)

2. A researcher records the following sample data relating age and risk of a certain cancer. Calculate
the correlation coefficient r using the following data and formulas provided below:

Xi (Xi − X̄) (Xi − X̄)2 Yi (Yi − Ȳ ) (Yi − Ȳ )2

79 0.15

65 0.11

52 0.06

40 0.05

30 0.02

X̄= SSx = Ȳ = SSy =

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 199

q q
SSx SSy
Sx = n−1 = Sy = n−1 =

Zx Zy Zx ∗ Zy

Sum =

ΣZx ∗Zy
r= n−1 =
3. Produce a scatterplot below complete with axis labels and best fit line.

4. Interpret your findings. Was your correlation positive or negative?

5. Was there an association between these two variables, if so describe it based on your results.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 200

Workshop 4B
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

1. The correlation coefficient r measures the strength of linear association between two categorical
variables. Circle only one response.
a. TRUE
b. FALSE

2. As the temperature outside decreases, heating bills increases. This implies a


correlation between temperature and heating bill, with a correlation coefficient value (r-value) between
and (hint: between what two numbers?)

3. A study was conducted to determine if the number of hours spent studying for an exam is associated
with the exam score. Calculate the correlation coefficient r using the following data and formulas
provided below:

Xi (Xi − X̄) (Xi − X̄)2 Yi (Yi − Ȳ ) (Yi − Ȳ )2

8 98

2 74

6 87

4 82

2 72

X̄= SSx = Ȳ = SSy =

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 201

q q
SSx SSy
Sx = n−1 = Sy = n−1 =

Zx Zy Zx ∗ Zy

Sum =

ΣZx ∗Zy
r= n−1 =

4. Interpret your findings. Was your correlation positive or negative?

5. Was there an association between these two variables, if so describe it based on your results.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 202

Workshop 5A
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

1. For the following pairs of variables, identify whether the association is positive or negative, or the
two variables are not associated at all.
A. Number of Absences and Final Grade
B. Age in Years and Person’s Agility (ability to move quickly)
C. Time Spent Running on a Treadmill and Calories Burn
D. Amount of Moisture in an Environment and Growth of Mold Spores
E. Stress Level and Blood Pressure Readings

2. Match each correlation coefficient with the appropriate scatterplot.

Correlation Values: A. 0.882 B. 0.314 C. -0.849 D. -0.373

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 203

3. Studies show that the height of a person and pulse rate are somehow associated. To verify this claim,
information on 5 randomly selected persons were obtained. Using the covariance approach, calculate
the correlation coefficient r using the following data and formulas below:

Xi Yi (Xi − X̄) (Xi − X̄)2 (Yi − Ȳ ) (Yi − Ȳ )2 (Xi − X̄)(Yi − Ȳ )

68 90

72 85

70 100

64 65

75 98

X̄ = Ȳ = SSx = SSy = Sum =

q q
SSx SSy
Sx = n−1 = Sy = n−1 =

COV = Σ(Xi−n−1
X̄)(Yi −Ȳ )
= r = SCOV
x ∗Sy
=
4. Interpret the correlation coefficient calculated above.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 204

Workshop 5B
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

1. The following dataset contains observations on weight and height of 5 10-year old boys.
Calculate the correlation coefficient between the two variables.

Weight (X) Height (Y) X − µX Y − µY (X − µX)2 (Y − µY )2 XY

35 60

33 55

30 52

37 55

31 48

SSx = SSy = ΣXY =

Formulas:
q
SSx
σx = N N − µX µY
COV = ΣXY

q
SSy
σy = N R = COV
σx σy

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 205

a. Calculate the means for weight and height variables.

b. Calculate the standard deviations for weight and height. Fill in columns of the table to assist in
the computations.

c. Calculate the covariance using the formula given on the previous page.

d. Calculate the correlation coefficient between weight and height.

e. Interpret the value of the correlation coefficient.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 206

2. Match the scatterplots with the range of values of the correlation coefficients:
(-1 to -0.8), (-0.8 to -0.5), (0 to 0.4), (0.7 to 0.9)

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 207

Workshop 6A
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

1. Predict Y from X based solely from looking at the scatterplot and regression line below.
A. What do you predict as your Y value to be with an X value of 550? Y = .
B. What Y value is associated with an X=200? Y = .

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 208

For the next set of questions, refer to the following problem:

According to the National Cancer Institute, breast cancer is the second leading cause of cancer death
in women. The NCIs annual SEER reports state that a US woman's lifetime risk of developing breast
cancer between 2002 and 2006 is approximately as follows:

Year Lifetime Breast Cancer Risk


2002 12.7%
2003 12.3%
2004 12.0%
2005 12.1%
2006 12.4%
x̄ = 2004 ȳ = 12.3 %
Sx = 1.58 Sy = 0.27%

2. Given that the correlation coefficient which relates year and lifetime breast cancer risk is r = -0.462,
use Z-scores to find the predicted value of lifetime breast cancer risk for the year 2007. I need to see
your Zx value, Zy value and your predicted Y.

3. Using the information above, what is the slope of the regression equation that predicts lifetime breast
cancer risk using year and the following formula? Round your slope to four places past the decimal!

S
Slope = r ∗ Sxy

4. Using the information above, what is the intercept of the regression equation that predicts lifetime
breast cancer risk using year and the formula below?

Intercept = Ȳ − slope ∗ X̄

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 209

5. Write the full regression equation below using your above-calculated slope and intercept.

Ŷ = intercept + slope ∗ X

6. Verify that your equation predicts approximately the same value for the year 2007 as you calculated
using r and Z-scores in problem 2 above.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 210

Workshop 6B
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

According to the National Cancer Institute, breast cancer is the second leading cause of cancer death in
women. The NCIs annual SEER reports state that a US womans lifetime risk of developing breast cancer
between 2002 and 2006 is approximately as follows:
Year Lifetime Breast Cancer Risk
2002 12.7%
2003 12.3%
2004 12.0%
2005 12.1%
2006 12.4%
x̄ = 2004 ȳ = 12.3 %
Sx = 1.58 Sy = 0.27%

1. Given that the correlation coefficient which relates year and lifetime breast cancer risk is r = -0.462,
what is the slope of the regression equation that predicts lifetime breast cancer risk using year?

2. Using the information above, what is the intercept of the regression equation that predicts lifetime
breast cancer risk using year?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 211

3. What is the predicted value of a womans lifetime breast cancer risk for the year 2018?

4. Suppose the actual lifetime breast cancer risk for women in 2018 is 11.9%. What is the residual for
the year 2018?

5. Using the predicted value and the RMS of the regression, there is a 99.7% chance that the actual
lifetime breast cancer risk for 2018 falls between which two values?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 212

Workshop 7A
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

A. A bag of Skittles is designed to have candies which contains 20% Orange. Let 1 represent the event
that the color of the candy is Orange, and 0 otherwise.

1. Make a Box Model for the color distribution of the candies in a bag.

2. Find the average and standard deviation of the box in A.

3. Suppose you bought a 48-ounce bag and randomly draw 100 pieces of candy. If the bag follows
the intended color distribution,

a. How many Orange candies do you expect to find in the sample?

b. What is the standard error of the total number of Orange candies in the bag?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 213

c. Assume that the color distribution of the candy follows a normal distribution. What is the
probability that the bag will contain more than 25 Orange pieces?

B. In a neighborhood in a large city, the number of kids in a family can be modeled by the following box
model: [0,0,1,1,1,1,1,2,2,2].

1. Calculate the average of the Box Model.

2. Suppose the standard deviation of the box model is 0.70. If you were to select 50 families from
this neighborhood,

a. How many kids do you expect to find?

b. What is the standard error of the number of kids from this sample?

c. What is the probability that the total number of kids in the sample is below 53?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 214

Workshop 7B
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

A. According to US department of Transportation, the nationwide seat belt use rate was 90% in 2016.

1. Make a box model for population proportion of drivers in the USA, who wear seat belt.

2. Find the mean and the standard deviation of the box model in part 1.

3. If you observe 50 drivers, what is the expected value and standard error of the number of drivers
who wear seat belt?

4. Assume the number of drivers who wear seat belt is normally distributed. What is the probability
that more than 48 drivers wear seat belt?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 215

B. In a grocery store, there is a game that you can play, if your purchases exceed $75. There is a box
containg 5 chips numbered from 1 to 5. You have to draw a chip 5 times with replacement. If the sum
of the numbers in the chips exceeds 20, you will get a cash back of $20. If the sum of the numbers
in the chips is between 10 and 20, you will get a cash back of $10. If the sum of the numbers in the
chips is below 10, you will not get a cash back.

The box model for one draw of this game is [1 2 3 4 5], with µ = 3 and σ =1.414.

1. Find the expected value and standard error of the sum, if you play this game.

2. What is the probability that you will get a $20 cash back?

3. What is the probability that you will not get any cash back?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 216

Workshop 8A
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION

1. The above chart shows the percentage of adults who list different items as the most frustrating to
hear. The results are based on a survey of 1004 adults conducted by Kelton Research. As an example,
according to this poll, 39% of these adults said that a car alarm is the worst noise to hear, 28%
mentioned a jackhammer, 21% said a baby crying and 13% reported a dog barking as most annoying.
Note that the percentages in the chart add up to 101% because of rounding.

(a.) What is the expected value for the proportion of all adults who consider car alarm to be the
worst noise to hear?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 217

(b.) What is the standard error for the proportion of all adults who consider car alarm to be the worst
noise to hear?

(c.) Calculate a 95% confidence interval for the proportion of all adults who consider car alarm to
be the worst noise to hear?

(d.) Interpret the confidence interval calculated in (c).

(e.) What is the probability that the proportion of all adults who consider car alarm to be the worst
noise to hear is greater than 0.42?

SAMPLING DISTRIBUTION OF THE SAMPLE MEAN


2. The engines made by Ford for speedboats had an average power of 220 horsepower (HP) and standard
deviation of 15 HP. A potential buyer intends to take a random sample of 30 engines for quality control
check.

(a.) What is the expected value and standard error of the sample mean?

(b.) Suppose the buyer will not place an order if the sample mean is below 215 HP. What is the
probability that the buyer will not place an order?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 218

Workshop 8B
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

SAMPLING DISTRIBUTION OF THE SAMPLE MEAN

1. The above chart, based on the U.S. Department of Agricultures Center for Nutrition Policy and Promotion
study, gives the average cost of raising a child born in 2008 for three different income groups.
For example, families with annual incomes less than $56,870 are expected to spend an average of
$159,870 on their child born in 2008 through age 17. The corresponding average expenditures for
families with annual incomes of $56,870 to $98,470 and families with annual incomes of more than
$98,470 will be $221,190 and $366,660, respectively.

Assume that for the income group, $56,870 to $98,470, the mean expenditure µ = $221,190 as provided
in the chart above is based on population data with a standard deviation of $30,000. Suppose we take
a random sample of 36 such families.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 219

(a.) What is the expected value and standard error of the average expenditures for the random
sample?

(b.) What is the probability that the sample mean expenditure is less than $220,190?

(c.) What is the probability that the sample mean expenditure is greater than $224,190?

SAMPLING DISTRIBUTION OF THE SAMPLE PROPORTION


2. According to the Center for Disease Control and Prevention (CDC), as of 2015, the percentage of
adults with arthritis in Michigan is 30%. Suppose a random sample of 100 adults who lives in Michigan
was randomly selected.
(a.) What is the expected value and standard error of the sample proportion of adults with arthritis
in Michigan?

(b.) What is the chance that the sample proportion of adults with arthritis in Michigan exceeds 32%?

3. A researcher wishes to estimate the proportion of X-ray machines that malfunctions and produces
excess radiation. He randomly selected 40 machines, and found that 12 machines malfunctions.
(a.) Calculate a 90% confidence interval for the proportion of X-ray machines that malfunctions.

(b.) Interpret the confidence interval constructed in (a).

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 220

Workshop 9A
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

1. A researcher thinks the average time to run a mile for a non-competitive, in-shape runner is less 8.5
minutes. A sample of 50 runners was taken with a mean (X̄) of 8.3 minutes and a standard deviation
s = 0.8 mins. Researchers are investigating whether this group of runners differs from the population
of runners. Carry out a left-tailed z-test of significance by answering the questions below in as much
detail as possible. Let α =0.10.

(a.) What are the null and alternative hypotheses for the test?

(b.) What is the critical value?

(c.) Compute the Z-test statistic for the mean.

(d.) Draw the critical region in (b) with your z-test statistic in (c) for the standard normal curve. Is
your Z-test statistic lower than your critical value? What does that tell you?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 221

(e.) Calculate the p-value of the test.

(f.) What is your decision and why?

(g.) What is your conclusion in the context of the problem?

2. In a study conducted by Richard Lynn, a British Professor of Psychology, as of 2006, the average IQ
in the United States is 98. To determine if there has been an increase in the average IQ in the US, a
random sample of 70 Americans was selected, and the average was found to be 99, with a standard
deviation of 6. Is this enough evidence to conclude an increase in the average IQ? Test at 5% level of
significance.

(a.) What are the null and alternative hypotheses for the test?

(b.) Compute the z-test statistic for the mean.

(c.) Calculate the p-value of the test.

(d.) What is your decision and why?

(e.) What is your conclusion in the context of the problem?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 222

Workshop 9B
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

1. The average height of 7-year old boys is 48 inches. To verify if there was a change in this value,
a random sample of 40 7-year old boys in a large daycare was selected, and their heights were
measured. The average height from the sample was found to be 48.8 inches, with a standard deviation
of 2.61 inches. Is this enough evidence to conclude a change in the average height from 48 inches?
Perform a hypothesis test using 5% level of significance.

(a.) What are the null and alternative hypotheses for the test?

(b.) What is the critical value?

(c.) Compute the Z-test statistic for the mean.

(d.) Draw the critical region in (b) with your z-test statistic in (c) for the standard normal curve. Is
your Z-test statistic lower than your critical value? What does that tell you?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 223

(e.) Calculate the p-value of the test.

(f.) What is your decision and why?

(g.) What is your conclusion in the context of the problem?

2. In a study conducted by Quadrant Information Services in 2014, the average annual cost of car
insurance in the United States is $907.38. A market analyst believes this value is underestimated, and
claims the annual cost is higher than $907.38. He randomly selected 35 car owners in Michigan, and
he calculated the average annual cost of car insurance for the sample to be $954.45, with a standard
deviation of $139.75. Does he have evidence to support his claim? Test at 5% level of significance.

(a.) What are the null and alternative hypotheses for the test?

(b.) Compute the Z-test statistic for the mean.

(c.) Calculate the p-value of the test.

(d.) What is your decision and why?

(e.) What is your conclusion in the context of the problem?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 224

Workshop 10A
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

1. Suppose a female patient has taken 18 laboratory blood tests during the past two years. In this
sample, the mean hemoglobin count (HC) (in grams per 100 milliliters of whole blood) is 15.1 and
the standard deviation is 2.514. Does this information indicate that the population mean HC for this
patient is higher than 14? Carry out a test of significance by answering the questions below in as
much detail as possible. Use 5% level of significance.

(a.) What are the null and alternative hypotheses for the test?

(b.) Compute the test statistic and degrees of freedom (if applicable).

(c.) What is the p-value of the test.

(d.) What is your decision and why?

(e.) What is your conclusion in the context of the problem?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 225

2. Olestra is a fat substitute approved by the FDA for use in snack foods. The article Gastrointestinal
Symptoms Following Consumption of Olestra or Regular Triglyceride Potato Chips (Journal of the
American Medical Association, 1998: 150-152) reported that in an experiment, 17.6% of the 529
individuals in the TG control group experience an adverse GI event, whereas 15.8% of the 563 individuals
in the olestra treatment group experienced such an event. Does it appear that the incident rate of
GI problems for those who consume olestra chips according to the experimental regimen differs
from the incidence rate for the TG control group? Using α = 0.01, carry out a test of significance by
answering the questions below in as much detail as possible. [Hint: This is a two-sample proportion
problem.]

(a.) What are the null and alternative hypotheses for the test?

(b.) Compute the test statistic and degrees of freedom (if applicable).

(c.) What is the p-value of the test.

(d.) What is your decision and why?

(e.) What is your conclusion in the context of the problem?

3. Using the sample data in question (2), construct a 99% confidence interval for the difference between
the incident rate of GI problems for those who consume olestra chips according to the experimental
regimen and the incident rate for those in TG control group.

4. Would the conclusion based on the confidence interval in (3) agree with the conclusion you gave for
the test in (2)?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 226

Workshop 10B
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

1. According to the World Health Organization, the average incubation period of hepatitis A is about 28
days. A specialist who wants to determine if there has been an increase in this value, took a random
sample of 23 patients with hepatitis A, and found the average incubation period to be 29.5 days, with
a standard deviation of 6.6 days. Is this enough evidence to conclude an increase in the average
incubation period for the disease? Test at 5% level of significance.

(a.) What are the null and alternative hypotheses for the test?

(b.) Compute the test statistic and degrees of freedom (if applicable).

(c.) What is the p-value of the test.

(d.) What is your decision and why?

(e.) What is your conclusion in the context of the problem?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 227

2. World Health Organization cites Mental Health as “a state of well-being in which the individual realizes
his or her own abilities, can cope with the normal stresses of life, can work productively and fruitfully,
and is able to make a contribution to his or her community”. Several measures are being used to
assess an individual’s mental health. In Kalamazoo, CDC revealed that 14.9% of the 200 individuals
aged 18 years and above who were randomly sampled and monitored for a time period, consider
their mental health as “not good”, while in Lansing, the sample proportion calculated among 200
individuals was found to be 14.2%. Is this enough evidence to conclude a significant difference in
proportion of individuals with “not good” mental health between the two cities? Test at 5% level of
significance.

(a.) What are the null and alternative hypotheses for the test?

(b.) Compute the test statistic and degrees of freedom (if applicable).

(c.) What is the p-value of the test.

(d.) What is your decision and why?

(e.) What is your conclusion in the context of the problem?

3. Using the sample data in question (2), construct a 95% confidence interval for the difference in the
proportion of individuals with “not good” mental health between the two cities.

4. Would the conclusion based on the confidence interval in (3) agree with the conclusion you gave for
the test in (2)?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 228

Workshop 11A
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

1. A study was conducted to determine if there is a difference in the average systolic blood pressure
between men and women. A random sample of 35 men and 40 women were selected, and their blood
pressures were taken. Does the data provide evidence to show a difference in the average systolic
blood pressure between the two groups? Test at 1% level of significance.

n Mean Standard Deviation


Men 35 124.71 14.53
Women 40 125.18 10.35

(a.) What are the null and alternative hypotheses for the test?

(b.) Compute the test statistic and degrees of freedom (if applicable).

(c.) What is the p-value of the test.

(d.) What is your decision and why?

(e.) What is your conclusion in the context of the problem?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 229

2. Using the sample data in question (1), construct a 99% confidence interval for the difference in mean
level of systolic blood pressure (SBP) (mm Hg) between Men and Women.

3. Would the conclusion based on the confidence interval in (2) agree with the conclusion you gave for
the test in (1)?

4. A random sample of 16 counties in the Midwest gave the following information about birth rate and
death rate per 1000 resident population. Do the data indicate a difference between population average
birth rate and death rate in this region? Proceed by answering the questions below in as much detail
as possible. Use 5% level of significance.

n Mean Standard Deviation


A: Birth Rate 16 12.91 1.618
B: Death Rate 16 11.81 2.712
Difference = B - A 16 1.10 3.745
Source: Brase & Brase. Understandable Statistics (8th edition), p. 537.

(a.) What are the null and alternative hypotheses for the test?

(b.) Compute the test statistic and degrees of freedom (if applicable).

(c.) What is the p-value of the test.

(d.) What is your decision and why?

(e.) What is your conclusion in the context of the problem?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 230

Workshop 11B
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

1. In a packing plant, a machine pack cartons with jars. It is supposed that a new machine will pack
faster on the average than the machine currently being used. To test this claim, the times (in seconds)
it takes each machine to pack 35 cartons are recorded. Is this enough evidence to conclude that the
average packing time of the old machine is greater than the new machine? Let µ1 be the average
packing time for the old machine, and µ2 be the average packing time for the new machine. Test at
1% level of significance.

n Mean Standard Deviation


Old Machine 35 43.23 1.75
New Machine 35 42.14 1.683
(a.) What are the null and alternative hypotheses for the test?

(b.) Compute the test statistic and degrees of freedom (if applicable).

(c.) What is the p-value of the test.

(d.) What is your decision and why?

(e.) What is your conclusion in the context of the problem?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 231

2. Using the sample data in question (1), construct a 99% confidence interval for the difference in mean
packing time between the old machine and the new machine.

3. Would the conclusion based on the confidence interval in (2) agree with the conclusion you gave for
the test in (1)?

4. Trace metals in drinking water affect the flavor, and an unusually high concentration can pose a
health hazard. Ten locations were randomly selected, and the zinc concentration in bottom water,
and surface water were taken. Does the data suggest that the true average concentration in the
bottom water exceeds that of surface water? Test at 5% level of significance.

n Mean Standard Deviation


Bottom Water 10 0.5649 0.1468
Surface Water 10 0.4845 0.1312
Diff (Bottom - Surface) 10 0.0804 0.1523

(a.) What are the null and alternative hypotheses for the test?

(b.) Compute the test statistic and degrees of freedom (if applicable).

(c.) What is the p-value of the test.

(d.) What is your decision and why?

(e.) What is your conclusion in the context of the problem?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 232

Workshop 12A
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

1. A researcher wants to test whether the distribution of BMI range has changed over past few years.
He selected a random sample of 75 individuals and measured their BMI. Proceed by answering the
questions below in as much detail as possible. Use 5% level of significance.

BMI Range Percentages in 2005 OBSERVED COUNTS EXPECTED COUNTS


1: Underweight (<18.5) 3% 5
2: Normal (18.5-24.9) 67% 52
3: Overweight (25.0-29.9) 23% 8
4:Obese (>30) 7% 10
Total 100% 75

(a) What are the null and alternative hypotheses for the test?

(b) If the null hypothesis is true, what are the expected cell counts? Fill in the table above.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 233

(c) Compute the test statistic. How many degrees of freedom does it have?

(d) What is the p-value of the test?

(e) What is your decision?

(f) What is your conclusion in the context of the problem?

2. Use the following data to perform a test of independence between quality of life of patients (evaluated
a year after a heart attack) and their nationality. Proceed by answering the questions below in as much
detail as possible. Use 5% level of significance.

Quality of Life Canada United States Total


Better 146 1039 1185
About the Same 96 779 875
Worse 69 347 416
Total 311 2165 2476

(a) What are the null and alternative hypotheses for the test?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 234

(b) If the null hypothesis is true, what are the expected cell counts? Fill in the table below.

Quality of Life Canada United States Total


Better 1185
About the Same 875
Worse 416
Total 311 2165 2476

(c) Compute the test statistic. How many degrees of freedom does it have?

(d) What is the p-value of the test?

(e) What is your decision?

(f) What is your conclusion in the context of the problem?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 235

Workshop 12B
Submitted by:

Name: Signature:

Name: Signature:

Name: Signature:

Name: Signature:

1. A company is selling Tulip bulbs, and claim that they have equal number of white, pink, yellow, and
purple bulbs, and that when customers order them, they randomly select the bulbs from those that
they have. You ordered 30 bulbs and received 14 white, 6 pink, 7 yellow, and 3 purple bulbs. Is
there evidence that the bulbs you received were not randomly selected from a population with equal
proportion of white, pink, yellow, and purple bulbs? Test at 5% level of significance.

(a) What are the hypotheses of the test?

(b) Assuming the null hypothesis is true, what are the expected counts?

(c) Compute the test statistic. How many degrees of freedom does it have?

(d) What is the p-value of the test?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 236

(e) What is your decision?

(f) What is your conclusion in the context of the problem?

2. The operations manager of a company that manufactures tires wants to determine whether there are
any differences in the quality of workmanship among the two daily shifts. She randomly selected 496
tires and carefully inspects them. Each tire is classified as perfect, satisfactory, or defective, and the
shift that produced it was also recorded. Do the data provide evidence at 5% level of significance to
show that the quality of tire depends on the persons working on shift?

Quality of Tire Shift 1 Shift 2 Total


Perfect 106 104 210
Satisfactory 124 157 281
Defective 1 4 5
Total 231 265 496

(a) What are the null and alternative hypotheses for the test?

(b) If the null hypothesis is true, what are the expected cell counts? Fill in the table below.

Quality of Tire Shift 1 Shift 2 Total


Perfect 210
Satisfactory 281
Defective 5
Total 231 265 496

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
WORKSHOP 237

(c) Compute the test statistic. How many degrees of freedom does it have?

(d) What is the p-value of the test?

(e) What is your decision?

(f) What is your conclusion in the context of the problem?

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
Answer Key

CHAPTER 3
1. This conclusion is not supported by the data. It is not a fair comparison since there are other
factors that influence the students’ performance on the assessment test. For example, the private
school would have more students from higher socio-economic families, which have an impact on the
students’ performance having more time to spend studying, and ability to hire private tutors to assist
with the preparation for the test, to name a few. Even if the schools provide comparable education
we should see this type of result. This is an observational study.
2. Considering the characteristics of a good controlled experiment, this experiment did not indicate
whether randomization of participants into different treatment groups was implemented. The sailors
who participated appear to have knowledge of the treatment they are receiving, hence the study
is not blind. Details on whether the person who assessed the sailors have knowledge of which
treatment each received was not provided, hence it is not clear if the experiment is double-blind. As
far as having comparison groups, the experiment includes six treatment groups hence this aspect
is satisfied. Also, the sailors received the same diet, thus reducing the effect of confounders on the
results. However, there is a difference in the length of time the sailors received treatment since one
of the groups ran out of fruit after six days, thus results are not merely comparable.
3. This is an observational study. The comparison is not really fair since students who elect to take
UNIV 1010 are probably more concerned about their education, more motivated and more likely to
work hard, hence the concluding that the course was successful is not really warranted by the data.
4. This is an observational study. Parents who provide music training for their kids probably have money
to pay for it, are more concerned about their child’s growth, encourage and support them, etc. It is
no wonder that such children do well in other areas. Their success could be due to their parents
influence and not caused by the training.
5. This is an observational study. This is not a strong conclusion to make based on the data available.
It seems reasonable, but perhaps people who do puzzles are different in important ways from those
who do no puzzles. Maybe their success with puzzles (so they continue to do them) indicates higher
mental abilities from the start.
6. There seems to be no causal connection between drinking diet sodas and obesity. Surely those who
are overweight would tend to choose diet, sugar-free drinks, but weight gain could also be due to
other factors not covered in this study, which could have led to weight gain while drinking diet soda.
7. This conclusion is not appropriate based on the data. Students who chose to take the online version
are probably different from the others electing the regular class. They may be more independent and
confident in their abilities. If the classes do about the same on the test this may indicate that students
in the regular section need that structure to do well. They may do less well in an online class. A more
carefully designed experiment should be done to get better information, and stronger proof to make
such conclusion.

238
ANSWER KEY 239

8. The articles assessment appears inaccurate, as it failed to take into account the distribution/proportions
of male and female registered voters in each district. Having additional information, it was revealed
that the percent participation of female registered voters for each district are actually higher than
those of the males.

CHAPTER 5
1. Recall the data table:
Age Hourly Workers Salaried Workers
18-21 0.5% 0.1%
21-31 6.1% 13.3%
31-41 20.4% 26.9%
41-51 43.3% 33.6%
51-61 25.4% 23.7%
61-65 3.5% 2.1%
65-75 0.8% 0.3%

The histogram for each group are:

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 240

2. The Pareto chart for causes of car problems is:

Based on the Pareto chart, majority of the car problems are caused by brake plates and brake lines.

3. The Frequency Table for fuel cost is:

Fuel Cost Frequency


0-10 13
10-15 17
15-30 6

Based on the table, the corresponding histogram is given as:

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 241

4. The Frequency Distribution Table is given as:

Height Frequency Relative Frequency


48-50 5 0.1250
51-53 5 0.1250
54-56 4 0.1000
57-58 3 0.0750
59-60 9 0.2250
61-62 6 0.1500
63-64 4 0.1000
65-68 4 0.1000
The histogram using the relative frequency is:

5. The histogram for the cost of a year of college for 800 freshmen is given as:

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 242

6. The histogram of the GPA of students at a college is given as:

7. (a) % data in an interval = frequency of the interval/Total Frequency


= 5/25 = 0.20 or 20%
(b) % of at least 28 fries = sum of frequencies from 28 above / Total Frequency
= (7 + 5 + 4 + 2 + 1)/25
= 19/25 = 0.76 or 76%
8. (a) % data in an interval = frequency of the interval/Total Frequency
= 3/10 = 0.30 or 30%
(b) % of at most 6 times = sum of frequencies from 6 below / Total Frequency
= (1 + 4 + 3)/10
= 8/10 = 0.80 or 80%
9. (a) Based on the boxplot, the distribution of the number of visits per year is symmetric.
(b) The value where 50% of the number of visits are less is the median, which is 5.

CHAPTER 6
23 + 25 + 28 + 29 + 30
1. mean = = 27.
5
median is 28 (the middle number of ordered data set).

xi xi − µ (xi − µ)2
23 -4 16
25 -2 4
28 1 1
29 2 4
30 3 9
µ = 27 SS = 34

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 243

r r r
∑(xi − µ)2 SS 34
σ= = = = 2.61
n n 5
18 + 5 + 15 + 22 + 11 + 19
2. mean x̄ = = 15.
6
ordered data set: 5, 11, 15, 18, 19, 22
Since, there is even number of obeservations, median is the average of two middle numbers. Median
15 + 18
= = 16.5
2

xi xi − x̄ (xi − x̄)2
18 3 9
5 -10 100
15 0 0
22 7 49
11 -4 16
19 4 16
x̄ = 15 SS = 190

s r r
∑(xi − x̄)2 SS 190
s= = = = 6.16
n−1 n−1 6−1

3. 0: If the data values are the same, then x̄ is also the same data value. Therefore, xi − x̄ = 0 for all xi and
SS = 0. That is SD = 0.
x1 + x2 + · · · + xn
4. x̄ = . Let the constant be denoted by c.
n
x1 + c + x2 + c + · · · + xn + c x1 + x2 + · · · + xn + n(c) x1 + x2 + · · · + xn n(c)
meanNew = = = + = x̄+c= meanOld +constant.
n n n n
Median is the middle number of the ordered data set. Since we add a constant to all the numbers,
medianNew = median + constant.
SDNew =SDOld , because, xNew − meanNew = (x + c) − (x̄ + c) = x − x̄.
x1 + x2 + · · · + xn
5. x̄ = . Let the constant be denoted by c.
n
c · x1 + c · x2 + · · · + c · xn + c c(x1 + x2 + · · · + xn ))
meanNew = = = c · x̄= constant ·meanOld .
n n
Median is the middle number of the ordered data set. Since we multiply all the numbers by a constant,
medianNew = median·constant.
SDNew =constant·SDOld , because, xNew − meanNew = 2 2 2
r r(cx) − (cx̄) = c(xr− x̄) and (xNew − meanNew ) = c (x − x̄) .
SSnew c2 (SSOld ) SSOld
Therefore, SSnew = c2 (SSOld ) and SD = = = c· , for population SD. Replace n
n n n
with n-1 for sample SD.
6. A: the “balance” point of the histogram is closer to the original mean because of all of the “weight”
of the data around the mean.

7. x̄ = . That is ∑ x = n · x̄.
n
∑ x 25 · 8 + 26 113
x̄NEWA = = = = 8.69
n 26 13
∑ x 50 · 8 + 26 426
x̄NEWB = = = = 8.35
n 51 51

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 244

8. Neither: Replacing the maximum with a value greater than the maximum will not affect the middle
value.
13−7.5
9. Cutting approx. 2.5% off either end, you make cuts at 7.5 and 13, so the estimated SD = 4 = 1.375
10. C: ∑ xi = n · x̄ = 125 · 40000 = 5 million
11. Distribution of salaries is skewed to the right (not mound shaped) because of a few huge salaries -
empirical rule doesn’t apply.

12. A. NBA, since it has a higher average and the highest median salary.
B. NBA, since it has the highest standard deviation
C. Median, since the salary varies greatly among players. Average could easily be affected when a
star player is randomly selected, since his salary is higher compared to the other players.
13. By the empirical rule, 75 + 8 = 83 is 1SD above the mean, thus about 68/2 = 34% of the data between
75 and 83, while 75 + 2(8) = 91 is 2SD above the mean, thus about 95/2 = 47.5% of the data between
75 and 91. Subtracting the two percentages, 47.5% − 34% = 13.5%

CHAPTER 7
1.
(a)

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 245

Z = x−mean
sd = 6.2−5
0.8 = 1.5
Area below 1.5 is Area(Z < 1.5) = 0.9332. Area above 1.5 is Area(Z > 1.5) = 1 − 0.9332 = 0.0668.
6.7% percent of the data lies above the value X = 6.2.

(b)

Z = x−mean
sd = 3.8−5
0.8 = −1.5
Area below -1.5 is Area(Z < −1.5) = 0.0.0668.
6.7% percent of the data lies below the value X = 3.8.
2.
(a)

90th percentile means area to the left is 0.9000.


In the Z table, 0.8997 is the closest area for 0.9000.
Corresponding z value for area 0.8997 is 1.28. x = mean + z · sd = 5 + 1.28(0.8) = 6.024.
90th percentile is at 6.024.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 246

(b)

15th percentile means area to the left is 0.1500. In the Z table, 0.1492 is the closest area for
0.1500. Corresponding z value for area 0.1492 is -1.04. x = mean + z · sd = 5 + (−1.04)(0.8) = 4.168.
15th percentile is at 4.168.
3. c): X0 = X̄ + 1.28(SDX ). That is, corresponding Z value for X0 is 1.28. Percent of weeks the driver drives
more than X0 is area above X0 which is similar to area above Z = 1.28. Using Z table, area below 1.28
is 0.8997. Area above 1.28 is 1 − .8997 = .1003 or 10%

4. c): by definition, the area to the left of X67th is 0.67. The chance that the driver will drive more than X67th
is area above x67th , which is 1-0.67 = 0.33 or 33%.

5. rate ≥ .7064 homeless persons/1k population): Top 10% means area to the right is 0.1000. Then area
to the left is 90%, so X is at 90th percentile. In the Z table, 0.8997 is the closest area for 0.9000.
Corresponding z value for area 0.8997 is 1.28. x = mean + z · sd = 0.54 + 1.28(0.13) = 0.7064.

6.
(a)

x−mean 25000−28000
Z= sd = 2000 = −1.5. Area below -1.5 is Area(Z < −1.5) = 0.0.0668.
6.7% percent of salaries were below $ 25000.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 247

(b)

x−mean 27500−28000
Z1 = sd = 2000 = −0.25.
x−mean 30500−28000
Z2 = sd = = 1.25.
2000
Area below 1.25 is Area(Z < 1.25) = 0.8944. Area below -0.25 is Area(Z < −0.25) = 0.4013. Therefore,
area between -0.25 and 1.25 is Area(−0.25 < Z < 1.25) = Area(Z < 1.25)−Area(Z < −0.25) = 0.8944−
0.4013 = 0.4931
49.3% percent of salaries were between $27500 and $30500.
7. c) Center of distribution A is above 0 and center of distribution B is around 0. That is X̄A > X̄B .
Distribution B is more spread out compare to A. Therefore, SDA < SDB .
8. c): Zx1 = Zx2 = 1. Z score corresponding to X1 and X2 is 1. Therefore, area below X1 is the same as area
below X2 .
1−0 1
9. b): Z score for distribution C is ZC = = .
SDC SDC
1−0 1
Z score for distribution D is ZD = = .
SDD SDD
1 1
Since, SDC < SDD , > . That is, ZC > ZD . Therefore, area below ZC is greater than area below
SDC SDD
ZD .
10.
(a)

Z = x−mean
sd = 45−30
9 = 1.6667 ≈ 1.67.
Area below 1.67 is Area(Z < 1.67) = 0.9525. Area above 1.67 is Area(Z > 1.67) = 1 − 0.9525 = 0.0475.
4.75% percent of fish have PBC level above 40 ppm.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 248

(b)

25th percentile means area to the left is 0.2500. In the Z table, 0.2514 is the closest area for 0.2500.
Corresponding z value for area 0.2514 is -0.67. x = mean + z · sd = 30 + (−0.67)(9) = 23.97ppm.
25th percentile of the PBC levels is 24 ppm.

75th percentile means area to the left is 0.7500. In the Z table, 0.7486 is the closest area for
0.7500. Corresponding z value for area 0.7486 is 0.67. x = mean + z · sd = 30 + (0.67)(9) = 36.03ppm.
75th percentile of the PBC levels is 36 ppm.
11.
(a)

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 249

Z = x−mean
sd = 20−18.2
0.9 = 2.
Area below 2 is Area(Z < 2) = 0.9772. Area above 2 is Area(Z > 2) = 1 − 0.9772 = 0.0228.
There is a 0.0228 chance of a box having more than 20 ounces of cereal.
(b)

Area to the left of some weight X is 0.3000. In the Z table, 0.3015 is the closest area for 0.3000.
Corresponding z value for area 0.3015 is -0.52. x = mean + z · sd = 18.2 + (−0.52)(0.9) = 17.732.
30% of the boxes will have a weight less than 17.73 ounces.
12.
(A) TRUE: $250 is the center of the distribution.
(B) a. I only:
Z score corresponding to $400 is greater than z score corresponding to $300. Therefore, area
below $400 is greater than area below $300. Therefore, I is correct.
Z score corresponding to $450 is greater than z score corresponding to $350. Therefore, area
below $450 is greater than area below $350 and area above $450 is less than area above $350
(total area is 1). Therefore, II is incorrect.

CHAPTER 8
1.
(a)

X Y X-µX Y-µY (X − µX )2 (Y − µY )2
3 6 -4 2 16 4
8 4 1 0 1 0
10 2 3 -2 9 4
21 12
µX = =7 µY = =4 SSX = 26 SSY = 8
3 3
r r
SSX 26
σX = = = 2.94
n 3
r r
SSY 8
σY = = = 1.63
n 3
∑ ZX · ZY
R= = −2.9214
3) = −.97
n

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 250

X Y ZX ZY ZX · ZY
3 6 -1.3605 1.2270 -1.6694
8 4 0.3401 0.0000 0.0000
10 2 1.0204 -1.2270 -1.2520
µX = 7 µY = 4 -2.9214

(b)

X Y X − µX Y − µY (X − µY )2 (Y − µY )2 X ·Y
4 2 -2 -2 4 4 8
7 4 1 0 1 0 28
5 3 -1 -1 1 1 15
9 7 3 3 9 9 63
5 4 -1 0 1 0 20
µX = 30/5 = 6 µY = 20/5 = 4 SSX = 16 SSY = 14 ∑ X ·Y = 134
r r
SSX 16
σX = = = 1.79
r n r 5
SSY 14
σY = = = 1.67
n 5
∑ X ·Y 134
COV = − µX · µY = − (6 · 4) = 2.8
n 5
COV 2.8
R= = (1.79·1.67) = .94
σX · σY

2. Clockwise from top left: (0.15 − 0.25), (0.75 − 0.85), (0.85 − 0.95), and (0.55 − 0.65)
3. Positive means when one increases, so does the other. Negative means when one increases, the
other decreases, and vice-versa.
4.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 251

The data clouds are identical in shape, but displaced vertically. When combined the new overall
shape will be a “stouter” oval, thus decreasing the correlation.
5. b) less than 0: , Correlation for the subgroups need not to be the same or even close to the correlation
coefficient for the whole group. Picture of data clouds helps to answer this question. This is like the
second graph under comments, except the slope of the line of circles is downward (negative).

6.

Game 1 (X) Game 2 (Y) X − µX Y − µY (X − µX )2 (Y − µY )2 X ·Y


185 162 38.875 -4 1511.2656 16 29970
100 157 -46.125 -9 2127.5156 81 15700
126 138 -20.125 -28 405.0156 784 17388
137 167 -9.125 1 83.2656 1 22879
190 210 43.875 44 1925.0156 1936 39900
156 169 9.875 3 97.5156 9 26364
147 157 0.875 -9 0.7656 81 23079
128 168 -18.125 2 328.5156 4 21504
µX = 146.125 µY = 166 SSX = 6478.8750 SSY = 2912 ∑ X ·Y = 196784

µX = 1169/8
r =r146.125, µY = 1328/8 = 166
SSX 6478.875
σX = = = 28.458
r n r 8
SSY 2912
σY = = = 19.079
n 8
∑ X ·Y 196784
COV = − µX · µY = − (146.125 · 166) = 341.25
n 8
COV 341.25
R= = (28.458·19.079) = 0.629
σX · σY

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 252

R = 0.629; There is a strong positive relationship between the scores in the two games.
7. Positive: grade point average, hours spent studying. Negative: grade point average, hours spent
partying.
8.
(a) zero, as we don’t expect to see any relationship here
(b) negative (in general): if people aren’t working, they aren’t paying taxes
(c) positive: hurricanes cause damages to property for which people file insurance claims
(d) negative: high school students using illegal drugs tend to perform poorly, are sometimes
expelled and/or arrested, both which can affect the graduation rates.
9. Scatter plot B: correlation measures how close data fits to a straight line. The actual correlations
here are:
A: 0.687; B: 0.928.
10. FALSE. Correlation does not imply causation.Thus, even though there is an association between the
two, eating breakfast would not cause the event of getting a high score on the exam.
11. No, it is not warranted, since criticism would not necessarily result to higher performance ratings
even though an association may seem to exist between the two.

CHAPTER 9
1.
(A) Answer (b) 165 seconds < time < 185 seconds.
X − X̄
When X is known, ZX = . Then, ZY = R · ZX and predicted Y value is Ypredicted = Ȳ + ZY · SDY .
SDX
86 − 78
Here, R is unknown but positive (0 < R < 1). ZX = = 1. Therefore, 0 < Zy = R · Zx < 1.
8
Predicted Y value is Ȳ = 165, when ZY = 0. When ZY = 1, predicted Y value is Ȳ + 1 · SDY = 165 + 20 =
80. Therefore Ȳ < Ypredicted < Ȳ + SDy . That is, 165 < Ypredicted < 185.
(B) Answer (c) 135 seconds < time < 165 seconds.
66 − 78
ZX = = −1.5. Therefore, −1.5 < Zy = R · Zx < 0. Predicted Y value is Ȳ = 165, when ZY = 0.
8
When ZY = −1.5, predicted Y value is Ȳ − 1.5 · SDY = 165 − 1.5 · 20 = 135. Therefore Ȳ − 1.5 · SDY <
Ypredicted < Ȳ . That is, 135 < Ypredicted < 165.
√ √ √
2. (a) RMS= 1 − R2 · SDY . Since 0 < R < 1, 0 < 1 − R2 < 1. Therefore, 0 < RMS= 1 − R2 · SDY < SDy ,
meaning that 0 < RMS < $100k.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 253

3. (c)

70th percentile means area to the left is 0.7000. In the Z table, 0.6985 is the closest area for 0.7000.
Corresponding Z value for area 0.6985 is 0.51. Therefore, ZX = 0.51.
ZY = R · ZX = 0.3 · 0.51 = 0.153 ≈ 0.15. Using Z table, corresponding area for z value 0.15 is 0.5596.
Therefore, predicted Y value is at 56th percentile.
4.
(a) Predicted Y value is 128.
x = 75, then Zx=75 = 75−81 3 = −2.
Zy = R · Zx = .3 · (−2) = −.6.
YPredicted = ȳ + Zy · SDy = 131 + (−.6) · 5 = 128.

(b) 55th percentile:

67th percentile means area to the left is 0.6700. In the Z table, corresponding Z value for area
0.6700 is 0.44. Therefore, ZX = 0.44.
ZY = R · ZX = 0.3 · 0.44 = 0.132 ≈ 0.13. Using Z table, corresponding area for z value 0.13 is 0.5517.
Therefore, predicted Y value is at 55th percentile.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 254

5. less than 12 : For X=70 seconds, predicted Y values have an approximately normal distribution with
mean equals to predicted Y Ypredict = 149 seconds and standard deviation equals to RMS. The proportion
of runners will have 800m races times below 145 seconds is less than 1/2.
6. More than 23 : Predicted Y values have an approximately normal distribution with mean equals to
predicted Y Ypredict = 165 seconds and standard deviation equals to RMS. The proportion of runners
will have 800m races times between 145 seconds and 185 seconds is more than 2/3. Because, RMS
is less than 20 and because of the mound shape, within one RMS from mean, there is about 68% of
the data. 145 to 185 range exceeds 165-RSM to 165+RMS range.
7.
41 − 49
(a) X=41, ZX = = −1
8
ZY = R · ZX = 0.8 · (−1) = −0.8
Ypredict = Ȳ + ZY · SDY = 5600 + (−0.8) · 500 = 5200

(b) When a salesperson answer 57 calls, total sales has a normal distribution with mean equals to
the predicted
√ Y value when x = 57 and standard deviation equals to RMS.
RMS = 1 − .82 · 500 = 300
Therefore, total sales has a normal distribution with mean $6000 and SD = $300.
ZY = 6300−6000
300 = 1, so the area to the left of Z = 1 is 0.8413. Area above Z = 1 is 1 − 0.8413 = 0.1587
or 15.9%.

8.
6 − 5.2
(a) ZX = = 0.67.
1.2
ZY = R · ZX = 0.45 · 0.67 = 0.3015 ≈ 0.30.
Ypredict = Ȳ + ZY · SDY = 102 + 0.30 · 14 = $106.2k.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 255

(b)

30th percentile means area to the left is 0.3000. In the Z table, 0.3015 is the closest to 0.3000.
Corresponding Z value for area 0.3015 is -0.52. Therefore, ZX = −0.52.
Monthly advertising expenditure is X = X̄ + ZX · SDX = 5.2 + (−0.52) · 1.2 = $4.58k
ZY = R · ZX = 0.45 · (−0.52) = −0.234 ≈ −0.23. Using Z table, corresponding area for z value -0.23 is
0.4090. Therefore, predicted Y value is at 41st percentile.

75th percentile means area to the left is 0.7500. In the Z table, 0.7486 is the closest to 0.7500.
Corresponding Z value for area 0.7486 is 0.67. Therefore, ZX = 0.67.
Monthly advertising expenditure is X = X̄ + ZX · SDX = 5.2 + (0.67) · 1.2 = $6.00k
ZY = R · ZX = 0.45 · 0.67 = 0.3015 ≈ 0.30. Using Z table, corresponding area for z value 0.3 is 0.6179.
Therefore, predicted Y value is at 62nd percentile.
√ √
(c) RMS = 1 − R2 · SDY = 1 − .452 · 14 = $12.5k

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 256

(d) Part (A) gives the predicted Y value when X = $6k is Ypredict = $106.2k. Part (C) gives that RSM is
$12.5k. Then, total sales has an approximately normal distribution with mean 106.2 and standard
deviation 12.5.

102−106.2
ZY =102 = 12.5 = −0.336 ≈ −0.34. Area below -0.34 is 0.3669 or 36.7%.
9.
98 − 98.5
(a) ZX = = −0.67.
0.75
ZY = R · ZX = 0.5 · (−0.67) = −0.335 ≈ −0.34.
Ypredict = Ȳ + ZY · SDY = 74 + (−0.34) · 9 = 70.94.

(b)

85th percentile means area to the left is 0.8500. In the Z table, 0.8508 is the closest to 0.8500.
Corresponding Z value for area 0.8508 is 1.04. Therefore, ZX = 1.04.
Body temperature is X = X̄ + ZX · SDX = 98.5 + 1.04 · 0.75 = 99.28

(c) ZY = R·ZX = 0.5·1.04 = 0.52. Using Z table, corresponding area for z value 0.52 is 0.6985. Therefore,
predicted Y value is at 70th percentile.

(d) ZY = 0.52, then YPredicted = Ȳ + ZY · SDY = 74 + .52 · 9 = 78.68


√ p
(e) RMS = 1 − R2 · SDY = 1 − (.5)2 · 9 = 7.79

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 257

(f) When a female patient has a body temperature of 100F, predicted heart rate is 83. From part (e),
RMS = 7.79.
Therefore, total sales has a normal distribution with mean 83 and SD = 7.79.
ZY = 87−83
7.79 = 0.51, so the area to the left of Z = 0.51 is 0.6950. The chance she will have a heart
rate less than 87 is 0.6950 or 69.5%.

SDY 9
(g) slope =R · = 0.5 0.75 = 6, intercept =Ȳ − slope · X̄ = 74 − 6 · 98.5 = −517.
SDX
Regression equation: y = 6x − 517.
SDY
10. slope =R · = 0.8 20
8 = 2, intercept =Ȳ − slope · X̄ = 165 − 2 · 78 = 9.
SDX
Regression equation: y = 2x + 9.
11. Predicted Y value when x = 5 is y pred = 15 + 6x = 15 + 6 · 5 = 45.
Therefore, y has a normal distribution with mean 45 and SD = 2.5.

50−45 5
Zy=50 = 2.5 = 2.5 = 2, area below Z = 2 is 0.9772. Area above Z = 2 is 1-0.9772 = 0.0228, or 2.3%.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 258

CHAPTER 10
1.
(A) The average of the box is 3.5, with a standard deviation of 1.708.
√ √
When n = 10, EV = n · µ = 10 · 3.5 = 35 and SE = n ·√
σ = 10√· 1.708 = 5.40.
When n = 100, EV = n · µ = 100 · 3.5 = 350 and SE = n ·√
σ = 100 √ · 1.708 = 17.08
When n = 1000, EV = n · µ = 1000 · 3.5 = 3500 and SE = n · σ = 1000 · 1.708 = 54.01.

(B) The average of the box is 2, with a standard deviation of 1.


√ √
When n = 10, EV = n · µ = 10 · 2 = 20 and SE = n ·√
σ = 10√· 1 = 3.16.
When n = 100, EV = n · µ = 100 · 2 = 200 and SE = n ·√
σ = 100 √ · 1 = 10
When n = 1000, EV = n · µ = 1000 · 2 = 2000 and SE = n · σ = 1000 · 1 = 31.62.

2. Box model [0 1], with p = 0.55 = proportion of births are boys.


n = 100, p = 0.55 p p
EV = n · p = 100 · 0.55 = 55 and SE = np(1 − p) = 100 · 0.55(1 − 0.55) = 4.9747
Number of births of boys have an approximately normal distribution with mean equals to EV = 55 and
standard deviation equals to SE = 4.9747.

value − EV 60−55
Z= = 4.9749 = 1.01, Thus the area below Z = 1.01 is P(Z < 1.01) = 0.8438 or 84.38%.
SE

3. Box model [0 1], with p = 0.25 = proportion of getting the correct answer.
n = 50, p = 0.25 p p
EV = n · p = 50 · 0.25 = 12.5 and SE = np(1 − p) = 50 · 0.25(1 − 0.25) = 3.0619.
The 95% range is 2 SE from EV, thus: 12.5 ± 2(3.0619) = (12.5 − 6.1238, 12.5 + 6.1238) = (6.3762, 18.6238)
4. Box model [0 1], with p = 0.7 = proportion of success.
n = 100, p = 0.70 p p
EV = n · p = 100 · 0.7 = 70 and SE = np(1 − p) = 100 · 0.7(1 − 0.7) = 4.5826.
The procedure will succeed 70 times, give or take 4.5826.
The 95% range is 2 SE from EV, thus: 70 ± 2(21) = (70 − 42, 70 + 42)(28, 112)

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 259

5.
(A) Box model [0 1], with p = 0.p = proportion of making a free throw.
n = 15, p = 0.90
EV = n · p = 15 · 0.9 = 13.5
p p
(B) SE = np(1 − p) = 15 · 0.9(1 − 0.9) = 1.1619
value − EV
Z= = 14−13.5
1.1619 = 0.43. Area below Z = 0.43 is P(Z < 0.43) = 0.6664. Therefore, area above
SE
Z = 0.43 is P(Z > 0.43) = 1 − 0.6664 = 0.3336. There is a 33.36% chance that he will make more than
14 free throws in 15 attempts.
6.
(A) The box model is [0 1 2 2 2 2 2 3 3 3]
0+1+2·5+3·3
(B) µ = = 2 and
r 10 r
(0 − 2)2 + (1 − 2)2 + 5(2 − 2)2 + 3(3 − 2)2 8
σ= = = 0.8944
10 10
√ √
(C) n = 100, EV = n · µ = 100 · 2 = 200 and SE = n · σ = 100 · 0.8944 = 8.944.
(D) The sum of the number of pieces of candy received is approximately normally distributed with
mean equals to EV = 200 and standard deviation equals to SE = 8.944.

value − EV
Z= = 185−200
8.944 = −1.68. Area below Z = -1.68 is P(Z < −1.68) = 0.0465. Therefore, the
SE
probability that you will receive less than 185 pieces of candy if you were to trick-or-treat at 100
houses is 0.0465.
7. n = 26, µ = 14.5, σ = 2.25 √ √
EV = n · µ = 26 · 14.5 = 377, SE = n · σ = 26 · 2.25 = 11.4728

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 260

Thus, the chance he would take his wife to a resort is computed as:
value − EV
Z= = 395−377
11.4728 = 1.57.
SE
Area below Z = 1.57 is P(Z < 1.57) = 0.9418. Area above Z = 1.57 is P(Z > 1.57) = 1 − 0.9418 = 0.0582.
The chance he would take his wife to a resort is 5.82%.

CHAPTER 11
1. The probability the sample will contain 40% or more seals with the trait is 0.0618.
Given:qp = 0.30, n = 50. The expected value and standard error for p are calculated as: EV = p = 0.30,
0.30(1−0.30)
SE = 50 = 0.0648. To calculate the probability, first calculate Z value for 0.40 as Z = p̂−EV
SE =
0.40−0.30
0.0648 = 1.54. The chance the sample contain more than 40% seals is equivalent to the area above
1.54. Now area above 1.54 = 1− area below 1.54 = 1 − 0.9382 = 0.0618
2. False; The second poll has larger population size than the first poll, hence getting a sample of the
same size as what was obtained from the first poll will not result into a more accurate estimate since
it is less representative of the population.
3. for p = 0.2, sample size= 683 for 0.03 interval with 95% confidence, for p = 0.5, sample size= 1068
4.
(A) EV = p = 0.80
q
(B) SE = 0.80(1−0.80)
55 = 0.0539
5.
(A) False. The sample size must be greater than 30 for the sampling distribution of p to be approximately
normal. q
0.40(1−0.40)
(B) EV = p = 0.40, SE = 200 = 0.0346

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 261

105
6. n = 500, x = 105; p̂ = 500 = 0.21

q
0.21(1−0.21)
SE = 500 = 0.0182

95% CI = 0.21 ± (1.96 · 0.0182) = 0.21 ± 0.0357 = (0.17, 0.25)

42
7. n = 60; x = 42; p̂ = 60 = 0.7

q
0.7(1−0.7)
SE = 60 = 0.0592

95% CI = 0.7 ± (1.96 · 0.0592) = 0.7 ± 0.1160 = (0.58, 0.82)


8. (b)

CHAPTER 12
1.
(A) EV = µ = 7.5
σ 0.7
(B) SE = √
n
=

70
= 0.0837
X̄−EV 7.25−7.5
(C) Z = SE 0.7

= −2.99
70
Thus, the probability is equal to 1 - area for -2.99 = 1 - 0.9986 = 0.0014
2.
520−500
(A) Z520 = 100

= 1.41. Thus the probability is equal to 1 - area for 1.41 = 1 - 0.9207 = 0.0793
50
520−500
(B) IF we can assume the population is normally dist., then Z520 = 100 = 0.2. Thus, the probability
is equal to 1 - area for 0.2 = 1 - 0.5793=0.4207
3. n = 30, x̄ = 22, S = 8
The 95% CI is given as: 22 ± 1.96 · ( √830 ) = 22 ± 2.8628 = (19.14, 24.86)
4.
(A) n = 300, x̄ = 225, S = 90
The 90% CI is given as: 225 ± (1.645 · √90
300
) = 225 ± 8.5477 = (216.45, 233.55)
(B) We are 90% confident that the true average credit card debt for students at WMU last month is
between $216.45 and $233.55.
5.
(A) n = 400, x̄ = 3.75, S = 1.6
The 99% CI is given as: 3.75 ± (2.58 · √1.6
400
) = 375 ± 0.2064 = (3.54, 3.96)
(B) 100%: sample average is always in then region “sample AVG ± (Z*SE)”.
6.
9.5−10
(A) Z = 3.5

= −1. Thus the probability is equal to area for Z = -1 on the Z-table which is 0.1587.
49

(B) False: The interval applies to the population mean, not the sample values.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 262

7. n = 3600, x̄ = 6.8, S = 2.2


Thus, the 95% CI is: 3600 ± (1.96 · √2.2
3600
) = 6.8 ± 0.0719 = (6.73, 6.87)
We are 95% confident that the true average number of regular office hours faculty members held per
week during the semester is between 6.73 hours and 6.87 hours.
8.
2.7
(A) False: 4 is the correct center, but SE = √
81
=0.3.
(B) False: The population is not necessarily symmetric, only the distribution of the sample mean
can be considered approximately normal since n is large.
(C) True.
(D) False: a random number, like the mean, with a continuous set of possible value, has no guarantee
of being any exact value within its range, even if it is the mode (most likely).
9. (b)

CHAPTER 13
1. B: “to the right” of −.5 is “away” from H0 and “more extreme” from the sample value
2. H0 : µ ≤ .77, HA : µ > .77
.79−.77
Z = q (.77)(1−.77) = .38 ⇒ p-value of 1 − .648 = .352 > .05
64
Decision: Do not reject H0
Conclusion: There is not enough evidence to support the mayor’s claim.
3.
(a) False: p-valuetwo-sided alternative hypothesis = 2 · p-valueone sided
(b) True
(c) False. A low p-value means the observed sample is very unlikely to occur assuming H0 is true.
(d) False. The p-value is between 0.05 and 0.10
4. H0 : µ ≤ 5mg; HA :µ > 5mg
SD = 1.61; t = 5.63−5
1.61

= 1.516; df= 15 − 1 = 14; 0.05 < p − value < 0.10
15
Decision: Do not reject H0
Conclusion: We do not have enough evidence to show the average is more than 5 mg, hence must
conclude the reporter is incorrect.
5. YES (for method employed) H0 : µNEW METHOD ≤ 75 vs HA : µNEW METHOD > 75
Z = 78−75
√8
= 2.053 → p-value=0.02.
30
Decision: Reject H0 .
Conclusion: We have evidence to show the method improved the math test scores.
.75−0
6. t = √2
= 1.875; df= 24 ;⇒ One-sided p-value: 0.025 < p − value < 0.05 ⇒ Two-sided p-value:0.05 <
5
p − value < 0.10
Decision: Do not Reject H0
Conclusion: We do not have enough evidence to show the average is significantly different from zero.
7. H0 : p = .4 vs HA : pq6= .4
.4·.6
p̂ = 38
76 = 0.5 SE = 76 = .0562
p̂−EV
Z= = .5−.4
.0562 = 1.78 ⇒ Probability of Z > 1.78 = 1 − .962 = .038 → p-value = 2 · .038 = .076 > .05
SE
Decision: Do not reject H0

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 263

Conclusion: We do not have enough evidence to show that the proportion of divorces in the first five
years for couples that have a child in the first two years of marriage is significantly different from
40%.
8. H0 : P ≤ 0.3 vs HA : Pq> 0.3
p̂ = 100 = .33; SE = 0.3(1−0.3)
33
100 = .046
Z= 0.33 −0.3
= 0.65 → Probability of Z > 0.65 : 1 − 0.742 = .258 > .05
0.046
Decision: Do not reject H0 .
Conclusion: We do not have evidence to show the proportion of the city that is interested in zipcars
is more than 30%, hence the company shouldn’t invest.
9. H0 : P ≥ 0.35 vs HA :q
P < 0.35;
35 0.35(1−0.35)
p̂ = 120 = .29; SE = 120 = .0435
Then Z = .29−.35
.0435 = −1.38 → Probability of Z < −1.38 = 1 − (chance of Z < 1.38) = 1 − .916 = .084
Decision: Do not reject H0 .
Conclusion: We do not have evidence to show TMT B outperform TMT A.
10. Based on the sample, H0 is FALSE, but the decision was to not reject H0 , hence a Type II error
occurred.

CHAPTER 14
1.
(A) (b)
(B) (c)
(C) Z = 0.67−0.52
0.0762 = 1.97 → p − value = 1 − 0.976 = 0.024
Decision: Reject H0
Conclusion: We have evidence to show the new drug reduces seizures.
2. H0 : µsmokers ≤ µnonsmokers vs Ha : µsmokers > µnonsmokers
Z = q 90−88
2 2
= 2.56 → p − value = 1 − 0.995 = 0.005
5 6
100 + 100
Decision: Reject Ho
Conclusion: We have evidence to show that smokers have higher pulse compared to nonsmokers.
3. Z= q(230−224)−0 = 2.19 → p-value of 1 − .986 = .014
[ √10 ]2 +[ √13 ]2
36 36
Decision: Reject H0
Conclusion: We have evidence to show that the average mileage of the tested cars increases with the
addition of the fuel additive.
p
4. EV = 83.6 − 79.2 = 4.4, SENM = √4.336
= 0.7167, SEBM = √3.8
36
= 0.6333, SE = (0.7167)2 + (0.6333)2 = 0.9564
95% CI: 4.4 ± 1.96 · 0.9564 = 4.4 ± 1.8745 = (2.53, 6.27)
5. We use a t-test as per rule nA or nB < 30; d f = the smaller of nA − 1 and nB − 1 = 19; 0.05 < p − value < 0.10
Decision: Do not reject Ho.
Conclusion: We do not have evidence to show a significant difference in the averages between the
two groups.
6. EV = 0.03, SE = 0.0454
95% CI: 0.03 ± 1.96 · 0.0454 = (−0.06, 0.12)
Since 0 is in the interval, we dont have evidence to show the proportions are different.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 264

7. H0 : Pmale − Pf emale = 0 vs Ha : Pmale − Pf emale 6= 0q


39 .481·(1−.481)
p̂male = = .481; p̂ f emale = 22
81 64 = .344 SEmale = 81 = .0556
q
SE f emale = .344·(1−.344) = .0594
√ 64
SEmale− f emale = .0556 + .05942 = .081
2
p̂male − p̂ f emale −0
Z= = .481−.344 = 1.69 → P(Z > 1.69) = 1 − .954 = .046
SEmale− f emale .081
P-value (2-sided) = 2 · .046 = .092
Decision: Do no reject H0
Conclusion: We do not have evidence to show a significant difference in the proportion between
males and females.
8. H0 : µS − µN ≤ 0 vs Ha : µS − µN > 0;
SEs = SD√ s = 0.65 = 0.13; SEn = 0.85 = 0.17
ns
√ 5 5
SE[S−N] = .132 + .172 = 0.214
(X̄S −X̄N )−EV [X̄S −X̄N ]
t= = (4.6−3.8)−0 = 3.74 (d f = 25 − 1 = 24)
SE[X̄S −X̄N ] .214
From the t-table, p-value < .001
Decision: Reject Ho
Conclusion: We have enough evidence to show the average knee velocity of skilled rowers is higher
than that of novice rowers.
9. (c) since this is a before and after experiment, we use dependent samples, and t-test since the sample
size 10, is less than 30.
10. H0 : µd ≤0 vs. HA : µd >0
Test Statistic: t = 10.5−0
7.75

= 4.284
10
df = 9, 0.001¡p-value¡0.005
Decision: Reject H0 since p-value < 0.05.
Conclusion: We have significant evidence to show that the weights before the program is greater
than the weights after the program, hence the program is effective.

CHAPTER 15
1.
(A) H0 : pW hite = 0.4, pBlack = 0.3, pHispanic = 0.2, pOther = 0.1
(B) EW hite = n · pW hite = 60 · 0.4 = 24,
EBlack = n · pBlack = 60 · 0.3 = 18,
EHispanic = n · pHispanic = 60 · 0.2 = 12 and
EOther = n · pOther = 60 · 0.1 = 6
(C) d f = k − 1 = 4 − 1 = 3, where k is the number of categories.
(Observed − Expected)2 2 2 2 2
(D) χ 2 = ∑ = (34−24)
24 + (12−18)
18 + (9−12)
12 + (5−6)
6 = 7.083
Expected
0.05 < p-value < 0.1 (When df = 3, 6.251 < χ 2 = 7.082 < 7.815)
Decision: Do not Reject H0 . (p-value > 0.05)
Conclusion: We do not have significant evidence to show that at least one proportion differs
from what is expected.
2. H0 : pHigh = 0.3, pMiddle = 0.5, pLow = 0.2
Ha : At least one proportion differs from what is expected.
Expected Frequencies:

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 265

EHigh = n · pHigh = 50 · 0.3 = 15, EMiddle = n · pMiddle = 50 · 0.5 = 25, ELow = n · pLow = 50 · 0.2 = 10
(Observed − Expected)2 2 2 2
Test Statistic: χ 2 = ∑ = (10−15)
15 + (27−25)
25 + (13−10)
10 = 2.727
Expected
P-value: d f = k − 1 = 3 − 1 = 2, where k is the number of categories.
p − value > 0.20, (When df = 2, χ 2 = 2.727 < 3.219)
Decision: Do not reject H0 . (p-value > 0.05)
Conclusion: We do not have significant evidence to show that at least one proportion differs from
what is expected. Professor is right.
3.
(A) H0 : pBeer = 0.3, pWine = 0.55, pMixed = 0.1, pLiqueur = 0.05
(B) EBeer = n · pBeer = 120 · 0.3 = 36, EWine = n · pWine = 120 · 0.55 = 66, EMixed = n · pMixed = 120 · 0.1 = 12 and
ELiqueur = n · pLiqueur = 120 · 0.05 = 6
(C) d f = k − 1 = 4 − 1 = 3, where k is the number of categories.

(Observed − Expected)2 2 2 2 2
(D) χ 2 = ∑ = (30−36)
36 + (75−66)
66 + (13−12)
12 + (2−6)
6 = 4.977
Expected
2
0.1 < p − value < 0.2,(When df = 3, 4.642 < χ = 4.977 < 6.251)

Decision: Do not reject H0 . (p-value > 0.05)


Conclusion: We do not have significant evidence to show that at least one proportion differs
from what is expected. The restaurant should not consider a different distribution.
1 1
4. H0 : pFreshman = pSophomore = pJunior = pSenior = = = 0.25; where k is the number of categories.
k 4
Ha : At least one proportion differs from what is expected.
Expected Frequencies: EFreshman = ESophomore = EJunior = ESenior = n · 0.25 = 100 · 0.25 = 25
(Observed − Expected)2 2 2 2 2
Test Statistic: χ 2 = ∑ = (30−25)
25 + (19−25)
25 + (22−25)
25 + (29−25)
25 = 3.44
Expected
d f = k−1 = 4−1 = 3
p − value > 0.20, (When df = 3, χ 2 = 3.44 < 4.642)
Decision: Do not reject H0 . (p-value > 0.05)
Conclusion: We do not have evidence to show at least one proportion differs from what is expected.
The restaurant should not consider a different distribution.
5.
(Observed − Expected)2 2 2
(A) χ 2 = ∑ = (30−25)
25 + (20−25)
25 =2
Expected
d f = k − 1 = 2 − 1 = 1; where k is the number of categories.
0.1 < p-value (= .1573) < 0.2, (When df = 1, 1.642 < χ 2 = 2 < 2.706)

30
(B) Sample proportion of males = p̂ = = 0.6
50
p̂ − p0
Z=r = s 0.6−0.5 = 1.4142
p0 (1 − p0 ) 0.5(1 − 0.5)
n 50
p-value (2-sided) = 2 · tail area = 2 · (1 − 0.9207) = 2 · 0.793 = 0.1586.
p
(C) Z = χ 2 ; p-values are very close.
6. H0 : pA = 0.35, pB = 0.1, pC = 0.2, pD = 0.15, pE = 0.2
Ha : At least one proportion differs from what is expected.
Expected Frequencies:
EA = n · pA = 100 · 0.35 = 35, EB = n · pB = 100 · 0.1 = 10, EC = n · pC = 100 · 0.2 = 20, ED = n · pD = 100 · 0.15 = 15,
EE = n · pE = 100 · 0.2 = 20

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 266

(Observed − Expected)2 2 2 2 2 2
Test Statistic: χ 2 = ∑ = (22−35)
35 + (13−10)
10 + (15−20)
20 + (26−15)
15 + (24−20)
20 = 15.845
Expected
P-value: d f = k − 1 = 5 − 1 = 4, where k is the number of categories.
0.001 < p − value < 0.005, (When df = 4, 14.86 < χ 2 = 15.845 < 18.467)
Decision: Reject H0 . (p-value < 0.05)
Conclusion: We have significant evidence to show that at least one proportion differs from what is
expected. The distribution of purchasing different cereal brands has changed during the past few
years.
7.
(A) H0 : pLower = 0.18, pMiddle = 0.59, pU pper = 0.23
Ha : At least one proportion differs from what is expected.

(B) ELower = n · pLower = 432 · 0.18 = 77.76, EMiddle = n · pMiddle = 432 · 0.59 = 254.88, EU pper = n · pU pper = 432 ·
0.23 = 99.36
(Observed − Expected)2 2 2 2
(C) χ 2 = ∑ = (102−77.76)
77.76 + (219−254.88)
254.88 + (111−99.36)
99.36 = 13.971
Expected

(D) d f = k − 1 = 3 − 1 = 2; where k is the number of categories.


p-value < 0.001, (When df = 2, 13.816 < χ 2 = 13.971)

(E) Reject H0 . (p-value < 0.05)


We have significant evidence to show that at least one proportion differs from what is expected.

CHAPTER 16
1. H0 : Quality level and Suppliers are independent.
Ha : Quality level and Suppliers are dependent
Expected counts:

Outcome
Supplier Good Defective row sums
200·545 200·55
A 600 = 181.67 600 = 18.33 200
200·545 200·55
B 600 = 181.67 600 = 18.33 200
200·545 200·55
C 600 = 181.67 600 = 18.33 200
column sum 545 55 600

(Observed − Expected)2
χ2 = ∑
Expected
2 2 2 2 2 2
χ 2 = (180−181.67)
181.67 + (175−181.67)
181.67 + (190−181.67)
181.67 + (20−18.33)
18.33 + (25−18.33)
18.33 + (10−18.33)
18.33 = 7.01;
d f = (r − 1) · (c − 1) = (3 − 1) · (2 − 1) = 2
0.025 < p-value < 0.05, (When df = 2, 5.991 < χ 2 = 7.01 < 7.378)
Decision: We reject H0 . (p-value < 0.05)
Conclusion: We have significant evidence to show quality level and suppliers are dependent.

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 267

2.
Back Pain
Reduced Not Reduced Row sums
Gel Seat
Observed=20 Observed=44 64
32 · 64 96 · 64
(A) Expected= =16 Expected= =48
128 128
Regular Seat
Observed=12 Observed=52 64
32 · 64 96 · 64
Expected= =16 Expected= =48
128 128
Column sums 32 96 128
(Observed − Expected)2 (20−16)2 2 2 2
(B) χ 2 = ∑ = 16 + (12−16)
16 + (44−48)
48 + (52−48)
48 = 2.67;
Expected
d f = (r − 1) · (c − 1) = (2 − 1) · (2 − 1) = 1
(C) 0.005 < p-value < 0.01, (When df = 1, 6.635 < χ 2 = 6.95 < 7.879)
(D) Reject H0 (P-value < 0.05). Thus, we have significant evidence to show that back pain response
and seat type are dependent.
3. Expected counts:
Response
Residency Status Yes No
105·153 105·87
Out of State 240 = 67 240 = 38
135·153 135·87
In State 240 = 86 240 = 49

(Observed − Expected)2 2 2 2 2
Test Statistic: χ 2 = ∑ = (73−67)
67 + (80−86)
86 + (32−38)
38 + (55−49)
49 = 2.638;
Expected
d f = (r − 1) · (c − 1) = (2 − 1)(2 − 1) = 1
0.10 < p-value < 0.20, (When df = 1, 1.642 < χ 2 = 2.638 < 2.706)
Decision: Do not reject H0 . (P-value > 0.05)
Conclusion: We do not have significant evidence to show response and residency status are dependent.
4. H0 : Gender and Response are independent.
Ha : Gender and Response are dependent.
Expected counts:

Response
No change Got better Row sums
40 · 32 40 · 43
Men =17.07 = 22.93 40
75 75
35 · 32 35 · 43
Women =14.93 = 20.07 35
75 75
Column sums 32 43 75

(Observed − Expected)2 2 2 2 2
Test Statistic: χ 2 = ∑ = (15−17.07)
17.07 + (17−14.93)
14.93 + (25−22.93)
22.93 + (18−20.07)
20.07 = .9384;
Expected
d f = (r − 1) · (c − 1) = (2 − 1)(2 − 1) = 1
p-value > 0.2, (When df = 1, χ 2 = 0.9384 < 1.642)

Decision: Do not reject H0 . (P-value > 0.05)


Conclusion: We do not have significant evidence to show gender and response are dependent.
5.
80·30
(A) Supplier A: 400 = 6
320·30
Supplier B: 400 = 24;

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 268

80·370
(B) Supplier A: 400 = 74
320·370
Supplier B: 400 = 296;
(C) H0 :Quality of Casting and Supplier are independent.
Ha :Quality of Casting and Supplier are dependent.
(Observed − Expected)2 2 2 2 2
Test Statistic: χ 2 = ∑ = (10−6)
6 + (70−74)
74 + (20−24)
24 + (300−296)
296 = 3.604
Expected
d f = (r − 1) · (c − 1) = (2 − 1)(2 − 1) = 1
0.05 < p-value < 0.10, (When df = 1, 2.706 < χ 2 = 3.604 < 3.841)
Decision: Do not reject H0 . (p-value > 0.05)
Conclusion: We do not have significant evidence to show quality of casting and supplier are
dependent.
6. H0 : Party ID and race are independent.
Ha : Party ID and race are dependent.
Expected counts:

Party Identification
Race Democrat Independent Republican Row Sums
275 · 651 275 · 661 275 · 479
Black =99.96 =101.49 =73.55 275
1791 1791 1791
1516 · 651 1516 · 661 1516 · 479
White =551.04 =559.51 =405.45 1516
1791 1791 1791
Column sums 651 661 479 1791

(Observed − Expected)2
χ2 = ∑
Expected
(192−99.96)2 2 2 2 2 2
2
χ = 99.96 + (75−101.49)
101.49 + (8−73.55)
73.55 + (459−551.04)
551.04 + (586−559.51)
559.51 + (471−405.45)
405.45 = 177.307
d f = (r − 1) · (c − 1) = (2 − 1)(3 − 1) = 2
p-value < 0.001, (When df = 2, χ 2 = 177.307 > 13.816)
Decision: Reject H0 . (p-value < 0.05)
Conclusion: We have significant evidence to show that party ID and race are dependent.
7.
(a) H0 : Grade category and response are not related.
Ha : Grade category and response are related.

(b) Expected counts:


Response
Grade Category Yes No Row sum
28 · 13 28 · 54
High-A,BA,B =5.43 = 22.57 28
67 67
30 · 13 30 · 54
Middle-CB,C =7.54 = 24.18 30
67 67
9 · 13 9 · 54
Low-DC,D,E = 1.75 =7.25 9
67 67
Column sum 13 54 67
(Observed − Expected)2
(c) χ 2 = ∑
Expected
(3−5.43)2 2 2 2 2 2
χ2 = 5.43 + (25−22.57)
22.57 + (9−5.82)
5.82 + (21−24.18)
24.18 + (1−1.75)
1.75 + (8−7.25)
7.25 = 3.904
(d) d f = (r − 1) · (c − 1) = (3 − 1)(2 − 1) = 2
0.1 < p-value < 0.2, (When df = 2, 3.219 < χ 2 = 3.904 < 4.605)

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 269

(e) Do not reject H0 . (p-value > 0.05)


We do not have significant evidence to show that final grade category and response are related.
8. H0 : Smoking and lung cancers are not associated.
Ha : Smoking and lung cancers are associated.
Expected counts:

Lung cancers
Smoker Treatment Control row sums
185 · 100 185 · 100
Yes =92.5 =92.5 185
200 200
15 · 100 15 · 100
No =7.5 =7.5 15
200 200
column sums 100 100 200

(Observed − Expected)2
χ2 = ∑
Expected
(97−92.5)2 2 2 2
χ = 92.5 + (88−92.5)
2
92.5 + (3−7.5)
7.5 + (12−7.5)
7.5 = 5.838
d f = (r − 1) · (c − 1) = (2 − 1)(2 − 1) = 1
0.01 < p-value < 0.025, (When df = 1, 5.024 < χ 2 = 5.838 < 6.635)
Reject H0 . (p-value < 0.05)
We have significant evidence to show that smoking and lung cancers are associated.

CHAPTER 17
1.
Test Test Test Test
Positive Negative Positive Negative
Condition Yes 0.080 0.020 0.100 0.800 0.20
No 0.270 .630 0.90 0.30 0.70
0.350 0.650
Condition Yes 0.229 0.031
No 0.771 0.969

2.
(a) 27%
(b) 77.1%
(c) 65%
3.
Test Result Test Result Test Result Test Result
Good(sold) Bad(Scrapped) Good(sold) Bad(Scrapped)
Item cond. good 0.560 0.240 0.800 0.700 0.300
bad 0.040 0.160 0.200 0.200 0.800
0.600 0.400
Item cond. good 0.930 0.600
bad 0.067 0.400
(a) 93%
(b) 60%

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.
ANSWER KEY 270

Test Result Test Result Test Result Test Result


Good(sold) Bad(Scrapped) Good(sold) Bad(Scrapped)
Item cond. good 0.792 0.008 0.800 0.99 0.01
4. bad 0.002 0.198 0.200 0.01 0.99
0.794 0.206
Item cond. good 0.997 0.039
bad 0.007 0.961
(a) 99.7%
(b) 3.9%
5.
(a)
Test Test Test Test
Pos. Neg. Pos. Neg.
True Cond. Pos. 0.009 0.001 0.01 0.900 0.100
Neg. 0.099 0.891 0.99 0.100 0.900
0.108 0.892
True Cond. Pos. 0.083 0.0011
Neg. 0.916 0.9989
6.
(a) 8.3%
.9·.002
(b) .9·.002+.998·.1 = .0177 ∼ 2%
7.
(a) 36.8%
(b) 93.7%
8.
(a) 57.3%
(b) 16%

Copyright ©2017 by The Department of Statistics at Western Michigan University. All rights reserved. Reproductions or translation of any part
of this work without permission of the copyright owner is unlawful.

Вам также может понравиться