Outlines343 PDF

Mathematics 343 Class Notes 1
1 Introduction
Goal: To define probability and statistics and explain the relationship between the
two
1.1 What are Probability and Statistics?
1. Probability
Definition 1.1. A probability is a number meant to measure the likelihood of the occur-
rence of some uncertain event (in the future).
Definition 1.2. Probability (or the theory of probability) is the mathematical discipline
that
(a) constructs mathematical models for “real-world” situations that enable the compu-
tation of probabilities (“applied” probability)
(b) develops the theoretical structure that undergirds these models (“theoretical” or
“pure” probability).
2. Statistics
Definition 1.3. Statistics is the scientific discipline concerned with collecting, analyzing
and making inferences from data.
3. The relation between probability theory and statistics

Often, we arrange to collect the data by a process for which we have a probabilistic model.
Then probability theory informs our data analysis. The relationship between statistics
and probability theory is much like the relationship between mechanics and calculus.
1.2 An example of the relationship
Kellogg’s sells boxes of Raisin Bran labeled “Net Wt. 20 oz.”
1. What is meant by this claim?

Important Observation 1.4. Some variation is to be expected and should be allowed.
A probabilistic model is an appropriate “description” of this variation.
2. How does NIST recommend checking this claim?

Important Observation 1.5. We cannot check every box that Kellogg’s produces. We
must limit ourselves to a sample of such boxes and make inferences about the whole “pop-
ulation” of boxes.
1.3 Populations and Samples
1. Populations
Definition 1.6. A population is a well-defined collection of individuals.
We distinguish between actual (concrete) populations and conceptual (hypothetical) pop-

ulations.
Definition 1.7. A parameter is a numerical characteristic of a population.
2. Samples
Definition 1.8. A sample is a subcollection of a given population.
Definition 1.9. A simple random sample of a given size n is a sample chosen from a
finite population in a manner such that each possible sample of size n is equally likely to
occur.
Homework.
1. Read the syllabus.
2. Read Section 1.1 of Devore and Berk.
3. Read Section 2 of the notes SimpleR. These notes and the R package are available at the
R section of the course webpage.
4. Download R to the computer that you use regularly. If you do not have easy access to
high-speed internet, ask your instructor for an installation CD.
5. Do problems 2.1,2,5,6 of SimpleR.
6. Do problems 1.4,6,9 of Devore and Berk.

2 Random Experiments
Goal: To develop the language for describing random (probabilistic) experiments
2.1 Experiments
1. Experiment (or random experiment) is an undefined term.
2. Experiments have three key characteristics
(a) future, not past

(b) could have any one of a number of outcomes and which outcome will attain is un-
certain
(c) could be performed repeatedly (under essentially the same circumstances)
3. Examples of experiments.
2.2 The sample space
1. The sample space of an experiment is the set of all possible outcomes of that experiment.
(we usually use S for the name of the sample space)
2. Examples of sample spaces.
2.3 Events
1. An event is a subset of the set of outcomes.
2. The fundamental goal of a probability model:
We want to assign to each event E a number P (E) such that P (E) is the
likelihood that event E will happen if the experiment corresponding to the
sample space S is performed.
2.4 Language of Set Theory
Definition 2.1. Suppose that E and F are events.
1. The union of events E and F , denoted E ∪ F , is the set of outcomes that are in either E
or F
2. The intersection of events E and F , denoted E ∩ F , is the set of outcomes that are in
both E and F
3. The complement of an event E, denoted E 0 , is the set of outcomes that are in S but not
in E
• Two special events are ∅, (nothing happens!) and S (something happens)
2.5 Using R to generate random events
To construct a simple random sample of size 12 from a lot of size 250, we could use the following
R code.
> x=c(1:250)
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
....................
[217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
[235] 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250
> sample(x,12,replace=F)
[1] 82 145 19 129 198 27 237 25 106 83 34 170
> sample(x,12,replace=F)
[1] 222 240 34 30 239 109 27 141 112 248 69 243
>
Homework.

2. Do problems 2.2, 4, 5, 10.
3. There are (obviously) 100 positive integers in the range 1–100.
(a) How many of these are even? are prime?
(b) In a random sample of size 10, how many even numbers would one expect to find?
prime numbers?
(c) Use R to construct a random sample of size 10 from these integers. Record the
itegers in your sample. How many elements of your sample are even? how many are
prime?
3 Probability Functions
Goal: To assign to each event A a number P (A), the probability of A
3.1 The Meaning of Probability Statements
1. The frequentist interpretation: the probability of an event A is the limit of the relative
frequency that A occurs in repeated trials of the experiment as the number of trials goes
to infinity.
2. The subjectivist interpretation: the probability of an event A is an expression of how

confident the assignor is that the event will happen.
3.2 Assigning Probabilities - Theory
1. Axioms.
Axiom 3.1. For all events A, P (A) ≥ 0.
Axiom 3.2. P (S) = 1.
Axiom 3.3. If A1 , A2 , A3 , . . . is a sequence of disjoint events, then

∞
X
P (A1 ∪ A2 ∪ A3 ∪ · · · ) = P (Ai )
i=1
2. Consequences.
Theorem 3.4. P (∅) = 0.
Theorem 3.5. If A and B are disjoint sets, then P (A ∪ B) = P (A) + P (B).
Theorem 3.6. For every event A, P (A0 ) = 1 − P (A).
Theorem 3.7. For all events A and B, P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
3.3 Assigning Probabilities - Practice
1. If we can list the outcomes (simple events) E1 , E2 , E3 , . . . and assign a probability P (Ei )
to each, then the probability of any event A is determined by
X
P (A) = P (Ei )
Ei in A
2. Special case: there are N possible outcomes and we judge that each outcome is equally
likely to occur. Then each outcome has probability 1/n. And if an event A consists of k
outcomes, then P (A) = k/n.
Example: toss two fair dice. Will a sum of 7 occur?
Example: choose a random sample of 12 boxes of Raisin Bran from a lot of 250. Will
there be a unacceptably underweight box in the sample?
3. Special case: if we have data on previous trials of the experiment, we may estimate the
probability of each outcome by the relative frequency with which that outcome occured
in the previous trials.
Example: Jim Thome bats. Will a homerun occur?
Example: a 54 year old male buys a life insurance policy. Will he die in the next 10 years?
3.4 Using R to Simulate Random Experiments
> outcomes=c(’Out’,’Single’,’Double’,’Homerun’,’Walk’)
> outcomes
[1] "Out" "Single" "Double" "Homerun" "Walk"
> relfreq=c(299,65,23,39,86)/512
> sum(relfreq)
[1] 1
> sample(outcomes,1,prob=relfreq)
[1] "Out"
> sample(outcomes,4,prob=relfreq,replace=T)
[1] "Out" "Out" "Out" "Single"
Homework.
1. Read Devore and Berk, Section 2.2.
2. Do problems 2.22,24,26,28,30 of Devore and Berk.

4 Counting
Goal: To develop methods for counting equally likely outcomes
4.1 Two Problems
Example 4.1. Suppose that there are 10 underweight boxes of Raisin Bran in a shipment of
250. What is the probability that there will be an underweight box in a random sample of 12
such boxes?
Example 4.2. Suppose that a fair coin is tossed 100 times. Should we be surprised if it comes
up heads more than 60 times?
4.2 The Fundamental Theorem of Counting
Proposition 4.3. Suppose that a set consists of ordered pairs such that there are n1 possible
choices for the first element of the ordered pair and for each such element there are n2 choices
for the second element. Then there are n1 n2 ordered pairs in the set.
Proposition 4.4. Suppose that a set consists of ordered k-tuples such that there are n1 choices
for the first element, for each choice of the first element there are n2 choices for the second
element, for each choice of the first two elements there are n3 choices for the third element, etc.
Then there are n1 n2 · · · nk−1 nk ordered tuples in the set.
4.3 Permutations and Combinations
Definition 4.5. An ordered sequence of k objects chosen from a set of n ≥ k objects is called
a permutation of size k. The number of such permutations is denoted Pk,n and is computed by
Pk,n = n(n − 1)(n − 2) · · · (n − k + 1).
Note that Pn,n = n!
Definition 4.6. A subset of k objects chosen from a set

of n objects is called a combination of
n
size k. The number of such combinations is denoted k (or sometimes Ck,n ) and is computed
by
n Pk,n n!
= =
k Pk,k k!(n − k)!
The number nk is usually called a binomial coefficient and is often read as “n choose k.”

Note that in each of these situations, we are selecting k objects without replacement. If order
“matters”, we are dealing with permutations. If order does not matter, we are dealing with
combinations.
4.4 Solution to the two basic problems
> 1-choose(240,12)/choose(250,12)
[1] 0.3942094
> heads=c(0:100)
> prob=choose(100,heads)/(2^100)
> sum(prob)
[1] 1
> prob
[1] 7.888609e-31 7.888609e-29 3.904861e-27 1.275588e-25 3.093301e-24
[6] 5.939138e-23 9.403635e-22 1.262774e-20 1.467975e-19 1.500596e-18
................................
[91] 1.365543e-17 1.500596e-18 1.467975e-19 1.262774e-20 9.403635e-22
[96] 5.939138e-23 3.093301e-24 1.275588e-25 3.904861e-27 7.888609e-29
[101] 7.888609e-31
> sum(prob[61:100])
[1] 0.02844397
Homework.

2. Do problems 2.35,40,42 of Devore and Berk.
3. A company receives a shipment of 100 computer chips. Inevitably, there will be defective
chips in the shipment. However the company is willing to accept the shipment if there
are no more than 5 defective chips in the shipment. Unfortunately, testing the chips for
defects is expensive and destructive. Suppose that the company decides to test 10 chips
and decides to reject the shipment if there is at least one defective chip in the shipment.
(a) If there are 6 defective chips in the shipment, what is the probability that the com-
pany will reject the shipment?
(b) If there are only 5 defective chips in the shipment, what is the probability that the
company will reject the shipment?
(c) If the company is limited to testing just 10 chips, do you think that it is employing
the right decision rule?
5 Conditional Probability
Goal: To compute the probability of an event A given knowledge as to whether

another event B has occured
5.1 Examples
In each of the following examples, it appears that event A and event B “depend” on each other.
1. Experiment: Choose a Calvin senior at random. Event A: Student has a GPA greater
than 3.5. Event B: Student has an ACT score greater than 30.
2. Experiment: Choose 12 Raisin Bran boxes from a shipment of 250. Event A: There is a
box weighing less than 19.23 ounces. Event B: The average weight of the 12 boxes is less
than 19.9 ounces.
3. Experiment: Throw two dice. Event A: The sum of the two dice is twelve. Event B: A
six occurs on at least one of the two dice.
5.2 Conditional Probability Defined
Definition 5.1. Suppose that A and B are events such that P (B) > 0. The conditional
probability of A given B, denoted P (A|B), is
P (A ∩ B)
P (A|B) =
P (B)
Example 5.2. In example 3 above, we have P (A) = 1/36, P (B) = 11/36, P (A ∩ B) = 1/36
so P (A ∩ B) = 1/36 so P (A|B) = 1/11 and P (B|A) = 1.
Proposition 5.3. Given a sample space S and an event B with P (B) > 0, the function
P 0 (A) = P (A|B) is a probability function defined on the new sample space S 0 = B.
5.3 Using Conditional Probability to Compute Unconditional Probabilities
Since P (A ∩ B) = P (B)P (A|B), we can use conditional probabilities (such as P (A|B)) to

compute unconditional probabilities such as P (A ∩ B).
Sampling without replacement is a typical example.
Example 5.4. A Class of 31 calculus students has 14 females in the class. If a random sample
if size 2 is chosen from the class, what is the probability that both are female?
Example 5.5. A certain rare disease has an incidence of 0.1% in the general population. There
is a test for this disease but the test can be in error. It is estimated that the test indicates
false positives 1% of the time (that is a person that doesn’t have the disease tests positive) and
false negatives 5% of the time (that is a person with the disease tests negative). What is the
probability that a randomly chosen person receives a positive test result?
5.4 Independence
Informally, A and B are independent if the fact that B occurs does not affect the probability
that A occurs. Formally,
Definition 5.6. Events A and B are independent if P (A ∩ B) = P (A)P (B).
Note that if P (B) = 0 or P (A) = 0 then events A and B are automatically independent. The
following proposition gives a more intuitive characterization of independence in the case that
P (B) > 0.
Proposition 5.7. Suppose that P (B) > 0. Then events A and B are independent iff P (A) =
P (A|B).
Sampling with replacement is a typical situation where independence is applied.
Homework.
1. Read pages 73–78 and also 83, 84
2. Do problems 2.45, 46, 48, 49, 55, 56

6 Bayes’ Theorem
Goal: To compute conditional probabilities of the form P (B|A) from P (A|B)
6.1 Simple Statement of the Theorem
Theorem 6.1 (Bayes’ Theorem). Suppose that A and B are events such that P (A) > 0 and
P (B) > 0. Then
P (A)P (A|B)
P (B|A) =
P (B)
Proof. By the definition of conditional probability
P (A ∩ B) = P (A|B)P (B) and P (A ∩ B) = P (B|A)P (A)
The result follows immediately by equating the two expressions for P (A ∩ B).
6.2 Examples
Example 6.2. Medical testing for a rare disease.
T a person tests positive for the a certain disease

D the person has the disease
Mammograms have, on some reports, a 30% false negative rate: P (T |D) = 0.7. The false
positive rate is lower - perhaps 10%; P (T |D0 ) = 0.1. In typical situations, P (D) = 0.005.
Obviously the important question is what is P (D|T )?
Example 6.3. The dependability of the judicial system.
G the defendant is guilty

J the jury finds the defendant guilty
A scholarly study estimated that in capital cases in Illinois, P (G|J) = .93. How accurate are
juries? That is what are P (J|G) and P (J|G0 )?
6.3 Aside - Information Markets
Estimates of probabilities from past events (e.g., mammogram errors) can often be made ac-
curately by computing relative frequencies. What about probabilities of possible future events
that may turn out one way or another?
Example 6.4. What is the probability that the Democratic Party will win the 2008 Presidential
Election if Hilary Clinton is Nominated?
Event Probability
Clinton is Nominated
Clinton is Elected
The Democratic Candidate Wins
McCain is Nominated
McCain is Elected
The Republican Candidate Wins
Homework.
1. Read Section 2.4.
2. Do problems 2.59, 60, 106, 109.

7 Random Variables and The Binomial Distribution
Goal: To introduce the concept of random variables by way of an extraordinarily

important example
7.1 Random Variables
Definition 7.1. Given an experiment with sample space S, a random variable is a function X
defined on S that has real number values. For a given outcome o ∈ S, X(o) is the value of the
random variable on outcome o.
We generally use uppercase letters near the end of the alphabet for random variables (X, Y ,
etc.). Examples:
1. Choose a random sample of size 12 from 250 boxes of Raisin Bran. Let X be the random
variable that counts the number of underweight boxes and let Y be the random variable
that is the average weight of the 12 boxes.
2. Choose a Calvin senior at random. Let Z be the GPA of that student and let U be the
composite ACT score of that student.
3. Choose a football player in the National Football League at random. Let W be the weight
(in pounds) of that player.
4. Throw a fair die until all six numbers have appeared. Let T be the number of throws
necessary.
For most purposes, we can consider a random variable as an experiment with outcomes that
are numbers. Random variables can have finitely many or infinitely many different values.
Definition 7.2. A random variable is discrete if it has only finitely many different values or
infinitely many values that can be listed in a list v1 , v2 , v3 , . . . .
Notice that X, U and T above are discrete random variables while the others are not.
7.2 Binomial Random Variables
A binomial experiment is a random experiment characterized by the following conditions:
1. The experiment consists of a sequence of finitely many (n) trials of some simpler experi-
ment.
2. Each trial results in one of two possible outcomes, usually called success (S) and failure
(F ).
3. The probability of success on each trial is a constant denoted by p.
4. The trials are independent one from another - that is the outcome of one trial does not
affect the outcome of any other.
Thus a binomial experiment is characterized by two parameters, n and p.
Definition 7.3. Given a binomial experiment, the binomial random variable X associated with
this experiment is defined by X(o) is the number of successes in the n trials of the experiment.
Examples:
1. A fair coin is tossed n = 10 times with the probability of a HEAD (success) being p = .5.
X is the number of heads.
2. A basketball player shoots n = 25 freethrows with the probability of making each freethrow
being p = .70. Y is the number of made freethrows.
3. A quality control inspector tests the next n = 12 widgets off the assembly line each of
which has a probability of 0.10 of being defective. Z is the number of defective widgets.
7.3 The pmf of a discrete random variable
Definition 7.4. Suppose that X is a discrete random variable. The probability mass function
(pmf) of X is the function
pX (x) = P r({o : X(o) = x})
(We will write the right hand side of this equation as P (X = x).)
We will drop the subscript X in naming pX if the random variable in question is clear. And
sometimes, as in the case of a binomial random variable, we will give the pmf a more suggestive
name. The pmf is also sometimes called the distribution of X.
Theorem 7.5 (The Binomial Distribution). Suppose that X is a binomial random variable
with parameters n and p. The pmf of X, denoted by b(x; n, p), is given by

n x
b(x; n, p) = p (1 − p)n−x
x
7.4 R Computes the pmf of any Binomial Random Variable
> help(dbinom)
> dbinom(x=7,size=10,prob=.7)
[1] 0.2668279
> x=c(0:10)
> dbinom(x=x,size=10,prob=.7)
[1] 0.0000059049 0.0001377810 0.0014467005 0.0090016920 0.0367569090
[6] 0.1029193452 0.2001209490 0.2668279320 0.2334744405 0.1210608210
[11] 0.0282475249
> y=dbinom(x=x,size=10,prob=.7)
> plot(x,y)
>
The Binomial Distribution
Description:
Density, distribution function, quantile function and random

generation for the binomial distribution with parameters ’size’
and ’prob’.
Usage:
dbinom(x, size, prob, log = FALSE)

pbinom(q, size, prob, lower.tail = TRUE, log.p = FALSE)
qbinom(p, size, prob, lower.tail = TRUE, log.p = FALSE)
rbinom(n, size, prob)
Arguments:
x, q: vector of quantiles.
p: vector of probabilities.
n: number of observations. If ’length(n) > 1’, the length is

taken to be the number required.
size: number of trials.
prob: probability of success on each trial.

..............
Value:
’dbinom’ gives the density, ’pbinom’ gives the distribution

function, ’qbinom’ gives the quantile function and ’rbinom’
generates random deviates.
If ’size’ is not an integer, ’NaN’ is returned.
●
0.25
●
0.20
●
0.15
y
●
0.10
●
0.05
●
●
0.00
●
● ● ●
0 2 4 6 8 10
Figure 1: b(x; 10, .7)
Homework.
1. It’s not easy to give a reading assignment as we have just jumped all around in Chapter
3. However this material is covered on pages 95, 99, and 125–128.
2. For each of the following binomial random experiments, use R to compute the indicated
probability. In each, identify clearly the parameters of the experiment and the outcome
of a trial that you are considering a “success.”
(a) 100 boxes of Raisin Bran coming off the production line are inspected. The probabil-
ity that any one of these is significantly underweight is .01. What is the probabilty
that none of the 100 boxes are underweight? one is underweight? two or more are
underweight?
(b) Jermain Dye hits a homerun in 7.48% of his plate appearances. Suppose that he
comes to the plate 4 times in tomorrow night’s game. What is the probability that
he will get at least one homerun in that game?
(c) A section of SAT problems has 12 multiple choice problems, each with five possible
options. What is the probability that a guesser will be so unlucky as to get every
one of the problems wrong?
(d) Normally, the random variable associated with the roll of a die has 6 outcomes.
But suppose that we are concerned only with whether a six appears on the die or
not. What is the probability that a six occurs at least three times in five rolls (an
important question in the game of Yahtzee!)? What is the probability that one of
the six numbers occurs at least three times in five rolls (justify your answer)?
8 Binomial Distribution Continued and The Hypergeometric

Distribution
Goal: To introduce further properties of random variables using the example of the
binomial distribution and to introduce the hypergeometric distribution
8.1 The Cumulative Distribution Function
The probability mass function (pmf) of a discrete random variable X is sufficient to answer any
probability question about X. However answering such questions as
What is the probability of between 40 and 60 heads in 100 tosses of a fair coin?
requires a considerable addition.

Definition 8.1. The cumulative distribution function (cdf ) of a discrete random variable X
with pdf p is the function F defined by
X
F (x) = P (X ≤ x) = p(x)
y≤x
We will write FX for the cdf of X when the relavant random variable is in doubt.
Proposition 8.2. For any discrete random variable X, the cdf F of X satisfies
1. lim F (x) = 0
x→−∞
2. lim F (x) = 1
x→∞
3. if x1 < x2 then F (x1 ) ≤ F (x2 )

4. p(b) = F (b) − F (a−) where F (a−) = limx→a− F (x).
> pbinom(q=c(0:10),size=10,prob=.7) # pbinom computes the cdf

[1] 0.0000059049 0.0001436859 0.0015903864 0.0105920784 0.0473489874
[6] 0.1502683326 0.3503892816 0.6172172136 0.8506916541 0.9717524751
[11] 1.0000000000
> pbinom(11,10,.7)
[1] 1
> pbinom(-3,10,.7)
[1] 0
> pbinom(8,10,.7)-pbinom(5,10,.7)
[1] 0.7004233
8.2 Simulations
In order to get a better understanding of a random process, it is helpful to be able to perform the
underlying random experiment many times. While it is usually impractical or even impossible
to do that with a real experiment, if the pdf is known, the computer can be used to simulate
the random variable.
Example 8.3. Suppose that a manufacturing process produces widgets in lots of 1,000 and
the probability that any particular widget is defective is 0.1%. In R, the command rbinom
simulates any particular binomial distribution. The following code simulates 100 lots of 1,000
widgets.
> sim=rbinom(100,size=1000,prob=.001)
> table(sim)
sim
0 1 2 3 4
43 32 18 5 2
> dbinom(c(0:4),1000,.001)
[1] 0.36769542 0.36806349 0.18403174 0.06128251 0.01528996
> 1-pbinom(4,1000,.001)
[1] 0.003636878
8.3 The Hypergeometric Distribution
An hypergeometric experiment is characterized by the following assumptions.
1. there is a population of N individuals,

2. the individuals are divided into two groups, one of M individuals (called the “success”
group) and one of N − M individuals (called the “failure” group)
3. a sample of n individuals is selected without replacement in such a way that each subset
of size n is equally likely to be chosen.
Given a hypergeometric experiment with parameters n, N , and M , the random variable X that
counts the number of successes in the sample is said to have a hypergeometric distribution.
Proposition 8.4. A random variable X that has a hypergeometric distribution with parameters
n, N and M has pdf given by
M N −M

x n−x
h(x; n, M, N ) = N
max{0, n − N + M } ≤ x ≤ min{M, n}
n
In R, the functions dhyper and phyper compute the pdf and cdf of a hypergeometric random
variable. Unfortunately, Ruses completely different notation than above. In R, m is used for
M , but n is used for N − M . In other words, m is the number of successes and n is the
number of failures in the population which therefore has size m+n. The size of the sample is
called k in R. Therefore the following R code references a hypergeometric distribution where
the population has size 10, there are 7 successes and 3 failures, and we are choosing 5 objects
without replacement.
> x=c(0:5)
> dhyper(x,m=7,n=3,k=5)
[1] 0.00000000 0.00000000 0.08333333 0.41666667 0.41666667 0.08333333
> phyper(x,m=7,n=3,k=5)
[1] 0.00000000 0.00000000 0.08333333 0.50000000 0.91666667 1.00000000
> table(rhyper(100,m=7,n=3,k=5))
2 3 4 5
6 39 46 9
Homework.
1. Read Sections 3.1 and 3.2 and pages 134–136 of Devore and Berk. Note the notation
X ∼ Bin(n, p) to signify that X is a random variable that has a binomial distribution
with parameters n and p.
2. Do problems 3.12, 3.23 (of Section 3.2 of Devore and Berk).
3. Do problem 3.36 (in Section 3.5).
4. Do problem 3.82(a,b,c) (in Section 3.6).

9 Expected Values
Goal: To define the expected value of a random variable and to compute expected
values for binomial and hypergeometric random variables
9.1 Expected Value Defined
Definition 9.1. Suppose that X is a discrete random variable with pmf p. The expected value
of X, denoted E(X) or µX , is the following sum, provided that it exists:
X
E(X) = xp(x)
where the sum is taken over all possible values of X.

Example 9.2. Suppose that a fair six-sided die is tossed once and that X is the number that
occurs. Then p(x) = 1/6 for each x = 1, 2, 3, 4, 5, 6 and
1 1 1 1 1 1 21
E(X) = 1 + 2 + 3 + 4 + 5 + 6 = = 3.5
6 6 6 6 6 6 6
Intuitively, if the experiment yielding the random variable X is performed many times, the
average of the values of X that actually occur should be approximately E(X). Note also that
E(X) is the center of mass of a system of points with mass p(x) located at x.
9.2 The Expected Value of Binomial and Hypergeometric Random Variables
Theorem 9.3. Suppose that X ∼ Bin(n, p). Then E(X) = np.

Theorem 9.4. If X is a hypergeometric random variable with parameters n, M , and N , then
E(X) = nM/N .
The proof of Theorem 9.4 will be deferred until we have more technology making the proof
easy.
9.3 Functions of Random Variables
Suppose that h is a function that is defined on all values of a random variable X. Then we can
define a new random variable Y by Y = h(X).
Example 9.5. Let h(x) = x2 . Then if X is the random variable that is the numerical result
of rolling a fair die, Y = h(X) is simply the square of that value. The pmf of y is
pY (y) = 1/6 y = 1, 4, 9, 16, 25, 36
If X is a random variable and Y =

Xh(X), we can find E(Y ) in two steps: first find the pmf pY
of Y from pX and then compute ypY (y). The next theorem gives us a shortcut.
Theorem 9.6. If X is a random variable and Y = h(X) then

X
E(Y ) = E(h(X)) = h(x)pX (x)
X
provided that the sum |h(x)|pX (x) exists.
> x=(0:10)
> px=dbinom(x,10,.65)
> sum(x*px)
[1] 6.5
> y=x^2
> sum(y*px)
[1] 44.525
> 6.5^2
[1] 42.25
> r=rbinom(100,10,.65)
> mean(r)
[1] 6.48
> mean(r^2)
[1] 44.04
An important property of expectation is linearity.

Proposition 9.7 (Linearity of Expectation). Suppose that X is a discrete random variable and
a and b are constants. Then E(aX + b) = aE(X) + b.
9.4 The Negative Binomial Random Variable
A negative binomial distribution results from an experiment similar to a binomial distribution.

The distribution is characterized by two parameters: r a positive integer and p a probability.
The conditions for the distribution are the following:
1. The experiment consists of a sequence of independent trials.
2. Each trial can result in either a success or a failure.
3. The probability of a success on each trial is p.
4. The experiment continues until a total of r successes have been performed.

The random variable X that results from a negative binomial experiment is the number of
failures that precede the rth success. (Some books count the total number of trials rather than
the number of failures.) Notice this is the first example of a random variable that can take on
infinitely many values – X can be 0, 1, 2, . . . .
Proposition 9.8. The pmf of a random variable X that results from a negative binomial
experiment with parameters p and r is

x+r−1 r
nb(x; r, p) = P (X = x) = p (1 − p)x x = 0, 1, 2, . . .
r−1
Theorem 9.9. The expected value of a negative binomial random variable X with parameters
r and p is r(1 − p)/p.
Obviously, R knows the negative binomial distribution. In dnbinom, the parameters r and p
are named size and p respectively.
> dnbinom(x=(0:30),size=3,p=1/6)
[1] 0.004629630 0.011574074 0.019290123 0.026791838 0.033489798 0.039071431
[7] 0.043412701 0.046513608 0.048451675 0.049348928 0.049348928 0.048601217
[13] 0.047251183 0.045433830 0.043270314 0.040866408 0.038312257 0.035682985
[19] 0.033039801 0.030431396 0.027895446 0.025460129 0.023145572 0.020965192
[25] 0.018926909 0.017034219 0.015287119 0.013682915 0.012216889 0.010882861
[31] 0.009673654
> pnbinom(30,size=3,p=1/6)
[1] 0.929983
> sim=rnbinom(1000,size=3,p=1/6)
> table(sim)
sim
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
4 14 12 27 25 47 44 35 44 46 48 53 55 40 44 40 49 31 30 33 37 19 28 17 12 21
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 48 49 50 51 52
21 12 17 8 9 10 8 14 7 3 6 6 1 5 2 2 2 1 1 1 1 1 1 1 1 1
53 57 59
1 1 1
Homework.
1. Read Devore and Berk, pages 109–111 and page 138.
2. Do problems 3.29, 3.33, 3.34, 3.37, and 3.87 of Devore and Berk.
3. Consider the random variable D that obtains by adding the faces of two fair dice when
they are thrown.
(a) Write the pmf of D.

(b) Compute the expected value of D.
4. A basketball player makes 90% of her free throws. Suppose that in practice she shoots free
throws until she misses one. What is the probability that she takes at least 20 throws?
5. One of the first problems in probability theory was proposed to Pascal by the Chevalier
de Mere. The gambling game in question was this. The player throws two dice repeatedly
until a double-six occurs. The gambler wins the game if this happens in 24 or fewer
uv
throws. What is the probability that the gambler wins?
10 Variance
Goal: To define the variance of a discrete random variable
10.1 The Variance
Definition 10.1. Let X be a discrete random variable with pmf p and expected value µ. Then
2 , is
the variance of X, denoted by V (X) or σX
X
V (X) = E[(X − µ)2 )] = (x − µ)2 p(x)
The standard deviation of X, denoted SD(X) or σX , is
q
σX = σX 2
The variance of a random variable X is a measure of the spread or range of possible values of
X. The larger the variance of X, the more likely it is that X will take on values far away from
the mean. The following table summarizes the properties of the discrete distributions that we
have met so far.
Distribution Parameters pmf Mean Variance

n
px (1− p)n−x

Binomial n, p x np np(1 − p)
M N −M

x n−x M N −n M M
Hypergeometric n, N, M N
n 1−
N −1

n
N N N

x+r−1 r r(1 − p) r(1 − p)
Negative Binomial r, p p (1 − p)x
r−1 p p2
10.2 Formulas for Variance
It is often easier to compute the variance using the following fact:

Proposition 10.2. For any discrete random variable X with expected value µ,
V (X) = E(X 2 ) − (E(X))2 = E(X 2 ) − µ2
The variance of a linear function of X is easily computed from that of X.

Proposition 10.3. Suppose that X is a random variable and Y = aX + b for constants a, b.
Then
σY2 = a2 σX
2
and σY = |a|σX
.
10.3 Chebyshev’s Inequality
Chebyshev’s Inequality says that it is likely that the value of X be within several standard
deviations of the mean of X.
Theorem 10.4 (Chebyshev’s Inequality). Suppose that X is a random variable with mean µ
and variance σ 2 and let k ≥ 1. Then
1
P (|X − µ| ≥ kσ) ≤
k2
Proof. Let A = {x : |x − µ| ≥ kσ} and let D be the set of all possible values of X. We have
X X X
σ2 = (x − µ)2 p(x) ≥ (x − µ)2 p(x) ≥ k 2 σ 2 p(x)
x∈D x∈A x∈A
Therefore
1 X
≥ p(x)
k2
x∈A
This is the desired inequality. (Notice that the inequality is true for k < 1 as well however it
isn’t very interesting!)
The following R example shows that the Chebyshev bound is rather conservative, at least in
the case of this particular random variable. The example being
p illustrated there is a binomial
random variable with n = 100, p = .5 and so µ = 50 and σ = 100/4 = 5.
> pbinom(40,100,.5)+(1-pbinom(59,100,.5)) # k=2

[1] 0.05688793
> pbinom(35,100,.5)+(1-pbinom(64,100,.5)) # k=3
[1] 0.003517642
> pbinom(30,100,.5)+(1-pbinom(69,100,.5)) # k=4
[1] 7.85014e-05
Homework.
2. Suppose that a multiple choice test has 30 questions with 5 choices each. Suppose that
the test-taker guesses purely randomly on each question. Let X be the random variable
that counts the number of correct guesses.
(a) Compute the expected value of X.

(b) Compute the variance of X.
(c) Use Chebyshev’s inequality to compute the probability that the test taker gets at
least 15 problems right.
(d) Use R to find the exact probability that the test taker gets at least 15 problems
right.
3. Do p. 116 28, p. 133 72.

11 Hypothesis Testing
Goal: To introduce hypothesis testing through the example of the binomial distribu-
tion
11.1 Setting
Suppose that a real-world process is modeled by a binomial distribution for which we know n
but do not know p. Examples abound.
Example 11.1. A factory produces the ubiquitous widget. It claims that the probability that
any widget is defective is less than 0.1%. We receive a shipment of 100 widgets. We wonder
whether the claim about the defective rate is really true. The shipment is an example of a
binomial experiment with n = 100 and p unknown.
Example 11.2. A National Football League team is trying to decide whether to replace its
field goal kicker with a new one. The team estimates that the current kicker makes about 30%
of his kicks from 45 yards out. They want to try out the new kicker by asking him to try 20
kicks from 45 yards out. This might be modeled by a binomial distribution with n = 20 and p
unknown. The team is hoping that p > .3.
Example 11.3. A standard test for ESP works as follows. A card with one of five printed
symbols is selected without the person claiming to have ESP being able to see it. The purported
psychic is asked to name what symbol is on the card while the experimenter looks at it and
“thinks” about it. A typical experiment consists of 25 trials. This is an example of a binomial
experiment with n = 25 and unknown p. The experimenter usually believes that p = .2.
In each of these examples, we have a hypothesis about p that we could be considered to be

testing. In Example 11.1, we are testing whether p = .001. In example 11.2, we are testing
whether p ≤ .3. Finally, in Example 11.3, we are obviously testing the hypothesis that p = .2.
11.2 Hypotheses
A hypothesis proposes a possible state of affairs with respect to a probability distribution

governing an experiment that we are about to perform. Examples:
1. A hypothesis stating a fixed value of a parameter: p = .5.
2. A hypothesis stating a range of values of a parameter: p ≥ .7.
3. A hypothesis about the nature of the distribution itself; X has a binomial distribution.
In a typical hypothesis test, we pit two hypotheses against each other:
1. Null Hypothesis. The null hypothesis, usually denoted H0 , is generally a hypothesis

that the data analysis is intended to investigate. It is usually thought of as the “default” or
“status quo” hypothesis that we will accept unless the data gives us substantial evidence
against it.
2. Alternate Hypothesis. The alternate hypothesis, usually denoted H1 or Ha , is the

hypothesis that we are wanting to put forward as true if we have sufficient evidence
against the null hypothesis.
3. Possible Decisions. On the basis of the data we will either reject H0 or fail to reject
H0 (in favor of Ha ).
4. Asymmetry. Note that H0 and Ha are not treated equally. The idea is that H0 is the
default and only if we are reasonably sure that H0 is false do we reject it in favor of Ha .
H0 is “innocent until proven guilty” and this metaphor from the criminal justice system
is good to keep in mind.
In the examples above, the pairs of hypotheses that we are probably wanting to test are:
H0 p = .001 H0 p = .3 H0 p = .2
Ha p > .001 Ha p > .3 Ha p > .2
Example 11.1 Example 11.2 Example 11.3
11.3 Decisions and Errors
How do we decide to reject H0 ? Obviously, we perform the experiment and decide whether the
result is in greater accord with H0 or Ha . There are two types of errors that we could make.
Definition 11.4. A Type I error is the error of rejecting H0 even though it is true. The
probability of a type I error is denoted by α.
A Type II error is the error of not rejecting H0 even though it is false. The probability of a
Type II error is denoted by β.
In all of the examples above, a reasonable strategy to follow is to perform the experiment and
reject H0 if the resulting number of successes is too large.
Definition 11.5. A test statistic is a random variable on which the decision is to be based.
A rejection region is the set of all possible values of of the test statistic that would lead us to
reject H0 .
In all these examples, our test statistic will simply be the number of successes of the binomial
experiment X. Our rejection region will be sets of the form R = {x : x ≥ x0 } for some constant
x0 . The number x0 will be chosen based on our choice of α.
Convention. Random variables are denoted by uppercase letters such as X. Random variables
are functions to be applied to the outcomes of experiments. The value of a random variable X
computed on a particular trial of the experiment will be denoted by the corresponding lowercase
letter, in this instance x. After the experiment, x is data and is simply a number.
The following R output provides relevant information for Example 11.1.
> pbinom(c(5:10),25,.2)
[1] 0.6166894 0.7800353 0.8908772 0.9532258 0.9826681 0.9944451
Consider, in this example, the decision rule:
Reject H0 if and only if x ≥ 10.
This decision procedure has a probability of less than 2% of making a type I error. (If the null
hypothesis is true, the probability is 98.2% that we will not get a value this extreme.)
One approach to setting up a test procedure is to choose a rejection region in advance of the
experiment based on some desired limit on the value of α. Typical choices are .05 and .01. For
the above example, if we decide that α should be no more than .05 we would reject H0 if x ≥ 9.
If we are more conservative and desire α ≤ .01, then we would reject H0 if x ≥ 11.
Another approach to testing is to avoid defining the rejection region altogether and to report
the result of the experiment in such a way that those interpreting the results can choose their
own α.
Definition 11.6. The p-value of the value x of a test statistic X is the probability that, if H0
is true, X would have a value at least as extreme (in favor of the alternate hypothesis) as x.
In the above example, the p-value of an outcome of 10 is 1 − .982 = .018. The p-value tells us
exactly for which α would we reject H0 .
Note the extraordinarily confusing feature of the name of the p-value. There are
two completely different p’s in this story. There is the p that is the value of the
parameter of the underlying binomial distribution. And there is the p that results
from analyzing the data. We’ll just have to deal with this. By the way, some authors
call the parameter in the binomial distribution π. That would help.
11.4 In the absence of normality
If X and Y are not necessarily normal but the sample sizes are large, the above results are
approximately true. This is not very interesting
Homework.
1. Devore and Berk pages 418–422 cover this material. They might be somewhat difficult to
read however as they use language developed over the course of Chapters 7 and 8.
2. Do problems 3.68 and 3.69 of Devore and Berk.
3. Do problem 9.9 of Devore and Berk (yes, Chapter 9). This introduces a new wrinkle in
that the alternate hypothesis is two-sided.
4. In Example 11.2, the team decides to fire the old kicker and hire the new one if the kicker
makes 8 or more kicks (out of the 20 trial kicks).
(a) What is the p-value of this test?

(b) Suppose that the new kicker actually has a 35% probability of making such kicks.
What is the probability that the team will hire the new kicker based on this test?
12 Continuous Random Variables
Goal: To introduce the concept of continuous random variable
12.1 Provisional Definition
Definition 12.1. A random variable X is continuous if its possible values consist of a union
of intervals of real numbers and the probability that X is any single real number is 0.
We will have a better definition later. This one will suffice for now. The basic idea is that
continuous random variables are those which represent measurements that are taken on a real
number scale.
Example 12.2. Suppose a student is selected at random from the Calvin population. Contin-
uous random variables that we might report are weight, height, GPA, cholestoral level, etc.
12.2 Probability as Area or Mass

250
0.8
200
0.6
150
Density
0.4
100
0.2
50
0.0
0
2.0 2.5 3.0 3.5 4.0 2.0 2.5 3.0 3.5 4.0
Figure 2: Senior GPAs
The two histograms are of the GPAs of all seniors at a certain college located somewhere in
Western Michigan. The difference between the two histograms is in the units used on the y
axis. In the first histogram, one can read off the counts (out of the 1333 seniors) of the number
of seniors in each bin. The second histogram is a density histogram. The y-axis units are called
density units and are scaled so that the total area of all bins in the hisotogram is 1 (where area
is simply width in x units times height in these density units).
Important Observation 12.3. Given a density histogram of a population, the probability
that a randomly chosen individual from the population comes from any particular bar of the
histogram is equal to the area of that bar.
12.3 Density as Limit of Histograms
0.8
0.6
0.6
0.4
Density
0.4
0.2
0.2
0.0
0.0
2.0 2.5 3.0 3.5 4.0 2.0 2.5 3.0 3.5 4.0
Figure 3: GPA Data, Different Bin Widths
The histogram on the right, with its smaller bin widths, gives a better picture of the distribution
of senior GPAs. With a finite population, there is a limit to the narrowness of the bins. With
very small bin widths, many bins would have no individuals while some might have two or three.
This would lead to a very spikey appearance and a loss of information. The histogram in Figure
4 has a smooth curve superimposed and this curve seems to approximate the distribution of the
GPAs well. We will defer the discussion of exactly how curve is drawn until much later in the
course but a key property of the histograms that it preserves is that the area under the curve
is 1.
There are two ways to view this curve. On the one hand we might consider it a smooth approx-
imation of a finite population. On the other hand, we might think of the curve as modelling
a conceptual population of something like “all possible seniors at the unnamed college.” In
general, of course such a curve can only fit exactly an infinite, conceptual population.
0.8
0.6
0.4
0.2
0.0
2.0 2.5 3.0 3.5 4.0
Figure 4: GPA Data with Smooth Density
12.4 Probability Density Functions
The above discussion motivates the following definition.

Definition 12.4. A probability density function, (pdf ) is a function f (x) such that
1. f (x) ≥ 0 for all x and

Z ∞
2. f (x) dx = 1.
−∞
A pdf is the pdf for a random variable X if for every a, b

Z b
P (a ≤ X ≤ b) = f (x) dx
a
Example 12.5. A model for the waiting time X (in minutes) to the next radioactive event
recorded on a Geiger counter is given by the following pdf:
f (x) = 100e−100x x≥0

The probability that we will wait at most 0.01 minutes is given by

Z 0.01 0.01
f (x) dx = −e−100x = 1 − e−1 = .632

0 0
12.5 R Code for Drawing Histograms
> sr=read.csv(’sr.csv’)
> sr
SATM SATV ACT GPA
1 700 670 30 3.992
2 NA NA 25 3.376
3 NA NA 20 3.020
4 NA NA 24 3.509
5 NA NA 32 3.970
10 710 680 32 4.000
......................
1330 670 700 27 3.830
1331 NA NA 25 2.234
1332 NA NA 27 3.163
1333 NA NA 24 2.886
> hist(sr$GPA)
> hist(sr$GPA,br=seq(1.7,4.0,.230))
> layout(matrix(c(1,2),1,2))
> layout.show(2)
> hist(sr$GPA,br=seq(1.7,4.0,.230))
> hist(sr$GPA,br=seq(1.7,4.0,.230),prob=T)
> layout(1)
> hist(sr$GPA,br=seq(1.7,4.0,.115),prob=T)
> d=density(sr$GPA)
> lines(d)
Homework.
1. Read Devore and Berk, pages 155–159
2. Do problems 4.1,3,4,10
13 Continuous Random Variables - II
Goal: To define the cdf of a continuous random variable and to introduce two im-
portant families of random variables
13.1 The cdf of a Random Variable
Definition 13.1. The cumulative distribution function (cdf ) of a continuous random variable
X is the function F (x) = P (X ≤ x).
The following proposition is almost the same as Propostion 8.2.
Proposition 13.2. For any continuous random variable X, the cdf F of X satisfies
1. lim F (x) = 0
x→−∞
2. lim F (x) = 1
x→∞
3. if x1 < x2 then F (x1 ) ≤ F (x2 )
4. for all a, F (a) = P (X ≤ a) = P (X < a)
5. for all a, b, P (a ≤ X ≤ b) = F (b) − F (a).
We now provide the correct definition of a continuous random variable.
Definition 13.3. A random variable X is continuous if the cdf F of X is a continuous function.
Part (4) of the last proposition is not true for the cdf of a discrete random variable but is the
new feature of continuous random variables.
If X is a continuous random variable, X does not necessarily have a pdf but it always has a
cdf. However if X has a pdf, we have
Z x
F (x) = P (X ≤ x) = f (x) dx
−∞
More importantly,
Theorem 13.4. Suppose that X is a continuous random variable with continuous pdf f . Then
the cdf F of X satisfies F 0 (x) = f (x) for all x.
13.2 Uniform Random Variables
Definition 13.5. A random variable X has a uniform distribution with parameters A < B if
the pdf of X is
1
f (x) = A≤x≤B
B−A
(Convention: a definition of a pdf such as that of f above implies that f (x) = 0 for all x not in
the specified domain.)
R commands dunif, punif, and runif compute the pdf, cdf, and random numbers from a
uniform distribution.
> punif(.3,0,1)
[1] 0.3
> punif(13,10,20)
[1] 0.3
> s=runif(100,0,1)
> hist(s,prob=T)
13.3 Exponential Random Variables
Definition 13.6. Random variable X has an exponential distribution with parameter λ if X

has pdf
f (x) = λe−λx x≥0
Besides the commands dexp, pexp and rexp, the command qexp computes quantiles (or per-
centiles) of the exponential distribution.
Definition 13.7. If p is a real number such that 0 < p < 1, the 100pth -percentile of X is any
number q such that P (X ≤ q) = p. If p = .5, a 100pth -percentile is called a median of X.
For a continuous random variable X, F (q) = p if q is the 100pth -percentile. However this is not
usually the case for discrete random variables.
> x=seq(0,10,.01)
> plot(x,p)
> p=dexp(x,.5)
> plot(x,p)
> pexp(1,.5)
[1] 0.3934693
> s=rexp(100,.5)
> hist(s)
> qexp(.25,.5)
[1] 0.5753641
> qexp(.5,.5)
[1] 1.386294
> qexp(.75,.5)
[1] 2.772589
> qexp(1,.5)
[1] Inf
>
The exponential distribution has an important property that characterizes it and dictates when
it should be used in an application. Suppose that X is an exponential random variable with
parameter λ that is used to model the waiting time in minutes for something to occur. Then
the probability that the waiting time exceeds t minutes is
Z ∞
P (X ≥ t) = f (x) dx = e−λt
t
Now suppose that we have an exponential random variable and we have waited at t0 units of
time. The probability that we must wait at least t more units of time is
e−λ(t+t0 )
P (X ≥ t + t0 |X ≥ t0 ) = = e−λt
et0
In other words, the conditional probability that we must wait at least t more minutes does not
depend on how long we have already waited. This property of the exponential distribution is
referred to as the memoryless property.
Homework.
2. Do problems 4.11, 4.12, 4.14, 4.16

14 Expected Values of Continuous Random Variables
Goal: To define the expected value of continuous random variables and introduce
an important theoretical tool - the moment generating function
14.1 Expected Value
Definition 14.1. The expected value of a random variable X with pdf f is

Z ∞
E(X) = µx = xf (x) dx
−∞
Note that it is possible that the expected value of X does not exist due to the divergence of the
improper integral. However if the range of X is a finite interval, the expected value will always
exist.
Example 14.2. The expected value of a uniform random variable with parameters A and B
is (A + B)/2. The expected value of an exponential random variable with parameter λ is 1/λ.
14.2 Functions of Random Variables
Just as in the discrete case, given a function h and a random variable X we can define a new
random variable Y = h(X). To compute E(Y ) by the definition we would have to find the pdf
of Y and then integrate. Techniques for finding the pdf of Y will be discussed in a later section
but it is not always easy to do. However the following theorem saves us.
Theorem 14.3 (The Law of the Lazy Statistician). Suppose that X is a continuous random
variable with pdf f and that Y = h(X). Then
Z ∞
E(Y ) = E(h(X)) = h(x)f (x) dx
−∞
Example 14.4. Suppose that a random variable has a uniform distribution with parameters
A = 0 and B = 1. The Y = X 2 has expected value
Z 1
1
µY = x2 dx = .
0 3
Note that for all constants a, b,
E(aX + b) = aE(X) + b
On the basis of this property, we say that E is a linear operator.

14.3 Variance
Definition 14.5. The variance of a continuous random variable with pdf f and mean µ is
Z ∞
2 2
V (X) = σx = E[(X − µ) ] = (x − µ)2 f (x) dx
∞
√
The standard deviation of X, σX , is given by σ = σ2.
Just as in the case of a discrete random variable, we have the following computational formula
for the variance:
2
σX = E(X 2 ) − µ2X
The variance of a uniform random variable with parameters A and B is (A + B)/12. The
variance of an exponential random variable with parameter λ is 1/λ2 .
14.4 The Moment Generating Function
Definition 14.6. The moment generating function of a random variable X is the function M
defined by
M (t) = E(etx )
provided this expectation exists for some open interval containing 0.
Note that the definition makes sense for discrete as well as continuous random variables. Note
also that M (0) = 1 but that M (t) may not exist for any t 6= 0. That is the issue with the
provision at the end of the definition.
Example 14.7. Suppose that X is an exponetial random variable with parameter λ. Then
Z ∞
λ
M (t) = etx λe−λx dx =
0 λ−t
which exists for all t 6= λ.
(r)
Let MX (t) denote the rth derivative of M . Then
Proposition 14.8. Suppose that X is a random variable and MX (t) exists. Then
(r)
E(X r ) = MX (0)
An important property of the moment generating function is the following.

Theorem 14.9. Suppose that X and Y are random variables that have moment generating
functions. Then X and Y are identically distributed (have the same cdf ) if and only if MX (t) =
MY (t) for all t in some open interval containing 0.
Homework.
2. Do problems 4.18,21,25,28,31
15 Normal Distribution
Goal: To defne the normal distribution and introduce its properties
15.1 The Normal Distribution
Definition 15.1. A continuous random variable X has the normal distribution with parameters
µ and σ (∞ < µ < ∞ and 0 < σ) if the pdf of X is
1 2 2
f (x; µ, σ) = √ e−(x−µ) /(2σ ) −∞<x<∞
2πσ
We write X ∼ N (µ, σ) if X is a normal random variable with parameters µ and σ. Annoyingly,

some books will use σ 2 as the parameter so one has to check carefully. R uses σ. Note that the
pdf of a normal random variable is unimodal and symmetric about x = µ.
Proposition 15.2. Z ∞
1 2 2
√ e−(x−µ) /(2σ ) dx = 1
−∞ 2πσ
The parameters µ and σ are aptly named.

Proposition 15.3. If X is a normal random variable with parameters µ and σ, then
µX = µ σX = σ
Proposition 15.4. The moment generating function of a normal random variable with param-
eters µ and σ is
2 2
MX (t) = eµt+σ t /2
15.2 The Standard Normal Distribution
The normal distribution with mean µ = 0 and standard deviation σ = 1 is called the standard
normal distribution. Such a random variable is often named Z and the cdf of Z is often called
Φ. In other words Z z
1 2
Φ(z) = P (Z ≤ z) = √ e−x /2 dx
−∞ 2π
The standard normal distribution occurs so frequently, that a standard terminology has devel-
oped to name certain of its properties. For example
Definition 15.5. If Z is a standard normal random variable and 0 < α ≤ .5, then zα denotes
the number such that
P (Z ≥ zα ) = 1 − α
It is helpful to commit to memory certain important values of zα .
z.05 = 1.645 z.025 = 1.96 z.005 = 2.58
Due to the symmetry of the normal pdf, we can use these values to write probability statements
such as
P (−1.96 < Z < 1.96) = 1 − 2(.025) = .95
That is, approximately 95% of the probability is within 2σ of µ.
We can directly translate results about the standard normal distribution to any normal distri-
bution using the following.
Proposition 15.6. If X is a normal random variable with parameters µ and σ, then

X −µ
Z=
σ
is a standard normal random variable. In other words

x−µ
P (X ≤ x) = Φ
σ
15.3 The Normal Approximation of the Binomial
We will eventually have a whole theory of approximating various distributions with a normal
distribution. One important example is the following
Proposition 15.7. Suppose that X is a binomial random variable with parameters n and p
and such that n is large and p is not too close to the extreme p
values of 0 and 1. Then the
distribution of X is approximately normal with µ = np and σ = np(1 − p). In other words
!
x + .5 − np
P (X ≤ x) ' Φ p (15.1)
np(1 − p)
(The .5 in the numerator is called the “continuity correction” and it attempts to account for the
fact that X is discrete while the standard normal distribution is continuous.)
15.4 R examples
> x=seq(0,10,.1)
> p=dnorm(x,5,2)
> plot(x,p)
> pnorm(7,5,2)
[1] 0.8413447
> pnorm(1.64,0,1)
[1] 0.9494974
> qnorm(.95,0,1)
[1] 1.644854
> sim=rnorm(1000,0,1)
> hist(sim)
Homework.
2. Do Problems 4.40,42,55,60
3. In this problem we investigate the accuracy of the approximation in Equation 15.1.
(a) Suppose that X ∼ Bin(100, .5). What percentage error does Equation 15.1 make in
approximating P (X ≤ 20)? P (X ≤ 30)? P (X ≤ 40)? P (X ≤ 45)?
(b) Suppose that X ∼ Bin(100, .1). What percentage error does Equation 15.1 make in
(c) Suppose that X ∼ Bin(30, .3). What percentage error does Equation 15.1 make in
16 The Gamma Function and Distribution
Goal: To define the gamma distribution
16.1 The Gamma Function
Definition 16.1. The gamma function Γ is defined by

Z ∞
Γ(α) = xα−1 e−x dx α > 0.
0
Proposition 16.2. The gamma function satisfies the following properties
1. Γ(1) = 1
2. For all α > 1, Γ(α) = (α − 1)Γ(α − 1)
3. For all natural numbers n > 1, Γ(n) = (n − 1)!
√
4. Γ(1/2) = π.
16.2 The Gamma Distribution
Definition 16.3. A continuous random variable X has the gamma distribution with parameters
α and β (X ∼ Gamma(α, β)) if the pdf of X is
1
f (x; α, β) = xα−1 e−x/β x≥0
β α Γ(α)
It is easy to see that f (x; α, β) is a pdf by using the properties of the gamma function. The
parameter α is usually called a “shape” parameter and β is called a “scale” parameter.
Proposition 16.4. If X ∼ Gamma(α, β), then
2
µX = αβ σX = αβ 2
16.3 Special Cases of the Gamma Distribtion
The exponential distribution with parameter λ is a special case of the gamma distribution with
α = 1 and β = 1/λ.
The chi-squared distribution has one parameter ν called the degrees of freedom of the distribu-
tion and is the special case of the gamma distribution with α = ν/2 and β = 2. Note that this
implies that a chi-squared random variable has µ = ν and σ 2 = 2ν.
R knows both the gamma function and the gamma distribution.
> gamma(4)
[1] 6
> gamma(.5)
[1] 1.772454
> x=seq(0.1,5.0,.01)
> y=dgamma(x,shape=2,scale=3)
> y=pgamma(x,shape=2,scale=3)
> y=dgamma(x,shape=1,scale=2)-dexp(x,1/2) # should be zeros
> y=dgamma(x,shape=2,scale=2)-dchisq(x,4) # should be zeros
>
16.4 The Poisson Distribution
The Poisson Distribution is a discrete distribution related to the exponential distribution.
Definition 16.5. A Poisson random variable X with parameter λ is a discrete random variable
with pmf
λx
p(x; λ) = e−λ x = 0, 1, 2, . . .
x!
A Possion random variable with parameter λ has mean λ and variance λ. The relation to the
exponential distribution is this.
Proposition 16.6. Let P be a process that generates events such that the distribution of elapsed
time between the occurence of successive events is exponential with parameter λ. Then the
distribution of the number of occurences in a time interval of length 1 is Poisson with parameter
λ.
We will see a proof of this later in the course.
Homework.
2. Do problems 4.71, 76, 77.

17 Transformations of Random Variables
Goal: To find the pdf of Y = g(X) given that of X
17.1 The CDF Method
Suppose that X is a continuous random variable and Y = g(X). We will first suppose that g
is one-to-one on the range of X.
Definition 17.1. The range of a random variable X is the set of all x such that X(o) = x for
some outcome of the experiment.
We first illustrate the cdf method with a simple X and an increasing g.
Example 17.2. Suppose that X is uniform on [0, 1] and Y = X 2 so that g(x) = x2 . (Note
that g is not one-to-one on every interval but it is in this case.) The range of Y is [0, 1] as well.
We have
√ √
FY (y) = P (Y ≤ y) = P (X 2 ≤ y) = P (X ≤ y) = y 0≤y≤1
Therefore we have
1
fY (y) = FY0 (y) = √ 1≤y≤∞
2 y
In this next example, the only thing that changes is that h is decreasing on the range of X.
Example 17.3. Suppose that X is uniform on [0, 1] and Y = 1/X so that g(x) = 1/x. The
range of Y is [1, ∞). We have
FY (y) = P (Y ≤ y) = P (1/X ≤ y) = P (X ≥ 1/y) = 1 − 1/y 1≤y<∞
Therefore we have
1
fY (y) = FY0 (y) = 1≤y≤∞
y2
Here is the general result.
Proposition 17.4. Suppose that X is a continuous random variable and Y = g(X). Suppose
further that g is one-to-one and differentiable on the range of X so that g has a differentiable
inverse h. Then
fY (y) = fX (h(y))|h0 (y)|
for all y in the range of Y .
Proof. Suppose that g is an increasing function on the range of X so that h is increasing on

the range of Y . The proof for a decreasing g is similar. Then
FY (y) = P (Y ≤ y) = P (g(X) ≤ y) = P (X ≤ h(y)) = FX (h(y))
Thus
d d
fY (y) = FY (y) = FX (h(y)) = FX0 (h(y))h0 (y) = fY (y)h0 (y)
dy dy
17.2 g is not Monotonic
The cdf method still works of g is not monotonic. One simply has to be careful in finding the
appropriate regions on which to evaluate the cdf of X.
Example 17.5. Let Z be a standard normal random variable (i.e., µ = 0 and σ = 1). Recall
that the cdf of Z is denoted Φ and the range of Z is (−∞, ∞). Let Y = Z 2 . To find the pdf of
Y we again use the cdf method.
√ √ √ √
FY (y) = P (Y ≤ y) = P (Z 2 ≤ y) = P (− y ≤ Z ≤ y) = Φ( y) − Φ(− y) 0 ≤ y (17.2)
Therefore
1 √ 1 √
fY (y) = √ fZ ( y) + √ fZ (− y) 0≤y
2 y 2 y
By the symmetry of fZ , we thus have
1 √ 2 1
fY (y) = √ √ e− y /2 = √ y −1/2 e−y/2 0≤y
y 2π 2π
This density is an example of a gamma distribution. In fact it is a chi-squared distribution with
one degree of freedom.
Notice in the preceding example that equation 17.2 does not depend on the fact that we were
using the standard normal distribution. Whenever the transformation Y = X 2 is used we have
that
√ √
FY (y) = FX ( y) − FX (− y) y≥0
Homework.
2. Do problems 4.109, 110, 114, 118.

18 Jointly Distributed Random Variables
Goal: To consider experiements in which several random variables are observed
18.1 Many Random Variables
It is often the case that a random experiment results in several random variables that are of
interest.
Example 18.1. A senior at a fixed but unnamed college is selected and both her GPA and
ACT score are recorded. The result is an ordered pair: (GPA, ACT). We really want to record
this data as an ordered pair — recording the GPA and ACT values separately loses the fact
that these measurements may be related.
Example 18.2. A mediocre golfer hits a drive. For the drive both the distance travelled D
and the deviation from the center line M are recorded.
Example 18.3. A voter is selected for an exit poll and is asked a dozen yes-no questions. The
answers are recorded as Q1 , . . . , Q12 (using a convention such as 1 means yes and 0 means no).
Formally,
Definition 18.4. If n random variables X1 , . . . , Xn are defined on the sample space of an
experiment, the random vector (X1 , . . . , Xn ) is the function that assigns to each outcome o the
vector (X1 (o), . . . , Xn (o)).
Though many of our examples will be random pairs which we will usually denote (X, Y ) it is
not uncommon in “real” applications to consider random vectors that have a large number of
components. One very important special case of this is to repeat a single random experiment
k times for some large k, and think of the k-different values of a random variable X defined
on that experiment as one k-tuple of values. Essentially, we treat the k replications of an
experiment as one replication of a larger, compound experiment.
18.2 Discrete Random Vectors
If each of the k random variables that form a random vector (X1 , . . . , Xk ) are discrete, we call
the random vector a discrete random vector. The natural extension of the pmf to this situation
is given by the following definition (which we give for the case k = 2).
Definition 18.5. Given random variables X and Y defined on the sample space of a random
experiment, the joint probability mass function p of X and Y is defined by
p(x, y) = P (X = x and Y = y)
Obviously, for random variables X1 , . . . , Xn , the pmf is a function of n-variables usually written
p(x1 , . . . , xn ).
Example 18.6. Suppose that a random experiment consists of tossing two fair dice. Two
random variables associated with this experiment are
S = the smaller number of the two numbers that appear

L = the larger number of the two numbers that appear
The joint pmf of S and L is given by the following table:
L
1 2 3 4 5 6
1 1/36 1/18 1/18 1/18 1/18 1/18
2 1/36 1/18 1/18 1/18 1/18
S 3 1/36 1/18 1/18 1/18
4 1/36 1/18 1/18
5 1/36 1/18
6 1/36
Notice that if the pmf p for such a random vector is given, we can compute the probability of
any event by adding the appropriate values of p. In particular, we can recover the pmf of each
of the random variables that make up the random vector. For example, if (X, Y ) is a random
pair, we compute pX (x) and pY (y) by
X X
pX (x) = p(x, y) pY (y) = p(x, y)
y x
In the case of a random pair (X, Y ) the functions pX (x) and pY (y) are called the marginal
probability mass functions of X and Y respectively.
While pX and pY can always be recovered from the joint pmf p, the converse is not true. A
special case is the case that X and Y are independent.
Definition 18.7. Discrete random variables X and Y are independent if for every x and y,
p(x, y) = pX (x)pY (y)
This definition really just says that the events X = x and Y = y are independent. The random
variables S and L of Example 18.6 are obviously not independent.
The definitions extend to discrete random vectors consisting of n random variables. Rather
than give a general definition with an unreasonable amount of notation, we give an example.
Proposition 18.8. Let (X1 , . . . , Xn ) be a random vector with pmf p. The joint pmf of (X1 , X2 )
is the function pX1 ,X2 by
X
pX1 ,X2 (x1 , x2 ) = p(x1 , x2 , . . . , xn )
x3 ,...,xn
There is of course a joint pmf for any subvector (Xi1 , . . . , Xik ) of the vector (X1 , . . . , Xn ).
Definition 18.9. The random variables X1 , . . . , Xn are independent if for every subvector
(Xi1 , . . . , Xik ) of the vector (X1 , . . . , Xn ), the joint pmf of (Xi1 , . . . , Xik ) is equal to the product
of the marginal pmfs of the Xij .
18.3 The Multinomial Distribution
The exact generalization of the binomial distribution to an experiment with r > 2 possible
outcomes is the multinomial distribution. Thus the conditions on a multinomial experiment
are:
1. there are n independent trials of a simpler experiment
2. each trial results on one of r possible outcomes
3. the ith outcome happens with probability pi in any trial
Definition 18.10. A random vector (X1 , . . . , Xr ) has the multinomial distribution with pa-
rameters n, p1 , . . . , pr if the joint pmf is
n!
p(x1 , . . . , xr ) = px1 · · · pxr r x1 + · · · xr = n
x1 ! · · · xr ! 1
> dmultinom(c(1,2,2,2,1,2),10,prob=c(1/6,1/6,1/6,1/6,1/6,1/6))
[1] 0.003750857
> rmultinom(1,10,prob=c(1/6,1/6,1/6,1/6,1/6,1/6))
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 1
[5,] 2
[6,] 1
>
Homework.
1. Read Devore and Berk pages 230, 231.
2. Do problems 5.3, 5.4, 5.8.

19 Continuous Random Vectors
Goal: To extend the concept of random vector to continuous random variables
19.1 Joint Probability Density Functions
Suppose that (X, Y ) is a random vector and that both X and Y are continuous random vari-
ables. The notion of pdf extends to this situation in a natural way.
Definition 19.1. Suppose that X and Y are continuous random variables associated with some
experiment. Then a function f (x, y) is a joint probability density function for X and Y if for
every (reasonable) set A in the Cartesian plane
ZZ
P [(X, Y ) ∈ A] = f (x, y) dx dy
A
In the special case that A is the rectangle {(x, y) : a ≤ x ≤ b, c ≤ y ≤ d},
Z bZ d
P (a ≤ X ≤ b, c ≤ Y ≤ d) = f (x, y) dy dx
a c
The physical interpretation of probability is the same as in the one variable case. The plane
has mass (probability) 1 and the density function p(x, y) is the mass per unit area at the point
(x, y). A joint density function must satisfy the properties familiar from the one-variable case:
1. For all x, y, we have that f (x, y) ≥ 0,

Z ∞Z ∞
2. f (x, y) dx dy = 1
−∞ −∞
The definition of pdf is easily extended to random vectors (X1 , . . . , Xn ) however multiple inte-
gration with n-fold integrals becomes difficult quickly.
19.2 Marginal pdf and Independence
Just as in the single variable case, the pdf of each individual random variable is recoverable
from the joint pdf.
Proposition 19.2. Given a random vector (X, Y ) with pdf f , the marginal pdf of X and Y
can be computed as
Z ∞ Z ∞
fX (x) = f (x, y) dy fY (y) = f (x, y) dx
−∞ −∞
And the natural definition of independence is
Definition 19.3. Random variables X and Y with joint pdf f are independent if for every
pair (x, y)
f (x, y) = fX (x)fY (y)
The definitions of marginal pdf and independence extend to n random variables in the obvious
way.
Homework.
1. Read Devore and Berk Section 5.1.
2. Do problems 5.12, 13, 15

20 Expected Values, the Multivariate Case
Goal: To compute expected values of functions of a random vector
20.1 Computing Expected Values
For ease of notation, we consider random pairs (X, Y ) but we could easily extend to random
n-vectors.
Suppose that Z = h(X, Y ) is a function of the random pair (X, Y ). Since Z is a random
variable, we could compute its expected value if we knew its pmf (pdf). But as in the single
variable case, there is a shortcut. We state the theorem only for the continuous case but the
discrete case is the same but with sums replacing integrals.
Theorem 20.1. Suppose that Z = h(X, Y ) and the pair (X, Y ) has pdf f (x, y). Then
Z ∞Z ∞
E(Z) = E[h(X, Y )] = h(x, y)f (x, y) dx dy
−∞ −∞
Example 20.2. Suppose that the space shuttle has a part and a spare and the time-to-failure
of these parts is exponential with parameter λ. Presumably, the expected value of X + Y is
relevant to determining the expected lifetime of the system on which the part depends.
It is often the case (as in the previous example) that we are dealing with independent X and Y .
Depending on the form of h(X, Y ), the work involved in computing E[h(X, Y )] can be greatly
simplified.
Example 20.3. Suppose that X and Y are independent random variables with uniform dis-
tributions on [0, 1]. What is E(XY )?
The general fact is
Proposition 20.4. Suppose that X and Y are independent and that h(X, Y ) = f (X)g(Y ) for
some functions f, g. Then
E[h(X, Y )] = E[f (X)]E[g(Y )].
It’s important to realize that E(XY ) is not necessarily equal to E(X)E(Y ) in the general case.
Example 20.5. Suppose that (X, Y ) has the joint pdf
f (x, y) = x + y 0 ≤ x, y ≤ 1
Then E(XY ) = 1/3 but E(X)E(Y ) = (7/12)2 .

20.2 Covariance and Correlation
Definition 20.6. The covariance between X and Y is
Cov(X, Y ) = E[(X − µX )(Y − µY )]
The covariance between X and Y measures the relationship between X and Y . There is a
shortcut formula for covariance.
Proposition 20.7. Cov(X, Y ) = E(XY ) − µX µY .
Proposition 20.8. For every X, Y and constants a, b,
1. Cov(X, Y ) = Cov(Y, X)
2. Cov(aX, Y ) = a Cov(X, Y )
3. Cov(X + b, Y ) = Cov(X, Y )
4. Cov(X, X) = Var(X).
5. If X and Y are independent then Cov(X, Y ) = 0.
It is important to note that the converse of Proposition 20.8(e) is not true.
Example 20.9. There are random variables X and Y such that Cov(X, Y ) = 0 but X and
Y are not independent. There is a straightforward discrete example in the text. For a contin-
uous example let X be uniform on [−1, 1] and let Y = |X|. It is obvious that Y and X are
not independent – indeed the value of Y is completely determined from that of X. However
Cov(X, Y ) = Cov(X, |X|) = E(X|X|) − E(X)E(|X|) = 0 − 0(1/2) = 0.
Covariance has units – it is convenient to make covariance a dimensionless quantity by normal-

izing.
Definition 20.10. The correlation between X and Y is given by
Cov(X, Y )
Corr(X, Y ) =
σX σY
Corr(X, Y ) is also denoted ρX,Y or simply ρ if X and Y are understood.
Proposition 20.11. For all random variables X and Y
1. −1 ≤ Corr(X, Y ) ≤ 1
2. Corr(aX + b, X) = 1 if a > 0 and Corr(aX + b, X) = −1 if a < 0.
3. If | Corr(X, Y )| = 1 then Y = aX + b for all but a set of x-values of probability 0.
Proof. To prove (a), define random variables V and W by
W = X − µX V = Y − µY
Now for any real number t
E[(tW + V )2 ] = t2 E(W 2 ) + 2tE(V W ) + E(V 2 ) (20.3)
Since E[(tW + V )2 ] ≥ 0 for every t, considering the right hand side of 20.3, which is a quadratic
polynomial in t must be nonnegative for every t. This in turn means that
4E(V W )2 − 4E(W 2 )E(V 2 ) ≤ 0
Now the result follows since E(V W ) = Cov(X, Y ), E(W 2 ) = Var(X), and E(V 2 ) = Var(Y ).
To prove (c), note that Corr(X, Y ) = 1 only if the quadratic equation in 20.3 has the value 0
for some t0 . For this t0 , we have E[(t0 W + V )2 ] = 0. This must mean that t0 W + V is the zero
random variable, at least on a set of probability 1. In this case we have V = −t0 W which is
equivalent to (c).
From the above proof, we can say more. If | Corr(X, Y )| = 1 then, letting ρ = Corr(X, Y ),
Y − µY X − µX
=ρ
σY σX
Example 20.12. Suppose that (X1 , . . . , Xr ) has a mulitinomial distribution with parameters
n, p1 , . . . , pr . Then we have
Xi ∼ Bin(n, pi ) E(Xi ) = npi Var(Xi ) = npi (1 − pi ) Cov(Xi , Xj ) = −npi pj
Homework.
2. Do problems 5.18, 5.23, 5.24, 5.30 of Devore and Berk.

21 Conditional Distributions
Goal: To define conditional distributions
21.1 Conditional Mass and Density Functions
Definition 21.1. Given a random pair (X, Y ), the random variable Y |x is defined by the pmf
(pdf)
p(x, y) f (x, y)
pY |x (y) = fY |x (y) =
px (x) fX (x)
The random variable X|y is defined similarly. Often pY |x (y) is written pY |x (y|x).
Example 21.2. Recall the experiment in example 18.6 in which two dice are tossed and the
smaller and larger numbers are reported as S and L. Then
p(1, 2) 2/36 2
pS|2 (1) = = =
pL (2) 3/36 3
Example 21.3. Let (X, Y ) be defined by the joint pdf
f (x, y) = x + y 0 ≤ x, y ≤ 1
This joint distribution was considered in Example 20.5. Then
f (x, y) x+y
fY |x (y) = =
fX (x) x + 1/2
Since Y |x is a random variable, we can compute E(Y |x) and Var(Y |x) in the usual way. In
general, of course, E(Y |x) depends on x.
Note also that if X and Y are independent, then fY |x (y) = fY (y) as we would expect from the
informal meaning of independence.
21.2 Using Conditional Probabilities to Generate Joint Distributions
Example 21.4. Suppose that two random numbers X and Y are generated as follows. X has
a uniform distribution on [0, 1]. Given X = x, Y is chosen to have a uniform distribution on
[x, 1]. In other words Y ≥ X but no particular values of Y > X are “favored.” Then
1
f (x, y) = fX (x)fY |x (y) = 1 0≤x≤y≤1
1−x
We have, for example,
Z 1Z y Z 1Z y
y
E(Y ) = yf (x, y) dx dy = dx dy = 3/4
0 0 0 0 1−x
21.3 The Bivariate Normal Distribution
The most important joint distribution is the bivariate normal distribution. It is a five parameter
distribution: µX , µY , σX , σY , and ρ. The density of this joint distribution is the amazing
expression
" 2 2 !#
1 1 x − µX (x − µX )(y − µY ) y − µY
Exp − − 2ρ +
ρ2 )
p
2πσx σy 1−ρ 2 2(1 − σX σX σY σY
which has domain = ∞ < x, y < ∞.

It is characterized by the following properties:
Proposition 21.5. If the random pair (X, Y ) has the bivariate normal distribution
2 .
1. X has a normal distribution with mean µX and variance σX
2. Y has a normal distribution with mean µY and variance σY2 .
3. the correlation of X and Y is ρ.
4. the conditional distribution of Y |x is a normal distribution with
x − µX
µY |x = µY + ρσY
σX
σY2 |x 2 2
= σY (1 − ρ )
Conversely, any random pair with these five properties has the bivariate normal distribution.
R makes the multivariate normal distribution available via a package that must be installed
and loaded: mvnorm. The functions dmvnorm, pmvnorm and rmvnorm return what you would
expect. Since we haven’t defined the cdf of a random pair, we have that definition here:
Definition 21.6. If (X, Y ) is a random pair, the cdf of (X, Y ) is the function
F (x, y) = P (X ≤ x and Y ≤ y)
The function pmvnorm computes the cdf of the multivariate normal distribution. One needs to
supply these functions with a vector of means (µX , µY ) and the covariance matrix

Var(X) Cov(X, Y )
Cov(X, Y ) Var(Y )
Recall that Cov(X, Y ) = ρσX σY . Suppose that SAT Math and Verbal scores have means of
500, standard deviations of 100 and a correlation of 0.8. Then the following computes the
probability that both math and verbal scores of a randomly chosen individual are below 500.
> m=c(500,500)
> cv=matrix(c(10000,8000,8000,10000),nrow=2,ncol=2)
> pmvnorm(upper=c(500,500),sigma=cv,mean=m)
[1] 0.3975836
attr(,"error")
[1] 1e-15
attr(,"msg")
[1] "Normal Completion"
Homework.
2. Do problems 5.36, 5.45, 5.46, 5.53, 5.57

22 Statistics
Goal: To define statistics and study them by simulation
22.1 Random Samples
Definition 22.1. The random variables X1 , . . . , Xn are a random sample if
1. The Xi ’s are independent,

2. The Xi ’s have identical distributions.
We also say that the Xi ’s are i.i.d. (independent and identically distributed).
The R commands rnorm, runif, rbinom, etc. are intended to simulate a random sample.
Not all data can be considered to have arisen from a random sample. There are often depen-
dencies to worry about. For example, sampling from a finite population almost never produces
independent random variables. However we will often try to construct experiments so that the
data can be analyzed as having come from a random sample.
Note that in any experiment, x1 , . . . , xn denote the result of the random variables X1 , . . . , Xn
for this particular trial of the experiment. Uppercase letters are random variables and the
corresponding lowercase letter is a number.
22.2 Statistics
To analyze data, we often compute certain summary values. For example

Definition 22.2. Given x1 , . . . , xn , the (sample) mean of these numbers, x is
x1 + · · · + xn
x=
n
The process of producing the sample mean is the result of applying a function to the values of
the random variables X1 , . . . , Xn . The sample mean could be considered the result of a random
variable.
Definition 22.3. Given a random sample X1 , . . . , Xn any function Y = h(X1 , . . . , Xn ) is called
a statistic.
Definition 22.4. Given a random sample X1 , . . . , Xn , the sample mean is the random variable
X1 + · · · + Xn
X=
n
We sometimes write X n to signify the random variable that is the sample mean of a sample of
size n.
22.3 The Sampling Distribution of the Statistic
If X1 , . . . , Xn is a random sample and Y = g(X1 , . . . , Xn ) is a statistic, then Y has a distribution

that depends on that of the Xi ’s (which all have the same distribution). Obviously, it might
not be easy to figure out what the pdf of Y is, even if we know the pdf of the Xi ’s.
22.4 Simulation
Simulations sometimes help to understand and estimate the sampling distribution of Y . The
following R code simulates samples of size 100 from an exponential distribution with λ = 1.
The experiment of generating random samples of size 100 is replicated 1,000 times. The sample
mean is computed for each replicate. Therefore we get 1,000 different values for the sample
mean. R accomplishes the simulation in the following steps. First, 100,000 random numbers are
generated from the exponential distribution with parameter λ = 1. Presumably, these numbers
act like they are being generated independently. These 100,000 numbers comprise the vector x.
The vector y contains 100 copies of the numbers from 1 to 1,000. The idea is that the numbers
in y are being used to label the numbers in x with 100 numbers in x to receive each of the
numbers from 1 to 1,000. Numbers in x with the same label from y are treated as if they are
in the same sample of size 100. The tapply command takes three arguments. The first two
are vectors of the same length, in this case x and y. The third is an R function. The command
applies the function to each group of elements in x with the groups determined by the labels in
y. Therefore, in the example below xbar consists of 1,000 sample means of 100 numbers from
an exponential distribution with λ = 1.
> y=rep(1:1000,100)
> x=rexp(100000,1)
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.181e-05 2.891e-01 6.968e-01 1.003e+00 1.395e+00 1.050e+01
> xbar=tapply(x,y,mean)
> length(xbar)
[1] 1000
> summary(xbar)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.7259 0.9326 1.0020 1.0030 1.0670 1.3740
>
Homework.
2. Using the R example in the notes as a model, study the sampling distribution of the
sample mean X of random samples from a gamma distribution with paramters α = 2 and
β = 1. Do four different simulations with the sample size being simulated being n = 4,
n = 10, n = 30 and n = 100. Do 500 replications of each sample.
(a) Comment on the differences in shape of the sampling distribution as n varies. In

other words, as n increases how does the shape of the sampling distribution change?
(b) Comment of the differences in spread of the sampling distribution as n varies.
23 Distributions of Sums of Random Variables
Goal: To develop methods for determining the distribution of a sum of random

variables
23.1 Linear Combinations of Random Variables
Theorem 23.1. If Y = a1 X1 + · · · + an Xn for some constants a1 , . . . , an , then
E(Y ) = a1 E(X1 ) + · · · + an E(Xn )
Corollary 23.2. If each of X1 , . . . , Xn has mean µ, then E(X) = µ.
Notes:
1. A random variable y of form Y = a1 X1 + · · · + an Xn is called a linear combination of the

random variables X1 , . . . , Xn .
2. Note that Theorem 23.1 does not require independence of the random variables X1 , . . . , Xn .
Similarly Corollary 23.2 does not require that the random variables are identically dis-
tributed but only that they have the same mean.
The situation for the variance is more complicated.

Theorem 23.3. If Y = a1 X1 + · · · + an Xn for some constants a1 , . . . , an , then
n X
X n
Var(Y ) = ai aj Cov(Xi , Xj )
i=1 i=1
Corollary 23.4. For any two random variables X1 and X2 ,
Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) + 2 Cov(X1 , X2 )

Var(X1 − X2 ) = Var(X1 ) + Var(X2 ) − 2 Cov(X1 , X2 )
Corollary 23.5. If Y = a1 X1 + · · · + an Xn for some constants a1 , . . . , an , and the random

variables X1 , . . . , Xn are independent then
n
X
Var(Y ) = a2i Var(Xi )
i=1
The following theorem is a direct corollary of the Corollaries 23.2 and 23.5.
Theorem 23.6. Suppose that random variables X1 , . . . , Xn are a random sample each with
mean µ and variance σ 2 . Then
σ2
E(X) = µ Var(X̄) =
n
23.2 The Distribution of a Linear Combination - The mgf Method
Theorem 23.7. Suppose that X1 , . . . , Xn are independent random variables such that the mo-
ment generating function MXi (t) of each rv Xi exists. Let Y = a1 X1 + · · · + Xn . Then
MY (t) = MX1 (a1 t)MX2 (a2 t) · · · MXn (an t)
Example 23.8. Suppose that X1 and X2 are independent exponential random variables with
λ = 1. Then Y = X1 + X2 is a gamma random variable with α = 2 and β = 1.
Corollary 23.9. Suppose that the random variables X1 , . . . , Xn are independent and that each
is normally distributed. Then Y = a1 X1 + · · · + an Xn is normally distributed.
Homework.
2. Do problems 6.28, 6.36 of Devore and Berk.
3. Suppose that X1 , . . . , Xn are independent random variables so that Xi has a gamma

distribution with parameters α and β. What is the distribution of Y = X? (Hint: use
the Moment generating function method.) From this, verify that Theorem 23.6 holds in
this special case.
4. Suppose that X1 , . . . , Xn are independent chi-squared random variables such that Xi has
νi “degrees of freedom.” What is the distribution of Y = X1 + · · · + Xn ?
24 The Central Limit Theorem
Goal: To develop further the properties of the distribution of the samle mean
24.1 The Distribution of the Sample Mean
Summarizing the preceding section we have
Theorem 24.1. Suppose that X1 , . . . , Xn are iid with mean µ and variance σ 2 . Then
σ2
E(X) = µ Var(X) =
n
Theorem 24.2. Suppose that X1 , . . . , Xn are independent and normally distributed with mean
µ and standard deviation σ. Then X is normally distributed with mean µ and standard deviation
√
σ/ n.
24.2 The Central Limit Theorem
If the underlying distribution is not normal, we often do not know the distribution of the sample
mean. The following theorem is an important remedy.
Theorem 24.3 (The Central Limit Theorem). Suppose that X1 , . . . , Xn are iid with mean µ
and variance σ 2 . Then

X −µ
lim P √ ≤ z = P (Z ≤ z) = Φ(z)
n→∞ σ/ n
Corollary 24.4. Suppose that X has a binomial distribution parameters n and p. Then
if
p n is large, X is approximately normally distributed with mean np and standard deviation
np(1 − p).
Homework.
1. Read Section 6.2. pages 291–296.
2. No problems.
25 Investigating the CLT
Goal: To investigate the accuracy of approximating the distribution of the sample

mean by the normal distribution
25.1 Corollaries and Variants of the CLT

X −µ
lim P √ ≤ z = P (Z ≤ z) = Φ(z)
n→∞ σ/ n
Corollary 25.2 (Informal Statement). For n large, X n has a distribution that is approximately
normal with mean µ and variance σ 2 /n. Additionally, T = X1 + · · · + Xn has a distribution
that is approximately normal with mean nµ and variance nσ 2 .
Corollary 25.3. Suppose that X has a binomial distribution parameters n and p. Then
if
p n is large, X is approximately normally distributed with mean np and standard deviation
np(1 − p).
25.2 Using R To Investigate CLT
In the following example we superimpose the density of the appropriate normal distribution
over the histogram of random variables that should be approximately normal by the CLT.
Example 25.4. The Uniform Distribution. If X1 , . . . , Xn is a random sample from a uni-

form distribution that is uniform on [0, 1], µ = 1/2 and σ 2 = 1/12. The function superhist
produces the histogram of sample means and superimposes the approximating normal distri-
bution as predicted by the CLT. The arguments to superhist include
sm a vector of sample means

n sample size
mean mean of population random variable
sd standard deviation of population random variable
> superhist
function (sm,mean,sd,n){
hist(sm,prob=T);
normcurve(mean,sd,n)}
> normcurve
function (m,s,n) {
x= seq(m-3*s/sqrt(n),m+3*s/sqrt(n),length=100);
y=dnorm(x,m,s/sqrt(n));
return(lines(x,y))
}
> superhist(replicate(1000,mean(runif(2))),1/2,1/sqrt(12),2)
> superhist(replicate(1000,mean(runif(10))),1/2,1/sqrt(12),10)
> superhist(replicate(1000,mean(rexp(2))),1,1,2)
> superhist(replicate(1000,mean(rexp(10))),1,1,10)
The histograms below are the result of simulating sample sizes of 2 and 10 respectively from a
uniform distribution on [0, 1].
Histogram of sm Histogram of sm
2.0
4
1.5
3
Density
Density
1.0
2
0.5
1
0.0
0.0 0.4 0.8 0.2 0.4 0.6 0.8
sm sm
The uniform distribution is symmetric. The exponential distribution is highly skewed. The
result of simulating sample sizes of 2 and 10 respectively from an exponential distribution with
λ = 1 is shown in the next pair of histograms.
Histogram of sm Histogram of sm
0.6
1.2
0.4
0.8
Density
Density
0.2
0.4
0.0
0.0
0 1 2 3 4 5 0.5 1.0 1.5 2.0
sm sm
25.3 Numerical Comparisons
The function normquant compares the simulated distribution of the sample mean to what is
predicted by the normal cdf by computing the number of simulated values in the intervals
(−∞, −3), (−3, −2), (−2, −1), (−1, 0), (0, 1), (1, 2), (2, 3), (3, ∞) (in standardized units).
> normquant
function (sm,mean,sd,n){
devs=c(); reps=length(sm);
pct=0;normpct=0;
for (i in -3:3){ br= mean+i*sd/sqrt(n);
counts=length(sm[sm<br]);
devs=c(devs,counts/reps-pct-pnorm(i,0,1)+normpct);
normpct=pnorm(i,0,1);
pct=counts/reps};
devs=c(devs,(1-pct)-(1-normpct));
return(devs)}
> normquant(replicate(1000,mean(runif(2))),1/2,1/sqrt(12),2)
[1] -0.001349898 -0.002400234 0.001094878 -0.011344746 0.002655254

[6] 0.019094878 -0.006400234 -0.001349898
> normquant(replicate(1000,mean(runif(10))),1/2,1/sqrt(12),10)
[1] -0.000349898 0.007599766 -0.018905122 0.019655254 -0.008344746
[6] 0.007094878 -0.005400234 -0.001349898
> normquant(replicate(1000,mean(rexp(2))),1,1,2)
[1] -0.001349898 -0.021400234 -0.014905122 0.129655254 -0.100344746
[6] -0.015905122 0.006599766 0.017650102
Homework.
1. Investigate the accuracy of the approximation of the CLT for the gamma distribution.
Specifically, consider the gamma distribution with parameters α = 5 and β = 2.
(a) From Exercise 3 of Section 23, what is the distribution of the sample mean of a
sample of size n from this distribution? (No new question here.) What is the mean
and the standard deviation of X n ?
(b) Fix n = 2. Using the distribution in part (a), find the probability that the sample
mean X is in each of the intervals (−∞, −3), (−3, −2), (−2, −1), (−1, 0), (0, 1), (1, 2),
(2, 3), (3, ∞) where the intervals are expressed in terms of the standardization of X.
Compare these probabilities with those predicted by the Central Limit Theorem.
(c) Repeat the previous part for n = 10.
26 “Proof ” of the Central Limit Theorem
Goal: To prove the Central Limit Theorem and also to introduce a graphical tech-
nique for testing normality
26.1 Proof of the CLT

X −µ
lim P √ ≤ z = P (Z ≤ z) = Φ(z)
n→∞ σ/ n
Proof. Appendix, Chapter 5. Note that the proof relies on some facts about moment generating
functions that are beyond the scope of the course. But it is a good example of the moment
generating function method of determining the distribution of a function of random variables.
26.2 Normal Probability Plots and qqnorm
The book’s version of normal probability plots varies slightly from the the version implemented
in R (qqnorm) but the differences are small and the idea is the same. A normal probability
plot is a way of plotting a vector of data so that if the data comes from a distrbution that is
normal, the plot should be (almost) a straight line.
Definition 26.2. If x1 , . . . , xn is a sequence of n (not necessarily distinct) numbers, the ith
smallest such is denoted x(i) . (In other words, x(1) ≤ x(2) ≤ · · · ≤ x(n) .)
The next definition has some variants in different books. This one is most common (and is
reasonable).
Definition 26.3. If x1 , . . . , xn is a sequence of n numbers, then xi is called the 100(i − .5)/nth
sample percentile.
Earlier, we defined the 100pth percentile of a distribution. Namely, given a continuous random
variable X and p such that 0 ≤ p ≤ 1, the 100pth percentile of X is the unique number qp such
that P (X ≤ qp ) = p.
Definition 26.4. A normal probability plot for a sequence x1 , . . . , xn of numbers is the plot
of the pairs
(x(i) , q(i−.5)/n )
where qp is the 100pth percentile of the standard normal distribution.
If the sequence of numbers x1 , . . . , xn arises from a random sample from a normal distribution,
then its normal probability plot should be (approximately) a straight line. In R, the function
qqnorm produces a normal probability plot. If n > 10, the plot it produces is exactly the plot
described in Definition 26.4. If n < 10, the quantiles of the normal distribution are modified
somewhat for technical reasons that we will consider later.
> qqnorm(runif(100),main="Uniform")
> qqnorm(rexp(100),main="Exponential")
> qqnorm(rnorm(100),main="Normal")
Uniform 5 Exponential Normal

1.0
● ● ● ●
●●
●●●●●
● ●
●
●
2
●
●
●
● ●●●
0.8
● ●●●
4
●
● ●
●
●● ●
●●
●
● ●●
Sample Quantiles
Sample Quantiles
Sample Quantiles
●
● ●●
1
●●
● ●
●●
●
●●
● ●●
●
● ●
●
● ● ●
●
●● ●
●
●●
●
0.6
●
3
●●
● ●
●
●
●
●
● ● ●
●●
●
●
●
●● ●●
●
●
● ●● ●
0
● ● ●
●●●
●
●
●
● ●
●●●
● ● ●
●
●●
●
● ● ●
●
●●
● ●
0.4
●
2
● ● ●
●●
●
●●
●
●
● ●
●● ●
●
−1
● ●
●●
●
●
●
● ●
● ●
●●
●
●
●● ● ●●
●
●
●
●
●● ●
● ●●
●●
●
● ●
●● ●●●
●
●●
● ●
●●
0.2
● ●
●
●
●●
● ●●●
1
●
● ●
●
−2
●
●● ●●
●
●
● ●
●
●●
● ●
●
●●
●
●
●● ●
●
●
●
●
● ●
●
●●
●
●
●●
●
●
●
●
●●● ●●
●
●
●
●●
●
●●
●
●● ●
●●
0.0
●
●●
●●
● ●●
−3
●
● ● ●●●●●● ●
0
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles
Homework.
1. Read the Appendix to Chapter 5 in Devore and Berk and also Section 4.6, pages 206–212.
2. No further problems are assigned.

27 Estimation of Parameters
Goal: To introduce the concept of estimation of a parameter
27.1 Parameter of a Population
Recall the notions of population and parameter from Section 1.3.

Definition 27.1. A population is a well-defined collection of individuals.
Definition 27.2. A parameter is a numerical characteristic of a population.
The framework for the next several sections is this: while the population is well-defined, the
parameter is unknown. We will collect data (a sample of the population) to estimate the
parameter. To cast this as a problem in statistics, we will assume that sampling from the
population is represented by a random variable X called the population random variable.
Definition 27.3. A random sample from a population X is an iid sequence of random variables
X1 , . . . , Xn all of which have the same distribution as X.
Problem 27.4 (Parameter Estimation). Given a random variable X and a parameter θ as-
sociated with X the value of which is unknown, compute an estimate of θ from the result of a
random sample X1 , . . . , Xn from X.
27.2 Example I - Estimating p in the Binomial Distribution
Example 27.5. The manufacturing process of a part results in a certain probability of a

defective part. Estimate that rate.
Example 27.6. Not every seed in a bag of seeds will germinate. Estimate the germination
percentage.
Example 27.7. Some free-throw shooters are better than others. Estimate the probability
that a given free-throw shooter will make a free-throw.
Each of these examples is a (theoretical) population characterized by a parameter p and the

value of p is unknown. (Insert diatribe about Greek or roman letters here.) The relevant
random variable is the one that results from the experiment of testing a part (planting a seed,
shooting a free-throw) and recording whether the result of an experiment is a success or not.
The random sample is to repeat the experiment n times independently and record the results.
The obvious candidate for an estimator is to compute X/n where X is the number of suc-
cesses since we know the distribution of the X is binomial with parameters n (known) and p
(unknown).
Definition 27.8. An estimator of θ is a statistic θ̂ used to estimate θ. The number that results
from evaluating the estimator on the result of the experiment is called the estimate of the
parameter.
Summary: If X is a binomial random variable, p̂ = X/n is an estimator of p.
Example 27.9. (Example 27.5 continued) We test n = 100 parts. Record the number of
defectives X. Then X/100 is an estimator of the defective rate. If we observe 5 defectives, .05
is the estimate of the defective rate.
27.3 Competing Estimators
Instead of X/n, Laplace proposed (X + 1)/(n + 2). Wilson (and Devore and Berk) propose
(X + 2)/(n + 4).
Which should we choose?
Principle I - Lack of Bias
Note that if X ∼ Bin(n, p),
np + 1 np + 2
E(X) = p E[(X + 1)/(n + 2)] = E[(X + 2)(n + 4)] =
n+2 n+4
Note that the Laplace and Wilson estimators are “biased” in the sense that they tend to
overestimate small p and underestimate large p. The estimator X/n has no such bias.
Definition 27.10. An estimator θ̂ of a parameter θ is unbiased if E(θ̂) = θ. The bias of an

estimator θ̂ is E(θ̂) − θ.
(Note that we don’t generally know the bias since we don’t know θ.)
Principle II – Close to θ on average
Definition 27.11. The mean squre error of an estimator θ̂ is
MSE(θ̂) = E[(θ̂ − θ)2 ]
(Again, we do not generally know MSE(θ̂) since it depends on the unknown θ.)
Proposition 27.12. MSE(θ̂) = Var(θ̂) + Bias(θ̂)2
Let p̂, p̂L , and p̂W be the three estimators for p above (L for Laplace and W for Wilson).
Estimator Bias Variance
p(1 − p)
p̂ 0
n
1/n − 2p/n p(1 − p)
p̂L
1 + 2/n n + 2 + 4/n
2/n − 4p/n p(1 − p)
p̂W
1 + 4/n n + 8 + 16/n
A plot of the mean square error of each of these three estimators for n = 10 and n = 30 shows
that which estimator has least mean square error depends on the true (unknown) value of p
and the (known) value of n.
0.008
0.020
MSE
MSE
0.004
0.010
0.000
0.000
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p p
Homework.
1. Read Devore and Berk, pages 326–331.
2. Do problems 7.7,8,14
28 Estimating µ
Goal: To estimate µ from a random sample
Proposition 28.1. If X is a population random variable and X1 , . . . , Xn is a random sample

from X, then µ̂ = X is an unbiased estimator of µX .
Note that we also know the variance of the estimator X̄ even if we do not know the distribution
of X, provided X itself has a variance. Namely
2
σX
Var(X) =
n
Note also that Var(X) = MSE(X).
In the special case that X has a normal distribution, we have something more.
Proposition 28.2. If X is normally distributed, the X is an unbiased estimator of µ and

among all unbiased estimators it has the least variance. (We call X the MVUE - Mimimum
Variance Unbiased Estimator.)
Important Notes:
1. If X is not normal, X may not be MVUE.
2. An estimator that is MVUE need not have minimum MSE.
Homework.
2. Do problems 7.15,16,19.
3. Extra Credit: Problem 7.20 looks interesting although it took me quite a while to figure
out what was interesting in my answer to 7.20a. Try it if you like.
29 Estimating σ 2
Goal: To find an unbiased estimator for σ 2
29.1 S2
Setting: X1 , . . . , Xn is a random sample from a distribution that has mean µ and variance σ 2 .
Goal: Estimate σ 2 .
n
X
The natural estimator is (Xi − µ)2 /n which is obviously an unbiased estimator. But µ is
i=1
usually not known so this is not helpful.
Definition 29.1. The sample variance is the statistic

n
X
2
S = (Xi − X)2 /(n − 1)
i=1
Proposition 29.2. If X1 , . . . , Xn is a random sample from a distribution with variance σ 2 ,

then S 2 is an unbiased estimator of σ 2 .
Note that the Proposition implies that using n in the denomiator gives a biased estimator (that
on average underestimates σ). Some insight as to why comes from
Proposition 29.3. Let x1 , . . . , xn be real numbers with mean x. Then x is the unique number
c that minimizes
n
X
f (c) = (xi − c)2
i=1
In other words, by approximating µ by X, we make the expression for computing variance

smaller.
29.2 The Normal Case
In Example 17.5, we showed that if Z is a standard normal random variable then Z 2 has a
chi-squared distribution with one degree of freedom. Also in Exercise 4 of Section 23, we found
that the sum of n independent ch-squared random variables with degrees of freedom ν1 , . . . , νn
respecitively is also chi-squared with the degrees of freedom ν1 + · · · + νn . The following is
immediate.
Proposition 29.4. Suppose that X1 , . . . , Xn is a random sample from a normal distribution

with mean µ and variance σ 2 . Then
n
Xi − µ 2
X
σ
i=1
has a chi-squared distribution with n degrees of freedom and

2
X −µ
√
σ/ n
has a chi-squared distribution with 1 degree of freedom.
Lemma 29.5. 2 2
n n
Xi − µ 2

X Xi − X X X −µ
= − √
σ σ σ/ n
i=1 i=1
Proposition 29.6. If X1 , . . . , Xn is a random sample from a normal distribution then S 2 and

X are independent random variables.
Corollary 29.7. If X1 , . . . , Xn is a random sample from a normal distribution, then the

random variable (n − 1)S 2 /σ 2 has a chi-squared distribution with n − 1 degrees of freedom.
Thus S 2 has mean σ 2 and variance 2σ 4 /(n − 1).
Important Notes:
1. Proposition 29.6 is true of normal distributions but is not true in general.
2. Though S 2 is always an unbiased estimator of σ 2 , S is never an unbiased estimator of σ.
Homework.
1. Read Devore and Berk pages 309–312.
2. Do problem 7.10 of Devore and Berk.

30 Summary of Properties of Estimators
Goal: To review the main properties of estimators.
30.1 Summary
1. X is a random variable and θ is a parameter (unknown) of X
2. A random sample X1 , . . . , Xn is generated from X
3. A statistic θ̂ = θ̂(X1 , . . . , Xn ) is computed
4. We would like θ̂ to have small MSE(θ̂) = E[(θ̂ − θ)2 ]
5. θ̂ is unbiased if E(θ̂) = θ. Unbiasedness is a good thing.
6. MSE(θ̂) = Bias2 (θ̂) + Var(θ̂) so small variance is a good thing.

2 /n.
7. For any X, X is an MVUE of µX with variance σX
8. For any X, S 2 is an unbiased estimator of σ 2 .
9. For normal X, S 2 and X are independent and nS 2 /σ 2 is chi-squared with (n − 1) degrees

of freedom.
10. It’s often much easier to find an unbiased estimator than to find the estimator with least
MSE. Furthermore, there may not be an estimator with least MSE over the full range of
parameter values.
The Standard Error of the Estimator
Definition 30.1. If θ̂ is an estimator for θ, the standard error of θ̂ is

q
σθ̂ = Var(θ̂)
The standard error of θ̂ is a measure of how precise the estimator is. Unfortunately, we usually
don’t know it since it depends on unknown parameters. If we can estimate σθ̂ , we write sθ̂ for
this estimate.
Example 30.2. If X ∼ Bin(n, p) withpp known, we have p̂ = X/n is an unbiased estimator of p

= p(1 − p)/n. Thus σp̂ = p(1 − p)/n. Since p̂ is an estimator of p it is reasonable
with Var(p̂) p
to use sp̂ = p̂(1 − p̂)/n.
Homework.
1. Read pages 338-339 of Devore and Berk.
2. Do problems 7.11, 7.13 of Devore and Berk.
Information concerning Test 2
The test is Monday, November 6. The test is in-class and closed-book. Calculators are allowed
and our favorite distribution chart will be available. The test covers Sections 16–26 of the
notes. This includes the following sections of the book: 4.4,4.7,5.1,5.2,5.3,6.1,6.2,6.3 (and a few
miscellaneous other pages).
The test has five problems. They are approximately as follows.
1. If X has such and such a distribution and Y = g(X), what is the pdf of Y ? (see 4.7)
2. If X has such and such a distribution and Y has a distribution that is conditional on X,
tell me a lot about Y . (This is similar to 5.45 or the problem gone over in class on the
day this one was handed back.)
3. A question about the CLT. You must be able to state it accurately including defining all
notation used in its statement. You should know exactly what it is about and how it is
applied.
4. A certain joint density is given where X and Y are both discrete. Compute miscellaneous
things concerning it.
5. Some linear combination of random variables is described. You might have to compute
some or all of the mean, variance, and moment generating function of said random vari-
ables.
31 Method of Moments
Goal: To estimate parameters using the method of moments
31.1 Moments
Definition 31.1. Suppose that X1 , . . . , Xn are a random sample from a population random
variable X.
1. The k th population moment is µ0k = E(X k ).

2. The k th central population moment is µk = E[(X − µ)k ] for k ≥ 2.
n
X
3. The k th sample moment is Mk = (1/n) Xik .
i=1
Obviously, E(Mk ) = µ0k . Also note that µ01 = µX , µ02 = σX

2 + µ2 , and M = X.
X 1
31.2 Method of Moments - One parameter
Example 31.2. Suppose that X is exponential with parameter λ. Then µ = 1/λ. So E(X) =
1/λ. Therefore to estimate λ we might use 1/X. (Of course this estimator is not unbiased.)
In general, the method of moments to estimate a parameter θ works like this. Suppose µ is
some known function of θ. Then to estimate θ we use
X=µ
and solve this equation for θ in terms of X.

Example 31.3. Suppose that X1 , . . . , Xn is a random sample from a distribution that has pdf
f (x; θ) = θ(x − 1/2) + 1 0≤x≤1 −2≤θ ≤2
One can show that µX = 1/2 + θ/12. Therefore the method of moments estimator comes from
solving
1 θ̂
X= +
2 12
So
θ̂ = 12X − 6
Note something peculiar about this estimator. It is certainly possible that θ̂ gives a value of θ
that is impossible.
31.3 Two parameters
Example 31.4. The gamma distribution has two parameters α and β. Suppose that we wish
to estimate both of these given a sample.
> x=rgamma(30,shape=alpha,scale=beta)
> x
[1] 24.62216 29.25221 35.77913 39.77655 27.73842 20.47076 68.53810 29.17282
[9] 32.55297 23.14998 29.32863 22.15848 47.13237 35.77347 25.08374 25.61091
[17] 28.38134 20.18289 47.29603 22.40046 32.42016 58.93412 19.39311 38.03705
[25] 26.45068 39.11045 21.36603 17.03293 26.85245 35.56088
> hist(x,prob=T)
We have µ = αβ and σ 2 = αβ 2 so µ02 = σ 2 + µ2 = αβ 2 + α2 β 2 . So we have the two equations

M1 = α̂β̂
M2 = α̂β̂ 2 + α̂2 β̂ 2
In our particular example we have
> m1=mean(x)
> m2=mean(x^2)
> betahat=(m2-m1*m1)/m1
> alphahat=m1/betahat
> alphahat
[1] 7.51535
> betahat
[1] 4.211644
The plot of the true density (α = 10 and β = 2) is shown on the next page with the density
computed using estimated parameters and also with the histogram of the data.
Example 31.5. The method of moments estimator can always be used to estimate µ and σ 2 .
We must solve
M1 = µ̂
M2 = µ̂2 + σ̂ 2
n
X
This obviously yields our usual µ̂ = X and the estimate (Xi − X)2 /n for σ 2 .
i=1
0.04
Density
0.02
0.00
10 20 30 40 50 60 70
Homework.
1. Read Devore and Berk pages 344-346.
2. Do problem 7.23a.
3. A Rayleigh distribution has one parameter θ and pdf

x −x2 /2θ2
f (x; θ) = e x≥0
θ2
Find the method of moments estimator of θ.
32 Maximum Likelihood Estimation
Goal: To estimate parameters using the method of maximum likelihood
32.1 An Example
Example 32.1. Suppose that X ∼ Bin(n, p) with p unknown. The density is

n x
f (x; p) = p (1 − p)n−x (32.4)
x
If x is known, we can think of f (x; p) as a function of p. It is reasonable to
choose p̂ to maximize
n
this function. That is, for fixed x we will maximize 32.4. Note that x is a positive constant
so we can omit it and also it is enough to find p̂ that maximizes the natural logarithm of f .
Namely our problem is to maximize
L(p) = x ln p + (n − x) ln(1 − p)
x X
Solving L0 (x) = 0 gives p = . In other words we should use p̂ = .
n n
32.2 The General Method
Suppose that X1 , . . . , Xn have joint density function f (x1 , . . . , xn ; θ1 , . . . , θm ) where θ1 , . . . , θm

are unknown parameters. If x1 , . . . , xn are observed sample values, the function
L(θ1 , . . . , θm ) = f (x1 , . . . , xn ; θ1 , . . . , θm )
is called the likelihood function of the sample.
Definition 32.2. Given the n values x1 , . . . , xn of random variables X1 , . . . , Xn with likelihood
function L(θ1 , . . . , θm ), the maximum likelihood estimates of θ1 , . . . , θm are those values that
maximize the function L. If the random variables Xi are substituted in place of their values xi ,
the resultant random variables are the maximum likelihood estimators θ̂1 , . . . , θ̂m of θ1 , . . . , θm .
Example 32.3. Suppose X1 , . . . , Xn is a random sample from an exponential distribution with
parameter λ. Then the joint density function is
P
f (x1 , . . . , xn ; λ) = (λe−λx1 ) · · · (λe−λxn ) = λn e−λ xi
The log of the likelihood function is therefore

X
n ln λ − λ xi
Maximizing this function we find that
λ̂ = 1/X.
Example 32.4. Suppose that X1 , . . . , Xn is a random sample from a normal distribution with
parameters µ and σ 2 , Then the maximum likelihood estimators of µ and σ are
(Xi − X)2
P
µ̂ = X 2
σ =
c
n
Unfortunately, it is often difficult to find the values of the parameters that maximize the like-
lihood function.
Example 32.5. Let X1 , . . . , Xn be a random sample from a gamma distribution with param-
eters α and β. The likelihood function is
P
L(α, β) = β −nα (Γ(α))−n (x1 · · · xn )α−1 e−( xi )/β
Homework.
2. Do problems 7.21, 7.23b, 7.25.

33 Nonlinear Optimization Using R
Goal: To use R to find maximum likelihood estimates for the gamma distribution
33.1 Finding Minima of Nonlinear Functions
Example 33.1. The function h(x) = x2 has a minimum of 0 at x = 2. R finds minima

numerically. The command nlm takes two arguments: a function to be minimized and a starting
value for the iterative procedure.
> h
function(x){ (x-2)^2 }
> nlm(h,3)
$minimum
[1] 0
$estimate
[1] 2
$gradient
[1] 0
$code
[1] 1
$iterations
[1] 2
Example 33.2. The function k(x) = x4 − 8x3 + 16x + 1 has two local minima. Which one of
these is found depends on the starting point of the iteration.
> k
function (x) {x^4-8*x^3+16*x+1}
> nlm(k,3)
$minimum
[1] -335.912
$estimate
[1] 5.884481
$gradient
[1] 2.414972e-07
$code
[1] 1
$iterations
[1] 6
> nlm(k,-1)
$minimum
[1] -7.316241
$estimate
[1] -0.7687347
$gradient
[1] 3.939959e-06
$code
[1] 1
$iterations
[1] 5
Example 33.3. The function m(x, y) = x2 + xy + y 2 + 3x − 3y + 4 has at minimum at (1, 1).

Note that the argument to m in R is a vector of length two that carries the two variables of
the function.
> m
function (w) {x=w[1]; y=w[2] ; x^2 + x*y+y^2-3*x-3*y+4 }
> nlm(m,c(2,3))
$minimum
[1] 1
$estimate
[1] 0.9999997 0.9999997
$gradient
[1] -4.440892e-10 -1.332268e-09
$code
[1] 1
$iterations
[1] 3
Example 33.4. Let’s get more interesting and solve a problem from a recent Mathematics 162
test.
A piece of sheet metal of length 10 inches is bent to form a trapezoidal trough as in the picture.
Find the length of the sides x and the angle of the bend θ so that the area of the cross-section
of the trough is maximized.
A
A
A
xA
A
Aθ
A
The maximum occurs at x = 10/3, θ = π/6. Note that R gets close but is dumb enough not
to know that it should find a reasonable value of θ. Note also that the function trap returns
the negative of the area of the trapezoid. This is because a minimum of trap corresponds to a
maximum value for the area.
> trap
function (x) { s=x[1]; t=x[2]; return(-s*cos(t)*(s*sin(t)+(10-2*s)))}
> n=nlm(trap,c(3,1))
> n
$minimum
[1] -14.43376
$estimate
[1] 3.333313 -12.042781
$gradient
[1] 1.971768e-08 1.389486e-07
$code
[1] 1
$iterations
[1] 9
> n$estimate[2]+4*pi
[1] 0.5235893
> pi/6
[1] 0.5235988
33.2 Maximum Likelihood Estimation for the Gamma Function
Example 33.5. Let X1 , . . . , Xn be a random sample from a gamma distribution with param-
eters α and β. The likelihood function is
P
L(α, β) = β −nα (Γ(α))−n (x1 · · · xn )α−1 e−( xi )/β
Rather than maximizing L, we minimize − log(L(α, β)). Notice that dgamma can be used to
return the logarithm of the density function. Also, note that we need to pass the values of the
array x through nlm to the function f. This is the role of the third argument to nlm.
> f
function (a,x) {alpha=a[1];beta=a[2]; -sum(dgamma(x,shape=alpha,scale=beta,log=T)) }
> x=rgamma(30,shape=10,scale=4)
> n=nlm(f,c(8,2),x=x)
> n
$minimum
[1] 115.6313
$estimate
[1] 10.788300 3.588843
$gradient
[1] -1.654462e-06 8.604496e-06
$code
[1] 1
$iterations
[1] 18
Homework.
34 Beta Distribution Project
Goal: To keep students occupied while the instructor is out of town
34.1 The (Standard) Beta Distribution
The beta distribution is a distribution with parameters α, β > 0. (Refer to pages 203–204 of
Devore and Berk where this distribution is called the standard beta distribution.) The pdf is
Γ(α + β) α−1
x (1 − x)β−1 0≤x≤1
Γ(α)Γ(β)
The beta distribution has mean and variance given by
α αβ
µ= σ2 =
α+β (α + β)2 (α + β + 1)
R knows the beta distribution. The commands dbeta, rbeta, pbeta work as expected. The
parameters α and β are called shape1 and shape2 in R.
> x=seq(0,1,.01)
> y=dbeta(x,shape1=2,shape2=4)
> plot(x,y,type="l")
2.0
1.5
1.0
y
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
x
34.2 Collect Some Data
Given the domain of the beta distribution (0 ≤ x ≤ 1), it is often used to model data that
are percentages. For example, we could use the beta distribution to model the distribution of
free-throw shooting percentages of basketball players in the National Basketball Association.
Collect some data that is (or at least approximates) a random sample from what might be
a beta distribution. For example, you might find a dataset that represents a population of
percentages and choose your own random sample from it. Be creative here, I don’t want to see
eight collections of free-throw shooting percentages of National Basketball Association players.
1. Clearly describe your data, the manner in which you collected it.
2. Describe exactly the population (whether actual or theoretical) that the data is a sample
from.
3. Is your data a true random sample from the population? If not, what biases might have
been introduced in the sampling?
4. Present your data numerically and also in some useful graphical way.
34.3 Estimate Some Parameters
1. Using the data you have collected, compute the method of moments estimators for the
parameters α and β in the beta distribution.
2. Using the data you have collected, compute the maximum likelihood estimators for the
parameters α and β.
3. Assess the fit of your two respective sets of estimators in some useful (qualatative) way.
Perhaps you might want to plot the respective densities obtained from these estimators.
35 Maximum Likelihood Estimation Yet Again
Goal: To deal with some “issues” related to maximum likelihood estimation
35.1 Maximization Is Not Always Through Differentiation
We know this already. The maximum of a function may occur at an endpoint of the domain or
at a point where the function is not differentiable rather than at a point where the derivative
is 0.
Example 35.1. Suppose that x1 , . . . , xn is a sample from a distribution that is uniform on

[0, θ], where θ is unknown. Then we know that θ ≥ x1 , . . . , xn and for all such θ, L(θ) = 1/θn .
This is maximimized when θ is as small as possible, that is θ = max{x1 , . . . , xn }.
Example 35.2. Suppose that X is hypergeometric with parameters n, M , and N and that N
is the only unknown parameter. Then the likelihood function is
M N −M

x n−x
L(N ) = N

n
This is a function of an integer variable and so it is not differentiable. We analyze this situation
as follows. Consider L(N )/L(N − 1). we have
L(N ) (N − M )(N − n)
=
L(N − 1) N (N − M − n + x)
This ratio is larger than 1 if and only if N < M n/x. Thus the value of N that maximizes L(N )
is the greatest integer N such that N < M n/x which is denoted bM n/xc.
35.2 The Invariance Principle
Proposition 35.3. Suppose that θ̂ is the maximum likelihood estimator of θ. Then h(θ̂) is the
maximum likelihood estimator of h(θ). (This result is true for m-tuples of parameters as well.)
Example 35.4. Suppose that α̂ and β̂ are the maximum likelihood estimators of α and β in
the gamma distribution. Then α̂β̂ is the maximum likelihood estimator of αβ = µ.
Example 35.5. The invariance principle is very useful as it applies to any function h. But in
a sense it is bad news. For example we know that the maximum likelihood estimator of µ in
the normal distribution is X̄ and is unbiased. Thus X̄ 2 is the maximum likelihood estimator of
µ2 but we know that it cannot be an unbiased estimator.
35.3 Large Sample Behavior of the MLE
Although the MLE is sometimes difficult to compute and is not necessarily unbiased, it can be
proved that for large samples the MLE has some desirable properties. Suppose that we have a
random sample from a population with a pdf f (x; θ) that depends on one unknown parameter.
One technical assumption that we must make is that the possible values of x do not depend on
θ (unlike Example 35.1). Then we have the following theorem, stated informally. (It is really a
limit theorem like the Central Limit Theorem.)
Theorem 35.6. For large n, the distribution of the Maximum Likelihood Estimator θ̂ of θ
approaches a normal distribution, its mean approaches θ and its variance approaches 0. Fur-
thermore, for large n, its variance is nearly as small as that of any unbiased estimator of θ.
But beware, small samples are not large!
Homework.
2. Do problem 7.31 of Devore and Berk.

36 Confidence Intervals
Goal: To develop the theory of confidence intervals in the context of a simple (un-
realistic) example.
36.1 Example
Suppose that X1 , . . . , Xn is a random sample from a distribution that is normal with unknown
mean µ and known variance σ 2 . (Knowing σ 2 without knowing µ is unrealistic.) Then we
know
1. X is an unbiased estimator of µ
X −µ
2. √ has a standard normal distribution.
σ/ n
Therefore

X −µ
P −1.96 < √ < 1.96 = .95
σ/ n
and so by algebra
σ σ
P X − 1.96 √ < µ < X + 1.96 √ = .95
n n
The interval
σ σ
X − 1.96 √ , X + 1.96 √
n n
is a random interval.
Definition 36.1. Suppose that X1 , . . . , Xn is a random sample from a distribution that is

normal with mean µ and known variance σ 2 . Suppose that x1 , . . . , xn is the observed sample.
The interval
σ σ
x − 1.96 √ , x + 1.96 √
n n
is called a 95% confidence interval for µ.
Example 36.2. A machine creates rods that are to have a diameter of 23 millimeters. It is
known that the standard deviation of the actual diameters of parts created over time is 0.1
mm. A random sample of 40 parts are measured precisely to determine if the machine is still
producing rods of diameter 23 mm. The data and 95% confidence interval are given by
> x
[1] 22.958 23.179 23.049 22.863 23.098 23.011 22.958 23.186 23.015 23.089
[11] 23.166 22.883 22.926 23.051 23.146 23.080 22.957 23.054 22.995 22.894
[21] 23.040 23.057 22.985 22.827 23.172 23.039 23.029 22.889 23.019 23.073
[31] 22.837 23.045 22.957 23.212 23.092 22.886 23.018 23.031 23.059 23.117
> mean(x)
[1] 23.024
> c(mean(x)-(1.96)*.1/sqrt(40),mean(x)+(1.96)*.1/sqrt(40))
[1] 22.993 23.055
It appears that the process could still be producing rods of diameter 23 mm.
36.2 Interpreting Confidence Intervals
In example 36.2, we can say something like “we are 95% confident that the true mean is in the
interval (22.993, 23.055).” But beware:
This is not a probability statement! That is, we do not say that the probability
that the true mean is in the interval (22.993, 23.055) is 95%. There is no probability
after the experiment is done, only before.
The correct probability statement is one that we make before the experiment.
If we are to generate a 95% confidence interval for the mean from a random sample of
size 40 from a normal distribution with standard deviation 0.1, then the probability
is 95% that the resulting confidence interval will contain the mean.
Another way of saying this using the relative frequency interpretation of probability is
If we generate many 95% confidence intervals by this procedure, approximately 95%

of them will contain the mean of the population.
After the experiment, a good way of saying what confidence means is this
Either the population mean is in (22.993, 23.055) or something very surprising hap-
pened.
Below we generate 1000 random samples of size 40 from a normal distribution with mean 20
and standard deviation 1. Since we know µ for this simulation, we can check whether the 1000
corresponding confidence intervals contain µ. We would expect about 950 of them to contain
µ.
> zint
function (x,sigma,alpha) { z = -qnorm(alpha/2,0,1);
c(mean(x)-z*sigma/sqrt(length(x)),mean(x)+z*sigma/sqrt(length(x)))}
> cints=replicate(1000,zint(rnorm(40,20,1),1,.05))
> cints[,1]
[1] 19.504 20.124
> sum((cints[1,]<20)&(cints[2,]>20))
[1] 964
> cints=replicate(1000,zint(rnorm(40,20,1),1,.05))
> sum((cints[1,]<20)&(cints[2,]>20))
[1] 952
36.3 Confidence Levels and Sample Size
Nothing is magic about 95%. To generate a confidence interval at a different level of confidence,
we simply change the 1.96. Let zβ denote the number such that Φ(zβ ) = 1 − β where Φ is the
cdf of a standard normal random variable. Then a 100(1 − α)% confidence interval for µ is
given by
σ σ
x − zα/2 √ , x + zα/2 √
n n
Note that the size of the confidence interval is determined by α and n. Of course higher
levels of confidence require larger confidence intervals. And larger sample sizes result in smaller
confidence intervals. Both these are intuitively obvious.
Homework.
2. Do problems 8.2, 8.3, 8.5, 8.8, 8.9 of Devore and Berk.

37 An Important (but Overused) Confidence Interval
Goal: To construct confidence intervals for the mean of a normal distribution with-
out the silly assumption that the variance is known
37.1 The Problem
Given a random sample X1 , . . . , Xn from a normal distribution with unknown mean we want
to construct a confidence interval for µ without assuming that σ is known. The steps are:
X −µ
1. Recall that √ has a standard normal distribution.
σ/ n
2. Since σ is unknown, it seems advisable to approximate σ by S, the sample standard
deviation.
X −µ
3. Now we need to know the distribution of √ .
S/ n
37.2 The t Distribution
Definition 37.1. A random variable T has a t distribution (with parameter ν ≥ 1, called the
degrees of freedom of the distribution) if it has pdf
1 Γ((ν + 1)/2) 1
f (t) = √ −∞<t<∞
πν Γ(ν/2) (1 + t /ν)(ν+1)/2
2
Some properties of the t distribution include
1. f is symmetric about t = 0 and unimodal. In fact f looks bell-shaped.

2. Indeed the mean of T is 0 if ν > 1 and does not exist if ν = 1.
3. The variance of T is ν/(ν − 2) if ν > 2.
4. For large ν, T is approximately standard normal.
Theorem 37.2. If X1 , . . . , Xn is a random sample from a normal distribution with mean µ
and variance σ 2 , then the random variable
X −µ
√
S/ n
has a t distribution with n − 1 degrees of freedom.
37.3 The Confidence Interval
Analogous to zβ , define tβ,ν to be the unique number such that
P (T > tβ,ν ) = β
where T is random variable that has a t distribution with ν degrees of freedom. We have the
following:
Proposition 37.3. If x1 , . . . , xn are the observed values of a random sample from a normal
distribution with unknown mean µ and t∗ = tα/2,n−1 , the interval

s s
∗
x̄ − t √ , x̄ + t∗ √
n n
is an 100(1 − α)% confidence interval for µ.
37.4 R and t
As usual, dt, pt, qt, and rt compute the usual functions associated with the t distribution.
The relavant parameter is the degrees of freedom.
> x=seq(-3,3,.01)
> y=dt(x,5)
> z=dt(x,10)
> w=dnorm(x)
> plot(x,y,type="l")
> lines(x,z)
0.3
> lines(x,w)
> -qt(.025,5)
0.2
[1] 2.570582
y
> -qt(.025,10)
[1] 2.228139
0.1
> -qt(.025,30)
[1] 2.042272
> -qt(.025,100)
[1] 1.983972
−3 −2 −1 0 1 2 3
x
Confidence intervals are produced using t.test. Below we construct 95% and 90% confidence
intervals for the sepal width of the species virginica irises in an historically important dataset
(that is built into R).
> data(iris)
> sw=iris$Sepal.Width[iris$Species=="virginica"]
> hist(sw)
> t.test(sw)
One Sample t-test
data: sw
t = 65.208, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
2.882347 3.065653
sample estimates:
mean of x
2.974
> t.test(sw,conf.level=.9)
One Sample t-test
data: sw
t = 65.208, df = 49, p-value < 2.2e-16
2.897536 3.050464
sample estimates:
mean of x
2.974
37.5 What if X is not Normal?
Suppose that X1 , . . . , Xn is a random sample from a distribution that is not known to be normal
but we still want to construct a confidence interval for µ. If the sample size is large and the
distribution is known to be reasonably symmetric, it is common to use the same confidence
interval as if it were normal. This probably should be a last resort.
Example 37.4. The GPAs of the 1333 seniors at a college somewhere in the Midwest do not
have a normal distribution. (The distribution is decidedly skewed to the right.) A sample of
30 seniors is taken and a 95% confidence interval for the population GPA is reported using a
t confidence interval. In an R simulation of 1000 such samples, we find that 949 of the 1000
confidence intervals contain the true mean of the population!
> g$GPA
[1] 3.992 3.376 3.020 3.509 3.970 3.917 3.243 3.547 3.416 4.000 3.448 3.908
..........................................
[1321] 3.312 2.621 3.494 2.507 3.222 2.892 3.344 3.417 3.656 3.830 2.234 3.163
[1333] 2.886
> f
function (){x=sample(g$GPA,30,replace=F) ;
t=t.test(x);
t$conf.int}
> cints=replicate(1000,f())
> sum( (cints[1,]<mean(g$GPA))&(cints[2,]>mean(g$GPA)) )
[1] 949
Homework.
1. Read Devore and Berk pages 383–385.
2. Do problem 8.34.
38 Confidence Intervals Using the CLT
Goal: To construct approximate confidence intervals for parameters of distributions

that are not normal
38.1 Setting
We suppose that we have a random sample X1 , . . . , Xn from a distribution that is not necessarily
normal and an unknown parameter θ. Suppose that θ̂ is an estimator for θ that has the following
properties:
1. the distribution of θ̂ is approximately normal
2. it is approximately unbiased
3. an expression for σθ̂ is available (in terms of quantaties that can be estimated)
Then !
θ̂ − θ
P −zα/2 < < zα/2 ≈1−α
σθ̂
We can use this to generate an approximate confidence interval. If sθ̂ is the estimate for σθ̂
then the interval is
sσ̂ sσ̂
x̄ − zα/2 √ , x̄ + zα/2 √
n n
38.2 Example 1 - Estimating µ
By the CLT, if X1 , . . . , Xn is a random sample from any distribution and n is large then X is
X −µ
an unbiased estimator for µ and √ has a distribution that is approximately normal. Since
σ/ n
σ is unknown, we need to estimate σ to construct a confidence interval. If we estimate σ by S,
we have the following approximate confidence interval for µ

S S
P X − zα/2 √ < µ < X + zα/2 √ ≈1−α
n n
Note that this is the same interval as generated in the last section except that there the t
distribution is used.
38.3 Example 2 - The Binomial Distribution
Suppose that X is binomial with parameters n and p and that p is unknown. The estimator p̂ =
X
n is an unbiased estimator of p. The CLT allows us to approximate the binomial distribution
by a normal distribution and so we can write
!
p̂ − p
P −zα/2 < p < zα/2 ≈1−α (38.5)
p(1 − p)/n
Equation 38.5 is the starting point for several different approximate confidence intervals.
38.3.1 The Wald interval.
If we estimate σ by substituting p̂ for p, we get the “Wald interval”

p p
p̂ − zα/2 p̂(1 − p̂)/n, p̂ + zα/2 p̂(1 − p̂)/n
Until about the year 2000, this was the standard confidence interval suggested in most ele-
mentary statistics textbooks if the sample size is large enough. Books varied as to what large
enough meant. A typical piece of advice is to only use this interval if np̂(1 − p̂) ≥ 10. However,
you should never use this interval.
Definition 38.1. Suppose that I is a random interval used as a confidence interval for θ. The
coverage probability of I is P (θ ∈ I). (In other words, the coverage probability is the true
confidence level of the confidence intervals produced by I.)
The coverage probability of the (approximately) 95% Wald confidence intervals is almost always
less than 95% and could be quite a bit less depending on p and the sample size. For example,
if p = .2, it takes a sample size of 118 to guarantee that the coverage probability of the
Wald confidence interval is at least 93%. For very small probabilities, it takes thousands of
observations to ensure that the coverage probability of the Wald interval approaches 95%.
38.3.2 The Wilson Interval.
Since 1927, a much better interval than the Wald interval has been known although it wasn’t
always appreciated how much better the Wilson interval is. The Wilson interval is derived by
solving the inequality in 38.5 so that p is isolated in the middle. We get the following (impressive
looking) approximate confidence interval statement:
 r r 
2
zα/2 2
zα/2 2
zα/2 2
zα/2
p̂(1−p̂) p̂(1−p̂)
 p̂ + 2n − zα/2 n + 4n2
p̂ + 2n + zα/2 n + 4n2 
P <p< ≈1−α
 1+ 2 )/n
(zα/2 1+ 2 )/n
(zα/2 
The following R code computes this interval in the case that x = 7, n = 10. (The option
correct=F will be considered later.)
> prop.test(7,10,correct=F)
1-sample proportions test without continuity correction
data: 7 out of 10, null probability 0.5

X-squared = 1.6, df = 1, p-value = 0.2059
alternative hypothesis: true p is not equal to 0.5
0.3967781 0.8922087
sample estimates:
p
0.7
The Wilson interval performs much better than the Wald interval. If np̂(1 − p̂) ≥ 10, you can
be reasonably certain that the coverage probability of the 95% Wilson interval is at least 93%.
Notice that the center of the Wilson interval is not p̂. It is
z2 2 /2
α/2
p̂ + 2n x + zα/2
2 )/n = n + z 2
1 + (zα/2 α/2
A way to think about this is that the center of the interval comes from adding zα/2 2 trials and
2 /2 successes to the observed data. For a 95% confidence interval, this is very close to adding
zα/2
2 successes and 4 trials (which gives a point estimator for p that we studied earlier).
38.3.3 The Agresti-Coull Interval.
Agresti and Coull (1998) suggest combining the biased estimator of p̂ that is used in the Wilson
interval together with the simpler estimate for the standard error that comes from the Wald
interval. In particular, if we are looking for a 100(1 − α)% confidence interval and x is the
number of successes observed in n trials, define
2 2 x̃
x̃ = x + zα/2 /2 ñ = n + zα/2 p̃ =
ñ
Then the Agresti-Coull interval is

r r !
p̃(1 − p̃) p̃(1 − p̃)
p̃ − zα/2 , p̃ + zα/2
ñ ñ
In practice, this estimator is even better than the Wilson estimator and is now almost universally
the recommended one, even in basic statistics textbooks. For the particular example of x = 7
and n = 10, the Wilson and Agresti-Coull intervals are compared below.
> agco
function (x,n,alpha) { z= -qnorm(alpha/2); ntilde=n+z^2;
ptilde = (x+z^2/2)/ntilde ;
se = sqrt( ptilde*(1-ptilde)/ntilde);
c(ptilde-z*se,ptilde+z*se)}
> agco(7,10,.05)
[1] 0.3923253 0.8966616
> prop.test(7,10,correct=F)
1-sample proportions test without continuity correction
data: 7 out of 10, null probability 0.5

alternative hypothesis: true p is not equal to 0.5
0.3967781 0.8922087
sample estimates:
p
0.7
Homework.
2. Do problems 8.20 and 8.22 of Devore and Berk. You may use either Wilson or Agresti-
Coull confidence intervals. Problem 8.22 asks for an upper confidence bound, i.e., a
one-sided confidence interval. In R, a useful option to prop.test is alternative=. Read
the help document to learn more. These homework problems are due Friday, December
8.
39 The Bootstrap
Goal: To generate “bootstrap” confidence intervals
39.1 Approximations
We constructed exact 95% confidence intervals for the mean from random samples from a normal
distribution. There are other instances where exact confidence intervals can be constructed.
But usually, the 95% confidence intervals that we construct are approximate (i.e., their coverage
probability is only approximately 95%). The approximation is usually because we do not know
the distribution of our estimator θ̂ and use something like the CLT to approximate it.
The bootstrap is a method of constructing approximate confidence intervals that relies on two
pieces of intuition.
1. Rather than make a distributional assumption about the population (e.g., the data comes
from a gamma distribution), we use the data itself to give us an approximation to the
distribution of the population.
2. Simulation can be used to construct approximate confidence intervals if the distribution

is known.
39.2 The Empirical Density Function
Definition 39.1. Suppose that x1 , . . . , xn are n numbers that are the result of a random sample
from a population. The empirical density function of the sample is the function
p(xi ) = 1/n 1≤i≤n
The first principle used in the bootstrap is that the empirical density function is a good ap-
proximation to the actual density function of the unknown random variable. More precisely,
the cumulative distribution function corresponding to this density function is a good approx-
imation to the population cdf. The next example graphs the empirical density function of a
sample (from a known distribution) against the actual cdf.
1.0
> x=rnorm(30,10,1)
> x=sort(x)
0.8
> y=c(1:30)/30
0.6
> plot(x,y,type="s")
> y=pnorm(x,10,1)
0.4
> lines(x,y)
0.2
0.0
8 9 10 11 12
39.3 The Bootstrap
Suppose we want to find a confidence interval for θ and we have a reasonable estimator θ̂ for θ.
Let x1 , . . . , xn be the result of a random sample of size n. The bootstrap simulates an estimate
of the sampling distribution for θ̂ as follows.
Definition 39.2. A bootstrap sample from x1 , . . . , xn is a random sample of size n from

{x1 , . . . , xn } with replacement. That is a bootstrap sample is a random sample from a popula-
tion with pdf equal to the empirical pdf of the original sample.
To compute a bootstrap confidence interval, choose many (e.g., 1000) bootstrap samples and
compute the values of θ̂ for each. Let θ̂i be the value of θ̂ in the ith bootstrap sample. The
values θ̂i give an estimate for the distribution of θ̂.
Definition 39.3. A 100(1 − α)% bootstrap (percentile) confidence interval for θ is (a, b) where
a is the α/2 percentile of the set of bootstrap values θ̂i and b is the 100 − α/2 percentile of
these values.
> x
[1] 1.903 3.901 3.027 2.391 3.813 3.364 3.108 3.596 2.709 3.873 2.728 3.435
[13] 3.167 3.820 3.719 3.212 3.763 3.773 3.611 3.764 2.544 3.619 3.527 2.768
[25] 3.058 2.960 3.612 3.774 2.979 3.041
> hattheta=replicate(1000,mean(sample(x,30,replace=T)))
> hist(hattheta)
> s=sort(hattheta)
> s[25]
[1] 3.093667
> s[975]
[1] 3.460367
> t.test(x)
One Sample t-test

data: x
t = 35.3424, df = 29, p-value < 2.2e-16
3.095183 3.475417
sample estimates:
mean of x
3.2853
Histogram of hattheta
200
150
100
50
0
2.9 3.1 3.3 3.5
There are more sophisticated ways to generate confidence intervals from the bootstrap values
θ̂i . Devore and Berk describe BCa (bias-corrected, accelerated). An R package, simpleboot
(which must be loaded), computes these two intervals (as well as others).
> x
[1] 3.267 2.496 2.534 3.393 3.439 3.737 3.833 3.336 3.319 2.630 2.382 1.989
[13] 3.330 3.061 2.452 3.258 3.054 3.918 3.788 3.821 3.162 2.794 3.919 3.623
[25] 3.319 3.651 3.764 2.607 3.745 2.946
> b=one.boot(x,mean,1000)
> b
$t0
[1] 3.2189
$t
[,1]
[1,] 3.273233
[2,] 3.170400
[3,] 3.211500
................
[997,] 3.293700
[998,] 3.318433
[999,] 3.405000
[1000,] 3.072833
$R
[1] 1000
> boot.ci(b)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = b)
Intervals :
Level Normal Basic
95% ( 3.031, 3.404 ) ( 3.038, 3.417 )
Level Percentile BCa

95% ( 3.020, 3.400 ) ( 3.011, 3.397 )
Calculations and Intervals on Original Scale
Warning message:
bootstrap variances needed for studentized intervals in: boot.ci(b)
Homework.

40 Testing Hypotheses About the Mean
Goal: To review hypothesis testing in the context of hypotheses about the mean of
a (normal) distribution
40.1 Setting
X1 , . . . , Xn is a random sample from a normal distribution with unknown µ. We want to test

a hypothesis about µ.
Example 40.1. Kellogg’s makes Raisin Bran and fills boxes that are labelled 11 oz. NIST
mandates testing protocols to ensure that this claim is accurate. Suppose that a shipment
of 250 boxes, called the inspection lot, is to be tested. The mandated procedure is to take a
random sample of 12 boxes from this shipment. If any box is more than 1/2 ounce underweight,
then the lot is declared defective. Else, the sample mean x̄ and the sample standard deviation
s are computed. The shipment is rejected if (x̄ − 11)/s ≤ −0.635.
40.2 The Hypothesis Testing Technology
We need two hypotheses:
1. Null Hypothesis. The null hypothesis, denoted H0 , is a hypothesis that the data
analysis is intended to investigate. It is usually thought of as the “default” or “status
quo” hypothesis that we will accept unless the data gives us substantial evidence against
it.
2. Alternate Hypothesis. The alternate hypothesis, usually denoted H1 or Ha , is the

hypothesis that we are wanting to put forward as true if we have sufficient evidence
against the null hypothesis.
In the context of Example 40.1, our hypotheses are
H0 : µ = 11
Ha : µ < 11
We will make one of two decisions. Either we will reject H0 (in favor of Ha ) or we will not
reject H0 .
There are two possible errors:
1. A Type I error is the error of rejecting H0 even though it is true. The probability of a
type I error is denoted by α.
2. A Type II error is the error of not rejecting H0 even though it is false. The probability
of a Type II error is denoted by β.
We test the hypothesis by computing a test statistic.

Definition 40.2. A test statistic is a random variable on which the decision is to be based.
A rejection region is the set of all possible values of of the test statistic that would lead us to
reject H0 .
40.3 Testing Hypotheses About the Mean of the Normal Distribution
Suppose that X1 , . . . , Xn is a random sample from a normal distribution with unknown mean
µ. Suppose that we have the following null and alternate hypotheses:
H0 µ = µ0
Ha µ < µ0
We will use the following test statistic:
X − µ0
T = √
S/ n
The important fact about this statistic is that if H0 is true then the distribution of T is known.
(It is a t distribution with n − 1 degrees of freedom.) Thus we can construct a rejection region
based on our desired α. In this case our test is
Reject H0 if and only if T < −tα,n−1 .
Note that in example 40.1, n = 12. If we let α = .025, we have that the test gives
√
Reject H0 if and only if (X − 11)/S < −t.025,11 / 12 = 0.635.
This means that the NIST test really is a hypothesis test with α = .025. Indeed the NIST
manual says that “this method gives acceptable lots a 97.5% chance of passing.” Of course
the NIST method implicitly is relying on the assumption that the distribution of the lot is
normal. Is this unwise? And the sample size is only 12 so we should be cautious about using
the t-distribution for a non-normal population.
The following R session shows the hypothesis test of a possible sample of size 12 of 11 oz boxes
of Raisin Bran. Note that even though this sample seems to suggest underfilling, the lot passes
the test.
> x
[1] 10.74900 11.04724 10.86442 10.98675 10.80881 11.33170 10.73323 10.69521
[9] 10.83790 10.90010 10.99387 10.88968
> t.test(x,alternative="less",mu=11)
One Sample t-test
data: x
t = -1.9358, df = 11, p-value = 0.0395
alternative hypothesis: true mean is less than 11
-Inf 10.99300
sample estimates:
mean of x
10.90316
> (mean(x)-11)/sd(x)
[1] -0.5588134
In the use of t.test above, we specified the null hypothesis, mu=11, the alternate hypothe-
sis alternative="less", and (of course) the data. The possible alternative hypotheses are
two.sided, less, and greater.
40.4 p-Value of a Hypothesis Test
An important number that R reports is the p-value of the statistic.
Definition 40.3. The p-value of a statistical test is the probability that, if H0 is true, the test
statistic T would have a value at least as extreme (in favor of the alternate hypothesis) as the
value t that actually occured.
The p-value is generally taken to be a measure of the strength of the evidence against H0 .
A small p-value is taken to be strong evidence against H0 . There are various incorrect ways
of saying what the p-value means. It is not the probability that the null hypothesis is true.
Indeed, it is a probability that only makes sense if the null hypothesis is true. Similarly, 1 − p
is not the probability that the alternative hypothesis is true.
Rather than reporting the results of hypothesis tests as “accept” or “reject” at the such-and-
such level, it is much more standard to report the p-value of the test. If a p-value is reported,
the reader can determine precisely those α for which the null hypothesis would be rejected at
significance level α. If the p-value of the test is less than α, we would reject the null hypothesis.
In this case, it is also sometimes said that the test is “significant at the α level.”
40.5 Power
How likely is it that the NIST test will identify a bad lot of Raisin Bran as being bad? This
is a question about β, the probability of a Type II error. The answer depends on how bad the
lot actually is. In order to ask this question, we need to know the distribution of
X̄ − 11
T = √
S/ 12
if µ 6= 11. This distribution depends on the true mean µ, the standard deviation σ (which we
do not know), and the sample size. Suppose for example that the true mean is 10.9 and the
standard deviation is 0.1. The following R code determines the power of the test. (Power is
1 − β.) Note that the arguments are
delta the deviation of the true mean from the null hypothesis mean
sd the true standard deviation
n the sample size
sig.level α
type this t-test is called a one.sample test
alternative we tested a one.sided alternative
> power.t.test(n=12,delta=.1,sd=.1,sig.level=.025,
+ type="one.sample",alternative="one.sided")
One-sample t test power calculation
n = 12
delta = 0.1
sd = 0.1
sig.level = 0.025
power = 0.8828915
alternative = one.sided
With these hypothesized values, the power of the test is 88%. This means that the hypothesis
test would detect average underfilling by an average of 0.1 oz 88% of the time if the true
standard deviation is also 0.1 oz. Obviously power goes up as n goes up and goes down as σ
goes up.
> sd=c(.05,.1,.15)
> power.t.test(n=12,delta=.1,sd=sd,
+ sig.level=.025,type="one.sample",alternative="one.sided")
n = 12
delta = 0.1
sd = 0.05, 0.10, 0.15
sig.level = 0.025
power = 0.9999909, 0.8828915, 0.5580037
The power of the test would also go up as the true deviation from the mean goes up.
> diff=seq(0,.1,.01)
> power.t.test(n=12,delta=diff,sd=.1,
+ sig.level=.025,type="one.sample",alternative="one.sided")
n = 12
delta = 0.00, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10
sd = 0.1
sig.level = 0.025
power = 0.02500000, 0.05024502, 0.09249152, 0.15643493, 0.24401839,
0.35263574, 0.47466264, 0.59891866, 0.71365697, 0.80978484, 0.88289152
Homework.
1. Read pages 418–421 and 435–436 of Devore and Berk.
2. Do problem 9.35.
41 The Poisson Distribution
Goal: To develop the properties of the Poisson distribution, reviewing relevant con-
cepts from Mathematics 343 along the way.
41.1 The Poisson distribution
1. Random variables: discrete and continuous
2. The Poisson random variable.

λx
p(x; λ) = e−λ x = 0, 1, 2, 3, . . .
x!
3. Mean, variance, moment generating function.
4. R commands: dpois, rpois, ppois
41.2 Estimating parameters
1. Example: Deaths from kicks by horses in the Prussian army.
2. Method of moments estimator of λ.
3. Maximum likelihood estimator of λ.
41.3 Applications of Poission distribution
1. Relation to binomial distribution.
2. Poisson processes. (occurrences in time)
3. Occurrences in space.
Homework.
2. Do problems 3.94, 3.98, 3.102, 3.106.

42 Experimental Design
Goal: To introduce the features of well-designed experiments
42.1 Variables
1. dependent variable (also, response variable): a measurable characteristic of the

process or population we are studying.
2. independent variable: a variable that (we think) affects the value of the dependent
variable
3. factor: a categorical independent variable. The categories of a factor are called levels.
4. Our goal in an experiment is to “explain” (some of) the variation in the response variable
by the variation in the independent variable.
42.2 Experimental Design
1. Observational studies (both retrospective and prospective) versus experiments.
2. Confounding or lurking variables.
3. In an experiment, we assume that we can control the independent variable. The exper-
imental design is a procedure that describes exactly how the independent variables are
varied to obtain information about the dependent variable.
4. The simplest experiment is one in which there is one factor (categorical independent
variable) with two levels.
Example 42.1. A clinical trial of a drug for a particular medical condition usually has
two groups of subjects. One group receives the drug and the other group (called the
control group receives either the traditional drug or no drug at all. The factor is the type
of drug received and there are two levels.
5. The basic problem: there is variation in the dependent variables across individuals that
is due to factors other than those being controlled.
6. Two solutions:
(a) control other factors

(b) randomize assignment of subjects to the levels of the factor
42.3 Example of a Simple Experiment
1. Random dot stereograms

2. Boxplots
> r
Time Treatment
1 47.20001 NV
2 21.99998 NV
3 20.39999 NV
4 19.70001 NV
......................
77 1.10000 VV
78 1.00000 VV
> boxplot(r$Time~r$Treatment)
●
40
30
●
20
● ●
●
●
10
0
NV VV
3. Is there a difference?
Homework.
1. Read pages 37–41 and the handout from De Veaux, Velleman, and Bock.
2. Do problem 1.56. Recall that all text data is available in a workspace on the course
webpage.
43 Two Independent Samples
Goal: To develop a test for determining whether a there is a difference in the means
of two populations
43.1 The (quite unrealistic) setting
1. X1 , . . . , Xm is a random sample from a population with mean µ1 and known variance

σ12 .
2. Y1 , . . . , Yn is a random sample from a population with mean µ2 and known variance σ22 .
3. The two samples are independent one from another.
Question: What is the difference between µ1 and µ2 ?
Obviously, we should use X − Y to estimate µ1 − µ2 .

σ12 σ22
Proposition 43.1. X − Y has mean µ1 − µ2 and variance + .
m n
43.2 If X and Y are normal
To develop confidence intervals and hypothesis tests we need a distributional assumption about
X and Y . We will assume that the distributions of the X and Y are each normal.
43.2.1 Confidence Intervals
Proposition 43.2. A 100(1 − α)% confidence interval for µ1 − µ2 is

r r !
σ12 σ22 σ12 σ22
(X̄ − Ȳ ) − + , (X̄ − Ȳ ) + +
m n m n
The situation is so unusual that there is no R command that computes this confidence interval.
> data(iris)
> i=iris
> xbarset=mean(i$Sepal.Length[i$Species=="setosa"])
> xbarvir=mean(i$Sepal.Length[i$Species=="virginica"])
> xbardiff=xbarset-xbarvir
> s=sqrt(.35^2/50+.65^2/50)
> z=qnorm(.025)
> ci=c(xbardiff-z*s,xbardiff+z*s)
> ci
[1] -1.377374 -1.786626
43.2.2 Hypothesis Test
We generally want to test the hypothesis that there is no difference in the means of the two
populations.
H0 µ1 − µ2 = 0
Ha µ1 − µ2 6= 0
Proposition 43.3. If the null hypothesis is true, then the test statistic
X̄ − Ȳ
Z=q 2
σ1 σ22
m + n
has a standard normal distribution.
So the test at the α level of significance is
reject H0 if the test statistic Z sastifies |Z| > zα/2
> xbardiff/s
[1] -15.15281
> pnorm(xbardiff/s)
[1] 3.629657e-52
43.2.3 If X and/or Y are not normal
If X and Y are not normally distributed but the sample sizes are large, the Central Limit
Theorem suggests that the above results are approximately true.
Homework.
1. Read pages 473–477 and 481–483.

2. Do problems 10.2, 4a, 6.
44 Distributions Related to the Normal
Goal: To review the distributions related to a sample from a normal distribution.
Assume throughout this section that
X1 , . . . , Xn is a random sample from a normal distribution with mean µ and variance

σ2.
Recall that
Pn
i=1 Xi
X =
n
Pn 2
2 i=1 Xi − X
S =
n−1
√
S = S2
The important distributional facts are these
σ2
1. X has a normal distribution with mean µ and variance .
n
2. (n − 1)S 2 /σ 2 has a chi-squared distribution with ν = n − 1 degrees of freedom.
3. X and S 2 are independent.
4. T has a t-distribution with n − 1 degrees of freedom where T is defined by

X −µ
T = √
S/ n
If the underlying distribution of the Xi ’s is necessarily normal but the sample size is large then
facts (1) and (4) above are still approximately true and can be used for inference. (This remark
about (1) is a way of stating the Central Limit Theorem.) However (2) and (3) above can fail
badly in this case.
Example 44.1. Suppose that a random sample X1 , . . . , X30 is chosen from an exponential
distribution with parameter λ = 1. The results of a simulation of 1,000 such samples are shown
below. The relevant t and chisquare distributions (the curves) are compared to the result of
1,000 random samples from this distribution (the histograms).
> samp=replicate(1000,rexp(30,1))
> m=apply(samp,2,mean)
> v=apply(samp,2,var)
> t=(m-1)/(v/sqrt(10))
> hist(t,freq=F)
> x=seq(-3,2,.01)
> y=dt(x,29)
> lines(x,y)
> hist(29*v,freq=F)
> x=seq(0,100,.01)
> y=dchisq(x,29)
> lines(x,y)
Histogram of t Histogram of 29 * v
0.030
0.4
Density
Density
0.015
0.2
0.000
0.0
−8 −4 0 2 0 20 60 100
t 29 * v
45 Two Independent Samples - t
Goal: To construct confidence intervals and hypothesis tests for the difference in
means between two populations
45.1 The Setting
1. X1 , . . . , Xm is a random sample from a population with mean µ1 and variance σ12
2. Y1 , . . . , Yn is a random sample from a population with mean µ2 and variance σ22 .
4. The samples come from normal distributions.
In Section 43, we noted that

X − Y − (µ1 − µ2 )
q
σ12 σ22
m + n
has a standard normal distribution. Since σ12 and σ22 are unknown, analagously to the one sample
case it makes sense to use S12 and S22 to esitmate these. Thus the problem is to determine the
distribution of

X − Y − (µ1 − µ2 )
q (45.6)
S12 S22
m + n
Alas, the distribution of this random variable is known only approximately and there isn’t a
different statistic that is more mathematically satisfying. However, what is known is that the
distribution of the variable in (45.6) is approximately a t-distribution with degrees of freedom
ν that can be estimated by a formula
2
s21 s22

m + n
ν= (s21 /m)2 (s22 /n)2
(45.7)
m−1 + n−1
Fortunately, a human does not need to compute ν or find the appropriate t-table (ν is in general
not an integer).
45.2 Confidence Intervals and Hypothesis Tests
Using the distributional result of (45.6), a 100(1 − α)% confidence interval for µ1 − µ2 is
r !
∗ S12 S22
X −Y ±t +
m n
where t∗ is the appropriate critical value tα/2,ν from the t-distribution with ν degrees of freedom
given by (45.7).
The usual hypothesis test is a test of whether there is a difference between the two means.
H0 µ1 − µ2 = 0
Ha µ1 − µ2 6= 0
The decision rule in this case is

X −Y
Reject H0 if q 2 > tα/2,ν
S1 S22
m + n
where α is the desired level of significance.
45.3 The Random Dot Stereogram Example
A random dot stereogram is shown to two groups of subjects and the time it takes for the subject
to see the image is recorded. Subjects in one group (VV) are told what they are looking for but
subjects in the other group (NV) are not. The quantity of interest is the difference in average
times. The relevant R code is below.
> r=read.csv(’randomdot.csv’)
> r
Time Treatment
1 47.20001 NV
2 21.99998 NV
3 20.39999 NV
4 19.70001 NV
.....................
75 1.40000 VV
76 1.20000 VV
77 1.10000 VV
78 1.00000 VV
> t.test(Time~Treatment,data=r)
Welch Two Sample t-test
data: Time by Treatment

t = 2.0384, df = 70.039, p-value = 0.04529
alternative hypothesis: true difference in means is not equal to 0
0.06493122 5.95314037
sample estimates:
mean in group NV mean in group VV
8.560465 5.551429
>
Homework.
1. Read Section 10.2
2. Do problems 10.23, 24, 30, 35.

46 Analysis of Paired Data
Goal: To make inferences about the difference of means using paired data
46.1 Setting
We consider the situation where we have two random samples X1 , . . . , Xn and Y1 , . . . , Yn but,
unlike the last section, where Yi is dependent on Xi . There are two slightly different situations
that produce such data.
1. The variables Xi and Yi are measures on the same subject but with two different treat-
ments. For example, we might give two different tests to each subject. This kind of
situation is often used in a pretest-posttest situation.
2. The variables Xi and Yi result from two different treatments given to subjects that are
“matched” in some way. For example, if we are measuring the germination rate of two
different brands of seeds, each of the n observations might be on a different set of growing
conditions so that in the ith observation, the seeds are given exactly the same conditions.
The formal assumptions that we will use for this situation are
1. The pairs (X1 , Y1 ), . . . , (Xn , Yn ) are a random sample.

2.
2. The n differences Di = Xi − Yi are independent with mean µD and variance σD
3. The distribution of the differences Xi − Yi is normal.
Note that if µ1 = E(Xi ) and µ2 = E(Yi ) then µD = µ1 − µ2 , the same parameter that we were
making inferences about in the last section.
Note 46.1. In the last section, we made use of the fact that V (X − Y ) = V (X) − V (Y ) if X
and Y are independent. However we no longer have this independence. In fact
V (D) = V (X − Y ) = σ12 + σ22 − 2ρσ1 σ2
where ρ is the correlation between X and Y . If ρ is positive, this means that pairing reduces
the variance.
46.2 Paired t Test
Since D1 , . . . , Dn is considered a random sample from a normal distribution, we can make

inferences about µD using the methods of Sections 37 and 40.
46.2.1 Confidence Intervals
A 100(1 − α)% confidence interval for µD = µ1 − µ2 is given by
sD
(X − Y ) ± tα/2,n−1 √
n
where sD is the sample standard deviation of the differences Di .
46.2.2 Hypothesis Tests
The decision rule in testing the hypotheses
H0 µD = 0
Ha µD 6= 0
is to reject H0 if

X −Y
T =
√ > tα/2,n−1
sD / n
46.3 An Example with R
Experts try to pick stocks that will increase in price. Perhaps choosing stocks at random would
do as well. The R analysis below compares the average percentage gain in the portfolio chosen
by experts with the average percentage gain of a portfolio chosen by dart throwing. The data
are paired by controlling for time. It is obvious that we should control for this as the stocks in
general increased in price more at some times than others. That is, much of the variation in
the performance of the darts or the experts is due to the variation in the market.
> dart
CONTEST.PERIOD PROS DARTS DJIA
1 1January-June1990 12.7 0.0 2.5
2 2February-July1990 26.4 1.8 11.5
3 3March-August1990 2.5 -14.3 -2.3
4 4April-September1990 -20.0 -7.2 -9.2
5 5May-October1990 -37.8 -16.3 -8.5
...
97 97January-June1998 24.4 3.2 15.0
98 98February-July1998 39.3 -10.1 7.1

99 99March-August1998 -18.8 -20.4 -13.1
100 100April-September1998 -20.1 -34.2 -11.8
> y=dart$PROS-dart$DART
> y
[1] 12.7 24.6 16.8 -12.8 -21.5 -5.9 12.3 17.0 41.4 9.0 -22.3 50.3
[13] -21.2 -27.3 -31.7 12.8 -25.6 26.2 8.2 53.3 39.5 24.1 -0.2 9.1
[25] -28.7 -28.4 -6.1 -25.3 12.0 -12.0 12.3 29.5 -42.3 6.5 -10.3 21.3
[37] -11.3 72.1 5.9 20.3 9.9 14.0 -40.0 9.3 2.2 13.9 15.9 -10.9
[49] -4.4 -26.9 32.7 -10.0 16.9 43.8 -21.1 17.6 -6.5 -18.6 10.7 1.8
[61] 60.9 14.0 15.3 71.7 -5.0 7.2 -29.7 30.5 -10.3 -49.0 -13.6 -9.5
[73] -3.9 -11.2 37.4 0.9 5.0 -1.1 34.3 3.0 23.0 6.9 -3.9 31.9
[85] 2.2 43.2 20.6 14.7 17.7 -10.0 -20.6 23.3 14.2 -7.8 27.9 -28.5
[97] 21.2 49.4 1.6 14.1
> t.test(y)
One Sample t-test
data: y
t = 2.6426, df = 99, p-value = 0.009563
1.601077 11.250923
sample estimates:
mean of x
6.426
Homework.
2. Do problems 10.39, 42, 47.

47 Two Samples - Nonnormal Distributions
Goal: To develop two tests appropriate for the two independent sample setting where
the normality assumption is suspect.
47.1 Non-normality
Recall the two independent sample setting.
1. X1 , . . . , Xm is a random sample from a population with mean µ1 and variance σ12 .
2. Y1 , . . . , Yn is a random sample from a population with mean µ2 and variance σ22 .
In the case that the X and Y distributions are normal, the two-sample t gives a (approximate)
solution inference problems about µ1 − µ2 . If the distributions are not normal, this solution
might still be approximately (approximately) correct as the statistics that we are using, X
and Y are approximately normally distributed for large sample sizes (CLT). But with smaller
samples or distributions that are highly non-normal, we should be cautious. Two approaches
are presented in the next two sections.
47.2 Bootstrap
We proceed as in Section 39. We want to simulate the sampling distribution of X − Y . We

do this by taking many (R) pairs of bootstrap samples of size m from X1 , . . . , Xm and size n
from Y1 , . . . , Yn respectively and find the difference of the sample means for each pair. The R
numbers give us an empirical estimate of the sampling distribution of X − Y . Various methods
can be used to generate a confidence interval from these.
Recall the notation from Section 39.
θ the parameter to be estimated
θ̂ the estimate of θ from the sample
θ̂i the estimate of θ from the ith bootstrap sample
The bootstrap percentile interval is given in Definition 39.3, reproduced here.
Definition 47.1. A 100(1 − α)% bootstrap (percentile) confidence interval for θ is (a, b) where
a is the α/2 percentile of the set of bootstrap values θ̂i and b is the 100 − α/2 percentile of
these values.
Other intervals can be constructed. The book suggests the following methodology. Let sB be
the sample standard deviation of the bootstrap estimates of θ. Then we might could use an
interval of form
θ̂ ± zα/2 sB or θ̂ ± tα/2,ν sB
The R bootstrap package gives four different confidence intervals. Besides the percentile inter-
val, it gives something called the basic interval. The basic interval is constructed as follows.
First, the intuition is that θ̂i − θ̂ is an estimate for θ̂i − θ. Let a and b be the percentiles
calculated from the definition above. Then the interval (a − θ̂, b − θ̂) is a confidence interval for
θ̂ − θ. Thus (2θ̂ − b, 2θ̂ − a) is a confidence interval for θ.
> rd=read.csv(’randomdot.csv’)
> rdvv=rd$Time[rd$Treatment=="VV"]
> rdnv=rd$Time[rd$Treatment=="NV"]
> m=length(rdvv)
> n=length(rdnv)
> t.test(rdvv,rdnv)
Welch Two Sample t-test
data: rdvv and rdnv

t = -2.0384, df = 70.039, p-value = 0.04529
alternative hypothesis: true difference in means is not equal to 0
-5.95314037 -0.06493122
sample estimates:
mean of x mean of y
5.551429 8.560465
> b=replicate(1000,mean(sample(rdvv,m,replace=T))-mean(sample(rdnv,n,replace=T)))
> s=sort(b)
> diff=mean(rdvv)-mean(rdnv)
> serror=sd(b)
> tstar=qt(.975,70.035)
> zstar=qnorm(.975)
> normci=c(diff-zstar*serror,diff+zstar*serror)
> tci=c(diff-tstar*serror,diff+tstar*serror)
> percentci=c(s[25],s[975])
> normci
[1] -5.9088806 -0.1091910
> tci
[1] -5.95985906 -0.05821253

> percentci
[1] -5.9867762 -0.2838531
> hist(b)
> b=two.boot(rdvv,rdnv,mean,1000)
> boot.ci(b)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = b)
Intervals :
Level Normal Basic
95% (-5.834, -0.159 ) (-5.711, -0.047 )
Level Percentile BCa

95% (-5.972, -0.307 ) (-6.327, -0.498 )
Calculations and Intervals on Original Scale
Histogram of b
250
Frequency
100
0
−8 −6 −4 −2 0 2
b
47.3 Permutation Tests
The permutation test gives a hypothesis test of the hypothesis that there is no difference in
the distributions. Suppose that there is no difference in the distributions. Then we could think
of the two groups as coming from one collection of n + m many individuals with m of these
arbitrarily labeled X and n arbitrarily labeled Y . Now given m + n many objects, there are
(n + m)!
R=
n!m!
many ways of assigning the label X to m of these and Y to n of these. Now consider the
data x1 , . . . , xm , y1 , . . . , yn and the value of our test statistic t = x̄ − ȳ. This value of the
test statistic is only one of the R equally likely possible values that could have resulted after
labeling the individuals. The distribution of all R such values gives us an empirical estimate
of the distribution of the statistic under the null hypothesis. So our decision rule should be
to reject the null hypothesis if the actual value t that we observed is one of the 100α% most
extreme values of this distribution.
While we don’t need any distributional assumption to apply this test and achieve the desired
level of significance (i.e., guarantee a fixed probability of a Type I error), we do need some
assumptions about the distributions of X and Y in the case that H0 is false so that a sensible
test results. For example, we might suppose for an alternate hypothesis that the distributions
of X and Y have the same shape but that µX > µY .
The R package DAAG does permutation tests. The reported p-value is a two-sided p-value.
> twotPermutation(rdvv,rdnv,1000)
[1] 0.056
Homework.
1. Read pages 520–524.
2. Do problems 10.78,79,80,83,84
48 The Difference of Two Proportions
Goal: To make inferences about the difference of two proportions
48.1 The Setting
1. Suppose X ∼ Bin(m, p1 ) and Y ∼ Bin(n, p2 ) and that X and Y are independent.
2. Let p̂1 = X/m and p̂2 = Y /n.

p1 (1 − p1 )
3. Then p̂1 − p̂2 is approximately normal with mean p1 − p2 and variance +
m
p2 (1 − p2 )
.
n
4. So the following rv is approximately standard normal.
(p̂1 − p̂2 ) − (p1 − p2 )
q (48.8)
p1 (1−p1 )
m + p2 (1−p
m
2)
48.2 Confidence Intervals
As usual, we find the confidence interval by estimating the standard error. If we use p̂1 and p̂2
in the denominator of (48.8), we get the following interval
r
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
(p̂1 − p̂2 ) ± zα/2 +
m m
Example 48.1. Two free throw shooters for a certain Michigan college’s basketball team have
made 67 out of 77 and 59 out of 79 free throws respectively for percentages of p̂1 = .87 and
p̂2 = .747. A confidence interval for the difference of percentages is
> prop.test(c(67,59),c(77,79),correct=F)
2-sample test for equality of proportions without continuity

correction
data: c(67, 59) out of c(77, 79)

alternative hypothesis: two.sided
0.001509787 0.245079067
sample estimates:
prop 1 prop 2
0.8701299 0.7468354
> p1hat=67/77
> p2hat=59/79
> d=p1hat-p2hat
> s=sqrt( (p1hat*(1-p1hat)/77 + p2hat*(1-p2hat)/79 ))
> z=qnorm(.975)
> d-z*s
[1] 0.001509787
> d+z*s
[1] 0.2450791
As in the one sample case, the confidence interval can be improved by adding four pseudo-
x+1
observations. Instead of using p̂1 and p̂2 in the confidence interval above, we use p̃1 = m+2 and
y+1
p̃2 = n+2 .
> prop.test(c(68,60),c(79,81),correct=F)
2-sample test for equality of proportions without continuity

correction
data: c(68, 60) out of c(79, 81)

alternative hypothesis: two.sided
-0.002193035 0.242230541
sample estimates:
prop 1 prop 2
0.8607595 0.7407407
48.3 Hypothesis Tests
We can use the normal approximation to develop a hypothesis test as well. Here the hypotheses
are
H0 p1 − p2 = 0
Ha p1 − p2 6= 0

1 1
If H0 is true, then p̂1 − p̂2 = p has variance p(1 − p) + . Therefore, we estimate the
m n
x+y
variance by using both x and y to give an estimate for p: p̂ = n+m . Then the test statistic is
p̂1 − p̂2
Z=q
1 1

p̂(1 − p̂) m + n
and the test is to
Reject H0 if |Z| > zα/2
x+y+2
As with confidence intervals, the test is improved if we use p̃1 , p̃2 , and p̃ = n+m+4 .
Homework.
2. Do 10.54, 10.56.
49 Fitting Functional Models
Goal: To describe bivariate data by a functional relationship, especially a linear one
49.1 Setting
1. Given: bivariate data: (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ).
2. Problem: fit a function y = f (x) to the data. The function f usually comes from some
fixed family of functions (e.g., polynomials). This general problem is called regression.
3. Method: choose the function to minimize the error in approximation.
4. Fit: the residuals are ei = yi − f (xi ). We measure the (lack of) fit by some increasing
function of the absolute value of the residuals.
49.2 Least Squares, Linear Regression
1. Problem: fit a line y = a + bx to the data.
2. Least Squares Solution: Choose a and b to minimize the sums of squares of the residuals
(denoted SSE where E is for error)
n
X n
X
2
SSE = (yi − (a + bxi )) = e2i
i=1 i=1
3. Notation:
n
X
Sxx = (xi − x̄)2 s2x = Sxx /(n − 1)
i=1
n
X
SST = Syy = (yi − ȳ)2 s2y = Syy /(n − 1)
i=1
n
X
Sxy = (xi − x̄)(yi − ȳ)
i=1
4. Solution: the following choices for a and b minimize SSE
Sxy
b= a = ȳ − bx̄
Sxx
49.3 Linear Regression in R
Example 49.1. A certain class took three tests and a final exam. There were 32 students.
The scores on the first test are used to predict the scores on the final exam.
> m
Test1 Test2 Test3 Exam
1 98 100 98 181
2 93 91 89 168
3 100 99 99 193
.........................
30 100 96 97 193
31 74 87 72 109
32 85 100 95 173
> l=lm(Exam~Test1,data=m)
> l
Call:
lm(formula = Exam ~ Test1, data = m)
Coefficients:
(Intercept) Test1
23.965 1.604
> plot(m$Test1,m$Exam)
> abline(l)
●
● ●
●
180
●
● ●
● ●
● ● ●
●
m$Exam
● ●
● ●
●
●
140
● ● ● ●
● ●
●
●
●
●
100
●
●
60 70 80 90 100
m$Test1
49.4 Interpreting the Regression Line
1. The correlation coefficient r is defined by
Sxy
r=√ p
Sxx Syy
2. The regression equation can be rewritten as

y − ȳ x − x̄
=r
sy sx
3. Intercept: the regression line passes through the point (x̄, ȳ).
4. Slope: the slope of the regression line is r if the variables are expressed in standardized
units.
49.5 Analysis of Variance
1. For a given xi , we write ŷi = a + bxi .
2. Then
n
X
SSE = (yi − ŷi )2
i=1
3. Define the regression sum of squares SSR by

n
X
SSR = (ŷi − ȳ)2
i=1
4. Then
SST = SSR + SSE
5. Define the coefficient of determination, R2 , by
SSE SSR
R2 = 1 − =
SST SST
6. Note that R2 = r2 .
7. R2 is usually read as a percentage and is called the percentage of variation in y “explained”

by x.
> anova(l)
Analysis of Variance Table
Response: Exam
Df Sum Sq Mean Sq F value Pr(>F)
Test1 1 12303.4 12303.4 34.571 1.952e-06 ***
Residuals 30 10676.6 355.9
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
> cor(m$Test1,m$Exam)^2
[1] 0.5353956
Homework.
1. A dataset containing some statistics on baseball teams competing in the 2003 American
League baseball season can be found on the data page of the course website. Suppose
that you want to predict the number of runs scored (R) by a team just from knowing how
many home runs (HR) the team has.
(a) Write the linear regression of R on HR.
(b) Compute the predicted values for each of the teams. (Use predict(l) in R.) Make
some comments on the fit. (For example, are there any values not particiularly
well-fit? Do you have any explanations for that?)
2. Suppose that we wish to fit a linear function without a constant: i.e., y = bx. Write down
an expression for the value of b that minimizes the sums of squares of residuals in this
case. (Hint: there is only one variable here, b, so this is a straightforward Mathematics 161
max-min problem.) R will compute b in this case as well with the command lm(y∼x-1).
In this expression, 1 stands for the constant term and -1 therefore means leave it out.
Alternatively we can write lm(y∼x+0).
3. Refer to the same baseball dataset as the first problem.
(a) Write the least squares line for W (wins) as a function of R (runs).
(b) A better model takes into account the runs that a team’s opponent has scored as
well. Write W − L as a function of R − OR (here L is losses and OR is opponents
runs scored).
(c) Why might it make sense from the meaning of the variables W−L and R−OR to
use a linear model without a constant term as in problem 1? Write W−L as a linear
function of R−OR without a constant term and compare the fit to part (b).
50 The Linear Algebra Approach
Goal: To extend least squares regression to multiple predictors using linear algebra
50.1 Setting
1. Suppose that we have k explanatory variables and 1 response variable.
2. The data points are denoted (xi1 , . . . , xik , yi ) for 1 ≤ i ≤ n.
3. We wish to fit a linear function y = b0 + b1 x1 + · · · + bk xk to the data.
4. Let ei = yi − b0 xi1 + · · · + xik . These are the residuals.
5. The least squares solutions is to choose b0 , . . . , bk to mimimize

n
X
SSR = e2i
i
50.2 Linear Algebra
We could solve this least squares problem by noting that SSR is a function of the k + 1 variables
b0 , . . . , bk and finding the minimum using calculus. Instead we use a geometric approach.
Define the matrices
     
y1 b0 1 x11 . . . x1k
y =  ...  b =  ...  X =  ... .. .. 
    
. . 
yn bk 1 xn1 . . . xnk
The vector of residuals is then e = y − Xb and our goal is to minimize the square of the length
of this vector:
||e||2 = ||y − Xb||2 = (y − Xb)0 (y − Xb)
Geometrically, this amounts to the following set of observations.
1. The set of all possible vectors Xb as b ranges over all vectors in <k+1 is a k+1 dimensional
subspace V of <n .
2. The vector b that we seek is such that Ab is the vector in this subspace that is closest
to the vector y.
3. To find this vector, we need to find the projection of y onto the subspace V .
4. That projection is given by finding the unique b that satisfies
X0 y = X0 Xb
5. Assuming that X0 X has an inverse, we get that
b = (X0 X)−1 X0 y
6. For this choice of b the vector ŷ = Xb of predicted values is
ŷ = X(X0 X)−1 X0 y
7. The matrix on the right hand side of this equation is called the “hat matrix” H. Thus
H = X(X0 X)−1 X0
We can do the sum of squares analysis as in the last section. In vector notation, if we let ȳ be
the vector with n identical components ȳ, we have
SST = ||y − ȳ||2 SSE = ||y − ŷ||2 SSR = ||ŷ − ȳ||2
Then
SST = SSR + SSE
50.3 An example in R
Example 50.1. Suppose that we wish to predict the final exam score for a class of 32 students
from the three tests.
> m
Test1 Test2 Test3 Exam
1 98 100 98 181
2 93 91 89 168
.........................
31 74 87 72 109
32 85 100 95 173
> l=lm(Exam~Test1+Test2+Test3,data=m)
> l
Call:
lm(formula = Exam ~ Test1 + Test2 + Test3, data = m)
Coefficients:
(Intercept) Test1 Test2 Test3
-28.6004 0.4996 0.5901 1.0643
> anova(l)
Response: Exam
Test1 1 12303.4 12303.4 56.9280 3.221e-08 ***
Test2 1 1209.4 1209.4 5.5958 0.0251634 *
Test3 1 3415.8 3415.8 15.8051 0.0004487 ***
Residuals 28 6051.4 216.1
Homework.
1. Use the 2003 American League baseball data to write runs (R) as a linear function of hits
(H), doubles (2B), triples (3B), and homeruns (HR). From what you know about scoring
in baseball, do the coefficients make sense?
2. The website also has data on baseball teams for the five seasons 1994-1998. Repeat the
analysis of the preceding problem for these data. Are your results consistent?
3. For some linear algebra exercise, write the hat matrix for the data (1, 2), (3, 4) and (5, 5).
51 Nonlinear Curve Fitting
Goal: To extend the least squares approach to nonlinear functions
51.1 Setting
1. Suppose that we have k explanatory variables and 1 response variable.
2. The data points are denoted (xi1 , . . . , xik , yi ) for 1 ≤ i ≤ n.
3. We wish to fit a possibly nonlinear function y = f (x; b1 , . . . , bl ) which depends on k

parameters b1 , . . . , bl to the data.
4. Let ei = yi − f (xi1 , . . . , xik ; b1 , . . . , bl ). These are the residuals.
51.2 Approach 1 - Linearize
Example 51.1. Suppose we wish to fit a function y = aebx to the data. This equation
transforms to
ln y = ln a + bx
We then can use standard linear regression with the data (xi , ln yi ). This returns ln a and b.
Note however that this choice of a, b does not minimize the sums of the squares of the residuals
ei = yi − aebxi . Rather, it minimizes the sums of squares of ln yi − (ln a + bxi ). In a given
application, it might not be so clear that this is desirable.
Note that in the above example, though ln y is nonlinear in y, linear regression finds the co-
efficients a and b. Generalizing this example, suppose that f is a possibly nonlinear function
of one variable x that depends on two unknown parameters a and b. The goal is to transform
the data (x, y) to (g(x), h(y)) so that the equation y = f (x) is equivalent to h(y) = a0 + b0 g(x)
where a0 , b0 are known functions of a and b.
51.3 Approach 2 - Nonlinear Least Squares
Example 51.2. Continuing Example 51.1, suppose we wish to fit y = aebx to data by mini-
mizing the sums of the squares of the residuals ei = yi − aebxi . This is a problem in minimizing
a nonlinear function of two variables. Usually this requires an iterative method to approximate
the solution.
51.4 The Approaches Compared in R
Example 51.3. The R dataset trees contains the height, girth, and volume of 31 cherry trees.
It is useful to use the girth of a tree to predict the volume of the tree. But the relationship is
nonlinear. Suppose that we assume the relationship has the form
V = aGb
This is not unreasonable as we might expect volume varies as the square of the girth. Linearizing
gives
ln V = ln a + b ln G
Regression yields ln a = −2.353 (a = .095) and b = 2.20. On the other hand, minimizing the
sums of squares of residuals directly gives a = .087 and b = 2.24. SSE = 313.75 when minimized
and SSE = 317.85 when linearized.
> trees
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
......................
30 18.0 80 51.0
31 20.6 87 77.0
> attach(trees)
> lg=log(Girth)
> lv=log(Volume)
> l=lm(lv~lg)
> l
Call:
lm(formula = lv ~ lg)
Coefficients:
(Intercept) lg
-2.353 2.200
> nls(Volume~A*Girth^B,start=list(A=.2, B=2.2))

Nonlinear regression model
model: Volume ~ A * Girth^B
data: parent.frame()
A B
0.08661001 2.23638558
residual sum-of-squares: 313.7535
> p=predict(l)
> sum((Volume-exp(p))^2)
[1] 317.8461
>
Homework.
1. Find a transformation that transforms the following nonlinear equations y = f (x) (that
depend on parameters a and b) to linear equations g(y) = a0 + b0 h(x).
a
(a) y =
b+x
x
(b) y =
a + bx
1
(c) y =
1 + aebx
2. The R dataset Puromycin gives the rate of reaction as a function (in counts/min/min)
of concentration of an enzyme (in ppm) for two different substrates - one treated with
Puromycin and one not treated. The biochemistry suggests that these two variables are
related by
conc
rate = a
b + conc
Find the least squares estimates of a and b for the treated condition by both of the
methods suggested in this section and compare the sums of squares of residuals.
52 The Linear Model
Goal: To provide a statistical model for the linear regression setting
52.1 The Linear Model
1. Given data (x1 , y1 ), . . . , (xn , yn ) with x an explanatory variable and y a response variable.
2. A statistical model.
Y = β0 + β1 x +
where
(a) is a random variable with mean 0 and variance σ 2

(b) β0 , β1 , σ 2 are (unknown) parameters
(c) normality assumption: when we wish to write confidence intervals or hypothesis tests
we sometimes assume that is normal
3. The data (x1 , y1 ), . . . , (xn , yn ) are assumed to arise by choosing a random sample 1 , . . . , n .
The random variables 1 , . . . , n are thus assumed to be independent and all have the same
distribution as . Therefore Yi is a random variable (and xi is not) and
Yi = β0 + β1 xi + i
4. Note that
µY |x = β0 + β1 x σY2 |x = σ 2
52.2 Estimating model parameters
The least squares estimates provide reasonable estimates for β0 and β1 . In particular, let
Pn
(xi − x̄)Yi
β̂1 = Pi=1
n 2
β̂0 = Ȳ − β̂1 x̄
i=1 (xi − x̄)
Proposition 52.1. Assuming only that E(i ) = 0 for all i, β̂0 and β̂1 are unbiased estimators
of β0 and β1 respectively.
Note that the proposition implies that Ŷi = β̂0 + β̂1 xi is an unbiased estimator of β0 + β1 xi .
Proposition 52.2. Assuming that E(i ) = 0, Var(i ) = σ 2 , and that the random variables i
are independent,
σ2
1. Var(β̂1 ) = P
(xi − x̄)2
σ2
P 2
x̄2

xi 2 1
2. Var(β̂0 ) = =σ +
n (xi − x̄)2 (xi − x̄)2
P P
n
Theorem 52.3 (Gauss-Markov Theorem). Assume that E(i ) = 0, Var(i ) = σ 2 , and the
random variables i are independent. Then the estimators β̂0 and β̂1 are the unbiased estimators
of minimum variance among all unbiased estimators that are linear in the random variables Yi .
(We say that these estimators are BLUE.)
In the case that the errors i are normally distributed, we have

Theorem 52.4. Assume that E(i ) = 0, Var(i ) = σ 2 , and the random variables i are indepen-
dent and normally distributed. Then β̂0 and β̂1 are normally distributed and are the maximum
likelihood estimators of β0 and β1 .
Homework.
1. Read pages 600-606 and 611-615.

2. Obviously, we would love for Var(β̂0 ) and Var(β̂1 ) to be small since we would like to
estimate the parameters accurately. We have no control over σ 2 but sometimes the ex-
perimenter can choose the points xi .
(a) Suppose that the experimenter must choose n = 10 values for x and these values
must lie in the interval [−1, 1]. How should these be chosen to minimize Var(β̂1 )?
(b) Though your strategy in (a) minimizes the variance of the estimate for the slope, it
might not be the best choice of values for x. Give a reason that you might not want
choose the values for x in this way.
4. If y is approximately linearly related to x then x is approximately linearly related to y.
For this problem use the builtin dataframe state.x77. (For some reason, this dataframe
is loaded and defined by data(state) instead of what you might think.)
(a) Write Life.Exp (life expectancy) as a linear function of Murder (murder rate).
(b) Write Murder as a linear function of Life.Exp.
(c) Plot the data and the two lines on one plot. (It is easy to get R to plot the data
and one of the lines. You’ll have to work a bit to get R to plot the second line on
the same plot. Else, plot it by hand!)
(d) Why exactly aren’t the two lines the same?
53 Inferences Concerning the Betas
Goal: To construct confidence intervals and tests for the parameters β0 , β1
53.1 Estimating σ 2
X X 2
1. SSE = (Yi − Ŷi )2 = Ŷi − (β̂0 + β̂1 xi )
2. Note that Yi − Ŷi is an estimator for i = Yi − (β0 + β1 xi ).
Proposition 53.1. Under the same assumptions as the Gauss-Markov Theorem, the estimator
SSE
MSE =
n−2
is an unbiased estimator of σ 2 .
Proposition 53.2. With the additional assumption of normality, the estimator SSE/n is the
maximum likelihood estimator for σ 2 . Also SSE /σ 2 has a chi-squared distribution with (n − 2)
degrees of freedom.
In R, MSE is referred to as the residual standard error.
Call:
lm(formula = Exam ~ Test1, data = m)
Residuals:
Min 1Q Median 3Q Max
-33.6930 -10.1574 -0.9462 8.5918 44.0759
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.9652 22.7916 1.051 0.301
Test1 1.6044 0.2729 5.880 1.95e-06 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 18.86 on 30 degrees of freedom

Multiple R-Squared: 0.5354, Adjusted R-squared: 0.5199
F-statistic: 34.57 on 1 and 30 DF, p-value: 1.952e-06
We need one more fact about SSE (without proof): each of β̂0 and β̂1 is independent of SSE.
53.2 Inferences About the Slope
Recall that with the normality assumption
β̂1 ∼ N (β1 , σ 2 /Sxx )
The following proposition is the key to inference. Since σ 2 is unknown, we replace σ 2 by its
unbiased estimate SSE, which we abbreviate as S 2 . We then estimate σβ̂2 by s2β̂ = S 2 /Sxx .
1 1
Proposition 53.3. With the normality assumption, the random variable
β̂1 − β1 β̂1 − β1
T = √ =
S/ Sxx Sβ̂1
has a t distribution with n − 2 degrees of freedom.
Thus, a 100(1 − α)% confidence interval for β1 is
β̂1 ± tα/2,n−2 sβ1
In R, confint returns the confidence intervals if you are too lazy to do the multipication.
> confint(l,c("Test1"),level=.90)
5 % 95 %
Test1 1.141289 2.067571
> confint(l)
2.5 % 97.5 %
(Intercept) -22.581545 70.511925
Test1 1.047144 2.161716
We can also test hypotheses about β1 . The natural hypothesis to check is that β1 = 0.
H0 β1 = 0
Ha β1 6= 0
Using the proposition, the obvious test statistic is T = β̂1 /sβ̂1 and the decision rule is
Reject H0 if |T | > tα/2,n−2 where α is the desired level of significance.
R computes the t value and the p value for this very test.
53.3 Inferences about β0
Similarly.
Homework.
1. Read pages 626–635.
2. Do problems 12.32, 36.

54 More Inferences in Regression
Goal: To make inferences about quantities related to regression
54.1 Inferences About the Mean of Y given x
Given a particular x value, x = x∗ , the model says that E(Y ) = β0 + β1 x∗ . Obviously Ŷ =

β̂0 + β̂1 x∗ is an unbiased estimator of E(Y ). We have
(x∗ − x̄)2

2 1
Var(Ŷ ) = σ +P
n (xi − x̄)2
Then, if we assume that the errors are normally distributed,

√ Ŷ is normally distributed. As
before, if we let s be the residual standard error ( MSE), we have the following confidence
interval.
Proposition 54.1. With the normality assumption, a 100(1 − α)% confidence interval for
β0 + β1 x∗ is s
1 (x∗ − x̄)2
β̂0 + β̂1 x∗ ± tα/2,n−2 · s +
n Sxx
54.2 Prediction Intervals
Given a value of x∗ for which we want to predict a future value of Y , it is natural to predict Y
to be Ŷ = β̂0 + β̂1 x∗ . How close is Y likely to be to Ŷ ?
We have Y − Ŷ = β0 + β1 x∗ + − β̂0 + β̂1 x∗ . Then, since Y and the variables Yi are independent,
1 (x∗ − x̄)2

2 2
Var(Y − Ŷ ) = Var(Y ) + Var(Ŷ ) = σ + σ +
n Sxx
Thus we have that
Y − Ŷ
T = q
1 (x∗ −x̄)2
S 1+ n + Sxx
has a T distribution with n − 2 degrees of freedom. This leads to the notion of prediction
interval.
Proposition 54.2. A 100(1 − α)% prediction interval for a future value of Y given x = x∗ is
s
1 (x∗ − x̄)2
β̂0 + β̂1 x∗ ± tα/2,n−2 · s 1 + +
n Sxx
Example 54.3. Almost every senior at a certain college in Western Michigan took the ACT
test at some point. The ACT test might be helpful in predicting their eventual GPA. The
relevant analysis is as follows.
> sr=read.csv(’sr.csv’)
> l=lm(GPA~ACT,data=sr)
> summary(l)
Call:
lm(formula = GPA ~ ACT, data = sr)
Residuals:
-1.46828 -0.20360 0.07051 0.28510 0.87986
Coefficients:
(Intercept) 1.358432 0.081947 16.58 <2e-16 ***
ACT 0.071558 0.003087 23.18 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

(229 observations deleted due to missingness)
F-statistic: 537.5 on 1 and 1102 DF, p-value: < 2.2e-16
> anova(l)
Response: GPA
ACT 1 87.391 87.391 537.49 < 2.2e-16 ***
Residuals 1102 179.175 0.163
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
> x=data.frame(ACT=c(14:36))
> ci=predict(l,x,interval="confidence")
> ci
fit lwr upr
1 2.360240 2.282283 2.438197
2 2.431797 2.359583 2.504011

.............................
22 3.862951 3.804896 3.921006
23 3.934508 3.870881 3.998135
> pi=predict(l,x,interval="prediction")
> pi
fit lwr upr
1 2.360240 1.565233 3.155247
2 2.431797 1.637333 3.226262
.............................
22 3.862951 3.069648 4.656253
23 3.934508 3.140778 4.728238
> plot(sr$ACT,sr$GPA)
> abline(l)
> lines(x$ACT,ci[,2])
> lines(x$ACT,ci[,3])
> lines(x$ACT,pi[,3])
> lines(x$ACT,pi[,2])
4.0
● ●
● ● ●
●● ● ● ●
● ● ● ● ●●
● ●
● ● ●
● ● ●
● ●
● ●●
● ●
● ●
● ●
● ●
● ●
● ●
●
● ●
● ●
● ●
●
●●
● ● ●
● ● ● ● ●
●
●
●
● ●
● ●
● ●
● ●
●●
●
● ● ●
● ● ● ● ●
● ●
● ●
● ●
●
● ●
● ● ●
● ●
● ● ● ● ● ●● ● ●
● ● ● ●
● ● ●
●
●
●
● ● ●
●
●●
● ●
●
● ●
●
●
●
● ●
● ●
● ●
●
●
● ●●
●
● ●
3.5
● ●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
●
● ● ●
●
●
● ●
●
● ● ●
●
●
●
●
●
● ● ●
● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ●
● ●
● ● ● ● ●
●
●
● ● ●
●
● ●
● ● ● ●
● ● ●
●
● ● ●
● ●
● ●
●
●
●
● ● ● ● ●
●
●
●
●
●
●
● ● ●
● ● ● ●
●
●
●
● ●
● ● ●
●
● ● ● ● ●
●
●
●
● ●
● ●
●
●
● ● ●
● ● ● ●
● ●
●
●
●
●
● ●
●
● ● ● ●
● ●
●
● ● ● ● ● ● ●
●
sr$GPA
● ● ●
3.0
● ●
● ●
● ● ●
●
● ● ● ● ● ●
●
● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ●
●
● ● ● ● ● ● ●
● ●
● ● ●
● ●
● ● ●
● ● ●
●
● ● ●
● ●
●
● ●
● ●
● ●
● ● ● ●
●
● ●
●
● ● ●
● ● ● ● ● ● ● ●
●
● ●
● ●
● ●
● ● ●
● ● ●
● ●
●
● ● ●
● ● ●
● ● ●
● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
●
●
● ● ● ● ●
● ● ● ● ●
● ●
● ● ● ● ● ●
2.5
● ● ● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ●
●
● ●
● ● ● ● ● ● ●
●
● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
2.0
● ●
● ● ●
● ● ● ● ●
● ● ● ●
● ●
●
15 20 25 30 35
sr$ACT
54.3 The Problem of Multiple Confidence Intervals
Suppose that we wish to announce 95% confidence intervals Iβ0 for β0 and Iβ1 for β1 . What is
the probability that both confidence intervals will contain the respective parameters? It is not
95%. In fact, we know that
.90 ≤ P (β0 ∈ Iβ0 and β1 ∈ Iβ1 ) ≤ .95
but we cannot say more unless we know the joint distribution of β̂0 and β̂1 . It is important to
note that β̂0 and β̂1 are not independent.
In the presence of the normality assumption however, the disribution of β̂0 and β̂1 is known.
It is a bivariate normal distribution. (We will establish this later.) Therefore we can use the
same techniques that are used to find individual confidence intervals to find a joint confidence
region for the pair (β0 , β1 ). In general, this region will be an ellipse. As always, R will do the
computation. An additional package, ellipse, is needed.
The command plot(ellipse(l),type="l") plots the 95% confidence ellipse.
0.075
ACT
0.070
0.065
1.2 1.3 1.4 1.5
(Intercept)
Homework.
2. Do problems 12.48, 50, 52.

55 Regression Diagnostics
Goal: To investigate how to determine whether the regression model is appropriate
55.1 Residuals
The residuals ei = yi − ŷi are estimates of the values of the random variables i . That is, they
are estimates of quantities that are supposed to be a random sample from a distribution with
mean 0 and variance σ 2 . So the behavior of the the ei might give us information concerning
the aptness of the model for the i . Define the residual random variable Ei = Yi − Ŷi .
Proposition 55.1. The random variables Ei satisfy
1 (xi − x̄)2

2
E(Ei ) = 0 Var(Ei ) = σ 1− −
n Sxx
1 (xi − x̄)2
For ease of notation, define hi = + . Then Var(Ei ) = σ 2 (1 − hi ). The number hi
n Sxx
is called the leverage of the data point (xi , yi ) (and depends only on xi ). If the leverage hi is
large, this forces Var(Ei ) to be small. That is, it forces the line to be close to the data point.
Definition 55.2. The standardized (Studentized) residuals are given by
yi − ŷi y − ŷi
e∗i = q = √i
s 1 − n1 − (xSi −x̄)
2 s 1 − hi
xx
Proposition 55.3. With the normality assumption, e∗i has a t distribution with n − 2 degrees
of freedom.
55.2 Outliers and Influential Observations
An influential observation is one that has a large effect on the fit. We let a subscript of (i)
denote the value we get from a fit that omits the point (xi , yi ). Thus β̂0(i) denotes the value
of β̂0 when the point (xi , yi ) is removed. Also ŷj(i) denotes the predicted yj when this point is
removed. Here are some measures of influence.
1. Changes in the coefficients β̂ − β̂(i) .
2. Changes in the fit ŷj(i) − ŷj

Since these depend on the scale of the observations, a popular alternative is Cook’s distance.
Define Di = (yj − yj(i) )2 /(2s2 ). A large Di means that removing (xi , yi ) changes the values
P
of the fitted yj . It can be shown that
e∗2

i hi
Di =
2 1 − hi
Thus the point (xi , yi ) has a large influence on the regression if it has a large residual and or a
large leverage.
R computes changes in the coefficients with dfbeta and Cook’s distances with cooks.distance.
55.3 Plots
In order to test whether the linear model is appropriate, it is helpful to consider various plots
of ei , e∗i , and related quantities.
> r=read.csv(’rust.csv’)
> r
X Fe loss
1 1 0.01 127.6
2 2 0.48 124.0
.................
12 12 1.44 91.4
13 13 1.96 86.2
> l=lm(loss~Fe,data=r)
> dfbeta(l)
(Intercept) Fe
1 -0.513041678 0.37126101
2 0.878336310 -0.44131427
3 -0.214333224 0.06010562
4 -0.216789023 -0.04483002
5 0.009335614 0.01821324
6 0.145920493 -0.10559491
7 0.572452966 -0.28762521
8 0.032692365 -0.33201424
9 0.040762667 -0.01143110
10 -0.123753265 0.26383717
11 -0.407607731 0.29496406
12 0.042845440 -0.43512593
13 -0.435508297 0.92848684
> cooks.distance(l)
1 2 3 4 5 6
0.0668861786 0.2284452990 0.0193150251 0.0461479881 0.0005440446 0.0054108272
7 8 9 10 11 12
0.0970376280 0.0796340512 0.0006986207 0.0291503547 0.0422197749 0.1367777095
13
0.3610134495
> par(mfcol=c(2,3))
> plot.lm(l,which=c(1:6))
Residuals vs Fitted Scale−Location Residuals vs Leverage

Standardized residuals
2● 1
0 2 4 6
2● ●2
2
0.0 0.4 0.8 1.2
● 13 ● 12
7●
●
● 13 ● 0.5
Residuals
● ● ●
1
● ●
● ● ●
● ● ● ● ●● ●
0
●
● ●
● ● ● ●
● ●
−1
● ● ● ●
−4
● 12 ● 12
Cook's distance 0.5
90 100 120 90 100 120 0.00 0.10 0.20
Fitted values Fitted values Leverage
Normal Q−Q Cook's distance Cook's distance vs Leverage

0.4
2●
13 2 1.5 113 ●
Cook's distance
Cook's distance
0.5 1.5
●13 ●
2 ●2
0.2
0.2
●
●●● 12 ● 12
●
●
●
0.5
●●
−1.0
●
● ●
● ● ● ●
0.0
0.0
● 12 ●● ● 0
−1.5 −0.5 0.5 1.5 2 4 6 8 10 0.05 0.15 0.25
Theoretical Quantiles Obs. number Leverage
The six plots that R needs some explanation.
1. Residuals vs Fitted. The residuals ei are plotted against ŷi . (Since the ŷi are linearly
related to the xi , we could as well have plotted ei against xi .) The resulting plot could
uncover such problems as nonlinearity, dependence, and heteroskedasticity.
2. Scale - Location. The square root of the absolute value of standardized residuals |e∗i |
p
is plotted against ŷi . Standardized residuals have the same standard deviation if the
model is correct. Absolute values are used to help detect heteroskedasticity. Square roots
are used since square roots of absolute values of normal data tend to be less skewed than
the data itself.
3. Normal Q-Q. This is a normal quantile-quantile plot of the standardized residuals. This
plot should be roughly linear if the normality assumption is correct.
4. Cook’s Distance. This is a plot of Cook’s distance against the index of the data point.
Cases with large Cook’s distances should be examined as they have large influence on the
regression.
5. Residuals versus Leverage.The standardized residuals e∗i are plotted against the lever-
ages hi . This contains the information needed to compute Cook’s distances and curves of
constant Cook’s distance are superimposed on the plot.
6. Cook’s Distance versus Leverage. The Cook’s distances are plotted against the
leverages and the lines of constant standardized residual are drawn.
Homework.
3. Use the builtin states dataset again. (Recall that you retrieve the data by data(state)
and then s=data.frame(state.x77). The variable s will now be a dataframe with the
desired data.)
(a) Write life expectancy as a linear function of illiteracy rate.

(b) Plot the residuals. Are there any features of the residual plot that indicate a violation
of the linear model assumptions?
(c) Which observation has the most influence over the coefficients in the regression
equation?
56 The Multivariate Linear Regression Model
Goal: To investigate a statistical model in the case of several independent variables
56.1 The Setting
1. k independent variables x1 , . . . , xk . One dependent variable y.
2. n data points: (x11 , . . . , x1k , y1 ), . . . , (xi1 , . . . , xik , yi ), . . . , (xn1 , . . . , xnk , yn ).
3. The standard linear statistical model (lm in R)
(a) Yi = β0 + β1 x1i + · · · + βk xki + i (So β0 + β1 x∗1 + · · · + βk x∗k is the mean value of Y

for a fixed tuple (x∗1 , . . . , x∗k ) of independent variables.)
Recall that we can write this model as y = Xβ ~ + ~.
(b) The errors, i have mean 0, variance σ 2 , and are independent.
(c) And, to make inferences, we sometimes assume that the random variables i have
normal distributions.
The Gauss-Markov Theorem is true in this case as well.
Theorem 56.1 (Gauss-Markov). The least squares estimator β̂ of β is the minimum variance
unbiased estimator of β among all linear estimators of β.
If the normality assumption holds, β̂ is also the maximum likelihood estimator of β. In the con-
text of the normality assumption, we can generate confidence intervals and prediction intervals
as in the single variable case. We have
Proposition 56.2. Given the standard linear model (not necessarily with the normality as-
sumption)
SSE
MSE =
n − (k + 1)
is an unbiased estimator of σ 2 .
Note that we lose a “degree of freedom” for each predictor including the constant term.
56.2 The Stackloss Data
The stackloss dataset built into R consists of 21 observations on the operation of a plant that
oxidizes ammonia to nitric acid. The response variable is stack.loss which is in screwy units –
it is 10 times the percentage of the incoming ammonia that is lost to the environment. Thus it
is an inverse measure of the plant efficiency. The three predictor variables are air.flow which
measures the rate of operation of the plant, Water.Temp which measures the temperature of
the cooling water, and Acid.Conc which is the concentration of the acid circulating (rexpressed
by subtracting 50 and multiplying by 10). The result of a linear regression of stack.loss on
the other three variables is given by the following R output.
> data(stackloss)
> stackloss
Air.Flow Water.Temp Acid.Conc. stack.loss
1 80 27 89 42
2 80 27 88 37
............................................
20 56 20 82 15
21 70 20 91 15
> l=lm(stack.loss~.,data=stackloss)
> summary(l)
Call:
lm(formula = stack.loss ~ ., data = stackloss)
Residuals:
-7.2377 -1.7117 -0.4551 2.3614 5.6978
Coefficients:
(Intercept) -39.9197 11.8960 -3.356 0.00375 **
Air.Flow 0.7156 0.1349 5.307 5.8e-05 ***
Water.Temp 1.2953 0.3680 3.520 0.00263 **
Acid.Conc. -0.1521 0.1563 -0.973 0.34405
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

> confint(l)
2.5 % 97.5 %
(Intercept) -63.8252504 -23.5828115
Air.Flow 0.6371624 1.1410539
Water.Temp 0.1275882 1.5056515

Acid.Conc. -0.3711573 0.1568746
> anova(l)
Response: stack.loss
Air.Flow 1 1750.12 1750.12 166.3707 3.309e-10 ***
Water.Temp 1 130.32 130.32 12.3886 0.002629 **
Acid.Conc. 1 9.97 9.97 0.9473 0.344046
Residuals 17 178.83 10.52
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
> x=data.frame(Air.Flow=60,Water.Temp=20,Acid.Conc.=80)
> predict(l,x,interval="confidence")
fit lwr upr
[1,] 16.75466 14.12012 19.38920
Just as in the single predictor case, we have estimates β̂i for each βi and also estimates sβ̂i of
the standard deviation of β̂i so that the statistic
β̂i − βi
T =
sβ̂i
has a T distribution with n − (k + 1) degrees of freedom. Thus we can write confidence intervals
for βi as in the above output. Note that the above output also gives p-values for the tests of
the hypotheses that βi = 0. The above analysis suggests that the variable Acid.Conc. might
not be necessary in the model. Residual plots are available for multiple regression.
56.3 Interaction Terms
Suppose that we have two predictor variables x1 and x2 . The linear model
Y = β0 + β1 x1 + β2 x2 +
assumes that the coefficient β1 does not depend on the value of x2 . In some cases however it
is reasonable to assume that there is some “interaction” between x1 and x2 so that different
values of x2 result in different values for β1 . The following model is often used to represent that
interaction:
Y = β0 + β1 x1 + β2 x2 + β3 x1 x2 +
R provides an easy way to include an interaction term in the model.
Residuals vs Fitted Scale−Location Residuals vs Leverage

● 21 1
1.5
2
●4 ●4
5
3● ● 0.5
●4
1●
● 3●
1
●● ● ●●
●
● ● ●
Residuals
1.0
● ●● ● ● ●
●
●
0
● ● ●
0
● ●●
●● ●
● ● ● ●
● ● ●● ● ●
●● ● ●
●● ●
−1
● ● ●
● ●
0.5
●
● 0.5
●
−5
−2
1
● 21 ●
21 ●
Cook's distance
0.0
−3
5 15 25 35 5 15 25 35 0.0 0.1 0.2 0.3 0.4
Fitted values Fitted values Leverage
Normal Q−Q Cook's distance Cook's distance vs Leverage

1
21 32.52 1.521 ●
2
4●
0.6
3●
0.6
●
Cook's distance
Cook's distance
1
●
●●
●●
●
0.4
0.4
0
●
●●
●●
●●●
●
−1
● ●
0.2
0.2
1
4 1● 0.5
● 4●
−2
● ● ●
●● ● ●
0.0
0.0
● 21 ●●
● ●● ●●
●● ● 0
−2 −1 0 1 2 5 10 15 20 0 0.2 0.3 0.4
Theoretical Quantiles Obs. number Leverage
Example 56.3. Devore and Berk (Example 12.12) give data on the graduation rate and average
SAT scores of 20 colleges. Some of the colleges are pivate and some state. We can use the kind
of college as a variable in regression by coding it as 0 or 1 (and being sure that we do not
interpret a value of .5 as meaning anything!). If Y is GradRate, x1 is SAT, and x2 is PS, we
have the model
Y = β1 + β1 x1 + β2 x2 + β3 x1 x2 +
The following regression shows that there is not compelling evidence to keep an interaction term
in the model but without the interaction term it is useful to keep the public-state variable.
> exmp1212
C1 Rank Univ GradRate SAT PrivorSt PS
1 1 2 Princetn 98 1465.00 P 1
2 2 13 Brown 96 1395.00 P 1
.................................................
19 19 243 Toledo 44 877.77 S 0
20 20 245 WayneSt 31 833.32 S 0

> l=lm(GradRate~SAT+PS+SAT:PS,data=exmp1212)
> summary(l)
Call:
lm(formula = GradRate ~ SAT + PS + SAT:PS, data = exmp1212)
Residuals:
-19.177 -3.531 1.686 5.902 9.846
Coefficients:
(Intercept) -0.52145 18.16644 -0.029 0.9775
SAT 0.04822 0.01840 2.620 0.0186 *
PS -7.86223 29.39747 -0.267 0.7925
SAT:PS 0.02240 0.02617 0.856 0.4047
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

> l=lm(GradRate~SAT+PS,data=exmp1212)
> summary(l)
Call:
lm(formula = GradRate ~ SAT + PS, data = exmp1212)
Residuals:
-20.051 -4.182 1.265 5.767 9.409
Coefficients:
(Intercept) -11.35960 12.92137 -0.879 0.391587
SAT 0.05929 0.01298 4.567 0.000273 ***
PS 16.92772 4.97206 3.405 0.003374 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Homework.
2. Do problem 12.78.
3.
4. In the data section of the course website is a datafile consisting of the scores of 32 students
of Mathematics 222 on three tests and a final exam. In this question we investigate using
the test scores to predict the final exam score. (After all, if the test scores do a good
enough job, I wouldn’t have to grade the final exam!)
(a) Write a linear function Exam = β̂0 + β̂1 Test1 + β̂2 Test2 + β̂3 Test3 that can be used
to predict the final exam score from the three test scores.
(b) Write a 95% confidence interval for the parameter β1 in the model.
(c) If a student scores 85 on each test, what is the predicted exam score? What is a 90%
confidence interval for that prediction?
(d) Do the p-values for the coefficients lead you to suspect that one or more of the βi
are not very useful in the model? Explain.
(e) For each of the three independent variables, fit a linear function that does not include
that variable. Compare the values of R2 for each those models to each other and to
the full model. Which model would you use to predict exam scores and why?
57 The Linear Algebra of Multivariate Linear Regression
Goal: To derive estimators for multivariate linear regression using linear algebra
57.1 Random Matrices
Definition 57.1. A random matrix is a matrix such that each element of the matrix is a
random variable.
Definition 57.2. The expected value of a random matrix M is the matrix of expected values
of the elements of M .
The fact that expected value is a linear operator is summarized in the following proposition.
Proposition 57.3. If X and Y are random matrices and A is a matrix of constants then
1. E(X + Y ) = E(X) + E(Y )
2. E(AX) = AE(X).
Definition 57.4. If Y is a random vector, then the covariance matrix, Cov(Y ), of Y is the
matrix C such that Ci,j = Cov(Yi , Yj ).
Proposition 57.5. If Y is a random vector with E(Y) = µ, then
Cov(Y ) = E((Y − µ)(Y − µ)0 )
Proposition 57.6. If Y is a random vector and A is a matrix of constants, then Cov(AY ) =

A Cov(Y )A0 .
We can now state the standard linear model in terms of matrices. Recall that if the data
consists of n tuples (xi,1 , . . . , xi,k , yi ), we let X be the matrix
 
1 x11 . . . x1k
X =  ... .. .. 

. . 
1 xn1 . . . xnk
Let β = (β0 , . . . , βk )0 . Then we write the standard linear model as
Y = Xβ +
where is a random vector with mean 0 and covariance matrix σ 2 I.

57.2 The Least Squares Estimates
Recall that the least squares estimates of β in the model above are β̂ given by
β̂ = (X 0 X)−1 X 0 Y
Proposition 57.7. The estimator β̂ satisfies
1. E(β̂) = β
2. Cov(β) = σ 2 (X 0 X)−1
Recall that the “hat matrix” H is defined by H = X(X 0 X)−1 X 0 and is used to generate Ŷ
by
Ŷ = HY
It is easy to show that H satisfies H 0 = H and H 2 = H. We then have
Proposition 57.8.
1. Cov(Ŷ ) = σ 2 H
2. Cov(Y − Ŷ ) = σ 2 (I − H).
Notice that we have Var(Yi − Ŷi ) = σ 2 (1 − hii ). Thus Hii is exactly the number that we called
the leverage of the ith observation.
57.3 Life Cycle Savings Example
The R builtin dataset LifeCycleSavings contains data on the savings rate in 50 different
countries between 1960–1970. See the help document for the names of the variables.
> data(LifeCycleSavings)
> L=LifeCycleSavings
> L
sr pop15 pop75 dpi ddpi
Australia 11.43 29.35 2.87 2329.68 2.87
Austria 12.07 23.32 4.41 1507.99 3.93
..............................................
Libya 8.89 43.69 2.07 123.58 16.71
Malaysia 4.71 47.20 0.66 242.69 5.08
> l=lm(sr~.,data=L)
> x=cbind(1,L[,c(2:5)])
> x=as.matrix(x)
> solve(t(x)%*%x) # %*% gives matrix multiplication, solve finds an inverse
# t(x) returns the transpose
1 pop15 pop75 dpi ddpi

1 3.740514e+00 -7.240019e-02 -4.459169e-01 -7.855506e-05 -1.878625e-02
pop15 -7.240019e-02 1.446816e-03 8.295645e-03 1.675591e-06 2.010896e-04
pop75 -4.459169e-01 8.295645e-03 8.120077e-02 -2.561013e-05 -8.045429e-04
dpi -7.855506e-05 1.675591e-06 -2.561013e-05 5.995458e-08 3.227600e-06
ddpi -1.878625e-02 2.010896e-04 -8.045429e-04 3.227600e-06 2.662002e-03
> summary(l)
Call:
lm(formula = sr ~ ., data = L)
Residuals:
-8.2422 -2.6857 -0.2488 2.4280 9.7509
Coefficients:
(Intercept) 28.5660865 7.3545161 3.884 0.000334 ***
pop15 -0.4611931 0.1446422 -3.189 0.002603 **
pop75 -1.6914977 1.0835989 -1.561 0.125530
dpi -0.0003369 0.0009311 -0.362 0.719173
ddpi 0.4096949 0.1961971 2.088 0.042471 *
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007904
Homework.

58 Competing Models
Goal: To develop metrics for deciding among competing linear models
58.1 Adjusted R2
Suppose that we have several independent variables x1 , . . . , xk that we might possibly include
in the linear model y = β0 + β1 x1 + · · · + βk xk . One measure of the utility of including all these
variables is
SSR SSE
R2 = =1−
SST SST
We read R2 as the percentage of variation of y that is explained by the independent variables. A
problem with R2 however is that R2 will always increase as we increase the number of variables
in the model. So simply choosing the model with greatest R2 will lead to choosing the model
with all the predictor variables. An adjustment to R2 which takes into account the number of
predictors is
MSE SSE /(n − (k + 1)) n−1 SSE

Ra2 = 1 − =1− =
MST SST /(n − 1) n − (k + 1) SST
It is clear that Ra2 < R2 and, if k is large relative to n, then Ra2 can be much smaller than R2 .
58.2 Comparing two Models
Suppose that we are trying to decide between two competing models, one a refinement of the
other.
Example 58.1. In the LifeCycleSavings data from R, we could explain sr by any combina-
tion of the four independent variables pop15, pop75, dpi and ddpi. Suppose that we wish to
compare the two models
Model 1: sr = β0 + β1 pop15 + β2 ddpi
Model 2: sr = β0 + β1 pop15 + β2 ddpi + β3 pop75 + β4 dpi
> summary(l)
Call:
lm(formula = sr ~ pop15 + ddpi, data = sav)
Residuals:
-7.58314 -2.86323 0.04535 2.22734 10.47530
Coefficients:
(Intercept) 15.59958 2.33439 6.682 2.48e-08 ***
pop15 -0.21638 0.06033 -3.586 0.000796 ***
ddpi 0.44283 0.19240 2.302 0.025837 *
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

> anova(l)
Response: sr
pop15 1 204.12 204.12 13.6942 0.0005633 ***
ddpi 1 78.96 78.96 5.2973 0.0258374 *
Residuals 47 700.55 14.91
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
> summary(ll)
Call:
lm(formula = sr ~ pop15 + ddpi + pop75 + dpi, data = sav)
Residuals:
-8.2422 -2.6857 -0.2488 2.4280 9.7509
Coefficients:
(Intercept) 28.5660865 7.3545161 3.884 0.000334 ***
pop15 -0.4611931 0.1446422 -3.189 0.002603 **
ddpi 0.4096949 0.1961971 2.088 0.042471 *
pop75 -1.6914977 1.0835989 -1.561 0.125530
dpi -0.0003369 0.0009311 -0.362 0.719173
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

> anova(ll)
Response: sr
pop15 1 204.12 204.12 14.1157 0.0004922 ***
ddpi 1 78.96 78.96 5.4604 0.0239634 *
pop75 1 47.95 47.95 3.3157 0.0752748 .
dpi 1 1.89 1.89 0.1309 0.7191732
Residuals 45 650.71 14.46
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
>
Note that in terms of adjusted-R2 , the second model does better than the first although not by
much. We now look at another way to compare these two models. We think of comparing the
models as a hypothesis testing situation. The null hypothesis in the above case is that the first
model, Model 1, is the true model. In other words, the null hypothesis is that β3 = β4 = 0.
The alternate hypothesis is of course that at least one of β3 and β4 are nonzero.
More generally, we assume that we have two models ω and Ω with the smaller model ω contained
in the larger Ω. (In our example, Model 1 is ω and Model 2 is Ω.) The null hypothesis that we
wish to check is that ω is the “true” model, that is that the extra parameters in Ω are all zero.
Each of the two models has a residual sum of squares SSEω and SSEΩ .)
If Ω is doing a much better job than ω, we should have that SSEω − SSEΩ is large. (In the
above example, this quantity is about 50.)
This suggests that we use the size of SSEω − SSEΩ to test the hypothesis that the larger model
Ω is really necessary. Similarly, it seems reasonable to normalize this and use a statistic such
as
SSEω − SSEΩ
SSEΩ
to measure the necessity of Ω. In fact, this is almost the statistic we need.
Proposition 58.2. Suppose that the dimension (number of parameters including the constant)
of ω is p and the dimension of Ω is q. Then if the null hypothesis (ω) is true the statistic
(SSEω − SSEΩ )/(q − p)

F =
SSEΩ /(n − q)
has an F distribution with parameters q − p and n − q.
R provides a convenient way to do the hypothesis test. With the two models of the example
we have
> anova(l,ll)
Model 1: sr ~ pop15 + ddpi

Model 2: sr ~ pop15 + ddpi + pop75 + dpi
Res.Df RSS Df Sum of Sq F Pr(>F)
1 47 700.55
2 45 650.71 2 49.84 1.7233 0.1900
Based on this test, we do not reject the null hypothesis that β3 = β4 = 0.

It is instructive to consider this proposition in the case that ω is the model that only contains a
constant term and Ω is the full model. Then the null hypothesis in question is that β1 = · · · =
βk = 0. The F statistic in this case is
(SSEω − SSEΩ )/k

F =
SSEΩ /(n − (k + 1))
This F statistic is reported on the bottom of the summary of the linear regression.
58.3 Stepwise Regression and AIC
Other things being equal, models with small SSE are preferred to those with larger SSE and
those with small k are preferred to those with larger k. The following measure, called the
Akaike Information Criterion, AIC, combines these two considerations into one measure
SSE
AIC = n ln + 2(k + 1)
n
Models with smaller AIC (for a fixed data set and so fixed n) are better.
Given a set x1 , . . . , xk of k possible predictors (k + 1 with the constant term), we could evaluate
AIC for each subset of possible predictors and choose the model with smallest AIC. However
it is more common to start with the full model and choose variables to eliminate one at a time
according to what AIC would result if that variable is eliminated.
The stepwise regression for the model of savings rate data starting with the four variables given
is done in R as follows.
> step(ll)
Start: AIC= 138.3
sr ~ pop15 + ddpi + pop75 + dpi
Df Sum of Sq RSS AIC

- dpi 1 1.89 652.61 136.45
<none> 650.71 138.30
- pop75 1 35.24 685.95 138.94
- ddpi 1 63.05 713.77 140.93
- pop15 1 147.01 797.72 146.49
Step: AIC= 136.45

sr ~ pop15 + ddpi + pop75
Df Sum of Sq RSS AIC

<none> 652.61 136.45
- pop75 1 47.95 700.55 137.99
- ddpi 1 73.56 726.17 139.79
- pop15 1 145.79 798.40 144.53
Call:
lm(formula = sr ~ pop15 + ddpi + pop75, data = sav)
Coefficients:
(Intercept) pop15 ddpi pop75
28.1247 -0.4518 0.4278 -1.8354
Evidently, the full model has AIC = 138.30. The only variable that could be removed to reduce
AIC is dpi. Removing that variable results in AIC of 136.45. The next step shows that no
other single variable can be removed to reduce AIC.
Homework.
1. Read Section pages 672-675.

2. The data page at the course website has a dataset that includes statistics on five years
(1994-1998) of major-league baseball. We desire to take certain statistics computed on a
per-game basis and use these to predict the number of runs-per-game (RG). The version
of the dataset that is useful in this regard is the “per game” version. The variables are
RG (runs), X1BG (singles), X2bG (doubles), X3BG (triples), HRG (home runs), SOG
(strikeouts), SBG (stolen bases), CSG (caught stealing).
(a) Use all of the other variables in a linear function to predict RG.
(b) Based on the t-values in the preceding analysis, which variables are obvious candi-
dates for removal from the model? Refit the model without these variables. Compute
adjusted R2 for the two models.
(c) Employ stepwise regresson in R (step). Does step give the same result as the
analysis in part (b)?
(d) Using the model that resulted from the stepwise regression in part (c), what variable
should be the next to remove? Refit the model without that variable and compare
adjusted R2 of this model to the models of (b) and (c).
(e) If you know something about baseball, do the coefficients in the model of (c) “make
sense?”
59 Logistic Regression
Goal: To do regression in the case that the response variable is a dichotomous

categorical variable
59.1 The Logistic Regression Model
If y is a dichotomous categorical variable we can code the two outcomes as 0 or 1 (often, failure
and success). We will suppose that we wish to predict y from a single independent variable x.
The data are pairs (xi , yi ) where yi = 0, 1.
Consider the standard linear model y = β0 + β1 x + . At one time, researchers simply fit this
model to the data. Note that if the model is true, we have E(Y |x) = β0 + β1 x. But E(Y |x) is
the expected number of successes given a fixed value of x. Call this number p(x). Thus p(x) is
the probability of success for a fixed value of x. This model is inappropriate for a number of
reasons.
1. The number p(x) should be between 0 and 1 but it is entirely possible for the regression
to give numbers β̂0 + β̂1 x that are outside this range.
2. The model assumes that the y values have the same variance for each x value. But this
is false. The variance y for a fixed x depends on p(x). Namely, it is p(x)(1 − p(x)).
3. The model assumes that the y values for a fixed x are normally distributed. This is silly
since the y values are all 0 or 1.
Logistic regression models p(x) by a nonlinear function of β0 + β1 x.
eβ0 +β1 x
p(x) =
1 + eβ0 +β1 x
Note that the right hand side of this equation is always strictly between 0 and 1. It is increasing
if β1 > 0 and decreasing if β1 < 0.
The above model is often written in one of the following two forms:

p(x) p(x)
= eβ0 +β1 x ln = β0 + β1 x
1 − p(x) 1 − p(x)
p(x)
The expression is called the odds ratio.
1 − p(x)
59.2 Estimating the Parameters
In the above model, there is a question as to how to estimate the parameters β0 and β1 . The
least squares estimates are difficult to find and no longer have the desirable properties as they
did in the linear regression case. Instead, the model is usually fit using the maximum likelihood
estimators.
To show how this is done, suppose that we have four data points (x1 , 0), (x2 , 1), (x3 , 1), (x4 , 1).
Then the likelihood function is
1 eβ0 +β1 x2 eβ0 +β1 x3 eβ0 +β1 x4

(1 − p(x1 ))p(x2 )p(x3 )p(x4 ) =
1 + eβ0 +β1 x1 1 + eβ0 +β1 x2 1 + eβ0 +β1 x3 1 + eβ0 +β1 x4
Finding the values of β̂0 and β̂1 to maximize this function is left to software. Additionally,
the exact distribution of β̂i is not known but it is known that it is approximately normal.
Expressions for sβ̂i are also known.
59.3 The Challenger Disaster
The shuttle Challenger destroyed on launch in January 1986 due to the failure of an O-ring.
Engineers for the manufacturer had recommended that no shuttle be launched when the tem-
perature at lauch time is below 53 degrees but the temperature at launch time was 31 degrees.
Data on prevous shuttle launches is presented below (in a simpified manner). The temperature
at launch and the failure of the O-ring is recorded. We wish to use logistic regression to predict
the probability of failure as a function of temperature. Although 31 degrees is outside the range
of the data, extrapolation predicts almost certain failure of the O-ring.
> exmp1214
Temperat Failure Fail1
1 53 Y 1
2 56 Y 1
3 57 Y 1
4 63 N 0
5 66 N 0
.........................
20 76 N 0
21 78 N 0
22 79 N 0
23 80 N 0
24 81 N 0
> l=glm(Fail1~Temperat,data=exmp1214,family="binomial")
> predict(l,data.frame(Temperat=31),type="response")
[1] 0.9961828
1.0
0.8
0.6
p
0.4
0.2
40 50 60 70 80
59.4 Baseball
Obviously the winning percentage of a baseball team is related to its runs scored. We try to
predict winning percentage from the difference of runs per game and opponents runs per game.
We use an alternate form of entry to glm. Rather than use a response variable that codes the
individual games, we use a two-column variable that has a column for successes and a column
for failures. This is a more convenient format when there are many trials for each value of x.
> r=read.csv(’runswins04.csv’)
> r
TEAM LG W L G R OR RDIFF AB H DBL TPL HR BB SO
1 Anaheim AL 92 70 162 836 734 102 5675 1603 272 37 162 450 942
2 Boston AL 98 64 162 949 768 181 5720 1613 373 25 222 659 1189
.................................................................
30 Milwaukee NL 67 94 161 634 757 -123 5483 1358 295 32 135 540 1312
> wl=cbind(r$W,r$L)
> ror=(r$R-r$OR)/r$G
> l=glm(wl~ror,family=binomial)
> l
Call: glm(formula = wl ~ ror, family = binomial)
Coefficients:
(Intercept) ror
-0.001175 0.447539
Degrees of Freedom: 29 Total (i.e. Null); 28 Residual

Null Deviance: 131.7
Residual Deviance: 12.65 AIC: 182.1
> x=data.frame(ror=seq(-3,3,.1))
> p=predict(l,x,type="response")
> plot(x$ror,p,type="l")
>
0.8
0.6
p
0.4
0.2
−3 −1 0 1 2 3
x$ror
Homework.
1. Read pages 607–608 and 636–637.
2. A professional golfer practices putts from three different lengths. Out of 50 putts from
distance of 3 feet, he makes 42. Out of 50 putts from a distance of 6 feet, he makes 31.
And out of 50 putts from a distance of 15 feet he makes 8. Let x be the length of the
putt and p(x) be the probability of making the putt.
(a) Use logistic regression to write p(x) as a function of x.

(b) Use your model from part (a) to predict the percentage of 10 foot putts that this
golfer will make.
(c) Use your model to predict the number of 0 foot (!) putts that this golfer will make.
(d) Fit a model of the form p(x) = β0 + β1 x using simple linear regression and compare
the results to your model in (a). In particular, compare the predictions of the two
models at various distances.
60 Goodness of Fit Tests
Goal: To test whether the data fit the model
60.1 Test of a Multinomial
Definition 60.1. The multinomial distribution has k + 1 parameters n, p1 , . . . , pk (such that

p1 + · · · + pk = 1) and pdf
n!
p(n1 , . . . , nk ) = pn1 1 . . . pnk k n1 + · · · + nk = n
n1 ! . . . n k !
The multinomial is a model for an experiment that has n independent trials of an experiment
with k possible outcomes such that the probability of outcome i in any trial is pi . The random
variables N1 , . . . , Nk count the number of each of the outcomes that occur.
The simplest hypothesis testing situation is a test for particular values of pi . The null and
alternate hypotheses are as follows;
H0 : p1 = p 1 0 , p 2 = p2 0 , . . . , p k = pk 0
Ha : for some i pi = 6 pi0
Example 60.2. Suppose that a supposedly fair die is tossed 100 times and the number of times
each face appears is recorded.
Number 1 2 3 4 5 6
Occurences 16 11 23 15 15 20
We wish to test the hypothesis that p1 = · · · = pk = 1/6.
The test statistic that we use is
(observed − expected)2 (ni − npi0 )2

χ2 = =
expected npi0
There are two different methods that use this test statistic to generate a p-value. One way is an
approximate distributional result that relies on the Central Limit Theorem. The second way is
a computer intensive method that approximates the exact distribution of χ2 by simulation.
Theorem 60.3. If the null hypothesis H0 is true, the test statistic χ2 has a distribution which
is approximately chi-squared with k − 1 degrees of freedom. The approximation is good for large
n and such that each “expected” value npi0 is not too small (usually we ask that npi0 ≥ 5).
To test the null hypothesis then, we compute the value of χ2 and reject the null hypothesis if
this value exceeds the appropriate critical value from the chi-squared distribution. Specifically,
define χ2α,k−1 to be the value such that if the null hypothesis is true
P (χ2 > χ2α,k−1 ) = α
For our desired level of significance α, we reject the null hypothesis if χ2 > χ2α,k−1 . The p-value
of the test statistic is the probability that a chi-squared random variable with k − 1 degrees of
freedom exceeds the sample value of χ2 .
In the case of the dice example above, we do not reject the null hypothesis.
> x=c(16,11,23,15,15,20)
> p=c(1/6,1/6,1/6,1/6,1/6,1/6)
> p0=c(1/6,1/6,1/6,1/6,1/6,1/6)
> chisq.test(x,p=p0)
Chi-squared test for given probabilities
data: x
> sum( (x-100*p0)^2/(100*p0))
[1] 5.36
>
The second method of testing the null hypothesis is a simulation method. Assuming that the
null hypothesis is true, we can simulate many different trials of the experiment and compute
χ2 for each one. Then we can compare the value of χ2 for this experiment to those and get a
p-value. This method is entirely analogous to the percentile version of the bootstrap method.
> chisq.test(x,p=p0,simulate.p.value=T)
Chi-squared test for given probabilities with simulated p-value (based

on 2000 replicates)
data: x
X-squared = 5.36, df = NA, p-value = 0.3698
The p-value in the above example indicates that in 37% of the simulated samples, the value of
χ2 was more extreme than the value actually observed in this sample. The null hypothesis is
not rejected.
60.2 Using Chi-squared to Test Other Distributional Hypotheses
Example 60.4. Suppose that 100 observations are made of a random variable that is sup-
posedly Poisson distributed with parameter λ = 1. The following table gives an example of
simulated data that might come from that distribution.
Observation 0 1 2 3 4
Count 34 39 18 7 2
If we combine the data to form just four classes (0, 1, 2, ≥ 3), we can view this as a multinomial
experiment with n = 100 and hypothesized probabilities given by the poisson density. (The
classes for x ≥ 3 are combined so that we ensure that the expected size of each class is at least
5.)
> p0=c(dpois(c(0,1,2),1),1-ppois(2,1))
> n=c(34,39,18,9)
> chisq.test(n,p=p0,simulate.p.value=T)
Chi-squared test for given probabilities with simulated p-value (based

on 2000 replicates)
data: n
We can do the same thing to test goodness of fit to a continuous distribution. We first divide
the data into classes (much as we do with a histogram). We then treat the experiment as a
multinomial experiment that produces outcomes in the given classes. For example, suppose
that we are testing whether data comes from a standard normal distribution. Below we divide
the standard normal curve into 10 bins of equal probability and use hist to count the data in
each bin. Note that chisq.test with no argument of hypothesized probabilities uses as default
the hypothesis that all bins have equal probability.
> x
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> q=qnorm(x)
> q
[1] -Inf -1.2815516 -0.8416212 -0.5244005 -0.2533471 0.0000000
[7] 0.2533471 0.5244005 0.8416212 1.2815516 Inf
> z=rnorm(100,.1,1)
> q[1]=min(z)
> q[11]=max(z)
> h=hist(z,breaks=q)
> chisq.test(n)
data: n
> q
[1] -2.5567418 -1.2815516 -0.8416212 -0.5244005 -0.2533471 0.0000000
[7] 0.2533471 0.5244005 0.8416212 1.2815516 1.9250591
>
Though there is nothing wrong theoretically with this test for normality, it really is a lousy
test. The problem is that is has very low power. Intuitively one can see that this test does not
use most of the information in the data. For example, it does not even use the fact that the
data values have an ordering. There are tests for normality that have greater power which we
shall consider later.
60.3 Goodness of Fit with Unknown Parameters
It is often the case that we have a model for the probabilities pi0 in our null hypothesis in the
sense that we assume that they depend on unknown parameters.
Example 60.5. If gene frequencies are in equlibrium in a population, the genotypes AA, Aa,
and aa occur in the population with frequencies θ2 , 2θ(1 − θ) and (1 − θ)2 respectively. Suppose
that the genotypes correspond to three phenotypes and we observe the following data in a
random sample of the population
Genotype AA Aa aa
Count 342 500 187
Suppose we want to test the hypothesis that this simple genetic model fits the data. To perform
a chi-squared goodness-of-fit test we need first to estimate θ. We estimate θ to maximize the
likelihood of the count data.
In this case, the likelihood of the data is
1029!
L(θ) = (θ2 )342 (2θ(1 − θ))500 ((1 − θ)2 )187
342!500!187!
The maximum likelihood estimator θ̂ of θ is .5753. (The R code below finds the minimum of
the negative log likelihood.)
> like=function (theta){-( 684 *log(theta) + 500 *

+ log(2*theta*(1-theta)) + 187*log((1-theta)^2))}
> nlm(like,.3)
$minimum
[1] 1056.486
$estimate
[1] 0.5753153
$gradient
[1] -0.0006591563
$code
[1] 1
$iterations
[1] 5
Now we do a chi-squared test using that value of θ̂ to generate our null hypothesis.
> theta=.5753
> p0=c(theta^2,2*theta*(1-theta),(1-theta)^2)
> n=c(342,500,187)
> chisq.test(n,p=p0)
data: n
The above analysis is not quite right yet. The fact that we estimated a parameter means that
we have produced a chi-squared value that might be somewhat smaller than what it would have
been had we used the true value of θ to perform the chi-squared test. The corrective is in the
following proposition.
Proposition 60.6. Suppose that the random variables N1 , . . . , Nk have a multinomial distribu-
tion with parameters n, p1 , . . . , pk . Suppose that the null hypothesis states that each probability
pi is a function π(θ1 , . . . , θm ) of unknown parameters θ1 , . . . , θm . Let θ̂1 , . . . , θ̂m be maximum
likelihood estimates of the parameters. Then if H0 is true, the following statistic has an ap-
proximately chi-squared distribution with k − 1 − m degrees of freedom:
(observed − est. expected)2 (ni − nπi (θ̂1 , . . . , θ̂m ))2

χ2 = =
est. expected nπi (θ̂1 , . . . , θ̂m )
In the above example, this means that we should have compared the value of the test statistic,
0.0325, against the chi-squared distribution with one degree of freedom rather than 2. The
result is a p-value of 0.857 rather than 0.984.
Homework.
1. Read sections 13.1 and 13.2 through page 720.
2. Do problems 13.4,9,10,13.
61 Goodness of Fit Continued
Goal: To extend the chi-squared goodness of fit analysis to continuous distributions

for which parameters must be estimated
61.1 An extended example - in memory of Section 34
As of April 6, 2007 there were 137 players in the National Basketball Association who had
attempted enough free throws to be considered for league leader. We want to check the hy-
pothesis that the beta distribution is a good model for free-throw shooting percentage. We find
the following distribution for percentage using classes determined from basketball knowledge
(and not the data).
pct 0-.650 .650-.700 .700-.750 .750-.800 .800-.850 .850-1.000

count 11 14 19 33 37 23
In the R session below, we use the method of maximum likelihood to estimate the parameters
in the beta distribution best fitting the percentage data and then apply the chi-squared test.
Notice that since there are 6 classes, there are just 3 degrees of freedom in the chi-squared
statistic since we estimated 2 parameters. The conclusion of the above analysis is that we do
not reject the hypothesis that the beta distribution is a good model for this count data.
> f
Name.Team GP PPG FTMG FTAG FTM FTA
1 Kyle Korver , PHI 72 14.5 2.6 2.9 189 207
2 Earl Boykins , MIL-DEN 58 14.6 3.4 3.8 197 218
3 Ray Allen , SEA 55 26.4 5.1 5.6 279 309
.........................................................
136 Emeka Okafor , CHA 61 14.8 2.6 4.4 158 269
137 Shaquille O’Neal , MIA 36 17.2 3.2 7.5 116 270
> br=c(0,.650,.700,.750,.800,.850,1)
> pct=f$FTM/f$FTA
> h=hist(pct,breaks=br)
> counts=h$counts
> counts
[1] 11 14 19 33 37 23
> loglik = function (a) { -sum(dbeta(pct,shape1=a[1],shape2=a[2],log=T)) }
> estparam=nlm(loglik,c(8,2))
> estparam
$minimum
[1] -158.4951
$estimate
[1] 21.450598 6.194976
$gradient
[1] -5.692133e-06 8.937160e-06
$code
[1] 1
$iterations
[1] 16
> hist(pct,freq=F)
> lines(x,dbeta(x,shape1=estparam$estimate[1],estparam$estimate[2]))
> p=pbeta(br,n$estimate[1],n$estimate[2])
> p0=p[-1]-p
Warning message:
longer object length
is not a multiple of shorter object length in: p[-1] - p
> p0=p0[1:6]
> chisq.test(h$counts,p=p0)
data: h$counts
> 1-pchisq(2.388,3)
[1] 0.4958723
>
61.2 A technical point
In the analysis of the example above, we used the maximum likelihood estimators for the
parameters of the beta distribution derived from all the data. However the theory demands
that we should use the maximum likelihood estimators derived from the count data. That is we
should choose the values of the parameters α and β in the beta distribution to maximum the
likelihood of the particular table of counts that we obtained and not to maximum the likelihood
of the unaggregated data. It is intuitively clear that the value of χ2 that we achieved is larger
than would have obtained had we found the correct maximum likelihood estimates. Therefore
we are still safe in the above instance in not rejecting the null hypothesis. But the above test
will, in general cause us to reject the null hypothesis more often than the advertised level of
significance.
The fit is pretty good as the following plot suggests.
Histogram of pct
5
4
Density
3
2
1
0
0.4 0.5 0.6 0.7 0.8 0.9
pct
61.3 The special case of normality
The above procedure is entirely justified in testing the hypothesis of normality. But as was
mentioned in the previous section, this is a test of very low power. There are tests of normality
of considerably greater power. We consider one here.
The Ryan-Joiner test is based on the same idea as that of normal quantile plots (qqnorm in R).
If the data come from a normal distribution, the normal quantile plot should be approximately
a straight line. The correlation coefficient corresponding to this line is therefore a good test
statistic for normality. If there are n data points, let yi be the (i−3/8)/(n+1/4) quantile of the
standard normal distribution. (Intuitively one might use (i − 1/2)/n but these yi give better
results.) If x(1) , x(2) , . . . , x(n) is the ordered list of observations, the Ryan-Joiner statistic is the
correlation coefficient of this list and the y vector. We need to know the distribution of this
statistic under the null hypothesis that the data xi come from a normal distribution. Tables
have been constructed of the appropriate critical values. For example, if n = 30 and α = .05
we reject the normality hypothesis at the α level of significance if the Ryan-Joiner statistic is
less than 0.9639. In the simulated examples below, we do not reject the normality hypothesis
for the data that come from a normal distribution but we do reject the hypothesis in the case
that the data come from an exponential distribution.
> expsamp=rexp(30,1)
> normsamp=rnorm(30,1)
> y= (c(1:30)-3/8)/(30+1/4)
> q=qnorm(y,0,1)
> cor(sort(normsamp),y)
[1] 0.9841894
> cor(sort(expsamp),y)
[1] 0.750169
Homework.
1. Read Chapter 13.2.
2. Do problems 13.19,22.
62 Two-Way Contingency Tables
Goal: To test for homogeneity or independece given two-way categorical data
62.1 Contingency Tables
Notation.
1. I rows and J columns.

2. nij is the entry in row i column j. It is a count (nij is a nonnegative integer).
J
X
3. ni· = nij
j=1
I
X
4. n·j = nij
i=1
X
5. n = nij .
i,j
Two Sampling Situations.
1. The I rows are I different populations and the J columns are J mutually exclusive and
exhaustive categories. For each row i a random sample of size ni is taken from the
corresponding population and nij is the number of sampled individuals that are in category
j.
2. The I rows are I different levels of one factor (categorical variable) and the J columns are
J different levels of a second factor. A random sample of n individuals is chosen and nij
is the number of individuals at level i of the first factor and level j of the second factor.
Sometimes it is difficult to determine from the problem description which of the two situations
produced the two-way table.
Research Question.
There is a research question corresponding to each of the above sampling situations.
1. Homogeneity. In the case of independent populations, we want to test the hypothesis

that the proportion of individuals in the various (column) categories is the same for all
populations.
2. Independence. In the case of one population but two categorical variables, we want to
test the hypothesis that the two factors are independent of each other in the population.
62.2 Homogeneity
Let pij denote the proportion of individuals of population i that fall in category j. The null
hypothesis is
H0 : for every j, p1j = p2j = · · · = pIj
n
If H0 is true, we use pj to denote the common value of the pij . We estimate pj by p̂j = n·j
Now we form a chi-squared statistic based on the cells of the table. To find the estimated count
in cell ij, we use
ni· n·j
êij = ni· p̂j =
n
The chi-squared statistic that we use is
X (observed − est. expected)2 X X (nij − êij )2
χ2 = =
est. expected êij
i j
If the null hypothesis is true, then this statistic has an approximate chi-squared distribution
with (I − 1)(J − 1) degrees of freedom.
To understand the degrees of freedom computation we use this heuristic argument. In each row,
we are computing a chi-square statistic based on J − 1 degrees of freedom so the overall chi-
squared statistic has I(J − 1) degrees of freedom (The sum of independent chi-square random
variables is chi-square and the degrees of freedom add.) However we are using probabilities
that are estimated from the estimates of J − 1 parameters (not J parameters since only J − 1
estimates are necessary to produce the eij ). Thus the number of degrees of freedom is
I(J − 1) − (J − 1) = (I − 1)(J − 1)
Example 62.1. The R dataset UCBAdmissions is a table that classfies all applicants to the
six largest graduate programs at the University of California at Berkeley in 1973 according to
their gender and admission status. The question is whether the University treats males and
femals differently in the admission process.
> UCBAdmissions
, , Dept = A
Gender
Admit Male Female
Admitted 512 89
Rejected 313 19
, , Dept = B
Gender
Admit Male Female
Admitted 353 17
Rejected 207 8
, , Dept = C
Gender
Admit Male Female
Admitted 120 202
Rejected 205 391
, , Dept = D
Gender
Admit Male Female
Admitted 138 131
Rejected 279 244
, , Dept = E
Gender
Admit Male Female
Admitted 53 94
Rejected 138 299
, , Dept = F
Gender
Admit Male Female
Admitted 22 24
Rejected 351 317
> f=ftable(UCBAdmissions,row.vars=1,col.vars=2)
> f
Gender Male Female
Admit
Admitted 1198 557
Rejected 1493 1278
> chisq.test(f)
Pearson’s Chi-squared test with Yates’ continuity correction
data: f
X-squared = 91.6096, df = 1, p-value < 2.2e-16
Apparently, we reject the hypothesis that the two populations (male and female) are homeoge-
neous with respect to admission rate. In the above example, the chi-squared test is computed
with a continuity correction. This correction is applied only in the case of a 2 × 2 table. The
correction is to subtract 0.5 from each |observed − expected| and “corrects” for the fact that
the chi-squared distribution is not a very good approximation to the true sampling distribution
of χ2 in the case of one degree of freedom.
62.3 Independence
Let pi· denote the proportion of the population belonging to category i of the first factor and
p·j be the proporiton of the population belonging to category j of the second factor. Let pij be
the proportion of individuals belonging to both category i of the first factor and category j of
the second factor. Then the null hypothesis of independence is
H0 : For all i, j pij = pi· p·j
In this case, to get the estimated cell counts under the null hypothesis, we need to estimate pi·
and p·j . The obvious estimates are
ni· n·j
p̂i· = p̂·j =
n n
Then the appropriate chi-squared statistic is
X (observed − est. expected)2 X X (nij − n̂p̂i· p̂·j )2
χ2 = =
est. expected n̂p̂i· p̂·j
i j
This is exactly the same statistic as in the test for homogeniety case and it has the same number
(I − 1)(J − 1) degrees of freedom although it needs a different argument to count the degrees
of freedom.
To count the degrees of freedom, note that there are IJ many cells and we are performing a
chi-squared test on the IJ many cell counts. Without other restrictions, this would be IJ − 1
degrees of freedom. But we are estimating the parameters pij via our estimates of the pi· and
p·j . There are (I − 1) + (J − 1) parameters being estimated. (The subtraction of 1 occurs
from the fact that the pij can really be written as a function of the pi· and p·j with pI· and p·J
omitted as are functions of the other parameters.) So the number of degrees of freedom used
in the chi-squared test are
IJ − 1 − (I − 1) − (J − 1) = IJ − I − J + 1 = (I − 1)(J − 1)
Example 62.2. A company classifies a sample of 309 defective parts according to whether the
parts were produced on the first, second or third shift and whether the defects were of type
A, B, C, or D. The question as to whether the shift and the type of defect are independent is
investigated by the chi-square test below (which occurs both in the classic form and also in a
simulated form).
> x=matrix(c(15,21,45,13,26,31,34,5,33,17,49,20),nrow=3,ncol=4,byrow=T)
> x
[,1] [,2] [,3] [,4]
[1,] 15 21 45 13
[2,] 26 31 34 5
[3,] 33 17 49 20
> chisq.test(x)
Pearson’s Chi-squared test
data: x
> chisq.test(x,simulate.p.value=T,B=10000)
Pearson’s Chi-squared test with simulated p-value (based on 10000

replicates)
data: x
Homework.
1. Read pages 729-734.

Outlines343 PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Outlines343 PDF

Загружено:

Авторское право:

Доступные форматы

Mathematics 343 Class Notes 1

1.1 What are Probability and Statistics?

3. The relation between probability theory and statistics

1.2 An example of the relationship

Kellogg’s sells boxes of Raisin Bran labeled “Net Wt. 20 oz.”

1. What is meant by this claim?

2. How does NIST recommend checking this claim?

1.3 Populations and Samples

Definition 1.6. A population is a well-defined collection of individuals.

We distinguish between actual (concrete) populations and conceptual (hypothetical) pop-

Definition 1.7. A parameter is a numerical characteristic of a population.

Definition 1.8. A sample is a subcollection of a given population.

1. Read the syllabus.

2. Read Section 1.1 of Devore and Berk.

5. Do problems 2.1,2,5,6 of SimpleR.

6. Do problems 1.4,6,9 of Devore and Berk.

Goal: To develop the language for describing random (probabilistic) experiments

1. Experiment (or random experiment) is an undefined term.

2. Experiments have three key characteristics

(a) future, not past

2.2 The sample space

2. Examples of sample spaces.

1. An event is a subset of the set of outcomes.

2. The fundamental goal of a probability model:

2.4 Language of Set Theory

Definition 2.1. Suppose that E and F are events.

• Two special events are ∅, (nothing happens!) and S (something happens)

2.5 Using R to generate random events

1. Read Section 2.1 of Devore and Berk.

Goal: To assign to each event A a number P (A), the probability of A

3.1 The Meaning of Probability Statements

2. The subjectivist interpretation: the probability of an event A is an expression of how

3.2 Assigning Probabilities - Theory

Axiom 3.1. For all events A, P (A) ≥ 0.

Axiom 3.2. P (S) = 1.

Axiom 3.3. If A1 , A2 , A3 , . . . is a sequence of disjoint events, then

Theorem 3.4. P (∅) = 0.

Theorem 3.5. If A and B are disjoint sets, then P (A ∪ B) = P (A) + P (B).

Theorem 3.6. For every event A, P (A0 ) = 1 − P (A).

Theorem 3.7. For all events A and B, P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

3.3 Assigning Probabilities - Practice

3.4 Using R to Simulate Random Experiments

1. Read Devore and Berk, Section 2.2.

2. Do problems 2.22,24,26,28,30 of Devore and Berk.

Goal: To develop methods for counting equally likely outcomes

4.1 Two Problems

4.2 The Fundamental Theorem of Counting

4.3 Permutations and Combinations

Pk,n = n(n − 1)(n − 2) · · · (n − k + 1).

Note that Pn,n = n!

Definition 4.6. A subset of k objects chosen from a set

4.4 Solution to the two basic problems

1. Read Section 2.3 of Devore and Berk.

Goal: To compute the probability of an event A given knowledge as to whether

5.2 Conditional Probability Defined

5.3 Using Conditional Probability to Compute Unconditional Probabilities

Since P (A ∩ B) = P (B)P (A|B), we can use conditional probabilities (such as P (A|B)) to

Definition 5.6. Events A and B are independent if P (A ∩ B) = P (A)P (B).

Sampling with replacement is a typical situation where independence is applied.

1. Read pages 73–78 and also 83, 84

2. Do problems 2.45, 46, 48, 49, 55, 56

Goal: To compute conditional probabilities of the form P (B|A) from P (A|B)

6.1 Simple Statement of the Theorem