Applied Statistics - MIT

Dr.
Elizabeth Newton
Slides prepared by Elizabeth Newton (MIT) with some slides by
Roy Welsch (MIT) and Gordon Kaufman (MIT).
1
15.075, Applied Statistics
Lecture: M,W 10-11:30
Recitation: R 4-5
Text: Statistics and Data Analysis by Tamhane and Dunlop
Computing: S-Plus
Exams: Mid-term (in class) and Final during exam week
Prerequisites: Calculus, Probability, Linear Algebra
2
15.075, Applied Statistics, Course Outline
Collecting Data
Summarizing and Exploring Data
Review of Probability
Sampling Distributions of Statistics
Inference
Point and CI Estimation, Hypothesis Testing
Linear Regression
Analysis of Variance
Nonparametric Methods
Special Topics (Data Mining?)
3
Statistics
The science of collecting and
analyzing data for the purpose of
drawing conclusions and
making decisions. from Tamhane, Ajit C.,
and Dorothy D. Dunlop. Statistics and Data Analysis from
Elementary to Intermediate. Prentice Hall, 2000, pp. 1.
Statistics are no substitute for judgment.
Henry Clay
4
How is the meter defined?
One ten-millionth of a quarter meridian
(distance from pole to equator).
BUT it isnt exactly.
Why?
5
The Measure of All Things, by Ken Alder,
describes the attempt of 2 French astronomers,
Delambre and Mechain, to determine the
circumference of the earth during the time of the
French Revolution.
Determined the distance between Barcelona and
Dunkirk by triangulation.
Needed to know latitude at each end (by measuring
heights of stars).
Seven months stretched to seven years.
Mechain obtained conflicting information and
suppressed some of his data.
6
Page 214 (Measure of All Things):
What counts as an error? Who is to say when you have
made a mistake? How close is close enough? Neither
Mechain nor his colleagues could have answered these
questions with any degree of confidence. They were
completely innocent of statistical method.
- Quote from Alder, Ken. The Measure of All Things: The Seven-Year
Odyssey and Hidden Error that Transformed the World. Free Press, 2003.
7
Data: A Set of measurements
Character
Nominal, e.g. color: red, green, blue
Binary e.g. (M,F), (H,T), (0,1)
Ordinal, e.g attitude to war: agree, neutral disagree
Numeric
Discrete, e.g. number of children
Continuous. e.g. distance, time, temperature
also:
Interval, e.g. Fahrenheit temperature
Ratio (real zero), e.g distance, number of children
8
S-Plus Data Set: cu.summary
9
Concepts
Population:
The set of all units of interest (finite or infinite).
E.g. all students at MIT
Sample:
A subset of the population actually observed.
E.g. students in this room.
Variable:
A property or attribute of each unit, e.g age, height
Observation:
Values of all variables for an individual unit
A dataset is often organized as a matrix with rows
corresponding to observations and columns to variables.
10
Concepts (continued)
Parameter:
Numerical characteristic of population, defined
for each variable, e.g. proportion opposed to war
Statistic:
Numerical function of sample used to estimate
population parameter.
Precision:
Spread of estimator of a parameter
Accuracy:
How close estimator is to true value - opposite of
Bias:
Systematic deviation of estimate from true value
11
Accuracy and Precision
accurate and
precise
accurate,
not precise
precise,
not accurate
not accurate,
not precise
12
Diagram courtesy of MIT OpenCourseWare
Steps in Study Design and Implementation
1. Background research and literature review.
2. Define the goals and specific hypotheses of the study.
3. Determine what variables should be measured and how.
5. Develop a plan to collect the data
Sampling design
Sample size
Inclusions and exclusions
5. Train Personnel
6. Gather Data
7. Analyze Data
8. Report Results
13
Ethical Issues
For human subjects:
For animal subjects:
(See Hulley & Cummings, Designing Clinical Research.)
14
Statistical Studies
Descriptive:
One group, e.g. survey, poll
Comparative:
2 or more groups, e.g. compare effectiveness of different
teaching methods.
Experimental:
Investigator actively intervenes to control study conditions
Look at relationship between predictor (explanatory) and
response (outcome) variables
Establish causation, e.g. drug trial
Observational:
Investigator records data without intervening
Difficult to distinguish effects of predictors and confounding
variables (lurking variables)
Establish association, e.g. Framingham Heart Study
15
Observational Studies:
Cross-sectional
Look at sample at a single point in time
E.g. Census, Sample survey
Prospective (expensive!)
Follow sample (cohort) forward in time.
E.g. Framingham heart study, Nurses Health
Study
Retrospective (case-control)
Look back in time
16
Sources of Error in Observational Studies
Sampling Error sample differs from population
Measurement Bias poorly worded questions
Self-Selection Bias refusal to participate
Response Bias incorrect or untruthful responses
17
Types of Samples
Probability Sample (every element in population has
known non-zero probability of inclusion)
Simple Random Sample (SRS)
Stratified Random Sample
Multi-Stage Cluster Sample
Systematic Sample
Non-Probability Sample (estimates may be biased, but
frequently used as only feasible method)
Convenience Sample e.g. supermarket survey
Judgment Sample chosen by investigator
18
Simple Random Sample (SRS)
Requires a Sampling Frame, a list of all the units in a finite
population
Sample of size n is drawn without replacement from
population of size N, such that each sample (there are
of them) has same chance of being chosen.
Each unit in population has same chance of being chosen:
n/N (the sampling fraction).
Generate random numbers to select from sampling frame.
n
N
19
Stratified Random Sample
Divide a diverse population into homogeneous
subpopulations (strata).
Draw simple random sample from each one.
Advantages:
Separate estimates for strata obtained in addition to
overall estimates.
Precision of estimates higher than for simple random
sample
Disadvantage: Requires sampling frame
20
Multistage Cluster Sampling
Used to survey large populations when sampling
frame not available, e.g. USA
For instance, in an educational survey, draw a
sample of states, then towns within states, then
schools within towns.
Prepare a sampling frame of students from
selected schools and use SRS.
21
Systematic Sampling
Useful when list of units exists or when units
arrive sequentially (cars through a toll booth).
Select first unit at random, then every kth unit.
In finite population, each unit has same
probability of selection (n/N)
(however not all samples are equally likely).
Must avoid choosing k to coincide with regular
cyclic variations in the data
22
Questionnaire Design
Structured questions: responses should be mutually
exclusive and collectively exhaustive.
E.g. How many glasses of water do you drink per day?
-------------- 0 to 2
--------------- 3 to 5
--------------- 6 or more
Non-structured:
E.g. How many glasses of water do you drink per day?
Allow more individualized response, but more prone to
data entry errors.
23
Attitude questions
1. The homework load in this course is reasonable.
Strongly Neither Agree Strongly
Disagree Disagree nor Disagree Agree Agree
Usually 5 to 9 categories.
(Should we assign numbers to these categories?)
(High to low or low to high?)
24
Problems with Question Wording
Double-barreled question
Leading question
One-sided question
Ambiguous question
Pretest! Pretest! Pretest!
(For more information, see Johnson & Wichern, Business Statistics)
25
26
Sensitive Questions
E.G Have you ever used heroin?
Randomized Response may elicit more accurate responses.
Interviewer does not know what question respondent is answering.
E.g. Roll a die. If less than 3 then say whether statement 1 is true or false.
Otherwise say whether statement 2 is true of false.
Statement 1: I have used heroin.
Statement 2: I have not used heroin.
Let p=proportion of people who have used heroin
q=proportion of people answering question 1 (cant be 0.5).
P(True)=P(True|1)P(1) + P(True|2)P(2) = p q + (1-p) (1-q)
Solve for p.
Question Sequencing
1. Demographics at end
2. Sensitive questions nearer to end
3. Same topic questions appear together
4. Go from general to specific
5. Avoid skipping around.
27
28
Experimental Studies
Purpose: Evaluate how a set of predictor variables (factors) affect a
response variable.
Treatment Factors are of primary interest. Values (Levels) are
controlled.
Nuisance Factors also affect response.
Treatment: particular combination of levels of treatment factors.
Experimental units (EUs): subjects to which treatments applied.
Treatment group: all EUs receiving same treatment
Run: observation on an EU under particular treatment condition.
Replicate: another independent run.
Sources of Error in Experimental Studies
Systematic Error: differences among EUs caused by
Confounding Factors
Random Error: inherent variability in responses of EUs.
Measurement Error: due to imprecision of measuring instruments.
29
Strategies to Control Error in Experimental Studies
Blocking: Divide sample into groups of similar EUs (same value for
nuisance factors).
E.g. In agricultural trials effect of nutrient and moisture gradients
can be controlled for by blocking on agricultural plots
Matching: EUs can be matched on nuisance factors, then each member
of match can be randomly assigned to different treatment (each
match is a block).
Regression Analysis: If value of nuisance factor is known can include as
covariate in final model.
Randomization: Randomly assign EUs to treatments.
Basic Idea: Block over those nuisance factors that can be easily
controlled and randomize over the rest
30
Basic Experimental Designs
Completely Randomized Design (CRD)
EUs assigned at random to treatments
Randomized Block Design (RBD)
EUs divided into homogeneous blocks
Treatments assigned randomly within blocks.
Randomized Complete Block Design (RCBD):
Blocks contain all treatments.
Randomized Incomplete Block Design (RIBD)
Blocks do not contain all treatments.
31
Chapter 4: Summarizing & Exploring Data
(Descriptive Statistics)
Graphics! Graphics! Graphics!
(and some numbers)
Slides prepared by Elizabeth Newton (MIT) with some slides by
J acqueline Telford (J ohns Hopkins University) and Roy Welsch (MIT).
1
Graphical Excellence
Complex ideas communicated with
clarity, precision, and efficiency
Shows the data
Makes you think about substance rather than
method, graphic design, or something else
Many numbers in a small space
Makes large data sets coherent
Encourages the eye to compare different
pieces of the data
2
Charles J oseph Minard
Graphic Depicting Exports of Wine
from France (1864)
Available at
http://www.math.yorku.ca/SCS/Gallery/
Source: Minard, C. J . Carte figurative et approximative des quantits de vin franais exports
par mer en 1864. 1865. ENPC (cole Nationale des Ponts et Chausses), 1865.
Also available in: Tufte, Edward R. The Visual Display of Quantitative Information. Cheshire, CT:
Graphics Press, 2001.
3
Summarizing Categorical Data
A frequency table shows the number
of occurrences of each category.
Relative frequency is the proportion
of the total in each category.
Bar charts and Pie Charts are used
to graph categorical data. A Pareto
chart is a bar chart with categories
arranged from the highest to lowest
(QC: vital few from the trivial many).
Attraction Frequency
Relative
Frequency (%)
Vertical Drop 101 15.1
Roller Coaster A 54 8.1
Roller Coaster B 77 11.5
Water Park 155 23.1
Spinners 35 5.2
Tea Cups 81 12.1
Haunted House 79 11.8
Log Drop 88 13.1
Total 670 100.0
Popularity of attractions at an amusement park
Relative Frequency (%)
0.0
5.0
10.0
15.0
20.0
25.0
V
e
r
t
i
c
a
l

D
r
o
p
R
o
l
l
e
r

C
o
a
s
t
e
r

A
R
o
l
l
e
r

C
o
a
s
t
e
r

B
W
a
t
e
r

P
a
r
k
S
p
i
n
n
e
r
s
T
e
a

C
u
p
s
H
a
u
n
t
e
d

H
o
u
s
e
L
o
g

D
r
o
p
4
Pie Chart and Bar Chart of Attraction
Popularity at an Amusement Park
5
Vertical Drop Roller Coaster A
Roller Coaster B Water Park
Spinners Tea Cups
Haunted House Log Drop
0.0
5.0
10.0
15.0
20.0
25.0
V
e
r
t
i
c
a
l

D
r
o
p
R
o
l
l
e
r

C
o
a
s
t
e
r

A
R
o
l
l
e
r

C
o
a
s
t
e
r

B
W
a
t
e
r

P
a
r
k
S
p
i
n
n
e
r
s
T
e
a

C
u
p
s
H
a
u
n
t
e
d

H
o
u
s
e
L
o
g

D
r
o
p
Charles J oseph Minard
Graph showing quantities of meat sent from various
regions of France to Paris using pie charts overlaid a
map of France (1864)
Available at
Source: Minard, C. J . Carte figurative et approximative des quantits de viande de
boucherie envoyes sur pied par les dpartments et consommes Paris. ENPC (cole
Nationale des Ponts et Chausses),1858, pp. 44.
6
Plots for Numerical Univariate Data
Scatter plot (vs. observation number)
Histogram
Stem and Leaf
Box Plot (Box and Whiskers)
QQ Plot (Normal probability plot)
7
Scatter Plot of Iris Data
observation number
i
r
i
s
[
,

"
S
e
p
a
l

W
"
,

"
S
e
t
o
s
a
"
]
0 10 20 30 40 50
2
.
5
3
.
0
3
.
5
4
.
0
8
This graph was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
9
Scatter Plot of Iris Data with Observation
Number Indicated
observation number
i
r
i
s
2
1
0 10 20 30 40 50
2
.
5
3
.
0
3
.
5
4
.
0
1
2
3
4
5
6
7 8
9
10
11
12
1314
15
16
17
18
1920
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
plot(iris21)
text(iris21)
Plot of data using jitter function in S-Plus
observation number
x
0 100 200 300 400 500
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
observation number
j
i
t
t
e
r
(
x
)
0 100 200 300 400 500
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
10
Run Chart
For time series data, it is often useful to plot the data in time
sequence. A run chart graphs the data against time.
0 5 10 15 20 25 30
Production Order
C
o
m
p
r
e
s
s
i
o
n
F
r
e
q
u
e
n
c
y
Compression
11
Always Plot Your Data Appropriately - Try Several Ways!
Histogram
Data: n=24 Gas Mileage
{31,13,20,21,24,25,25,27,28,
40,29,30,31,23,31,32,35,28,
36,37,38,40,50,17}
Gives a picture of the distribution of data.
Area under the histogram represents
sample proportion.
Use approx. sqrt(n) bins- if too many,
too jagged; if too few, too smooth (no detail)
Shows if the distribution is:
Symmetric or skewed
Unimodal or bimodal
Gaps in the data may indicate a problem
with the measurement process.
Many quality control applications
Are there two processes?
Detection of rework or cheating
Tells if process meets the
specifications
2.5
5.0
7.5
C
o
u
n
t

A
x
i
s
10 15 20 25 30 35 40 45 50 55
Miles per gallon
Distributions
Note: Bars touch for
continuous data, but do NOT
touch for discrete data.
12
Histogram of Iris Data
2.5 3.0 3.5 4.0
0
2
4
6
8
1
0
iris21
13
2.0 2.5 3.0 3.5 4.0 4.5
siris21
0.0
0.2
0.4
0.6
0.8
1.0
Histogram of Iris Data with Density Curve
14
Stem and Leaf Diagram Cum. Dist. Function
Data: Gas Mileage
Stem Leaf
5 0
4
4 00
3 5678
3 01112
2 557889
2 0134
1 7
1 3
Count
1
2
4
5
6
4
1
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
C
u
m

P
r
o
b
10 15 20 25 30 35 40 45 50 55
Miles per gallon
CDF Plot
Shows distribution of data
similar to a histogram but
preserves the actual data.
Can see numerical patterns
in the data (like 40s and 50).
Step occurs at each data value
(higher for more values at the
same data point).
15
Stem and Leaf Diagram for Iris Data
Decimal point is 1 place to the left of the colon
23 : 0
24 :
25 :
26 :
27 :
28 :
29 : 0
30 : 000000
31 : 0000
32 : 00000
33 : 00
34 : 000000000
35 : 000000
36 : 000
37 : 000
38 : 0000
39 : 00
40 : 0
41 : 0
42 : 0
43 :
44 : 0
16
Summary Statistics for Numerical Data
Measures of Location:
n
x
n
x x x
x
n
i
i
n
=
=
+ + +
=
1 2 1

(average): Mean
Median: middle of the ordered sample (like
.5
for distribution)
x
min
= x
(1)
x
(2)
x
(n)
= x
max
+
=
+
even is if
odd is if
median
n x x
n x
n n
n

2
2
2
2
1

2
1
Median of {0,1,2}is 1 : n=3 so n+1=4 & (n+1)/2=2 (2
nd
value)
Median of {0,1,2,3}is 1.5 (assumes data is continuous): n=4
Mode: The most common value
17
Mean or Median?
Appropriate summary of the center of the data?
Mean if the data has a symmetric distribution with light tails
(i.e. a relatively small proportion of the observations lie
away from the center of the data).
Median if the distribution has heavy tails or is asymmetric.
Extreme values that are far removed from the main body of the
data are called outliers.
Large influence on the mean but not on the median.
Right and left skewness (asymmetry)
(reverse alphabetic - RIGHT skewed)
mode (high point)
median
mean
(alphabetic - LEFT skewed)
mode
median
mean
18
Quantiles, Fractiles, Percentiles
For a theoretical distribution:
The pth quantile is the value of a random variable X,
x
p
, such that P(X<x
p
)=p. For the normal distn:
In S-Plus: qnorm(p), 0<p<1, gives the quantile.
In S-Plus: pnorm(q) gives the probability.
For a sample:
The order statistics are the sample values in
ascending order. Denoted X
(1)
,X
(n)
The pth quantile is the data value in the sorted
sample, such that a fraction p of the data is less than
or equal to that value.
19
20
Normal CDF
x
p
n
o
r
m
(
x
)
-3 -2 -1 0 1 2 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
qnorm(0.8)=0.8416212
pnorm(0.8416212)=0.8
An algorithm for finding sample quantiles:
1) Arrange observations from smallest to largest.
2) For a given proportion p, compute the sample
size p = np.
3) If np is NOT an integer, round up to the next
integer (ceiling (np)) and set the
corresponding observation = x
p
.
4) If np IS an integer k, average the kth and (k +
1)st ordered values. This average is then x
p
.
Text has a different algorithm
21
Quantiles, continued
(p
th
quantile is 100p
th
percentile)
Example:
Data: {0, 1, 2, 3, 4, 5, 6}
= {x
(1)
,x
(2)
,x
(3)
,x
(4)
,x
(5)
,x
(6)
, x
(7)
}
n=7
Q1 = ceiling(0.25*7) = 2 Q1 = x
(2)
= 1 = 25
th
percentile
Q2 = ceiling(0.50*7) = 4 Q2 = x
(4)
= 3 = median (50
th
percentile)
Q3 = ceiling(0.75*7) = 6 Q3 = x
(6)
= 5 = 75
th
percentile
S-Plus gives different answers!
Different methods for calculating quantiles.
22
Measures of Dispersion (Spread, Variability):
Two data sets may have the same center and but quite
different dispersions around it.
Two ways to summarize variability:
1. Give the values that divide the data into equal parts.
Median is the 50
th
percentile
The 25
th
, 50
th
, and 75
th
percentiles are called
quartiles (Q1,Q2,Q3) and divide the data into four
equal parts.
The minimum, maximum, and three quartiles are
called the five number summary of the data.
2. Compute a single number, e.g., range, interquartile
range, variance, and standard deviation.
23
Measures of Dispersion, continued
Range = maximum - minimum
Interquartile range (IQR) = Q3 Q1
=

= =
n
i
i
n
i
i
x n x
n
x x
n
s
1
2 2
1
2 2
) (
1
1
) (
1
1
Sample variance:
2
s s = Sample standard deviation:
Sample mean, variance, and standard deviations are sample analogs
of the population mean, variance, and standard deviation(,
2
, )
24
Other Measures of Dispersion
Sample Average of Absolute Deviations from the Mean:
Sample Median of Absolute Deviations from the Median
Median of {|x
i
x
.5
|, i = 1, . . . , n}
1
1
=

n
i
i
x x
n
25
Computations for Measures of Dispersion
Example:
Data: {0, 1, 2, 3, 4, 5, 6}
= {x
(1)
,x
(2)
,x
(3)
,x
(4)
,x
(5)
,x
(6)
, x
(7)
}
mean = (0+1+2+3+4+5+6)/ 7 = 21/ 7 = 3
min = 0, max = 6
Q1= x
(2)
= 1 = 25
th
percentile
Q2= x
(4)
= 3 = median (50
th
percentile)
Q3= x
(6)
= 5 = 75
th
percentile
Range = max - min = 6 - 0 = 6
IQR = Q3 - Q1 = 5 - 1 = 4
s
2
= [(0
2
+1
2
+2
2
+3
2
+4
2
+5
2
+6
2
) - 7(3
2
)]/(7-1) = [91-63]/6 =4.67
s = sqrt(4.67) = 2.16
26
Sample Variance and Standard Deviation
s
2
and s should only be used to summarize dispersion with
symmetric distributions.
For asymmetric distribution, a more detailed breakup of the
dispersion must be given in terms of quartiles.
For normal data and large samples:
50% of the data values fall between mean 0.67s
68% of the data values fall between mean 1s
95% of the data values fall between mean 2s
99.7% of the data values fall between mean 3s
For normally distributed data:
IQR=(mean + 0.67s) - (mean - 0.67s) = 1.34s
27
28
Standard Normal Density
x
d
n
o
r
m
(
x
)
-4 -2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
68%
Box (and Whiskers) Plots
Visual display of summary of data (more than five numbers)
Outlier Box Plot Quantile Box Plot
Data: Gas Mileage
median
Q3
Q1
IQR = Q3 - Q1
Upper Fence = Q3 + 1.5 x IQR
Lower Fence = Q1 1.5 x IQR
Two lines are called whiskers
and extend to the most extreme
data values that are still inside
the fences.
Observations outside the
fences are regarded as
possible outliers and are
denoted by dots and circles or
asterisks.
90th
percentile
10th
percentile
Rectangle:
29
Box Plot for Iris Data
2
.
5
3
.
0
3
.
5
4
.
0
iris21
30
QQ Plots
Compare Sample to Theoretical
Distribution
Order the data. The i
th
ordered data value is the pth quantile,
where p = (i - 0.5)/n, 0<p<1.
Text uses i/(n+1).
(Why cant we just say i/n)?
Obtain quantiles from theoretical distribution corresponding
to the values for p.
E.g. qnorm(p), in S-Plus for normal distribution.
Plot theoretical quantiles vs. empirical quantiles (sorted data).
S-Plus: plot(qnorm((1:length(y)-0.5)/n),sort(y))
Fit line through first and third quartiles of each distribution.
31
32
QQ (Normal) Plot for Iris Data
Quantiles of Standard Normal
i
r
i
s
2
1
-2 -1 0 1 2
2
.
5
3
.
0
3
.
5
4
.
0
Normalizing Transformations
Data can be non-normal in a number of ways, e.g., the
distribution may not be bell shaped or may be heavier tailed
than the normal distribution or may not be symmetric.
Only the departure from symmetry can be easily corrected by
transforming the data.
If the distribution is positively skewed, then the right tail needs
to be shrunk inward. The most common transformation used
for this purpose is the log transformation: x log x (e.g.,
decibels, Richter, and Beaufort (?) scales); see Figure 4.11.
x
The square-root ( ) transformation provides a weaker
shrinking effect; it is frequently used for (Poisson) count data.
For negatively skewed data, use the exponential (e
x
) or
squared (x
2
) transformations.
33
34
Normal Probability Plot of data generated
from a certain distribution
x
-2 -1 0 1 2
0
2
4
6
8
1
0
35
Normal probability plot of log of same data
l
o
g
(
x
)
-2 -1 0 1 2
-
1
0
1
2
Histogram of the same data
36
0 2 4 6 8 10
0
1
0
2
0
3
0
4
0
x
Summarizing Multivariate Data
37
When two or more variables are measured on each
sampling unit, the result is multivariate data.
If only two variables are measured the result is bivariate
data. One variable may be called the x variable and the
other the y variable.
We can analyze the x and y variable separately with the
methods we have learned so far, but these methods would
NOT answer questions about the relationship between x
and y.
What is the nature of the relationship between x and y (if
any)?
How strong is the relationship?
How well can one variable be predicted from the other?
Summarizing Bivariate Categorical Data
Two-way Table
Overall Job Satisfaction
Annual
Salary
Very
Dissatisfied
Slightly
Dissatisfied
Slightly
Satisfied
Very Satisfied Row Sum
Less than
$10,000
81 64 29 10 184
$10,000-
25,000
73 79 35 24 211
$25,000-
50,000
47 59 75 58 239
More than
$50,000
14 23 84 69 190
Column Sum 215 225 223 161 824
38
The numbers in the cells are the frequencies of each possible
combination of categories.
Cell, row and column percentages can be computed to assess
distribution.
Column Percentages for Income and J ob
Satisfaction Table
Overall Job Satisfaction
Annual
Salary
Very
Dissatisfied
Slightly
Dissatisfied
Slightly
Satisfied
Very Satisfied
Less than
$10,000
37.7 28.4 13.0 6.2
$10,000-
25,000
34.0 35.1 15.7 14.9
$25,000-
50,000
21.9 26.2 33.6 36.0
More than
$50,000
6.5 10.2 37.7 42.9
39
Simpsons Paradox
Lurking variables [excluded from
consideration] can change or
reverse a relation between two
categorical variables!
40
Doctors Salaries
The interpreter of a survey of doctors
salaries in 1990 and again in 2000
concluded that their average income
actually declined from $97,000 in 1990
to $91,000 in 2000.
Income is measured here in nominal
(not adjusted for inflation) dollars.
41
What about the Rest of the
Story ?
What deductive piece of logic might
clarify the real meaning of this particular
pair of statistics?
Look more deeply: Is there a piece
missing?
Here is a very simple breakdown of the
numbers that may help.
42
Doctors Salaries by Age
1980 1990
Age fraction, f1 Income fraction, f2 Income
<=45 0.5 $60,000 0.7 $70,000
>45 0.5 $120,000 0.3 $130,000
Mean $90,000 $88,000
43
Conclusion
If MD salaries are broken into two
categories by age:
Doctors younger than 45 constituted 50%
of the MD population in 1980 and 70% in
1990
Younger doctors tend to earn less than
older, more experienced doctors
Parsed by age, MD salaries increased in
both age categories!
44
Gender Bias in Graduate Admissions
For this example, see J ohnson and Wichern, Business Statistics:
Decision Making with Data. Wiley, First Edition, 1997.
45
Randomized study
Gender should be randomly assigned to applicants!
This would automatically balance out the departmental factor
which is not controlled for in the original plaintiff (observational)
study.
Practical reality
Gender cannot be assigned randomly.
Control for department factor by comparing admission within
department, i.e. controlling for the confounding factor after
completion of the study.
Statistical Ideal
46
There are lies, damn lies and
then there are statistics!
Benjamin Disraeli
47
Summarizing Bivariate Numerical Data
No. Method
1 (x
i
)
Method
2 (y
i
)
1 88 86
2 78 81
3 90 87
4 91 90
5 89 89
6 79 80
7 76 74
8 80 78
9 78 76
10 90 86
0
10
20
30
40
50
60
70
80
90
100
75 80 85 90 95
Method 1
M
e
t
h
o
d

2
Is it easier to grasp the relationship in the
data between Method A and Method B from
the Table or from the Figure (scatter plot)?
48
Labeled Scatter Plot
Year Country
A
Country
B
Country
C
Country
D
1965 64.7 64.8 61.1 86.2
1970 65.0 65.2 61.2 86.5
1975 66.8 66.3 63.0 87.4
1980 66.9 67.4 62.8 87.0
1985 67.9 68.5 63.1 89.2
1990 68.3 69.1 63.5 89.4
1995 70.8 69.4 64.3 90.1
2000 71.7 70.0 65.1 90.5
Can you see the
improvements in
the literacy rates
for these four
countries more
easily in the
Table or in the
Figure?
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005
Year
L
i
t
e
r
a
r
y

R
a
t
e
Country A
Country B
Country C
Country D
49
Sample Correlation Coefficient
A single numerical summary statistic which measures
the strength of a linear relationship between x and y.
r = covar(x,y)/(stddev(x)*stddev(y))
Properties similar to the population correlation coefficient
Unitless quantity
Takes values between 1 and 1
The extreme values are attained if and only if the points (x
i
, y
i
)
fall exactly on a straight line (r = -1 for a line with negative slope
and r = +1 for a line with positive slope.)
Takes values close to zero if there is no linear relationship
between x and y.
See Figures 4.15, 4.16, 4.17 (a) and (b)
= =
n
i
i i xy
y x
xy
y y x x
n
s
s s
s
r
1
) )( (
1
1
where
50
What is the correlation?
x
y
0 20 40 60 80 100
0
2
0
4
0
6
0
8
0
1
0
0
1
2
0
51
x
y
-4 -2 0 2 4
0
5
1
0
1
5
52
x
y
-4 -2 0 2 4
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
53
Correlation and Causation
High correlation is frequently mistaken for a cause and effect
relationship. Such a conclusion may not be valid in
observational studies, where the variables are not controlled.
A lurking variable may be affecting both variables.
One can only claim association, not causation.
Countries with high fat diets tend to have higher incidences of
cancer. Can we conclude causation?
A common lurking variable in many studies is time order.
Wealth and health problems go up with age.
Does wealth cause health problems?
Sometimes correlations can be found without any plausible
explanation, e.g., sun spots and economic cycles.
54
Plots for Multivariate Data
Side by Side Box Plots
Scatter plot matrix
Three dimensional plots
Brush and Spin plots add motion
Maps for spatial data
55
56
Box Plots of Auto Data
widths indicate number of each type
2
0
2
5
3
0
3
5
f
u
e
l
.
f
r
a
m
e
[
,

"
M
i
l
e
a
g
e
"
]
Compact Large Medium Small Sporty Van
fuel.frame[, "Type"]
57
Scatter plot matrix Iris (Versicolor)
Sepal.L.
2.0 2.4 2.8 3.2 1.0 1.2 1.4 1.6 1.8
5
.
0
5
.
5
6
.
0
6
.
5
7
.
0
2
.
0
2
.
4
2
.
8
3
.
2
Sepal.W.
Petal.L.
3
.
0
3
.
5
4
.
0
4
.
5
5
.
0
5.0 5.5 6.0 6.5 7.0
1
.
0
1
.
2
1
.
4
1
.
6
1
.
8
3.0 3.5 4.0 4.5 5.0
Petal.W.
58
Galaxy S-PLUS Language Reference
Radial Velocity of Galaxy NGC7531
SUMMARY:
The galaxy data frame records the radial velocity of a spiral galaxy
measured at 323 points in the area of sky which it covers. All the
measurements lie within seven slots crossing at the origin. The positions
of the measurements given by four variables (columns).
ARGUMENTS:
east.west
the east-west coordinate. The origin, (0,0), is near the center of the
galaxy, east is negative, west is positive.
north.south
the north-south coordinate. The origin, (0,0), is near the center of the
galaxy, south is negative, north is positive.
angle
degrees of counter-clockwise rotation from the horizontal of the slot
within which the observation lies.
radial.position
signed distance from origin; negative if east-west coordinate is
negative.
velocity
radial velocity measured in km/sec. .
This output was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Galaxy Data
east.west
-40 -20 0 20 40 1400 1500 1600 1700
-
3
0
-
1
0
1
0
3
0
-
4
0
0
2
0
4
0
north.south
radial.position
-
4
0
0
2
0
4
0
6
0
-30 -20 -10 0 10 20 30
1
4
0
0
1
6
0
0
-40 -20 0 20 40 60
velocity
59
Galaxy 3D
60
61
Earthquake Data
longitude
36.0 36.5 37.0 37.5 38.0 38.5
-
1
2
3
-
1
2
2
-
1
2
1
-
1
2
0
3
6
.
0
3
7
.
0
3
8
.
0
latitude
-123 -122 -121 -120 3 4 5
3
4
5
magnitude
62
Earthquake 3D
Narrative Graphics of Space
and Time
Adding spatial dimensions to a graph so that
the data are moving over space and time can
enhance the explanatory power of time series
displays
The Classic of Charles J oseph Minard (1781-
1870) shows the terrible fate of Napoleons
army during his Russian campaign of 1812.
A copy of the map is available at
63
Map Source: Minard, C. J . Carte figurative des pertes successives en hommes de l'arme
qu'Annibal conduisit d'Espagne en Italie en traversant les Gaules (selon Polybe). Carte figurative
des pertes successives en hommes de l'arme franaise dans la campagne de Russie, 1812-
1813. cole Nationale des Ponts et Chausses (ENPC), 1869. Also available in: Tufte, Edward
R. The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press, 2001.
Beginning at the left on the Polish-Russian border near
the Niemen River the thick band shows the size of the
army (422,000) as it invaded Russia in J une 1812.
The width of the band indicates the size of the
army
The army reached a sacked and deserted Moscow
with 100,000 men
Napoleons retreat path from Moscow is depicted by
a dark, lower band, linked to a temperature scale
and dates at the bottom.
The men struggled into Poland with only 10,000
troops remaining.
64
Minards graphic tells a rich, coherent
story with its multivariate data, far more
enlightening than just a single number
SIX variables are plotted:
Its location on a two-dimensional
surface
Direction of armys movement
Temperature as a function of time
during the retreat
The size of the army
It may well be the best statistical graphic
ever drawn. Edward Tufte (The Visual Display of
Quantitative Information. Cheshire, CT: Graphics Press, 2001, pp. 40)
65
Scatter plot matrix of air data set in S-Plus
66
ozone
0 50 100 200 300 5 10 15 20
1
2
3
4
5
0
5
0
1
5
0
2
5
0
radiation
temperature
6
0
7
0
8
0
9
0
1 2 3 4 5
5
1
0
1
5
2
0
60 70 80 90
wind
67
plot(temperature,ozone)
temperature
o
z
o
n
e
60 70 80 90
1
2
3
4
5
We often try to fit a straight line to bivariatedata as a way to summarize
bivariatedata:
y =data =fit +residual
fit =a + bx
The parameter (coefficients) a and b can be found in many ways. Least-squares
is commonly used.
The fit is often denoted by The residuals are
What about curvature and outliers?
( )
2
, 1
=
= .
min
=

n
i i
a b i
xy x
y a bx
b S S
a y bx
+ .
i i
y a bx = .
i i
y y
Fitting Lines
68
Divide x data into thirds. Find median of x in each third, and
median of the ysthat correspond to the xsin each third.
Call these three pairs (x
a
, y
a
), (x
b
, y
b
), (x
c
, y
c
). Fit a least-squares
line to these three points.
Or consider other metrics
These are alternatives to least-squares.
Resistant Line
, 1
,
.
min
median min
n
i i
a b i
i i
a b i
y a bx
y a bx
=

69
70
abline(lm(ozone~temperature))
temperature
o
z
o
n
e
60 70 80 90
1
2
3
4
5
Prediction and Residuals
Fitted lines can be used to predict. If we go too far beyond
range of x-data, we can expect poor results. Consider problems
of interpolation and extrapolation.
Examination of residuals help tell us how well our model (a line)
fits the data.
We also compute
and call s the standard deviation of the residuals. Note use of n 2
because two degrees of freedom are used to find a and b.
( )
2
1
1
2
n
i i
i
s y y
n
=
=

71
Residual Plots
72
1. against fitted values
2. against explanatory variable
3. against other possible explanatory variables
4. against time, if applicable.
We want these pictures to look random no pattern.
Outliers and Influence
Values of x far away from the line have a lot of leverageon the line.
Values of y with large residuals at high leverage points will
usually be quite influential on the fitted line.
We can check by setting influential points aside and comparing
fits and residuals.
( )
i
y
73
Plot of residuals vs. observation
number for ozone data
r
e
s
i
d
(
l
m
f
i
t
)
0 20 40 60 80 100
-
1
0
1
2
74
Residuals vs. Fitted Values for ozone data
fitted(lmfit)
r
e
s
i
d
(
l
m
f
i
t
)
2.0 2.5 3.0 3.5 4.0 4.5
-
1
0
1
2
Smoothing
Fitting curves to data
Separate Signal from noise
Fitted values, , are a weighted average of the
response y.
Weights are a function of predictor x.
Degrees of freedom indicate roughness
Simple linear regression, df=2
y
75
76
lines(smooth.spline(temperature,ozone,df=16.5))
temperature
o
z
o
n
e
60 70 80 90
1
2
3
4
5
77
lines(smooth.spline(temperature,ozone,df=6))
temperature
o
z
o
n
e
60 70 80 90
1
2
3
4
5
Time-Series/Runs Chart
Plot of Compression
vs. Time (Order of
Production)
This is example of a
process not in
statistical control as
seen from the
downward drift.
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
0 5 10 15 20 25 30
Production Order
C
o
m
p
r
e
s
s
i
o
n
The usual statistics procedures (such as means, standard
deviation, confidence interval, hypothesis testing) should
NOT be applied until the process has been stabilized.
78
Time-Series Data
Data obtained at successive time points for the same
sampling unit(s).
A time series typically consists of the following components.
1. Stable component
2. Trend component
3. Seasonal component
4. Random component
5. Cyclic (long term) component
Univariate time series {x
t
, t = 1, 2, , T }
Time-series plot: X
t
vs. Time
79
Data Smoothing and Forecasting
Two types of averages for time-series data:
1. Moving averages
2. Exponentially weighted averages
These should be used only if mean is constant (process
is in statistical control or is stationary) or mean varies
slowly.
Regression techniques can be used to model trends.
More advanced methods are needed to model
seasonality and dependence between successive
observations (autocorrelation).
80
(Arithmetic) Moving Averages (MA)
The average of a set of w successive data values (called a
window); the oldest data is successively dropped off.
T , 1, w w, for t
1
+ =
+ +
=
+
w
x x
MA
t w t
t
The bigger the window (w), the more the smoothing.
MA forecast:
1

=
t t
MA x
T , 2, t ,
1
=
= =
t t t t t
MA x x x e
Forecast error:
% 100
1
1
2

T
t
t
t
e
T x
Mean Absolute Percent Error:
(error in eqn 4.12 in textbook,
x not y in the denominator)
81
Exponentially Weighted Moving Averages
Uses all data, but the most recent data is weighted the heaviest.
1
) 1 (
+ =
t t t
EWMA w x w EWMA
where 0 < w < 1 is the smoothing constant (usually 0.2 to 0.3).
1

=
t t
EWMA x
1

EWMA forecast:
= =
t t t t t
EWMA x x x e
Forecast error:
1
+ =
t t t
EWMA e w EWMA
Alternative formula:
Interpretation: If the forecast error is positive (forecast
underestimated the actual value), the next periods forecast
is adjusted upward by a fraction of the forecast error.
82
Autocorrelation Coefficient
For time-series data, observations separated by a
specified time period (called a lag) are said to be lagged.
First-order autocorrelation or the serial correlation
coefficient between observations with lag = 1:
=
=

=
T
t
t
T
t
t t
x x
x x x x
r
1
2
2
1
1
) (
) )( (
The k-th order autocorrelation coefficient:
=
+ =

=
T
t
t
T
k t
t k t
k
x x
x x x x
r
1
2
1
) (
) )( (
83
84
Lag Plots in S-Plus
lag.plot(x) or plot(x[1:(n-i)],x[(i+1):n])
lagged 1
S
e
r
i
e
s

1
100 150 200
5
0
1
0
0
1
5
0
2
0
0
lagged 2
S
e
r
i
e
s

1
100 150 200
5
0
1
0
0
1
5
0
2
0
0
lagged 3
S
e
r
i
e
s

1
100 150 200
5
0
1
0
0
1
5
0
2
0
0
lagged 4
S
e
r
i
e
s

1
100 150 200
5
0
1
0
0
1
5
0
2
0
0
lagged 5
S
e
r
i
e
s

1
100 150 200
5
0
1
0
0
1
5
0
2
0
0
lagged 6
S
e
r
i
e
s

1
100 150 200
5
0
1
0
0
1
5
0
2
0
0
Housing starts 1966:1974, lagged scatterplots Housing starts 1966:1974, lagged scatterplots
These graphs were created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
J ohn W. Tukey (1915 - 2000)
Statistician at Princeton Univ. and Bell
Labs
Co-developer of Fast Fourier Transform
Coined terms bit (binary digit) and
software
An approximate answer to the right
problem is worth a great deal more than
a precise answer to the wrong problem.
Developed new graphical displays
(stem-and-leaf and box plots) to examine
the data, as a reaction to the
mathematization of statistics.
85
Review of
Probability
Corresponds to Chapter 2 of
Tamhane and Dunlop
Slides prepared by Elizabeth Newton (MIT),
with some slides by Jacqueline Telford
(Johns Hopkins University)
1
Concepts (Review)
A population is a collection of all units of interest.
A sample is a subset of a population that is actually observed.
A measurable property or attribute associated with each unit of a population
is called a variable.
A parameter is a numerical characteristic of a population.
A statistic is a numerical characteristic of a sample.
Statistics are used to infer the values of parameters.
A random sample gives a non-zero chance to every unit of the population to
enter the sample.
In probability, we assume that the population and its parameters are known
and compute the probability of drawing a particular sample.
In statistics, we assume that the population and its parameters are unknown
and the sample is used to infer the values of the parameters.
Different samples give different estimates of population parameters (called
sampling variability).
Sampling variability leads to sampling error.
Probability is deductive (general -> particular)
Statistics is inductive (particular -> general)
2
Difference between Statistics and Probability
Statistics: Given the information
in your hand, what is in the box?
Probability: Given the information
in the box, what is in your hand?
Based on: Statistics, Norma Gilbert, W.B. Saunders Co., 1976.
3
Probability Concepts
Random experiment procedure whose outcome cannot be
predicted in advance. E.g. toss a coin twice
Sample Space (S) The finest grain, mutually exclusive,
collectively exhaustive listing of all possible outcomes
(Drake, Fundamentals of Applied Probability Theory)
S={H,H},{H,T},{T,H},{T,T}
Event (A) a set of outcomes (subset of S). E.g. No heads
A={T,T}
Union (or) E.g. A=heads on first, B=heads on second
A U B= {H,T},{H,H},{T,H}
Intersection (and): E.g. A= heads on first, B=heads on second
A
B = {H,H}
Complement of Event A set of all outcomes not in A. E.g.
A={T,T}, A
c
={H,H},{H,T},{T,H}
4
5
Venn Diagram
A
B
Axioms of Probability
Associated with each event A in S is the probability of A, P(A)
Axioms:
1. P(A) 0
2. P(S) = 1 where S is the sample space
3. P(A U B) = P(A) + P(B) if A and B are mutually exclusive
E.g. P(ace or king) = P(ace)+P(king)=1/13+1/13=2/13.
Theorems about probability can be proved using these axioms
and these theorems can be used in probability calculations.
P(A) = 1 - P(A
c
) (see birthday problem on p. 13)
P(A U B) = P(A) + P(B) P(AB)
E.g. P(ace or black) = P(ace) + P(black) P(ace and black)
= 4/52 + 26/52 2/52 = 28/52 = 7/13
6
Conditional Probabiity:
P(A|B) = P(AB)/P(B)
P(AB) = P(A|B)P(B)
E.g. Drawing a card from a deck of 52 cards,
P(Heart)=1/4.
However, if it is known that the card is red,
P(Heart | Red) = .
Sample space has been reduced to the 26 red cards.
(See page 16)
7
Independence
P(A|B)=P(A)
There are situations in which knowing that event B occurred gives
no information about event A, E.g. knowing that a card is black
gives no information about whether it is an ace.
P(ace | black) = 2/26 = 4/52 = P(ace).
If two events are independent then P(AB)=P(A)P(B)
P(AB)=P(A|B)P(B)=P(A)P(B)
E.g. P(ace of hearts) = P(ace) * P(hearts) = 4/52 * 13/52 = 1/52
Independent events are not the same as disjoint events.
Strong dependence between disjoint events.
E.g. card is red means cant be black. P(A|B)=0.
8
Summary
If A and B are disjoint:
P(A U B) = P(A) + P(B)
P (A B) =0
If A and B are independent:
P(A B) = P(A) * P(B)
P(A U B) = P(A) + P(B) P(A B)
9
Bayes Theorem
P(AB) = P(A|B) P(B) = P(B|A) P(A)
P(B|A) = P(A|B) P(B) / P(A)
P(B) = prior probability
P(B|A) = posterior probability
E.g. P(heart | red)=P(red | heart) * P(heart) / P(red) =
1* 0.25 / 0.5 = 0.5
Monte Hall problem (page 20)
10
Sensor Problem
Assume that there are two chemical hazard sensors: A and B.
Let P(A falsely detecting a hazardous chemical)=0.05 and the
same for B.
What is the probability of both sensors falsely detecting a
hazardous chemical?
P (A B) = P(A|B)P(B) = P(A) P(B) = 0.05 0.05 = 0.0025
only if A and B are independent (use different detection methods).
If A and B are both fooled by the same chemical substance,
then P (A B) = P(A | B) P(B) = 1 0.05 = 0.05
which is 20 times the rate of false alarms (same type of sensor)
DONT assume independence without good reason!
11
HIV + HIV -
Test positive (+) 95 495 590
Test negative (-) 5 9405 9410
100 9900 10000

P(HIV +) = 100/10000 = .01 (prevalence)
P(Test + | HIV +) = 95/100 = 0.95 (sensitivity)
P(Test - | HIV -) = 9405/9900 = .95 (specificity)
P(Test - | HIV +) = 5/100 = .05 (false negatives)
P(Test + | HIV -) = 495/9900 = .05 (false positives)
P(HIV + | Test +) =
95/590 = 0.16
This is one reason why we dont have mass HIV screening
HIV Testing Example
want these
to be high
want these
to be low
Made-up data
Suggestions for Solving Probability Problems
Draw a picture
Venn diagram
Tree or event diagram (Probabilistic Risk Assessment)
Sketch
Write out all possible combinations if feasible
Do a smaller scale problem first
Figure out the algorithm for the solution
Increment the size of the problem by one and check
algorithm for correctness
Generalize algorithm (mathematical induction)
13
Counting rules
Number of Possible Arrangements of Size r from n Objects:
Without With Replacement
Replacement
Ordered:
!
( )!
n
n r
r
n
Unordered:
n
r

1 n r
r
+

Source: Casella, George, and Roger L. Berger. Statistical Inference. Belmont, CA: Duxbury Press, 1990, page 16.
14
Counting rules (from Casella & Berger)
For these examples, see pages 15-16 of: Casella, George, and Roger L.
Berger. Statistical Inference. Belmont, CA: Duxbury Press, 1990.
15
Birthday Problem
At a gathering of s randomly chosen students what is the probability
that at least 2 will have the same birthday?
P(at least 2 have same birthday)=
1-P(all s students have different birthdays).
Assume 365 days in a year. Think of students birthdays as a sample
of these 365 days.
The total number of possible outcomes is:
N=365
s
(ordered, with replacement)
The number of ways that s students can have different birthdays is
M=364!/(365-s)! (ordered, without replacement)
P(all s students have different birthdays) is M / N.
16
Probability that all students have
different birthdays
c
h
o
o
s
e
(
3
6
5
,

1
:
8
0
,

o
r
d
e
r

=

.
.
.
.
0
.
0

0
.
4

0
.
8
0
.
2

0
.
6

1
.
0

0 20 40 60 80
Number of students
17
See Harry Potter and the
Sorcerers Stone by J.K.
Rowling.
18
Another Counting Rule
The number of ways of classifying n items into k
groups with r
i
in group i, r
1
+r
2
++r
k
=n, is:
n! / (r
1
! r
2
! r
3
!...r
k
!)
For example: How many ways are there to assign
100 incoming students to the 4 houses at
Hogwarts?
(1.6 * 10^57)
19
Random Variables
A random variable (r.v.) associates a unique numerical value with
each outcome in the sample space
Example:
1 if coin toss results in a head
X =
0 if coin toss results in a tail
Discrete random variables: number of possible values is finite
or countably infinite: x
1
, x
2
, x
3
, x
4
, x
5
, x
6
,
Probability mass function (p.m.f.)
f(x) = P(X = x) (Sum over all possible values =1 always)
Cumulative distribution function (c.d.f)
F(x) = P (X x) = f(k)
k x
See Table 2.1 on p. 21 (p.m.f. and c.d.f. for sum of two dice)
See Figure 2.5 on p. 22 (p.m.f. and c.d.f. graphs for two dice)
20
Continuous Random Variables
An r.v. is continuous if it can assume any value from
one or more intervals of real numbers
Probability density function (p.d.f.) f(x)
f(x) 0
) f ( dx x =1 curve the under (Area = always) 1

b
a P X b) = f ( ds x any for a b ( )
a
21
P(0<X<1) for standard normal= area
under curve between 0 and 1
d
n
o
r
m
(
x
)

0
.
0

0
.
2

0
.
4
0
.
1

0
.
3

-4 -2 0 2 4
x
22
Cumulative Distribution Function
The cumulative distribution function (c.d.f.), denoted
F(x), for a continuous random variable is given by:
x
F ( x) = X P x) = f ( dy y ( )

f ( x) =
dF ( x)
dx
23
P(0<Z<1) for standard normal= F(1)-F(0)
=0.8413-0.5 = 0.3413 (table page 674)
p
n
o
r
m
(
z
)

0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0

-4 -2 0 2 4
z
24
Expected Value
The expected value or mean of a discrete r.v. X,denoted
by E(X),
x
, or simply , is defined as:
( X E ) = =
f x ( x) = f x ( x ) +x
2
f ( x ) +
1 1 2
x
This is essentially a weighted average of the possible
values the r.v. can assume, weights=f(x)
The expected value of a continuous r.v. X is defined as:
X E ) = =
f x ( dx x ( )
25
Variance and Standard Deviation
2
The variance of an r.v. X, denoted by Var(X),
x
, or
simply
2
, is defined as:
Var(X) =
2
= E[(X - )
2
]
Var(X) = E[(X - )
2
]= E(X
2
- 2X +
2
)
= E(X
2
) - 2E(X) + E(
2
)
= E(X
2
) - 2+
2
= E(X
2
) -
2
= E(X
2
) - [E(X)]
2
The standard deviation (SD) is the square root of the
variance. Note that the variance is in the square of
the original units, while the SD is in the original units.
See Example 2.17 on p. 26 (mean and variance of
two dice)
26
Quantiles and Percentiles
For 0 p 1 the p
th
quantile (or the 100p
th
percentile),
denoted by
p
, of a continuous r.v. X is defined by the
following equation:
X P ) =F (
p
) = p (
p
.5
is called the median
See Example 2.20 on p. 30 (exponential distribution)
Jointly distributed random variables and independent
random variables
See pp. 30-33
27
Joint Distributions
For a discrete distribution:
f(x,y) = P(X=x,Y=y)
f(x,y) 0 for all x and y
y
f(x,y)=1
28
Marginal Distributions
g(x) = P(X=x) =
y
f(x,y)
h(y) = P(Y=y) =
x
f(x,y)
Independent if joint distribution factors
into product of marginal distributions
f(x,y) = g(x) h(y)
29
Conditional Distributions
f(y|x) = f(x,y) / g(x)
If X and Y are independent:
f(y|x) = g(x) h(y) / g(x) = h(y)
Conditional distribution is just a probability
distribution defined on a reduced
sample space.
For every x,
y
f(y|x) = 1
30
Covariance and Correlation
Cov(X,Y) =
XY
= E[(X -
X
)(Y -
Y
)] = E(XY) - E(X)E(Y)
= E(XY) -
X
Y
If X and Y are independent, then E(XY) = E(X)E(Y) so the
covariance is zero. The other direction is not true.

Note that:
E ( Y X ) = y x f y x ) dx dy ( ,

XY
=corr ( X ,Y ) =
Cov ( X ,Y )
=

XY
Var ( X )Var (Y )

Y
See Examples 2.26 and 2.27 on pp. 37-38 (prob vs.
stat grades)
31
x
Example 2.25 in text
y=x with probability 0.5 and y= -x with probability 0.5
y is not independent of x, yet covariance is zero
y

-
4
0

0
-
2
0

2
0

4
0

0 10 20 30 40 50
x
32
Two Famous Theorems
Chebyshevs Inequality: Let c >0 be a constant. Then,
irrespective of the distribution of X,
2
( X P c )
2
c
See Example 2.29 on p. 41 (exact vs. Cheb. for two dice)
Weak Law of Large Numbers: Let
X
be the sample mean
of n i.i.d. observations from a population with finite mean
2
and variance . Then, for any fixed c >0,
( X P c ) as 0 n
33
Selected Discrete Distributions
Bernoulli trials: (single coin flip)
x if (success) 1
( x f ) = ( X P = x ) =

p =
1 p x if (failure) 0 =
0 1
E(X) = p and Var(X) = p(1-p)
Binomial distribution: (multiple coin flips)
X successes out of n trials

n
p
x
(1 p )
x n
for x f ( ) = X P ( = x ) = x = 1, 0, , n

x
E(X) = np and Var(X) = np(1-p)
See Example 2.30 on p. 43 (teeth)
0 1 . . n
34
Selected Discrete Distributions (cont)
Hypergeometric: drawing balls from the box without
replacing the balls (as in the hand with the question
mark)
Poisson: number of occurrences of a rare event
Geometric: number of failures before the first success
Multinomial: more than two outcomes
Negative Binomial: number of trials to get r successes
Uniform: N equally likely events
1 2 3 N
See Table 2.5, p. 59 for properties of these distributions
35
Selected Continuous Distributions
Uniform: equally likely over an interval
Exponential: lifetimes of devices with no
wear-out (memoryless), interarrival times
when the arrivals are at random
Gamma: used to model lifetimes,
related to many other distributions
Lognormal: lifetimes (similar shape to
Gamma but with longer tail)
Beta: not equally likely over an interval
See Table 2.5, p. 59 for properties of these distributions
36
Normal Distribution
First discovered by de Moivre (1667-1754) in
1733
Rediscovered by Laplace (1749-1827) and
also by
Gauss (1777-1855) in their studies of errors
in astronomical measurements.
Often referred to as the Gaussian
distribution.
37
Carl Friedrick Gauss (1777 - 1855)
Photograph courtesy of John L. Telford, John Telford Photography. Used with permission. Currency from 1991.
38
Karl Pearson (1857 - 1936)
Many years ago I called the Laplace-Gauss curve the
NORMAL curve, which name, while it avoids an international
question of priority, has the disadvantage of leading people to
believe that all other distributions of frequency are in one
sense or another ABNORMAL.
That belief is, of course, not justifiable.
Karl Pearson, 1920
39
Normal Distribution (Bell-curve, Gaussian)
A continuous r.v X has a normal distribution with parameter and
2
if its probability density function is given by:
x f ) =
1
exp[( x )
2
/ 2
2
] - for < x < (
2
E(X) = and Var(X) =
2
(see Figure 2.12, p. 53)
Standard normal distribution: Z =
X
~ N ( 1 , 0 )

See Table A.3 on p. 673 (z) = P(Z z)
X P x ) = Z P =
X

x
=z
x

(

See Examples 2.37 and 2.38 on pp. 54-55 (computations)
40
Percentiles of the Normal Distribution
Suppose that the scores on a standardized test are normally
distributed with mean 500 and standard deviation of 100. What
is the 75
th
percentile score of this test?
X P x ) =P
X 500 x 500

x 500
=

= 75 . 0
100 100

100
From Table A.3, (0.675) = 0.75

x 500
= 675 . 0 x =500 +( 100 )( 675 . 0 ) = 5 . 567
100
Useful Information about the Normal Distribution:
~68% of a normal population is within 1of
~95% of a normal population is within 2 of
~99.7% of a normal population is within 3 of
41
75
th
percentile for a test with scores which are normally
distributed, mean=500, standard deviation=100
p
n
o
r
m
(
x
,

5
0
0
,

1
0
0
)

0
.
0

0
.
4

0
.
8

qnorm(0.75, 500, 100)=567.5
pnorm(567.5, 500, 100)=0.75
0
.
2

0
.
6

1
.
0

200 400 600 800
x
42
Linear Combinations of r.v.s
X
i
~ N(
i
,
i
2
) for i = 1, , n and Cov(X, X
j
) =
ij
for ij
i
Let X = a
1
X
1
+ a
2
X
2
+ + a
n
X
n
where are constants. a
i
Then X has a normal distribution with mean and variance:
n
X E ) = X a E + X a
2
++ X a ) =a +a ++a =
a ( (
1 1 2 n n 1 1 2 2 n n i i
i =1
n n n
2 2
Var ( X ) =Var ( X a + X a
2
++ X a ) =
a
i
+2
a a
j
1 1 2 n n i i ij
i =1 i =1 j =1
i j
X
= (X
1
+ X
2
+ + X
n
) / n , so a
i
= 1/n
Therefore, X from n i.i.d. N(,
2
) observations ~ N(,
2
/n),
since the covariances (
ij
) are zero (by independence).
43
Sampling Distributions of
Statistics
Tamhane and Dunlop
(Johns Hopkins University)
1
Sampling Distributions
2
Definitions and Key Concepts
A sample statistic used to estimate an unknown population
parameter is called an estimate.
The discrepancy between the estimate and the true
parameter value is known as sampling error.
A statistic is a random variable with a probability distribution,
called the sampling distribution, which is generated by
repeated sampling.
We use the sampling distribution of a statistic to assess the
sampling error in an estimate.
Random Sample
Definition 5.11, page 201, Casella and Berger.
How is this different from a simple random sample?
For mutual independence, population must be very
large or must sample with replacement.
3
Sample Mean and Variance
=
=
n
i
i
X
n
X
1
1
1
) (
1
2
2
=

=
n
X X
S
n
i
i
Sample Mean
Sample Variance
How do the sample mean and variance vary in repeated
samples of size n drawn from the population?
In general, difficult to find exact sampling distribution. However,
see example of deriving distribution when all possible samples
can be enumerated (rolling 2 dice) in sections 5.1 and 5.2.
Note errors on page 168.
4
Properties of a sample mean and variance
See Theorem 5.2.2, page 268, Casella & Berger.
5
Distribution of Sample Means
If the i.i.d. r.v.s are
Bernoulli
Normal
Exponential
The distributions of the sample means can be derived
Sum of n i.i.d. Bernoulli(p) r.v.s is Binomial(n,p)
Sum of n i.i.d. Normal(,
2
) r.v.s is Normal(n,n
2
)
Sum of n i.i.d. Exponential() r.v.s is Gamma(,n)
6
Distribution of Sample Means
Generally, the exact distribution is difficult to
calculate.
What can be said about the distribution of the
sample mean when the sample is drawn from
an arbitrary population?
In many cases we can approximate the
distribution of the sample mean when n is
large by a normal distribution.
The famous Central Limit Theorem
7
Central Limit Theorem
Let X
1
, X
2
, , X
n
be a random sample drawn from an
arbitrary distribution with a finite mean and variance
2
As n goes to infinity, the sampling distribution of
n
X

) 1 , 0 (

1
N
n
n X
n
i
i
converges to the N(0,1) distribution.

Sometimes this theorem is given in terms of the sums:
8
Central Limit Theorem
Let X
1
X
n
be a random sample from an arbitrary distribution
with finite mean and variance
2.
As n increases
? ) , (
? ) , (
) 1 , 0 (
/
) (
2
1
2

n n N X
n
N X
N
n
X
n
i
i

=
What happens as n goes to infinity?
9
10
Variance of means from uniform distribution
sample size=10 to 10^6
number of samples=100
log10(sample.size)
l
o
g
1
0
(
v
a
r
i
a
n
c
e
)
1 2 3 4 5 6
-
7
-
6
-
5
-
4
-
3
-
2
Example: Uniform Distribution
f(x | a, b) = 1 / (b-a), axb
E X = (b+a)/2
Var X = (b-a)
2
/12
0 2 4 6 8 10
0
5
1
0
1
5
2
0
2
5
3
0
runif(500, min = 0, max = 10)
11
12
Standardized Means, Uniform Distribution
500 samples, n=1
-1 0 1
0
1
0
2
0
3
0
4
0
number of samples=500, n=1
13
500 samples, n=2
-2 -1 0 1 2
0
1
0
2
0
3
0
4
0
14
500 samples, n=100
-3 -2 -1 0 1 2 3
0
1
0
2
0
3
0
4
0
15
QQ (Normal) plot of means of 500 samples of
size 100 from uniform distribution
t
m
p
-3 -2 -1 0 1 2 3
-
3
-
2
-
1
0
1
2
Bootstrap sampling from the
sample
Previous slides have shown results for means
of 500 samples (of size 100) from uniform
distribution.
Bootstrap takes just one sample of size 100
and then takes 500 samples (of size 100)
with replacement from the sample.
x<-runif(100)
y<- mean(sample(x,100,replace=T))
16
17
Normal probability plot of sample of
size 100 from exponential distribution
x
-2 -1 0 1 2
0
1
2
3
4
5
18
Normal probability plot of means of 500
bootstrap samples from sample of size 100
from exponential distribution
y
-3 -2 -1 0 1 2 3
1
.
0
1
.
1
1
.
2
1
.
3
1
.
4
1
.
5
Law of Large Numbers and Central Limit Theorem
Both are asymptotic results about the sample mean:
Law of Large Numbers (LLN) says that as n , the
sample mean converges to the population mean, i.e.,
0 , n as X
Central Limit Theorem (CLT) says that as n ,
also the distribution converges to Normal, i.e.,
N(0,1) to converges , n as
n
X

19
Normal Approximation to the Binomial
A binomial r.v. is the sum of i.i.d. Bernoulli r.v.s so the CLT can be
used to approximate its distribution.
Suppose that X is B(n, p). Then the mean of X is np and the variance
of X is np(1 - p) .
By the CLT, we have:
) 1 , 0 (
) 1 (
N
p np
np X

=
.) . (
.) . ( . .
Formula
General
v r SD
v r E v r
How large a sample, n, do we need for the approximation to be good?
Rule of Thumb: np 10 and n(1-p) 10
For p=0.5, np = n(1-p) = n(0.5) = 10 n should be 20. (symmetrical)
For p=0.1 or 0.9, np or n(1-p) = n(0.1) = 10 n should be 100. (skewed)
See Figures 5.2 and 5.3 and Example 5.3, pp.172-174
20
Continuity Correction
See Figure 5.4 for motivation.
+

) 1 (
5 . 0
) (
p np
np x
x X P

) 1 (
5 . 0
1 ) (
p np
np x
x X P
Exact Binomial Probability:
P(X 8) = 0.2517
Normal approximation without Continuity Correction:
P(X 8) = 0.1867
Normal approximation with Continuity Correction:
P(X 8.5) = 0.2514 (much better agreement with exact calculation)
21
Sampling Distribution of the Sample Variance
? ~
1
) (
1
2
2
=

=
n
X X
S
n
i
i
There is no analog to the CLT for which
gives an approximation for large
samples for an arbitrary distribution.
The exact distribution for S
2
can be derived for X ~ i.i.d. Normal.
Chi-square distribution: For 1, let Z
1
, Z
2
, , Z
be i.i.d. N(0,1)
and let Y = Z
1
2
+ Z
2
2
+ + Z
2
.
The p.d.f. of Y can be shown to be
( )
2
1
2
2
2
) ( 2
1
) (
y
e x y f

This is known as the

2
distribution with degrees of freedom
(d.f.) or Y ~ .
2
See Figures 5.5 and 5.6, pp. 176-177 and Table A.5, p.676
22
Distribution of the Sample Variance in the Normal Case
If Z ~ N(0,1), then Z
2
~
2
1
2
1
2
2
2
2
~
) 1 /(
) 1 (
n
n
S S n

1
~
2
1
2
2

n
S
n

It can be shown that
or equivalently
, a scaled
2
E(S
2
) =
2
(is an unbiased estimator)
Var(S
2
) =
1
2
4
n
See Result 2 (p.179)

23
Chi-square distribution
24
x
c
h
i

s
q
u
a
r
e

d
e
n
s
i
t
y

f
o
r

d
f
=
5
,
1
0
,
2
0
,
3
0
0 10 20 30 40 50
0
.
0
0
.
0
5
0
.
1
0
0
.
1
5
Chi-Square Distribution
Interesting Facts
EX = (degrees of freedom)
Var X = 2
Special case of the gamma distribution
with scale parameter=2, shape
parameter=v/2.
Chi-square variate with v d.f. is equal to
the sum of the squares of v independent
unit normal variates.
25
Students t-Distribution
Consider a random sample X
1
, X
2
, ..., X
n
drawn from N(,
2
).
It is known that
n
X
/

is exactly distributed as N(0,1).
n S
X
T
/

=
is NOT distributed as N(0,1).
A different distribution for each = n-1 degrees of freedom (d.f.).
T is the ratio of a N(0,1) r.v. and sq.rt.(independent
2
divided by its d.f.)
- for derivation, see eqn 5.13, p.180, and its messy p.d.f., eqn 5.14
See Figure 5.7, Students t p.d.f.s for = 2, 10,and , p.180
See Table A.4, t-distribution table, p. 675
See Example 5.6, milk cartons, p. 181
26
27
Students t densities for df=1,100
x
S
t
u
d
e
n
t
'
s

t

p
d
f
,

d
f
=
1

&

1
0
0
-4 -2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
df=1
df=100
Students t Distribution
Interesting Facts
E X = 0, for v>1
Var X = v/(v-2) for v>2
Related to F distribution (F
1,v
= t
2
v
)
As v tends to infinity t variate tends to
unit normal
If v=1 then t variate is standard Cauchy
28
29
Cauchy Distribution
for center=0, scale=1
and center=1, scale=2
x
C
a
u
c
h
y

p
d
f
-4 -2 0 2 4
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
center=1, scale=2
center=0, scale=1
Cauchy Distribution
Interesting Facts
1 2
]} ) ( 1 [ { ) , | (

+ =
b
a x
b b a x f
30
Parameters, a=center, b=scale
Mean and Variance do not exist (how could this be?)
a=median
Quartiles=a +/- b
Special case of Students t with 1 d.f.
Ratio of 2 independent unit normal variates is standard
Cauchy variate
Should not be thought of as only a pathological case.
(Casella & Berger) as we frequently (when?) calculate
ratios of random variables.
Snedecor-Fishers F-Distribution
has an F-distribution with n
1
-1 d.f.
in the numerator and n
2
-1 d.f.
in the denominator.
F is the ratio of two independent
2
s divided by their respective d.f.s
Used to compare sample variances.
See Table A.6, F-distribution, pp. 677-679
Consider two independent random samples:
X
1
, X
2
, ..., X
n
1
from N(
1
,
1
2
) , Y
1
, Y
2
, ..., Y
n
2
from N(
2
,
2
2
).
Then
) 1
2
(
2
2
2
2
) 1
2
(
) 1
1
(
2
1
2
1
) 1
1
(
2
2
2
2
2
1
2
1
=
n
S n
n
S n
S
S
31
32
Snedecors F Distribution
x
F

p
d
f

f
o
r

d
f
2
=
4
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
.
2
df1=40
df1=10
df1=4
Snedecors F Distribution
Interesting Facts
Parameters, v, w, referred to as degrees of freedom (df).
Mean = w/(w-2), for w>2
Variance = 2w
2
(v+w-2)/(v(w-2)
2
(w-4)), for w>4
As d.f., v and w increase, F variate tends to normal
Related also to Chi-square, Students t, Beta and Binomial
Reference for distributions:
Statistical Distributions 3
rd
ed. by Evans, Hastings
and Peacock, Wiley, 2000
33
Sampling Distributions - Summary
For random sample from any distribution, standardized sample mean
converges to N(0,1) as n increases (CLT).
In normal case, standardized sample mean with S instead of sigma in the
denominator ~ Students t(n-1).
Sum of n squared unit normal variates ~ Chi-square (n)
In the normal case, sample variance has scaled Chi-square distribution.
In the normal case, ratio of sample variances from two different samples
divided by their respective d.f. has F distribution.
34
Sir Ronald A. Fisher George W. Snedecor
(1890-1962) (1882-1974)
Taught at Iowa State Univ. where
wrote a college textbook (1937):
Thank God for Snedecor;
now we can understand Fisher.
(named the distribution for Fisher)
Wrote the first books on statistical
methods (1926 & 1936):
A student should not be made
to read Fishers books
unless he has read them before.
35
Sampling Distributions for Order Statistics
Most sampling distribution results (except for CLT) apply to samples from
normal populations.
If data does not come from a normal (or at least approximately normal),
then statistical methods called distribution-free or non-parametric
methods can be used (Chapter 14).
Non-parametric methods are often based on ordered data (called order
statistics: X
(1)
, X
(2)
, , X
(n)
) or just their ranks.
If X
1
..X
n
are from a continuous population with cdf F(x) and pdf f(x) then the
pdf of X
(j)
is:
The confidence intervals for percentiles can be derived using the order
statistics and the binomial distribution.
j n j
j
x F x F x f
j n j
n
x f

= )] ( 1 [ )] ( )[ (
)! ( )! 1 (
!
) (
1
) (
36
Basic Concepts of Inference
Tamhane and Dunlop
Slides prepared by Elizabeth Newton (MIT)
(Johns Hopkins University) and Roy Welsch (MIT).
1
Statistical thinking will one day be as necessary for efficient citizenship
as the ability to read and write. H. G. Wells
Statistical Inference
Deals with methods for making statements about a population based on
a sample drawn from the population
Point Estimation: Estimate an unknown population parameter
Confidence Interval Estimation: Find an interval that contains the
parameter with preassigned probability.
Hypothesis testing: Testing hypothesis about an unknown population
parameter
2
Examples
Point Estimation: estimate the mean package weight
of a cereal box filled during a production shift
Confidence Interval Estimation: Find an interval
[L,U] based on the data that includes the mean
weight of the cereal box with a specified
probability
Hypothesis testing: Do the cereal boxes meet the
minimum mean weight specification of 16 oz?
3
Two Levels of Statistical
Inference
Informal, using summary statistics (may
only be descriptive statistics)
Formal, which uses methods of
probability and sampling distributions to
develop measures of statistical
accuracy
4
Estimation Problems
Point estimation: estimation of an unknown population
parameter by a single statistic calculated from the
sample data.
Confidence interval estimation: calculation of an
interval from sample data that includes the unknown
population parameter with a pre-assigned probability.
5
Point Estimation Terminology
Estimator = the random variable (r.v.) , a function of the X
i
s
(the general formula of the rule to be computed from the data)

Estimate = the numerical value of calculated from the
observed sample data X
1
= x
1
, ..., X
n
= x
n
n
X
X
n
i
i
=
=
1
n
x
x
n
i
i
=
=
1
Example: X
i
~ N(,
2
)
(the specific value calculated from the data)
of (= 10.2) is an estimate
Estimate =
6
Estimator =
is an estimator of

=
Other estimators of ?
Methods of Evaluating Estimators
Bias and Variance
= )
( )
( E Bias
- The bias measures the accuracy of an estimator.
- An estimator whose bias is zero is called unbiased.
- An unbiased estimator may, nevertheless, fluctuate greatly from
sample to sample.
{ }
2
)]
[ )
( E E = Var
7
-The lower the variance, the more precise the estimator.
- A low-variance estimator may be biased.
- Among unbiased estimators, the one with the lowest variance
should be chosen. Best=minimum variance.
Accuracy and Precision
accurate and
precise
accurate,
not precise
precise,
not accurate
not accurate,
not precise
8
Diagram courtesy of MIT OpenCourseWare
Mean Squared Error
- To chose among all estimators (biased and unbiased),
minimize a measure that combines both bias and variance.
- A good estimator should have low bias (accurate) AND
low variance (precise).
{ }
2
)]
[ )
( = E MSE
6.2) (eqn Bias Var
2
)]
( [ )
( + =
MSE = expected squared error loss function
= )
( )
( E Bias
{ }
2
)]
[ )
( E E = Var
9
Example: estimators of variance
Two estimators of variance:
) 1 ( ) (
2
1
2
1
=
=
n X X S
n
i
i
is unbiased (Example 6.3)
n X X S
n
i
i
2
1
2
2
) ( =
=
is biased but has smaller MSE
(Example 6.4)
In spite of larger MSE, we almost always use S
1
2
10
Example - Poisson
(See example in Casella & Berger, page 308)
11
Standard Error (SE)
- The standard deviation of an estimator is called the standard
error of the estimator (SE).
- The estimated standard error is also called standard error (se).
- The precision of an estimator is measured by the SE.
Examples for the normal and binomial distributions:
1. of estimator unbiased an is X
n X SE = ) (
are called the standard error of the mean
n s X se = ) (

2. p p of estimator unbiased an is
n p p p se )
1 (
( =
12
Precision and Standard Error
A precise estimate has a small standard error, but exactly
how are the precision and standard error related?
If the sampling distribution of an estimator is normal with
mean equal to the true parameter value (i.e., unbiased). Then
we know that about 95% of the time the estimator will be within
two SEs from the true parameter value.
13
Methods of Point Estimation
Method of Moments (Chapter 6)
Maximum Likelihood Estimation (Chapter 15)
Least Squares (Chapter 10 and 11)
14
Method of Moments
Equate sample moments to population moments (as we did with
Poisson).
Example: for the continuous uniform distribution, f(x|a,b)=1/(b-a), axb
E(X) = (b+a)/2, Var(X)=(b-a)
2
/12
Set = (b+a)/2
S
2
= (b-a)
2
/12
Solve for a and b (can be a bit messy).
X
15
Maximum Likelihood Parameter
Estimation
By far the most popular estimation method! (Casella &
Berger).
MLE is the parameter point for which observed data is most
likely under the assumed probability model.
Likelihood function: L( |x) = f(x| ), where x is the vector of
sample values, also a vector possibly.
When we consider f(x| ), we consider as fixed and x as
the variable.
When we consider L( |x), we are considering x to be the
fixed observed sample point and to be varying over all
possible parameter values.
16
MLE (continued)
If X
1
.X
n
are iid then
L(|x)=f(x
1
x
n
| ) = f(x
i
| )
The MLE of is the value which maximizes the
likelihood function (assuming it has a global maximum).
Found by differentiating when possible.
Usually work with log of likelihood function ().
Equations obtained by setting partial derivatives of
ln L() = 0 are called the likelihood equations.
See text page 616 for example normal distribution.
17
Confidence Interval Estimation
We want an interval [ L, U ] where L and U are two
statistics calculated from X
1
, X
2
, , X
n
such that
P[ L U] = 1 -
Note: L and U are random
and is fixed but unknown
regardless of the true value of .
[ L, U ] is called a 100(1- )% confidence interval (CI).
1- is called the confidence level of the interval.
After the data is observed X
1
= x
1
, ..., X
n
= x
n
, the
confidence limits L = l and U = u can be calculated.
18
95% Confidence Interval: Normal known
2
Consider a random sample X

1
, X
2
, , X
n
~ N(,
2
) where
2
is assumed to be known and is an unknown parameter
to be estimated. Then
95 . 0 96 . 1 96 . 1 P =

n
X

By the CLT even if the sample
is not normal, this result is
approximately correct.
95 . 0 96 . 1 96 . 1 P =
+ = =
n
X U
n
X L

u
n
x
n
x l = + =

96 . 1 96 . 1
is a 95% CI for
(two-sided)
See Example 6.7, Airline Revenues, p. 204
19
Normal Distribution, 95% of area under
curve is between -1.96 and 1.96
x
d
n
o
r
m
(
x
)
-3 -2 -1 0 1 2 3
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
9 5 %
20
Frequentist Interpretation of CIs
In an infinitely long series of trials in which repeated
samples of size n are drawn from the same population
and 95% CIs for are calculated using the same method,
the proportion of intervals that actually include will be
95% (coverage probability).
However, for any particular CI, it is not known whether or
not the CI includes , but the probability that it includes
is either 0 or 1, that is, either it does or it doesnt.
It is incorrect to say that the probability is 0.95 that the true
is in a particular CI.
See Figure 6.2, p. 205
21
22
95% CI, 50 samples from unit normal
distribution
9
5
%

C
o
n
f
i
d
e
n
c
e

I
n
t
e
r
v
a
l
0 10 20 30 40 50
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
Arbitrary Confidence Level for CI: known
2
100(1-)% two-sided CI for based on the observed sample mean

n
Z x
n
Z x

2 / 2 /
+
For 99% confidence,
Z
/2
= 2.576
The price paid for higher confidence level is a wider interval.
For large samples, these CI can be used for data from any
distribution, since by CLT N(,
2
/n).
x
23
One-sided Confidence Intervals
n
Z x

Lower one-sided CI For 95%
confidence,
Z
= 1.645 vs.
Z
/2
= 1.96
n
Z x

+ Upper one-sided CI
One-sided CIs are tighter for the same confidence level.
24
Hypothesis Testing
The objective of hypothesis testing is to access the validity of a
claim against a counterclaim using sample data.
The claim to be proved is the alternative hypothesis (H
1
).
The competing claim is called the null hypothesis (H
0
).
One begins by assuming that H
0
is true. If the data fails to
contradict H
0
beyond a reasonable doubt, then H
0
is not
rejected. However, failing to reject H
0
does not mean that we
accept it as true. It simply means that H
0
cannot be ruled out as
a possible explanation for the observed data. A proof by
insufficient data is not a proof at all.
25
Testing Hypotheses
The process by which we use data to answer questions about parameters
is very similar to how juries evaluate evidence about a defendant. from
Geoffrey Vining, Statistical Methods for Engineers, Duxbury, 1st edition,
1998. For more information, see that textbook.
26
Hypothesis Tests
A hypothesis test is a data-based rule to decide between H
0
and H
1.
A test statistic calculated from the data is used to make this
decision.
The values of the test statistics for which the test rejects H
0
comprise the rejection region of the test.
The complement of the rejection region is called the
acceptance region.
The boundaries of the rejection region are defined by one or
more critical constants (critical values).
See Examples 6.13(acc. sampling) and 6.14(SAT coaching),
pp. 210-211.
27
Hypothesis Testing as a Two-Decision Problem
28
Framework developed by Neyman and Pearson in 1933.
When a hypothesis test is viewed as a decision procedure,
two types of errors are possible:
Decision

Do not reject H
0
Reject H
0

H
0
True
Correct Decision
Confidence
1 -
Type I Error
Significance Level

R
e
a
l
i
t
y

H
0
False
Type II Error
Failure to Detect

Correct Decision
Prob. of Detection
1 -
Column
Total

1

1

=1
=1
Probabilities of Type I and II Errors
= P{Type I error} = P{Reject H
0
when H
0
is true} = P{Reject H
0
|H
0
}
also called -risk or producers risk or false alarm rate
= P{Type II error} = P{Fail to reject H
0
when H
1
is true} = P{Fail to
reject H
0
|H
1
}
also called -risk or consumers risk or prob. of not detecting
= 1 - = P{Reject H
0
|H
1
} is prob. of detection or power of the test
We would like to have low and low (or equivalently, high power).
and 1- are directly related, can increase power by increasing .
These probabilities are calculated using the sampling distributions from
either the null hypothesis (for ) or alternative hypothesis (for ).
29
Example 6.17 (SAT Coaching)
See Example 6.17, SAT Coaching, in the course textbook.
30
Power Function and OC Curve
The operating characteristic function of a test is the probability
that the test fails to reject H
0
as a function of , where is the
test parameter.
OC() = P{test fails to reject H
0
| }
For values included in H
1
the OC function is the risk.
The power function is:
() = P{Test rejects H0 | } = 1 OC()
Example: In SAT coaching, for the test that rejects the null
hypothesis when mean change is 25 or greater, the power
= 1-pnorm(25,mean=0:50,sd=40/sqrt(20))
31
Level of Significance
The practice of test of hypothesis is to put an upper bound on the
P(Type I error) and, subject to that constraint, find a test with the lowest
possible P(Type II error).
The upper bound on P(Type I error) is called the level of significance of
the test and is denoted by (usually some small number such as 0.01,
0.05, or 0.10).
The test is required to satisfy:
P{ Type I error } = P{ Test Rejects H
0
| H
0
}
Note that is now used to denote an upper bound on P(Type I error).
Motivated by the fact that the Type I error is usually the more serious.
A hypothesis test with a significance level is called an a -level test.
32
Choice of Significance Level
What level should one use?
Recall that as P(Type I error) decreases P(Type II error)
increases.
A proper choice of should take into account the relative costs
of Type I and Type II errors. (These costs may be difficult to
determine in practice, but must be considered!)
Fisher said: =0.05
Today = 0.10, 0.05, 0.01 depending on how much proof
against the null hypothesis we want to have before rejecting it.
P-values have become popular with the advent of computer
programs.
33
Observed Level of Significance or P-value
Simply rejecting or not rejecting H
0
at a specified level does
not fully convey the information in the data.
Example: H
0
: = 15 vs H
1
: > 15 is rejected at the = 0.05
when
71 . 29
20
40
645 . 1 15 = + > x
Is a sample with a mean of 30 equivalent to a sample with a mean
of 50? (Note that both lead to rejection at the -level of 0.05.)
More useful to report the smallest -level for which the data
would reject (this is called the observed level of significance or
P-value).
Reject H
0
if P-value <
34
Example 6.23 (SAT Coaching: P-Value)
See Example 6.23, SAT Coaching, on page 220 of the
course textbook.
35
One-sided and Two-sided Tests
H
0
: = 15 can have three possible alternative hypotheses:
H
1
: > 15 , H
1
: < 15 , or H
1
: 15
(upper one-sided) (lower one-sided) (two-sided)
Example 6.27 (SAT Coaching: Two-sided testing)
See Example 6.27 in the course textbook.
36
Example 6.27 continued
See Example 6.27, SAT Coaching, on page 223 of the
course textbook.
37
Relationship Between Confidence Intervals
and Hypothesis Tests
An -level two-sided test rejects a hypothesis H
0
: =
0
if and
only if the (1- )100% confidence interval does not contain
0
.
Example 6.7 (Airline Revenues)
See Example 6.7, Airline Revenues, on page 207 of the
course textbook.
38
Use/Misuse of Hypothesis Tests in Practice
Difficulties of Interpreting Tests on Non-random samples
and observational data
Statistical significance versus Practical significance
Statistical significance is a function of sample size
Perils of searching for significance
Ignoring lack of significance
Confusing confidence (1 - ) with probability of detecting a
difference (1 - )
39
Jerzy Neyman Egon Pearson
(1894-1981) (1895-1980)
Carried on a decades-long feud with Fisher over the
foundations of statistics (hypothesis testing and confidence
limits)
- Fisher never recognized Type II error & developed fiducial
limits
40
Inference for Single Samples
Corresponds to
Chapter 7 of
Tamhane and Dunlop
with some slides by Ramn V. Len (University of Tennessee)
1
Inference About the Mean and Variance of a
Normal Population
Applications:
Monitor the mean of a manufacturing process to determine
if the process is under control
Evaluate the precision of a laboratory instrument measured
by the variance of its readings
Prediction intervals and tolerance intervals which are
methods for estimating future observations from a
population.
By using the central limit theorem (CLT), inference procedures
for the mean of a normal population can be extended to the
mean of a non-normal population when a large sample is available
2
Inferences on Mean (Large Samples)
( )
2
2
Inferences on will be based on the sample mean ,
which is an unbiased estimator of with variance .
For large sample size , the CLT tells us that is
approximately , distributed, even if
X
n
n X
N n
2
2
the population
is not normal.
Also for large , the sample variance may be
taken as an accurate estimator of with neglible sampling error.
If 30, we may assume that in the formulas.
n s
n s

3
Pivots
Definition: Casella & Berger, p. 413
E.g.
Allow us to construct confidence intervals on
parameters.
) 1 , 0 ( ~
/
N
n
X
Z

=
4
Confidence Intervals on the Mean:
Large Samples
2 2
1
X
P z Z z
n

= =

Note: z
/2
= -qnorm(/2)
(See Figure 2.15 on page 56 of the
course textbook.)
5
Confidence Intervals on the Mean
2 2
x z x z
n n

+
( Lower One-Sided CI) x z
n

(Upper One-Sided CI) x z
n

+
is the standard error of the mean
n
6
Confidence Intervals in S-Plus
t.test(lottery.payoff)
One-sample t-Test
data: lottery.payoff
t = 35.9035, df = 253, p-value = 0
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
274.4315 306.2850
sample estimates:
mean of x
290.3583
7
This code was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Sample Size Determination for a z-interval
[ ]
Suppose that we require a (1- )-level two-sided CI for
of the form , with a margin of error x E x E E

+
i
2
2
2
Set and solve for , obtaining
z
E z n n
E
n

= =

i
Calculation is done at the design stage so a sample
estimate of is not available.
An estimate for can be obtained by anticipating
the range of the observations and dividing by 4.
[ ]
Based on assuming normality since then 95% of the
observation are expected to fall in 2 , 2 +
8
Example 7.1 (Airline Revenue)
See Example 7.1, Airline Revenue, on page 239 of the
course textbook.
9
Example 7.2 Strength of Steel Beams
See Example 7.2 on page 240 of the course textbook.
10
Power Calculation for One-sided Z-tests
0
( ) P[Test rejects | ] H =
Testing vs.
For the power function of the -level
upper one sided z-test derivation, see
Equation 7.7 in the course textbook.
:
o o
H
1 0
: H >
Illustration of
calculation on next
page
-z
z
(z)
1(z)
11
Power Calculation for One-sided Z-tests
2
p.d.f. curves of
, X N
n
(See Figure 7.1 on page

243 of the course
textbook.)
12
Power Functions Curves
See Figure 7.2 on page 243 of the course
textbook.
Notice how it is easier to detect a big
difference from
0
.
13
Example 7.3 (SAT Couching: Power Calculation)
( )
0
( )
n
z
= +

14
Power
Calculation
Two-Sided
Test
(See Figure 7.3 on page 245
of the course textbook.)
15
Power Curve for Two-sided Test
It is easier to
detect large
differences
from the null
hypothesis
(See Figure 7.4 on
page 246 of the
course textbook.)
Larger samples
lead to more
powerful tests
16
17
Power as a function of and n,
0
=0, =1
Uses function persp in S-Plus
2
0
4
0
6
0
8
0
1
0
0
n
-
1
-
0
.
5

0
0
.
5
1
m
u

0
0
.
2
0
.
4
0
.
6
0
.
8
1
p
o
w
e
r
Sample Size Determination for a One-Sided z-Test
Determine the sample size so that a study will have
sufficient power to detect an effect of practically important
magnitude
If the goal of the study is to show that the mean response
under a treatment is higher than the mean response
0
without the treatment, then
0
is called the treatment
effect
Let > 0 denote a practically important treatment effect
and let 1 denote the minimum power required to detect
it. The goal is to find the minimum sample size n which
would guarantee that an -level test of H
0
has at least 1-
power to reject H
0
when the treatment effect is at least .
18
Sample Size Determination for a One-sided Z-test
Because Power is an increasing function of
0
, it is only
necessary to find n that makes the power 1 at =
0
+.
( )
0
2
( ) 1 [See Equation (7.7), Slide 11]
Since ( ) 1 we have - .
Solving for n, we obtain

n
z
n
z z z
z z
n

+ = + =

= + =

+
=

z
19
Example 7.5 (SAT Coaching: Sample Size
Determination
20
Sample Size Determination for a
Two-Sided z-Test
( )
2
2
z z
n

Read on you own the

derivation on pages 248-249
See Example 7.6 on page 249 of the
course textbook.
Read on your own Example 7.4 (page
246)
21
Power and Sample Size in S-Plus
normal.sample.size(mean.alt = 0.3)
mean.null sd1 mean.alt delta alpha power n1
0 1 0.3 0.3 0.05 0.8 88
> normal.sample.size(mean.alt = 0.3,n1=100)
mean.null sd1 mean.alt delta alpha power n1
0 1 0.3 0.3 0.05 0.8508 100
22
Inference on Mean (Small Samples)
The sampling variability of s
2
may be sizable if the sample is small
(less than 30). Inference methods must take this variability into
account when
2
is unknown .
1
2
Assume that ,..., is a random sample from an
( , ) ditribution. Then has a
-distribution with -1 degrees of freedom (d.f.)
n
X X
X
N T
S n
t n

=
(Note that T is a pivot)
23
Confidence Intervals on Mean
1, 2 1, 2
1, 2 1, 2
1
n n
n n
X
P t T t
S n
S S
P X t X t
n n

= =

= +

1, 2 1, 2
[Two-Sided 100(1- )% CI]
n n
S S
X t X t
n n

+
1, 2 2
interval is wider on the average than z-interval
n
t z t

>
24
Example 7.7, 7.8, and 7.9
See Examples 7.7, 7.8, and 7.9 from the course textbook.
25
Inference on Variance
2
1
Assume that ,..., is a random sample from an ( , ) distribution
n
X X N
2
2
2
( 1)
has a Chi-square distribution with -1 d.f.
n S
n
=
(See Figure 7.8 on page 255 of
the course textbook)
( )
2
2 2
2
1,1 1,
2 2
1
1
n n
n S
P

=

26
CI for
2
and
The 100(1-)% two-sided CI for
2
(Equation 7.17 in course textbook):
2 2
2
2 2
1, 1,1
2 2
( 1) ( 1)
n n
n s n s

The 100(1-)% two-sided CI for (Equation 7.18 in course textbook):
2 2
1, 1,1
2 2
1 1
n n
n n
s s

27
Hypothesis Test on Variance
See Equation 7.21 on page 256 of the course textbook for an
explanation of the chi-square statistic:
2
2
2
0
( 1) n s
=
28
Prediction Intervals
Many practical applications call for an interval estimate of
an individual (future) observation sampled from a population
rather than of the mean of the population.
An interval estimate for an individual observation is called
a prediction interval
Prediction Interval Formula:
1, 2 1, 2
1 1
1 1
n n
x t s X x t s
n n

+ + +
29
Confidence vs. Prediction Interval
Prediction interval of a single future observation:
1, 2 1, 2
2 2
1 1
1 1
As interval converges to [ , ]
n n
x t s X x t s
n n
n z z

+ + +
+
Confidence interval for :
1, 2 1, 2
1 1
As interval converges to single point
n n
x t s x t s
n n
n

+

30
Example 7.12: Tear Strength of Rubber
Run chart shows process is predictable.
31
Tolerance Intervals
Suppose we want an interval which will contain at least
.90 = 1- of the strengths of the future batches (observations) with
95% = 1- confidence
Using Table A.12 in the course textbook:
1- = 0.95
1- = 0.90
n = 14
So, the critical value we want is 2.529.
[ , ] 33.712 2.529 0.798 [31.694,35.730] x Ks x Ks + = =
Note that this statistical interval is even wider than the prediction interval
32
Inferences for Two Samples
Tamhane and Dunlop
with some slides by Ramn V. Len
(University of Tennessee)
1
Introductory Remarks
A majority of statistical studies, whether experimental or
observational, are comparative
Simplest type of comparative study compares two
populations
Two principal designs for comparative studies
Using independent samples
Using matched pairs
Graphical methods for informal comparisons
Formal comparisons of means and variances of normal
populations
Confidence intervals
Hypothesis tests
2
Independent Samples Design
Example: Compare Control Group to Treatment Group
See page 270 in course textbook.
1
2
1 2
1 2
:
Sample 1: , ,...,
Sample 2: , ,...,
n
n
x x x
y y y
Independent samples design
Different Numbers
The two samples are independent
Independent sample design relies on random assignment to make
the two groups equal (on the average) on all attributes except for
the treatment used (treatment factor).
3
Graphical
Methods for
Comparing
Two
Independent
Samples See Table 8.1 and Figure 8.1, which is a
Q-Q Plot. Plot suggests that treatment group
costs are less than control group costs.
But is it true?
( ) ( )
Plot of the order
statistics ordered
pairs ( , )
which are the
i
quantiles
n+1
of the respective
samples
i i
x y

Book discusses how to prepare this graph when the
two samples are not of the same size (interpolation).
4
5
Box plots of hospitalization cost data
0
5
0
0
0
1
0
0
0
0
1
5
0
0
0
2
0
0
0
0
2
5
0
0
0
3
0
0
0
0
hcc hct
Box plots of logs of hospitalization cost data
6
7
8
9
1
0
lhcc lhct
Graphical Displays of Data from
Matched Pairs
7
Plot the pairs (x
i
, y
i
) in a scatter plot. Using the
45 line as a reference, one can judge whether the
two sets of values are similar or whether one set
tends to be larger than the other
Plots of the differences or the ratios of the pairs
may prove to be useful
A Q-Q plot is meaningless for paired data because
the same quantiles based on the ordered
observations do not, in general, come from the
same pair.
Comparing Means of Two Populations:
(Large Samples Case)
1 2
1 2 1 2
1
2 2
2 1 2
Suppose that the observations , ,..., and , ,...,
are random samples from two populations with means
and and variances and . Both means and variances
are assumed to be unknown.
n n
x x x y y y

1
2 1 2 1
2
The goal is to compare and
in terms of their difference - . We assume that and
are large (say 30).
n
n

>
8
Comparing Means of Two Populations:
1 2
2 2
1 2
1 2
1 2
2 2
1 1 2 2
1 2
( ) ( ) ( )
( ) ( ) ( )
Therefore the standarized r.v.
( )
has mean = 0 and variance = 1
If and are large, then Z is approximately (0,1) by
th
E X Y E X E Y
Var X Y Var X Var Y
n n
X Y
Z
n n
n n N

= =
= + = +

=
+
e Central Limit Theorem though we did not assume the samples
came from normal populations. (We also use fact that the
difference of independent normal r.v.'s is also normal.)
9
Large Sample (Approximate) 100(1-)% CI for
1
2
( ) ( )
2 2 2 2
1 2 1 2
2 1 2 2
1 2 1 2
2 2
Note has been substituted for because samples are
large, i.e., bigger than 30.
i i
s s s s
x y z x y z
n n n n
s

+ + +
Example 8.2: See Example 8.2 in course textbook.
10
Large Sample (Approximate) Test of Hypothesis
0 1 2 0 1 1 2 0 0
: vs. : (Typically 0) H H = =
0
2 2
1 1 2 2
( )
Test statistics:
x y
z
s n s n

=
+
11
Inference for Small Samples
2 2
1 2
Case 1: Variances and assumed equal.
Assumption of normal populations is important
since we cannot invoke the CLT
2 2
2 2
2
1 1 2 2
1 2 1 2
2 2 2
1 2
Pooled estimate of the common variance:
( ) ( )
( 1) ( 1)
( 1) ( 1) 2
Note: ( ) / 2 if sample sizes are equal
i i
X X Y Y
n S n S
S
n n n n
S S S
+
+
= =
+ +
= +

1 2
1 2
1 2
( )
has -distribution with 2 d.f.
1 1
X Y
T t n n
S n n

= +
+
12
Inference for Small Sample:
Confidence Intervals and Hypothesis Tests
2 2
1 2
Case 1: Variances and assumed equal.
1 2 1 2
2, 2 1 2 2, 2
1 2 1 2
Two-sided 100(1- )% CI is given by:
1 1 1 1
n n n n
x y t s x y t s
n n n n

+ +
+ + +
1 2
0 1 2 0 1 1 2 0
0
1 2
0 2, 2
Test of Hypothesis: : vs. :
Test statistics:
1 1
Reject if
n n
H H
x y
t
s
n n
H t t

+
=

=
+
>
13
Hospitalization Cost Example
See Example 8.2 on page 276 of course textbook.
Contrast this conclusion with apparent difference seen on the Q-Q plot in Figure 8.1
14
t.test in S-Plus to test difference in
means of logs of hospitalization cost data
t.test(lhcc,lhct)
Standard Two-Sample t-Test
data: lhcc and lhct
t = 0.6181, df = 58, p-value = 0.5389
alternative hypothesis: true difference in means is not equal to 0
-0.3731277 0.7064981
sample estimates:
mean of x mean of y
8.250925 8.08424
15
Interpretation of Difference in
Means on the Log Scale
Mean (log Cost) = Median (log Cost) = log (Median Cost)
Because distribution
of log cost is
symmetric
Because the log
preserves ordering
0.373 (log ) (log ) 0.707
0.373 log( ) log( ) 0.707

0.373 log 0.707

.689 exp( 0.373) exp(0.707) 2.028

C T
C T
C
T
C
T
Mean Cost Mean Cost
Median Cost Median Cost
Median Cost
Median Cost
Median Cost
Median Cost

= =
This
Interpretation
is not in your
textbook
95%
confidence
interval for
the ratio of
median
costs
16
2 2
1 2
Case 2: Variances and unequal.
1 2
2 2
1 2
1 2
( )
does not have a Student - distribution
X Y
T t
S S
n n

=
+
It can be shown that distribution of T depends on the ratio of unknown
variances, hence T is not a pivotal quantity. However, when
n1 and n2 are large T has an approximate N(0,1) distribution
17
2 2
1 2
( ) ( )
1 2
2 2
1 2
1 2
2
1 2
2 2
1 1 2 2
2 2
2 2
1 2
1 2
1 2
For small samples
( )
has an approximately -distribution
( )
with degrees of freedom
( 1) ( 1)
where SEM( ) and SEM( )
X Y
T t
S S
n n
w w
w n w n
s s
w x w y
n n

=
+
+
=
+
= = = =
Note: d.f. are estimated from the data and are not a function of the samples sizes alone
Note: is not usually an integer but is rounded down to the nearest integer
18
2 2
1 2
1 2
2 2 2 2
1 2 1 2
, 2 1 2 , 2
1 2 1 2
Approximate 100(1- )% two-sided CI for :
s s s s
x y t x y t
n n n n

+ +
0 1 2 0 1 1 2 0
0
2 2
1 1 1 1
0 , 2
Test statistics for : vs. :
is
Reject if .
H H
x y
t
s n s n
H t t

=

=
+
>
19
Hospitalization Costs: Inference Using
Separate Variances
See Example 8.4 on page 280 of course textbook.
20
t.test in S-Plus to test differences in means of
hospitalization data, unequal variances
t.test(lhcc,lhct,var.equal=F)
Welch Modified Two-Sample t-Test
data: lhcc and lhct
t = 0.6181, df = 54.61, p-value = 0.5391
-0.3738420 0.7072124
sample estimates:
mean of x mean of y
8.250925 8.08424
21
Testing for the Equality of Variances
Section 8.4 covers the classical F test for the equality of two variances and
associated confidence intervals. However, this method is not robust against
departures from normality. For example, p-values can be off by a factor of 10 if
the distributions have shorter or longer tails than the normal.
A robust alternative is Levenes Test. His test applies the two-sample t-test to the
absolute value of the difference of each observation and the group mean
1 1 1
2 2 2
| |, 1, 2, ,
| |, 1, 2, ,
i
i
Y Y i n
Y Y i n
=
=
This method works well even though these absolute deviations are not
independent.
In the Brown-Forsythe test the response is the absolute value of the difference of
each observation and the group median.
22
Independent Sample Design: Sample Size Determination
Assuming Equal Variances
0 1 2 1 1 2
2
2
1 2
: 0 vs. : 0
( )
2
H H
z z
n n n

=
+

= = =

Because we assume a known
variance this n is a slight
underestimate of sample size
Smallest difference of practical
importance that we want to detect
23
Using S-Plus to compute sample size
normal.sample.size(mean2=.693,power=0.9)
mean1 sd1 mean2 sd2 delta alpha power n1 n2 prop.n2
0 1 0.693 1 0.693 0.05 0.9 44 44 1
24
Matched Pairs Design
Example:
See Section 8.3.2, page 283 in course textbook.
25
Statistical Justification of Matched Pairs Design
See Section 8.3.2, page 283 in course textbook.
26
Sample Size Determination
2
2
2
( )
(One-Sided Test)
( )
(Two-Sided Test)
D
D
z z
n
z z
n

+

=

+

=

One needs a planning value for
D
This formulas come from the one-sample formulas applied
to the differences
27
Comparing Variances of Two Populations
Application arises when comparing instrument precision or
uniformities of products.
The methods discussed in the book are applicable only under the
assumption of normality of the data. They are highly sensitive
to even modest departures from normality
In case of nonnormal data there are nonparametric and other
robust methods for comparing data dispersion.
28
Comparing Variances of Two Populations
1
1
2
1 2 1 1
2
1 2 1 1
Independent sample design:
Sample 1: , ,..., is a random sample from ( , )
Sample 2: , ,..., is a random sample from ( , )
n
n
x x x N
y y y N

2 2
1 1
1 2
2 2
2 2
has an F distribution 1 and 1 d.f. respectively
S
F n n
S
=
1 2 1 2
2 2
1 1
1, 1,1 / 2 1, 1, / 2
2 2
2 2
/
1
/
n n n n
S
P f f
S

=

1 2 1 2
2 2 2
1 1 1
2 2 2
1, 1, / 2 2 2 1, 1,1 / 2 2
1 1
1
n n n n
S S
P
f S f S

=

(1-)-level CI (two-sided):
1 2 1 2
2 2 2
1 1 1
2 2 2
1, 1, / 2 2 2 1, 1,1 / 2 2
1 1
n n n n
S S
f S f S

29
An Important Industrial Application:
Example 8.8
(See Table 8.8 in course textbook.)
Do the two labs have equal measurement precision?
30
Inferences for Proportions and
Count Data
Tamhane and Dunlop
with some slides by Ramn V. Len
(University of Tennessee)
1
Inference for Proportions
Data = {0,1,1,10,0..1,0}, Bernoulli(p)
Goal estimate p, probability of success (or
proportion of population with a certain attribute)
p
=
x
= number of successes in n trials
Var(
p
) = p(1-p)/n = pq/n
Variance depends on the mean.
2
Large Sample Confidence Interval for Proportion

Recall that
(
p p
)
N(0,1) if n is large
/ pq n
(q =1- p, np

10 and n(1p
) 10)
It follows that:
z
2
(
p p
)
z
2

pq n

Confidence interval for p:

p z

2
pq
p p z
pq

+
2
n n
3
A Better Confidence Interval for Proportion
Use this probability statement

(
p p
)
z
2
z
2
pq n

Solve for p using quadratic equation
CI for p:
z
2
l
2
z
4
z
2
pqz
l l
p +
pqz
+
4n
2
p + +
l
2
+
z
4
2n n

2n n 4n
2
z
2

p
z
2
1+

1+

n

n
where z = z
/ 2
4
Example
5
Binomial CI
In S-Plus:
>qbinom(.975,800,0.45)
[1] 388
> qbinom(.025,800,0.45)
[1] 332
95% CI for proportion of gun owners is:
332/800 p 388/800
0.415 p 0.485
6
Sample Size Determination for a Confidence
Interval for Proportion
Want (1-)-level two-sided CI:
p E where E is the margin of error. Then E = z

2

.
pq
n
z
2
pq Solving for n gives n =

1 1
Largest value of pq =

=
1
so conservative sample size is:

2 2

4
2
z
2
1
n =

(Formula 9.5)
4
7
Example 9.2: Presidential Poll
Threefold increase in precision requires ninefold increase in sample size
8
Largest Sample Hypothesis Test on
Proportion
= :
0
H : p p vs. H p p
0 0 1

0
Best test statistics: z =
p p
p q n
0 0
Acceptance Region: p
0
cd, where c=z
a/2
and d=(p
0
q
0
/n)
0.5
9
Basketball Problem: z-test
P-value
2.182
10
Exact Binomial Test in S-Plus
1-pbinom(299,400,.7)
0.01553209
d
b
i
n
o
m
(
x
,

4
0
0
,

0
.
7
)

0
.
0

0
.
0
1

0
.
0
2

0
.
0
3

0
.
0
4

240 260 280 300 320
x
11
Sample Size for Z-Test of Proportion
H p p H p p :
0
vs. : >
0 o 1
Suppose that the power for rejecting H must be at
0
least 1- when the true proportion is p p p
0
. = >
1
Let = p p
0
. Then
1
z p q + z p q

2
Test based on:
0 0 1 1

0
n =

z =
p p

p q n
0 0
Replace z by z
for two-sided test sample size.

2
12
Example 9.4: Pizza Testing
2
z p q z p q

n =

2 0 0
+
1 1

13
Comparing Two Proportions:
Independent Sample Design
If n p , n q , n
2
p
2
, n
2
q
2
10, then
1 1 1 1
Z
p p p p
=
2
(
1
2
)
N(0,1)

2
p q p
2
q
1 1
+
n
2
n
1
Confidence Interval:
p q p

q

1
+
2 2
p p p p z p p z
1
2
+
2 1 2
n
2
n
1
1 1 2 2
1 2

p q
n n
+

q p
14
Test for Equality of Proportions (Large n)
Independent Sample Design pooled estimate of p
:
1
= vs.
1
:
1
2
H p p H p p
0 2
1
Test statitics: z =
p p
1 1
+ pq

n n
2 1

+ x +y
1 1
where p

=
n p n
2
p
2
=
n n
2
n n
2
+
1
+
1
15
Example 9.6
Comparing Two Leukemia Therapies
16
Fishers Exact Test
Calculates the probability of obtaining
observed 2x2 table or any more extreme with
margins fixed.
Uses hypergeometric distribution
M
x
N
K
N
K
M
x
X P ( K M N x ) | , ,
= =
17
Inference for Count Data
Data = cell counts = number of observations in
each of sevaral (>2) categories, n
i,
i=1..c, n
i
=n
Joint distribution of corresponding r.v.s is multinomial.
Goal determine if the probabilities of belonging to each
of the categories are equal to hypothesized values, p
i0
.
Test statistic,
2
= (observed-expected)
2
/expected, where
observed=n
i
, expected=np
i0
2
has chi-square distribution when sample size is large
18
Multinomial Test of Proportions
19
Inferences for Two-Way Count Data
y: Job Satisfaction
x: Annual Very Slightly Slightly Very Satisfied Row Sum
Salary Dissatisfied Dissatisfied Satisfied
Less than
$10,000
81 64 29 10 184
$10,000-
25,000
73 79 35 24 211
$25,000-
50,000
47 59 75 58 239
More than
$50,000
14 23 84 69 190
Column Sum 215 225 223 161 824
Sampling Model 1: Multinomial Model (Total Sample Size Fixed)
Sample of 824 from a single population that is then cross-classified
The null hypothesis is that X and Y are independent:
: ( = , ( = ) (
i. . j
for all i, j H p
ij
=P X i Y = j) =P X i P Y = j) = p p
0
20
Sampling Model 1 (Total Sample Size Fixed)
Based on Table 9.10 in the course textbook
y: Job Satisfaction
x: Annual
Very Slightly Slightly Very Satisfied Row Sum
Salary
Dissatisfied Dissatisfied Satisfied
Less than
$10,000
81 64 29 10 184
$10,000-25,000 73 79 35 24 211
$25,000-50,000 47 59 75 58 239
More than
$50,000
14 23 84 69 190
Column Sum 215 225 223 161 824
Estimated Expected Frequency = 824
215

184
=
215184
=48.01

824

824
824
(Cell 1,1)
=np p
1 1
21
Chi-Square Statistics
See Example 9.13, page 324 for instructions on
calculating the chi-square statistic.
c
=
(n e )
2
2
i

i
i=1
e
i
22
2
Based on Table A.5, critical values
, for the
Chi-
Square
Chi-square Distribution, in the course textbook:
Test
Critical
Value
2
The d.f. for this statistics is
2
(4-1)(4-1) = 9. Since
9,.05
=16.919
2
the calculated =11.989 is not
sufficiently large to reject
the hypothesis of independence
at =.05 level
v .995 .99 .975 .95 .90 .10 .05

1
2
3
4
5
6
7
8
9 16.919
10
11
23
S-Plus job satisfaction example
Call:
crosstabs(formula = c(jobsat) ~ c(row(jobsat)) + c(col(jobsat)))
901 cases in table
+----------+
|N |
|N/RowTotal|
|N/ColTotal|
|N/Total |
+----------+
c(row(jobsat))|c(col(jobsat))
|1 |2 |3 |4 |RowTotl|
-------+-------+-------+-------+-------+-------+
1 | 20 | 24 | 80 | 82 |206 |
|0.097 |0.12 |0.39 |0.4 |0.23 |
|0.32 |0.22 |0.25 |0.2 | |
|0.022 |0.027 |0.089 |0.091 | |
-------+-------+-------+-------+-------+-------+
2 | 22 | 38 |104 |125 |289 |
|0.076 |0.13 |0.36 |0.43 |0.32 |
|0.35 |0.35 |0.33 |0.3 | |
|0.024 |0.042 |0.12 |0.14 | |
-------+-------+-------+-------+-------+-------+
3 | 13 | 28 | 81 |113 |235 |
|0.055 |0.12 |0.34 |0.48 |0.26 |
|0.21 |0.26 |0.25 |0.27 | |
|0.014 |0.031 |0.09 |0.13 | |
-------+-------+-------+-------+-------+-------+
4 | 7 | 18 | 54 | 92 |171 |
|0.041 |0.11 |0.32 |0.54 |0.19 |
|0.11 |0.17 |0.17 |0.22 | |
|0.0078 |0.02 |0.06 |0.1 | |
-------+-------+-------+-------+-------+-------+
ColTotl|62 |108 |319 |412 |901 |
|0.069 |0.12 |0.35 |0.46 | |
-------+-------+-------+-------+-------+-------+
Test for independence of all factors
Chi^2 = 11.98857 d.f.= 9 (p=0.2139542)
Yates' correction not used
24
>
Product Multinomial Model:
Row Totals Fixed
(See Table 9.2 in the course textbook.)
Sampling Model 2: Product Multinomial
Total number of patients in each drug group is fixed.
The null hypothesis is that the probability of column response
(success or failure) is the same, regardless of the row
population:
0
: (Y = j | X i p
j
) H P = =
25
S-Plus leukemia trial
Call:
crosstabs(formula = c(leuk) ~ c(row(leuk)) + c(col(leuk)))
63 cases in table
+----------+
|N |
|N/RowTotal|
|N/ColTotal|
|N/Total |
+----------+
c(row(leuk))|c(col(leuk))
|1 |2 |RowTotl|
-------+-------+-------+-------+
1 |14 | 7 |21 |
|0.67 |0.33 |0.33 |
|0.27 |0.64 | |
|0.22 |0.11 | |
-------+-------+-------+-------+
2 |38 | 4 |42 |
|0.9 |0.095 |0.67 |
|0.73 |0.36 | |
|0.6 |0.063 | |
-------+-------+-------+-------+
ColTotl|52 |11 |63 |
|0.83 |0.17 | |
-------+-------+-------+-------+
Test for independence of all factors
Chi^2 = 5.506993 d.f.= 1 (p=0.01894058)
Yates' correction not used
Some expected values are less than 5, don't trust stated p-value
>
26
Remarks About Chi-Square Test
The distribution of the chi-square statistics under the null
hypothesis is approximately chi-square only when the
sample sizes are large
The rule of thumb is that all expected cell counts should be greater
than 1 and
No more than 1/5
th
of the expected cell counts should be less than
5.
Combine sparse cell (having small expected cell counts)
with adjacent cells. Unfortunately, this has the drawback of
losing some information.
Never stop with the chi-square test. Look at cells with
large values of (O-E), as in job satisfaction example.
27
Odds Ratio as a Measure of
Association for a 2x2 Table
Sampling Model I: Multinomial
p p
11 12
=
p p
21 22
The numerator is the odds of the column 1 outcome
vs. the column 2 outcome for row 1, and the
denominator is the same odds for row 2, hence the
name odds ratio
28
Odds Ratio as a Measure of
Association for a 2x2 Table
Sampling Model II: Product Multinomial
1p
1 1
=
p
p 1p
2 2
The two column outcomes are labeled as success and
failure, then is the odds of success for the row 1
population vs. the odds of success for the row 2
population
29
Inference in a Nutshell
Corresponds to Chapters 6-9 of
Tamhane and Dunlop
1
Outline
Chapter 6: Basic Concepts of Inference
Mean Square Error
Confidence Interval
Hypothesis Test
Chapter 7: Inference for Single Samples
Mean - Large Sample - z
Mean - Small Sample t
Variance Chi-square
Prediction and Tolerance Intervals
2
Outline (continued)
Chapter 8 Inference for Two Samples
Comparing Means, Independent, Large Sample z
Comparing Means, Independent, Small Sample
Variances equal t
Variances not equal t with df from SEM
Matched Pairs test differences t
Comparing Variances F
3
Outline (continued)
Chapter 9 - Inferences for Proportions and Count Data
Proportion, Large sample z
Proportion, Small sample binomial
Comparing 2 Proportions, large z or Chi-square
Comparing 2 Proportions, small Fishers Exact
Matched Pairs McNemars Test
One way Count Chi square
Two-way Count Chi square
Goodness of Fit Chi square
Odds ratio - z
4
Confidence Interval on the Mean
cd is a two-sided CI for mean u
where:
= estimator of u = sample mean
d=standard deviation of .
c=critical constant, for instance, z
/2
or t
n-1,a/2
.
z
/2
is such that P(Z> z
/2
)=/2.
z
/2
=
-1
(1-/2) = qnorm(1-/2) = -qnorm(/2)
If a=0.05 then z
/2
= 1.96.
If draw many samples and construct 95% CIs from
them, 95% would contain true value of u.
5
Confidence Intervals
(See Figure 6.2 on page 205 of the course
textbook.)
6
Hypothesis Tests
H
0
: null hypothesis, no change, no effect,
for instance u=u
0
H
1
: alternative hypothesis, uu
0
= P(Type I error = P(reject H
0
| H
0
true)
= P(Type II error = P(accept H
0
| H
0
false)
Power = function of u = P(reject H
0
| u)
A two-sided hypothesis test rejects H
0
when
|-u
0
|/d > c |-u
0
| > cd
<u0-cd or >u0+cd
7
Level Tests
(See Table 7.1 on page 240 of the course
textbook.)
8
P-Values
P-Value is the probability of obtaining the
observed result or one more extreme
Two-sided P-Value
= P(|Z|>|(-u
0
)|/d
= 2[1-[|(-u
0
)|/d]
= 2*(1-pnorm(abs(-u
0
)/d)) in S-Plus
9
P-Values
textbook.)
10
Power Function
Power is the probability of rejecting H
0
for a
given value of u.
(u) = P(<u
0
-cd | u) + P(>u
0
+cd |u)
= [-c+(u
0
-u)/d] + [-c+(u-u
0
)/d]
11
Power
(See Figure 7.3 on page 245 of the course
textbook.)
12
Reject H
0
(1) If u
0
falls outside interval cd.
(2) if falls outside interval u
0
cd.
(3) if p-value is small.
13
Simple Linear Regression
and Correlation.
Corresponds to
Chapter 10
Tamhane and Dunlop
with some slides by
Jacqueline Telford (Johns Hopkins University)
1
Simple linear regression analysis estimates the relationship
between two variables.
One of the variables is regarded as a response or outcome
variable (y).
The other variable is regarded as predictor or explanatory
variable (x).
Sometimes it is not clear which of two variable should be the
response (e.g. height and weight). In this case, correlation
analysis may be used.
Simple linear regression estimates relationships of the form
y = a + bx.
2
Scatter plot of ozone concentration
by temperature
air$temperature
a
i
r
$
o
z
o
n
e
60 70 80 90
1
2
3
4
5
3
A Probabilistic Model for Simple Linear Regression
Let x
1
, x
2
,..., x
n
be specific settings of the predictor variable.
Let y
1
, y
2
,..., y
n
be the corresponding values of the response
variable.
Assume that y
i
is the observed value of a random variable
(r.v.) Y
i
, which depends X on according to the following model:
Y
i
=
0
+
1
x
i
+
i
(i = 1, 2, , n)
Here
i
is the random error with E(
i
)=0 and Var(
i
)=
2
.
Thus, E(Y
i
) =
i
=
0
+
1
x
i
(true regression line).
The x
i
s usually are assumed to be fixed (not random
variables).
4
A Probabilistic Model for Simple Linear Regression
See Figure 10.1, p. 348 and also see page 348 for the four
assumptions of a simple linear regression model.
5
Least Square Line Mathematics (invented by Gauss)
Find the line, i.e., values of
0
and
1
that minimizes the sum
of the squared deviations:
=
+ =
n
i
i i
x y
1
2
1 0
)] ( [ Q
How?
Solve for values of
0
and
1
for which
0 0
1 0
=

Q
and
Q
6
Finding Regression Coefficients
)] ( [ 2
)] ( [ 2
1 0
1
1
1
1 0
0
i
n
i
i i
n
i
i i
x y x
Q
x y
Q

+ =
+ =
=
=
7
Normal Equations

= = =
= =
= +
= +
n
i
i i
n
i
i
n
i
i
n
i
i
n
i
i
y x x x
y x n
1 1
2
1
1 0
1 1
1 0

8
Solution to Normal Equations

S
S

) (
) )( (
1 0
xx
xy
1
2
1
1
x y
x x
y y x x
n
i
i
n
i
i i

=
=
=
=
. ) , ( y x
Note that least squares line goes through
9
Fitted regression line
air$temperature
a
i
r
$
o
z
o
n
e
60 70 80 90
1
2
3
4
5
10
n i x y y
i i i
,..., 2 , 1 ,

1 0
: of values Fitted = + =
n i x y y y e
i i i i i
,..., 2 , 1 , )

(
: Residuals
1 0
= + = =
temperature ozone fitted resid
67 3.45 2.49 0.96
72 3.30 2.84 0.46
74 2.29 2.98 -0.69
62 2.62 2.14 0.48
65 2.84 2.35 0.50
11
Matrix Approach to Simple Linear Regression
(what your regression package is really doing)
The model: y=X +
y is n by 1
X is n by 2
is 2 by 1
is n by 1
12
Y=X +
4
3
1
0
1
0
4
3
2
1
4
3
2
1
x 1
x 1
x 1
x 1
y
y
y
y
13
Solution of linear equations
In linear algebra:
Find x which solves Ax=b.
In regression analysis:
Find which solves X=y
Why cant we do this?
14
Least Squares
Q=(y-X)(y-X)
= yy Xy yX + XX
= yy 2 Xy + XX
Q/ = -2Xy + 2XX
Q/ = 0 Xy = XXb, where b=

15
Least Squares continued
For simple linear regression:
i i
i
i
y x
y
x
n
X X
y X'

x
x
'
2
i
i
16
XXb = Xy
i i
i
i
y x
y
x
n
b
x
x
2
i
i
The Normal Equations as before
17
XXb = Xy
b= (XX)
-1
Xy (if X has linearly
independent columns)
Solution by QR decomposition
X=QR, Q orthonormal, R upper triangular
and invertible
b=(XX)
-1
Xy = (RQQR)
-1
RQy
=(RR)
-1
RQy = R
-1
Qy
18
The Hat Matrix
b=(XX)
-1
Xy
=Xb = X(XX)
-1
Xy =Hy
H (n by n) is the Hat matrix
Takes y to
H is symmetric and idempotent HH=H
Diagonal elements of the hat matrix are
useful in detecting influential observations.
y
19
Expected value of b
E(b) = E((XX)
-1
Xy]
= E[(XX)
-1
X(X+)]
= E[(XX)
-1
XX + (XX)
-1
X]
=
Hence b is an unbiased estimator of .
20
Covariance of b
The covariance matrix of y is
2
I
b=(XX)
-1
Xy = Ay (where A is k by n)
Cov(b) = A Var(y) A = A
2
I A =
2
AA
=
2
(XX)
-1
XX(XX)
-1
=
2
(XX)
-1
21
Covariance of b
For simple linear regression,
2
(XX)
-1
=

n x
x - x
) (
x
x
i
i
2
i
2
2
1
2
i
i
2
i i
i
x x n
x
n

xx xx
i
S nS
x
b SD
1
) SD(b ; ) (
1
2
0
= =

22
Estimation of
2
2
)
(
2
2
1
2
1
2
=

= =
n
y y
n
e
n
i
i i
n
i
i
s
Note: The denominator is n - 2 since
two parameters are being estimated
(
0
and
1
).
E[S
2
]=
2
(See proof in Seber, Linear
Regression Analysis)
23
Statistical Inference for o and 1
xx
xx
i
S
s
SE
nS
x
s SE = =

)
( and )
(
1
2
0

For ozone example:
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -2.2260 0.4614 -4.8243 0.0000
temperature 0.0704 0.0059 11.9511 0.0000
24
Sums of Squares
=

n
i
i
y y
1
2
) ( : (SST) Total Squares of Sum

= =
=
n
i
i i
n
i
i
y y e
1
2
1
2
)
( : (SSE) Error for Squares of Sum
=

n
i
i
y y
1
2
)
( : (SSR) Regression for Squares of Sum

25
Geometry of the Sums of Squares
)
( )
(
i i i i
y y y y y y + =
y
y
i
SST = SSR + SSE, see derivation on p. 354
26
J. Telford
Coefficient of Determination (R-squared)
= = =
SST
SSE
1
SST
SSR
2
r
proportion of the variance in y that is
accounted for by the regression on x
= square of correlation between y and
y
For ozone example:

Multiple R-Squared: 0.5672
27
Analysis of Variance (ANOVA)
0 1 0 1
: 0 . : 0 H vs H =
2
MSE
MSR
2) - SSE/(n
SSR/1
t F = = =
For ozone example:
summary.aov(tmp)
Df Sum of Sq Mean Sq F Value Pr(F)
temperature 1 49.46178 49.46178 142.8282 0
Residuals 109 37.74698 0.34630
28
Regression Diagnostics
Residual vs. observation number
r
e
s
i
d
(
o
z
o
n
e
.
l
m
)
0 20 40 60 80 100
-
1
0
1
2
29
residual vs. fitted value
fitted(ozone.lm)
r
e
s
i
d
(
o
z
o
n
e
.
l
m
)
2.0 2.5 3.0 3.5 4.0 4.5
-
1
0
1
2
30
Regession Diagnostics
residual vs. x
air$temperature
r
e
s
i
d
(
o
z
o
n
e
.
l
m
)
60 70 80 90
-
1
0
1
2
31
qq plot of residuals
r
e
s
i
d
(
o
z
o
n
e
.
l
m
)
-2 -1 0 1 2
-
1
0
1
2
32
Hat Matrix Diagonals
h
a
t
(
m
o
d
e
l
.
m
a
t
r
i
x
(
o
z
o
n
e
.
l
m
)
)
0 20 40 60 80 100
0
.
0
1
0
.
0
2
0
.
0
3
0
.
0
4
0
.
0
5
33
Some useful S-Plus commands
my.lm <- lm(y~x, data=mydata, na.action=na.omit)
includes intercept term by default
summary(my.lm)
gives coefficients, correlation of coefficients, R-square, F-
statistic, residual standard error
summary.aov(my.lm)
gives ANOVA table
resid(my.lm)
gives residuals
fitted(my.lm)
gives fitted values
model.matrix(my.lm)
gives model matrix
34
Multiple Linear Regression
Tamhane & Dunlop
with some slides by Roy Welsch (MIT).
Linear Regression
Review:
Linear Model: y=X +
y~N(X,
2
I)
Least squares: =(XX)Xy
= fitted value of y = X =
X(XX)
-1
Xy=Hy
e = error = residuals = y- = y-Hy=(I-H)y
2
Properties of the Hat matrix
Symmetric: H=H
Idempotent: HH=H
Trace(H) = sum(diag(H)) = k+1 = number of
columns in the X matrix
1H=vector of 1s (hence y and have same
mean)
1(I-H) = vector of 0s (hence mean of residuals
is 0).
What is H when X is only a column of 1s?
y
3
Variance-Covariance Matrices
) ( ) ( ) (
) )( ( ) ( ) ( ) (
) ( ) ( )
(
time) last saw we (as ) ' ( )
Cov(
2 2
2 2
1 2
H I H I I H I
H I y Cov H I y H I Cov e Cov
H IH H
H y HCov Hy Cov y Cov
X X
= =
= =
= =
= =
=

4
Confidence and Prediction Intervals
) 1 ( ) 1 ) ' ( ( ) ' (
) ( )
( )
y , x at n observatio new of
) ' ( )
( )
(
x at response mean
0
2
0
1 '
0
2 2
0
1 '
0
2
0 0 0 0
0 0 0 0
0
2
0
1 '
0
2 '
0 0
0
+ = + = +
= + = +
+ =
= = =

v x X X x x X X x
Var y Var y Var
y Variance
v x X X x x Var y Var
of Variance

An estimate of
2
is s
2
= MSE = y(I-H)y /(n-k-1)
5
Confidence and Prediction Intervals
(1-) Confidence Interval on Mean Response at x
0
:
0 /2 1), (k - n 0
v s d and t c where ,
= =
+
cd y
(1-) Prediction Interval on New Observation at x
0
:
1 v s d and t c where ,
0 /2 1), (k - n 0
+ = =
+
cd y
6
Sums of Squares
=

n
i
i
y y
1
2
) ( : (SST) Total Squares of Sum

= =
=
n
i
i i
n
i
i
y y e
1
2
1
2
)
( : (SSE) Error for Squares of Sum
=

n
i
i
y y
1
2
)
( : (SSR) Regression for Squares of Sum

SSR = SST - SSE
7
Overall Significance Test
To see if there is any linear relationship we test:
H
0
:
1
=
2
= . . . =
k
= 0
H
1
:
j
0 for some j.
Compute
The F statistic is:
with F based on k and (n k 1) degrees of freedom.
Reject H
0
when F exceeds F
k,nk1()
.
SSE SST SSR y y y y SSE
i i i i
= = =

) ( SST )
(
2 2
MSE
MSR
k n SSE
k SSR
=
) 1 /(
/
8
Sequential Sums of Squares
SSR(x1) = SST - SSE(x1)
SSR(x2|x1) = SSR(x1,x2) - SSR(x1) =
SSE(x1) - SSE(x1,x2)
SSR(x3|x1 x2) = SSE(x1,x2) - SSE(x1,x2,x3)
9
ANOVA Table
Type 1 (sequential) sums of squares
Source of SS df
Variation
Regression SSR(x1,x2,x3) 3
x1 SSR(x1) 1
x2|x1 SSR(x2|x1) 1
x3|x2 x1 SSR(x3|x2,x1) 1
Error SSE(x1,x2,x3) n-4
Total SST n-1
10
ANOVA Table
Type 3 (partial) sums of squares
Source of SS df
Variation
Regression SSR(x1,x2,x3) 3
x1|x2,x3 SSR(x1|x2,x3) 1
x2|x1,x3 SSR(x2|x1,x3) 1
x3|x1,x2 SSR(x3|x1,x2) 1
Error SSE(x1,x2,x3) n-4
Total SST n-1
11
12
Scatter plot Matrix of the Air Data Set in S-Plus
pairs(air)
ozone
0 50 100 200 300 5 10 15 20
1
2
3
4
5
0
5
0
1
5
0
2
5
0
radiation
temperature
6
0
7
0
8
0
9
0
1 2 3 4 5
5
1
0
1
5
2
0
60 70 80 90
wind
air.lm<-lm(y~x1+x2+x3)
> summary(air.lm)$coef
(Intercept) -0.297329634 0.5552138923 -0.5355227 5.933998e-001
x1 0.002205541 0.0005584658 3.9492854 1.407070e-004
x2 0.050044325 0.0061061612 8.1957098 5.848655e-013
x3 -0.076021950 0.0157548357 -4.8253090 4.665124e-006
> summary.aov(air.lm)
x1 1 15.53144 15.53144 59.6761 6.000000e-012
x2 1 37.76939 37.76939 145.1204 0.000000e+000
x3 1 6.05985 6.05985 23.2836 4.665124e-006
Residuals 107 27.84808 0.26026
> summary.aov(air.lm,ssType=3)
Type III Sum of Squares
x1 1 4.05928 4.05928 15.59685 0.0001407070
x2 1 17.48174 17.48174 67.16966 0.0000000000
x3 1 6.05985 6.05985 23.28361 0.0000046651
Residuals 107 27.84808 0.26026
>
13
Polynomial Models
y=
0
+
1
x +
2
x
2
+
k
x
k
Problems:
Powers of x tend to be large in magnitude
Powers of x tend to be highly correlated
Solutions:
Centering and scaling of x variables
Orthogonal polynomials (poly(x,k) in S-Plus,
see Seber for methods of generating)
14
15
Plot of mpg vs. weight for 74 autos
(S-Plus dataset auto.stats)
wt
m
p
g
2000 2500 3000 3500 4000 4500
1
5
2
0
2
5
3
0
3
5
4
0
summary(lm(mpg~wt+wt^2+wt^3))
Call: lm(formula = mpg ~ wt + wt^2 + wt^3)
Residuals:
Min 1Q Median 3Q Max
-6.415 -1.556 -0.2815 1.265 13.06
Coefficients:
(Intercept) 68.1797 21.4515 3.1783 0.0022
wt -0.0309 0.0214 -1.4430 0.1535
I(wt^2) 0.0000 0.0000 0.9586 0.3410
I(wt^3) 0.0000 0.0000 -0.7449 0.4588
Residual standard error: 3.209 on 70 degrees of freedom
F-statistic: 55.76 on 3 and 70 degrees of freedom, the p-value is 0
Correlation of Coefficients:
(Intercept) wt I(wt^2)
wt -0.9958
I(wt^2) 0.9841 -0.9961
I(wt^3) -0.9659 0.9846 -0.9961
16
wts<-(wt-mean(wt))/sqrt(var(wt))
summary(lm(mpg~wts+wts^2+wts^3))
Call: lm(formula = mpg ~ wts + wts^2 + wts^3)
Residuals:
-6.415 -1.556 -0.2815 1.265 13.06
Coefficients:
(Intercept) 20.2331 0.5676 35.6470 0.0000
wts -4.4466 0.7465 -5.9567 0.0000
I(wts^2) 1.1241 0.4682 2.4007 0.0190
I(wts^3) -0.2521 0.3385 -0.7449 0.4588
(Intercept) wts I(wts^2)
wts -0.2800
I(wts^2) -0.7490 0.4558
I(wts^3) 0.3925 -0.8596 -0.6123
17
Orthogonal Polynomials
Generation is similar to Gram-Schmidt
orthogonalization (see Strang, Linear Algebra)
Resulting vectors are orthonormal XX=I
Hence (XX
)-1
= I and coefficients
= (XX)
-1
Xy = Xy
Addition of higher degree term does not affect
coefficients for lower degree terms
Correlation of coefficients = I
SE of coefficients = s = sqrt(MSE)
18
summary(lm(mpg~poly(wt,3)))
Call: lm(formula = mpg ~ poly(wt, 3))
Residuals:
-6.415 -1.556 -0.2815 1.265 13.06
Coefficients:
(Intercept) 21.2973 0.3730 57.0912 0.0000
poly(wt, 3)1 -40.6769 3.2090 -12.6758 0.0000
poly(wt, 3)2 7.8926 3.2090 2.4595 0.0164
poly(wt, 3)3 -2.3904 3.2090 -0.7449 0.4588
(Intercept) poly(wt, 3)1 poly(wt, 3)2
poly(wt, 3)1 0
poly(wt, 3)2 0 0
poly(wt, 3)3 0 0 0
19
20
Plot of mpg by weight with fitted
regression line
wt
m
p
g
2000 2500 3000 3500 4000 4500
1
5
2
0
2
5
3
0
3
5
4
0
Indicator Variables
Sometimes we might want to fit a model with a
categorical variable as a predictor. For instance,
automobile price as a function of where the car
is made (Germany, Japan, USA).
If there are c categories, we need c-1 indicator
(0,1) variables as predictors. For instance j=1 if
car is made in Japan, 0 otherwise, u=1 if car is
made in USA, 0 otherwise.
If there are just 2 categories and no other
predictors, we could just do a t-test for difference
in means.
21
22
Boxplots of price by country for S-Plus
dataset cu.summary
1
0
0
0
0
2
0
0
0
0
3
0
0
0
0
4
0
0
0
0
p
r
i
c
e
Germany Japan USA
cntry
23
Histogram of automobile prices for S-Plus
dataset cu.summary
10000 20000 30000 40000
0
1
0
2
0
3
0
4
0
price
24
Histogram of log of automobile prices for S-
Plus dataset cu.summary
9.0 9.5 10.0 10.5
0
5
1
0
1
5
2
0
log(price)
summary(lm(price~u+j))
Call: lm(formula = price ~ u + j)
Residuals:
-15746 -4586 -2071 2374 22495
Coefficients:
(Intercept) 25741.3636 2282.2729 11.2788 0.0000
u -10520.5473 2525.4871 -4.1657 0.0001
j -10236.0088 2656.5095 -3.8532 0.0002
Residual standard error: 7569 on 88 degrees of freedom
F-statistic: 9.159 on 2 and 88 degrees of freedom, the p-value is
0.0002435
(Intercept) u
u -0.9037
j -0.8591 0.7764
25
summary(lm(price~u+g))
Call: lm(formula = price ~ u + g)
Residuals:
-15746 -4586 -2071 2374 22495
Coefficients:
(Intercept) 15505.3548 1359.5121 11.4051 0.0000
u -284.5385 1737.1208 -0.1638 0.8703
g 10236.0088 2656.5095 3.8532 0.0002
Residual standard error: 7569 on 88 degrees of freedom
F-statistic: 9.159 on 2 and 88 degrees of freedom, the p-value is
0.0002435
(Intercept) u
u -0.7826
g -0.5118 0.4005
26
Goal: identify remarkable observations and unremarkable
predictors.
Problems with observations:
Outliers
Influential observations
Problems with predictors:
A predictor may not add much to model.
A predictor may be too similar to another predictor
(collinearity).
Predictors may have been left out.
27
28
Plot of standardized residuals vs. fitted
values for air dataset
fitted value
s
t
a
n
d
a
r
d
i
z
e
d

r
e
s
i
d
u
a
l
2 3 4
-
2
-
1
0
1
2
3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
Plot of residual vs. fit for air data
set with all interaction terms
fitted(tmp)
r
e
s
i
d
(
t
m
p
)
2.0 2.5 3.0 3.5 4.0 4.5 5.0
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
29
30
Plot of residual vs. fit for air model
with x3*x4 interaction
fitted(tmp)
r
e
s
i
d
(
t
m
p
)
2 3 4 5
-
1
.
0
-
0
.
5
0
.
0
0
.
5
1
.
0
1
.
5
Call: lm(formula = air[, 1] ~ air[, 2] + air[, 3] + air[, 4] + air[, 3] * air[, 4])
Residuals:
-1.088 -0.3542 -0.07242 0.3436 1.47
Coefficients:
(Intercept) -3.6465 1.1684 -3.1209 0.0023
air[, 2] 0.0023 0.0005 4.3223 0.0000
air[, 3] 0.0920 0.0143 6.4435 0.0000
air[, 4] 0.2523 0.1031 2.4478 0.0160
air[, 3]:air[, 4] -0.0042 0.0013 -3.2201 0.0017
(Intercept) air[, 2] air[, 3] air[, 4]
air[, 2] -0.0361
air[, 3] -0.9880 -0.0495
air[, 4] -0.9268 0.0620 0.9313
air[, 3]:air[, 4] 0.8902 -0.0661 -0.9119 -0.9892
>
31
Remarkable Observations
Residuals are the key
Standardized residuals:
Outlier if |e
i
*|>2
Hat matrix diagonals, h
ii
Influential if h
ii
> 2(k+1)/n
Cooks Distance
ii
i
i
i
i
h i s
e
e SE
e
e
= =
) (
*
)
1
( )
1
(
2
*
ii
ii i
i
h
h
k
e
d
+
=
Influential if d
i
> 1
32
33
Plot of standardized residual vs. observation
number for air dataset
observation number
s
t
a
n
d
a
r
d
i
z
e
d

r
e
s
i
d
u
a
l
0 20 40 60 80 100
-
2
-
1
0
1
2
3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
34
Hat matrix diagonals
observaton number
h
a
t

m
a
t
r
i
x

d
i
a
g
o
n
a
l
s
0 20 40 60 80 100
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
0
.
1
0
0
.
1
2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37 38
39
40
41
42
43
44
45
46
47
48 49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
35
Plot of wind vs. ozone
wind
o
z
o
n
e
5 10 15 20
1
2
3
4
5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40 41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
36
Cooks Distance
C
o
o
k
'
s

D
i
s
t
a
n
c
e
0 20 40 60 80 100
0
.
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
0
.
1
0
0
.
1
2
0
.
1
4
17
77
30
37
Plot of ozone vs. wind including fitted regression lines with and
without observation 30
(simple linear regression)
wind
o
z
o
n
e
5 10 15 20
1
2
3
4
5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40 41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
Remedies for Outliers
Nothing?
Data Transformation?
Remove outliers?
Robust Regression weighted least
squares: b=(XWX)
-1
XWy
Minimize median absolute deviation
38
Collinearity
High correlation among the predictors can cause problems with least
squares estimates (wrong signs, low t-values, unexpected results).
If predictors are centered and scaled to unit length, then XX is the
correlation matrix.
Diagonal elements of inverse of correlation matrix are called VIFs
(variance inflation factors).
R ,
1
1
2
j
2
where
R
VIF
j
j
=
is the coefficient of determination for the regression of the jth predictor
on the remaining predictors
39
When R
j
2
= .90, VIF is about 10 and caution is advised. (Some authors
say VIF = 5.) A large VIF indicates there is redundant information in
the explanatory variables.
Why is this called the variance inflation factor?
We can show that
Thus VIF
j
represents the variation inflation caused by adding all the
variables other than x
j
to the model.
( )
( )
2
2
2
1
1
Var
1
VIF Var in simple regression

j
n
j
j j
i
j j
R
x x
=
=
R Welsch 40
Remedies for collinearity
1. Identify and eliminate redundant variables (large literature
on this).
2. Modified regression techniques
a. ridge regression, b=(XX+cI)
-1
Xy
3. Regress on orthogonal linear combinations of the
explanatory variables
a. principal components regression
4. Careful variable selection
R Welsch 41
Correlation and inverse of correlation
matrix for air data set.
r<-cor(model.matrix(air.lm)[,-1])
> r
x1 x2 x3
x1 1.0000000 0.2940876 -0.1273656
x2 0.2940876 1.0000000 -0.4971459
X3 -0.1273656 -0.4971459 1.0000000
> solve(r)
x1 x2 x3
x1 1.09524102 -0.3357220 -0.02740677
x2 -0.33572201 1.4312012 0.66875638
x3 -0.02740677 0.6687564 1.32897882
>
42
Correlation and inverse of correlation
matrix for mpg data set
r<-cor(model.matrix(auto1.lm)[,-1])
> r
wt I(wt^2) I(wt^3)
wt 1.0000000 0.9917756 0.9677228
I(wt^2) 0.9917756 1.0000000 0.9918939
I(wt^3) 0.9677228 0.9918939 1.0000000
solve(r)
wt I(wt^2) I(wt^3)
wt 2000.377 -3951.728 1983.884
I(wt^2) -3951.728 7868.535 -3980.575
I(wt^3) 1983.884 -3980.575 2029.459
43
Variable Selection
We want a parsimonious model as few
variables as possible to still provide reasonable
accuracy in predicting y.
Some variables may not contribute much to the
model.
SSE never will increase if add more variables to
model, however MSE=SSE/(n-k-1) may.
Minimum MSE is one possible optimality
criterion. However, must fit all possible subsets
(2
k
of them) and find one with minimum MSE.
44
Backward Elimination
1. Fit the full model (with all candidate
predictors).
2. If P-values for all coefficients < then
stop.
3. Delete predictor with highest P-value
4. Refit the model
5. Go to Step 2.
45
Logistic Regression
References:
Applied Linear Statistical Models, Neter et al.
Categorical Data Analysis, Agresti
Logistic Regression
Nonlinear regression model when response
variable is qualitative.
2 possible outcomes, success or failure,
diseased or not diseased, present or absent
Examples: CAD (y/n) as a function of age,
weight, gender, smoking history, blood pressure
Smoker or non-smoker as a function of family
history, peer group behavior, income, age
Purchase an auto this year as a function of
income, age of current car, age
E Newton 2
Response Function for Binary Outcome
i i i
i i i i
i i
i i
i i
i i i
X Y E
Y E
Y P
Y P
X Y E
X Y

= + =
= + =
= =
= =
+ =
+ + =
1 0
1 0
1 0
} {
) 1 ( 0 ) ( 1 } {
1 ) 0 (
) 1 (
} {
E Newton 3
Special Problems when Response is Binary
Constraints on Response Function
0 E{Y} = = 1
Non-normal Error Terms
When Y
i
=1:
i
= 1-
0
-
1
X
i
When Y
i
=0:
i
= -
0
-
1
X
i
Non-constant error variance
Var{Y
i
} = Var{
i
} =
i
(1-
i
)
E Newton 4
Logistic Response Function
X
X
X
X X
X X
X X
X
X
Y E
1 0
1 0
1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0
1 0
1
log
) exp(
1
) exp( ) 1 (
) exp( ) exp(
) exp( ) exp(
) exp( )) exp( 1 (
) exp( 1
) exp(
} {

+ =
+ =
+ =
+ + =
+ = + +
+ = + +
+ +
+
= =
E Newton 5
Example of Logistic Response Function
Age
P
r
o
b
a
b
i
l
i
t
y

o
f

C
A
D
0 20 40 60 80 100
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
E Newton 6
Properties of Logistic Response
Function
log(/(1-))=logit transformation, log odds
/(1-) = odds
Logit ranges from - to as x varies from - to
E Newton 7
Likelihood Function
) 1 log( )]
1
log( [ ) ... g(Y log
) 1 ( ) ( ) ... g(Y
is; pdf joint t, independen re Y Since
1,2...n i ; 1 , 0 Y , ) 1 ( ) ( :
1 ) 0 (
) 1 (
1 1
i
1
1 1 i
i
i
1
i
n
i
n
i
i
i
i n
Y
i
Y
i
n
i i i
n
i n
Y
i
Y
i i i
i i
i i
Y Y
Y f Y
a
Y f pdf
Y P
Y P
i
i
i
i
=
= =
= = =
= =
= =

= =
= =
E Newton 8
Likelihood Function (continued)
)] exp( 1 log[ ) ( ) , ( log
) exp( 1
1
1
)
- 1
log(
1 1
1 0 1 0 1 0
1 0
1 0

= =
+ + + =
+ +
=
+ =
n
i
n
i
i i i
i
i
i
i
i
X X Y L
X
X

E Newton 9
Likelihood for Multiple Logistic
Regression
y X y X
x
x
x
x x y
x
x
x x y
L
x X y L
ik
i
i
j
ij j
j
ij j
i
ik
i
ik i
j
ij j
j
ij j
i
ik
i
ik i
k
i j
ij j
i
j ij i
j
' '
]
) exp( 1
) exp(
[ : Equations Likelihood
]
) exp( 1
) exp(
[
)] exp( 1 log[ ) ( ) ( log
=
=
+
=
+
=
+ =

E Newton 10
Solution of Likelihood Equations
No closed form solution
Use Newton-Raphson algorithm
Iteratively reweighted least squares (IRLS)
Start with OLS solution for at iteration t=0,
0
i
t
=1/(1+exp(-X
i
t
))
(t+1)
=
t
+ (XVX)
-1
X(y-
t
)
Where V=diag(
i
t
(1-
i
t
))
Usually only takes a few iterations
E Newton 11
Interpretation of logistic regression
coefficients
Log(/(1-))=X
So each
j
is effect of unit increase in X
j
on log odds of success with values of
other variables held constant
Odds Ratio=exp(
j
)
E Newton 12
Example: Spinal Disease in Children Data
SUMMARY:
The kyphosis data frame has 81 rows representing data on 81 children
who have had corrective spinal surgery. The outcome Kyphosis is a
binary variable, the other three variables (columns) are numeric.
ARGUMENTS:
Kyphosis
a factor telling whether a postoperative deformity (kyphosis) is
"present" or "absent" .
Age
the age of the child in months.
Number
the number of vertebrae involved in the operation.
Start
the beginning of the range of vertebrae involved in the operation.
SOURCE:
John M. Chambers and Trevor J. Hastie, Statistical Models in S,
Wadsworth and Brooks, Pacific Grove, CA 1992, pg. 200.
E Newton 13
Observations 1:16 of kyphosis data set
kyphosi s[ 1: 16, ]
Kyphosi s Age Number St ar t
1 absent 71 3 5
2 absent 158 3 14
3 pr esent 128 4 5
4 absent 2 5 1
5 absent 1 4 15
6 absent 1 2 16
7 absent 61 2 17
8 absent 37 3 16
9 absent 113 2 16
10 pr esent 59 6 12
11 pr esent 82 5 14
12 absent 148 3 16
13 absent 18 5 2
14 absent 1 4 12
16 absent 168 3 18
E Newton 14
Variables in kyphosis
summar y( kyphosi s)
Kyphosi s Age Number St ar t
absent : 64 Mi n. : 1. 00 Mi n. : 2. 000 Mi n. : 1. 00
pr esent : 17 1st Qu. : 26. 00 1st Qu. : 3. 000 1st Qu. : 9. 00
Medi an: 87. 00 Medi an: 4. 000 Medi an: 13. 00
Mean: 83. 65 Mean: 4. 049 Mean: 11. 49
3r d Qu. : 130. 00 3r d Qu. : 5. 000 3r d Qu. : 16. 00
Max. : 206. 00 Max. : 10. 000 Max. : 18. 00
E Newton 15
Scatter plot matrix kyphosis data set
Kyphosis
0 50 100 150 200 5 10 15
a
b
s
n
p
r
s
n
0
5
0
1
0
0
1
5
0
2
0
0
Age
Number
2
4
6
8
1
0
absn prsn
5
1
0
1
5
2 4 6 8 10
Start
E Newton 16
Boxplots of predictors vs. kyphosis
0
5
0
1
0
0
1
5
0
2
0
0
A
g
e
absent present
Kyphosis
2
4
6
8
1
0
N
u
m
b
e
r
absent present
Kyphosis
5
1
0
1
5
S
t
a
r
t
absent present
Kyphosis
E Newton 17
Smoothing spline fits, df=3
jitter(age)
k
y
p
0 50 100 150 200
1
.
0
1
.
2
1
.
4
1
.
6
1
.
8
2
.
0
jitter(num)
k
y
p
2 4 6 8 10
1
.
0
1
.
2
1
.
4
1
.
6
1
.
8
2
.
0
jitter(sta)
k
y
p
5 10 15
1
.
0
1
.
2
1
.
4
1
.
6
1
.
8
2
.
0
E Newton 18
Summary of glm fit
Cal l : gl m( f or mul a = Kyphosi s ~ Age + Number + St ar t ,
f ami l y = bi nomi al , dat a = kyphosi s)
Devi ance Resi dual s:
Mi n 1Q Medi an 3Q Max
- 2. 312363 - 0. 5484308 - 0. 3631876 - 0. 1658653 2. 16133
Coef f i ci ent s:
Val ue St d. Er r or t val ue
( I nt er cept ) - 2. 03693225 1. 44918287 - 1. 405573
Age 0. 01093048 0. 00644419 1. 696175
Number 0. 41060098 0. 22478659 1. 826626
St ar t - 0. 20651000 0. 06768504 - 3. 051043
E Newton 19
Summary of glm fit
Nul l Devi ance: 83. 23447 on 80 degr ees of f r eedom
Resi dual Devi ance: 61. 37993 on 77 degr ees of f r eedom
Number of Fi sher Scor i ng I t er at i ons: 5
Cor r el at i on of Coef f i ci ent s:
( I nt er cept ) Age Number
Age - 0. 4633715
Number - 0. 8480574 0. 2321004
St ar t - 0. 3784028 - 0. 2849547 0. 1107516
E Newton 20
This code7 was created using S-PLUS(R) Software. S-PLUS(R) is a registered trademark of Insightful Corporation.
Residuals
Response Residuals: y
i
-
i
Pearson Residuals: (y
i
-
i
)/sqrt(
i
(1-
i
))
Deviance Residuals: sqrt(-2log(|1-y
i
-
i
|))
E Newton 21
Model Deviance
Deviance of fitted model compares log-likelihood
of fitted model to that of saturated model.
Log likelihood of saturated model=0
DEV d
Y Y Y sign d
Y Y DEV
i
i
i i i i i i i
i i
n
i
i i
=
+ =
+ =
=
2
2 / 1
1
)]} 1 log( ) 1 ( ) log( [ 2 ){ (
) 1 log( ) 1 ( ) log( 2

E Newton 22
Covariance Matrix
> x<- model . mat r i x( kyph. gl m)
> xvx<- t ( x) %*%di ag( f i *( 1- f i ) ) %*%x
> xvx
( I nt er cept ) Age Number St ar t
( I nt er cept ) 9. 620342 907. 8887 43. 67401 86. 49845
Age 907. 888726 114049. 8308 3904. 31350 9013. 14464
Number 43. 674014 3904. 3135 219. 95353 378. 82849
St ar t 86. 498450 9013. 1446 378. 82849 1024. 07328
> xvxi <- sol ve( xvx)
> xvxi
( I nt er cept ) Age Number St ar t
( I nt er cept ) 2. 101402986 - 0. 00433216784 - 0. 2764670205 - 0. 0370950612
Age - 0. 004332168 0. 00004155736 0. 0003368969 - 0. 0001244665
Number - 0. 276467020 0. 00033689690 0. 0505664221 0. 0016809996
St ar t - 0. 037095061 - 0. 00012446655 0. 0016809996 0. 0045833534
> sqr t ( di ag( xvxi ) )
[ 1] 1. 44962167 0. 00644650 0. 22486979 0. 06770047
E Newton 23
Change in Deviance resulting from adding
terms to model
> anova( kyph. gl m)
Anal ysi s of Devi ance Tabl e
Bi nomi al model
Response: Kyphosi s
Ter ms added sequent i al l y ( f i r st t o l ast )
Df Devi ance Resi d. Df Resi d. Dev
NULL 80 83. 23447
Age 1 1. 30198 79 81. 93249
Number 1 10. 30593 78 71. 62656
St ar t 1 10. 24663 77 61. 37993
E Newton 24
Summary for kyphosis model with
age^2 added
Cal l : gl m( f or mul a = Kyphosi s ~ pol y( Age, 2) + Number
+ St ar t , f ami l y = bi nomi al , dat a = kyphosi s)
Devi ance Resi dual s:
- 2. 235654 - 0. 5124374 - 0. 245114 - 0. 06111367 2. 354818
Coef f i ci ent s:
Val ue St d. Er r or t val ue
( I nt er cept ) - 1. 6502939 1. 40171048 - 1. 177343
pol y( Age, 2) 1 7. 3182325 4. 66933068 1. 567298
pol y( Age, 2) 2 - 10. 6509151 5. 05858692 - 2. 105512
Number 0. 4268172 0. 23531689 1. 813798
St ar t - 0. 2038329 0. 07047967 - 2. 892080
E Newton 25
Summary of fit with age^2 added
Nul l Devi ance: 83. 23447 on 80 degr ees of f r eedom
Resi dual Devi ance: 54. 42776 on 76 degr ees of f r eedom
Number of Fi sher Scor i ng I t er at i ons: 5
( I nt er cept ) pol y( Age, 2) 1 pol y( Age,
2) 2 Number
pol y( Age, 2) 1 - 0. 2107783
pol y( Age, 2) 2 0. 2497127 - 0. 0924834
Number - 0. 8403856 0. 3070957 - 0. 0988896
St ar t - 0. 4918747 - 0. 2208804 0. 0911896
0. 0721616
E Newton 26
Analysis of Deviance
> anova( kyph. gl m2)
Anal ysi s of Devi ance Tabl e
Bi nomi al model
Response: Kyphosi s
Ter ms added sequent i al l y ( f i r st t o l ast )
Df Devi ance Resi d. Df Resi d. Dev
NULL 80 83. 23447
pol y( Age, 2) 2 10. 49589 78 72. 73858
Number 1 8. 87597 77 63. 86261
St ar t 1 9. 43485 76 54. 42776
E Newton 27
Kyphosis data, 16 obs, with fit and residuals
cbi nd( kyphosi s, r ound( p, 3) , r ound( r r , 3) , r ound( r p, 3) , r ound( r d, 3) ) [ 1: 16, ]
Kyphosi s Age Number St ar t f i t r r r p r d
1 absent 71 3 5 0. 257 - 0. 257 - 0. 588 - 0. 771
2 absent 158 3 14 0. 122 - 0. 122 - 0. 374 - 0. 511
3 pr esent 128 4 5 0. 493 0. 507 1. 014 1. 189
4 absent 2 5 1 0. 458 - 0. 458 - 0. 919 - 1. 107
5 absent 1 4 15 0. 030 - 0. 030 - 0. 175 - 0. 246
6 absent 1 2 16 0. 011 - 0. 011 - 0. 105 - 0. 148
7 absent 61 2 17 0. 017 - 0. 017 - 0. 131 - 0. 185
8 absent 37 3 16 0. 024 - 0. 024 - 0. 157 - 0. 220
9 absent 113 2 16 0. 036 - 0. 036 - 0. 193 - 0. 271
10 pr esent 59 6 12 0. 197 0. 803 2. 020 1. 803
11 pr esent 82 5 14 0. 121 0. 879 2. 689 2. 053
12 absent 148 3 16 0. 076 - 0. 076 - 0. 288 - 0. 399
13 absent 18 5 2 0. 450 - 0. 450 - 0. 905 - 1. 094
14 absent 1 4 12 0. 054 - 0. 054 - 0. 239 - 0. 333
16 absent 168 3 18 0. 064 - 0. 064 - 0. 261 - 0. 363
17 absent 1 3 16 0. 016 - 0. 016 - 0. 129 - 0. 181
E Newton 28
Plot of response residual vs. fit
fi
y

-

f
i
0.0 0.2 0.4 0.6 0.8
-
1
.
0
-
0
.
5
0
.
0
0
.
5
E Newton 29
Plot of deviance residual vs. index
r
e
s
i
d
(
k
y
p
h
.
g
l
m
,

t
y
p
e

=

"
d
e
.
.
.
.
0 20 40 60 80
-
2
-
1
0
1
2
E Newton 30
Plot of deviance residuals vs. fitted value
fitted(kyph.glm2)
r
e
s
i
d
(
k
y
p
h
.
g
l
m
2
,

t
y
p
e

=

"
d
.
.
.
.
0.0 0.2 0.4 0.6 0.8
-
2
-
1
0
1
2
E Newton 31
Summary of bootstrap for kyphosis model
E Newton 32
Cal l :
boot st r ap( dat a = kyphosi s, st at i st i c = coef ( gl m( Kyphosi s ~
pol y( Age, 2) + Number + St ar t , f ami l y = bi nomi al ,
dat a = kyphosi s) ) , t r ace = F)
Number of Repl i cat i ons: 1000
Summar y St at i st i cs:
Obser ved Bi as Mean SE
( I nt er cept ) - 1. 6503 - 0. 85600 - 2. 5063 5. 1675
pol y( Age, 2) 1 7. 3182 4. 33814 11. 6564 22. 0166
pol y( Age, 2) 2 - 10. 6509 - 7. 48557 - 18. 1365 37. 6780
Number 0. 4268 0. 17785 0. 6047 0. 6823
St ar t - 0. 2038 - 0. 07825 - 0. 2821 0. 4593
Empi r i cal Per cent i l es:
2. 5% 5% 95% 97. 5%
( I nt er cept ) - 8. 52922 - 7. 247145 1. 1760 2. 27636
pol y( Age, 2) 1 - 6. 13910 - 1. 352143 27. 1515 34. 64701
pol y( Age, 2) 2 - 48. 86864 - 38. 993192 - 4. 9585 - 4. 13232
Number - 0. 07539 - 0. 003433 1. 4756 1. 82754
St ar t - 0. 58795 - 0. 470139 - 0. 1159 - 0. 08919
Summary of bootstrap (continued)
BCa Conf i dence Li mi t s:
2. 5% 5% 95% 97. 5%
( I nt er cept ) - 6. 4394 - 5. 3043 2. 39707 3. 56856
pol y( Age, 2) 1 - 18. 2205 - 10. 1003 18. 34192 21. 56654
pol y( Age, 2) 2 - 24. 2382 - 20. 3911 - 1. 75701 - 0. 19269
Number - 0. 7653 - 0. 1694 1. 14036 1. 27858
St ar t - 0. 3521 - 0. 3167 - 0. 03478 0. 01461
Cor r el at i on of Repl i cat es:
( I nt er cept ) pol y( Age, 2) 1 pol y( Age, 2) 2 Number St ar t
( I nt er cept ) 1. 0000 - 0. 4204 0. 5082 - 0. 5676 - 0. 1839
pol y( Age, 2) 1 - 0. 4204 1. 0000 - 0. 8475 0. 4368 - 0. 6478
pol y( Age, 2) 2 0. 5082 - 0. 8475 1. 0000 - 0. 3739 0. 5983
Number - 0. 5676 0. 4368 - 0. 3739 1. 0000 - 0. 4174
St ar t - 0. 1839 - 0. 6478 0. 5983 - 0. 4174 1. 0000
E Newton 33
Histograms of coefficient estimates
-50 0 50
0
.
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
Value
D
e
n
s
i
t
y
(Intercept)
0 100 200 300 400
0
.
0
0
.
0
1
0
.
0
2
0
.
0
3
0
.
0
4
0
.
0
5
Value
D
e
n
s
i
t
y
poly(Age, 2)1
-600 -400 -200 0
0
.
0
0
.
0
1
0
.
0
3
0
.
0
5
Value
D
e
n
s
i
t
y
poly(Age, 2)2
0 2 4 6 8 10
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
.
2
Value
D
e
n
s
i
t
y
Number
-12 -10 -8 -6 -4 -2 0
0
1
2
3
4
Value
D
e
n
s
i
t
y
Start
E Newton 34
QQ Plots of coefficient estimates
Q
u
a
n
t
i
l
e
s

o
f

R
e
p
l
i
c
a
t
e
s
-2 0 2
-
5
0
0
5
0
(Intercept)
Q
u
a
n
t
i
l
e
s

o
f

R
e
p
l
i
c
a
t
e
s
-2 0 2
0
1
0
0
2
0
0
3
0
0
4
0
0
poly(Age, 2)1
Q
u
a
n
t
i
l
e
s

o
f

R
e
p
l
i
c
a
t
e
s
-2 0 2
-
6
0
0
-
4
0
0
-
2
0
0
0
poly(Age, 2)2
Q
u
a
n
t
i
l
e
s

o
f

R
e
p
l
i
c
a
t
e
s
-2 0 2
0
2
4
6
8
1
0
Number
Q
u
a
n
t
i
l
e
s

o
f

R
e
p
l
i
c
a
t
e
s
-2 0 2
-
1
2
-
1
0
-
8
-
6
-
4
-
2
0
Start
E Newton 35
Regression Review
and Robust Regression
S-Plus Oil City Data Frame
Monthly Excess Returns of Oil City Petroleum, Inc.
Stocks and the Market
SUMMARY:
The oilcity data frame has 129 rows and 2 columns. The
sample runs from April 1979 to December 1989. This
data frame contains the following columns:
VALUE:
Oil
monthly excess returns of Oil City Petroleum, Inc. stocks.
Market
monthly excess returns of the market.
E Newton 2
Oil City Data (continued)
Returns = relative change in the stock price over a one
month interval
Excess returns are computed relative to the monthly
return of a 90-day US Treasury bill at the risk-free rate
Financial economists use least squares to fit a straight
line predicting a particular stock return from the market
return.
Beta= estimated coefficient of the market return.
Measures the riskiness of the stock in terms of standard
deviation and expected returns.
Large beta -> stock is risky compared to market, but also
expected returns from the stock are large.
E Newton 3
Plot of Market returns vs. month
Month
o
i
l
c
i
t
y
$
M
a
r
k
e
t
0 20 40 60 80 100 120
-
0
.
2
-
0
.
1
0
.
0
E Newton 4
Plot of Oil City Petroleum return vs. month
month
O
i
l
0 20 40 60 80 100 120
0
1
2
3
4
5
E Newton 5
Histogram of Market Returns
-0.3 -0.2 -0.1 0.0 0.1
0
1
0
2
0
3
0
4
0
5
0
Market
E Newton 6
Histogram of Oil City Returns
-1 0 1 2 3 4 5
0
2
0
4
0
6
0
8
0
1
0
0
Oil
E Newton 7
Plot of Oil City vs. Market Returns
Market
O
i
l

C
i
t
y
-0.2 -0.1 0.0
0
1
2
3
4
5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22 23
24
25
26
27
28 29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50 51
52
53
54
55
56
57
58
59
60
61
62
63 64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81 82 83 84 85 86 87 88
89
90 91 92 93
94
95
96
97
98
99
100
101
102
103
104 105
106
107
108 109 110
111
112
113
114
115
116
117
118 119 120 121
122
123 124
125
126 127 128
129
E Newton 8
Plot of Oil City vs. Market Returns without
observation 94
Market
O
i
l

C
i
t
y
-0.25 -0.20 -0.15 -0.10 -0.05 0.0 0.05
-
0
.
6
-
0
.
4
-
0
.
2
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50 51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81 82 83
84 85 86 87 88
89
90 91 92 93
94
95
96
97
98
99
100
101
102
103
104
105
106
107 108
109
110
111
112
113
114
115
116
117 118 119
120
121
122 123
124
125 126 127
128
E Newton 9
> summar y( oi l ci t y)
Oi l Mar ket
Mi n. : - 0. 55667260 Mi n. : - 0. 27857020
1st Qu. : - 0. 23968330 1st Qu. : - 0. 10557534
Medi an: - 0. 10049000 Medi an: - 0. 07277544
Mean: - 0. 07221215 Mean: - 0. 07689209
3r d Qu. : - 0. 05821000 3r d Qu. : - 0. 03973828
Max. : 5. 19292000 Max. : 0. 07131940
E Newton 10
Summary oil.lm
Cal l : l m( f or mul a = Oi l ~ Mar ket , dat a = oi l ci t y)
Resi dual s:
- 0. 6952 - 0. 1732 - 0. 05444 0. 08407 4. 842
Coef f i ci ent s:
Val ue St d. Er r or t val ue Pr ( >| t | )
( I nt er cept ) 0. 1474 0. 0707 2. 0849 0. 0391
Mar ket 2. 8567 0. 7318 3. 9040 0. 0002
Resi dual st andar d er r or : 0. 4867 on 127 degr ees of f r eedom
Mul t i pl e R- Squar ed: 0. 1071
F- st at i st i c: 15. 24 on 1 and 127 degr ees of f r eedom, t he p- val ue
i s 0. 0001528
( I nt er cept )
Mar ket 0. 7956
E Newton 11
Plot of residual vs. fit for oil.lm
Fitted : Market
R
e
s
i
d
u
a
l
s
-0.6 -0.4 -0.2 0.0 0.2
0
1
2
3
4
5
65
79
94
E Newton 12
E Newton 13
Plot of Cooks Distance vs. Index
C
o
o
k
'
s

D
i
s
t
a
n
c
e
0 20 40 60 80 100 120
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
65 43
94
Plot of hat matrix diagonals for oil.lm
month
h
a
t
(
m
o
d
e
l
.
m
a
t
r
i
x
(
o
i
l
.
l
m
)
)
0 20 40 60 80 100 120
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
0
.
1
0
1
2
3
4
5
6
7
8
910
11
12
13
14
15
16
17
18
19 20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57 58
59
60
61
62
63
64
65
66
67
68 69
70
71 72 73
74
75
76 77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115 116
117
118
119
120
121
122
123
124
125
126
127
128 129
E Newton 14
Summary of model without observation 94
Cal l : l m( f or mul a = Oi l ~ Mar ket , dat a = oi l ci t y94)
Resi dual s:
- 0. 5169 - 0. 1174 - 0. 01959 0. 06864 0. 859
Coef f i ci ent s:
( I nt er cept ) - 0. 0247 0. 0304 - 0. 8139 0. 4173
Mar ket 1. 1355 0. 3137 3. 6202 0. 0004
i s 0. 0004249
( I nt er cept )
Mar ket 0. 8061
E Newton 15
Plot of residual vs fit for model without
observation 94
Fitted : Market
R
e
s
i
d
u
a
l
s
-0.3 -0.2 -0.1 0.0
-
0
.
4
-
0
.
2
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
8
105
79
E Newton 16
Weighted Least Squares
V of root square the called sometimes is R
V RR R R' that such
R matrix, symmetric singular - non nxn
symmetric always is V
ed, uncorrelat are errors if diagonal is V
definite positive singular - non is
) ( , 0 ) (
variances unequal have , y ns, observatio when Used
2
i
= =
= =
+ =
V
V Var E
X y

E Newton 17
Weighted least squares (continued)
0 ) ( ) (
y
or ,
becomes
, X , y
: variables new Define
1
*
* * *
1 1 1
1
*
1
*
1
*
= =
+ =
+ =
+ =
= = =

R E E
X
R X R y R
X y
R X R y R
E Newton 18
Weighted least squares (continued)
I
RRR R
VR R
R E R
R R E
E
E E E Var
2
1 1 2
1 1 2
1 1
1 1
* *
* * * * *
) ' (
) ' (
) ' (
} )]' ( )][ ( {[ ) (

=
=
=
=
=
=
=

E Newton 19
Weighted Least Squares (continued)
1 2
1 1 1 - 2
1 1 -
1 -
-1 1
* *
) ' (
) ' ( ' WX) X' (
) ( ) var( ' WX) X' ( )
(
' WX) (X'
: is solution The
Wy X'
WX) (X' are equations normal squares Least

) ( )' (
V W , ' ) Q(
=
=
=
=
=
=
= = = = =
WX X
WX X WX WW X
XWX WX y W X Var
Wy X
X y W X y
weights W V

E Newton 20
Robust Regression
Used to reduce influence of outliers
residuals of function a g , ) g(e ) g(y : minimize
: estimators M
} median{e } ] median{[y : minimize
: Regression LMS
| e | | y | L1 minimize
: Regression LAR
n
1 i
i
n
1 i
i
2
i
2
i
n
1 i
i
n
1 i
i

= =
= =
=
=
= =
i
i
i
x
x
x
E Newton 21
Robust Regression (continued)
IRLS, iteratively reweighted least squares
Minimize eWe
W is a diagonal matrix of weights, inversely proportional to
magnitude of scaled residuals, u
i
u
i
=e
i
/s, s=MAD=median{|e
i
-median(e
i
)|}
Procedure:
1. Obtain initial coefficient estimates from OLS
2. Obtain weights from scaled residuals
3. Obtain coefficient estimates from WLS
4. Return to 2.
Convergence usually rapid.
E Newton 22
(See Figure 10.4, and Equations 10.44 and 10.45 in Neter et al.
Applied Linear Statistical Models.)
Neter et al. Applied Linear
Statistical Models
23
Plot of residuals in oil.rreg
o
i
l
.
r
r
e
g
$
r
e
s
i
d
0 20 40 60 80 100 120
0
1
2
3
4
5
E Newton 24
Plot of weights in robust regression for oil
city data set
Month
W
e
i
g
h
t
s
0 20 40 60 80 100 120
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70 71
72
73
74
75 76
77
78
79
80
81
82
83 84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
E Newton 25
Plot of sqrt(weights)*resid/s in oil.rreg
(
s
q
r
t
(
o
i
l
.
r
r
e
g
$
w
)

*

o
i
l
.
r
r
.
.
.
.
0 20 40 60 80 100 120
-
1
0
1
E Newton 26
Coefficient table for oil.rreg
> x<- cbi nd( 1, Mar ket )
> bet a<- sol ve( t ( x) %*%di ag( w) %*%x) %*%t ( x) %*%di ag( w) %*%Oi l
> r <- Oi l - x%*%bet a
> s<- medi an( abs( r - medi an( r ) ) ) *1. 4826
> covm<- sol ve( t ( x) %*%di ag( w) %*%x) *s^2
> se<- sqr t ( di ag( covm) )
> t val ue=bet a/ se
> pr ob<- 2*( 1- pt ( abs( t val ue) , 127) )
> cbi nd( bet a, se, t val ue, pr ob)
bet a se t val ue pr ob
( I nt er cept ) - 0. 06779903 0. 02451469 - 2. 765649 0. 0065285939
x 0. 89895511 0. 24902845 3. 609849 0. 0004394276
Covar i ance mat r i x i s appr oxi mat e.
E Newton 27
Plots of fitted regression lines for oil city data
Market
O
i
l
-0.2 -0.1 0.0
0
1
2
3
4
5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22 23
24
25
26
27
28 29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50 51
52
53
54
55
56
57
58
59
60
61
62
63 64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81 82 83 84 85 86 87 88
89
90 91 92 93
94
95
96
97
98
99
100
101
102
103
104 105
106
107
108 109 110
111
112
113
114
115
116
117
118 119 120 121
122
123 124
125
126 127 128
129
oil.lm
oil.lm94
oil.rreg
E Newton 28
Least Trimmed Squares Regression
n and n/2 between be to chosen is q where
, e :
q
1 i
2
i
=
Minimizes
Based on a genetic algorithm for finding a
subset of data with minimum SSE.
High breakdown point: fits the bulk of the
data well, even if bulk is only a little more
than half the data.
Resulting weights are 1 or 0
E Newton 29
E Newton 30
> summar y( oi l . l t s)
Met hod:
[ 1] " Least Tr i mmed Squar es Robust Regr essi on. "
Cal l :
l t sr eg( f or mul a = Oi l ~ Mar ket )
Coef f i ci ent s:
I nt er cept Mar ket
- 0. 0864 0. 7907
Scal e est i mat e of r esi dual s: 0. 1468
Robust Mul t i pl e R- Squar ed: 0. 09863
Tot al number of obser vat i ons: 129
Number of obser vat i ons t hat det er mi ne t he LTS est i mat e: 116
Resi dual s:
Mi n. 1st Qu. Medi an 3r d Qu. Max.
- 0. 454 - 0. 088 0. 032 0. 097 5. 223
Wei ght s:
0 1
10 119
Single Factor ANOVA Models
Tamhane and Dunlop
(Johns Hopkins University).
1
Chapter 8: How to compare two treatments
Chapter 12: How to compare more than two
treatments (or just two).
Example: yields of several varieties of barley.
Variety is the treatment factor (predictor)
Yield is the response
2
Experimental Designs
3
S-Plus barley data set (observation 13:30)
> barley.small
yield variety year site
13 35.13333 Svansota 1931 University Farm
14 47.33333 Svansota 1931 Waseca
15 25.76667 Svansota 1931 Morris
16 40.46667 Svansota 1931 Crookston
17 29.66667 Svansota 1931 Grand Rapids
18 25.70000 Svansota 1931 Duluth
19 39.90000 Velvet 1931 University Farm
20 50.23333 Velvet 1931 Waseca
21 26.13333 Velvet 1931 Morris
22 41.33333 Velvet 1931 Crookston
23 23.03333 Velvet 1931 Grand Rapids
24 26.30000 Velvet 1931 Duluth
25 36.56666 Trebi 1931 University Farm
26 63.83330 Trebi 1931 Waseca
27 43.76667 Trebi 1931 Morris
28 46.93333 Trebi 1931 Crookston
29 29.76667 Trebi 1931 Grand Rapids
30 33.93333 Trebi 1931 Duluth
4
Completely Randomized Design Notation
If the sample
sizes are equal
the design is
balanced;
otherwise the
design is
unbalanced
See Table 12.1,
page 458 in the
course textbook.
1
a
i
j
N n
=
=
5
S-Plus barley dataset
(observations 13:30)
Variety Svansota Velvet Trebi
35.13333 39.90000 36.56666
47.33333 50.23333 63.83330
25.76667 26.13333 43.76667
40.46667 41.33333 46.93333
29.66667 23.03333 29.76667
25.70000 26.30000 33.93333
Variety Mean 34.01111 34.48889 42.46666
6
Plot of yield by variety for S-Plus
barley data set
3
0
4
0
5
0
6
0
b
a
r
l
e
y
.
s
m
a
l
l
$
y
i
e
l
d
Svansota Velvet Trebi
barley.small$variety
7
8
S-plus plot.design function
Factors
m
e
a
n

o
f

y
i
e
l
d
3
4
3
6
3
8
4
0
4
2
Svansota
Velvet
Trebi
variety
Factors
m
e
d
i
a
n

o
f

y
i
e
l
d
3
4
3
6
3
8
4
0
Svansota
Velvet
Trebi
variety
CRD: Model and Estimation (cell means model)
See Section 12.1.1 and Figure 12.2 on
page 460 of the course textbook.
9
CRD: Treatment Effects Model
Alternative Formulation of the Model:
Formula from 12.1.1, page 460 in the course textbook.
( 1, 2,..., ; 1, 2,..., )
ij i ij i
Y i a j n = + + = =
10
CRD parameter estimates
a) - e/(n e' s by estimated
- y error e
means treatment values fitted of vector
)/n y (1' y by estimated treatment, i of mean

y)/n (1' y by estimated mean,
2 2
i i i
th
i
=
= =
= =
= =
= =
y
y
grand
11
Fitted values and residuals for barley example
> cbind(barley.small[,1:2],fitted(tmp),resid(tmp))
yield variety fitted resid
13 35.13333 Svansota 34.01111 1.122218
14 47.33333 Svansota 34.01111 13.322218
15 25.76667 Svansota 34.01111 -8.244442
16 40.46667 Svansota 34.01111 6.455558
17 29.66667 Svansota 34.01111 -4.344442
18 25.70000 Svansota 34.01111 -8.311112
19 39.90000 Velvet 34.48889 5.411113
20 50.23333 Velvet 34.48889 15.744443
21 26.13333 Velvet 34.48889 -8.355557
22 41.33333 Velvet 34.48889 6.844443
23 23.03333 Velvet 34.48889 -11.455557
24 26.30000 Velvet 34.48889 -8.188887
25 36.56666 Trebi 42.46666 -5.900000
26 63.83330 Trebi 42.46666 21.366640
27 43.76667 Trebi 42.46666 1.300010
28 46.93333 Trebi 42.46666 4.466670
29 29.76667 Trebi 42.46666 -12.699990
30 33.93333 Trebi 42.46666 -8.533330
12
X matrix?
1 1 0 0
1 1 0 0
1 1 0 0
1 1 0 0
1 1 0 0
1 1 0 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
13
Model.matrix in S-Plus
> round(model.matrix(barley.small.aov),3)
(Intercept) variety.L variety.Q
13 1 -0.707 0.408
14 1 -0.707 0.408
15 1 -0.707 0.408
16 1 -0.707 0.408
17 1 -0.707 0.408
18 1 -0.707 0.408
19 1 0.000 -0.816
20 1 0.000 -0.816
21 1 0.000 -0.816
22 1 0.000 -0.816
23 1 0.000 -0.816
24 1 0.000 -0.816
25 1 0.707 0.408
26 1 0.707 0.408
27 1 0.707 0.408
28 1 0.707 0.408
29 1 0.707 0.408
30 1 0.707 0.408
14
Model Coefficients
15
> summary.lm(barley.small.aov)
Call: aov(formula = yield ~ variety, data = barley.small)
Residuals:
-12.7 -8.294 -1.611 6.194 21.37
Coefficients:
(Intercept) 36.9889 2.5207 14.6741 0.0000
variety.L 5.9790 4.3660 1.3695 0.1910
variety.Q 3.0619 4.3660 0.7013 0.4939
F-statistic: 1.184 on 2 and 15 degrees of freedom, the p-value is 0.3332
(Intercept) variety.L
variety.L 0
variety.Q 0 0
S-plus model.tables command gives
treatment means or effects
> model.tables(barley.small.aov,type="mean")
Warning messages:
Model was refit to allow projection in: model.tables(tmp, type =
"mean")
Tables of means
Grand mean
36.989
variety
34.011 34.489 42.467
16
S-plus model.tables command gives
treatment means or effects
> model.tables(barley.small.aov)
Warning messages:
Model was refit to allow projection in:
model.tables(barley.small.aov)
Tables of effects
variety
-2.9778 -2.5000 5.4778
17
Analysis of Variance (ANOVA)
Homogeneity Hypothesis:
Note SSR=SSA=Treatment sums of squares
0 1 2 1
0 1 2 1
: ... . : .
: ... . : 0.
a i
a i
H vs H Not all the areequal
H vs H At least some

= = =
= = =
Variation
Source Sum of Squares Degrees of Freedom Mean Square F
Treatments
(A)
Error
(E)
Total
(T)
2
( )
ij i
y y

2
( )
i i
n y y
2
( )
ij
y y

1 a
N a
1 N
1
SSA
a
SSE
N a
MSA
MSE
18
ANOVA table for model with 3 varieties of barley, year 1
> summary(aov(yield~variety,barley.small))
variety 2 270.739 135.3694 1.183614 0.3332005
Residuals 15 1715.544 114.3696
ANOVA table for model with all 10 varieties of barley,
year 1
> summary(aov(yield~variety,barley1))
variety 9 646.262 71.8069 0.5963671 0.793823
Residuals 50 6020.357 120.4071
>
19
F-statistic for One-way ANOVA
a n a
F
MSE
MSA
F

=
, 1
~
1
) (
) (
1
2
2
2
+ =
=
=
a
n
MSA E
MSE E
a
i
i i
20
Fitting model with continuous vs.
character predictor
> summary(aov(barley.small$yield~varnum))
varnum 1 214.489 214.4889 1.93692 0.1830502
Residuals 16 1771.794 110.7371
> summary(aov(barley.small$yield~as.factor(varnum)))
as.factor(varnum) 2 270.739 135.3694 1.183614 0.3332005
Residuals 15 1715.544 114.3696
21
Equivalence of T test and ANOVA for
model with single factor with 2 levels
> t.test(y[1:6],y[7:12])
data: y[1:6] and y[7:12]
t = -1.194, df = 10, p-value = 0.26
-22.864726 6.909179
sample estimates:
mean of x mean of y
34.48889 42.46666
> summary(aov(yield~variety,barley.vsmall))
variety 1 190.935 190.9346 1.425727 0.2600178
Residuals 10 1339.209 133.9209
22
23
Model Diagnostics, residual vs. fitted value
(all 10 varieties, year 1)
fitted(barley1.aov)
r
e
s
i
d
(
b
a
r
l
e
y
1
.
a
o
v
)
32 34 36 38 40 42
-
1
0
0
1
0
2
0
24
Model Diagnostics, residual vs. observation number
r
e
s
i
d
(
b
a
r
l
e
y
1
.
a
o
v
)
0 10 20 30 40 50 60
-
1
0
0
1
0
2
0
Model Diagnostics, normal plot of residuals
25
r
e
s
i
d
(
b
a
r
l
e
y
1
.
a
o
v
)
-2 -1 0 1 2
-
1
0
0
1
0
2
0
26
Model Diagnostics, histogram of residuals
-10 0 10 20 30
0
5
1
0
1
5
2
0
resid(barley1.aov)
Random Effects Model for a One-way Layout
When the treatment levels are determined by the experimenter
(or those are the only levels of interest), the design is a fixed
effects model.
Goal is to measure the treatment effects or means (pick
the winner).
When the treatment levels are a random sample from a
population of possible treatment levels (e.g. workers in a factory)
and the particular levels used in the experiment are not of any
interest, the design is a random effects model.
Goal is to measure the treatment variability (estimate the
expected variability among workers).
27
Random Effects Model for a One-way Layout
Model: Y
ij
=
i
+
ij
= +
i
+
ij
(looks similar to the fixed effects
model), where
ij
~ N(0,
2
)
i
~ N(,
A
2
) or
i
~ N(0,
A
2
) (constants in fixed effects model)
Var(Y
ij
) = Var(
i
) + Var(e
ij
) =
A
2
+
2
A
2
=variance among,
2
= variance within
With balanced one-way layout, n observations per treatment:
2 2
2
) (
) (
A
n MSA E
MSE E

+ =
=
Can estimate
A
2
as (MSA-MSE)/n (if you are lucky!)
28
Randomized Block Design
See Figure 3.2 on page 99 of the course textbook.
29
Barley Example
10 varieties, 6 sites
> ym
University Farm Waseca Morris Crookston Grand Rapids Duluth Variety Mean
Manchuria 27.00000 48.86667 27.43334 39.93333 32.96667 28.96667 34.19445
Glabron 43.06666 55.20000 28.76667 38.13333 29.13333 29.66667 37.32778
Svansota 35.13333 47.33333 25.76667 40.46667 29.66667 25.70000 34.01111
Velvet 39.90000 50.23333 26.13333 41.33333 23.03333 26.30000 34.48889
Trebi 36.56666 63.83330 43.76667 46.93333 29.76667 33.93333 42.46666
No. 457 43.26667 58.10000 28.70000 45.66667 32.16667 33.60000 40.25000
No. 462 36.60000 65.76670 30.36667 48.56666 24.93334 28.10000 39.05556
Peatland 32.76667 48.56666 29.86667 41.60000 34.70000 32.00000 36.58333
No. 475 24.66667 46.76667 22.60000 44.10000 19.70000 33.06666 31.81667
Wisconsin No. 38 39.30000 58.80000 29.46667 49.86667 34.46667 31.60000 40.58333
Site Mean 35.82667 54.34667 29.28667 43.66000 29.05334 30.29333 37.07778
30
Randomized Block Design (RBD)Method
( 1,..., ; 1,..., )
ij i j ij
Y i a j b = + + + = =
1
0
b
j
j
=
=
1
0
a
i
i
=
=
a-1 independent treatment effects

b-1 independent block effects
For more information, see 12.4, page 482 in course
textbook.
31
No Interactions Between Treatments and Blocks
' ' '
( ) ( )
ij i j i j i j i i
= + + + + + =
Formula from page 483 in the course textbook.
32
RBD: Sums of Squares
See formulas 12.17, 12.18,
and 12.19 on pages 484-5
in the course textbook.
33
ANOVA tables for models for barley
data set
> summary(aov(yield~variety,barley1))
variety 9 646.262 71.8069 0.5963671 0.793823
Residuals 50 6020.357 120.4071
> summary(aov(yield~variety+site,barley1))
variety 9 646.262 71.807 3.67995 0.001612103
site 5 5142.272 1028.454 52.70610 0.000000000
Residuals 45 878.085 19.513
34
Type 1 and Type 3 Sums of Squares
for barley example (balanced design)
> summary(barley12.aov)
variety 9 646.262 71.807 3.67995 0.001612103
site 5 5142.272 1028.454 52.70610 0.000000000
Residuals 45 878.085 19.513
> summary(barley12.aov,ssType=3)
Type III Sum of Squares
variety 9 646.262 71.807 3.67995 0.001612103
site 5 5142.272 1028.454 52.70610 0.000000000
Residuals 45 878.085 19.513
35
Degrees of Freedom
36
Effects in barley model
> model.tables(barley12.aov,type="effects")
Warning messages:
Model was refit to allow projection in: model.tables(barley12.aov, type = "effects")
Tables of effects
variety
Svanso No. 462 Manch No. 475 Velvet Peatla Glabron No. 457 Wisc No. 38 Trebi
-3.0667 1.9778 -2.8833 -5.2611 -2.5889 -0.4944 0.2500 3.1722 3.5056 5.3889
site
Grand Rapids Duluth University Farm Morris Crookston Waseca
-8.024 -6.784 -1.251 -7.791 6.582 17.269
37
Analysis of Multifactor
Experiments
Corresponds to Chapter 13
of Tamhane and Dunlop
with some slides by J acqueline Telford
(J ohns Hopkins University)
1
Analysis of Multifactor Experiments
textbook.)
2
Model and estimates
.
... . . .. . ) (
... . .
... ..
...
) ( y
ijk
ij ijk ijk ijk ijk
ij ijk
j i ij ij
j j
i i
ijk ij j i
y y y y e
y y
y y y y
y y
y y
y
= =
=
+ =
=
=
=
+ + + + =

3
For any model
) y
- (y )' y
- (y SSError SSE
) y - y
( )' y - y
( SSModel SSM
) y - (y )' y - (y SSTotal SST
mean grand of vector y
values fitted of vector y
values response observed of vector

= =
= =
= =
=
=
= y
4
Biochemical Reactions of Cells Treated with
Puromycin
SUMMARY:
The Balanced Puromycin data frame has 24 rows
representing the measurement of initial velocity of a
biochemical reaction for 6 different concentrations of
substrate and two different cell treatments. This data
frame contains the following variables (columns):
ARGUMENTS:
conc
the concentration of the substrate.
vel
the initial velocity of the reaction.
state
a factor telling whether the cells involved were treated or
untreated.
5
6
Scatterplot matrix for puromycin data set
conc
untr trtd
0
.
0
0
.
4
0
.
8
u
n
t
r
t
r
t
d
state
0.2 0.4 0.6 0.8 1.0 50 100 150 200
5
0
1
0
0
1
5
0
2
0
0
vel
7
plot.factor(conc,vel)
5
0
1
0
0
1
5
0
2
0
0
v
e
l
0.02 0.06 0.11 0.22 0.56 1.1
f(conc)
8
plot.factor(state,vel)
5
0
1
0
0
1
5
0
2
0
0
v
e
l
untreated treated
state
Velocity in Balanced puromycin data set
conc t r eat ed unt r eat ed
0. 02 76 47 67 51
0. 06 97 107 84 86
0. 11 123 139 98 115
0. 22 159 152 131 124
0. 56 191 201 144 158
1. 10 207 200 160 162
9
Histogram of velocity
0
1
2
3
4
5
vel
10
interaction.plot(pyb$state,pyb$conc,pyb$vel)
11
pyb$state
m
e
a
n

o
f

p
y
b
$
v
e
l
6
0
8
0
1
0
0
1
2
0
1
4
0
1
6
0
1
8
0
2
0
0
untreated treated
pyb$conc
1.1
0.56
0.22
0.11
0.06
0.02
12
interaction.plot(pyb$conc,pyb$state,pyb$vel)
pyb$conc
m
e
a
n

o
f

p
y
b
$
v
e
l
6
0
8
0
1
0
0
1
2
0
1
4
0
1
6
0
1
8
0
2
0
0
0.02 0.06 0.11 0.22 0.56 1.1
pyb$state
treated
untreated
Summar i es of pur omyci n model
Resi dual s:
- 14. 5 - 5 - 4. 441e- 016 5 14. 5
F- st at i st i c: 49. 5 on 11 and 12 degr ees of f r eedom, t he
p- val ue i s 2. 919e- 008
Df Sumof Sq Mean Sq F Val ue Pr ( F)
st at e 1 4240. 04 4240. 042 46. 40264 0. 00001871
conc 5 44243. 71 8848. 742 96. 83985 0. 00000000
st at e: conc 5 1270. 71 254. 142 2. 78130 0. 06803651
Resi dual s 12 1096. 50 91. 375
13
Observed velocity and fitted values for
puromycin model with interaction
Obser ved Fi t t ed Val ues
conc t r eat ed unt r eat ed t r eat ed unt r eat ed
0. 02 76 47 67 51 61. 5 61. 5 59. 0 59. 0
0. 06 97 107 84 86 102. 0 102. 0 85. 0 85. 0
0. 11 123 139 98 115 131. 0 131. 0 106. 5 106. 5
0. 22 159 152 131 124 155. 5 155. 5 127. 5 127. 5
0. 56 191 201 144 158 196. 0 196. 0 151. 0 151. 0
1. 10 207 200 160 162 203. 5 203. 5 161. 0 161. 0
14
model.tables
Tabl es of means
Gr and mean
128. 29
st at e
unt r eat ed t r eat ed
115. 00 141. 58
conc
0. 02 0. 06 0. 11 0. 22 0. 56 1. 1
60. 25 93. 50 118. 75 141. 50 173. 50 182. 25
st at e: conc
Di m1 : st at e
Di m2 : conc
0. 02 0. 06 0. 11 0. 22 0. 56 1. 1
unt r eat ed 59. 0 85. 0 106. 5 127. 5 151. 0 161. 0
t r eat ed 61. 5 102. 0 131. 0 155. 5 196. 0 203. 5
15
multicomp(pyb.aov,focus=concf)
95 %si mul t aneous conf i dence i nt er val s f or speci f i ed
l i near combi nat i ons, by t he Tukey met hod
cr i t i cal poi nt : 3. 3595
r esponse var i abl e: vel
i nt er val s excl udi ng 0 ar e f l agged by ' ****'
Est i mat e St d. Er r or Lower Bound Upper Bound
0. 02- 0. 06 - 33. 20 6. 76 - 56. 0 - 10. 5000 ****
0. 02- 0. 11 - 58. 50 6. 76 - 81. 2 - 35. 8000 ****
0. 02- 0. 22 - 81. 20 6. 76 - 104. 0 - 58. 5000 ****
0. 02- 0. 56 - 113. 00 6. 76 - 136. 0 - 90. 5000 ****
0. 02- 1. 1 - 122. 00 6. 76 - 145. 0 - 99. 3000 ****
0. 06- 0. 11 - 25. 30 6. 76 - 48. 0 - 2. 5400 ****
0. 06- 0. 22 - 48. 00 6. 76 - 70. 7 - 25. 3000 ****
0. 06- 0. 56 - 80. 00 6. 76 - 103. 0 - 57. 3000 ****
0. 06- 1. 1 - 88. 70 6. 76 - 111. 0 - 66. 0000 ****
0. 11- 0. 22 - 22. 70 6. 76 - 45. 5 - 0. 0425 ****
0. 11- 0. 56 - 54. 70 6. 76 - 77. 5 - 32. 0000 ****
0. 11- 1. 1 - 63. 50 6. 76 - 86. 2 - 40. 8000 ****
0. 22- 0. 56 - 32. 00 6. 76 - 54. 7 - 9. 2900 ****
0. 22- 1. 1 - 40. 70 6. 76 - 63. 5 - 18. 0000 ****
0. 56- 1. 1 - 8. 75 6. 76 - 31. 5 14. 0000
16
17
Residual vs. fit for puromycin model
fitted(pyb.aov)
r
e
s
i
d
(
p
y
b
.
a
o
v
)
-
1
5
-
1
0
-
5
0
5
1
0
1
5
18
qqplot of residuals for puromycin model
r
e
s
i
d
(
p
y
b
.
a
o
v
)
-
1
5
-
1
0
-
5
0
5
1
0
1
5
Summaries of puromycin model without interaction
Resi dual s:
- 26. 54 - 7. 083 2. 625 4. 792 20. 04
F- st at i st i c: 58. 03 on 6 and 17 degr ees of f r eedom, t he
p- val ue i s 2. 18e- 010
conc 5 44243. 71 8848. 742 63. 54684 0. 00000000021
st at e 1 4240. 04 4240. 042 30. 44967 0. 00003762498
Resi dual s 17 2367. 21 139. 248
19
Observed velocity and fitted values for
puromycin model without interaction
Obser ved Fi t t ed
conc t r eat ed unt r eat ed t r eat ed unt r eat ed
0. 02 76 47 67 51 73. 542 73. 542 46. 958 46. 958
0. 06 97 107 84 86 106. 792 106. 792 80. 208 80. 208
0. 11 123 139 98 115 132. 042 132. 042 105. 458 105. 458
0. 22 159 152 131 124 154. 792 154. 792 128. 208 128. 208
0. 56 191 201 144 158 186. 792 186. 792 160. 208 160. 208
1. 10 207 200 160 162 195. 542 195. 542 168. 958 168. 958
20
21
Plot of residual vs. fit for puromycin model
without interaction
Fitted : conc +state
R
e
s
i
d
u
a
l
s
-
2
0
-
1
0
0
1
0
2
0
21
13
2
22
Plot of velocity vs. concentration
conc
v
e
l
5
0
1
0
0
1
5
0
2
0
0
Cal l : aov( f or mul a = vel ~ conc + conc^2 + st at e)
Resi dual s:
- 45. 4 - 6. 93 4. 227 7. 902 23. 94
Coef f i ci ent s:
( I nt er cept ) 73. 0885 6. 0136 12. 1539 0. 0000
conc 304. 9581 37. 3027 8. 1752 0. 0000
I ( conc^2) - 188. 9327 32. 5953 - 5. 7963 0. 0000
st at e 13. 2917 3. 4172 3. 8897 0. 0009
F- st at i st i c: 53. 82 on 3 and 20 degr ees of f r eedom, t he p-
val ue i s 9. 291e- 010
> summar y( pyb2. aov)
conc 1 31590. 27 31590. 27 112. 7215 0. 0000000011
I ( conc^2) 1 9415. 64 9415. 64 33. 5972 0. 0000113551
st at e 1 4240. 04 4240. 04 15. 1295 0. 0009104989
Resi dual s 20 5605. 01 280. 25
23
24
Plot of residual vs. fit for pyb2.aov
Fitted : conc +conc^2 +state
R
e
s
i
d
u
a
l
s
-
4
0
-
2
0
0
2
0
18
21
2
25
qqplot of residuals for pyb2.aov
R
e
s
i
d
u
a
l
s
-
4
0
-
2
0
0
2
0
18
21
2
26
Cal l : aov( f or mul a = vel ~ conc + conc^2 + conc^3 + conc^4
+ conc^5 + st at e)
Resi dual s:
- 26. 54 - 7. 083 2. 625 4. 792 20. 04
Coef f i ci ent s:
F- st at i st i c: 58. 03 on 6 and 17 degr ees of f r eedom, t he p-
val ue i s 2. 18e- 010
> summar y( pyb5. aov)
conc 1 31590. 27 31590. 27 226. 8641 0. 0000000
I ( conc^2) 1 9415. 64 9415. 64 67. 6180 0. 0000003
I ( conc^3) 1 2603. 71 2603. 71 18. 6984 0. 0004604
I ( conc^4) 1 631. 13 631. 13 4. 5324 0. 0481759
I ( conc^5) 1 2. 96 2. 96 0. 0213 0. 8857934
st at e 1 4240. 04 4240. 04 30. 4497 0. 0000376
Resi dual s 17 2367. 21 139. 25
>
27
Plot of residual vs. fit for pyb5.aov
Fitted : conc +conc^2 +conc^3 +conc^4 +conc^5 +state
R
e
s
i
d
u
a
l
s
-
2
0
-
1
0
0
1
0
2
0
21
13
2
Guayule data set
Rate of Germination of Treated Guayule Seeds
SUMMARY:
The guayule data frame, a design object, has 96 rows and 5
columns. The guayule is a Mexican plant from which rubber is
manufactured. Batches of 100 seeds of eight varieties ( variety ) of
guayule were given one of four treatments ( treatment ), and
planted; the number of plants that came up in each batch ( plants )
was recorded.
ARGUMENTS:
variety
factor with levels V1 through V8 labeling the variety of guayule.
treatment
factor with levels T1 through T4 labeling the treatment given to the
seeds.
plants
numeric vector givng the number seeds out of a batch of 100 that
germinated.
28
pairs(gy)
29
variety
T1 T2 T3 T4
V
1
V
3
V
5
V
7
T
1
T
2
T
3
T
4
treatment
V1 V3 V5 V7 20 40 60 80
2
0
4
0
6
0
8
0
plants
plot.factor(gy$variety,gy$plants)
30
2
0
4
0
6
0
8
0
g
y
$
p
l
a
n
t
s
V1 V2 V3 V4 V5 V6 V7 V8
gy$variety
plot.factor(gy$treatment,gy$plants)
31
2
0
4
0
6
0
8
0
g
y
$
p
l
a
n
t
s
T1 T2 T3 T4
gy$treatment
interaction.plot(gy$variety,gy$treatment,gy$
plants)
32
gy$variety
m
e
a
n

o
f

g
y
$
p
l
a
n
t
s
1
0
2
0
3
0
4
0
5
0
6
0
V1 V2 V3 V4 V5 V6 V7 V8
gy$treatment
T1
T3
T2
T4
interaction.plot(gy$treatment,gy$variety,gy$
plants)
33
gy$treatment
m
e
a
n

o
f

g
y
$
p
l
a
n
t
s
1
0
2
0
3
0
4
0
5
0
6
0
T1 T2 T3 T4
gy$variety
V6
V8
V5
V3
V2
V7
V4
V1
hist(gy$plants)
34
0
1
0
2
0
3
0
gy$plants
Summaries of gy.aov
Cal l : aov( f or mul a = pl ant s ~ var i et y * t r eat ment , dat a = gy)
Resi dual s:
- 16. 33 - 2. 667 1. 494e- 015 2. 75 16
i s 0
> summar y( gy. aov)
var i et y 7 763. 16 109. 02 2. 7058 0. 01604076
t r eat ment 3 30774. 28 10258. 09 254. 5959 0. 00000000
var i et y: t r eat ment 21 2620. 14 124. 77 3. 0966 0. 00026666
Resi dual s 64 2578. 67 40. 29
35
Plot of residual vs. fit for gy data set
36
Fitted : variety * treatment
R
e
s
i
d
u
a
l
s
-
1
5
-
1
0
-
5
0
5
1
0
1
5
3
35
34
model.tables(gy.aov,type="mean")
Tabl es of means
Gr and mean
25. 302
var i et y
V1 V2 V3 V4 V5 V6 V7 V8
24. 667 26. 833 28. 833 21. 000 21. 917 28. 167 23. 250 27. 750
t r eat ment
T1 T2 T3 T4
55. 833 13. 917 20. 042 11. 417
37
model.tables(gy.aov,type="mean")
var i et y: t r eat ment
Di m1 : var i et y
Di m2 : t r eat ment
T1 T2 T3 T4
V1 66. 333 11. 667 12. 333 8. 333
V2 63. 333 18. 333 14. 333 11. 333
V3 65. 000 12. 667 26. 333 11. 333
V4 50. 333 10. 000 14. 000 9. 667
V5 49. 333 16. 333 10. 333 11. 667
V6 58. 000 8. 000 29. 667 17. 000
V7 46. 333 14. 667 22. 000 10. 000
V8 48. 000 19. 667 31. 333 12. 000
38
mul t i comp( gy. aov, f ocus=" t r eat ment " )
95 %si mul t aneous conf i dence i nt er val s f or speci f i ed
l i near combi nat i ons, by t he Tukey met hod
cr i t i cal poi nt : 2. 6378
r esponse var i abl e: pl ant s
i nt er val s excl udi ng 0 ar e f l agged by ' ****'
Est i mat e St d. Er r or Lower Bound Upper Bound
T1- T2 41. 90 1. 83 37. 10 46. 80 ****
T1- T3 35. 80 1. 83 31. 00 40. 60 ****
T1- T4 44. 40 1. 83 39. 60 49. 30 ****
T2- T3 - 6. 12 1. 83 - 11. 00 - 1. 29 ****
T2- T4 2. 50 1. 83 - 2. 33 7. 33
T3- T4 8. 62 1. 83 3. 79 13. 50 ****
39
Guayule ANOVA with variety random
> gyr . t ab
t r eat ment 3 30774. 28 10258. 09 82. 21711 0. 0000000
var i et y 7 763. 16 109. 02 0. 87380 0. 5428964
t r eat ment : var i et y 21 2620. 14 124. 77 3. 09663 0. 0002667
Resi dual s 64 2578. 67 40. 29
40
Random if:
Not interested in those particular factor
levels (e.g. batches)
Levels of factor are randomly chosen from
a larger population of factor levels (e.g. 10
universities selected from all universities in
country).
Want to generalize to a larger population
of factor levels.
41
EMS for 2-factor models
(See Table 24.5 on page 981 of Neter et al. Applied Linear
Statistical Models.)
Nested vs. Crossed Design
(See Figure 28.1 in Neter et al. Applied
Linear Statistical Models.)
Nested Fixed Factors
(See Table 28.3 on page 1129 of Neter et al. Applied
42
Nested Mixed Factors
Cross-Nested Models
43
Images of book covers:
Patrick OBrian, The Commodore.
Patrick OBrian, The Fortune of War.
44
Nested Factors
Speed of Firing Naval Guns
SUMMARY:
The gun data frame, a design object, has 36 rows representing runs
of a team of 3 men loading and firing naval guns attempting to get
off as many rounds per minute as possible. The three predictor
variables (columns) specify the team and the physique of the men
on it and the loading method used; the outcome variable is the
rounds fired per minute.
ARGUMENTS:
Method
factor giving one of two methods for loading rounds into Naval guns.
Levels are M1 and M2 .
Physique
an ordered factor giving the physique of the men: S for slight, A for
average, and H for heavy.
Team
factor with levels T1 , T2 or T3 . In fact there are nine teams, three of
each physique, i.e. a slight T1 , an average T1 , and a heavy T1 , etc.
Rounds
numeric vector giving the number of rounds per minute fired by a team.
45
gun
Met hod Physi que TeamRounds
1 M1 S T1 20. 2
2 M2 S T1 14. 2
3 M1 A T1 22. 0
4 M2 A T1 14. 1
5 M1 H T1 23. 1
6 M2 H T1 14. 1
7 M1 S T2 26. 2
8 M2 S T2 18. 0
9 M1 A T2 22. 6
10 M2 A T2 14. 0
11 M1 H T2 22. 9
12 M2 H T2 12. 2
13 M1 S T3 23. 8
14 M2 S T3 12. 5
15 M1 A T3 22. 9
16 M2 A T3 13. 7
17 M1 H T3 21. 8
18 M2 H T3 12. 7
19 M1 S T1 24. 1
20 M2 S T1 16. 2
46
gun
Met hod Physi que TeamRounds
1 M1 S T1 20. 2
2 M2 S T1 14. 2
3 M1 A T2 22. 0
4 M2 A T2 14. 1
5 M1 H T3 23. 1
6 M2 H T3 14. 1
7 M1 S T4 26. 2
8 M2 S T4 18. 0
9 M1 A T5 22. 6
10 M2 A T5 14. 0
11 M1 H T6 22. 9
12 M2 H T6 12. 2
13 M1 S T7 23. 8
14 M2 S T7 12. 5
15 M1 A T8 22. 9
16 M2 A T8 13. 7
17 M1 H T9 21. 8
18 M2 H T9 12. 7
19 M1 S T1 24. 1
20 M2 S T1 16. 2
47
Speed of firing of naval guns
Slight Average Heavy
Method 1 T1: 20.2, 24.1
T4: 26.2, 26.9
T7: 23.8, 24.9
T2: 22.0, 23.5
T5: 22.6, 24.6
T8: 22.9, 25.0
T3: 23.1, 22.9
T6: 22.9, 23.7
T9: 21.8, 23.5
Method 2 T1: 14.2, 16.2
T4: 18.0, 19.1
T7: 12.5, 15.4
T2: 14.1, 16.1
T5: 14.0, 18.1
T8: 13.7, 16.0
T3: 14.1, 16.1
T6: 12.2, 13.8
T9: 12.7, 15.1
48
pairs(gun2)
49
method
1.0 1.5 2.0 2.5 3.0 15 20 25
1
.
0
1
.
4
1
.
8
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
physique
team
2
4
6
8
1.0 1.2 1.4 1.6 1.8 2.0
1
5
2
0
2
5
2 4 6 8
rounds
50
Method Effect
method
m
e
a
n

o
f

r
o
u
n
d
s
1
6
1
8
2
0
2
2
M1 M2
rep(1, 36)
1
51
Physique Effect
physique
m
e
a
n

o
f

r
o
u
n
d
s
1
8
.
5
1
9
.
0
1
9
.
5
2
0
.
0
S A H
rep(1, 36)
1
52
Team Effect
team
m
e
a
n

o
f

r
o
u
n
d
s
1
8
1
9
2
0
2
1
2
2
1 2 3 4 5 6 7 8 9
rep(1, 36)
1
53
Method-Physique Interaction
method
m
e
a
n

o
f

r
o
u
n
d
s
1
4
1
6
1
8
2
0
2
2
2
4
M1 M2
physique
S
A
H
ANOVA tables for firing of naval guns example
(with teams numbered 1-9)
54
> summar y( aov( r ounds~phys*met h*t eam) )
phys 2 16. 0517 8. 0258 3. 4736 0. 0529995
met h 1 651. 9511 651. 9511 282. 1621 0. 0000000
t eam 6 39. 2583 6. 5431 2. 8318 0. 0403140
phys: met h 2 1. 1872 0. 5936 0. 2569 0. 7762240
met h: t eam 6 10. 7217 1. 7869 0. 7734 0. 6009376
Resi dual s 18 41. 5900 2. 3106
> summar y( aov( r ounds~phys*met h*t eam%i n%phys) )
phys 2 16. 0517 8. 0258 3. 4736 0. 0529995
met h 1 651. 9511 651. 9511 282. 1621 0. 0000000
phys: met h 2 1. 1872 0. 5936 0. 2569 0. 7762240
t eam%i n%phys 6 39. 2583 6. 5431 2. 8318 0. 0403140
met h: ( t eam%i n%phys) 6 10. 7217 1. 7869 0. 7734 0. 6009376
Resi dual s 18 41. 5900 2. 3106
> model . t abl es( gunaov, t ype="mean")
Tabl es of means
Gr and mean
19. 333
Met hod
M1 M2
23. 589 15. 078
Physi que
S A H
20. 125 19. 383 18. 492
Team%i n%Physi que
Di m1 : Physi que
Di m2 : Team
T1 T2 T3
S 18. 675 22. 550 19. 150
A 18. 925 19. 825 19. 400
H 19. 050 18. 150 18. 275
55
56
Tabl es of means
Gr and mean
19. 333
met hod
M1 M2
23. 589 15. 078
r ep 18. 000 18. 000
physi que
S A H
20. 125 19. 383 18. 492
r ep 12. 000 12. 000 12. 000
t eam%i n%physi que
Di m1 : physi que
Di m2 : t eam
1 2 3 4 5 6 7 8 9
S 18. 675 22. 550 19. 150
r ep 4. 000 0. 000 0. 000 4. 000 0. 000 0. 000 4. 000 0. 000 0. 000
A 18. 925 19. 825 19. 400
r ep 0. 000 4. 000 0. 000 0. 000 4. 000 0. 000 0. 000 4. 000 0. 000
H 19. 050 18. 150 18. 275
r ep 0. 000 0. 000 4. 000 0. 000 0. 000 4. 000 0. 000 0. 000 4. 000
Summaries of firing of naval guns example (without
interaction)
57
Cal l : aov( f or mul a = Rounds ~ Met hod + Physi que/ Team, dat a = gun)
Resi dual s:
- 2. 731 - 0. 7368 2. 498e- 016 0. 9972 2. 531
F- st at i st i c: 38. 19 on 9 and 26 degr ees of f r eedom, t he p- val ue i s
9. 602e- 013
> summar y( gunaov)
Met hod 1 651. 9511 651. 9511 316. 8426 0. 00000000
Physi que 2 16. 0517 8. 0258 3. 9005 0. 03300457
Team%i n%Physi que 6 39. 2583 6. 5431 3. 1799 0. 01782181
Resi dual s 26 53. 4989 2. 0576
Plot of residual vs fit for gun.aov
58
Fitted : Method +Physique/Team
R
e
s
i
d
u
a
l
s
14 16 18 20 22 24 26
-
2
-
1
0
1
2
14
28
1
2
k
Factorial Designs
Exploratory experimental studies.
Multifactor experiment in which each factor
studied at two levels.
Used to screen large number of factors to
identify the most important.
Sometimes 2 levels naturally occur e.g.
present or absent, smoker or non-smoker
k factors => 2
k
treatment combinations
59
2
k
Factorial Design Example
Example: 13.19, page 553 of the course
textbook.
60
61
pairs(nw.df)
y
-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0
5
0
1
5
0
2
5
0
-
1
.
0
0
.
0
0
.
5
1
.
0
a
b
-
1
.
0
0
.
0
0
.
5
1
.
0
50 100 150 200 250 300
-
1
.
0
0
.
0
0
.
5
1
.
0
-1.0 -0.5 0.0 0.5 1.0
c
62
hist(y)
0
2
4
6
y
Effect of a
63
a
m
e
a
n

o
f

y
1
6
0
1
7
0
1
8
0
1
9
0
-1 1
rep(1, 24)
1
64
Effect of b
b
m
e
a
n

o
f

y
1
0
0
1
5
0
2
0
0
2
5
0
-1 1
rep(1, 24)
1
65
Effect of c
c
m
e
a
n

o
f

y
1
6
0
1
6
5
1
7
0
1
7
5
1
8
0
1
8
5
-1 1
rep(1, 24)
1
66
interaction.plot(a,b,y)
a
m
e
a
n

o
f

y
1
0
0
1
5
0
2
0
0
2
5
0
-1 1
b
-1
1
67
interaction.plot(a,c,y)
a
m
e
a
n

o
f

y
1
4
0
1
6
0
1
8
0
-1 1
c
1
-1
68
interaction.plot(b,c,y)
b
m
e
a
n

o
f

y
1
0
0
1
5
0
2
0
0
2
5
0
-1 1
c
1
-1
summary.lm(nw.aov)
Cal l : aov( f or mul a = y ~ a * b * c, dat a = nw. df )
Resi dual s:
- 37. 67 - 6. 861 2. 388 12. 67 28. 67
Coef f i ci ent s:
( I nt er cept ) 171. 1942 4. 6675 36. 6780 0. 0000
a - 17. 6942 4. 6675 - 3. 7909 0. 0016
b - 76. 5833 4. 6675 - 16. 4078 0. 0000
c 13. 3333 4. 6675 2. 8566 0. 0114
a: b - 14. 8050 4. 6675 - 3. 1719 0. 0059
a: c 16. 6667 4. 6675 3. 5708 0. 0026
b: c 4. 9442 4. 6675 1. 0593 0. 3052
a: b: c - 25. 0558 4. 6675 - 5. 3682 0. 0001
F- st at i st i c: 49. 21 on 7 and 16 degr ees of f r eedom, t he p- val ue i s
1. 209e- 009
Effect (of going from low to high level) is 2*regression coefficient
69
model . mat r i x( nw. aov)
( I nt er cept ) a b c a: b a: c b: c a: b: c
1 1 - 1 - 1 - 1 1 1 1 - 1
2 1 1 - 1 - 1 - 1 - 1 1 1
3 1 - 1 1 - 1 - 1 1 - 1 1
4 1 - 1 - 1 1 1 - 1 - 1 1
5 1 1 1 - 1 1 - 1 - 1 - 1
6 1 1 - 1 1 - 1 1 - 1 - 1
7 1 - 1 1 1 - 1 - 1 1 - 1
8 1 1 1 1 1 1 1 1
9 1 - 1 - 1 - 1 1 1 1 - 1
10 1 1 - 1 - 1 - 1 - 1 1 1
11 1 - 1 1 - 1 - 1 1 - 1 1
12 1 - 1 - 1 1 1 - 1 - 1 1
13 1 1 1 - 1 1 - 1 - 1 - 1
14 1 1 - 1 1 - 1 1 - 1 - 1
15 1 - 1 1 1 - 1 - 1 1 - 1
16 1 1 1 1 1 1 1 1
17 1 - 1 - 1 - 1 1 1 1 - 1
18 1 1 - 1 - 1 - 1 - 1 1 1
19 1 - 1 1 - 1 - 1 1 - 1 1
20 1 - 1 - 1 1 1 - 1 - 1 1
21 1 1 1 - 1 1 - 1 - 1 - 1
22 1 1 - 1 1 - 1 1 - 1 - 1
23 1 - 1 1 1 - 1 - 1 1 - 1
24 1 1 1 1 1 1 1 1
70
XX Matrix
t ( X) %*%X
( I nt er cept ) a b c a: b a: c b: c a: b: c
( I nt er cept ) 24 0 0 0 0 0 0 0
a 0 24 0 0 0 0 0 0
b 0 0 24 0 0 0 0 0
c 0 0 0 24 0 0 0 0
a: b 0 0 0 0 24 0 0 0
a: c 0 0 0 0 0 24 0 0
b: c 0 0 0 0 0 0 24 0
a: b: c 0 0 0 0 0 0 0 24
71
n*(XX)
-1
X
> sol ve( t ( X) %*%X) %*%t ( X) *24
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
( I nt er cept ) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
a - 1 1 - 1 - 1 1 1 - 1 1 - 1 1 - 1 - 1 1 1 - 1 1 - 1 1 - 1 - 1 1 1 - 1 1
b - 1 - 1 1 - 1 1 - 1 1 1 - 1 - 1 1 - 1 1 - 1 1 1 - 1 - 1 1 - 1 1 - 1 1 1
c - 1 - 1 - 1 1 - 1 1 1 1 - 1 - 1 - 1 1 - 1 1 1 1 - 1 - 1 - 1 1 - 1 1 1 1
a: b 1 - 1 - 1 1 1 - 1 - 1 1 1 - 1 - 1 1 1 - 1 - 1 1 1 - 1 - 1 1 1 - 1 - 1 1
a: c 1 - 1 1 - 1 - 1 1 - 1 1 1 - 1 1 - 1 - 1 1 - 1 1 1 - 1 1 - 1 - 1 1 - 1 1
b: c 1 1 - 1 - 1 - 1 - 1 1 1 1 1 - 1 - 1 - 1 - 1 1 1 1 1 - 1 - 1 - 1 - 1 1 1
a: b: c - 1 1 1 1 - 1 - 1 - 1 1 - 1 1 1 1 - 1 - 1 - 1 1 - 1 1 1 1 - 1 - 1 - 1 1
72
summary(nw.aov)
> summar y( nw. aov)
a 1 7514. 0 7514. 0 14. 3712 0. 0016031
b 1 140760. 2 140760. 2 269. 2166 0. 0000000
c 1 4266. 7 4266. 7 8. 1604 0. 0114229
a: b 1 5260. 5 5260. 5 10. 0612 0. 0059164
a: c 1 6666. 7 6666. 7 12. 7506 0. 0025519
b: c 1 586. 7 586. 7 1. 1221 0. 3052037
a: b: c 1 15067. 1 15067. 1 28. 8171 0. 0000628
Resi dual s 16 8365. 6 522. 9
73
Plot of residual vs. fit for nw.aov
74
Fitted : a * b * c
R
e
s
i
d
u
a
l
s
-
4
0
-
2
0
0
2
0
6
23
22
Nonparametric Statistical
Methods
Tamhane and Dunlop
1
Nonparametric Methods
Most NP methods are based on ranks instead of original
data
Reference: Hollander & Wolfe, Nonparametric Statistical
Methods
E Newton 2
E Newton 3
Histogram of 100 gamma(1,1) r.v.s
0 1 2 3 4
0
1
0
2
0
3
0
g
Histogram of ranks of 100 r.v.s
0 20 40 60 80 100
0
2
4
6
8
1
0
rank(g)
E Newton 4
Parametric and Nonparametric
Tests
E Newton 5
Type of test Parametric Nonparametric
Single Sample z and t tests Sign test
Wilcoxon
Signed Rank
Test
Two
independent
samples
z and t tests Wilcoxon Rank
Sum Test
Mann Whitney
U Test
E Newton 6
Type of test Parametric Nonparametric
Several
Independent
Samples
ANOVA CRD Kruskal-Wallace
Test
Several Matched
Samples
ANOVA RBD Friedman Test
Correlation Pearson Spearman Rank
Correlation
Kendalls Rank
Correlation
Sign Test
Inference on median (u) for a single sample, size n
H
0
: u=u
0
vs. H
1
uu
0
Count the number of x
i
s that are greater than u
0
and
denote this s+
The number of x
i
s less than u are s- = n - s+
Reject H
0
if s+ is large or if s- is small.
Under H
0
, s+ (and s-) has binomial(n,1/2)
distribution
Large sample z test
E Newton 7
Histogram of thermostat data
198 200 202 204 206 208
0
1
2
3
4
x
E Newton 8
Sign Test in S-Plus
> thermostat
[1] 202.2 203.4 200.5 202.5 206.3 198.0 203.7 200.8
201.3 199.0
> thermostat<200
[1] F F F F F T F F F T
> sum(thermostat<200)
[1] 2
> 2*pbinom(sum(thermostat<200),10,0.5)
[1] 0.109375
E Newton 9
Wilcoxon Signed Rank Test
Inference on median (u), single sample, size n
Assumes population distribution is symmetric
H
0
: u=u
0
vs. H
1
uu
0
d
i
= x
i
-u
0
Rank order |d
i
|
W+ = sum of ranks of positive differences
W- = sum of ranks of negative differences
W
max
= maximum (W+, W-)
Reject H
0
if W
max
is large.
Null Distribution see text
Large sample z test
E Newton 10
S-Plus wilcox.test for thermostat data
E Newton 11
> t her most at
[ 1] 202. 2 203. 4 200. 5 202. 5 206. 3 198. 0 203. 7 200. 8
201. 3 199. 0
> sum( r ank( abs( t her most at - 200) ) [ - c( 6, 10) ] )
[ 1] 47
> wi l cox. t est ( t her most at , mu=200)
Exact Wi l coxon si gned- r ank t est
dat a: t her most at
si gned- r ank st at i st i c V = 47, n = 10, p- val ue =
0. 0488
al t er nat i ve hypot hesi s: t r ue mu i s not equal t o 200
S-Plus parametric t-test for thermostat data
> t.test(thermostat, mu=200)
One-sample t-Test
data: thermostat
t = 2.3223, df = 9, p-value = 0.0453
alternative hypothesis: true mean is not equal to 200
200.0459 203.4941
sample estimates:
mean of x
201.77
E Newton 12
Location-Scale Families
See course textbook, page 575.
E Newton 13
2 normal pdfs with location parameters = -1
and 1, scale parameter =1
x
d
n
o
r
m
(
x
,

1
,

1
)
-4 -2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
E Newton 14
Wilcoxon Rank Sum Test
Inference on location of distribution of 2
independent random samples X and Y (e.g.
from control and treatment population).
Assume X~Y+
H
0
: =0 vs. H
1
: 0
Rank all N = n1 + n2 observations
W=sum of ranks assigned to the Ys (or Xs,
whichever has smaller sample size)
Reject H
0
if W is extreme
E Newton 15
Mann-Whitney U test
Equivalent to Wilcoxon rank sum test
Compare each x
i
with each y
i
.
There are n
x
*n
y
such comparisons
U= number of pairs in which x
i
<y
i
.
Icbst W = U + (n*(n+1))/2 (when no ties)
Reject H
0
if U is extreme.
E Newton 16
Boxplots of times to failure for
control and stressed capacitors
0
5
1
0
1
5
2
0
2
5
3
0
cg sg
t
i
m
e

t
o

f
a
i
l
u
r
e
E Newton 17
S-Plus wilcox.test
> wilcox.test(cg, sg)
Exact Wilcoxon rank-sum test
data: cg and sg
rank-sum statistic W = 95, n = 8, m = 10, p-value =
0.1011
alternative hypothesis: true mu is not equal to 0
E Newton 18
S-Plus parametric t-test
> t.test(cg,sg)
data: cg and sg
t = 1.8105, df = 16, p-value = 0.089
-1.103506 14.018506
sample estimates:
mean of x mean of y
15.5375 9.08
E Newton 19
Kolmogorov-Smirnov Tests
There is also a one-sample version for testing the distance between
some observed data and a specified (ideal) distribution.
The Kolmogorov-Smirnov test detects differences in location, scale,
skewness, or whatever (any differences between two distributions),
uses two empirical cumulative distribution functions (step functions).
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
C
u
m
u
l
a
t
i
v
e

F
r
e
q
u
e
n
c
y
Distri buti on 2
Maximum
Gap
Distri buti on 1
Two-sample Test
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
C
u
m
u
l
a
t
i
v
e

F
r
e
q
u
e
n
c
y
Ideal Distribution
Maximum
Gap
Observed
Distribution
One-sample Test
Tests the maximum gap between the observed distribution and the
hypothesized distribution as a function of sample size (tables or p-values).
J Telford 20
E Newton 21
Histograms of 100 random normal (2,1) deviates
and 100 random gamma(4,2) deviates
-1 0 1 2 3 4 5
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
x
0 1 2 3 4 5 6
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
y
Kolmogorov-Smirnov Tests
> ks. gof ( x, y)
Two- Sampl e Kol mogor ov- Smi r nov Test
dat a: x and y
ks = 0. 15, p- val ue = 0. 2112
al t er nat i ve hypot hesi s: cdf of x does not equal t he
cdf of y f or at l east one sampl e poi nt .
> ks. gof ( y)
One sampl e Kol mogor ov- Smi r nov Test of Composi t e Nor mal i t y
dat a: y
ks = 0. 0969, p- val ue = 0. 0216
al t er nat i ve hypot hesi s: Tr ue cdf i s not t he nor mal di st n. wi t h
est i mat ed par amet er s
sampl e est i mat es:
mean of x st andar d devi at i on of x
1. 865857 0. 9421928
E Newton 22
Kruskal-Wallis Test
Inference for several independent samples
Assume distributions of each of the samples differ
only possibly in location.
X
ij
= +
j
+ e
ij
.
H
0
:
1
=
2
=..=
k
, vs. H
1
:
i
j
for some i j
Rank all N=n
1
+n
2
..+n
a
observations.
Calculate rank sums and averages in each group
Calculate KW test statistic=kw (see text)
Reject H
0
for large values of kw
For large n
i
s, null distn of kw
2
a-1
E Newton 23
Test scores for four different teaching methods
(page 582)
scm<- mat r i x( scor e, 7, 4)
> scm
[ , 1] [ , 2] [ , 3] [ , 4]
[ 1, ] 14. 06 14. 71 23. 32 26. 93
[ 2, ] 14. 26 19. 49 23. 42 29. 76
[ 3, ] 14. 59 20. 20 24. 92 30. 43
[ 4, ] 18. 15 20. 27 27. 82 33. 16
[ 5, ] 20. 82 22. 34 28. 68 33. 88
[ 6, ] 23. 44 24. 92 32. 85 36. 43
[ 7, ] 25. 43 26. 84 33. 90 37. 04
E Newton 24
Plot.factor(f(grp),score)
1
5
2
0
2
5
3
0
3
5
t
e
s
t

s
c
o
r
e
s

f
o
r

e
a
c
h

t
e
a
c
h
i
n
g

m
e
t
h
o
d
1 2 3 4
f(grp)
E Newton 25
Ranks of Test Scores
E Newton 26
> scmr <- mat r i x( r ank( scor e) , 7, 4)
> scmr
[ , 1] [ , 2] [ , 3] [ , 4]
[ 1, ] 1 4. 0 11. 0 18
[ 2, ] 2 6. 0 12. 0 21
[ 3, ] 3 7. 0 14. 5 22
[ 4, ] 5 8. 0 19. 0 24
[ 5, ] 9 10. 0 20. 0 25
[ 6, ] 13 14. 5 23. 0 27
[ 7, ] 16 17. 0 26. 0 28
> t mp<- appl y( scmr , 2, sum)
> t mp
[ 1] 49. 0 66. 5 125. 5 165. 0
> ( 12/ ( 28*29) ) *sum( ( t mp^2) / 7) - 3*29
[ 1] 18. 13406
Kruskal-Wallis test in S-Plus
> kruskal.test(scm, col(scm))
Kruskal-Wallis rank sum test
data: scm and col(scm)
Kruskal-Wallis chi-square = 18.139, df = 3,
p-value = 0.0004
alternative hypothesis: two.sided
E Newton 27
ANOVA for test scores
summar y( aov( scor e~f ( gr p) ) )
f ( gr p) 3 830. 1914 276. 7305 15. 93607 6. 509182e- 006
Resi dual s 24 416. 7609 17. 3650
E Newton 28
Friedman Test
Inference for several matched samples
a treatments, b blocks
H
0
:
1
=
2
=..=
k
, vs. H
1
:
i
j
for some i j
Rank observations separately within each block
Calculate rank sums
Calculate the Friedman statistic, fr (see text)
Reject H
0
for large values of fr
For b large, fr ~
2
a-1
E Newton 29
Ranks within Blocks (rows)
> scmr b<- t ( appl y( scm, 1, r ank) )
> scmr b
[ , 1] [ , 2] [ , 3] [ , 4]
[ 1, ] 1 2 3 4
[ 2, ] 1 2 3 4
[ 3, ] 1 2 3 4
[ 4, ] 1 2 3 4
[ 5, ] 1 2 3 4
[ 6, ] 1 2 3 4
[ 7, ] 1 2 3 4
> t mp<- appl y( scmr b, 2, sum)
[ 1] 7 14 21 28
> ( 12/ ( 4*7*5) ) *sum( t mp^2) - 3*7*5
[ 1] 21
E Newton 30
Friedman test in S-Plus
> friedman.test(scm, col(scm), row(scm))
Friedman rank sum test
data: scm and col(scm) and row(scm)
Friedman chi-square = 21, df = 3, p-value
= 0.0001
alternative hypothesis: two.sided
E Newton 31
ANOVA test score data with blocks
> summar y( aov( scor e~f ( gr p) +f ( bl k) ) )
f ( gr p) 3 830. 1914 276. 7305 260. 4768 5. 220000e- 015
f ( bl k) 6 397. 6377 66. 2729 62. 3804 4. 558276e- 011
Resi dual s 18 19. 1232 1. 0624
E Newton 32
Correlation Methods
Pearson Correlation: measures only linear
association.
Spearman Correlation: correlation of the
ranks
Kendalls Tau: based on number of
concordant and discordant pairs.
E Newton 33
Kendalls Tau
Assume: the n bivariate observations
(X
1
,Y
1
),,(X
n
,Y
n
) are a random sample from a
continuous bivariate population.
H
0
: X
i
, Y
i
are independent
H
0
: F(x,y) = F(x)F(y)
Measure dependence by finding the number of
concordant and discordant pairs.
Population correlation coefficient:
= 2*P{X
2
-X
1
)(Y
2
-Y
1
)>0}-1
E Newton 34
Kendalls Tau
) 1 (
2
)) , ( ), , ((
0 ) Y - )(Y X - (X if 1, -
0 ) Y - )(Y X - (X if 0,
0 ) Y - )(Y X - (X if 1,
)) Y , (X ), Y , Q((X
: n j i 1
1
1 1
j i j i
j i j i
j i j i
j j i i
=
=
<
=
>
=
<
= + =
n n
K
Y X Y X Q K
For
n
i
n
i j
j j i i
E Newton 35
Kendalls Tau example
E Newton 36
> m
1 3 2 4
1 NA 1 1 1
2 NA NA - 1 1
3 NA NA NA 1
4 NA NA NA NA
> 2*sum( m, na. r m=T) / 12
[ 1] 0. 6666667
> cor . t est ( c( 1, 2, 3, 4) , c( 1, 3, 2, 4) , met hod="k")
Kendal l ' s r ank cor r el at i on t au
dat a: c( 1, 2, 3, 4) and c( 1, 3, 2, 4)
nor mal - z = 1. 3587, p- val ue = 0. 1742
al t er nat i ve hypot hesi s: t r ue t au i s not equal t o 0
t au
0. 6666667
x=1:10
y=exp(x)
x
y
2 4 6 8 10
0
5
0
0
0
1
0
0
0
0
1
5
0
0
0
2
0
0
0
0
E Newton 37
Pearson Correlation
> cor . t est ( x, y, met hod=" p" )
Pear son' s pr oduct - moment cor r el at i on
dat a: x and y
t = 2. 9082, df = 8, p- val ue = 0. 0196
al t er nat i ve hypot hesi s: t r ue coef i s not
equal t o 0
cor
0. 7168704
E Newton 38
Spearman Correlation
> cor . t est ( x, y, met hod=" s" )
Spear man' s r ank cor r el at i on
dat a: x and y
nor mal - z = 2. 9818, p- val ue = 0. 0029
al t er nat i ve hypot hesi s: t r ue r ho i s not
equal t o 0
r ho
1
E Newton 39
Kendall Correlaton
> cor . t est ( x, y, met hod=" k" )
Kendal l ' s r ank cor r el at i on t au
dat a: x and y
nor mal - z = 4. 0249, p- val ue = 0. 0001
al t er nat i ve hypot hesi s: t r ue t au i s not
equal t o 0
t au
1
E Newton 40
E Newton 41
Example - Environmental Data
Censored below LOD
0 2 4 6 8 10 12 14
0
1
0
2
0
3
0
4
0
5
0
g
0 2 4 6 8 10 12 14
0
1
0
2
0
3
0
4
0
5
0
h
Resampling Methods
Parametric methods Inference based on
assumed population distribution
Resampling methods No assumption
about functional form of population
distribution.
Permutation Tests 2 sample problem
Jackknife Delete one observation at a
time
Bootstrap resample with replacement
E Newton 42
Permulation Tests
Goal: estimate difference in means (2 sample problem)
(x
1
, x
2
x
n1
) and (y
1
, y
2
.. y
n2
) are independent samples
drawn from F
1
and F
2
.
H
0
: F
1
=F
2
=> all assignments of labels x and y equally
likely.
Choose SRS of size n1 from n1+n2 observations and
label as x, label rest as y.
Calculate value of test statistic (e.g. difference in means)
for each assignment -> permutation distribution.
There are (n1+n2) choose (n1) possible distinct
assignments (capacitor data set Ex14.7, n1=8, n2=10,
number of assignments=43,758)
E Newton 43
Jackknife
Goal: estimate distribution and standard error of statistic
(e.g. median or mean)
Draw n samples of size n-1 from original sample, by
deleting one observation at a time.
Calculate m
j
*=mean (median) from each sample
=
n
j
j
m m
n
n
m JSE
1
2 * *
) (
1
) (
JSE is exact for mean, not necessarily very good
for median
E Newton 44
Bootstrap
Goal: estimate distribution, standard error,
confidence interval of statistic (e.g. mean,
median, correlation)
Draw B samples of size n, with replacement,
from original sample
Calculate test statistics from each sample
1
) (
) (
1
2 * *
=

=
B
m m
m BSE
B
j
j
E Newton 45
Swiss Data Set in S-Plus
Fertility Data for Switzerland in 1888
SUMMARY:
The swiss.fertility and swiss.x data sets contain fertility data for Switzerland in 1888.
ARGUMENTS:
swiss.fertility
standardized fertility measure I[g] for each of 47 French-speaking provinces of
Switzerland in approximately 1888.
swiss.x
matrix with 5 columns that contain socioeconomic indicators for the provinces:
1) percent of population involved in agriculture as an occupation; 2) percent
of "draftees" receiving highest mark on army examination; 3) percent of
population whose education is beyond primary school; 4) percent of
population who are Catholic; and, 5) percent of live births who live less than
1 year (infant mortality).
SOURCE:
Mosteller and Tukey (1977). Data Analysis and Regression. Addison-Wesley.
Unpublished data used by permission of Francine van de Walle. Population Study
Center, University of Pennsylvania, Philadelphia, PA.
E Newton 46
Bootstrap estimates and CI for variance of
education
> educ<- swi ss. x[ , 3]
> var ( educ)
[ 1] 92. 45606
> educ. boot <- boot st r ap( educ, var , t r ace=F)
> summar y( educ. boot )
Cal l :
boot st r ap( dat a = educ, st at i st i c = var , t r ace = F)
var 92. 46 - 0. 5972 91. 86 39. 14
2. 5% 5% 95%97. 5%
var 29. 98 36. 26 165. 3 175
E Newton 47
Histogram of variance estimates obtained
from 1000 bootstrap samples
50 100 150 200
0
.
0
0
.
0
0
2
0
.
0
0
4
0
.
0
0
6
0
.
0
0
8
0
.
0
1
0
Value
D
e
n
s
i
t
y
var
E Newton 48
QQ plot of variance estimates
Q
u
a
n
t
i
l
e
s

o
f

R
e
p
l
i
c
a
t
e
s
-2 0 2
5
0
1
0
0
1
5
0
2
0
0
var
E Newton 49
Plot of LSAT scores by GPA for a
sample of 15 schools
gpa
l
s
a
t
2.8 3.0 3.2 3.4
5
6
0
5
8
0
6
0
0
6
2
0
6
4
0
6
6
0
E Newton 50
Bootstrap estimates and CI for correlation
between LSAT and GPA
> l aw. boot <- boot st r ap( l aw. dat a, cor ( l sat , gpa) , t r ace=F)
> summar y( l aw. boot )
Cal l :
boot st r ap( dat a = l aw. dat a, st at i st i c = cor ( l sat , gpa) , t r ace = F)
Par am 0. 7764 - 0. 00506 0. 7713 0. 1368
2. 5% 5% 95% 97. 5%
Par am0. 449 0. 5133 0. 947 0. 9623
2. 5% 5% 95% 97. 5%
Par am0. 2623 0. 4138 0. 9232 0. 9413
E Newton 51
Histogram of correlation estimates obtained
from 1000 bootstrap samples
0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
Value
D
e
n
s
i
t
y
Param
E Newton 52
QQ Plot of correlation estimates
Q
u
a
n
t
i
l
e
s

o
f

R
e
p
l
i
c
a
t
e
s
-2 0 2
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Param
E Newton 53
S-Plus Stack-loss data set
Stack-loss Data
SUMMARY:
The stack.loss and stack.x data sets are from the operation of a plant for the
oxidation of ammonia to nitric acid, measured on 21 consecutive days.
ARGUMENTS:
stack.loss
percent of ammonia lost (times 10).
stack.x
matrix with 21 rows and 3 columns representing air flow to the plant,
cooling water inlet temperature, and acid concentration as a percentage
(coded by subtracting 50 and then multiplying by 10).
SOURCE:
Brownlee, K.A. (1965). Statistical Theory and Methodology in Science and
Engineering. New York: John Wiley & Sons, Inc.
Draper and Smith (1966). Applied Regression Analysis. New York: John
Wiley & Sons, Inc.
Daniel and Wood (1971). Fitting Equations to Data. New York: John Wiley &
Sons, Inc.
E Newton 54
S-Plus stack loss data set
stack.loss
50 55 60 65 70 75 80 75 80 85 90
1
0
2
0
3
0
4
0
5
0
6
0
7
0
8
0
Air.Flow
Water.Temp
1
8
2
0
2
2
2
4
2
6
10 20 30 40
7
5
8
0
8
5
9
0
18 20 22 24 26
Acid.Conc.
E Newton 55
Summary of stack loss regression
> summar y( t mp)
Cal l : l m( f or mul a = st ack. l oss ~ Ai r . Fl ow + Wat er . Temp + Aci d. Conc. , dat a =
st ack)
Resi dual s:
- 7. 238 - 1. 712 - 0. 4551 2. 361 5. 698
Coef f i ci ent s:
( I nt er cept ) - 39. 9197 11. 8960 - 3. 3557 0. 0038
Ai r . Fl ow 0. 7156 0. 1349 5. 3066 0. 0001
Wat er . Temp 1. 2953 0. 3680 3. 5196 0. 0026
Aci d. Conc. - 0. 1521 0. 1563 - 0. 9733 0. 3440
F- st at i st i c: 59. 9 on 3 and 17 degr ees of f r eedom, t he p- val ue i s 3. 016e- 009
( I nt er cept ) Ai r . Fl ow Wat er . Temp
Ai r . Fl ow 0. 1793
Wat er . Temp - 0. 1489 - 0. 7356
Aci d. Conc. - 0. 9016 - 0. 3389 0. 0002
E Newton 56
Summary of stack loss bootstrap output
summar y( st ack. boot )
Cal l :
boot st r ap( dat a = st ack, st at i st i c = coef ( l m( st ack. l oss ~ Ai r . Fl ow
+ Wat er . Temp + Aci d. Conc. , st ack) ) , t r ace = F)
( I nt er cept ) - 39. 9197 0. 5691396 - 39. 3505 9. 3731
Ai r . Fl ow 0. 7156 0. 0016734 0. 7173 0. 1777
Wat er . Temp 1. 2953 - 0. 0264873 1. 2688 0. 4798
Aci d. Conc. - 0. 1521 - 0. 0006978 - 0. 1528 0. 1261
2. 5% 5% 95% 97. 5%
( I nt er cept ) - 56. 0109 - 53. 4216 - 21. 92994 - 18. 75262
Ai r . Fl ow 0. 3903 0. 4366 1. 00261 1. 04605
Wat er . Temp 0. 4004 0. 5131 2. 07381 2. 23633
Aci d. Conc. - 0. 4285 - 0. 3740 0. 03282 0. 05912
E Newton 57
Summary of stack loss bootstrap output
summar y( st ack. boot )
2. 5% 5% 95% 97. 5%
( I nt er cept ) - 55. 6465 - 52. 6606 - 21. 451125 - 18. 55810
Ai r . Fl ow 0. 3266 0. 4120 0. 992007 1. 01855
Wat er . Temp 0. 5244 0. 6193 2. 264165 2. 40956
Aci d. Conc. - 0. 4629 - 0. 4101 - 0. 007724 0. 04459
Cor r el at i on of Repl i cat es:
( I nt er cept ) Ai r . Fl ow Wat er . Temp Aci d. Conc.
( I nt er cept ) 1. 00000 - 0. 17636 0. 09902 - 0. 80236
Ai r . Fl ow - 0. 17636 1. 00000 - 0. 78822 - 0. 07635
Wat er . Temp 0. 09902 - 0. 78822 1. 00000 - 0. 24463
Aci d. Conc. - 0. 80236 - 0. 07635 - 0. 24463 1. 00000
E Newton 58
Histograms of regression coefficients
E Newton 59
-60 -40 -20 0 20 40
0
.
0
0
.
0
1
0
.
0
3
0
.
0
5
Value
D
e
n
s
i
t
y
(Intercept)
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
Value
D
e
n
s
i
t
y
Air.Flow
-0.5 0.0 0.5 1.0 1.5 2.0 2.5
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
Value
D
e
n
s
i
t
y
Water.Temp
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4
0
1
2
3
4
Value
D
e
n
s
i
t
y
Acid.Conc.
QQ Plots of regression coefficients
E Newton 60
Q
u
a
n
t
i
l
e
s

o
f

R
e
p
l
i
c
a
t
e
s
-2 0 2
-
6
0
-
4
0
-
2
0
0
2
0
4
0
(Intercept)
Q
u
a
n
t
i
l
e
s

o
f

R
e
p
l
i
c
a
t
e
s
-2 0 2
0
.
0
0
.
4
0
.
8
1
.
2
Air.Flow
Q
u
a
n
t
i
l
e
s

o
f

R
e
p
l
i
c
a
t
e
s
-2 0 2
-
0
.
5
0
.
5
1
.
5
2
.
5
Water.Temp
Q
u
a
n
t
i
l
e
s

o
f

R
e
p
l
i
c
a
t
e
s
-2 0 2
-
1
.
0
-
0
.
6
-
0
.
2
0
.
2
Acid.Conc.

Applied Statistics - MIT

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Applied Statistics - MIT

Загружено:

Авторское право:

Доступные форматы

Dr.

) f ( dx x =1 curve the under (Area = always) 1

From Table A.3, (0.675) = 0.75

converges to the N(0,1) distribution.

This is known as the

See Result 2 (p.179)

(the general formula of the rule to be computed from the data)

Consider a random sample X

100(1-)% two-sided CI for based on the observed sample mean

(See Figure 7.1 on page

Read on you own the

p E where E is the margin of error. Then E = z

pq Solving for n gives n =

for two-sided test sample size.

v .995 .99 .975 .95 .90 .10 .05

( : (SSE) Error for Squares of Sum

( : (SSR) Regression for Squares of Sum

For ozone example:

( : (SSE) Error for Squares of Sum

( : (SSR) Regression for Squares of Sum

VIF Var in simple regression

WX) (X' are equations normal squares Least

)/n y (1' y by estimated treatment, i of mean

a-1 independent treatment effects

values response observed of vector

Вам также может понравиться