Stat Methods

Some definitions
◦ Individual: each object described by a set of data

◦ Variable: any characteristic of an individual
⋄ Categorical variable: places an individual into one of several
groups or categories.
⋄ Quantitative variable: takes numerical values on which we can
do arithmetic.
◦ Distribution of a variable: tells what values it takes and how often
it takes these values.
Example:
The following data set consists of five variables about 20 individuals.
ID Age Education Sex Total income Job class
1 43 4 1 18526 5
2 35 3 2 5400 7
3 43 2 1 3900 7
4 33 3 1 28003 5
5 38 3 2 43900 7
6 53 4 1 53000 5
7 64 6 1 51100 6
8 27 4 2 44000 5
9 34 4 1 31200 5
10 27 3 2 26030 5
11 47 6 1 6000 6
12 48 3 1 8145 5
13 39 2 1 37032 5
14 30 3 2 30000 5
15 35 3 2 17874 5
16 47 4 2 400 5
17 51 4 2 22216 5
18 56 5 1 26000 6
19 57 6 1 100267 7
20 34 1 1 15000 5
Age: age in years

Education: 1=no high school, 2=some high school, 3=high school diplom,
4=some college, 5=bachelor’s degree, 6=postgraduate degree
Sex: 1=male, 2=female
Total income: income from all sources
Job class: 5=private sector, 6=government, 7=self employed
Variables Age and Total income are quantitative, variables Eduction, Sex,
and Job class are categorical.
Graphical Description of Data, Jan 5, 2004 -1-
Categorical variable analysis
Questions to ask about a categorical variable:

◦ How many categories are there?
◦ In each category, how many observations are there?
Bar graphs and pie charts

Categorical data can be displayed by bar graphs or pie charts.
◦ In a bar graph, the horizontal axis lists the categories, in any order.
The height of the bars can be either counts or percentages.
◦ For better comparison of the frequencies, the variables can be ordered
from most frequent to lest frequent.
◦ In a pie chart, the area of each slide is proportional to the percentage
of individuals who fall into that category.
Example: Education of people aged 25 to 34

30
30
Percent of people aged 25 to 34
Percent of people aged 25 to 34

20
20
10
10
0
no HS some HS HS diploma Bachelor’s some college postgrad HS diploma Bachelor’s some college some HS postgrad no HS
Education level Education level
6.7%3.6%
7.5%
22.7%
30.4%
29.1%
no HS some HS
HS diploma Bachelor’s
some college postgrad

Categorical variable analysis
Example: Education of people aged 25 to 34

STATA commands:
. infile ID AGE EDUC SEX EARN JOB using individuals.txt, clear

. drop if AGE<25 | AGE>34
. label values EDUC Education
. label define Education 1 "no HS" 2 "some HS" 3 "HS diploma" 4 "Bachelor’s"
> 5 "some college" 6 "postgrad"
. set scheme s1mono
. gen COUNT=100/_N
. graph bar (sum) COUNT, over(EDUC) ytitle("Percent of people aged 25 to 34")
> b1title("Education level")
. translate @Graph bar1.eps, translator(Graph2eps) replace
. graph bar (sum) COUNT, over(EDUC, sort(1) descending)
> ytitle("Percent of people aged 25 to 34") b1title("Education level")
. translate @Graph bar2.eps, translator(Graph2eps) replace
. set scheme s1color
. graph pie COUNT, over(EDUC) plabel(_all perc, format(%4.1f) gap(-5))
. translate @Graph pie.eps, translator(Graph2eps) replace

Quantitative variables: stemplots
Example: Sammy Sosa home runs
Year Home runs

Producing stemplots in STATA:
1989 4
1990 15 . infile YEAR HR using sosa.dat
1991 10 . stem HR
1992 8
1993 33 Stem-and-leaf plot for HR
1994 25
1995 36
1996 40 0* | 48
1997 36 1* | 05
1998 66 2* | 5
1999 63 3* | 366
2000 50 4* | 009
2001 64
5* | 0
2002 49
2003 40 6* | 346
How to make a stemplot
1. Separate each observation into a stem and a leaf.
e.g. 15 → |{z} 5 and 4 → |{z}

1 |{z} 0 |{z}
4
stem leaf stem leaf
2. Write the stems in a vertical column in increasing order.
3. Write each leaf next to stem, in increasing order out from the stem.
How to choose the stem

◦ Rounding: each leaf should have exactly one digit, so rounding long
numbers before producing the stemplot can help produce a more com-
pact and informative plot.
◦ Splitting: if each stem (or many stems) have a large number of leaves,
all stems can be split, with leaves of 0-4 going to the first stem and 5-9
going to the second.

Quantitative variables: histograms
How to make a histogram
1. Group the observations into “bins” according to their value. Choose

the bins carefully: too few hide detail, too many decimate the pattern.
2. Count the individuals in each bin.
3. Draw the histogram

◦ Leave no space between bars.
◦ Label the axes with units of measurement.
◦ The y-axis is can be counts or percentages (per unit).

Year Home runs
1989 4
1990 15
1991 10
.04
1992 8
1993 33
1994 25
.03
1995 36
Density
1996 40
.02
1997 36
1998 66
.01
1999 63
2000 50
0
2001 64 0 10 20 30 40 50 60 70
Home runs
2002 49
2003 40
The area of each bar is proportional to the percentage of data in that range.
We care about the area, not the height, but when the bar has equal width,
area is determined by the height.
For simplicity, use equally spaced bins.

Quantitative variables: histograms

Histograms with different bin widths:
Histogram of Sosa Home Runs Histogram of Sosa Home Runs Histogram of Sosa Home Runs Histogram of Sosa Home Runs
0.07
0.07
0.07
0.07
0.06
0.06
0.06
0.06
0.05
0.05
0.05
0.05
0.04
0.04
0.04
0.04
Percentage
Percentage
Percentage
Percentage
0.03
0.03
0.03
0.03
0.02
0.02
0.02
0.02
0.01
0.01
0.01
0.01
0.00
0.00
0.00
0.00
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Home Runs Home Runs Home Runs Home Runs
Producing histograms in STATA:
. infile YEAR HR using sosa.dat

. hist HR, start(0.1) width(10) xlabel(0(10)70) xtitle(Home runs)
. translate @Graph hist1.eps, translator(Graph2eps) replace
. hist HR, start(0.1) width(10) xlabel(0(10)70) xtitle(Home runs) freq
. translate @Graph hist2.eps, translator(Graph2eps) replace
.04
5
4
.03
3
Frequency
Density
.02
2
.01
1
0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Home runs Home runs
Why is a histogram not a bar graph?

◦ Frequencies are represented by area, not height.
◦ There is no space between the bars.
◦ The horizontal axis represents a numerical quantity, with an inherent
order.

Interpreting histograms
◦ Describe the overall pattern and any significant deviations from that
pattern.
◦ Shape: Is the distribution (approximately) symmetric or skewed?
Histogram of x
2000
1500
This distribution is skewed right

Frequency
because it has a long right-hand

1000
tail.
500
0
0.0 0.5 1.0 1.5 2.0
◦ Center: Where is the “middle” of the distribution?

◦ Spread: What are the smallest and largest values?
◦ Outliers: Are there any observations that lie outside the overall pat-
tern? They could be unusual observations, or they could be mistakes.
Check them!
Example: Newcomb’s measurements of the passage time of light (IPS Tab

1.1)
25
20
Frequency
15
10
0
−60 −40 −20 0 20 40 60
Time

Time plots
Example: Average retail price of gasoline from Jan 1988 to Apr 2001
1.8
1.7
1.6
Retail gasoline price
1.2 1.3 1.4 1.5
1.1
1.0
0.9
1988 1990 1992 1994 1996 1998 2000

Year
Note: Whenever data are collected over time, it is a good idea to have
a time plot. Stemplots and histograms ignore time order, which can be
misleading when systematic change over time exists.
Producing a time plot in STATA:
. infile PRICE using gasoline.txt, clear

. graph twoway line PRICE T, ylabel(0.9(0.1)1.8, format(%3.1f)) xtick(0(12)159)
> xlabel(0 "1988" 24 "1990" 48 "1992" 72 "1994" 96 "1996" 120 "1998" 144 "2000")
> xtitle(Year) ytitle(Retail gasoline price)

Measures of center
The mean
The mean of a distribution is the arithmetic average of the obser-
vations:
x1 + · · · + xn 1 Pn
x̄ = =n xi
n i=1
The median
The median is the midpoint of a distribution: the number M
such that
◦ half the observations are smaller and
◦ half are larger.
How to find the median

Suppose the observations are x1, x2, . . . , xn.
1. Arrange the data in increasing order and let x(i) denote the ith
smallest observation.
2. If the number of observations n is odd, the median is the center
observation in the ordered list:
M = x((n+1)/2)
3. If the number of observation n is even, the median is the av-

erage of the two center observations in the ordered list:
x(n/2) + x(n/2+1)
M=
2
Numerical Description of Data, Jan 7, 2004 -1-
Measures of center
Examples:
Data set 1:
x1 x2 x3 x4 x5 x6 x7 x8 x9
2 4 3 4 6 5 4 -6 5
Arrange in increasing order:

x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)
-6 2 3 4 4 4 5 5 6
There is an odd number of observations, so the median is
M = x((n+1)/2) = x(5) = 4.
The mean is given by

2 + 4 + 3 + 4 + 6 + 5 + 4 + (−6) + 5 27
x̄ = = = 3.
9 9
Data set 2:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
2.3 8.8 3.9 4.1 6.4 5.9 4.2 2.9 1.3 5.1

x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10)
1.3 2.3 2.9 3.9 4.1 4.2 5.1 5.9 6.4 8.8
There is an even number of observations, so the median is
x(n/2) + x(n/2+1) x(5) + x(6) 4.1 + 4.2
M= = = = 4.15.
2 2 2
The mean is given by
2.3 + 8.8 + 3.9 + 4.1 + 6.4 + 5.9 + 4.2 + 2.9 + 1.3 + 5.1 44.9
x̄ = = = 4.49.
10 10

Mean versus median
◦ The mean is easy to work with algebraically, while the median

is not.
◦ The mean is sensitive to extreme observations, while the median
is more robust.
Example:
0 1 2 3 4 5 6 7 8 9 10
The original mean and median are

0+1+2
x̄ = = 1 and M = x((n+1)/2) = 1
3
The modified mean and median are
0 + 1 + 10 2
x̄ = = 3 and M = x((n+1)/2) = 1
3 3
◦ If the distribution is exactly symmetric, then mean=median.
◦ In a skewed distribution, the mean is further out in the longer
tail than the median.
◦ The median is preferable for strongly skewed distributions, or
when outliers are present.

Measures of spread
Example: Monthly returns on two stocks

Stock A Stock B
40 40
30 30
Frequency
Frequency
20 20
10 10
0 0
−10 −5 0 5 10 15 20 −10 −5 0 5 10 15 20
Daily returns (in %) Daily returns (in %)
Stock A Stock B
Mean 4.95 4.82
Median 4.99 4.68
The distributions of the two stocks have approximately the same
mean and median, but stock B is more volatile and thus more risky.
◦ Measures of center alone are an insufficient description of a

distribution and can be misleading
◦ The simplest useful numerical description of a distribution con-
sists of both a measure of center and a measure of spread.
Common measures of spread are

◦ the quartiles and the interquartile range
◦ the standard deviation

Quartiles
Quartiles divide data into 4 even parts

◦ Lower (or first) quartile QL :
median of all observations less than the median M
◦ Middle (or second) quartile M = QM :
median of all observations
◦ Upper (or third) quartile QU :
median of all observations lgreater than the median M
◦ Interquartile range: IQR = QU − QL
distance between upper and lower quartile
How to find the quartiles
1. Arrange the data in increasing order and find the median M
2. Find the median of the observations to the left of M, that is the lower
quartiles, QL
3. Find the median of the observations to the right of M, that is the

upper quartiles, QU
Examples:
Data set:
x1 x2 x3 x4 x5 x6 x7 x8 x9
2 4 3 4 6 5 4 -6 5

x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)
-6 2 3 4 4 4 5 5 6
◦ QL is the median of {−6, 2, 3, 4}: QL = 2.5
◦ QU is the median of {4, 5, 5, 6}: QU = 5
◦ IQR = 5 − 2.5 = 2.5
Percentiles
More generally we might be interested in the value which is ex-

ceeded only by a certain percentage of observations:
The pth percentile of a set of observations is the value such that

◦ p% of the observation are less than or equal to it and
◦ (100 − p)% of the observation are greater than or equal to it.
How to find the percentiles
1. Arrange the data into increasing order.

2. If np/100 is not an integer, then x(k+1) is the pth percentile,
where k is the largest integer less than np/100.
3. If np/100 is an integer, the pth percentile is the average of the
x(np/100) and x(np/100+1).
Five-number summary
A numerical summary of a distribution {x1, . . . , xn} is given by
x(1) QL M QU x(n)
A simple boxplot is a graph of the five-number summary.

Boxplots
A common “rule” for discovering outliers is the 1.5 × IQR rule:

An observations is a suspected outlier if it lies more than falls more
than 1.5 × IQR below QL or above QU .
20
How to draw a boxplot Box-and-whisker
plot)
1. A box (the box) is drawn from the lower to
10
the upper quartile (QL and QU ).
2. The median of the data is shown by a line in
the box.
0
3. Lines (the whiskers) are drawn from the ends
of the box to the most extreme observations
within a distance of 1.5 IQR (Interquartile
−10
range). Stock A Stock B
4. Measurements falling outside 1.5 IQR from

the ends of the box are potential outliers and
Plotting a boxplot with STATA:
marked by ◦ or ∗.
. infile A B using stocks.txt, clear
. label var A "Stock A"
. label var B "Stock B"
. graph box A B, xsize(2) ysize(5)

Boxplots
Interpretation of Box Plots

◦ The IQR is a measure for the sample’s variability.
◦ If the whiskers differ in length the distribution of the data is
probably skewed in the direction of the longer whisker.
◦ Very extreme observations (more than 3 IQR away from the
lower resp. upper quartile) are outliers, with one of the following
explanations:
a) The measurement is incorrect (error in measurement process
or data processing).
b) The measurement belongs to a different population.
c) The measurement is correct, but represents a rare (chance)
event.
We accept the last explanation only after carefully ruling out
all others.

Variance and standard deviation
Suppose there are n observations x1, x2, . . . , xn,

The variance of the n observations is:
2 (x1 − x̄)2 + (x2 − x̄)2 + · · · + (xn − x̄)2
s =
n−1
P
n
= n −1 1 (xi − x̄)2
i=1
This is (approximately) the average of the squared distances of the

observations from the mean.
The standard deviation is:
s
√ 1 P
n
s = s = n − 1 (xi − x̄)2
2
i=1
Why n − 1?
Division by n − 1 instead of n in the variance calculation is a
common cause of confusion. Why n − 1? Note that
n
X
(xi − x̄) = 0
i=1
Thus, if you know any n − 1 of the differences, the last difference

can be determined from the others. The number of “freely varying”
observations, n − 1 in this case, is called the “degrees of freedom”.

Properties of s
◦ Measures spread around the mean =⇒ use only if the mean

is used as a measure of center.
◦ s = 0 ⇔ all observations are the same
◦ s is in the same units as the measurements, while s2 is in the
square of these units.
◦ s, like x̄ is not resistant to outliers.
Five-number summary versus standard deviation
◦ The 5-number summary is better for describing skewed distri-

butions, since each side has a different spread.
◦ x̄ and s are preferred for symmetric distributions with no out-
liers.
Numerical Description of Data, Jan 7, 2004 - 10 -

Histograms and density curves
What’s in our toolkit so far?

◦ Plot the data: histogram (or stemplot)
◦ Look for the overall pattern and identify deviations and outliers
◦ Numerical summary to briefly describe center and spread
A new idea:
If the pattern is sufficiently regular, approximate it with a
smooth curve.
Any curve that is always on or above the horizontal axis and has
total are underneath equal to one is a density curve.
◦ Area under the curve in a range of values indicates the propor-
tion of values in that range.
◦ Come in a variety of shapes, but the “normal” family of familiar
bell-shaped densities is commonly used.
◦ Remember the density is only an approximation, but it sim-
plifies analysis and is generally accurate enough for practical
use.
The Normal Distrbution, Jan 9, 2004 -1-

Examples
0.07
0.06
0.05
Density
0.04
0.03
0.02
0.01
0.00
0 10 20 30 40
Sulfur oxide (in tons)
0.07
0.06
0.05
Shaded area of histogram: 0.29
Density
0.04
0.03
0.02
0.01
0.00
0 10 20 30 40
0.07
0.06
0.05 Shaded area under the curve: 0.30

Density
0.04
0.03
0.02
0.01
0.00
0 10 20 30 40
0.04
0.03
Density
0.02
0.01
0.00
40 46 52 58 64 70 76 82 88 94 100
Waiting time between eruptions (min)

Median and mean of a density curve
Median:
The equal-areas point with 50% of the “mass” on either side.
Mean:
The balancing point of the curve, if it were a solid mass.
Note:
◦ The mean and median of a symmetric density curve are equal.
◦ The mean of a skewed curve is pulled away from the median in
the direction of the long tail.
The mean and standard deviation of a density are denoted µ and

σ, rather than x̄ and s, to indicate that they refer to an idealized
model, and not actual data.

Normal distributions: N (µ, σ)
The normal distribution is

◦ symmetric,
◦ single-peaked,
◦ bell-shaped.
The density curve is given by

1 1 2
f (x) = √ 2 exp − 2σ2 (X − µ) .
2πσ
It is determined by two parameters µ and σ:

◦ µ is the mean (also the median)
◦ σ is the standard deviation
Note: The point where the curve changes from concave to convex
is σ units from µ in either direction.

The 68-95-99.7 rule
◦ About 68% of the data fall inside (µ − σ, µ + σ).

◦ About 95% of the data fall inside (µ − 2σ, µ + 2σ).
◦ About 99.7% of the data fall inside (µ − 3σ, µ + 3σ).

Example
Scores on the Wechsler Adult Intelligence Scale (WAIS) for the 20

to 34 age group are approximately N (110, 25).
◦ About what percent of people in this age group have scores

above 110?
◦ About what percent have scores above 160?
◦ In what range do the middle 95% of all scores lie?

Standardization and z-scores
Linear transformation of normal distributions:
X ∼ N (µ, σ) ⇒ a X + b ∼ N (a µ + b, a σ)
In particular it follows that

X −µ
∼ N (0, 1).
σ
N (0, 1) is called standard normal distribution.
For a real number x the standardized value or z-score

x−µ
z=
σ
tells how many standard deviations x is from µ, and in what di-
rection.
Standardization enables us to use a standard normal table to find
probabilities for any normal variable.
For example:
◦ What is the proportion of N (0, 1) observations less than 1.2?
◦ What is the proportion of N (3, 1.5) observations greater than 5?
◦ What is the proportion of N (10, 5) observations between 3 and 9?

Normal calculations
Standard normal calculations
1. State the problem in terms of x.

x−µ
2. Standardize: z = σ .
3. Look up the required value(s) on the standard normal table.

4. Reality check: Does the answer make sense?
Backward normal calculations

We can also calculate the values, given the probabilities:
If MPG ∼ N (25.7, 5.88), what is the minimum MPG required to be in the
top 10%?
“Backward” normal calculations

1. State the problem in terms of the probability of being less
than some number.
2. Look up the required value(s) on the standard normal table.
x−µ
3. “Unstandardize,” i.e. solve z = σ for x.

Example
Suppose X ∼ N (0, 1).

◦ P(X ≤ 2) = ?
◦ P(X > 2) = ?
◦ P(−1 ≤ X ≤ 2) = ?
◦ Find the value z such that
⋄ P(X ≤ z) = 0.95
⋄ P(X > z) = 0.99
⋄ P(−z ≤ X < z) = 0.68
⋄ P(−z ≤ X < z) = 0.95
⋄ P(−z ≤ X < z) = 0.997
Suppose X ∼ N (10, 5).
◦ P(X < 5) = ?
◦ P(−3 < X < 5) = ?
◦ P(−x < X < x) = 0.95

Assessing Normality
How to make a normal quantile plot

1. Arrange the data in increasing order.
2. Record the percentiles ( n1 , n2 , . . . , nn ).
3. Find the z-scores for these percentiles.
4. Plot x on the vertical axis against z on the horizontal axis.
Use of normal quantile plots

◦ If the data are (approximately) normal, the plot will be close
to a straight line.
◦ Systematic deviations from a straight line indicate a nonnormal
distribution.
◦ Outliers appear as points that are far away from the overall
patter of the plot.
1.0
6
2
0.8
5
Sample Quantiles
Sample Quantiles
Sample Quantiles
1
0.6
4
0
0.4
−1
0.2
−2
N(0, 1) Exp(1) U(0, 1)

0.0
0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles
The Normal Distrbution, Jan 9, 2004 - 10 -

Density Estimation
The normal density is just one possible density curve. There are
many others, some with compact mathematical formulas and many
without.
Density estimation software fits an arbitrary density to data to give

a smooth summary of the overall pattern.
0.2
Density
0.1
0.0
0 10 20 30 40
Velocity of galaxy (1000km/s)

Histogram
How to scale a histogram?
◦ Easiest way to draw a histogram:

⋄ qqually spaced bins 5
Frequency
⋄ counts on the vertical axis 3
0
0 10 20 30 40 50 60 70
Sosa home runs
Disadvantage: Scaling depends on number of observations and

bin width.
◦ Scale histogram such that area of each bar corresponds to pro-

portion of data:
counts
height = 0.04
width · total number 0.03

Density
0.02
0.01
0.00
0 10 20 30 40 50 60 70
Sosa home runs
Proportion of data in interval (0, 10]:
height · width = 0.02 · 10 = 0.2 = 20%
Since n = 15 this corresponds to 3 observations.

Density curves
0.5
n=250
0.4
Density
0.3
0.2
0.1
0.0
−4 −3 −2 −1 0 1 2 3 4
x
0.5
n=2500
0.4
Density
0.3
0.2 Proportion of data in (1,2]:

0.1
0.0
−4 −3 −2 −1 0 1 2 3 4
x #{xi : 1 < xi ≤ 2}
0.5
0.4
n=250000
n
Density
0.3
0.2
0.1
0.0
−4 −3 −2 −1 0
x
1 2 3 4
↓ n→∞
0.5
n→∞
0.4
Z 2
Density
0.3
0.2 f (x) dx
0.1 1
0.0
−4 −3 −2 −1 0 1 2 3 4
x
Probability that a new observation X fall into [a, b]

Z b
#{xi : 1 < xi ≤ 2}
P(a ≤ X ≤ b) = f (x) dx = n→∞ lim
n
a

Relationships between data
Example: Smoking and mortality

◦ Data from 25 occupational groups
(condensed from data on thousands of individual men)
◦ Smoking (100 = average number of cigarettes per day)
◦ Mortality ratio for deaths from lung cancer
(100 = average ratio for all English men)
Scatter plot of the data:
140
Mortality (index)
120
100
80
60
70 80 90 100 110 120 130

Smoking (index)
In STATA:
. insheet using smoking.txt
. graph twoway scatter mortality smoking
Scatterplots and correlation, Jan 12, 2004 -1-

Relationship between data
Assessing a scatter plot:
◦ What is the overall pattern?

⋄ form of the relationship?
⋄ direction of the relationship?
⋄ strength of the relationship?
◦ Are there any deviations (e.g. outliers) from these patterns?
Direction of relationship/association:
◦ positive association: above-average values of both variables

tend to occur together, and the same for below-average values
◦ negative association: above-average values of one variable
tend to occur with below-average values of the other, and vice
versa.
Strength of relationship/association:
◦ determined by how closely the points follow the overall pattern

◦ difficult to assess numerical measure

Correlation
Correlation is a numerical measure of the direction and strength

of the linear relationship between two quantitative variables.
The sample correlation r is defined as

sxy
rxy = √ .
sx sy
where
1 P
n
sx = n−1
(xi − x̄)2,
i=1
1 Pn
sy = n−1
(yi − ȳ)2,
i=1
1 Pn
sxy = n−1
(xi − x̄)(yi − ȳ).
i=1
Properties:
◦ dimensionless quantity
◦ not affected by linear transformations:
for x′i = a xi + b and yi′ = c yi + d
rx′ y ′ = rxy
◦ −1 ≤ rxy ≤ 1
◦ rxy = 1 if and only if yi = a xi + b for some a and b
◦ measures linear association between xi and yi

Correlation
2 ρ = −0.9 ρ = −0.6
2
1
1
0
0
y
y
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
x x
ρ = −0.3 ρ=0
2
2
1
1
0
0
y
y
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
x x
ρ = 0.3 ρ = 0.6
2
2
1
1
0
0
y
y
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
x x
ρ = 0.9 ρ = 0.99
2
2
1
1
0
0
y
y
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
x x

Introduction to regression
Regression describes how one variable (response) depends on

another variable (explanatory variable).
◦ Response variable: variable of interest, measures the out-
come of a study
◦ Explanatory variable: explains (or even causes) changes in
response variable
Examples:
◦ Hearing difficulties:
response - sound level (decibels), explanatory - age (years)
◦ Real estate market:
response - listing prize ($), explanatory - house size (sq. ft.)
◦ Salaries:
response - salary ($), explanatory - experience (years), educa-
tion, sex
Least squares regression, Jan 14, 2004 -1-

Introduction to regression
Example: Food expenditures and income

Data: Sample of 20 households
20
16
food expenditure
12
0
0 20 40 60 80 100 120
income
Questions:
◦ How does food expenditure (Y ) depend on income (X)?
◦ Suppose we know that X = x0, what can we tell about Y ?
Linear regression:
If the response Y depends linearly on the explanatory variable
X, we can use a straight line (regression line) to predict Y
from X.

Least squares regression
How to find the regression line
20
16
food expenditure
12
18 Observed y
8
16 Difference y − y^
food expenditure
4 Predicted y^
14
0 12
0 20 40 60 80 100 120
income
10
8
50 60 70 80 90
income
Since we intend to predict Y from X, the errors of interest are

mispredictions of Y for fixed X.
The least squares regression line of Y on X is the line that

minimizes the sum of squared errors.
For observations (x1, y1), . . . , (xn, yn ), the regression line is given

by
Ŷ = a + b X
where
b = r ssy and a = ȳ − b x̄
x
(r correlation coefficient, sx , sx standard deviations, x̄, ȳ means)

Least squares regression
Example: Food expenditure and income

X 28 26 32 24 54 59 44 30 40 82
Y 5.2 5.1 5.6 4.6 11.3 8.1 7.8 5.8 5.1 18.0
X 42 58 28 20 42 47 112 85 31 26
Y 4.9 11.8 5.2 4.8 7.9 6.4 20.0 13.7 5.1 2.9
The summary statistics are:

◦ x̄ = 45.50 ◦ sx = 23.96 ◦ r = 0.946
◦ ȳ = 7.97 ◦ sy = 4.66
The regression coefficients are:

0.946 · 4.66
b = r ssy = 23.96
= 0.184
x
a = ȳ − b x̄ = 7.97 − 0.184 · 45.5 = −0.402
20
food expenditure
15
10
0
0 20 40 60 80 100 120
income

Interpreting the regression model
◦ The response in the model is denoted Ŷ to indicate that these

are predicted Y values, not the true Y values. The “hat” de-
notes prediction.
◦ The slope of the line indicates how much Ŷ changes for a unit
change in X.
◦ The intercept is the value of Ŷ for X = 0. It may or not have
a physical interpretation, depending on whether or not X can
take values near 0.
◦ To make a prediction for an unobserved X, just plug it in and
calculate Ŷ .
◦ Note that the line need not pass through the observed data
points. In fact, it often will not pass through any of them.

Regression and correlation
Correlation analysis:
We are interested in the joint distribution of two (or more)
quantitive variables.
Example: Heights of 1,078 fathers and sons
80
78
76
74
Son’s height (inches)
72
70
68
66
64
62
60
58
58 60 62 64 66 68 70 72 74 76 78 80
Father’s height (inches)
Points are scattered around the SD line:

s
◦ (y − ȳ) = sxy (x − x̄)
◦ goes through center (x̄, ȳ)
◦ has slope sy /sx
The correlation r measures how much the points spread around
the SD line.

Regression and correlation
Regression analysis:
We are interested how the distribution of one response variable
depends on one (or more) explanatory variables.
Example: Heights of 1,078 fathers and sons

80 Father’s height = 64 inches
0.20
78 x
0.15
76
Density
0.10
74
0.05
72
0.00
70 58 60 62 64 66 68 70 72 74 76 78 80
68 Father’s height = 68 inches
0.16
66 x
0.12
64
Density
0.08
62
0.04
60
0.00
58 58 60 62 64 66 68 70 72 74 76 78 80
58 60 62 64 66 68 70 72 74 76 78 80 Son’s height (inches)
Father’s height (inches) Father’s height = 72 inches
0.18
x
0.15
0.12
Density
0.09
0.06
0.03
0.00
58 60 62 64 66 68 70 72 74 76 78 80
80 Son’s height (inches)
78
76
74
In each vertical strip, the

72
70
points are distributed
68
66 around the regression

line.
64
62
60
58
58 60 62 64 66 68 70 72 74 76 78 80

Properties of least squares regression
◦ The distinction between explanatory and response variables is

essential. Looking at vertical deviations means that changing
the axes would change the regression line.
80
x^ = a’ + b’y
78
76
74
y^ = a + bx
72
70
68
66
64
62
60
58
58 60 62 64 66 68 70 72 74 76 78 80
◦ A change of 1 sd in X corresponds to a change of r sds in Y .

◦ The least squares regression line always passes through the
point (x̄, ȳ).
◦ r2 (the square of the correlation) is the fraction of the variation
in the values of y that is explained by the least squares regres-
sion on x.
When reporting the results of a linear regression,
you should report r2.
These properties depend on the least-squares fitting criterion and
are one reason why that criterion is used.

The regression effect
Regression effect
In virtually all test-retest situations, the bottom group on the
first test will on average show some improvement on the sec-
ond test - and the top group will on average fall back. This is
the regression effect. The statistician and geneticist Sir Fran-
cis Galton (1822-1911) called this effect “regression to medi-
ocrity”.
80
78
76
74
72
70
68
66
64
62
60
58
58 60 62 64 66 68 70 72 74 76 78 80
Regression fallacy
Thinking that the regression effect must be due to something
important, not just the spread around the line, is the regression
fallacy.

Regression in STATA
. infile food income size using food.txt
. graph twoway scatter food income || lfit food income, legend(off)
> ytitle(food)
. regress food income
Source | SS df MS Number of obs = 20
------------+------------------------------ F( 1, 18) = 151.97
Model | 369.572965 1 369.572965 Prob > F = 0.0000
Residual | 43.7725361 18 2.43180756 R-squared = 0.8941
------------+------------------------------ Adj R-squared = 0.8882
Total | 413.345502 19 21.7550264 Root MSE = 1.5594
---------------------------------------------------------------------------
food | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------+--------------------------------------------------------------
income | .1841099 .0149345 12.33 0.000 .1527336 .2154862
_cons | -.4119994 .7637666 -0.54 0.596 -2.016613 1.192615
---------------------------------------------------------------------------
20
Food expenditure
15
10
0
0 20 40 60 80 100 120
Income
This graph has been generated using the graphical user interface of STATA.
The complete command is:
. twoway (scatter food income, msymbol(circle) msize(medium) mcolor(black))
> (lfit food income, range(0 120) clcolor(black) clpat(solid) clwidth(medium)),
> ytitle(Food expenditure, size(large)) ylabel(, valuelabel angle(horizontal)
> labsize(medlarge)) xtitle(Income, size(large)) xscale(range(0 120))
> xlabel(0(20)120, labsize(medlarge)) legend(off) ysize(2) xsize(3)
Least squares regression, Jan 14, 2004 - 10 -

Residual plots
Residuals: difference of observed and predicted values
ei = observed y − predicted y
= yi − ŷi
= yi − (a + b xi)
For a least squares regression, the residuals always have mean zero.
Residual plot
A residual plot is a scatterplot of the residuals against the
explanatory variable. It is a diagnostic tool to assess the fit of
the regression line.
Patterns to look for:
◦ Curvature indicates that the relationship is not linear.
◦ Increasing or decreasing spread indicates that the prediction
will be less accurate in the range of explanatory variables where
the spread is larger.
◦ Points with large residuals are outliers in the vertical direc-
tion.
◦ Points that are extreme in the x direction are potential high
influence points.
Influential observations are individuals with extreme x values
that exert a strong influence on the position of the regression line.
Removing them would significantly change the regression line.

Regression Diagnostics
Example: First data set
10
0
5 10 15
X
1
Residuals
−1
−2
4 6 8 10
Fitted values
1
Residuals
−1
−2
5 10 15
X
residuals are regularly distributed

Example: Second data set
10
0
5 10 15
X
1
Residuals
−1
−2
4 6 8 10
Fitted values
1
Residuals
−1
−2
5 10 15
X
functional relationship other than linear

Example: Third data set
15
Y 10
0
5 10 15
X
2
Residuals
−1
4 6 8 10
Fitted values
2
Residuals
−1
5 10 15
X
outlier, regression line misfits majority of data

Example: Fourth data set
15
Y 10
0
5 10 15
X
1
Residuals
−1
−2
4 6 8 10
Fitted values
1
Residuals
−1
−2
5 10 15
X
heteroscedasticity

Example: Fifth data set
15
Y 10
0
5 10 15 20
X
1
Residuals
−1
−2
6 8 10 12 14
Fitted values
1
Residuals
−1
−2
5 10 15 20
X
one separate point in direction of x, highly influential

The Question of Causation
Example: Are babies brought by the stork?

◦ Data from 54 countries
◦ Variables:
⋄ Birth rate (newborns per 1000 women)
⋄ Number of storks (per 1000 women)
21
18
15
Birth rate
12
0
0 1 2 3 4 5
Number of storks (per 1000 women)
Model: Birth rate (Y) is proportional to the number of storks (X)
Y = bX + ε
Least squares regression yields for the slope of the regression line
b̂ = 4.3 ± 0.2.
Can we conclude that babies are brought by the stork?
Causation, Jan 16, 2004 -1-

The Question of Causation
A more serious example:

Variables:
◦ Income Y - response
◦ level of education X - explanatory variable
There is a positive association between income and the education.
Question: Does better education increase income?
X Y X ? Y X ? Y
Z Z
(a) (b) (c)

causal effect confounding
Possible alternative explanation: Confounding

◦ People from prosperous homes are likely to receive many years of edu-
cation and are more likely to have high earnings.
◦ Education and income might both be affected by personal attributes
such as self assurance. On the other hand the level of education could
have an impact on e.g. self assurance. The effects of education and self
assurance can not be separated.
Confounding:
Response and explanatory variable both depend on a third
(hidden) variable.

Establishing Causal Relationships
Controlled experiments:
A cause-effect relationship between two variables X and Y can be
established by conducting an experiment where
◦ the values of X are manipulated and
◦ the effect on Y is observed.
Problem: Often such experiments are not possible.
If we cannot establish a causal relationship by a controlled experi-

ment, we can still collect evidence from observational studies:
◦ The association is strong.
◦ The association is consistent across multiple studies.
◦ Higher doses are associated with stronger responses.
◦ The alleged cause precedes the effect in time.
◦ The alleged cause is plausible.
Example: Smoking and lung cancer

Caution about Causation
Association is not causation

Two variables may be correlated because both are affected
by some other (measured or unmeasured) variable.
Unmeasured confounding variables can influence the in-
terpretation of relationships among the measured vari-
ables. They
◦ may suggest a relationship where there is none or
◦ may mask a real relationship.
No causation in - no causation out
Causation is - unlike association - no statistical concept.
For inference on cause-effect relationships, we need some
knowledge about the causal relationships between the vari-
ables in the study.
Randomized experiments guarantee the absence of any
confounding variables. Any relationship between the ma-
nipulated variable and the response must be due to a
cause-effect relationship.

Experiments and Observational Studies
Two major types of statistical studies

◦ Observational study - observes individuals/objects and mea-
sures variables of interest but does not attempt to interfere with
the natural process.
◦ Designed experiment - deliberately imposes some treatment
on individuals to observe their responses.
Remarks:
◦ Sample survey are an example of an observational study.
◦ In economics, most studies are observational.
◦ Clinical studies are often designed experiments.
◦ Designed experiments allow statements about causal relation-
ship between treatment and response.
◦ Observational studies have no control over variables. Thus the
effect of the explanatory variable on the response variable might
be confounded (mixed up) with the effect of some other vari-
ables. Such variables are called confounder and a major source
of bias.
Experiments and Observational Studies, Jan 16, 2004 -5-

Designed Experiments
• In controlled experiments, the subjects are assigned to one of

two groups,
◦ treatment group and
◦ control group (which does not receive treatment).
• A controlled experiment is randomized if the subjects are ran-
domly assigned to one of the two groups.
• One precaution in designed experiments if the use of a placebo,
which are made of a completely neutral substance. The sub-
jects do not know whether they receive the treatment or a
placebo, any difference in the response thus cannot be attir-
buted to psychological and psychosomatical effects.
• In a double blind experiment, neither the subjects nor the
treatment administrators know who is assigned to the two
groups.
Example: The Salk polio vaccine field trial

◦ Randomized controlled double-blind experiment in 11 states
◦ 200,000 children in treatment group
◦ 200,000 children in control group treated with placebo
The difference between the responses of the two groups show that
the vaccine reduces the risk of polio infection.

Confounding
Confounding means a difference between the treatment and con-

trol groups—other than the treatment—which affects the responses
being studied. A confounder is a third variable. associated with
exposure and with disease.
Example: Lanarkshire Milk Experiment

The purpose of the experiment was to study the effect of pasteur-
ized milk on the health of children.
◦ The subjects of the experiment were school children.
◦ The children in the treatment group got a daily portion of pas-
teurized milk.
◦ The children in the control did not receive any extra milk.
◦ The teachers assigned poorer children to treatment group so
that they got extra milk
The effect of pasteurized milk on the health of children is con-
founded with the effect of wealth: Poorer children are more exposed
to diseases.

Observational Studies
Confounding is a major problem in observational studies.
Association is NOT Causation
Example: Does smoking cause cancer.

• Designed experiment not possible (cannot make people
smoke).
• Observation: Smokers have higher cancer rates
• Tobacco industry: There might be a gene which
◦ makes people smoke and
◦ causes cancer
In that case stopping smoking would not prevent cancer since
it is caused by the gene. The observed high association could
be attributed to the confounding effect of such a gene.
• However: Studies with identical twins—one smoker and one
nonsmoker—puts some serious doubt on the gene theory.

Example
Do screening programs speed up detection of breast cancer?

◦ Large-scale trial run by the Health Insurance Plan of Greater
New York, starting in 1963
◦ 62,000 women age 40 to 64 (all members of the plan)
◦ Randomly assigned to two equal groups
◦ Treatment group:
⋄ women were encouraged to come in for annual screeening
⋄ 20,200 women did come in for screening
⋄ 10,800 refused.
◦ Control group:
⋄ was offered usual health care
◦ All the women were followed for many years.
Epidemiologists who worked on the study found that

◦ screening had little impact on diseases other than breast cancer;
◦ poorer women were less likely to accept screening than richer
ones; and
◦ most diseases fall more heavily on the poor than the rich.

Example
Deaths in the first five years of the screening trial, by cause. Rates per
1,000 women.
Cause of Death
Breast cancer All other
Number of persons Number Rates Number Rates
Treatment group 31,000 39 1.3 837 27
Examined 20,200 23 1.1 428 21
Refused 10,800 16 1.5 409 38
Control group 31,000 63 2.0 879 28
Questions:
◦ Does screening save lives?
◦ Why is the death rate from all other causes in the whole treatment
group (“examined” and “refused” combined) about the same as the
rate in the control group?
◦ Why is the death rate from all other causes higher for the “refused”
group than the “examined” group?
◦ Breast cancer (like polio, but unlike most other diseases) affects the
rich more than the poor. Which numbers in the table confirm this
association between breast cancer and income?
◦ The death rate (from all causes) among women who accepted screening
is about half the death rate among women who refused. Did screening
cut the death rate in half? In not, what explains the difference in death
rates?
◦ To show that screening reduces the risk from breast cancer, someone
wants to compare 1.1 and 1.5. Is this a good comparison? Is it biased
against screening? For screening?
Experiments and Observational Studies, Jan 16, 2004 - 10 -

Survey Sampling
Situation:
Population of N individuals (or items)
e.g. ◦ students at this university
◦ light bulbs produced by a company on one day
Seek information about population

e.g. ◦ amount of money students spent on books this quarter
◦ percentage of students who bought more than 10 books
in this quarter
◦ lifetime of light bulbs
Full data collection is often not possible because it is e.g.

◦ too expensive
◦ too time consuming
◦ not sensible (e.g. testing every produced light bulb for its lifetime)
Statistical approach:
◦ collect information from part of the population (sample)
◦ use information on sample to draw conclusions on whole pop-
ulation
Questions:
◦ How to choose a sample?
◦ What conclusions can be drawn?
Survey Sampling, Jan 19, 2004 -1-

Survey Sampling
Objective of a sample survey:

Gather information on some variable for population of N individ-
uals:
x̃i value of interest for ith individual

x̃1, . . . , x̃N values for population
Sample of length n:
x1 , . . . , xn values obtained from sampling
Parameter - number that describes the population, e.g.

1 PN
µpop = x̃j population mean
N j=1
2 1 PN
σpop = (x̃j − µpop)2 population variance
N j=1
Estimate population parameter from sampled values:

1P n
µ̂pop = x̄ = xi sample mean
n i=1
2 2 1 P N
σ̂pop = s = (xj − x̄)2 sample variance
n − 1 j=1
A function of the sample x1, . . . , xn is called a statistic.

Sampling Distribution
Suppose we are interested in the amount of money students at this

university have spent on books this quarter.
Idea: Ask 20 students about the amount they have spent and take
the average.
The value we obtain will vary from sample to sample, that is, if we
asked another 20 students we would get a different answer.
Sampling distribution
The sampling distribution of a statistic is the distribution of
all values taken by the statistic if evaluated for all possible
samples of size n taken from the same population.
In our example, the sampling distribution of the average amount

obtained from the sample depends on the way we choose the sample
from the population:
◦ Ask 20 students in this class.
◦ Ask 20 students in your department.
◦ Ask 20 students in the University bookshop.
◦ Select randomly 20 students from the register of the university.
The design of a sample refers to the method used to choose the
sample from the population.

Sampling Distribution
Example:
Consider a population of 20 students who spent the following
amounts on books:
x̃1 x̃2 x̃3 x̃4 x̃5 x̃6 x̃7 x̃8 x̃9 x̃10 x̃11 x̃12 x̃13 x̃14 x̃15
100 120 150 180 200 220 220 240 260 280 290 300 310 350 400
(a) 12
σ = 55.4247
9
Sampling distribution of
Frequency (%)
6
1 P
n
x̄ = n
xi
3
i=1
0
0 100 200 300 400 for sample sizes
(b) 12
x
(a) n = 2
σ = 43.38302
9 (b) n = 3
Frequency (%)
6
(c) n = 4
3
0
0 100 200 300 400
x
(c) 12
σ = 35.96526
9
Frequency (%)
0
0 100 200 300 400

Bias
Example:
Suppose we are interested in the amount of money students at this
university have spent on books last quarter.
Sample: 20 students in the University bookshop

Do we get a good estimate for the average amount spent on books
last quarter by UofC students?
◦ Students who buy more books and spend more money on books
are more likely to be found in bookshops than students who buy
less books.
◦ The sample mean might overestimate the true amount spent
on books.
◦ The sample is not representative for the population of all stu-
dents.
Careful: A poor sample design can produce misleading conclu-
sions.
The design of a study is biased if it systematically favors some

parts of the population over others.
A statistic is unbiased if the mean of its sampling distribution
is equal to the parameter being estimated. Otherwise we say the
statistic is biased.

Bias
Examples: Biased Sampling

◦ Midway Airlines Ads in the New York Times and the Wall Street Jour-
nal stated that “84 percent of frequent business travelers to Chicago
prefer Midway Metrolink to American, United, and TWA.”
The survey was “conducted among Midway Metrolink passengers be-
tween New York and Chicago.
◦ A 1992 Roper poll asked “Does it seem possible or does it seem im-
possible to you that the Nazi extermination of Jews never happened?”
22% of the American respondents said “seems possible.”
A reworded poll 1994 asked “Does it seem possible to you that the Nazi
extermination of Jews never happened, or do you feel certain that it
happened?” This time only 1% of the respondents said it was “possible
it never happened.”
◦ ABC network program Nightline once asked whether the United Na-
tions should continue to have its headquarters in the United States.
More than 186,000 callers responded, and 67% said “No.”
A properly designed sample survey showed that 72% of adults want the
UN to stay.
◦ A call-in poll conducted by USA Today concluded that Americans love
Donald Trump.
USA Today later reported that 5,640 of the 7,800 calls for the poll came
from the offices owned by one man, Cincinnati financier Carl Lindner.

Caution about Sample Surveys
• Undercoverage
◦ occurs when same groups in the population are left out of
the process of choosing the sample
◦ no accurate list of the population
◦ results in bias if this group differs from the rest of the
population
• Nonresponse
◦ occurs when a chosen individual cannot be contacted or
does not cooperate
◦ results in bias if this group differs from the rest of the
population
• Response bias
◦ subjects may not want to admit illegal or unpopular be-
haviour
◦ subjects may be affected by the interviewers appearance or
tone
◦ subjects may not remember correctly
• Question wording
◦ confusing or leading questions can introduce strong bias
◦ do not trust sample survey results unless you have read the
exact questions posed

Simple Random Sampling
A simple random sample (SRS) of size n consists of n indi-

viduals chosen from the population in such a way that every set of
n individuals is equally likely to be selected.
◦ Every possible sample has an equal chance of being selected.
◦ Every individual has an equal chance of being selected.
◦ Random selection eliminates bias in sampling.
SRS or Not?
Is each of the following samples an SRS or not?
◦ A deck of cards if shuffled, and the top five dealt.
◦ A sample of Illinois residents is drawn by choosing all the resi-
dents in each of 100 census blocks (in such a way that each set
of 100 blocks is equally likely to be chosen)
◦ A telephone survey is conducted by dialing telephone numbers
at random (i.e. each valid phone number is equally likely).
◦ A sample of 10%of all student at the University of Chicago is
chosen by numbering the students 1, . . . , N , drawing a random
integer i from 1 to 10, and drawing every tenth student begin-
ning with i.
(E.g. if i = 5, students 5, 15, 25, . . . are chosen.)

Stratified Sampling
Example:
◦ Population: Students at this university
◦ Objective: Amount of money spent on books this quarter
◦ Knowledge: Students in e.g. humanities spend more money on
books
Use knowledge to build sample:

◦ divide sample into groups of similar individuals, called strata
◦ choose simply random sample within each group
◦ size of samples in each groups e.g. proportional to size of groups
Can reduce variability of estimate significantly.

Summary
◦ A number which describes a population is a parameter.

◦ A number computed from the data is a statistic.
◦ Use statistics to make inferences about unknown population
parameters.
◦ A Simple random sample (SRS) of size n consists of n in-
dividuals from the population sampled without replacement,
that is, every set of n individuals has an equal chance to be the
sample actually selected.
◦ A statistic from a random sample has a sampling distribution
that describes how the statistic varies in repeated data produc-
tion.
◦ A statistic as an estimator of a parameter may suffer from bias
or from high variability. Bias means that the mean of the
sampling distribution is not equal to the true value of the pa-
rameter. The variability of the statistic is described by the
spread of its sampling distribution.
Survey Sampling, Jan 19, 2004 - 10 -

First Step Towards Probability
Experiment:
Toss a die and observe the number on the face up.
What is the chance

◦ of getting a six?
Event of interest: 6
All possible events: 1 2 3 4 5 6
⇒ 16 (one out of six)
◦ of getting an even number?
Event of interest: 2 4 6
All possible events: 1 2 3 4 5 6
⇒ 12 (three out of six)
The classical probability concept:

If there are N equally likely possibilities, of which one must occur
and s are regarded favorable, or as a “success”, then the probability
of a “success” is
s
.
N
Counting, Jan 21, 2003 -1-

First Step Towards Probability
Example:
Suppose that of 100 applicants for a job 50 were women and 50
were men, all equally qualified. Further suppose that the company
hired 2 women and 8 men.
How likely is this outcome under the assumption that

the company does not discriminate?
How many ways are there to choose

◦ 10 out of 100 applicants? ( ⇒ N )
◦ 2 out of 50 female applicants and 8 out of 50 male applicants?
( ⇒ s)
To compute such probabilities we need a way to count the num-

ber of possibilities (favorable and total).

The Multiplicative Rule
Suppose you have k choices with N1, . . . , Nk possibilities, re-

spectively, to make. Then the total number of possibilities is
the product
N1 · · · Nk .
Sampling in order with replacement
If you sample n times in order with replacement from a set of N

elements, then the total number of possible sequences (x1, . . . , xn)
is N n .
Example:
If you toss a die 5 times, the number of possible results is 65 = 7776.
Sampling in order without replacement
If you sample n times in order without replacement from a set of N

elements, then the total number of possible sequences (x1, . . . , xn)
is
N!
N (N − 1) · · · (N − n + 1) = .
(N − n)!
Example:
If you select 5 cards in order from a card deck of 64, the number
of possible results is 64 · 63 · 62 · 61 · 60 = 914, 941, 440.
Permutations and Combinations
Example:
If you select 5 cards from a card deck of 64, you are typically only
interested in the cards you have, not in the order in which you
received them. How many different combinations of 5 cards out
of 64 are there?
To answer this question we first address the question of how many

different sequences of the same 5 cards exist.
Permutation:
Let (x1, . . . , xn) be a sequence. A permutation of this sequence is
any rearrangement of the elements without loosing or adding any
elements, that is, any new sequence
(xi1 , . . . , xin )
with permuted indices {i1, . . . , in } = {1, . . . , n}. The trivial per-

mutation does not change the order, i.e. ij = j.
How many permutations of n distinct elements are there? The

multiplicative rule yields
n · · · (n − 1) · · · 1 = n!.
Example (contd):
The number of different sequences of 5 fixed cards is 5! = 5 · 4 · 3 ·
2 · 1 = 120.

Permutations and Combinations
How many different combinations of n elements chosen from

N distinct elements are there?
Recall that
◦ The number of different sequences of length n that can be cho-
sen from N distinct elements are
N!
.
(N − n)!
◦ The number of permutions of any sequence of length n is n!.
Thus the number of combinations of n elements chosen from N

distinct elements is

N! N N
= = .
n! (N − n)! n N −n
N

n
are referred to as binomial coefficient.
Since two permuted (ordered) sequences (x1, . . . , xn) lead to the same (un-
ordered) combination {x1, . . . , xn} we divide the number of ordered se-
quences by the number of permutations.

Examples
Example:
If you select 5 cards from a card deck of 64, you are typically only
interested in the cards you have, not in the order in which you
received them. How many different combinations of 5 cards out
of 64 are there?
The answer is

64 64 · 63 · 62 · 61 · 60 914941444
= = = 7, 624, 512.
5 5·4·3·2·1 120
Example:
Recall the example with the 100 applicants for a job. The number
of ways to choose

◦ 2 women out of 50 is 50 2
.
50

◦ 8 men out of 50 is 8 .

◦ 10 applicants out of 100 is 100
10
.
Thus the chance of this event is

50 50
2
8 = 0.037
100
10
Moreover, the chance of this or a more extreme event (only one or

no woman is hired) is 0.046.

Summary
The number of possibilities to sample with or without replacement

in order or unordered n elements from a set of N distinct elements
are summarized in the following table:
Sampling in order without

order
N! N
without replacement
(N − n)! n
N +n−1
with replacement Nn
N

Introduction to Probability
Classical Concept:
◦ requires finitely many and equally likely outcomes
◦ probability of event defined as number of favorable outcomes
(s) divided by number of total outcomes (N):
s
Probability of event =
N
◦ can be determined by counting outcomes
In many practical situations the different outcomes are not equally

likely:
◦ Success of treatment
◦ Chance to die of a heart attack
◦ Chance of snowfall tomorrow
It is not immediately clear how to measure chance in each of these
cases.
Three Concepts of Probability

◦ Frequency interpretation
◦ Subjective probabilities
◦ Mathematical probability concept
Elements of Probability, Jan 23, 2003 -1-

The Frequentist Approach
In the long run, we are all dead.

John Maynard Keynes (1883-1943)
The Frequency Interpretation of Probability

The probability of an event is the proportion of time that events
of the same kind (repeated independently and under the same
conditions) will occur in the long run.
Example:
Suppose we collect data on the weather in Chicago on Jan 21 and
we note that in the past 124 years it snowed in 34 years on Jan 21,
34
that is 124 100% = 27.4% of the time.
Thus we would estimate the probability of snowfall on Jan 21 in
Chicago as 0.274.
The frequency interpretation of probability is based on the follow-

ing theorem:
The Law of Large Numbers

If a situation, trial, or experiment is repeated again and again, the
proportion of successes will converge to the probability of any one
outcame being a success.

The Frequentist Approach
1.0
Tosses 1 − 1000
Relative Frequency of Heads
0.8
0.6
0.4
0.2
0.0
0 100 200 300 400 500 600 700 800 900 1000
Number of Tosses
0.52
Tosses 1000 − 100000

0.51
0.50
0.49
0.48
1 10 20 30 40 50 60 70 80 90 100
Number of Tosses (in 1000s)
0.505
Tosses 100000 − 1000000

0.500
0.495
1 2 3 4 5 6 7 8 9 10
Number of Tosses (in 100000s)

The Subjectivist (Bayesian) Approach
Not all events are repeatable:

◦ Will it snow tomorrow?
◦ Will Mr Jones, 42, live to 65?
◦ Will the Dow Jones rise tomorrow?
◦ Does the Iraq have weapons of mass destruction?
To all these questions the answer is either “yes” or “no”, but we

are uncertain about the right answer.
Need to quantify our uncertainty about an event A:
Game with two players:

◦ 1st player determines p such that he will “win” $c · (1 − p) if
event A occurs and otherwise he will “loose” $c · p.
◦ 2nd player chooses c which can be positive or negative.
The Bayesian interpretation of probability is that probability

measures the personal (subjective) uncertainty of an event.
Example: Weather forecast

Meteorologist says that the probability of snowfall tomorrow is
90%.
He should be willing to bet $90 against $10 that it snows tomorrow
and $10 against $90 that it does not snow.

The Elements of Probability
A (statistical) experiment is a process of observation or mea-

surement. For a mathematical treatment we need:
Sample Space S - set of possible outcomes

Example: An urn contains five balls, numbered from 1 through
5. We choose two at random and at the same time. What is the
sample space?

S = {1, 2}, {1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5},

{3, 4}, {3, 5}, {4, 5} .
Events A ⊆ S - an event is a subset of the sample space S

Example: In the example above the event A that two balls with
uneven numbers are choses is

A = {1, 3}, {1, 5}, {3, 5} .
Probability Function P - assigns each A a value in [0, 1]

Example: Assuming that all events are equally likely we obtain
P(A) = 103 .

The Elements of Probability
Why not assign probabilities to outcomes?

Example: Spinner labeled from 0 to 1.
◦ Suppose that all outcomes s ∈ S = [0, 1) are equally likely.
◦ Assign probabilities uniformly on S.
◦ P({s}) = c > 0 ⇒ P(S) = ∞
◦ P({s}) = 0 ⇒ P(S) = 0
Solution: Assign to each subset of S a probability equal to the

“length” of that subset:
◦ Probability that the spinner lands in [0, 41 ) is 41 .
◦ Probability that the spinner lands in [ 21 , 43 ) is 14 .
◦ Probability that the spinner lands on 21 is 0.
In integral notation we have
Z b
P(spinner lands in [a, b]) = dx = b − a.
a
Remark:
Strictly speaking, we can define above probability only on a set A of subsets A ⊆ S which
however covers all important and for this class relevant subsets.
In the case of finite or countably infinite sample spaces S there are no such exceptions
and A covers all subsets of S.

A Set Theory Primer
A set is “a collection of definite, well distinguished objects of our perception

or of our thought”. (Georg Cantor, 1845-1918)
Some important sets:

◦ N = {1, 2, 3, . . .}, the set of natural numbers
◦ Z = {. . . , −2, −1, 0, 1, 2, . . .}, the set of integers
◦ R = (−∞, ∞), the set of real numbers
Intervals are denoted as follows:
[0, 1] the interval from 0 to 1 including 0 and 1
[0, 1) the interval from 0 to 1 including 0 but not 1
(0, 1) the interval from 0 to 1 not including 0 and 1
If a is an element of the set A then we write a ∈ A.

If a is not an element of the set A then we write a ∈
/ A.
Suppose that A and B are subsets of S (denoted as A, B ⊆ S).
The empty set is denoted by ∅ (Note: ∅ ⊆ A for all subsets A of S).

Difference of A and B (A\B): Set of all elements in A which are not in B.
Intersection of A and B (A ∩ B): Set of all elements in S which are both
in A and in B.
Union of A and B (A ∪ B): Set of all elements in S that are in A or in B.
Complement of A (A∁ or A′): Set of all elements in S that are not in A.
Note that A ∩ A∁ = ∅ and A ∪ A∁ = S
A and B are disjoint if A and B have no common elements, that is A∩B =
∅. Two events A and B with this property are said to be mutually
exclusive.

The Postulates of Probability
A probability on a sample space S (and a set A of events) is a

function which assigns each subset A a value in [0, 1] and satisfies
the following rules:
Axiom 1: All probabilities are nonnegative:
P(A) ≥ 0 for all events A.
Axiom 2: The probability of the whole sample space is 1:
P(S) = 1.
Axiom 3 (Addition Rule): If two events A and B are mutu-
ally exclusive then
P(A ∪ B) = P(A) + P(B),

that is the probability that one or the other occurs is the sum
of their probabilities.
More generally, if countably many events Ai, i ∈ N are mutu-
ally exclusive (i.e. Ai ∩ Aj = ∅ whenever i 6= j) then
S
∞ P
∞
P Ai = P(Ai).
i=1 i=1

Classical Concept of Probability
The probability of an event A is defined as
P(A) = #A
#S
,
where #A denotes the number of elements (outcomes) in A.

It satisfies
◦ P(A) ≥ 0
◦ P(S) = #S/#S = 1
◦ If A and B mutually exclusive then
P(A ∪ B) = #(A#S∪ B)
#A #B
= + = P(A) + P(B).
#S #S

Frequency Interpretation of Probability
The probability of an event A is defined as

n(A)
P(A) = n→∞
lim
n
,
where n(A) is the number of times event A occurred in n repeti-

tions.
It satisfies
◦ P(A) ≥ 0
◦ P(S) = limn→∞ nn = 1
◦ If A and B mutually exclusive then n(A ∪ B) = n(A) + n(B).
Hence
n(A ∪ B)
P(A ∪ B) = n→∞
lim
n
n(A) n(B)
= lim +
n→∞ n n
n(A) n(B)
= lim + lim = P(A) + P(B).
n→∞ n n→∞ n
Elements of Probability, Jan 23, 2003 - 10 -

Example: Toss of one die

The events A = {1} and B = {4 5} are mutually exclusive.
Since all outcomes are equiprobable we obtain
P(A) = 61
and
P(B) = 31 .
The addition rule yields
P(A ∪ B) = 61 + 31 = 36 = 21 .
On the other hand we get for C = A ∪ B = {1 4 5}
P(C) = 63 = 12 .
The first two axioms can be summarized by the
Cardinal Rule: For any subset A of S
0 ≤ P(A) ≤ 1.
In particular
◦ P(∅) = 0
◦ P(S) = 1
Elements of Probability, Jan 23, 2003 - 11 -

The Calculus of Probability
Let A and B be events in a sample space S.
Partition rule:
P(A) = P(A ∩ B) + P(A ∩ B ∁)
Example: Roll a pair of fair dice
P(Total of 10)
= P(Total of 10 and double) + P(Total of 10 and no double)
1 2 3 1
= + = =
36 36 36 12
Complementation rule:
P(A∁) = 1 − P(A)
Example: Often useful for events of the type “at least one”:
P(At least one even number)

= 1 − P(No even number) = 1 −
9 3
=
36 4
Containment rule
P(A) ≤ P(B) for all A ⊆ B
Example: Compare two aces with doubles,
= P(Two aces) ≤ P(Doubles) =

1 6 1
=
36 36 6
Calculus of Probability, Jan 26, 2003 -1-

The Calculus of Probability
Inclusion and exclusion formula

P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
Example: Roll a pair of fair dice
P(Total of 10 or double)
= P(Total of 10) + P(Double) − P(Total of 10 and double)
3 6 1 8 2
= + − = =
36 36 36 36 9
The two events are
Total of 10 = { 46, 55, 64}

and
Double = { 11,22,33,44,55,66}
The intersection is
Total of 10 and double = { 55}.

Adding the probabilities for the two events, the probability for the
event 55 is added twice.

Conditional Probability
Probability gives chances for events in sample space S.

Often: Have partial information about event of interest.
Example: Number of Deaths in the U.S. in 1996

Cause All ages 1-4 5-14 15-24 25-44 45-64 ≥ 65
Heart 733,125 207 341 920 16,261 102,510 612,886
Cancer 544,161 440 1,035 1,642 22,147 132,805 386,092
HIV 32,003 149 174 420 22,795 8,443 22
Accidents1 92,998 2,155 3,521 13,872 26,554 16,332 30,564
Homicide2 24,486 395 513 6,548 9,261 7,717 52
All causes 2,171,935 5,947 8,465 32,699 148,904 380,396 1,717,218
1 2
Accidents and adverse effects, Homicide and legal intervention
measure probability with respect to a subset of S
Conditional probability of A given B
P(A|B) = P(A ∩ B)
P(B) , if P(B) > 0
If P(B) = 0 then P(A|B) is undefined.

Conditional probabilities for causes of death:
◦ P(accident) = 0.04282
◦ P(age=10) = 0.00390
◦ P(accident|age=10) = 0.42423
◦ P(accident|age=40) = 0.17832

Conditional Probability
Example: Select two cards from 32 cards

◦ What is the probability that the second card is an ace?
P(2nd card is an ace) = 18

◦ What is the probability that the second card is an ace if the
first was an ace?
P(2nd card is an ace|1st card was an ace) = 313

Multiplication rules
Example: Death Rates (per 100,000 people)
All Ages 1-4 5-14 15-24 25-44 45-64 ≥ 65

872.5 38.3 22.0 90.3 177.8 708.0 5071.4
Can we combine these rates with the table on causes of death?

◦ What is the probability to die from an accident (HIV)?
◦ What is the probability to die from an accident at age 10 (40)?
Know P(accident|die) = P(die from accident)/P(die)

⇒ P(die from accident) = P(accident|die)P(die)
Calculate probabilities:
◦ P(die from accident) = 0.04281 · 0.00873 = 0.00037
◦ P(die from accident|age = 10) = 0.42423 · 0.00090 = 0.00038
◦ P(die from accident|age = 40) = 0.17832 · 0.00178 = 0.00031
◦ P(die from HIV) = 0.01473 · 0.00873 = 0.00013
◦ P(die from HIV|age = 10) = 0.02055 · 0.00090 = 0.00002
◦ P(die from HIV|age = 40) = 0.15308 · 0.00178 = 0.00027
General multiplication rule
P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)

Independence
Example: Roll two dice

◦ What ist the probability that the second die shows 1?
P(2nd die = 1) = 16
◦ What ist the probability that the second die shows 1 if the first
die already shows 1?
P(2nd die = 1|1st die = 1) = 61

◦ What ist the probability that the second die shows 1 if the first
does not show 1?
P(2nd die = 1|1st die 6= 1) = 61

The chances of getting 1 with the second die are the same, no
matter what the first die shows. Such events are called indepen-
dent:
The event A is independent of the event B if its chances are

not affected by the occurrence of B,
P(A|B) = P(A).
Equivalently, A and B are independent if
P(A ∩ B) = P(A)P(B)
Otherwise we say A and B are dependent.

Let’s Make a Deal
The Rules:
◦ Three doors - one price, two blanks
◦ Candidate selects one door
◦ Showmaster reveals one loosing door
◦ Candidate may switch doors
1 2 3
Would YOU change?
Can probability theory help you?

◦ What is the probability of winning if candidate switches doors?
◦ What is the probability of winning if candidate does not switch
doors?

The Rule of Total Probability
Events of interest:
◦ A - choose winning door at the beginning
◦ W - win the price
Strategy: Switch doors (S)

Know: PS (W |A) = 0
◦ ◦ PS (A) = 31
◦ PS (W |A∁) = 1 ◦ PS (A∁) = 23
Probability of interest: PS (W ):
PS (W ) = PS (W ∩ A) + PS (W ∩ A∁)
= PS (W |A)PS (A) + PS (W |A∁)PS (A∁)
1 2
=0· + 1 · 23 =
3 3
Strategy: Do not switch doors (N )

Know: PN (W |A) = 1
◦ ◦ PN (A) = 13
◦ PN (W |A∁) = 0 ◦ PN (A∁) = 23
Probability of interest: PN (W ):
PN (W ) = PN (W ∩ A) + PN (W ∩ A∁)
= PN (W |A)PN (A) + PN (W |A∁)PN (A∁)
1 1
=1· + 0 · 23 =
3 3

Rule of Total Probability

If B1, . . . , Bk mutually exclusive and B1 ∪ . . . ∪ Bk = S, then
P(A) = P(A|B1)P(B1) + . . . + P(A|Bk )P(Bk )
Example:
Suppose an applicant for a job has been invited for an interview.
The chance that
◦ he is nervous is P(N ) = 0.7,
◦ the interview is succussful if he is nervous is P(S|N ) = 0.2,
◦ the interview is succussful if he is not nervous is P(S|N ∁) = 0.9.
What is the probability that the interview is successful?
P(S) = P(S|N )P(N ) + P(S|N ∁)P(N ∁)

= 0.2 · 0.7 + 0.9 · 0.3
= 0.441

Example:
Suppose we have two unfair coins:
◦ Coin 1 comes up heads with probability 0.8
◦ Coin 2 comes up heads with probability 0.35
Choose a coin at random and flip it. What is the probability of its
being a head?
Events: H=“heads comes up”, C1=“1st coin”, C2=“2nd coin”
P(H) = P(H|C1)P(C1) + P(H|C2)P(C2)

1
= (0.8 + 0.35) = 0.575
2
Calculus of Probability, Jan 26, 2003 - 10 -

Bayes’ Theorem
Example: O.J. Simpson

1
“Only about 10 of one percent of wife-batterers actually murder their wives”
Lawyer of O.J. Simpson on TV
Fact: Simpson pleaded no contest to beating his wife in 1988.

So he murdered his wife with probability 0.001?
◦ Sample space S - married couples in U.S. in which the husband
beat his wife in 1988
◦ Event H - all couples in S in which the husband has since
murdered his wife
◦ Event M - all couples in S in which the wife has been murdered
since 1988
We have ◦ P(H) = 0.001
◦ P(M |H) = 1 since H ⊆ M
◦ P(M |H ∁) = 0.0001 at most in the U.S.
Then
P(H|M ) = P(MP|H) P(H)

(M )
=
P(M |H)P(H)
P(M |H)P(H) + P(M |H ∁)P(H ∁)
0.001
= = 0.91
0.001 + 0.0001 · 0.999

Bayes’ Theorem
Reversal of conditioning (general multiplication rule)
P(B|A)P(A) = P(A|B)P(B)
Rewriting P(A) using the rule of total probability we obtain
Bayes’ Theorem
P(B|A) = P(A|B)P(B)
P(A|B)P(B) + P(A|B ∁)P(B ∁)
If B1, . . . , Bk mutually exclusive and B1 ∪ . . . ∪ Bk = S, then
P(Bi|A) = P(A|B )P(BP(A|B i )P(Bi)

) + . . . + P(A|B )P(B )
1 1 k k
(General form of Bayes’ Theorem)

Bayes’ Theorem
Example: Testing for AIDS
Enzyme immunoassay test for HIV:

◦ P(T+|I+) = 0.98 (sensitivity - positive for infected)
◦ P(T-|I-) = 0.995 (specificity - negative for noninfected)
◦ P(I+) = 0.0003 (prevalence)
What is the probability that the tested person is infected if the
test was positive?
P(I+|T+) = P(T+|I+)PP(T+|I+) P(I+)

(I+) + P(T+|I-)P(I-)
0.98 · 0.0003
=
0.98 · 0.0003 + 0.005 · 0.9997
= 0.05556
Consider different population with P(I+) = 0.1 (greater risk)

0.98 · 0.1
P(I+|T+) = 0.98 · 0.1 + 0.005 · 0.9
= 0.956
testing on large scale not sensible (too many false positives)
Repeat test (Bayesian updating):

◦ P(I+|T++) = 0.92 in 1st population
◦ P(I+|T++) = 0.9998 in 2nd population

Random Variables
Aim: ◦ Learn about population

◦ Available information: observed data x1, . . . , xn
Problem: ◦ Data affected by chance variation

◦ New set of data would look different
Suppose we observe/measure some characteristic (variable) of n

individuals. The actual observed values x1, . . . , xn are the outcome
of a random phenomenon.
Random variable: a variable whose value is a numerical out-

come of a random phenomenon
Remark: Mathematically, a random variable is a real-valued func-

tion on the sample space S:
S −−−−→ R
X
ω 7−→ x = X(ω)
◦ SX = X(S) is the sample space of the random variable.
◦ The outcome x = X(ω) is called realisation of X.
◦ X induces a probability P (B) = P(X ∈ B) on SX , the prob-
ability distribution of X
Example: Roll one die
Outcome ω 1 2 3 4 5 6
Realization X(ω) 1 2 3 4 5 6
Random Variables, Jan 28, 2003 -1-

Random Variables
◦ X1 - number on the first die

◦ X2 - number on the second die
◦ Y = X1 + X2 - total number of points
(a function of random variables is again a random variable)
Table of outcomes:
Outcome (X1, X2 ) Y Outcome (X1, X2 ) Y

11 (1,1) 2 41 (4,1) 5
12 (1,2) 3 42 (4,2) 6
13 (1,3) 4 43 (4,3) 7
14 (1,4) 5 44 (4,4) 8
15 (1,5) 6 45 (4,5) 9
16 (1,6) 7 46 (4,6) 10
21 (2,1) 3 51 (5,1) 6
22 (2,2) 4 52 (5,2) 7
23 (2,3) 5 53 (5,3) 8
24 (2,4) 6 54 (5,4) 9
25 (2,5) 7 55 (5,5) 10
26 (2,6) 8 56 (5,6) 11
31 (3,1) 4 61 (6,1) 7
32 (3,2) 5 62 (6,2) 8
33 (3,3) 6 63 (6,3) 9
34 (3,4) 7 64 (6,4) 10
35 (3,5) 8 65 (6,5) 11
36 (3,6) 9 66 (6,6) 12

Random Variables
Two important types of random variables:
• Discrete random variable

◦ takes values in a finite or countable set
• Continuous random variable
◦ takes values in a continuum, or uncountable set
◦ probability of any particular outcome x is zero
P(X = x) = 0 for all x ∈ SX
Example: Ten tosses of a coin

Suppose we toss a coin ten times. Let
◦ X be the number of heads in ten tosses of a coin
◦ Y be the time it takes to toss ten times

Discrete Random Variables
Suppose X is a discrete random variables with values x1, x2, . . ..

Y = X1 + X2 total number of points
y 2 3 4 5 6 7 8 9 10 11 12
P(Y = y) 361 362 363 364 365 366 365 364 363 362 361
Frequency function: The function
p(x) = P(X = x) = P({ω ∈ S|X(ω) = x})
is called the frequency function or probability mass function.
Note: p defines a probability on SX = {x1, x2, . . .}:

P
P (B) = p(x) = P(X ∈ B).
x∈B
We call P the (probability) distribution of X.
Properties of a discrete probability distribution

◦ p(x) ≥ 0 for all values of X
P
◦ i p(xi ) = 1

Discrete Random Variables

Let X denote the number of points on the face turned up. Since
all numbers are equally likely we obtain
1
if x ∈ {1, . . . , 6}
p(x) = P(X = x) = 6 .
0 otherwise

The probability mass function of the total number of points
Y = X1 + X2
can be written as:

1

6 − |y − 7| if y ∈ {2, . . . , 12}
p(y) = P(Y = y) = 36
0 otherwise
Example: Three tosses of a coin

Let X be the number of heads in three tosses of a coin. There are
3

x outcomes with x heads and 3 − x tails, thus

3 1
p(x) = .
x 8

Continuous Random Variables
For a continuous random variable X, the probability that X falls

in the interval (a, b ] is given by
Z
P(a < X ≤ B) =
b
f (x)dx,
a
where f is the density function of X.
Note: The density defines a probability on R:

Zb
P [a, b] = f (x) dx = P X ∈ [a, b]
a
We call P the (probability) distribution of X.

Remark: The definition of P can be extended to (almost) all B ⊆ R.
Example: Spinner
Consider a spinner that turns freely on its axis and slowly comes to a stop.
◦ X is the stopping point on the circle marked from 0 to 1.
◦ X can take any value in SX = [0, 1).
◦ The outcomes of X are uniformly distributed over the interval [0, 1).
Then the density function of X is

1 if 0 ≤ x < 1
f (x) = .
0 otherwise
Consequently

P X ∈ [a, b] = b − a.
Note that for all possible outcomes x ∈ [0, 1) we have

P X ∈ [x, x] = x − x = 0.

Independence of Random Variables
Recall: Two events A and B are independent if
P(A ∩ B) = P(A)P(B)
Independence of Random Variables
Two discrete random variables X and Y are independent if
P(X ∈ A, Y ∈ B) = P(X ∈ A) P(Y ∈ B)
for all A ⊆ SX and B ⊆ SY .
Remark: It is sufficient to show that
P(X = x, Y = y) = pX (x) pY (y) = P(X = x) P(Y = y)
for all x ∈ SX and y ∈ SY .

More generally, X1, X2 , . . . are independent if for all n ∈ N
P(X1 ∈ A1, . . . , Xn ∈ An) = P(X1 ∈ A1) · · · P(Xn ∈ An).

for all Ai ⊆ Xi .
Example: Toss coin three times

Consider

1 if head in ith toss of coin
Xi =
0 otherwise
X1 , X2 , and X3 are independent:
P(X1 = x1, . . . , X3 = x3) = 81 = P(X1 = x1)P(X2 = x2)P(X3 = x3)

Multivariate Distributions: Discrete Case
Discrete Case
Let X and Y be discrete random variables.
Joint frequency function of X and Y
pXY (x, y) = P(X = x, Y = y) = P({X = x} ∩ {Y = y})
Marginal frequency function of X

P
pX (x) = pXY (x, yi)
i
Marginal frequency function of Y

P
pY (y) = pXY (xi, y)
i
The random variables X and Y are independent if and only if
pXY (x, y) = pX (x) pY (y)
for all possible values x ∈ SX and y ∈ SY .
Conditional probability of X = x given Y = y
P(X = x|Y = y) = pX|Y (x|y) =

pXY (x, y)
=
P(X = x, Y = y)
pY (y) P(Y = y)
where pX|Y (x|y) is the conditional frequency function.

Multivariate Distributions
Discrete Case
Example: Three Tosses of a Coin
◦ X - number of heads on the first toss (values in {0, 1})

◦ Y - total number of heads (values in {0, 1, 2, 3})
The joint frequency function pXY (x, y) is given by the following
table
x\y 0 1 2 3
1 2 1 1
0 8 8 8
0 2
1 2 1 1
1 0 8 8 8 2
1 3 3 1
8 8 8 8
1
Marginal frequency function of Y
pY (0) = P(Y = 0)
= P(Y = 0, X = 0) + P(Y = 0, X = 1)
= 18 + 0 = 1
8
pY (1) = P(Y = 1)
= P(Y = 1, X = 0) + P(Y = 1, X = 1)
= 28 + 18 = 3
8
...

Multivariate Distributions
Continuous Case
Let X and Y be continuous random variables.
Joint density function of X and Y : fXY such that

Z Z
fXY (x, y) dy dx = P(X ∈ A, Y ∈ B)
A B
Marginal density function of X:

Z
fX (x) = fXY (x, y) dy
Marginal density function of Y

Z
fY (y) = fXY (x, y) dx
The random variables X and Y are independent if and only if
fXY (x, y) = fX (x) fY (y)
for all possible values x ∈ SX and y ∈ SY .
Conditional density function of X given Y = y

fXY (x, y)
fX|Y (x|y) =
fY (y)
Conditional probability of X ∈ A given Y = y
Z
P(X ∈ A|Y = y) = fX|Y (x|y) dx
A
Random Variables, Jan 28, 2003 - 10 -

Bernoulli Distribution
Example: Toss of coin

Define X = 1 if head comes up and
X = 0 if tail comes up.
Both realizations are equally likely: P(X = 1) = P(X = 0) = 1
2
Examples:
Often: Two outcomes which are not equally likely:
◦ Success of medical treatment
◦ Interviewed person is female
◦ Student passes exam
◦ Transmittance of a disease
Bernoulli distribution (with parameter θ)
◦ X takes two values, 0 and 1, with probabilities p and 1 − p

◦ Frequency function of X
x
θ (1 − θ)1−x for x ∈ {0, 1}
p(x) =
0 otherwise
◦ Often:

1 if event A has occured
X=
0 otherwise
Example: A = blood pressure above 140/90 mm HG.
Distributions, Jan 30, 2003 -1-

Bernoulli Distribution
Let X1, . . . , Xn be independent Bernoulli random variables with

same parameter θ.
Frequency function of X1, . . . , Xn
p(x1, . . . , xn) = p(x1) · · · p(xn) = θx1 +...+xn (1 − θ)n−x1 −...−xn
for xi ∈ {0, 1} and i = 1, . . . , n
Example: Paired-Sample Sign Test

◦ Study success of new elaborate safety program
◦ Record average weekly losses in hours of labor due to accidents before
and after installation of the program in 10 industrial plants
Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11
Define for the ith plant

1 if first value is greater than the second
Xi =
0 otherwise
Result: 1 1 1 1 0 1 1 1 1 1
The Xi’s are independently Bernoulli distributed with unknown

parameter θ.

Binomial Distribution
Let X1, . . . , Xn be independent Bernoulli random variables

◦ Often only interested in number of successes
Y = X1 + . . . + Xn
Example: Paired Sample Sign Test (contd)

Define for the ith plant

1 if first value is greater than the second
Xi =
0 otherwise
Pn
Y = Xi
i=1
Y is the number of plants for which the number of lost hours has
decreased after the installation of the safety program
We know:
◦ Xi is Bernoulli distributed with parameter θ
◦ Xi’s are independent
What is the distribution of Y ?
◦ Probability of realization x1, . . . , xn with y successes:
p(x1, . . . , xn) = θy (1 − θ)n−y

n

◦ Number of different realizations with y successes: y

Binomial distribution (with parameters n and θ)

Let X1, . . . , Xn be independent and Bernoulli distributed with pa-
rameter θ and
Pn
Y = Xi .
i=1
Y has frequency function

n
p(y) = θy (1 − θ)n−y for y ∈ {0, . . . , n}
y
Y is binomially distributed with parameters n and θ. We write
Y ∼ Bin(n, θ).
Note that
◦ the number of trials is fixed,
◦ the probability of success is the same for each trial, and
◦ the trials are independent.
Example: Paired Sample Sign Test (contd)

Let Y be the number of plants for which the number of lost hours
has decreased after the installation of the safety program. Then
Y ∼ Bin(n, θ)

Binomial distribution for n = 10
0.4 0.4
θ = 0.1 θ = 0.3
0.3 0.3
p(x)
p(x)
0.2 0.2
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
0.4 0.4
θ = 0.5 θ = 0.8
0.3 0.3
p(x)
p(x)
0.2 0.2
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

Geometric Distribution
Consider a sequence of independent Bernoulli trials.

◦ On each trial, a success occurs with probability θ.
◦ Let X be the number of trials up to the first success.
What is the distribution of X?

◦ Probability of no success in x − 1 trials: (1 − θ)x−1
◦ Probability of one success in the xth trial: θ
The frequency function of X is
p(x) = θ(1 − θ)x−1 , x = 1, 2, 3, . . .
X is geometrically distributed with parameter θ.
Example:
Suppose a batter has probability 31 to hit the ball. What is the chance that
he misses the ball less than 3 times?
The number X of balls up to the first success is geometrically distributed

with parameter 13 . Thus
1 1 2 1 2 2
P(X ≤ 3) = 3 + 3 · 3 + 3 3 = 0.7037.

Hypergemetric Distribution
Example: Quality Control

Quality control - sample and examine fraction of produced units
◦ N produced units
◦ M defective units
◦ n sampled units
What is the probability that the sample contains x defective units?
The frequency function of X is

M N−M

x n−x
p(x) = N
, x = 0, 1, . . . , n.
n
X is a hypergeometric random variable with parameters N , M ,

and n.
Example:
Suppose that of 100 applicants for a job 50 were women and 50 were men,
all equally qualified. If we select 10 applicants at random what is the
probability that x of them are female?
The number of chosen female applicants is hypergeometrically distributed
with parameters 100, 50, and 10. The frequency function is
50
50
x 10−x
p(x) = 100
for x ∈ {0, . . . , n}
10
for x = 0, 1, . . . , 10.

Poisson Distribution
Often we are interested in the number of events which occur in a

specific period of time or in a specific area of volume:
◦ Number of alpha particles emitted from a radioactive source during a
given period of time
◦ Number of telephone calls coming into an exchange during one unit of
time
◦ Number of diseased trees per acre of a certain woodland
◦ Number of death claims received per day by an insurance company
Characteristics
Let X be the number of times a certain event occurs during a given
unit of time (or in a given area, etc).
◦ The probability that the event occurs in a given unit of time is
the same for all the units.
◦ The number of events that occur in one unit of time is inde-
pendent of the number of events in other units.
◦ The mean (or expected) rate is λ.
Then X is a Poisson random variable with parameter λ and

frequency function
λx −λ
p(x) = e , x = 0, 1, 2, . . .
x!

Poisson Approximation
The Poisson distribution is often used as an approximation for

binomial probabilities when n is large and θ is small:

n x n−x λx −λ
p(x) = θ (1 − θ) ≈ e
x x!
with λ = n θ.
Example: Fatalities in Prussian cavalry

Classical example from von Bortkiewicz (1898).
◦ Number of fatalities resulting from being kicked by a horse
◦ 200 observations (10 corps over a period of 20 years)
Statistical model:
◦ Each soldier is kicked to death by a horse with probability θ.
◦ Let Y be the number of such fatalities in one corps. Then
Y ∼ Bin(n, θ)
where n is the number of soldiers in one corps.
Observation: The data are well approximated by a Poisson distribution

with λ = 0.61
Deaths per Year Observed Rel. Frequency Poisson Prob.

0 109 0.545 0.543
1 65 0.325 0.331
2 22 0.110 0.101
3 3 0.015 0.021
4 1 0.005 0.003

Poisson Approximation
Poisson approximation of Bin(40, θ)
1.0 1.0
1 1
θ= 400 λ= 10
0.8 0.8
0.6 0.6
p(x)
p(x)
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x x
0.5 1 0.5
θ= 40 λ=1
0.4 0.4
0.3 0.3
p(x)
p(x)
0.2 0.2
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x x
0.2 1 0.2
θ= 8
λ=5
p(x)
p(x)
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x x
1
0.2
θ= 4
0.2
λ = 10
p(x)
p(x)
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x x
Distributions, Jan 30, 2003 - 10 -

Continuous Distributions
40
Uniform distribution U (0, θ)
U(0, θ)
30
Range (0, 1)
Frequency
1
20
f (x) = 1(0,θ) (x)
θ
E(X) = θ2
10
0
θ2 −2 −1 0 1 2 3 4
var(X) = X
12
40
Exponential distribution Exp(λ) Exp(λ)
30
Range [0, ∞) Frequency
20
f (x) = λ exp(−λx)1[0,∞)(x)
E(X) = λ1
10
1
0
var(X) = 2 −2 −1 0 1 2 3 4
λ X
40
Normal distribution N (µ, σ 2)

N(µ, σ2)
R
30
Range
Frequency
1 1
20
2
f (x) = √ exp − 2 (x−µ)
2πσ 2 2σ
E(X) = µ
10
var(X) = σ 2
0
−2 −1 0 1 2 3 4
X
6
4
2
0
−2
U(0, θ) Exp(λ) N(µ, σ2)
Distributions, Jan 30, 2003 - 11 -

Expected Value
Let X be a discrete random variable which takes values in SX =

{x1, x2, . . . , xn}
Expected Value or Mean of X:

P
n
E(X) = xi p(xi)
i=1

Let X be outcome of rolling one die. The frequency function is
1
p(x) = , x = 1, . . . , 6,
6
and hence
P
6
E(X) = x
=
7
= 3.5
x=1 6 2
Example: Bernoulli random variable

Let X ∼ Bin(1, θ).
p(x) = θx (1 − θ)1−x
Thus the mean of X is
E(X) = 0 · (1 − θ) + 1 · θ = θ.
Expected Value and Variance, Feb 2, 2003 -1-

Expected Value
Linearity of the expected value

Let X and Y be two discrete random variables. Then
E(a X + b Y ) = aE(X) + bE(Y )

for any constants a, b ∈ R
Note: No independence is required.
Proof:
P
E(a X + b Y ) = (a x + b y)p(x, y)
x,y
P P
=a x p(x, y) + b y p(x, y)
P x,y x,y
x
p(x, y) = p(y) P P
x =a x p(x) + b y p(y)
x y
= aE(X) + bE(Y )
Example: Binomial distribution

Let X ∼ Bin(n, θ). Then X = X1 +. . .+Xn with Xi ∼ Bin(1, θ):
P
n P
n
E(X) = E(Xi) = θ = nθ
i=1 i=1

Expected Value
Example: Poisson distribution
Let X be a Poisson random variable with parameter λ.
X∞
λx −λ
E(X) = x x! e
x=0
X∞
−λ λx−1
= λe
x=0
(x − 1)!
= λ e−λeλ
=λ
Remarks:
◦ For most distributions some “advanced” knowledge of calculus
is required to find the mean.
◦ Use tables for means of commonly used distribution.

Expected Value
Example: European Call Options

Agreement that gives an investor the right (but not the obliga-
tion) to buy a stock, bond, commodity, or other instruments at
a specific time at a specific price.
What is a fair price P for European call options?
If ST is the price of the stock at time T , the profit will be
Profit = (ST − K)+ − P.
Profit is a random variable.

30
20
10
0
0
−10
0 10 20 30 40 50
0
Fair price P for this option is expected value
P = E(ST − K)+.

Expected Value
Example: European Call Options (contd)
Consider the following simple model:

◦ St = St−1 + εt, t = 1, . . . , T
◦ P(εt = 1) = p and P(εt = −1) = 1 − p.
St is also called a random walk.
The distribution of ST is given by (s0 known at time 0)
ST = s0 + 2 Y − T, with Y ∼ Bin(T, p)
Therefore the price P is (assuming s0 = 0 without loss of generality)

P
T
P = E(ST − K) = +
(2 y − T − K) pθ (y) 1{y>(K+T )/2}
y=1
Let n = 20, K = 10, θ = 0.6

0.5
P = 2.75 0.4
0.3
p(x)
0.2
0.1
0.0
−2.75
−0.75
1.25
3.25
5.25
7.25
9.25
11.25
13.25
15.25
17.25
19.25
Profit
Frequency function of profit

Expected Value
Example: Group testing
Suppose that a large number of blood samples are to be screened for a rare
disease with prevalence 1 − p.
• If each sample is assayed individually, n tests will be required.

• Alternative scheme:
◦ n samples, m groups with k samples
◦ Split each sample in half and pool all samples in one group
◦ Test pooled sample for each group
◦ If test positive test all samples in group separately
What is the expected number of tests under this alternative scheme?
Let Xi be the number of tests in group i. The frequency function of Xi is
(
pk if x = 1
p(x) = k
1 − p if x = k + 1
The expected number of tests in each group is
E(Xi) = pk + (k + 1)(1 − pk ) = k + 1 − kpk

Hence
0.50
P
m
E(N ) = E(Xi) = n 1 + k − p
1 k
0.45
i=1
0.40
Proportion
E(N ):
0.35
Plot of
0.30
The mean is minimized for

0.25
groups of size 11.

0.20
2 4 6 8 10 12 14 16
k

Variance
Let X be a random variable.
Variance of X:
2
var(X) = E X − E(X) .
The variance of X is the expected squared distance of X from its

mean.
Suppose X is discrete random variable with SX = {x1, . . . , xn}.

Then the variance of X can be written as
n
P Pn 2
var(X) = xi − xj p(xj ) p(xi)
i=1 j=1

X takes values in {1, 2, 3, 4, 5, 6} with frequency function p(x) = 16 .
P
6
E(X) = 7
2
1
x =
6
x=1
6
P 7 2 1 1 25 9 1 1 9 25 35
var(X) = x− = + + + + + =
x=1 2 6 6 4 4 4 4 4 4 12
2
We often denote the variance of a random variable X by σX ,
2
σX = var(X)
and its standard deviation by σX .

Properties of the Variance
The variance can also be written as

2
var(X) = E(X ) − E(X)
2
To see this (using linearity of the mean):
var(X) = E(X − E(X))2

2
= E X 2 − 2X E(X) + E(X)
2
= E X 2 − 2E(X)E(X) + E(X) = E(X 2 ) − E(X) 2
Example: Let X ∼ Bin(1, θ). Then

2
var(X) = E(X ) − E(X)
2
2
= E(X) − E(X) = θ − θ2 = θ (1 − θ)
Rules for the variance:

◦ For constants a and b
var(aX + b) = a2var(X).
◦ For independent random variables X and Y
var(X + Y ) = var(X) + var(Y ).
Example: Let X ∼ Bin(n, θ). Then
var(X) = n θ (1 − θ)

Covariance
For independent random variables X and Y we have
var(X + Y ) = var(X) + var(Y ).
Question: What about dependent random variables?

It can be shown that
var(X + Y ) = var(X) + var(Y ) + 2 cov(X, Y )
where

cov(X, Y ) = E (X − E(X))(Y − E(Y )
is the covariance of X and Y .
Properties of the covariance

◦ cov(X, Y ) = E(XY ) − E(X) E(Y )
◦ cov(X, X) = var(X)
◦ cov(X, 1) = 0
◦ cov(X, Y ) = cov(Y, X)
◦ cov(a X1 + b X2, Y ) = a cov(X1, Y ) + b cov(X2, Y )

Covariance
Important:
cov(X, Y ) = 0 does NOT imply that X and Y are independent.
Example:
Suppose X ∈ {−1, 0, 1} with probabilities P(X = x) = 1
3
for
x = −1, 0, 1. Then E(X) = 0 and
cov(X, X 2) = E(X 3) = E(X) = 0
On the other hand
P(X = 1, X 2 = 0) = 0 6= 91 = P(X = 1)P(X 2 = 0),

that is, X and Y are not independent!
Note: The covariance of X and Y measures only linear depen-

dence.
Expected Value and Variance, Feb 2, 2003 - 10 -

Correlation
The correlation coefficient ρ is defined as

cov(X, Y )
ρXY = corr(X, Y ) = p .
var(X)var(Y )
Properties:
◦ dimensionless quantity
◦ not affected by linear transformations, i.e.
corr(a X + b, c Y + d) = corr(X, Y )
◦ −1 ≤ ρXY ≤ 1
◦ ρXY = 1 if and only if P(Y = a + b X) = 1 for some a and b
◦ measures linear association between X and Y
Example: Three boxes: pp, pd, and dd (Ex 3.6)

1
Let Xi = 1{penny on ith draw}. Then Xi ∼ Bin(1, p) with p = 2
and
joint frequency function
p(x1, x2): Thus:
x1\x2 0 1 cov(X1, X2) = E[(X1 − p)(X2 − p)]

0 1 1
3 6
= 41 · 13 + 14 · 31 + 2 14 · 16 = 1
12
1 1
1 6 3 corr(X1, X2) = 14 · 12
1
= 1
3

Prediction
An instructor standardizes his midterm and final so the class aver-

age is µ = 75 and the SD is σ = 10 on both tests. The correlation
between the tests is always around ρ = 0.50.
◦ X - score of student on the first examination
◦ Y - score of student on the second examination
Since X and Y are dependent we should be able to predict the
score in the final from the midterm score.
Approach:
◦ Predict Y from linear function a + b X
◦ Minimize mean squared error
2
MSE = E Y − a − b X
2
= var(Y − b X) + E(Y − a − b X)
Solution:
σXY
a = µ−bµ and b= 2 =ρ
σX
Thus the best linear predictor is
Ŷ = µ + ρ (X − µ)
Note:
We expect the student’s score on the final to differ from the mean
only by half the difference observed in the midterm (regression to
the mean).

Summary
Bernoulli distribution - Bin(1, θ)
p(x) = θx(1 − θ)1−x E(X) = θ

var(X) = θ(1 − θ)
Binomial distribution - Bin(n, θ)

p(x) =
n x
x
θ (1 − θ)n−x E(X) = nθ
var(X) = nθ(1 − θ)
Poisson distribution - Poiss(λ)
λx −λ
p(x) = e
x!
E(X) = λ
var(X) = λ
Geometric distribution
p(x) = θ(1 − θ)x−1 E(X) = 1θ

1−θ
var(X) =
θ2
Hypergeometric distribution - H(N, M, n)

M N −M

p(x) = x
N
n−x
E(X) = nNM
n

Properties of the Sample Mean
Consider X1, . . . , Xn independent and identically distributed (iid)

with mean µ and variance σ 2.
1P n
X̄ = Xi (sample mean)
n i=1
Then
1P n
E(X̄) = n µ = µ
i=1
1 Pn
2 σ2
var(X̄) = 2 σ =
n i=1 n
Remarks:
◦ The sample mean is an unbiased estimate of the true mean.
◦ The variance of the sample mean decreases as the sample size
increases.
◦ Law of Large Numbers: It can be shown that for n → ∞
1P n
X̄ = Xi → µ.
n i=1
Question:
◦ How close to µ is the sample mean for finite n?
◦ Can we answer this without knowing the distribution of X?
Central Limit Theorem, Feb 4, 2004 -1-

Properties of the Sample Mean
Chebyshev’s inequality
Let X be a random variable with mean µ and variance σ 2.
Then for any ε > 0
σ2
P |X − µ| > ε ≤ 2 .
ε
Proof: Let

1 if |xi − µ| > ε
1{|xi − µ| > ε} =
0 otherwise
Then
P
n Pn n (x − µ)2 o
i
1{|xi − µ| > ε} p(xi) = 1 > 1 p(xi)
i=1 i=1 ε2
Pn (x − µ)2 σ2
i
≤ p(x i ) =
i=1 ε2 ε2
Application to the sample mean:

P µ − √n ≤ X̄ ≤ µ + √n ≥ 1 − 19 ≈ 0.889
3σ 3σ
However: Known to be not very precise

iid
Example: Xi ∼ N (0, 1)
1P n
X̄ = Xi ∼ N (0, n1 )
n i=1
Therefore

P − √ ≤ X̄ ≤ √ = 0.997
3 3
n n

Central Limit Theorem
Let X1, X2, . . . be a sequence of random variables

◦ independent and identically distributed
◦ with mean µ and variance σ 2.
For n ∈ N define
√ X̄ − µ 1 P
n
Xi − µ
Zn = n =√ .
σ n i=1 σ
Zn has mean 0 and variance 1.

For large n, the distribution of Zn can be approximated by the
standard normal distribution N (0, 1). More precisely,
√ X̄ − µ
lim P a ≤ n ≤ b = Φ(b) − Φ(a),
n→∞ σ
where Φ(x) is the standard normal probability
Z z
Φ(z) = f (x) dx,
−∞
that is, the area under the standard normal curve to left of z.
Example:
◦ U1 , . . . , U12 uniformly distributed on [ 0, 12).
◦ What is the probability that the sample mean exceeds 9?
√
P(Ū > 9) = P 12 √ > 3 ≈ 1 − Φ(3) = 0.0013
Ū − 6
12

0.4 U[0,1],n=1 1.0 Exp(1),n=1
0.8
0.3
density f(x)
density f(x)
0.6
0.2
0.4
0.1
0.2
0.0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
0.4 U[0,1],n=2 Exp(1),n=2

0.5
0.3
0.4
density f(x)
density f(x)
0.2 0.3
0.2
0.1
0.1
0.0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
0.4 U[0,1],n=6 0.5 Exp(1),n=6
0.4
0.3
density f(x)
density f(x)
0.3
0.2
0.2
0.1
0.1
0.0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
0.4 U[0,1],n=12 Exp(1),n=12

0.4
0.3
0.3
density f(x)
density f(x)
0.2 0.2
0.1 0.1
0.0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
0.4
U[0,1],n=100 Exp(1),n=100
0.4
0.3
0.3
density f(x)
density f(x)
0.2
0.2
0.1 0.1
0.0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Example: Shipping packages

Suppose a company ships packages that vary in weight:
◦ Packages have mean 15 lb and standard deviation 10 lb.
◦ They come from a arge number of customurs, i.e. packages are
independent.
Question: What is the probability that 100 packages will have a
total weight exceeding 1700 lb?
Let Xi be the weight of the ith package and
P
100
T = Xi .
i=1
Then

T − 1500 lb 1700 lb − 1500 lb
P(T > 1700 lb) = P √ > √
100 · 10 lb 100 · 10 lb

T − 1500 lb
=P √ >2
100 · 10 lb
≈ 1 − Φ(2) = 0.023

Remarks
• How fast approximation becomes good depends on distribution
of Xi’s:
◦ If it is symmetric and has tails that die off rapidly, n can
be relatively small.
iid
Example: If Xi ∼ U [0, 1], the approximation is good for
n = 12.
◦ If it is very skewed or if its tails die down very slowly, a
larger value of n is needed.
Example: Exponential distribution.
• Central limit theorems are very important in statistics.
• There are many central limit theorems covering many situa-
tions, e.g.
◦ for not identically distributed random variables or
◦ for dependent, but not “too” dependent random variables.

The Normal Approximation to the Binomial
Let X be binomially distributed with parameters n and p.
Recall that X is the sum of n iid Bernoulli random variables,

P
n
iid
X= Xi , Xi ∼ Bin(1, p).
i=1
Therefore we can apply the Central Limit Theorem:
Normal Approximation to the Binomial Distribution

For n large enough, X is approximately N np, np(1 − p)
distributed:

P a ≤ X ≤ b) ≈ P a − 2 ≤ Z ≤ b + 2
1 1
where

Z ∼ N np, np(1 − p) .
Rule of thumb for n: np > 5 and n(1 − p) > 5.
In terms of the standard normal distribution we get

a − 12 − np b + 1
− np
P a ≤ X ≤ b) = P p ≤ Z′ ≤ p 2
np(1 − p) np(1 − p)

b + 12 − np a − 12 − np
=Φ p −Φ p
np(1 − p) np(1 − p)
where Z ′ ∼ N (0, 1).

1.0 Bin(1,0.5) 1.0 Bin(1,0.1)
0.8 0.8
0.6 0.6
p(x)
p(x)
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
x x
1.0 Bin(2,0.5) 1.0 Bin(5,0.1)
0.8 0.8
0.6 0.6
p(x)
p(x)
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
x x
0.5 Bin(5,0.5) 0.5 Bin(10,0.1)
0.4 0.4
0.3 0.3
p(x)
p(x)
0.2 0.2
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
x x
0.3 Bin(10,0.5) 0.3 Bin(20,0.1)
0.2 0.2
p(x)
p(x)
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
x x
0.3 Bin(20,0.5) 0.3 Bin(50,0.1)
0.2 0.2
p(x)
p(x)
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
x x

Example: The random walk of a drunkard

Suppose a drunkard executes a “random” walk in the following
way:
◦ Each minute he takes a step north or south, with probability 21
each.
◦ His successive step directions are independent.
◦ His step length is 50 cm.
How likely is he to have advanced 10 m north after one hour?

◦ Position after one hour: X · 1 m − 30 m
1
◦ X binomially distributed with parameters n = 60 and p = 2
◦ X is approximately normal with mean 30 and variance 15:
P(X · 1 m − 30 m > 10 m)
= P(X > 40)
≈ P(Z > 39.5) Z ∼ N (30, 15)

Z − 30 9.5
=P √ >√
15 15
= 1 − Φ(2.452) = 0.007
How does the probability change if he has same idea of where he

wants to go and steps north with probability p = 23 and south with
probability 13 ?

Estimation
Example: Cholesterol levels of heart-attack patients

Data: Observational study at a Pennsylvania medical center
◦ blood cholesterol levels patients treated for heart attacks
◦ measurements 2, 4, and 14 days after the attack
Id Y1 Y2 Y3 Id Y1 Y2 Y3
1 270 218 156 15 294 240 264
2 236 234 193 16 282 294 220
3 210 214 242 17 234 220 264
4 142 116 120 18 224 200 213
5 280 200 181 19 276 220 188
6 272 276 256 20 282 186 182
7 160 146 142 21 360 352 294
8 220 182 216 22 310 202 214
9 226 238 248 23 280 218 170
10 242 288 298 24 278 248 198
11 186 190 168 25 288 278 236
12 266 236 236 26 288 248 256
13 206 244 238 27 244 270 280
14 318 258 200 28 236 242 204
Aim: Make inference on distribution of

◦ cholesterol level 14 days after the attack: Y3
◦ decrease in cholesterol level: D = Y1 − Y3
Y1 − Y3
◦ relative decrease in cholesterol level: R = Y3
Confidence intervals I, Feb 11, 2004 -1-

Estimation
Data:
d1, . . . , d28 observed decrease in cholesterol level
In this example, parameters of interest might be
µD = E(D) the mean decrease in cholesterol level,

2
σD = var(D) the variation of the cholesterol level,
pD = P(D ≤ 0) probability of no decrease in cholesterol level
These parameters are naturally estimated by the following sample

statistics:
1P n
µ̂D = di (sample mean)
n i=1
2 1P n
¯ 2,
σ̂D = (di − d) (sample mean)
n i=1
#{di|di ≤ 0}
p̂D = (sample proportion)
n
Such statistics are point estimators since they estimate the corre-
sponding parameter by a single numerical value.
◦ Point estimates provide no information about their chance vari-
ation.
◦ Estimates without an indication of their variability are of lim-
ited value.

Confidence Intervals for the Mean
Recall:
◦ CLT for the sample mean: For large n we have

σ2
X̄ ≈ N µ,
n
◦ 68-95-99 rule: With 95% probability the sample differs from
its mean µ by less that two standard deviations.
More precisely, we have

P µ − 1.96 √ ≤ X̄ ≤ µ + 1.96 √ = 0.95,
σ σ
n n
or equivalently, after rearranging the terms,

P X̄ − 1.96 √ ≤ µ ≤ X̄ + 1.96 √ = 0.95.
σ
n
σ
n
Interpretation: There is 95% probability that the random in-

terval
h i
σ σ
X̄ − 1.96 √ , X̄ + 1.96 √
n n
will cover the mean µ.
Example: Cholesterol levels
d¯ = 36.89, σ = 51.00, n = 28.
Therefore, the 95% confidence interval for µ is
[18.00, 55.78].

Assumption: The population standard deviation σ is known.
◦ In the next lecture, we will drop this unrealistic assumption.

◦ Assumption is approximately satisfied for large sample sizes,
since then σ̂ ≈ σ by the law of large numbers.
Definition: Confidence interval for µ (σ known)

The interval
h i
σ σ
X̄ − zα/2 √ , X̄ + zα/2 √
n n
is called a 1 − α confidence interval for the population mean
µ. (1 − α) is the confidence level.
For large sample sizes n, an approximate (1 − α) confidence
interval for µ is given by

σ̂ σ̂
X̄ − zα/2 √ , X̄ + zα/2 √ .
n n
Here, zα is the α-critical value of the standard normal distribution:

0.4
◦ zα has area α to its right
0.3
◦ Φ(zα ) = 1 − α
f(x)
0.2
0.1
α
0.0
−3 −2 −1 0 1 zα 2 3
z

Confidence Interval for the Mean
Example: Community banks

◦ Community banks are banks with less than a billion dollars of assets.
◦ Approximately 7500 such banks in the United States.
Annual survey of the Community Bankers Council of the American Bankers

Association (ABA)
◦ Population: Community banks in the United States.
◦ Variable of interest: Total assets of community banks.
◦ Sample size: n = 110
◦ Sample mean: X̄ = 220 millions of dollars
◦ Sample standard deviation: SD = 161 millions of dollars
◦ Histogram of sampled values:
Assets of Community Banks in the U.S.
20 (sample of 110 community banks)
15
Frequency
10
0
0 100 200 300 400 500 600 700 800 900 1000
Assets (in millions of dollars)
Suppose we want to give a 95% confidence interval for the mean total assets
of all community banks in the United States.
◦ α = 0.05, zα/2 = 1.96
A 95% confidence interval for the mean assets (in millions of dollars) is

161 161
220 − 1.96 · √ , 220 + 1.96 · √ ≈ 190, 250].
110 110

Sample Size

Suppose we want a 99% confidence interval for the decrease in
cholesterol level:
◦ α = 0.01, z0.005 = 2.58
◦ The 99% confidence interval for µD is
h i
50.93 50.93
36.89 − 2.58 · √ , 36.89 + 2.58 · √ ≈ 12.06, 61.72].
28 28
Note: If we raise the confidence level, the confidence interval

becomes wider.
Suppose we want to obtain increase the confidence level without
increasing the error of estimation (indicated by the half-width of
the confidence interval). For this we have to increase the sample
size n.
Question: What sample size n is needed to estimate the mean

decrease in cholesterol with error e = 20 and confidence level 99%?
The error (half-width of the confidence interval) is
σ
e = zα/2 √
n
Therefore the sample size ne needed is given by
z σ 2
α/2 2.58 · 50.93 2
ne ≥ = = 43.16,
e 20
that is, a sample of 44 patients is needed to estimate µD with error
e = 20 and 99% confidence.

Estimation of the Mean
Example: Banks’ loan-to-deposit ratio

The ABA survey of community banks also asked about the loan-to-deposit
ratio (LTDR), a bank’s total loans as a percent of its total deposits.
Loan−To−Deposit Ratio of Community Banks

(sample of 110 community banks)
Sample statistics: 18
◦ n = 110 Frequency
15
12
◦ µ̂LTDR = 76.7 9
6
◦ σ̂LTDR = 12.3
3
0
50 60 70 80 90 100 110 120
LTDR (in %)
Construction of 95% confidence interval:

◦ α = 0.05, zα/2 = 1.96
σ
◦ Standard error σX̄ = LT
√ DR = 1.17
n
◦ 95% confidence interval for µLTDR :
h σLT DR σLT DR i
X̄ − zα/2 √ , X̄ + zα/2 √ = 74.4, 79.0
n n
◦ To get an estimation with error e = 3.0 (half-width of confidence inter-
val) it suffices to sample ne banks,
2
zα/2σLT DR 2 1.96 · 12.3
ne ≥ = = 64.6.
e 3.0
Thus a sample of ne = 65 banks it sufficient.

Confidence intervals
Definition: Confidence interval

A (1 − α) confidence interval for a parameter is an interval that
◦ depends only on sample statistics and
◦ covers the parameter with probability (1 − α)
Note:
◦ Confidence intervals are random while the estimated parameter
is fixed.
◦ For repeated samples, only 95% of the confidence intervals will
cover the true parameter is a random:
Confidence intervals II, Feb 13, 2004 -1-

iid
Suppose that X1, . . . , Xn ∼ N (µ, σ 2). Then
X̄ − µ
√ ∼ N (0, 1) (*)
σ/ n
Assuming that σ is known, we obtain
h i
σ σ
X̄ − zα/2 · √ , X̄ + zα/2 · √
n n
as (1 − α) confidence interval for µ.
More realistic situation: σ is unknown.

t1
Approach: Replace by estimate σ̂ = s

0.4
t3
t10
N(0, 1)
0.3
This approach leads to the t statistic
f(x)
0.2
X̄ − µ
T = √ ∼ tn−1. 0.1
s/ n
0.0
−4 −3 −2 −1 0 1 2 3 4
It is t distributed with n − 1 degrees of freedom. x
Confidence interval for the mean µ (σ unknown)

The interval
h i
s s
X̄ − tn−1,α/2 · √ , X̄ + tn−1,α/2 · √
n n
is a (1 − α) confidence interval for the mean µ.
Notation: Critical values of distributions
zα standard normal distribution

tn,α t distribution with n degrees of freedom


In the study on cholesterol levels, the standard deviation of the decrease
of cholesterol level was unknown.
◦ µ̂D = 36.89, σ̂D = 50.94
◦ t27,0.025 = 2.05
◦ Then
h 50.94 50.94 i
36.89 − 2.05 · √ , 36.89 + 2.05 · √ = [16.78, 57.01]
27 27
is a 95% confidence interval for µD
◦ The large sample confidence interval based on (*) was [18.00,55.78].
Example: Level of vitamin C

The following data are the amounts of vitamin C, measured in milligrams
per 100 grams (mg/100 g) of corn soy blend, for a random sample of size 8
from a production run:
26 31 23 22 11 22 14 31
What is the 95% confidence interval for µ, the mean vitamin C content of
the CSB produced during this run?
◦ µ̂ = 22.5, σ̂ = 7.2, t7,0.025 = 2.36
◦ The 95% confidence interval for µ is

2.36 · 7.2 2.36 · 7.2
22.5 − √ , 22.5 + √ = [16.5, 28.5].
8 8
◦ The large sample CI would be [17.5, 27.5].

Confidence Intervals for the Variance
iid
For normally distributed data X1, . . . , Xn ∼ N (µ, σ 2), the ratio
(n − 1) · s2
σ2
has a χ2 distribution with n − 1 degrees of freedom.

The (1 − α) confidence interval for σ 2 is
h i
(n − 1) · s2 (n − 1) · s2
χ2
, χ2 .
n−1,α/2 n−1,1−α/2
where χ2n−1,α is the α fractile of the χ2n−1 distribution.

Caution: This confidence interval is not robust against depar-
tures from normality regardless of the sample size.
Suppose we are interested in the variance of Y3, the cholesterol level 14
days after the attack.
◦ Normal probability plot:
300
250
Cholesterol level
200
150
−2 −1 0 1 2
Normal quantiles
Data seem to be normally distributed.

◦ s2 = 2030.55, χ227,0.975 = 14.57, χ227,0.025 = 43.19
◦ The 95% confidence interval for σ 2 is
h i
27 · 2030.55 27 · 2030.55
, = [1269.26, 3761.99]
43.19 14.57

Statistical Tests
Example:
Suppose that of 100 applicants for a job 50 were women and 50 were men,
all equally qualified. Further suppose that the company hired 2 women
and 8 men.
Question:
◦ Does the company discriminate against female job applicants?
◦ How likely is this outcome under the assumption that the company
does not discriminate?
Example:

Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11
Question:
◦ Has the safety program an effect on the loss of labour due to accidents?
◦ In 9 out of 10 plants the average weekly losses have decreased after
implementation of the safety program. How likely is this (or a more
extreme) outcome under the assumption that there is no difference
before and after implementation of the safety program.
Testing Hypotheses I, Feb 16, 2004 -1-

Statistical Tests
Example: Fair coin

Suppose we have a coin. We suspect it might be unfair. We devise a
statistical experiment:
◦ Toss coin 100 times
◦ Conclude that coin is fair if we see between 40 and 60 heads
◦ Otherwise decide that the coin is not fair
Let θ be the probability that the coin lands heads, that is,
P(Xi = 1) = θ and P(Xi = 0) = 1 − θ.

Our suspicion (“coin not fair”) is a hypothesis about the population pa-
rameter θ (θ 6= 12 ) and thus about P. We emphasize this dependence of P
on θ by writing Pθ .
Decision problem:
Null hypothesis H0 : X ∼ Bin(100, 21 )
1
Alternative hypothesis Ha : X ∼ Bin(100, θ), θ 6= 2
The null hypothesis represents the default belief (here: the coin is fair).
The alternative is the hypothesis we accept in view of evidence against the
null hypothesis.
The data-based decision rule
reject H0 if X ∈
/ [40, 60]
do not reject H0 if X ∈ [40, 60]
is called a statistical test for the test problem H0 vs. Ha .

Statistical Tests
Example: Fair coin (contd)

Note: It is possible to obtain e.g. X = 55 (or X = 65)
◦ with probability 0.048 (resp. 0.0009) if p = 0.5
0.10 Bin(100,0.5)
0.08
Accept H0: p ≠ 0.5
Reject H0: p ≠ 0.5
0.06
p(x)
0.04
0.02
0.00
20 25 30 35 40 45 50 55 60 65 70 75 80
x
0.10 Bin(100,0.6)
0.08
0.06
p(x)
0.04
0.02
0.00
20 25 30 35 40 45 50 55 60 65 70 75 80
x
0.10 Bin(100,0.7)
0.08
0.06
p(x)
0.04
0.02
0.00
20 25 30 35 40 45 50 55 60 65 70 75 80
x

Types of errors

It is possible that the test (decision rule) gives a wrong answer:
◦ If θ = 0.7 and x = 55, we do not reject the null hypothesis that the
coin is fair although the coin in fact is not fair.
◦ If θ = 0.5 and x = 65, we reject the null hypothesis that the coin is fair
although the coin in fact is fair.
The following table lists the possibilities:
Decision H0 true H0 false

Reject H0 type I error correct decision
Accept H0 correct decision type II error
Definition (Types of error)

◦ If we reject H0 when in fact H0 is true, this is a Type I error.
◦ If we do not reject H0 when in fact H0 is false, this is a Type II error.

Types of errors
Question: How good is our decision rule?

For a good decision rule, the probability of committing an error of either
type should be small.
Probability of type I error: α

If the null hypothesis is true, i.e. θ = 12 , then
Pθ (reject H0) = Pθ (X ∈/ [40, 60])

= 1 − Pθ (X ∈ [40, 60])
60
X 100
100 1
=1− = 0.035.
x=40
x 2
Thus the probability of a type I error, denoted as α, is 3.5%.

Probability of type II error: β(θ)
If the null hypothesis is false and the true probability of observing “head”
is θ with θ 6= 12 , then
Pθ (accept H0) = Pθ (X ∈ [40, 60])

60
X
100
= θx (1 − θ)n−x
x=40
x
Thus, the probability of an error of type II depends on θ. It will be denoted

as β(θ).

Power of Tests
Question: How good is our test in detecting the alternative?

Consider the probability of rejecting H0
Pθ (reject H0) = Pθ (X ∈/ [40, 60])

= 1 − Pθ (accept H0 ) = 1 − β(θ).
Note:
1
◦ If θ = this is the probability of committing a error of type I:
2

1
1−β =α
2
1
◦ If θ > 2 this is the probability of correctly rejecting H0 .
Definition (Power of a test)

We call 1 − β(θ) the power of the test as it measures the ability to
detect that the null hypothesis is false.
1.0
0.8
0.6
1 − β(θ)
0.4
0.2 reject if X ∉ [40,60]
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
θ

Significance Tests
Idea: minimize probability of committing an error of type I and II

Different probabilities of type I error
1.0
0.8
1 − β(θ)
0.6
0.4
reject if X ∉ [40,60]
0.2 reject if X ∉ [38,62]
reject if X ∉ [42,58]
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
θ
Note: If we decrease the probability of a type I error,

◦ the power of the test, 1 − β(θ) decreases as well and
◦ the probablity of a type II error increases.
Problem: cannot minimize both errors simultaneously
Solution:
◦ choose fixed level α for probability of a type I error
◦ under this restriction find test with small probability of a type II error
Remark:
◦ you do not have to do this minimization yourself.
◦ all tests taught in this course are of this kind.
Definition
A test of this kind is called a significance test with significance level α.

Statistical Hypotheses
A statistical hypothesis is an assertion or conjecture about a population,

which may be expressed in terms of
◦ some parameter: mean is zero;
◦ some parameters: mean and median are identical; or
◦ some sampling distribution: this sample is normally distributed.
Test problem - decide between two hypotheses

◦ the null hypothesis H0 and
◦ the alternative hypothesis Ha .
Popperian approach to scientific theories

◦ Scientific theories are subject to falsification.
◦ It is impossible to verify a scientific theory.
Null hypothesis H0
default (current) theory which we try to falsify
Alternative hypothesis Ha
alternative to adopt if null hypothesis is rejected
Examples:
◦ Clinical study of new drug - H0 : drug has no effect

◦ Criminal case - H0 : suspect is not guilty
◦ Safety test of nuclear power station - H0 : power station is not safe
◦ Chances of new investment - H0 : project not profitable
◦ Testing for independence - H0 : random variables are independent
Testing Hypotheses II, Feb 18, 2004 -1-

Statistical Tests
Example: Testing for pesticide in discharge water

Suppose the Environmental Protection Agency takes 10 readings on the
amount of pesticide in the discharge water of a chemical company.
Question: Does the concentration cP of pesticide in the water exceed the
allowed maximum concentration c0 ?
◦ Before taking action against the company, the agency must have some
evidence that the concentration cP exceeds the allowed level.
◦ Without evidence the agency assumes that the pesticide concentration
cP is within the limits of the law.
Consequently, the null hypothesis of the agency is that the pesticide con-
centration cP does not exceed c0 . Thus the question corresponds to the
test problem
H0 : cP ≤ c0 vs Ha : cP > c0 .
Suppose that the company regularly also runs tests on the amount of pes-
ticide in the discharge water.
Question: Does the concentration cP of pesticide in the water exceed the
allowed maximum concentration c0 ?
◦ The aim of the company is to avoid fines for exceeding the allowed
level. Thus the company wants to make sure that the concentration
stays within the allowed limits.
Thus, the null hypothesis of the company should be that the pesticide
concentration cP exceeds c0 . The question now corresponds to the test
problem
H0 : cP ≥ c0 vs Ha : cP < c0 .

Six Steps of Conducting a Test
Steps of a significance test
1. Determine null hypothesis H0 and alternative Ha .
2. Decide on probability of type I error, the significance level α.
3. Find an appropriate test statistic T .
4. Based on the sampling distribution of T , formulate a criterion for

testing H0 against Ha .
5. Calculate value of the test statistic T .
6. Decide whether or not to reject the null hypothesis H0 .

We want to decide from 100 tosses of a coin whether it is fair or not. Let
θ be the probability of heads.
1. Test problem:
H0 : θ = 21 vs Ha : θ 6= 1
2
2. Significance level:
α = 0.05 (most commonly used significance level)
3. Test statistic:
T =X (number of heads in 100 tosses of the coin)
4. Rejection criterion:
reject H0 if T ∈/ [40, 60]
5. Observed value of test statistic: Suppose after 100 tosses we obtain

t = 55
6. Decision: Since 55 does not lie in the rejection region, we

do not reject H0.
One and Two-sided Hypotheses
Example: Blood cholesterol after a heart attack

Suppose we are interested in whether the blood cholesterol level two days
after a heart attack differs from the average cholesterol level in the (general)
population (µ0 = 193).
Two cases:
◦ We are interested in any difference from the population mean µ0 . Then
we have a two-sided test problem
H0 : µY1 = µ0 vs H0 : µY1 6= µ0 .
◦ We suspect that the cholesterol level after a heart attack might me

higher than in the general population. In this case, we have a one-sided
test problem
H0 : µY1 = µ0 vs H0 : µY1 > µ0 .
Remark:
◦ More generally, we might be interested in one-sided test problems of
the form
H0 : µY1 ≤ µ0 vs H0 : µY1 > µ0 ,
which accounts for the possibility that µ might be smaller than µ0 .

◦ For all common test situations (in particular those discussed in this
course), the form of the test does not depend on the form of H0 , but
only on the parameter value in H0 that is closest to Ha , that is µ0 .

Test Statistic
Let θ be the parameter of interest.

Two-sided test problem
H0 : θ = θ0 against Ha : θ 6= θ0
One-sided test problem
H0 : θ = θ0 against Ha : θ > θ0 (or Ha : θ < θ0)
Suppose that θ̂ is an estimate for θ.

◦ If θ = θ0 (null hypothesis), we expect the estimate θ̂ to take a value
near θ0.
◦ Large deviations from θ0 are evidence against H0 .
This suggests the following decision rules:
◦ Ha : θ > θ0: reject H0 if θ̂ − θ0 is much larger than zero
◦ Ha : θ < θ0: reject H0 if θ̂ − θ0 is much smaller than zero
◦ Ha : θ 6= θ0: reject H0 if |θ̂ − θ0| is much larger than zero
Problem: Often the sampling distribution of the estimate θ̂ depends on the
unknown parameter θ.
Definition (Test statistic)
A test statistic is a random variable
◦ that measures the compatibility between the null hypothesis and the
data and
◦ has a sampling distribution which we know (under H0 ).

Test Statistic

Data: X1, . . . , X28
◦ blood cholesterol level of 28 patients two days after a heart attack
2
◦ assumed to be normally distributed with mean µX and variance σX
The parameter µ can be estimated by the sample mean
2
1 P
28
σX
X̄ = Xi ∼ N µX , .
28 i=1 28
This suggests to the standardized sample mean as a test statistic

X̄ − µ0
√ ∼ N (0, 1) (under H0 ).
σ/ 28
Test H0 : µ ≤ 193 vs Ha : µ > 193 at significance level α = 0.05

◦ Test statistic: Assume σ = 47.7 to be known.
X̄ − µ0
T = √
σ/ 28
◦ Rejection criterion: Reject H0 if T > z0.05 = 1.645

◦ Outcome of test: Since the observed value of T is
253.9 − 193
t= √ = 6.76,
47.7/ 28
we reject the null hypothesis that µ = 193.

Tests for the Mean
Tests for the mean µ (σ 2 known):

◦ Test statistic:
X̄ − µ0
T = √
σ/ n
◦ Two sided test:
H0 : µ = µ0 against Ha : µ 6= µ0
reject H0 if |T | > zα/2
◦ One sided tests:
H0 : µ = µ0 against Ha : µ > µ0 (µ < µ0 )
reject H0 if T > zα (T < −zα )
Tests for the mean µ (σ 2 unknown):

◦ Test statistic:
X̄ − µ0
T = √
s/ n
◦ Two sided test:
reject H0 if |T | > tn−1,α/2
◦ One sided tests:
H0 : µ = µ0 against Ha : µ > µ0 (µ < µ0 )
reject H0 if T > tn−1,α (T < −tn−1,α )

Estimating the standard deviation from the data, we obtain the test statis-
tic
X̄ − µ0
T = √ ∼ t27 .
s/ 28
Noting that t27,0.05 = 1.703 and t = 6.76, we still reject H0 .

Tests and Confidence Intervals
Consider level α significance test for the two-sided test problem
H0 : θ = θ0 vs Ha : θ 6= θ0.
Let
◦ T = Tθ0 (X) be the test statistic of the test (depends on θ0)
◦ R be the critical region of the test
Then
C(X) = {θ : Tθ (X) ∈
/ R}
is a (1 − α) confidence interval for θ: If θ is the true parameter, then

Pθ θ ∈ C(X) = Pθ Tθ (X) ∈/ R = 1 − Pθ Tθ (X) ∈ R = 1 − α.
We have
θ0 ∈ C(X) ⇔ Tθ0 (X) ∈

/ R ⇔ H0 is not rejected
Result A level α two-sided significance test rejects the null hypothesis

H0 : θ = θ0 if and only if the parameter θ0 falls outside a (1 − α)
confidence interval for θ.
Example: Normal distribution

iid
Let X1 , . . . , Xn ∼ N (µ, σ 2). We reject H0 : µ = µ0 if

X̄ − µ0
√ > tn−1,α/2
s/ n
or equivalently

X̄ − µ0 > tn−1,α/2 √s
n
Rearranging terms, we find that we reject if

h i
s s
µ0 ∈
/ X̄ − tn−1,α/2 √ , X̄ + tn−1,α/2 √ .
n n
The P -value
Definition (P -value)
The probability that under the null hypothesis H0 the test statistic
would take a value as extreme or more extreme that that actually
observed is called the P -value of the test.
The P -value is often interpreted a measure for the strength of evidence

against the null hypothesis: the smaller the P -value, the stronger the evi-
dence.
However:
◦ The P -value is a random variable (under H0 uniformly distr. on [0, 1]).
◦ Without a measure of its variability it is not safe to interpret the actu-
ally observed P -value.
◦ If the P -value is smaller than the chosen significance level α, we reject
the null hypothesis H0 .
Three approaches to deciding on test problem:

◦ reject if θ0 ∈
/ C(X)
◦ reject if T (X) ∈ R
◦ reject if P -value p ≤ α

The observed value for the test statistic
X̄ − µ0
T = √ ∼ t27 .
s/ 28
is t = 6.76. The corresponding P -value is
P(T > 6.76) = 1.47 · 10−07.

We thus reject the null hypothesis.
Equivalently, the confidence interval for µ is [235.43, 272.42]. Since it does
not contain µ0 = 193 we reject H0 (for the third and last time!).
Example
Data: Banks’ net income
◦ percent change in net income between first half of last year and first
half of this year
◦ sample mean x̄ = 8.1%
◦ sample standard deviation s = 26.4%
Test problem: H0 : µ = 0 against Ha : µ 6= 0

. ttesti 110 8.1 26.4 0
One-sample t test
------------------------------------------------------------------
| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
----+-------------------------------------------------------------
x | 110 8.1 2.517141 26.4 3.111108 13.08889
------------------------------------------------------------------
Degrees of freedom: 109
Ho: mean(x) = 0
Ha: mean < 0 Ha: mean != 0 Ha: mean > 0
t = 3.2179 t = 3.2179 t = 3.2179
P < t = 0.9991 P > |t| = 0.0017 P > t = 0.0009
Critical value of t distribution with 109 degrees of freedom:
t109,0.025 = 1.982
Result:
◦ |t| > t109,0.025, therefore the test rejects H0 at significance level α = 0.05.
◦ Equivalently, µ0 = 0 ∈
/ [3.11, 13.09] and thus the test rejects H0.
◦ Equivalently, P -value is less than α = 0.05 and thus the test rejects H0 .
Testing Hypotheses II, Feb 18, 2004 - 10 -

Exact Binomial Test
Example: Fair coin

Data: 100 tosses of a coin which we suspect might be unfair.
Modelling:
◦ θ is the probability that the coin lands heads up
◦ X is the number of heads in 100 tosses of the coin
◦ X is binomially distributed with parameters n and θ.
Decision problem:
◦ Null hypothesis H0 : coin is fair
◦ Alternative hypothesis Ha : coin is unfair
Testproblem:
1 1
H0 : θ = vs Ha : θ 6= .
2 2
Under the null hypothesis H0, the distribution of X is known,

1
X ∼ Bin 100, .
2
Reject null hypothesis if
X∈
/ [b100,0.5,0.975, b100,0.5,0.025] = [40, 60].
where bn,θ,α is the α fractile of Bin(n, θ).
Note:
◦ Exact binomial tests typically have smaller significance level α due to
discreteness of distribution.
◦ In the above example, the probability of a type I error is
P(reject H0) = α = 0.035.
Testing Hypotheses III, Feb 20, 2004 -1-

Sign Test
Example: Safety program

Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11
Question:
◦ Has the safety program an effect on the loss of labour due to accidents?
The Sign Test for matched pairs
◦ Ignore pairs with difference 0

◦ Number of trials n is the count of the remaining pairs
◦ The test statistic is the count X of pairs with positive difference
◦ X is binomially distributed with parameters n and θ.
1
◦ Null hypothesis H0 : θ = 2
(i.e. median of the differences is zero)
Example:
For the safety program data, we find
◦ n = 10, X = 9
1 1
◦ Test H0 : θ = 2 against Ha : θ > 2
◦ The P -value of the observed count X is

10 10
P(X ≥ 9) = 9 1 + 1 = 0.0107 2 2
Since the P -value is smaller than α = 0.05 we reject the null hypothesis H0
that the safety program has no effect on the loss of labour due to accidents.

Tests for Proportions

Suppose we are interested in the proportion p of patients who show a
decrease of cholesterol level between the second and the 14th day after a
heart attack.
The proportion p can be estimated by the sample proportion
X
p̂ =
n
where X is the number of patients whose cholesterol level decreased.
Question: Does a decrease occur more often than an increase?
1 1
Test problem: H0 : p = 2 vs Ha : p > 2
Exact tests:
Since X is binomially distributed, we can use exact binomial tests.
Large sample approximations:

Facts: ◦ E(p̂) = p
p(1 − p)
◦ var(p̂) =
n
p̂ − p
◦ p ≈ N (0, 1) (for large n)
p(1 − p)/n
Under the null hypothesis H0 , we get

p̂ − p0
T =p ≈ N (0, 1).
p0 (1 − p0 )/n
Hence, we reject H0 if T > zα .
◦ n = 28, x = 22, p = 0.79, α = 0.05, z0.05 = 1.645

0.79 − 0.5
◦ t= p = 3.7675
0.79 · 0.21/28
◦ P-value: P(T > t) = 8.24 · 10−5.
Confidence Intervals for Proportions
Exact binomial confidence intervals

◦ difficult to compute
◦ use statistics software

◦ 28 patients in the study
◦ 22 showed a decrease in cholesterol level between second and 14th day
after the attack
Computation of an exact binomial confidence interval in STATA:

. cii 28 22
-- Binomial Exact --
Variable | Obs Mean Std. Err. [95% Conf. Interval]
---------+-----------------------------------------------------------
| 28 .7857143 .0775443 .590469 .9170394

Confidence Intervals for Proportions
Large sample approximations

The CLT states that for large n p̂ is approximately normally distributed,

p(1 − p)
p̂ ≈ N p,
n
Problems:
◦ variance is unknown
◦ estimate p̂(1 − p̂)/n is zero if p̂ = 0 or p̂ = 1
Example: What is the proportion of HIV+ students at the UofC?
◦ Random sample of 100 students
◦ None test positive for HIV
Are you absolutely sure that there are no HIV+ students at the UofC?
Idea: Estimate p by
X +2
p̃ = (Wilson estimate)
n+4
and use
h r r i
p̃(1 − p̃) p̃(1 − p̃)
p̃ − zα/2 , p̃ + zα/2
n+4 n+4
as a (1 − α) confidence interval for p

. cii 28 22, wilson
------ Wilson ------
Variable | Obs Mean Std. Err. [95% Conf. Interval]
---------+-----------------------------------------------------------
| 28 .7857143 .0775443 .6046141 .8978754

Paired Samples
Example: Safety program

Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11
Question: Does the safety program have a positive effect?

Approach:
◦ Consider differences before and after implementation of the program:
(after) (before)
Di = Xi − Xi
◦ Di ’s are approximately normal 25
iid
Di ∼ N (µ, σ 2) 20
Decrease in losses of work
◦ H0 : µ = 0 against Ha : µ > 0 15
◦ Significance level α = 0.01 10
◦ One sample t test:

5
D̄
T = √ 0
s/ n
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
Normal quantiles
Reject if T > tn−1,α
Result:
◦ ȳ = 10.27, s = 7.98, n = 10
◦ t = 4.07 and t9,0.01 = 2.82, P -value: 0.0014
◦ Test rejects H0 at significance level α = 0.01

Paired Sample t Test
Data: (X1 , Y1), . . . , (Xn, Yn)

Assumptions:
◦ Pairs are independent
iid
◦ Di = Xi − Yi ∼ N (µ, σ 2)
◦ Apply one-sample t test
Paired sample t test

◦ Test statistic
D̄ − µ0
T = √
s/ n
◦ Two-sided test:
reject H0 if |T | > tn−1,α/2
◦ One-sided test:
H0 : µ = µ0 against Ha : µ > µ0
reject H0 if T > tn−1,α
Power of the paired sample t test and the paired sign test:
1.0
0.8
0.6
1 − β( δ )
0.4
0.2 t test
Sign test
0.0
0 1 2 3 4 5 6 7 8 9 10 11
δ

Sign and t Test
t test:
◦ based on Central Limit Theorem
◦ readsonably robust against departures from normality
◦ do not use if n is small and
⋄ data are strongly skewed or
⋄ data have clear outliers
Sign test:
◦ uses much less information than t test
◦ for normal data less powerful than t test
◦ no assumption on distribution keeps significance level regardless of
distribution
◦ preferable for very small data sets
Remark:
◦ The two-step procedure
1. assess normality by normal quantile plot
2. conduct either t test or sign test depending on result in step 1
does not attain the chosen significance level α (two tests!).
◦ The sign test is rarely used since there are more powerful distribution-
free tests.

Two Sample Problems
Two sample problems
◦ The goal of inference is to compare the responses in two groups.

◦ Each group is a sample from a different population.
◦ The responses in each group are independent of those in the other
group.
Example: Effects of ozone

Study the effects of ozone by controlled randomized experiment
◦ 55 70-day-old rats were randomly assigned to two treatment or control
◦ Treatment group: 22 rats were kept in an environment containing ozone.
◦ Control group: 23 rats were kept in an ozone-free environment
◦ Data: Weight gains after 7 days
We are interested in the difference in weight gain be-
50
tween the treatment and control group.
40
Question: Do the weight gains differ between groups?

30
Weight gain (in gram)
◦ x1, . . . , x22 - weight gains for treatment group

20
◦ y1, . . . , y23 - weight gains for control group

◦ Test problem:
10
H0 : µX = µY vs Ha : µX 6= µY
0
◦ Idea: Reject null hypothesis if x̄ − ȳ is large.

−10
Treatment Control
Two Sample Tests, Feb 23, 2004 -1-

Comparing Means
Let X1 , . . . , Xm and Y1 , . . . , Yn be two independent normally distributed

samples. Then
2

σX σY2
X̄ − Ȳ ∼ N µX − µY , +
m n
Two-sample t test
◦ Two-sample t statistic
X̄ − Ȳ
T =q 2
sX s2Y
m + n
Distribution of T can be approximated by t distribution

◦ Two-sided test:
H0 : µX = µY against Ha : µX 6= µY
reject H0 if |T | > tdf,α/2
◦ One-sided test:
H0 : µX = µY against Ha : µX > µY
reject H0 if T > tdf,α
◦ Degrees of freedom:
◦ Approximations for df provided by statistical software
◦ Satterthwaite approximation
2 2
sX s2Y
m + n
df = 2 2 2 2
1 sX 1 s
m−1 m
+ n−1 nY
commonly used, conservative approximation

◦ Otherwise: use df = min(m − 1, n − 1)

Comparing Means
Example: Effects of ozone

Data:
◦ Treatment group: x̄ = 11.01, sX = 19.02, m = 22
◦ Control group: x̄ = 22.43, sX = 10.78, n = 23
Testproblem:
◦ H0 : µX = µY vs Ha : µX 6= µY
◦ α = 0.05, df = min(m − 1, n − 1) = 21, t21,0.025 = 2.08
The value of the test statistic is
x̄ − ȳ
t= q = −2.46
s2X s2Y
m
+ m
The corresponding P-value is
P(|T | ≥ |t|) = P(|T | ≥ 2.46) = 0.023

Thus we reject the hypothesis that ozone has no effect on weight gain.
Two-sample t test with STATA:
. ttest weight, by(group) unequal
Two-sample t test with unequal variances
----------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+------------------------------------------------------------------
0 | 23 22.42609 2.247108 10.77675 17.76587 27.0863
1 | 22 11.00909 4.054461 19.01711 2.577378 19.4408
---------+------------------------------------------------------------------
combined | 45 16.84444 2.422057 16.24765 11.96311 21.72578
---------+------------------------------------------------------------------
diff | 11.417 4.635531 1.985043 20.84895
----------------------------------------------------------------------------
Satterthwaite’s degrees of freedom: 32.9179
Ho: mean(0) - mean(1) = diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
t = 2.4629 t = 2.4629 t = 2.4629
P < t = 0.9904 P > |t| = 0.0192 P > t = 0.0096

Comparing Means
2
Suppose that σX = σY2 = σ 2. Then

σ2 σ2 1 1
+ = σ2 + .
m n m n
Estimate σ 2 by the pooled sample variance
(m − 1)s2X + (n − 1)s2Y
s2p = .
m+n−2
Pooled two-sample t test

◦ Two-sample t statistic
X̄ − Ȳ
T = q
sp m1 + 1
n
T is t distributed with m + n − 2 degrees of freedom.

◦ Two-sided test:
H0 : µX = µY against Ha : µX 6= µY
reject H0 if |T | > tm+n−2,α/2
◦ One-sided test:
H0 : µX = µY against Ha : µX > µY
reject H0 if T > tm+n−2,α
Remarks:
◦ If m ≈ n, the test is reasonably robust against
◦ nonnormality and
◦ unequal variances.
◦ If sample sizes differ a lot, test is very sensitive to unequal variances.
◦ Tests for differences in variances are sensitive to nonnormality.

Comparing Means
Example: Parkinson’s disease

Study on Parkinson’s disease
3.0
◦ Parkinson’s disease, among other things, affects a
person’s ability to speak
◦ Overall condition can be improved by an operation
2.5
◦ How does the operation affect the ability to speak?
Speaking ability
◦ Treatment group: Eight patients received operation
2.0
◦ Control group: Fourteen patients
◦ Data:
1.5
⋄ score on several test
⋄ high scores indicate problem with speaking
Treat. Contr.
Pooled twpo sample t test with STATA:
. infile ability group using parkinson.txt

. ttest ability, by(group)
Two-sample t test with equal variances
---------------------------------------------------------------------------
---------+-----------------------------------------------------------------
0 | 14 1.821429 .148686 .5563322 1.500212 2.142645
1 | 8 2.45 .14516 .4105745 2.106751 2.793249
---------+-----------------------------------------------------------------
combined | 22 2.05 .1249675 .5861497 1.790116 2.309884
---------+-----------------------------------------------------------------
diff | -.6285714 .2260675 -1.10014 -.1570029
---------------------------------------------------------------------------
Degrees of freedom: 20
Ho: mean(0) - mean(1) = diff = 0
t = -2.7805 t = -2.7805 t = -2.7805
P < t = 0.0058 P > |t| = 0.0115 P > t = 0.9942

Comparing Variances
Example: Parkinson’s disease

In order to apply the pooled two-sample t test, the variances of the two
groups have to be equal. Are the data compatible with this assumption?
F test for equality of variances

The F test statistic
s2X
F = 2.
sY
is F distributed with m − 1 and n − 1 degrees of freedom.
. sdtest ability, by(group)

Variance ratio test
------------------------------------------------------------------------------
---------+--------------------------------------------------------------------
0 | 14 1.821429 .148686 .5563322 1.500212 2.142645
1 | 8 2.45 .14516 .4105745 2.106751 2.793249
---------+--------------------------------------------------------------------
combined | 22 2.05 .1249675 .5861497 1.790116 2.309884
------------------------------------------------------------------------------
Ho: sd(0) = sd(1)
F(13,7) observed = F_obs = 1.836
F(13,7) lower tail = F_L = 1/F_obs = 0.545
F(13,7) upper tail = F_U = F_obs = 1.836
Ha: sd(0) < sd(1) Ha: sd(0) != sd(1) Ha: sd(0) > sd(1)
P < F_obs = 0.7865 P < F_L + P > F_U = 0.3767 P > F_obs = 0.2135
Result: We cannot reject the null hypothesis that the variances are equal.
3.0 3.0
Speaking ability (Contr.)
Problem: Are the data normally

Speaking ability (Treat.)
2.8
2.6 2.5
distributed? 2.4
2.0
2.2
2.0
1.5
1.8
−1.5 −0.5 0.5 1.0 1.5 −1 0 1

Theoretical Quantiles Theoretical Quantiles

Comparing Proportions
Suppose we have two populations with unknown proportions p1 and p2.

◦ Random samples of size n1 and n2 are drawn from the two population
◦ p̂1 is the sample proportion for the first population
◦ p̂2 is the sample proportion for the second population
Question: Are the two proportions p1 and p2 different?
Test problem:
H0 : p 1 = p 2 vs H1 : p1 6= p2
Idea: Reject H0 if p̂1 − p̂2 is large.

Note that

p1 (1 − p1 ) p2 (1 − p2 )
p̂1 − p̂2 ≈ N p1 − p2, +
n1 n2
This suggests the test statistic

p̂1 − p̂2
T =r
p̂(1 − p̂) n11 + 1
n2
where p̂ is the combined proportion of successes in both samples

X1 + X2 n p̂ + n2 p̂2
p̂ = = 1 1
n1 + n2 n1 + n2
with X1 and X2 denoting the number of successes in each sample.

Under H0 , the test statistic is approximately standard normally dis-
tributed.

Comparing Proportions
Example: Question wording

The ability of question wording to affect the outcome of a survey can be a
serious issue. Consider the following two questions:
1. Would you favor or oppose a law that would require a person to obtain
a police permit before purchasing a gun?
2. Would you favor or oppose a law that would require a person to obtain
a police permit before purchasing a gun, or do you think such a law
would interfere too much with the right of citizens to own guns?
In two surveys, the following results were obtained:

Question Yes No Total
1 463 152 615
2 403 182 585
Question: Is the true proportion of people favoring the permit law the
same in both groups or not?
. prtesti 615 0.753 585 0.689

Two-sample test of proportion x: Number of obs = 615
y: Number of obs = 585
--------------------------------------------------------------------------
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]
---------+----------------------------------------------------------------
x | .753 .0173904 .7189155 .7870845
y | .689 .0191387 .6514889 .7265111
---------+----------------------------------------------------------------
diff | .064 .0258595 .0133163 .1146837
| under Ho: .0258799 2.47 0.013
--------------------------------------------------------------------------
Ho: proportion(x) - proportion(y) = diff = 0
z = 2.473 z = 2.473 z = 2.473
P < z = 0.9933 P > |z| = 0.0134 P > z = 0.0067

Final Remarks
Statistical theory focuses on the significance level, the probability of a type

I error.
In practice, discussion of power of test also important:
Example: Efficient Market Hypothesis
“Efficient market hypothesis” for stock prices:
◦ future stock prices show only random variation
◦ market incorporates all information available now in present prices
◦ no information available now will help to predict future stock prices
Testing of the efficient market hypothesis:

◦ Many studies tested
H0: Market is efficient
Ha : Prediction is possible
◦ Almost all studies failed to find good evidence against H0.
◦ Consequently the efficient market hypothesis became quite popular.
Problem:
◦ Power was generally low in the significance tests employed in the stud-
ies.
◦ Failure to reject H0 is no evidence that H0 is true.
◦ More careful studies showed that the size of a company and measures
of value such as ratio of stock price to earnings do help predict future
stock prices.

Final Remarks
Example
◦ IQ of 1000 women and 1000 men

◦ µ̂w = 100.68, σw = 14.91
◦ µ̂m = 98.90, σm = 14.68
◦ Pooled two-sample t test: T = −2.7009
◦ Reject H0 : µw = µm since |T | > t1998,0.005 = 2.58.
◦ The difference in the IQ is statistically significant at the 0.01 level.
◦ However we might conclude that the difference is scientifically irrele-
vant.
Note: A low significance level does not mean there is a large difference,
but only that there is strong evidence that there is some difference.
Two Sample Tests, Feb 23, 2004 - 10 -

Final Remarks
Example: Is radiation from cell phones harmful?
◦ Observational study
◦ Comparison of brain cancer patients and similar group without brain
cancer
◦ No statistically significant association between cell phone use and a
group of brain cancers known as gliomas.
◦ Separate analysis for 20 types of gliomas found association between
phone use and one rare from.
◦ Risk seemed to decrease with greater mobile phone use.
Think for a moment:
◦ Suppose all 20 null hypotheses are true.
◦ Each test has 5% chance of being significant - the outcome is Bernoulli
distributed with parameter 0.05.
◦ The number of false positive tests is binomially distributed:
N ∼ Bin(20, 0.05)
◦ The probability of getting one or more positive results is
P(N ≥ 1) = 1 − P(N = 0) = 1 − 0.9520 = 0.64.

We therefore might have expected at least one significant association.
Beware of searching for significance

Final Remarks
Problem: If several tests are performed, the probability of a type I error

increases.
Idea: Adjust significance level of each single test.

Bonferroni procedure:
◦ Perform k tests
◦ Use significance level α/k for each of the k tests
◦ If all null hypothesis are true, the probability is α that any of the tests
rejects its null hypothesis.
Example
Suppose we perform k = 6 tests and obtain the following P -values:
P -value α/k
0.476 0.032 0.241 0.008* 0.010 0.001* 0.0083
Only two tests (*) are significant at the 0.05 level.

Two-Way Tables
Example: Depression and marital status

Question: Does severity of depression depend on marital status?
◦ Study of 159 depression patients
◦ Patients were categorized by
⋄ severity of depression (severe, normal, mild)
⋄ marital status (single, married, widowed/divorced)
The following two-way table summarizes the data:
Depression Marital Status Total

Single Married Wid/Div
Severe 16 22 19 57
Normal 29 33 14 76
Mild 9 14 3 26
Total 54 69 36 159
◦ Each combination of values defines a cell.

◦ The severity of depression is a row variable.
◦ The marital status is a column variable.
Inference for Two-Way Tables, Feb 25, 2004 -1-

Two-Way Tables
From this table of counts, the sample distribution can be obtained

by dividing each cell by the total sample size n = 159:

Severe 0.101 0.138 0.119 0.358
Normal 0.182 0.208 0.088 0.478
Mild 0.057 0.088 0.019 0.164
Total 0.340 0.434 0.226 1.000
◦ Joint distribution: proportion for each combination of values

◦ Marginal distribution: distribution of the row and column
variables separately.
◦ Conditional distribution: distribution of one variable at a
given level of the other variable

Test for Independence

Conditional distributions of severity of depression given marital
status:
0.5 severe
normal
mild
0.4
Sample proportion
0.3
0.2
0.1
0.0
single married wid/div
Marital status
Question: Is a relationship between the row variable (depression)

and the column variable (marital status)?
◦ The distribution for widowed/divorced patients seems to differ

from the distributions for single or married patients.
◦ Are these differences significant or can they be attributed to
chance variation?
◦ How likely are differences as large or larger than those observed
if the two variables were indeed independent (and thus the con-
ditional distribution were the same)?
A statistical test will be required to answer these questions.

Test problem:
H0 : the row and the column variables are independent
Ha : the row and the column variables are dependent
How can we measure evidence against the null hypothesis?
◦ What counts would we expect to observe if the null hypothesis
were true?
row total × column total
Expected Cell Count =
total count
Recall: For two independent events A and B, P(A ∩ B) = P(A) P(B).
If the null hypothesis H0 is true, then the table of expected
counts should be “close” to the observed table of counts.
◦ We need a statistic that measures the difference between the
tables.
◦ And we need to know what is the distribution of the statistic
to make statistical inference.

Idea of the test:

◦ construct table of expected counts
◦ compare expected with observed counts
◦ if the null hypothesis is true, the difference between the tables
should be “small”
The χ2 (Chi-Squared) Statistic
To measure how far the expected table is from the observed table,
we use the following test statistic:
X (Observed − Expected)2
X=
Expected
all cells
◦ Under the null hypothesis, T is approximately χ2 distributed
with (r − 1)(c − 1) degrees of freedom.
Why (r − 1)(c − 1)?
Recall that our “expected” table is based on some quantities estimated
from the data: namely the row and column totals.
Once these totals are known, filling in any (r − 1)(c − 1) undetermined
table entries actually gives us the whole table. Thus, there are only
(r − 1)(c − 1) freely varying quantities in the table.
◦ We reject H0 if observed and expected counts are very different

and hence X is large. Consequently we reject H0 at significance
level α if
X ≥ χ2(r−1)(c−1),α .

The χ2 Distribution
What does the χ2 distribution look like?
χ2 Densities
0.20 Degrees of
Freedom
0.15 1
5
10
Density
20
0.10
30
0.05
0.00
0 10 20 30 40 50
χ2
◦ Unlike the Normal or t distributions, the χ2 distribution takes

values in (0, ∞).
◦ As with the t distribution, the exact shape of the χ2 distribution
depends on its degrees of freedom.
Recall that X has only an approximate χ2(r−1)(c−1) distribution.

When is the approximation valid?
◦ For any two-way table larger than 2 × 2, we require that the

average expected cell count is at least 5 and each expected count
is at least one.
◦ For 2×2 tables, we require that each expected count be at least
5.


The following table show the observed counts and expected counts
(in brackets):

Severe 16 22 19 57
(19.36) (24.74) (12.90)
Normal 29 33 14 76
(25.81) (32.98) (17.21)
Mild 9 14 3 26
(8.83) (11.28) (5.89)
Total 54 69 36 159
◦ The table is 3 × 3, so there are (r − 1)(c − 1) = 2 × 2 = 4

degrees of freedom.
◦ The critical value (significance level α = 0.05) is χ24,0.05 = 9.49.
◦ The observed value of the χ2 test statistic is
(16 − 19.36)2 (22 − 24.74)2 (3 − 5.89)2
x= + + ... +
19.36 24.74 5.89
= 6.83 ≤ χ24,0.05
Thus we do not reject the null hypothesis of independence.

◦ The corresponding P-value is
P(X ≥ x) = P(X ≥ 6.83) = 0.145 ≥ α

Again we do not reject H0

The χ2 test in STATA:

. insheet using depression.txt, clear
(3 vars, 159 obs)
. tabulate depression marital, chi2
| Marital
Depression | Married Single Wid/Div | Total
-----------+---------------------------------+----------
Mild | 14 9 3 | 26
Normal | 33 29 14 | 76
Severe | 22 16 19 | 57
-----------+---------------------------------+----------
Total | 69 54 36 | 159
Pearson chi2(4) = 6.8281 Pr = 0.145
The same result can be obtained by the command

. tabi 16 22 19 \ 29 33 14 \ 9 14 3, chi2
| col
row | 1 2 3 | Total
-----------+---------------------------------+----------
1 | 16 22 19 | 57
2 | 29 33 14 | 76
3 | 9 14 3 | 26
-----------+---------------------------------+----------
Total | 54 69 36 | 159
Pearson chi2(4) = 6.8281 Pr = 0.145

Models for Two-Way Tables
The χ2 -test for the presence of a relationship between two distributions

in a two-way table is valid for data produced by several different study
designs, although the exact null hypothesis varies.
◦ Examining independence between variables
⋄ Select random sample of size n from a population.
⋄ Classify each individual according to two categorical variables.
Question: Is there a relationship between the two variables?
Test problem:
H0: The two variables are independent
Ha : The two variables are not independent
Example: Suppose we collect an SRS of 114 college students, and cate-
gorize each my major and GPA (e.g. (0, 0.5], . . . , (3.5, 4]). Then, we can
use the χ2 -test to ascertain whether grades and major are independent.
◦ Comparing several populations
⋄ Select independent random samples from each of c population, of
sizes n1 , . . . , nc .
⋄ Classify each individual according to a categorical response variable
with r possible values (the same across populations),
⋄ This yields a r × c table.
Question: Does the distribution of the response variable differs be-
tween populations?
Test problem:
H0: The distribution is the same in all populations.
Ha : The distribution is not the same.
Example: Suppose we select independent SRSs of Psychology, Biology
and Math majors, of sizes 40, 39, 35, and classify each individual by
GPA range. Then, we can use a χ2 -test to ascertain whether or not the
distribution of grades is the same in all three populations.
Models for Two-Way Tables
Example: Literary Analysis (Rice, 1995)

When Jane Austen died, she left the novel Sanditon only partially com-
pleted, but she left a summary of the reminder. A highly literate admirer
finished the novel, attempting to emulate Austen’s style, and the hybrid
was published. Someone counted the occurrences of various words in sev-
eral chapters from various works.
Austen Imitator
Sense and Emma Sanditon I Sanditon II
Word Sensibility
a 147 186 101 83
an 25 26 11 29
this 32 39 15 15
that 94 105 37 22
with 59 74 28 43
without 18 10 10 4
TOTAL 375 440 202 196
Questions:
◦ Is there consistency in Austen’s work (do the frequencies with which

Austen used these words change from work to work)?
Answer X = 12.27, df=?, P-value=?
◦ Was the imitator successful (are the frequencies of the words the same
in Austen’s work and the imitator’s work)?
Inference for Two-Way Tables, Feb 25, 2004 - 10 -

Simpson’s Paradoxon
Example: Medical study

◦ contact randomly chosen people in a district in England
◦ data on 1314 women contacted
◦ either current smoker or who had never smoked
Question: Survival rate after 20 years?
Smoker Not
Dead 139 230
Alive 438 502
Result: A higher percent of smokers stayed alive!
Here are the same data classified by their age at time of the survey:
Age 18 to 44 Age 45 to 64 Age 65+
Smoker Not Smoker Not Smoker Not
Dead 19 13 Dead 78 52 Dead 42 165
Alive 269 327 Alive 162 147 Alive 7 28
Age at time of the study is a confounding variable, in each age

group a higher percent of nonsmokers survive.
Simpson’s Paradoxon
An association/comparison that holds for all of several groups can
reverse direction when the data are combined to form a single
group.
Inference for Two-Way Tables, Feb 25, 2004 - 11 -

Simple Linear Regression
Example: Body density

Aim: Measure body density (weight per unit volume of the body)
(Body density indicates the fat content of the human body.)
Problem:
◦ Body density is difficult to measure directly.
◦ Research suggests that skinfold thickness can accurately predict body
density.
◦ Skinfold thickness is measures by pinching a fold of skin between
calipers.
2.0
Body Density (103kg m3)
1.8
1.6
1.4
1.2
1.0
1.03 1.04 1.05 1.06 1.07 1.08 1.09
Skinfold Thickness (mm)
Questions:
◦ Are body density and skinfold thickness related?
◦ How accurately can we predict body density from skinfold thickness?
Regression: predict response variable for fixed value of explanatory variable

◦ describe linear relationship in data by regression line
◦ fitted regression line is affected by chance variation in observed data
Statistical inference: accounts for chance variation in data
Simple Linear Regression, Feb 27, 2004 -1-

Population Regression Line
Simple linear regression studies the relationship between

◦ a response variable Y and
◦ a single explanatory variable X.
We expect that different values of X will produce different mean responses
of Y .
For given X = x, we consider the subpopulation with X = x:
◦ this subpopulation has mean
µY |X=x = E(Y |X = x) (cond. mean of Y given X = x)
◦ and variance
σY2 |X=x = var(Y |X = x) (cond. variance of Y given X = x)
Linear regression model with constant variance:
E(Y |X = x) = µY |X=x = a + b x (population regression line)

var(Y |X = x) = σY2 |X=x = σ 2
◦ The population regression line connects the conditional means of the

response variable for fixed values of the explanatory variable.
◦ This population regression line tells how the mean response of Y varies
with X.
◦ The variance (and standard deviation) does not depend on x.

Conditional Mean
Sample (x1, y1), . . . , (xn, yn)
6
1 5
2 4
3
4
5 3
6
7 2
8
9 1
10
11 0
12
Sampling probability
f (x, y)
6
0
1 5
2
3 4
4
5 3
6
7 2
8
9 1
10
11 0
12
y
fix x = x0
6
0
1 5
f (x0, y)
2
3 4
4
5 3
6
7 2
8
y
9
10
1
rescale by fX (x0 )
11 0
12
6
Conditional probability
0
5
1
2 fXY (x0, y)
3
4
4
f (y|x0) =
5
6
3 fX (x0)
7 2
8
9 1
10
11 0
12
Z
E(Y |X = x0) = y fY |X (y|x0) dy conditional mean

The Linear Regression Model
Simple linear regression
Yi = a + b x i + ε i , i = 1, . . . , n
where
Yi response (also dependent variable)

xi predictor (also independent variable)
εi error
Assumptions:
◦ Predictor xi is deterministic (fixed values, not random).
◦ Errors have zero mean, E(εi) = 0.
◦ Variation about mean does not depend on xi, i.e. var(εi) = σ 2 .
◦ Errors εi are independent.
Often we additionally assume:
◦ The errors are normally distributed,
iid
εi ∼ N (0, σ 2).
For fixed x the response Y is normally distributed with
Y ∼ N (a + b x, σ 2).

Least Squares Estimation
Data: (Y1 , x1), . . . , (Yn, xn)
Aim: Find straight line which fits data best:
Ŷi = a + b xi fitted values for coefficients a and b
a - intercept
b - slope
Least Squares Approach:
Minimize squared distance between observed Yi and fitted Ŷi :
P
n
2 P
n
L(a, b) = (Yi − Ŷi ) = (Yi − a − b xi)2
i=1 i=1
Set partial derivatives to zero (normal equations):

∂L Pn
=0 ⇔ (Yi − a − b xi) = 0
∂a i=1
∂L Pn
=0 ⇔ (Yi − a − b xi) · xi = 0
∂b i=1
Solution: Least squares estimators

SXY
â = Ȳ − · X̄
SXX
SXY
b̂ =
SXX
where
P
n
SXY = (Yi − Ȳ )(xi − x̄) (sum of squares)
i=1
Pn
SXX = (xi − x̄)2
i=1

Least squares predictor Ŷ
Ŷi = â + b̂ xi
Residuals ε̂i:
ε̂i = Yi − Ŷi
= Yi − â − b̂ xi
Residual sum of squares (SS Residual )

P
n P
n
SS Residual = ε̂2i = (Yi − Ŷi)2
i=1 i=1
Estimation of σ 2
2 1 P n 1
σ̂ = (Yi − Ŷi )2 = SS Residual
n − 2 i=1 n−2
Regression standard error
p
se = σ̂ = SS Residual /(n − 2)
Variation accounting:
P
n
SS Total = (Yi − Ȳ )2 total variation
i=1
Pn
SS Model = (Ŷi − Ȳ )2 variation explained by linear model
i=1
Pn
SS Residual = (Yi − Ŷi )2 remaining variation
i=1


Scatter plot with least squares regression line:
2.0
Body Density (103kg m3)
1.8
1.6
1.4
1.2
1.0
1.03 1.04 1.05 1.06 1.07 1.08 1.09
Skinfold Thickness (mm)
Calculation of least squares estimates:

x̄ ȳ SXX SXY SY Y SS Residual
1.064 1.568 0.0235 -0.2679 4.244 1.187
SXY −0.267
b̂ = = = −11.40
SXX 0.023
â = ȳ − b̂x̄ = 1.568 + 11.40 · 1.064 = 13.70
RSS 1.187
σ̂ 2 = = = 0.0132
n−2 90
√ √
se = σ̂ 2 = 0.0132 = 0.1149


Using STATA:
. infile ID BODYD SKINT using bodydens.txt, clear
(92 observations read)
. regress BODYD SKINT
-------------+------------------------------ F( 1, 90) = 231.89
Model | 3.05747739 1 3.05747739 Prob > F = 0.0000
Residual | 1.18663025 90 .013184781 R-squared = 0.7204
-------------+------------------------------ Adj R-squared = 0.7173
Total | 4.24410764 91 .046638546 Root MSE = .11482
------------------------------------------------------------------------------
BODYD | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
SKINT | -11.41345 .7494999 -15.23 0.000 -12.90246 -9.924433
_cons | 13.71221 .7975822 17.19 0.000 12.12768 15.29675
------------------------------------------------------------------------------
. twoway (lfitci BODYD SKINT, range(1 1.1)) (scatter BODYD SKINT), xtitle(Skin thickn
> ess) ytitle(Body density) scheme(s1color) legend(off)
2.5
2
Body density
1.5
1
1 1.02 1.04 1.06 1.08 1.1

SKin thickness

Properties of Estimators
Statistical properties of â and b̂

Mean and variance of b̂
E(b̂) = b
Recall that
σ2
var(b̂) = P
n
SXX SXX = (xi − x̄)2
i=1
Distribution of b̂

σ2
b̂ ∼ N b,
SXX
Mean and variance of â
E(â) = a

1 x̄2
var(â) = + σ2
n SXX
Distribution of â

1 x̄2
â ∼ N a, + σ2
n SXX
Inference for Regression, Mar 1, 2004 -1-

Confidence Intervals

σ2
Note that b̂ ∼ N b, SXX . Thus
b̂ − b
√ ∼ N (0, 1)
σ/ SXX
Substituting se for σ, we obtain
b̂ − b
√ ∼ tn−2
se / SXX
(1 − α) confidence interval for b:

se
b̂ ± tn−2,a/2 · √
SXX
Similarly
â − a
q ∼ N (0, 1)
1 X̄ 2
σ n + SXX
Substituting se for σ, we obtain

â − a
q ∼ tn−2
1 x̄2
se n + SXX
(1 − α) confidence interval for a:

s
1 x̄2
â ± tn−2,α/2 · se · +
n SXX

Tests on the Coefficients
Question: Is b equal to some value b0 ?

The correspoding test problem is
H0 : b = b 0 versus Ha : b 6= b0.
The test statistic is given by
b̂ − b0
Tb = √ ∼ tn−2
se / SXX
The null hypothesis H0 : b = b0 is rejected if
|T | > tn−2,α/2
Question: Is a equal to some value a0 ?

The correspoding test problem is
H0 : a = a0 versus Ha : a 6= a0 .
The test statistic is given by

â − a0
Ta = q ∼ tn−2
1 x̄2
se n + SXX
The null hypothesis H0 : a = a0 is rejected if
|T | > tn−2,α/2

Inference for the Coefficients

The confidence interval for b is given by
se
b̂ ± tn−2,α/2 · √
SXX
√
0.0132
= −11.41 ± 1.99 · √ = [−12.92, −9.90]
0.023
The confidence interval for a is given by
s
1 x̄2
â ± tn−2,α/2 se +
n SXX
r
√ 1 1.062
= 13.71 ± 1.99 · 0.0132 · + = [12.11, 15.30]
92 0.023
Furthermore we find for
b̂
Tb = √ = −15.22 > t90,0.025 = 1.99
se / SXX
Thus we reject H0 : b = 0 at significance level 0.05: The coefficient b is
statistically significantly different from zero.
Similarly
â
Ta = q = 17.26 > t90,0.025 = 1.99
1 x̄2
se n + SXX
Thus we reject H0 : a = 0 at significance level 0.05: The coefficient a is

statistically significantly different from zero.
The corresponding P -values are

◦ P(|Ta| ≥ 15.22) ≈ 0
◦ P(|Tb| ≥ 17.26) ≈ 0

Estimating the Mean
In the linear regression model, the mean of Y at x = x0 is given by
E(Y ) = a + b x0
Our estimate for the mean of Y at X = x0 is
Ŷx0 = â + b̂ x0 .
Question: How precise is this estimate?

Note that
Ŷx0 = â + b̂ x0 = Ȳ − b̂(x0 − x̄).
Hence we obtain
E(Ŷx ) = a + b x0
0

1 (x0 − x̄)2 2
var(Ŷx0 ) = + σ
n SXX
(1 − α) confidence interval for (Yx0 ) E

s
1 (x0 − x̄)2
(â + b̂ x0) ± tn−2,α/2 · se · +
n SXX

Estimating the Mean

Suppose the measured skin thickness is x0 = 1.1 mm.
What is the mean body density for this value of skin thickness?
◦ Point estimate:
Ŷx0 = â + hb x0 = 13.71 − 11.41 · 1.1 = 1.159
The mean body density is 1.159 · 103 kg/m3.

◦ Confidence interval:
s
1 (x0 − x̄)2
(â + b̂ x0) ± tn−2,α/2 · se · +
n SXX
r
√ 1 (1.1 − 1.06)2
= (13.71 − 11.41 · 1.1) ± 1.99 · 0.0132 · +
92 0.023
= [1.09, 1.22]
In STATA, the standard error for estimating the mean of Y is calculated

by passing the option stdp to predict:
. predict BDH
. predict SE, stdp
. generate low=BDH-invttail(49,.025)*SE
. generate high=BDH+invttail(49,.025)*SE
. sort SKINT
. graph twoway line low high BDH SKINT, clpattern(dash dash solid) clcolor(black bla
> ck black) || scatter BODYD SKINT, legend(off) scheme(s1color)
2
1.8
1.6
1.4
1.2
1
1.02 1.04 1.06 1.08 1.1

SKINT

Prediction
Suppose we want to predict Y at x = x0 .

Aim: (1 − α) confidence interval for Y
Note that

1 (x0 − X̄)2
â + b̂ x0 − Y ∼ N 0, σ 2 1+ +
n SXX
Thus the desired (1 − α) confidence interval for Yx0 is given by
s
1 (x0 − X̄)2
â + b̂ x0 ± tn−2,α/2 · se · 1 + +
n SXX

Prediction

Suppose the measured skin thickness is x0 = 1.1 mm.
What is the predicted body density for this value of skin thickness?
◦ Point estimate: Ŷx0 = â + hb x0 = 13.71 − 11.41 · 1.1 = 1.159
The predicted body density is 1.159 · 103 kg/m3.
◦ Confidence interval:
s
1 (x0 − x̄)2
(â + b̂ x0) ± tn−2,α/2 · se · 1+ +
n SXX
r
√ 1 (1.1 − 1.06)2
= (13.71 − 11.41 · 1.1) ± 1.99 · 0.0132 · 1+ +
92 0.023
= [0.92, 1.40]
In STATA, the standard error for predicting Y is calculated by passing the

option stdf to predict:
. drop SE low high
. predict SE, stdf
. generate low=tbillh-invttail(49,.025)*SE
. generate high=tbillh+invttail(49,.025)*SE
. graph twoway line low high BDH SKINT, clpattern(dash dash solid) clcolor(black bla
> ck black) || scatter BODYD SKINT, legend(off) scheme(s1color)
Alternatively, we can use the following command:

. twoway (lfitci BODYD SKINT, range(1 1.1) stdf) (scatter BODYD SKINT),
> xtitle(Skin thickness) ytitle(Body density) scheme(s1color) legend(off)
2.5
2.5
2
2
Body density
1.5
1.5
1
1
1.02 1.04 1.06 1.08 1.1 1 1.02 1.04 1.06 1.08 1.1
SKINT SKin thickness

Multiple Regression
Example: Food expenditure and family income

Data: ◦ Sample of 20 households
◦ Food expenditure (response variable)
◦ Family income and family size
. regress food income

-------------------------------------------------------------------------
--------+----------------------------------------------------------------
income | .1841099 .0149345 12.33 0.000 .1527336 .2154862
_cons | -.4119994 .7637666 -0.54 0.596 -2.016613 1.192615
-------------------------------------------------------------------------
. regress food number
-------------------------------------------------------------------------
--------+----------------------------------------------------------------
number | 2.287334 .4224493 5.41 0.000 1.399801 3.174867
_cons | 1.217365 1.410627 0.86 0.399 -1.746252 4.180981
-------------------------------------------------------------------------
20 20
16 16
Food Expenditure
Food Expenditure
12 12
8 8
4 4
0 0
0 20 40 60 80 100 120 0 1 2 3 4 5 6
Income Family Size
Multiple Regression, Mar 3, 2004 -1-

Multiple Regression
Multiple regression model
Yi = b0 + b1 x1,i + b2 x2,i + . . . + bp xp,i + εi i = 1, . . . , n
where
◦ Yi response variable
◦ x1,i, . . . , xp,i predictor variables (fixed, nonrandom)
◦ b0, . . . , bp regression coefficients
iid
◦ εi ∼ N (0, σ 2) error variable

Fitting multiple regression models in STATA:
. regress food income number

--------+------------------------------ F( 2, 17) = 121.47
Model | 386.312865 2 193.156433 Prob > F = 0.0000
Resid. | 27.0326365 17 1.59015509 R-squared = 0.9346
--------+------------------------------ Adj R-squared = 0.9269
Total | 413.345502 19 21.7550264 Root MSE = 1.261
-------------------------------------------------------------------------
--------+----------------------------------------------------------------
income | .1482117 .0163786 9.05 0.000 .1136558 .1827676
number | .7931055 .2444411 3.24 0.005 .2773798 1.308831
_cons | -1.118295 .6548524 -1.71 0.106 -2.499913 .2633232
-------------------------------------------------------------------------

Multiple Regression

Data: (Foodi , Incomei , Numberi ), i = 1, . . . , 20
Fitted regression model:
d = b̂0 + b̂1 Income + b̂2 Number
Food
Yi
^
Yi
20
16
12
6
5
4 0
3 120
2 100
80
1 60
40
0 20
0
Fitted model is a two-dimensional plane - difficult to visualize.

Inference for Multiple Regression
Multiple regression model (matrix notation)
Y =Xb+ε
where
Y n dimensional vector
X n × (1 + p) dimensional matrix
b 1 + p dimensional vector
ε n dimensional vector
Thus the model can be written as
      
Y1 1 x1,1 · · · xp,1 b0 ε1
 ..   .. .. .
. . . ..   .   .
 .  = . .   ..  +  .. 
Yn 1 x1,n · · · xp,n bp εn
Least squares approach: Minimize

P
n
kY − Ŷ k = (Yi − Ŷi )2
i=1
Results:

b̂ = (X T X)−1X T Y ∼ N b, σ 2(X T X)−1

Ŷ = X(X T X)−1X T Y ∼ N X b, σ 2X(X T X)−1X T

ε̂ = Y − Ŷ = 1 − X(X T X)−1X T Y ∼ N 0, σ 2 1 − X(X T X)−1X T
2 kY − Ŷ k2
σ̂ = s2e =
n−p
1 P n
= (Yi − Ŷi )2
n − p i=1
Details course in regression analysis (STAT 22200) or econometrics

Inference for Multiple Regression

Interpretation of regression coefficients
. quietly regress food income

. predict e_food1, residuals
. quietly regress number income
. predict e_num, residuals
. regress e_food1 e_num
------------------------------------------------------------------------
e_food1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------
e_num | .7931055 .2375541 3.34 0.004 .2940229 1.292188
------------------------------------------------------------------------
. quietly regress food number
. predict e_food2, residuals
. quietly regress income number
. predict e_inc, residuals
. regress e_food2 e_inc
------------------------------------------------------------------------
e_food2 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------
e_inc | .1482117 .0159172 9.31 0.000 .114771 .1816525
------------------------------------------------------------------------
Result:
◦ bj measures the dependence of Y on xj after removing the linear effects
of all other predictors xk , k 6= j.
◦ bj = 0 if xj does not provide information for the prediction of Y addi-
tionally to the information given by the other predictor variables.

Multiple Regression
Example: Heart cathederization

Description: A Teflon tube (catheder) 3 mm is diameter is passed into a major vein or
artery at the femoral region and pushed up into the heart to obtain information about
the heart’s physiology and functional ability. The length of the catheder is typically
determined by a physician’s educated guess.
Data:
◦ Study with 12 children with congenital heart defects
◦ Exact required catheder length was measured using a fluoroscope
◦ Patient’s height and weight were recorded
Question: How accurately can catheder length be determined by height

and length?
50 50
45 45
40 40
Distance (cm)
Distance (cm)
35 35
30 30
25 25
20 20
30 40 50 60 20 40 60 80
Height (in) Weight (lb)

Multiple Regression
Example: Heart cathederization (contd)

Regression model:
Y = b0 + b1 x 1 + b2 x 2 + ε
where ◦ Y - distance to pulmonary artery

◦ x1 - height
◦ x2 - weight
STATA regression output:

. regress distance height weight
-------------+------------------------------ F( 2, 9) = 18.62
Model | 578.81613 2 289.408065 Prob > F = 0.0006
-------------+------------------------------ Adj R-squared = 0.7621
Total | 718.729167 11 65.3390152 Root MSE = 3.9428
------------------------------------------------------------------------------
distance | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
height | .1963566 .3605845 0.54 0.599 -.6193422 1.012056
weight | .1908278 .165164 1.16 0.278 -.1827991 .5644547
_cons | 21.0084 8.751156 2.40 0.040 1.211907 40.80489
------------------------------------------------------------------------------
Note:
◦ Neither height nor weight seem to be significant for predicting the dis-
tance to the pulmonary artery.
◦ The regression on both variables explains 80% of the variation of the
response (length of catheder).

Multiple Regression
Example: Heart cathederization (contd)

Consider predicting the length by height alone and by weight alone:
. regress distance height
R-squared = 0.7765
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
height | .5967612 .1012558 5.89 0.000 .3711492 .8223732
_cons | 12.12405 4.247174 2.85 0.017 2.660752 21.58734
------------------------------------------------------------------------------
. regress distance weight
R-squared = 0.7989
------------------------------------------------------------------------------
-------------+----------------------------------------------------------------
weight | .2772687 .0439881 6.30 0.000 .1792571 .3752804
_cons | 25.63746 2.004207 12.79 0.000 21.17181 30.10311
------------------------------------------------------------------------------
Note:
◦ In a simple regression of Y on either height or weight, the explanatory
variable is highly significant for predicting Y .
◦ In a multiple regression of Y on height and weight, the coefficients for
both height and weight are not significantly different from zero.
Problem: Explanatory variables are highly linearly dependent (collinear)
80
60
Weight (lb)
40
20
20 30 40 50 60 70
Height (in)

Analysis of Variance
Decomposition of variation:
P
◦ SS Total = i (Yi − Ȳ )2 - total variation
P
◦ SS Residual = i (Yi − Ŷi )2 - variation in regression model
◦ SS Model = SS Total − SS Residual
P
= i (Ŷi − Ȳ )2 - variation explained by regression
Coefficient of determination: The ratio
SS Model
R2 =
SS Total
indicates how well the regression model predicts the response. R2 is also
the squared multiple correlation coefficient - in a simple linear regression
we have
R2 = ρ2XY .

-------------+------------------------------ F( 2, 9) = 18.62
Model | 578.81613 2 289.408065 Prob > F = 0.0006
-------------+------------------------------ Adj R-squared = 0.7621
Total | 718.729167 11 65.3390152 Root MSE = 3.9428
The coefficient of determination for these data is

578.82
R2 = = 0.81.
718.73
Regression on height and weight explains 81% of the variation of distance.

Question: Is improvement in prediction (decrease in variation) significant?
Our null hypothesis is that none of the explanatory variables helps to

predict the response, that is,
H0 : b1 = . . . = bp = 0 versus Ha : bj 6= 0 for any j ∈ {1, . . . , p}.
Under the null hypothesis H0 the F statistic

n − p − 1 SS Model n − p − 1 SS Total − SS Residual
F = · = ·
p SS Residual p SS Residual
is F distributed with p and n − p − 1 degrees of freedom.
The null hypothesis H0 is rejected at level α if F > Fp,n−p−1,α.

-------------+------------------------------ F( 2, 9) = 18.62
Model | 578.81613 2 289.408065 Prob > F = 0.0006
-------------+------------------------------ Adj R-squared = 0.7621
Total | 718.729167 11 65.3390152 Root MSE = 3.9428
The value of the F statistic is

9 578.82
F = · = 18.61.
2 139.91
The critical value for rejecting H0 : b1 = b2 = 0 is F2,9,0.05 = 4.26. Thus
the null hypothesis H0 that both coefficients b1 and b2 are zero is rejected
at significance level α = 0.05.
Multiple Regression, Mar 3, 2004 - 10 -

Comparing Models
Example: Cobb-Douglas production function
Y = t · K a · Lb · M c
where ◦ Y - output ◦ L - labour

◦ K - capital ◦ M - materials
Regression model:
log Y = log t + a log K + b log L + c log M
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4

Y
Y
0.2 0.2 0.2
0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4 0.6 −0.2 0.0 0.2 0.4 0.6 0.8 1.0
K L M

Comparing Models
Example: Cobb-Douglas production function (contd)

Regression model M0 for Cobb-Douglas function:
log Y = log t + a log K + b log L + c log M

. regress LY LK LM LL
---------+----------------------------- F( 3, 21) = 138.98
Model | 1.35136742 3 .450455808 Prob > F = 0.0000
Residual | .068065609 21 .003241219 R-squared = 0.9520
---------+----------------------------- Adj R-squared = 0.9452
Total | 1.41943303 24 .059143043 Root MSE = .05693
-------------------------------------------------------------------------
LY | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+---------------------------------------------------------------
LK | .0718626 .1543912 0.47 0.646 -.2492114 .3929366
LM | .7072231 .3004146 2.35 0.028 .0824768 1.331969
LL | .2117778 .4248755 0.50 0.623 -.6717991 1.095355
_cons | .0347117 .0374354 0.93 0.364 -.0431395 .1125629
Two variables, log K and log L, do not improve prediction of log Y .

alternative model M1
log Y = log t + c log M

. regress LY LM
---------+----------------------------- F( 1, 23) = 445.69
Model | 1.34977753 1 1.34977753 Prob > F = 0.0000
Residual | .069655501 23 .0030285 R-squared = 0.9509
---------+----------------------------- Adj R-squared = 0.9488
Total | 1.41943303 24 .059143043 Root MSE = .05503
-------------------------------------------------------------------------
LY | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+---------------------------------------------------------------
LM | .9086794 .0430421 21.11 0.000 .81964 .9977188
_cons | .0512244 .0189767 2.70 0.013 .011968 .0904808
Question: Is model M0 significantly better than model M1 ?

Comparing Models
Consider the multiple regression model with p explanatory variables
Yi = b0 + b1 x1,i + . . . + bp xp,i + εi .
Problem:
Test the null hypothesis
H0 : q specific explanatory variables all have zero coefficients
versus
Ha : any of these q explanatory variables has a nonzero coefficient.
Solution:
(1)
◦ Regress Y on all p explanatory variables and read SS Residual from the
output.
◦ Regress Y on just p − q explanatory variables that remain after you
(2)
remove the q variables from the model. Read SS Residual from the output.
◦ The test statistic is
(2) (1)
n − p − 1 SS Residual − SS Residual
F = · (1)
.
q SS Residual
Under the null hypothesis, F is F distributed with q and n − p − 1

defrees of freedom.
◦ Reject if F > Fq,n−p−1,α.

Comparing Models
Example: Cobb-Douglas production function

Comparison of models M0 and M1 :
(0)
◦ M0: SS Residual = .06807 and n − p − 1 = 21.
(1)
◦ M1: SS Residual = .06966 and q = 2.
◦
21 .06966 − .06807
F = · = 0.2453
2 .06807
◦ Since F < F2,21,0.05 = 3.47 we cannot reject H0 : a = b = 0.
Using STATA:
. test LK LL
( 1) LK = 0
( 2) LL = 0
F( 2, 21) = 0.25
Prob > F = 0.7847
. test LK LL _cons
( 1) LK = 0
( 2) LL = 0
( 3) _cons = 0
F( 3, 21) = 2.43
Prob > F = 0.0934

Case Study
Example: Headaches and pain reliever
◦ 24 patients with a common type of headache were treated with a new

pain reliever
◦ Medicamentation was given to each patient in one of four dosage levels:
2,5,7 or 10 grams
◦ Response variable: time until noticeable relieve (in minutes)
◦ Other explanatory variables:
⋄ sex (0=female, 1=male)
⋄ blood pressure (0.25=low, 0.50=medium, 0.75=high)
Box plots
60
50
40
Time (in minutes)
30
20
10
female male female male female male female male

2 grams 5 grams 7 grams 2 grams
Multiple Regression II, Mar 5, 2004 -1-

Case Study
. regress time dose bp if sex==0
R-squared = 0.8861
--------------------------------------------------------------------------
time | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+----------------------------------------------------------------
dose | -5.519608 .6608907 -8.35 0.000 -7.014646 -4.024569
bp | -5 9.439407 -0.53 0.609 -26.35342 16.35342
_cons | 61.11765 6.458495 9.46 0.000 46.50752 75.72778
--------------------------------------------------------------------------
. predict YHf
(option xb assumed; fitted values)
. twoway line YHf dose if bp==0.25||line YHf dose if bp==0.5||
> line YHf dose if bp==0.75||scatter time dose if(sex==0), saving(a, replace)
(file a.gph saved)
. regress time dose bp if sex==1
R-squared = 0.5765
--------------------------------------------------------------------------
---------+----------------------------------------------------------------
dose | -3.343137 .9564492 -3.50 0.007 -5.506776 -1.179499
bp | -2.5 13.66083 -0.18 0.859 -33.40294 28.40294
_cons | 51.39216 9.346814 5.50 0.000 30.2482 72.53612
--------------------------------------------------------------------------
. predict YHm
. twoway line YHm dose if bp==0.25||line YHm dose if bp==0.5||
> line YHm dose if bp==0.75||scatter time dose if(sex==1), saving(b, replace)
(file b.gph saved)
. graph combine a.gph b.gph
60
60
40
40
20
20
0
2 4 6 8 10 2 4 6 8 10
dose dose
Fitted values Fitted values Fitted values Fitted values

Fitted values time Fitted values time

Case Study
Model:
Time = Dose + Sex + Sex · Dose + BP + ε

. infile time dose sex bp using headache.dat
. generate sexdose=sex*dose
. regress time dose sex sexdose bp
----------+------------------------------ F( 4, 19) = 16.78
Model | 4387.65319 4 1096.9133 Prob > F = 0.0000
----------+------------------------------ Adj R-squared = 0.7329
Total | 5629.95833 23 244.780797 Root MSE = 8.0861
---------------------------------------------------------------------------
----------+----------------------------------------------------------------
dose | -5.519608 .8006399 -6.89 0.000 -7.195367 -3.843849
sex | -8.47549 7.553222 -1.12 0.276 -24.28457 7.333585
sexdose | 2.176471 1.132276 1.92 0.070 -.19341 4.546351
bp | -3.75 8.086067 -0.46 0.648 -20.67433 13.17433
_cons | 60.49265 6.698634 9.03 0.000 46.47224 74.51305
---------------------------------------------------------------------------
. predict YH
. predict E, residuals
Residual plot: residualsi vs Dose

15 15
10 10
Residuals (in minutes)
Sample Quantiles
5 5
0 0
−5 −5
−10 −10
2 4 6 8 10 −2 −1 0 1 2
Dose (in grams) Theoretical Quantiles

Case Study
Model:
Time = Dose + Dose2 + Sex + Sex · Dose + BP + ε

. drop YH E
. generate dosesq=dose^2
. regress time dose sex sexdose dosesq bp
----------+------------------------------ F( 5, 18) = 24.20
Model | 4901.02819 5 980.205637 Prob > F = 0.0000
----------+------------------------------ Adj R-squared = 0.8346
Total | 5629.95833 23 244.780797 Root MSE = 6.3637
---------------------------------------------------------------------------
----------+----------------------------------------------------------------
dose | -12.91961 2.171775 -5.95 0.000 -17.48234 -8.356878
sex | -8.47549 5.944312 -1.43 0.171 -20.96403 4.013047
sexdose | 2.176471 .8910901 2.44 0.025 .3043598 4.048581
dosesq | .6166667 .1731968 3.56 0.002 .2527937 .9805396
bp | -3.75 6.363656 -0.59 0.563 -17.11955 9.619545
_cons | 77.45098 7.104701 10.90 0.000 62.52456 92.3774
---------------------------------------------------------------------------
. predict E, residuals
10 10
5 5
Residuals (in minutes)
Sample Quantiles
0 0
−5 −5
−10 −10
2 4 6 8 10 −2 −1 0 1 2
Dose (in grams) Theoretical Quantiles
. test sex bp
( 1) sex = 0
( 2) bp = 0
F( 2, 18) = 1.19
Prob > F = 0.3270

Case Study
Model:
Time = Dose + Dose2 + Sex · Dose + ε

. regress time dose sexdose dosesq
----------+------------------------------ F( 3, 20) = 38.81
Model | 4804.63916 3 1601.54639 Prob > F = 0.0000
----------+------------------------------ Adj R-squared = 0.8314
Total | 5629.95833 23 244.780797 Root MSE = 6.4239
---------------------------------------------------------------------------
----------+----------------------------------------------------------------
dose | -12.34823 2.154675 -5.73 0.000 -16.8428 -7.853653
sexdose | 1.033708 .3931338 2.63 0.016 .2136452 1.853771
dosesq | .6166667 .1748353 3.53 0.002 .2519667 .9813667
_cons | 71.33824 5.667294 12.59 0.000 59.51647 83.16
---------------------------------------------------------------------------
. twoway line YH dose if sex==0|| line YH dose if sex==1,
> legend(label(1 "female") label(2 "male"))
60
50
Fitted time (in minutes)
40
30
20
10
0
2 4 6 8 10
Dose (in grams)

Comparing Several Means
Example: Comparison of laboratories
◦ Task: Measure amount of chlorpheniramine maleate in tablets

◦ Seven laboratories were asked to make 10 determinations of one tablet
◦ Study consistency between labs and variability of measurements
Box plot
4.10
4.05
Amount of chlorphenimarine (in mg)
4.00
3.95
3.90
3.85
3.80
Lab 1 Lab 2 Lab 3 Lab 4 Lab 5 Lab 6 Lab 7
One-Way Analysis of Variance, Mar 8, 2004 -1-

Example: Comparison of drugs
◦ Experimental study of drugs to relieve itching

◦ Five drugs were compared to a placebo and no drug
◦ Ten volunteer male subjects
◦ Each subject underwent one treatment per day (randomized order)
◦ Drug or placebo were given intravenously
◦ Itching was induced on forearms with cowage
◦ Subjects recorded duration of itching
Box plot
400
300
Duration of itching (sec)
200
100
No drug Papaverine Aminophylline Tripelennamine

Placebo Morphine Pentobarbital

. infile amount lab using labs.txt

. graph box amount, over(lab)
. oneway amount lab, bonferroni tabulate
| Summary of amount
lab | Mean Std. Dev. Freq.
------------+------------------------------------
1 | 4.062 .03259178 10
2 | 3.997 .08969706 10
3 | 4.003 .02311808 10
4 | 3.920 .03333330 10
5 | 3.957 .05716445 10
6 | 3.955 .06704064 10
7 | 3.998 .08482662 10
------------+------------------------------------
Total | 3.9845715 .07184294 70
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups .1247371 6 .020789517 5.66 0.0001
Within groups .231400073 63 .003673017
------------------------------------------------------------------------
Total .356137173 69 .005161408
Bartlett’s test for equal variances: chi2(6) = 24.3697 Prob>chi2 = 0.000
Comparison of amount by lab
(Bonferroni)
Row Mean-|
Col Mean | 1 2 3 4 5 6
---------+------------------------------------------------------------------
2 | -.065
| 0.408
|
3 | -.059 .006
| 0.698 1.000
|
4 | -.142 -.077 -.083
| 0.000 0.127 0.068
|
5 | -.105 -.04 -.046 .037
| 0.005 1.000 1.000 1.000
|
6 | -.107 -.042 -.048 .035 -.002
| 0.004 1.000 1.000 1.000 1.000
|
7 | -.064 .001 -.005 .078 .041 .043
| 0.448 1.000 1.000 0.115 1.000 1.000

. oneway duration drug, bonferroni tabulate
| Summary of duration
drug | Mean Std. Dev. Freq.
------------+------------------------------------
1 | 191.0 54.861442 10
2 | 204.8 105.723750 10
3 | 118.2 52.809511 10
4 | 148.0 44.738748 10
5 | 144.3 42.076782 10
6 | 176.5 68.856130 10
7 | 167.2 67.499465 10
------------+------------------------------------
Total | 164.28571 68.463709 70
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 53012.8857 6 8835.48095 2.06 0.0708
Within groups 270409.4 63 4292.2127
------------------------------------------------------------------------
Total 323422.286 69 4687.2795
Bartlett’s test for equal variances: chi2(6) = 11.3828 Prob>chi2 = 0.077
Comparison of duration by drug
(Bonferroni)
Row Mean-|
Col Mean | 1 2 3 4 5 6
---------+------------------------------------------------------------------
2 | 13.8
| 1.000
|
3 | -72.8 -86.6
| 0.328 0.092
|
4 | -43 -56.8 29.8
| 1.000 1.000 1.000
|
5 | -46.7 -60.5 26.1 -3.7
| 1.000 0.904 1.000 1.000
|
6 | -14.5 -28.3 58.3 28.5 32.2
| 1.000 1.000 1.000 1.000 1.000
|
7 | -23.8 -37.6 49 19.2 22.9 -9.3
| 1.000 1.000 1.000 1.000 1.000 1.000

Stat Methods

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Stat Methods

Загружено:

Авторское право:

Доступные форматы

Some definitions

◦ Individual: each object described by a set of data

Age: age in years

Questions to ask about a categorical variable:

Bar graphs and pie charts

Example: Education of people aged 25 to 34

Percent of people aged 25 to 34

Graphical Description of Data, Jan 5, 2004 -2-

Example: Education of people aged 25 to 34

. infile ID AGE EDUC SEX EARN JOB using individuals.txt, clear

Graphical Description of Data, Jan 5, 2004 -3-

Example: Sammy Sosa home runs

Year Home runs

How to make a stemplot

1. Separate each observation into a stem and a leaf.

e.g. 15 → |{z} 5 and 4 → |{z}

2. Write the stems in a vertical column in increasing order.

How to choose the stem

Graphical Description of Data, Jan 5, 2004 -4-

How to make a histogram

1. Group the observations into “bins” according to their value. Choose

2. Count the individuals in each bin.

3. Draw the histogram

Example: Sammy Sosa home runs

Graphical Description of Data, Jan 5, 2004 -5-

Example: Sammy Sosa home runs

Home Runs Home Runs Home Runs Home Runs

Producing histograms in STATA:

. infile YEAR HR using sosa.dat

Why is a histogram not a bar graph?

Graphical Description of Data, Jan 5, 2004 -6-

This distribution is skewed right

because it has a long right-hand

0.0 0.5 1.0 1.5 2.0

◦ Center: Where is the “middle” of the distribution?

Example: Newcomb’s measurements of the passage time of light (IPS Tab

Graphical Description of Data, Jan 5, 2004 -7-

1988 1990 1992 1994 1996 1998 2000

Producing a time plot in STATA:

. infile PRICE using gasoline.txt, clear

Graphical Description of Data, Jan 5, 2004 -8-

How to find the median

3. If the number of observation n is even, the median is the av-

Arrange in increasing order:

The mean is given by

Arrange in increasing order:

Numerical Description of Data, Jan 7, 2004 -2-

◦ The mean is easy to work with algebraically, while the median

The original mean and median are

Numerical Description of Data, Jan 7, 2004 -3-

Example: Monthly returns on two stocks

◦ Measures of center alone are an insufficient description of a

Common measures of spread are

Numerical Description of Data, Jan 7, 2004 -4-

Quartiles divide data into 4 even parts

How to find the quartiles

1. Arrange the data in increasing order and find the median M

3. Find the median of the observations to the right of M, that is the

Arrange in increasing order:

More generally we might be interested in the value which is ex-

The pth percentile of a set of observations is the value such that

How to find the percentiles

1. Arrange the data into increasing order.