Вы находитесь на странице: 1из 243

Some definitions

◦ Individual: each object described by a set of data


◦ Variable: any characteristic of an individual
⋄ Categorical variable: places an individual into one of several
groups or categories.
⋄ Quantitative variable: takes numerical values on which we can
do arithmetic.
◦ Distribution of a variable: tells what values it takes and how often
it takes these values.

Example:
The following data set consists of five variables about 20 individuals.
ID Age Education Sex Total income Job class
1 43 4 1 18526 5
2 35 3 2 5400 7
3 43 2 1 3900 7
4 33 3 1 28003 5
5 38 3 2 43900 7
6 53 4 1 53000 5
7 64 6 1 51100 6
8 27 4 2 44000 5
9 34 4 1 31200 5
10 27 3 2 26030 5
11 47 6 1 6000 6
12 48 3 1 8145 5
13 39 2 1 37032 5
14 30 3 2 30000 5
15 35 3 2 17874 5
16 47 4 2 400 5
17 51 4 2 22216 5
18 56 5 1 26000 6
19 57 6 1 100267 7
20 34 1 1 15000 5

Age: age in years


Education: 1=no high school, 2=some high school, 3=high school diplom,
4=some college, 5=bachelor’s degree, 6=postgraduate degree
Sex: 1=male, 2=female
Total income: income from all sources
Job class: 5=private sector, 6=government, 7=self employed
Variables Age and Total income are quantitative, variables Eduction, Sex,
and Job class are categorical.
Graphical Description of Data, Jan 5, 2004 -1-
Categorical variable analysis

Questions to ask about a categorical variable:


◦ How many categories are there?
◦ In each category, how many observations are there?

Bar graphs and pie charts


Categorical data can be displayed by bar graphs or pie charts.
◦ In a bar graph, the horizontal axis lists the categories, in any order.
The height of the bars can be either counts or percentages.
◦ For better comparison of the frequencies, the variables can be ordered
from most frequent to lest frequent.
◦ In a pie chart, the area of each slide is proportional to the percentage
of individuals who fall into that category.

Example: Education of people aged 25 to 34


30

30
Percent of people aged 25 to 34

Percent of people aged 25 to 34


20

20
10

10
0

no HS some HS HS diploma Bachelor’s some college postgrad HS diploma Bachelor’s some college some HS postgrad no HS
Education level Education level

6.7%3.6%
7.5%

22.7%

30.4%

29.1%

no HS some HS
HS diploma Bachelor’s
some college postgrad

Graphical Description of Data, Jan 5, 2004 -2-


Categorical variable analysis

Example: Education of people aged 25 to 34


STATA commands:

. infile ID AGE EDUC SEX EARN JOB using individuals.txt, clear


. drop if AGE<25 | AGE>34
. label values EDUC Education
. label define Education 1 "no HS" 2 "some HS" 3 "HS diploma" 4 "Bachelor’s"
> 5 "some college" 6 "postgrad"
. set scheme s1mono
. gen COUNT=100/_N
. graph bar (sum) COUNT, over(EDUC) ytitle("Percent of people aged 25 to 34")
> b1title("Education level")
. translate @Graph bar1.eps, translator(Graph2eps) replace
. graph bar (sum) COUNT, over(EDUC, sort(1) descending)
> ytitle("Percent of people aged 25 to 34") b1title("Education level")
. translate @Graph bar2.eps, translator(Graph2eps) replace
. set scheme s1color
. graph pie COUNT, over(EDUC) plabel(_all perc, format(%4.1f) gap(-5))
. translate @Graph pie.eps, translator(Graph2eps) replace

Graphical Description of Data, Jan 5, 2004 -3-


Quantitative variables: stemplots

Example: Sammy Sosa home runs

Year Home runs


Producing stemplots in STATA:
1989 4
1990 15 . infile YEAR HR using sosa.dat
1991 10 . stem HR
1992 8
1993 33 Stem-and-leaf plot for HR
1994 25
1995 36
1996 40 0* | 48
1997 36 1* | 05
1998 66 2* | 5
1999 63 3* | 366
2000 50 4* | 009
2001 64
5* | 0
2002 49
2003 40 6* | 346

How to make a stemplot

1. Separate each observation into a stem and a leaf.

e.g. 15 → |{z} 5 and 4 → |{z}


1 |{z} 0 |{z}
4
stem leaf stem leaf

2. Write the stems in a vertical column in increasing order.

3. Write each leaf next to stem, in increasing order out from the stem.

How to choose the stem


◦ Rounding: each leaf should have exactly one digit, so rounding long
numbers before producing the stemplot can help produce a more com-
pact and informative plot.
◦ Splitting: if each stem (or many stems) have a large number of leaves,
all stems can be split, with leaves of 0-4 going to the first stem and 5-9
going to the second.

Graphical Description of Data, Jan 5, 2004 -4-


Quantitative variables: histograms

How to make a histogram

1. Group the observations into “bins” according to their value. Choose


the bins carefully: too few hide detail, too many decimate the pattern.

2. Count the individuals in each bin.

3. Draw the histogram


◦ Leave no space between bars.
◦ Label the axes with units of measurement.
◦ The y-axis is can be counts or percentages (per unit).

Example: Sammy Sosa home runs


Year Home runs
1989 4
1990 15
1991 10
.04

1992 8
1993 33
1994 25
.03

1995 36
Density

1996 40
.02

1997 36
1998 66
.01

1999 63
2000 50
0

2001 64 0 10 20 30 40 50 60 70
Home runs
2002 49
2003 40

The area of each bar is proportional to the percentage of data in that range.
We care about the area, not the height, but when the bar has equal width,
area is determined by the height.
For simplicity, use equally spaced bins.

Graphical Description of Data, Jan 5, 2004 -5-


Quantitative variables: histograms

Example: Sammy Sosa home runs


Histograms with different bin widths:
Histogram of Sosa Home Runs Histogram of Sosa Home Runs Histogram of Sosa Home Runs Histogram of Sosa Home Runs
0.07

0.07

0.07

0.07
0.06

0.06

0.06

0.06
0.05

0.05

0.05

0.05
0.04

0.04

0.04

0.04
Percentage

Percentage

Percentage

Percentage
0.03

0.03

0.03

0.03
0.02

0.02

0.02

0.02
0.01

0.01

0.01

0.01
0.00

0.00

0.00

0.00
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70

Home Runs Home Runs Home Runs Home Runs

Producing histograms in STATA:

. infile YEAR HR using sosa.dat


. hist HR, start(0.1) width(10) xlabel(0(10)70) xtitle(Home runs)
. translate @Graph hist1.eps, translator(Graph2eps) replace
. hist HR, start(0.1) width(10) xlabel(0(10)70) xtitle(Home runs) freq
. translate @Graph hist2.eps, translator(Graph2eps) replace
.04

5
4
.03

3
Frequency
Density
.02

2
.01

1
0

0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Home runs Home runs

Why is a histogram not a bar graph?


◦ Frequencies are represented by area, not height.
◦ There is no space between the bars.
◦ The horizontal axis represents a numerical quantity, with an inherent
order.

Graphical Description of Data, Jan 5, 2004 -6-


Interpreting histograms

◦ Describe the overall pattern and any significant deviations from that
pattern.
◦ Shape: Is the distribution (approximately) symmetric or skewed?
Histogram of x
2000
1500

This distribution is skewed right


Frequency

because it has a long right-hand


1000

tail.
500
0

0.0 0.5 1.0 1.5 2.0

◦ Center: Where is the “middle” of the distribution?


◦ Spread: What are the smallest and largest values?
◦ Outliers: Are there any observations that lie outside the overall pat-
tern? They could be unusual observations, or they could be mistakes.
Check them!

Example: Newcomb’s measurements of the passage time of light (IPS Tab


1.1)

25

20
Frequency

15

10

0
−60 −40 −20 0 20 40 60

Time

Graphical Description of Data, Jan 5, 2004 -7-


Time plots

Example: Average retail price of gasoline from Jan 1988 to Apr 2001

1.8
1.7
1.6
Retail gasoline price
1.2 1.3 1.4 1.5
1.1
1.0
0.9

1988 1990 1992 1994 1996 1998 2000


Year

Note: Whenever data are collected over time, it is a good idea to have
a time plot. Stemplots and histograms ignore time order, which can be
misleading when systematic change over time exists.

Producing a time plot in STATA:

. infile PRICE using gasoline.txt, clear


. graph twoway line PRICE T, ylabel(0.9(0.1)1.8, format(%3.1f)) xtick(0(12)159)
> xlabel(0 "1988" 24 "1990" 48 "1992" 72 "1994" 96 "1996" 120 "1998" 144 "2000")
> xtitle(Year) ytitle(Retail gasoline price)

Graphical Description of Data, Jan 5, 2004 -8-


Measures of center

The mean
The mean of a distribution is the arithmetic average of the obser-
vations:
x1 + · · · + xn 1 Pn
x̄ = =n xi
n i=1

The median
The median is the midpoint of a distribution: the number M
such that
◦ half the observations are smaller and
◦ half are larger.

How to find the median


Suppose the observations are x1, x2, . . . , xn.

1. Arrange the data in increasing order and let x(i) denote the ith
smallest observation.
2. If the number of observations n is odd, the median is the center
observation in the ordered list:

M = x((n+1)/2)

3. If the number of observation n is even, the median is the av-


erage of the two center observations in the ordered list:
x(n/2) + x(n/2+1)
M=
2
Numerical Description of Data, Jan 7, 2004 -1-
Measures of center

Examples:
Data set 1:
x1 x2 x3 x4 x5 x6 x7 x8 x9
2 4 3 4 6 5 4 -6 5

Arrange in increasing order:


x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)
-6 2 3 4 4 4 5 5 6
There is an odd number of observations, so the median is

M = x((n+1)/2) = x(5) = 4.

The mean is given by


2 + 4 + 3 + 4 + 6 + 5 + 4 + (−6) + 5 27
x̄ = = = 3.
9 9

Data set 2:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
2.3 8.8 3.9 4.1 6.4 5.9 4.2 2.9 1.3 5.1

Arrange in increasing order:


x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10)
1.3 2.3 2.9 3.9 4.1 4.2 5.1 5.9 6.4 8.8
There is an even number of observations, so the median is
x(n/2) + x(n/2+1) x(5) + x(6) 4.1 + 4.2
M= = = = 4.15.
2 2 2
The mean is given by
2.3 + 8.8 + 3.9 + 4.1 + 6.4 + 5.9 + 4.2 + 2.9 + 1.3 + 5.1 44.9
x̄ = = = 4.49.
10 10

Numerical Description of Data, Jan 7, 2004 -2-


Mean versus median

◦ The mean is easy to work with algebraically, while the median


is not.
◦ The mean is sensitive to extreme observations, while the median
is more robust.

Example:

0 1 2 3 4 5 6 7 8 9 10

The original mean and median are


0+1+2
x̄ = = 1 and M = x((n+1)/2) = 1
3
The modified mean and median are
0 + 1 + 10 2
x̄ = = 3 and M = x((n+1)/2) = 1
3 3
◦ If the distribution is exactly symmetric, then mean=median.
◦ In a skewed distribution, the mean is further out in the longer
tail than the median.
◦ The median is preferable for strongly skewed distributions, or
when outliers are present.

Numerical Description of Data, Jan 7, 2004 -3-


Measures of spread

Example: Monthly returns on two stocks


Stock A Stock B
40 40

30 30
Frequency

Frequency
20 20

10 10

0 0
−10 −5 0 5 10 15 20 −10 −5 0 5 10 15 20
Daily returns (in %) Daily returns (in %)

Stock A Stock B
Mean 4.95 4.82
Median 4.99 4.68
The distributions of the two stocks have approximately the same
mean and median, but stock B is more volatile and thus more risky.

◦ Measures of center alone are an insufficient description of a


distribution and can be misleading
◦ The simplest useful numerical description of a distribution con-
sists of both a measure of center and a measure of spread.

Common measures of spread are


◦ the quartiles and the interquartile range
◦ the standard deviation

Numerical Description of Data, Jan 7, 2004 -4-


Quartiles

Quartiles divide data into 4 even parts


◦ Lower (or first) quartile QL :
median of all observations less than the median M
◦ Middle (or second) quartile M = QM :
median of all observations
◦ Upper (or third) quartile QU :
median of all observations lgreater than the median M
◦ Interquartile range: IQR = QU − QL
distance between upper and lower quartile

How to find the quartiles

1. Arrange the data in increasing order and find the median M

2. Find the median of the observations to the left of M, that is the lower
quartiles, QL

3. Find the median of the observations to the right of M, that is the


upper quartiles, QU

Examples:
Data set:
x1 x2 x3 x4 x5 x6 x7 x8 x9
2 4 3 4 6 5 4 -6 5

Arrange in increasing order:


x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)
-6 2 3 4 4 4 5 5 6
◦ QL is the median of {−6, 2, 3, 4}: QL = 2.5
◦ QU is the median of {4, 5, 5, 6}: QU = 5
◦ IQR = 5 − 2.5 = 2.5
Numerical Description of Data, Jan 7, 2004 -5-
Percentiles

More generally we might be interested in the value which is ex-


ceeded only by a certain percentage of observations:

The pth percentile of a set of observations is the value such that


◦ p% of the observation are less than or equal to it and
◦ (100 − p)% of the observation are greater than or equal to it.

How to find the percentiles

1. Arrange the data into increasing order.


2. If np/100 is not an integer, then x(k+1) is the pth percentile,
where k is the largest integer less than np/100.
3. If np/100 is an integer, the pth percentile is the average of the
x(np/100) and x(np/100+1).

Five-number summary

A numerical summary of a distribution {x1, . . . , xn} is given by

x(1) QL M QU x(n)

A simple boxplot is a graph of the five-number summary.

Numerical Description of Data, Jan 7, 2004 -6-


Boxplots

A common “rule” for discovering outliers is the 1.5 × IQR rule:


An observations is a suspected outlier if it lies more than falls more
than 1.5 × IQR below QL or above QU .

20
How to draw a boxplot Box-and-whisker
plot)
1. A box (the box) is drawn from the lower to

10
the upper quartile (QL and QU ).
2. The median of the data is shown by a line in
the box.

0
3. Lines (the whiskers) are drawn from the ends
of the box to the most extreme observations
within a distance of 1.5 IQR (Interquartile
−10

range). Stock A Stock B

4. Measurements falling outside 1.5 IQR from


the ends of the box are potential outliers and
Plotting a boxplot with STATA:
marked by ◦ or ∗.
. infile A B using stocks.txt, clear
. label var A "Stock A"
. label var B "Stock B"
. graph box A B, xsize(2) ysize(5)

Numerical Description of Data, Jan 7, 2004 -7-


Boxplots

Interpretation of Box Plots


◦ The IQR is a measure for the sample’s variability.
◦ If the whiskers differ in length the distribution of the data is
probably skewed in the direction of the longer whisker.
◦ Very extreme observations (more than 3 IQR away from the
lower resp. upper quartile) are outliers, with one of the following
explanations:
a) The measurement is incorrect (error in measurement process
or data processing).
b) The measurement belongs to a different population.
c) The measurement is correct, but represents a rare (chance)
event.
We accept the last explanation only after carefully ruling out
all others.

Numerical Description of Data, Jan 7, 2004 -8-


Variance and standard deviation

Suppose there are n observations x1, x2, . . . , xn,


The variance of the n observations is:
2 (x1 − x̄)2 + (x2 − x̄)2 + · · · + (xn − x̄)2
s =
n−1
P
n
= n −1 1 (xi − x̄)2
i=1

This is (approximately) the average of the squared distances of the


observations from the mean.
The standard deviation is:
s
√ 1 P
n
s = s = n − 1 (xi − x̄)2
2
i=1

Why n − 1?
Division by n − 1 instead of n in the variance calculation is a
common cause of confusion. Why n − 1? Note that
n
X
(xi − x̄) = 0
i=1

Thus, if you know any n − 1 of the differences, the last difference


can be determined from the others. The number of “freely varying”
observations, n − 1 in this case, is called the “degrees of freedom”.

Numerical Description of Data, Jan 7, 2004 -9-


Properties of s

◦ Measures spread around the mean =⇒ use only if the mean


is used as a measure of center.
◦ s = 0 ⇔ all observations are the same
◦ s is in the same units as the measurements, while s2 is in the
square of these units.
◦ s, like x̄ is not resistant to outliers.

Five-number summary versus standard deviation

◦ The 5-number summary is better for describing skewed distri-


butions, since each side has a different spread.
◦ x̄ and s are preferred for symmetric distributions with no out-
liers.

Numerical Description of Data, Jan 7, 2004 - 10 -


Histograms and density curves

What’s in our toolkit so far?


◦ Plot the data: histogram (or stemplot)
◦ Look for the overall pattern and identify deviations and outliers
◦ Numerical summary to briefly describe center and spread

A new idea:
If the pattern is sufficiently regular, approximate it with a
smooth curve.

Any curve that is always on or above the horizontal axis and has
total are underneath equal to one is a density curve.
◦ Area under the curve in a range of values indicates the propor-
tion of values in that range.
◦ Come in a variety of shapes, but the “normal” family of familiar
bell-shaped densities is commonly used.
◦ Remember the density is only an approximation, but it sim-
plifies analysis and is generally accurate enough for practical
use.

The Normal Distrbution, Jan 9, 2004 -1-


Examples

0.07

0.06

0.05
Density

0.04

0.03

0.02

0.01

0.00
0 10 20 30 40
Sulfur oxide (in tons)

0.07

0.06

0.05
Shaded area of histogram: 0.29
Density

0.04

0.03

0.02

0.01

0.00
0 10 20 30 40
Sulfur oxide (in tons)

0.07

0.06

0.05 Shaded area under the curve: 0.30


Density

0.04

0.03

0.02

0.01

0.00
0 10 20 30 40
Sulfur oxide (in tons)

0.04

0.03
Density

0.02

0.01

0.00
40 46 52 58 64 70 76 82 88 94 100
Waiting time between eruptions (min)

The Normal Distrbution, Jan 9, 2004 -2-


Median and mean of a density curve

Median:
The equal-areas point with 50% of the “mass” on either side.

Mean:
The balancing point of the curve, if it were a solid mass.

Note:
◦ The mean and median of a symmetric density curve are equal.
◦ The mean of a skewed curve is pulled away from the median in
the direction of the long tail.

The mean and standard deviation of a density are denoted µ and


σ, rather than x̄ and s, to indicate that they refer to an idealized
model, and not actual data.

The Normal Distrbution, Jan 9, 2004 -3-


Normal distributions: N (µ, σ)

The normal distribution is


◦ symmetric,
◦ single-peaked,
◦ bell-shaped.

The density curve is given by


 
1 1 2
f (x) = √ 2 exp − 2σ2 (X − µ) .
2πσ

It is determined by two parameters µ and σ:


◦ µ is the mean (also the median)
◦ σ is the standard deviation

Note: The point where the curve changes from concave to convex
is σ units from µ in either direction.

The Normal Distrbution, Jan 9, 2004 -4-


The 68-95-99.7 rule

◦ About 68% of the data fall inside (µ − σ, µ + σ).


◦ About 95% of the data fall inside (µ − 2σ, µ + 2σ).
◦ About 99.7% of the data fall inside (µ − 3σ, µ + 3σ).

The Normal Distrbution, Jan 9, 2004 -5-


Example

Scores on the Wechsler Adult Intelligence Scale (WAIS) for the 20


to 34 age group are approximately N (110, 25).

◦ About what percent of people in this age group have scores


above 110?

◦ About what percent have scores above 160?

◦ In what range do the middle 95% of all scores lie?

The Normal Distrbution, Jan 9, 2004 -6-


Standardization and z-scores

Linear transformation of normal distributions:

X ∼ N (µ, σ) ⇒ a X + b ∼ N (a µ + b, a σ)

In particular it follows that


X −µ
∼ N (0, 1).
σ
N (0, 1) is called standard normal distribution.

For a real number x the standardized value or z-score


x−µ
z=
σ
tells how many standard deviations x is from µ, and in what di-
rection.
Standardization enables us to use a standard normal table to find
probabilities for any normal variable.

For example:
◦ What is the proportion of N (0, 1) observations less than 1.2?
◦ What is the proportion of N (3, 1.5) observations greater than 5?
◦ What is the proportion of N (10, 5) observations between 3 and 9?

The Normal Distrbution, Jan 9, 2004 -7-


Normal calculations

Standard normal calculations

1. State the problem in terms of x.


x−µ
2. Standardize: z = σ .

3. Look up the required value(s) on the standard normal table.


4. Reality check: Does the answer make sense?

Backward normal calculations


We can also calculate the values, given the probabilities:
If MPG ∼ N (25.7, 5.88), what is the minimum MPG required to be in the
top 10%?

“Backward” normal calculations


1. State the problem in terms of the probability of being less
than some number.
2. Look up the required value(s) on the standard normal table.
x−µ
3. “Unstandardize,” i.e. solve z = σ for x.

The Normal Distrbution, Jan 9, 2004 -8-


Example

Suppose X ∼ N (0, 1).


◦ P(X ≤ 2) = ?
◦ P(X > 2) = ?
◦ P(−1 ≤ X ≤ 2) = ?
◦ Find the value z such that
⋄ P(X ≤ z) = 0.95
⋄ P(X > z) = 0.99
⋄ P(−z ≤ X < z) = 0.68
⋄ P(−z ≤ X < z) = 0.95
⋄ P(−z ≤ X < z) = 0.997
Suppose X ∼ N (10, 5).
◦ P(X < 5) = ?
◦ P(−3 < X < 5) = ?
◦ P(−x < X < x) = 0.95

The Normal Distrbution, Jan 9, 2004 -9-


Assessing Normality

How to make a normal quantile plot


1. Arrange the data in increasing order.
2. Record the percentiles ( n1 , n2 , . . . , nn ).
3. Find the z-scores for these percentiles.
4. Plot x on the vertical axis against z on the horizontal axis.

Use of normal quantile plots


◦ If the data are (approximately) normal, the plot will be close
to a straight line.
◦ Systematic deviations from a straight line indicate a nonnormal
distribution.
◦ Outliers appear as points that are far away from the overall
patter of the plot.
1.0
6
2

0.8
5
Sample Quantiles

Sample Quantiles

Sample Quantiles
1

0.6
4
0

0.4
−1

0.2
−2

N(0, 1) Exp(1) U(0, 1)


0.0
0

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

The Normal Distrbution, Jan 9, 2004 - 10 -


Density Estimation

The normal density is just one possible density curve. There are
many others, some with compact mathematical formulas and many
without.

Density estimation software fits an arbitrary density to data to give


a smooth summary of the overall pattern.

0.2
Density

0.1

0.0
0 10 20 30 40

Velocity of galaxy (1000km/s)

The Normal Distrbution, Jan 9, 2004 - 11 -


Histogram

How to scale a histogram?

◦ Easiest way to draw a histogram:


⋄ qqually spaced bins 5

Frequency
⋄ counts on the vertical axis 3

0
0 10 20 30 40 50 60 70
Sosa home runs

Disadvantage: Scaling depends on number of observations and


bin width.

◦ Scale histogram such that area of each bar corresponds to pro-


portion of data:
counts
height = 0.04

width · total number 0.03


Density

0.02

0.01

0.00
0 10 20 30 40 50 60 70
Sosa home runs

Proportion of data in interval (0, 10]:

height · width = 0.02 · 10 = 0.2 = 20%

Since n = 15 this corresponds to 3 observations.

The Normal Distrbution, Jan 9, 2004 - 12 -


Density curves

0.5
n=250
0.4
Density

0.3

0.2

0.1

0.0
−4 −3 −2 −1 0 1 2 3 4
x
0.5
n=2500
0.4
Density

0.3

0.2 Proportion of data in (1,2]:


0.1

0.0
−4 −3 −2 −1 0 1 2 3 4
x #{xi : 1 < xi ≤ 2}
0.5

0.4
n=250000
n
Density

0.3

0.2

0.1

0.0
−4 −3 −2 −1 0
x
1 2 3 4
↓ n→∞
0.5
n→∞
0.4
Z 2
Density

0.3

0.2 f (x) dx
0.1 1
0.0
−4 −3 −2 −1 0 1 2 3 4
x

Probability that a new observation X fall into [a, b]


Z b
#{xi : 1 < xi ≤ 2}
P(a ≤ X ≤ b) = f (x) dx = n→∞ lim
n
a

The Normal Distrbution, Jan 9, 2004 - 13 -


Relationships between data

Example: Smoking and mortality


◦ Data from 25 occupational groups
(condensed from data on thousands of individual men)
◦ Smoking (100 = average number of cigarettes per day)
◦ Mortality ratio for deaths from lung cancer
(100 = average ratio for all English men)

Scatter plot of the data:

140
Mortality (index)

120

100

80

60

70 80 90 100 110 120 130


Smoking (index)

In STATA:
. insheet using smoking.txt
. graph twoway scatter mortality smoking

Scatterplots and correlation, Jan 12, 2004 -1-


Relationship between data

Assessing a scatter plot:

◦ What is the overall pattern?


⋄ form of the relationship?
⋄ direction of the relationship?
⋄ strength of the relationship?
◦ Are there any deviations (e.g. outliers) from these patterns?

Direction of relationship/association:

◦ positive association: above-average values of both variables


tend to occur together, and the same for below-average values
◦ negative association: above-average values of one variable
tend to occur with below-average values of the other, and vice
versa.

Strength of relationship/association:

◦ determined by how closely the points follow the overall pattern


◦ difficult to assess numerical measure

Scatterplots and correlation, Jan 12, 2004 -2-


Correlation

Correlation is a numerical measure of the direction and strength


of the linear relationship between two quantitative variables.

The sample correlation r is defined as


sxy
rxy = √ .
sx sy
where
1 P
n
sx = n−1
(xi − x̄)2,
i=1
1 Pn
sy = n−1
(yi − ȳ)2,
i=1
1 Pn
sxy = n−1
(xi − x̄)(yi − ȳ).
i=1

Properties:
◦ dimensionless quantity
◦ not affected by linear transformations:
for x′i = a xi + b and yi′ = c yi + d

rx′ y ′ = rxy

◦ −1 ≤ rxy ≤ 1
◦ rxy = 1 if and only if yi = a xi + b for some a and b
◦ measures linear association between xi and yi

Scatterplots and correlation, Jan 12, 2004 -3-


Correlation

2 ρ = −0.9 ρ = −0.6

2
1

1
0

0
y

y
−2

−2
−2 −1 0 1 2 −2 −1 0 1 2
x x

ρ = −0.3 ρ=0
2

2
1

1
0

0
y

y
−2

−2

−2 −1 0 1 2 −2 −1 0 1 2
x x

ρ = 0.3 ρ = 0.6
2

2
1

1
0

0
y

y
−2

−2

−2 −1 0 1 2 −2 −1 0 1 2
x x

ρ = 0.9 ρ = 0.99
2

2
1

1
0

0
y

y
−2

−2

−2 −1 0 1 2 −2 −1 0 1 2
x x

Scatterplots and correlation, Jan 12, 2004 -4-


Introduction to regression

Regression describes how one variable (response) depends on


another variable (explanatory variable).
◦ Response variable: variable of interest, measures the out-
come of a study
◦ Explanatory variable: explains (or even causes) changes in
response variable

Examples:
◦ Hearing difficulties:
response - sound level (decibels), explanatory - age (years)
◦ Real estate market:
response - listing prize ($), explanatory - house size (sq. ft.)
◦ Salaries:
response - salary ($), explanatory - experience (years), educa-
tion, sex

Least squares regression, Jan 14, 2004 -1-


Introduction to regression

Example: Food expenditures and income


Data: Sample of 20 households

20

16
food expenditure

12

0
0 20 40 60 80 100 120
income

Questions:
◦ How does food expenditure (Y ) depend on income (X)?
◦ Suppose we know that X = x0, what can we tell about Y ?

Linear regression:
If the response Y depends linearly on the explanatory variable
X, we can use a straight line (regression line) to predict Y
from X.

Least squares regression, Jan 14, 2004 -2-


Least squares regression

How to find the regression line

20

16
food expenditure

12
18 Observed y
8
16 Difference y − y^

food expenditure
4 Predicted y^
14

0 12
0 20 40 60 80 100 120
income
10

8
50 60 70 80 90
income

Since we intend to predict Y from X, the errors of interest are


mispredictions of Y for fixed X.

The least squares regression line of Y on X is the line that


minimizes the sum of squared errors.

For observations (x1, y1), . . . , (xn, yn ), the regression line is given


by

Ŷ = a + b X

where

b = r ssy and a = ȳ − b x̄
x

(r correlation coefficient, sx , sx standard deviations, x̄, ȳ means)

Least squares regression, Jan 14, 2004 -3-


Least squares regression

Example: Food expenditure and income


X 28 26 32 24 54 59 44 30 40 82
Y 5.2 5.1 5.6 4.6 11.3 8.1 7.8 5.8 5.1 18.0

X 42 58 28 20 42 47 112 85 31 26
Y 4.9 11.8 5.2 4.8 7.9 6.4 20.0 13.7 5.1 2.9

The summary statistics are:


◦ x̄ = 45.50 ◦ sx = 23.96 ◦ r = 0.946
◦ ȳ = 7.97 ◦ sy = 4.66

The regression coefficients are:


0.946 · 4.66
b = r ssy = 23.96
= 0.184
x

a = ȳ − b x̄ = 7.97 − 0.184 · 45.5 = −0.402

20
food expenditure

15

10

0
0 20 40 60 80 100 120
income

Least squares regression, Jan 14, 2004 -4-


Interpreting the regression model

◦ The response in the model is denoted Ŷ to indicate that these


are predicted Y values, not the true Y values. The “hat” de-
notes prediction.
◦ The slope of the line indicates how much Ŷ changes for a unit
change in X.
◦ The intercept is the value of Ŷ for X = 0. It may or not have
a physical interpretation, depending on whether or not X can
take values near 0.
◦ To make a prediction for an unobserved X, just plug it in and
calculate Ŷ .
◦ Note that the line need not pass through the observed data
points. In fact, it often will not pass through any of them.

Least squares regression, Jan 14, 2004 -5-


Regression and correlation

Correlation analysis:
We are interested in the joint distribution of two (or more)
quantitive variables.

Example: Heights of 1,078 fathers and sons

80

78

76

74
Son’s height (inches)

72

70

68

66

64

62

60

58
58 60 62 64 66 68 70 72 74 76 78 80

Father’s height (inches)

Points are scattered around the SD line:


s
◦ (y − ȳ) = sxy (x − x̄)
◦ goes through center (x̄, ȳ)
◦ has slope sy /sx
The correlation r measures how much the points spread around
the SD line.

Least squares regression, Jan 14, 2004 -6-


Regression and correlation

Regression analysis:
We are interested how the distribution of one response variable
depends on one (or more) explanatory variables.

Example: Heights of 1,078 fathers and sons


80 Father’s height = 64 inches
0.20
78 x
0.15
76

Density
0.10
74
0.05
Son’s height (inches)

72
0.00
70 58 60 62 64 66 68 70 72 74 76 78 80
Son’s height (inches)
68 Father’s height = 68 inches
0.16
66 x
0.12
64
Density

0.08
62
0.04
60

0.00
58 58 60 62 64 66 68 70 72 74 76 78 80
58 60 62 64 66 68 70 72 74 76 78 80 Son’s height (inches)
Father’s height (inches) Father’s height = 72 inches
0.18
x
0.15
0.12
Density

0.09
0.06
0.03
0.00
58 60 62 64 66 68 70 72 74 76 78 80
80 Son’s height (inches)

78

76

74

In each vertical strip, the


Son’s height (inches)

72

70
points are distributed
68

66 around the regression


line.
64

62

60

58
58 60 62 64 66 68 70 72 74 76 78 80

Father’s height (inches)

Least squares regression, Jan 14, 2004 -7-


Properties of least squares regression

◦ The distinction between explanatory and response variables is


essential. Looking at vertical deviations means that changing
the axes would change the regression line.
80
x^ = a’ + b’y
78

76

74

y^ = a + bx
Son’s height (inches)

72

70

68

66

64

62

60

58
58 60 62 64 66 68 70 72 74 76 78 80

Father’s height (inches)

◦ A change of 1 sd in X corresponds to a change of r sds in Y .


◦ The least squares regression line always passes through the
point (x̄, ȳ).
◦ r2 (the square of the correlation) is the fraction of the variation
in the values of y that is explained by the least squares regres-
sion on x.
When reporting the results of a linear regression,
you should report r2.
These properties depend on the least-squares fitting criterion and
are one reason why that criterion is used.

Least squares regression, Jan 14, 2004 -8-


The regression effect

Regression effect
In virtually all test-retest situations, the bottom group on the
first test will on average show some improvement on the sec-
ond test - and the top group will on average fall back. This is
the regression effect. The statistician and geneticist Sir Fran-
cis Galton (1822-1911) called this effect “regression to medi-
ocrity”.

80

78

76

74
Son’s height (inches)

72

70

68

66

64

62

60

58
58 60 62 64 66 68 70 72 74 76 78 80

Father’s height (inches)

Regression fallacy
Thinking that the regression effect must be due to something
important, not just the spread around the line, is the regression
fallacy.

Least squares regression, Jan 14, 2004 -9-


Regression in STATA
. infile food income size using food.txt
. graph twoway scatter food income || lfit food income, legend(off)
> ytitle(food)
. regress food income
Source | SS df MS Number of obs = 20
------------+------------------------------ F( 1, 18) = 151.97
Model | 369.572965 1 369.572965 Prob > F = 0.0000
Residual | 43.7725361 18 2.43180756 R-squared = 0.8941
------------+------------------------------ Adj R-squared = 0.8882
Total | 413.345502 19 21.7550264 Root MSE = 1.5594
---------------------------------------------------------------------------
food | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------+--------------------------------------------------------------
income | .1841099 .0149345 12.33 0.000 .1527336 .2154862
_cons | -.4119994 .7637666 -0.54 0.596 -2.016613 1.192615
---------------------------------------------------------------------------

20
Food expenditure

15

10

0
0 20 40 60 80 100 120
Income

This graph has been generated using the graphical user interface of STATA.
The complete command is:
. twoway (scatter food income, msymbol(circle) msize(medium) mcolor(black))
> (lfit food income, range(0 120) clcolor(black) clpat(solid) clwidth(medium)),
> ytitle(Food expenditure, size(large)) ylabel(, valuelabel angle(horizontal)
> labsize(medlarge)) xtitle(Income, size(large)) xscale(range(0 120))
> xlabel(0(20)120, labsize(medlarge)) legend(off) ysize(2) xsize(3)

Least squares regression, Jan 14, 2004 - 10 -


Residual plots

Residuals: difference of observed and predicted values

ei = observed y − predicted y
= yi − ŷi
= yi − (a + b xi)

For a least squares regression, the residuals always have mean zero.

Residual plot
A residual plot is a scatterplot of the residuals against the
explanatory variable. It is a diagnostic tool to assess the fit of
the regression line.
Patterns to look for:
◦ Curvature indicates that the relationship is not linear.
◦ Increasing or decreasing spread indicates that the prediction
will be less accurate in the range of explanatory variables where
the spread is larger.
◦ Points with large residuals are outliers in the vertical direc-
tion.
◦ Points that are extreme in the x direction are potential high
influence points.
Influential observations are individuals with extreme x values
that exert a strong influence on the position of the regression line.
Removing them would significantly change the regression line.

Least squares regression, Jan 14, 2004 - 11 -


Regression Diagnostics

Example: First data set

10

0
5 10 15
X

1
Residuals

−1

−2
4 6 8 10
Fitted values

1
Residuals

−1

−2
5 10 15
X

residuals are regularly distributed

Least squares regression, Jan 14, 2004 - 12 -


Regression Diagnostics

Example: Second data set

10

0
5 10 15
X

1
Residuals

−1

−2
4 6 8 10
Fitted values

1
Residuals

−1

−2
5 10 15
X

functional relationship other than linear

Least squares regression, Jan 14, 2004 - 13 -


Regression Diagnostics

Example: Third data set

15

Y 10

0
5 10 15
X

2
Residuals

−1

4 6 8 10
Fitted values

2
Residuals

−1

5 10 15
X

outlier, regression line misfits majority of data

Least squares regression, Jan 14, 2004 - 14 -


Regression Diagnostics

Example: Fourth data set

15

Y 10

0
5 10 15
X

1
Residuals

−1

−2
4 6 8 10
Fitted values

1
Residuals

−1

−2
5 10 15
X

heteroscedasticity

Least squares regression, Jan 14, 2004 - 15 -


Regression Diagnostics

Example: Fifth data set

15

Y 10

0
5 10 15 20
X

1
Residuals

−1

−2
6 8 10 12 14
Fitted values

1
Residuals

−1

−2
5 10 15 20
X

one separate point in direction of x, highly influential

Least squares regression, Jan 14, 2004 - 16 -


The Question of Causation

Example: Are babies brought by the stork?


◦ Data from 54 countries
◦ Variables:
⋄ Birth rate (newborns per 1000 women)
⋄ Number of storks (per 1000 women)

21

18

15
Birth rate

12

0
0 1 2 3 4 5
Number of storks (per 1000 women)

Model: Birth rate (Y) is proportional to the number of storks (X)

Y = bX + ε

Least squares regression yields for the slope of the regression line

b̂ = 4.3 ± 0.2.

Can we conclude that babies are brought by the stork?

Causation, Jan 16, 2004 -1-


The Question of Causation

A more serious example:


Variables:
◦ Income Y - response
◦ level of education X - explanatory variable
There is a positive association between income and the education.
Question: Does better education increase income?

X Y X ? Y X ? Y

Z Z

(a) (b) (c)


causal effect confounding

Possible alternative explanation: Confounding


◦ People from prosperous homes are likely to receive many years of edu-
cation and are more likely to have high earnings.
◦ Education and income might both be affected by personal attributes
such as self assurance. On the other hand the level of education could
have an impact on e.g. self assurance. The effects of education and self
assurance can not be separated.

Confounding:
Response and explanatory variable both depend on a third
(hidden) variable.

Causation, Jan 16, 2004 -2-


Establishing Causal Relationships

Controlled experiments:
A cause-effect relationship between two variables X and Y can be
established by conducting an experiment where
◦ the values of X are manipulated and
◦ the effect on Y is observed.

Problem: Often such experiments are not possible.

If we cannot establish a causal relationship by a controlled experi-


ment, we can still collect evidence from observational studies:
◦ The association is strong.
◦ The association is consistent across multiple studies.
◦ Higher doses are associated with stronger responses.
◦ The alleged cause precedes the effect in time.
◦ The alleged cause is plausible.

Example: Smoking and lung cancer

Causation, Jan 16, 2004 -3-


Caution about Causation

Association is not causation


Two variables may be correlated because both are affected
by some other (measured or unmeasured) variable.
Unmeasured confounding variables can influence the in-
terpretation of relationships among the measured vari-
ables. They
◦ may suggest a relationship where there is none or
◦ may mask a real relationship.
No causation in - no causation out
Causation is - unlike association - no statistical concept.
For inference on cause-effect relationships, we need some
knowledge about the causal relationships between the vari-
ables in the study.
Randomized experiments guarantee the absence of any
confounding variables. Any relationship between the ma-
nipulated variable and the response must be due to a
cause-effect relationship.

Causation, Jan 16, 2004 -4-


Experiments and Observational Studies

Two major types of statistical studies


◦ Observational study - observes individuals/objects and mea-
sures variables of interest but does not attempt to interfere with
the natural process.
◦ Designed experiment - deliberately imposes some treatment
on individuals to observe their responses.
Remarks:
◦ Sample survey are an example of an observational study.
◦ In economics, most studies are observational.
◦ Clinical studies are often designed experiments.
◦ Designed experiments allow statements about causal relation-
ship between treatment and response.
◦ Observational studies have no control over variables. Thus the
effect of the explanatory variable on the response variable might
be confounded (mixed up) with the effect of some other vari-
ables. Such variables are called confounder and a major source
of bias.

Experiments and Observational Studies, Jan 16, 2004 -5-


Designed Experiments

• In controlled experiments, the subjects are assigned to one of


two groups,
◦ treatment group and
◦ control group (which does not receive treatment).
• A controlled experiment is randomized if the subjects are ran-
domly assigned to one of the two groups.
• One precaution in designed experiments if the use of a placebo,
which are made of a completely neutral substance. The sub-
jects do not know whether they receive the treatment or a
placebo, any difference in the response thus cannot be attir-
buted to psychological and psychosomatical effects.
• In a double blind experiment, neither the subjects nor the
treatment administrators know who is assigned to the two
groups.

Example: The Salk polio vaccine field trial


◦ Randomized controlled double-blind experiment in 11 states
◦ 200,000 children in treatment group
◦ 200,000 children in control group treated with placebo
The difference between the responses of the two groups show that
the vaccine reduces the risk of polio infection.

Experiments and Observational Studies, Jan 16, 2004 -6-


Confounding

Confounding means a difference between the treatment and con-


trol groups—other than the treatment—which affects the responses
being studied. A confounder is a third variable. associated with
exposure and with disease.

Example: Lanarkshire Milk Experiment


The purpose of the experiment was to study the effect of pasteur-
ized milk on the health of children.
◦ The subjects of the experiment were school children.
◦ The children in the treatment group got a daily portion of pas-
teurized milk.
◦ The children in the control did not receive any extra milk.
◦ The teachers assigned poorer children to treatment group so
that they got extra milk
The effect of pasteurized milk on the health of children is con-
founded with the effect of wealth: Poorer children are more exposed
to diseases.

Experiments and Observational Studies, Jan 16, 2004 -7-


Observational Studies

Confounding is a major problem in observational studies.

Association is NOT Causation

Example: Does smoking cause cancer.


• Designed experiment not possible (cannot make people
smoke).
• Observation: Smokers have higher cancer rates
• Tobacco industry: There might be a gene which
◦ makes people smoke and
◦ causes cancer
In that case stopping smoking would not prevent cancer since
it is caused by the gene. The observed high association could
be attributed to the confounding effect of such a gene.
• However: Studies with identical twins—one smoker and one
nonsmoker—puts some serious doubt on the gene theory.

Experiments and Observational Studies, Jan 16, 2004 -8-


Example

Do screening programs speed up detection of breast cancer?


◦ Large-scale trial run by the Health Insurance Plan of Greater
New York, starting in 1963
◦ 62,000 women age 40 to 64 (all members of the plan)
◦ Randomly assigned to two equal groups
◦ Treatment group:
⋄ women were encouraged to come in for annual screeening
⋄ 20,200 women did come in for screening
⋄ 10,800 refused.
◦ Control group:
⋄ was offered usual health care
◦ All the women were followed for many years.

Epidemiologists who worked on the study found that


◦ screening had little impact on diseases other than breast cancer;
◦ poorer women were less likely to accept screening than richer
ones; and
◦ most diseases fall more heavily on the poor than the rich.

Experiments and Observational Studies, Jan 16, 2004 -9-


Example

Deaths in the first five years of the screening trial, by cause. Rates per
1,000 women.

Cause of Death
Breast cancer All other
Number of persons Number Rates Number Rates
Treatment group 31,000 39 1.3 837 27
Examined 20,200 23 1.1 428 21
Refused 10,800 16 1.5 409 38
Control group 31,000 63 2.0 879 28

Questions:
◦ Does screening save lives?
◦ Why is the death rate from all other causes in the whole treatment
group (“examined” and “refused” combined) about the same as the
rate in the control group?
◦ Why is the death rate from all other causes higher for the “refused”
group than the “examined” group?
◦ Breast cancer (like polio, but unlike most other diseases) affects the
rich more than the poor. Which numbers in the table confirm this
association between breast cancer and income?
◦ The death rate (from all causes) among women who accepted screening
is about half the death rate among women who refused. Did screening
cut the death rate in half? In not, what explains the difference in death
rates?
◦ To show that screening reduces the risk from breast cancer, someone
wants to compare 1.1 and 1.5. Is this a good comparison? Is it biased
against screening? For screening?

Experiments and Observational Studies, Jan 16, 2004 - 10 -


Survey Sampling

Situation:
Population of N individuals (or items)
e.g. ◦ students at this university
◦ light bulbs produced by a company on one day

Seek information about population


e.g. ◦ amount of money students spent on books this quarter
◦ percentage of students who bought more than 10 books
in this quarter
◦ lifetime of light bulbs

Full data collection is often not possible because it is e.g.


◦ too expensive
◦ too time consuming
◦ not sensible (e.g. testing every produced light bulb for its lifetime)

Statistical approach:
◦ collect information from part of the population (sample)
◦ use information on sample to draw conclusions on whole pop-
ulation
Questions:
◦ How to choose a sample?
◦ What conclusions can be drawn?

Survey Sampling, Jan 19, 2004 -1-


Survey Sampling

Objective of a sample survey:


Gather information on some variable for population of N individ-
uals:

x̃i value of interest for ith individual


x̃1, . . . , x̃N values for population

Sample of length n:

x1 , . . . , xn values obtained from sampling

Parameter - number that describes the population, e.g.


1 PN
µpop = x̃j population mean
N j=1
2 1 PN
σpop = (x̃j − µpop)2 population variance
N j=1

Estimate population parameter from sampled values:


1P n
µ̂pop = x̄ = xi sample mean
n i=1
2 2 1 P N
σ̂pop = s = (xj − x̄)2 sample variance
n − 1 j=1

A function of the sample x1, . . . , xn is called a statistic.

Survey Sampling, Jan 19, 2004 -2-


Sampling Distribution

Suppose we are interested in the amount of money students at this


university have spent on books this quarter.

Idea: Ask 20 students about the amount they have spent and take
the average.
The value we obtain will vary from sample to sample, that is, if we
asked another 20 students we would get a different answer.

Sampling distribution
The sampling distribution of a statistic is the distribution of
all values taken by the statistic if evaluated for all possible
samples of size n taken from the same population.

In our example, the sampling distribution of the average amount


obtained from the sample depends on the way we choose the sample
from the population:
◦ Ask 20 students in this class.
◦ Ask 20 students in your department.
◦ Ask 20 students in the University bookshop.
◦ Select randomly 20 students from the register of the university.
The design of a sample refers to the method used to choose the
sample from the population.

Survey Sampling, Jan 19, 2004 -3-


Sampling Distribution

Example:
Consider a population of 20 students who spent the following
amounts on books:
x̃1 x̃2 x̃3 x̃4 x̃5 x̃6 x̃7 x̃8 x̃9 x̃10 x̃11 x̃12 x̃13 x̃14 x̃15
100 120 150 180 200 220 220 240 260 280 290 300 310 350 400

(a) 12
σ = 55.4247

9
Sampling distribution of
Frequency (%)

6
1 P
n
x̄ = n
xi
3
i=1
0
0 100 200 300 400 for sample sizes
(b) 12
x
(a) n = 2
σ = 43.38302

9 (b) n = 3
Frequency (%)

6
(c) n = 4
3

0
0 100 200 300 400

x
(c) 12
σ = 35.96526

9
Frequency (%)

0
0 100 200 300 400

Survey Sampling, Jan 19, 2004 -4-


Bias

Example:
Suppose we are interested in the amount of money students at this
university have spent on books last quarter.

Sample: 20 students in the University bookshop


Do we get a good estimate for the average amount spent on books
last quarter by UofC students?
◦ Students who buy more books and spend more money on books
are more likely to be found in bookshops than students who buy
less books.
◦ The sample mean might overestimate the true amount spent
on books.
◦ The sample is not representative for the population of all stu-
dents.
Careful: A poor sample design can produce misleading conclu-
sions.

The design of a study is biased if it systematically favors some


parts of the population over others.
A statistic is unbiased if the mean of its sampling distribution
is equal to the parameter being estimated. Otherwise we say the
statistic is biased.

Survey Sampling, Jan 19, 2004 -5-


Bias

Examples: Biased Sampling


◦ Midway Airlines Ads in the New York Times and the Wall Street Jour-
nal stated that “84 percent of frequent business travelers to Chicago
prefer Midway Metrolink to American, United, and TWA.”
The survey was “conducted among Midway Metrolink passengers be-
tween New York and Chicago.
◦ A 1992 Roper poll asked “Does it seem possible or does it seem im-
possible to you that the Nazi extermination of Jews never happened?”
22% of the American respondents said “seems possible.”
A reworded poll 1994 asked “Does it seem possible to you that the Nazi
extermination of Jews never happened, or do you feel certain that it
happened?” This time only 1% of the respondents said it was “possible
it never happened.”
◦ ABC network program Nightline once asked whether the United Na-
tions should continue to have its headquarters in the United States.
More than 186,000 callers responded, and 67% said “No.”
A properly designed sample survey showed that 72% of adults want the
UN to stay.
◦ A call-in poll conducted by USA Today concluded that Americans love
Donald Trump.
USA Today later reported that 5,640 of the 7,800 calls for the poll came
from the offices owned by one man, Cincinnati financier Carl Lindner.

Survey Sampling, Jan 19, 2004 -6-


Caution about Sample Surveys

• Undercoverage
◦ occurs when same groups in the population are left out of
the process of choosing the sample
◦ no accurate list of the population
◦ results in bias if this group differs from the rest of the
population
• Nonresponse
◦ occurs when a chosen individual cannot be contacted or
does not cooperate
◦ results in bias if this group differs from the rest of the
population
• Response bias
◦ subjects may not want to admit illegal or unpopular be-
haviour
◦ subjects may be affected by the interviewers appearance or
tone
◦ subjects may not remember correctly
• Question wording
◦ confusing or leading questions can introduce strong bias
◦ do not trust sample survey results unless you have read the
exact questions posed

Survey Sampling, Jan 19, 2004 -7-


Simple Random Sampling

A simple random sample (SRS) of size n consists of n indi-


viduals chosen from the population in such a way that every set of
n individuals is equally likely to be selected.
◦ Every possible sample has an equal chance of being selected.
◦ Every individual has an equal chance of being selected.
◦ Random selection eliminates bias in sampling.

SRS or Not?
Is each of the following samples an SRS or not?
◦ A deck of cards if shuffled, and the top five dealt.
◦ A sample of Illinois residents is drawn by choosing all the resi-
dents in each of 100 census blocks (in such a way that each set
of 100 blocks is equally likely to be chosen)
◦ A telephone survey is conducted by dialing telephone numbers
at random (i.e. each valid phone number is equally likely).
◦ A sample of 10%of all student at the University of Chicago is
chosen by numbering the students 1, . . . , N , drawing a random
integer i from 1 to 10, and drawing every tenth student begin-
ning with i.
(E.g. if i = 5, students 5, 15, 25, . . . are chosen.)

Survey Sampling, Jan 19, 2004 -8-


Stratified Sampling

Example:
◦ Population: Students at this university
◦ Objective: Amount of money spent on books this quarter
◦ Knowledge: Students in e.g. humanities spend more money on
books

Use knowledge to build sample:


◦ divide sample into groups of similar individuals, called strata
◦ choose simply random sample within each group
◦ size of samples in each groups e.g. proportional to size of groups

Can reduce variability of estimate significantly.

Survey Sampling, Jan 19, 2004 -9-


Summary

◦ A number which describes a population is a parameter.


◦ A number computed from the data is a statistic.
◦ Use statistics to make inferences about unknown population
parameters.
◦ A Simple random sample (SRS) of size n consists of n in-
dividuals from the population sampled without replacement,
that is, every set of n individuals has an equal chance to be the
sample actually selected.
◦ A statistic from a random sample has a sampling distribution
that describes how the statistic varies in repeated data produc-
tion.
◦ A statistic as an estimator of a parameter may suffer from bias
or from high variability. Bias means that the mean of the
sampling distribution is not equal to the true value of the pa-
rameter. The variability of the statistic is described by the
spread of its sampling distribution.

Survey Sampling, Jan 19, 2004 - 10 -


First Step Towards Probability

Experiment:
Toss a die and observe the number on the face up.

What is the chance


◦ of getting a six?
Event of interest: 6
All possible events: 1 2 3 4 5 6
⇒ 16 (one out of six)
◦ of getting an even number?
Event of interest: 2 4 6
All possible events: 1 2 3 4 5 6
⇒ 12 (three out of six)

The classical probability concept:


If there are N equally likely possibilities, of which one must occur
and s are regarded favorable, or as a “success”, then the probability
of a “success” is
s
.
N

Counting, Jan 21, 2003 -1-


First Step Towards Probability

Example:
Suppose that of 100 applicants for a job 50 were women and 50
were men, all equally qualified. Further suppose that the company
hired 2 women and 8 men.

How likely is this outcome under the assumption that


the company does not discriminate?

How many ways are there to choose


◦ 10 out of 100 applicants? ( ⇒ N )
◦ 2 out of 50 female applicants and 8 out of 50 male applicants?
( ⇒ s)

To compute such probabilities we need a way to count the num-


ber of possibilities (favorable and total).

Counting, Jan 21, 2003 -2-


The Multiplicative Rule

Suppose you have k choices with N1, . . . , Nk possibilities, re-


spectively, to make. Then the total number of possibilities is
the product

N1 · · · Nk .

Sampling in order with replacement

If you sample n times in order with replacement from a set of N


elements, then the total number of possible sequences (x1, . . . , xn)
is N n .

Example:
If you toss a die 5 times, the number of possible results is 65 = 7776.

Sampling in order without replacement

If you sample n times in order without replacement from a set of N


elements, then the total number of possible sequences (x1, . . . , xn)
is
N!
N (N − 1) · · · (N − n + 1) = .
(N − n)!

Example:
If you select 5 cards in order from a card deck of 64, the number
of possible results is 64 · 63 · 62 · 61 · 60 = 914, 941, 440.
Counting, Jan 21, 2003 -3-
Permutations and Combinations

Example:
If you select 5 cards from a card deck of 64, you are typically only
interested in the cards you have, not in the order in which you
received them. How many different combinations of 5 cards out
of 64 are there?

To answer this question we first address the question of how many


different sequences of the same 5 cards exist.

Permutation:
Let (x1, . . . , xn) be a sequence. A permutation of this sequence is
any rearrangement of the elements without loosing or adding any
elements, that is, any new sequence

(xi1 , . . . , xin )

with permuted indices {i1, . . . , in } = {1, . . . , n}. The trivial per-


mutation does not change the order, i.e. ij = j.

How many permutations of n distinct elements are there? The


multiplicative rule yields

n · · · (n − 1) · · · 1 = n!.

Example (contd):
The number of different sequences of 5 fixed cards is 5! = 5 · 4 · 3 ·
2 · 1 = 120.

Counting, Jan 21, 2003 -4-


Permutations and Combinations

How many different combinations of n elements chosen from


N distinct elements are there?

Recall that
◦ The number of different sequences of length n that can be cho-
sen from N distinct elements are
N!
.
(N − n)!
◦ The number of permutions of any sequence of length n is n!.

Thus the number of combinations of n elements chosen from N


distinct elements is
   
N! N N
= = .
n! (N − n)! n N −n
N

n
are referred to as binomial coefficient.

Since two permuted (ordered) sequences (x1, . . . , xn) lead to the same (un-
ordered) combination {x1, . . . , xn} we divide the number of ordered se-
quences by the number of permutations.

Counting, Jan 21, 2003 -5-


Examples

Example:
If you select 5 cards from a card deck of 64, you are typically only
interested in the cards you have, not in the order in which you
received them. How many different combinations of 5 cards out
of 64 are there?

The answer is
 
64 64 · 63 · 62 · 61 · 60 914941444
= = = 7, 624, 512.
5 5·4·3·2·1 120

Example:
Recall the example with the 100 applicants for a job. The number
of ways to choose

◦ 2 women out of 50 is 50 2
.
50

◦ 8 men out of 50 is 8 .

◦ 10 applicants out of 100 is 100
10
.
Thus the chance of this event is
 
50 50
2
8 = 0.037
100
10

Moreover, the chance of this or a more extreme event (only one or


no woman is hired) is 0.046.

Counting, Jan 21, 2003 -6-


Summary

The number of possibilities to sample with or without replacement


in order or unordered n elements from a set of N distinct elements
are summarized in the following table:

Sampling in order without


  order
N! N
without replacement
(N − n)!  n 
N +n−1
with replacement Nn
N

Counting, Jan 21, 2003 -7-


Introduction to Probability

Classical Concept:
◦ requires finitely many and equally likely outcomes
◦ probability of event defined as number of favorable outcomes
(s) divided by number of total outcomes (N):
s
Probability of event =
N
◦ can be determined by counting outcomes

In many practical situations the different outcomes are not equally


likely:
◦ Success of treatment
◦ Chance to die of a heart attack
◦ Chance of snowfall tomorrow
It is not immediately clear how to measure chance in each of these
cases.

Three Concepts of Probability


◦ Frequency interpretation
◦ Subjective probabilities
◦ Mathematical probability concept

Elements of Probability, Jan 23, 2003 -1-


The Frequentist Approach

In the long run, we are all dead.


John Maynard Keynes (1883-1943)

The Frequency Interpretation of Probability


The probability of an event is the proportion of time that events
of the same kind (repeated independently and under the same
conditions) will occur in the long run.

Example:
Suppose we collect data on the weather in Chicago on Jan 21 and
we note that in the past 124 years it snowed in 34 years on Jan 21,
34
that is 124 100% = 27.4% of the time.
Thus we would estimate the probability of snowfall on Jan 21 in
Chicago as 0.274.

The frequency interpretation of probability is based on the follow-


ing theorem:

The Law of Large Numbers


If a situation, trial, or experiment is repeated again and again, the
proportion of successes will converge to the probability of any one
outcame being a success.

Elements of Probability, Jan 23, 2003 -2-


The Frequentist Approach
1.0

Tosses 1 − 1000
Relative Frequency of Heads
0.8
0.6
0.4
0.2
0.0

0 100 200 300 400 500 600 700 800 900 1000
Number of Tosses
0.52

Tosses 1000 − 100000


Relative Frequency of Heads
0.51
0.50
0.49
0.48

1 10 20 30 40 50 60 70 80 90 100
Number of Tosses (in 1000s)
0.505

Tosses 100000 − 1000000


Relative Frequency of Heads
0.500
0.495

1 2 3 4 5 6 7 8 9 10
Number of Tosses (in 100000s)

Elements of Probability, Jan 23, 2003 -3-


The Subjectivist (Bayesian) Approach

Not all events are repeatable:


◦ Will it snow tomorrow?
◦ Will Mr Jones, 42, live to 65?
◦ Will the Dow Jones rise tomorrow?
◦ Does the Iraq have weapons of mass destruction?

To all these questions the answer is either “yes” or “no”, but we


are uncertain about the right answer.

Need to quantify our uncertainty about an event A:

Game with two players:


◦ 1st player determines p such that he will “win” $c · (1 − p) if
event A occurs and otherwise he will “loose” $c · p.
◦ 2nd player chooses c which can be positive or negative.

The Bayesian interpretation of probability is that probability


measures the personal (subjective) uncertainty of an event.

Example: Weather forecast


Meteorologist says that the probability of snowfall tomorrow is
90%.
He should be willing to bet $90 against $10 that it snows tomorrow
and $10 against $90 that it does not snow.

Elements of Probability, Jan 23, 2003 -4-


The Elements of Probability

A (statistical) experiment is a process of observation or mea-


surement. For a mathematical treatment we need:

Sample Space S - set of possible outcomes


Example: An urn contains five balls, numbered from 1 through
5. We choose two at random and at the same time. What is the
sample space?

S = {1, 2}, {1, 3}, {1, 4}, {1, 5}, {2, 3}, {2, 4}, {2, 5},

{3, 4}, {3, 5}, {4, 5} .

Events A ⊆ S - an event is a subset of the sample space S


Example: In the example above the event A that two balls with
uneven numbers are choses is

A = {1, 3}, {1, 5}, {3, 5} .

Probability Function P - assigns each A a value in [0, 1]


Example: Assuming that all events are equally likely we obtain

P(A) = 103 .

Elements of Probability, Jan 23, 2003 -5-


The Elements of Probability

Why not assign probabilities to outcomes?


Example: Spinner labeled from 0 to 1.
◦ Suppose that all outcomes s ∈ S = [0, 1) are equally likely.
◦ Assign probabilities uniformly on S.
◦ P({s}) = c > 0 ⇒ P(S) = ∞
◦ P({s}) = 0 ⇒ P(S) = 0

Solution: Assign to each subset of S a probability equal to the


“length” of that subset:
◦ Probability that the spinner lands in [0, 41 ) is 41 .
◦ Probability that the spinner lands in [ 21 , 43 ) is 14 .
◦ Probability that the spinner lands on 21 is 0.
In integral notation we have
Z b
P(spinner lands in [a, b]) = dx = b − a.
a

Remark:
Strictly speaking, we can define above probability only on a set A of subsets A ⊆ S which
however covers all important and for this class relevant subsets.
In the case of finite or countably infinite sample spaces S there are no such exceptions
and A covers all subsets of S.

Elements of Probability, Jan 23, 2003 -6-


A Set Theory Primer

A set is “a collection of definite, well distinguished objects of our perception


or of our thought”. (Georg Cantor, 1845-1918)

Some important sets:


◦ N = {1, 2, 3, . . .}, the set of natural numbers
◦ Z = {. . . , −2, −1, 0, 1, 2, . . .}, the set of integers
◦ R = (−∞, ∞), the set of real numbers
Intervals are denoted as follows:
[0, 1] the interval from 0 to 1 including 0 and 1
[0, 1) the interval from 0 to 1 including 0 but not 1
(0, 1) the interval from 0 to 1 not including 0 and 1

If a is an element of the set A then we write a ∈ A.


If a is not an element of the set A then we write a ∈
/ A.
Suppose that A and B are subsets of S (denoted as A, B ⊆ S).

The empty set is denoted by ∅ (Note: ∅ ⊆ A for all subsets A of S).


Difference of A and B (A\B): Set of all elements in A which are not in B.
Intersection of A and B (A ∩ B): Set of all elements in S which are both
in A and in B.
Union of A and B (A ∪ B): Set of all elements in S that are in A or in B.
Complement of A (A∁ or A′): Set of all elements in S that are not in A.
Note that A ∩ A∁ = ∅ and A ∪ A∁ = S
A and B are disjoint if A and B have no common elements, that is A∩B =
∅. Two events A and B with this property are said to be mutually
exclusive.

Elements of Probability, Jan 23, 2003 -7-


The Postulates of Probability

A probability on a sample space S (and a set A of events) is a


function which assigns each subset A a value in [0, 1] and satisfies
the following rules:

Axiom 1: All probabilities are nonnegative:

P(A) ≥ 0 for all events A.

Axiom 2: The probability of the whole sample space is 1:

P(S) = 1.
Axiom 3 (Addition Rule): If two events A and B are mutu-
ally exclusive then

P(A ∪ B) = P(A) + P(B),


that is the probability that one or the other occurs is the sum
of their probabilities.
More generally, if countably many events Ai, i ∈ N are mutu-
ally exclusive (i.e. Ai ∩ Aj = ∅ whenever i 6= j) then
S
∞  P

P Ai = P(Ai).
i=1 i=1

Elements of Probability, Jan 23, 2003 -8-


The Postulates of Probability

Classical Concept of Probability

The probability of an event A is defined as

P(A) = #A
#S
,

where #A denotes the number of elements (outcomes) in A.


It satisfies
◦ P(A) ≥ 0
◦ P(S) = #S/#S = 1
◦ If A and B mutually exclusive then

P(A ∪ B) = #(A#S∪ B)
#A #B
= + = P(A) + P(B).
#S #S

Elements of Probability, Jan 23, 2003 -9-


The Postulates of Probability

Frequency Interpretation of Probability

The probability of an event A is defined as


n(A)
P(A) = n→∞
lim
n
,

where n(A) is the number of times event A occurred in n repeti-


tions.
It satisfies
◦ P(A) ≥ 0
◦ P(S) = limn→∞ nn = 1
◦ If A and B mutually exclusive then n(A ∪ B) = n(A) + n(B).
Hence
n(A ∪ B)
P(A ∪ B) = n→∞
lim
n
 n(A) n(B) 
= lim +
n→∞ n n
n(A) n(B)
= lim + lim = P(A) + P(B).
n→∞ n n→∞ n

Elements of Probability, Jan 23, 2003 - 10 -


The Postulates of Probability

Example: Toss of one die


The events A = {1} and B = {4 5} are mutually exclusive.
Since all outcomes are equiprobable we obtain

P(A) = 61
and

P(B) = 31 .
The addition rule yields

P(A ∪ B) = 61 + 31 = 36 = 21 .
On the other hand we get for C = A ∪ B = {1 4 5}

P(C) = 63 = 12 .

The first two axioms can be summarized by the

Cardinal Rule: For any subset A of S

0 ≤ P(A) ≤ 1.

In particular
◦ P(∅) = 0
◦ P(S) = 1

Elements of Probability, Jan 23, 2003 - 11 -


The Calculus of Probability

Let A and B be events in a sample space S.

Partition rule:
P(A) = P(A ∩ B) + P(A ∩ B ∁)
Example: Roll a pair of fair dice

P(Total of 10)
= P(Total of 10 and double) + P(Total of 10 and no double)
1 2 3 1
= + = =
36 36 36 12

Complementation rule:
P(A∁) = 1 − P(A)
Example: Often useful for events of the type “at least one”:

P(At least one even number)


= 1 − P(No even number) = 1 −
9 3
=
36 4

Containment rule
P(A) ≤ P(B) for all A ⊆ B
Example: Compare two aces with doubles,

= P(Two aces) ≤ P(Doubles) =


1 6 1
=
36 36 6

Calculus of Probability, Jan 26, 2003 -1-


The Calculus of Probability

Inclusion and exclusion formula


P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
Example: Roll a pair of fair dice

P(Total of 10 or double)
= P(Total of 10) + P(Double) − P(Total of 10 and double)
3 6 1 8 2
= + − = =
36 36 36 36 9

The two events are

Total of 10 = { 46, 55, 64}


and

Double = { 11,22,33,44,55,66}
The intersection is

Total of 10 and double = { 55}.


Adding the probabilities for the two events, the probability for the
event 55 is added twice.

Calculus of Probability, Jan 26, 2003 -2-


Conditional Probability

Probability gives chances for events in sample space S.


Often: Have partial information about event of interest.

Example: Number of Deaths in the U.S. in 1996


Cause All ages 1-4 5-14 15-24 25-44 45-64 ≥ 65
Heart 733,125 207 341 920 16,261 102,510 612,886
Cancer 544,161 440 1,035 1,642 22,147 132,805 386,092
HIV 32,003 149 174 420 22,795 8,443 22
Accidents1 92,998 2,155 3,521 13,872 26,554 16,332 30,564
Homicide2 24,486 395 513 6,548 9,261 7,717 52
All causes 2,171,935 5,947 8,465 32,699 148,904 380,396 1,717,218
1 2
Accidents and adverse effects, Homicide and legal intervention

measure probability with respect to a subset of S

Conditional probability of A given B

P(A|B) = P(A ∩ B)
P(B) , if P(B) > 0

If P(B) = 0 then P(A|B) is undefined.


Conditional probabilities for causes of death:
◦ P(accident) = 0.04282
◦ P(age=10) = 0.00390
◦ P(accident|age=10) = 0.42423
◦ P(accident|age=40) = 0.17832

Calculus of Probability, Jan 26, 2003 -3-


Conditional Probability

Example: Select two cards from 32 cards


◦ What is the probability that the second card is an ace?

P(2nd card is an ace) = 18


◦ What is the probability that the second card is an ace if the
first was an ace?

P(2nd card is an ace|1st card was an ace) = 313

Calculus of Probability, Jan 26, 2003 -4-


Multiplication rules

Example: Death Rates (per 100,000 people)

All Ages 1-4 5-14 15-24 25-44 45-64 ≥ 65


872.5 38.3 22.0 90.3 177.8 708.0 5071.4

Can we combine these rates with the table on causes of death?


◦ What is the probability to die from an accident (HIV)?
◦ What is the probability to die from an accident at age 10 (40)?

Know P(accident|die) = P(die from accident)/P(die)


⇒ P(die from accident) = P(accident|die)P(die)

Calculate probabilities:
◦ P(die from accident) = 0.04281 · 0.00873 = 0.00037
◦ P(die from accident|age = 10) = 0.42423 · 0.00090 = 0.00038
◦ P(die from accident|age = 40) = 0.17832 · 0.00178 = 0.00031
◦ P(die from HIV) = 0.01473 · 0.00873 = 0.00013
◦ P(die from HIV|age = 10) = 0.02055 · 0.00090 = 0.00002
◦ P(die from HIV|age = 40) = 0.15308 · 0.00178 = 0.00027
General multiplication rule

P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)

Calculus of Probability, Jan 26, 2003 -5-


Independence

Example: Roll two dice


◦ What ist the probability that the second die shows 1?

P(2nd die = 1) = 16
◦ What ist the probability that the second die shows 1 if the first
die already shows 1?

P(2nd die = 1|1st die = 1) = 61


◦ What ist the probability that the second die shows 1 if the first
does not show 1?

P(2nd die = 1|1st die 6= 1) = 61


The chances of getting 1 with the second die are the same, no
matter what the first die shows. Such events are called indepen-
dent:

The event A is independent of the event B if its chances are


not affected by the occurrence of B,

P(A|B) = P(A).
Equivalently, A and B are independent if

P(A ∩ B) = P(A)P(B)
Otherwise we say A and B are dependent.

Calculus of Probability, Jan 26, 2003 -6-


Let’s Make a Deal

The Rules:
◦ Three doors - one price, two blanks
◦ Candidate selects one door
◦ Showmaster reveals one loosing door
◦ Candidate may switch doors

1 2 3

Would YOU change?

Can probability theory help you?


◦ What is the probability of winning if candidate switches doors?
◦ What is the probability of winning if candidate does not switch
doors?

Calculus of Probability, Jan 26, 2003 -7-


The Rule of Total Probability

Events of interest:
◦ A - choose winning door at the beginning
◦ W - win the price

Strategy: Switch doors (S)


Know: PS (W |A) = 0
◦ ◦ PS (A) = 31
◦ PS (W |A∁) = 1 ◦ PS (A∁) = 23
Probability of interest: PS (W ):

PS (W ) = PS (W ∩ A) + PS (W ∩ A∁)
= PS (W |A)PS (A) + PS (W |A∁)PS (A∁)
1 2
=0· + 1 · 23 =
3 3

Strategy: Do not switch doors (N )


Know: PN (W |A) = 1
◦ ◦ PN (A) = 13
◦ PN (W |A∁) = 0 ◦ PN (A∁) = 23
Probability of interest: PN (W ):

PN (W ) = PN (W ∩ A) + PN (W ∩ A∁)
= PN (W |A)PN (A) + PN (W |A∁)PN (A∁)
1 1
=1· + 0 · 23 =
3 3

Calculus of Probability, Jan 26, 2003 -8-


The Rule of Total Probability

Rule of Total Probability


If B1, . . . , Bk mutually exclusive and B1 ∪ . . . ∪ Bk = S, then

P(A) = P(A|B1)P(B1) + . . . + P(A|Bk )P(Bk )

Example:
Suppose an applicant for a job has been invited for an interview.
The chance that
◦ he is nervous is P(N ) = 0.7,
◦ the interview is succussful if he is nervous is P(S|N ) = 0.2,
◦ the interview is succussful if he is not nervous is P(S|N ∁) = 0.9.

What is the probability that the interview is successful?

P(S) = P(S|N )P(N ) + P(S|N ∁)P(N ∁)


= 0.2 · 0.7 + 0.9 · 0.3
= 0.441

Calculus of Probability, Jan 26, 2003 -9-


The Rule of Total Probability

Example:
Suppose we have two unfair coins:
◦ Coin 1 comes up heads with probability 0.8
◦ Coin 2 comes up heads with probability 0.35
Choose a coin at random and flip it. What is the probability of its
being a head?
Events: H=“heads comes up”, C1=“1st coin”, C2=“2nd coin”

P(H) = P(H|C1)P(C1) + P(H|C2)P(C2)


1
= (0.8 + 0.35) = 0.575
2

Calculus of Probability, Jan 26, 2003 - 10 -


Bayes’ Theorem

Example: O.J. Simpson


1
“Only about 10 of one percent of wife-batterers actually murder their wives”
Lawyer of O.J. Simpson on TV

Fact: Simpson pleaded no contest to beating his wife in 1988.


So he murdered his wife with probability 0.001?
◦ Sample space S - married couples in U.S. in which the husband
beat his wife in 1988
◦ Event H - all couples in S in which the husband has since
murdered his wife
◦ Event M - all couples in S in which the wife has been murdered
since 1988
We have ◦ P(H) = 0.001
◦ P(M |H) = 1 since H ⊆ M
◦ P(M |H ∁) = 0.0001 at most in the U.S.
Then

P(H|M ) = P(MP|H) P(H)


(M )
=
P(M |H)P(H)
P(M |H)P(H) + P(M |H ∁)P(H ∁)
0.001
= = 0.91
0.001 + 0.0001 · 0.999

Calculus of Probability, Jan 26, 2003 - 11 -


Bayes’ Theorem

Reversal of conditioning (general multiplication rule)

P(B|A)P(A) = P(A|B)P(B)

Rewriting P(A) using the rule of total probability we obtain

Bayes’ Theorem

P(B|A) = P(A|B)P(B)
P(A|B)P(B) + P(A|B ∁)P(B ∁)

If B1, . . . , Bk mutually exclusive and B1 ∪ . . . ∪ Bk = S, then

P(Bi|A) = P(A|B )P(BP(A|B i )P(Bi)


) + . . . + P(A|B )P(B )
1 1 k k

(General form of Bayes’ Theorem)

Calculus of Probability, Jan 26, 2003 - 12 -


Bayes’ Theorem

Example: Testing for AIDS

Enzyme immunoassay test for HIV:


◦ P(T+|I+) = 0.98 (sensitivity - positive for infected)
◦ P(T-|I-) = 0.995 (specificity - negative for noninfected)
◦ P(I+) = 0.0003 (prevalence)
What is the probability that the tested person is infected if the
test was positive?

P(I+|T+) = P(T+|I+)PP(T+|I+) P(I+)


(I+) + P(T+|I-)P(I-)
0.98 · 0.0003
=
0.98 · 0.0003 + 0.005 · 0.9997
= 0.05556

Consider different population with P(I+) = 0.1 (greater risk)


0.98 · 0.1
P(I+|T+) = 0.98 · 0.1 + 0.005 · 0.9
= 0.956

testing on large scale not sensible (too many false positives)

Repeat test (Bayesian updating):


◦ P(I+|T++) = 0.92 in 1st population
◦ P(I+|T++) = 0.9998 in 2nd population

Calculus of Probability, Jan 26, 2003 - 13 -


Random Variables

Aim: ◦ Learn about population


◦ Available information: observed data x1, . . . , xn

Problem: ◦ Data affected by chance variation


◦ New set of data would look different

Suppose we observe/measure some characteristic (variable) of n


individuals. The actual observed values x1, . . . , xn are the outcome
of a random phenomenon.

Random variable: a variable whose value is a numerical out-


come of a random phenomenon

Remark: Mathematically, a random variable is a real-valued func-


tion on the sample space S:

S −−−−→ R
X

ω 7−→ x = X(ω)
◦ SX = X(S) is the sample space of the random variable.
◦ The outcome x = X(ω) is called realisation of X.
◦ X induces a probability P (B) = P(X ∈ B) on SX , the prob-
ability distribution of X
Example: Roll one die

Outcome ω 1 2 3 4 5 6
Realization X(ω) 1 2 3 4 5 6

Random Variables, Jan 28, 2003 -1-


Random Variables

Example: Roll two dice

◦ X1 - number on the first die


◦ X2 - number on the second die
◦ Y = X1 + X2 - total number of points
(a function of random variables is again a random variable)

Table of outcomes:

Outcome (X1, X2 ) Y Outcome (X1, X2 ) Y


11 (1,1) 2 41 (4,1) 5
12 (1,2) 3 42 (4,2) 6
13 (1,3) 4 43 (4,3) 7
14 (1,4) 5 44 (4,4) 8
15 (1,5) 6 45 (4,5) 9
16 (1,6) 7 46 (4,6) 10
21 (2,1) 3 51 (5,1) 6
22 (2,2) 4 52 (5,2) 7
23 (2,3) 5 53 (5,3) 8
24 (2,4) 6 54 (5,4) 9
25 (2,5) 7 55 (5,5) 10
26 (2,6) 8 56 (5,6) 11
31 (3,1) 4 61 (6,1) 7
32 (3,2) 5 62 (6,2) 8
33 (3,3) 6 63 (6,3) 9
34 (3,4) 7 64 (6,4) 10
35 (3,5) 8 65 (6,5) 11
36 (3,6) 9 66 (6,6) 12

Random Variables, Jan 28, 2003 -2-


Random Variables

Two important types of random variables:

• Discrete random variable


◦ takes values in a finite or countable set
• Continuous random variable
◦ takes values in a continuum, or uncountable set
◦ probability of any particular outcome x is zero

P(X = x) = 0 for all x ∈ SX

Example: Ten tosses of a coin


Suppose we toss a coin ten times. Let
◦ X be the number of heads in ten tosses of a coin
◦ Y be the time it takes to toss ten times

Random Variables, Jan 28, 2003 -3-


Discrete Random Variables

Suppose X is a discrete random variables with values x1, x2, . . ..

Example: Roll two dice


Y = X1 + X2 total number of points
y 2 3 4 5 6 7 8 9 10 11 12
P(Y = y) 361 362 363 364 365 366 365 364 363 362 361
Frequency function: The function

p(x) = P(X = x) = P({ω ∈ S|X(ω) = x})

is called the frequency function or probability mass function.

Note: p defines a probability on SX = {x1, x2, . . .}:


P
P (B) = p(x) = P(X ∈ B).
x∈B

We call P the (probability) distribution of X.

Properties of a discrete probability distribution


◦ p(x) ≥ 0 for all values of X
P
◦ i p(xi ) = 1

Random Variables, Jan 28, 2003 -4-


Discrete Random Variables

Example: Roll one die


Let X denote the number of points on the face turned up. Since
all numbers are equally likely we obtain
1
if x ∈ {1, . . . , 6}
p(x) = P(X = x) = 6 .
0 otherwise

Example: Roll two dice


The probability mass function of the total number of points

Y = X1 + X2

can be written as:


 1

6 − |y − 7| if y ∈ {2, . . . , 12}
p(y) = P(Y = y) = 36
0 otherwise

Example: Three tosses of a coin


Let X be the number of heads in three tosses of a coin. There are
3

x outcomes with x heads and 3 − x tails, thus
 
3 1
p(x) = .
x 8

Random Variables, Jan 28, 2003 -5-


Continuous Random Variables

For a continuous random variable X, the probability that X falls


in the interval (a, b ] is given by
Z
P(a < X ≤ B) =
b
f (x)dx,
a

where f is the density function of X.

Note: The density defines a probability on R:


 Zb 
P [a, b] = f (x) dx = P X ∈ [a, b]
a

We call P the (probability) distribution of X.


Remark: The definition of P can be extended to (almost) all B ⊆ R.
Example: Spinner
Consider a spinner that turns freely on its axis and slowly comes to a stop.
◦ X is the stopping point on the circle marked from 0 to 1.
◦ X can take any value in SX = [0, 1).
◦ The outcomes of X are uniformly distributed over the interval [0, 1).
Then the density function of X is

1 if 0 ≤ x < 1
f (x) = .
0 otherwise

Consequently

P X ∈ [a, b] = b − a.

Note that for all possible outcomes x ∈ [0, 1) we have



P X ∈ [x, x] = x − x = 0.

Random Variables, Jan 28, 2003 -6-


Independence of Random Variables

Recall: Two events A and B are independent if

P(A ∩ B) = P(A)P(B)
Independence of Random Variables
Two discrete random variables X and Y are independent if

P(X ∈ A, Y ∈ B) = P(X ∈ A) P(Y ∈ B)

for all A ⊆ SX and B ⊆ SY .

Remark: It is sufficient to show that

P(X = x, Y = y) = pX (x) pY (y) = P(X = x) P(Y = y)

for all x ∈ SX and y ∈ SY .


More generally, X1, X2 , . . . are independent if for all n ∈ N

P(X1 ∈ A1, . . . , Xn ∈ An) = P(X1 ∈ A1) · · · P(Xn ∈ An).


for all Ai ⊆ Xi .

Example: Toss coin three times


Consider

1 if head in ith toss of coin
Xi =
0 otherwise

X1 , X2 , and X3 are independent:

P(X1 = x1, . . . , X3 = x3) = 81 = P(X1 = x1)P(X2 = x2)P(X3 = x3)

Random Variables, Jan 28, 2003 -7-


Multivariate Distributions: Discrete Case
Discrete Case

Let X and Y be discrete random variables.

Joint frequency function of X and Y

pXY (x, y) = P(X = x, Y = y) = P({X = x} ∩ {Y = y})

Marginal frequency function of X


P
pX (x) = pXY (x, yi)
i

Marginal frequency function of Y


P
pY (y) = pXY (xi, y)
i

The random variables X and Y are independent if and only if

pXY (x, y) = pX (x) pY (y)

for all possible values x ∈ SX and y ∈ SY .

Conditional probability of X = x given Y = y

P(X = x|Y = y) = pX|Y (x|y) =


pXY (x, y)
=
P(X = x, Y = y)
pY (y) P(Y = y)
where pX|Y (x|y) is the conditional frequency function.

Random Variables, Jan 28, 2003 -8-


Multivariate Distributions
Discrete Case

Example: Three Tosses of a Coin

◦ X - number of heads on the first toss (values in {0, 1})


◦ Y - total number of heads (values in {0, 1, 2, 3})
The joint frequency function pXY (x, y) is given by the following
table

x\y 0 1 2 3
1 2 1 1
0 8 8 8
0 2
1 2 1 1
1 0 8 8 8 2
1 3 3 1
8 8 8 8
1

Marginal frequency function of Y

pY (0) = P(Y = 0)
= P(Y = 0, X = 0) + P(Y = 0, X = 1)
= 18 + 0 = 1
8
pY (1) = P(Y = 1)
= P(Y = 1, X = 0) + P(Y = 1, X = 1)
= 28 + 18 = 3
8
...

Random Variables, Jan 28, 2003 -9-


Multivariate Distributions
Continuous Case

Let X and Y be continuous random variables.

Joint density function of X and Y : fXY such that


Z Z
fXY (x, y) dy dx = P(X ∈ A, Y ∈ B)
A B

Marginal density function of X:


Z
fX (x) = fXY (x, y) dy

Marginal density function of Y


Z
fY (y) = fXY (x, y) dx

The random variables X and Y are independent if and only if

fXY (x, y) = fX (x) fY (y)

for all possible values x ∈ SX and y ∈ SY .

Conditional density function of X given Y = y


fXY (x, y)
fX|Y (x|y) =
fY (y)
Conditional probability of X ∈ A given Y = y
Z
P(X ∈ A|Y = y) = fX|Y (x|y) dx
A

Random Variables, Jan 28, 2003 - 10 -


Bernoulli Distribution

Example: Toss of coin


Define X = 1 if head comes up and
X = 0 if tail comes up.
Both realizations are equally likely: P(X = 1) = P(X = 0) = 1
2

Examples:
Often: Two outcomes which are not equally likely:
◦ Success of medical treatment
◦ Interviewed person is female
◦ Student passes exam
◦ Transmittance of a disease

Bernoulli distribution (with parameter θ)

◦ X takes two values, 0 and 1, with probabilities p and 1 − p


◦ Frequency function of X
 x
θ (1 − θ)1−x for x ∈ {0, 1}
p(x) =
0 otherwise
◦ Often:

1 if event A has occured
X=
0 otherwise
Example: A = blood pressure above 140/90 mm HG.

Distributions, Jan 30, 2003 -1-


Bernoulli Distribution

Let X1, . . . , Xn be independent Bernoulli random variables with


same parameter θ.

Frequency function of X1, . . . , Xn

p(x1, . . . , xn) = p(x1) · · · p(xn) = θx1 +...+xn (1 − θ)n−x1 −...−xn

for xi ∈ {0, 1} and i = 1, . . . , n

Example: Paired-Sample Sign Test


◦ Study success of new elaborate safety program
◦ Record average weekly losses in hours of labor due to accidents before
and after installation of the program in 10 industrial plants

Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11

Define for the ith plant



1 if first value is greater than the second
Xi =
0 otherwise

Result: 1 1 1 1 0 1 1 1 1 1

The Xi’s are independently Bernoulli distributed with unknown


parameter θ.

Distributions, Jan 30, 2003 -2-


Binomial Distribution

Let X1, . . . , Xn be independent Bernoulli random variables


◦ Often only interested in number of successes

Y = X1 + . . . + Xn

Example: Paired Sample Sign Test (contd)


Define for the ith plant

1 if first value is greater than the second
Xi =
0 otherwise
Pn
Y = Xi
i=1

Y is the number of plants for which the number of lost hours has
decreased after the installation of the safety program

We know:
◦ Xi is Bernoulli distributed with parameter θ
◦ Xi’s are independent

What is the distribution of Y ?

◦ Probability of realization x1, . . . , xn with y successes:

p(x1, . . . , xn) = θy (1 − θ)n−y


n

◦ Number of different realizations with y successes: y

Distributions, Jan 30, 2003 -3-


Binomial Distribution

Binomial distribution (with parameters n and θ)


Let X1, . . . , Xn be independent and Bernoulli distributed with pa-
rameter θ and
Pn
Y = Xi .
i=1

Y has frequency function


 
n
p(y) = θy (1 − θ)n−y for y ∈ {0, . . . , n}
y
Y is binomially distributed with parameters n and θ. We write

Y ∼ Bin(n, θ).

Note that
◦ the number of trials is fixed,
◦ the probability of success is the same for each trial, and
◦ the trials are independent.

Example: Paired Sample Sign Test (contd)


Let Y be the number of plants for which the number of lost hours
has decreased after the installation of the safety program. Then

Y ∼ Bin(n, θ)

Distributions, Jan 30, 2003 -4-


Binomial Distribution

Binomial distribution for n = 10

0.4 0.4
θ = 0.1 θ = 0.3

0.3 0.3
p(x)

p(x)
0.2 0.2

0.1 0.1

0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

0.4 0.4
θ = 0.5 θ = 0.8

0.3 0.3
p(x)

p(x)

0.2 0.2

0.1 0.1

0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x

Distributions, Jan 30, 2003 -5-


Geometric Distribution

Consider a sequence of independent Bernoulli trials.


◦ On each trial, a success occurs with probability θ.
◦ Let X be the number of trials up to the first success.

What is the distribution of X?


◦ Probability of no success in x − 1 trials: (1 − θ)x−1
◦ Probability of one success in the xth trial: θ
The frequency function of X is

p(x) = θ(1 − θ)x−1 , x = 1, 2, 3, . . .

X is geometrically distributed with parameter θ.

Example:
Suppose a batter has probability 31 to hit the ball. What is the chance that
he misses the ball less than 3 times?

The number X of balls up to the first success is geometrically distributed


with parameter 13 . Thus
1 1 2 1  2 2
P(X ≤ 3) = 3 + 3 · 3 + 3 3 = 0.7037.

Distributions, Jan 30, 2003 -6-


Hypergemetric Distribution

Example: Quality Control


Quality control - sample and examine fraction of produced units
◦ N produced units
◦ M defective units
◦ n sampled units
What is the probability that the sample contains x defective units?

The frequency function of X is



M N−M

x n−x
p(x) = N
 , x = 0, 1, . . . , n.
n

X is a hypergeometric random variable with parameters N , M ,


and n.

Example:
Suppose that of 100 applicants for a job 50 were women and 50 were men,
all equally qualified. If we select 10 applicants at random what is the
probability that x of them are female?
The number of chosen female applicants is hypergeometrically distributed
with parameters 100, 50, and 10. The frequency function is
50
 50 
x 10−x
p(x) = 100
 for x ∈ {0, . . . , n}
10

for x = 0, 1, . . . , 10.

Distributions, Jan 30, 2003 -7-


Poisson Distribution

Often we are interested in the number of events which occur in a


specific period of time or in a specific area of volume:
◦ Number of alpha particles emitted from a radioactive source during a
given period of time
◦ Number of telephone calls coming into an exchange during one unit of
time
◦ Number of diseased trees per acre of a certain woodland
◦ Number of death claims received per day by an insurance company

Characteristics
Let X be the number of times a certain event occurs during a given
unit of time (or in a given area, etc).
◦ The probability that the event occurs in a given unit of time is
the same for all the units.
◦ The number of events that occur in one unit of time is inde-
pendent of the number of events in other units.
◦ The mean (or expected) rate is λ.

Then X is a Poisson random variable with parameter λ and


frequency function
λx −λ
p(x) = e , x = 0, 1, 2, . . .
x!

Distributions, Jan 30, 2003 -8-


Poisson Approximation

The Poisson distribution is often used as an approximation for


binomial probabilities when n is large and θ is small:
 
n x n−x λx −λ
p(x) = θ (1 − θ) ≈ e
x x!
with λ = n θ.

Example: Fatalities in Prussian cavalry


Classical example from von Bortkiewicz (1898).
◦ Number of fatalities resulting from being kicked by a horse
◦ 200 observations (10 corps over a period of 20 years)

Statistical model:
◦ Each soldier is kicked to death by a horse with probability θ.
◦ Let Y be the number of such fatalities in one corps. Then

Y ∼ Bin(n, θ)

where n is the number of soldiers in one corps.

Observation: The data are well approximated by a Poisson distribution


with λ = 0.61

Deaths per Year Observed Rel. Frequency Poisson Prob.


0 109 0.545 0.543
1 65 0.325 0.331
2 22 0.110 0.101
3 3 0.015 0.021
4 1 0.005 0.003

Distributions, Jan 30, 2003 -9-


Poisson Approximation

Poisson approximation of Bin(40, θ)

1.0 1.0
1 1
θ= 400 λ= 10
0.8 0.8

0.6 0.6
p(x)

p(x)
0.4 0.4

0.2 0.2

0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x x

0.5 1 0.5
θ= 40 λ=1
0.4 0.4

0.3 0.3
p(x)

p(x)

0.2 0.2

0.1 0.1

0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x x

0.2 1 0.2
θ= 8
λ=5
p(x)

p(x)

0.1 0.1

0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x x

1
0.2
θ= 4
0.2
λ = 10
p(x)

p(x)

0.1 0.1

0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x x

Distributions, Jan 30, 2003 - 10 -


Continuous Distributions

40
Uniform distribution U (0, θ)
U(0, θ)

30
Range (0, 1)

Frequency
1

20
f (x) = 1(0,θ) (x)
θ
E(X) = θ2

10
0
θ2 −2 −1 0 1 2 3 4
var(X) = X
12

40
Exponential distribution Exp(λ) Exp(λ)

30
Range [0, ∞) Frequency
20
f (x) = λ exp(−λx)1[0,∞)(x)
E(X) = λ1
10

1
0

var(X) = 2 −2 −1 0 1 2 3 4
λ X
40

Normal distribution N (µ, σ 2)


N(µ, σ2)
R
30

Range
Frequency

1 1  
20

2
f (x) = √ exp − 2 (x−µ)
2πσ 2 2σ
E(X) = µ
10

var(X) = σ 2
0

−2 −1 0 1 2 3 4
X
6
4
2
0
−2

U(0, θ) Exp(λ) N(µ, σ2)

Distributions, Jan 30, 2003 - 11 -


Expected Value

Let X be a discrete random variable which takes values in SX =


{x1, x2, . . . , xn}

Expected Value or Mean of X:


P
n
E(X) = xi p(xi)
i=1

Example: Roll one die


Let X be outcome of rolling one die. The frequency function is
1
p(x) = , x = 1, . . . , 6,
6
and hence
P
6
E(X) = x
=
7
= 3.5
x=1 6 2

Example: Bernoulli random variable


Let X ∼ Bin(1, θ).

p(x) = θx (1 − θ)1−x

Thus the mean of X is

E(X) = 0 · (1 − θ) + 1 · θ = θ.

Expected Value and Variance, Feb 2, 2003 -1-


Expected Value

Linearity of the expected value


Let X and Y be two discrete random variables. Then

E(a X + b Y ) = aE(X) + bE(Y )


for any constants a, b ∈ R
Note: No independence is required.

Proof:
P
E(a X + b Y ) = (a x + b y)p(x, y)
x,y
P P
=a x p(x, y) + b y p(x, y)
P x,y x,y
x

p(x, y) = p(y) P P
x =a x p(x) + b y p(y)
x y

= aE(X) + bE(Y )

Example: Binomial distribution


Let X ∼ Bin(n, θ). Then X = X1 +. . .+Xn with Xi ∼ Bin(1, θ):
P
n P
n
E(X) = E(Xi) = θ = nθ
i=1 i=1

Expected Value and Variance, Feb 2, 2003 -2-


Expected Value

Example: Poisson distribution

Let X be a Poisson random variable with parameter λ.

X∞
λx −λ
E(X) = x x! e
x=0
X∞
−λ λx−1
= λe
x=0
(x − 1)!
= λ e−λeλ

Remarks:
◦ For most distributions some “advanced” knowledge of calculus
is required to find the mean.
◦ Use tables for means of commonly used distribution.

Expected Value and Variance, Feb 2, 2003 -3-


Expected Value

Example: European Call Options


Agreement that gives an investor the right (but not the obliga-
tion) to buy a stock, bond, commodity, or other instruments at
a specific time at a specific price.

What is a fair price P for European call options?

If ST is the price of the stock at time T , the profit will be

Profit = (ST − K)+ − P.

Profit is a random variable.


30
20
10
0
0
−10

0 10 20 30 40 50
0

Fair price P for this option is expected value

P = E(ST − K)+.

Expected Value and Variance, Feb 2, 2003 -4-


Expected Value

Example: European Call Options (contd)

Consider the following simple model:


◦ St = St−1 + εt, t = 1, . . . , T
◦ P(εt = 1) = p and P(εt = −1) = 1 − p.
St is also called a random walk.
The distribution of ST is given by (s0 known at time 0)

ST = s0 + 2 Y − T, with Y ∼ Bin(T, p)

Therefore the price P is (assuming s0 = 0 without loss of generality)


P
T
P = E(ST − K) = +
(2 y − T − K) pθ (y) 1{y>(K+T )/2}
y=1

Let n = 20, K = 10, θ = 0.6


0.5

P = 2.75 0.4

0.3
p(x)

0.2

0.1

0.0
−2.75

−0.75

1.25

3.25

5.25

7.25

9.25

11.25

13.25

15.25

17.25

19.25

Profit

Frequency function of profit

Expected Value and Variance, Feb 2, 2003 -5-


Expected Value
Example: Group testing
Suppose that a large number of blood samples are to be screened for a rare
disease with prevalence 1 − p.

• If each sample is assayed individually, n tests will be required.


• Alternative scheme:
◦ n samples, m groups with k samples
◦ Split each sample in half and pool all samples in one group
◦ Test pooled sample for each group
◦ If test positive test all samples in group separately
What is the expected number of tests under this alternative scheme?
Let Xi be the number of tests in group i. The frequency function of Xi is
(
pk if x = 1
p(x) = k
1 − p if x = k + 1

The expected number of tests in each group is

E(Xi) = pk + (k + 1)(1 − pk ) = k + 1 − kpk


Hence
0.50

P
m  
E(N ) = E(Xi) = n 1 + k − p
1 k
0.45

i=1
0.40
Proportion

E(N ):
0.35

Plot of
0.30

The mean is minimized for


0.25

groups of size 11.


0.20

2 4 6 8 10 12 14 16
k

Expected Value and Variance, Feb 2, 2003 -6-


Variance

Let X be a random variable.

Variance of X:
2
var(X) = E X − E(X) .

The variance of X is the expected squared distance of X from its


mean.

Suppose X is discrete random variable with SX = {x1, . . . , xn}.


Then the variance of X can be written as
n 
P Pn 2
var(X) = xi − xj p(xj ) p(xi)
i=1 j=1

Example: Roll one die


X takes values in {1, 2, 3, 4, 5, 6} with frequency function p(x) = 16 .
P
6
E(X) = 7
2
1
x =
6
x=1
6 
P 7 2 1 1  25 9 1 1 9 25  35
var(X) = x− = + + + + + =
x=1 2 6 6 4 4 4 4 4 4 12

2
We often denote the variance of a random variable X by σX ,
2
σX = var(X)

and its standard deviation by σX .

Expected Value and Variance, Feb 2, 2003 -7-


Properties of the Variance

The variance can also be written as


2
var(X) = E(X ) − E(X)
2

To see this (using linearity of the mean):

var(X) = E(X − E(X))2


 2 
= E X 2 − 2X E(X) + E(X)
 2 
= E X 2 − 2E(X)E(X) + E(X) = E(X 2 ) − E(X) 2

Example: Let X ∼ Bin(1, θ). Then


2
var(X) = E(X ) − E(X)
2
2
= E(X) − E(X) = θ − θ2 = θ (1 − θ)

Rules for the variance:


◦ For constants a and b

var(aX + b) = a2var(X).

◦ For independent random variables X and Y

var(X + Y ) = var(X) + var(Y ).

Example: Let X ∼ Bin(n, θ). Then

var(X) = n θ (1 − θ)

Expected Value and Variance, Feb 2, 2003 -8-


Covariance

For independent random variables X and Y we have

var(X + Y ) = var(X) + var(Y ).

Question: What about dependent random variables?


It can be shown that

var(X + Y ) = var(X) + var(Y ) + 2 cov(X, Y )

where
 
cov(X, Y ) = E (X − E(X))(Y − E(Y )

is the covariance of X and Y .

Properties of the covariance


◦ cov(X, Y ) = E(XY ) − E(X) E(Y )
◦ cov(X, X) = var(X)
◦ cov(X, 1) = 0
◦ cov(X, Y ) = cov(Y, X)
◦ cov(a X1 + b X2, Y ) = a cov(X1, Y ) + b cov(X2, Y )

Expected Value and Variance, Feb 2, 2003 -9-


Covariance

Important:
cov(X, Y ) = 0 does NOT imply that X and Y are independent.

Example:
Suppose X ∈ {−1, 0, 1} with probabilities P(X = x) = 1
3
for
x = −1, 0, 1. Then E(X) = 0 and

cov(X, X 2) = E(X 3) = E(X) = 0

On the other hand

P(X = 1, X 2 = 0) = 0 6= 91 = P(X = 1)P(X 2 = 0),


that is, X and Y are not independent!

Note: The covariance of X and Y measures only linear depen-


dence.

Expected Value and Variance, Feb 2, 2003 - 10 -


Correlation

The correlation coefficient ρ is defined as


cov(X, Y )
ρXY = corr(X, Y ) = p .
var(X)var(Y )

Properties:
◦ dimensionless quantity
◦ not affected by linear transformations, i.e.

corr(a X + b, c Y + d) = corr(X, Y )

◦ −1 ≤ ρXY ≤ 1
◦ ρXY = 1 if and only if P(Y = a + b X) = 1 for some a and b
◦ measures linear association between X and Y

Example: Three boxes: pp, pd, and dd (Ex 3.6)


1
Let Xi = 1{penny on ith draw}. Then Xi ∼ Bin(1, p) with p = 2
and
joint frequency function

p(x1, x2): Thus:

x1\x2 0 1 cov(X1, X2) = E[(X1 − p)(X2 − p)]


0 1 1
3 6
= 41 · 13 + 14 · 31 + 2 14 · 16 = 1
12
1 1
1 6 3 corr(X1, X2) = 14 · 12
1
= 1
3

Expected Value and Variance, Feb 2, 2003 - 11 -


Prediction

An instructor standardizes his midterm and final so the class aver-


age is µ = 75 and the SD is σ = 10 on both tests. The correlation
between the tests is always around ρ = 0.50.
◦ X - score of student on the first examination
◦ Y - score of student on the second examination
Since X and Y are dependent we should be able to predict the
score in the final from the midterm score.
Approach:
◦ Predict Y from linear function a + b X
◦ Minimize mean squared error
2
MSE = E Y − a − b X
 2
= var(Y − b X) + E(Y − a − b X)

Solution:
σXY
a = µ−bµ and b= 2 =ρ
σX
Thus the best linear predictor is

Ŷ = µ + ρ (X − µ)

Note:
We expect the student’s score on the final to differ from the mean
only by half the difference observed in the midterm (regression to
the mean).

Expected Value and Variance, Feb 2, 2003 - 12 -


Summary

Bernoulli distribution - Bin(1, θ)

p(x) = θx(1 − θ)1−x E(X) = θ


var(X) = θ(1 − θ)

Binomial distribution - Bin(n, θ)


 
p(x) =
n x
x
θ (1 − θ)n−x E(X) = nθ
var(X) = nθ(1 − θ)

Poisson distribution - Poiss(λ)

λx −λ
p(x) = e
x!
E(X) = λ
var(X) = λ

Geometric distribution

p(x) = θ(1 − θ)x−1 E(X) = 1θ


1−θ
var(X) =
θ2
Hypergeometric distribution - H(N, M, n)

M N −M

p(x) = x
N
n−x
 E(X) = nNM
n

Expected Value and Variance, Feb 2, 2003 - 13 -


Properties of the Sample Mean

Consider X1, . . . , Xn independent and identically distributed (iid)


with mean µ and variance σ 2.
1P n
X̄ = Xi (sample mean)
n i=1
Then
1P n
E(X̄) = n µ = µ
i=1
1 Pn
2 σ2
var(X̄) = 2 σ =
n i=1 n
Remarks:
◦ The sample mean is an unbiased estimate of the true mean.
◦ The variance of the sample mean decreases as the sample size
increases.
◦ Law of Large Numbers: It can be shown that for n → ∞
1P n
X̄ = Xi → µ.
n i=1

Question:
◦ How close to µ is the sample mean for finite n?
◦ Can we answer this without knowing the distribution of X?

Central Limit Theorem, Feb 4, 2004 -1-


Properties of the Sample Mean

Chebyshev’s inequality
Let X be a random variable with mean µ and variance σ 2.
Then for any ε > 0
σ2 
P |X − µ| > ε ≤ 2 .
ε
Proof: Let

1 if |xi − µ| > ε
1{|xi − µ| > ε} =
0 otherwise

Then
P
n Pn n (x − µ)2 o
i
1{|xi − µ| > ε} p(xi) = 1 > 1 p(xi)
i=1 i=1 ε2
Pn (x − µ)2 σ2
i
≤ p(x i ) =
i=1 ε2 ε2

Application to the sample mean:


 
P µ − √n ≤ X̄ ≤ µ + √n ≥ 1 − 19 ≈ 0.889
3σ 3σ

However: Known to be not very precise


iid
Example: Xi ∼ N (0, 1)
1P n
X̄ = Xi ∼ N (0, n1 )
n i=1

Therefore
 
P − √ ≤ X̄ ≤ √ = 0.997
3 3
n n

Central Limit Theorem, Feb 4, 2004 -2-


Central Limit Theorem

Let X1, X2, . . . be a sequence of random variables


◦ independent and identically distributed
◦ with mean µ and variance σ 2.
For n ∈ N define
√ X̄ − µ 1 P
n
Xi − µ
Zn = n =√ .
σ n i=1 σ

Zn has mean 0 and variance 1.

Central Limit Theorem


For large n, the distribution of Zn can be approximated by the
standard normal distribution N (0, 1). More precisely,
 √ X̄ − µ 
lim P a ≤ n ≤ b = Φ(b) − Φ(a),
n→∞ σ
where Φ(x) is the standard normal probability
Z z
Φ(z) = f (x) dx,
−∞

that is, the area under the standard normal curve to left of z.

Example:
◦ U1 , . . . , U12 uniformly distributed on [ 0, 12).
◦ What is the probability that the sample mean exceeds 9?
√ 
P(Ū > 9) = P 12 √ > 3 ≈ 1 − Φ(3) = 0.0013
Ū − 6
12

Central Limit Theorem, Feb 4, 2004 -3-


Central Limit Theorem
0.4 U[0,1],n=1 1.0 Exp(1),n=1

0.8
0.3
density f(x)

density f(x)
0.6
0.2
0.4

0.1
0.2

0.0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

0.4 U[0,1],n=2 Exp(1),n=2


0.5

0.3
0.4
density f(x)

density f(x)
0.2 0.3

0.2
0.1
0.1

0.0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

0.4 U[0,1],n=6 0.5 Exp(1),n=6

0.4
0.3
density f(x)

density f(x)

0.3
0.2
0.2

0.1
0.1

0.0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

0.4 U[0,1],n=12 Exp(1),n=12


0.4

0.3
0.3
density f(x)

density f(x)

0.2 0.2

0.1 0.1

0.0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

0.4
U[0,1],n=100 Exp(1),n=100
0.4

0.3
0.3
density f(x)

density f(x)

0.2
0.2

0.1 0.1

0.0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Central Limit Theorem, Feb 4, 2004 -4-


Central Limit Theorem

Example: Shipping packages


Suppose a company ships packages that vary in weight:
◦ Packages have mean 15 lb and standard deviation 10 lb.
◦ They come from a arge number of customurs, i.e. packages are
independent.
Question: What is the probability that 100 packages will have a
total weight exceeding 1700 lb?
Let Xi be the weight of the ith package and
P
100
T = Xi .
i=1

Then
 
T − 1500 lb 1700 lb − 1500 lb
P(T > 1700 lb) = P √ > √
100 · 10 lb 100 · 10 lb
 
T − 1500 lb
=P √ >2
100 · 10 lb
≈ 1 − Φ(2) = 0.023

Central Limit Theorem, Feb 4, 2004 -5-


Central Limit Theorem

Remarks
• How fast approximation becomes good depends on distribution
of Xi’s:
◦ If it is symmetric and has tails that die off rapidly, n can
be relatively small.
iid
Example: If Xi ∼ U [0, 1], the approximation is good for
n = 12.
◦ If it is very skewed or if its tails die down very slowly, a
larger value of n is needed.
Example: Exponential distribution.
• Central limit theorems are very important in statistics.
• There are many central limit theorems covering many situa-
tions, e.g.
◦ for not identically distributed random variables or
◦ for dependent, but not “too” dependent random variables.

Central Limit Theorem, Feb 4, 2004 -6-


The Normal Approximation to the Binomial

Let X be binomially distributed with parameters n and p.

Recall that X is the sum of n iid Bernoulli random variables,


P
n
iid
X= Xi , Xi ∼ Bin(1, p).
i=1

Therefore we can apply the Central Limit Theorem:

Normal Approximation to the Binomial Distribution



For n large enough, X is approximately N np, np(1 − p)
distributed:

P a ≤ X ≤ b) ≈ P a − 2 ≤ Z ≤ b + 2
1 1

where

Z ∼ N np, np(1 − p) .

Rule of thumb for n: np > 5 and n(1 − p) > 5.

In terms of the standard normal distribution we get


 
a − 12 − np b + 1
− np
P a ≤ X ≤ b) = P p ≤ Z′ ≤ p 2
np(1 − p) np(1 − p)
   
b + 12 − np a − 12 − np
=Φ p −Φ p
np(1 − p) np(1 − p)
where Z ′ ∼ N (0, 1).

Central Limit Theorem, Feb 4, 2004 -7-


The Normal Approximation to the Binomial
1.0 Bin(1,0.5) 1.0 Bin(1,0.1)

0.8 0.8

0.6 0.6
p(x)

p(x)
0.4 0.4

0.2 0.2

0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
x x

1.0 Bin(2,0.5) 1.0 Bin(5,0.1)

0.8 0.8

0.6 0.6
p(x)

p(x)
0.4 0.4

0.2 0.2

0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
x x

0.5 Bin(5,0.5) 0.5 Bin(10,0.1)

0.4 0.4

0.3 0.3
p(x)

p(x)

0.2 0.2

0.1 0.1

0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
x x

0.3 Bin(10,0.5) 0.3 Bin(20,0.1)

0.2 0.2
p(x)

p(x)

0.1 0.1

0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
x x

0.3 Bin(20,0.5) 0.3 Bin(50,0.1)

0.2 0.2
p(x)

p(x)

0.1 0.1

0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
x x

Central Limit Theorem, Feb 4, 2004 -8-


The Normal Approximation to the Binomial

Example: The random walk of a drunkard


Suppose a drunkard executes a “random” walk in the following
way:
◦ Each minute he takes a step north or south, with probability 21
each.
◦ His successive step directions are independent.
◦ His step length is 50 cm.

How likely is he to have advanced 10 m north after one hour?


◦ Position after one hour: X · 1 m − 30 m
1
◦ X binomially distributed with parameters n = 60 and p = 2

◦ X is approximately normal with mean 30 and variance 15:

P(X · 1 m − 30 m > 10 m)
= P(X > 40)
≈ P(Z > 39.5) Z ∼ N (30, 15)
 
Z − 30 9.5
=P √ >√
15 15
= 1 − Φ(2.452) = 0.007

How does the probability change if he has same idea of where he


wants to go and steps north with probability p = 23 and south with
probability 13 ?

Central Limit Theorem, Feb 4, 2004 -9-


Estimation

Example: Cholesterol levels of heart-attack patients


Data: Observational study at a Pennsylvania medical center
◦ blood cholesterol levels patients treated for heart attacks
◦ measurements 2, 4, and 14 days after the attack

Id Y1 Y2 Y3 Id Y1 Y2 Y3
1 270 218 156 15 294 240 264
2 236 234 193 16 282 294 220
3 210 214 242 17 234 220 264
4 142 116 120 18 224 200 213
5 280 200 181 19 276 220 188
6 272 276 256 20 282 186 182
7 160 146 142 21 360 352 294
8 220 182 216 22 310 202 214
9 226 238 248 23 280 218 170
10 242 288 298 24 278 248 198
11 186 190 168 25 288 278 236
12 266 236 236 26 288 248 256
13 206 244 238 27 244 270 280
14 318 258 200 28 236 242 204

Aim: Make inference on distribution of


◦ cholesterol level 14 days after the attack: Y3
◦ decrease in cholesterol level: D = Y1 − Y3
Y1 − Y3
◦ relative decrease in cholesterol level: R = Y3

Confidence intervals I, Feb 11, 2004 -1-


Estimation

Data:

d1, . . . , d28 observed decrease in cholesterol level

In this example, parameters of interest might be

µD = E(D) the mean decrease in cholesterol level,


2
σD = var(D) the variation of the cholesterol level,
pD = P(D ≤ 0) probability of no decrease in cholesterol level

These parameters are naturally estimated by the following sample


statistics:
1P n
µ̂D = di (sample mean)
n i=1
2 1P n
¯ 2,
σ̂D = (di − d) (sample mean)
n i=1
#{di|di ≤ 0}
p̂D = (sample proportion)
n
Such statistics are point estimators since they estimate the corre-
sponding parameter by a single numerical value.
◦ Point estimates provide no information about their chance vari-
ation.
◦ Estimates without an indication of their variability are of lim-
ited value.

Confidence intervals I, Feb 11, 2004 -2-


Confidence Intervals for the Mean

Recall:
◦ CLT for the sample mean: For large n we have

σ2 
X̄ ≈ N µ,
n
◦ 68-95-99 rule: With 95% probability the sample differs from
its mean µ by less that two standard deviations.

More precisely, we have


 
P µ − 1.96 √ ≤ X̄ ≤ µ + 1.96 √ = 0.95,
σ σ
n n
or equivalently, after rearranging the terms,
 
P X̄ − 1.96 √ ≤ µ ≤ X̄ + 1.96 √ = 0.95.
σ
n
σ
n

Interpretation: There is 95% probability that the random in-


terval
h i
σ σ
X̄ − 1.96 √ , X̄ + 1.96 √
n n
will cover the mean µ.

Example: Cholesterol levels

d¯ = 36.89, σ = 51.00, n = 28.

Therefore, the 95% confidence interval for µ is

[18.00, 55.78].

Confidence intervals I, Feb 11, 2004 -3-


Confidence Intervals for the Mean

Assumption: The population standard deviation σ is known.

◦ In the next lecture, we will drop this unrealistic assumption.


◦ Assumption is approximately satisfied for large sample sizes,
since then σ̂ ≈ σ by the law of large numbers.

Definition: Confidence interval for µ (σ known)


The interval
h i
σ σ
X̄ − zα/2 √ , X̄ + zα/2 √
n n
is called a 1 − α confidence interval for the population mean
µ. (1 − α) is the confidence level.
For large sample sizes n, an approximate (1 − α) confidence
interval for µ is given by
 
σ̂ σ̂
X̄ − zα/2 √ , X̄ + zα/2 √ .
n n

Here, zα is the α-critical value of the standard normal distribution:


0.4
◦ zα has area α to its right
0.3

◦ Φ(zα ) = 1 − α
f(x)

0.2

0.1
α
0.0
−3 −2 −1 0 1 zα 2 3
z

Confidence intervals I, Feb 11, 2004 -4-


Confidence Interval for the Mean

Example: Community banks


◦ Community banks are banks with less than a billion dollars of assets.
◦ Approximately 7500 such banks in the United States.

Annual survey of the Community Bankers Council of the American Bankers


Association (ABA)
◦ Population: Community banks in the United States.
◦ Variable of interest: Total assets of community banks.
◦ Sample size: n = 110
◦ Sample mean: X̄ = 220 millions of dollars
◦ Sample standard deviation: SD = 161 millions of dollars
◦ Histogram of sampled values:
Assets of Community Banks in the U.S.
20 (sample of 110 community banks)

15
Frequency

10

0
0 100 200 300 400 500 600 700 800 900 1000
Assets (in millions of dollars)

Suppose we want to give a 95% confidence interval for the mean total assets
of all community banks in the United States.
◦ α = 0.05, zα/2 = 1.96
A 95% confidence interval for the mean assets (in millions of dollars) is
 
161 161 
220 − 1.96 · √ , 220 + 1.96 · √ ≈ 190, 250].
110 110

Confidence intervals I, Feb 11, 2004 -5-


Sample Size

Example: Cholesterol levels


Suppose we want a 99% confidence interval for the decrease in
cholesterol level:
◦ α = 0.01, z0.005 = 2.58
◦ The 99% confidence interval for µD is
h i 
50.93 50.93
36.89 − 2.58 · √ , 36.89 + 2.58 · √ ≈ 12.06, 61.72].
28 28

Note: If we raise the confidence level, the confidence interval


becomes wider.
Suppose we want to obtain increase the confidence level without
increasing the error of estimation (indicated by the half-width of
the confidence interval). For this we have to increase the sample
size n.

Question: What sample size n is needed to estimate the mean


decrease in cholesterol with error e = 20 and confidence level 99%?
The error (half-width of the confidence interval) is
σ
e = zα/2 √
n
Therefore the sample size ne needed is given by
 z σ 2  
α/2 2.58 · 50.93 2
ne ≥ = = 43.16,
e 20
that is, a sample of 44 patients is needed to estimate µD with error
e = 20 and 99% confidence.

Confidence intervals I, Feb 11, 2004 -6-


Estimation of the Mean

Example: Banks’ loan-to-deposit ratio


The ABA survey of community banks also asked about the loan-to-deposit
ratio (LTDR), a bank’s total loans as a percent of its total deposits.

Loan−To−Deposit Ratio of Community Banks


(sample of 110 community banks)
Sample statistics: 18

◦ n = 110 Frequency
15

12

◦ µ̂LTDR = 76.7 9

6
◦ σ̂LTDR = 12.3
3

0
50 60 70 80 90 100 110 120
LTDR (in %)

Construction of 95% confidence interval:


◦ α = 0.05, zα/2 = 1.96
σ
◦ Standard error σX̄ = LT
√ DR = 1.17
n
◦ 95% confidence interval for µLTDR :
h σLT DR σLT DR i  
X̄ − zα/2 √ , X̄ + zα/2 √ = 74.4, 79.0
n n
◦ To get an estimation with error e = 3.0 (half-width of confidence inter-
val) it suffices to sample ne banks,
   2
zα/2σLT DR 2 1.96 · 12.3
ne ≥ = = 64.6.
e 3.0
Thus a sample of ne = 65 banks it sufficient.

Confidence intervals I, Feb 11, 2004 -7-


Confidence intervals

Definition: Confidence interval


A (1 − α) confidence interval for a parameter is an interval that
◦ depends only on sample statistics and
◦ covers the parameter with probability (1 − α)

Note:
◦ Confidence intervals are random while the estimated parameter
is fixed.
◦ For repeated samples, only 95% of the confidence intervals will
cover the true parameter is a random:

Confidence intervals II, Feb 13, 2004 -1-


Confidence Intervals for the Mean
iid
Suppose that X1, . . . , Xn ∼ N (µ, σ 2). Then
X̄ − µ
√ ∼ N (0, 1) (*)
σ/ n
Assuming that σ is known, we obtain
h i
σ σ
X̄ − zα/2 · √ , X̄ + zα/2 · √
n n

as (1 − α) confidence interval for µ.

More realistic situation: σ is unknown.


t1

Approach: Replace by estimate σ̂ = s


0.4
t3
t10
N(0, 1)

0.3

This approach leads to the t statistic

f(x)
0.2

X̄ − µ
T = √ ∼ tn−1. 0.1

s/ n
0.0
−4 −3 −2 −1 0 1 2 3 4

It is t distributed with n − 1 degrees of freedom. x

Confidence interval for the mean µ (σ unknown)


The interval
h i
s s
X̄ − tn−1,α/2 · √ , X̄ + tn−1,α/2 · √
n n

is a (1 − α) confidence interval for the mean µ.

Notation: Critical values of distributions

zα standard normal distribution


tn,α t distribution with n degrees of freedom

Confidence intervals II, Feb 13, 2004 -2-


Confidence Intervals for the Mean

Example: Cholesterol levels


In the study on cholesterol levels, the standard deviation of the decrease
of cholesterol level was unknown.
◦ µ̂D = 36.89, σ̂D = 50.94
◦ t27,0.025 = 2.05
◦ Then
h 50.94 50.94 i
36.89 − 2.05 · √ , 36.89 + 2.05 · √ = [16.78, 57.01]
27 27
is a 95% confidence interval for µD
◦ The large sample confidence interval based on (*) was [18.00,55.78].

Example: Level of vitamin C


The following data are the amounts of vitamin C, measured in milligrams
per 100 grams (mg/100 g) of corn soy blend, for a random sample of size 8
from a production run:

26 31 23 22 11 22 14 31

What is the 95% confidence interval for µ, the mean vitamin C content of
the CSB produced during this run?
◦ µ̂ = 22.5, σ̂ = 7.2, t7,0.025 = 2.36
◦ The 95% confidence interval for µ is
 
2.36 · 7.2 2.36 · 7.2
22.5 − √ , 22.5 + √ = [16.5, 28.5].
8 8
◦ The large sample CI would be [17.5, 27.5].

Confidence intervals II, Feb 13, 2004 -3-


Confidence Intervals for the Variance
iid
For normally distributed data X1, . . . , Xn ∼ N (µ, σ 2), the ratio
(n − 1) · s2
σ2

has a χ2 distribution with n − 1 degrees of freedom.


The (1 − α) confidence interval for σ 2 is
h i
(n − 1) · s2 (n − 1) · s2
χ2
, χ2 .
n−1,α/2 n−1,1−α/2

where χ2n−1,α is the α fractile of the χ2n−1 distribution.


Caution: This confidence interval is not robust against depar-
tures from normality regardless of the sample size.
Example: Cholesterol levels
Suppose we are interested in the variance of Y3, the cholesterol level 14
days after the attack.
◦ Normal probability plot:
300

250
Cholesterol level

200

150

−2 −1 0 1 2
Normal quantiles

Data seem to be normally distributed.


◦ s2 = 2030.55, χ227,0.975 = 14.57, χ227,0.025 = 43.19
◦ The 95% confidence interval for σ 2 is
h i
27 · 2030.55 27 · 2030.55
, = [1269.26, 3761.99]
43.19 14.57

Confidence intervals II, Feb 13, 2004 -4-


Statistical Tests

Example:
Suppose that of 100 applicants for a job 50 were women and 50 were men,
all equally qualified. Further suppose that the company hired 2 women
and 8 men.

Question:
◦ Does the company discriminate against female job applicants?
◦ How likely is this outcome under the assumption that the company
does not discriminate?

Example:

◦ Study success of new elaborate safety program


◦ Record average weekly losses in hours of labor due to accidents before
and after installation of the program in 10 industrial plants

Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11

Question:
◦ Has the safety program an effect on the loss of labour due to accidents?
◦ In 9 out of 10 plants the average weekly losses have decreased after
implementation of the safety program. How likely is this (or a more
extreme) outcome under the assumption that there is no difference
before and after implementation of the safety program.

Testing Hypotheses I, Feb 16, 2004 -1-


Statistical Tests

Example: Fair coin


Suppose we have a coin. We suspect it might be unfair. We devise a
statistical experiment:
◦ Toss coin 100 times
◦ Conclude that coin is fair if we see between 40 and 60 heads
◦ Otherwise decide that the coin is not fair

Let θ be the probability that the coin lands heads, that is,

P(Xi = 1) = θ and P(Xi = 0) = 1 − θ.


Our suspicion (“coin not fair”) is a hypothesis about the population pa-
rameter θ (θ 6= 12 ) and thus about P. We emphasize this dependence of P
on θ by writing Pθ .

Decision problem:
Null hypothesis H0 : X ∼ Bin(100, 21 )
1
Alternative hypothesis Ha : X ∼ Bin(100, θ), θ 6= 2

The null hypothesis represents the default belief (here: the coin is fair).
The alternative is the hypothesis we accept in view of evidence against the
null hypothesis.
The data-based decision rule

reject H0 if X ∈
/ [40, 60]
do not reject H0 if X ∈ [40, 60]

is called a statistical test for the test problem H0 vs. Ha .

Testing Hypotheses I, Feb 16, 2004 -2-


Statistical Tests

Example: Fair coin (contd)


Note: It is possible to obtain e.g. X = 55 (or X = 65)
◦ with probability 0.048 (resp. 0.0009) if p = 0.5
◦ with probability 0.048 (resp. 0.0049) if p = 0.6
◦ with probability 0.0005 (resp. 0.047) if p = 0.7

0.10 Bin(100,0.5)

0.08
Accept H0: p ≠ 0.5
Reject H0: p ≠ 0.5
0.06
p(x)

0.04

0.02

0.00
20 25 30 35 40 45 50 55 60 65 70 75 80
x

0.10 Bin(100,0.6)

0.08
Accept H0: p ≠ 0.5
Reject H0: p ≠ 0.5
0.06
p(x)

0.04

0.02

0.00
20 25 30 35 40 45 50 55 60 65 70 75 80
x

0.10 Bin(100,0.7)

0.08
Accept H0: p ≠ 0.5
Reject H0: p ≠ 0.5
0.06
p(x)

0.04

0.02

0.00
20 25 30 35 40 45 50 55 60 65 70 75 80
x

Testing Hypotheses I, Feb 16, 2004 -3-


Types of errors

Example: Fair coin (contd)


It is possible that the test (decision rule) gives a wrong answer:
◦ If θ = 0.7 and x = 55, we do not reject the null hypothesis that the
coin is fair although the coin in fact is not fair.
◦ If θ = 0.5 and x = 65, we reject the null hypothesis that the coin is fair
although the coin in fact is fair.

The following table lists the possibilities:

Decision H0 true H0 false


Reject H0 type I error correct decision
Accept H0 correct decision type II error

Definition (Types of error)


◦ If we reject H0 when in fact H0 is true, this is a Type I error.
◦ If we do not reject H0 when in fact H0 is false, this is a Type II error.

Testing Hypotheses I, Feb 16, 2004 -4-


Types of errors

Question: How good is our decision rule?


For a good decision rule, the probability of committing an error of either
type should be small.

Probability of type I error: α


If the null hypothesis is true, i.e. θ = 12 , then

Pθ (reject H0) = Pθ (X ∈/ [40, 60])


= 1 − Pθ (X ∈ [40, 60])
60 
X  100
100 1
=1− = 0.035.
x=40
x 2

Thus the probability of a type I error, denoted as α, is 3.5%.


Probability of type II error: β(θ)
If the null hypothesis is false and the true probability of observing “head”
is θ with θ 6= 12 , then

Pθ (accept H0) = Pθ (X ∈ [40, 60])


60 
X 
100
= θx (1 − θ)n−x
x=40
x

Thus, the probability of an error of type II depends on θ. It will be denoted


as β(θ).

Testing Hypotheses I, Feb 16, 2004 -5-


Power of Tests

Question: How good is our test in detecting the alternative?


Consider the probability of rejecting H0

Pθ (reject H0) = Pθ (X ∈/ [40, 60])


= 1 − Pθ (accept H0 ) = 1 − β(θ).

Note:
1
◦ If θ = this is the probability of committing a error of type I:
2
 
1
1−β =α
2
1
◦ If θ > 2 this is the probability of correctly rejecting H0 .

Definition (Power of a test)


We call 1 − β(θ) the power of the test as it measures the ability to
detect that the null hypothesis is false.

1.0

0.8

0.6
1 − β(θ)

0.4

0.2 reject if X ∉ [40,60]

0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
θ

Testing Hypotheses I, Feb 16, 2004 -6-


Significance Tests

Idea: minimize probability of committing an error of type I and II


Different probabilities of type I error

1.0

0.8
1 − β(θ)

0.6

0.4
reject if X ∉ [40,60]
0.2 reject if X ∉ [38,62]
reject if X ∉ [42,58]
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
θ

Note: If we decrease the probability of a type I error,


◦ the power of the test, 1 − β(θ) decreases as well and
◦ the probablity of a type II error increases.
Problem: cannot minimize both errors simultaneously

Solution:
◦ choose fixed level α for probability of a type I error
◦ under this restriction find test with small probability of a type II error
Remark:
◦ you do not have to do this minimization yourself.
◦ all tests taught in this course are of this kind.
Definition
A test of this kind is called a significance test with significance level α.

Testing Hypotheses I, Feb 16, 2004 -7-


Statistical Hypotheses

A statistical hypothesis is an assertion or conjecture about a population,


which may be expressed in terms of
◦ some parameter: mean is zero;
◦ some parameters: mean and median are identical; or
◦ some sampling distribution: this sample is normally distributed.

Test problem - decide between two hypotheses


◦ the null hypothesis H0 and
◦ the alternative hypothesis Ha .

Popperian approach to scientific theories


◦ Scientific theories are subject to falsification.
◦ It is impossible to verify a scientific theory.

Null hypothesis H0
default (current) theory which we try to falsify

Alternative hypothesis Ha
alternative to adopt if null hypothesis is rejected

Examples:

◦ Clinical study of new drug - H0 : drug has no effect


◦ Criminal case - H0 : suspect is not guilty
◦ Safety test of nuclear power station - H0 : power station is not safe
◦ Chances of new investment - H0 : project not profitable
◦ Testing for independence - H0 : random variables are independent

Testing Hypotheses II, Feb 18, 2004 -1-


Statistical Tests

Example: Testing for pesticide in discharge water


Suppose the Environmental Protection Agency takes 10 readings on the
amount of pesticide in the discharge water of a chemical company.
Question: Does the concentration cP of pesticide in the water exceed the
allowed maximum concentration c0 ?
◦ Before taking action against the company, the agency must have some
evidence that the concentration cP exceeds the allowed level.
◦ Without evidence the agency assumes that the pesticide concentration
cP is within the limits of the law.
Consequently, the null hypothesis of the agency is that the pesticide con-
centration cP does not exceed c0 . Thus the question corresponds to the
test problem

H0 : cP ≤ c0 vs Ha : cP > c0 .

Suppose that the company regularly also runs tests on the amount of pes-
ticide in the discharge water.
Question: Does the concentration cP of pesticide in the water exceed the
allowed maximum concentration c0 ?
◦ The aim of the company is to avoid fines for exceeding the allowed
level. Thus the company wants to make sure that the concentration
stays within the allowed limits.
Thus, the null hypothesis of the company should be that the pesticide
concentration cP exceeds c0 . The question now corresponds to the test
problem

H0 : cP ≥ c0 vs Ha : cP < c0 .

Testing Hypotheses II, Feb 18, 2004 -2-


Six Steps of Conducting a Test

Steps of a significance test

1. Determine null hypothesis H0 and alternative Ha .

2. Decide on probability of type I error, the significance level α.

3. Find an appropriate test statistic T .

4. Based on the sampling distribution of T , formulate a criterion for


testing H0 against Ha .

5. Calculate value of the test statistic T .

6. Decide whether or not to reject the null hypothesis H0 .

Example: Fair coin (contd)


We want to decide from 100 tosses of a coin whether it is fair or not. Let
θ be the probability of heads.

1. Test problem:
H0 : θ = 21 vs Ha : θ 6= 1
2

2. Significance level:
α = 0.05 (most commonly used significance level)

3. Test statistic:
T =X (number of heads in 100 tosses of the coin)

4. Rejection criterion:
reject H0 if T ∈/ [40, 60]

5. Observed value of test statistic: Suppose after 100 tosses we obtain


t = 55

6. Decision: Since 55 does not lie in the rejection region, we


do not reject H0.
Testing Hypotheses II, Feb 18, 2004 -3-
One and Two-sided Hypotheses

Example: Blood cholesterol after a heart attack


Suppose we are interested in whether the blood cholesterol level two days
after a heart attack differs from the average cholesterol level in the (general)
population (µ0 = 193).
Two cases:
◦ We are interested in any difference from the population mean µ0 . Then
we have a two-sided test problem

H0 : µY1 = µ0 vs H0 : µY1 6= µ0 .

◦ We suspect that the cholesterol level after a heart attack might me


higher than in the general population. In this case, we have a one-sided
test problem

H0 : µY1 = µ0 vs H0 : µY1 > µ0 .

Remark:
◦ More generally, we might be interested in one-sided test problems of
the form

H0 : µY1 ≤ µ0 vs H0 : µY1 > µ0 ,

which accounts for the possibility that µ might be smaller than µ0 .


◦ For all common test situations (in particular those discussed in this
course), the form of the test does not depend on the form of H0 , but
only on the parameter value in H0 that is closest to Ha , that is µ0 .

Testing Hypotheses II, Feb 18, 2004 -4-


Test Statistic

Let θ be the parameter of interest.


Two-sided test problem

H0 : θ = θ0 against Ha : θ 6= θ0

One-sided test problem

H0 : θ = θ0 against Ha : θ > θ0 (or Ha : θ < θ0)

Suppose that θ̂ is an estimate for θ.


◦ If θ = θ0 (null hypothesis), we expect the estimate θ̂ to take a value
near θ0.
◦ Large deviations from θ0 are evidence against H0 .
This suggests the following decision rules:
◦ Ha : θ > θ0: reject H0 if θ̂ − θ0 is much larger than zero
◦ Ha : θ < θ0: reject H0 if θ̂ − θ0 is much smaller than zero
◦ Ha : θ 6= θ0: reject H0 if |θ̂ − θ0| is much larger than zero
Problem: Often the sampling distribution of the estimate θ̂ depends on the
unknown parameter θ.
Definition (Test statistic)
A test statistic is a random variable
◦ that measures the compatibility between the null hypothesis and the
data and
◦ has a sampling distribution which we know (under H0 ).

Testing Hypotheses II, Feb 18, 2004 -5-


Test Statistic

Example: Blood cholesterol after a heart attack


Data: X1, . . . , X28
◦ blood cholesterol level of 28 patients two days after a heart attack
2
◦ assumed to be normally distributed with mean µX and variance σX
The parameter µ can be estimated by the sample mean
 2 
1 P
28
σX
X̄ = Xi ∼ N µX , .
28 i=1 28

This suggests to the standardized sample mean as a test statistic


X̄ − µ0
√ ∼ N (0, 1) (under H0 ).
σ/ 28

Test H0 : µ ≤ 193 vs Ha : µ > 193 at significance level α = 0.05


◦ Test statistic: Assume σ = 47.7 to be known.
X̄ − µ0
T = √
σ/ 28

◦ Rejection criterion: Reject H0 if T > z0.05 = 1.645


◦ Outcome of test: Since the observed value of T is
253.9 − 193
t= √ = 6.76,
47.7/ 28

we reject the null hypothesis that µ = 193.

Testing Hypotheses II, Feb 18, 2004 -6-


Tests for the Mean

Tests for the mean µ (σ 2 known):


◦ Test statistic:
X̄ − µ0
T = √
σ/ n
◦ Two sided test:
H0 : µ = µ0 against Ha : µ 6= µ0
reject H0 if |T | > zα/2
◦ One sided tests:
H0 : µ = µ0 against Ha : µ > µ0 (µ < µ0 )
reject H0 if T > zα (T < −zα )

Tests for the mean µ (σ 2 unknown):


◦ Test statistic:
X̄ − µ0
T = √
s/ n
◦ Two sided test:
H0 : µ = µ0 against Ha : µ 6= µ0
reject H0 if |T | > tn−1,α/2
◦ One sided tests:
H0 : µ = µ0 against Ha : µ > µ0 (µ < µ0 )
reject H0 if T > tn−1,α (T < −tn−1,α )

Example: Blood cholesterol after a heart attack


Estimating the standard deviation from the data, we obtain the test statis-
tic
X̄ − µ0
T = √ ∼ t27 .
s/ 28

Noting that t27,0.05 = 1.703 and t = 6.76, we still reject H0 .

Testing Hypotheses II, Feb 18, 2004 -7-


Tests and Confidence Intervals

Consider level α significance test for the two-sided test problem

H0 : θ = θ0 vs Ha : θ 6= θ0.

Let
◦ T = Tθ0 (X) be the test statistic of the test (depends on θ0)
◦ R be the critical region of the test

Then

C(X) = {θ : Tθ (X) ∈
/ R}

is a (1 − α) confidence interval for θ: If θ is the true parameter, then


  
Pθ θ ∈ C(X) = Pθ Tθ (X) ∈/ R = 1 − Pθ Tθ (X) ∈ R = 1 − α.
We have

θ0 ∈ C(X) ⇔ Tθ0 (X) ∈


/ R ⇔ H0 is not rejected

Result A level α two-sided significance test rejects the null hypothesis


H0 : θ = θ0 if and only if the parameter θ0 falls outside a (1 − α)
confidence interval for θ.

Example: Normal distribution


iid
Let X1 , . . . , Xn ∼ N (µ, σ 2). We reject H0 : µ = µ0 if

X̄ − µ0
√ > tn−1,α/2
s/ n

or equivalently

X̄ − µ0 > tn−1,α/2 √s
n

Rearranging terms, we find that we reject if


h i
s s
µ0 ∈
/ X̄ − tn−1,α/2 √ , X̄ + tn−1,α/2 √ .
n n
Testing Hypotheses II, Feb 18, 2004 -8-
The P -value

Definition (P -value)
The probability that under the null hypothesis H0 the test statistic
would take a value as extreme or more extreme that that actually
observed is called the P -value of the test.

The P -value is often interpreted a measure for the strength of evidence


against the null hypothesis: the smaller the P -value, the stronger the evi-
dence.
However:
◦ The P -value is a random variable (under H0 uniformly distr. on [0, 1]).
◦ Without a measure of its variability it is not safe to interpret the actu-
ally observed P -value.
◦ If the P -value is smaller than the chosen significance level α, we reject
the null hypothesis H0 .

Three approaches to deciding on test problem:


◦ reject if θ0 ∈
/ C(X)
◦ reject if T (X) ∈ R
◦ reject if P -value p ≤ α

Example: Blood cholesterol after a heart attack


The observed value for the test statistic
X̄ − µ0
T = √ ∼ t27 .
s/ 28

is t = 6.76. The corresponding P -value is

P(T > 6.76) = 1.47 · 10−07.


We thus reject the null hypothesis.
Equivalently, the confidence interval for µ is [235.43, 272.42]. Since it does
not contain µ0 = 193 we reject H0 (for the third and last time!).
Testing Hypotheses II, Feb 18, 2004 -9-
Example

Data: Banks’ net income

◦ percent change in net income between first half of last year and first
half of this year
◦ sample mean x̄ = 8.1%
◦ sample standard deviation s = 26.4%

Test problem: H0 : µ = 0 against Ha : µ 6= 0


. ttesti 110 8.1 26.4 0
One-sample t test
------------------------------------------------------------------
| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
----+-------------------------------------------------------------
x | 110 8.1 2.517141 26.4 3.111108 13.08889
------------------------------------------------------------------
Degrees of freedom: 109
Ho: mean(x) = 0
Ha: mean < 0 Ha: mean != 0 Ha: mean > 0
t = 3.2179 t = 3.2179 t = 3.2179
P < t = 0.9991 P > |t| = 0.0017 P > t = 0.0009

Critical value of t distribution with 109 degrees of freedom:

t109,0.025 = 1.982

Result:
◦ |t| > t109,0.025, therefore the test rejects H0 at significance level α = 0.05.
◦ Equivalently, µ0 = 0 ∈
/ [3.11, 13.09] and thus the test rejects H0.
◦ Equivalently, P -value is less than α = 0.05 and thus the test rejects H0 .

Testing Hypotheses II, Feb 18, 2004 - 10 -


Exact Binomial Test

Example: Fair coin


Data: 100 tosses of a coin which we suspect might be unfair.
Modelling:
◦ θ is the probability that the coin lands heads up
◦ X is the number of heads in 100 tosses of the coin
◦ X is binomially distributed with parameters n and θ.
Decision problem:
◦ Null hypothesis H0 : coin is fair
◦ Alternative hypothesis Ha : coin is unfair

Testproblem:
1 1
H0 : θ = vs Ha : θ 6= .
2 2
Under the null hypothesis H0, the distribution of X is known,
 
1
X ∼ Bin 100, .
2
Reject null hypothesis if

X∈
/ [b100,0.5,0.975, b100,0.5,0.025] = [40, 60].

where bn,θ,α is the α fractile of Bin(n, θ).

Note:
◦ Exact binomial tests typically have smaller significance level α due to
discreteness of distribution.
◦ In the above example, the probability of a type I error is

P(reject H0) = α = 0.035.

Testing Hypotheses III, Feb 20, 2004 -1-


Sign Test

Example: Safety program


◦ Study success of new elaborate safety program
◦ Record average weekly losses in hours of labor due to accidents before
and after installation of the program in 10 industrial plants

Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11

Question:
◦ Has the safety program an effect on the loss of labour due to accidents?
The Sign Test for matched pairs

◦ Ignore pairs with difference 0


◦ Number of trials n is the count of the remaining pairs
◦ The test statistic is the count X of pairs with positive difference
◦ X is binomially distributed with parameters n and θ.
1
◦ Null hypothesis H0 : θ = 2
(i.e. median of the differences is zero)

Example:
For the safety program data, we find
◦ n = 10, X = 9
1 1
◦ Test H0 : θ = 2 against Ha : θ > 2

◦ The P -value of the observed count X is


 10  10
P(X ≥ 9) = 9 1 + 1 = 0.0107 2 2
Since the P -value is smaller than α = 0.05 we reject the null hypothesis H0
that the safety program has no effect on the loss of labour due to accidents.

Testing Hypotheses III, Feb 20, 2004 -2-


Tests for Proportions

Example: Blood cholesterol after a heart attack


Suppose we are interested in the proportion p of patients who show a
decrease of cholesterol level between the second and the 14th day after a
heart attack.
The proportion p can be estimated by the sample proportion
X
p̂ =
n
where X is the number of patients whose cholesterol level decreased.
Question: Does a decrease occur more often than an increase?
1 1
Test problem: H0 : p = 2 vs Ha : p > 2

Exact tests:
Since X is binomially distributed, we can use exact binomial tests.

Large sample approximations:


Facts: ◦ E(p̂) = p
p(1 − p)
◦ var(p̂) =
n
p̂ − p
◦ p ≈ N (0, 1) (for large n)
p(1 − p)/n

Under the null hypothesis H0 , we get


p̂ − p0
T =p ≈ N (0, 1).
p0 (1 − p0 )/n

Hence, we reject H0 if T > zα .

Example: Blood cholesterol after a heart attack

◦ n = 28, x = 22, p = 0.79, α = 0.05, z0.05 = 1.645


0.79 − 0.5
◦ t= p = 3.7675
0.79 · 0.21/28
◦ P-value: P(T > t) = 8.24 · 10−5.
Testing Hypotheses III, Feb 20, 2004 -3-
Confidence Intervals for Proportions

Exact binomial confidence intervals


◦ difficult to compute
◦ use statistics software

Example: Blood cholesterol after a heart attack


◦ 28 patients in the study
◦ 22 showed a decrease in cholesterol level between second and 14th day
after the attack

Computation of an exact binomial confidence interval in STATA:


. cii 28 22
-- Binomial Exact --
Variable | Obs Mean Std. Err. [95% Conf. Interval]
---------+-----------------------------------------------------------
| 28 .7857143 .0775443 .590469 .9170394

Testing Hypotheses III, Feb 20, 2004 -4-


Confidence Intervals for Proportions

Large sample approximations


The CLT states that for large n p̂ is approximately normally distributed,
 
p(1 − p)
p̂ ≈ N p,
n
Problems:
◦ variance is unknown
◦ estimate p̂(1 − p̂)/n is zero if p̂ = 0 or p̂ = 1
Example: What is the proportion of HIV+ students at the UofC?
◦ Random sample of 100 students
◦ None test positive for HIV
Are you absolutely sure that there are no HIV+ students at the UofC?
Idea: Estimate p by
X +2
p̃ = (Wilson estimate)
n+4
and use
h r r i
p̃(1 − p̃) p̃(1 − p̃)
p̃ − zα/2 , p̃ + zα/2
n+4 n+4

as a (1 − α) confidence interval for p

Example: Blood cholesterol after a heart attack


. cii 28 22, wilson
------ Wilson ------
Variable | Obs Mean Std. Err. [95% Conf. Interval]
---------+-----------------------------------------------------------
| 28 .7857143 .0775443 .6046141 .8978754

Testing Hypotheses III, Feb 20, 2004 -5-


Paired Samples

Example: Safety program

◦ Study success of new elaborate safety program


◦ Record average weekly losses in hours of labor due to accidents before
and after installation of the program in 10 industrial plants

Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11

Question: Does the safety program have a positive effect?


Approach:
◦ Consider differences before and after implementation of the program:
(after) (before)
Di = Xi − Xi

◦ Di ’s are approximately normal 25

iid
Di ∼ N (µ, σ 2) 20
Decrease in losses of work

◦ H0 : µ = 0 against Ha : µ > 0 15

◦ Significance level α = 0.01 10

◦ One sample t test:


5


T = √ 0
s/ n
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
Normal quantiles
Reject if T > tn−1,α

Result:
◦ ȳ = 10.27, s = 7.98, n = 10
◦ t = 4.07 and t9,0.01 = 2.82, P -value: 0.0014
◦ Test rejects H0 at significance level α = 0.01

Testing Hypotheses III, Feb 20, 2004 -6-


Paired Sample t Test

Data: (X1 , Y1), . . . , (Xn, Yn)


Assumptions:
◦ Pairs are independent
iid
◦ Di = Xi − Yi ∼ N (µ, σ 2)
◦ Apply one-sample t test

Paired sample t test


◦ Test statistic
D̄ − µ0
T = √
s/ n
◦ Two-sided test:
H0 : µ = µ0 against Ha : µ 6= µ0
reject H0 if |T | > tn−1,α/2
◦ One-sided test:
H0 : µ = µ0 against Ha : µ > µ0
reject H0 if T > tn−1,α

Power of the paired sample t test and the paired sign test:

1.0

0.8

0.6
1 − β( δ )

0.4

0.2 t test
Sign test
0.0
0 1 2 3 4 5 6 7 8 9 10 11
δ

Testing Hypotheses III, Feb 20, 2004 -7-


Sign and t Test

t test:
◦ based on Central Limit Theorem
◦ readsonably robust against departures from normality
◦ do not use if n is small and
⋄ data are strongly skewed or
⋄ data have clear outliers
Sign test:
◦ uses much less information than t test
◦ for normal data less powerful than t test
◦ no assumption on distribution keeps significance level regardless of
distribution
◦ preferable for very small data sets

Remark:
◦ The two-step procedure
1. assess normality by normal quantile plot
2. conduct either t test or sign test depending on result in step 1
does not attain the chosen significance level α (two tests!).
◦ The sign test is rarely used since there are more powerful distribution-
free tests.

Testing Hypotheses III, Feb 20, 2004 -8-


Two Sample Problems

Two sample problems

◦ The goal of inference is to compare the responses in two groups.


◦ Each group is a sample from a different population.
◦ The responses in each group are independent of those in the other
group.

Example: Effects of ozone


Study the effects of ozone by controlled randomized experiment
◦ 55 70-day-old rats were randomly assigned to two treatment or control
◦ Treatment group: 22 rats were kept in an environment containing ozone.
◦ Control group: 23 rats were kept in an ozone-free environment
◦ Data: Weight gains after 7 days

We are interested in the difference in weight gain be-

50
tween the treatment and control group.
40

Question: Do the weight gains differ between groups?


30
Weight gain (in gram)

◦ x1, . . . , x22 - weight gains for treatment group


20

◦ y1, . . . , y23 - weight gains for control group


◦ Test problem:
10

H0 : µX = µY vs Ha : µX 6= µY
0

◦ Idea: Reject null hypothesis if x̄ − ȳ is large.


−10

Treatment Control

Two Sample Tests, Feb 23, 2004 -1-


Comparing Means

Let X1 , . . . , Xm and Y1 , . . . , Yn be two independent normally distributed


samples. Then
 2

σX σY2
X̄ − Ȳ ∼ N µX − µY , +
m n

Two-sample t test
◦ Two-sample t statistic
X̄ − Ȳ
T =q 2
sX s2Y
m + n

Distribution of T can be approximated by t distribution


◦ Two-sided test:
H0 : µX = µY against Ha : µX 6= µY
reject H0 if |T | > tdf,α/2
◦ One-sided test:
H0 : µX = µY against Ha : µX > µY
reject H0 if T > tdf,α
◦ Degrees of freedom:
◦ Approximations for df provided by statistical software
◦ Satterthwaite approximation
 2 2
sX s2Y
m + n
df =  2 2  2 2
1 sX 1 s
m−1 m
+ n−1 nY

commonly used, conservative approximation


◦ Otherwise: use df = min(m − 1, n − 1)

Two Sample Tests, Feb 23, 2004 -2-


Comparing Means

Example: Effects of ozone


Data:
◦ Treatment group: x̄ = 11.01, sX = 19.02, m = 22
◦ Control group: x̄ = 22.43, sX = 10.78, n = 23
Testproblem:
◦ H0 : µX = µY vs Ha : µX 6= µY
◦ α = 0.05, df = min(m − 1, n − 1) = 21, t21,0.025 = 2.08
The value of the test statistic is
x̄ − ȳ
t= q = −2.46
s2X s2Y
m
+ m

The corresponding P-value is

P(|T | ≥ |t|) = P(|T | ≥ 2.46) = 0.023


Thus we reject the hypothesis that ozone has no effect on weight gain.
Two-sample t test with STATA:
. ttest weight, by(group) unequal
Two-sample t test with unequal variances
----------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+------------------------------------------------------------------
0 | 23 22.42609 2.247108 10.77675 17.76587 27.0863
1 | 22 11.00909 4.054461 19.01711 2.577378 19.4408
---------+------------------------------------------------------------------
combined | 45 16.84444 2.422057 16.24765 11.96311 21.72578
---------+------------------------------------------------------------------
diff | 11.417 4.635531 1.985043 20.84895
----------------------------------------------------------------------------
Satterthwaite’s degrees of freedom: 32.9179
Ho: mean(0) - mean(1) = diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
t = 2.4629 t = 2.4629 t = 2.4629
P < t = 0.9904 P > |t| = 0.0192 P > t = 0.0096

Two Sample Tests, Feb 23, 2004 -3-


Comparing Means
2
Suppose that σX = σY2 = σ 2. Then
 
σ2 σ2 1 1
+ = σ2 + .
m n m n
Estimate σ 2 by the pooled sample variance
(m − 1)s2X + (n − 1)s2Y
s2p = .
m+n−2

Pooled two-sample t test


◦ Two-sample t statistic
X̄ − Ȳ
T = q
sp m1 + 1
n

T is t distributed with m + n − 2 degrees of freedom.


◦ Two-sided test:
H0 : µX = µY against Ha : µX 6= µY
reject H0 if |T | > tm+n−2,α/2
◦ One-sided test:
H0 : µX = µY against Ha : µX > µY
reject H0 if T > tm+n−2,α
Remarks:
◦ If m ≈ n, the test is reasonably robust against
◦ nonnormality and
◦ unequal variances.
◦ If sample sizes differ a lot, test is very sensitive to unequal variances.
◦ Tests for differences in variances are sensitive to nonnormality.

Two Sample Tests, Feb 23, 2004 -4-


Comparing Means

Example: Parkinson’s disease


Study on Parkinson’s disease

3.0
◦ Parkinson’s disease, among other things, affects a
person’s ability to speak
◦ Overall condition can be improved by an operation

2.5
◦ How does the operation affect the ability to speak?

Speaking ability
◦ Treatment group: Eight patients received operation

2.0
◦ Control group: Fourteen patients
◦ Data:

1.5
⋄ score on several test
⋄ high scores indicate problem with speaking
Treat. Contr.

Pooled twpo sample t test with STATA:

. infile ability group using parkinson.txt


. ttest ability, by(group)
Two-sample t test with equal variances
---------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+-----------------------------------------------------------------
0 | 14 1.821429 .148686 .5563322 1.500212 2.142645
1 | 8 2.45 .14516 .4105745 2.106751 2.793249
---------+-----------------------------------------------------------------
combined | 22 2.05 .1249675 .5861497 1.790116 2.309884
---------+-----------------------------------------------------------------
diff | -.6285714 .2260675 -1.10014 -.1570029
---------------------------------------------------------------------------
Degrees of freedom: 20
Ho: mean(0) - mean(1) = diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
t = -2.7805 t = -2.7805 t = -2.7805
P < t = 0.0058 P > |t| = 0.0115 P > t = 0.9942

Two Sample Tests, Feb 23, 2004 -5-


Comparing Variances

Example: Parkinson’s disease


In order to apply the pooled two-sample t test, the variances of the two
groups have to be equal. Are the data compatible with this assumption?

F test for equality of variances


The F test statistic
s2X
F = 2.
sY
is F distributed with m − 1 and n − 1 degrees of freedom.

. sdtest ability, by(group)


Variance ratio test
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
0 | 14 1.821429 .148686 .5563322 1.500212 2.142645
1 | 8 2.45 .14516 .4105745 2.106751 2.793249
---------+--------------------------------------------------------------------
combined | 22 2.05 .1249675 .5861497 1.790116 2.309884
------------------------------------------------------------------------------
Ho: sd(0) = sd(1)
F(13,7) observed = F_obs = 1.836
F(13,7) lower tail = F_L = 1/F_obs = 0.545
F(13,7) upper tail = F_U = F_obs = 1.836
Ha: sd(0) < sd(1) Ha: sd(0) != sd(1) Ha: sd(0) > sd(1)
P < F_obs = 0.7865 P < F_L + P > F_U = 0.3767 P > F_obs = 0.2135

Result: We cannot reject the null hypothesis that the variances are equal.
3.0 3.0
Speaking ability (Contr.)

Problem: Are the data normally


Speaking ability (Treat.)

2.8
2.6 2.5
distributed? 2.4
2.0
2.2
2.0
1.5
1.8

−1.5 −0.5 0.5 1.0 1.5 −1 0 1


Theoretical Quantiles Theoretical Quantiles

Two Sample Tests, Feb 23, 2004 -6-


Comparing Proportions

Suppose we have two populations with unknown proportions p1 and p2.


◦ Random samples of size n1 and n2 are drawn from the two population
◦ p̂1 is the sample proportion for the first population
◦ p̂2 is the sample proportion for the second population

Question: Are the two proportions p1 and p2 different?

Test problem:

H0 : p 1 = p 2 vs H1 : p1 6= p2

Idea: Reject H0 if p̂1 − p̂2 is large.


Note that
 
p1 (1 − p1 ) p2 (1 − p2 )
p̂1 − p̂2 ≈ N p1 − p2, +
n1 n2

This suggests the test statistic


p̂1 − p̂2
T =r  
p̂(1 − p̂) n11 + 1
n2

where p̂ is the combined proportion of successes in both samples


X1 + X2 n p̂ + n2 p̂2
p̂ = = 1 1
n1 + n2 n1 + n2

with X1 and X2 denoting the number of successes in each sample.


Under H0 , the test statistic is approximately standard normally dis-
tributed.

Two Sample Tests, Feb 23, 2004 -7-


Comparing Proportions

Example: Question wording


The ability of question wording to affect the outcome of a survey can be a
serious issue. Consider the following two questions:

1. Would you favor or oppose a law that would require a person to obtain
a police permit before purchasing a gun?

2. Would you favor or oppose a law that would require a person to obtain
a police permit before purchasing a gun, or do you think such a law
would interfere too much with the right of citizens to own guns?

In two surveys, the following results were obtained:


Question Yes No Total
1 463 152 615
2 403 182 585

Question: Is the true proportion of people favoring the permit law the
same in both groups or not?

. prtesti 615 0.753 585 0.689


Two-sample test of proportion x: Number of obs = 615
y: Number of obs = 585
--------------------------------------------------------------------------
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]
---------+----------------------------------------------------------------
x | .753 .0173904 .7189155 .7870845
y | .689 .0191387 .6514889 .7265111
---------+----------------------------------------------------------------
diff | .064 .0258595 .0133163 .1146837
| under Ho: .0258799 2.47 0.013
--------------------------------------------------------------------------
Ho: proportion(x) - proportion(y) = diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
z = 2.473 z = 2.473 z = 2.473
P < z = 0.9933 P > |z| = 0.0134 P > z = 0.0067

Two Sample Tests, Feb 23, 2004 -8-


Final Remarks

Statistical theory focuses on the significance level, the probability of a type


I error.
In practice, discussion of power of test also important:
Example: Efficient Market Hypothesis
“Efficient market hypothesis” for stock prices:
◦ future stock prices show only random variation
◦ market incorporates all information available now in present prices
◦ no information available now will help to predict future stock prices

Testing of the efficient market hypothesis:


◦ Many studies tested
H0: Market is efficient
Ha : Prediction is possible
◦ Almost all studies failed to find good evidence against H0.
◦ Consequently the efficient market hypothesis became quite popular.

Problem:
◦ Power was generally low in the significance tests employed in the stud-
ies.
◦ Failure to reject H0 is no evidence that H0 is true.
◦ More careful studies showed that the size of a company and measures
of value such as ratio of stock price to earnings do help predict future
stock prices.

Two Sample Tests, Feb 23, 2004 -9-


Final Remarks

Example

◦ IQ of 1000 women and 1000 men


◦ µ̂w = 100.68, σw = 14.91
◦ µ̂m = 98.90, σm = 14.68
◦ Pooled two-sample t test: T = −2.7009
◦ Reject H0 : µw = µm since |T | > t1998,0.005 = 2.58.
◦ The difference in the IQ is statistically significant at the 0.01 level.
◦ However we might conclude that the difference is scientifically irrele-
vant.

Note: A low significance level does not mean there is a large difference,
but only that there is strong evidence that there is some difference.

Two Sample Tests, Feb 23, 2004 - 10 -


Final Remarks

Example: Is radiation from cell phones harmful?

◦ Observational study
◦ Comparison of brain cancer patients and similar group without brain
cancer
◦ No statistically significant association between cell phone use and a
group of brain cancers known as gliomas.
◦ Separate analysis for 20 types of gliomas found association between
phone use and one rare from.
◦ Risk seemed to decrease with greater mobile phone use.
Think for a moment:
◦ Suppose all 20 null hypotheses are true.
◦ Each test has 5% chance of being significant - the outcome is Bernoulli
distributed with parameter 0.05.
◦ The number of false positive tests is binomially distributed:

N ∼ Bin(20, 0.05)

◦ The probability of getting one or more positive results is

P(N ≥ 1) = 1 − P(N = 0) = 1 − 0.9520 = 0.64.


We therefore might have expected at least one significant association.

Beware of searching for significance

Two Sample Tests, Feb 23, 2004 - 11 -


Final Remarks

Problem: If several tests are performed, the probability of a type I error


increases.

Idea: Adjust significance level of each single test.


Bonferroni procedure:
◦ Perform k tests
◦ Use significance level α/k for each of the k tests
◦ If all null hypothesis are true, the probability is α that any of the tests
rejects its null hypothesis.

Example
Suppose we perform k = 6 tests and obtain the following P -values:

P -value α/k
0.476 0.032 0.241 0.008* 0.010 0.001* 0.0083

Only two tests (*) are significant at the 0.05 level.

Two Sample Tests, Feb 23, 2004 - 12 -


Two-Way Tables

Example: Depression and marital status


Question: Does severity of depression depend on marital status?
◦ Study of 159 depression patients
◦ Patients were categorized by
⋄ severity of depression (severe, normal, mild)
⋄ marital status (single, married, widowed/divorced)

The following two-way table summarizes the data:

Depression Marital Status Total


Single Married Wid/Div
Severe 16 22 19 57
Normal 29 33 14 76
Mild 9 14 3 26
Total 54 69 36 159

◦ Each combination of values defines a cell.


◦ The severity of depression is a row variable.
◦ The marital status is a column variable.

Inference for Two-Way Tables, Feb 25, 2004 -1-


Two-Way Tables

From this table of counts, the sample distribution can be obtained


by dividing each cell by the total sample size n = 159:

Depression Marital Status Total


Single Married Wid/Div
Severe 0.101 0.138 0.119 0.358
Normal 0.182 0.208 0.088 0.478
Mild 0.057 0.088 0.019 0.164
Total 0.340 0.434 0.226 1.000

◦ Joint distribution: proportion for each combination of values


◦ Marginal distribution: distribution of the row and column
variables separately.
◦ Conditional distribution: distribution of one variable at a
given level of the other variable

Inference for Two-Way Tables, Feb 25, 2004 -2-


Test for Independence

Example: Depression and marital status


Conditional distributions of severity of depression given marital
status:
0.5 severe
normal
mild

0.4
Sample proportion

0.3

0.2

0.1

0.0
single married wid/div
Marital status

Question: Is a relationship between the row variable (depression)


and the column variable (marital status)?

◦ The distribution for widowed/divorced patients seems to differ


from the distributions for single or married patients.
◦ Are these differences significant or can they be attributed to
chance variation?
◦ How likely are differences as large or larger than those observed
if the two variables were indeed independent (and thus the con-
ditional distribution were the same)?

A statistical test will be required to answer these questions.

Inference for Two-Way Tables, Feb 25, 2004 -3-


Test for Independence

Test problem:
H0 : the row and the column variables are independent
Ha : the row and the column variables are dependent
How can we measure evidence against the null hypothesis?
◦ What counts would we expect to observe if the null hypothesis
were true?
row total × column total
Expected Cell Count =
total count
Recall: For two independent events A and B, P(A ∩ B) = P(A) P(B).
If the null hypothesis H0 is true, then the table of expected
counts should be “close” to the observed table of counts.
◦ We need a statistic that measures the difference between the
tables.
◦ And we need to know what is the distribution of the statistic
to make statistical inference.

Inference for Two-Way Tables, Feb 25, 2004 -4-


Test for Independence

Idea of the test:


◦ construct table of expected counts
◦ compare expected with observed counts
◦ if the null hypothesis is true, the difference between the tables
should be “small”
The χ2 (Chi-Squared) Statistic
To measure how far the expected table is from the observed table,
we use the following test statistic:
X (Observed − Expected)2
X=
Expected
all cells
◦ Under the null hypothesis, T is approximately χ2 distributed
with (r − 1)(c − 1) degrees of freedom.
Why (r − 1)(c − 1)?
Recall that our “expected” table is based on some quantities estimated
from the data: namely the row and column totals.
Once these totals are known, filling in any (r − 1)(c − 1) undetermined
table entries actually gives us the whole table. Thus, there are only
(r − 1)(c − 1) freely varying quantities in the table.

◦ We reject H0 if observed and expected counts are very different


and hence X is large. Consequently we reject H0 at significance
level α if

X ≥ χ2(r−1)(c−1),α .

Inference for Two-Way Tables, Feb 25, 2004 -5-


The χ2 Distribution

What does the χ2 distribution look like?

χ2 Densities
0.20 Degrees of
Freedom

0.15 1
5
10
Density

20
0.10
30

0.05

0.00
0 10 20 30 40 50
χ2

◦ Unlike the Normal or t distributions, the χ2 distribution takes


values in (0, ∞).
◦ As with the t distribution, the exact shape of the χ2 distribution
depends on its degrees of freedom.

Recall that X has only an approximate χ2(r−1)(c−1) distribution.


When is the approximation valid?

◦ For any two-way table larger than 2 × 2, we require that the


average expected cell count is at least 5 and each expected count
is at least one.
◦ For 2×2 tables, we require that each expected count be at least
5.

Inference for Two-Way Tables, Feb 25, 2004 -6-


Test for Independence

Example: Depression and marital status


The following table show the observed counts and expected counts
(in brackets):

Depression Marital Status Total


Single Married Wid/Div
Severe 16 22 19 57
(19.36) (24.74) (12.90)
Normal 29 33 14 76
(25.81) (32.98) (17.21)
Mild 9 14 3 26
(8.83) (11.28) (5.89)
Total 54 69 36 159

◦ The table is 3 × 3, so there are (r − 1)(c − 1) = 2 × 2 = 4


degrees of freedom.
◦ The critical value (significance level α = 0.05) is χ24,0.05 = 9.49.
◦ The observed value of the χ2 test statistic is
(16 − 19.36)2 (22 − 24.74)2 (3 − 5.89)2
x= + + ... +
19.36 24.74 5.89
= 6.83 ≤ χ24,0.05

Thus we do not reject the null hypothesis of independence.


◦ The corresponding P-value is

P(X ≥ x) = P(X ≥ 6.83) = 0.145 ≥ α


Again we do not reject H0

Inference for Two-Way Tables, Feb 25, 2004 -7-


Test for Independence

The χ2 test in STATA:


. insheet using depression.txt, clear
(3 vars, 159 obs)
. tabulate depression marital, chi2
| Marital
Depression | Married Single Wid/Div | Total
-----------+---------------------------------+----------
Mild | 14 9 3 | 26
Normal | 33 29 14 | 76
Severe | 22 16 19 | 57
-----------+---------------------------------+----------
Total | 69 54 36 | 159
Pearson chi2(4) = 6.8281 Pr = 0.145

The same result can be obtained by the command


. tabi 16 22 19 \ 29 33 14 \ 9 14 3, chi2
| col
row | 1 2 3 | Total
-----------+---------------------------------+----------
1 | 16 22 19 | 57
2 | 29 33 14 | 76
3 | 9 14 3 | 26
-----------+---------------------------------+----------
Total | 54 69 36 | 159
Pearson chi2(4) = 6.8281 Pr = 0.145

Inference for Two-Way Tables, Feb 25, 2004 -8-


Models for Two-Way Tables

The χ2 -test for the presence of a relationship between two distributions


in a two-way table is valid for data produced by several different study
designs, although the exact null hypothesis varies.
◦ Examining independence between variables
⋄ Select random sample of size n from a population.
⋄ Classify each individual according to two categorical variables.
Question: Is there a relationship between the two variables?
Test problem:
H0: The two variables are independent
Ha : The two variables are not independent
Example: Suppose we collect an SRS of 114 college students, and cate-
gorize each my major and GPA (e.g. (0, 0.5], . . . , (3.5, 4]). Then, we can
use the χ2 -test to ascertain whether grades and major are independent.
◦ Comparing several populations
⋄ Select independent random samples from each of c population, of
sizes n1 , . . . , nc .
⋄ Classify each individual according to a categorical response variable
with r possible values (the same across populations),
⋄ This yields a r × c table.
Question: Does the distribution of the response variable differs be-
tween populations?
Test problem:
H0: The distribution is the same in all populations.
Ha : The distribution is not the same.
Example: Suppose we select independent SRSs of Psychology, Biology
and Math majors, of sizes 40, 39, 35, and classify each individual by
GPA range. Then, we can use a χ2 -test to ascertain whether or not the
distribution of grades is the same in all three populations.
Inference for Two-Way Tables, Feb 25, 2004 -9-
Models for Two-Way Tables

Example: Literary Analysis (Rice, 1995)


When Jane Austen died, she left the novel Sanditon only partially com-
pleted, but she left a summary of the reminder. A highly literate admirer
finished the novel, attempting to emulate Austen’s style, and the hybrid
was published. Someone counted the occurrences of various words in sev-
eral chapters from various works.

Austen Imitator
Sense and Emma Sanditon I Sanditon II
Word Sensibility
a 147 186 101 83
an 25 26 11 29
this 32 39 15 15
that 94 105 37 22
with 59 74 28 43
without 18 10 10 4
TOTAL 375 440 202 196

Questions:

◦ Is there consistency in Austen’s work (do the frequencies with which


Austen used these words change from work to work)?
Answer X = 12.27, df=?, P-value=?
◦ Was the imitator successful (are the frequencies of the words the same
in Austen’s work and the imitator’s work)?

Inference for Two-Way Tables, Feb 25, 2004 - 10 -


Simpson’s Paradoxon

Example: Medical study


◦ contact randomly chosen people in a district in England
◦ data on 1314 women contacted
◦ either current smoker or who had never smoked
Question: Survival rate after 20 years?

Smoker Not
Dead 139 230
Alive 438 502

Result: A higher percent of smokers stayed alive!

Here are the same data classified by their age at time of the survey:
Age 18 to 44 Age 45 to 64 Age 65+
Smoker Not Smoker Not Smoker Not
Dead 19 13 Dead 78 52 Dead 42 165
Alive 269 327 Alive 162 147 Alive 7 28

Age at time of the study is a confounding variable, in each age


group a higher percent of nonsmokers survive.

Simpson’s Paradoxon
An association/comparison that holds for all of several groups can
reverse direction when the data are combined to form a single
group.

Inference for Two-Way Tables, Feb 25, 2004 - 11 -


Simple Linear Regression

Example: Body density


Aim: Measure body density (weight per unit volume of the body)
(Body density indicates the fat content of the human body.)
Problem:
◦ Body density is difficult to measure directly.
◦ Research suggests that skinfold thickness can accurately predict body
density.
◦ Skinfold thickness is measures by pinching a fold of skin between
calipers.

2.0
Body Density (103kg m3)

1.8

1.6

1.4

1.2

1.0
1.03 1.04 1.05 1.06 1.07 1.08 1.09
Skinfold Thickness (mm)

Questions:
◦ Are body density and skinfold thickness related?
◦ How accurately can we predict body density from skinfold thickness?

Regression: predict response variable for fixed value of explanatory variable


◦ describe linear relationship in data by regression line
◦ fitted regression line is affected by chance variation in observed data

Statistical inference: accounts for chance variation in data

Simple Linear Regression, Feb 27, 2004 -1-


Population Regression Line

Simple linear regression studies the relationship between


◦ a response variable Y and
◦ a single explanatory variable X.
We expect that different values of X will produce different mean responses
of Y .
For given X = x, we consider the subpopulation with X = x:
◦ this subpopulation has mean

µY |X=x = E(Y |X = x) (cond. mean of Y given X = x)

◦ and variance

σY2 |X=x = var(Y |X = x) (cond. variance of Y given X = x)

Linear regression model with constant variance:

E(Y |X = x) = µY |X=x = a + b x (population regression line)


var(Y |X = x) = σY2 |X=x = σ 2

◦ The population regression line connects the conditional means of the


response variable for fixed values of the explanatory variable.
◦ This population regression line tells how the mean response of Y varies
with X.
◦ The variance (and standard deviation) does not depend on x.

Simple Linear Regression, Feb 27, 2004 -2-


Conditional Mean

Sample (x1, y1), . . . , (xn, yn)

6
1 5
2 4
3
4
5 3
6
7 2
8
9 1
10
11 0
12

Sampling probability
f (x, y)
6
0
1 5
2
3 4
4
5 3
6
7 2
8
9 1
10
11 0
12
y

fix x = x0

6
0
1 5
f (x0, y)
2
3 4
4
5 3
6
7 2
8
y

9
10
1
rescale by fX (x0 )
11 0
12

6
Conditional probability
0
5
1
2 fXY (x0, y)
3
4
4
f (y|x0) =
5
6
3 fX (x0)
7 2
8
9 1
10
11 0
12

Z
E(Y |X = x0) = y fY |X (y|x0) dy conditional mean

Simple Linear Regression, Feb 27, 2004 -3-


The Linear Regression Model

Simple linear regression

Yi = a + b x i + ε i , i = 1, . . . , n

where

Yi response (also dependent variable)


xi predictor (also independent variable)
εi error

Assumptions:
◦ Predictor xi is deterministic (fixed values, not random).
◦ Errors have zero mean, E(εi) = 0.
◦ Variation about mean does not depend on xi, i.e. var(εi) = σ 2 .
◦ Errors εi are independent.
Often we additionally assume:
◦ The errors are normally distributed,
iid
εi ∼ N (0, σ 2).

For fixed x the response Y is normally distributed with

Y ∼ N (a + b x, σ 2).

Simple Linear Regression, Feb 27, 2004 -4-


Least Squares Estimation

Data: (Y1 , x1), . . . , (Yn, xn)

Aim: Find straight line which fits data best:

Ŷi = a + b xi fitted values for coefficients a and b

a - intercept
b - slope
Least Squares Approach:
Minimize squared distance between observed Yi and fitted Ŷi :
P
n
2 P
n
L(a, b) = (Yi − Ŷi ) = (Yi − a − b xi)2
i=1 i=1

Set partial derivatives to zero (normal equations):


∂L Pn
=0 ⇔ (Yi − a − b xi) = 0
∂a i=1
∂L Pn
=0 ⇔ (Yi − a − b xi) · xi = 0
∂b i=1

Solution: Least squares estimators


SXY
â = Ȳ − · X̄
SXX
SXY
b̂ =
SXX
where
P
n
SXY = (Yi − Ȳ )(xi − x̄) (sum of squares)
i=1
Pn
SXX = (xi − x̄)2
i=1

Simple Linear Regression, Feb 27, 2004 -5-


Least Squares Estimation

Least squares predictor Ŷ

Ŷi = â + b̂ xi

Residuals ε̂i:

ε̂i = Yi − Ŷi
= Yi − â − b̂ xi

Residual sum of squares (SS Residual )


P
n P
n
SS Residual = ε̂2i = (Yi − Ŷi)2
i=1 i=1

Estimation of σ 2
2 1 P n 1
σ̂ = (Yi − Ŷi )2 = SS Residual
n − 2 i=1 n−2
Regression standard error
p
se = σ̂ = SS Residual /(n − 2)

Variation accounting:
P
n
SS Total = (Yi − Ȳ )2 total variation
i=1
Pn
SS Model = (Ŷi − Ȳ )2 variation explained by linear model
i=1
Pn
SS Residual = (Yi − Ŷi )2 remaining variation
i=1

Simple Linear Regression, Feb 27, 2004 -6-


Least Squares Estimation

Example: Body density


Scatter plot with least squares regression line:

2.0
Body Density (103kg m3)

1.8

1.6

1.4

1.2

1.0
1.03 1.04 1.05 1.06 1.07 1.08 1.09
Skinfold Thickness (mm)

Calculation of least squares estimates:


x̄ ȳ SXX SXY SY Y SS Residual
1.064 1.568 0.0235 -0.2679 4.244 1.187

SXY −0.267
b̂ = = = −11.40
SXX 0.023
â = ȳ − b̂x̄ = 1.568 + 11.40 · 1.064 = 13.70
RSS 1.187
σ̂ 2 = = = 0.0132
n−2 90
√ √
se = σ̂ 2 = 0.0132 = 0.1149

Simple Linear Regression, Feb 27, 2004 -7-


Least Squares Estimation

Example: Body density


Using STATA:
. infile ID BODYD SKINT using bodydens.txt, clear
(92 observations read)
. regress BODYD SKINT
Source | SS df MS Number of obs = 92
-------------+------------------------------ F( 1, 90) = 231.89
Model | 3.05747739 1 3.05747739 Prob > F = 0.0000
Residual | 1.18663025 90 .013184781 R-squared = 0.7204
-------------+------------------------------ Adj R-squared = 0.7173
Total | 4.24410764 91 .046638546 Root MSE = .11482
------------------------------------------------------------------------------
BODYD | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
SKINT | -11.41345 .7494999 -15.23 0.000 -12.90246 -9.924433
_cons | 13.71221 .7975822 17.19 0.000 12.12768 15.29675
------------------------------------------------------------------------------
. twoway (lfitci BODYD SKINT, range(1 1.1)) (scatter BODYD SKINT), xtitle(Skin thickn
> ess) ytitle(Body density) scheme(s1color) legend(off)
2.5
2
Body density
1.5
1

1 1.02 1.04 1.06 1.08 1.1


SKin thickness

Simple Linear Regression, Feb 27, 2004 -8-


Properties of Estimators

Statistical properties of â and b̂


Mean and variance of b̂

E(b̂) = b
Recall that
σ2
var(b̂) = P
n
SXX SXX = (xi − x̄)2
i=1

Distribution of b̂
 
σ2
b̂ ∼ N b,
SXX

Mean and variance of â

E(â) = a
 
1 x̄2
var(â) = + σ2
n SXX
Distribution of â
   
1 x̄2
â ∼ N a, + σ2
n SXX

Inference for Regression, Mar 1, 2004 -1-


Confidence Intervals
 
σ2
Note that b̂ ∼ N b, SXX . Thus

b̂ − b
√ ∼ N (0, 1)
σ/ SXX
Substituting se for σ, we obtain

b̂ − b
√ ∼ tn−2
se / SXX

(1 − α) confidence interval for b:


se
b̂ ± tn−2,a/2 · √
SXX

Similarly
â − a
q ∼ N (0, 1)
1 X̄ 2
σ n + SXX

Substituting se for σ, we obtain


â − a
q ∼ tn−2
1 x̄2
se n + SXX

(1 − α) confidence interval for a:


s
1 x̄2
â ± tn−2,α/2 · se · +
n SXX

Inference for Regression, Mar 1, 2004 -2-


Tests on the Coefficients

Question: Is b equal to some value b0 ?


The correspoding test problem is

H0 : b = b 0 versus Ha : b 6= b0.

The test statistic is given by

b̂ − b0
Tb = √ ∼ tn−2
se / SXX
The null hypothesis H0 : b = b0 is rejected if

|T | > tn−2,α/2

Question: Is a equal to some value a0 ?


The correspoding test problem is

H0 : a = a0 versus Ha : a 6= a0 .

The test statistic is given by


â − a0
Ta = q ∼ tn−2
1 x̄2
se n + SXX

The null hypothesis H0 : a = a0 is rejected if

|T | > tn−2,α/2

Inference for Regression, Mar 1, 2004 -3-


Inference for the Coefficients

Example: Body density


The confidence interval for b is given by
se
b̂ ± tn−2,α/2 · √
SXX

0.0132
= −11.41 ± 1.99 · √ = [−12.92, −9.90]
0.023
The confidence interval for a is given by
s
1 x̄2
â ± tn−2,α/2 se +
n SXX
r
√ 1 1.062
= 13.71 ± 1.99 · 0.0132 · + = [12.11, 15.30]
92 0.023
Furthermore we find for

Tb = √ = −15.22 > t90,0.025 = 1.99
se / SXX
Thus we reject H0 : b = 0 at significance level 0.05: The coefficient b is
statistically significantly different from zero.
Similarly

Ta = q = 17.26 > t90,0.025 = 1.99
1 x̄2
se n + SXX

Thus we reject H0 : a = 0 at significance level 0.05: The coefficient a is


statistically significantly different from zero.

The corresponding P -values are


◦ P(|Ta| ≥ 15.22) ≈ 0
◦ P(|Tb| ≥ 17.26) ≈ 0

Inference for Regression, Mar 1, 2004 -4-


Estimating the Mean

In the linear regression model, the mean of Y at x = x0 is given by

E(Y ) = a + b x0
Our estimate for the mean of Y at X = x0 is

Ŷx0 = â + b̂ x0 .

Question: How precise is this estimate?


Note that

Ŷx0 = â + b̂ x0 = Ȳ − b̂(x0 − x̄).

Hence we obtain

E(Ŷx ) = a + b x0
0
 
1 (x0 − x̄)2 2
var(Ŷx0 ) = + σ
n SXX

(1 − α) confidence interval for (Yx0 ) E


s
1 (x0 − x̄)2
(â + b̂ x0) ± tn−2,α/2 · se · +
n SXX

Inference for Regression, Mar 1, 2004 -5-


Estimating the Mean

Example: Body density


Suppose the measured skin thickness is x0 = 1.1 mm.
What is the mean body density for this value of skin thickness?
◦ Point estimate:

Ŷx0 = â + hb x0 = 13.71 − 11.41 · 1.1 = 1.159

The mean body density is 1.159 · 103 kg/m3.


◦ Confidence interval:
s
1 (x0 − x̄)2
(â + b̂ x0) ± tn−2,α/2 · se · +
n SXX
r
√ 1 (1.1 − 1.06)2
= (13.71 − 11.41 · 1.1) ± 1.99 · 0.0132 · +
92 0.023
= [1.09, 1.22]

In STATA, the standard error for estimating the mean of Y is calculated


by passing the option stdp to predict:
. predict BDH
. predict SE, stdp
. generate low=BDH-invttail(49,.025)*SE
. generate high=BDH+invttail(49,.025)*SE
. sort SKINT
. graph twoway line low high BDH SKINT, clpattern(dash dash solid) clcolor(black bla
> ck black) || scatter BODYD SKINT, legend(off) scheme(s1color)
2
1.8
1.6
1.4
1.2
1

1.02 1.04 1.06 1.08 1.1


SKINT

Inference for Regression, Mar 1, 2004 -6-


Prediction

Suppose we want to predict Y at x = x0 .


Aim: (1 − α) confidence interval for Y
Note that
  
1 (x0 − X̄)2
â + b̂ x0 − Y ∼ N 0, σ 2 1+ +
n SXX
Thus the desired (1 − α) confidence interval for Yx0 is given by
s
1 (x0 − X̄)2
â + b̂ x0 ± tn−2,α/2 · se · 1 + +
n SXX

Inference for Regression, Mar 1, 2004 -7-


Prediction

Example: Body density


Suppose the measured skin thickness is x0 = 1.1 mm.
What is the predicted body density for this value of skin thickness?
◦ Point estimate: Ŷx0 = â + hb x0 = 13.71 − 11.41 · 1.1 = 1.159
The predicted body density is 1.159 · 103 kg/m3.
◦ Confidence interval:
s
1 (x0 − x̄)2
(â + b̂ x0) ± tn−2,α/2 · se · 1+ +
n SXX
r
√ 1 (1.1 − 1.06)2
= (13.71 − 11.41 · 1.1) ± 1.99 · 0.0132 · 1+ +
92 0.023
= [0.92, 1.40]

In STATA, the standard error for predicting Y is calculated by passing the


option stdf to predict:
. drop SE low high
. predict SE, stdf
. generate low=tbillh-invttail(49,.025)*SE
. generate high=tbillh+invttail(49,.025)*SE
. graph twoway line low high BDH SKINT, clpattern(dash dash solid) clcolor(black bla
> ck black) || scatter BODYD SKINT, legend(off) scheme(s1color)

Alternatively, we can use the following command:


. twoway (lfitci BODYD SKINT, range(1 1.1) stdf) (scatter BODYD SKINT),
> xtitle(Skin thickness) ytitle(Body density) scheme(s1color) legend(off)
2.5

2.5
2

2
Body density
1.5
1.5

1
1

1.02 1.04 1.06 1.08 1.1 1 1.02 1.04 1.06 1.08 1.1
SKINT SKin thickness

Inference for Regression, Mar 1, 2004 -8-


Multiple Regression

Example: Food expenditure and family income


Data: ◦ Sample of 20 households
◦ Food expenditure (response variable)
◦ Family income and family size

. regress food income


-------------------------------------------------------------------------
food | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------+----------------------------------------------------------------
income | .1841099 .0149345 12.33 0.000 .1527336 .2154862
_cons | -.4119994 .7637666 -0.54 0.596 -2.016613 1.192615
-------------------------------------------------------------------------
. regress food number
-------------------------------------------------------------------------
food | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------+----------------------------------------------------------------
number | 2.287334 .4224493 5.41 0.000 1.399801 3.174867
_cons | 1.217365 1.410627 0.86 0.399 -1.746252 4.180981
-------------------------------------------------------------------------

20 20

16 16
Food Expenditure

Food Expenditure

12 12

8 8

4 4

0 0
0 20 40 60 80 100 120 0 1 2 3 4 5 6
Income Family Size

Multiple Regression, Mar 3, 2004 -1-


Multiple Regression

Multiple regression model

Yi = b0 + b1 x1,i + b2 x2,i + . . . + bp xp,i + εi i = 1, . . . , n

where
◦ Yi response variable
◦ x1,i, . . . , xp,i predictor variables (fixed, nonrandom)
◦ b0, . . . , bp regression coefficients
iid
◦ εi ∼ N (0, σ 2) error variable

Example: Food expenditure and family income


Fitting multiple regression models in STATA:

. regress food income number


Source | SS df MS Number of obs = 20
--------+------------------------------ F( 2, 17) = 121.47
Model | 386.312865 2 193.156433 Prob > F = 0.0000
Resid. | 27.0326365 17 1.59015509 R-squared = 0.9346
--------+------------------------------ Adj R-squared = 0.9269
Total | 413.345502 19 21.7550264 Root MSE = 1.261
-------------------------------------------------------------------------
food | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------+----------------------------------------------------------------
income | .1482117 .0163786 9.05 0.000 .1136558 .1827676
number | .7931055 .2444411 3.24 0.005 .2773798 1.308831
_cons | -1.118295 .6548524 -1.71 0.106 -2.499913 .2633232
-------------------------------------------------------------------------

Multiple Regression, Mar 3, 2004 -2-


Multiple Regression

Example: Food expenditure and family income


Data: (Foodi , Incomei , Numberi ), i = 1, . . . , 20
Fitted regression model:
d = b̂0 + b̂1 Income + b̂2 Number
Food

Yi
^
Yi
20

16

12

6
5
4 0
3 120
2 100
80
1 60
40
0 20
0

Fitted model is a two-dimensional plane - difficult to visualize.

Multiple Regression, Mar 3, 2004 -3-


Inference for Multiple Regression

Multiple regression model (matrix notation)

Y =Xb+ε

where
Y n dimensional vector
X n × (1 + p) dimensional matrix
b 1 + p dimensional vector
ε n dimensional vector
Thus the model can be written as
      
Y1 1 x1,1 · · · xp,1 b0 ε1
 ..   .. .. .
. . . ..   .   .
 .  = . .   ..  +  .. 
Yn 1 x1,n · · · xp,n bp εn

Least squares approach: Minimize


P
n
kY − Ŷ k = (Yi − Ŷi )2
i=1

Results:

b̂ = (X T X)−1X T Y ∼ N b, σ 2(X T X)−1

Ŷ = X(X T X)−1X T Y ∼ N X b, σ 2X(X T X)−1X T
 
ε̂ = Y − Ŷ = 1 − X(X T X)−1X T Y ∼ N 0, σ 2 1 − X(X T X)−1X T
2 kY − Ŷ k2
σ̂ = s2e =
n−p
1 P n
= (Yi − Ŷi )2
n − p i=1

Details course in regression analysis (STAT 22200) or econometrics

Multiple Regression, Mar 3, 2004 -4-


Inference for Multiple Regression

Example: Food expenditure and family income


Interpretation of regression coefficients

. quietly regress food income


. predict e_food1, residuals
. quietly regress number income
. predict e_num, residuals
. regress e_food1 e_num
------------------------------------------------------------------------
e_food1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------
e_num | .7931055 .2375541 3.34 0.004 .2940229 1.292188
------------------------------------------------------------------------
. quietly regress food number
. predict e_food2, residuals
. quietly regress income number
. predict e_inc, residuals
. regress e_food2 e_inc
------------------------------------------------------------------------
e_food2 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------
e_inc | .1482117 .0159172 9.31 0.000 .114771 .1816525
------------------------------------------------------------------------

Result:
◦ bj measures the dependence of Y on xj after removing the linear effects
of all other predictors xk , k 6= j.
◦ bj = 0 if xj does not provide information for the prediction of Y addi-
tionally to the information given by the other predictor variables.

Multiple Regression, Mar 3, 2004 -5-


Multiple Regression

Example: Heart cathederization


Description: A Teflon tube (catheder) 3 mm is diameter is passed into a major vein or
artery at the femoral region and pushed up into the heart to obtain information about
the heart’s physiology and functional ability. The length of the catheder is typically
determined by a physician’s educated guess.

Data:
◦ Study with 12 children with congenital heart defects
◦ Exact required catheder length was measured using a fluoroscope
◦ Patient’s height and weight were recorded

Question: How accurately can catheder length be determined by height


and length?
50 50

45 45

40 40
Distance (cm)

Distance (cm)

35 35

30 30

25 25

20 20
30 40 50 60 20 40 60 80
Height (in) Weight (lb)

Multiple Regression, Mar 3, 2004 -6-


Multiple Regression

Example: Heart cathederization (contd)


Regression model:

Y = b0 + b1 x 1 + b2 x 2 + ε

where ◦ Y - distance to pulmonary artery


◦ x1 - height
◦ x2 - weight

STATA regression output:


. regress distance height weight
Source | SS df MS Number of obs = 12
-------------+------------------------------ F( 2, 9) = 18.62
Model | 578.81613 2 289.408065 Prob > F = 0.0006
Residual | 139.913037 9 15.545893 R-squared = 0.8053
-------------+------------------------------ Adj R-squared = 0.7621
Total | 718.729167 11 65.3390152 Root MSE = 3.9428
------------------------------------------------------------------------------
distance | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
height | .1963566 .3605845 0.54 0.599 -.6193422 1.012056
weight | .1908278 .165164 1.16 0.278 -.1827991 .5644547
_cons | 21.0084 8.751156 2.40 0.040 1.211907 40.80489
------------------------------------------------------------------------------

Note:
◦ Neither height nor weight seem to be significant for predicting the dis-
tance to the pulmonary artery.
◦ The regression on both variables explains 80% of the variation of the
response (length of catheder).

Multiple Regression, Mar 3, 2004 -7-


Multiple Regression

Example: Heart cathederization (contd)


Consider predicting the length by height alone and by weight alone:
. regress distance height
R-squared = 0.7765
------------------------------------------------------------------------------
distance | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
height | .5967612 .1012558 5.89 0.000 .3711492 .8223732
_cons | 12.12405 4.247174 2.85 0.017 2.660752 21.58734
------------------------------------------------------------------------------
. regress distance weight
R-squared = 0.7989
------------------------------------------------------------------------------
distance | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | .2772687 .0439881 6.30 0.000 .1792571 .3752804
_cons | 25.63746 2.004207 12.79 0.000 21.17181 30.10311
------------------------------------------------------------------------------

Note:
◦ In a simple regression of Y on either height or weight, the explanatory
variable is highly significant for predicting Y .
◦ In a multiple regression of Y on height and weight, the coefficients for
both height and weight are not significantly different from zero.

Problem: Explanatory variables are highly linearly dependent (collinear)

80

60
Weight (lb)

40

20

20 30 40 50 60 70
Height (in)

Multiple Regression, Mar 3, 2004 -8-


Analysis of Variance

Decomposition of variation:
P
◦ SS Total = i (Yi − Ȳ )2 - total variation
P
◦ SS Residual = i (Yi − Ŷi )2 - variation in regression model
◦ SS Model = SS Total − SS Residual
P
= i (Ŷi − Ȳ )2 - variation explained by regression
Coefficient of determination: The ratio
SS Model
R2 =
SS Total
indicates how well the regression model predicts the response. R2 is also
the squared multiple correlation coefficient - in a simple linear regression
we have

R2 = ρ2XY .

Example: Heart cathederization

Source | SS df MS Number of obs = 12


-------------+------------------------------ F( 2, 9) = 18.62
Model | 578.81613 2 289.408065 Prob > F = 0.0006
Residual | 139.913037 9 15.545893 R-squared = 0.8053
-------------+------------------------------ Adj R-squared = 0.7621
Total | 718.729167 11 65.3390152 Root MSE = 3.9428

The coefficient of determination for these data is


578.82
R2 = = 0.81.
718.73
Regression on height and weight explains 81% of the variation of distance.

Multiple Regression, Mar 3, 2004 -9-


Analysis of Variance

Question: Is improvement in prediction (decrease in variation) significant?

Our null hypothesis is that none of the explanatory variables helps to


predict the response, that is,

H0 : b1 = . . . = bp = 0 versus Ha : bj 6= 0 for any j ∈ {1, . . . , p}.

Under the null hypothesis H0 the F statistic


n − p − 1 SS Model n − p − 1 SS Total − SS Residual
F = · = ·
p SS Residual p SS Residual
is F distributed with p and n − p − 1 degrees of freedom.
The null hypothesis H0 is rejected at level α if F > Fp,n−p−1,α.

Example: Heart cathederization

Source | SS df MS Number of obs = 12


-------------+------------------------------ F( 2, 9) = 18.62
Model | 578.81613 2 289.408065 Prob > F = 0.0006
Residual | 139.913037 9 15.545893 R-squared = 0.8053
-------------+------------------------------ Adj R-squared = 0.7621
Total | 718.729167 11 65.3390152 Root MSE = 3.9428

The value of the F statistic is


9 578.82
F = · = 18.61.
2 139.91
The critical value for rejecting H0 : b1 = b2 = 0 is F2,9,0.05 = 4.26. Thus
the null hypothesis H0 that both coefficients b1 and b2 are zero is rejected
at significance level α = 0.05.

Multiple Regression, Mar 3, 2004 - 10 -


Comparing Models

Example: Cobb-Douglas production function

Y = t · K a · Lb · M c

where ◦ Y - output ◦ L - labour


◦ K - capital ◦ M - materials
Regression model:

log Y = log t + a log K + b log L + c log M

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4


Y

Y
0.2 0.2 0.2

0.0 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4 0.6 −0.2 0.0 0.2 0.4 0.6 0.8 1.0
K L M

Multiple Regression, Mar 3, 2004 - 11 -


Comparing Models

Example: Cobb-Douglas production function (contd)


Regression model M0 for Cobb-Douglas function:

log Y = log t + a log K + b log L + c log M


. regress LY LK LM LL
Source | SS df MS Number of obs = 25
---------+----------------------------- F( 3, 21) = 138.98
Model | 1.35136742 3 .450455808 Prob > F = 0.0000
Residual | .068065609 21 .003241219 R-squared = 0.9520
---------+----------------------------- Adj R-squared = 0.9452
Total | 1.41943303 24 .059143043 Root MSE = .05693
-------------------------------------------------------------------------
LY | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+---------------------------------------------------------------
LK | .0718626 .1543912 0.47 0.646 -.2492114 .3929366
LM | .7072231 .3004146 2.35 0.028 .0824768 1.331969
LL | .2117778 .4248755 0.50 0.623 -.6717991 1.095355
_cons | .0347117 .0374354 0.93 0.364 -.0431395 .1125629

Two variables, log K and log L, do not improve prediction of log Y .


alternative model M1

log Y = log t + c log M


. regress LY LM
Source | SS df MS Number of obs = 25
---------+----------------------------- F( 1, 23) = 445.69
Model | 1.34977753 1 1.34977753 Prob > F = 0.0000
Residual | .069655501 23 .0030285 R-squared = 0.9509
---------+----------------------------- Adj R-squared = 0.9488
Total | 1.41943303 24 .059143043 Root MSE = .05503
-------------------------------------------------------------------------
LY | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+---------------------------------------------------------------
LM | .9086794 .0430421 21.11 0.000 .81964 .9977188
_cons | .0512244 .0189767 2.70 0.013 .011968 .0904808

Question: Is model M0 significantly better than model M1 ?

Multiple Regression, Mar 3, 2004 - 12 -


Comparing Models

Consider the multiple regression model with p explanatory variables

Yi = b0 + b1 x1,i + . . . + bp xp,i + εi .

Problem:
Test the null hypothesis
H0 : q specific explanatory variables all have zero coefficients
versus
Ha : any of these q explanatory variables has a nonzero coefficient.

Solution:
(1)
◦ Regress Y on all p explanatory variables and read SS Residual from the
output.
◦ Regress Y on just p − q explanatory variables that remain after you
(2)
remove the q variables from the model. Read SS Residual from the output.
◦ The test statistic is
(2) (1)
n − p − 1 SS Residual − SS Residual
F = · (1)
.
q SS Residual

Under the null hypothesis, F is F distributed with q and n − p − 1


defrees of freedom.
◦ Reject if F > Fq,n−p−1,α.

Multiple Regression, Mar 3, 2004 - 13 -


Comparing Models

Example: Cobb-Douglas production function


Comparison of models M0 and M1 :
(0)
◦ M0: SS Residual = .06807 and n − p − 1 = 21.
(1)
◦ M1: SS Residual = .06966 and q = 2.

21 .06966 − .06807
F = · = 0.2453
2 .06807
◦ Since F < F2,21,0.05 = 3.47 we cannot reject H0 : a = b = 0.

Using STATA:
. test LK LL
( 1) LK = 0
( 2) LL = 0
F( 2, 21) = 0.25
Prob > F = 0.7847
. test LK LL _cons
( 1) LK = 0
( 2) LL = 0
( 3) _cons = 0
F( 3, 21) = 2.43
Prob > F = 0.0934

Multiple Regression, Mar 3, 2004 - 14 -


Case Study

Example: Headaches and pain reliever

◦ 24 patients with a common type of headache were treated with a new


pain reliever
◦ Medicamentation was given to each patient in one of four dosage levels:
2,5,7 or 10 grams
◦ Response variable: time until noticeable relieve (in minutes)
◦ Other explanatory variables:
⋄ sex (0=female, 1=male)
⋄ blood pressure (0.25=low, 0.50=medium, 0.75=high)

Box plots

60

50

40
Time (in minutes)

30

20

10

female male female male female male female male


2 grams 5 grams 7 grams 2 grams

Multiple Regression II, Mar 5, 2004 -1-


Case Study
. regress time dose bp if sex==0
R-squared = 0.8861
--------------------------------------------------------------------------
time | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+----------------------------------------------------------------
dose | -5.519608 .6608907 -8.35 0.000 -7.014646 -4.024569
bp | -5 9.439407 -0.53 0.609 -26.35342 16.35342
_cons | 61.11765 6.458495 9.46 0.000 46.50752 75.72778
--------------------------------------------------------------------------
. predict YHf
(option xb assumed; fitted values)
. twoway line YHf dose if bp==0.25||line YHf dose if bp==0.5||
> line YHf dose if bp==0.75||scatter time dose if(sex==0), saving(a, replace)
(file a.gph saved)
. regress time dose bp if sex==1
R-squared = 0.5765
--------------------------------------------------------------------------
time | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+----------------------------------------------------------------
dose | -3.343137 .9564492 -3.50 0.007 -5.506776 -1.179499
bp | -2.5 13.66083 -0.18 0.859 -33.40294 28.40294
_cons | 51.39216 9.346814 5.50 0.000 30.2482 72.53612
--------------------------------------------------------------------------
. predict YHm
(option xb assumed; fitted values)
. twoway line YHm dose if bp==0.25||line YHm dose if bp==0.5||
> line YHm dose if bp==0.75||scatter time dose if(sex==1), saving(b, replace)
(file b.gph saved)
. graph combine a.gph b.gph
60

60
40

40
20

20
0

2 4 6 8 10 2 4 6 8 10
dose dose

Fitted values Fitted values Fitted values Fitted values


Fitted values time Fitted values time

Multiple Regression II, Mar 5, 2004 -2-


Case Study

Model:

Time = Dose + Sex + Sex · Dose + BP + ε


. infile time dose sex bp using headache.dat
(24 observations read)
. generate sexdose=sex*dose
. regress time dose sex sexdose bp
Source | SS df MS Number of obs = 24
----------+------------------------------ F( 4, 19) = 16.78
Model | 4387.65319 4 1096.9133 Prob > F = 0.0000
Residual | 1242.30515 19 65.3844814 R-squared = 0.7793
----------+------------------------------ Adj R-squared = 0.7329
Total | 5629.95833 23 244.780797 Root MSE = 8.0861
---------------------------------------------------------------------------
time | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------+----------------------------------------------------------------
dose | -5.519608 .8006399 -6.89 0.000 -7.195367 -3.843849
sex | -8.47549 7.553222 -1.12 0.276 -24.28457 7.333585
sexdose | 2.176471 1.132276 1.92 0.070 -.19341 4.546351
bp | -3.75 8.086067 -0.46 0.648 -20.67433 13.17433
_cons | 60.49265 6.698634 9.03 0.000 46.47224 74.51305
---------------------------------------------------------------------------
. predict YH
(option xb assumed; fitted values)
. predict E, residuals

Residual plot: residualsi vs Dose


15 15

10 10
Residuals (in minutes)

Sample Quantiles

5 5

0 0

−5 −5

−10 −10

2 4 6 8 10 −2 −1 0 1 2
Dose (in grams) Theoretical Quantiles

Multiple Regression II, Mar 5, 2004 -3-


Case Study

Model:

Time = Dose + Dose2 + Sex + Sex · Dose + BP + ε


. drop YH E
. generate dosesq=dose^2
. regress time dose sex sexdose dosesq bp
Source | SS df MS Number of obs = 24
----------+------------------------------ F( 5, 18) = 24.20
Model | 4901.02819 5 980.205637 Prob > F = 0.0000
Residual | 728.930147 18 40.4961193 R-squared = 0.8705
----------+------------------------------ Adj R-squared = 0.8346
Total | 5629.95833 23 244.780797 Root MSE = 6.3637
---------------------------------------------------------------------------
time | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------+----------------------------------------------------------------
dose | -12.91961 2.171775 -5.95 0.000 -17.48234 -8.356878
sex | -8.47549 5.944312 -1.43 0.171 -20.96403 4.013047
sexdose | 2.176471 .8910901 2.44 0.025 .3043598 4.048581
dosesq | .6166667 .1731968 3.56 0.002 .2527937 .9805396
bp | -3.75 6.363656 -0.59 0.563 -17.11955 9.619545
_cons | 77.45098 7.104701 10.90 0.000 62.52456 92.3774
---------------------------------------------------------------------------
. predict E, residuals

10 10

5 5
Residuals (in minutes)

Sample Quantiles

0 0

−5 −5

−10 −10

2 4 6 8 10 −2 −1 0 1 2
Dose (in grams) Theoretical Quantiles

. test sex bp
( 1) sex = 0
( 2) bp = 0
F( 2, 18) = 1.19
Prob > F = 0.3270

Multiple Regression II, Mar 5, 2004 -4-


Case Study

Model:

Time = Dose + Dose2 + Sex · Dose + ε


. regress time dose sexdose dosesq
Source | SS df MS Number of obs = 24
----------+------------------------------ F( 3, 20) = 38.81
Model | 4804.63916 3 1601.54639 Prob > F = 0.0000
Residual | 825.319178 20 41.2659589 R-squared = 0.8534
----------+------------------------------ Adj R-squared = 0.8314
Total | 5629.95833 23 244.780797 Root MSE = 6.4239
---------------------------------------------------------------------------
time | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------+----------------------------------------------------------------
dose | -12.34823 2.154675 -5.73 0.000 -16.8428 -7.853653
sexdose | 1.033708 .3931338 2.63 0.016 .2136452 1.853771
dosesq | .6166667 .1748353 3.53 0.002 .2519667 .9813667
_cons | 71.33824 5.667294 12.59 0.000 59.51647 83.16
---------------------------------------------------------------------------
. twoway line YH dose if sex==0|| line YH dose if sex==1,
> legend(label(1 "female") label(2 "male"))

60

50
Fitted time (in minutes)

40

30

20

10

0
2 4 6 8 10
Dose (in grams)

Multiple Regression II, Mar 5, 2004 -5-


Comparing Several Means

Example: Comparison of laboratories

◦ Task: Measure amount of chlorpheniramine maleate in tablets


◦ Seven laboratories were asked to make 10 determinations of one tablet
◦ Study consistency between labs and variability of measurements
Box plot

4.10

4.05
Amount of chlorphenimarine (in mg)

4.00

3.95

3.90

3.85

3.80
Lab 1 Lab 2 Lab 3 Lab 4 Lab 5 Lab 6 Lab 7

One-Way Analysis of Variance, Mar 8, 2004 -1-


Comparing Several Means

Example: Comparison of drugs

◦ Experimental study of drugs to relieve itching


◦ Five drugs were compared to a placebo and no drug
◦ Ten volunteer male subjects
◦ Each subject underwent one treatment per day (randomized order)
◦ Drug or placebo were given intravenously
◦ Itching was induced on forearms with cowage
◦ Subjects recorded duration of itching
Box plot

400

300
Duration of itching (sec)

200

100

No drug Papaverine Aminophylline Tripelennamine


Placebo Morphine Pentobarbital

One-Way Analysis of Variance, Mar 8, 2004 -2-


Comparing Several Means

. infile amount lab using labs.txt


(70 observations read)
. graph box amount, over(lab)
. oneway amount lab, bonferroni tabulate
| Summary of amount
lab | Mean Std. Dev. Freq.
------------+------------------------------------
1 | 4.062 .03259178 10
2 | 3.997 .08969706 10
3 | 4.003 .02311808 10
4 | 3.920 .03333330 10
5 | 3.957 .05716445 10
6 | 3.955 .06704064 10
7 | 3.998 .08482662 10
------------+------------------------------------
Total | 3.9845715 .07184294 70
Analysis of Variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups .1247371 6 .020789517 5.66 0.0001
Within groups .231400073 63 .003673017
------------------------------------------------------------------------
Total .356137173 69 .005161408
Bartlett’s test for equal variances: chi2(6) = 24.3697 Prob>chi2 = 0.000
Comparison of amount by lab
(Bonferroni)
Row Mean-|
Col Mean | 1 2 3 4 5 6
---------+------------------------------------------------------------------
2 | -.065
| 0.408
|
3 | -.059 .006
| 0.698 1.000
|
4 | -.142 -.077 -.083
| 0.000 0.127 0.068
|
5 | -.105 -.04 -.046 .037
| 0.005 1.000 1.000 1.000
|
6 | -.107 -.042 -.048 .035 -.002
| 0.004 1.000 1.000 1.000 1.000
|
7 | -.064 .001 -.005 .078 .041 .043
| 0.448 1.000 1.000 0.115 1.000 1.000

One-Way Analysis of Variance, Mar 8, 2004 -3-


Comparing Several Means
. oneway duration drug, bonferroni tabulate
| Summary of duration
drug | Mean Std. Dev. Freq.
------------+------------------------------------
1 | 191.0 54.861442 10
2 | 204.8 105.723750 10
3 | 118.2 52.809511 10
4 | 148.0 44.738748 10
5 | 144.3 42.076782 10
6 | 176.5 68.856130 10
7 | 167.2 67.499465 10
------------+------------------------------------
Total | 164.28571 68.463709 70
Analysis of Variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 53012.8857 6 8835.48095 2.06 0.0708
Within groups 270409.4 63 4292.2127
------------------------------------------------------------------------
Total 323422.286 69 4687.2795
Bartlett’s test for equal variances: chi2(6) = 11.3828 Prob>chi2 = 0.077
Comparison of duration by drug
(Bonferroni)
Row Mean-|
Col Mean | 1 2 3 4 5 6
---------+------------------------------------------------------------------
2 | 13.8
| 1.000
|
3 | -72.8 -86.6
| 0.328 0.092
|
4 | -43 -56.8 29.8
| 1.000 1.000 1.000
|
5 | -46.7 -60.5 26.1 -3.7
| 1.000 0.904 1.000 1.000
|
6 | -14.5 -28.3 58.3 28.5 32.2
| 1.000 1.000 1.000 1.000 1.000
|
7 | -23.8 -37.6 49 19.2 22.9 -9.3
| 1.000 1.000 1.000 1.000 1.000 1.000

One-Way Analysis of Variance, Mar 8, 2004 -4-

Вам также может понравиться