Академический Документы
Профессиональный Документы
Культура Документы
Example:
The following data set consists of five variables about 20 individuals.
ID Age Education Sex Total income Job class
1 43 4 1 18526 5
2 35 3 2 5400 7
3 43 2 1 3900 7
4 33 3 1 28003 5
5 38 3 2 43900 7
6 53 4 1 53000 5
7 64 6 1 51100 6
8 27 4 2 44000 5
9 34 4 1 31200 5
10 27 3 2 26030 5
11 47 6 1 6000 6
12 48 3 1 8145 5
13 39 2 1 37032 5
14 30 3 2 30000 5
15 35 3 2 17874 5
16 47 4 2 400 5
17 51 4 2 22216 5
18 56 5 1 26000 6
19 57 6 1 100267 7
20 34 1 1 15000 5
30
Percent of people aged 25 to 34
20
10
10
0
no HS some HS HS diploma Bachelor’s some college postgrad HS diploma Bachelor’s some college some HS postgrad no HS
Education level Education level
6.7%3.6%
7.5%
22.7%
30.4%
29.1%
no HS some HS
HS diploma Bachelor’s
some college postgrad
3. Write each leaf next to stem, in increasing order out from the stem.
1992 8
1993 33
1994 25
.03
1995 36
Density
1996 40
.02
1997 36
1998 66
.01
1999 63
2000 50
0
2001 64 0 10 20 30 40 50 60 70
Home runs
2002 49
2003 40
The area of each bar is proportional to the percentage of data in that range.
We care about the area, not the height, but when the bar has equal width,
area is determined by the height.
For simplicity, use equally spaced bins.
0.07
0.07
0.07
0.06
0.06
0.06
0.06
0.05
0.05
0.05
0.05
0.04
0.04
0.04
0.04
Percentage
Percentage
Percentage
Percentage
0.03
0.03
0.03
0.03
0.02
0.02
0.02
0.02
0.01
0.01
0.01
0.01
0.00
0.00
0.00
0.00
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
5
4
.03
3
Frequency
Density
.02
2
.01
1
0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Home runs Home runs
◦ Describe the overall pattern and any significant deviations from that
pattern.
◦ Shape: Is the distribution (approximately) symmetric or skewed?
Histogram of x
2000
1500
tail.
500
0
25
20
Frequency
15
10
0
−60 −40 −20 0 20 40 60
Time
Example: Average retail price of gasoline from Jan 1988 to Apr 2001
1.8
1.7
1.6
Retail gasoline price
1.2 1.3 1.4 1.5
1.1
1.0
0.9
Note: Whenever data are collected over time, it is a good idea to have
a time plot. Stemplots and histograms ignore time order, which can be
misleading when systematic change over time exists.
The mean
The mean of a distribution is the arithmetic average of the obser-
vations:
x1 + · · · + xn 1 Pn
x̄ = =n xi
n i=1
The median
The median is the midpoint of a distribution: the number M
such that
◦ half the observations are smaller and
◦ half are larger.
1. Arrange the data in increasing order and let x(i) denote the ith
smallest observation.
2. If the number of observations n is odd, the median is the center
observation in the ordered list:
M = x((n+1)/2)
Examples:
Data set 1:
x1 x2 x3 x4 x5 x6 x7 x8 x9
2 4 3 4 6 5 4 -6 5
M = x((n+1)/2) = x(5) = 4.
Data set 2:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
2.3 8.8 3.9 4.1 6.4 5.9 4.2 2.9 1.3 5.1
Example:
0 1 2 3 4 5 6 7 8 9 10
30 30
Frequency
Frequency
20 20
10 10
0 0
−10 −5 0 5 10 15 20 −10 −5 0 5 10 15 20
Daily returns (in %) Daily returns (in %)
Stock A Stock B
Mean 4.95 4.82
Median 4.99 4.68
The distributions of the two stocks have approximately the same
mean and median, but stock B is more volatile and thus more risky.
2. Find the median of the observations to the left of M, that is the lower
quartiles, QL
Examples:
Data set:
x1 x2 x3 x4 x5 x6 x7 x8 x9
2 4 3 4 6 5 4 -6 5
Five-number summary
x(1) QL M QU x(n)
20
How to draw a boxplot Box-and-whisker
plot)
1. A box (the box) is drawn from the lower to
10
the upper quartile (QL and QU ).
2. The median of the data is shown by a line in
the box.
0
3. Lines (the whiskers) are drawn from the ends
of the box to the most extreme observations
within a distance of 1.5 IQR (Interquartile
−10
Why n − 1?
Division by n − 1 instead of n in the variance calculation is a
common cause of confusion. Why n − 1? Note that
n
X
(xi − x̄) = 0
i=1
A new idea:
If the pattern is sufficiently regular, approximate it with a
smooth curve.
Any curve that is always on or above the horizontal axis and has
total are underneath equal to one is a density curve.
◦ Area under the curve in a range of values indicates the propor-
tion of values in that range.
◦ Come in a variety of shapes, but the “normal” family of familiar
bell-shaped densities is commonly used.
◦ Remember the density is only an approximation, but it sim-
plifies analysis and is generally accurate enough for practical
use.
0.07
0.06
0.05
Density
0.04
0.03
0.02
0.01
0.00
0 10 20 30 40
Sulfur oxide (in tons)
0.07
0.06
0.05
Shaded area of histogram: 0.29
Density
0.04
0.03
0.02
0.01
0.00
0 10 20 30 40
Sulfur oxide (in tons)
0.07
0.06
0.04
0.03
0.02
0.01
0.00
0 10 20 30 40
Sulfur oxide (in tons)
0.04
0.03
Density
0.02
0.01
0.00
40 46 52 58 64 70 76 82 88 94 100
Waiting time between eruptions (min)
Median:
The equal-areas point with 50% of the “mass” on either side.
Mean:
The balancing point of the curve, if it were a solid mass.
Note:
◦ The mean and median of a symmetric density curve are equal.
◦ The mean of a skewed curve is pulled away from the median in
the direction of the long tail.
Note: The point where the curve changes from concave to convex
is σ units from µ in either direction.
X ∼ N (µ, σ) ⇒ a X + b ∼ N (a µ + b, a σ)
For example:
◦ What is the proportion of N (0, 1) observations less than 1.2?
◦ What is the proportion of N (3, 1.5) observations greater than 5?
◦ What is the proportion of N (10, 5) observations between 3 and 9?
0.8
5
Sample Quantiles
Sample Quantiles
Sample Quantiles
1
0.6
4
0
0.4
−1
0.2
−2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles
The normal density is just one possible density curve. There are
many others, some with compact mathematical formulas and many
without.
0.2
Density
0.1
0.0
0 10 20 30 40
Frequency
⋄ counts on the vertical axis 3
0
0 10 20 30 40 50 60 70
Sosa home runs
0.02
0.01
0.00
0 10 20 30 40 50 60 70
Sosa home runs
0.5
n=250
0.4
Density
0.3
0.2
0.1
0.0
−4 −3 −2 −1 0 1 2 3 4
x
0.5
n=2500
0.4
Density
0.3
0.0
−4 −3 −2 −1 0 1 2 3 4
x #{xi : 1 < xi ≤ 2}
0.5
0.4
n=250000
n
Density
0.3
0.2
0.1
0.0
−4 −3 −2 −1 0
x
1 2 3 4
↓ n→∞
0.5
n→∞
0.4
Z 2
Density
0.3
0.2 f (x) dx
0.1 1
0.0
−4 −3 −2 −1 0 1 2 3 4
x
140
Mortality (index)
120
100
80
60
In STATA:
. insheet using smoking.txt
. graph twoway scatter mortality smoking
Direction of relationship/association:
Strength of relationship/association:
Properties:
◦ dimensionless quantity
◦ not affected by linear transformations:
for x′i = a xi + b and yi′ = c yi + d
rx′ y ′ = rxy
◦ −1 ≤ rxy ≤ 1
◦ rxy = 1 if and only if yi = a xi + b for some a and b
◦ measures linear association between xi and yi
2 ρ = −0.9 ρ = −0.6
2
1
1
0
0
y
y
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
x x
ρ = −0.3 ρ=0
2
2
1
1
0
0
y
y
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
x x
ρ = 0.3 ρ = 0.6
2
2
1
1
0
0
y
y
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
x x
ρ = 0.9 ρ = 0.99
2
2
1
1
0
0
y
y
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
x x
Examples:
◦ Hearing difficulties:
response - sound level (decibels), explanatory - age (years)
◦ Real estate market:
response - listing prize ($), explanatory - house size (sq. ft.)
◦ Salaries:
response - salary ($), explanatory - experience (years), educa-
tion, sex
20
16
food expenditure
12
0
0 20 40 60 80 100 120
income
Questions:
◦ How does food expenditure (Y ) depend on income (X)?
◦ Suppose we know that X = x0, what can we tell about Y ?
Linear regression:
If the response Y depends linearly on the explanatory variable
X, we can use a straight line (regression line) to predict Y
from X.
20
16
food expenditure
12
18 Observed y
8
16 Difference y − y^
food expenditure
4 Predicted y^
14
0 12
0 20 40 60 80 100 120
income
10
8
50 60 70 80 90
income
Ŷ = a + b X
where
b = r ssy and a = ȳ − b x̄
x
X 42 58 28 20 42 47 112 85 31 26
Y 4.9 11.8 5.2 4.8 7.9 6.4 20.0 13.7 5.1 2.9
20
food expenditure
15
10
0
0 20 40 60 80 100 120
income
Correlation analysis:
We are interested in the joint distribution of two (or more)
quantitive variables.
80
78
76
74
Son’s height (inches)
72
70
68
66
64
62
60
58
58 60 62 64 66 68 70 72 74 76 78 80
Regression analysis:
We are interested how the distribution of one response variable
depends on one (or more) explanatory variables.
Density
0.10
74
0.05
Son’s height (inches)
72
0.00
70 58 60 62 64 66 68 70 72 74 76 78 80
Son’s height (inches)
68 Father’s height = 68 inches
0.16
66 x
0.12
64
Density
0.08
62
0.04
60
0.00
58 58 60 62 64 66 68 70 72 74 76 78 80
58 60 62 64 66 68 70 72 74 76 78 80 Son’s height (inches)
Father’s height (inches) Father’s height = 72 inches
0.18
x
0.15
0.12
Density
0.09
0.06
0.03
0.00
58 60 62 64 66 68 70 72 74 76 78 80
80 Son’s height (inches)
78
76
74
72
70
points are distributed
68
62
60
58
58 60 62 64 66 68 70 72 74 76 78 80
76
74
y^ = a + bx
Son’s height (inches)
72
70
68
66
64
62
60
58
58 60 62 64 66 68 70 72 74 76 78 80
Regression effect
In virtually all test-retest situations, the bottom group on the
first test will on average show some improvement on the sec-
ond test - and the top group will on average fall back. This is
the regression effect. The statistician and geneticist Sir Fran-
cis Galton (1822-1911) called this effect “regression to medi-
ocrity”.
80
78
76
74
Son’s height (inches)
72
70
68
66
64
62
60
58
58 60 62 64 66 68 70 72 74 76 78 80
Regression fallacy
Thinking that the regression effect must be due to something
important, not just the spread around the line, is the regression
fallacy.
20
Food expenditure
15
10
0
0 20 40 60 80 100 120
Income
This graph has been generated using the graphical user interface of STATA.
The complete command is:
. twoway (scatter food income, msymbol(circle) msize(medium) mcolor(black))
> (lfit food income, range(0 120) clcolor(black) clpat(solid) clwidth(medium)),
> ytitle(Food expenditure, size(large)) ylabel(, valuelabel angle(horizontal)
> labsize(medlarge)) xtitle(Income, size(large)) xscale(range(0 120))
> xlabel(0(20)120, labsize(medlarge)) legend(off) ysize(2) xsize(3)
ei = observed y − predicted y
= yi − ŷi
= yi − (a + b xi)
For a least squares regression, the residuals always have mean zero.
Residual plot
A residual plot is a scatterplot of the residuals against the
explanatory variable. It is a diagnostic tool to assess the fit of
the regression line.
Patterns to look for:
◦ Curvature indicates that the relationship is not linear.
◦ Increasing or decreasing spread indicates that the prediction
will be less accurate in the range of explanatory variables where
the spread is larger.
◦ Points with large residuals are outliers in the vertical direc-
tion.
◦ Points that are extreme in the x direction are potential high
influence points.
Influential observations are individuals with extreme x values
that exert a strong influence on the position of the regression line.
Removing them would significantly change the regression line.
10
0
5 10 15
X
1
Residuals
−1
−2
4 6 8 10
Fitted values
1
Residuals
−1
−2
5 10 15
X
10
0
5 10 15
X
1
Residuals
−1
−2
4 6 8 10
Fitted values
1
Residuals
−1
−2
5 10 15
X
15
Y 10
0
5 10 15
X
2
Residuals
−1
4 6 8 10
Fitted values
2
Residuals
−1
5 10 15
X
15
Y 10
0
5 10 15
X
1
Residuals
−1
−2
4 6 8 10
Fitted values
1
Residuals
−1
−2
5 10 15
X
heteroscedasticity
15
Y 10
0
5 10 15 20
X
1
Residuals
−1
−2
6 8 10 12 14
Fitted values
1
Residuals
−1
−2
5 10 15 20
X
21
18
15
Birth rate
12
0
0 1 2 3 4 5
Number of storks (per 1000 women)
Y = bX + ε
Least squares regression yields for the slope of the regression line
b̂ = 4.3 ± 0.2.
X Y X ? Y X ? Y
Z Z
Confounding:
Response and explanatory variable both depend on a third
(hidden) variable.
Controlled experiments:
A cause-effect relationship between two variables X and Y can be
established by conducting an experiment where
◦ the values of X are manipulated and
◦ the effect on Y is observed.
Deaths in the first five years of the screening trial, by cause. Rates per
1,000 women.
Cause of Death
Breast cancer All other
Number of persons Number Rates Number Rates
Treatment group 31,000 39 1.3 837 27
Examined 20,200 23 1.1 428 21
Refused 10,800 16 1.5 409 38
Control group 31,000 63 2.0 879 28
Questions:
◦ Does screening save lives?
◦ Why is the death rate from all other causes in the whole treatment
group (“examined” and “refused” combined) about the same as the
rate in the control group?
◦ Why is the death rate from all other causes higher for the “refused”
group than the “examined” group?
◦ Breast cancer (like polio, but unlike most other diseases) affects the
rich more than the poor. Which numbers in the table confirm this
association between breast cancer and income?
◦ The death rate (from all causes) among women who accepted screening
is about half the death rate among women who refused. Did screening
cut the death rate in half? In not, what explains the difference in death
rates?
◦ To show that screening reduces the risk from breast cancer, someone
wants to compare 1.1 and 1.5. Is this a good comparison? Is it biased
against screening? For screening?
Situation:
Population of N individuals (or items)
e.g. ◦ students at this university
◦ light bulbs produced by a company on one day
Statistical approach:
◦ collect information from part of the population (sample)
◦ use information on sample to draw conclusions on whole pop-
ulation
Questions:
◦ How to choose a sample?
◦ What conclusions can be drawn?
Sample of length n:
Idea: Ask 20 students about the amount they have spent and take
the average.
The value we obtain will vary from sample to sample, that is, if we
asked another 20 students we would get a different answer.
Sampling distribution
The sampling distribution of a statistic is the distribution of
all values taken by the statistic if evaluated for all possible
samples of size n taken from the same population.
Example:
Consider a population of 20 students who spent the following
amounts on books:
x̃1 x̃2 x̃3 x̃4 x̃5 x̃6 x̃7 x̃8 x̃9 x̃10 x̃11 x̃12 x̃13 x̃14 x̃15
100 120 150 180 200 220 220 240 260 280 290 300 310 350 400
(a) 12
σ = 55.4247
9
Sampling distribution of
Frequency (%)
6
1 P
n
x̄ = n
xi
3
i=1
0
0 100 200 300 400 for sample sizes
(b) 12
x
(a) n = 2
σ = 43.38302
9 (b) n = 3
Frequency (%)
6
(c) n = 4
3
0
0 100 200 300 400
x
(c) 12
σ = 35.96526
9
Frequency (%)
0
0 100 200 300 400
Example:
Suppose we are interested in the amount of money students at this
university have spent on books last quarter.
• Undercoverage
◦ occurs when same groups in the population are left out of
the process of choosing the sample
◦ no accurate list of the population
◦ results in bias if this group differs from the rest of the
population
• Nonresponse
◦ occurs when a chosen individual cannot be contacted or
does not cooperate
◦ results in bias if this group differs from the rest of the
population
• Response bias
◦ subjects may not want to admit illegal or unpopular be-
haviour
◦ subjects may be affected by the interviewers appearance or
tone
◦ subjects may not remember correctly
• Question wording
◦ confusing or leading questions can introduce strong bias
◦ do not trust sample survey results unless you have read the
exact questions posed
SRS or Not?
Is each of the following samples an SRS or not?
◦ A deck of cards if shuffled, and the top five dealt.
◦ A sample of Illinois residents is drawn by choosing all the resi-
dents in each of 100 census blocks (in such a way that each set
of 100 blocks is equally likely to be chosen)
◦ A telephone survey is conducted by dialing telephone numbers
at random (i.e. each valid phone number is equally likely).
◦ A sample of 10%of all student at the University of Chicago is
chosen by numbering the students 1, . . . , N , drawing a random
integer i from 1 to 10, and drawing every tenth student begin-
ning with i.
(E.g. if i = 5, students 5, 15, 25, . . . are chosen.)
Example:
◦ Population: Students at this university
◦ Objective: Amount of money spent on books this quarter
◦ Knowledge: Students in e.g. humanities spend more money on
books
Experiment:
Toss a die and observe the number on the face up.
Example:
Suppose that of 100 applicants for a job 50 were women and 50
were men, all equally qualified. Further suppose that the company
hired 2 women and 8 men.
N1 · · · Nk .
Example:
If you toss a die 5 times, the number of possible results is 65 = 7776.
Example:
If you select 5 cards in order from a card deck of 64, the number
of possible results is 64 · 63 · 62 · 61 · 60 = 914, 941, 440.
Counting, Jan 21, 2003 -3-
Permutations and Combinations
Example:
If you select 5 cards from a card deck of 64, you are typically only
interested in the cards you have, not in the order in which you
received them. How many different combinations of 5 cards out
of 64 are there?
Permutation:
Let (x1, . . . , xn) be a sequence. A permutation of this sequence is
any rearrangement of the elements without loosing or adding any
elements, that is, any new sequence
(xi1 , . . . , xin )
n · · · (n − 1) · · · 1 = n!.
Example (contd):
The number of different sequences of 5 fixed cards is 5! = 5 · 4 · 3 ·
2 · 1 = 120.
Recall that
◦ The number of different sequences of length n that can be cho-
sen from N distinct elements are
N!
.
(N − n)!
◦ The number of permutions of any sequence of length n is n!.
Since two permuted (ordered) sequences (x1, . . . , xn) lead to the same (un-
ordered) combination {x1, . . . , xn} we divide the number of ordered se-
quences by the number of permutations.
Example:
If you select 5 cards from a card deck of 64, you are typically only
interested in the cards you have, not in the order in which you
received them. How many different combinations of 5 cards out
of 64 are there?
The answer is
64 64 · 63 · 62 · 61 · 60 914941444
= = = 7, 624, 512.
5 5·4·3·2·1 120
Example:
Recall the example with the 100 applicants for a job. The number
of ways to choose
◦ 2 women out of 50 is 50 2
.
50
◦ 8 men out of 50 is 8 .
◦ 10 applicants out of 100 is 100
10
.
Thus the chance of this event is
50 50
2
8 = 0.037
100
10
Classical Concept:
◦ requires finitely many and equally likely outcomes
◦ probability of event defined as number of favorable outcomes
(s) divided by number of total outcomes (N):
s
Probability of event =
N
◦ can be determined by counting outcomes
Example:
Suppose we collect data on the weather in Chicago on Jan 21 and
we note that in the past 124 years it snowed in 34 years on Jan 21,
34
that is 124 100% = 27.4% of the time.
Thus we would estimate the probability of snowfall on Jan 21 in
Chicago as 0.274.
Tosses 1 − 1000
Relative Frequency of Heads
0.8
0.6
0.4
0.2
0.0
0 100 200 300 400 500 600 700 800 900 1000
Number of Tosses
0.52
1 10 20 30 40 50 60 70 80 90 100
Number of Tosses (in 1000s)
0.505
1 2 3 4 5 6 7 8 9 10
Number of Tosses (in 100000s)
P(A) = 103 .
Remark:
Strictly speaking, we can define above probability only on a set A of subsets A ⊆ S which
however covers all important and for this class relevant subsets.
In the case of finite or countably infinite sample spaces S there are no such exceptions
and A covers all subsets of S.
P(S) = 1.
Axiom 3 (Addition Rule): If two events A and B are mutu-
ally exclusive then
P(A) = #A
#S
,
P(A ∪ B) = #(A#S∪ B)
#A #B
= + = P(A) + P(B).
#S #S
P(A) = 61
and
P(B) = 31 .
The addition rule yields
P(A ∪ B) = 61 + 31 = 36 = 21 .
On the other hand we get for C = A ∪ B = {1 4 5}
P(C) = 63 = 12 .
0 ≤ P(A) ≤ 1.
In particular
◦ P(∅) = 0
◦ P(S) = 1
Partition rule:
P(A) = P(A ∩ B) + P(A ∩ B ∁)
Example: Roll a pair of fair dice
P(Total of 10)
= P(Total of 10 and double) + P(Total of 10 and no double)
1 2 3 1
= + = =
36 36 36 12
Complementation rule:
P(A∁) = 1 − P(A)
Example: Often useful for events of the type “at least one”:
Containment rule
P(A) ≤ P(B) for all A ⊆ B
Example: Compare two aces with doubles,
P(Total of 10 or double)
= P(Total of 10) + P(Double) − P(Total of 10 and double)
3 6 1 8 2
= + − = =
36 36 36 36 9
Double = { 11,22,33,44,55,66}
The intersection is
P(A|B) = P(A ∩ B)
P(B) , if P(B) > 0
Calculate probabilities:
◦ P(die from accident) = 0.04281 · 0.00873 = 0.00037
◦ P(die from accident|age = 10) = 0.42423 · 0.00090 = 0.00038
◦ P(die from accident|age = 40) = 0.17832 · 0.00178 = 0.00031
◦ P(die from HIV) = 0.01473 · 0.00873 = 0.00013
◦ P(die from HIV|age = 10) = 0.02055 · 0.00090 = 0.00002
◦ P(die from HIV|age = 40) = 0.15308 · 0.00178 = 0.00027
General multiplication rule
P(2nd die = 1) = 16
◦ What ist the probability that the second die shows 1 if the first
die already shows 1?
P(A|B) = P(A).
Equivalently, A and B are independent if
P(A ∩ B) = P(A)P(B)
Otherwise we say A and B are dependent.
The Rules:
◦ Three doors - one price, two blanks
◦ Candidate selects one door
◦ Showmaster reveals one loosing door
◦ Candidate may switch doors
1 2 3
Events of interest:
◦ A - choose winning door at the beginning
◦ W - win the price
PS (W ) = PS (W ∩ A) + PS (W ∩ A∁)
= PS (W |A)PS (A) + PS (W |A∁)PS (A∁)
1 2
=0· + 1 · 23 =
3 3
PN (W ) = PN (W ∩ A) + PN (W ∩ A∁)
= PN (W |A)PN (A) + PN (W |A∁)PN (A∁)
1 1
=1· + 0 · 23 =
3 3
Example:
Suppose an applicant for a job has been invited for an interview.
The chance that
◦ he is nervous is P(N ) = 0.7,
◦ the interview is succussful if he is nervous is P(S|N ) = 0.2,
◦ the interview is succussful if he is not nervous is P(S|N ∁) = 0.9.
Example:
Suppose we have two unfair coins:
◦ Coin 1 comes up heads with probability 0.8
◦ Coin 2 comes up heads with probability 0.35
Choose a coin at random and flip it. What is the probability of its
being a head?
Events: H=“heads comes up”, C1=“1st coin”, C2=“2nd coin”
P(B|A)P(A) = P(A|B)P(B)
Bayes’ Theorem
P(B|A) = P(A|B)P(B)
P(A|B)P(B) + P(A|B ∁)P(B ∁)
S −−−−→ R
X
ω 7−→ x = X(ω)
◦ SX = X(S) is the sample space of the random variable.
◦ The outcome x = X(ω) is called realisation of X.
◦ X induces a probability P (B) = P(X ∈ B) on SX , the prob-
ability distribution of X
Example: Roll one die
Outcome ω 1 2 3 4 5 6
Realization X(ω) 1 2 3 4 5 6
Table of outcomes:
Y = X1 + X2
Consequently
P X ∈ [a, b] = b − a.
P(A ∩ B) = P(A)P(B)
Independence of Random Variables
Two discrete random variables X and Y are independent if
x\y 0 1 2 3
1 2 1 1
0 8 8 8
0 2
1 2 1 1
1 0 8 8 8 2
1 3 3 1
8 8 8 8
1
pY (0) = P(Y = 0)
= P(Y = 0, X = 0) + P(Y = 0, X = 1)
= 18 + 0 = 1
8
pY (1) = P(Y = 1)
= P(Y = 1, X = 0) + P(Y = 1, X = 1)
= 28 + 18 = 3
8
...
Examples:
Often: Two outcomes which are not equally likely:
◦ Success of medical treatment
◦ Interviewed person is female
◦ Student passes exam
◦ Transmittance of a disease
Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11
Result: 1 1 1 1 0 1 1 1 1 1
Y = X1 + . . . + Xn
Y is the number of plants for which the number of lost hours has
decreased after the installation of the safety program
We know:
◦ Xi is Bernoulli distributed with parameter θ
◦ Xi’s are independent
Y ∼ Bin(n, θ).
Note that
◦ the number of trials is fixed,
◦ the probability of success is the same for each trial, and
◦ the trials are independent.
Y ∼ Bin(n, θ)
0.4 0.4
θ = 0.1 θ = 0.3
0.3 0.3
p(x)
p(x)
0.2 0.2
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
0.4 0.4
θ = 0.5 θ = 0.8
0.3 0.3
p(x)
p(x)
0.2 0.2
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
x x
Example:
Suppose a batter has probability 31 to hit the ball. What is the chance that
he misses the ball less than 3 times?
Example:
Suppose that of 100 applicants for a job 50 were women and 50 were men,
all equally qualified. If we select 10 applicants at random what is the
probability that x of them are female?
The number of chosen female applicants is hypergeometrically distributed
with parameters 100, 50, and 10. The frequency function is
50
50
x 10−x
p(x) = 100
for x ∈ {0, . . . , n}
10
for x = 0, 1, . . . , 10.
Characteristics
Let X be the number of times a certain event occurs during a given
unit of time (or in a given area, etc).
◦ The probability that the event occurs in a given unit of time is
the same for all the units.
◦ The number of events that occur in one unit of time is inde-
pendent of the number of events in other units.
◦ The mean (or expected) rate is λ.
Statistical model:
◦ Each soldier is kicked to death by a horse with probability θ.
◦ Let Y be the number of such fatalities in one corps. Then
Y ∼ Bin(n, θ)
1.0 1.0
1 1
θ= 400 λ= 10
0.8 0.8
0.6 0.6
p(x)
p(x)
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x x
0.5 1 0.5
θ= 40 λ=1
0.4 0.4
0.3 0.3
p(x)
p(x)
0.2 0.2
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x x
0.2 1 0.2
θ= 8
λ=5
p(x)
p(x)
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x x
1
0.2
θ= 4
0.2
λ = 10
p(x)
p(x)
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x x
40
Uniform distribution U (0, θ)
U(0, θ)
30
Range (0, 1)
Frequency
1
20
f (x) = 1(0,θ) (x)
θ
E(X) = θ2
10
0
θ2 −2 −1 0 1 2 3 4
var(X) = X
12
40
Exponential distribution Exp(λ) Exp(λ)
30
Range [0, ∞) Frequency
20
f (x) = λ exp(−λx)1[0,∞)(x)
E(X) = λ1
10
1
0
var(X) = 2 −2 −1 0 1 2 3 4
λ X
40
Range
Frequency
1 1
20
2
f (x) = √ exp − 2 (x−µ)
2πσ 2 2σ
E(X) = µ
10
var(X) = σ 2
0
−2 −1 0 1 2 3 4
X
6
4
2
0
−2
p(x) = θx (1 − θ)1−x
E(X) = 0 · (1 − θ) + 1 · θ = θ.
Proof:
P
E(a X + b Y ) = (a x + b y)p(x, y)
x,y
P P
=a x p(x, y) + b y p(x, y)
P x,y x,y
x
p(x, y) = p(y) P P
x =a x p(x) + b y p(y)
x y
= aE(X) + bE(Y )
X∞
λx −λ
E(X) = x x! e
x=0
X∞
−λ λx−1
= λe
x=0
(x − 1)!
= λ e−λeλ
=λ
Remarks:
◦ For most distributions some “advanced” knowledge of calculus
is required to find the mean.
◦ Use tables for means of commonly used distribution.
0 10 20 30 40 50
0
P = E(ST − K)+.
ST = s0 + 2 Y − T, with Y ∼ Bin(T, p)
P = 2.75 0.4
0.3
p(x)
0.2
0.1
0.0
−2.75
−0.75
1.25
3.25
5.25
7.25
9.25
11.25
13.25
15.25
17.25
19.25
Profit
P
m
E(N ) = E(Xi) = n 1 + k − p
1 k
0.45
i=1
0.40
Proportion
E(N ):
0.35
Plot of
0.30
2 4 6 8 10 12 14 16
k
Variance of X:
2
var(X) = E X − E(X) .
2
We often denote the variance of a random variable X by σX ,
2
σX = var(X)
var(aX + b) = a2var(X).
var(X) = n θ (1 − θ)
where
cov(X, Y ) = E (X − E(X))(Y − E(Y )
Important:
cov(X, Y ) = 0 does NOT imply that X and Y are independent.
Example:
Suppose X ∈ {−1, 0, 1} with probabilities P(X = x) = 1
3
for
x = −1, 0, 1. Then E(X) = 0 and
Properties:
◦ dimensionless quantity
◦ not affected by linear transformations, i.e.
corr(a X + b, c Y + d) = corr(X, Y )
◦ −1 ≤ ρXY ≤ 1
◦ ρXY = 1 if and only if P(Y = a + b X) = 1 for some a and b
◦ measures linear association between X and Y
Solution:
σXY
a = µ−bµ and b= 2 =ρ
σX
Thus the best linear predictor is
Ŷ = µ + ρ (X − µ)
Note:
We expect the student’s score on the final to differ from the mean
only by half the difference observed in the midterm (regression to
the mean).
λx −λ
p(x) = e
x!
E(X) = λ
var(X) = λ
Geometric distribution
Question:
◦ How close to µ is the sample mean for finite n?
◦ Can we answer this without knowing the distribution of X?
Chebyshev’s inequality
Let X be a random variable with mean µ and variance σ 2.
Then for any ε > 0
σ2
P |X − µ| > ε ≤ 2 .
ε
Proof: Let
1 if |xi − µ| > ε
1{|xi − µ| > ε} =
0 otherwise
Then
P
n Pn n (x − µ)2 o
i
1{|xi − µ| > ε} p(xi) = 1 > 1 p(xi)
i=1 i=1 ε2
Pn (x − µ)2 σ2
i
≤ p(x i ) =
i=1 ε2 ε2
Therefore
P − √ ≤ X̄ ≤ √ = 0.997
3 3
n n
that is, the area under the standard normal curve to left of z.
Example:
◦ U1 , . . . , U12 uniformly distributed on [ 0, 12).
◦ What is the probability that the sample mean exceeds 9?
√
P(Ū > 9) = P 12 √ > 3 ≈ 1 − Φ(3) = 0.0013
Ū − 6
12
0.8
0.3
density f(x)
density f(x)
0.6
0.2
0.4
0.1
0.2
0.0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
0.3
0.4
density f(x)
density f(x)
0.2 0.3
0.2
0.1
0.1
0.0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
0.4
0.3
density f(x)
density f(x)
0.3
0.2
0.2
0.1
0.1
0.0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
0.3
0.3
density f(x)
density f(x)
0.2 0.2
0.1 0.1
0.0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
0.4
U[0,1],n=100 Exp(1),n=100
0.4
0.3
0.3
density f(x)
density f(x)
0.2
0.2
0.1 0.1
0.0 0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Then
T − 1500 lb 1700 lb − 1500 lb
P(T > 1700 lb) = P √ > √
100 · 10 lb 100 · 10 lb
T − 1500 lb
=P √ >2
100 · 10 lb
≈ 1 − Φ(2) = 0.023
Remarks
• How fast approximation becomes good depends on distribution
of Xi’s:
◦ If it is symmetric and has tails that die off rapidly, n can
be relatively small.
iid
Example: If Xi ∼ U [0, 1], the approximation is good for
n = 12.
◦ If it is very skewed or if its tails die down very slowly, a
larger value of n is needed.
Example: Exponential distribution.
• Central limit theorems are very important in statistics.
• There are many central limit theorems covering many situa-
tions, e.g.
◦ for not identically distributed random variables or
◦ for dependent, but not “too” dependent random variables.
where
Z ∼ N np, np(1 − p) .
0.8 0.8
0.6 0.6
p(x)
p(x)
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
x x
0.8 0.8
0.6 0.6
p(x)
p(x)
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
x x
0.4 0.4
0.3 0.3
p(x)
p(x)
0.2 0.2
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
x x
0.2 0.2
p(x)
p(x)
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
x x
0.2 0.2
p(x)
p(x)
0.1 0.1
0.0 0.0
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
x x
P(X · 1 m − 30 m > 10 m)
= P(X > 40)
≈ P(Z > 39.5) Z ∼ N (30, 15)
Z − 30 9.5
=P √ >√
15 15
= 1 − Φ(2.452) = 0.007
Id Y1 Y2 Y3 Id Y1 Y2 Y3
1 270 218 156 15 294 240 264
2 236 234 193 16 282 294 220
3 210 214 242 17 234 220 264
4 142 116 120 18 224 200 213
5 280 200 181 19 276 220 188
6 272 276 256 20 282 186 182
7 160 146 142 21 360 352 294
8 220 182 216 22 310 202 214
9 226 238 248 23 280 218 170
10 242 288 298 24 278 248 198
11 186 190 168 25 288 278 236
12 266 236 236 26 288 248 256
13 206 244 238 27 244 270 280
14 318 258 200 28 236 242 204
Data:
Recall:
◦ CLT for the sample mean: For large n we have
σ2
X̄ ≈ N µ,
n
◦ 68-95-99 rule: With 95% probability the sample differs from
its mean µ by less that two standard deviations.
[18.00, 55.78].
◦ Φ(zα ) = 1 − α
f(x)
0.2
0.1
α
0.0
−3 −2 −1 0 1 zα 2 3
z
15
Frequency
10
0
0 100 200 300 400 500 600 700 800 900 1000
Assets (in millions of dollars)
Suppose we want to give a 95% confidence interval for the mean total assets
of all community banks in the United States.
◦ α = 0.05, zα/2 = 1.96
A 95% confidence interval for the mean assets (in millions of dollars) is
161 161
220 − 1.96 · √ , 220 + 1.96 · √ ≈ 190, 250].
110 110
◦ n = 110 Frequency
15
12
◦ µ̂LTDR = 76.7 9
6
◦ σ̂LTDR = 12.3
3
0
50 60 70 80 90 100 110 120
LTDR (in %)
Note:
◦ Confidence intervals are random while the estimated parameter
is fixed.
◦ For repeated samples, only 95% of the confidence intervals will
cover the true parameter is a random:
0.3
f(x)
0.2
X̄ − µ
T = √ ∼ tn−1. 0.1
s/ n
0.0
−4 −3 −2 −1 0 1 2 3 4
26 31 23 22 11 22 14 31
What is the 95% confidence interval for µ, the mean vitamin C content of
the CSB produced during this run?
◦ µ̂ = 22.5, σ̂ = 7.2, t7,0.025 = 2.36
◦ The 95% confidence interval for µ is
2.36 · 7.2 2.36 · 7.2
22.5 − √ , 22.5 + √ = [16.5, 28.5].
8 8
◦ The large sample CI would be [17.5, 27.5].
250
Cholesterol level
200
150
−2 −1 0 1 2
Normal quantiles
Example:
Suppose that of 100 applicants for a job 50 were women and 50 were men,
all equally qualified. Further suppose that the company hired 2 women
and 8 men.
Question:
◦ Does the company discriminate against female job applicants?
◦ How likely is this outcome under the assumption that the company
does not discriminate?
Example:
Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11
Question:
◦ Has the safety program an effect on the loss of labour due to accidents?
◦ In 9 out of 10 plants the average weekly losses have decreased after
implementation of the safety program. How likely is this (or a more
extreme) outcome under the assumption that there is no difference
before and after implementation of the safety program.
Let θ be the probability that the coin lands heads, that is,
Decision problem:
Null hypothesis H0 : X ∼ Bin(100, 21 )
1
Alternative hypothesis Ha : X ∼ Bin(100, θ), θ 6= 2
The null hypothesis represents the default belief (here: the coin is fair).
The alternative is the hypothesis we accept in view of evidence against the
null hypothesis.
The data-based decision rule
reject H0 if X ∈
/ [40, 60]
do not reject H0 if X ∈ [40, 60]
0.10 Bin(100,0.5)
0.08
Accept H0: p ≠ 0.5
Reject H0: p ≠ 0.5
0.06
p(x)
0.04
0.02
0.00
20 25 30 35 40 45 50 55 60 65 70 75 80
x
0.10 Bin(100,0.6)
0.08
Accept H0: p ≠ 0.5
Reject H0: p ≠ 0.5
0.06
p(x)
0.04
0.02
0.00
20 25 30 35 40 45 50 55 60 65 70 75 80
x
0.10 Bin(100,0.7)
0.08
Accept H0: p ≠ 0.5
Reject H0: p ≠ 0.5
0.06
p(x)
0.04
0.02
0.00
20 25 30 35 40 45 50 55 60 65 70 75 80
x
Note:
1
◦ If θ = this is the probability of committing a error of type I:
2
1
1−β =α
2
1
◦ If θ > 2 this is the probability of correctly rejecting H0 .
1.0
0.8
0.6
1 − β(θ)
0.4
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
θ
1.0
0.8
1 − β(θ)
0.6
0.4
reject if X ∉ [40,60]
0.2 reject if X ∉ [38,62]
reject if X ∉ [42,58]
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
θ
Solution:
◦ choose fixed level α for probability of a type I error
◦ under this restriction find test with small probability of a type II error
Remark:
◦ you do not have to do this minimization yourself.
◦ all tests taught in this course are of this kind.
Definition
A test of this kind is called a significance test with significance level α.
Null hypothesis H0
default (current) theory which we try to falsify
Alternative hypothesis Ha
alternative to adopt if null hypothesis is rejected
Examples:
H0 : cP ≤ c0 vs Ha : cP > c0 .
Suppose that the company regularly also runs tests on the amount of pes-
ticide in the discharge water.
Question: Does the concentration cP of pesticide in the water exceed the
allowed maximum concentration c0 ?
◦ The aim of the company is to avoid fines for exceeding the allowed
level. Thus the company wants to make sure that the concentration
stays within the allowed limits.
Thus, the null hypothesis of the company should be that the pesticide
concentration cP exceeds c0 . The question now corresponds to the test
problem
H0 : cP ≥ c0 vs Ha : cP < c0 .
1. Test problem:
H0 : θ = 21 vs Ha : θ 6= 1
2
2. Significance level:
α = 0.05 (most commonly used significance level)
3. Test statistic:
T =X (number of heads in 100 tosses of the coin)
4. Rejection criterion:
reject H0 if T ∈/ [40, 60]
H0 : µY1 = µ0 vs H0 : µY1 6= µ0 .
Remark:
◦ More generally, we might be interested in one-sided test problems of
the form
H0 : θ = θ0 against Ha : θ 6= θ0
H0 : θ = θ0 vs Ha : θ 6= θ0.
Let
◦ T = Tθ0 (X) be the test statistic of the test (depends on θ0)
◦ R be the critical region of the test
Then
C(X) = {θ : Tθ (X) ∈
/ R}
or equivalently
X̄ − µ0 > tn−1,α/2 √s
n
Definition (P -value)
The probability that under the null hypothesis H0 the test statistic
would take a value as extreme or more extreme that that actually
observed is called the P -value of the test.
◦ percent change in net income between first half of last year and first
half of this year
◦ sample mean x̄ = 8.1%
◦ sample standard deviation s = 26.4%
t109,0.025 = 1.982
Result:
◦ |t| > t109,0.025, therefore the test rejects H0 at significance level α = 0.05.
◦ Equivalently, µ0 = 0 ∈
/ [3.11, 13.09] and thus the test rejects H0.
◦ Equivalently, P -value is less than α = 0.05 and thus the test rejects H0 .
Testproblem:
1 1
H0 : θ = vs Ha : θ 6= .
2 2
Under the null hypothesis H0, the distribution of X is known,
1
X ∼ Bin 100, .
2
Reject null hypothesis if
X∈
/ [b100,0.5,0.975, b100,0.5,0.025] = [40, 60].
Note:
◦ Exact binomial tests typically have smaller significance level α due to
discreteness of distribution.
◦ In the above example, the probability of a type I error is
Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11
Question:
◦ Has the safety program an effect on the loss of labour due to accidents?
The Sign Test for matched pairs
Example:
For the safety program data, we find
◦ n = 10, X = 9
1 1
◦ Test H0 : θ = 2 against Ha : θ > 2
Exact tests:
Since X is binomially distributed, we can use exact binomial tests.
Plant 1 2 3 4 5 6 7 8 9 10
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11
iid
Di ∼ N (µ, σ 2) 20
Decrease in losses of work
◦ H0 : µ = 0 against Ha : µ > 0 15
D̄
T = √ 0
s/ n
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
Normal quantiles
Reject if T > tn−1,α
Result:
◦ ȳ = 10.27, s = 7.98, n = 10
◦ t = 4.07 and t9,0.01 = 2.82, P -value: 0.0014
◦ Test rejects H0 at significance level α = 0.01
Power of the paired sample t test and the paired sign test:
1.0
0.8
0.6
1 − β( δ )
0.4
0.2 t test
Sign test
0.0
0 1 2 3 4 5 6 7 8 9 10 11
δ
t test:
◦ based on Central Limit Theorem
◦ readsonably robust against departures from normality
◦ do not use if n is small and
⋄ data are strongly skewed or
⋄ data have clear outliers
Sign test:
◦ uses much less information than t test
◦ for normal data less powerful than t test
◦ no assumption on distribution keeps significance level regardless of
distribution
◦ preferable for very small data sets
Remark:
◦ The two-step procedure
1. assess normality by normal quantile plot
2. conduct either t test or sign test depending on result in step 1
does not attain the chosen significance level α (two tests!).
◦ The sign test is rarely used since there are more powerful distribution-
free tests.
50
tween the treatment and control group.
40
H0 : µX = µY vs Ha : µX 6= µY
0
Treatment Control
Two-sample t test
◦ Two-sample t statistic
X̄ − Ȳ
T =q 2
sX s2Y
m + n
3.0
◦ Parkinson’s disease, among other things, affects a
person’s ability to speak
◦ Overall condition can be improved by an operation
2.5
◦ How does the operation affect the ability to speak?
Speaking ability
◦ Treatment group: Eight patients received operation
2.0
◦ Control group: Fourteen patients
◦ Data:
1.5
⋄ score on several test
⋄ high scores indicate problem with speaking
Treat. Contr.
Result: We cannot reject the null hypothesis that the variances are equal.
3.0 3.0
Speaking ability (Contr.)
2.8
2.6 2.5
distributed? 2.4
2.0
2.2
2.0
1.5
1.8
Test problem:
H0 : p 1 = p 2 vs H1 : p1 6= p2
1. Would you favor or oppose a law that would require a person to obtain
a police permit before purchasing a gun?
2. Would you favor or oppose a law that would require a person to obtain
a police permit before purchasing a gun, or do you think such a law
would interfere too much with the right of citizens to own guns?
Question: Is the true proportion of people favoring the permit law the
same in both groups or not?
Problem:
◦ Power was generally low in the significance tests employed in the stud-
ies.
◦ Failure to reject H0 is no evidence that H0 is true.
◦ More careful studies showed that the size of a company and measures
of value such as ratio of stock price to earnings do help predict future
stock prices.
Example
Note: A low significance level does not mean there is a large difference,
but only that there is strong evidence that there is some difference.
◦ Observational study
◦ Comparison of brain cancer patients and similar group without brain
cancer
◦ No statistically significant association between cell phone use and a
group of brain cancers known as gliomas.
◦ Separate analysis for 20 types of gliomas found association between
phone use and one rare from.
◦ Risk seemed to decrease with greater mobile phone use.
Think for a moment:
◦ Suppose all 20 null hypotheses are true.
◦ Each test has 5% chance of being significant - the outcome is Bernoulli
distributed with parameter 0.05.
◦ The number of false positive tests is binomially distributed:
N ∼ Bin(20, 0.05)
Example
Suppose we perform k = 6 tests and obtain the following P -values:
P -value α/k
0.476 0.032 0.241 0.008* 0.010 0.001* 0.0083
0.4
Sample proportion
0.3
0.2
0.1
0.0
single married wid/div
Marital status
Test problem:
H0 : the row and the column variables are independent
Ha : the row and the column variables are dependent
How can we measure evidence against the null hypothesis?
◦ What counts would we expect to observe if the null hypothesis
were true?
row total × column total
Expected Cell Count =
total count
Recall: For two independent events A and B, P(A ∩ B) = P(A) P(B).
If the null hypothesis H0 is true, then the table of expected
counts should be “close” to the observed table of counts.
◦ We need a statistic that measures the difference between the
tables.
◦ And we need to know what is the distribution of the statistic
to make statistical inference.
X ≥ χ2(r−1)(c−1),α .
χ2 Densities
0.20 Degrees of
Freedom
0.15 1
5
10
Density
20
0.10
30
0.05
0.00
0 10 20 30 40 50
χ2
Austen Imitator
Sense and Emma Sanditon I Sanditon II
Word Sensibility
a 147 186 101 83
an 25 26 11 29
this 32 39 15 15
that 94 105 37 22
with 59 74 28 43
without 18 10 10 4
TOTAL 375 440 202 196
Questions:
Smoker Not
Dead 139 230
Alive 438 502
Here are the same data classified by their age at time of the survey:
Age 18 to 44 Age 45 to 64 Age 65+
Smoker Not Smoker Not Smoker Not
Dead 19 13 Dead 78 52 Dead 42 165
Alive 269 327 Alive 162 147 Alive 7 28
Simpson’s Paradoxon
An association/comparison that holds for all of several groups can
reverse direction when the data are combined to form a single
group.
2.0
Body Density (103kg m3)
1.8
1.6
1.4
1.2
1.0
1.03 1.04 1.05 1.06 1.07 1.08 1.09
Skinfold Thickness (mm)
Questions:
◦ Are body density and skinfold thickness related?
◦ How accurately can we predict body density from skinfold thickness?
◦ and variance
6
1 5
2 4
3
4
5 3
6
7 2
8
9 1
10
11 0
12
Sampling probability
f (x, y)
6
0
1 5
2
3 4
4
5 3
6
7 2
8
9 1
10
11 0
12
y
fix x = x0
6
0
1 5
f (x0, y)
2
3 4
4
5 3
6
7 2
8
y
9
10
1
rescale by fX (x0 )
11 0
12
6
Conditional probability
0
5
1
2 fXY (x0, y)
3
4
4
f (y|x0) =
5
6
3 fX (x0)
7 2
8
9 1
10
11 0
12
Z
E(Y |X = x0) = y fY |X (y|x0) dy conditional mean
Yi = a + b x i + ε i , i = 1, . . . , n
where
Assumptions:
◦ Predictor xi is deterministic (fixed values, not random).
◦ Errors have zero mean, E(εi) = 0.
◦ Variation about mean does not depend on xi, i.e. var(εi) = σ 2 .
◦ Errors εi are independent.
Often we additionally assume:
◦ The errors are normally distributed,
iid
εi ∼ N (0, σ 2).
Y ∼ N (a + b x, σ 2).
a - intercept
b - slope
Least Squares Approach:
Minimize squared distance between observed Yi and fitted Ŷi :
P
n
2 P
n
L(a, b) = (Yi − Ŷi ) = (Yi − a − b xi)2
i=1 i=1
Ŷi = â + b̂ xi
Residuals ε̂i:
ε̂i = Yi − Ŷi
= Yi − â − b̂ xi
Estimation of σ 2
2 1 P n 1
σ̂ = (Yi − Ŷi )2 = SS Residual
n − 2 i=1 n−2
Regression standard error
p
se = σ̂ = SS Residual /(n − 2)
Variation accounting:
P
n
SS Total = (Yi − Ȳ )2 total variation
i=1
Pn
SS Model = (Ŷi − Ȳ )2 variation explained by linear model
i=1
Pn
SS Residual = (Yi − Ŷi )2 remaining variation
i=1
2.0
Body Density (103kg m3)
1.8
1.6
1.4
1.2
1.0
1.03 1.04 1.05 1.06 1.07 1.08 1.09
Skinfold Thickness (mm)
SXY −0.267
b̂ = = = −11.40
SXX 0.023
â = ȳ − b̂x̄ = 1.568 + 11.40 · 1.064 = 13.70
RSS 1.187
σ̂ 2 = = = 0.0132
n−2 90
√ √
se = σ̂ 2 = 0.0132 = 0.1149
E(b̂) = b
Recall that
σ2
var(b̂) = P
n
SXX SXX = (xi − x̄)2
i=1
Distribution of b̂
σ2
b̂ ∼ N b,
SXX
E(â) = a
1 x̄2
var(â) = + σ2
n SXX
Distribution of â
1 x̄2
â ∼ N a, + σ2
n SXX
b̂ − b
√ ∼ N (0, 1)
σ/ SXX
Substituting se for σ, we obtain
b̂ − b
√ ∼ tn−2
se / SXX
Similarly
â − a
q ∼ N (0, 1)
1 X̄ 2
σ n + SXX
H0 : b = b 0 versus Ha : b 6= b0.
b̂ − b0
Tb = √ ∼ tn−2
se / SXX
The null hypothesis H0 : b = b0 is rejected if
|T | > tn−2,α/2
H0 : a = a0 versus Ha : a 6= a0 .
|T | > tn−2,α/2
E(Y ) = a + b x0
Our estimate for the mean of Y at X = x0 is
Ŷx0 = â + b̂ x0 .
Hence we obtain
E(Ŷx ) = a + b x0
0
1 (x0 − x̄)2 2
var(Ŷx0 ) = + σ
n SXX
2.5
2
2
Body density
1.5
1.5
1
1
1.02 1.04 1.06 1.08 1.1 1 1.02 1.04 1.06 1.08 1.1
SKINT SKin thickness
20 20
16 16
Food Expenditure
Food Expenditure
12 12
8 8
4 4
0 0
0 20 40 60 80 100 120 0 1 2 3 4 5 6
Income Family Size
where
◦ Yi response variable
◦ x1,i, . . . , xp,i predictor variables (fixed, nonrandom)
◦ b0, . . . , bp regression coefficients
iid
◦ εi ∼ N (0, σ 2) error variable
Yi
^
Yi
20
16
12
6
5
4 0
3 120
2 100
80
1 60
40
0 20
0
Y =Xb+ε
where
Y n dimensional vector
X n × (1 + p) dimensional matrix
b 1 + p dimensional vector
ε n dimensional vector
Thus the model can be written as
Y1 1 x1,1 · · · xp,1 b0 ε1
.. .. .. .
. . . .. . .
. = . . .. + ..
Yn 1 x1,n · · · xp,n bp εn
Results:
b̂ = (X T X)−1X T Y ∼ N b, σ 2(X T X)−1
Ŷ = X(X T X)−1X T Y ∼ N X b, σ 2X(X T X)−1X T
ε̂ = Y − Ŷ = 1 − X(X T X)−1X T Y ∼ N 0, σ 2 1 − X(X T X)−1X T
2 kY − Ŷ k2
σ̂ = s2e =
n−p
1 P n
= (Yi − Ŷi )2
n − p i=1
Result:
◦ bj measures the dependence of Y on xj after removing the linear effects
of all other predictors xk , k 6= j.
◦ bj = 0 if xj does not provide information for the prediction of Y addi-
tionally to the information given by the other predictor variables.
Data:
◦ Study with 12 children with congenital heart defects
◦ Exact required catheder length was measured using a fluoroscope
◦ Patient’s height and weight were recorded
45 45
40 40
Distance (cm)
Distance (cm)
35 35
30 30
25 25
20 20
30 40 50 60 20 40 60 80
Height (in) Weight (lb)
Y = b0 + b1 x 1 + b2 x 2 + ε
Note:
◦ Neither height nor weight seem to be significant for predicting the dis-
tance to the pulmonary artery.
◦ The regression on both variables explains 80% of the variation of the
response (length of catheder).
Note:
◦ In a simple regression of Y on either height or weight, the explanatory
variable is highly significant for predicting Y .
◦ In a multiple regression of Y on height and weight, the coefficients for
both height and weight are not significantly different from zero.
80
60
Weight (lb)
40
20
20 30 40 50 60 70
Height (in)
Decomposition of variation:
P
◦ SS Total = i (Yi − Ȳ )2 - total variation
P
◦ SS Residual = i (Yi − Ŷi )2 - variation in regression model
◦ SS Model = SS Total − SS Residual
P
= i (Ŷi − Ȳ )2 - variation explained by regression
Coefficient of determination: The ratio
SS Model
R2 =
SS Total
indicates how well the regression model predicts the response. R2 is also
the squared multiple correlation coefficient - in a simple linear regression
we have
R2 = ρ2XY .
Y = t · K a · Lb · M c
Y
0.2 0.2 0.2
0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4 0.6 −0.2 0.0 0.2 0.4 0.6 0.8 1.0
K L M
Yi = b0 + b1 x1,i + . . . + bp xp,i + εi .
Problem:
Test the null hypothesis
H0 : q specific explanatory variables all have zero coefficients
versus
Ha : any of these q explanatory variables has a nonzero coefficient.
Solution:
(1)
◦ Regress Y on all p explanatory variables and read SS Residual from the
output.
◦ Regress Y on just p − q explanatory variables that remain after you
(2)
remove the q variables from the model. Read SS Residual from the output.
◦ The test statistic is
(2) (1)
n − p − 1 SS Residual − SS Residual
F = · (1)
.
q SS Residual
Using STATA:
. test LK LL
( 1) LK = 0
( 2) LL = 0
F( 2, 21) = 0.25
Prob > F = 0.7847
. test LK LL _cons
( 1) LK = 0
( 2) LL = 0
( 3) _cons = 0
F( 3, 21) = 2.43
Prob > F = 0.0934
Box plots
60
50
40
Time (in minutes)
30
20
10
60
40
40
20
20
0
2 4 6 8 10 2 4 6 8 10
dose dose
Model:
10 10
Residuals (in minutes)
Sample Quantiles
5 5
0 0
−5 −5
−10 −10
2 4 6 8 10 −2 −1 0 1 2
Dose (in grams) Theoretical Quantiles
Model:
10 10
5 5
Residuals (in minutes)
Sample Quantiles
0 0
−5 −5
−10 −10
2 4 6 8 10 −2 −1 0 1 2
Dose (in grams) Theoretical Quantiles
. test sex bp
( 1) sex = 0
( 2) bp = 0
F( 2, 18) = 1.19
Prob > F = 0.3270
Model:
60
50
Fitted time (in minutes)
40
30
20
10
0
2 4 6 8 10
Dose (in grams)
4.10
4.05
Amount of chlorphenimarine (in mg)
4.00
3.95
3.90
3.85
3.80
Lab 1 Lab 2 Lab 3 Lab 4 Lab 5 Lab 6 Lab 7
400
300
Duration of itching (sec)
200
100