Академический Документы
Профессиональный Документы
Культура Документы
(Module 1)
Statistics (MAST20005) & Elements of Statistics (MAST90058)
Semester 2, 2018
Contents
1 Subject information 1
2 Review of probability 5
3 Descriptive statistics 13
1 Subject information
What is statistics?
Examples
• Weather forecasts: Bureau of Meteorology
• Poll aggregation: FiveThirtyEight, The Guardian
• Climate change modelling: Australian Academy of Science
• Discovery of the Higgs Boson (the ‘God Particle’): van Dyk (2014)
• Smoking leads to lung cancer: Doll & Hill (1945)
• A/B testing for websites: Google and 41 shades of blue
Laleh’s example
• Dependencies between financial institutions across countries
Damjan’s examples
• Genome-wide association studies
• Web analytics
• Lung testing in infants
• Skin texture image analysis
• Wedding ‘guestimation’
1
Goals of statistics
• Answer questions using data
• Evaluate evidence
• Optimise study design
• Make decisions
And, importantly:
• Clarify assumptions
• Quantify uncertainty
Subject overview
Joint teaching
MAST20005 and MAST90058 share the same lectures but have separate tutorials and lab classes. The teaching and
assessment material for both subjects will overlap significantly.
2
Subject structure
• Lectures: Three 1-hour lectures per week. Lecture notes/slides will appear on the LMS.
• Tutorials: One 1-hour tutorial per week (starting in week 2). Tutorial problems and solutions will appear on
the LMS.
• Computer lab classes: One 1-hour lab per week (starting in week 2), immediately following the tutorial. Lab
notes, exercises and solutions will appear on the LMS.
Computing
• This subject introduces basic statistical computing and programming skills.
• We make extensive use of the R statistical software environment.
• Knowledge of R will be essential for some of the tutorial problems, assignment questions and will also be
examined.
• We will use the RStudio program as a convenient interface with R.
Textbook
R. Hogg, E. Tanis, and D. Zimmerman. Probability and Statistical Inference. 9th Edition, Pearson, 2015.
• This subject is based on Chapters 6–9.
• Some of the teaching material is taken from the textbook.
• This textbook is being phased out for this subject.
• There are important differences between the subject content and the textbook. We will point many of these out,
but please ask if unsure.
Assessment
• 3 assignments (20%)
1. Hand out at the start of week 4, due at the end of week 5
2. Hand out at the start of week 7, due at the end of week 8
3. Hand out at the start of week 10, due at the end of week 11
• 45-minute computer lab test held in week 12 (10%)
• 3-hour written examination in the examination period (70%)
Plagiarism declaration
• Everyone must complete the Plagiarism Declaration Form
• Do this on the LMS
• Do this ASAP!
Staff contacts
Student representatives
Student representatives assist the teaching staff to ensure good communication and feedback from students.
See the LMS to find the contact details of your representatives.
4
– More than one correct answer
– Often uncertain about the answer
Diversity
Homework
1. Complete plagiarism declaration on the LMS
2. Log in to Piazza
3. Install RStudio on your computer
4. Start reading lab notes for week 2 (long!)
Tips
The best way to learn statistics is by solving problems and ‘getting your hands dirty’ with data.
We encourage you to attend all lectures, tutorial and computer labs to get as much practice and feedback as possible.
Good luck!
2 Review of probability
Why probability?
• It forms the mathematical foundation for statistical models and procedures
• Let’s review what we know already. . .
5
Distribution functions
• The cumulative distribution function (cdf ) of X is
F (x) = Pr(X 6 x), −∞ < x < ∞
• If X is a continuous rv then it has a probability density function (pdf ), f (x), that satisfies
d
f (x) = F 0 (x) = F (x)
Z x dx
F (x) = f (t) dt
−∞
A large group of individuals have recently lots their jobs. Let X denote the length of time (in months) that any
particular individual will stay unemployed. It was found that this was well-described by the following pdf:
1 −x/2
e , x ≥ 0,
f (x) = 2
0, otherwise.
1.0
0.6
0.8
0.4
0.6
F(x)
f(x)
0.4
0.2
0.2
0.0
0.0
−2 0 2 4 6 8 10 0 5 10 15
x x
Clearly, f (x) ≥ 0 for any x and the total area under the pdf is:
Z ∞ h i∞
1 −x/2
Pr(−∞ < X < ∞) = 2e dx = 1
2 −2e−x/2 = 1.
0 0
The probability that a person in the population finds a new job within 3 months is:
Z 3 h i3
1 −x/2
Pr(0 6 X 6 3) = 2e dx = 12 −2e−x/2 = 0.7769.
0 0
6
Example: Received calls
The number of calls received by an office in a given day, X, is well-represented by a pmf with the following expression:
e−5 5x
p(x) = , x ∈ {0, 1, 2, . . . },
x!
where x! = 1 · 2 · · · (x − 1) · x and 0! = 1. For example,
Pr(X = 1) = e−5 5 = 0.03368
e−5 53
Pr(X = 3) = = 0.1403
3·2·1
1.0
● ● ● ● ● ● ●
●
●
●
0.15
●
●
0.8
●
●
0.6
●
●
0.10
P(X=x)
●
F(x)
●
●
0.4
0.05
●
0.2
●
●
●
●
● ● ●
0.00
●
0.0
● ● ● ● ● ● ● ● ●
0 5 10 15 20 0 5 10 15
x x
7
• More generally for a function g(x) we can compute
∞
X
E(g(X)) = g(x)p(x) (discrete rv)
x=−∞
Z ∞
E(g(X)) = g(x)f (x) dx (continuous rv)
−∞
• More generally,
var(X + Y ) = var(X) + var(Y ) + 2 cov(X, Y )
where cov(X, Y ) is the covariance between X and Y
Covariance
• Definition of covariance:
cov(X, Y ) = E {(X − E(X)) (Y − E(Y ))}
Correlation
• If cov(X, Y ) > 0 then X and Y are positively correlated
• If cov(X, Y ) < 0 then X and Y are negatively correlated
• If cov(X, Y ) = 0 then X and Y are uncorrelated
• The correlation between X and Y is defined as:
cov(X, Y )
ρ = cor(X, Y ) = , −1 6 ρ 6 1
sd(X) sd(Y )
8
Moment generating functions
• A moment generating function (mgf ) of a rv X is
MX (t) = E etX ,
t ∈ (−∞, ∞)
• The mgf uniquely determines a distribution. Hence, knowing the mgf is the same as knowing the distribution.
• If X and Y are independent rvs,
Bernoulli distribution
• X takes on the values 1 (success) or 0 (failure)
• X ∼ Be(p) with pmf
p(x) = px (1 − p)1−x , x ∈ {0, 1}
• Properties:
E(X) = p
var(X) = p(1 − p)
MX (t) = pet + 1 − p
Binomial distribution
• X ∼ Bi(n, p) with pmf
n x
p(x) = p (1 − p)n−x , x ∈ {0, 1, . . . , n}
x
• Properties:
E(X) = np
var(X) = np(1 − p)
MX (t) = (pet + 1 − p)n
Poisson distribution
• X ∼ Pn(λ) with pmf
λx
p(x) = e−λ , x ∈ {0, 1, . . . }
x!
• Properties:
E(X) = var(X) = λ
t
−1)
MX (t) = eλ(e
λx
n x
p(x) = p (1 − p)n−x ≈ e−λ
x x!
as n → ∞ and p → 0.
9
Uniform distribution
• X ∼ Unif(a, b) with pdf
1
f (x) = , x ∈ (a, b)
b−a
• Properties:
(a + b)
E(X) =
2
(b − a)2
var(X) =
12
etb − eta
MX (t) =
t(b − a)
• If b = 1 and a = 0, this is known as the uniform distribution over the unit interval.
Exponential distribution
• X ∼ Exp(λ) with pdf
f (x) = λe−λx , x ∈ [0, ∞)
• It approximates “time until first success” for independent Be(p) trials every ∆t units of time with p = λ∆t and
∆t → 0
• Properties:
E(X) = 1/λ
var(X) = 1/λ2
λ
MX (t) =
λ−t
• It is famous for being the only continuous distribution with the memoryless property:
Normal distribution
• X ∼ N(µ, σ 2 ) with pdf
1 (x−µ)2
f (x) = √ e− 2σ2 , x ∈ (−∞, ∞), µ ∈ (−∞, ∞), σ>0
σ 2π
E(X) = µ
var(X) = σ 2
2
σ 2 /2
MX (t) = etµ+t
10
Quantiles
Let X be a continuous rv. The pth quantile of its distribution is a number πp such that p = Pr(X 6 πp ) = F (πp ).
In other words, the area under f (x) to the left of πp is p:
Z πp
p= f (x) dx = F (πp )
−∞
3x2 −(x/4)3
f (x) = e , x ∈ (0, ∞).
4
The cdf is 3
F (x) = 1 − e−(x/4) , x ∈ (0, ∞)
Then π0.3 satisfies 0.3 = F (π0.3 ). Therefore,
3
1 − e−(π0.3 /4) = 0.3
⇒ ln(0.7) = −(π0.3 /4)3
⇒ π0.3 = −4(ln 0.7)1/3 = 2.84.
Consider a collection X1 , . . . , Xn of independent and identically distributed (iid) random variables with E(X) = µ < ∞,
then with probability 1 we have:
n
1X
Xi → µ, as n → ∞.
n i=1
The LLN ‘guarantees’ that long-run averages behave as we expect them to:
n
1X
E(X) ≈ Xi .
n i=1
Consider a collection X1 , . . . , Xn of iid rvs with E(X) = µ < ∞ and var(X) = σ 2 < ∞. Let,
n
1X
X̄ = Xi .
n i=1
Then
X̄ − µ
√
σ/ n
follows a N(0, 1) distribution as n → ∞.
This is an extremely important theorem! It provides the ‘magic’ that will make statistical analysis work.
11
Example
52
1 1
X̄ ≈ N , = N 5,
λ nλ2 25
Is n = 25 large enough?
A simulation exercise
Generate B = 1000 samples of size n. For each sample compute x̄. The continuous curve is the normal N (5, 52 /n)
distribution prescribed by the CLT.
(1)
Sample 1: x1 , . . . , x(1)
n → x̄
(1)
(2)
Sample 2: x1 , . . . , x(2)
n → x̄
(2)
..
.
(B)
Sample B: x1 , . . . , x(B)
n → x̄(B)
A simulation exercise
The distribution of X̄ approaches the theoretical distribution (CLT). Moreover it will be more and more concentrated
around µ (LLN). To see this, note that var(X̄) = σ 2 /n → 0 as n → ∞.
n=1 n=5
0.20
0.15
0.10
Density
Density
0.10
0.05
0.00
0.00
−10 −5 0 5 10 15 20 0 5 10 15
X X
n = 25 n = 100
0.4
Density
0.2
0.1
0.0
2 4 6 8 10 2 4 6 8 10
X X
12
Challenge problem
Let X1 , X2 , . . . , X25 be iid rvs with pdf f (x) = ax3 where 0 < x < 2.
1. What is the value of a?
2. Calculate E(X1 ) and var(X1 ).
3. What is an approximate value of Pr(X̄ < 1.5)?
3 Descriptive statistics
Statistics: the big picture
13
Data & sampling
• The data are numbers:
x1 , . . . , xn
• The model for the data is a random sample, that is a sequence of iid rvs:
X1 , X2 , . . . , Xn
Statistic
• A statistic T = φ(X1 , . . . , Xn ) is a function of the sample and its realisation is denoted by t = φ(x1 , . . . , xn ).
• Note: the word “statistic” can also be used to refer to both the realisation, t, as well as the random variable, T .
Sometime need to be more specific about which one is meant.
• A statistic has two purposes:
– Describe or summarise the sample — descriptive statistics
– Estimate the distribution generating the sample — inferential statistics
• A statistic can be both descriptive and inferential, it depends on how you wish to use/interpret it (see later)
• We now introduce some commonly used descriptive statistics. . .
Moment statistics
n
1X 23.59
Sample mean = x̄ = xi = = 2.359
n i=1 10
n
1 X
Sample variance = s2 = (xi − x̄)2 = 3.98761
n − 1 i=1
√
Sample standard deviation = s = 3.98761 = 1.9969
Order statistics
What is x(3.25) ?
14
Let it be 0.25 of the way from x(3) to x(4) ,
Sample quantiles
Special cases:
1.26 + 1.53
Sample median = π̂0.5 = x(5.5) = = 1.395
2
Sample 1st quartile = π̂0.25 = x(3.25) = 0.9625
Sample 3rd quartile = π̂0.75 = x(7.75) = 3.6480
Also:
Interquartile range = π̂0.75 − π̂0.25 = 2.685
π̂0.25 and π̂0.75 contain about 50% of the sample between them
Note: Type 7 quantiles are the default in R, but there are many alternatives! Don’t worry too much about the differences
between them. We will discuss this in a bit more detail later in the semester.
> x
[1] 0.97 0.52 0.73 0.96 1.26 4.06 2.41 1.53 6.07 5.08
Frequency statistics
15
4 Basic data visualisations
Box plot
1 2 3 4 5 6
VEGFC
1 2 3 4 5 6
Scatter plot
●
6
●
5
●
4
VEGFC
●
2
●
●
1
● ●
●
●
0 10 20 30 40 50
COX2
16
Empirical cdf
where I(·) is the indicator function (I(xi 6 x) has value 1 if xi 6 x and value 0 if xi > x).
For example, for the previous data,
10
1 X 6
F̂ (2) = I(xi 6 2) = = 0.6
10 i=1 10
ecdf(VEGFC)
1.0
●
0.8
●
0.6
●
Fn(x)
●
0.4
●
0.2
●
0.0
0 1 2 3 4 5 6
It has the form of a discrete cdf. However, it will approximate the cdf of a continuous variable if the sample size is
large. The following diagram shows cdfs based on n = 50 and n = 200 observations sampled from a standard normal
distribution, N(0, 1).
1.0
1.0
0.8
0.8
0.6
0.6
F(x)
F(x)
0.4
0.4
0.2
0.2
0.0
0.0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
x x
Empirical pmf
If the underlying variable is discrete we use the pmf corresponding to the sample cdf F̂
n
1X
p̂(x) = I(xi = x)
n i=1
For example, the following shows p̂(x) of size n = 15 from Pn(5) (left) and the true pmf p(x) of Pn(5) (right)
4
150
3
100
p(x)
p(x)
2
50
1
0
2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13
x x
17
Histograms and smoothed pdfs
If the underlying variable is continuous we would prefer to obtain an approximation of the pdf. There are several
approaches that can be used:
1. Histogram, fˆh (h is the bin length). First divide the entire range of values into a series of small intervals (bins)
and then count how many values fall into each interval. For interval [a, b), where b − a = h, draw a rectangle
with height:
n
1 X
fˆh (x) = I(a 6 xi < b)
hn i=1
where K(·) is the kernel (a non-negative function that integrates to 1 and with mean zero) and h is a parameter
that controls the level of smoothing.
Example: VEGFC
0.4
0.3
f (x)
0.2
0.1
0.0
0 1 2 3 4 5 6 7
x
Simulated data
0.2
0.1
0.0
0 1 2 3 4 5
x
18
Quantile-quantile (QQ) plots
• For comparing the similarity of two probability distributions
• We plot their quantiles against each other (as a scatter plot)
• Typically, we compare data against a theoretical distribution
• The points in the plot are,
−1 k
x(k) , F , k = 1, . . . , n.
n+1
• One axis shows the data, written here as sample quantiles:
k
x(k) = π̂p , where p = (‘Type 6’ quantiles)
n+1
for k = 1, . . . , n
• Other axis shows corresponding quantiles for a theoretical distribution:
k
F −1
n+1
Example: VEGFC
Theoretical quantiles:
F −1 (p) = − ln(1 − p)/λ (e.g. set λ = 0.5)
1/(10 + 1) = 0.09, 2/(10 + 1) = 0.18, . . . , 10/(10 + 1) = 0.91
F −1 (0.09) = 0.19, F −1 (0.18) = 0.40, . . . , F −1 (0.91) = 4.80
0.4
Theoretical quantiles: X~Exp(2)
4
0.3
3
Density
0.2
2
0.1
1
0.0
1 2 3 4 5 6 0 1 2 3 4 5 6 7
Sample quantiles x
The right tail of the sample does not quite match the theoretical model (tail of the sample distribution is heavier).
Normal QQ plots
If X ∼ N(µ, σ 2 ), then X = µ + σZ, where Z ∼ N(0, 1). Therefore, if the normal model is correct
−1 k
x(k) ≈ µ + σ Φ
n+1
where Φ(z) = P (Z 6 z) is the standard normal cdf.
So, if we plot the points
k
x(k) , Φ−1
, k = 1, . . . , n
n+1
the result should be a straight line with intercept µ and slope σ. The values Φ−1 (k/(n + 1)) are called normal scores.
19
Example: simulated data
Consider 25 observations from X ∼ N(10, 2). The histogram is not very helpful:
0.20
0.15
Density
0.10
0.05
0.00
6 7 8 9 10 11 12 13
x
-2 -1 0 1 2
Theoretical Quantiles
20