Вы находитесь на странице: 1из 9

Probablity Models & Statistics

Abhijit Sen

Last Updated : April 5, 2007

Expectation (Mean)
µ(X) = E[X] = ΣXP (X) (1)
σ 2 (X) = Σ(X − µ)2 P (X) = E[X 2 ] − E[X]2 (2)
Standard Deviation √
σ= V ariance (3)
E[(X − µ)3 ]
γ= (4)
If Y = a + bX ; a,b are constants

µ(Y ) = a + bµ(X) , σ 2 (Y ) = b2 σ 2 (X) , skewness of Y = + γ if b > 0 else –γ (5)

Chebyshev’s Theorem

For any random vriable X and positive integer k

1 1
P (|X − µ| > kσ) ≤ 2
or P (|X − µ| ≤ kσ) ≥ 1 − 2 (6)
k k

Probability Theory
P (not A) = 1 − P (A) (7)
P (A or B) = P (A) + P (B) − P (A and B) (8)
P (A and B) = P (A|B)P (B) (9)
Bayes’ Theorem
P (B|A)P (A)
P (A|B) = (10)
P (B)
P (A) = P (B)P (A|B) + P (not B)P (A|not B) (11)
P (A) = P (B1 )P (A|B1 ) + P (B2 )P (A|B2 ) + . . . + P (Bn )P (A|Bn ) (12)
P (A or B or C) = P (A)+P (B)+P (C)−P (A and B)−P (A and C)−P (B and C)+P (A and B and C)

Hypergeometric Model
The hypergeometric random variable (X) is a count telling us how many in the sample have the
characteristic of interest. It is associated with simple random sampling without replacement from a
dichotomous population and is used only for small populations.
C(R, x)C(N − R, n − x)
P (X = x) = (14)
C(N, n)
N = population size ; n = sample size(say less than 60) ;
X,R = no.s in the population and the sample having the characteristic of interest
p=R/N, proportion of the population having the characteristic ; q = 1 - p ;
npq(N − n)
µ = np , σ 2 = (15)
(N − 1)

Sampling with replacement from a dichotomous population

This is used as an approximation to the above model when the sample size is very large since in a
large population, there’s a very small probablity of selecting the same member twice. Also, the size
of the sample should be small as compared to the size of the population. (N ≥ 60 , N ≥ 10n)
P (X = x) = C(n, x)px q n−x (16)
µ = np , σ 2 = npq (17)

Bernoulli random variable

The Bernoulli trial has exactly 2 outcomes and the outcome of interest has a probability p.
µ = p , σ 2 = p(1 − p) (18)

Binomial distribution
This distribution is equivalent to many Bernoulli trials performed successively and independently. It
gives the number of successes among all the trials.
P (X = x) = C(n, x)px q n−x (19)
(1 − 2p)
µ = np , σ 2 = npq , γ = p (20)
p(1 − p)n
Dividing the mean and the standard deviation ny n would give us the parameters for the binomial

Geometric random variable

The geometric random variable is the number of independent Bernoulli trials necessary to observe the
first success.If p is the probability of success in one trial and q=1-p,
P (X = x) = q x−1 p (21)
P (X ≤ x) = 1 − q x (22)
1 q
µ = , σ2 = 2 (23)
p p

Poisson distribution
The poisson variable is useful for counting accidental occurences within a fixed interval of time. Hence,
simultaneous occurences are impossible, any two occurences are independent and the expected number
of occurences in any interval is proportional to the size of the interval.

e−λ λx
P (X = x) = (24)
P (X = x + 1) = P (X = x) (25)
µ = λ , σ2 = λ (26)
This can be used in place of the binomial distribution when p is small (less than 0.005) and n is large
(greater than 20). Here, λ = np.

Negative Binomial distribution

The negative binomial random variable is the number of Bernoulli trials necessary to observe exactly
k successes.
P (X = x) = C(x − 1, k − 1)pk q x−k (27)
P (X = x + 1) = P (X = x) (28)
x − (k − 1)
k kq
µ= , σ2 = 2 (29)
p p


For a continuosly distributed probability model, P(X=x) = 0. The probabilities can be defined over a
range. The function f (x) relating the probablity and the continuous values is known as the Probability
Density Function (PDF) or the Probabilty Mass Function (PMF). The total area under this curve is

Continous uniform distribution

All intervals of a given length have an equal probability in this distribution. X takes on values only
between a and b and is uniformly distributed on [a, b] with probability of an interval being proportional
to the length of the interval.
f (x) = (30)
(x2 − x1 )
P (x1 < X < x2 ) = (31)
(b − a)
where a < x1 < x2 < b.
(a + b) (b − a)2
µ= , σ2 = (32)
12 12

Exponential distribution
The Exponential distribution is generally used to model the time between two ”occurences” in a system
where the number of ”occurences” is appropriately modeled by the Poisson distribution.

f (x) = λe−xλ , x ≥ 0 (33)

P (X < x) = 1 − e−xλ , x > 0 (34)

1 1
µ= , σ2 = 2 (35)
λ λ

Normal distribution
The Normal distribution arose as a model for measurement error. The key idea is that for repeated
measurements, the measurement error is a random variable and assuming no systematic source of
error, the mean should be zero and the distribution symmetric about the mean. A member X of
the family of normally distributed random variables can be notated as X ∼ N (µ, σ 2 ) where the 2
parameters to N are the mean and the variance respectively.

1 (x − µ)2
f (x) = √ exp − (36)
σ 2 2π 2σ 2

The standard normally distributed random variable Z is defined as N(0,1) and it’s density function
is called the Gaussian curve. Any situation giving rise to numbers where the difference between
the two values looks like random error(is due to many independent factors) will often be modeled
appropriately by a normal distribution. Certain quantiites will not follow this distribution but may
do so after transformations (lognormal,quadratic etc.).
Since the values in cumulative probability distribution function tables are given for Z i.e. P(Z < z),
we standardize the normally distributed variable X by using
X −µ
Z= (37)
On a normal probability plot, an ideal sample drawn from N(µ, σ 2 ) should be evenly spread i.e. it
should cut the distribution into equal probabilities. If X is binomially distributed, the X can be
appropriated by N(np,npq) provided np,nq ≥ 5.

A sample from the entire population is chosen and the sample mean X is used as an estimator for
the population mean µ. The Central Limit Theorem states that if Z is the standardized sum of any
n independent, identically distributed (discrete or continuous) random variables, then the probability
distribution of Z tends to a normal distribution as n increases. Here, if σ is the standard deviation of
the population, then √
X ∼ N [µ, σ/ n] (38)
The standardized variable Z is calculated as
X −µ
Z= √ (39)
σ/ n
An estimator is said to be unbiased if it’s expected value is equal to the parameter being estimated.
The percentage of confidence intervals associated with the variable influence the deviation in the

value of Z determined by looking at the table of standardized normal distribution. This value is then
transformed to get an interval w.r.t. X. The spread of the confidence interval is the margin for
sampling error. For example,
√ √
0.95 = P [X − 1.96(σ/ n) < µ < X + 1.96(σ/ n)] (40)

This is the same as √ √

0.95 = P [µ − 1.96(σ/ n) < X < µ + 1.96(σ/ n)] (41)
These calculations assume that the populaion is infinite. The exact formula for σ when sampling
without replacement from a finite population of size N is
σ N −n
√ (42)
n N −1

where the second term is the finite population correction. This is usuall ignored unless n > 0.1N.
The standard deviation of the population needs to be estimated as well since we seldom know it
beforehand. It can be estimated by the sample standard deviation s as
2 i=1 (Xi − X)2
s = (43)
The (n-1) in the denominator makes the estimator unbiased. However, s is a biased estimator of the
population standard deviation.
Student’s t distribution is used to correct for the above bias when the sampling is from a normal
X −µ
t= √ (44)
s/ n
We have to choose the appropriate tn curve where n is the number of observations. The degrees of
freedom is n-1.
A special case occurs when the random variable has a binomial distribution since it can only take the
values 1 and 0. If π is the probability of success in the population ,the sample size is n and the sample
mean is p, the parameters for the success proportion p = x/n are
µ = π, σ = π(1 − π)/n (45)

The estimate is given as

p ± kσ (46)
where k is determined by the percentage of confidence interval.

Hypothesis testing
We denote 2 competing theories as the null hypothesis H0 and the alternative hypothesis H1 . The two
hypotheses have different values for the parameters of distribution (generally the mean).
A Type I error rejects a null hypothesis which is true. A Type II error does not reject a null hypothesis
that is false.
α = P[Type I error] = P[reject H0 |H0 is true]
β = P[Type II error] = P[do not reject H0 |H0 is false]
Raising α would reduce β and vice versa. The only way to reduce both is to increase the sample size.
Mathematically, √
Xc = µ0 + Zα σ/ n (47)

Xc = µ1 + Zβ σ/ n (48)

Generally, an event with a less than or equal to 5 percent chance of happening (P-value = 0.05) is
rejected when the test is one-sided. For a two-sided test, a 2.5 percent margin is kept and thus ,it has a
higher chance of making a Type II error. These values are assigned to α also known as the significance
level of the test. When the random sample are not drawn for a normal distribution and has a small
size, it is better to use the t-distribution to calculate the range of the variable. The hypothesis which
we hope to reject (straw hypothesis) is usually chosen as the null hypothesis. The hypotheses may
take many forms depending upon the utility.
H0 : µ = µ1 H1 : µ = µ2 General framework
H0 : µ = µ1 H1 : µ > µ1 Composite one-sided alternative hypothesis
H0 : µ ≤ µ1 H1 : µ > µ1 Composite hypothesis allowing for a negative effect
H0 : µ = µ1 H1 : µ 6= µ1 Two-sided hypothesis
H0 : µ1 = µ2 H1 : µ1 6= µ2 Comparing two different distributions : Difference between two means
H0 : µ1 − µ2 = 0 H1 : µ1 − µ2 6= 0 Rephrasing the above
For reasonably sized samples following the Central Limit Theorem, the sample means are normally
√ √
distributed according to X1 ∼ N [µ1 , σ1 / n1 ] , X2 ∼ N [µ2 , σ2 / n2 ] and so is their difference
 s 
σ12 σ22 
X1 − X2 ∼ N µ1 − µ2 , + (49)
n1 n2

This can be standardized to the Z or t-statistic (n1 + n2 − 2 degrees of freedom) to facilitate reading
off values from the tables.
(X1 − X2 ) − (µ1 − µ2 ) (X1 − X2 ) − (µ1 − µ2 )
Z= q ,t= q (50)
σ12 /n1 + σ22 /n2 SP2 /n1 + SP2 /n2

For the t-distribution, it is assumed that both the distributions have the same standard deviation SP
(n1 − 1)s21 + (n2 − 1)s22
SP2 = (51)
n1 + n2 − 2
For binomial distribution and proportion where the probability of success is π, the formulae are
x1 − x2 ∼ N [π1 n1 − π2 n2 , π1 (1 − π1 )n1 + π2 (1 − π2 )n2 ] (52)
p1 − p2 ∼ N [π1 − π2 , π1 (1 − π1 )/n1 + π2 (1 − π2 )/n2 ] (53)
Matched-Pair Samples are often used in such comparison tests.

Chi-Squared distribution
The Chi-square statistic is defined as
(O1 − E1 )2 (O2 − E2 )2 (Om − Em )2
χ2 = + + ... + (54)
E1 E2 Em
where Oi and Ei are the observed and expected occurences for m exhaustive and mutually exclusive
outcomes. The degrees of freeedom is (m-1 ). The Chi-Squared random variable with d=m-1 degrees
of freedom is defined as
χ2d = Z12 + Z22 + . . . + Zd2 (55)
µ = d , σ 2 = 2d (56)
The d Z’s are independent of each other and the peak of the distribution curve always occurs at d-2.
The values are read off from the table for the respective DOF. Contingency tables are also used in
such analyses. For a two-way contingency table with m rows and n columns, the DOF is (n-1 )(m-1 ).
Such tests are also known as goodness-of-fit tests.

The sample mean is called the least squares estimator since it minimizes the sum of the squares of the
difference between the sample values and itself. For least squares estimate of a linear equation,

Y = α + βX +  (57)

ei = Yi − Ŷi = Yi − (a + bX) (58)

where e is the prediction error,  is the unobserved error term to allow for the imperfection of model and
the two Y’s are the actual and predicted values. The error term  reflects inaccurate measurements,
omitted influences and sampling errors and has an expected value of zero and standard deviation of
σ. The estimates a and b for the parameters α and β such that the sum of the squared errors is
minimized is given by
Σ(Xi − X)(Yi − Y ) ΣXi Yi − nXY
b= = (59)
Σ(Xi − X) 2 ΣXi2 − n(X)2
a = Y − bX (60)
The probability distributions for the estimators for n number of observations are given by

b ∼ N [β, σ/( nSx )] (61)

a ∼ N [α, σ X 2 /( nSx )] (62)
Σ(Xi − X)2
SX = (63)
X2 = (64)
Since σ is not known, we use the standard deviation of e to estimate it. This adjusted value is called
the standard error of estimate SEE
SEE = (e21 + e22 + . . . + e2n )/(n − 2) (65)

The theory of confidence intervals can be applied to the parameters a and b For hypothesis testing,
we can choose β = 0 (which says that the explanatory variable X has no influence on the dependent
variable Y) as the straw hypothesis.
The distribution of Yˆ0 = a + bX0 which is an unbiased estimator of α + βX0 is
" s #
σ (X0 − X)2

Yˆ0 ∼ N α + βX0 , √ 1+ 2 (66)
n SX

A confidence interval for Yˆ0 can be constructed as previously done. The prediction interval for the
value of Y is given by
Y0 = Yˆ0 + k(std. deviation of Y0 − Yˆ0 ) (67)
where k is determined by the percentage of confidence and variance of Y0 − Yˆ0 = variance of Yˆ0 +
variance of 0 . The co-efficient of determination called R-squared is given by

Σ(Yi − Ŷi )2
R2 = 1 − (68)
Σ(Yi − Y )2

The ratio in the formula compares the sum of the squared prediction errors for the model with what
the sum of squared prediction errors would be if b were set equal to zero. Adjusting for the degrees of
freedom gives
Σ(Yi − Ŷi )2 /(n − 2)
Ra2 = 1 − (69)
Σ(Yi − Y )2 /(n − 1)
R is the correlation coefficient. This value gives the predictive accuracy of the model. R-squared
compares the sum of squared prediction errors for the model and for the assumption that β is zero
(Y independent of X). It can also be thought of as the goodness of the model to predict Y relative to
when the sample mean is sued as a predictor. It is close to 1.0 if the dependence reduces the prediction
errors and close to 0 for the opposite case. In Multiple Regression, an equation with several explantory
variables is estimated, each having a separate effect on the dependent variable.

Y = α + β1 X1 + β2 X2 + . . . + βk Xk +  (70)

The esitimate with the parameters a,b etc. is given by

Ŷ = a + b1 X1 + b2 X2 + . . . + bk Xk (71)

Here, n observations are used to estimate (k+1) parameters. Other parameters calculated are

Σ(Yi − Ŷi )2
SEE = (72)
n − (k + 1)

Σ(Yi − Ŷi )2
R2 = 1 − (73)
Σ(Yi − Yi )2
Σ(Yi − Ŷi )2 /(n − k − 1)
Ra2 = 1 − (74)
Σ(Yi − Yi )2 /(n − 1)
An associated problem im multiple regression is multicollinearity where high correlations among the
explanatory variables prevent accurate estimates of the parameters.

The observed difference between two sample means can be due to different means of different distri-
butions or the large variance of a single distribution. To compare and determine which is the case, we
use the F-statistic. If n is sample size, then

(variance of sample means)

F = (75)
(mean of sample variance)/n

A value near 1.0 indicates that the data came from a single population with a large variance. The
F-Distribution is absed on the assumption that the underlying variable is normally distributed. It is
characterized by two degrees-of-freedom.
Numerator : (k-1) , Denominator : k(n-1) : k = number of treatments, n = sample size
Another way of calculating the F-statistic is using the mean sum of squares.
 
X k
X Xn kn
2 2
(Yi − Y ) = n (Yi − Y ) +  (Yi − Y1 )2 + . . . + 2
(Yi − Yk ) (76)
i=1 i=1 i=1 i=(k−1)n+1

where Y is the mean across all the data points and Yi is the mean for the ith treatment. The term
in the LHS is the total sum of squares which measures the variation of Y. The first term in the RHS

is squared deviations of means multiplied by n and gauges the variation in Y between treatments.
The second collective term in the RHS is the squared deviations about the means and measures the
variation in Y within treatments. Thus,
Total variation = variation between treatments (explained) + variation within treatments (unex-
n(variances of sample means) n(sum of squared deviations of means)/(k − 1)
F = = (77)
mean of sample variances (sum of squared deviations about the means)/k(n − 1)
An ANOVA table is then built to report the data.
source of variation sum of squares degrees of mean sum of squares (Col
freedom 2/Col 3)
Explained (between n(squared deviation of k-1 n (variance of sample means)
treatments) means)
Unexplained (within squared deviation k(n-1) mean of sample variances
treatments) about the means
SUM Total sum of squares nk-1
F is thus the ratio of explained to unexplained variance. A generalized F-Test used to test null
hypotheses which restricts the values of parameters in a regression can be done by computing
(restricted unexplained - unrestricted unexplained)/NUM
F = (78)
(unrestricted unexplained)/DEN
where NUM = number of restrictions imposed by the null hypothesis and DEN = DOF for unrestricted
regression. The other values are estimated from the regression equation with and without imposing
restrictions. The formula for F derived previously is consistent with this definition. Another use of the
F-test is it’s use with a regression model to test composite hypothesis (using 0-1 dummy variables).

1. Smith, G., 1999, Statistical Reasoning
2. http://en.wikipedia.org


Game Theory, Decision Theory, Time Series Analysis, Capital Asset Pricing Model (CAPM), Wilcoxon-
Mann-Whitney Rank Sum Test, Spearman’s rank correlation

Econometricians, like artists, tend to fall in love with their models. Edward Leamer
A reasonable probability is the only certainity. E.W. Howe
If you torture the data long enough, Nature will confess. Ronald H. Coase
There is no safety in numbers, or in anything else. James Thurber
A pinch of probabilty is worth a pound of perhaps. James Thurber
Life is a school of probability. Walter Bagehot
The mouse is an animal which, killed in sufficient numbers under carefully controlled
conditions, will produce a Ph.D. thesis. Anonymous