Probability Lecture Slides v2018.10.03

Probability Theory and Stochastic Processes
İstanbul Teknik Üniversitesi
Mustafa Kamasak, PhD
These slides are licensed under a Creative Commons Attribution 4.0 License.
License: https://creativecommons.org/licenses/by-nc-nd/4.0/
v2018.10.03
1 / 89
Stochastic vs. Deterministic Systems
I Deterministic system
I no randomness
I same output for the same input/other conditions
I Stochastic system
I randomness due to
I Limited capabilities of production, measurement,
I various unknown factors (noise, uncertain parameters etc.)
I different output even for the same input/other conditions
I Only of the theoretical systems are deterministic. Their
physical implementations and measurements are stochastic.
2 / 89
Stochastic vs. Deterministic Systems
I This course deals with modeling the output (events /

outcomes) of stochastic systems.
I Outcomes are the observation of results from experiments,
trials etc.
I Events are the observation of events that happen without a
human setup
I Random output instance (at a certain time) can be modelled
with probability theory.
I The output can be
I nominal
I ordinal
I interval (continuous-valued or discrete-valued)
I Time series of output can be modelled as a stochastic process.
3 / 89
Sets
I A set is a collection of objects/elements
A = {ζ1 , ζ2 , · · · , ζN }
There are N elements in set A

I Notation:
ζ2 ∈ A means ζ2 is in set A
ζ2 ∈
/ A means ζ2 is not in set A
I Empty (or null) set Φ contains no elements by definition
I If a set has n elements, it has 2n subsets including the empty
set and itself
I A ⊇ B means B is a subset of A
I A ⊇ B and B ⊇ A then A = B
4 / 89
Set Operators
I Complement:
Ac = {x : x ∈ S but x ∈
/ A}
I Union:
A ∪ B = {x : x ∈ A or x ∈ B}
I Intersection:
A ∩ B = {x : x ∈ A and x ∈ B}
I Symmetric difference:
A 4 B = (Ac ∩ B) ∪ (A ∩ B c )
5 / 89
Properties of Sets
For any subset of S

I Commutative:
A∪B =B ∪A
A∩B =B ∩A
I Associative law:
(A ∪ B) ∪ C = A ∪ (B ∪ C )
(A ∩ B) ∩ C = A ∩ (B ∩ C )
I Distributive law:
(A ∪ B) ∩ C = (A ∩ C ) ∪ (B ∩ C )
(A ∩ B) ∪ C = (A ∪ C ) ∩ (B ∪ C )
6 / 89
Disjoint Sets
I Sets A and B are disjoint (mutually exclusive) if A ∩ B = Φ

I Several sets A1 , A2 , · · · , AN are mutually exclusive if
Ai ∩ Aj = Φ when i 6= j.
I Ai are called a partition of S if
I Ai are mutually exclusive
I ∪Ai = S
7 / 89
De Morgan Law
I De Morgan law is used to find the complement of complicated

operations on sets
I De Morgan Law
(∪i Ai )c = ∩i Aci (∩i Ai )c = ∪i Aci
I General form
Replace all sets with its complement
Replace union with intersection
Replace intersection with union
Replace Φ with S
Replace S with Φ
I For example:
[A ∩ (B ∪ Φ)]c = Ac ∪ (B c ∩ S)
8 / 89
Duality
I If a complicated equality is proven then its dual is also correct.

I General form
Exchange union with intersection
Exchange intersection with union
Exchange Φ with S
Exchange S with Φ
I For example: if S ∩ A = A is proven then its dual Φ ∪ A = A
is also correct
9 / 89
Sample space and empty set
I S: sample space / certain event
It is the set of all possible outcomes/events
I Φ: empty set/ impossible event
I Field:
Let Ai be a subset of S
If Ai are finitely many F = {Ai : Ai ⊆ S, i ≤ N} and
I Φ∈F
I If Ai ∈ F then Aci ∈ F
I If Ai ∈ F for i = 1, 2...N then ∪N
i=1 Ai ∈ F
Then F is called a field.
I Borel field:
If Ai are infinitely many then it is called a Borel field.
A Borel field is closed under complement and countable union
operations
I Suppose B = {Ai : Ai ⊆ S and i ∈ N} is a Borel field. Any
subset A of S is called and event iff A ∈ B
10 / 89
Axioms of Probability
I Probability assigns a unique number in [0,1] range to each

event
I Axioms of probability (by Kolmogorov in 1933)
I P(S) = 1
I P(A) ≥ 0 for
Pevery A ∈ B
I P(∪i Ai ) = i P(Ai ) for all A ∈ B is Ai are mutually disjoint
I Axioms are accepted without a proof.
11 / 89
Theorems
Suppose A and B are two events and Bi forms a partition of S

then
I P(Φ) = 0 and P(A) ≤ 1
I P(Ac ) = 1 − P(A)
I P(B ∩ Ac ) = P(B) − P(B ∩ A)
I P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
I If A ⊆ B then P(A) ≤ P(B)
P
I P(A) = i P(A ∩ Bi )
These theorems can be proven using axioms and other proven
theorems
12 / 89
Conditional Probability
If S is the sample space, B is Borel field and let A, B ∈ B then
P(A ∩ B)
P(A|B) =
P(B)
13 / 89
Independence
I If P(A|B) = P(A) then events A and B are independent.

I Observing event B has no effect (gives us no extra
information) on observation of event A.
I Events A and B are independent iff P(A ∩ B) = P(A)P(B)
I The following statements are equal
I A and B are independent
I Ac and B c are independent
I A and B c are independent
I Ac and B are independent
Proof:
P(Ac ∩ B) = P(B) − P(A ∩ B)

= P(B)(1 − P(A))
= P(B)P(Ac )
14 / 89
Mutual Independence
I A collection of events Ai are called mutually independent iff

every subcollection consists of independent events
I It is possible to have pairwise independent events, but the
whole set may not be mutually independent
Example:
Consider tossing a fair coin (S = {H, T })
A1 : H on the first toss
A2 : H on the second toss
A3 : same outcome on both tosses
Are events A1 , A2 , A3 mutually independent?
15 / 89
Bayes Theorem
Suppose A1 , ..., AN form a partition of the sample space S For an

arbitrary event B
P(Aj )P(B|Aj )
P(Aj |B) = PN
i=1 P(Ai )P(B|Ai )
I P(Ai ) is called prior information

I P(B|Ai ) is called likelihood
I P(Aj |B) is called posteriori probability
16 / 89
Bayes Theorem – Example 1
I Consider a rare disease that is seen 1 in every million.

I Consider a medical test that is 99% accurate. This means if a
person has this disease (case positive), the test will detect it
correctly with the probability of 0.99
I When someone takes this test and the test result turns out to
be positive, what is the probability of this person having this
disease?
I Prior information: P(disease+) = 0.000001
Likelihood: P(test + |disease+) = 0.99
Posterior probability: P(disease + |test+) =?
17 / 89

Posterior probability: P(disease + |test+) =?
I Using Bayes theorem:
P(test + |disease+)P(disease+)
P(disease + |test+) =
P(test+)
I What is P(test+) =?
I The test can result positive when there is disease (true
positive), or when there is no disease (false positive). Hence
P(test+) = P(test + |disease+)P(disease+)

+ P(test + |disease−)P(disease−)
18 / 89
P(test+) = P(test + |disease+)P(disease+)

+ P(test + |disease−)P(disease−)
= 0.99 × 0.000001 + 0.01 × 0.999999
= 0.01000097999901 ≈ 0.01
I Hence
P(test + |disease+)P(disease+)
P(disease + |test+) =
P(test+)
0.99 × 0.000001
=
0.01
= 0.000099 < 0.01%
I Although it is quite an accurate test, it seems meaningless
19 / 89
A cab was involved in a hit and run accident at night. Two cab
companies, the Green and the Blue, operate in the city. You are
given the following data:1
I 85% of the cabs in the city are Green and 15% are Blue.
I A witness identified the cab as Blue.
I The court tested the reliability of the witness under the same
circumstances that existed on the night of the accident and
concluded that the witness correctly identified each one of the
two colors 80% of the time and failed 20% of the time.
What is the probability that the cab involved in the accident was
Blue rather than Green?
1
Example taken from A. Tversky, D. Kahneman, Evidential impact of base
rates, in Judgement under uncertainty: Heuristics and biases, D. Kahneman, P.
Slovic, A. Tversky (editors), Cambridge University Press, 1982
20 / 89
I Apriori probabilities: P(Green) = 0.85 and P(Blue) = 0.15
I Likelihoods: P(Witness = Blue|Blue) = 0.8
I From Bayes theorem
P(Witness = Blue|Blue) × P(Blue)
P(Blue|Witness = Blue) =
P(Witness = Blue)
P(Witness = Blue) = P(Witness = Blue|Blue) × P(Blue)

+ P(Witness = Blue|Green) × P(Green)
= 0.8 × 0.15 + 0.2 × 0.85
= 0.29
0.8 × 0.15
P(Blue|Witness = Blue) = ≈ 41%
0.29
21 / 89
Permutation and Combination
In a repeated trial, we want to enumerate the number of possible
outcomes (without repetition of objects)
I Permutation:
The number of possible arrangements of k objects from a
collection of n objects when the ordering is important.
Pkn = n(n − 1)(n − 2)..(n − k + 1)
n!
=
(n − k)!
I Combination:
collection of n objects when the ordering is NOT important.

n n n(n − 1)...(n − k + 1)
Ck = =
k k!
n!
=
k!(n − k!)
22 / 89
Properties of Combination

n
I C0n = = 1 if n > 0
0
I Ckn = Cn−k
n
I Binomial theorem: The combinations Ckn gives the binomial

coefficients.
n
n
X n
(a + b) = ak b n−k
k
k=0
23 / 89
Permutation and Combination
In a repeated trial, we want to enumerate the number of possible

outcomes (with repetition of objects)
I Permutation:
collection of n objects when the ordering is important.
P̃kn = nk
I Combination:
collection of n objects when the ordering is NOT important.

n n+k −1
C̃k =
k
24 / 89
Permutation Examples
Example: How many different 2 digit numbers can you obtain
using digits {2, 5, 8} without repeating digits?
I Ordering is important as 25 6= 52
I For the first digit there are 3 options from {2, 5, 8}, for the
second digit there are 2 options.
I Hence there are 3 × 2 = 6 possibilities
258, 285, 528, 582, 825, 852
Example: Assuming 20 letters are used to form 3-letter license
plates. How many different possibilities if the letters can be
repeated?
I License plate ABC 6= ACB, ordering is important
I 20 × 20 × 20 = 8000
I If repeated letters is not permitted, then 20 × 19 × 18 = 6840
25 / 89
Combination Examples
Example: A thesis committee will be formed with 4 professors out

of 10 professors in a department. How many different committees
can be formed?
I The order of the committee members is not important.
I Hence there are C410 different committees that can be formed.
Example: There are
26 / 89
Combination Examples
Example: A thesis committee will be formed with 2 professors

from mechanical engineering and 3 professors from computer
engineering. If mechanical engineering department has 10 and
computer engineering department has 8 professors, how many
different committees can be formed?
I The order of the committee members is not important.
I There can be C210 different selection from mechanical eng.
and C38 different possibilities from computer eng.
I Hence, there area C210 × C38 different committees
27 / 89
Random Variables
I A random variable X is a mapping from the sample space S
to a subset X of the real line R
X :S →R
I Using a random variable (rv) a real number can be assigned
to an event/outcome
I For example, the experiment of coin flipping can generate
S = {H, T }
X :H → 1
T → −1
or
Y :H → 100
T → 40
Both X and Y are random variables.
28 / 89
Discrete Random Variables
I A discrete rv takes a finite or countably infinite number of

possible values with specific probabilities assigned to each
value.
I If X is a discrete rv it assigns discrete values such as x1 , x2 , ...
to the events/outcomes
I Then pi means probability of X generating the value of xi :
pi = P(X = xi )
I It is possible for X to assign multiple events/outcomes to a
certain value such asP xi . Hence
pi = P(X = xi ) = s∈S,X (s)=xi P(s)
I While assigning probabilities
I p
Pi ≥ 0 for all i
i pi = 1
I
29 / 89
Probability Mass Function (pmf)
I An assignment xi → pi is called discrete distribution or a

discrete probability distribution
I A function f (x) = P(X = x) for x ∈ X is called a probability
mass function (pmf)
I A pmf is discrete in values
I Why do we use “mass” in pmf?
30 / 89
Cumulative Distribution Function (cdf)
I F (x) = P(X ≤ x) = P{s : s ∈ SsuchthatX (s) ≤ s} is called

cumulative distribution function (cdf)
I Properties of cdf
I F (x) is a nondecreasing function
F (x) ≤ F (y ) for all x ≤ y where x, y ∈ R
I limx→−∞ F (x) = 0 and limx→∞ F (x) = 1
I F(x) is right continuous
For all x ∈ R, limh→0 F (x + h) = F (x)
Examples of valid and invalid cdf comes here!!!
I Probability of x is: p(x) = F (x) − F (x − )
31 / 89
Probability Density Function (pdf)
I For a continuous rv p(X = x) = 0

I Define f (x) associated with x ∈ R such that
I fR (x) ≥ 0 for all x ∈ R
I
R
f (x)dx = 1
I For a given pdf f(x)
Z
f (x ∈ A) = f (x)dx
A
which is the area under pdf for the given region A
32 / 89
Cumulative Distribution Function (cdf)
I cdf is defined as
Z ∞
F (x) = P(X ≤ x) = f (t)dt
−∞
for all x ∈ R
I F(x) should have finite or countably infinite number of
discontinuities.
I P(X < a) and P(X ≤ a) are the same, which is
Z a
F (a) = f (t)dt
−∞
I P(X > b) and P(X ≥ b)
Z ∞
1 − F (b) = f (t)dt
b
I P(a < X < b) or P(a ≤ X < b) or P(a < X ≤ b) or
P(a ≤ X ≤ b)
Z b
F (b) − F (a) = f (t)dt
a 33 / 89
Relation of pdf with cdf
For discrete rv
I pdf → cdf
x
X
F (x) = f (x)
−∞
I cdf → pdf
f (x) = F (x) − F (x − )
For continuous rv
I pdf → cdf Z x
F (x) = f (t)dt
−∞
I cdf → pdf
d
f (x) = F (t)
dt t=x
34 / 89
Expected Value of a Distribution
I All possible values of a rv X such that f (x) > 0 is called the

support of the distribution of X . Support of X is denoted by
X
I Expected value of a distribution is denoted by E (x)
I For discrete distributions:
X
E (X ) = xi f (xi )
xi ∈X
I For continuous distributions:

Z
E (X ) = xf (x)dx
X
I The expected value is also called the mean of the distribution
and typically denoted by µ
35 / 89
Variance of a Distribution
I The variance of rv X is
X
σx2 = (xi − µ)2 f (xi )
xi ∈X

Z
σx2 = (x − µ)2 f (x)dx
X
I Variance is a measure of deviation of a rv from its mean
36 / 89
Standard Deviation of a Distribution
I The square root of variance is called the standard deviation of

the distribution and it is typically denoted by σ
sX
σ= (xi − µ)2 f (xi )
xi ∈X

s Z
σ= σx2 = (x − µ)2 f (x)dx
X
I Both variance and standard deviation of all distributions are

nonnegative σ > 0 σ 2 > 0
37 / 89
What does SD mean?
I For a Gaussian (will cover this later) distributed rv X , the
range
I [µX − σ, µX + σ] contains %68.2
I [µX − 2σ, µX + 2σ] contains %95.4
I [µX − 3σ, µX + 3σ] contains %99.7
of the values of this rv.
I Hence for a continuous rv:
Z µ+σ
f (x)dx = 0.682
µ−σ
Z µ+2σ
f (x)dx = 0.954
µ−2σ
Z µ+3σ
f (x)dx = 0.997
µ−3σ
38 / 89
Expected Value of a Function of rv
I Consider a function of a rv: X → g (X )

I g (X ) is also a rv as it maps events/outcomes to another
subset of R
I The expected value of g (X ) is:
For discrete rv:
X
E (g (X )) = g (xi )f (xi )
xi ∈X
For continuous rv:

Z
E (g (X )) = g (x)f (x)dx
X
39 / 89
Mean an Variance of Translation and Scaling
Consider a rv X and constant values T and S

I Y =X +T
I Z = SX
What is the mean and variance of rv’s Y and Z ?
40 / 89
Expectation is a Linear Operator
I Expectation is a linear operator
I It can exchange order with other linear operators such as
summation, integration etc.
I For example:
Consider a series of functions
PN gi (X ) and constants ai . What is
the expected value of Y = i=1 ai gi (X )?
I The variance was given as:
σ 2 = E ((X − µx )2 )
= E (X 2 − 2X µx + µ2x )
= E (X 2 ) − 2µx E (X ) + µ2x
= E (X 2 ) − µ2x
Hence:
E (X 2 ) = σX2 + µ2X
41 / 89
Median of a distribution
I xm is the median of f (x) if F (xm ) = 0.5

I For a continous distribution:
P(x ≥ xm ) = P(X ≤ xm ) = 0.5

I How is it defined for discrete rv?
42 / 89
Mode of a distribution
I For discrete rv: The mode is the value x at which its pmf
takes its maximum value.
I For a continuous rv: The mode is the value x at which its pdf
has its maximum value
43 / 89
Comparison of Mean, Median and Mode
I Mode is the most likely value of an rv
that has the highest value of pmf/pdf
I Median is the value of an rv that
divides the pmf/pdf in half
I Mean is the value of an rv that is the
center of mass of pmf/pdf
44 / 89
I Mean median and mode have very close values for some
distributions
I For other distributions, their values can be quite different.
45 / 89
I If a random variable has symmetric (no skew) distribution, its

mean and median are the same.
I However, having the same median and mean does not
necessarily imply a symmetric distribution. For example:
Consider a discrete rv with a support of X = {−2, 0, 4}. The
probabilities for these values are P(X = −2) = 1/3,
P(X = 0) = 1/2, P(X = 4) = 1/6. Then:
1 1 1
µX = (−2) + 0 + 4
3 2 6
=0
and since CX (−2) = 1/3, CX (0) = 5/6, C (4) = 1 the median

of X is also 0.
46 / 89
I If a random variable has symmetric and unimodal (single

peak) distribution, its mode, mean and median are the same.
I If it is positively skewed then mode<median<mean
I If it is negatively skewed then mode>median>mean
47 / 89
Standard Probability Distributions
For discrete rv I Exponential

I Bernoulli I Chi-square
I Binomial I Lognormal
I Poisson I Student’s t
I Geometric I Cauchy
I Uniform I F
For continuous rv I Beta
I Uniform I Negative exponential
I Normal (Gaussian) I Weibull
I Standard normal I Rayleigh
I Gamma
48 / 89
Bernoulli Distribution
I Discrete distribution
I Single parameter p
I Bernoulli(p)
f (x) = P(X = x) = p x (1 − p)1−x
for x = {0, 1} and 0 ≤ p ≤ 1

I This means:
P(X = 0) = 1 − p
and
P(X = 1) = p
49 / 89
Bernoulli Distribution – Mean and Variance
f (x) = P(X = x) = p x (1 − p)1−x

Expected value:
E (X ) = µX = (1)p + (0)(1 − p)
=p
Variance:
E (X 2 ) = (12 )p + (02 )(1 − p)

=p
Hence:
σX2 = E (X 2 ) − µ2X
= p − p2
= p(1 − p)
50 / 89
Binomial Distribution
I Two parameters (n, p)
I Binomial(n,p)

n
f (x) = p x (1 − p)(n−x)
x
I Repeated Bernoulli trials lead to Binomial distribution.

Example: If a fair coin is tossed 20 times, what is the
probability of getting 6 tails?
Probability of getting 6 tails and 19 heads with a particular
order is 0.56 (1 − 0.5)19 . There are C625 different cases (order
is not important) with 6 tails.
Example: If a fair coin is tossed 25 times, what is the
probability of getting 6 consecutive tails?
51 / 89
Binomial Distribution – Mean and Variance
Mean:
n
X n
E (X ) = x p x (1 − p)(n−x)
x
x=0
n
X n n−1
= p x (1 − p)(n−x)
x x −1
x=1
n
X n n−1
= np p x−1 (1 − p)(n−x)
x x −1
x=1
= np[p + (1 − p)](n−1)
= np
Remember binomial theorem:

n
X n
(a + b)n = ax b n−x
x
x=0
52 / 89
Binomial Distribution – Mean and Variance
Variance:
n
X n−2
E (X (X − 1)) = n(n − 1)p 2 p x−2 (1 − p)n−x
x −2
x=2
= n(n − 1)p 2
Furthermore
E (X 2 ) = E (X (X − 1)) + E (X )
= n(n − 1)p 2 + np
Then
σX2 = E (X 2 ) − µ2X
= n(n − 1)p 2 + np − (np)2
= n2 p 2 − np 2 + np − n2 p 2
= np(1 − p)
53 / 89
Binomial Distribution – Example
For the same p value:

I Expected value increases linearly with n (see the shift in pdf)
I Variance increases linearly with n (see the expansion in pdf)
I Hence for a fixed p value, the pdf shifts right and expands as
n increases
54 / 89
Poisson Distribution
I Single parameter λ > 0
I Poisson(λ)
e −λ λx
f (x) =
x!
I Mean: µX = λ (Derivation is left as an exercise)
I Variance: σX2 = λ
I Mean is equal to variance.
55 / 89
Poisson Distribution – Example
I As λ increases the mean increases linearly. Observe the shift

in the pdf.
I As λ increases the variance increases linearly. Observe the
expansion in the pdf.
56 / 89
Geometric Distribution
I Single parameter p
I Geometric(p)
f (x) = p(1 − p)x−1
I x is the number of trials needed for the Bernoulli trials to
produce “1” for the first time. This can also be regarded as
the number of trials before a success.
I Hence, it should produce x − 1 times “0” and “1” in the x th
trial.
1
I Mean: µX = p
1−p
I Variance: σX2 = p2
57 / 89
Uniform Distribution
I Each of the possible K outcomes are equally likely
1
f (xi ) =
K
for i ∈ {0, 1, ..., K − 1}
I Assuming xi ∈ [a, b] with b > a
I Mean:
a+b
µX =
2
I Variance:
(b − a + 1)2 − 1
σX2 =
12
58 / 89
Uniform Distribution
I Continuous distribution x ∈ R
I Two parameters: (a, b) with b > a
I Uniform(a,b)
1
f (x) =
b−a
59 / 89
Normal Distribution
I Continuous distribution x ∈ R
I Typically referred as Gaussian distribution
I Widely used
I Two parameters (µ, σ)
I N (µ, σ 2 )
(x − µ)2

1
f (x) = √ exp −
2πσ 2σ 2
60 / 89
Standard Normal Distribution
I Gaussian Distribution with zero mean and unit variance

I N (0, 1) 2
1 z
Φ(z) = √ exp −
2π 2
I It is possible to normalize X from N (µ, σ) into N (0, 1) with
the following transformation
X −µ
Z=
σ
I It is possible to normalize Z from N (0, 1) into N (µ, σ) with
the following transformation
X = σ(Z + µ)
61 / 89
Gamma Distribution
I Continuous distribution
I Two parameters (α, β) with 0 < α, β < 1
I Gamma(α, β)

1 x
f (x) = α exp − x α−1
β Γ(α) β
R∞
where 0 < α, β < 1 and Γ(α) = 0 e −x x α−1 dx
I For Γ function
I αΓ(α − 1) for any positive α. For large values of α
Γ(α) = √
Γ(α) ≈ 2πe −α αα−0.5 (Stirling’s approximation)
√ (n − 1)! for any positive integer n. For large values of n
Γ(n) =
I
n! ≈ 2πe√−n−1 nn+0.5
I Γ(1/2) = π
62 / 89
Exponential Distribution
I Single parameter β
I Exp(β)
1 x
f (x) = exp −
β β
where 0 < x, β < ∞
I This is a specific case of Gamma function with α = 1
I If β = 1 this distribution is called standard exponential
distribution
I cdf (
0 if − ∞ < 0 ≤ 0
F (X ) =
1 − e −x/β if x > 0
63 / 89
Chi-square Distribution
I Single parameter υ (this is also called degrees of freedom)
I Chi(υ) is chi-square distribution with υ degrees of freedom
f (x) = Gamma(υ/2, 2)
64 / 89
Lognormal Distribution
I pdf
(log(x) − µ)2

1
f (x) = √ exp −
2πσx 2σ 2
with 0 < x < ∞,−∞ < µ < ∞, and 0 < σ < ∞
I cdf
log(x) − µ
F (X ) = Φ
σ
where Φ is the cdf of standard normal distribution
I
x x −µ
P(log(X ) ≤ x) = P(X ≤ e ) = Φ
σ
I Logarithm of the rv X has N (µ, σ) distribution. Hence, it is
called lognormal distribution.
65 / 89
Student’s t Distribution
I Single parameter υ (degrees of freedom)
I pdf
− 1 (υ+1)
1 2
f (x) = a(υ) 1 + x 2
υ
where
1
a(υ) = √
υπ Γ(0.5(υ+1)
Γ(0.5υ)
for integer values of υ

I Distribution is symmetric around x = 0
I Discovered by W.S. Goset under pseudonym “Student” in
1908
66 / 89
Cauchy Distribution
I Specific case of Student’s t distribution with υ = 1
I pdf
1 1
f (x) =
π 1 + x2
I cdf
1 1
F (x) = arctan(x) +
π 2
for x ∈ R
67 / 89
F Distribution
I Two parameters (υ1 , υ2 )
I Order of parameters are important. Hence fυ1 ,υ2 (x) 6= fυ2 ,υ1 (x)
I pdf
υ1 −0.5(υ1 +υ2 )

0.5(υ1 −2)
f (x) = k(υ1 , υ2 )x 1+ x
υ2
where
υ1 0.5υ1 Γ(0.5(υ1 + υ2 ))
k(υ1 , υ2 ) =
υ2 Γ(0.5υ1 )Γ(0.5υ2 )
68 / 89
Beta Distribution
I Beta function
Z 1
b(α, β) = x α−1 (1 − x)β−1 dx
0
Γ(α)Γ(β)
=
Γ(α + β)
where α, β are positive real numbers.

I Beta distribution
I Two parameters (α, β)
I Beta(α, β)
1
f (x) = x α−1 (1 − x)β−1
b(α, β)
I Beta(1,1) is equal to uniform distribution with range [0,1]
69 / 89
Negative Exponential Distribution
I Two parameters (γ, β)
I pdf
1 (x − γ)
f (x) = exp −
β β
where 0 < β < ∞, −∞ < γ < ∞ and γ < x < ∞
70 / 89
Weibull Distribution
I Two parameters (α, β)
I pdf
−α −α−1 x α
f (x) = αβ x exp ( )
β
where 0 < α, β and 0 < x < ∞
71 / 89
Rayleigh Distribution
I Single parameter θ
I pdf 2
2 x
f (x) = x exp −
θ θ
where 0 < x < ∞ and 0 < θ
72 / 89
Laplace Distribution
I pdf
1 |x − µ|
f (x) = exp −
2σ σ
73 / 89
Moments
I r th moment of a rv X is
nr = E (X r )
I For r = 1, the first moment of an rv is its mean
74 / 89
Central Moments
I r th central moment of a rv X is
cr = E ((X − µ)r )
I c1 = 0
I c2 = σ 2
I If nr is finite then
I cr is finite
I ns is finite for s ∈ {1, 2, ..., r − 1}
I cs is finite for s ∈ {1, 2, ..., r − 1}
75 / 89
Moment Generating Function
MX (t) = E (e (tX ) )
I Why do we need it?

Remember
t 2X 2
e tX = 1 + tX + + ...
2!
I Hence the moments of an rv X can be computed from its
moment generating function
dr

nr = r Mx (t)
dt t=0
76 / 89
Moment Generating Function of Distributions
The moment generating function of some of the distributions that

we have covered
I Bernoulli: MX (t) = ((1 − p) + pe t )n
t)
I Poisson: MX (t) = e (−λ+λe
2 µ2 )
I Normal: MX (t) = e (tµ+0.5t
I Gamma: MX (t) = (1 − βt)−λ for all t < 1/β
I Exponential: MX (t) = (1 − βt)−1
I Chi-square: MX (t) = (1 − 2υ)−0.5υ for t < 0.5
77 / 89
Functions of Random Variables
I Functions of an rv are also rv

I Consider a function that maps an rv X into another rv Y ie.
Y = g (X )
I If the pdf of X is given as fX (x), how do we find the pdf of
Y , fY (y )? How are they related?
78 / 89
Functions of Discrete Random Variables
I The function g (.) may or may not be a one-to-one function.

X Y X Y
1 D 1 D
2 B 2 B
3 C 3 C
A 4
One-to-one function (injection) Not one-to-one (surjection)
I If function g (.) is a one-to-one function there will be a single

yi for each xi value. Hence: P(Y = yi ) = P(X = xi ) for
yi = g (xi ). Then
f (yi ) = f (xi )
for yi = g (xi )
79 / 89
Functions of Discrete Random Variables
I If function g (.) is not a one-to-one function Hence there may

be many xi that maps to the same yi such as yi = g (xi ).
I In this case
X
P(Y = yi ) = P(X = xj )
j,yi =g (xj )
and X
f (yi ) = f (xj )
j,yi =g (xj )
80 / 89
Functions of Continuous Random Variables
I Continuous functions can also be one-to-one or not.
im f
y
y
y Y
y Y2 im f
Y1 , Y 2 Y
y Y1
x
x X x
f :X Y x X1 x X2
f :X Y
y f x y f x X1 , X 2 X
One-to-one Not one-to-one
I To find fY (y ), first consider
P(y ≤ Y ≤ y + dy ) = fY (y )dy = FY (y + dy ) − FY (y )
81 / 89
I If the function is one-to-one then
P(y ≤ Y ≤ y + dy ) = P(x ≤ X ≤ x + dx)
which is
fY (y )dy = fX (x)dx
I If the function is not one-to-one and there are N different
values of x that maps to the same y value: y = g (xi )
i = 1, 2..N. In this case
P(y ≤ Y ≤ y + dy ) = P(x1 ≤ X ≤ x1 + dx)

+ P(x2 ≤ X ≤ x2 + dx) + · · ·
+ P(xN ≤ X ≤ xN + dx)
which is
fY (y )dy = fX (x1 )dx + fX (x2 )dx + · · · + fX (xN )dx

82 / 89
From the graph dy /dx = g 0 (x). Hence dx = dy /g 0 (x)
dy dy dy
fY (y )dy = fX (x1 ) + fX (x2 ) 0 + · · · + fX (xN ) 0
|g 0 (x1 )| |g (x2 )| |g (xN )|
I Absolute value is taken to avoid negativity in pdf

Finally:
X fX (xi )
fY (y ) =
|g (xi )|
i,y =g (xi )
83 / 89
Example for Function of Discrete RV – 1
I One-to-one discrete rv Y = 3X + 2 and X = {1, 2, 3}.
12
10
8
Y
0
0 0.5 1 1.5 2 2.5 3 3.5 4
X
0.6 0.6
0.4 0.4
fX(x)
fY(y)
0.2 0.2
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4 5 6 7 8 9 10 11 12
X y
84 / 89
Example for Function of Continuous RV – 1
I Consider a linear transformation Y = g (X ) = aX + b where a

and b are constant real numbers.
d y −b
I For this function dX (aX + b) = a and x = a . Hence
fX (x) fX ( y −b
a )
fY (y ) = =
|a| |a|
Linear transformation of rv
Linear transformation of a random variable does not change the
type of distribution (ie. uniform, Gaussian etc.). It may change the
parameters such as mean, variance etc.
I If X has uniform distribution in [x1 , x2 ] range, then Y has also
uniform distribution in [ax1 + b, ax2 + b] range.
85 / 89
I Consider Y = g (X ) = 1/X . Find fY (y ) in terms of fX (x).

I This is a one-to-on function with a single value of X such that
X = 1/Y .
1
g 0 (X ) = −
X2
1
=− = −Y 2
(1/Y )2
I Hence
1 1
fY (y ) = 2 fX
Y Y
86 / 89
I Consider Y = g (X ) = aX 2 where a > 0 ∈ R is a constant.
Find fY (y ) in terms of fX (x).
I This is not a one-to-on function
I For y < 0 there is no x p
I For y >p0 there are two values of X: x1 = y /a and
x1 = − y /a
that satisfies Y = g (X ) = aX 2 .
I g 0 (X ) = 2aX and
√
1/|g 0 (X )| = 1/(2aX ) = 1/(2a y /) = 1/(2 ay ) for both
p
p p
x1 = y /a and x1 = − y /a when y > 0
I Then
( q q
y
√1
2 ay (fX ( a) + fX (− ya )) if y > 0
fY (y ) =
0 if y < 0
87 / 89
I Consider Y = g (X ) = e X . Find fY (y ) in terms of fX (x).
I This is not a one-to-on function with a single solution at
x = ln y
I
1 1
=
|g 0 (x)| ex
1
= ln y
e
1
=
y
I Note that y > 0 for all values of x
I Then (
1
y fX (ln y ) if y > 0
fY (y ) =
0 if y ≤ 0
88 / 89
PDF Conversion
I We have seen how to find the pdf of a function of a rv.

I Now a different problem: How can we convert a rv X with
fX (x) to another rv Y with fY (y ) using a function y = g (x)?
I We will use 2 steps
1. Convert X into another temporary rv Z that has uniform
distribution in [0, 1]
2. Convert Z into Y
89 / 89

Probability Lecture Slides v2018.10.03

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Probability Lecture Slides v2018.10.03

Загружено:

Авторское право:

Доступные форматы

Probability Theory and Stochastic Processes

İstanbul Teknik Üniversitesi

Mustafa Kamasak, PhD

I This course deals with modeling the output (events /

I A set is a collection of objects/elements

There are N elements in set A

For any subset of S

I Sets A and B are disjoint (mutually exclusive) if A ∩ B = Φ

I De Morgan law is used to find the complement of complicated

I If a complicated equality is proven then its dual is also correct.

I Probability assigns a unique number in [0,1] range to each

Suppose A and B are two events and Bi forms a partition of S

If S is the sample space, B is Borel field and let A, B ∈ B then

I If P(A|B) = P(A) then events A and B are independent.

P(Ac ∩ B) = P(B) − P(A ∩ B)

I A collection of events Ai are called mutually independent iff

Suppose A1 , ..., AN form a partition of the sample space S For an

I P(Ai ) is called prior information

I Consider a rare disease that is seen 1 in every million.

I Prior information: P(disease+) = 0.000001

P(test+) = P(test + |disease+)P(disease+)

P(test+) = P(test + |disease+)P(disease+)

P(Witness = Blue) = P(Witness = Blue|Blue) × P(Blue)

I Binomial theorem: The combinations Ckn gives the binomial

In a repeated trial, we want to enumerate the number of possible

Example: A thesis committee will be formed with 4 professors out

Example: A thesis committee will be formed with 2 professors

I A discrete rv takes a finite or countably infinite number of

I An assignment xi → pi is called discrete distribution or a

I F (x) = P(X ≤ x) = P{s : s ∈ SsuchthatX (s) ≤ s} is called

I For a continuous rv p(X = x) = 0

which is the area under pdf for the given region A

I All possible values of a rv X such that f (x) > 0 is called the

I For continuous distributions:

I For continuous distributions:

I The square root of variance is called the standard deviation of

I For continuous distributions:

I Both variance and standard deviation of all distributions are

I Consider a function of a rv: X → g (X )

For continuous rv:

Consider a rv X and constant values T and S

I xm is the median of f (x) if F (xm ) = 0.5

P(x ≥ xm ) = P(X ≤ xm ) = 0.5

I If a random variable has symmetric (no skew) distribution, its

and since CX (−2) = 1/3, CX (0) = 5/6, C (4) = 1 the median

I If a random variable has symmetric and unimodal (single

For discrete rv I Exponential

f (x) = P(X = x) = p x (1 − p)1−x

for x = {0, 1} and 0 ≤ p ≤ 1

f (x) = P(X = x) = p x (1 − p)1−x

E (X 2 ) = (12 )p + (02 )(1 − p)

I Repeated Bernoulli trials lead to Binomial distribution.

Remember binomial theorem:

For the same p value:

I As λ increases the mean increases linearly. Observe the shift

I Gaussian Distribution with zero mean and unit variance

for integer values of υ

where α, β are positive real numbers.

I For r = 1, the first moment of an rv is its mean

I Why do we need it?

The moment generating function of some of the distributions that

I Functions of an rv are also rv

I The function g (.) may or may not be a one-to-one function.

One-to-one function (injection) Not one-to-one (surjection)