Вы находитесь на странице: 1из 167

STOCHASTIC MODELLING - STAT3004/STAT7018, 2015, Semester 2

Contents
1 Basics of Set-Theoretical Probability Theory
2 Random Variables
2.1 Definition and Distribution . .
2.2 Common Distributions . . . .
2.3 Moments and Quantiles . . .
2.4 Moment Generating Functions

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

3 Several Random Variables


3.1 Joint distributions . . . . . . . . . . . . . . .
3.2 Covariance, Correlation, Independency . . . .
3.3 Sums of Random Variables and Convolutions .
3.4 Change of Variables . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

4 Conditional Probability
4.1 Conditional Probability of Events . . . . . . . .
4.2 Discrete Random Variables . . . . . . . . . . . .
4.3 Mixed Cases . . . . . . . . . . . . . . . . . . . .
4.4 Random Sums . . . . . . . . . . . . . . . . . . .
4.5 Conditioning on Continuous Random Variables
4.6 Joint Conditional Distributions . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

7
7
8
11
13

.
.
.
.

17
17
18
19
20

.
.
.
.
.
.

21
21
23
25
26
28
30

5 Elements of Matrix Algebra

32

6 Stochastic Process and Markov Chains


6.1 Introduction and Definitions . . . . . . . . .
6.2 Markov Property . . . . . . . . . . . . . . .
6.3 Stationarity . . . . . . . . . . . . . . . . . .
6.4 Transition Matrices and Initial Distributions
6.5 Examples of Markov Chains . . . . . . . . .
6.6 Extending the Markov Property . . . . . . .

35
35
36
37
37
41
43

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

6.7
6.8
6.9
6.10
6.11
6.12
6.13
6.14
6.15

Multi-Step Transition Functions .


Hitting Times and Strong Markov
First Step Analysis . . . . . . . .
Transience and Recurrence . . . .
Decomposition of the State Space
Computing hitting probabilities .
Martingales . . . . . . . . . . . .
Special chains . . . . . . . . . . .
Summary . . . . . . . . . . . . .

. . . . . .
Property
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

7 Stationary Distribution and Equilibrium


7.1 Introduction and Definitions . . . . . . . . . . . . . . . . . . .
7.2 Basic Properties of Stationary and Steady State Distributions
7.3 Periodicity and Smoothing . . . . . . . . . . . . . . . . . . . .
7.4 Positive and Null Recurrence . . . . . . . . . . . . . . . . . . .
7.5 Existence and Uniqueness of Stationary Distributions . . . . .
7.6 Examples of Stationary Distributions . . . . . . . . . . . . . .
7.7 Convergence to the Stationary Distribution . . . . . . . . . . .
7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 Pure Jump Processes
8.1 Definitions . . . . . . . . . . . . . . . . .
8.2 Characterizing a Markov Jump Processes
8.3 S = {0, 1} . . . . . . . . . . . . . . . . .
8.4 Poisson Processes . . . . . . . . . . . . .
8.5 Inhomogeneous Poisson Processes . . . .
8.6 Special Distributions Associated with the
8.7 Compound Poisson Processes . . . . . .
8.8 Birth and Death Processes . . . . . . . .
8.9 Infinite Server Queue . . . . . . . . . . .
8.10 Long-run Behaviour of Jump Processes .

. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
Poisson Processes
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .

9 Gaussian Processes
9.1 Univariate Gaussian Distribution . . . . .
9.2 Bivariate Gaussian Distribution . . . . . .
9.3 Multivariate Gaussian Distribution . . . .
9.4 Gaussian Processes and Brownian Motion
2

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

44
47
49
55
58
63
65
66
71

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

73
73
74
78
81
84
87
93
101

.
.
.
.
.
.
.
.
.
.

104
. 104
. 106
. 109
. 111
. 115
. 118
. 121
. 123
. 126
. 129

.
.
.
.

138
. 138
. 139
. 143
. 146

9.5
9.6
9.7
9.8
9.9

Brownian Motion via Random


Brownian Bridge . . . . . . .
Geometric Brownian Motion .
Integrated Brownian Motion .
White Noise . . . . . . . . . .

Walks
. . . .
. . . .
. . . .
. . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

148
154
156
157
162

Part I: Review Probability & Conditional Probability


1

Basics of Set-Theoretical Probability Theory

Sets and Events. We need to recall a little bit of set theory and its terminology insofar as
it is relevant to probability. To start, we shall refer to the set of all possible outcomes that
a random experiment may take on as the sample space and denote it by . In probability
theory is contrived as a set. Its elements are called the samples.
An event A is then most simply thought of as a suitable subset of , that is A ,
and we shall generally use the terms event and set interchangeably. (For the technically
minded, not all subsets of can be included as legitimate events for measure theoretic
reasons, but for our purposes, we will ignore this subtlety.)
Example 1.1. Consider the random process of flipping a coin twice. For this scenario,
the sample space is the set of all possible outcomes, namely = {HH, HT, TH, TT}
(discounting, of course, the possibility that the coin lands on its side and assuming that
the coin has two distinct sides H and T). One obvious event might be that of getting an
H on the first of the two tosses, in other words A = {HH, HT}.

Basic Set Operations. There are four basic set operators: union (), intersection (),
complementation (c ), and cardinality (#).
Let A, B . The union of two sets is the set which contains all the elements
in either of the original sets, and we write A B. A B is the even that either A or B or
both happen. The intersection of two sets is the set which contains all the elements
which are common to the two original sets, and we write A B. A B is the event that
both A and B happen simultaneously.
The complement of a set A is the set containing all of the elements in the sample
space which are not in the original set A, and we write Ac . So, clearly, c = , c = ,
and (Ac )c = A. Ac is the event that not A happens. (Notational note: occasionally, the
complement of A is denoted by A, but this is rarely done in statistics due to the potential
for confusion with sample means.)
Note that if two sets A and B have no elements in common then they are referred to
as disjoint and thus, A B = , where signifies the empty or null set (the impossible
event). Also, if A B then clearly A B = A, so that in particular A = A for any
event A.
Using unions and intersections, we can now define a very useful set theory concept, the
partition. A collection of sets A1 , . . . , Ak is a partition of S if their combined union is equal
4

to the entire sample space and they are all mutually disjoint; that is, A1 . . . Ak =
and Ai Aj = for any i 6= j. In other words, a partition is a collection of events one and
only one of which must occur. In addition, note that the collection of sets {A, Ac } forms
a very simple but nonetheless extremely useful partition.
Finally, the cardinality of a set is simply the number of elements it contains. Thus,
in Example 1.1 above, # = 4 while #A = 2. A set is called countable if we can enumerate it in a possible nonunique way by natural numbers, for instance, is countable.
Also a set A with finitely many elements is countable, ie. #A is finite. Examples for
countable, but infinite sets ate the natural numbers N = {1, 2, 3, . . . } and the integers
Z = {. . . , 2, 1, 0, 1, 2, . . . }. Also the rational numbers Q are countable. Intervals (a, b),
(a, b] and the real line R = (, ) are examples of uncountable sets.
Basic Set Theory rules.
The Distributive laws:
(A B) C = (A C) (B C)
(A B) C = (A C) (B C)
DeMorgans rules:
(A B)c = Ac B c ;

(A B)c = Ac B c

You should convince yourself of the validity of these rules through the use of Venn diagrams. Formal proofs are elementary.
Basic Probability Rules. We now use the above set theory nomenclature to discuss the
basic tenets of probability. Informally, the probability of an event A is simply the chance
that it will occur. If the elements of the sample space are finite in number and may be
considered equally likely, then we may calculate the probability of an event A as
P(A) =

#A
.
#

More generally, of course, we will have to rely on our long-run frequency interpretation
of the probability of an event; namely, the probability of an event is the proportion of
times that it would occur among a (generally hypothetical) infinite number of equivalent
repetitions of the random experiment.
Zero & Unity Rules. All probabilities must fall between 0 and 1, i.e. 0 P(A) 1. In
particular, P() = 0 and P() = 1.
Subset rule. If A B, then P(A) P(B).
5

Inclusion-Exclusion Law. The inclusion-exclusion rule states that the probability of the
union of two events is equal to the sum of the probabilities of the two events minus the
probability of the intersection of the two events, which has been in some sense double
counted in the sum of the initial two probabilities, so that
P(A B) = P(A) + P(B) P(A B).
Notice that the final subtracted term disappears if the two events A and B are disjoint,
more generally:
Additivity. Assume that A1 , . . . , An with Ai Aj = for i 6= j. Then
P(A1 An ) = P(A1 ) + + P(An ).
Countable Additivity. Assume that A1 , A2 , A3 , . . . is a sequence of events with Ai
Aj = for i 6= j. Then
P(A1 A2 A3 . . . ) = P(A1 ) + P(A2 ) + P(A3 ) + P(A4 ) + . . .
Complement Rule. The probability of the complement of an event is equal to one minus
the probability of the event itself, so that P(Ac ) = 1 P(A). This rule is easily derived
from the Inclusion-Exclusion rule.
Product Rule. Two events A and B are said to be independent if and only if they satisfy
the equation P(A B) = P(A)P(B).
The Law of Total Probability. The law of total probability is a way of calculating a probability by breaking it up into several (hopefully easier to deal with) pieces. If the sets
A1 , . . . , Ak form a partition, then the probability of an event B may be calculated as:
P(B) =

k
X

P(B Ai ).

i=1

Again, heuristic verification is straightforward from a Venn diagram.

Random Variables

2.1

Definition and Distribution

Definition A random variable X is a numerically valued function X : R (R denoting


the real line) whose domain is a sample space . If the range of X is a countable subset of
the real line then we call X a discrete random variable. (For the technically minded, not
all numerical functions X : R are random variables for measure theoretic reasons,
but for our purposes, we will ignore this subtlety.)
Below we introduce the notion of a continuous random variable. A continuous random
variable takes values in an uncountable set such as intervals or the real line. Note that
a random variable cannot be continuous if the sample space on which it is defined is
countable; however, a random variable defined on an uncountable sample space may still
be discrete. In the coin tossing scenario of Example 1.1 above, the quantity X which
records the number of heads in the outcome is a discrete random variable.
Distribution of a Random Variable. Since random variables are functions on a sample
space, we can determine probabilities regarding random variables by determining the
probability of the associated subset of . The probability of a random variable X being
in some subset I R on the real line is equivalent to the probability of the event A =
{ : X() I}:
P(X I) = P({ : X() I}) .
Note that we have used the notion of a random variable as a function on the sample space
when we use the notation X(). The collection of all probabilities P(X I) is called the
distribution of X.
Probability Mass Function (PMF). If X is discrete, then it is clearly desirable to find
pX (x) = P(X = x), the probability mass function (or pmf) of X. Because it is possible to
characterise the distribution of X in terms of its pmf pX via
X
P(X I) =
pX (i) .
iI

If X is discrete, we have
X

pX (x) = 1.

xRange(X)

Cumulative Distribution Function (CDF). For any random variable X : R the


function
FX (x) = P(X x) ,
xR
7

is called the cumulative distribution function (CDF) of X. The CDF of X determines the
distribution of X (the collection of all probabilities P(X I) can be computed from the
cdf of X).
If X is a discrete random variable then its cumulative distribution function is a step
function:
X
FX (x) = P(X x) =
pX (y).
yRange(X):yx

(Absolutely) Continuous Random Variable. Assume that X is a random variable


such that
Z
P(X I) =
fX (x) dx
I
R
where fX (x) is some nonnegative function with fX (x) dx = 1. Then X is called a
continuous random variable admitting a density fX . In this case, the CDF is still a valid
entity, being continuous and given by
Z x
FX (x) = P(X x) =
fX (x0 ) dx0

Observe that the concept of a pmf is completely useless when dealing with continuous
r.v.s as we have P(X = x) = 0 for all x. The Fundamental Theorem of Calculus thus
d
F (x) = F 0 (x), which in turn leads to the informal identity
shows that f (x) = dx
P(x < X x + dx) = F (x + dx) F (x) = dF (x) = f (x)dx,
which is where the density function f gets its name, since in some sense it describes how
the probability is spread over the real line.
(Notational note: We will attempt to stick to the convention that capital letters denote
random variables while the corresponding lower case letters indicate possible values or
realisations of the random variable.)

2.2

Common Distributions

The real importance of CDF s, pmf s and densities is that they completely characterize
the random variable from which they were derived. In other words, if we know the CDF
(or equivalently the pmf or density) then we know everything there is to know about the
random variable. For most random variables that we might think of, of course, writing
down a pmf , say, would entail the long and tedious process of listing all the possible values
and their associated probabilities. However, there are some types of important random
variables which arise over and over and for which simple formulae for their CDF s, pmf s
or densities have been found. Some common CDF s, pmf s and densities are listed below:
8

Discrete Distributions
Name

pmf
x

p(x) = e x!
x N0 = {0, 1, 2, . . .}

p(x) = nx px (1 p)nx
x {0, 1, . . . , n}

p(x) = x+r1
(1 p)r px
r1
x N0 = {0, 1, 2, . . . }
p(x) = p(1 p)x1
x N = {1, 2, 3, . . . }
 N M  N 
p(x) = M
/ n
x
nx
x {max(0, n+M N ), . . . , min(M, n)}

Poisson()
>0
Binomial(n, p)
n N = {1, 2, 3, . . . } 0 < p < 1
Negative Binomial(r, p)
r N 0<p<1
Geometric(p)
0<p<1
Hypergeometric(n, N, M )
n N m {0, . . . , n} N {1, . . . , n}

Continuous Distributions
Name
Normal(, 2 )

Density


(x)2
1
f (x) = 2
exp

2
2 2

R 2 > 0
Exponential()
>0
Uniform(a, b)
< a < b <
Weibull(, )
> 0, > 0
Gamma(, )
> 0, > 0
Chi-Squared(k)

x R = (, )
f (x) = ex
x (0, )
f (x) = 1/(b a)
x (a, b)

f (x) = x1 ex
x (0, )

f (x) = ()
(x)1 ex
x (0, )
1
1
f (x) = k 1 k  x 2 (k2) e 2 x
22

k N = {1, 2, 3, . . . }
Beta(, )
, > 0
Students tk

f (x) =

kN
Fisher-Snedecor Fm,n

x (0, )
(+) 1
f (x) = ()() x (1 x)1

f (x) =

x (0, 1)
 
k+1
2
 1+
1

x2
k

x (, )
  m
2
 n  m
n

m+n
2

k
k
2
2

m, n N

x (0, )
9

 12 (k+1)

(m2)
2
1 (m+n)
mx 2
1+ n

The factorials n! and the binomial coefficients


and, for n N and x {0, . . . , n},
n! := n (n 1) 1 ,

n
x

which are defined as follows: 0! := 1

 
n
n!
:=
x!(nx)!
x

The gamma function, (), is defined by the integral


Z
x1 ex dx,
() =
0

from which it follows that if is a positive integer, then () = ( 1)!. Also, note that
for = 1, the Gamma(1,) distribution is equivalent to the Exponential() distribution,
while for = 21 , the Gamma(, 21 ) distribution is equivalent to the Chi-squared distribution with 2 degrees of freedom. Similarly, the Geometric(p) distribution is closely related
to the Negative Binomial distribution when r = 1.
Above we listed formulas only for those x where p(x) or f (x) > 0. For the remaining
xs we have p(x) = 0 or f (x) = 0. We write X Q indicating that X has the distribution
Q: for instance, X Normal(0, 1) refers to a continuous random variable X which has the
x2
density fX (x) = 12 e 2 . Similarly, Y Poisson(5) refers to a discrete random variable
x
Y with range N0 having pmf pY (x) = e5 5x! for x N0 .
Exercise. (a) Let X Exponential(). Check that the CDF of X satisfies FX (x) =
1 ex for x 0. Graph this function for x [1, 4] for the parameter = 1.
(b) Let X Geometric(p). Check that the CDF of X satisfies FX (x) = 1 (1 p)x ,
x {1, 2, 3, . . .}. Graph this function for x [1, 4] (hint: step function).


10

2.3

Moments and Quantiles

Moments. The mth moment of a random variable X is the expected value of the random
variable X m and is defined as
X
E[X m ] =
xm pX (x),
xRange(X)

if X is discrete, and as
m

xm fX (x)dx,

E[X ] =

if X is continuous (provided, of course, that the quantities on the right hand sides exist).
In particular, when m = 1, the first moment of X is generally referred to as its mean
and is often denoted as X , or just when there is no chance of confusion. The expected
value of a random variable is one measure of the centre of its distribution.
General Formulae. A good, though somewhat informal, way of thinking of the expected
value is that it is the value we would tend to get if we were to average the outcomes of a
very large number of equivalent realisations of the random variable. From this idea, it is
easy to generalize the moment definition to encompass the expectations of any function,
g, of a random variable as either
X

E[g(X)] =

g(x)p(x),

xRange(X)

or

E[g(X)] =

g(x)f (x)dx,

depending on whether X is discrete or continuous.


Central Moments and Variance. The idea of moments is often extended by defining
the central moments, which are the moments of the centred random variable X X . The
first central moment is, of course, equal to zero. The second central moment is generally
2
referred to as the variance of X, and denoted Var(X) or sometimes X
. The variance is
a measure of the amount of dispersion in the distribution of X; that is, random variables
with high variances are likely to produce realisations which are far from the mean, while
low variance random variables have realisations which will tend to cluster closely about
the mean. A simple calculation shows the relationship between the moments and the
central moments; for example, we have
Var(X) = E[(X X )2 ] = E[X 2 ] 2X .

11

One drawback to the variance is that, by its definition, its units are not comparable
to those of X. To avert this problem, we often use the square root of the variance,
p
X = Var(X), which is called the standard deviation of the random variable X.
Quantiles and Median. Another way to characterize the location (i.e. centre) and
spread of the distribution of a random variable is through its quantiles. The (1 )quantile of the distribution of X is any value which satisfies:
P(X ) 1

and

P(X ) .

Note that the definition does not necessarily uniquely define the quantile; in other
words, there may be several distinct (1 )-quantiles of a distribution. However, for most
continuous distributions that we shall meet the quantiles will be unique. In particular,
the = 21 quantile is called the median of the distribution and is another measure of the
centre of the distribution, since there is a 50% chance that a realisation of X will fall below
it and also a 50% chance that the realisation will be above the median value. The = 34
and = 14 quantiles are generally referred to as the first and third quartiles, respectively,
and there difference, called the interquartile range (or IQR), is another measure of spread
in the distribution.
Expectation via Tails. Calculating the mean of a random variable from the definition
can often involve painful integration and algebra. Sometimes, there are simpler ways. For
example, if X is a non-negative integer-valued random variable (i.e. its range contains
only non-negative integers), then we can calculate the mean of X as

P(X > x).

x=0

The validity of this can be easily seen by a term rearrangement argument:

X
x=0

P(X > x) =

p(y) =

x=0 y=x+1

y1
X
X
y=1 x=0

p(y) =

yp(y) = .

y=1

More generally, if X is an arbitrary, but non-negative random variable with cumulative


distribution function F , then
Z
=
{1 F (x)}dx .
0

Example 2.1. Let a > 0 and U be uniformly distributed on (0, a). Using at least two
methods find E[U ].
Solution: U is a continuous random variable with density fU (u) = 1/a for 0 < u < a
12

(otherwise, fU (u) = 0 if u 6= (0, a)).


Method I:

1
E[U ] =
a

Z
0

1
u du = [u2 /2]a0 = a/2
a

Method II: Note that U is a nonnegative random variable taking values only in (0, a).
Ru
Also, F( U )(u) = a1 0 du0 = u/a if 0 < u < a. Otherwise, we have either FU (u) = 0 for
u 0, or, FU (u) = 1 for u a. Consequently, The tail integral becomes
Z a
Z a
Z
1
1 (u/a) du = a [u2 /2]a0 = a/2 .
1 FU (u) du =
1 FU (u) du =
E[U ] =
a
0
0
0


2.4

Moment Generating Functions

A more general method of calculating moments is through the use of the moment generating function (or mgf ), which is defined as
Z
tX
m(t) = E[e ] =
etx dF (x),

provided the expectation exists for all values of t in a neighborhood of the origin. To
obtain the moments of X we note that (provided sufficient regularity conditions which
justify the interchange of the operations of differentiation and integration are satisfied),




dm
m tX
tX
= E[X m ].
=
E[X
e
]
E[e
]


m
dt
t=0
t=0
Example 2.2. Suppose that X has a Poisson() distribution. The moment generating
function of X is given by:
x

X
X
X
et
etx x e
t
t
tx

m(t) =
e p(x) =
=e
= e ee = e(e 1) ,
x!
x!
x=0
x=0
x=0
where we have used the series expansion
that

xn
n=0 n!

= ex . Taking derivatives of m(t) shows

m0 (t) = e(e 1) (et ) = m0 (0) = E[X] =


t

m00 (t) = e(e 1) (et )2 + e(e 1) (et ) = m00 (0) = E[X 2 ] = 2 + .


Finally, this shows that V ar(X) = E[X 2 ] {E[X]}2 = (2 + ) 2 = .
13

Example 2.3. Suppose that X has a Gamma(, ) distribution. The moment generating
function of X is given by:
Z
Z

1 x
tx
(x) e dx =
e(t)x x1 dx
m(t) =
e
()
()
0
0


Z

t (t)x

1
=
e
{( t)x} dx =
,
t<
( t) 0 ()
t
R t (t)x
where we have used the fact that 0 ()
e
{(t)x}1 dx = 1 since it is the integral
of the density of a Gamma(, t) distribution over the full range of its sample space,
provided that t < , since the parameters of a Gamma distribution must be positive. So,
differentiating this function shows that:

= m0 (0) = E[X] =
+1
( t)

( + 1)
2 +
00
2
m00 (t) =
=
m
(0)
=
E[X
]
=
.
( t)+2
2
m0 (t) =

Thus, we can see that V ar(X) = E[X 2 ] {E[X]}2 =

.
2

If X, Y are random variables having a finite mgf in an open interval containing zero and
their corresponding mgfs equal then X and Y have the same distribution. Generally, such
demonstrations rely on algebraic calculation followed by recognition of resultant functions
as the moment generating function of some specific distribution. As a useful reference,
then, we now give the moment generating functions of some of the distributions noted in
the previous section:
Discrete Distributions
Name

mgf

Poisson()
Binomial(n, p)
Geometric(p)
Negative Binomial(r, p):

m(t) = exp((et 1))


tR
t n
m(t) = (1 p + pe )
tR
t
t
m(t) = pe /(1 (1p)e )
(1p)et < 1
m(t) = (1 p)r /(1 pet )r
pet < 1

Continuous Distributions
Name
Normal(, 2 )
Uniform(a, b)
Weibull(, )
Gamma(, )

mgf
m(t) = exp(t + 12 2 t2 )
t (, )
tb
ta
m(t) = (e e )/[t(b a)]
t (, )\{0} , m(0) = 1

t
m(t) = 1 + t
t (, )

m(t) = /( t)
t (, )
14

Generating Functions. More generally, the concept of a generating function of a sequence is quite useful in many areas of probability theory. The generating function of a
sequence of numbers {a0 , a1 , a2 , . . .} is defined as
2

A(s) = a0 + a1 s + a2 s + . . . =

an s n

n=0

provided the series converges for values of s in a neighborhood of the origin.


Note that from this definition, the moment generating function is just the generating
function of the sequence an = E[X n ]/n!. As with the mgf , the elements of the sequence
can be recovered by successive differentiation of the generating function and subsequent
evaluation at s = 0 (as well as a rescaling by n! for the appropriate value of n).
In particular, if X is a discrete random variable taking non-negative integer values,
then setting an = P(X = n) yields the probability generating function,
P (s) = E[sX ] = E[eX log s ].
Note that m(t) = P (et ), so that there is a clear link between the moment generating
function and the probability generating function. In particular, moment-like quantities
can be found via derivatives of P (s) evaluated at s = 1 = e0 . For example,




00
00
t
2t
0
t
t
m (t) = P (e )e + P (e )e = P 00 (1) + P 0 (1).
t=0

t=0

Also, if we let qn = P(X > n), then Q(s) = n qn sn , is a tail probability generating
function and
1 P (s)
.
Q(s) =
1s
This can be seen by noting that the coefficient of sn in the function (1 s)Q(s) is
P

qn qn1 = P(X > n) P(X > n 1) = P(X > n) {P(X = n) + P(X > n)} = P(X = n),
if n 1, and q0 = P(X > 0) = 1 P(X = 0) if n = 0, so that
(1 s)Q(s) = 1 P(X = 0)

P(X = n)sn = 1 P (s).

n=1

We saw earlier that E[X] = n qn , so E[X] = Q(1) = lims1 {1 P (s)}/(1 s), and thus,
the graph of {1 P (s)}/(1 s) has a pole rather than an asymptote at s = 1, as long as
the expectation of X exists and is finite.
Additional Remarks. For the mathematically minded, we note that occasionally the
15

mgf (which, for positive random variables is also sometimes called the Laplace transform
of the density or pmf ) will not exist (for example, the t- and F -distributions have nonexistent moment generating functions, since the necessary integrals are infinite) and this
is why it is often more convenient to work with the characteristic function (also known
to some as the Fourier transform of the density or pmf ), (t) = E[eitX ] which always
exists, but this requires some knowledge of complex analysis. One of the most useful
features of the characteristic function (and of the moment generating function, in the
cases where it exists) is that it uniquely specifies the distribution from which it arose (i.e.
no two distinct distributions have the same characteristic function), and many difficult
properties of distributions can be derived easily from the corresponding properties of
characteristic functions. For example, the Central Limit Theorem is easily proved using
moment generating functions, as are some important relationships regarding the various
distributions listed above.

16

3
3.1

Several Random Variables


Joint distributions

The joint distribution of two random variables X and Y describes how the outcomes of
the two random variables are probabilistically related. Specifically, the joint distribution
function is defined as
FXY (x, y) = F (x, y) = P(X x and Y y).
Usually, the subscripts are omitted when no ambiguity is possible.
If X and Y are both discrete, then they will have a joint probability mass function
defined by P(x, y) = P(X = x and Y = y). Otherwise, if there exists a joint density
defined as that function fXY which satisfies:
Z x Z y
FXY (x, y) =
fXY (, )dd.

In this case, we call X and Y (jointly) continuous.


The case where one of X and Y is discrete and one continuous is of interest, but is
slightly more complicated and we will deal with it when it comes up.
The function FX (x) = limy F (x, y) is called the marginal distribution function of
X, and similarly the marginal distribution function of Y is FY (y) = limx F (x, y). If X
and Y are discrete, then the marginal probability mass functions are simply
pX (x) =

p(x, y)

and

pY (y) =

yRange(Y )

p(x, y).

xRange(X)

If X and Y are continuous, then the marginal densities of X and Y are given by
Z
Z
fX (x) =
fXY (x, y)dy
and
fY (y) =
fXY (x, y)dx,

respectively. Note that the marginal density at a particular value is derived by simply
integrating the area under the joint density along the appropriate horizontal or vertical
line.
The expectation of a function h of the two random variables X and Y is calculated in
a fashion similar to the expectations of functions of single random variables, namely,
E[h(X, Y )] =

xRange(X) yRange(Y )

17

h(x, y)p(x, y)

if X and Y are discrete, or


Z

E[h(X, Y )] =

h(x, y)f (x, y)dxdy

if X and Y are continuous.


Note that the above definitions show that regardless of the type of random variables,
E[aX + bY ] = aE[X] + bE[Y ] for any constants a and b. Also, analogous definitions and
results hold for any finite group of random variables. For example the joint distribution
of X1 , . . . , Xk is
F (x1 , . . . , xk ) = P(X1 x1 and . . . and Xk xk ).

3.2

Covariance, Correlation, Independency

Independence. If it happens that F (x, y) = FX (x)FY (y) then the random variables X
and Y are said to be independent. If both the random variables are continuous, then
the above condition is equivalent to f (x, y) = fX (x)fY (y), while if both are discrete it is
the same as p(x, y) = pX (x)pY (y). Note the similarity of these definitions to that for the
independence of events.
Given two jointly distributed random variables X and Y , we can calculate their means,
X and Y , and their standard deviations, X and Y , using their marginal distributions.
Provided these means and standard deviations exist, we can use the joint distribution
to calculate the covariance between X and Y which is defined as Cov(X, Y ) = XY =
E[(X X )(Y Y )] = E[XY ] X Y .
Two random variables are said to be uncorrelated if their covariance is zero. Note that
if X and Y are independent then they are certainly uncorrelated, since the factorization of
the pmf or density implies that E[XY ] = E[X]E[Y ] = X Y . However, two uncorrelated
random variables need not be independent. Note also that it is an easy calculation to
show that if X, Y , V and W are jointly distributed random variables and a, b, c and d
are constants, then
Cov(aX + bY, cV + dW ) = acXV + adXW + bcY V + bdY W ;
in other words, the covariance operator is bilinear.
Finally, if we scale XY by the product of the two standard deviations, we get the
correlation coefficient, = XY /X Y , which satisfies 1 1.

18

3.3

Sums of Random Variables and Convolutions

We saw that the expectation of Z = X + Y was simply the sum of the individual expectations of X and Y for any two random variables. Unfortunately, this is about the extent
of what we can say in general. If, however, X and Y are independent, the distribution of
Z can be determined by means of a convolution:
Z
Z
FX (z )dFY () =
FY (z )dFX ().
FZ (z) =

In the case where both X and Y are are discrete, we can write the convolution formula
using pmf s:
X
X
pZ (z) =
pX (x)pY (z x) =
pX (z y)pY (y).
xRange(X)

yRange(Y )

If X and Y are both continuous, we can rewrite the convolution formula using densities:
Z
Z
fY (z )fX ()d.
fX (z )fY ()d =
fZ (z) =

Note that, in the same way that marginal densities are found by integrating along horizontal or vertical lines, the density of Z at the value z is found by integrating along
the line x + y = z, and of course using the independence to state that fXY (, z ) =
fX ()fY (z ).
Since convolutions are a bit cumbersome, we now note an advantage of mgf s. If X
and Y are independent, then
mZ (t) = E[etZ ] = E[et(X+Y ) ] = E[etX etY ] = E[etX ] E[etY ] = mX (t)mY (t).
So, the mgf of a sum of independent random variables is the product of the mgf s of the
summands. This fact makes many calculations regarding sums of independent random
variables much easier to demonstrate:
Suppose that X Gamma(X , ) and Y Gamma(Y , ) are two independent
random variables, and we wish to determine the distribution of Z = X + Y . We could
use the convolution formula, but this would require some extremely difficult (though not
impossible, of course) integration. However, recalling the moment generating function of
the Gamma distribution we see that, for any t < :
X 
Y

X +Y


=
,
mX+Y (t) = mX (t)mY (t) =
t
t
t
which easily shows that X1 + X2 Gamma(1 + 2 , ).
19

3.4

Change of Variables

We saw previously, that we could find the expectation of g(X) using the distribution of X.
Suppose, however, that we want to know more about the new random variable Y = g(X).
If g is a strictly monotone function, we can find the distribution of Y by noting that
FY (y) = P(Y y) = P({g(X) y}) = P({X g 1 (y)} = FX {g 1 (y)}),
if g is increasing, and
FY (y) = P({Y y}) = P({g(X) y}) = P({X g 1 (y)})
= 1 FX {g 1 (y)} + P({X = g 1 (y)}),
if g is decreasing (if g is not strictly monotone, we need to be a bit more clever, but we
wont deal with that case here). Now, if X is continuous and g is a smooth function (i.e.
has a continuous derivative) then the differentiation chain rule yields
fY (y) =

fX {g
|g 0 {g 1 (y)}|

(y)} =

fX (x),
|g 0 (x)|

where y = g(x) (note that when X is continuous, the CDF of Y in the case when g is
decreasing simplifies since P{X = g 1 (y)} = 0).
A similar formula holds for joint distributions except that the derivative factor becomes the reciprocal of the modulus of the determinant of the Jacobian matrix for the
transformation function g. In other words, if X1 and X2 have joint density fX1 X2 and
g(x1 , x2 ) = {g1 (x1 , x2 ), g2 (x1 , x2 )} = (y1 , y2 ) is an invertible transformation, then the
joint density of Y1 = g1 (X1 , X2 ) and Y2 = g2 (X1 , X2 ) is
fY1 Y2 (y1 , y2 ) = |J(x1 , x2 )|1 fX1 X2 (x1 , x2 ),
where y1 = g1 (x1 , x2 ) and y2 = g2 (x1 , x2 ) and |J(x1 , x2 )| is the determinant of the Jacobian

1 ,x2 )
matrix J(x1 , x2 ), which has (i, j)th element Jij (x1 , x2 ) = gi (x
.
xj

20

4
4.1

Conditional Probability
Conditional Probability of Events

So far, we have discussed the probabilities of events in a rather static situation. However,
typically, we wish to know how the outcomes of certain events will subsequently affect the
chances of later events. To describe such situations, we need to use conditional probability
for events.
Suppose that we wish to know the chance that an event A will occur. Then we have
seen that we want to calculate P(A). However, if we are in possession of the knowledge
that the event B has already occurred, then we would likely change our belief about the
chance of A occuring. For example, if A is the event it will rain today and B is the
event the sky is overcast. We use the notation P(A|B) to signify the probability of A
given that B has occurred, and we define it as
P(A|B) =

P(A B)
,
P(B)

provided P(B) 6= 0.
If we think of probabilities as areas in a Venn diagram, then a conditional probability
amounts to restricting the sample space down from to B and then finding the relative
area of that part of A which is also in B to the total area of the restricted sample space,
namely B itself.
Multiplication Rule. In many of our subsequent applications, conditional probabilities
will be dictated as primary data by the circumstances of the process under study. In this
case, the above definition will find its most useful function in the form
P(A B) = P(A|B)P(B) = P(B|A)P(A).
Independence. Also, we can rephrase independency of events note interms of conditional
probabilities; namely, two events A and B are independent if and only if
P(A|B) = P(A)

and

P(B|A) = P(B).

(Note that only one of the above two conditions need be verified, since if one is true the
other follows from the definition of conditional probability.) In other words, two events
are independent if the chance of one occurring is unaffected by whether or not the other
has occurred.
Total Probability law. Recalling the law of total probability, we can use this new
21

identity to show that if the sets B1 , . . . , Bk form a partition then


P(A) =

k
X

P(A|Bi )P(Bi ).

i=1

Bayes Rule. Finally, a very useful formula exists which relates the conditional probability of A given B to the conditional probability of B given A and goes by the name of
Bayes Rule. Bayes rule states that
P(B|A) =

P(A|B)P(B)
P(A B)
=
,
P(A)
P(A|B)P(B) + P(A|B c )P(B c )

which follows from the definition of conditional probability and the law of total probability,
since B and B c form a partition. In fact, we can generalize Bayes rule by letting B1 , . . . , Bk
be a more general partition, so that
P(A|Bi )P(Bi )
P(Bi |A) = Pk
.
j=1 P(A|Bj )P(Bj )
Example 4.1. Suppose there are three urns labelled I, II and III, the first containing
4 red and 8 blue balls, the second containing 3 red and 9 blue, and the third 6 red and 6
blue. (a) If an urn is picked at random and subsequently a ball is picked at random from
the chosen urn, what is the chance that the chosen ball will be red? (b) If a red ball is
drawn, what is the chance that it came from the first urn?
Solution: Let R be the event that the chosen ball is red. Then, from the description of
the situation it is clear that:
P(I) = P(II) = P(III) =

1
,
3

P(R|I) =

4
1
= ,
12
3

P(R|II) =

3
1
= ,
12
3

P(R|III) =

(a) Since the events I, II and III clearly form a partition (i.e. one and only one of them
must occur), we can use the law of total probability to find
P(R) = P(R|I)P(I) + P(R|II)P(II) + P(R|III)P(III) =

13
.
36

(b) Using Bayes rule,


P(I|R) =

P(R|I)P(I)
(1/3)(1/3)
4
=
= .
P(R|I)P(I) + P(R|II)P(II) + P(R|III)P(III)
13/36
13


22

6
1
=
12
2

4.2

Discrete Random Variables

Conditional pmf. The conditional probability mass function derives from the definition
of conditional probability for events in a straightforward manner:
pXY (x, y)
P(X = x and Y = y)
=
,
P(Y = y)
pY (y)
P
as long as pY (y) > 0. Note that for each y, pX|Y is a pmf , i.e. x pX|Y (x|y) = 1, but the
same is not true for each fixed x. Also, the law of total probability becomes
X
pX (x) =
pX|Y (x|y)pY (y).
pX|Y (x|y) = P(X = x|Y = y) =

yRange(Y )

Example 4.2. Suppose that N has a geometric distribution with parameter 1 , and
that conditional on N , X has a negative binomial distribution with parameters p and N .
In other words,
pN (n) = (1 ) n1
for n = 1, 2, . . .
and


pX|N (x|n) =


x+n1 x
p (1 p)n
n1

for x = 0, 1, . . . .

Find the marginal distribution of X.


Solution: Using the law of total probability: for x = 01, 2, 3, . . .
pX (x) =

pX|N (x|n) pN (n)

n=1

=
=
=
=



X
x+n1
(1 p)n px (1 ) n1
n

1
n=1


X
x+n1
(1 p)n px (1 ) n1
x
n=1


X
x+n
1 x
(1 ) p
[(1 p)]n+1
x
n=0


(1)(1p)px X x + n
[(1 p)]n (1 (1 p))x+1
(1(1p))x+1 n=0
x
|
{z
}
=1

x
(1)(1p)
p
1(1p)
1(1p)

Consequently, X + 1 {1, 2, 3, . . . } is geometric with parameter


23

(1)(1p)
.
1(1p)

Conditional Expectation. The conditional expectation of g(X) given Y = y, denoted


as E[g(X)|Y = y], is defined as
X
E[g(X)|Y = y] =
g(x)pX|Y (x|y).
xRange(X)

The law of total probability then shows that


X
X
X
X
E[g(X)] =
g(x)pX (x) =
g(x)
pX|Y (x|y)pY (y) =
E[g(X)|Y = y] pY (y).
x

Note that the conditional expectation can be regarded as a function of y; that is, it is
a numerical function defined on the sample space of Y and is thus a random variable,
denoted by E[g(X)|Y ], and we therefore have


E[g(X)] = E E[g(X)|Y ] .
A similar expression can be obtained for variances:
2
Var(X) = E[X 2 ] (E[X])2 = E[E[X 2 |Y ]] E[E[X|Y ]]




2
= E[E[X 2 |Y ]] E (E[X|Y ])2 + E (E[X|Y ])2 E[E[X|Y ]




2
= E E[X 2 |Y ] (E[X|Y ])2 + E (E[X|Y ])2 E[E[X|Y ]]
= E[Var(X|Y )] + Var(E[X|Y ]).
2
, which, like the
Note that we have defined Var(X|Y ) := E[X 2 |Y ] (E[X|Y ])2 = X|Y
conditional expectation, is now a random variable.

Example 4.3. Let Y have a distribution with mean and variance 2 . Conditional on
Y = y, suppose that X has a distribution with mean y and variance y 2 . Find the
variance of X.
Solution: From the information given, E[X|Y ] = Y and Var(X|Y ) = Y 2 . Thus,
Var(X) = E[Y 2 ] + Var(Y ) = 2 + 2 + (1)2 Var(Y ) = 2 2 + 2 .

Since the conditional expectation is the expectation with respect to the conditional
probability mass function pX|Y (x|y), conditional expectations behave in most ways like
ordinary expectations. For example,
1. E[ag(X1 ) + bh(X2 )|Y ] = aE[g(X1 )|Y ] + bE[h(X2 )|Y ]
24

2. If g 0 then E[g(X)|Y ] 0
3. E[g(X, Y )|Y = y] = E[g(X, y)|Y = y]
4. If X and Y are independent, E[g(X)|Y ] = E[g(X)]
5. E[g(X)h(Y )|Y ] = h(Y )E[g(X)|Y ]
6. E[g(X)h(Y )] = E[h(Y )E[g(X)|Y ]]
In particular, it follows from properties 1 and 5 that E[a|Y ] = a for any constant a, and
E[h(Y )|Y ] = h(Y ) for any function h.
Remark: the formulae 1. 6. are applicable in more general situations, even if X nor Y
are not discrete (cf. random sums for some applications).

4.3

Mixed Cases

If X is a continuous random variable and N is a discrete random variable, then the


conditional distribution function FX|N (x|n) of X given that N = n can be defined in the
obvious way
P(X x and N = n)
.
FX|N (x|n) =
P(N = n)
From this definition, we can easily define the conditional probability density function as
fX|N (x|n) =

d
FX|N (x|n).
dx

As in the discrete case, the conditional density behaves much like an ordinary density, so
that, for example,
Z b
P(a < X b, N = n) = P(a < X b|N = n)P(N = n) = pN (n)
fX|N (x|n)dx.
a

Note that the key feature to this and the discrete case was that the conditioning random
variable N was discrete, so that we would be able to guarantee that there would be some
possible values of n such that P(N = n) > 0. It is possible to condition on continuous
random variables and the properties are much the same, but we just need to take a bit of
care since technically the probability of any individual outcome of a continuous random
variable is zero.

25

4.4

Random Sums

Suppose we have an infinite sequence of independent and identically distributed random


variables 1 , 2 , . . ., and a discrete non-negative integer valued random variable N which
is independent of the s. We can then define the random sum
X = 1 + . . . + N =

N
X

k .

k=1

(Note that for convenience, we will define the sum of zero terms to be zero.)
Moments. If we let
E[k ] =

Var(k ) = 2

E[N ] =

Var(N ) = 2

then we can derive the mean and variance of X as


E[X] = E[E[X|N ]] =

E[X|N = n]pN (n)

n=0

=
=

X
n=1

E[1 + . . . + N |N = n]pN (n)


E[1 + . . . + n |N = n]pN (n)

n=1

E[1 + . . . + n ]pN (n) =

n=1

npN (n)

n=1

= ,
and the variance as
Var(X) = E[(X )2 ] = E[(X N + N )2 ]
= E[(X N )2 ] + E[2 (N )2 ] + 2E[(X N )(N )]


= E E[(X N )2 |N ] + E[2 (N )2 ]


+ 2E E[(X N )(N )|N ]
= 2 + 2 2 ,
since

X

n
E[X N |N = n] = E
i n = 0;
i=1

 X
2 
n
2
E[(X N ) |N = n] = E
i n
= n 2 .
i=1

26

Example 4.4. Total Grandchildren - Suppose that individuals in a certain species have
a random number of offspring independently of one another with a known distribution
having mean and variance 2 . Let X be the number of grandchildren of a single parent,
so that X = 1 +. . .+N , where N is the random number of original offspring and k is the
random number of offspring of the k th child of the original parent. Then E[N ] = E[k ] =
and Var(N ) = Var(k ) = 2 , so that
E[X] = 2

Var(X) = 2 (1 + ).

and


Distribution of Random Sums. In addition to moments, we need to know the distribution of the random sum X. If the s are continuous and have density function f (z),
then the distribution of 1 + . . . + n is the n-fold convolution of f , denoted by f (n) (z)
and recursively defined by
f (1) (z) = f (z)
Z
(n)
f (z) =
f (n1) (z u)f (u)du
for n > 1.

Since N is independent of the s, f (n) is also the distribution of X given N = n 1.


Thus, if we assume that P(N = 0) = 0, the law of total probability says
fX (x) =

f (n) (x)pN (n).

n=1

NOTE: If we dont assume that P(N = 0) = 0, then we have a mixed distribution, so


that

Z bX

(n)
P(a < X b) =
f (x)pN (n) dx
a

n=1

if a < b < 0 or 0 < a < b, and


P(a < X b) =

Z bX

(n)


(x)pN (n) dx + pN (0),

n=1

if a < 0 < b.
Example 4.5. Geometric sum of Exponential random variables - Suppose that the s
have an exponential distribution with parameter , i.e. the density looks like f (z) = ez

27

for z 0, and suppose also that N has a geometric distribution with parameter p, so that
pN (n) = p(1 p)n1 for n = 1, 2, . . .. In this case,
Z z
Z
(2)
2 ez du
f (z u)f (u)du =
f (z) =
0

Z z
= 2 ez
du = 2 ez z.
0

In fact, it is straightforward to use mathematical induction to show that f (n) (z) =


n
z n1 ez , for z 0, which is a Gamma(n, ) distribution (a fact which is much
(n1)!
more easily demonstrated using moment generating functions!). Thus, the distribution of
X is

X
X
n
(n)
xn1 ex p(1 p)n1
fX (x) =
f (x)pN (n) =
(n 1)!
n=1
n=1
= pex
= pe

X
{(1 p)x}n1

n=1
px

(n 1)!

= pex e(1p)x

So, X has an exponential distribution with parameter p, or a Gamma(1, p). Note that
the distribution of the random sum is not the same as the distribution of the non-random
sum.


4.5

Conditioning on Continuous Random Variables

Conditional Density. Note that in the previous sections we have been able to use our
definition of conditional probability for events since the conditioning events {Y = y}
have non-zero probability for discrete random variables. If we want to find the conditional
distribution of X given Y = y, and Y is continuous, we cannot use, as we might first try,
FX|Y (x|y) = P(X x|Y = y) =

P(X x and Y = y)
,
P(Y = y)

since both probabilities in the final fraction are zero. Instead, we shall define the conditional density function as
fXY (x, y)
fX|Y (x|y) =
,
fY (y)
for values of y such that fY (y) > 0. The conditional distribution function is then given
by
Z
x

FX|Y (x|y) =

fX|Y (|y)d.

28

Conditional Expectation. Finally, we can define


Z
E[g(X)|Y = y] =
g(x)fX|Y (x|y)dx,

as expected, and this version of the conditional expectation still satisfies all of the nice
properties that we derived in the previous sections for discrete conditioning variables. For
example,
Z b
fX|Y (x|y)dx
P(a < X b|Y = y) = FX|Y (b|y) FX|Y (a|y) =
a
Z
=
1(a,b] (x)fX|Y (x|y)dx

= E[1(a,b] (X)|Y = y],


where the function 1I (x) is the indicator function of the set I, i.e. 1I (x) = 1 if x I and
1I (x) = 0 otherwise.
Note that, as is the case with ordinary expectations and indicators, the conditional
probability of the random variable having an outcome in I is equal to the conditional
expectation of the indicator function of that event. (Recall that
Z
Z
P(X I) = fX (x)dx =
1I (x)fX (x)dx = E[1I (X)]

for ordinary expectations and probabilities.)


We can use the above fact to show a new form of the law of total probability, which
is often a very useful method of finding probabilities; namely,
Z
P(a < X b) =
P(a < X b|Y = y)fY (y)dy.

To see why this is true, note that


Z
Z
P(a < X b|Y = y)fY (y)dy =

fX|Y (x|y)dxfY (y)dy =


a

fXY (x, y)dxdy

= P(a < X b and < Y < )


= P(a < X b).
In fact, we can generalize this notion even further to show that
Z
P{a < g(X, Y ) b} =
P{a < g(X, y) b|Y = y}fY (y)dy.

29

Example 4.6. Suppose X and Y are continuous random variables having joint density
function
fXY (x, y) = yexyy
for x, y > 0.
(a) Find the conditional distribution of X given Y = y.
(b) Find the distribution function of Z = XY .
Solution: (a) First, we must find the marginal density of Y , which is
Z
Z
Z
xyy
y
yexy dx = ey ,
ye
dx = e
fXY (x, y)dx =
fY (y) =

y > 0.

Therefore,
fX|Y (x|y) =

fXY (x, y)
= yexy ,
fY (y)

y>0

In other words, conditional on Y = y, X has an exponential distribution with parameter


y, and thus FX|Y (x|y) = 1 exy .
(b) To find the distribution of Z = XY , we write
Z
P(XY z|Y = y)fY (y)dy
FZ (z) = P(Z z) = P(XY z) =

Z
Z
z
y
=
P(X |Y = y)e dy =
(1 ez )ey dy
y
0
0
z
= 1e ,
so that Z has an exponential distribution with parameter 1.

4.6

Joint Conditional Distributions

If X, Y and Z are jointly distribution random variables and Z is discrete, we can define
the joint conditional distribution of X and Y given Z in the obvious way,
FXY |Z (x, y|z) = P(X x and Y y|Z = z) =

P(X x and Y y and Z = z)


.
P(Z = z)

If X, Y and Z are all continuous, then we define the joint conditional density of X and
Y given Z as
fXY Z (x, y, z)
fXY |Z (x, y|z) =
,
fZ (z)
where fXY Z (x, y, z) is the joint density function of X, Y and Z and fZ (z) is the marginal
density function of Z.
30

The random variables X and Y are said to be conditionally independent given Z


if FXY |Z (x, y|z) = FX|Z (x|z)FY |Z (y|z), where FX|Z (x|z) = limy FXY |Z (x, y|z) and
FY |Z (y|z) = limx FXY |Z (x, y|z) are the conditional distributions of X given Z and
Y given Z, respectively. As with unconditional independence, an equivalent characterization when the random variables involved are continuous is that the densities factor as
fXY |Z (x, y|z) = fX|Z (x|z)fY |Z (y|z). (NOTE: In an obvious extension of the formula for
unconditional densities,
Z
fXY |Z (x, y|z)dy,
fX|Z (x|z) =

with a similar definition for fY |Z (y|z).)


As with the case for unconditional joint distributions, a useful concept is the conditional covariance, defined as
Cov(X, Y |Z) = E[XY |Z] E[X|Z]E[Y |Z],
and the conditional correlation coefficient, which is simply the conditional covariance
p
scaled by the product of the conditional standard deviations, X|Z = Var(X|Z) and
p
Y |Z = Var(Y |Z). Note that if two random variables are conditionally independent
then they are conditionally uncorrelated (i.e. the conditional covariance is zero), but the
converse is not true. Also, just because two random variables are conditionally independent
or uncorrelated does not necessarily imply that they are unconditionally independent or
uncorrelated.

31

Elements of Matrix Algebra

To prepare our analysis of Markov chains it is convenient to recall some elements of matrix
algebra:
A matrix A is a tabular with n rows and m columns with the real-valued entries
A(i, j) (A(i, j) refers to the element in the ith row and the jth column). We shortly write
A = (A(i, j)) Rnm (verbally, A is a n m-matrix).
Example 5.1. Note
A=

1 2 3
4 5 6

!
R23 .

A(1, 2) = 2.

We have different operations when dealing with matrices:


Scalar Multiplication. Let a R and A = (A(i, j)) Rnm The scalar multiplication
aA is defined by taking the product of real number a R with each of the components
of A, giving rise to a new matrix C = (C(i, j)) := aA Rnm with C(i, j) := aA(i, j).
Example 5.2. Let
A=

1 2 3
4 5 6

!
R23 .

Then (a = 2)
C = 2A =

2 4 6
8 10 12

!
R23 .


Transposition. Let A = (A(i, j)) Rnm Then the transposition of A is denoted by


A0 = (A0 (i, j)). A0 = (A0 (i, j)) is a Rmn -matrix with entries A0 (i, j) := A(j, i). (We
interchange the roles of columns and rows).
Example 5.3. Let
A=

1 2 3
4 5 6

!
R23

1 4

A0 = 2 5 R32 .
3 6


Sum of Matrices. Let A = (A(i, j)), B = (B(i, j)) Rnm . By componentwise adding
the entries we get a new matrix C = (C(i, j)) =: A + B Rnm where C(i, j) =
A(i, j) + B(i, j).
32

Example 5.4. Let


1 2 3
4 5 6

A=

1 1 2
1 3 6

R23 , B =

2 3 1
5 8 12

C =A+B =

!
R23

!
R23 .


Product of Matrices. Let A = (A(i, j)) Rnm , B = (B(i, j)) Rmr . (The number
m of As columns must match the number m of Bs rows). Then the matrix product
AB = A B := C is the matrix C = (C(i, j)) Rnr with entries
C(i, j) :=

m
X

A(i, k)B(k, j) ,

1 i n,1 j r.

k=1

By inspection: the entry C(i, j) is the Euclidian product of the ith row of A with the jth
column of B.
Example 5.5. Let
A=

1 2 3
4 5 6

1 4 2

B= 2 5 1 .
3 6 1

!
R23

To compute AB it is convenient to adopt the following scheme


1
4
2
2
5
1
AB =
3
6
1
1 2 3 11+22+33 14+25+36 7
4 5 6 41+52+63
...
...
Fill in the dots. The result is
C = AB =

14 32
32 . . .

7
...

!
.


Product with Vectors and Matrices. This is as special case of the general matrix
multiplication: let x Rn , A = (A(i, j)) Rnm . If we contrive x R1n as a matrix
33

with only one row then xA R1m is defined by the corresponding matrix multiplication.
The result is a row vector. If we insist on x Rn1 to be a column vector then still x0 A
and A0 x are well defined. If n 6= m then Ax is not defined, even if x is a column vector.
The dimensions must always match.
Power of Matrices. Let I Rnn be the identity matrix. I = (I(i, j)) = Rnn with
entries I(i, j) = 1, if i = j, and, otherwise, if i 6= j then I(i, j) = 0. The identity matrix
is a diagonal matrix (only the elements of the diagonal are nonzero) with unit entries on
the diagonal. For any A Rnm we have IA = A (for all B Rmn we have BI = B).
For matrices where the number of columns equals the number of rows, A Rnn we
can define the pthe power Ap , p N0 = {0, 1, 2, 3, 4 . . . } by iteration:
A0 := I,
Example 5.6. Let A =

A1 := A Ap := (A)p1 A = A(A)p1 .
!
1 2
. Find A0 , A1 , A2 and A3 .
3 4

Answer:
A0 = I =

1 0
0 1

!
, A1 = A =

1 2
3 4

!
, A2 =

7 10
15 22

!
, A3 =

37 54
81 118


Example 5.7. (a) Show (A0 )0 = A.
(b) Show A + B = B + A.
(b) Show (A + B)0 = A0 + B 0 .
(c) Show (AB)0 = B 0 A0 .
(d) Give an example of square matrices A, B R22 showing that AB 6= BA (not
commutative).
(Also see Tutorials)

34

Part II: Markov Chains


6

Stochastic Process and Markov Chains

6.1

Introduction and Definitions

General Stochastic Process. A stochastic process is a family of random variables,


{Xt }tT , indexed by a parameter t which belongs to an ordered index set T . For notational
convenience, we will sometimes use X(t) instead of Xt . We use t because the indexing is
most commonly associated with time.
For example, the price of a particular stock at the close of each days trading would be
a stochastic process indexed by time. Of course, the index does not have to be time, it may
be a spatial indicator. For example, the number of defects in specified regions of a computer
chip. In fact, the indexing may be almost anything. Indeed, if we consider the index to
be individuals, we can consider a random sample X1 , . . . , Xn to be a stochastic process.
Of course, this would be a rather special stochastic process in that the random variables
making up the stochastic process would be independent of each other. In general, we will
want to deal with stochastic processes where the random variables may be dependent on
one another.
As with individual random variables, we shall be interested in the set S of values
which the random variables may take on, but we shall generally refer to this set as the
state space in this context. Again, as with single random variables, the state space may
be either discrete or continuous. In addition, however, we must now also consider whether
the index set T is discrete or continuous. In this section, we shall be considering the
case where the index set is the discrete set of natural numbers T = N0 = {0, 1, 2, . . .},
such processes are usually referred to as discrete time stochastic processes. We will start
by examining processes with discrete state spaces and later move on to processes with
continuous time sets T .
Markov Chain. The simplest sort of stochastic processes are of course those for which
the random variables Xt are independent. However, the next simplest type of process, and
the starting point for our journey through the theory of stochastic processes, is called a
Markov chain. A Markov chain is a stochastic processes having:
1) a countable state space S,
2) a discrete index set T = {0, 1, 2, . . .},
3) the Markov property, and
35

4) stationary transition probabilities.


The final two properties listed are discussed next:

6.2

Markov Property

In general, we have defined a stochastic process so that the immediate future may depend
on both the present and the entire past. This framework is a bit too general for an initial
investigation of the concepts involved in stochastic processes. A discrete time process with
discrete state space will be said to have the Markov property if
P(Xt+1 = xt+1 |X0 = x0 , . . . , Xt = xt ) = P(Xt+1 = xt+1 |Xt = xt ).
In other words, the future depends only on the present and not on the past.
At first glance, this may seem a silly property, in the sense that it would never really
happen. However, it turns out that Markov chains can give surprisingly good approximations to real situations.
Example. As an example, suppose our stochastic process of interest is the total amount
of something (money, perhaps) that we have accumulated at the end of each day. Often, it
is a very reasonable assumption that tomorrows amount depends only on what we have
today and not on how we arrived at todays amount. Indeed, this will be the case if, for
instance, each days incremental amount is independent of those for the previous days.
Thus, a very common and useful stochastic process possessing the Markov property is the
sequence of partial totals in a random sum, i.e. Xt = 1 + . . . + t where 1 , 2 . . . is a
sequence of independent random variables. In this case, it is clear that Xt+1 = Xt +t+1 depends only on the value of Xt (and, of course, on the value of t+1 , but this is independent
of all the previous s and thus of the previous Xs as well).

36

6.3

Stationarity

Suppose we know that at time t, our Markov chain is in state x, and we want to know
about what will happen at time t + 1. The probability of Xt+1 being equal to y in this
instance is referred to as the one-step transition probability of going from state x to state
t,t+1
. (Note that for convenience
y at time t, and is denoted by P t,t+1 (x, y), or sometimes Pxy
of terminology, even if x = y we will still refer to this as a transition). If we are dealing
with a Markov chain, then we know that
P t,t+1 (x, y) = P(Xt+1 = y|Xt = x),
since the outcome of Xt+1 only depends on the value of Xt . If, for any value t in the index
set, we have
P t,t+1 (x, y) = P (x, y) = Pxy
for all x, y S,
that is, the one-step transition probabilities are the same at all times t, then the process
is said to have stationary transition probabilities. Here, the word stationary describes the
fact that the probability of going from one specified state to another does not change
with time. Note that for the partial totals in a random sum, the process has stationary
transition probabilities if and only if the s are identically distributed.

6.4

Transition Matrices and Initial Distributions

Lets start by considering the simplest type of Markov chain, namely, a chain with state
space of cardinality 2, say, S = {0, 1}. (Actually, this is the second-simplest type of chain,
the simplest being one with only one possible state, but this case is rather unenlightening).
Suppose that at any time t,
P(Xt+1 = 1|Xt = 0) = p,

P(Xt+1 = 0|Xt = 0) = 1 p,

P(Xt+1 = 0|Xt = 1) = q,

P(Xt+1 = 1|Xt = 1) = 1 q,

and that at time t = 0,


P(X0 = 0) = 0 (0),

P(X0 = 1) = 0 (1).

We will generally use the notation t to refer to the pmf of the discrete random variable
Xt when dealing with discrete time Markov chains, so that t (x) = pXt (x) = P(Xt = x).
When the state space is finite, we can arrange the transition probabilities, Pxy , into a
matrix called the transition matrix. For the two-state Markov chain described above the
37

transition matrix is

 
 

P (0, 0) P (0, 1)
P00 P01
1p
p
P =
=
=
.
P (1, 0) P (1, 1)
P10 P11
q
1q
Note that for any fixed x, the pmf of Xt given Xt1 = x is pXt |Xt1 (y|x) = P (x, y). Thus,
the sum of the values in any row of the matrix P will be 1. If the state space is not finite
then we will often refer to P (x, y) as the transition function of the Markov chain.
Similarly, if S is finite, we can arrange the initial distribution as a row vector, for
example, 0 = {0 (0), 0 (1)} in the case of the two-state chain above.
It is an important fact that P and 0 are enough to completely characterize a Markov
chain, and we shall examine this more thoroughly a little later. As an example, however,
lets compute some quantities associated with the above two-state chain.
Example 6.1. For the two-state Markov chain above, lets examine the chance that Xt
will equal 0. To do so, we note
t (0) = P(Xt = 0)
= P(Xt = 0|Xt1 = 0)P(Xt1 = 0) + P(Xt = 0|Xt1 = 1)P(Xt1 = 1)
= (1 p)t1 (0) + q t1 (1)
| {z }

=1t1 (0)

= q + (1 p q)t1 (0).
By iterating this procedure:
1 (0)

q + (1 p q)0 (0)

2 (0)

q + (1 p q)1 (0) = q + (1 p q){q + (1 p q)0 (0)}

q + (1 p q)q + (1 p q)2 0 (0)


..
.

t (0)

t1
X

(1 p q)i + (1 p q)t 0 (0)

i=0



q
q
t
+ (1 p q) 0 (0)
,
p+q
p+q

where we have used the well-known summation formula for a geometric series,
n1
X
i=0

1 rn
r =
.
1r
i


38

First, note that we can thus calculate the distribution of any of the Xt s using only the
entries of P and 0 . Second, as long as p, q < 1, it must be the case that |1 p q| < 1.
So, as t , we see that
q
.
lim t (0) =
t
p+q
Note that this does not depend on the initial distribution 0 . In other words, regardless
of how the chain starts, in the long run it will be in state 0 about q/(p + q) proportion of
the time. This sort of long run forecast is a preview of the steady state distribution of
a Markov chain which we will discuss later.
Also, note that if 0 (0) = q/(p + q) then t does not depend on t. Searching for
conditions where the distribution of the state of the chain at time t does not depend on t
previews ideas about the stationary distribution of the Markov chain which we will also
discuss later. We call these distributions stationary since if a member of the chain is in this
distribution then the rest of the chain will have this distribution as well. Finally, notice
that, in this case, if p + q = 1 then t (0) = q regardless of the starting distribution. Thus,
regardless of the initial distribution, such a chain will be in its stationary distribution
after a single step.
As a last demonstration that all we need to know about a chain is P and 0 , lets
calculate P(X2 = x2 |X0 = x0 ). To do this, note that
P(X2 = x2 |X0 = x0 ) =
=
=
=
=
=

P(X2 = x2 , X0 = x0 )
P(X0 = x0 )
P1
i=0 P(X2 = x2 , X1 = i, X0 = x0 )
0 (x0 )
P1
i=0 P(X2 = x2 |X1 = i, X0 = x0 )P(X1 = i, X0 = x0 )
0 (x0 )
P1
i=0 P(X2 = x2 |X1 = i)P(X1 = i, X0 = x0 )
0 (x0 )
P1
i=0 P(i, x2 )P(X1 = i|X0 = x0 )P(X0 = x0 )
0 (x0 )
1
X
P (x0 , i)P (i, x2 ).
i=0

Note that this is just the sum of the chances of all the ways of starting in state x0 and
going in two steps to state x2 . In addition, the above value is just the (x0 , x2 )th entry in

39

the two-step transition matrix,


P2 = P P =

(1 p)2 + pq p(2 p q)
q(2 p q) (1 q)2 + pq

!
.

We will discuss general n-step transition matrices shortly.


A formal proof that P and 0 fully characterize a Markov chain is beyond the scope
of this class. However, we will try to give the basic idea behind the proof now. It should
seem intuitively reasonable that anything we want to know about a Markov chain {Xt }t0
can be built up from probabilities of the form
P(Xn = xn , . . . , X0 = x0 ) = P(Xn = xn |Xn1 = xn1 , . . . , X0 = x0 )
P(Xn1 = xn1 , . . . , X0 = x0 )
= P(Xn = xn |Xn1 = xn1 )
P(Xn1 = xn1 , . . . , X0 = x0 )
..
.
= P (xn1 , xn )P (xn2 , xn1 ) P (x0 , x1 )0 (x0 )
n
Y
= 0 (x0 )
P (xi1 , xi ).
i=1

Notice that the above simply states that the probability that the chain follows a particular
path for the first n steps can be found by simply multiplying the probabilities of the
necessary transitions. Note also, that we directly required both the Markov property and
stationarity for this demonstration. Indeed, the above identity is an equivalent form of
the stationary Markov property. As a technical detail, we must be careful that none of
the conditioning events in the above derivation have probability zero. However, this will
only occur when the original path is not possible (i.e. the specified xi s do not form a
legitimate set of outcomes), in which case the original probability will clearly be zero as
will at least one of the factors in the final product, so the result still holds true.
For the sake of completeness, we note that the characterization is a one-to-one correspondence. That is, every Markov chain is completely determined by its initial distribution and transition matrix, and any initial distribution and transition matrix (recall
that a transition matrix must satisfy the property that each of its rows sums to unity)
determine some Markov chain.
As a final comment, we note that it is the transition function P which is the more
fundamental aspect of a Markov chain rather than the initial distribution 0 . We shall
see why this is so specifically in the results to follow, but it should be clear that changing
40

initial distributions will generally only slightly affect the overall behaviour of the chain,
while a change in P will generally result in dramatic changes.

6.5

Examples of Markov Chains

We now present some of the most commonly used Markov chains.


Random Walk: Let p(u) be a probability mass function on the integers. A random walk
is a Markov chain with transition function P (x, y) = p(y x) for integer valued x and y.
Here we have S = Z = {. . . , 3, 2, 1, 0, 1, 2, 3, . . . }. For instance, if p(1) = p(1) = 0.5,
then the chain is the simple symmetric random walk, where at each stage the chain takes
either one step forward or backward. Such models are sometimes used to describe the
motion of a suspended particle. One question of interest we might ask is how far the
particle will travel. Another might be whether the particle ever returns to its starting
position and if so, how often. Often, the simple random walk is extended so that p(1) = p,
p(1) = q and p(0) = r, where p, q and r are non-negative numbers less than one such
that p + q + r = 1.
Ehrenfest chain: The Ehrenfest chain is often used as a simple model for the diffusion
of molecules across a membrane. Suppose that we have two distinct boxes and d distinct
labelled balls. Initially, the balls are distributed between the two boxes. At each step, a
ball is selected at random and is moved from the box that it is in to the other box. If Xt
denotes the number of balls in the first box after t transitions, then {Xt }t0 is a Markov
chain with state space S = {0, . . . , d}. The transition function can be easily computed as
follows: If at time t, there are x balls in the first box, then there is probability x/d that
a ball will be removed from this box and put in the other, and a probability of (d x)/d
that a new ball will be added to this box from the other, thus

y =x1
d
x
P (x, y) =
1 d y =x+1

0
otherwise
For this chain, we might ask if an equilibrium is reached.
Gamblers ruin: Suppose a gambler starts out with x dollars and makes a series of one
dollar bets against the house. Assume that the respective probabilities of winning and
losing the bet are p and q = 1 p, and that if the capital ever reaches 0, the betting
ends and the gamblers fortune remains 0 forever after. This Markov chain has state space

41

S = N0 = {0, 1, 2, 3, . . . } transition function

1 x=y=0

q y = x 1 and x > 0
P (x, y) =

p y = x + 1 and x > 0

0 otherwise
for x 1, and P (0, 0) = 1, P (0, y) = 0 for y 6= 0. Note that a state which satisfies
P (a, a) = 1 and P (a, y) = 0 for y 6= a is called an absorbing state. We might wish to ask
what the chance is that the gambler is ruined (i.e. loses all his/her initial stake) and how
long it might take. Also, we might modify this chain to incorporate a strategy whereby the
gambler quits when his/her fortune reaches d. For this chain, the above transition matrix
still holds except that the definition given for P (x, y) now holds only for 1 x d 1,
and d becomes an absorbing state. One interpretation of this modification is that two
gamblers are betting against each other and between them they have a total capital of d
dollars. Letting Xt represent the fortune of one of the gamblers yields the gamblers ruin
chain on {0, 1, . . . , d}.
Birth and death chains: The Ehrenfest and Gamblers ruin chains are special cases of
a birth and death chain. A birth and death chain has state space S = N0 = {0, 1, 2, . . .}
and has transition function

qx y = x 1

r y=x
x
P (x, y) =

px y = x + 1

0 otherwise
where px is the chance of a birth, qx the chance of a death and 0 px , qx , rx 1 such
that px + qx + rx = 1. Note that we allow the chance of births and deaths to depend on
x, the current population. We will study birth and death chains in more detail later.
Queuing chain: Consider a service facility at which people arrive during each discrete
time interval according to a distribution with probability mass function p(u). If anyone is
in the queue at the start of a time period then a single person is served and removed from
the queue. Thus, the transition function for this chain is P (0, y) = p(y) and P (x, y) =
p(y x + 1). In other words, if there is no one in the queue then the chance of having
y people in the queue by the next time interval is just the chance of y people arriving,
namely p(y), while if x people are currently in the queue, one will definitely be served and
removed and thus to get to y individuals in the queue we require the arrival of y (x 1)
additional individuals. Two obvious questions to ask about this chain are when the queue
42

will be emptied and how often.


Branching chain: Consider objects or entities, such as bacteria, which generate a number
of offspring according to the probability mass function p(u). If at each time increment,
the existing objects produce a random number of offspring and then expire, then Xt , the
total number of objects at generation t is a Markov chain with
P (x, y) = P(1 + . . . + x = y)
where the i s are independent random variables each having a probability mass function
given by p(u). A natural question to ask for such a chain is if and when extinction will
occur.

6.6

Extending the Markov Property

Recall that we have said that the Markov property is equivalent to the identity
P(X0 = x0 , . . . , Xn = xn ) = 0 (x0 )P (x0 , x1 ) P (xn1 , xn ).
From this it is easy to see that
P(Xn+m = ym , . . . , Xn+1 = y1 |Xn = x, Xn1 = xn1 , . . . , X0 = x0 )
P(Xn+m = ym , . . . , Xn+1 = y1 , Xn = x, Xn1 = xn1 , . . . , X0 = x0 )
=
P(Xn = x, Xn1 = xn1 , . . . , X0 = x0 )
0 (x0 )P (x0 , x1 ) P (xn1 , x)P (x, y1 ) P (ym1 , ym )
=
0 (x0 )P (x0 , x1 ) P (xn1 , x)
= P (x, y1 ) P (ym1 , ym )
= P(Xn+m = ym , . . . , Xn+1 = y1 |Xn = x).
Now, it should seem reasonable that if specific past values are irrelevant in determining
future events then no more vague information could be any help either. Specifically, let
A0 , . . . , An1 S, then
P(Xn+m = ym , . . . , Xn+1 = y1 |Xn = x, Xn1 An1 , . . . , X0 A0 )
= P (x, y1 ) P (ym1 , ym )
= P(Xn+m = ym , . . . , Xn+1 = y1 |Xn = x).
From this it readily follows that for any B1 , . . . , Bm S,
P(Xn+m Bm , . . . , Xn+1 B1 |Xn = x, Xn1 An1 , . . . , X0 A0 )
X
X
=

P (x, y1 ) P (ym1 , ym )
y1 B1

ym Bm

= P(Xn+m Bm , . . . , Xn+1 B1 |Xn = x).


43

In fact, if C is a general past event (i.e. an event determined by the outcomes of


X0 , . . . , Xn1 ) and D is a general future event (i.e. one determined by the outcomes of
Xn+1 , . . . , Xn+m for some m 1) then the extended Markov property states that
P(D|C {Xn = x}) = P(D|Xn = x).
Once we have this, it is then easily believed (and almost as easily shown) that a Markov
chain known to be in state x at time n evolves from that point on in exactly the same
way as a fresh version of the Markov chain started in state x, or more precisely started
with an initial distribution having 0 (x) = 1. This property is summarized by saying that
a Markov chain restarts from fixed times, and is one major reason why it is P rather than
0 which is the more fundamental defining feature of a Markov chain.
Because of this, it will be very helpful to introduce a bit of new notation. In general,
when we write probability statements about a Markov chain, such as P(Xt = xt ), we are
incorporating into that statement the initial distribution 0 , so a more complete notation
would be P0 (Xt = xt ). When the initial distribution is a point mass at a particular value
x, we will use the notation Px (A) to indicate the probability of an event A regarding
the chain assuming that it has been started in state x. From this, it can be seen that
P(A|X0 = x) = P0 (A|X0 = x) = Px (A). In particular, note that P(Xn+m = y|Xn = x) =
Px (Xm = y).

6.7

Multi-Step Transition Functions

The m-step transition function of a Markov chain measures the probability of going from
state x to state y in exactly m time units, and is defined by
X
X
P m (x, y) :=

P (x, y1 ) P (ym1 , y) = Px (Xm = y).


y1 S

ym1 S

For consistency of notation, we will set P 1 (x, y) = P (x, y) and


(
1 y=x
P 0 (x, y) :=
0 otherwise
A little inspection reveals that if S is finite then the m-step transition function P m is
just the mth power of the transition matrix P as the notation suggests. A key result for

44

multi-step transition functions is


P n+m (x, y) = Px (Xn+m = y)
X
=
Px (Xn+m = y|Xn = z)Px (Xn = z)
zS

Pz (Xm = y)P n (x, z)

zS

P n (x, z)P m (z, y).

zS

We can use the multi-step transition function to calculate the distributions of each of the
Xt s. Note that
t (y) = P(Xt = y)
X
=
P(Xt = y, X0 = x)
xS

P(Xt = y|X0 = x)P(X0 = x)

xS

0 (x)P t (x, y).

xS

In matrix notation, this translates into t = 0 P t , which is why we have used a row vector
for the s instead of the more usual choice of a column vector. We may also arrive at this
result by noting that a similar sort of reasoning shows that t = t1 P , and then iterating.

Example 6.2. Suppose we have a three state

P = 1p
0

Markov chain with transition matrix

1 0

0 p .
1 0

(a) Find P 2 , (b) Show B := P 2 is idempotent: B 2 = B, (c) Find P n , n 1.


Solution: (a) Squaring the given transition matrix

1p 0

2
P = 0
1
1p 0

yields (algebra)

0 .
p

Alternatively, this result can also be obtained by probabilistic reasoning from a simple
diagram:
45

1-step transition vs. 2-step transition diagram

(b) Again, examination of a diagram gives the idea of the result, and simple calculation
shows directly that P 4 = P 2 .
(c) Since P 2 is idempotent it is clear that P 2 = P 4 = P 6 = . . . and that P 3 = P 5 = P 7 =
. . .. It remains only to note that P = P 1 = P 3 , but this again follows from simple matrix
calculations.


46

6.8

Hitting Times and Strong Markov Property

Let A S. The hitting time TA of the set A is defined as


TA = inf{t > 0 : Xt A},
where it is assumed by convention that the infimum of the empty set is infinity. So, TA is
just a random variable which indicates when the chain first enters the set of states A. In
particular, we will use the shortened notation Ta = T{a} for the hitting time of a specific
state.
For a specific state y note that the events {Ty = m, Xn = y} are disjoint and that
{Xn = y} =

n
[

{Ty = m, Xn = y}.

m=1

From this, it follows that


P n (x, y) = Px (Xn = y)
n
X
=
Px (Ty = m, Xn = y)
=
=
=
=

m=1
n
X
m=1
n
X
m=1
n
X
m=1
n
X

Px (Ty = m)P(Xn = y|Ty = m, X0 = x)


Px (Ty = m)P(Xn = y|Xm = y, Xm1 6= y, . . . , X1 6= y, X0 = x)
Px (Ty = m)P(Xn = y|Xm = y)
Px (Ty = m)P nm (y, y).

m=1

As an example of the use of this formula note that if a is an absorbing state, then
P n (x, a) =

n
X

Px (Ta = m)P nm (a, a) =

m=1

n
X

Px (Ta = m) = Px (Ta n).

m=1

In other words, the chance of going from state x to the absorbing state a in n steps is the
same as the chance of hitting the state a sometime before the (n + 1)st step of a chain
started at x.
Also, we can observe that
Px (Ty = 1) = Px (X1 = y) = P (x, y),
47

and that
Px (Ty = 2) =

Px (X1 = z, X2 = y) =

z6=y

P (x, z)P (z, y).

z6=y

For n 3, we can use the recursion formula


X
Px (Ty = n) =
P (x, z)Pz (Ty = n 1),
z6=y

which should be intuitively obvious, since in order to get to y for the first time at step n
when starting at x, we must first step somewhere other than y and then get to y for the
first time at step n 1 starting from this new location. This is an example of a first step
analysis, which we will investigate more fully in the next section.
Example 6.3. Let {Xt }t0 be the usual two-state Markov chain, so that P (0, 0) = 1 p,
P (0, 1) = p, P (1, 0) = q and P (1, 1) = 1 q with 0 < p, q < 1. Find P0 (T0 = n).
Solution: If n = 1 then P0 (T0 = 1) = P(0, 0) = 1 p, and if n 2 then we have
X
P (0, z)Pz (T0 = n 1)
P0 (T0 = n) =
z6=0

= P (0, 1)P1 (T0 = n 1).


Now, if n = 2 then P1 (T0 = n 1) = P1 (T0 = 1) = P (1, 0) = q and otherwise for n 3,
X
P (1, z)Pz (T0 = n 2) = P (1, 1)P1 (T0 = n 2),
P1 (T0 = n 1) =
z6=0

which implies by iteration that P1 (T0 = n 1) = (1 q)n2 P1 (T0 = 1) = q(1 q)n2 .


Thus,
P0 (T0 = n) = pq(1 q)n2
for n 2.

Finally, the strong Markov property states that a Markov chain restarts from hitting
times. In other words, we can make statements like P(XTa +1 = y|Ta = t) = Pa (X1 = y).
Note that this is not generally true for any random time . For example, if is the time
immediately preceding the first visit to a, i.e. = Ta 1, then P(X +1 = a) = 1 and the
chain certainly does not evolve as a freshly started version of the original Markov chain.
Actually, a more precise statement of the strong Markov property says that a Markov
chain restarts from stopping times. A stopping time is loosely defined as any positive
integer-valued random variable T such that the occurence of the event {T = n} can be
determined based solely on the outcomes of the random variables X0 , X1 , . . . , Xn . In other
words, it is a random time for which we do not need to look into the future to determine
whether it has been reached or not. A more precise definition requires a bit of measure
theory.
48

6.9

First Step Analysis

We now take a brief aside to discuss a very useful method for calculating various quantities
about a Markov chain. This method is usually referred to as first step analysis, since it
proceeds by breaking down the possibilities that can arise at the end of the first transition
of the chain, and then using the law of total probability to arrive at a characterizing
recurrence relationship.
As an initial example, consider the Markov chain with state space S = {0, 1, 2} and
transition matrix

1 0 0

P = .
0 0 1
with 0 < , < 1. This a rather simple chain, its behaviour characterized by a random
duration in state 1 before an absorption in either state 0 or state 2 (in fact, if = 0 then
this is just the gamblers ruin chain on {0, 1, 2}). However, it serves as a useful illustration
of the method of first step analysis. Two obvious questions arise regarding this chain: In
which state will the chain finally be absorbed, and how long, on the average, will it take
until absorption occurs? In other words, we would like to find
u = P1 (XT{0,2} = 0)

and

v = E1 [T{0,2} ].

[NOTE: We will use the notation Ex [] to denote the expectations of quantities regarding
a Markov chain started in state x.]
A first step analysis proceeds by conditioning on the first transition, so that
u = P1 (XT{0,2} = 0)
=

2
X

P1 (XT{0,2} = 0|X1 = x)P1 (X1 = x)

x=0

= 1() + u() + 0().


Note that we have used the fact that Markov chains restart at fixed times to justify the
fact that P1 (XT{0,2} = 0|X1 = 1) = u. Now, solving the equation gives
u=

=
,
1
+

provided 6= 1, in which case u = 0, since the chain can never reach either state 0 or
state 2 having been started in the absorbing state 1. Note that the answer is just the
conditional probability that the chain is in state 0 given that a jump from state 1 has
occurred, which only stands to reason.
49

Similarly, the expectation of T{0,2} can be found from the recursion equation
v = E1 [T{0,2} ]
2
X
=
E1 [T{0,2} |X1 = x]P1 (X1 = x)
x=0

= 1() + (1 + v)() + 1(),


again employing the restart property of Markov chains. Solving this equation shows that
v = 1/(1 ). Of course, in this instance, this answer can be arrived at directly by noting
that T{0,2} has a geometric distribution with parameter 1 since it is simply the number
of trials until the first occurrence of a transition to an absorbing state.
To study the method more generally, we must expand the complexity of our Markov
chain. Suppose that {Xt }t0 is an (N + 1)-state Markov chain in which states r, . . . , N
are absorbing. Let us partition the transition matrix as


Q R
P =
,
O I
where O is an (N r + 1) r matrix of zeroes, I is an (N r + 1) (N r + 1) identity
matrix, Qij = Pij for 0 i, j < r, and Rij = Pij for 0 i < r and r j N .
Started at one of the states 0 i < r, the process will remain in these states for some
random period of time and then eventually be trapped in one of the absorbing states
r i N . We wish to evaluate the average duration until absorption as well as the
distribution of where the chain is finally absorbed.
Now, for a fixed state k among the absorbing states, the probability of ultimate absorption in this state will depend on the initial state of the chain, X0 = i for 0 i < r.
We will denote the probability that the chain starts in state i and is ultimately absorbed
in state k, by Uik . So, using a first step analysis, we find that
Uik = Pi (absorbed in state k)
N
X
=
Pi (absorbed in state k|X1 = j)Pij
j=0

= Pik +

r1
X

Ujk Pij ,

j=0

since the chance of being absorbed in state k given a first step to another absorbing state is
obviously 0, and the chance of being absorbed in state k given a first step to state 0 j < r
is just Ujk since Markov chains restart from fixed times. In other words, to find the Uik s
50

for a fixed k and 0 i < r we need to solve an inhomogeneous system of linear equations.
If we write the column vectors Uk = (U0k , . . . , U(r1)k )0 and Rk = (R0k , . . . , R(r1)k )0 , then
we can write the system in matrix notation as
Uk = Rk + QUk .

51

Example 6.4. A white rat is put into the maze depicted below:

With the cat resting in the corner room, we might hypothesize that the rat would move
through the maze at random (i.e. in a room with k doors the rat would choose each of
these with probability 1/k), until it reaches either the food or the cat. If Xt represents the
compartment that the rat is in at time t, then {Xt }t0 is a Markov chain with transition
matrix

0 12 12 0 0 0 0 0 0
1

3 0 0 13 0 0 0 13 0

1 0 0 1 0 0 0 0 1
3
3
3

1
1
1
1
0 4 4 0 4 4 0 0 0

1
1
1

P =
0 0 0 3 0 0 3 3 0 .

0 0 0 31 0 0 13 0 31

0 0 0 0 1 1 0 0 0
2
2

0 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0 1
If we wish to find the chance that the rat finds the food first, then we must solve the

52

system of equations:
U0,7 =
U1,7 =
U2,7 =
U3,7 =
U4,7 =
U5,7 =
U6,7 =

1
1
U1,7 + U2,7
2
2
1 1
1
+ U0,7 + U3,7
3 3
3
1
1
U0,7 + U3,7
3
3
1
1
1
1
U17 + U2,7 + U4,7 + U5,7
4
4
4
4
1 1
1
+ U3,7 + U6,7
3 3
3
1
1
U3,7 + U6,7
3
3
1
1
U4,7 + U5,7 .
2
2

Of course, we could solve this directly, but the innate symmetry in this problem shows
that we may dramatically simplify the problem by noting that U0,7 = U6,7 , U1,7 = U4,7 ,
U2,7 = U5,7 and U3,7 = 12 . Using these simplifications, the system reduces to
1
1
U1,7 + U2,7
2
2
1 1
=
+ U0,7
2 3
1 1
+ U0,7 ,
=
6 3

U0,7 =
U1,7
U2,7
which is solved when U0,7 = 21 , U1,7 =

2
3

and U2,7 = 13 .

Now, let A be the set of absorbing states, so that A = {r, . . . , N }. Suppose we now wish
to find
 TX

A 1
wi = Ei
g(Xn )
0i<r
n=0

P A 1
for some function g. For instance, if g(j) = 1 for all j, then Tn=0
g(Xn ) = TA and the
above quantities are the expected time to absorption starting from state i. On the other
hand, if g(j) = 1 for j = k and g(j) = 0 otherwise, then the above quantity is the expected
number of visits to state k prior to absorption for a chain started in state i. So, applying

53

the first step analysis yields


wi = Ei

 TX
A 1


g(Xn )

n=0

N
X

Ei

 TX
A 1
n=0

j=0

N
X





g(Xn ) X1 = j Pij

g(i)Pij +

j=r

r1
X

{g(i) + wj }Pij

j=0

= g(i) +

r1
X

Pij wj .

j=0

In particular, if g(i) = 1 for all i then we see that the values vi = Ei [TA ] satisfy the system
of equations
r1
X
vi = 1 +
Pij vj
0 i < r.
j=0

Example 6.5. Example 6.4 cont.: For the above maze, the values vi satisfy
1
v0 = 1 + v1 +
2
1
v1 = 1 + v0 +
3
1
v2 = 1 + v0 +
3
1
v3 = 1 + v1 +
4
1
v4 = 1 + v3 +
3
1
v5 = 1 + v3 +
3
1
v6 = 1 + v4 +
2

1
v2
2
1
v3
3
1
v3
3
1
1
1
v2 + v4 + v5
4
4
4
1
v6
3
1
v6
3
1
v5 .
2

Again, a dramatic simplification results since by symmetry it must be the case that v0 = v6
and v1 = v2 = v4 = v5 , so that the system becomes
v0 = 1 + v1
1
1
v1 = 1 + v0 + v3
3
3
v3 = 1 + v1 .
This system is easily seen to be solved when v0 = v3 = 6 and v1 = 5.
54

Of course, these are just a few of the many uses for the method of first step analysis,
and many other seemingly difficult problems are found to be quite tractable using this
approach.

6.10

Transience and Recurrence

The hitting probability of a state y starting from state x is defined as


xy = Px (Ty < ).
In other words, xy is the chance that starting from x we will eventually be in state y at
some future time. In particular, yy is the chance that a chain started in state y will ever
return. A state y is called recurrent if yy = 1 and is termed transient otherwise. Note
that an absorbing state is necessarily recurrent.
Example 6.6. Example 6.1 contd: For the two-state Markov chain, find 00 .
Solution: From the previous part of this example, we have
00 = P0 (T0 < ) =

P0 (T0 = n) = 1 p +

n=1

pq(1 q)n2

n=2

pq
= 1,
= 1p+
1 (1 q)
provided q 6= 0. (Note that when q = 1, the summation reduces to just the first term since
we assume by convention that 00 = 1). If q = 0, so that state 1 is an absorbing state, then
00 = 1 p (i.e. the chance of a return to state 0 is just the chance that the chain hits
state 0 at its first step, since otherwise it will be absorbed in state 1). Therefore, state 0
is recurrent as long as q 6= 0, and if q = 0 then state 0 is transient unless p = 0 (i.e. unless
state 0 is also an absorbing state).


Usually, it is the case that hitting probabilities are not so straightforward to calculate,
and the general goal of the next sections will be to arrive at a method for finding these
probabilities. In order to compute hitting probabilities, it is convenient to define some
additional useful quantities. The idea behind the computation method relies on the fact
that the event {Ty < } is equivalent to the event {# of visits to y 1}. We start by
defining

X
N (A) =
1A (Xn ),
n=1

55

the number of visits to a set A S. Specifically, we will be interested in N (y) = N ({y}),


the number of visits to y. Thus, it is clear that
xy = Px (Ty < ) = Px {N (y) 1}.
Also, since Markov chains restart from hitting times, we can write the recursion equation
for m 1,

 X


X


Px {N (y) m} =
Px (Ty = n)Px
1y (Xk ) m 1 Ty = n
n=1

k=n+1

Px (Ty = n)Py {N (y) m 1} = xy Py {N (y) m 1}.

n=1

So, setting x = y and iterating, we find that Py {N (y) m} = m


yy for m 0 (again, using
the convention that 00 = 1) and thus
Px {N (y) m} = xy m1
yy

for m 1

Px {N (y) = 0} = 1 Px {N (y) 1} = 1 xy .
By taking the limit as m we then find that
(
xy if y is recurrent
Px {N (y) = } =
0
if y is transient
or, more specifically,
(
Py {N (y) = } =

1 if y is recurrent
0 if y is transient

These last expressions provide a better and more useful distinction between recurrent and
transient states; namely, no matter where the chain starts it will only hit a transient state
a finite number of times, while a chain started in a recurrent state will hit that state an
infinite number of times.
We now turn our attention to the expected number of visits to a particular state.
Recall that the notation Ex [] refers to expectations calculated regarding a version of the
chain started in state x, and note that
Ex [N (y)] = Ex

X

n=1

 X

X
1y (Xn ) =
Ex [1y (Xn )] =
P n (x, y).
n=1

56

n=1

We may also calculate the expected number of visits as


Ex [N (y)] =

Px {N (y) > n} =

n=0

Px {N (y) n} =

n=1

xy n1
yy .

n=1

Thus, if y is recurrent, so that yy = 1, then the expected number of visits to y for a chain
started in state x is infinite if xy > 0 (i.e. if the chain has a positive probability of hitting
y from x) and is zero if xy = 0 (i.e. the chain cannot reach y from x). On the other hand,
if y is transient, so that yy < 1, then the geometric series formula shows that
(
= if y is recurrent
Ex [N (y)]
< if y is transient
This expression provides the best and most useful distinction between the two types of
states; namely, that the expected number of returns to a state y is infinite for recurrent
states and finite for transient states. In addition, this distinction has an equivalent form:
Lemma: A state x is recurrent if and only if

P n (x, x) = .

n=1

Proof: From above x is recurrent if and only if Ex [N (x)] = . Thus, the lemma follows
P
n

from the fact that Ex [N (x)] =
n=1 P (x, x).
Finally, we note that if y is a transient state, then

P n (x, y) = Ex [N (y)] <

n=1

lim P n (x, y) = 0.

This fact implies that a Markov chain with a finite state space must have at least one
recurrent state. The idea here is a simple one: since a transient state is only visited a finite
number of times, it must eventually be left for the last time, and thus if all the states were
transient the chain would run out of places to go. To see a more formal demonstration,
note that if all the states were transient then using the above derivation (and the fact
that the state space is finite), we would have
X
X
0=
lim P n (x, y) = lim
P n (x, y) = lim Px (Xn S) = lim 1 = 1,
yS

yS

which is a contradiction.

57

6.11

Decomposition of the State Space

In our quest to calculate hitting probabilities, it will be helpful to categorize groups of


states in a similar fashion to how we have categorized single states as either transient or
recurrent. To do this, we start by defining a few new concepts.
We will say that a state x leads to a state y if either xy > 0 or y = x, and we will
denote this by x y. Note that the last condition is just a minor technicality to make
sure that a state is always allowed to lead to itself, even if xx = 0.
It is a useful fact that x y if and only if there is some n 0 such that P n (x, y) > 0.
To see why this is true, note that
(1) If x y then either xy = Px (Ty < ) > 0 or x = y. Now, if x = y then
P
P 0 (x, x) = 1 > 0. Otherwise, if xy > 0 then Px (Ty < ) =
m=1 Px (Ty = m) > 0,
so that there must be a value of n such that Px (Ty = n) > 0. Clearly, Pn (x, y) =
Px (Xn = y) Px (Ty = n) > 0 and we have shown that if x y then there is an
n 0 such that P n (x, y) > 0.
(2) Suppose now that for some fixed n, P n (x, y) > 0. Clearly, since {Xn = y}
{Ty < }, it must be the case that xy = Px (Ty < ) > Px (Xn = y) = P n (x, y) >
0, so that x y.
The implication of this should be obvious; namely, a state x leads to y if and only if there
is some path of states from x to y which has a non-zero probability of occurring.
The leads-to operator, , is clearly reflexive (i.e. x x) by definition, and it is also
transitive (i.e. if x y and y z, then x z), a fact which is left as an exercise.
However, it is NOT symmetric (i.e x y does not necessarily imply that y x). If
x y and y x, we shall say that x communicates with y and write x y. The
communication operator, , is symmetric and thus defines an equivalence relation on the
state space S. It is a fact that an equivalence relation on S defines a partitioning of the
state space. For each x S, we will define
[x] = {y S : x y},
and generally refer to [x] as the communication class of x. Notice that [x] = [y] if and
only if x y, and thus the communication operator naturally partitions S into groups
of states which communicate with each other.

58

Example 6.7. Suppose we have a Markov chain with state space S = {0, 1, . . . , 5} and
transition matrix

1 1
0
0
0
0
2
2

1 2
3 3 0 0 0 0

0 0 1 0 7 0
8
8

P =
1 1 0 0 1 1 .
4 4
4
4

3
1
0 0 4 0 4 0
0 15 0 51 15 52
We can examine the communication classes of this Markov chain via an accessibility
diagram, which looks like:
Derivation of Accessibility Diagram

For this chain it is easy to spot the communicating states:

59

Finally this diagram can be simplified to the


Accessibility Diagram

and we can see that there are three distinct communication classes; namely, [0] =
{0, 1} = [1], [2] = {2, 4} = [4] and [3] = {3, 5} = [5].

As a final definition, we will term a Markov chain as irreducible if its state space consists
of exactly one communication class, i.e. if [x] = [y] for all x, y S.
The importance of communication classes is demonstrated by the statement: recurrence is a class property. To see this, we note that if x is a recurrent state and x y then
y is also recurrent. The idea here is that if a chain is started in a recurrent state x then
60

it will return an infinite number of times, and during some of these returns it will have
visited y, since x leads there. Thus, we have
Theorem 1: Assume that x is a recurrent state and x y. Then we have:
(i) xy = yx = 1 and Px {N (y) = } = 1.
(ii) y is a recurrent state [yy = 1].
A formal proof of this fact is rather tedious and unenlightening (though not difficult)
and we will omit it. However, the concept behind the proof is important and worth
reiterating. The reasoning proceeds as follows:
(1) x is recurrent and thus a chain started in x will return an infinite number of times,
(2) x leads to y and thus on some of the returns to x the chain will definitely have
passed through y [i.e. if an event has a positive probability of occurring, then the
chance that it occurs at least once in an infinite sequence of independent repetitions
is 1, since P(no occurrence in first n trials) = {1 P(event occurs)}n which will tend
to 0 as n ], so xy = 1,
(3) since the chain is eventually in y on one of its infinite returns to x, it must be the
case that yx = 1, otherwise x could not be recurrent, since there would be a positive
chance that the chain would go from x to y and then never return,
(4) once the chain has gone through y and returned to x, it restarts and thus must
go through y again by the above reasoning, implying that it must go through y an
infinite number of times, i.e. Px {N (y) = } = 1.
Note that this final statement, Px {N (y) = } = 1, implies that y is recurrent, so that if
x is recurrent and x y then y is recurrent:
Corollary 0 Let C be a communication class. Then either, all states x C are recurrent,
or, all states x C are transient.
Proof: Let x, y C and assume that x is recurrent. As x, y we must have x y (in
particular, x y). By the theorem, y is thus recurrent. Hence, if there is only one
recurrent state in C then all the other states in C are recurrent as well.

Consequently, being recurrent or transient is in fact a communication class property. We
refer to a communication class C as a recurrent class if all states in C are recurrent.
Similarly, we refer to a communication class C as a transient class if all states in C are

61

transient. It clearly follows that the states in a communication class must be either all
recurrent or all transient.
A set of states C S is called closed (or absorbing) if x C and x y implies y C.
In other words, no state in C leads out of C. So, we can now show the following results:
Corollary 1: A recurrent communication class is closed.
Proof: Let C be a recurrent communication class. Then Theorem 1 shows that if x C
and x y then y is recurrent and yx = 1 > 0 so that y x. Thus, y [x] = C.

Corollary 2: If C is a recurrent communication class then xy = 1, Px {N (y) = } = 1
and Ex [N (y)] = for any x, y C.
Proof: This follows from Theorem 1 and the characterization of recurrent states given in
the previous section.

Recall that a Markov chain is called irreducible if S is its only communication class (all
states are communicating with each other).
Corollary 3: An irreducible Markov chain is either a recurrent chain or a transient chain,
that is either all of the states are recurrent or else they are all transient.
Proof: Since an irreducible chain consists of a single communication class, the result follows directly from the fact that recurrence and transience are class properties.

Corollary 4: An irreducible recurrent Markov chain visits every state infinitely often
with probability one.
Proof: This follows from Corollaries 2 and 3 and a tiny bit of measure theory.
Corollary 5: An irreducible Markov chain with finite state space is recurrent.
Proof: We have seen that a Markov chain with finite state space cannot be composed
entirely of transient states.

Corollary 6: Every state in a finite closed communication class is recurrent. Proof: The
proof is basically identical to that for Corollary 5, since a closed communication class does
not lead out of itself.


So, we can break down the state space of a Markov chain into the communication
classes that are transient and those that are recurrent. In addition, we will use the notation
SR = {recurrent states}
ST = {transient states}.
Note that S = SR ST , and that no state in SR leads to any state in ST .
62

Example 6.8. Example 6.7 contd. For the Markov chain described in Example 6.7 above,
we see that the communication class [3] leads to the other two classes and thus it must
be a transient class, while the classes [0] and [2] are finite and closed and thus must be
recurrent classes. So,
SR = [0] [2] = {0, 1, 2, 4}

and

ST = [3] = {3, 5}.

Note that none of the states in SR lead to either state 3 or state 5, although states 3 and
5 do lead to each of the states in SR .

In summary, we can characterize what we know about the behaviour of a Markov chain
with respect to recurrence and transience as:
(1) If a Markov chain begins in a recurrent state x, then it will stay in [x] forever,
visiting each of the states in this communication class infinitely often.
(2) If a Markov chain begins in a transient state y and |ST | < , then the chain will
eventually hit a recurrent state x and then stay in [x] forever as in (1) above.
(3) If a Markov chain begins in a transient state y and |ST | = , then either the same
behaviour as in (2) will occur, or the chain may travel through ST forever, hitting
each state there at most a finite number of times.

6.12

Computing hitting probabilities

Recall that the main objective of the introduction of the ideas in the preceding sections
was to calculate the hitting probabilities xy . So far, we can summarize what we have
been able to say about hitting probabilities in the following table, which is broken down
by the transient or recurrent nature of each of the states x and y:
Hitting Probabilities, xy , when:
y transient y recurrent
x transient
??
?
x recurrent
0
1[x] (y)
Now, the cell marked ?? is generally not of much interest, at least in the case when
|ST | < , since the chain will eventually leave all transient states. However, computing
xy in the case where x ST and y SR is of interest. Notice that the above corollaries
imply that for y1 , y2 SR , xy1 = xy2 as long as [y1 ] = [y2 ], since once a chain reaches
one member of a recurrent communication class it will hit all states there infinitely often.
63

In order to simplify the problem of calculating xy then, we can consider each recurrent
communication class as a single conglomerate state.
As a start, lets consider the simple problem where all recurrent states are absorbing,
so that each one forms a communication class all to itself. Then, using a first step analysis,
we see that
X
X
xy = P (x, y) +
P (x, z)zy = P (x, y) +
P (x, z)zy ,
x ST .
zST

z6=y

If |ST | < then the above system of linear equations can be solved for the values of xy .
Back in the general setting, we will define xC = Px (TC < ) and note that if C is
a recurrent communication class then xC = xy for all y C. Also, following the logic
of the simplified case above, we see that if C is a recurrent communication class and
|ST | < then we can find the values of the xC s for x ST by solving the system of
equations
X
X
xC =
P (x, z) +
P (x, z)zC ,
x ST .
zC

zST

It can be shown that the solution to these sets of equations exists and is unique, however,
we will omit the proof here since it is not very illuminating. In fact, it can be further
shown that a unique closed form solution for xC exists provided the Markov chain is a
martingale, which we will define in the next section.
Example 6.9. Example 6.7 contd. Again for the Markov chain as given there, lets find
xC for x ST = [3] and C {[0], [2]}.
Solution: From the above equation, we have
3[0] = P (3, 0) + P (3, 1) + P (3, 3)3[0] + P (3, 5)5[0]
1 1
+ 5[0]
=
2 4
5[0] = P (5, 0) + P (5, 1) + P (5, 3)3[0] + P (5, 5)5[0]
1 1
2
=
+ 3[0] + 5[0] .
5 5
5
6
7
Solving this system shows that 3[0] = 11
and 5[0] = 11
.
Similarly, we could find 3[2] and 5[2] , or we could notice that since states 3 and 5 are
transient, the chain must hit either [0] or [2] and it cannot hit both since they are distinct
5
4
and 5[2] = 1 5[0] = 11
. 
recurrent communication classes. Thus, 3[2] = 1 3[0] = 11

64

6.13

Martingales

Martingales are very special and extremely important types of stochastic process, not
least because their properties make them amenable to all sorts of calculations which are
usually difficult or impossible for other types of processes. The idea of a martingale is
that on the average it stays where it was.
More specifically, a stochastic process is a martingale if
E[Xt+1 |X0 = x0 , . . . , Xt = xt ] = xt

for all xt S.

If the process {Xt }t0 is a Markov chain with state space S = {0, 1, . . . , d} then
E[Xt+1 |Xt = xt , . . . , X0 = x0 ]
d
X
xt+1 P(Xt+1 = xt+1 |Xt = xt , . . . , X0 = x0 )
=
xt+1 =0

d
X

xt+1 P(Xt+1 = xt+1 |Xt = xt )

xt+1 =0

d
X

xt+1 P (xt , xt+1 ),

xt+1 =0

so that the martingale condition becomes


d
X

yP (x, y) = x

for all x S.

y=0

Lets now examine the nature of the states of a martingale Markov chain. The idea
of the martingale property is that on average the chain stays where it is. For this to be
true the chain must either stay where it is all the time (i.e. be in an absorbing state)
or else it must be able to move in both directions. This heuristic argument shows that
for a martingale Markov chain on state space S = {0, . . . , d}, the states 0 and d must
be absorbing. For a more formal demonstration, note that the martingale property shows
that
d
X
yP (0, y) = 0,
y=0

so that P (0, 1) = P (0, 2) = = P (0, d) = 0 and we see that state 0 is absorbing. A


similar argument shows that state d is absorbing. Now, suppose that none of the other
states are absorbing, then, since the chain must be able to move both right and left by the
65

martingale property, the other states lead to both state 0 and state d. Thus, the remaining
states must be transient.
We are interested in computing x0 = 1 xd for our (d + 1)-state martingale Markov
chain. Now, we know that xd is the unique solution to the equation
xd = P (x, d) +

d1
X

P (x, z)zd .

z=1

By inspection, if we let xd = x/d then we have


P (x, d) +

d1
X

d1

P (x, z)zd

z=1

0
d Xz
P (x, z) + P (x, 0)
= P (x, d) +
d z=1 d
d
d

x
1X
zP (x, z) = .
=
d z=0
d
Thus, xd = x/d must be the unique solution, implying of course that x0 = 1 x/d =
(d x)/d.
Example 6.10. When is the gamblers ruin chain on {0, 1, . . . d} a martingale?
Solution: For the gamblers ruin chain we have
x < d,

p
P (x, y) =
1p

0 and d as absorbing states and for 0 <


y =x+1
y =x1
otherwise

So, for the chain to be a martingale, we need


x=

d
X

yP (x, y) = p(x + 1) + (1 p)(x 1) = x 1 + 2p,

y=0

which leads to p = 1/2, that is we need the game to be fair, so to speak. However, if one
gambler starts with x dollars in this game, then, since the chain is a martingale in this
instance, the chance that the other gambler goes broke first is xd = x/d. So the game is
only really fair if both players start with the same capital stake.


6.14

Special chains

We close this portion of the course by examining some of the important examples of
Markov chains introduced at the start; namely, birth and death chains, branching chains
and queuing chains.
66

Birth and Death Chains: Consider an irreducible Markov chain. We have seen that
such a chain is either recurrent or transient. In addition, if we have |S| < then we
know that the chain must be recurrent. In general, unfortunately, we cannot say whether
the chain is recurrent or transient if S is not finite. However, we can answer the question
for birth and death chains. Recall that a birth and death chain is a Markov chain on
S = {0, 1, 2, . . .} with transition function

qx y = x 1

r y=x
x
P (x, y) =

px y = x + 1

0 otherwise
It should be relatively clear that the chain is irreducible (i.e. all states communicate with
one another) as long as p0 > 0 and px , qx > 0 for all x 1. So, how might we approach
determining whether the chain is recurrent or transient? Recall that all we need to show
is that one state is either recurrent or transient, so perhaps we might try to use the fact
P
n
that state 0 is recurrent if and only if
n=0 P (0, 0) is infinite. Unfortunately, this is quite
difficult, since tabulating the ways that the chain can return to state 0 in exactly n steps
is quite a painful bit of algebra.
A better approach proceeds from the fact that state 0 is recurrent if and only if 00 = 1,
and by the nature of a birth and death chain, 00 = 1 if and only if 10 = 1. In other
words, we can only return to state 0 once weve left if we come back through state 1.
Formally, a first step analysis shows that
00 = P (0, 0) + P (0, 1)10 = P (0, 0) + {1 P (0, 0)}10 ,
since from state 0 we can only go to states 0 or 1 in a single step. From this equation,
it is clear that 00 can only equal 1 if 10 does as well [since the irreducibility conditions
imply that P (0, 0) = 1 p0 < 1].
Now, to compute 10 = P1 (T0 < ), we note that for a birth and death chain started
in state 1 it must be the case that
1 T2 < T3 < ,
since the chain can only move to adjacent states in a single time unit. In addition, it
follows from this fact that Tn n 1, that is, starting from state 1, we cannot reach state
n until at least time n 1. Thus, Tn as n , so
10 = P1 (T0 < ) = lim P1 (T0 < Tn ).
n

67

It remains now to determine probabilities of the form u(x) = Px (Ta < Tb ) for a < x <
b. To do this, note that a first step analysis shows that
u(y) = qy u(y 1) + ry u(y) + py u(y + 1),

a < y < b,

and we set u(a) = 1 and u(b) = 0. (NOTE: We do this so that the recursion equations for
u(a + 1) and u(b 1) are correct.) Now, since px + rx + qx = 1, we can rewrite the above
equation as
u(y + 1) u(y) =

y
qy
{u(y) u(y 1)} =
{u(y) u(y 1)},
py
y1

where we define 0 = 1 and


Qy
qi
y = Qyi=1
i=1 pi

for y > 0.

Iterating this relationship yields


Qy

i
y
u(y + 1) u(y) = Qi=a+1
{u(a + 1) u(a)} = {u(a + 1) 1}.
y1
a
i=a i

Now, if we sum this equation over y = a, a + 1, . . . , b 1, we find that the left-hand side
telescopes, yielding
Pb1
y=a y
u(b) u(a) =
{u(a + 1) 1},
a
or, using the fact that u(b) = 0,
1
1
{u(a + 1) 1} = Pb1 .
a
y=a y
So, substituting this expression back into the original one (and changing the summation
index from y to z to avoid confusion) shows that
y
u(y + 1) u(y) = Pb1
z=a

Again, summing this expression, this time over y = x, x + 1, . . . , b 1, yields


P
b1
y=x y
u(b) u(x) = Pb1
,
z=a z
so that

Pb1

y=x

u(x) = Pb1

z=a z

68

Thus, finally, we see that


10

 Pn1 
y=1 y
= lim P1 (T0 < Tn ) = lim Pn1
n
n
z=0 z


0
1
.
= lim 1 Pn1
= 1 P
n

z
z
z=0
z=0

Therefore, a birth and death chain with px , qx > 0 for all x S is recurrent if and only if
P
z=0 z = . Recall that simple random walks and the gamblers ruin chains are special
cases of a birth and death chain.
Example 7: Let {Xt }t0 be a simple random walk on S = {0, 1, . . .} with a completely
reflecting barrier at state 0, so that P (0, 0) = 0, P (0, 1) = 1 and for x > 0

q y =x1

r
y=x
,
P (x, y) =

p
y
=
x
+
1

0 otherwise
where p + q + r = 1 and 0 < p, q < 1 are the same for all x 1. In this case, we have
0 = 1 and
 y
q
y =
.
p
Thus, the chain is recurrent if p (1 r)/2. Notice that the random walk being recurrent
implies that the chain will return to state 0 infinitely often, while if the walk is transient
it will eventually drift off to infinity.
Branching Chains: Recall that a branching chain is the number of individuals at generation t derived from a population of particles which each independently produce a random
number of offspring during their unit lifetime, where is an integer-valued random variable having probability mass function p(u). Clearly, state 0 is absorbing for this chain.
Now, if p(1) = 1, i.e. each parent produces exactly one child before expiring, then clearly
every state is absorbing. We now give some results for the non-degenerate case when
p(1) < 1.
The chance that a particular particles descendants become extinct is just 10 =
P1 (T0 < ) and since the particles produce offspring independently, we have
Px (extinction) = x0 = x10 .
Now, if p(1) < 1 then it turns out that all states other than state 0 are transient. The
idea here is that if p(0) > 0 then all the states lead to state 0, and otherwise, if p(0) = 0,
69

the chain cannot decrease in size, and the chain must eventually run off to infinity. So, a
non-degenerate branching chain either goes extinct or goes to infinity in the long run. We
wish to determine when the probability of extinction is unity. To do this, note that the
population will go extinct with probability one when 10 = 1. Now, a first step analysis
shows that
10 = P (1, 0) +
= p(0) +

P (1, y)y0 = P (1, 0) +

y=1

P (1, y)y10

y=1

p(y)y10 = E[10 )].

y=1

Recall that E[tX ] is the probability generating function of the random variable X. It turns
out that if E[] 1 then the above equation 10 = E[10 ] has no solution in the range
[0, 1), and thus 10 = 1, since clearly 1 = E[1 ]. On the other hand, if E[] > 1 then
there will be a solution in the range [0, 1) and thus 10 < 1 in this case. So, we have seen
that if the mean number of offspring is less than 1, the chain will definitely eventually go
extinct, however, if the mean number of offspring is greater than 1 then the chain has a
positive probability of a population explosion to infinity. The proof of this fact is a bit
cumbersome and we will omit it [for those who are interested, the proof is based on the
relationship between the pgf and the mgf of the random variable , and the fact that
the mgf has uniform positive concavity since its second derivative has the form E[ 2 et ]
and is thus always positive], however, the result should have intuitive appeal and the
relationship between the expectation of a random variable and its probability generating
function provides the mathematical link between the intuition and the formal proof.
Queuing Chain: We now briefly describe the results for the queuing chain. Recall that
a queuing chain counts the number of people in the queue at time t, when the number
of arrivals in any time unit are independent integer-valued random variables i having
probability mass function p(u), and the number of services in any time period is one as
long as there is someone in the queue and zero otherwise. First, we want to determine
when the chain is irreducible. It turns out that the chain is irreducible if 0 < p(0) < 1
and p(0) + p(1) < 1, in other words, if there is some chance of no arrivals as well as some
chance of more than 1 arrival in any time period.
In fact, there are five distinct cases for the queuing chain:
(1) If p(1) = 1, then clearly P (0, 1) = 1 and P (x, x) = 1 for all x 1. In this case, all
the states except state 0 are absorbing, and state 0 is transient, leading directly to
the absorbing state 1.
70

(2) If p(0) = 1, then clearly P (0, 0) = 1 and P (x, x 1) = 1 for all x 1. In this case,
all the states except state 0 are transient and state 0 is absorbing.
(3) If p(0) = 0 and p(1) < 1, then P (x, y) = 0 if y < x and P (x, y) > 0 for some y > x.
Therefore, the chain is non-decreasing, so that all states are transient and the chain
goes off to infinity eventually.
(4) If 0 < p(0) < 1 and p(0)+p(1) = 1, then P (x, y) = 0 if y > x. Thus, all states except
states 0 and 1 are transient, and states 0 and 1 are recurrent but not absorbing.
So, the chain eventually winds up spending random durations in states 0 and 1
alternately.
(5) If the chain is irreducible, then it can be shown that it is recurrent if E[] 1
and transient otherwise. Again, this result is a bit difficult to prove but should be
intuitively reasonable.

6.15

Summary

In this segment of the course we have encountered four fundamental concepts regarding
Markov chains:
(1) Markov chains are characterized by their initial distribution and transition function.
Indeed, the Markov property combined with the assumption of stationary transition
probabilities implies that
P(Xn = xn , . . . , X0 = x0 ) = 0 (x0 )P (x0 , x1 ) P (xn1 , xn ).
In addition, if the state space is finite, we may write the transition function as a
matrix. In this instance, the m-step transition function P m can be found as the
mth power of the transition matrix, and this leads to the identity t = 0 P t for the
distribution of Xt .
(2) Markov chains restart from fixed times as well as from hitting times. In other words,
once we know the value of the chain at some time or the value of a hitting time, the
Markov chain evolves from then on as if it were a new version of the chain started
from the appropriate state.
(3) Many important properties of a Markov chain can be determined through the application of a first step analysis. This type of analysis proceeds via conditioning on
where the chain goes on its first step, and uses the law of total probability along
71

with the above properties of Markov chains to set up recursion equations for the
quantities of interest.
(4) A Markov chain is most usefully decomposed into communication classes which can
be characterized as either transient or recurrent. The basic qualitative long run
behaviour of the chain can then be characterized as at the end of Subsection 6.11
above. Namely, if a Markov chain starts in a recurrent class it will stay there forever,
visiting each state infinitely often. On the other hand, if a Markov chain is started
in a transient class, then its behaviour depends on the cardinality of the set of
transient states. If there are only a finite number of transient states, then the chain
will eventually find a recurrent class and stay there forever as if it had started in this
class. If there are an infinite number of transient states, then either the chain will
hit a recurrent class as in the previous cases, or it will wander through the transient
state space indefinitely.

72

7
7.1

Stationary Distribution and Equilibrium


Introduction and Definitions

In this portion of the course, we will investigate the conditions under which a Markov
chain will settle down in the long run. In other words, we wish to examine when
n

as n ,

for some pmf . Of course, we must be a bit careful, since we dont yet know what it
means to take the limit of a distribution function. However, the idea should be clear;
namely, that the proportion of time that the chain spends in each state y eventually
settles down to (y). Such a distribution is called a steady state distribution. Formally,
we will define a pmf to be a steady state distribution if
lim n (y) = (y)

for all y S,

regardless of the initial distribution 0 . While this is the most obvious interpretation of
n , it is not the only one, though we will not discuss any others in this course. Note
that this is quite a complex statement since it requires that a limit exist for each y in
the state space which is the same for any initial distribution, and these limits together
must form a probability mass function (i.e. sum to unity). The condition that the limit
statement hold regardless of the initial distribution is an important one. To examine its
consequences, suppose that x is an absorbing state, then setting 0 (y) = 1x (y), in other
words, the distribution which puts probability one on x and zero everywhere else, would
clearly lead to
n (y) = 1x (y) = 0 (y).
So, we would have n = 1x (y) for this particular choice of initial distribution, but
this would not generally be a steady state distribution, since it is likely that starting at
a different point will lead to a different limit. In fact, this argument shows that a chain
with more than one absorbing state cannot have a steady state distribution. The intuition
here is that a steady state distribution describes the long run proportion of time that the
chain spends in each of its states, in other words, it characterizes some future equilibrium
state of the chain. If a chain has multiple absorbing states, it cannot reach any kind of
equilibrium since sometimes it will be trapped in one absorbing state while other times
it will be trapped in a different absorbing state. Moreover, the requirement that the limit
exist and be the same for all initial distributions clearly implies that there can be at most
one steady state distribution.
73

Now, suppose that a steady state distribution exists for a particular Markov chain.
A natural conjecture is that a version of the Markov chain started with the steady state
distribution as its initial distribution is already in exact equilibrium, i.e.
0 =

n =

for all n 0.

Any distribution for which the above implication holds is called a stationary distribution
for the Markov chain. Note that, unlike the situation for a steady state distribution, it is
possible to have many stationary distributions. In fact, when there are multiple absorbing
state, each distribution which puts probability one on an absorbing state is a stationary
distribution. It is no coincidence that this is precisely a situation in which no steady state
distribution exits, and clearly, there is a strong connection between stationary and steady
state distributions. We will now examine the fundamental properties of each of these types
of distributions.

7.2

Basic Properties of Stationary and Steady State Distributions

To characterize a stationary distribution, note that, since Markov chains restart from
fixed times, it is sufficient to check
0 =

1 = ,

instead of having to check the implication for all n 0. Formally, we have


Lemma: A pmf is a stationary distribution for a Markov chain having transition
function P if and only if
P = ,
or equivalently
X

(x)P (x, y) = (y)

for all y S.

xS

Proof: First, suppose that is a stationary distribution. Then, by its definition, we know
that if 0 = then n = for all n 0. In particular, it must be the case that 1 = ,
so that P = 0 P = 1 = , or more specifically, for any y S,
X
X
(x)P (x, y) =
0 (x)P (x, y) = 1 (y) = (y).
xS

xS

Next, suppose that P = , that is, that


X
(x)P (x, y) = (y)
xS

74

for all y S.

Then, clearly, if 0 = we have 1 = 0 P = P = . Then, since n = n1 P , a


straightforward induction argument shows that it must be the case that n = for all
n 0. Specifically, assume that k = , then for any y S, we have
X
X
X
k+1 (y) =
0 (x)P k+1 (x, y) =
(x)
P k (x, z)P (z, y)
xS

xS

XX

zS

(x)P (x, z)P (z, y) =

zS xS

(z)P (z, y) = (y).

zS


There is also a useful alternate characterization of steady state distributions via
Lemma: A pmf is a steady state distribution if and only if
lim P n (x, y) = (y)

for all x, y S.

Proof: First, suppose that is a steady state distribution. Since the definition for a
steady state distribution says that n regardless of the choice of 0 , lets choose
0 (z) = 1x (z). Then, for any y S,
X
X
(y) = lim n (y) = lim
0 (z)P n (z, y) = lim
1x (z)P n (z, y) = lim P n (x, y).
n

zS

zS

Since the steady state distribution is the same regardless of the initial distribution, the
above limit holds for any x S, which can be shown by simply choosing 0 to be the
appropriate indicator function (i.e. by assuming the chain starts from each state in turn).
Now, to show that the limit statement implies that is a steady state distribution
requires an appeal to the Bounded Convergence Theorem, and thus we will just have to
take the result on faith in this course.

In the case where the state space S is finite, we can write this new condition for a
steady state distribution as
P n 1T

as n ,

where 1T is a column vector of length equal to |S| and with each entry equal to 1. Now,
what this statement implies is that if the consecutive powers of the transition matrix
converge to a matrix whose rows are all equal to one another, then the values in one
of these identical rows form the entries in the steady state distribution. One important
case where we can be certain that a steady state distribution does exist, is in the case
where S is finite and the transition matrix P is regular; that is, for some positive integer
k, all of the entries in the matrix P k are non-zero, so that all states are joined to all
75

others by paths which have exactly k steps (we shall see why this is true at the end of
this Section). One obvious example of a Markov chain with a regular transition matrix
(also sometimes referred to as a regular Markov chain) is one for which all the entries
of P itself are positive (e.g. the two state Markov chain with 0 < p, q < 1). However,
there are many other regular Markov chains. Also, notice that a regular Markov chain is
certainly irreducible. However, an irreducible chain is not necessarily regular (e.g. imagine
a two-state chain with p = q = 1, so that the chain jumps back and forth between states
and thus the states lead to each other in only paths of odd length while they lead to
themselves in paths of only even length).
Example 7.1. Suppose we have a Markov chain with transition matrix

P =

0.9 0.1 0
0
0.9 0 0.1 0
0.9 0
0 0.1
0.9 0
0 0.1

We can show that this transition matrix


schematically as

+
+

P =
+
+

is regular by rewriting the transition matrix


+
0
0
0

0
+
0
0

0
0
+
+

where a + indicates a non-zero entry. Multiplying this matrix by itself then clearly shows
that

+ + + 0
+ + + +
+ + 0 +
+ + + +

2
4
P =
and
P =

.
+ + 0 +
+ + + +
+ + 0 +

+ + + +

Thus, the chain is regular since P 4 has all non-zero entries. In fact, a little more calculation
shows that

0.9 0.09 0.009 0.001


0.9 0.09 0.009 0.001

4
P =
.
0.9 0.09 0.009 0.001
0.9 0.09 0.009 0.001
Now, suppose we know that for some k 1, P k has the above form, that is all its rows are
identical, so that P k (x1 , y) = P k (x2 , y) for all x1 , x2 S. Then, it follows that P k+1 = P k ,
76

since
P k+1 (x, y) =

P (x, z)P k (z, y) = P k (x, y)

zS
n

P (x, z) = P k (x, y)

zS
k

Therefore, limn P (x, y) = P (x, y) = (y), where we can write the final equality
without reference to x since all the rows of P k are identical. Thus, we have found a steady
state distribution. In addition, in this case, if we think of as a row of the matrix P k ,
then P is just a row of the matrix P k+1 and thus is equal to itself, i.e. we have P = .
So, is a stationary distribution as well.

Now, as a first glimpse at the relationship between steady state and stationary distributions, we have the following:
Lemma: If a steady state distribution exists and 0 is a stationary distribution then
0 = .
Proof: Picking 0 = 0 , we see that the stationarity condition implies that n = 0 for all
n 0, which in turn means that n 0 for this choice of initial distribution. However,
the assumption that a steady state distribution exists means that regardless of the starting
distribution, it must be the case that n . Thus, since limits are unique, we must have
0 = .

Notice that the existence of a steady state distribution implies that the only possible
stationary distribution is . Unfortunately, the above lemma does not show that must
be a stationary distribution if it is a steady state distribution, just that if a stationary
distribution exists, then it must be equal to . We will, however, soon show that the
steady state distribution is indeed a stationary distribution.
Example 7.2. Recall that a transition matrix has rows which sum to unity. If a matrix
has rows which sum to unity and also columns which sum to unity, it is called doubly
stochastic, partly because its transpose is also a transition matrix. Suppose we have a
Markov chain with state space S = {0, 1, . . . , N 1} and a transition matrix P which is
doubly stochastic, so that for any x, y S we have
N
1
X

P (x, z) = 1

and

z=0

N
1
X

P (z, y) = 1.

z=0

Now, let = {1/N, . . . , 1/N }, and note that


N 1
1
1 X
P (x, y) =
= (y).
(x)P (x, y) =
N x=0
N
x=0

N
1
X

So, is a stationary distribution of this chain. Now, suppose that a steady state distribution exists (e.g. perhaps the chain is regular, though there are many other ways for a
77

steady state distribution to be shown to exist), then the steady state distribution must
be . Why should this be true? Well, we have seen that if a steady state and a stationary distribution both exist then they must be the same distribution. We know that is
stationary, thus if there is a steady state distribution, it must be as well.

The preceding two examples show that we can calculate steady state and stationary
distributions in many situations, but we want a more comprehensive treatment. If we let
SS = {steady state distributions} and Sta = {the set of stationary distributions} for a
particular Markov chain, we have the following summary table so far:
Steady State and Stationary Distributions
Set
SS
Sta

Definition
0 : n
0 = n =

Alternate Characterisation Size of Set


P n 1T
1
= P
??

Now, we expect that SS Sta, though we have not yet shown this. Of course, if
there is no steady state distribution, then this fact follows trivially from the fact that
SS = . However, we need to investigate the situation when |SS| = 1. When this is the
case we have seen that Sta SS from a previous lemma. So, if our belief that SS Sta
is true, then we can conclude that if there is a stationary distribution , this implies that
SS = Sta = {}.
Actually, there are three basic tasks which we would like to accomplish:
1. Characterize the set of stationary distributions, at least to the extent that we can
determine |Sta|;
2. Show that if a steady state distribution exists, then it is a stationary distribution
(which of course implies that it is the unique stationary distribution, by our previous
lemma);
3. Describe the long run behaviour of a Markov chain when no steady state distribution
exists.

7.3

Periodicity and Smoothing

Let us first spend some time focussing on the case where no steady state distribution
exists. In this case, the limiting procedure P n 1T fails. What can go wrong? Since the
entries of the matrix are constrained to be between 0 and 1, they cant run off to infinity,
so the only real way that the P n s wont settle down is if they have some kind of periodic
oscillation.
78

Example 7.3. Recall the Markov chain, introduced in the previous Chapter (cf. Example 6.2), which had transition matrix

0
1 0

P = 1 p 0 p .
0
1 0
For this Markov chain, we showed that P 2k = P 2 and that P 2k1 = P for all k 1. Now,
since P 6= P 2 for this chain, it is clearly the case that P n does not have a limit, rather
the sequence of matrices oscillates between the two values P and P 2 .

Of course, this is a rather simple example, and in a more general setting the oscillation
may take place over more than two values, but the idea is clear. Now, it turns out that even
P
though the sequence P n may never settled down to a limit, the sequence P n = n1 nm=1 P m
always will. For those who know, this is of course the idea of a Cesaro limit, and it turns
out that not only does P n always have a limit, but if the limit of P n happens to exist as
well, then it will be the case that
lim P n = lim P n .

Thus, we will begin by investigating the sequence P n . How will this help us? Well, if is
a stationary distribution, then = P n for any n. Note that this follows from the fact
that n = 0 P n by noting that if the chain is started in a stationary distribution then
= 0 = n . Thus, for a stationary distribution we have
n
n
1X
1X
=
=
P m = P n .
n m=1
n m=1

Thus, if we learn about P n we will also be able to learn about the stationary distributions,
through the alternate characterization of a stationary distribution given above; namely,
= P n .
Now, the analysis of P n is best examined from the following viewpoint: Recall that
P n (x, y) = Px (Xn = y) = Ex [1y (Xn )], so that


n
n
1X m
1X
Nn (y)
P (x, y) =
Ex [1y (Xm )] = Ex
,
n m=1
n m=1
n
P
where Nn (y) = nm=1 1y (Xm ) is the number of visits that the chain makes to state y in
the first n time units. So, Nn (y)/n is just the proportion of the first n time units that the
chain spends in state y. Also, note that limn Nn (y) = N (y).
79

So, we now want to examine the limiting behavior of the random quantity Nn (y)/n as
well as its expectation. In other words, we want to investigate the long run behaviour of
the proportion of time the chain spends in any given state. To do this, we need to examine
two distinct cases:
Transient Case. Suppose that y is a transient state.
In this case, it should be intuitively clear that the proportion of time the chain spends
in state y will tend to 0, since eventually the chain leaves state y never to return. More
formally, note that for any number  > 0, we have


Nn (y)
<  = lim Px {Nn (y) < n} = Px {N (y) < } = 1.
lim Px
n
n
n
In other words, eventually Nn (y)/n will be smaller than any positive number with certainty. This means that


Nn (y)
= 0 = 1.
Px lim
n
n
From here it should seem a very reasonable conclusion that


Nn (y)
lim Ex
= 0,
n
n
and another appeal to the Bounded Convergence Theorem proves the result formally.
Thus, we see that if y is a transient state, then
lim P n (x, y) = 0.

Recurrent Case. Next, suppose that y is recurrent.


In this case, one of two things can happen. First, if the chain never hits state y, then
clearly the long run proportion of time that the chain spends in state y is 0. However, if
the chain does hit state y, then it will return infinitely often. Moreover, it will return on
the average every my = Ey [Ty ] time units. So, the long run frequency of time that the
chains spends in state y will be 1/my . Formally, we have
Theorem 1: If y is a recurrent state, then


1{Ty <}
Nn (y)
Px lim
=
= 1,
n
n
my
where 1{Ty <} is an indicator random variable which takes the value 1 if the random
variable Ty is finite and the value 0 otherwise. More importantly, we also have


Nn (y)
xy
=
.
lim P n (x, y) = lim Ex
n
n
n
my
80

Proof: A proof of this fact is quite difficult and requires not only the Bounded Convergence
Theorem but the Strong Law of Large Numbers as well. We will not go through the proof in
this course, though you should remember the final limit statement as it is quite important.
The result should seem reasonable, since if xy = 0 the chain will never hit state y and
thus will clearly spend no time there, while if xy > 0, the expected long-run proportion
of time that the chain spends in state y will be 1/my times the chance that the chain ever
gets to state y in the first place.


7.4

Positive and Null Recurrence

Theorem 1 shows that a chain started in a recurrent state y will return to y with an
average long-run frequency of 1/my returns per time unit. Now, if my = , then this
return frequency is zero, even though the number of returns to state y is infinite.
[NOTE: Just because Py (Ty < ) = 1 does not mean that my = Ey [Ty ] < . For
P
2
2
example, suppose that Py (Ty = n) = 6/(n)2 . Then, since
n=1 1/n = /6, we can see
P
P
that Py (Ty < ) = 1 while my = Ey [Ty ] = n=1 nPy (Ty = n) = n=1 6/n 2 = .]
With this phenomenon in mind, we will say that a recurrent state y is null recurrent if
my = , and is positive recurrent if my < . Notice that in terms of long-run frequency
of visits a null recurrent state is much like a transient state. Using this definition, we can
now subdivide the set of recurrent states as SR = SR0 SR+ , where SR0 is the set of null
recurrent states and SR+ is the set of positive recurrent states.
We can now characterize a stationary distribution using the following three step procedure:
1. Recall that we have shown that a stationary distribution satisfies the equation =
P n , that is,
X
(y) =
(x)P n (x, y),
xS

for any n 1.
2. From Theorem 1, we know that
) (
(
0 if y is transient
lim P n (x, y) =
=
xy
n
if y is recurrent
my

0
y ST SR0 or xy = 0
xy
>0
y SR+ and xy > 0
my

Notice that this implies that a state y is positive recurrent if and only if the limit
as n of P n (y, y) is strictly positive.

81

3. Thus, taking limits in part (1) above, we find that if is a stationary distribution,
then it must satisfy
(
xy
X
X
X
y SR+ and xy > 0
my
(x) lim P n (x, y) =
(x)
(y) = lim
(x)P n (x, y) =
n
n
0
otherwise
xS
xS
xS
where we must be a little careful in switching the limit and summation, but it can
be rigorously justified in this case since the terms in the sum are all less than or
equal to one. Now, from this we can see directly that if y is a transient state then
(y) = 0 as we might have expected. This fact then allows us to write the above
expression as:
) (
(
P
P
1
1

(x)
y

S
and

>
0
y S R+
xy
R
xy
+
xSR+
x[y] (x)
my
my
=
(y) =
0
otherwise
0
otherwise
where the final equality is arrived at by recalling that if x and y are both recurrent
states then xy = 1 if x and y are in the same communication class and xy = 0
otherwise.
Now, the final characterization given above appears to give a straightforward linear system which must solve. However, a closer inspection shows that the number of equations
is determined by the number of positive recurrent states, |SR+ |, while the number of unknowns is potentially equal to the total number of recurrent states, |SR |, since summation
is over the elements of [y] which at present we can only guarantee our recurrent. However,
it turns out that, like the property of recurrence itself, the property of positive recurrence
is also a class property. More formally, we have:
Theorem 2: If x is a positive recurrent state and x y then y is also positive recurrent.
Proof: A formal proof is tedious and we will omit it. The idea of the proof is similar to
that used to show that recurrence is a class property. Namely, if x is a positive recurrent
state and the chain is started in state x then the chain will return to state x with positive
frequency, restarting each time it returns. Now, if x y then some positive proportion of
the returns to state x will have passed through state y. Thus, the chain returns to state
y with a positive frequency, implying that state y is positive recurrent.

Corollary 1: Positive and null recurrence are class properties.
Proof: Since the states in a recurrent communication class lead to each other and nowhere
else, Theorem 2 shows that positive recurrence must be a class property. Thus, since recurrence is a class property, it must be the case that null recurrence is a class property as
82

well.

So, the characterization of a stationary distribution given above is actually a system of


|SR+ | equations in as many unknowns, since the communication class [y] of a positive
recurrent state y contains only positive recurrent states. As with general recurrence, we
have a long list of other useful corollaries to Theorem 2:
Corollary 2: An irreducible Markov chain is either a transient chain, a null recurrent
chain or a positive recurrent chain.
Proof: This follows directly from the fact that transience, null recurrence and positive
recurrence are class properties.

Corollary 3: A Markov chain with finite state space must have at least one positive
recurrent state.
Proof: The idea here is the same as the proof that a finite state space Markov chain must
have at least one recurrent state. We first suppose that all the states are either transient
or null recurrent, so that limn P n (x, y) = 0 for all x, y S. However, if this were true,
P
then the fact that yS P m (x, y) = 1 for all m 1 and all x S would imply that
n
X
1 XX m
1=
P (x, y) =
P n (x, y).
n m=1 yS
yS

Thus, taking limits on both sides and using the fact that the state space is finite would
show that
X
X
X
lim P n (x, y) =
0 = 0,
1 = lim
P n (x, y) =
n

yS

yS

yS

which is a contradiction. Thus, there must be at least one positive recurrent state.

Corollary 4: An irreducible Markov chain with finite state space is positive recurrent.
Proof: This is a direct consequence of Corollaries 2 and 3.

Corollary 5: A finite closed communication class is positive recurrent.


Proof: This is just a direct extension of Corollary 4.

Corollary 6: Any Markov chain with a finite state space has no null recurrent states.
Proof: Let y be a recurrent state. Then, since the state space is finite, it must be the case
that [y] is a finite closed communication class. (Recall that a recurrent communication
class cannot lead outside of itself). Thus, [y] is a positive recurrent communication class
by Corollary 5. Thus, all recurrent states must be positive recurrent.


83

7.5

Existence and Uniqueness of Stationary Distributions

Using the characterizations of the previous section, we now investigate when a Markov
chain has a stationary distribution as well as how many stationary distributions it has.
To do this, we will adopt a strategy suggested by the characterization in step (3) above,
namely, we will deal with a chain by examining one positive recurrent communication
class at a time. Recall that the previous characterization showed that if is a stationary
distribution, then (y) = 0 for any state y which is not positive recurrent.
So, at the outset, we will start by examining irreducible chains. If such a chain is transient or null recurrent, then our characterization shows that any stationary distribution
must satisfy (y) = 0 for all y S. Of course, this is not a pmf and thus an irreducible
transient or null recurrent Markov chain has no stationary distribution. (Recall that the
above characterization merely states that if there is a stationary distribution then it must
satisfy the given criterion, but nowhere does it claim that there definitely is a stationary
distribution). The fact that irreducible transient and null recurrent Markov chains have
no stationary distribution should not be surprising, since we have seen that the general
mode of operation for a transient chain is to eventually wander off to infinity while a null
recurrent chain has a very high chance of long excursions toward infinity before returning
(recall that such chains must have an infinite state space).
Now, suppose that we have an irreducible positive recurrent Markov chain, then the
following theorem holds:
Theorem 3: An irreducible positive recurrent Markov chain has a unique stationary
distribution given by
1
.
(y) =
my
Proof: Since the chain is irreducible and recurrent, it must be the case that xy = 1 for
all x, y S. Thus, any stationary distribution must satisfy
(y) =

X xy
1 X
1
(x) =
(x) =
.
my
my xS
my

x[y]

Thus, there can be at most one stationary distribution. It therefore remains only to
show that (x) = 1/mx is indeed a stationary distribution. To show this, we need to
demonstrate two things; namely, that this is indeed a pmf [i.e. (y) 0 for all y S
P
P
and yS (y) = 1] and that it is indeed a stationary distribution [i.e. xS (x)P (x, y) =
(y)]. A fully general demonstration of these facts is quite technical, so we will only give

84

the proof for the case when the state space is finite, though it is true even when S is
infinite.
P
So, suppose |S| < . We know that in general yS P m (x, y) = 1 for all m 1 and
all x S. Thus
n
X
1 XX m
P (x, y) =
P n (x, y).
1=
P (x, y) =
n m=1 yS
yS
yS

Taking limits in the above expression and using the fact that the state space is finite (to
justify the interchange of the limit and summation operations) as well as the fact that
the chain is irreducible positive recurrent (which implies that xy = 1 for all x, y S and
that limn P n (x, y) = 1/my ), we have
1 = lim

P n (x, y) =

yS

X
yS

lim P n (x, y) =

X 1
.
my
yS

Thus, is indeed a pmf .


P
Next, recall that P m+1 (x, y) = zS P m (x, z)P (z, y), so that
n
n
X1X
1 X m+1
m
P (x, z)P (z, y) =
P
(x, y)
P n (x, z)P (z, y) =
n m=1
n m=1
zS
zS
 n+1

1 X m
1
n+1
=
P n+1 (x, y) P (x, y).
P (x, y) P (x, y) =
n m=1
n
n

So, taking limits on both sides of the above equality and again using the assumptions that
S is finite and the chain is irreducible positive recurrent, we have
X
X
X 1
P (z, y) =
lim P n (x, z)P (z, y) = lim
P n (x, z)P (z, y)
n
n
m
z
zS
zS
zS


n+1
1
1
= lim
P n+1 (x, y) P (x, y) =
.
n
n
n
my
Thus, is indeed a stationary distribution.

Now, what if the Markov chain is not irreducible? To analyze this case, we need a little
terminology. Suppose that C is a positive recurrent communication class, then a pmf
is said to be concentrated on C if
(x) = 0

for all x
/ C.

We then have the following theorem:


85

Theorem 4: Let C be a positive recurrent communication class of a Markov chain. Then


the distribution
(
1
xC
mx
C (x) =
0 otherwise
is a stationary distribution. Moreover, it is the unique stationary distribution among distributions concentrated on C.
Proof: The proof follows along the same lines as the proof of Theorem 3, and we will omit
the details.

So, in general then, a Markov chain will have one stationary distribution which is concentrated on each of its positive recurrent communication classes. In fact, we have the
following breakdown:
Theorem 5: For a general Markov chain, with SR+ the set of positive recurrent states,
(i.) if SR+ = then the chain has no stationary distribution;
(ii.) if SR+ consists of a single communication class then there is a unique stationary
distribution given as in Theorem 3;
(iii.) if SR+ is the union of more than one communication class then there are infinitely
many stationary distributions.
Proof: The first two statements follow directly from Theorem 3 and the discussion preceding it. As for the third statement, Theorem 4 shows that there are at least as many
stationary distributions as positive recurrent communication classes. In fact, it turns out
that if and 0 are both stationary distributions, then so is the mixture distribution
00 = + (1 ) 0 for any 0 1, a fact which is left as an exercise. Thus, if
there are at least two positive recurrent communication classes there must be an infinite
number of stationary distributions. Moreover, it can be shown that the only stationary
distributions are those which are mixtures of the unique stationary distributions concentrated on each positive recurrent communication class.

As a simple example of the final case in Theorem 5, suppose we have the two-state
Markov chain with p = q = 0, so that both states are absorbing. This chain has two communication classes, [0] and [1], and they are clearly positive recurrent since they are closed
and finite. So, Theorem 4 shows that there is a unique stationary distribution concentrated on each of these classes, namely the distributions (x) = 10 (x) and (x) = 11 (x).
Also, Theorem 5 suggests that there are an infinite number of stationary distributions. A

86

quick look at the transition matrix shows that


1 0
0 1

P =

!
= I,

so that it is clear that any distribution is stationary since = I for any pmf .
Also, note that the final case of Theorem 5 shows that any Markov chain with more
than one positive recurrent communication class cannot have a steady state distribution,
since the existence of a steady state distribution implies that there could be at most one
stationary distribution.
So, we have now characterized all the stationary distributions of a Markov chain. Unfortunately, the theorems of this section provide formulae for the stationary distributions
in terms of the quantities my = Ey [Ty ], which we dont yet have an easy way to calculate.
Thus, we must still generally rely on the characterization = P to actually find the
stationary distribution. However, there are some important cases where we can find the
quantities my .

7.6

Examples of Stationary Distributions

Example 7.4. Recall the Markov chain of Example 6.7. The state space was S =
{0, 1, . . . , 5} and the transition matrix was

P =

1
2
1
3

1
2
2
3

0 0 0
0 0 0
0 0 81 0 78
1
1
0 0 41
4
4
0 0 34 0 14
0 15 0 51 15

0
0
0

.
1
4

0
2
5

For this example we found that the communication classes were [0] = {0, 1} = [1], [2] =
{2, 4} = [4] and [3] = {3, 5} = [5]. Also, we saw that the classes [0] and [2] were recurrent,
and thus are positive recurrent since they are finite, and that [3] was transient. Lets find
all the stationary distributions of this Markov chain. To do this, we need to find each of
the unique stationary distributions concentrated on the positive recurrent communication
classes. So, for the stationary distribution concentrated on [0] we know that (2) = (3) =
(4) = (5) = 0. Thus, the system = P becomes
1
1
(0) = (0) + (1) ,
2
3

1
2
(1) = (0) + (1).
2
3
87

Solving this system [using the fact that (0)+(1) = 1] yields (0) = 2/5 and (1) = 3/5.
Alternatively, we could calculate the stationary distribution concentrated on [0] using
(0) = 1/m0 and (1) = 1/m1 . To do this, we note that,
P0 (T0 = 1) = P0 (X1 = 0) =

1
2

P0 (T0 = 2) = P0 (X1 = 1, X2 = 0) =
P0 (T0 = n) = P0 (X1 = 1, . . . , Xn1

1
1 1
=
2 3
6

1
= 1, Xn = 0) =
2

 n2
 n2
2
1
1 2
=
.
3
3
6 3

So, we have
m0

 n2
 n1

1
1 Xn 2
5 1X
2
=
nP0 (T0 = n) = + 2 +
= +
n
2
6 n=3 6 3
6 4 n=3
3
n=1
X






n1
5 1
2
4
5 1
4
5
=
+
n
1
= +
91
= ,
6 4 n=1
3
3
6 4
3
2

where we have used the fact that

nrn1 =

n=1

1
,
(1 r)2

which follows from differentiation of the standard infinite geometric series formula.
Thus, (0) = 2/5 and (1) can be determined via a similar calculation or from the
fact that any distribution concentrated on [0] must have (0) + (1) = 1.
In a like manner, the stationary distribution which is concentrated on the class [2] has
(0) = (1) = (3) = (5) = 0 and thus solves the system
1
3
7
1
(2) = (2) + (4)(4) = (2) + (4).
8
4
8
4
which yields a solution of (2) = 6/13 and (4) = 7/13.
Of course, we might also attack this problem by calculating m2 and m4 . Note that
for this example, we can calculated quantities such as m0 , m1 , m2 and m4 since the
communication classes [0] and [2] have only two states each and thus the hitting times
have a very straightforward distribution. Unfortunately, the calculations become rapidly
more difficult as the number of states in the communication class increases.
Finally, since all the stationary distributions can be represented as mixtures of the unique
stationary distributions concentrated on the positive recurrent communication classes, we
see that the stationary distributions are given by


7(1 )
2 3 6(1 )
, ,
, 0,
,0 ,
=
5 5
13
13
88

for 0 1. [NOTE: If there had been three distinct positive recurrent communication
classes with concentrated stationary distributions , 0 and 00 , then the most general
mixture stationary distribution would have had the form
000 = 1 + 2 0 + (1 1 2 ) 00 ,
for 0 1 , 2 1. With the obvious extension holding for more than three positive
recurrent communication classes.]

Next we find the stationary distributions for some of our well-known Markov chains:
Irreducible Birth and Death Chain: We have seen that a birth and death chain is
irreducible if px , qx > 0 for all x S (except of course for q0 which must be zero). In
addition, we know that if we define 0 = 1 and
y
Y
qx
,
y =
p
x
x=1

P
then the chain is recurrent if
y=0 y = and transient otherwise. We now want to
know when the chain is positive recurrent, and thus has a unique stationary distribution.
Actually, we will attack this problem from the other direction, and try to find the stationary distribution, reasoning that if we find one then the chain must have been positive
recurrent. Now, the system of equations = P becomes
(0) = (0)r0 + (1)q1

(y) = (y 1)py1 + (y)ry + (y + 1)qy+1

y 1.

Now, since px + rx + qx = 1, the above system reduces to


q1 (1) p0 (0) = 0 ,

qy+1 (y + 1) py (y) = qy (y) py1 (y 1)

y 1,

where the first equation uses the fact that q0 = 0. Now, iterating the above system easily
shows that
qy+1 (y + 1) py (y) = 0,
y 0,
and hence that
(y + 1) =

py
(y),
qy+1

y 0.

Iterating this equation then yields


(y) =

p0 py1
(0) = y (0),
q1 qy

where 0 = 1 and y is as defined. Note that y = p0 /(py y ).


89

So, if the chain is to have a stationary distribution and thus be positive recurrent, it
must be the case that the stationary distribution satisfies the above condition, as well as
P
P
the condition that
y=0 (y) = 1, of course. Now, suppose that
y=0 y < . Then, if
we set
y
,
(y) = P
x=0 x
we have a quantity which satisfies both of these criteria. In other words, in this case the
chain is positive recurrent and has the stationary distribution . On the other hand, if
P
y=0 y = , then any quantity which satisfied the stationary distribution criteria would
either be identically zero (i.e. have all components equal to zero) or else would sum to
infinity. Thus, no stationary distribution can exist in this case, and we must conclude that
the chain is null recurrent.
Finally, as a special case of the birth and death chain, we examine the simple reflected
random walk. Recall that for this chain px = p for all x 0 and that qx = 1 p for all
x 1. We saw that for this chain, y = {(1 p)/p}y so that the chain was recurrent if
p 0.5. We now can see that y = {p/(1 p)}y so that the chain is positive recurrent if
p < 0.5 and null recurrent if p = 0.5. Suppose that p < 0.5, then

y

X
p
1
1p
.
y =
=
p =
1

p
1

2p
1p
y=0
y=0

Therefore, the stationary distribution is given by


(y) =

(1 2p)y
(1 2p)py
=
.
1p
(1 p)y+1

The Queuing Chain: Previously, we saw that the queuing chain was irreducible if p(u),
the pmf of the random variables i which represented the number of arrivals in the queue
in each time unit, satisfied p(0) > 0 and p(0) + p(1) < 1. In addition, we saw that an
irreducible queuing chain was recurrent if E[i ] 1 and transient otherwise. We now want
to further classify the recurrent case as either positive or null recurrent. To do this, we
need the following lemma:
Lemma: The mean return time to state 0 for an irreducible recurrent queuing chain is
m0 = E0 [T0 ] =
where is a random variable having pmf p(u).

90

1
,
1 E[]

Proof: The proof is a bit technical, but we give an basic outline of it here.
(i) We define two probability generating functions:
A(s) =

s p(n)

and

B(s) =

n=0

sn P1 (T0 = n).

n=0

Note that A(s) is the probability generating function of the random variable , so that
P
lims1 A(s) =
n=0 p(n) = 1 and
0

lim A (s) = lim

s1

s1

n1

ns

p(n) =

n=1

np(n) = E[].

n=1

Also, B(s) is the probability generating function of the random variable T0 for a chain
P
started in state 1, so that lims1 B(s) =
n=0 P1 (T0 = n) = P1 (T0 < ) = 1, since the
chain is irreducible and recurrent.
(ii) Consider a version of the chain started in state x. Then T0 is the time that the chain
first enters state 0 starting from state x, and we can write
T0 = Tx1 + (Tx2 Tx1 ) + . . . + (T0 T1 ).
Note that the random variable Ty1 Ty here is just the time for a chain started in state
x > y to go from state y to state y 1. Now, since the queuing chain can decrease by at
most one step per time unit, it should be clear from the fact that Markov chains restart
from hitting times that the random variable Ty1 Ty for a chain started in state x > y is
the same as the random variable Ty1 for a chain started in state y. In particular, T0 T1
for a chain started in state x has the same distribution as T0 for a chain started in state
1. In fact, all the random variables Ty1 Ty are just the time for the queue to reduce one
person in length, and since the number of arrivals does not depend on how many people
are currently in the queue, they all have the same distribution. Finally, not only are they
identically distributed, they are also independent. This follows from the Markov property
and the fact that the chain can only decrease by one step at a time. Thus, the time it
takes to go from state y down to state y 1 does not depend on how the chain got to
y since wherever it went before it got to state y it could not have been in any state less
than y.
(iii) Using (ii) above, and the fact that the probability generating function of the sum of
independent random variables is just the product of the individual probability generating
functions, we note that

sn Px (T0 = n) = Bx (s) = {B(s)}x .

n=0

91

So, using a first step analysis we find that


B(s) =
=

X
n=1

P (1, y)Py (T0 = n 1) =

y=0

P (1, y)

y=0

p(y)sBy (s) = s

y=0

sn Py (T0 = n 1)

n=1

p(y){B(s)}y = sA{B(s)},

y=0

where we have used that fact that P (1, y) = p(y) for a queuing chain.
(iv) Taking derivatives in the equation of (iii) above yields
B 0 (s) = A{B(s)} + sB 0 (s)A0 {B(s)},
which then leads to

A{B(s)}
.
1 sA0 {B(s)}
So, taking the limit as s 1 and using the facts about A(s), B(s) and A0 (s) given in (i)
above, shows that
1
lim B 0 (s) =
.
s1
1 E[]
(v) Finally, we note that for the queuing chain, P (1, x) = P (0, x) since if there is one
person in the queue to start, that person will definitely be served and removed from the
queue, so both transitions require the arrival of x new persons. Thus,
B 0 (s) =

P1 (T0 = n) =

P (1, x)Px (T0 = n 1) =

x=1

P (0, x)Px (T0 = n 1) = P0 (T0 = n).

x=1

So, we can rewrite B(s) =

lim B (s) = lim

s1

s1

n=0

sn P1 (T0 = n) =
n1

ns

P0 (T0 = n) =

n=1

n=0

sn P0 (T0 = n), so that


nP0 (T0 = n) = E0 [T0 ].

n=1

(vi.) Combining the results of (iv) and (v) then shows that
m0 = E0 [T0 ] =
as required.

1
,
1 E[]


As a result of this lemma, we see that state 0 (and thus the whole chain since we are
dealing with the irreducible queuing chain here) is null recurrent if E[] = 1 and is
positive recurrent if E[] < 1.
So, when E[] < 1 we know that there is a unique stationary distribution. Unfortunately, it is still quite difficult to find it explicitly. However, we do know that for the
stationary distribution we must have (0) = 1/m0 = 1 E[] > 0.
92

7.7

Convergence to the Stationary Distribution

So, we have seen that P n always settles down, and we have used this to characterize
the stationary distributions of Markov chains. In particular, we have seen that for an
irreducible positive recurrent chain,
for all x, y S,

lim P n (x, y) = (y)

or in matrix notation P n 1T , where is the unique stationary distribution. What


we now want to investigate is when the stronger result of P n 1T holds, so that we
can conclude that is a steady state distribution. Note that when this stronger result
does hold, we can now conclude that the resultant steady state distribution must be a
stationary distribution as well.
Theorem 6: If a Markov chain has a steady state distribution , then this distribution
is the unique stationary distribution.
Proof: We already know that if a stationary distribution exists, then it must be unique and
equal to the steady state distribution. So, we need only show that a stationary distribution
does exist. Now, since we have assumed that there is a steady state distribution, we know
that P n 1T , and thus, we also know that P n 1T , or more precisely
for any x, y S.

lim P n (x, y) = (y)

This follows from the fact that if a sequence has a limit, then the associated Cesaro sums
must have the same limit.
From Theorem 1, we know that
(
xy
for y SR+
my
lim P n (x, y) =
n
0
otherwise
Therefore, we must have
(
(y) =

xy
my

for y SR+

otherwise

)
for any x, y S.

Thus, there must be at least one positive recurrent state in S, since otherwise we would
have (y) = 0 for all y S which contradicts the fact that is a steady state distribution
(i.e. distributions cannot have all entries equal to zero, since the entries must sum to
unity). So, for this positive recurrent state y we have 1 = yy = xy for any x S. This
implies that if x is any other recurrent state then x [y]. In other words, the recurrent
93

states are all positive recurrent and consist of a single communication class. Therefore,
by Theorem 5, there is a stationary distribution, and the desired result then follows. 
Note that an interesting consequence of the proof is that if a chain has any null recurrent
states, it cannot have a steady state distribution. However, we still havent answered the
question of when a steady state distribution exists. We might start by examining the cases
where there is a unique stationary distribution, for example, irreducible positive recurrent
chains. Unfortunately, we have already seen that even in the case of an irreducible positive
recurrent chain, we may have problems of periodicity. What is at the root of this problem?
Well, it turns out that the problem of periodicity arises when there are too many 0s in
the transition matrix, in some sense. Recall that a transition function is called regular if
there is some value k such that for any x, y S (where S is assumed finite)
P n (x, y) > 0

for all n k.

[NOTE: In our original definition of regular we only required all the entries of P n (x, y)
to be strictly positive for n = k, but clearly once this is true a simple schematic diagram
shows that P n (x, y) must be strictly positive for all n > k as well.]
Now, it turns out that the key issue is not whether all states can reach each other in
paths of the same length, but whether all states can return to themselves in the same
number of steps. In other words, we will want to know when there is a value k such that,
for all x S, we have
P n (x, x) > 0
for all n k.
Of course, one sufficient condition for this to be true is that P (x, x) > 0, i.e. the diagonal elements of the transition matrix are non-zero, but this is certainly not a necessary
condition.
To examine the idea of periodicity more closely, we must define some new concepts.
First, let Ix be the set of possible return times to state x, in other words
Ix = {n 1 : P n (x, x) > 0}.
Note that if x is a state such that xx > 0, then Ix is not an empty set. Next, we define
dx to be the greatest common divisor of the set Ix , and we will call dx the period of state
x. Recall that the greatest common divisor of a set is the largest integer which evenly
divides all the elements in the set. Clearly, the greatest common divisor is at least 1, but
it may be more. For example, the greatest common divisor of the set of even numbers
is 2. On the other hand, the greatest common divisor of the set of odd numbers is 1. In
fact the greatest common divisor of any set which contains two relatively prime numbers
94

clearly must be 1. In particular, any two prime numbers are relatively prime, so 1 is the
greatest common divisor of any set which contains more than one prime number.
So, lets investigate the period of a state x. First, it is easily seen that the following
two properties hold:
(i.) 1 dx min Ix (ii.) If there is some k such that P n (x, x) > 0 for all n k, then
dx = 1.
Property (i) is rather obvious, since 1 divides all numbers and a number cannot evenly
divide a number which it is larger than. The second property follows from the fact that
the greatest common divisor of a set which contains more that one prime number must
be 1. We now want to investigate how the periods of different states are related to one
another, and we arrive at the following
Theorem 7: If x y, then dx = dy . That is, the periods of all members of a communication class are the same.
Proof: Since x y and y x, we know that there must be two numbers n1 and n2 such
that
P n1 (x, y) > 0
and
P n2 (y, x) > 0.
So, it must be the case that
P n1 +n2 (x, x) P n1 (x, y)P n2 (y, x) > 0,
and thus n1 + n2 Ix . In other words, dx divides n1 + n2 evenly.
Now, for any n such that P n (y, y) > 0 (i.e. for any n Iy ), we have
P n1 +n+n2 (x, x) P n1 (x, y)P n (y, y)P n2 (y, x) > 0,
so that dx divides n1 + n + n2 . However, since dx divides n1 + n2 , it must also divide n.
Therefore, dx divides all the elements of Iy , implying that dx dy since dy is the largest
number which divides all the elements of Iy .
Turning the argument around then shows that we also must have dy dx , which
means that dx = dy .

This immediately leads to
Corollary: All the states in an irreducible Markov chain have the same period, dx = d
for all x S.
Based on this, if an irreducible Markov chain has d > 1 it is said to be periodic. Otherwise,
if d = 1, the chain is called aperiodic.

95

Example 7.5. Suppose we have an irreducible birth and death chain (i.e. px , qx > 0 for
x 1 and p0 > 0), lets determine the period of such chains.
Solution: First, suppose that for some x, there is an rx > 0. Then for this x, we have
P (x, x) > 0 which means that dx = 1. Therefore, since the chain was assumed to be
irreducible, it must be aperiodic.
On the other hand, if rx = 0 for all x S, then the chain can only go from an even
numbered state to an odd numbered state or vice versa in one step. Thus, it can only
return to itself in an even number of steps. Thus, the period of the chain must be either
2 or some multiple of 2. However, since irreducibility implies that p0 , q1 > 0, we see that
P 2 (0, 0) = p0 q1 + r02 = p0 q1 > 0.
Thus, since 2 I0 , it must be that d = d0 = 2. In particular, the Ehrenfest chain is
periodic with period 2.

As we shall shortly see, for the most part, it is the aperiodic irreducible chains which have
steady state distributions. Before we get to this, however, lets examine the connection
between aperiodicity and our previous ideas about regularity. It turns out that
Proposition: If an irreducible Markov chain has some state x for which P (x, x) > 0 then
the chain is aperiodic.
Proof: For state x, we know that 1 Ix which means that dx = 1. But, the chain is
irreducible, so its period must be d = 1, i.e. it is aperiodic.

So, if an irreducible Markov chain has a transition matrix in which at least one diagonal
element is strictly positive it must be aperiodic. The idea here is that we cant have any
periodic behaviour in the chain if we are able to stay put at some state. The reason behind
this is that any periodic structure could be circumvented by simply going to this special
state (which we can always do since the chain is irreducible) and staying there for some
length of time. Since a Markov chain restarts from hitting times, this would have the
affect of shifting the periodic nature of the chain by some amount and also requiring the
periodic structure to remain intact under this shift. Clearly, this cannot happen.
Unfortunately, while the above proposition gives a nice sufficient condition, it is not a
necessary one. For example

96

Example 7.6. Suppose we have a four-state Markov chain with transition matrix

0 p q 0
1 0 0 0

P =
.
0 0 0 1
1 0 0 0
Some simple

+
0

P2 =
+
0

schematic diagram work shows that

0 0 +
+ + + +
+ + + 0
+ + 0

P4 =
;
;
+ 0 0 +
0 0 0
+ + 0

P6 =

+ + + 0

+
+
+
+

+
+
+
+

+
+
+
+

+
+
+
+

So, this chain is regular, and so it is clearly irreducible. Also, it has a finite state space,
and thus is positive recurrent. In addition, since P 6 has all positive entries implies that
P 7 does as well, and 6 and 7 are relatively prime, it must be the case that the period
of all the states is d = 1 (since this is the largest number which divides both 6 and 7).
Therefore, this chain is aperiodic even though P (x, x) = 0 for all x S.

However, this example does lead us to the following result:
Result: A regular chain is irreducible, positive recurrent and aperiodic.
Proof: We have already seen that a regular chain is irreducible. Thus, since regularity deals
with Markov chains having finite state space, a regular chain must be positive recurrent.
Now, suppose there is a value k such that P n (x, y) > 0 for all n k. Then, clearly, for any
state x, we must have that {k, k + 1, . . .} Ix . So, Ix will certainly contain two relatively
prime integers implying that dx = 1 so that any state x, and therefore the entire chain, is
aperiodic.

Now, lets consider what happens to P n (x, x) for an irreducible periodic chain. If the
period is d > 1, then we know that P n (x, x) = 0 for any n which d does not divide evenly.
What if d does divide n? It turns out that for an irreducible periodic chain with period d,
there will be some number n0 such that, for any n n0 , we have P nd (x, x) > 0. In fact,
we have the following important theorem:
Theorem 8: Let {Xt }t0 be an irreducible positive recurrent Markov chain having unique
stationary distribution .
(a) If the chain is aperiodic, then
lim P n (x, y) = (y)

97

for all x, y S,

i.e. is the steady state distribution.


(b) If the chain is periodic with period d, then for any pair of states x, y S there is some
integer r (which will generally depend on which x and y we are dealing with) such that
lim P md+r (x, y) = d(y),

and P n (x, y) = 0 for any n which cannot be written as n = md + r for some integer m.
In particular, if we choose x = y then the appropriate value of r is 0, so that,
lim P md (x, x) = d(x).

Proof: The proof of this result is quite involved, particularly for part (b). We will give
only a basic outline of the proofs here.
proof of (a): The proof is based on a method known as coupling. The idea is to examine
two independent versions of the chain, one started in the stationary distribution and one
started in a specific state x.
So, lets consider the stochastic process {Yt }t0 where Yt = (Xt , Xt0 ), where Xt and Xt0
are independent versions of the original Markov chain, the first started in the stationary
distribution and the second started at some specific state x. The state space of {Yt }t0 is
just S 2 , the set of ordered pairs (x, x0 ) where x and x0 are any two states in S, the state
space of the original Markov chain. Now, it should be clear that {Yt }t0 is itself a Markov
chain, since where the chain will go at time t + 1 certainly depends only on where it is
at time t and not how it got there. In addition, the assumed independence of the two
versions of the chain show that the transition function of the new chain is
PY {(x, x0 ), (y, y 0 )} = P (x, y)P (x0 , y 0 ),
where P is the transition function of the original chain. Now, it should seem reasonable
(and indeed it can be shown) that since the original chain was assumed to be irreducible,
positive recurrent and aperiodic, so is the new chain. In fact, the unique stationary distribution can be seen to be Y {(x, y)} = (x)(y), since
X
XX
Y {(x, x0 )}PY {(x, x0 ), (y, y 0 )} =
(x)(x0 )P (x, y)P (x0 , y 0 )
xS x0 S

(x,x0 )S 2

X

 X

0
0 0
(x)P (x, y)
(x )P (x , y )
x0 S

xS
0

= (y)(y )
= Y {(y, y 0 )},
98

that is, Y = Y PY . Now, pick a state a S. We want to examine the hitting time T(a,a) ,
and since the chain is recurrent we know that it will eventually hit every state, so
P{T(a,a) < } = 1.
Next, let T be the hitting time
T = min{t > 0 : Xt = Xt0 },
in other words, T is the first time that the chain is in a state of the form (a, a) (i.e. the
first time both of the independent versions of the original chain which make up the new
chain are in the same state). Clearly, T T(a,a) so that
P(T < ) = 1,
or equivalently, limt P(T > t) = 0.
The key fact in the proof derives from the fact that
P(Xt = y, T t) = P(Xt0 = y, T t)

for any y S.

This fact should seem intuitively reasonable since after time T , the two chains should
proceed with identical probability structure, since at time T they were both in the same
state. Thus, we can write
P(Xt = y) = P(Xt = y, T t) + P(Xt = y, T > t) = P(Xt0 = y, T t) + P(Xt = y, T > t)
P(Xt0 = y) + P(T > t),
and similarly, P(Xt0 = y) P(Xt = y) + P(T > t). Thus,
|P(Xt = y) P(Xt0 = y)| P(T > t).
Now, since we assumed that {Xt }t0 was started in the stationary distribution, it follows
that P(Xt = y) = t (y) = (y). On the other hand, since {Xt0 }t0 was started in state x,
P(Xt0 = y) = P t (x, y). So, taking limits in the above expression, shows that
0 =
=

lim P(T > n) = lim |P(Xt = y) P(Xt0 = y)|

lim {P(Xt = y) P(Xt0 = y)} = (y) lim P n (x, y).

A quick rearrangement gives the desired result, P n 1T .


Proof of (b): We give only a heuristic argument for the case where x = y. Imagine
a new chain, {Yt }t0 , which is just every dth step of the original chain, i.e. Yt = Xtd .
99

Then, it should be clear that {Yt }t0 is a Markov chain with transition matrix PY = P d .
Moreover, this new chain is still positive recurrent, but is now aperiodic, since the length
of any possible return path to a state x for the new chain is equal to the length of the
path in the old chain divided by d, implying that the new set of possible return times to a
state x must have greatest common divisor 1, otherwise it would contradict the fact that
d was the greatest common divisor of the original set of possible return times (i.e. if the
greatest common divisor for the new set of possible return times was d0 > 1 then dd0 > d
would divide all the elements in the original set of possible return times). Therefore, from
above, we know that the new chain will converge to its unique stationary distribution if
it is still irreducible. If it is not irreducible, we must deal with each communication class
of the new chain separately, but we will still get the required result.
It remains then, only to find the stationary distribution of the new chain. To do this,
recall that the stationary distribution for the original chain could be written as
(y) =

1
,
my

where my was the expected return time to state y. Now, for the new chain, all return
times are simply divided by d, so the expected return time to state y for the new chain is
just mYy = my /d. Thus, the stationary distribution for the new chain is just
Y (y) =

1
d
=
= d(y).
Y
my
my

Therefore, P md (x, x) = PYm (x, x) Y (x) = d(x).

Example 7.7. As an example of the last part of


with transition function

0
1

P = 1p 0
0
1

this theorem, recall the Markov chain

It is easily shown that this chain is irreducible


P 2k = P 2 and P 2k1 = P for all k 1, where

1p 0

2
P = 0
1
1p 0

positive recurrent. Also, we saw that

100

p .
0

0 .
p

so it is easy to see that the chain is periodic with period d = 2. Now, the system = P
is
(0) = (1 p)(1), (1) = (0) + (2) (2) = p(1),
and a quick calculation shows that the stationary distribution is


1p 1 p
, ,
.
=
2 2 2
So, we can see that

1p x=0
2k
2
lim P (x, x) = P (x, x) =
1
x=1
k

p
x=2

= 2(x)

7.8

Summary

In this section of the course, we have discussed the long-run behaviour of Markov chains. In
particular, we first discussed stationary distributions, which are defined by the condition
that if a chain is started in such a distribution then it remains in that distribution forever
after (i.e. is a stationary distribution if 0 = implies that n = for all n 0) and are
most usefully characterized by the system of equations = P . Next, we discussed steady
state distributions, which are defined as the limit of n regardless of 0 , if such a limit
exists (i.e. is a steady state distribution if, for all choices of 0 , n ) and are most
usefully characterized by the matrix formula P n 1T . In the course of this section, we
have uncovered some important concepts and theorems which can be summarized in the
following five points:
(1) The recurrent states of a Markov chain can be further broken down into positive
recurrent and null recurrent states according to whether the mean return time to
the state, my = Ey [Ty ], is finite or infinite. It also turns out that, like recurrence and
transience, null and positive recurrence are communication class properties. Given
this fact, we then saw that any finite state Markov chain must have at least one
positive recurrent state, and that a finite state irreducible chain must be positive
recurrent.

101

(2) A full characterization of the stationary distributions came from Theorems 3 and
5. Namely, if a Markov chain had no positive recurrent states (which we saw could
not happen if the state space was finite) then it had no stationary distributions.
If a Markov chain had a set of positive recurrent states which was composed of a
single communication class, then it had a unique stationary distribution, and this
distribution had the form (y) = 1/my for the positive recurrent states y and (y) =
0 elsewhere. Finally, if a Markov chain had a set of positive recurrent states which
was composed of more than one communication class then it had an infinite number
of stationary distributions. Moreover, all of these infinite stationary distributions
could be characterized as mixtures of the unique stationary distributions which were
concentrated on each of the positive recurrent communication classes individually.
So, if (n) is the unique stationary distribution concentrated on the nth positive
recurrent communication class, then it has the form (n) (y) = 1/my on this class
and (n) (y) = 0 elsewhere, and any general stationary distribution of the Markov
chain can be written as
N
X
n (n) ,
=
n=1

where N is the number of positive recurrent communication classes in the chain (and
P
may be infinite), and the n s are constants such that 0 n 1 and N
n=1 n = 1.
(3) The set of possible return times for a state x was defined as Ix = {n > 0 : P n (x, x) >
0} and the greatest common divisor of this set, dx , was defined as the period of the
state. We saw that all states in a communication class had the same period, and
thus an irreducible chain could be referred to as either periodic, if the period of all
the states was dx = d > 1, or aperiodic, if dx = d = 1.
(4) A finite state Markov chain was called regular if its multi-step transition matrices
were eventually strictly positive. In other words, a chain was regular if there was a
value k such that P n (x, y) > 0 for all x, y S and for all n k. We saw that a
regular chain was irreducible, positive recurrent and aperiodic.
(5) Finally, we characterized when a chain had a steady state distribution in Theorem
8. In particular, if a chain was irreducible positive recurrent and aperiodic, then
the unique stationary distribution, [see point (2) above], was also a steady state
distribution, so that
lim P n (x, y) = (y)

102

for all x, y S.

Actually, to be precise, the chains which have steady state distributions are those
for which SR = SR+ = [y] for some recurrent state y with dy = 1 and for which
x[y] = 1 for all x ST ; in other words, chains for which the set of recurrent states
contains a single communication class of states which are positive recurrent and have
period 1 and for which it is certain that, starting from any of the transient states,
this positive recurrent communication class will eventually be entered [NOTE: a
sufficient condition for this last criterion to hold is that there are only a finite
number of transient states, but this is not a necessary condition]. Such chains are the
aperiodic irreducible positive recurrent ones, possibly augmented by some transient
states (which are of no importance in long-run behaviour). Lastly, if a Markov chain
was irreducible positive recurrent and periodic with period d, then we saw that
lim P md (y, y) = d(y)

103

for all y S.

Part III: Pure Jump Processes


8
8.1

Pure Jump Processes


Definitions

In the previous sections of the course, we have studied the properties of special stochastic
processes called Markov chains in which both the state space, S, and the time index
set, T , were discrete. However, in most situations of interest, it is rare that time can
be reasonably described as discrete. Thus, we wish to now examine some special types
of stochastic processes in which the state space, S, will still be assumed to be finite or
countably infinite, but the time index set, T , will be allowed to be a continuous set. In
particular, we will generally work with the case where T = [0, ). Stochastic processes
with discrete state space but continuous time index set are often called jump processes.
Consider a process which starts in some random state x0 at time 0, and suppose that
at some random time 1 > 0 the process jumps to a randomly chosen new state x1 6= x0 .
Similarly, n represents the random time that the process jumps out of state xn1 and
into the new state xn . Now, we will allow the possibility that n = , in other words
a process may never leave the state it is in. (Note that if n = then m = for all
m n).
We now make some basic observations. First, notice that, unlike the case for Markov
chains, we do not allow a continuous time process to jump to the state that it is already
in, since this would be the same as just ignoring this particular jump and renumbering
the subsequent s. We will use the notation X(t) to indicate the state of the process at
time t, and thus we can write the path of the process as

x0 0 t < 1

x1 1 t < 2
X(t) =
x2 2 t < 3

..
...
.
Now, it appears that we have defined the process X(t) in this way for all times t T =
[0, ). However, this is not necessarily the case. For example, suppose that we imagine
an idealized bouncing ball and let X(t) denote the number of bounces that the ball has
made up to time t. We will assume that the ball has made some random number of
bounces x0 at time t = 0 (i.e. the time we first start watching the ball); that the time
until the next bounce is some random quantity 1 ; and, that the time between any pair
104

of consecutive bounces is half of the time between the preceding pair, in other words,
n n1 = 21 (n1 n2 ). Now, the question is whether the above characterization
determines the process for all time, that is, can we say what state the process will be in
at any time in the future. Well, using the recursion equation for the s, we can determine
that
n n1 = 21 (n1 n2 ) = 22 (n2 n3 ) = 2(n1) 1 ,
and thus, the nth bounce will occur at time
n = (n n1 ) + (n1 n2 ) + . . . + (2 1 ) + 1 =

n
X
k=1

(k1)


1 =


2 n1 1 .
2
1

Now, our above definition of continuous time process as X(t) = xi for t [i , i+1 ) clearly
only defines the process from time t = 0 up to time t = limn n . In the bouncing ball
example, we clearly have limn n = 21 , and thus we have only defined the process
up to the time t = 21 . Of course, this is a random time, but nonetheless, it will rarely
if ever be infinity. Processes for which the probability of the event {limn n < } is
non-zero are said to be explosive. We will be interested here only with jump processes
which are non-explosive or pure, i.e. jump processes for which limn n always tends
towards infinity. For such processes, our initial description of X(t) = xi for t [i , i+1 )
is sufficient to fully describe the path of the process for all time.
Now, to fully characterize a pure jump process, we need to describe the distributions
of the random components which constitute the process. As a start, we need to be given
the initial distribution of the process, 0 (x) = P{X(0) = x}, and the transition function
Pxy (t) = Px {X(t) = y}. Notice, however, that the transition function for a jump process
is inherently more complex then that for a Markov chain, since it is a family of functions
rather than just a matrix. However, many of the basic principles which governed the
behaviour of Markov chains carry over in a relatively obvious way. For example,
P{X(t) = y} =

0 (x)Pxy (t).

xS

In other words, the chance of being in state y at time t is just the sum of the chances
of starting in some state x and then going from state x to state y at time t. As a final
observation, note that the event {X(t) = y} does not mean that the process jumped into
state y exactly at time t, it simply means that at time t the process is currently in state
y. In other words, at some time prior to t the process jumped into state y and does not
jump again until after time t.

105

Again, as in the case of processes with discrete time index set, the class of general pure
jump processes is to difficult to deal with all at once. In particular, the initial distribution
and transition function of a jump process are not enough to calculate a general probability
P{X(t1 ) = x1 , . . . , X(tn ) = xn },
unless the jump process satisfies an analog to the Markov property (recall that the initial
distribution and transition matrix were only enough to fully characterize a discrete time
process if it was a Markov chain). The appropriate analog to the Markov property for
pure jump processes states:
For any times 0 s1 < < sn < s < t T and any states x1 , . . . , xn , x, y S,
P{X(t) = y|X(s1 ) = x1 , . . . , X(sn ) = xn , X(s) = x} = Pxy (t s).
In other words, given the present state of the process at time s, the future of the process
progresses like a new version of the process started in the present state and ignores the
past.
We will restrict ourselves to dealing with jump processes which satisfy this property.
Such processes are sometimes called Markov pure jump processes, and they satisfy many
similar properties to those of Markov chains. For example, it can be show that for a
Markov pure jump process, if 0 = t0 t1 tn T and x0 , x1 , . . . , xn S then
Px0 {X(t1 ) = x1 , . . . , X(tn ) = xn } =

n1
Y

Pxk xk+1 (tk+1 tk ).

k=0

This fact can then be used to show that:


(1) Markov pure jump processes restart from the transition times, n ;
P
(2) Pxy (t + s) = zS Pxz (t)Pzy (s).
Note that (1) is just the analog of Markov chains restarting from hitting times. The
equation in (2) is called the Chapman-Kolmogorov equation and is the analog of the
Markov chain property P m+n = P m P n .

8.2

Characterizing a Markov Jump Processes

As with Markov chains, a Markov pure jump process is characterized by its initial distribution, 0 , and its transition function, Pxy (t). However, this is not the most convenient
characterization for jump processes, and we now investigate a more useful characterization.
106

First, lets define qx = 1/Ex [1 ]. Then, it can be shown (with a bit of higher level
mathematics) that
Px (1 t) = 1 eqx t .
In other words, starting from state x, the time until the first jump has an exponential
distribution with parameter qx , and thus the CDF and density of 1 for a Markov pure
jump process started in state x are
F1 ,x (t) = 1 eqx t

and

f1 ,x (t) = qx eqx t ,

respectively. Also, note that qx = 0 if and only if the state x is absorbing, since qx = 0
implies that Px (1 t) = 0 for any time t, so that the process can never jump out of state
x. [In addition, qx = 0 if and only if Ex [1 ] = ].
Next, we will define Yn = X(n ), so that the process {Yn }n1 keeps track of which
states the jump process visits but ignores how long it takes between jumps. If X(t) is a
Markov pure jump process then it should be no surprise (and, indeed, it is not too difficult
to show) that {Yn }n1 is a Markov chain with state space S and transition function given
by
Qxy = Px {X(1 ) = y},
and is generally referred to as the embedded chain of the Markov pure jump process. Note
that the nature of a jump process implies that Qxx = 0 for all non-absorbing states x.
Finally, it can be shown (with a bit of technical mathematics) that for a Markov pure
jump process, the two random variables 1 and Y1 = X(1 ) are independent. That is,
where the process jumps is independent of how long it takes to make a jump at all.
Now, it turns out (though again, the proof requires a bit of higher mathematics) that
the values qx and the matrix Qxy completely characterize a Markov pure jump process.
Moreover, if we define
(
qx y = x
qxy =
qx Qxy y 6= x
then these values are called the infinitesimal parameters and they also completely characterize the Markov pure jump process. Note that, since Qxy is the transition matrix of a
Markov chain, we must have
qxx = qx = qx

Qxy = qx Qxx +

yS

X
y6=x

qx Qxy =

qxy .

y6=x

P
And thus, yS qxy = 0. The reason that the qxy s are called the infinitesimal parameters
is that, in some sense, qx = qxx = 1/Ex [1 ] measures the instantaneous rate at which the
107

process leaves state x, while the qxy s for y 6= x measure the instantaneous rate at which
the process moves from state x into state y. We will see a clearer depiction of this idea
shortly.
The importance of the infinitesimal parameters and their relationship to the transition
function is captured in the identities
X
d
0
Pxy (t) = Pxy
(t) =
qxz Pzy (t),
dt
zS
which are generally referred to as the backward equations. (The term backward refers
to the fact that all the terms in the sum on the right hand side of the equation involve
values of the transition function which have their ending state in common, and thus form
a system which looks backwards to see how the process got to this state).
We now give a heuristic argument as to why the backward equations are valid. To do
so, we first note that if z 6= x and h is very close to 0, then
Z h
Px {X(h) = z|1 = t}f1 ,x (t)dt
Pxz (h) = Px {X(h) = z} =
0

hPx {X(h) = z|1 = h}f1 ,x (h) = hPx {X(1 ) = z|1 = h}qx eqx h
= hPx (Y1 = z)qx eqx h hQxz qx (1 qx h) = hqxz h2 qxz qx
hqxz .
Next, note that, again for values of h very close to 0, we have
Px {X(h) = x} Px (1 > h) = eqx h 1 qx h = 1 + qxx h.
Note that the initial approximation is based on the idea that if h is very near zero,
the only likely way that the process can be in its starting state at time h is not to have
jumped at all (the chance of a jump out of the starting state and then subsequently back
into the starting state in such a short time interval having a negligible probability).
So, combining these two facts with the Chapman-Kolmogorov equations yields
X
X
Pxy (t + h) =
Pxz (h)Pzy (t) = Pxx (h)Pxy (t) +
Pxz (h)Pzy (t)
zS

z6=x

(1 + qxx h)Pxy (t) +

hqxz Pzy (t)

z6=x

= Pxy (t) + h

qxz Pzy (t).

zS

Finally, rearranging this equation shows that


Pxy (t + h) Pxy (t) X

qxz Pzy (t),


h
zS
108

which leads to the backward equations once we appeal to the definition of the derivative.
One very immediate use of the backward equations is to show that
X
0
(0) =
qxz Pzy (0) = qxy ,
Pxy
zS

since clearly Pzy (0) = Pz {X(0) = y} = 0 unless z = y. This shows more precisely why
the infinitesimal parameters represent the instantaneous rate at which a process started
in state x goes to state y.
Lastly, for the sake of completeness, we note the above heuristics can be turned around
to yield the forward equations,
X
0
Pxz (t)qzy .
Pxy
(t) =
zS

(As with the backward equations, the forward equations are so called since the terms in
the sum on the right hand side all involve the initial state x.)
We now examine some specific types of Markov pure jump processes, where we can
apply these theoretical ideas more explicitly.

8.3

S = {0, 1}

Consider the state space S = {0, 1}, Then any matrix of the following form
!
q0
q0
q=
q1 q1
is a generator of an associated pure jump Markov process, provided the parameters are
taken to be nonnegative: q0 , q1 0. We shall analyse the behaviour for two cases, leaving
the remaining cases as exercise.
q0 > 0, q1 = 0. Plainly, 1 is an absorbing state. The embedded chain is deterministic
with transition matrix
!
!
0 q0 /q0
0 1
Q=
=
0
1
0 1
The distribution of the holding times is characterised as follows: 1 Exponential(q0 ),
if currently in 0; otherwise, if in 1, Exponential(0). Starting from 0, that is
X(0) = X0 = 0, we have the formula for the paths: decomposition: starting we have
(
1 t 1
X(t) = 1[1 , (t) =
0 t < 1
109

with 1 Exponential(q0 )
Using the equivalence of events {X(t) = 0} = {1 > t} (starting from X(0) = 0 ), we
R
find P00 (t) = P(X(t) = 0|X(0) = 0) = P(1 > t|X(0) = 0) = t q0 eq0 x dx = eq0 t .
q0 > 0, q1 > 0. There is no absorbing state. The embedded chain is equivalent to deterministic switching:
!
!
0 q0 /q0
0 1
Q=
=
q1 /q1
0
1 0
Staring from 0 the paths can be decomposed as
(
S

X
1 t
[2k+1 , 2k+2 )
X(t) =
1[2k+1 ,2k+2 ) (t) =
Sk=0

0 t k=0 [2k , 2k+1 )


k=1
with 0 := 0, 1 , 2 1 , 3 2 , . . . independent and
(
Exponential(q1 ) n even
n n1
Exponential(q0 ) n odd

n = 1, 2, . . .

To compute P00 (10) can not achieved by intuitive reasoning. We solve the backwards
equation P 0 (t) = qP (t) subject to initial distribution P (0) = I.
Note P01 (t) = 1 P00 (t) and P11 (t) = 1 P10 (t). It suffices to solve
0
P00
(t) = q0 P00 (t) + q0 P10 (t) ,
0
P10
(t) = q1 P10 (t) + q1 P00 (t) .
!
!
1
q0
q has eigenvalues 0 and (q0 + q1 ) with eigenvectors
and
, respectively
1
q1
The solution is thus of the form
!
!
!
P00 (t)
1
q
0
=A
+ Be(q0 +q1 )t
, t 0,
P10 (t)
1
q1

for some A, B which have to be determined from the initial conditions P00 (0) = 1, P10 (0) =
0. The final result is
q0
q1
+
e(q0 +q1 )t
q 0 + q1 q0 + q1
q1
(1 e(q0 +q1 )t )
P10 (t) =
q 0 + q1

P00 (t) =

(Differentiate the expressions on right hand-side of the last display to verify that they
solve the backwards equation.)
110

Remark 8.3.1. For a general pure jump process with generator q the matrix exponential
solves backwards and forwards equation (convergence unproblematic for finite state spaces
S). Here we define

X
tn
qn
P (t) = exp(tq) :=
|{z}
n!
n=0
matrix powers
where q n is again the nth matrix power of q. This indicates that, though numerically
possible, finding explicit and analytical formulae is rare and not the rule. However, solving
linear differential equations is a standard problem in numerical analysis.


8.4

Poisson Processes

Poisson behaviour is extremely pervasive in natural phenomena, and thus it has become
one of the central focusses of study in stochastic processes. The reason that the Poisson
distribution is so important in nature is due to its relationship to the distribution of rare
events. Suppose we wish to investigate the number of occurrences of a very rare event. If
the number of trials we investigate is small then we dont expect to see any occurrences of
the event, due to its rarity. However, if the number of trials we investigate is quite large,
then we do expect to see some occurrences of the rare event, and the distribution of the
number of occurrences we see will be approximately Poisson. For instance, suppose we
wish to count the number of raindrops which hit a particular piece of ground during a brief
rain shower and that the chance that a raindrop lands in a particular square millimeter
is some very small value p. If the area of the piece of ground we are interested in is only a
few square millimeters, then we dont expect any raindrops to hit it. On the other hand, if
the area is some large number n square millimeters then we expect some raindrops to hit
this region. To discover what the approximate distribution of the number of rare events
is, we use the following argument:
(1) We know that if a particular trial has chance p of success and we investigate n trials,
then X, the number of successes, has a binomial distribution with pmf given by
p(x) =

n!
px (1 p)nx .
x!(n x)!

(2) Now, for rare events we want to investigate what happens to this binomial pmf as n
gets very large and p gets very small. Now, we have to be a little careful about just letting
n and p get large and small, respectively. We want them to remain commensurate with
one another. For instance, in the raindrop example, if we choose a piece of ground which
has much too large an area then we will be unable to count all the drops. On the other
111

hand, if we choose too small an area, then we will get no raindrops at all. Formally, we
will investigate the binomial pmf as n and p 0 in such a way that np is always
equal to some value > 0. (Actually, all we really need is for np > 0 in the limit).
(3) So, using np = we can rewrite the binomial pmf as

n 
x

x n(n 1) (n x + 1)
(np)x n(n 1) (n x + 1)
nx
1
1
.
(1p)
=
p(x) =
x!
nx
x!
nx
n
n
Now, letting n we have
 


n(n 1) (n x + 1)
x1
1
1
1.
=1 1
nx
n
n
Also, as n we have


x

1
1.
n

Finally, remembering a bit of calculus, we recognize that


n


lim 1
= e .
n
n
Thus, putting this all together we see that
p(x)

x e
,
x!

in other words, the distribution is approximately Poisson with parameter .


We now discuss some important properties of the Poisson distribution itself, and then
move on to investigate the Poisson process itself.
Properties of the Poisson distribution
We already know that if X has a Poisson distribution with parameter then E[X] = and
Var(X) = as well. Now, suppose that X and Y are two independent random variables
having Poisson distributions with parameters X and Y , respectively. Then, the pmf of
the new random variable X + Y is
P(X + Y = z) =
=

z
X

P(X = x, Y = z x) =

x=0
z 
X
x=0

z
X

P(X = x)P(Y = z x)

x=0

xX eX
x!



Y
zx
Y e
(z x)!

z
e(X +Y ) X
z!
(X + Y )z e(X +Y )
=
xX zx
=
.
Y
z!
x!(z

x)!
z!
x=0

112

Thus, X + Y has a Poisson distribution with parameter X + Y . The idea is that if one
type of event is occuring with rate 1 and another type of event is occuring independently
at rate 2 , then obviously, the total number of combined events (i.e. both the first and
second type together) are occuring at a rate of 1 + 2 .
Next, let X be a Poisson random variable with rate parameter . In other words, X
counts up the number of occurrences of some rare event. However, suppose that instead of
being able to count all events that occur, we only get to see each event with probability p
and with probability 1 p we miss the event. If Y is the random variable which counts the
number of events we actually got to see (as opposed to the number which actually occured
which is what X was counting), then conditional on X, Y has a binomial distribution
with parameters X and p. In Exercise 9 of Tutorial 1, we saw that the unconditional
distribution of Y was Poisson with parameter p:
P(Y = y) =
=

X
x=y

X
x=y

P(Y = y|X = x)P(X = x)


x
x!
y
xy e
p (1 p)
y!(x y)!
x!

(p)y e X (1 p)xy xy
=
y!
(x y)!
x=y

(p)y e X {(1 p)}x


=
y!
x!
x=0

(p)y e p
e
y!
(p)y ep
=
.
y!

The random variable Y is sometimes called a thinned Poisson variable, since the actual
number of events which occur is thinned out before they are counted by Y . This sort of
random variable is very commonly used in situations where rare events are occuring, but
they are very hard to detect. For example, radioactive decay in very weakly radioactive
samples or very weak signals from some energy source. In such situations, we can use
the observed value of Y to estimate the rate at which events are occuring, however, any
estimate of the rate should be appropriately scaled up to account for the fact that the
observed occurrences are only some fraction of the total number of occurrences.
Basic Properties of a Poisson process.
Definition: A Markov pure jump process X(t) is a homogeneous Poisson process with
113

intensity if 1) X(0) = 0, (i.e. its initial distribution 0 has probability 1 on state 0 and
probability 0 elsewhere); and 2) the transition function is
(
et (t)yx /(y x)! if y x
Pxy (t) =
0
otherwise
Notice that the transition function depends on x and y only through the difference y x.
Thus, a Poisson process started at state z acts exactly like a Poisson process started at
state 0, except that z is added across the board to each X(t). In other words, we have
Pz,z+y (t) = P0y (t). This is why it is not really a serious restriction to require a Poisson
process to start at state 0. Also, notice that
P0 {X(t) = y} = P0y (t) =

et (t)y
,
y!

so that X(t) has a Poisson distribution with parameter t. In particular, E[X(t)] = t


and Var{X(t)} = t as well.
In addition, if s < t then we have
P{X(t) X(s) = y} =
=

X
x=0

P{X(s) = x, X(t) = x + y}
P{X(s) = x}Px,x+y (t s)

x=0

P{X(s) = x}P0y (t s)

x=0

= P0y (t s).
Thus, the random variable X(t)X(s) has a Poisson distribution with parameter (ts).
Finally, note that if t0 < t1 < t2 , then
P{X(t2 ) X(t1 ) = x, X(t1 ) X(t0 ) = y}

X
=
P{X(t0 ) = z, X(t1 ) = z + y, X(t2 ) = z + y + x}
z=0

P{X(t0 ) = z}Pz,z+y (t1 t0 )Pz+y,z+y+x (t2 t1 )

z=0

= P0y (t1 t0 )P0x (t2 t1 )


= P{X(t2 ) X(t1 ) = x}P{X(t1 ) X(t0 ) = y},
so that the random variables X(t2 ) X(t1 ) and X(t1 ) X(t0 ) are independent.
114

Poisson processes are used to keep track of rare events occuring through time (although, it also makes sense to think of them as keeping track of rare events along some
spatial dimension as well). The above properties can be summarized by saying that for a
homogeneous Poisson process with intensity , the number of events which occur in any
time interval has a Poisson distribution with parameter times the length of the time
interval, and the number of events that occur in any two disconnected time intervals are
independent.
Example 8.1. Suppose customers arrive at a certain store according to a Poisson process
with intensity = 4 customers per hour. If the store opens at 9:00A.M., what is the
probability that only one customer has arrived by 9:30A.M. but that 5 customers have
arrived by 11:30A.M.?
Solution: Since the rate parameter is given in customers per hour, we must measure time
in hours and thus we are asked to find P0 {X(0.5) = 1, X(2.5) = 5}. To calculate this
probability, we note that the random variables X(0.5) [which is the same as the random
variable X(0.5) X(0)] and X(2.5) X(0.5) are independent. Thus,
P0 {X(0.5) = 1, X(2.5) = 5} = P0 {X(0.5) = 1, X(2.5) X(0.5) = 4} = P01 (0.5)P04 (2)


 4(0.5)
e
{4(0.5)}1 e4(2) {4(2)}4
= 0.0155.
=
1!
4!


8.5

Inhomogeneous Poisson Processes

Lets examine the infinitesimal parameters of the homogeneous Poisson process. To do


0
this, recall that qxy = Pxy
(0). Now, obviously, if x > y, so that Pxy (t) = 0, then qxy = 0.
Next, if x = y then


d et (t)0
0
Pxx (t) =
= et ,
dt
0!
0
and thus qxx = qx = Pxx
(0) = . Notice that this means that qx = 1/Ex [1 ] = ,
so the reciprocal of the intensity can be interpreted as the mean time between jumps.
Recall that for a Markov pure jump process the time between jumps has an exponential
distribution with parameter qx . Thus, for the Poisson process, the time between jumps
has the exponential distribution regardless of the state that it is currently in.

115

Lastly, if y > x
0
(t)
Pxy



d et (t)yx
et (t)yx1 (y x t)
=
=
.
dt
(y x)!
(y x)!

0
0
Therefore, if y > x+1 we have qxy = Pxy
(0) = 0, and if y = x+1, then qx,x+1 = Px,x+1
(0) =
. In particular, the fact that qxy = 0 for y > x + 1 indicates that the jumps in the process
are all of size one. In other words, two events cannot happen simultaneously. In fact,
the infinitesimal parameters show us that an alternative characterization of a Poisson
process is a sequence of consecutive events with the waiting times between each event
being independent exponential random variables all with parameter .
The fact that the intensity parameter does not change over time is why the process is
called homogeneous. However, it is often the case that the intensity of events for a Poisson
process changes over time, and is equal to some function, say (t). Such processes are called
inhomogeneous or nonstationary Poisson processes. If X(t) is a inhomogeneous Poisson
process with intensity function (t), then the random variable X(t) X(s), the number
of events which occur in the time interval (s, t], has a Poisson distribution with parameter
Rt
(u)du. Also, as with the homogeneous case, the number of events in disconnected
s
time intervals are independent. [Notice that if the function (t) , i.e. is constant, then
X(t) X(s) for the inhomogeneous process has a Poisson distribution with parameter
Rt
du = (t s); in other words, it simply reduces to a homogeneous process].
s

Example 8.2. Example 8.1 contd. Suppose that instead of a constant rate of 4 customers
per hour, the rate function was

4t
0t<1

(t) =
4
1t<6

16 2t 6 t 8
In other words, the intensity of customers entering the store increases linearly from 0 to 4
customers per hour in the first hour of business, then remains at 4 customers per hour until
2 hours before closing and then linearly decreases to 0 customers per hour over the last
two hours of business. For this inhomogeneous process, find P{X(0.5) = 1, X(2.5) = 5}.
Solution: As before, we use the independence of disconnected time intervals, so that
P{X(0.5) = 1, X(2.5) = 5}
= P{X(0.5) X(0) = 1, X(2.5) X(0.5) = 4}
= P{X(0.5) X(0) = 1}P{X(2.5) X(0.5) = 4}.
116

Now,

R 0.5
0

(u)du =

R 0.5
0

4udu = 2u2 |00.5 = 0.5. Thus,

P{X(0.5) X(0) = 1} =

e0.5 (0.5)1
= 0.3033.
1!

Similarly,
Z

2.5

(u)du =
0.5

2.5

4udu +
0.5

4du = 2u2 |10.5 + 4u|2.5


1 = 7.5

So,
e7.5 (7.5)4
= 0.0729.
4!
Putting this all together, we have P{X(0.5) = 1, X(2.5) = 5} = 0.0221. This is somewhat
different from the answer of 0.0155 we got in the homogeneous example. The fact that
the chance is higher reflects the fact that the chance of only one customer in the first
half-hour of business is greater under the new intensity scheme.

P{X(2.5) X(0.5) = 4} =

It may seem that we have added a new level of complexity to the analysis by allowing
the process to be inhomogeneous. However, if X(t) is a inhomogeneous Poisson process
Rt
with intensity function (t) and we define the function (t) = 0 (u)du, then a simple
change of time scale will result in a homogeneous process. That is, if we let s = (t) then
the process


Y (s) = X 1 (s)
is a homogeneous process Poisson process with unit intensity.
Example 8.3. If X(t) is a inhomogeneous Poisson process with intensity function (t) = t

then (t) = 21 t2 , so that Y (s) = X( 2s) is a homogeneous Poisson process with intensity
1. For instance,

e( 2s) {( 2s)}k
es sk
P{Y (s) = k} = P{X( 2s) = k} =
=
,
k!
k!
which corresponds to the probability that a Poisson random variable with parameter s is
equal to k as it should for the homogeneous Poisson process Y (s) with intensity = 1.
In this way, all questions about inhomogeneous Poisson processes can be translated
into questions about homogeneous Poisson processes. Thus, we really only need to study
the homogeneous case.


117

8.6

Special Distributions Associated with the Poisson Processes

We now present two important results concerning the distribution of events in a homogeneous Poisson process with intensity .
Theorem 1: The distribution of n , the waiting time until the nth event, is a gamma
distribution with density
n tn1 t
e .
fn (t) =
(n 1)!
Proof: This is a direct consequence of the fact that the waiting times between each pair
of consecutive events are independent exponentially distributed random variables all with
parameter , and the fact that the distribution of the sum of independent exponential
random variables is a gamma distribution, which is the n-fold convolution of the exponential distribution. Now, to see the result another way, note that the event {n t} is
equivalent to the event {X(t) n}, i.e. the nth event occurs before time t if and only if
at time t there have been at least n events. So,
Fn (t) = P(n t) = P0 {X(t) n} =

X
(t)k et
k=n

k!

= 1

n1
X
(t)k et
k=0

k!

= 1et

n1
X
(t)k
k=0

k!

Differentiating this expression shows that


n1
n1
X
X
d
(t)k
(t)k1
t
t
fn (t) =
F (t) = e
e
dt n
k!
(k 1)!
k=0
k=1
t

= e

n1
X
(t)k
k=0

= et

k!

n2
X
(t)k

k=0
n n1
t
t

(t)n1
=
e
(n 1)!
(n 1)!

k!
.


Next, consider the problem of determining exactly when an event has occcured given that
we know it has occured some time before time t. Specifically, we have
Theorem 2: The conditional distribution of 1 given that X(t) = 1 is uniform on the
interval (0, t]. In other words,
(
1
0<ut
t
f1 |X(t) (u|1) =
0 elsewhere

118

Proof: We first note that for u (0, t],


P{1 u, X(t) = 1} = P{X(u) = 1, X(t) = 1}
= P{X(u) = 1, X(t) X(u) = 0}
= P{X(u) = 1}P{X(t) X(u) = 0}
= ueu e(tu)
= uet .
Thus, the conditional CDF is
F1 |X(t) (u|1) = P{n u|X(t) = 1} =

P{1 u, X(t) = 1}
uet
u
=
,
=
P{X(t) = 1}
tet
t

for 0 < u t. Taking derivatives then gives the desired result.

In fact, it can be further shown that the conditional density function for k given X(t) = n
is
n!uk1 (t u)nk
fk |X(t) (u|n) =
,
(k 1)!(n k)!tn
for 0 < u t. This density function is closely related to the so-called Beta function. To
see why this is true, first note that for small values of h,
fk |X(t) (u|n)

1
1
Fk |X(t) (u + h | n) Fk |X(t) (u|n) = P{u < k u + h|X(t) = n}.
h
h

Now, for h small,


P{u < k u + h, X(t) = n}
P{X(u) = k 1, X(u + h) X(u) = 1, X(t) X(u + h) = n k}
= P{X(u) = k 1}P{X(u + h) X(u) = 1}P{X(t) X(u + h) = n k}
(u)k1 eu heh {(t u h)}nk e(tuh)
=
(k 1)!(n k)!
n t k1
h e u {(t u) h}nk
=
.
(k 1)!(n k)!
Thus, since P{X(t) = n} = (t)n et /n!, we have
fk |X(t) (u|n)

n!uk1 {(t u) h}nk


n!uk1 (t u)nk

.
(k 1)!(n k)!tn
(k 1)!(n k)!tn

Example 8.4. Customers arrive at a facility according to a homogeneous Poisson process


with intensity . On arrival, each customer pays $1. Of course, due to the time value of
119

money, the later the dollar is paid the less it is actually worth. Assume that the discounting
rate of the dollar is , so that at time s, a dollar is only worth es as much as it was at
time 0. We want to find the expected total sum collected by time t, in terms of time 0
dollars. In other words, we want to find
X

X(t)
k
M =E
.
e
k=1

Solution: To find M we note that



X
 X
X

X(t)
X(t)


k
k
E
=
E
e
e
X(t) = n P{X(t) = n}.
k=1

n=1

k=1

Now,




X
X
X(t)
n


k
k
e
e
E
X(t) = n
X(t) = n = E
k=1

k=1

n
X


E ek |X(t) = n
=

k=1
n Z t
X
k=1
n

eu

n!uk1 (t u)nk
du
(k 1)!(n k)!tn

= nt

n1
X

= ntn

k=0

(n 1)!
uk (t u)nk1 du
k!(n k 1)!

eu {u + (t u)}n1 du
0

1 eu |t0

= nt1
n
(1 et ).
=
t
Thus,

X
1
1
t
M =
(1 e )
nP{X(t) = n} = (1 et )E[X(t)]
t
t
n=1

1
(1 et ) t t2 .

Compare this to E[X(t)] = t, which is the expected number of unadjusted dollars that
will be collected by time t.
[NOTE: The final approximation is, of course, only valid for moderate values of t, since
as t increases the value of t 12 t2 will eventually become negative.]


120

8.7

Compound Poisson Processes

The final example in the preceding section foreshadows a very useful and important extension to the idea of a Poisson process. Let X(t) be a homogeneous Poisson process
with intensity , that is, it keeps track of the occurrence of certain events through time.
Now, if associated with each event there is a random value Yn , such that all the Y s are
independent [of each other as well as of X(t)] and identically distributed with CDF
G(y) = P(Yn y),
then the cumulative total of the Y s through time is called a compound Poisson process.
In other words, a compound Poisson process is defined as
Z(t) =

X(t)
X

Yk .

k=1

Compound Poisson processes are extremely important tools in many fields of study. For
instance, in econometrics, compound Poisson processes are sometimes used to model the
fluctuations in stock prices. If the occurrences of trades of a particular stock are modelled
using a Poisson process, and the changes in trading price at each sale are taken as the Y
random variables, then the current price of the stock is a compound Poisson process.
Notice that for any fixed time t, Z(t) is just a random sum. Thus, if we set E[Yn ] =
and Var(Yn ) = 2 , then we can immediately deduce that
E[Z(t)] = t;

Var{Z(t)} = ( 2 + 2 )t.

Moreover, we can use our notions of convolutions to determine the distributional properties of a compound Poisson process.

Example 8.5. Let X(t) represent the number of shocks to a particular system up to time
t and let Yk represent the damage or wear incurred by the k th shock. We assume that
the damage is non-negative, so that P(Yk 0) = 1, and that the damage accumulates
PX(t)
additively, so that Z(t) = k=1 Yk represents the total damage sustained up to time t.
Suppose that the system continues to operate until the cumulative damage exceeds some
value a. If X(t) is a Poisson process with intensity and the Yk s are independent of X(t)
and each other and all have CDF G(y), then Z(t) is a compound Poisson process. Let F
be the time of system failure, so that
{F > t} if and only if {Z(t) < a}.
121

We wish to determine the distribution and expectation of the random time F .


Solution: Recall that the CDF of the random variable Y1 + . . . + Yn is G(n) (y), the n-fold
convolution of G(y), and that the n-fold convolution was defined recursively as
Z
(n)
G(n1) (y z)g(z)dz,
G (y) =

where g(z) = G0 (z) is the density function of the Yk s such that


P(F > t) = P{Z(t) < a} = P

X
X(t)


Yk < a

k=1

X
X(t)




(t)n et

=
Yk < a X(t) = n
P
n!
n=0
k=1
=

X
(t)n et
n=0

n!

G(n) (a).

Thus,
Z
E[F ] =

P(F > t)dt =


0

(n)

Z
(a)
0

n=0

X
(t)n et
1
dt =
G(n) (a).
n!
n=0

Now, if the distribution of the Yk s is exponential with parameter , then the above
P
formula for E[F ] reduces nicely: for n 1 recall that nk=1 Yl is Gamma distributed with
shape parameter n and scaling parameter . Further, (n) = (n 1)! for n = 1, 2, 3, . . .
such that (Page 8 for the density)
!
Z a
n
X
n
(n)
sn1 es ds , a > 0 ,
G (a) = P
Yk a =
(n

1)!
0
k=1
and interchanging series and integral then shows that, in this case,
(
)
(
)
Z a

n
X
X

E[F ] = 1 1 +
G(n) (a) = 1 1 +
sn1 es ds
(n

1)!
0
n=1
(
#
)n=1


Z a "X
Z a

n n s
1 + a
1
s s
1
=
1+
s e
ds =
1+
e e
ds =
n!

0
0
n=0


122

8.8

Birth and Death Processes

Suppose we define a Markov pure jump process with state space S = {0, 1, 2, . . .} and
having infinitesimal parameters qxy such that
qxy = 0

for any x, y S such that |y x| > 1.

In other words, the process only makes jumps of size 1. Such a process is called a birth
and death process, and the parameters x = qx,x+1 and x = qx,x1 are called the birth
rates and death rates of the process, respectively. Now, since we know that
X
qx = qxx =
qxy ,
y6=x

we clearly have qx = qxx = qx,x1 + qx,x+1 = x + x . Thus, x is an absorbing state if


and only if x = x = 0. If x is non-absorbing then we have

x

x +x y = x 1
x
Qxy =
y =x+1
+

x x
0
otherwise
Thus, the embedded chain in a birth and death process is just a birth and death chain
(which is where the process gets its name). Now, just because we write down a set of
infinitesimal parameters doesnt necessarily mean that there is a Markov pure jump process which has those parameters. In particular, we must verify that the process we have
described with our chosen parameter values is not explosive. It turns out that
Lemma: A birth and death process is non-explosive if and only if

X
X
1
x xy+1
+
= .

x xy
x=0 x
y=1 x=y

Proof: The proof is a bit difficult and not very informative, and thus we omit it here.
Notice, however, that if we focus only on the first summation in the condition of the
above lemma, then a sufficient condition for non-explosiveness is that

X
1
= .

x=0 x

One way in which this can occur is if there exist constants A and B such that 0 < x
A + Bx; that is, the birth rate is no more than linear. This condition is certainly not
necessary for non-explosiveness, however, we will now examine some of the basic features
123

of some simple types of birth and death processes, and for these examples the infinitesimal
parameters are defined such that the birth rates are no more than linear, so that processes
are non-explosive.
Pure Birth Processes. A birth and death process for which x = 0 for all x S is
called a pure birth process. The most obvious example of a pure birth process is a Poisson
process; in this case, we have x = for all x S. Now, we want to find the transition
function for a general pure birth process. To do this, we will use the forward equations,
which state that the transition function satisfies
X
0
(t) =
Pxz (t)qzy ,
Pxy
zS

which reduces to
0
Pxy
(t) = y1 Px,y1 (t) y Pxy (t),

in this case. Of course we could also have used the backward equations if we so desired,
0
(t) = x Px+1,y (t)
which reduce to an equation similar to the one given above; namely, Pxy
x Pxy (t). Now, a pure birth process can only move from its current state to larger states,
and thus we can see that clearly Pxy (t) = 0 for y < x. It therefore follows that Pxx (t)
satisfies the differential equation
0
Pxx
(t) = x Pxx (t).

This equation has a solution of the form


Pxx (t) = k1 ex t ,
for some non-zero constant k1 . Finally, using the condition that Pxx (0) = 1 shows that we
must have k1 = 1.
To find Px,x+1 (t), we must solve the equation
0
Px,x+1
(t) = x Pxx (t) x+1 Px,x+1 (t).

To solve this equation, we need the following lemma:


Lemma: The solution to the differential equation
f 0 (t) = cf (t) + g(t),

t 0,

is given by
f (t) = f (0)e

ct

Z
+
0

124

ec(ts) g(s)ds.

Proof: The initial differential equation can be rewritten as


ect f 0 (t) + cect f (t) = ect g(t),
which is equivalent to

d  ct
e f (t) = ect g(t).
dt
Integrating then shows that
Z

ct

e f (t) f (0) =

ecs g(s)ds,

and a rearrangement of terms then yields the desired result.

Using this lemma we can then conclude that


x+1 t

Px,x+1 (t) = Px,x+1 (0)e


+
ex+1 (ts) x Pxx (s)ds
0
Z t
= x
ex+1 (ts) ex s ds
( 0
x
(ex t ex+1 t ) x 6= x+1
x+1 x
=
x tex t
x = x+1
In fact, for any y > x we have
Z
Pxy (t) = y1

ey (ts) Px,y1 (s)ds,

and we can use this formula to find the Pxy (t)s recursively. Finally, note that if x =
for all x S, then the above formulae will reduce directly to those for a Poisson process.
Example 8.6. Consider a pure birth process having birth rates x = x for some constant
value . Such a process is called a linear birth process. For this process,
Pxx (t) = ext .
Similarly, we have
Px,x+1 (t) = xext (1 et ),

125

and
Z

e(x+2)(ts) xexs (1 es )ds


0
Z t
(x+2)t
= x(x + 1)e
e2s (1 es )ds
Z0 t
= x(x + 1)e(x+2)t
es (es 1)ds

Px,x+2 (t) = (x + 1)

0
t

(e 1)2
= x(x + 1)e(x+2)t
2


x + 1 xt
=
e
(1 et )2 .
2
In fact, using induction, we can see that for y x


y1
Pxy (t) =
ext (1 et )yx .
yx


8.9

Infinite Server Queue

Suppose that customers arrive for service according to a homogeneous Poisson process
with intensity , and that each customer begins being serviced immediately upon arrival
(i.e. there are an infinite number of servers). Also, suppose that the service times of
the customers are independent and exponentially distributed with parameter . Let X(t)
denote the number of customers in the process of being served at time t. Since the arrivals
are a Poisson process, this process can only increase by at most one at any given jump.
In addition, since the service times are independent and continuous random variables, no
two customers can be finished at the same exact instant. This phenomenon arises from
the fact that for a continuous random variable we have P(X = x) = 0, so that the chance
of two independent and continuous random variables X and Y being equal to one another
is
Z
Z
P(X = Y ) =
P(X = Y |Y = y)fY (y)dy =
P(X = y|Y = y)fY (y)dy

Z
=
P(X = y)fY (y)dy = 0.

Thus, the chain can decrease by at most one at any single jump. Therefore, we have a
birth and death process. Now, clearly the birth rates are x = since the arrivals occur
126

according to a homogeneous Poisson process. On the other hand, the instantaneous chance
of a departure clearly depends on how many people are currently being served. Since the
departures are independent, it is not hard to see that x = x.
For, this example, the backward equations become
0
(t) = xPx1,y (t) ( + x)Pxy (t) + Px+1,y (t),
Pxy

and the forward equations become


0
Pxy
(t) = Px,y1 (t) ( + y)Pxy (t) + (y + 1)Px,y+1 (t).

Unfortunately, neither of these differential systems of equations are particularly easy to


solve. We will investigate another way to find Pxy (t); however, before we do so, lets
examine another use of the forward equations in this case. Let the function Mx (t) be
defined as

X
yPxy (t),
Mx (t) = Ex [X(t)] =
y=1

so that

X
d
0
Mx (t) = Mx0 (t) =
yPxy
(t).
dt
y=1
Then, multiplying the forward equations by y and summing shows that
Mx0 (t)

=
=
=

X
y=1

X
y=0

yPx,y1 (t)

yPxy (t)

yPxy (t)

y Pxy (t) +

Pxy (t) Px1 (t)

X
y=1

y=1

y(y + 1)Px,y+1 (t)

y=1
2

y Pxy (t) +

y(y 1)Pxy (t)

y=2

yPxy (t)

y=2

y=0

y=1

y=1

(y + 1)Pxy (t)

yPxy (t) = Mx (t).

y=1

So, an application of our differential equation lemma shows that


Z t

t
Mx (t) = Mx (0)e +
e(ts) ds = xet + (1 et ).

0
So, using the forward equations, we can at least get an expression for the expected value
of the process at any time t. Notice that
lim Mx (t) =

127

so that in the long run, there are / customers being serviced regardless of how many
people were in the queue initially. Lets now see how to find the entire distribution function
of the process.
Let Y (t) be the Poisson process of arrivals. In other words, Y (t) is the number of
customers who arrive in the interval (0, t]. Thus, Y (t) has a Poisson distribution with
parameter t. Now, we have seen that for a Poisson process, the distribution of when an
event occurs given that it has occurred by time t is uniform over the interval (0, t]; in
other words, if is the arrival time of a particular customer, then the conditional density
function of given Y (t) = 1 is f |Y (t) (u|1) = 1/t. Also, if a customer arrives at time
s (0, t], then the chance that they will still be in the queue at time t is just
P(still in queue at t | entered at s) = P(service time > t s | entered at s)
= P(service time > t s) = e(ts) ,
since the service times are exponentially distributed with parameter and are independent
of when the individual arrived (since there are an infinite number of servers). Thus, the
probability of a customer still being in the queue at time t given that they arrived in the
interval (0, t] is
p = P(still in queue at t | entered before t)
Z t
1
P(still in queue at t | entered at s) ds
=
t
0
Z t
t
1e
1
e(ts) ds =
=
.
t 0
t
Now, let X1 (t) denote the number of customers who arrived in the interval (0, t] who
are still being served at time t. [Note that if there were initially no customers in the
queue then X1 (t) = X(t)]. Then given Y (t) = n, X1 (t) has a binomial distribution with
parameters n and p, since the customers are all independent of one another and each has
chance p of still being serviced at time t. Thus, we can conclude that X1 (t) is a thinned
Poisson random variable with parameter
tp =

(1 et ).

In other words,
(tp)y etp
.
y!
Now, suppose that there are x individuals initially in the queue at time 0, and let
X2 (t) be the number of these individuals still being serviced at time t. Then, X2 (t) is
P0y (t) = P0 {X(t) = y} =

128

independent of X1 (t) (again, since there are an infinite number of servers) and has a
binomial distribution with parameters x and et , since the chance of being in the queue
at time t given that you were in the queue initially is clearly et .
So, X(t) = X1 (t) + X2 (t) and we can therefore find that
min(x,y)

Pxy (t) = Px {X(t) = y} =

Px {X2 (t) = k, X1 (t) = y k}

k=0
min(x,y)

Px {X2 (t) = k}P{X1 (t) = y k}

k=0
min(x,y) 

X
k=0




x kt
(tp)yk etp
t xk
.
e
(1 e )
(y k)!
k

Also, note that from the above calculations it is easy to see that Ex [X1 (t)] = tp =
(/)(1 et ), while clearly Ex [X2 (t)] = xet , and thus noting that Ex [X(t)] =
Ex [X1 (t)] + Ex [X2 (t)] gives us exactly the expected value that we calculated initially
using the forward equations.
Now, as t , we know that et 0, so that the only term in the above sum which
is non-zero in the limit is the one for which k = 0. Thus,
lim Pxy (t) =

(/)y e/
.
y!

In other words, in the long-run, the process behaves like a Poisson process with parameter
/, regardless of how many customers were in the queue initially.

8.10

Long-run Behaviour of Jump Processes

The last example of the preceding section leads us to ask all the same questions about
the long-run behaviour of a Markov pure jump process as we did for a Markov chain. In
this section, we will briefly describe the long-run properties of a Markov jump process. At
the outset, it should seem reasonable that much of the long run behaviour of the jump
process can be gathered from the behaviour of the embedded Markov chain.
Hitting Times and Hitting Probabilities. For jump processes we need a slightly
modified definition of a hitting time from the one we used for a Markov chain. Recall that
the hitting time of a state y for a Markov chain was defined as the first non-zero time
at which the chain was in state y. The idea here was that we did not want to consider a
state to be hit just because the chain started there. Now, for a jump process we want a
129

similar type of definition; namely, the hitting time of a state y will be the first time that
the process is in state y after it has left its initial state. Thus,
Ty = inf{1 < t : X(t) = y}.
Once we have this definition, however, we can define the hitting probabilities, xy , exactly
as we did for Markov chains, i.e. xy = Px (Ty < ). The properties of recurrence and
transience are also defined as they were for Markov chains, and it turns out that we can
find the values of xy , as well as find the communication classes of the state space S
and designate them as either recurrent or transient by simply considering the embedded
Markov chain. In other words, if X(t) is a Markov pure jump process on the state space
S and having infinitesimal parameters qxy , then the decomposition of the state space into
recurrent and transient communication classes can be accomplished by simply considering
the decomposition of the state space induced by a Markov chain with transition matrix
1
given by Qxy = qx1 qxy = qxx
qxy for y 6= x and Qxx = 0.
Since the ideas of communication classes carry over from the embedded chain to the
associated jump process, so does the concept of irreducibility. In fact, the only difference
between the concepts associated with the decomposition of the state space between a
Markov pure jump process and its associated embedded Markov chain comes in the consideration of positive and null recurrence. Recall that we define the mean return time to
a non-absorbing state x as mx = Ex [Tx ], and that a state is positive recurrent if mx <
and null recurrent otherwise. The reason that we must restrict the above statement to
non-absorbing states in this case is that, technically, for an absorbing state Ex [Tx ] = ,
since for a jump process started in an absorbing state x we have Tx = . Of course, we
certainly want to consider absorbing states to be positive recurrent, and thus we simply
define this to be the case. It turns out that a state may be positive recurrent for X(t) but
null recurrent with respect to the embedded chain and vice versa. Unfortunately, there
is not a simple and easy method of definitively determining when a positive recurrent
state with respect to the embedded Markov chain will be positive recurrent with respect
to the jump process and vice versa without directly calculating the mean return times
mx . However, we will see one possible way of determining positive recurrence in the next
section.
Stationary Distributions. A stationary distribution for a Markov pure jump process
is defined in the same way as it was for a Markov chain: is a stationary distribution if
0 =

130

t = .

In other words, we need to satisfy the equations


X
(x)Pxy (t) = (y)
for all y S and all t 0.
xS

Differentiating the above equations with respect to t shows that an equivalent characterization of a stationary distribution is given by
X

0
(x)Pxy
(t) = 0.

xS

Setting t = 0 in this equation gives


X

(x)qxy = 0,

xS

and it turns out that this is also an equivalent (and very useful) characterization of a
stationary distribution.
We can characterize all the stationary distributions of a Markov pure jump process
in much the same way as we did for Markov chains. In particular, a Markov pure jump
process has a unique stationary distribution concentrated on each positive recurrent communication class C given by
(C) (y) =

1
q y my

for y C,

and equal to zero elsewhere. Note that this is similar but not identical to the characterization for Markov chains. The general idea here is that over a long period of time, the
process will make approximately 1/my visits to state y per unit of time, since my is the
mean return time. Also, the average time spent in state y is just 1/qy since the amount
of time spent in a state between jumps is exponentially distributed with parameter qy .
Thus, the proportion of time spent in state y is approximately 1/(qy my ). Moreover, a
distribution is stationary if and only if it is a mixture of the unique stationary distributions concentrated on each of the positive recurrent communication classes. Thus, if a
jump process is irreducible and has a unique stationary distribution [i.e. the equations
P
xS (x)qxy = 0 have only one solution which is a pmf ], then we can conclude that the
chain must have been positive recurrent. This is one way of determining positive recurrence of a jump process without having to actually calculate the mx s.
Steady State Distributions. Since Markov pure jump processes occur in continuous
time, there is never any problems with periodicity. Thus, it can be shown that for any

131

irreducible positive recurrent Markov jump process with unique stationary distribution ,
we have
lim Pxy (t) = (y)
for all x, y S.
t

It therefore follows (using the Bounded Convergence Theorem), that for any initial distribution 0 , an irreducible positive recurrent Markov pure jump processes has
lim P{X(t) = y} = (y),

so that the unique stationary distribution is also a steady state distribution.


Examples. We will now examine the long-run behaviour of certain types of birth and
death processes. In general, we know that an irreducible birth and death process will be
transient if the embedded birth and death Markov chain is itself transient. Now, we saw
that an irreducible birth and death Markov chain was transient if and only if

y < ,

y=0

where the y s were suitably defined. Using the definition of the y s it is not hard to show
(and is left as an exercise) that a birth and death process is transient if and only if

X
1 y
y=1

Now, the equations


become

xS

1 y

< .

(x)qxy = 0 which characterize the stationary distributions

1 (1) 0 (0) = 0
y+1 (y + 1) (y + y )(y) + y1 (y 1) = 0

for y 1.

If we rewrite the second equation above as


y+1 (y + 1) y (y) = y (y) y1 (y 1),
then we can easily iterate to obtain the equations
y+1 (y + 1) y (y) = 0
so that
(y + 1) =

y
(y)
y+1
132

for y 0,

for y 0,

and thus
(y) =
Therefore, if

y=0 y

0 y1
(0) = y (0)
1 y

for y 1.

< , we can define a stationary distribution as


x
(x) = P

y=0 y

which is the only possible solution to the stationary distribution characterizing equations,
so that the process is positive recurrent in this case. Otherwise, the process is null recurrent
and therefore has no stationary distribution.
Notice that if we have a birth and death process on the finite sample space S =
{0, . . . , d}, we can consider it as a birth and death process on the sample space S =
{0, 1, 2, . . .} if we define x = 0 for x d. Thus, the above results are still applicable
and we see that a finite state birth and death process must be positive recurrent (since
x = and x = 0 for x > d), and has a unique stationary and steady state distribution
given by
x
(x) = Pd
,
0 x d.

y
y=0
Example 8.7. We saw that the birth and death rates for the infinite server queue were
given by x = and x = x. Thus,
x =
and

X
x=0

x =

x
(/)x
=
,
x!x
x!

X
(/)x
x=0

x!

= e/ .

Now, clearly, the infinite server queue is irreducible since all states can be reached from
P
all other states. Thus, since
x=0 x < , there is a unique stationary and steady state
distribution given by
(/)x /
e
,
(x) =
x!
as we saw before.
Suppose that instead of an infinite number of servers, there are only some number N .
The birth rates for such a process are clearly the same as for the infinite server queue.
However, the death rates are slightly different. Clearly, if there are fewer than N people
currently in the queue then the death rate behaves as it did for the infinite server queue,
so that x = x for 0 x < N . If there are N or more people in the queue, only N
133

of them can be in the process of being served, and thus the death rate is x = N for
x N . For the N -server queue, we thus have
(
(/)x
0x<N
x!
x =
(/)x
xN
N !N xN
Now, if we set
K =

X
x=0
N
1
X
x=0

N
1
X
x=0

N
1
X
x=0

(/)x X (/)x
+
x!
N !N xN
x=N

(/)x (/)N X (/)xN


+
x!
N ! x=N N xN

(/)x (/)N X (/)y


+
,
x!
N ! y=0 N y

Then we can see that the process will be positive recurrent as long as K < , which
happens as long as < N . If the process is positive recurrent, then
K=

N
1
X
x=0


1
(/)x (/)N

+
1
,
x!
N!
N

and the unique stationary and steady state distribution is given by


(
1 (/)x
0x<N
K
x!
(x) =
1 (/)x
xN
K N !N xN
Of course, if we let N tend towards infinity we will arrive at the results for the infinite
server queue.

Example 8.8. Suppose we run a laundry business and that we own N washing machines. Our facility can only run M machines at a time, and we have a repair shop which
can simultaneously repair up to R machines at a time. Also, suppose that the operating
machines fail independently and have lifetimes which are exponentially distributed with
parameter . Similarly, the repair times of broken machines are independent and exponentially distributed with parameter . Thus, a washing machine may be in any one of
four possible categories:
134

1. Operating;
2. Operable, but not currently operating;
3. Being repaired;
4. Waiting to be repaired.
Let X(t) denote the number of operable machines (i.e. the number in either of the first two
categories above). Notice that once we know X(t), we also know the number of operating
machines, Y (t) = min{X(t), M }, as well as the number of inoperable machines (i.e. those
in the last two categories above), Z(t) = N X(t), and the number of machines currently
being repaired, W (t) = min{Z(t), R}.
Under this scheme, X(t) is a birth and death process on the state space S = {0, 1, . . . , N }
with infinitesimal parameters given by
(
R
0xN R
x =
(N x) N R < x N
(
x
0xM
x =
M
M <xN
Since this a finite state birth and death process, it must be positive recurrent, and thus for
any given values of , , N , M and R, we can calculate its unique stationary and steady
state distribution using the appropriate formulae and assess the operating characteristics
of the system. In particular, we might want to calculate various quantities of interest such
as
Average Machines Operating

= (1) + 2(2) + . . . + M (M )
+M {(M + 1) + . . . + (N )};

Long Run Utilization


Average Machines Operating
Capacity
(1) + 2(2) + . . . + M (M )
=
M
+(M + 1) + . . . + (N ).
=

For instance, these quantities might be calculated for various values of N and R to determine whether it is more beneficial to buy new machines or increase repair capabilities.

135

The general form for the steady state distribution is quite complicated. However, it
does reduce quite nicely in the case where N = M = R. In this case, x = (N x) and
x = x for x = 0, 1, . . . , N . Thus,
  x
N

,
x =

x
so that

N
X
x=0

x =

N   x
X
N

x=0

N


= 1+
.

Thus, the steady state distribution is given by


  x 
N  
x 
N x
N

(x) =
=
1+
x

x
+
+
which we recognize as the binomial distribution with parameters N and /( + ).
Now, suppose that we have 2 machines and we can run both of them together. Also,
suppose we have the capability of repairing only one machine at a time. In other words,
we have M = N = 2 and R = 1. In this case 0 = 1 = , 2 = 0 = 0, 1 = and
2 = 2. Thus,
 2

1
0 = 1;
1 = ;
.
2 =

2
P
from this we see that K = 2x=0 x = 1 + (/) + 0.5(/)2 , and we can calculate the
long-run utilization as a function of the ratio /


2
1X
1

LRU (/) =
x(x) =
1+
.
2 x=0
2K

Suppose we want to expand our store, and we can either buy a spare machine or hire
another repairman. If we hire another repairman, then we would have N = M = R =
2, and we know that the steady state distribution is binomial with parameters 2 and
/( + ). Thus, the long-run utilization can be calculated as
LRU (/) =

/
.
1 + /

On the other hand, if we buy a spare machine, then N = 3, M = 2 and R = 1. In this


case, 0 = 1 = 2 = , 3 = 0 = 0, 1 = and 2 = 3 = 2. From this we can
determine
 2
 3
1
1

0 = 1;
1 = ;
2 =
;
3 =
,

2
4
136

which, after some algebra, leads to a long-run utilization of


LRU (/) =

2(/) + 2(/)2 + (/)3


.
4 + 4(/) + 2(/)2 + (/)3

Using the above calculations we can find the long-run utilization for each of the three
scenarios for various values of the ratio /. Note that / is just the relative rate of
repair to failure, so that a ratio of / = 2 means that on the average we can fix machines
in half the time it takes for them to fail.
Long-Run Utilization
/ N = M = 2, R = 1 N = M = R = 2 N = 3, M = 2, R = 1
0.1
0.5
1
2
10

0.05
0.23
0.40
0.60
0.90

0.09
0.33
0.50
0.67
0.91

0.05
0.25
0.45
0.71
0.98

So, we can see that if the rate of repair is slow compared to the rate of machine failure
(i.e. / is small) then it is more beneficial to add an additional repairman. On the other
hand, if the rate of repair is fast compared to the rate of machine breakdown (i.e. / is
large) then it is more beneficial to buy a spare machine.

This finishes our discussion of Markov pure jump processes. In the final section, we
conclude with a brief look at some very important stochastic processes with continuous
state space [i.e., we allow the X(t)s to be continuously distributed) as well as continuous
time index. The most important of these processes, which occupies the same central role in
stochastic process theory that the normal distribution does in standard statistical theory,
is generally termed Brownian motion and it will be the central process which we shall
investigate.

137

Part IV: Gaussian Distribution and Processes


9
9.1

Gaussian Processes
Univariate Gaussian Distribution

The normal distribution is well-known absolutely continuous distribution, see STAT2001.


We want to extend this class to higher dimensions. We reinforce the univariate case:
Definition 9.1.1. X is called (univariate) Gaussian random variable, whenever the moment generating function (mgf) of X takes on the form (mX (t) =exponential(quadratic in
(t))), that is, there are , 2 R such that for all t R
(9.1)

o
n
1
mX (t) = E[exp(tX)] = exp t + t2 2 .
2

In particular, we call X Gaussian with mean and variance 2 .


and 2 = 1, we call X standard normal rv.

In addition if = 0


The parameters , 2 determine the distribution of X through the mgf of X in (9.1).


Let us verify that for X N (, 2 ) we have = E[X] and 2 =Var(X): taking derivatives
we see that
n
1 2 2 o
0
2
E[X] = mX (0) = ( + t ) exp t + t =
2
t=0
2
00
2
2
2
and, similarly, E[X ] = mX (0) = + such that =Var(X).
Being now identified as a variance, the parameter 2 has to be a nonnegative real
number. If Var(X) = 2 = 0 then X has zero variance, and degenerates to the deterministic random variable X . Otherwise, if 2 > 0 then X is absolutely continuous
(normal distributed in the sense of STAT2001) with mean zero and variance 2 , having
the familiar normal density


fX (x) = (x|, 2 ) = exp (x)2 /2 2 / 2 2 ,

x R.

If X N (0, 1) is standard normal then we have zero mean and unit variance: E[X] = =
0 and variance Var(X) = 2 = 1; the density and cdf are abbreviated to (x) = (x|0, 1)
Rx
and (x) = (y) dy (also see Subsection 2.2 and 2.4).
Additionally assuming a strictly positive 2 , we determine the mgf of the associated
normal density via expanding a square in the exponent under the integral as
Z
Z
1
2 2
tx
(x| + t 2 , 2 ) dx et+( t /2) , t R .
(x|, )e dx =
2
2

138

Plainly, the RHS matches (9.1). In particular, STAT2001-normal distribution equals


Gaussian distribution with existing (!) density.
Affine transformations. Recall (x|, 2 ) = ((x )/)/; also if X N (, 2 ) and
2 > 0 then (X)/ N (0, 1). This very important rule, called standardisation, allows
us to express probabilities concerning the normal distribution with general and 2 > 0
through standard normal rvs, see STAT2001.
The class of normal distributions is closed under linear and affine transformations:
if X N (2, 1) then 3X 1 N (3 21, 32 1) = N (5, 9) ,

if X N (2, 1) then (X 1)/ 2 N (0, 1) .


Its extension, the class of univariate Gaussian distributions in Definition 9.1.1, is much
better behaved:
if X N (, 2 ) and a, b R then aX + b N (a + b, a2 2 ),
and this includes cases where no density exists ( 2 = 0 or/and a = 0).
We have to (!) verify this rule using mgf s: for X N (, 2 ) and a, b R
o
n
n
1 2 2 2o
1
2 2
maX+b (t) = e mX (at) = e exp (at) + (at) = exp t(b + a) + (a )t ,
2
2
bt

bt

valid for all t R, giving the desired result: aX + b N (a + b, a2 2 ).


d
As a by-product we note that X = + Z for any X N (, 2 ), R, 2 0, Z
N (0, 1). This observation is significant in at least two ways: (i) simulation of the general
X based on Z for given mean R and nonnegative variance 2 ; (ii) characterisation
of the univariate Gaussian distribution in Definition 9.1.1: in distribution the univariate
Gaussian class is exhausted by applying affine transformations to a standard normal
random variable.

9.2

Bivariate Gaussian Distribution

We recall that the joint mgf of a bivariate vector X = (X1 , X2 ) is given by


mX (t) = mX1 X2 (t1 , t2 ) = E[exp {t1 X1 + t2 X2 }] = E[exp {t0 X}] ,

t = (t1 , t2 )0 R2 .

If mX is finite in an open neighborhood of zero in the plane R2 then mX determines the


joint distribution of X1 , X2 , that is the distribution of the vector (X1 , X2 ) R2 . We use
this fact to introduce the bivariate Gaussian distribution:
139

Definition 9.2.1. A bivariate (column) vector X = (X1 , X2 )0 R2 is called (bivariate/jointly)


Gaussian vector, or alternatively, its components X1 , X2 are called jointly Gaussian random variables, whenever its joint mgf mX1 X2 takes on the form exponential(quadratic
!

11
12
in (t)), that is, there are = (1 , 2 )0 R2 and a symmetric =
R22
12 22
such that for all t = (t1 , t2 )0 R2
n
1 0 o
0
(9.2)
mX (t) = mX1 ,X2 (t1 , t2 ) = exp t + t t .
2
In particular, X is called Gaussian vector with mean vector and covariance matrix .
Expectation vector & covariance matrix. Plainly, the distribution of X in Definition 9.2.1 is determined by and through the mgf in (9.2). As in the univariate case,
we justify the In particular,-extension of our definition, that is providing probabilistic
interpretation of and .
First we expand the matrix multiplications in (9.2) to see that



1
2
2
mX1 X2 (t1 , t2 ) = exp 1 t1 +2 t2 +
11 t1 + 212 t1 t2 + 22 t2
.
2
We differentiate both sides with respect to t1 at t1 = t2 = 0 to get

i
h d
d
t1 X1 +t2 X2
e
=
E[et1 X1 +t2 X2 ]|t1 =t2 =0
E[X1 ] = E

dt1
dt1
t1 =t2 =0
n
o
1
d
exp 1 t1+2 t2 + 11 t21+12 t1 t2+21 t2 t1+. . . |t1 =t2 =0
=
dt1
2


t2
= 1 + t1 11 + (12 + 21 ) mX1 ,X2 (t1 , t2 )|t1 =t2 =0 = 1 ,
2
giving E[X1 ] = 1 and E[X2 ] = 2 the latter by reversing the roles of X1 and X2 .
Taking partial derivatives of second order we arrive at


d d t1 X1 +t2 X2
d d
E[X1 X2 ] = E
e
|t1 =t2 =0 =
E[et1 X1 +t2 X2 ]|t1 =t2 =0
dt1 dt2
dt1 dt2


d
t2
1 + t1 11 + (12 + 21 ) mX1 ,X2 (t1 , t2 )|t1 =t2 =0
=
dt2
2
= = 12 + 1 2 ,
giving 12 =Cov(X1 , X2 ); similarly, 11 =Var(X1 ) and 22 =Var(X2 )
We call and expectation vector and covariance matrix of X, respectively, having
verified above that
!
!
!
!
1
E[X1 ]
11 12
Var(X1 )
Cov(X1 , X2 )
=
=
,
=
=
.
2
E[X2 ]
12 22
Cov(X1 , X2 ) Var(X2 )
140

There are no such restrictions on the choice of = (1 , 2 ). However, must be a


symmetric matrix: 12 = 21 . Besides this, must be nonnegative definite, because
x0 x = Var(x0 X) 0 ,

for all x R2 .

It is straight forwardly to determine whether a given matrix is symmetric. To verify


that such a matrix is also nonnegative definite one has to determine that the associated
eigenvalues are all nonnegative, a criterion for symmetric matrices, that works also in high
dimensions (alternatively check minors, see WIKIPEDIA).
In the bivariate setting, there exists a simple condition for nonnegative definiteness:
simultaneously, nonnegative diagonal elements 11 , 22 0 and nonnegative determinant
det() := 11 22 12 21 0.
Density. For X = (X1 , X2 )0 N (, ) with strictly positive determinant det( > 0).
Particularly, is invertible (non-singular) with X1 , X2 being jointly absolutely continuous
with density


22 (x1 1 )2 212 (x1 1 )(x2 2 )+11(x2 2)2
1
p
exp
fX1 X2 (x1 , x2 ) =
2det()
2 det()


1
1
p
=
exp (x )0 1 (x) , x = (x1 , x2 ) .
2
2 det()
(We can verify this mimicking the mgf approach in the univariate case.)
Singular case. If det = 0 (equivalently, is not invertible) then X may degenerates to
a constant P(X = ) = 1 or live on a one-dimensional subspace (there is x R2 such that
P(x0 X = x0 ) = 1 for some x R2 . In other words, X lives in a strictly lower dimensional
affine subspace of R2 . The point emerges: introducing the Gaussian distribution via mgf s
allows us to unify all these cases under a common roof.
Independence. Assume X = (X1 , X2 )0 N (, ) (both cases singular/invertible
possible). Suppose that X1 and X2 are uncorrelated: 12 = 11 = 0 with degenerating
to a diagonal matrix. In particular, the joint mgf factorises into marginal mgf s:
(
)

1
2
2
mX1 X2 (t1 , t2 ) = exp 1 t1 +2 t2 +
11 t1 + 22 t2
= mX1 (t1 )mX2 (t2 ) ,
2
for all t1 , t2 R. Consequently, X1 and X2 must be independent. To summarise, in
the setting of jointly Gaussian distributions, being independent is equivalent of being
uncorrelated.
Conditional distributions. Let (X1 , X2 )0 N (, ). We are interested to derive the
conditional distribution of X1 given X2 = x2 .
141

To avoid trivialities, we suppose that Var(X2 ) 6= 0. Next, determine R such that


!

0 = Cov(X1 X2 , X2 ) = Cov(X1 , X2 ) Var(X2 )

= Cov(X1 , X2 )/Var(X2 ) .

As seen in the previous paragraph, by choice of , X1 X2 and X2 are independent.


With = Cov(X1 , X2 )/Var(X2 ), decomposing X1 = (X1 X2 ) + X2 and conditioning
on X2 yields
E[X1 |X2 ] = E[X1 X2 |X2 ] + E[X2 |X2 ] = E[X1 X2 ] + X2
Cov(X1 , X2 )
(X2 E[X2 ]) .
= E[X1 ] +
Var(X2 )
By replacing X2 with x2 we find that
E[X1 |X2 = x2 ] = E[X1 ] +

Cov(X1 , X2 )
(x2 E[X2 ]) .
Var(X2 )

The RHS is an affine, thus easy to compute, function in x2 which gives rise to the slogan
that the best linear predictor equals the conditional expectation, the best predictor in the
mean square sense.
To determine the conditional variance, note X12 = (X1X2 )2 +2X2 (X1X2 )+ 2 X22
E[X12 |X2 ] = E[(X1 X2 )2 |X2 ] + 2E[X2 (X1 X2 )|X2 ] + 2 E[X22 |X2 ]
= E[(X1 X2 )2 ] + 2X2 E[X1 X2 ] + 2 X22 ,
such that
Var(X12 |X2 )

E[X12 |X2 ]E[X12 |X2 ]

Cov2 (X1 , X2 )
= Var(X1 X2 ) = Var(X1 )
.
Var(X2 )

Again peculiar to the jointly Gaussian setting, the conditional variance turns out to be
deterministic, and there is no dependence on X2 :
Var(X1 |X2 = x2 ) = Var(X1 |X2 ) = Var(X1 )

Cov2 (X1 , X2 )
.
Var(X2 )

Finally, with as above, write Y := X1X2 such that X1 = Y + X2 is a decomposition


into two independent Gaussian random variables. As a result, the conditional distributions
must be Gaussian, more specifically,
X1 |(X2 = x2 ) = Y + x2 N (x2 + E[Y ], Var(Y )) .
!

or, slightly rephrased, X1 |X2 N (X2 + Y, Var(Y )) with determined by 0 = Cov(X1


X2 , X2 ), where the prediction error Y = X1 X2 is a Gaussian rv, independent of X2 .
(Resolving the remaining case Var(X2 ) = 0, this is left to the reader.)
142

9.3

Multivariate Gaussian Distribution

We extend the definition from two two higher dimensions d. Again, this definitions make
sense as the multivariate mgf determine the multivariate joint distribution:
Definition 9.3.1. A d-dimensional (column) vector X = (X1 , X2 , . . . Xn )0 Rd is called
(jointly) Gaussian (vector), or alternatively, its components X1 , X2 , . . . , Xn are called
jointly Gaussian random variables, whenever the joint mgf mX1 X2 ...Xn takes on the form
exponential(quadratic in (t)), that is, there are = (1 , . . . , d )0 Rd and symmetric
= (kl )1k,ld Rdd such that for all t = (t1 , t2 , . . . , td )0 Rd


1 0
0
(9.3)
mX1 X2 ...Xd (t1 , t2 , . . . , td ) = exp t + t t .
2
In particular, X is called Gaussian vector with mean vector and covariance matrix .
Using exaclty the same methods as in the uni- and bivariate setting, by taking partial
derivatives of the first and second order, we find
k = E[Xk ], ,

k,l = Cov(Xk , Xl ) ,

1 k, l d .

In complete analogy to the low dimensional setting, and determine the distribution of
X through (9.3). Again, we call and are the expectation/mean vector and covariance
matrix, respectively, of X.
Any column vector R2 can be taken as a mean vector of a Gaussian vector in Rd .
However, as a covariance matrix has to be a symmetric and nonnegative d-dimensional
square matrix.
Affine Transformations. Let b Rm and T Rmd be a given (deterministic) vector
and matrix, respectively. Then x 7 b + T x defines affine transformation from Rd into Rn .
For = (1 , . . . , d )0 Rd , = (k,l ) Rdd and X = (X1 , . . . , Xd )0 N (, ) we
have
(9.4)

b + T X N (b + T , T T 0 ) .

In other words, the class of multivariate distribution is closed under deterministic affine
transformations. The relation in the last display follows from considerations about the
corresponding mgf s, similar as in the univariate case. (It is advised to derive (9.4) for a
bivariate setting, for instance, where d = n = 2).
Marginal distributions. Let = (1 , . . . , d )0 Rd , = (k,l ) Rdd and X =

143

(X1 , . . . , Xd )0 N (, ). Pick k {1, . . . , d} and consider the kth canonical basis vector:
ek = (0, . . . , 0, 1, 0, . . . , 0)0 = (1l=k ) Rd . In view of (9.4) we must have
Xk = e0k X N (e0k , e0k ek ) = N (k , kk ) .
The components of a Gaussian vector are thus univariate Gaussian random variables with
expectations and variances stored in and the diagonal of , respectively.
Simulation. Let = (1 , . . . , d )0 Rd and = (k,l ) Rdd . We aim to construct
a random vector X such that X N (, ) based on a sample of iid standard normal
random variables Z1 , . . . , Zd .
The minimal requirement is to assume that the given matrix is a valid covariance matrix, a symmetric and nonnegative d-dimensional square matrix. Such a matrix admits
a spectral decomposition (see WIKIPEDIA): There are square matrices Q, D Rdd
such that = QDQ0 , where D = (Di,k )1i,kd is a diagonal matrix and Q is an orthogonal
matrix: QQ0 = Q0 Q = I. The diagonal of D contains necessarily nonnegative eigenvalues
p
of . Consequently, we may introduce D1/2 componentwise as D1/2 := ( Dk,l )1k,ld .
Such prepared, let Z = (Z1 , . . . Zd ) where Z1 , . . . , Zd N (0, 1) are independent univariate standard normal rvs. It is straightforwardly verified that Z has mgf mZ1 ...Zd (t) =
exp(t0 t/2) t Rd and thus Z N (0, I) (is d-dimensional standard normal).
Introduce X := + QD1/2 Z. By (9.4), X N (, QD1/2 I(QD1/2 )0 ) = N (, ).
(Rephrase this saying that any d-dimensional Gaussian distribution occurs as deterministic affine transformation of the d-dimensional standard normal distribution.)
Density. Existence of a density is restricted to invertible . Otherwise, if det() = 0
then X lives on strict affine subspaces of Rd .
For = (1 , . . . , d )0 Rd , invertible = (k,l ) Rdd the random vector X =
(X1 , . . . , Xd )0 N (, ) is admits a density
 1

1
p
(9.5) fX (x) = fX1 ,...,Xd (x1 , . . . , . . . , xd ) =
exp (x)0 1 (x) .
2
(2)d/2 det()
Independence. Let X N (X , X ) and Y N (Y , Y ), and assume the column vector
Z := (X 0 , Y 0 )0 is a Gaussian vector (X, Y are jointly Gaussian vectors). In particular,
we
!
0
X CXY
may write Z N (Z , Z ) with Z = (0X , 0Y )0 and Z :=
(here the
CXY Y
matrices CXY contain covariances between components of X and Y , respectively).
As in the bivariate setting, X and Y are independent whenever they are uncorrelated,
that is CXY degenerates to a null matrix.
Conditional distributions. Using analogous considerations as in the bivariate case we
144

give formulae for conditional distributions. First determine b by solving the system of
!
linear equations Y b0 = CXY (any solution will do). Then we have
X|Y =y = E[X] + b(y E[Y ]) ,
X|Y =y = X bCXY ,
X|Y = y N (X|Y =y , X|Y =y ) .

3 2 1

Example 9.1. Let (X, Y 0 )0 N (0, C), X R, Y R2 , C = 2 2 1 .


1 1 1
(a) Determine the distribution of X and Y .
(b) Does the joint vector (X, Y 0 )0 admits a density? If so determine it.
(c) Determine the conditional distribution X|Y = y.
(d) Determine the conditional probability that X 1, provided Y = 0.2.
!!
2 1
Solution: (a) X N (0, 3) and Y N 0,
.
1 1
(b) We determine the determinant as
!
!
!
2 1
2 1
2 2
det(C) = 3det
2det
+ 1det
= 1.
1 1
1 1
1 1

1 1 0

with inverse C 1 = 1 2 1 (verify CC 1 = I).


0 1 2
0 0
In particular, (X, Y ) admits a density. Recalling that the joint vecor has mean zero,
we get from (9.5) that, for x, y1 , y2 R,
1
1
21 (x,y1 ,y2 )C 1 (x,y1 ,y2 )0
e
=
exp(0.5x2 y12 y22 +xy1 +y1 y2 ) .
1/2
3/2
1/2
3/2
8
8
!
1
1
0
(c) Here Y is invertible with inverse 1
. In particular, b = CXY
1
Y =
Y =
1
2
!
1 1
(2, 1)
= (1, 0) and thus X|Y =y = X + b(y Y ) = (1, 0)y = y1 and
1
2
!
2
X|Y =y = 3 (1, 0)
= 1. Conclusion: X|Y = y N (y1 , 1).
1
(d) Let Z N (0, 1). By Part (c), X|Y = 0.2 N (0.2, 1) and thus P(X 1|Y = 0.2) =
P(0.2 + Z 1) = 1 (0.8) = 0.710308447.

fXY1 Y2 (x, y1 , y2 ) =

145

9.4

Gaussian Processes and Brownian Motion

As noted at the end of the last section, we shall now briefly introduce some important
stochastic processes with continuous state space as well as continuous time index. Specifically, we shall be interested in so-called Gaussian processes:

Definition 9.4.1. A stochastic process, {X(t)}t[0,) , is called a Gaussian process if the


joint distribution of any k constituents of the process, {X(t1 ), . . . , X(tk )}, is multivariate
Gaussian for any collection of time values 0 t1 . . . tk .
Since multivariate Gaussian distributions are completely determined by the expectations and covariances their constituent components, we can completely determine the
behaviour of a Gaussian process by specifying the mean function, X (t) = E[X(t)], and
the covariance function, rX (s, t) = Cov{X(s), X(t)}.
Examples (i) Linear function with standard normal slope. Let A N (0, 1). Setting
X(t) := At for t 0 defines a real-valued stochastic process {X(t) : t 0}. Since the
expression defining this process is linear and deterministic in the Gaussian variable A,
that is a 7 ta is a linear and deterministic mapping, it is easy to show that {X(t)}
Gaussian process. Finally, X (t) = tE[A] 0 and rX (s, t) = stE[A2 ] = st.
(ii) Stationary Gaussian process Let A, B N (0, 1) independent, and define a real-valued
process {X(t) : t 0} by setting X(t) := A cos(t) + B sin(t) (trigonometric function with
random coefficients). Again the mapping (a, b) 7 a cos(t) + b sin(t) is deterministic and
linear such that {X(t)} defines Gaussian process. As is straightforwardly verified we have
X (t) 0 and rX (s, t) = cos(s) cos(t) + sin(s) sin(t) = cos(t s). The latter calculation
shows that rX (s, t) = rX (|t s|) indicates stationarity of the process.

The most important Gaussian process is generally known as Brownian motion (also known
as the Wiener process, after the mathematician Norbert Wiener who, along with Paul
Levy, developed much of the fundamental theory for this process). Generally, it is defined
as follows

Definition 9.4.2. A stochastic process {W (t)}t[0,) is called Brownian motion, if it


satisfies the three basic properties:
(i) W (0) = 0;
(ii) For any times 0 s t, W (t)W (s) is Gaussian with mean 0 and variance 2 (ts)
for some given constant 2 ; and
146

(iii) For any times 0 t1 t2 tk , the random variables W (tk ) W (tk1 ), . . .,


W (t2 ) W (t1 ) are all mutually independent. This requirement further implies that events
regarding the behaviour of the process which occur in non-overlapping intervals of the time
index set are independent of one another;
(iv) sample paths are continuous.
[ASIDE: Note the strong similarity between this definition and that of a homogeneous
Poisson process.]
To see that this definition leads to a Gaussian process, we shall employ two methods to
examine the joint distribution of W (s) and W (t) for two times 0 < s < t (0 s t)
(it is possible to extend each of these methods to cover the general case t1 < t2 < . . . tk
(t1 t2 . . . tk )).
Method I. Since W (s) = W (s) W (0) is normally distributed with mean 0 and variance
2 s, we see that the density function of W (s) is given by:


v2
1
exp 2 .
fW (s) (v) =
2 s
2s
Similarly, the random variable Z = W (t) W (s) is normally distributed with mean 0 and
variance 2 (t s), so that its density is given by:


z2
1
exp 2
fZ (z) = p
.
2 (t s)
2(t s)
Thus, we can write the joint density function for W (s) and W (t) [using the change of
variable formula in Subsection 3.4, and noting that the required Jacobian factor in this
dW (t)
case is one since W (t) = Z + W (s) implies that dW
= dWdZ(t) = 1.], as:
(s)



fW (s),W (t) (v, w) = fW (s),Z (v, w v)

dW (s)
dW (s)
dW (t)
dW (s)

dW (s)
dZ
dW (t)
dZ

= fW (s) (v)fZ (w v)




1
v2
1
(w v)2

p
exp 2
exp 2
=
2 s 2(t s)
2 (t s)
2s


1
v 2 (t s) + s(w v)2
p
=
exp
2 2 s(t s)
2 2 s(t s)


1
v 2 (t s) + s(w v)2
p
=
exp
2 2 s(t s)
2 2 s(t s)


1
tv 2 2svw + sw2
p
=
exp
2 2 s(t s)
2 2 s(t s)
147

where the second equality follows from the independence of W (s) and Z = W (t) W (s).
Now, this density is easily recognised as that of a bivariate normal distribution, which has
the general form:

 2
y (x x )2 2xy (x x )(y y ) + x2 (y y )2
1
,
(x, y) = p 2 2
exp
2 2 2)
2
2(xy
2 x y xy
x y
for appropriate mean parameters x , y , variance parameters x2 > 0, y2 > 0, and covariance parameter x y < xy < x y . In the current case, we have x = y = 0, y2 = 2 t,
x2 = xy = 2 s. So, we have seen that the joint distribution of W (t1 ) and W (t2 ) is bivariate normal, and a straightforward extension of the given argument shows that such
a result continues to hold for any collection W (t1 ), . . . , W (tk ); that is, the collection will
have a multivariate normal distribution.
Method II. Let us determine the joint distribution of W (s) and W (t) for two times
0 s t using mgf s. Again we start from the observation that W (s), W (t) W (s)
are independent Gaussian random variables, with mean 0 and variances 2 s, 2 (t s) for
some given constant 2 . As a result, for u, v R
mW (s),W (t) (u, v)
= E[euW (s)+vW (t) ] = E[euW (s)+v(W (t)W (s)+W (s)) ] = E[e(u+v)W (s)+v(W (t)W (s)) ] (algbra)
= E[e(u+v)W (s) ] E[etv(W (t)W (s)) ] (independence)


1
1
2
2
2 2
= e 2 s (u+v) ) e 2 (ts) v = exp 2 (su2 + 2suv + tv 2 )/2 ,
from which we conclude that (W (s), W (t))0 N (0, ) with = 2

s s
s t

!
.

In other words, Brownian motion is indeed a Gaussian process. Moreover, we shall


calculate its mean and covariance functions shortly.
However, to motivate our discussions, we shall first investigate how Brownian motion
can be seen as a natural extension of some simple Markov chains. We shall then discuss
some important properties and extenstions of Brownian motion, including the process
generally known as white noise.

9.5

Brownian Motion via Random Walks

We start by defining {Xn }n0 to be a simple symmetric random walk. In other words,
{Xn }n0 is a Markov chain with state space equal to the set of all integers (both positive

148

and negative) and transition matrix given by

2 y =x1
1
P (x, y) =
y =x+1
2

0 otherwise
so that in each unit of time, the chain moves up or down one integer with equal probability.
Now, if we set X0 = 0, then we can write the Markov chain as
Xn =

n
X

Yk ,

k=1

where the Yk s are independent random variables each having a distribution function
determined by
1
P(Yk = 1) = P(Yk = 1) = .
2
It is easy to calculate
E[Yk ] = 0;
Var(Yk ) = 1.
Thus, we can see that
X

n
Yk = 0
E0 [Xn ] = E
k=1

Var0 (Xn ) = Var

X
n


Yk

k=1

n
X

Var(Yk ) = n.

k=1

[NOTE: This implies that after n steps, an object performing a simple symmetric random

walk should be within about 2 n steps from where it started.]


Now, suppose that instead of steps of size one, the process moved in steps of size x.
In addition, suppose that instead of making a jump in a unit of time it made a jump
every t units of time. If we change our notation slightly so that this new random walk
is represented as X(t), then we can write
t/t

X(t) =

Yk ,

k=1

where now the Yk s are independent random variables with distribution determined by
P(Yk = x) = P(Yk = x) =

1
2

For this new definition of the Yk s it is easy to see that


E[Yk ] = 0;

Var(Yk ) = (x)2 .
149

Thus, we can see that


 t/t
X 
E0 [X(t)] = E
Yk = 0
k=1

 t/t
X  t/t
X
(x)2
.
Var0 {X(t)} = Var
Yk =
Var(Yk ) = t
t
k=1
k=1
Obviously, we would like to explore what happens as both x and t tend towards 0.
However, just as in the argument we used to see how the Poisson distribution arose from
the binomial (see page 110), we need to be careful how we let these two quantities tend
towards zero. In particular, we need to let them tend to zero in such a way that the
variance of X(t) does not become either zero or infinity. We can ensure this if we choose
to let

x = t,
for some constant . Notice that if we make this choice then the variance of X(t) will
always be t 2 as x and t tend to 0.
Having made this choice for how we will let x and t tend to 0, we now want to
investigate the density function of X(t) in the limit. To do this, we start by defining the
function
P(x < X(t) x + x)
f (x, t) =
.
x
Note that in the limit as x tends toward 0, f (x, t) tends toward the probability density
function of X(t) [and we further note that, as t tends to zero, X(t) becomes a sum of an
increasingly large number of independent random quantities, and so we strongly suspect
that the density function f (x, t) will approach that of a normal distribution]. Now, in order
for X(t) (x, x + x] it must have been the case that either X(t t) (x x, x] and
jumped up by x or else X(t t) (x + x, x + 2x] and jumped down by x. Thus,
P(x < X(t) x + x) =

1
P(x x < X(t t) x)
2
1
+ P(x + x < X(t t) x + 2x),
2

which implies that


1
1
f (x, t) = f (x x, t t) + f (x + x, t t).
2
2
Recall that Taylor expansion shows that for small values of h1 and h2 , we have

f (x, t) + h2 f (x, t)
x
t
2
1 2
2
1 2 2
+ h1 2 f (x, t) + h1 h2
f (x, t) + h2 2 f (x, t).
2 x
xt
2 t

f (x + h1 , t + h2 ) f (x, t) + h1

150


Thus, using this fact and the fact that x = t, we have
1

1
f (x, t) x f (x, t) t f (x, t)
2
2
x
2 t
2

2
1
2
1
1
+ (x)2 2 f (x, t) + xt
f (x, t) + (t)2 2 f (x, t)
4
x
2
xt
4
t
1

1
+ f (x, t) + x f (x, t) t f (x, t)
2
2
x
2 t
2

2
2
1
1
1
f (x, t) + (t)2 2 f (x, t)
+ (x)2 2 f (x, t) xt
4
x
2
xt
4
t
2
2

1
1
= f (x, t) t f (x, t) + (x)2 2 f (x, t) + (t)2 2 f (x, t)
t
2
x
2
t
2
2

1 2
1

2
= f (x, t) t f (x, t) + t 2 f (x, t) + (t) 2 f (x, t).
t
2
x
2
t

f (x, t)

A rearrangement of terms then shows that

2 2
1
2
f (x, t) =
f
(x,
t)
+
t
f (x, t),
t
2 x2
2 t2
and when t 0 we see that the probability density function for X(t) must satisfy the
partial differential equation
2 2

f (x, t) =
f (x, t).
t
2 x2
This partial differential equation is a very famous one, known as the one-dimensional heat
equation, since it is used to describe how heat flows over time within a one-dimensional
object, such as a metal wire. It turns out that if we assume that X(0) = 0, then the
solution to the heat equation is given by
f (x, t) =

1
2 2 t

x2

e 22 t .

In other words, {X(t)}t[0,) is a stochastic process for which the distribution of X(t)
is Normal(0, 2 t). Also, we note that since the process was constructed as the limit of a
Markov chain with independent increments, if t > s u > v, then the random variables
X(v) X(u) and X(t) X(s) will be independent. Finally, since Markov chains satisfy
the restart property from fixed times, we can see that, for any s < t, the distribution of
X(t) X(s) will be the same as the distribution of X(t s) X(0) = X(t s). In other
words, X(t) X(s) has a normal distribution with mean zero and variance 2 (t s). Of
course, this is exactly the specification of Brownian motion. As a final note, when = 1,
the process is generally referred to as standard Brownian motion.
[ASIDE: Notice that the distribution of X(t) has expectation zero for all times t. In other
151

words, on the average, the process does not move. If we had constructed a limiting process
from a non-symmetric simple random walk which steps up with probability p and down
with probability 1 p, a little extra care in the calculations shows that we arrive at a
stochastic process for which the distribution of X(t) is Normal(t, 2 t) for some constant
which depends on our initial choice of p. Such a process is called Brownian motion with
drift, since the expectation of X(t) is t so that the process tends to drift of towards
positive or negative infinity on the average. (Note that will be positive if we choose
p > 0.5 and negative if we choose p < 0.5)].
Basic Properties of Brownian Motion. Since we have seen that the Wiener process,
W (t), is a Gaussian process, it remains only to characterise it by calculating its mean and
variance function. Now, clearly, we have W (t) = E[W (t)] = 0 and, for t s 0,
rW (s, t) = Cov{W (s), W (t)} = Cov(W (s), W (t) W (s)) + Cov(W (s), W (s))
= 0 + Cov(Ws ) = 2 s,
where the penultimate equality follows from the independence of non-overlapping time
intervals, and the final equality follows from the definition of the Wiener process, where
the variance of W (t) was determined. In short,
rW (s, t) = 2 min{s, t}
Reflexion Principle. We now examine some basic properties of the Wiener process.
First, note that, for any 0 s < t, we have P{W (t) W (s) 0} = 12 , since W (t)
W (s) is normally distributed with mean zero. Moreover, since W (s) = W (s) W (0) and
W (t) W (s) are independent random quantities (since they deal with non-overlapping
time intervals), we see that:
1
P{W (t) a|W (s) = a} = P{W (t) W (s) 0|W (s) = a} = P{W (t) W (s) 0} = .
2
Thus, for a 6= 0, if we define Ta = min{t 0 : W (t) = a} to be the first time that the
process W (t) is hits the level a, then, by the continuity of the sample paths of BM, we
clearly have P{W (t) a|Ta = s} = 0 for any s > t and
1
P{W (t) a|Ta = s} = P{W (t) W (s) 0|Ta = s} = P{W (t) W (s) 0} = ,
2
for any s t, where we have again used the independence of non-overlapping time intervals
to infer that {Ta = s} [which is an event dealing with the time interval from 0 up to s]
and {W (t) W (s) 0} [which is an event dealing with the time interval s up to t] are
152

independent events provided that s t. Finally, then, denoting the density function of
Ta by fTa (s), we have
Z
Z
1
1 t
fTa (s)ds = P{Ta t}.
P{W (t) a} =
P{W (t) a|Ta = s}fTa (s)ds =
2 0
2
0
In other words, we have:



a
P{Ta t} = 2P{W (t) a} = 2 1
,
t
where (x) is the CDF of the standard normal distribution. Moreover, it is not difficult
to see that the event {Ta t} is equivalent to the event {max0st W (s) a}, since if
the first time that W (s) hits a occurs at a time earlier than t, clearly the maximum value
of the process W (s) in the interval [0, t] must be at least a. Thus, we can determine the
CDF of maximal process, M (t) = max0st W (s), as:
FM (t) (m) = P{M (t) m} = 1 P{M (t) > m} = 1 P{M (t) m} = 1 P{Tm t}


m
= 2 1,
t
where we note that P{M (t) > m} = P{M (t) m}, since M (t) is clearly a continuous
random variable, being based on W (s) which is normally distributed.
Boundary Crossing Probabilities of BM with drift. As noted in the aside at the end
of the previous sub-section, we might wish to consider Brownian motion with a drift. In
2
this instance, we can define W , (t) = W (t)+t, where W (t) is the usual Wiener process
with variance 2 . This new process clearly satisfies all the criteria of Brownian motion,
2
except that E[W , (t)] = t rather than 0. We close this section by stating without
proof (though the proof can readily be derived from the connections between Brownian
motion and random walks) an important property regarding so-called boundary crossing
probabilities. Specifically, suppose that a < 0 < b and that we are interested in the
2
probability that the process W , (t) achieves a value greater than b before it achieves
any value less than a (i.e., it crosses the upper boundary before it crosses the lower
2
2
boundary). In other words, for all x 6= 0, define Tx, = inf{t [0, ) : W , (t) = x} to
2
be the first time that the process W , (t) hits level x, then it can be shown that:
2

2
P(Tb,

<

2
Ta, )

e2a/ 1
= 2a/2
,
e
e2b/2

provided 6= 0. An application of LHospitals Rule ( 0) then shows that when = 0,


2
we have [W 0, = W is BM with variance 2 ]
a
P(Tb < Ta ) =
,
ab
153

a
.
so that the probability that the Wiener process exceeds b before it is less than a is just ab
, 2
increases to
Finally, we note that, clearly as a decreases towards negative infinity, Ta
infinity (since the time necessary for the process to be lower than extremely large negative
as will clearly increase accordingly). Thus, taking the limit of our probability as a tends
2
to negative infinity, yields the probability that the process W , (t) ever exceeds b:
(
2
2
1
if 0
2
P(Tb, < ) = lim P(Tb, < Ta, ) =
2
2b/
a
if < 0
e

where the case of 0 again requires an appeal to LHospitals Rule.

9.6

Brownian Bridge

We now examine some related stochastic processes which derive from specific transformations of the Wiener process. Indeed, we have already seen one such process in the previous
sub-section; namely, M (t), the maximal process of W (s). As a first extension, we wish
to examine the behaviour of standard Brownian motion, W (t) with = 1, over the time
interval [0, 1] conditional on the fact that we know W (1) = 0. The idea here is that we
imagine having observed a Brownian motion process which has returned to zero (recall
that Brownian motion processes are assumed to start at zero), and we want to investigate
the behaviour of the process up to that point. [NOTE: The choices of standard Brownian
motion (i.e., = 1) and the unit time interval are not critical, and extensions to the more
general case are straightforward.]
First, we note that this new process, Z(t) = W (t)|W (1) = 0 for 0 t 1, generally
known as the Brownian bridge process since it is anchored at zero at either end of its time
range, is indeed a Gaussian process. To see this, we examine the distribution of Z(t) to
find that its density function, which is just the conditional density function of W (t) given
that W (1) = 0, is given by:
fW (t),W (1) (z, 0)
fW (1) (0)


1
1
z2
1 0

e
=
exp
2(t2 t)
2 t t2
2


2
1
z
= p
exp
2(t t2 )
2(t t2 )

fZ(t) (z) = fW (t)|W (1) (z|0) =

where we have used the fact that the W (t) is itself a Gaussian process, meaning that W (1)
is normally distributed with mean 0 and variance 1, while W (t) and W (1) are jointly
normally distributed with means E[W (t)] = E[W (1)] = 0, variances V ar{W (t)} = t,
154

V ar{W (1)} = 1 and covariance Cov{W (t), W (1)} = t. So, we see that Z(t) indeed has
a normal distribution with mean E[Z(t)] = 0 and variance V ar{Z(t)} = t(1 t). Again,
straightforward extensions of this calculation show that the collections {Z(t1 ), . . . , Z(tk )}
for 0 t1 < . . . < tk 1 are jointly normally distributed, implying that Z(t) is a
Gaussian process. Moreover, we can determine the mean function directly from our density
calculation as E[Z(t)] = 0. Finally, the covariance function can be calculated as follows:
rZ (s, t) = Cov{Z(s), Z(t)} = Cov{W (s), W (t)|W (1) = 0}
= E[W (s)W (t)|W (1) = 0] E[W (s)|W (1) = 0]E[W (t)|W (1) = 0]
= E[W (s)W (t)|W (1) = 0],
where we have used the fact that E[W (t)|W (1) = 0] = E[Z(t)] = 0. Now, let us assume
for the moment that s t [if not, we can of course simply reorder the inputs since
Cov(X, Y ) = Cov(Y, X)] and apply the conditional version of the law of the iterated
expectation [which states that E[X|Y ] = E[E(X|Z, Y )|Y ], in clear analogy with the usual
law of the iterated expectation which states that E[X] = E[E[X|Y ]]] to yield:
rZ (s, t) = E[W (t)E[W (s)|W (t), W (1) = 0]|W (1) = 0]
= E[W (t)E[W (s)|W (t), W (1) W (t) = W (t)]|W (1) = 0]
= E[W (t)E[W (s)|W (t)]|W (1) = 0],
where we have also used the fact that, whenever s t 1 as we have assumed, the
random quantities W (s) and W (t) are both independent of W (1) W (t), since they refer
to events in non-overlapping time intervals [NOTE: in general, if X and Y are independent
of Z, then it can readily be shown that E[X|Y, Z] = E[X|Y ]]. Lastly, we use the fact that
if X and Y have a bivariate normal distribution with means x = 0 and y = 0, variances
2
x2 and y2 and covariance xy , then E[X|Y ] = xy Y /y2 and V ar(X|Y ) = {x2 (xy
/y2 )}
(the demonstration of which is left as an exercise for the reader), which then shows:
rZ (s, t) = E[W (t)Cov(W (s), W (t))W (t)/V ar(W (t))|W (1) = 0]
= Cov(W (s), W (t))E[{W (t)}2 |W (1) = 0]/V ar(W (t))

= s V ar(W (t)|W (1) = 0) + (E[W (t)|W (1) = 0])2 /t
= s(t t2 )/t = s(1 t).
Now, recall that we assumed that s t to derive this result, so in general the covariance function for a Brownian bridge process can be written as rZ (s, t) = min{s, t}(1
max{s, t}). In closing our discussion of the Brownian bridge, we note that the process
155

B(t) = W (t) tW (1), for 0 t 1, is clearly a Gaussian process (since any linear
combination of normally distributed quantities remains normally distributed) with mean
function E[B(t)] = E[W (t)] tE[W (1)] = 0 and covariance function:
rB (s, t) = Cov{W (s) sW (1), W (t) tW (1)}
= Cov{W (s), W (t)} sCov{W (1), W (t)} tCov{W (s), W (1)}
+stCov{W (1), W (1)}
= min{s, t} st st + st
= min{s, t} st
= min{s, t} min{s, t} max{s, t}
= min{s, t}(1 max{s, t}),
which are precisely the mean and covariance functions of a Brownian bridge process.
Now, since Gaussian processes are completely determined by their mean and covariance
functions, we see that B(t) must indeed be an alternate specification of the Brownian
bridge process.

9.7

Geometric Brownian Motion

Let W (t) be a standard Brownian motion, then the new process defined by:
X(t) = eW (t)
is called geometric Brownian motion. Clearly, X(t) is not a Gaussian process since for
any time t, X(t) is not normally distributed [indeed, the distribution of X(t) is the socalled log-normal distribution]. However, we can still calculate the mean and covariance
2
functions using the fact that if X N (0, x2 ) then E[eX ] = ex /2 (the demonstration of
which is left as an exercise), which yields:
E[X(t)] = eV ar(W (t))/2 = et/2 ,
and
Cov{X(s), X(t)} = E[X(s)X(t)] E[X(s)]E[X(t)]
= E[eW (s)+W (t) ] es/2 et/2
= eV ar(W (s)+W (t))/2 e(s+t)/2
= e[V ar(W (s))+V ar(W (t))+2Cov(W (s),W (t))]/2 e(s+t)/2
= e(s+t+2 min{s,t})/2 e(s+t)/2

= e(s+t)/2 emin{s,t} 1 .
156

Geometric Brownian motion is often employed in modelling phenomena where relative


(i.e., percentage) changes (rather than absolute changes) can be viewed as the limiting
case of a random walk. In other words, if we consider a Markov chain for which X0 = 1
and P{Xn = x(1 + r)|Xn1 = x} = P{Xn = x/(1 + r)|Xn1 = x} = 12 for some constant
r > 0, then the limiting case of this chain, as r and the time increment both tend toward
zero, will be geometric Brownian motion. The general idea here is that Xn can be viewed
as the product of n independent and identically distributed Yi s which take the values
(1 + r) and (1 + r)1 with equal probability; that is,
Xn =

n
Y

Yi

ln(Xn ) =

i=1

n
X

ln(Yi ).

i=1

As such, we can see that geometric Brownian motion can be viewed as the limit of a
symmetric random walk on the logarithmic scale [since ln(Yi ) is a random variable taking
the two values ln(1 + r) each with equal probability].

9.8

Integrated Brownian Motion

Let X(t) = + W (t), where W (t) is a Wiener process with (possibly known) variance
parameter 2 , and is an unknown location parameter for which we desire an estimate
based on an observed outcome of our stochastic process, x(t), over the time interval
t [a, b].
Now, if X(t) was a discrete time process (perhaps even an iid sample), we would
generally estimate by taking the average of the observed values in x(t). When X(t) is
a continuous time process, the logical extension to the average is an integral:
Z b
1

=
x(t)dt,
ba a
provided that this integral exists.
This estimate is indeed a reasonable one, however, we would like to investigate its
properties; that is, what is its potential bias and variability. To do so, we need to consider
it as a random variable. More generally, we can define a new random variable based on
the stochastic process X(t) and any function g(t) by the integral:
Z b
g(t)X(t)dt,
a

where we interpret this as a random variable, say G{X(t)}, whose realisations are the
Rb
values G{x(t)} = a g(t)x(t)dt, again provided that these integrals exist for all possible
157

realisations x(t). [NOTE: It can be shown that these integrals do indeed exist for any of
the general processes based on Brownian motion we have discussed earlier; however, the
level of mathematical detail necessary to perform these demonstrations is prodigious and
thus we omit them here.]
Moreover, it can further be shown (again using some detailed mathematics which we
omit) that the mean and variance of G{X(t)} can be calculated by simply interchanging
the order of integration and expectation, so that if X (t) = E[X(t)] is the mean function
of X(t), then
Z b
 Z b
E[G{X(t)}] = E
g(t)X(t)dt =
g(t)X (t)dt.
a

Indeed, if we further define the random variable H{X(t)} =


the covariance between this random variable and G{X(t)} as:

Rd
c

h(t)X(t)dt, we can find

Cov[G{X(t)}, H{X(t)}] = E[G{X(t)}H{X(t)}] E[G{X(t)}]E[H{X(t)}]


 Z b
Z b
Z d
Z d
g(t)X(t)dt
h(t)X(t)dt
g(t)X (t)dt
h(t)X (t)dt
= E
a
c
a
c
 Z bZ d
Z bZ d
g(t)h(s)X(t)X(s)dsdt
g(t)h(s)X (t)X (s)dsdt
= E
a
c
a
c
Z bZ d
Z bZ d
=
g(t)h(s)E[X(t)X(s)]dsdt
g(t)h(s)X (t)X (s)dsdt
a
c
a
c
Z bZ d
g(t)h(s)[E[X(t)X(s)] X (t)X (s)]dsdt
=
a
c
Z bZ d
=
g(t)h(s)rX (s, t)dsdt
a

where rX (s, t) = Cov{X(s), X(t)} is the covariance function of the X(t) process.
So, using these results, we can see that our original estimator
for the location pa-

158

rameter of the process X(t) = + W (t) has mean:




Z b
1
X(t)dt
E[
] = E
ba a
Z b
1
=
X (t)dt
ba a
Z b
1
( + E[W (t)])dt
=
ba a
Z b
1
=
dt
ba a
1
=
(b a)
ba
= ,
indicating that
is unbiased, and variance:
V ar(
)

=
=
=
=
=
=
=
=
=

Cov(
,
)
Z bZ b
1
rX (s, t)dsdt
(b a)2 a a
Z bZ b
1
2 min{s, t}dsdt
(b a)2 a a

Z bZ t
Z bZ b
2
tdsdt
sdsdt +
(b a)2
a
t
a
a


Z b
Z b
2
1 2 1 2
2
(bt t )dt
t a dt +
(b a)2
2
2
a
a

Z b
2
1 2 1 2
bt t a dt
(b a)2 a
2
2


2

1 3 1 3 1 2 1 2 1 3 1 3
b b ba ba + a + a
(b a)2 2
6
2
2
6
2


2

1
(b a)3 + a(b a)2
2
(b a) 3

2 (b+2a)
3

where we have used the fact that the covariance function of X(t) is the same as that
of W (t) (since they differ only by an additive constant), so that rX (s, t) = rW (s, t) =
min{s, t}. Moreover, it can be shown that
is, in fact, normally distributed (which should
not be hard to believe, given that it is effectively a summation of normally distributed
quantities, albeit an uncountably infinite number of them). As such, we could even con-

159

struct a 95% confidence interval for as:


r

1.96

b + 2a
.
3

Note that as the length of the time interval under observation increases (i.e., as we watch
the process for longer) the variance of our estimator increases as well! In other words, more
data is not better! The idea here is that the X(t) values are correlated, and thus watching
for longer is not necessarily adding any additional independent information; indeed, if the
process starts off above , then it is likely to stay there, since
E[X(t + s)|X(s) = + ] = E[X(t + s) X(s) + X(s)|X(s) = + ]
= E[X(t + s) X(s)|X(s) = + ]
+E[X(s)|X(s) = + ]
= E[X(t + s) X(s)] + +
= + ,
meaning that more and more data will actually tend to mislead us even further about the
true value of . [NOTE: In fact, a slight extension of this calculation shows that X(t), and
therefore also the Wiener process W (t), is a martingale; in other words, given the current
value of the process, we do not expect its value to change in the future. Note that this
is not true for Brownian motion with drift, the Brownian bridge process or for geometric
Brownian motion.]
More generally, we can define a new stochastic process by setting a = 0 and b = s, to
yield
Z
s

Y (s) =

f (u)X(u)du.
0

When f (u) = 1 and X(u) = W (u), a Brownian motion process, then Y (s) is generally
referred to as integrated Brownian motion. As noted previously, such a process is clearly
Gaussian since it can be seen as the limit of a linear combination of normal random
variables, and we can calculate its mean function as:
Z s
 Z s
Y (s) = E[Y (s)] = E
W (u)du =
W (u)du = 0,
0

160

and its covariance function, assuming s t, as:


rY (s, t) = Cov{Y (s), Y (t)}
= E[Y (s)Y (t)] E[Y (s)]E[Y (t)]
Z sZ t

= E
W (u)W (v)dudv 0
0
0
Z sZ t
E[W (u)W (v)]dudv
=
0
0
Z sZ t
rW (u, v)dudv
=
0
0
Z sZ t
=
min{u, v}dudv
0
0
Z sZ t
Z sZ v
vdudv
ududv +
=
Z0 s 0
Z s 0 v
1 2
=
v dv +
v(t v)dv
0 2
0
1 3 1 2
1
=
s + s t s3
6
2
3
s2 (3t s)
=
6
where we have used the fact that s t to ensure that the v in the limits of the inner
integrals in the seventh equality is indeed in the proper range (i.e., v t). In general,
then, we can write the covariance function for Y (s) as
1
rY (s, t) = (min{s, t})2 (3 max{s, t} min{s, t}).
6
In closing, we note that Y (s) is not a Markov process. In particular, we can see that
events in non-overlapping time intervals are not necessarily independent by examining the
covariance between Y (t) Y (s) and Y (s) for any s < t:
Cov{Y (t) Y (s), Y (s)}

=
=
=

Cov{Y (t), Y (s)} Cov{Y (s), Y (s)}


1 2
1
s (3t s) s2 (3s s)
6
6
1 2
s (t s)
2

6= 0.
Finally, suppose that we wish to calculate the covariance between Y (s) and W (t). We
start by noting that the definition of Y (s) clearly indicates that we can write W (t) as:
Z
Y (t + h) Y (t)
1 t+h
W (t) = lim
= lim
W (u)du,
h0 h t
h0
h
161

by applying a stochastic analog to the Fundamental Theorem of Calculus; however, as


we shall discuss in the next section, actual differentiation of stochastic processes is a bit
technical. Using this result, we then have:


Y (t + h) Y (t)
Cov{Y (s), W (t)} = Cov Y (s), lim
h0
h


Y (t + h) Y (t)
= lim Cov Y (s),
h0
h


1
= lim Cov{Y (s), Y (t + h)} Cov{Y (s), Y (t)} .
h0 h
Now, first suppose that s t. In this case, we have:


1 1 2
1 2
Cov{Y (s), W (t)} = lim
s {3(t + h) s} s (3t s)
h0 h 6
6


1 1 2
sh
= lim
h0 h 2
1 2
=
s.
2
Alternatively, suppose that s > t. In this case, as soon as h < (s t), we have:


1 1
1 2
2
Cov{Y (s), W (t)} = lim
(t + h) {3s (t + h)} t (3s t)
h0 h 6
6
1
= lim {t2 (3s t h) + (2th + h2 )(3s t h) t2 (3s t)}
h0 6h
1
= lim {t2 h + 2t(3s t)h + 3(s t)h2 h3 }
h0 6h
1 2 1
1
=
t + t(3s t) = st t2 .
6
3
2
Putting these results together yields (with some basic algebraic manipulation):
1
Cov{Y (s), W (t)} = s min{s, t} (min{s, t})2 .
2

9.9

White Noise

As noted at end of the last sub-section, actual differentiation of a stochastic process, X(t),
is not easy to properly define. In particular, unlike the case for integration, we cannot
simply use a definition based on the derivatives of the realisations, x(t), since it can often
be shown (with Brownian motion being a prime example) that these sample paths of
the stochastic process are not differentiable in the usual sense (i.e., they are too jagged
to admit derivatives). As such, we will have to define derivatives of stochastic processes
162

in a more average sense (the exact specification of which is quite technical and will be
omitted). We take as our starting point the ideas developed at the end of the previous
sub-section. Specifically, we will define the derivative of a stochastic process X(t) to be
that process X 0 (t) which satisfies the relationship
Z t
X(t) =
X 0 (u)du,
0

provided such a process exists. Clearly, if X(t) was itself defined in terms of an integration
(e.g., integrated Brownian motion), then we can see that this definition allows us to
immediately determine the appropriate derivative. It can further be shown (though the
details of the calculations are again beyond the scope of these notes) that the mean and
covariance functions for the process X 0 (t), provided it exists, are given by E[X 0 (t)] =
d
E[X(t)} = dtd X (t) and
dt
2
rX (s, t),
rX 0 (s, t) =
st
where X (t) and rX (s, t) are the mean and covariance functions of the original process
X(t).
Given this initial structure for differentiation, we might hope that we could determine
the derivative of the Wiener process. In other words, we would like to determine the
process W 0 (t) which satisfies the relationship
Z t
W (t) =
W 0 (u)du.
0

Unfortunately, no such process exists. However, it turns out that we can nevertheless
define a kind of derivative for Brownian motion in some sense. In particular, we will
define the derivative of Browian motion [alternately denoted as W 0 (t) or dW (t)] in terms
of the integral
Z
Z
b

g(t)W 0 (t)dt =

g(t)dW (t),
a

which in turn will be defined by the limit:




Z b
W (t + h) W (t)
lim
g(t)
dt,
h0 a
h
provided this limit exists. It can indeed be shown that this limit exists by employing
a simple integration by parts argument [recall that the standard integration by parts

163

Rb
Rb
formula states that a u(x)v 0 (x)dx = u(b)v(b) u(a)v(a) a u0 (x)v(x)dx] to yield:




Z b
Z b
W (t + h) W (t)
1 d
g(t)
=
g(t)
{Y (t + h) Y (t)} dt
h
h dt
a
a




Y (a + h) Y (a)
Y (b + h) Y (b)
g(a)
= g(b)
h
h


Z b
Y (a + h) Y (a)
g 0 (t)

dt,
h
a
Rt
where Y (t) = 0 W (u)du is just the integrated Brownian motion process defined in the
previous sub-section, for which we know differentiation is possible (due to the nature of
its definition as the integral of a stochastic process). Taking the limit of this expression
as h tends towards 0 then yields the desired limit (and thus the desired definition of the
original integral under investigation) as:
Z b
Z b
Z b
0
g 0 (t)W (t)dt,
g(t)dW (t) = g(b)W (b) g(a)W (a)
g(t)W (t)dt =
a

(t)
since the limit of Y (t+h)Y
is just W (t) which, as we saw at the end of the previous
h
sub-section, is simply the stochastic analog to the Fundamental Theorem of Calculus.
The derivative of the Wiener process, W 0 (t) or dW (t), defined in this way is generally
referred to as white noise (due to its original application in the theory of the physics of
sound). We stress that it is not a stochastic process in the usual sense, since it is only
truly defined in terms of its action within the integrals which lead to its definition.
We now close this section with some basic calculations based on white noise. First, we
note that, since the Wiener process is Gaussian, it is not hard to show that the integral
Rb
g(t)dW (t) is also normally distributed [indeed, the limit definition of this integral shows
a
that it is a linear combination of the three normally distributed quantities W (b), W (a)
Rb
and a g 0 (t)W (t)dt]. Moreover, its mean can easily be calculated as:
Z b

Z b

0
E
g(t)dW (t) = g(b)E[W (b)] g(a)E[W (a)] E
g (t)W (t)dt
a
a
Z b
= 00
g 0 (t)E[W (t)]dt
a

= 0
In addition, another application of the standard integration by parts formula (and some
rather tedious algebra which is omitted here), yields:
Z b

Z b
2
V ar
g(t)dW (t) =
{g(t)}2 dt,
a

164

where 2 is the scale parameter of the underlying Wiener process W (t) [i.e., we have
V ar{W (t)} = 2 t]. In fact, more generally, we can calculate
Z b

Z c
Z
2
Cov
g(t)dW (t),
h(t)dW (t) =
a

min{b,c}

g(t)h(t)dt.

As a particular special example, we note that if we set g(t) 1 and a = 0, then the
Rb
dW (t) is a normal random variable with mean zero and variance 2 b, implying that
0
Rs
the stochastic process Q(s) = 0 dW (t) is a Gaussian process with zero mean function
and covariance function:
Cov{Q(s), Q(t)} = E[Q(s)Q(t)] E[Q(s)]E[Q(t)]

Z s
Z t
dW (t)
dW (t) 0
= E
0

= 2 min{s, t}.
In other words, Q(s) is just the Wiener process with scale parameter 2 , Q(s) = W (s),
as it must be. Before finishing with a simple application of white noise, we note that the
Rb
definition of the integral a g(t)dW (t) given here is extendable to the case where a =
R
and b = , provided that the integral {g(t)}2 dt is finite.
Suppose we model a simple physical phenomenon over time (perhaps the behaviour
of a stock price or the position of a small particle suspended in a fluid) as some process
X(t) and that the defining characteristic of this process is that its value at time t + h, for
some small value h, is determined by its value at time t plus some independent normally
distributed random fluctuation. More specifically, suppose we have
X(t + h) = (1 + ch)X(t) + (t + h),
where c is an appropriate multiplier specific to the phenomenon under study and (t + h)
is a normally distributed quantity with mean 0 and variance 2 h and is independent of
X(s) for all s t [and is therefore independent of all the previous (s)s as well]. As such,
we can see that (t + h) can be thought of as the increment W (t + h) W (t) for a Wiener
process, since such a quantity would be normally distributed with the required parameters
and would be independent of all the preceding increments (since the increments deal with
non-overlapping time intervals). Now, rearranging the preceding relationship leads to a
defining equation of the form
W (t + h) W (t)
X(t + h) X(t)
= cX(t) +
,
h
h
165

and taking limits as h tends to 0 leads us to the formal relationship


X 0 (t) = cX(t) + dW (t).
Rb
Of course, since dW (t) is not really defined outside of the integral form a g(t)dW (t), this
relationship is not directly meaningful. Nevertheless, its derivation is intuitively reasonable, and we can use the defining integrals for white noise to help us find a solution to
this formal equation, and thus to characterise the process X(t) determined by this formal
relationship. Specifically, if we formally integrate both sides of our defining equation, we
see that the solution we are searching for must satisfy the relationship:
Z v
Z v
Z v
Z v
0
X (u)du = c
X(u)du +
dW (u) = X(v) X(0) = c
X(u)du + W (v).
0

To solve this equation, we start be multiplying both sides by ecv and then proceeding
as:
Z v
cv
cv
cv
X(u)du + ecv W (v)
e X(v) e X(0) = ce
0
Z v
= ecv X(v) cecv
X(u)du = ecv X(0) + ecv W (v)
0


Z v
d
cv
e
X(u)du = ecv X(0) + ecv W (v).
=
dv
0
Integrating both sides with respect to v yields:

Z t 
Z v
Z t
Z t
d
cv
cv
e
X(u)du dv =
e X(0)dv +
ecv W (v)dv
dv
0
0
0
0
Z t
Z t
X(0)
= ect
X(u)du =
(1 ect ) +
ecv W (v)dv
c
0
Z t
Z t 0
X(0) ct
X(u)du =
(e 1) +
ec(tv) W (v)dv.
=
c
0
0
Finally, differentiating both sides of this equation with respect to t and recalling the
Rb
definition given for integrals against white noise, a g(v)dW (v) = g(b)W (b) g(a)W (a)
Rb 0
g (v)W (v)dv, we have:
a
Z t
ct
c(tt)
X(t) = X(0)e + e
W (t) +
cec(tv) W (v)dv
0
Z t
d c(tv)
ct
c(tt)
c(t0)
= X(t) = X(0)e + W (t)e
W (0)e

e
W (v)dv
0 dv
Z t
ct
= X(t) = X(0)e +
ec(tv) dW (v),
0

166

where X(0) is any initial random variable used to start the X(t) process. Typically, we
will simply choose some arbitrary starting constant so that X(0) = x0 . In such cases, we
can see that the solution to our stochastic differential equation defining the process X(t)
is just:
Z
t

ec(tv) dW (v),

X(t) = x0 ect +

which is a Gaussian process (since its only random component is the white noise integral,
which we have already noted is normally distributed) with mean function:
X (t) = x0 ect
and covariance function (calculated using the previously provided formula regarding covariances of integrals against white noise on the top of page 152):
rX (s, t) =

min{s,t}

ec(sv) ec(tv) dv

0
2 c(s+t)

= e

min{s,t}

e2cv du

2 c(s+t)
e
(1 e2c min{s,t} )
=
2c
2 c(s+t)
=
{e
ec|st| },
2c
where we have used the fact that s + t 2 min{s, t} = max{s, t} + min{s, t} min{s, t} =
max{s, t} min{s, t} = |s t|. As such, we have determined a method of characterising
important stochastic processes from a determining equation based on the characteristics
of how the process is likely to change in a short interval of time. The idea here was to
note that the process at time t + h will generally be determined by some deterministic
relationship with its past, X(t), and by some additional, independent random increment,
(t + h) = W (t + h) W (t), which can be adequately modelled in a wide variety of
settings by white noise, dW (t).

167

Вам также может понравиться