MATH3353 Notes

MATH3353 Topics in Bioinformatics
ANU 1st Semester, 2015
Conrad Burden
Mathematical Sciences Institute, ANU
September 9, 2015
Contents
1 Introduction 2
2 Shotgun Sequencing 6
2.1 Expected coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Expected number of contigs . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Mean contig size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Variable length fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Word occurrences in random sequences 12

3.1 Number of occurrences of a given word in an i.i.d. sequence . . . . . . . . 12
3.2 Distance between words in an i.i.d. sequence . . . . . . . . . . . . . . . . . 15
3.3 Markovian sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Mean and variance of Y (w) for a first order Markov model . . . . . . . . 21
4 Sequence Alignment 24
4.1 Sequence Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Why not align by brute force ? . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Dynamic Programming: Global Alignment . . . . . . . . . . . . . . . . . . 26
4.4 Dynamic Programming: Local Alignment . . . . . . . . . . . . . . . . . . 27
4.5 Substitution matrices for sequence alignments . . . . . . . . . . . . . . . . 29
5 Assessing the significance of alignments using random walks 34

5.1 Difference equation approach to random walks . . . . . . . . . . . . . . . 35
5.2 Moment generating function approach to random walks . . . . . . . . . . 36
5.3 Ladder points and excursions . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 An example with overshoot . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1
6 BLAST (Basic Local Alignment Search Tool) 42
6.1 BLAST and the choice of substitution matrix . . . . . . . . . . . . . . . . 42
6.2 P-value estimates for aligned sequences . . . . . . . . . . . . . . . . . . . . 44
6.3 Unaligned sequences and database searches . . . . . . . . . . . . . . . . . 46
7 High throughput sequencing 48

7.1 RNA sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2 Overdispersion of read counts . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.3 Negative binomial model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.4 Detecting differential expression . . . . . . . . . . . . . . . . . . . . . . . . 55
8 Population Genetics 56
8.1 The Wright-Fisher Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.1.1 Probability of fixation . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.1.2 Expected time to fixation . . . . . . . . . . . . . . . . . . . . . . . 61
8.2 Including mutations in the Wright-Fisher model . . . . . . . . . . . . . . . 65
8.3 Continuum limit and the forward Kolmogorov equation . . . . . . . . . . 67
8.3.1 Derivation of the forward Kolmogorov equation . . . . . . . . . . . 68
8.3.2 The continuum Wright-Fisher model with mutatons . . . . . . . . 70
8.4 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.4.1 Maximum likelihood estimator . . . . . . . . . . . . . . . . . . . . 73
8.4.2 The Watterson estimator . . . . . . . . . . . . . . . . . . . . . . . 74
A Crash Course in Probability Theory 77

A.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 80
A.3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.4 Several Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
A.5 Sum of a random number of random variables . . . . . . . . . . . . . . . . 87
A.6 Extreme values and order statistics . . . . . . . . . . . . . . . . . . . . . . 87
A.7 Poisson processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
B Crash course in statistical inference 91

B.1 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
B.2 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
C Walds identity and martingales 94
D The Dirac delta function 97
1 Introduction
The sequencing of the human genome was a milestone in human endeavour. The human
genome consists of approximately 3 billion letters, and is now only one of hundreds of
2
genomes that have been completely sequenced, with more genomes being added to the
list every month. Translating these vast strings of four letters, the genome, into an
understanding of the phenome, namely each biochemical, physiological, and morpholog-
ical characteristic of an organism, is a current research frontier. It requires searching
for the underlying complex pathways and patterns linking the genome to the phenome.
Mathematics, statistics and computer science underpin this endeavour.
Bioinformatics is concerned with the analysis of genomic and proteomic data, gen-
erally in the form of sequences from and alphabet of 4 nucleic acid letters, C, G, A
and T comprising deoxyribonucleic acid (DNA) molecules, or from and alphabet of 20
amino-acid letters making up protein sequences. To get started on the subject, one
needs to understand the Central dogma of biochemistry, illustrated in simplified form
in Figure 1. Strung out along the length of the genome are specialised stretches of typi-
cally hundreds to thousands of letters, or bases, known as protein-coding genes. Within
a cell nucleus of a living organism, individual genes are continually being transcribed
to complementary copies of chemically similar letter sequences to form molecules called
messenger ribonucleic acid (mRNA). The mRNA molecule so formed is chemically less
stable than DNA, with a lifetime ranging from minutes to days, during which time it acts
as a blueprint for making proteins within the cell but outside the nucleus. A protein is
also a linear molecule, consisting of a string of elements called amino acids. Each triplet
of three nucleic acid letters in the mRNA sequence encodes for exactly one of twenty
possible nucleic acids. Thus a protein can be represented as a sequence of typically
hundreds of letters drawn from a 20-letter alphabet.
DNA mRNA protein

-C G-
-T A- glu
-T A-
-G C-
-C G- ala
-A U-
-C G-
-G C- arg
-A U-
Figure 1: The Cental Dogma of biochemistry: DNA segments called genes are translated
to complementary copies of messenger ribonucleic acid (mRNA) within the nucleus of a
cell. The transcribed sequence, or transcript, then becomes a blueprint for manufacturing
proteins outside the cell nucleus.
3
Some typical problems addressed by bioinformaticians are:
1. Genome assembly: the reconstruction of entire genomes from short reads of typi-
cally 200 bases.
2. Genome annotation: the identification of different functional parts of a sequenced

genome. Within the human genome, for instance, only only about 2 to 3% of
the genome is protein coding. Upstream from the genes are promoter regions,
where specialised proteins bind to switch enhance or inhibit the transcription to
mRNA. Within the genes are regions called introns which are excised from the
transcribed sequence before translation to proteins, and in the large spaces between
genes are mysterious intergenic regions, the function(s) of which, if any, are not
well understood. To minimise the need for costly wet-lab experiments, biologists
use computer algorithms based on hidden Markov models to annotate or parse
sequenced genomes into their functional elements.
3. Sequence comparison: A common problem faced by biologists is finding close

matches to a given short DNA or protein sequence within a large database of
sequences. For instance a biologist may be faced with determining the function
of a newly discovered protein by searching databases of proteins whose function
is known. Alternatively, one may be interested locating genes in two different
species which have evolved from a gene in the common ancestor of those species.
Alignment algorithms for sequence comparison are studied in these notes.
4. Phylogenetic reconstruction: the reconstruction of family trees of species using

measures of the divergence through mutation and natural selection of the species
genomes.
5. Protein folding: The function of a protein is largely determined by the shape that
it folds into, and that shape is, in principle, determined by the proteins amino-
acid sequence. A largely unsolved problem in bioinformatics is the prediction of a
proteins shape from its amino-acid letter sequence.
6. Expression profiling: At any point in time, only a small fraction of 10,000 or so

genes in the nucleus of a cell are expressed, that is, switched on so that they are
amenable to being transcribed to mRNA. Technologies such as microarrays and
high-throughput sequencing can be used to detect which sequences are present in
mRNA extracted from cells, and hence detect which genes are expressed. Analysing
data from these technologies to obtain quantitative measures of expression is known
as expression profiling. The problem of detecting differences in gene expression
under varying conditions from high throughput sequencing data is covered in the
later part of these notes.
7. Gene regulation: This is the study of how the protein products from certain genes
act as switches controlling the expression of other genes via complex networks and
feedback loops.
4
Much of this course is developed from material in the text book Statistical Methods
in Bioinformatics: An Introduction (2nd ed.) by W.J. Ewens and G.R. Grant [1].
Sections 2 and 3 are based on Chapter 5 of Ewens and Grant on the analysis of a
single DNA sequence, including genome assembly from short-read sequencing and the
statistics of word occurrences in random sequences. Sections 4.5 to 6 cover material from
Chapters 6, 7 and 10 of Ewens and Grant on significance scores of sequence alignments
and the BLAST algorithm. Section 7 of these notes deals with the analysis of data
arising from high throughput sequencing technology, which post-dates the publication
of Ewens and Grant. Section 8 on population genetics is mostly based on material in the
book Mathematical Population Genetics, vol. 1, by W.J. Ewens [2]. Appendices A and
B include a brief review of background material in probability theory and statistics from
Chapters 1 to 4 of Ewens and Grant. Appendix C contains more advanced material on
random walks required for the section on assessing significance of alignment scores.
5
2 Shotgun Sequencing
The sequence of the entire human genome was published in 2001 [3, 4] by two competing
groups. One of the two approaches used was whole genome shotgun sequencing, in
which the DNA from the entire genome was fragmented randomly into pieces of length
approximately 500 bases, the sequence of each fragment was read using the now obsolete
technology of capillary sequencing, and computers used to assemble the full genome.
The assembly step can be thought of as a giant one-dimensional jigsaw puzzle in which
one relies on overlaps to identify neighbouring fragments.
If the number of bases in an entire genome is G, the length of each fragment is L
and the total number of fragments sequences is N , the coverage a is defined as
NL
a= . (1)
G
To ensure that a reasonable fraction of the genome was covered, a coverage fraction of
at least a = 12 was used.
To place the size of the assembly problem in perspective, imagine that the human
genome is written out in its entirety in the typeface used in the micro-writing to the left
of Banjo Pattersons hat on the ten dollar note (see Fig. 2: There was movement at the
station . . . ). Given that there are approximately 2.9 1010 base-pairs in the genome,
and that each letter occupies 0.19 mm on this scale, the length of the entire sequence
would be approximately 560 km, that is, the distance by road from Canberra to just
south of Taree on the northern New South Wales coast. Each sequenced fragment would
be approximately 10 cm long, and the 12-times coverage used, if printed on strips of
paper a millimetre wide, would fill a tightly packed cube more than 5 metres along each
side. This jigsaw puzzle cannot be done by hand!
In general, because the fragments are randomly scattered along the genome, some
parts will inevitably be missed. In Fig. 3 we define a contig to be a contiguous part of
the genome covered by fragments. Typical questions one might ask are: What coverage
fraction a is necessary to ensure some fraction, say 99% of the genome, is covered? What
is the expected number of contigs? What is the mean size of each contig?
2.1 Expected coverage

Let the random variable K be the number of fragments overlapping any given point P
in the genome. The probability that any given fragment covers the point P is L/G.
Thinking of each of the N fragments as a Bernoulli trial with probability of success L/G
(the fragment does cover P ) and of failure 1 L/G (the fragment does not cover P ), we
see that K is a binomial random variable. Furthermore, since L/G << 1 and N >> 1,
we have in the limit L/G 0, N that K is a Poisson random variable:

L NL
K Bin N, Pois = Pois (a). (2)
G G
6
Figure 2: If the human genome were written out in the typeface of the micro-writing
used on the ten dollar note, it would reach from Canberra to just south of Taree.
7
G base pairs
contig
Figure 3: Contigs.
Table 1: Characteristics of shotgun sequencing for various values of the coverage fraction a. The
expected number of contigs is calculated assuming G = 100, 000 and L = 500. The mean contig
size is calculated assuming L = 500.
coverage a 2 4 6 8 10 12
Mean proportion covered .86466 .98168 .99752 .99966 .99995 .99999
Expected number of contigs 54.1 14.7 3.0 0.5
Mean contig size 1,600 6,700 33,500 186,000 1,100,000 6,780,000
That is,
ea ak
Prob (K = k) = , k = 0, 1, . . . . (3)
k!
The mean proportion of the genome covered is then
1 Prob (K = 0) = 1 ea . (4)
Table 1 gives the proportion covered for a range of a.
2.2 Expected number of contigs

To calculate the expected number of contigs, first define the following indicator random
variable corresponding to fragment number f for f = 1, 2, . . . , N :
(
1 if f is the rightmost fragment in a contig
If = (5)
0 otherwise.
8
PN
The number of contigs is i=1 If , and, neglecting end effects, the mean number of
contigs is thus
N N
!
X X
E If = E(If )
i=1 i=1
= N Prob (a given fragment f is rightmost in a contig)
= N Prob (no other fragment overlaps point at R.H. end of f )
= N ea
= N eN L/G . (6)
It is easy to check that, for fixed G and L, mean number of contigs is maximised at a = 1.
In the example in Table 1 the formula gives the unphysical result of less than one contig
for a 8 because end effects have been ignored. Furthermore, in practice the formula
gives an overly optimistic prediction of achieving high coverage of the genome because
(1) the existence of repeated stretches in the non protein-coding part of the genome
mean that detecting true overlaps is difficult, and (2) some stretches of the genome are
more difficult to clone than others, so the assumption of distribution of fragments is not
appropriate.
2.3 Mean contig size

To calculate the mean contig size, define a random variable X equal to the distance
from the left hand end of any given fragment to the beginning of the next fragment (see
Fig. 4). For fragment lengths of a few hundred bases, to a reasonable approximation X
can be considered to be a continuous random variable. If the fragments are assumed
to be uniformly distributed along the genome, the positioning of the next fragment is a
Poisson process:
Nh
Prob (x < X < x + h | x < X) = , 0 < x, (7)
G
and thus X is an exponential random variable with density function,
N a
fX (x) = ex , where = = . (8)
G L
The probability that the second fragment overlaps the first is
Z L
p= ex dx = 1 ea . (9)
0
The number of successful overlaps, Y , is then a geometric random variable with proba-
bility distribution
Prob (Y = y) = PY (y) = py (1 p), (10)
and mean
p
E(Y ) = = ea 1. (11)
1p
9
contig
Figure 4: The random variable X for calculating mean contig size.
The mean length of each successful overlap is

Z L
E(X | 0 X < L) = xfX | 0X<L (x) dx
0
RL
xfX (x) dx
= R0 L
0 fX (x) dx
1 L
= L
e 1
L L
= a . (12)
a e 1
Now the total contig size is the sum of each successful overlap plus the length of the
final fragment:
XY
S= Ui + L, (13)
i=1
where the Ui are identically and independently (i.i.d.) distributed random variables,
each with the same distribution as X | (0 X < L). Using the result in Appendix A for
the mean of the sum of a random number of iid random variables, we have for the mean
contig size
E(S) = E(Y )E(X | 0 X < L) + L

a L L
= (e 1) a +L
a e 1
ea 1
= L . (14)
a
Table 1 gives some mean contig sizes for fragment length L = 500, a range of coverages
a and an essentially infinite genome length G.
2.4 Variable length fragments

The above calculations make the assumption that all fragments are of equal length. A
more realistic assumption consistent with typical fragmentation processes used in the
10
laboratory is to assume the fragment length L to be a random variable (for simplicity
assumed to be continuous) with some specified probability distribution density function
fL (`), 0 ` . Here we repeat the calculation for the proportion of the genome
covered by contigs for the case of variable length fragments.
Consider a fixed point P along the length of the genome and an interval I = (P
x, P x + h) of width h a distance x to the left of P . The expected number of fragments
covering P which have their left-hand end in I is
0 Prob (no fragment has its L.H. end in I)+

1 Prob (1 fragment has its L.H. end in I and it extends to P ) + o(h)
Z
N
= h fL (`)d` + o(h)
G x
N
= h (1 FL (x)) + o(h), (15)
G
where FL (`) is the cumulative distribution function of L. In the limit h 0, this is the
expected number of fragments covering P which start in the interval (p x, P x + h).
Thus the mean number of fragments covering P in total is
N
Z
(1 FL (x)) dx
G 0

N
Z
N
= x (1 FL (x)) +
xfL (x)dx
G 0 G 0
N
= E(L). (16)
G
Assuming then that the number of fragments covering P is a Poisson random variable
with mean (N/G)E(L), the expected proportion of the genome covered by fragments is,
by analogy with Eq. (4),
1 eN E(L)/G . (17)
For more calculations related to shotgun sequencing, including anchored contigs,
see 5.1 of Ewens and Grant [1].
11
3 Word occurrences in random sequences
Given a random sequence of letters from an alphabet L (e.g.the nucleic acids {c, a, g, t}
or the set of 20 amino acids) we may be interested in knowing how many occurrences
of a given word, such as gaga, one might expect or the expected distance between one
occurrence of the word and the next. The answers to these questions depend on the
word itself and whether overlaps are counted. For example, in the sequence segment
. . . gagaga . . .
the word gaga occurs twice if overlaps are counted, but only once if not. In what follows
we will always allow overlaps.
Typical applications of such questions in bioinformatics are (i) to test an assumption
about a sequence, for example, can we consider it to be a sequence of identically and
independently distributed (i.i.d.) letters; (ii) to discover regulatory binding sites - do
some classes word of appear more often than expected by chance in a particular part
of a genome?; (iii) to test whether microarray probes are likely to be unique to the gene
that they target; (iv) when assembling a genome form short reads, to assess whether the
overlap between reads is likely to be unique within the genome; or (v) when mapping
sequencing reads onto a reference genome, to assess whether reads are likely to map to
unique locations.
3.1 Number of occurrences of a given word in an i.i.d. sequence

Consider a random sequence of length n of letters from an alphabet L of size d. To begin
with, we will assume that the letters are i.i.d. In later sections we will look at sequences
with a Markovian dependence. To keep it simple in the first instance, we assume further
that any letter has an equal chance of occurring at any site:
A = A1 A2 . . . An , Ai L, |L| = d, (18)
1
Prob (Ai = a) = , for all a L and i = 1, . . . , n. (19)
d
For any k-letter word w = w1 . . . wk Lk , let the random variable Y (w) be the
number of occurrences w in A. We define the following indicator variables:

1 if (Ai . . . Ai+k1 ) = w,
Ii (w) = i = 1, . . . n k + 1, (20)
0 otherwise,
which detects whether the word at postion i in the sequence is w, and

1 if (w1 . . . w` ) = (wk`+1 . . . wk ).
` (w) = ` = 1, . . . k, (21)
0 otherwise,
which detects whether the first ` letters of w match with the last ` letters. Note that
the number of occurrences of w can be written
n

X
Y (w) = Ii (w) = n k + 1.
where n (22)
i=1
12
The expected number of word matches is
n
n
n

X X X 1
E[Y (w)] = E[Ii (w)] = Prob ((Ai . . . Ai+k1 ) = w) = , (23)
dk
i=1 i=1 i=1
or,
nk+1
E[Y (w)] = . (24)
dk
The variance of Y (w) is more difficult to calculate. We have
n

!
X
Var (Y (w)) = Var Ii (w)
i=1
n

X
= Cov (Ii (w), Ij (w))
i,j=1
n

X X
= Var (Ii (w)) + 2 Cov (Ii (w)Ij (w)). (25)
i=1 1i<j
n
The first term is

n

X n

X
E[Ii (w)2 ] E[Ii (w)]2

Var (Ii (w)) =
i=1 i=1
Xn

E[Ii (w)] E[Ii (w)]2

=
i=1
n

X
Prob ((Ai . . . Ai+k1 ) = w) Prob ((Ai . . . Ai+k1 ) = w)2

=
i=1

1 1
= n
k
2k . (26)
d d
To calculate the second term, first note that since the letters of A are independent,
Cov (Ii (w)Ij (w)) is zero if |i j| k, that is, if the the k-word at position i does not
overlap the k-word at position j. The contribution to the second term from overlapping
13
k+s
s k-s
A:
i j
Figure 5: Contribution to Cov (Ii (w)Ij (w)) from overlapping occurrences of the k-word
w at locations i and j. In Eq. (27) we have set s = j i.
words is (see Fig. 5)
k1 n(k+s)+1
X X X
2 Cov (Ii (w)Ij (w)) = 2 Cov (Ii (w), Ii+s (w))
1i<j
n s=1 i=1
k1 n(k+s)+1
X X
= 2 (E[Ii (w)Ii+s (w)] E[Ii (w)]E[Ii+s (w)])
s=1 i=1
k1 n(k+s)+1
X X 1
= 2 Prob {(Ai . . . , Ai+k+s1 ) = (w1 . . . ws w1 . . . wk )} ks (w) 2k
d
s=1 i=1
k1 n(k+s)+1
X X ks (w) 2 k(k 1)
= 2 2k (k 1)(n k + 1)
dk+s d 2
s=1 i=1
k1
X (n k s + 1)ks (w) 1
= 2 k+s
2k (k 1)(2n + 2 3k)
d d
s=1
k1
X (n 2k + ` + 1)` (w) 1
= 2 2k`
2k (k 1)(2n + 2 3k), (27)
d d
`=1
where the substitution ` = k s has been made in the last line. Adding the contributions
from Eqs. (26) and (27) gives, after a little algebra,
k1
n k + 1 (2k 1)n 3k 2 + 4k 1 X (n 2k + ` + 1)` (w)
Var (Y (w)) = k
2k
+2 . (28)
d d d2k`
`=1
Note that the variance, but not the mean, depends on the choice of word. As an
example, consider a random i.i.d. sequence of n = 106 letters from the nucleic acid
alphabet {a, c, g, t}, each letter having an equal probability 14 of occurring at any given
site. For words of length k = 4 letters, Eq. (24) gives
999,997
E[Y (w)] = 3,906.2, (29)
256
14
independently of the choice of word. The calculated variance and standard deviation of
the word count for a few sample words, calculated from Eq. (28), are given in Table 2.
Naively, one might guess the number of occurrences of a given word to be a binomial
random variable: 999,997 trails with a probability of success 1/256 for each trial. If
that were the case, the variance would be 999, 997 1/256 255/256 3, 891. From
the table we see that this assumption is wrong by a considerable margin. In fact it fails
because the the trials are not independent: the occurrence of a word at a given site
affects the probability of occurrence of words at overlapping sites.
Table 2: Variance and standard deviation of the word count statistic Y (w) for sample
words w in an i.i.d. sequence of nucleic acids.
w gaga gggg gaag gagc binomial
Var (Y (w)) 4,288 6,363 3,922 3,799 3,891

p
Var (Y (w)) 65.4 79.7 62.6 61.6 62.4
The above calculations can be extended to more complicated cases such as (i) se-
quences in which the letter distribution Prob (Ai = a) is not a uniform distribution
across the alphabet, (ii) Markovian sequences in which letters are not independent but
depend in a stochastic way on the previous letter or letters, and (iii) counts which in-
clude the number of matches to a given word up to a specified number of mismatches.
The Markovian case will be treated in detail in a later section.
3.2 Distance between words in an i.i.d. sequence

Supposing a given word w occurs in the position starting at site i in the sequence A.
Define p(y) to be the probability that the next occurrence of the word w begins at site
i + y, y = 1, 2, . . .. Here we assume A to be an infinitely long random i.i.d. sequence
with the uniform letter distribution Eq. (19).
To calculate p(y) it is convenient to define the following events illustrated in Fig 6:
E: w occurs at position i, and the next ocurrence of w is at i + y;
F : w occurs at position i and at i + y;
Aj : w occurs at position i and at i + y, and the next occurrence of w after i is at i + j,

where 1 j < y.
These events satisfy the following relationships:
F = E A1 . . . Ay1 ;
E Aj = ; Aj Ak = , 1 j, k < y
15
no occurrences of w
E:
i i+y
anything
F:
i i+y
no occurrences of w anything
Aj:
i i+j i+y
Figure 6: The events E, F and Aj .
It follows that
y1
X
p(y) = Prob (E) = Prob (F ) Prob (Aj ). (30)
j=1
This enables us to write down an iterative formula for p(y), namely
k1 (w)

if y = 1,

d

y1

ky (w) X p(j)k+jy (w)

if 1 < y < k,
p(y) = dy dyj (31)
j=1

yk y1

1 X p(j) X p(j)k+jy (w)
if y k,

dk

dk dyj
j=1 j=yk+1
which follows from considering the cases illustrated in Fig. 7. A plot of the distribution
p(y) for the four DNA words listed in Table 2 is shown in Fig. 8.
In Section 5.7 of the textbook by Ewens and Grant [1] it is shown how to calculate
the generating function of the random variable Y (w). This provides a computationally
more efficient way of calculating p(y) for large y than the iterative formula given above.
In Section 6 of the work by Lothaire [5] equivalent results are given for a sequence
with a Markovian dependence between neighbouring letters, and from this properties of
the distribution of the number of occurrences of a given word in a Markovian random
sequence are also derived. In the following section, we extend our earlier results for the
mean and variance of the number of occurrences of a given word in i.i.d. sequences to
Markovian sequences.
16
First case: 1 y < k
y
k+j-y y-j
j
Second case: y k
1 jyk:
y
j
yk+1 j<y1:
y k+j-y y-j
j
Figure 7: Various cases leading to Eq. 31.
3.3 Markovian sequences

A more realistic statistical model of biological sequences is a Markovian sequence. Once
again we consider a random sequence A of letters from a finite alphabet L of size d (see
Eq. 18), but now we assume each letter is not independent, but depends on the previous
m letters. We say that A is an m-th order Markovian sequence if
Prob (Ai+m = b|(Ai , . . . , Ai+m1 ) = (a1 , . . . , am )) = M (a1 , . . . , am ; b), (32)
for a specified dm d transition matrix M satisfying

X
0 M (a1 , . . . , am ; b); M (a1 , . . . , am ; b) = 1, (33)
bL
for all a1 , . . . , am , b L. Note that M is assumed independent of i. In order for the

distribution to be well defined we also need to specify some way of starting the sequence.
For instance we may define
Prob ((A1 , . . . Am ) = (a1 . . . am )) = (a1 . . . am ), (34)

P
for some specified function satisfying (a1 . . . am ) 0, a1 ...am Lm (a1 . . . am ) = 1.
Below we will define an alternate way of generating a well defined distribution over
sequences which avoids this complication of introducing a starting distribution.
17
0.25
0.06
alphabet = cagt alphabet = cagt
0.20
word = gaga word = gggg
0.04
0.15
p(y)
p(y)
0.10
0.02
0.05
0.00
0.00
0 20 40 60 80 100 0 20 40 60 80 100
y y
0.004
0.015
alphabet = cagt alphabet = cagt

0.003
word = gaag word = gagc
0.010
0.002
p(y)
p(y)
0.005
0.001
0.000
0.000
0 20 40 60 80 100 0 20 40 60 80 100
y y
Figure 8: The probability distribution p(y) of the distance between words in a random
i.i.d. DNA sequence for four different words.
Statistical tests for assessing Markovian dependence in observed sequences are be-
yond the scope of this course, but can be found in Sections 5.2 and 11.3 of the textbook
by Ewens and Grant. A formula for maximum likelihood estimators of elements of the
transition matrix are derived in Section 3.4 of the textbook by Robin et al. [6]. Here we
will assume the matrix M is given.
We are interested in calculating the mean and variance of the number of occurrences
Y (w) of a given k-letter word w = w1 . . . wk for a sequence of length n generated
by a given m-th order Markov model. Much of complexity in calculating the mean and
variance in the i.i.d. case arose because of boundary effects at either end of the sequence.
If we assume the sequence length n to be considerably greater than the word length k,
it makes sense to work within an approximation which ignores these boundary effects
and avoids having to specify a starting distirbution. One way to accomplish this is to
use periodic boundary conditions, that is, we sew the ends of the sequence together
18
to form a continuous loop. More precisely, we extend the sequence indefinitely in both
directions by defining Ai+n = Ai for all i = 2, 1, 0, 1, 2, . . . and define Y (w) to be the
number of occurrences of w in one period of length n:
n
X
Y (w) = Ii (w), (35)
i=1
where the word match indicator random variable is defined by Eq. (20). However, now
we are faced with a new problem: How does one define a Markov chain when the chain
has no beginning?
Before answering this question, lets introduce some notation based on the embed-
ding technique of ref. [6] . We will write any string of length equal to the Markov order
m with an arrow above:
~a = (a1 , . . . am ), (36)
and write any substring of the random sequence A of length m in a similar fashion,
labelled by the index of the first element:
~ i = (Ai , . . . Ai+m1 ),
A (37)
Thus Eq.(32) is written more compactly as

~ i = ~a) = M (~a; b).
Prob (Ai+m = b|A (38)
Now define a dm dm square matrix M as

(
M (~a; bm ) if (a2 , . . . , am ) = (b1 , . . . bm1 ),
M(~a, ~b) = (39)
0 otherwise.
Then the m-th order Markovian dependency can be written as a first order Markovian
dependency as
~ i+1 = ~b|A
Prob (A ~ i = ~a) = M(~a, ~b). (40)
Given a transition matrix M, we first attempt to define a periodic random sequence
A = A1 , A2 . . . , An of length n via the following algorithm:
Algorithm 1.
Step 0: Choose an arbitrary starting distribution on P

the set of strings of length m:
~
Prob (A1 = ~a) = (~a), where 0 (~a) 1 and ~aLm (~a) = 1.
~ 1 = A1 , . . . Am from this distribution.
Step 1: Generate A
Step 2: Generate Am+1 , . . . , Am+n using Eq. (40).

~ n+1 = A
Step 3: If A ~ 1 , accept the sequence A = A1 , A2 . . . , An , otherwise repeat from
Step 1 until an accepted sequence is obtained.
19
This algorithm entails that
(~a1 )M(~a1 , ~a2 ), M(~a2 , ~a3 ) . . . M(~an , ~a1 )
Prob (A = a) = P , (41)
~b1 ,...,bn L
~
~ ~ ~ ~ ~ ~ ~
n (b1 )M(b1 , b2 ), M(b2 , b3 ) . . . M(bn , b1 )
where the denominator arises from the requirement that the sum of probabilities over
all possible sequences is 1. The idea behind PBCs is that there should be no privileged
position along the sequence from which to begin numbering. Thus we further impose a
condition that the sequence should have no privileged starting point, that is, for each
i = 1, . . . , n,
Prob (A = ai+1 ai+2 . . . an a1 . . . ai ) = Prob (A = a). (42)
Eqs. (41) and (42) imply that (~ai+1 ) = (~a1 ) for each i and for every sequence a Ln ,
which can only happen if
1
(~a) = m ~a Lm . (43)
d
This leads to the following definition [7]: Given a Markovian matrix M of order m, a
random periodic Markovian sequence of length n is one generated by Algorithm 1 with
the initial distribution in Step 0 equal to the uniform distribution Eq. (43).
It follows from Eq. (41) that for a random Markovian sequence A of length n, the
probability of the configuration a = (a1 . . . , an ) occurring is
M(~a1 , ~a2 )M(a~2 , a~3 ) . . . M(~an , ~a1 )
Prob (A = a) = . (44)
tr (Mn )
The distribution Eq. (44) has been proposed by Percus and Percus [8], who made an
extensive study of the probability distribution of words on periodic sequences, which
they refer to as rings.
A Markovian sequence A of length n generated by the dm d matrix M , together with
a given k-word w defines the word-count random variable Y (w; M, n). By Eq. (40), an
equivalent specification of this situation is a first order Markovian sequence A consisting
of letters of an alphabet of size dm generated by the square matrix M defined by Eq. (39).
The sparse structure of M ensures that each possible sequence A generated by M bears a
one-to-one correspondence with each possible sequence A generated by M . Furthermore,
for k m, an occurrence of the word w in A is equivalent to the occurrence of the word
~ = (w
w ~i . . . w
~ km+1 ) (45)
in A. It follows that the distributional properties of Y (w; M, n) can be determined as

the distributional properties of the analogous word-count statistic for the equivalent first
order Markov model:
Y (w; M, n) = Y (~
w; M, n). (46)
In particular, provided k m, the mean and variance of the word-count statistic for an
m-th order Markov model can always be found as the mean and variance of a word-count
statistic in an equivalent first order Markov model. From now on it is therefore sufficient
to consider only first order Markov sequences.
20
3.4 Mean and variance of Y (w) for a first order Markov model
Consider a random periodic sequence A = A1 . . . An of period n constructed from the
first order d d Markov transition matrix
Prob (Ai+1 = b|Ai = a) = Mab , a, b L (47)
where X
Mab 0, Mab = 1. (48)
bL
Setting m = 1 in Eq. (44) gives the probability of any sequence a = a1 . . . an occurring

as
Ma1 a2 Ma2 a3 . . . Man a1
Prob (A = a) = . (49)
tr (M n )
The mean of the word count vector Eq. (35) is then
n
X
E[Y (w)] = E[Ii (w)]
i=1
n
X
= Prob (Ii (w) = 1)
i=1
n
X X Ma1 a2 . . . Mai1 w1 Mw1 w2 . . . Mwk1 wk Mwk ai+k . . . Man a1
=
tr (M n )
i=1 a1 ,...,ai1 ,ai+k ,...,an
n
X Mw1 w2 . . . Mwk1 wk (M nk+1 )wk w1
= , (50)
tr (M n )
i=1
or, observing that the summand is independent of i,
E[Y (w)] = n(w), (51)
where, for any `-word a = a1 . . . a` we define
Ma1 a2 . . . Ma`1 a` (M n`+1 )a` a1

(a) = . (52)
tr (M n )
Calculation of the variance follows a similar path to that for an i.i.d. sequence, except
that Eq. (25) picks up an extra term because non-overlapping occurrences of w are no
longer independent. Assuming n 2k, we have
Var (Y (w)) =
Xn X X
Var (Ii (w)) + Cov (Ii (w), Ij (w)) + Cov (Ii (w), Ij (w)). (53)
i=1 0<|ij|<k k|ij|nk
21
s
i i i
s
j
Figure 9: Configurations of k-words at positions i and j corresponding to the three terms

contributing to Var (Y (w)). The circles represent the periodic sequence A, and s is the
summation variable in Eq. (54).
The three terms correspond to the three cases illustrated in Fig. 9. Following through a
lengthy but straightforward calculation analogous to that leading to Eq. (50) yields
Var (Y (w)) =
k1 nk
!
X X
n (w) + 2 ks (w)(w1 . . . ws w1 . . . wk ) + (w, s) n(w)2 , (54)
s=1 s=k
where l (w) is defined by Eq. (21), and
Mw1 w2 . . . Mwk1 wk (M sk+1 )wk w1 Mw1 w2 . . . Mwk1 wk (M nks+1 )wk w1

(w, s) = . (55)
tr (M n )
Derivation of this result is left as an exercise for the reader.
Note that as n becomes large, each row of the matrix M n approaches the stationary
distribution of the Markov chain, if the stationary distribution exists. In fact1
lim (M n )ab = b , lim tr (M n ) = 1, (56)

n n
where a is the stationary distribution satisfying

d
X
a Mab = b . (57)
a=1
It follows that
1
lim E[Y (w)] = w1 Mw1 w2 . . . Mwk1 wk . (58)
n n
1
Another approach to studying word counts in Markovian sequences is to specify the starting se-
quence Eq. (34) to be the stationary distribution of the Markov matrix, rather than introducing peridic
boundaries. Word count distributions from the two approaches agree in the large n limit if the word
counts are scaled by the sequence length.
22
Furthermore, one can show that the final two terms inside the brackets in Eq. (54)
contribute a finite difference as n , and hence that limn (1/n)Var (Y (w)) is also
finite.
Finally, given an m-th order Markov model, by using the embedding technique the
mean and variance of the word count for a word of length k can be found from Eqs. (51)
and (54) with M replaced by M, w replaced by w ~ and k replaced by k m + 1, provided
1
m k 2 n + m 1.
23
4 Sequence Alignment
4.1 Sequence Similarity
New DNA, RNA and protein sequences develop from pre-existing sequences rather than
get invented by nature from scratch. This fact is the cornerstone of the computational
sequence analysis. If we manage to recognise a significant similarity between a new
sequence and a sequence about which something (e.g., structure or function) is already
known, then chances are that the known information applies, at least to some extent, to
the new sequence as well. We say that the two related sequences are homologous, and
it is sequence homology that will be of interest to us during much of the bioinformatics
section of this course.
The most popular way to infer homology is sequence similarity. If two given se-
quences are very long, it is not easy to decide whether or not they are similar. To see if
they are similar, one has to properly align them. When sequences evolve, their residues
can undergo substitutions (when residues are replaced by some other residues). Apart
from substitutions, during the course of evolution sequences can accumulate a number
of events of two more types: insertions (when new residues appear in the sequence in
addition to the existing ones) and deletions (when some residues disappear). There-
fore, when one is trying to produce the best possible alignment between two sequences,
residues must be allowed to be aligned not only to other residues but also to gaps. The
presence of a gap in an alignment represents either an insertion or deletion event. Con-
sider, for example, the following two very short nucleotide sequences, each consisting of
only seven residues:
sequence 1: T A C C A G T, sequence 2: C C C G T A A.
Since the sequences are of the same length, there is only one way to align them, if one
does not allow gaps in alignments:
sequence 1: T A C C A G T
sequence 2: C C C G T A A
However, if we allow gaps, there are many possible alignments. In particular, the fol-
lowing alignment seems to be much more informative than the preceding one:
(59)
Alignment (59) indicates that the subsequence CCGT may be a common conserved
region for both sequences. Another possible alignment that also looks reasonable is:
(60)
How can one choose between alignments (59) and (65)? Are there any better align-
ments? To answer these questions we need to be able to score any possible alignment.
Then the alignment that has the highest score is by definition the best possible one.
24
The simplest scoring schemes assume independence among the columns in an align-
ment and set the total score of the alignment to be equal to the sum of the scores of each
column. Therefore, for such schemes one only needs to specify the scores s(a, b) = s(b, a)
and the gap penalty s(, a) = s(a, ), where a and b take the values A, C, G, T in the
case of DNA sequences and all possible 20 amino acid values in the case of protein se-
quences. The resulting best alignment between two sequences depends, of course, on the
scoring scheme. It is possible that for two different scoring schemes the best alignments
will be entirely different. As an example of a scoring scheme one can set s(a, a) = 1,
s(a, b) = 1, if a 6= b and s(, a) = s(a, ) = 2. However, it is important to keep in
mind that a scoring scheme must be biologically relevant in order to produce a sensible
alignment. More complex scoring schemes introduce some degree of dependence among
the columns in an alignment by making the score of a continuous gap region an affine
function of its length (note that in the example above the score of a gap region is linear
in its length).
The numbers s(a, b) form a substitution matrix. Substitution matrices are always
symmetric and must possess some specific properties in order to be successfully used
for sequence comparison. These properties will be discussed in detail in the next sec-
tion. The most popular substitution matrices are so-called PAM and BLOSUM matrices
used to align amino acid sequences. These 20 20-matrices are derived by statistically
analysing known amino acid sequences. For the time being we will assume that the
substitution matrix is given. For simplicity, we will initially restrict our considerations
only to DNA sequences. This setup will be sufficient to demonstrate the main principles
of alignment algorithms.
4.2 Why not align by brute force ?

The algorithm described below is an example of a dynamic programming algorithm. To
appreciate why such algorithms have been developed, it is instructive to ask why the
optimum alignment cannot be found simply by programming the computer to generate
all possible alignments and choose the one with the lowest score.
Suppose we are given two sequences of length n and m respectively, and let the
number of possible alignments (i.e. the number of possible ways of inserting gaps to
form alignments such as (59) and (65) above) be g(n, m). Without loss of generality
assume n m. For every possible alignment there will be k letters out of the first
sequence and k letters out of the second sequence which align with each other, and n k
letters out of the first sequence and m k letters out of the second sequence which align
with gaps, where 0 k n. Thus a lower bound on g(n, m) is
n
X m n m+n
g(n, m) > = . (61)
k k n
k=0
For simplicity, assume n = m and use Stirlings approximation

1
n! nn+ 2 en , (62)
25
to obtain 1 r
(2n)2n+ 2 e2n

2n (2n)! 2
g(n, n) > = = 22n . (63)
n (n!)2 n+ 12 n 2 n
(n e )
Biologists frequently need to align sequences of up to 1,000 letters. For n = 10, 100
and 1, 000 letters, the above result gives g(n, m) > 105 , 1059 and 10598 respectively. The
age of the universe since the big bang is of the order of 1018 seconds. If a computer
were able to check one alignment per nanosecond it would only have tested 1027 possible
alignments since the beginning of time, and would need more than 1032 times the age of
the universe to align two sequences of length 100 letters!
Next we describe an algorithm which will enable you to align two sequences of length
100 letters on your laptop in a fraction of a second.
4.3 Dynamic Programming: Global Alignment

Lets assume a linear gap model (that is, s(, a) = s(a, ) = d, with d > 0, so
that the score of a gap region of length L is equal to dL) and present a dynamic
programming algorithm, the Needleman-Wunsch algorithm, that always finds the best
possible alignment.
The idea is to build up an optimal alignment using previous solutions for opti-
mal alignments of smaller subsequences. Suppose we are given two sequences x =
x1 , x2 , . . . , xi , . . . , xn and y = y1 , y2 , . . . , yj , . . . , ym . We construct an (n + 1) (m + 1)
matrix whose (i, j)th entry is equal to F (i, j) which is the score of the best alignment
between x1 , . . . , xi and y1 , . . . , yj with 0 i n and 0 j m. We build F (i, j)
recursively initialising F (0, 0) = 0 and then proceeding to fill the matrix from top left
to bottom right. If F (i 1, j 1), F (i 1, j) and F (i, j 1) are known, we calculate
F (i, j) as follows
F (i 1, j 1) + s(xi , yj ),

F (i, j) = max F (i 1, j) d,

F (i, j 1) d.

Indeed, there are three possible ways to obtain the best score F (i, j): xi can be aligned
to yj (see the first option in the formula above), or xi is aligned to a gap (the second
option), or yj is aligned to a gap (the third option). Calculating F (i, j) we keep a pointer
to the option from which F (i, j) was produced. When we reach F (n, m) we trace back
the pointers to recover the optimal alignment. The value of F (n, m) is exactly its score.
Note that more than one pointers may come out of a particular cell of the matrix which
results in several optimal alignments.
It is usual to describe algorithms requiring computer implementation, such as the
Needleman-Wunsch algorithm, in pseudocode, that is, a descriptive imitation of real code
which could be handed to a competent programmer to implement in a suitable computer
language. In this case we have:
26
The Needleman-Wunsch Algorithm:
Initialisation: Append a to the begining of each

sequence and for i = 0, . . . , m,
j = 0, . . . , n, set
F (i, 0) = di, F (0, j) = dj.
Recursion: for i = 1, . . . , m,
for j = 1, . . . , n,
set

F (i 1, j 1) + s(xi , yj ),

F (i, j) = max F (i 1, j) d,

F (i, j 1) d.

F (i 1, j 1) + s(xi , yj ),

ptr(i, j) = arg max F (i 1, j) d,

F (i, j 1) d.

Traceback: Follow the pointers from (m, n) back to (0, 0) to find the
optimum alignments. F (m, n) gives the optimum score.
For example, let x = GAAT CT , y = CAT T and suppose that we are using the
scoring scheme: s(a, a) = 1, s(a, b) = 1, if a 6= b and s(, a) = s(a, ) = 2. Then the
matrix F (i, j) with pointers is given in Figure 10. Tracing back the pointers then gives
the following three best alignments
x: G A A T C T
y: C A T T,
x: G A A T C T
y: C A T T,
x: G A A T C T
y: C A T T
with score 2.
4.4 Dynamic Programming: Local Alignment

Another interesting alignment problem is to find subsequences of two given sequences
that have the highest-scoring alignment. It is called the local alignment problem. Here
we present the Smith-Waterman algorithm that solves the problem for a linear gap
model.
27
F C A T T
0 2 4 6 8
G 2 1 3 5 7
A 4 3 0 2 4
A 6 5 2 1 3
T 8 7 4 1 0
C 10 7 6 3 2
T 12 9 8 5 2
Figure 10: Needleman-Wunsch algorithm for a linear gap model
We construct an (n + 1) (m + 1)-matrix as in the previous section, but the formula

for its entries is slightly different

0,

F (i 1, j 1) + s(x , y ),
i j
F (i, j) = max

F (i 1, j) d,

F (i, j 1) d.
Taking the first option in the above formula corresponds to starting a new alignment.
If the best alignment up to some point has a negative score, it is better to start a new
one, rather than extend the old one. Likewise, the algorithm is initialised by setting the
first row and column of the F -matrix to 0.
Another difference is that now an alignment can end anywhere in the matrix, so
instead of taking the value F (n, m) in the bottom right corner of the matrix for the best
score, we look for the maximum value of F (i, j) in the matrix and start the traceback
from there. The traceback ends when we meet a cell with value 0 which corresponds to
the start of the alignment. The pseudocode is then:
28
The Smith-Waterman Algorithm:
Initialisation: Append a to the begining of each

sequence and for i = 0, . . . , m, and
j = 0, . . . , n, set
F (i, 0) = F (0, j) = 0.
Recursion: for i = 1, . . . , m,
for j = 1, . . . , n,
set

0,

F (i 1, j 1) + s(x , y ),
i j
F (i, j) = max

F (i 1, j) d,

F (i, j 1) d.

0,

F (i 1, j 1) + s(x , y ),
i j
ptr(i, j) = arg max

F (i 1, j) d,

F (i, j 1) d.
Traceback: Starting from each (iend , jend ) such that

F (iend , jend ) = maxi,j F (i, j), follow the pointers back to
each (istart , jstart ) such that F (istart , jstart ) = 0 to find the
optimum alignments.
F (iend , jend )) gives the optimum score.
For the example discussed in the preceding section the best local alignment, illus-
trated in Figure 11, is
x: A T
y : A T,
and its score is equal to 2. In general, there may be multiple alignments with multiple
start points and multiple end points.
4.5 Substitution matrices for sequence alignments

In the previous section we considered the problem of alignment of DNA or protein
sequences. Alignment consisted of maximising a score which typically consisted of sum
of elements of a substitution matrix s(x, y) if letter a aligned with letter b at each point
along the alignment and subtracting a gap penalty if a letter was aligned with a gap.
For DNA sequences it is generally considered adequate to use the simple substitution
29
F C A T T
0 0 0 0 0
G 0 0 0 0 0
A 0 0 1 0 0
A 0 0 1 0 0
T 0 0 0 2 1
C 0 1 0 0 1
T 0 0 0 1 1
Figure 11: Smith-Waterman algorithm for a linear gap model
matrix (
1 if x = y,
s(x, y) = (64)
1 6 y,
if x =
where x, y {A, C, G, T }. Thus, if we further impose a gap penalty of d = 2, the
alignment
(65)
has a score of
2 2 + 1 + 1 1 + 1 + 1 2 2 = 5. (66)
For protein sequences, which consist of letters from a 20-letter amino acid sequence, a
more complicated substitution matrix is required which takes into account the observed
phenomenon that a substituion with a chemically similar amino acid is more likely than
substitution with a chemically dissimilar amino acid. For instance, valine (V ), which is
hydrophobic, is more likely to be substituted with leucine (L), also hydrophobic, than
with arginine (R), which is hydrophilic, so we would expect s(V, L) > s(V, R). The most
commonly used amino acid substitution matrices are the BLOck SUbstitution Matrices,
or BLOSUM matrices for short, developed by Henikoff and Henikoff [9]. The BLOSUM62
matrix, which is used as the default in the BLAST alignment tool on the NCBI website
(www.ncbi.nlm.nih.gov/BLAST/), is shown in Fig. 12.
Here we give a very brief description of how the BLOSUM matrices were constructed.
For a more complete description, see Chapter 6.5 of the text book by Ewens and Grant [1]
or Chapter 9 of the text book by Isaev [10].
30
Figure 12: The BLOSUM62 matrix. The matrix is symmetric about the diagonal.
Construction of BLOSUM matrices began with an extensive database of trusted

blocks of multiply aligned sequences known as the Block Database [11] consisting of
2884 blocks based on 770 protein families. In the first instance the multiple alignments
were performed assuming the substitution matrix s(x, y) = 1 for a match, 0 for a mis-
match, and only ungapped alignments were considered. Fig. 13 gives an indication of
what a small section of such a database might look like. Define x to be the number
of occurrences of amino acid x in the database and xy to be the number of times oc-
currences of a pairwise alignment between the amino acid x and amino acid y, where
x, y L = {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y }.
We have that the probability of a letter chosen at random from the database is x is
x
px = P , (67)
uL u
and that the probability that a pairwise alignment chosen at random from the database
is (x, y) is
xy
qxy = P , (68)
uvL uv
where u < v means that u precedes v in the alphabet.
Consider the quantity qxy
if x 6= y
2px py

xy = (69)
q
xy

if x = y.
px py
31
Figure 13: A set of four blocks from a database of trusted multiple alignments. (Taken
from ref. [1].)
This is the ratio of the proportion of times that x and y are observed to be aligned to
the likelihood of alignment by chance in a pair of independent i.i.d. sequences with letter
frequencies equal to that observed in the database. The factor of 2 allows for the cases
that x is in the first sequence and y in the second, or vice versa. For substitutions that
occur more often than would be expected by chance, xy > 1, and for substitutions that
occur less often than would be expected by chance, xy < 1. The BLOSUM substitution
matrix is then defined as
s(x, y) = round(2 log2 xy ), (70)
where round() means rounded to the nearest integer. As we shall see in the next
section, restricting the substitution matrix to have only integer entries will allow us to
use random walk theory to assess the significance of alignments under a suitable null
hypothesis. We will also see in Section 6.1 that there are sound mathematical reasons
for this choice of substitution matrix.
Two refinements to the construction procedure described above were used in con-
structon of the BLOSUM (ref. [1] for details):
1. To allow for the fact that the database will typically include overrepresentation of
groups of highly similar sequences, groups of sequences with higher than n% sim-
ilarity are treated as a single consensus sequence for the purpose of defining the
effective numbers of occurrences x of individual letters and xy of aligned pairs of
letters. Here n is a predefined percentage between 1 and 100 indicating how closely
related the sequences being aligned are thought to be: Larger numbered BLOSUM
matrices are useful for aligning recently diverging proteins, smaller number matri-
ces are useful for aligning more distantly related proteins. A matrix constructed
by combining sequences of n% similarity is referred to as the BLOSUMn matrix.
32
The NCBI BLAST web page supports BLOSUM45, BLOSUM62 and BLOSUM80
matrices.
2. Once the above process has been followed and a first iteration of the BLOSUM
matrix has been constructed, the entire procedure is repeated with the initial
scoring matrix of ones and zeroes used to construct the database of trusted blocks
replaced by the first-iteration BLOSUM matrix. Then the BLOSUM matrix so
formed is used to carry out a third iteration, and it is this third iteration which
gives the final BLOSUM matrix.
A complete set of BLOSUMn matrices for a range of n and various groupings of amino
acids can be found at the ReBLOSUM web site http://bioinfo.lifl.fr/reblosum/.
33
5 Assessing the significance of alignments using random
walks
The most widely used software for searching large protein databases for close matches to
a given query sequence is BLAST (Basic Local Alignment Search Tool). BLAST searches
for high scoring alignments and tests for significance scores via p-values. The null hypoth-
esis tested against is that the two sequences being compared were generated separately
as i.i.d. sequences with the letter distributions appropriate to the BLOSUM matrices
used in scoring the alignment or estimated from letter frequencies in the databases.
Suppose that amino acid j (= 1, . . . 20) occurs in sequence 1 (query sequence) with
probability pj , and that amino acid k occurs in sequence 2 ( from database) with prob-
ability p0k . Then the null hypothesis is
H0 : Prob (occurrence of aligned pair (j, k)) = pj p0k . (71)
The cumulative score is a random walk under H0 .

We will illustrate the point with the simpler example of an ungapped DNA alignment.
Suppose that we are aligning two sequences with letter probabilities {pA , pC , pG , pT }
and {p0A , p0C , p0G , p0T } respectively, and that we assume the substitution matrix defined
by Eq. (64). Defining the probability of a match at any site in an alignment of two
independent i.i.d. sequences as p and the probability of a mismatch as q, we have
p = pA p0A + pC p0C + pG p0G + pT p0T , q = 1 p. (72)
Starting from the left hand end of the alignment, the cumulative score is a random walk
with possible step sizes 1 (see Fig. 14).
Score Excursions
Ladder points
- A T C T G C A G C T T A G C T T!
- A C C T T G C G C A A G G G A T !
Figure 14: A random walk generated by the alignment of two i.i.d. DNA sequences.
For any random walk we make the following definitions:
34
Ladder Points: A ladder point is a point in the walk which is lower than any previously
reached point.
Excursions: An excursion is the height reached above two consecutive ladder points
above the earlier of the two ladder points.
Note that substitution matrices are always designed so that the expected step size is
negative.2 This entails that the random walk has the tendency to drift downwards and
that, as we shall see, excursions have an geometric-like distribution. BLAST uses the
maximum excursion in an alignment as a test statistic for determining p-values.
In the next two subsections we use two different methods to explore the properties
of random walks. The first of these is simper to understand, but is only applicable to
a simple random walk, that is, one for which the only permissible step sizes are 1.
The second is applicable to any random walk with integer step sizes, such as the random
walks generated by protein alignments scored with BLOSUM matrices.
5.1 Difference equation approach to random walks

Consider a simple random walk with allowed step sizes 1, Prob (step = 1) = p and
Prob (step = 1) = q where p < q and p + q = 1. Suppose that the walk starts at some
integer position h on the real line and ends either at ceiling b or a floor a for some
pre-specified a and b where a h b. We are interested in calculating two quantities:
(i) the probability that the walk finishes at b as opposed to a, and (ii) the expected
length of the walk.
Let
wh = Prob (walk finishes at b)

uh = Prob (walk finishes at a) (= 1 wh ). (73)
The first step is either up (with probability p) or down (with probability q), so
wh = pwh+1 + qwh1 . (74)
This is a second order, linear, homogeneous difference equation with boundary conditions
wa = 0 wb = 1. (75)
A second order linear difference equation behaves has similar behaviour to a second order
linear differential equation. Namely, we can expect two linearly independent solutions
to Eq. (74), and the two boundary conditions will in general ensure a unique solution.
We try for a solution of the form wh = eh , where is to be determined. Substituting
into the difference equation gives a quadratic in e ,
pe2 e + q = 0, (76)
2
For instance, in the above example, if the random variable S is the step size at any point in the walk,
then E(S) = (+1)p+(1)q = pq, so for the case of a uniform distribution pA = pC = pG = pT = 14 ,
one easily checks that E(S) = 12 .
35
which has the two solutions = 0 and = log(q/p). Thus the general solution to
Eq. (74) is
q
wh = C1 + C2 eh , where = log . (77)
p
One easily checks that the solution
eh ea
wh = (78)
eb ea
is of the required form Eq.(77) and satisfies the boundary conditions Eq.(75). The
probability that the walk finishes at a is
eb eh
uh = 1 wh = . (79)
eb ea
Let the expected number of steps before the walk finishes be mh . After the first step,
which is either a step up or a step down, the remaining mean number of steps is mh 1.
This leads to the second order, linear, inhomogeneous difference equation
mh 1 = pmh+1 + qmh1 , (80)
with boundary conditions

ma = mb = 0. (81)
This difference equation has the same homogeneous part as Eq.(74), and therefore has
a general solution of the form
mh = C1 + C2 eh + particular solution. (82)
By direct substitution one checks that a particular solution is mh = h/(q p). Applying
the boundary conditions to determine the constants C1 and C2 gives for the complete
solution
h a b a eh ea
mh = . (83)
qp q p eb ea
Using Eqs.(78) and (79) and a little algebra, this can be written as
wh (b h) + uh (a h)
mh = . (84)
pq
5.2 Moment generating function approach to random walks

Although it is relatively simple to follow, the difference equation approach will not work
for more complicated substitution matrices which allow a wider range of step sizes than
1. We next re-derive wh , uh and mh in a way that readily generalises to random walks
with any finite set of allowed integer-valued step sizes.
Define the i.i.d. random variables Si , i = 1, 2, 3, . . ., to be the steps in a random walk,
and let S be the random variable with their common distribution. For instance, in the
36
simple random walk of the previous section we had Prob (S = 1) = p; Prob (S = 1) = q,
but in general we consider any distribution over a finite set of integer-valued step sizes
with the property that E(S) < 0.
As before, we assume that the walk starts at position h, and ends when the walk first
crosses either a floor at a or a ceiling at b, where a h b. Note that if step sizes of
magnitude greater than 1 are allowed, the walk may overshoot the floor or ceiling in the
final step. The moment generating function approach relies on a result known as Walds
Identity, which first requires some definitions. We define two new random variables (see
Fig. 15):
N = the number of steps until the walk ends;
XN
TN = Sj = the final displacement from the starting point.
j=1
We also need to consider the moment generating function of the step S, namely
mS () = E(eS ). (85)
With this machinery in place, we quote the following theorem
Walds identity: For any such that mS () 1,

E mS ()N eTN = 1, (86)
where the expectation value is taken with respect to the joint distribution of TN
and N .
The proof of Walds identity is lengthy and is given in Appendix C.
b b
TN
or
h h
-TN
a a
N N
Figure 15: The random variables N and TN .
To illustrate the method, we return to the simple random walk with allowed steps
1, for which the generating function is
mS () = qe + pe , (87)
37
where p + q = 1 and p < q. It is easy to check that there exists a unique number > 0
such that
mS () = 1, (88)
given by
q
= log . (89)
p
Evaluating Walds identity at = gives
E(eTN ) = 1. (90)
For a simple random walk which must terminate at either a or b without overshoot, the
only possibilities for TN are
(
b h with probability wh ,
TN = (91)
a h with probability uh = 1 wh .
Thus Eq.(90) gives

(1 wh )e(ah) + wh e(bh) = 1,
which solves to give
eh ea
wh = , (92)
eb ea
and we have recovered Eq.(78).
To derive the expected number of steps to completion of the random walk, mh =
E(N ), we differentiate Walds identity. From Eq.(86),
d
0 = E mS ()N eTN
d
= E N mS ()N 1 m0S ()eTN + mS ()N TN eTN . (93)
Setting = 0 and using the properties of generating functions that mS (0) = 1 and
m0S (0) = E(S) gives
0 = E(N E(S) + TN ) = E(N )E(S) + E(TN ), (94)
or
E(TN )
mh = . (95)
E(S)
This result is a consequence of Walds identity and therefore holds for any random walk.
For the particular case of the simple random walk we have (using Eq.(91))
E(S) = p q, E(TN ) = uh (a h) + wh (b h). (96)
This gives
wh (b h) + uh (a h)
mh = , (97)
pq
which agrees with Eq.(84).
38
5.3 Ladder points and excursions
Recall from Fig. 14 that the BLAST algorithm needs to be able to estimate the expected
distance to the next ladder point and the distribution under the null hypothesis of the
maximum upward excursion in an alignment. To calculate these values we set h = 0,
a = 1 (to define a ladder point), b = y, and consider the probability w0 that the walk
starting at 0 reaches a height y as y gets large.
From Eqs.(79) and (78), the probabilities that the walk ends at 1 or y respectively
are
ey 1 1 e
u0 = y , w0 = (1 e )ey as y . (98)
e e ey e
If the random variable Y is the maximum height (or excursion) reached, then
Prob (Y y) Cey as y , (99)
where C = 1 e . Thus Y is a geometric-like random variable.

Let the expected distance to the next ladder point be A. then Eq.(84) implies
u0 yw0
A= . (100)
qp
As y , we have u0 1 and yw0 const. ey 0. So the expected distance to

the next ladder point is
1
A= . (101)
qp
5.4 An example with overshoot

Suppose we start with the substitution matrix: +1 for a match, 2 for a mismatch, so
the step size is (
+1 with probability p, say,
S= (102)
2 with probability q = 1 p,
where p is set so that E(S) = p 2q < 0. As before we study ladder points and
excursions by setting the ceiling at b = y and the floor at a = 1. Because of the size
of the downward step a ladder point can either occur at the floor, 1, or overshoot the
floor and occur at 2. We cannot deal with this situation using the difference equation
method, but we can use the moment generating function method.
It is possible to show that for any random walk with E(S) < 0 there exists a unique
> 0 such that mS () = 1. In the current example, the moment generating function is,
from Eq.(102),
mS () = E(eS ) = pe + qe2 , (103)
and setting mS () = 1 yields the unique positive solution
p !
q + 4pq + q 2
= log . (104)
2p
39
As for the simple random walk, we consider Walds identity evaluated at = , Eq.(90).
This time the allowed values of the displacement TN when the walk ends are y, 1 and
2. Let
Pk = Prob (walk ends at k), k = y, 1, 2. (105)
Then Eq.(90) gives
Py ey + P1 e + P2 e2 = 1, (106)
or
Py = (1 P1 e P2 e2 )ey . (107)
We are interested in the asymptotic distribution as y and hence Py 0. Define
R1 = lim P1 = Prob (ladder point is at TN = 1)
y
R2 = lim P2 = Prob (ladder point is at TN = 2).
y
These probabilities can be calculated as follows:

R2 = Prob (walk ends at 2)
= Prob (walk goes immediately to 2)
+ Prob (walk goes to +1, then to 0 as next point below +1,
then to a ladder point at 2),
or equivalently,
R2 = q + p(1 R2 )R2 . (108)
This is a quadratic in R2 which solves to give
p
q + 4pq + q 2
R2 = . (109)
2p
Since the probabilities of both possible outcomes add to 1,
R1 = 1 R2 . (110)
If Y is the height of an upward excursion from the last ladder point, Eq.(107) gives
where C = 1 R1 e R2 e2 , and R1 and R2 are given above. Once again we
see that the excursion height is a geometric-like random variable.
To find the mean distance between ladder points, we again use Eq. (95), where
E(TN ) = yPy P1 2P2 R1 2R2 as y , (112)
and
E(S) = p 2q. (113)
Thus the expected distance A = limy mh between ladder points is
R1 + 2R2
A= . (114)
2q p
40
5.5 The general case
Formulae for the asymptotic distribution of the excursion height and expected distance
between ladder points are given in Sections 7.5 and 7.6 of Ewens and Grant [1]. The
calculations are somewhat involved and not reproduced here. In summary, for any
random walk generated by a given substitution matrix with integer-valued elements and
sequences with an i.i.d. letter distribution such that the expected step size is negative,
1. The upward excursions Y between ladder points are geometric-like random vari-
ables satisfying
where is the unique positive solution to
mS () = 1, (116)
mS () is the moment generating function for the step size S, and C, which satisfies
0 < C < 1, can be calculated;
2. The mean distance A between ladder points can be calculated via Eq.(95) in terms
of the probabilities Rj that the ladder point occurs a distance j below the previous
ladder point.
41
6 BLAST (Basic Local Alignment Search Tool)
In the previous section we developed the theory of random walks, with particular atten-
tion to the concepts of ladder points and upward excursions. We derived formulae for the
expected distance between ladder points and the probability that an upward excursion
exceeded a given height for walks with negative expected step size. In this section we
will see how the alignment software BLAST uses these concepts to attach a significance
score to an alignment. For simplicity we consider only ungapped alignments.
6.1 BLAST and the choice of substitution matrix

To assess the significance of an alignment, BLAST uses the highest upward excursion
of the running alignment score, calculated using the BLOSUM matrix, as test statistic.
We may be tempted to ask: How do we know that this particular test statistic is a good
measurement of the relatedness of sequences?
As a starting point, we assume the database of trusted blocks of multiple alignments
used for constructing BLOSUM matrices to be a good sample of truly related sequences.
To couch our question in the language of statistical hypothesis testing, consider the
following null and alternate hypotheses:
H0 : Prob (occurrence of aligned pair (j, k)) = pj pk ,

(117)
H1 : Prob (occurrence of aligned pair (j, k)) = qjk ,
where amino acid j (= 1, . . . 20) occurs in sequence 1 at a given position, amino acid k
(= 1, . . . 20) occurs in sequence 2 at the corresponding aligned position, and pj , pk and
qjk are the relative letter frequencies and alignment frequencies estimated from trusted
blocks of multiple alignments, defined in Eqs.(67) and (68). The null hypothesis states
that the two sequences are unrelated strings of i.i.d. letters. This is the assumption
under which the upward excursions and distances between ladder points were calculated
in the previous section. The alternate hypothesis states that the two sequences contain
aligned pairs of letters at same the frequency as occurs in the trusted aligned blocks.
Suppose we begin with a large population of pairs of random i.i.d. sequences gener-
ated under the null hypothesis H0 . Suppose further that we pull out from this population
all sequence pairs with upward excursions higher than a certain (high) cutoff y, and dis-
card the remainder of the population. From this reduced set of sequences we further
extract the subsequences consisting of high scoring upward excursions and accept them
as our population of local alignments. If this procedure is equivalent to generating a
population of sequence pairs under the alternate hypothesis H1 , then we can have confi-
dence that upward excursions are an appropriate test statistic for measuring relatedness
of sequences.
The above is a potted version of the argument used by Karlin and Altschul [12] who
developed the BLAST algorithm. It remains to demonstrate that the procedure of choos-
ing the highest scoring upward excursions does indeed generate the correct frequency qij
of aligned pairs of letters.
42
Recall from Eq.(115) that the asymptotic probability that a random walk has an
excursion y as y under the null hypothesis H0 is of the form
Prob (Y y) = Cey , (118)
where is the unique positive solution to mS () = 1, and mS is the moment generating

function of the random step size S. We now calculate the probability under the null
hypothesis that the letter j at a given site in sequence 1 is aligned with the letter k
at the corresponding site in sequence 2, conditional on the event that the site is part
of an upward excursion y. The conditioning ensures that this pair of letters has
survived the selection process described above for local alignments. Let the element of
the substitution matrix between letter j and the letter k be s(j, k). Then
Prob (alignment at a given site is (j, k)|Y y)

Prob (alignment at a given site is (j, k) and Y y)
=
Prob (Y y)
Prob (alignment at a given site is (j, k)) Prob (Y y s(j, k))
=
Prob (Y y)
pj pk Ce(ys(j,k))
=
Cey
s(j,k)
= pj pk e . (119)
In going from the second to the third line above we have made use of the fact that the
steps in a random walk are independent, and that removing a step of height s(j, k) from
an upward excursion of height y leaves an excursion of height y s(j, k).
From Eqs. (69) and (70), and neglecting the rounding to an integer, the BLOSUM
substitution matrix is of the form3

qjk
s(j, k) = log , (120)
pj pk
for some constant . Note that and are related via the definition of :
qj,k
X X
1 = mS () = E(eS ) = pj pk es(j,k) = pj pk . (121)
pj pk
(j,k) (j,k)
The right hand side can only be equal to 1 if = 1 , given that is defined as the
unique solution > 0 and that the qj,k are probabilities summing to 1. Then substituting
Eq.(120) into Eq.(119) gives finally
Prob (alignment at a given site is (j, k)|Y y) = pj pk elog(qjk /(pj pk )) = qjk , (122)
as required.
3
Since we distinguish here between the ordering of Sequence 1 and Sequence 2, the factor of 2 in the
denominator of the first part of Eq. (69) does not apply.
43
6.2 P-value estimates for aligned sequences
BLAST needs to be able to assign a p-value, calculated under the null hypothesis, to the
maximum upward excursion between two aligned sequences. Since the null hypothesis
assumes independence of sites along the sequences, each upward excursion is an indepen-
dent random variable. Suppose there are n upward excursions in an alignment. Then
the relevant statistic for calculating p-values is
Ymax = max Yi , (123)

i=1,...,n
where the Yi are i.i.d. geometric-like random variables with the distribution of Eq. (118).
This is an example of an extreme value statistic, the properties of which are summarised
in Appendix A.
A convenient way to obtain bounds on the p-value for this discrete random variable
is to write it as the integer part of a continuous random variable. Let Yi = bXi c, where
the Xi are continuous i.i.d. exponential-like random variables with a common cumulative
distribution function FX satisfying
FX (x) = Prob (X x) 1 Cex as x . (124)
One easily checks that this has the desired properties, namely that for any integer y,
Prob (Y y) = Prob (bXc y) = Prob (X y) = 1 FX (y) = Cey , (125)
as required.
Defining Xmax = maxi Xi , so that
Xmax 1 < Ymax Xmax , (126)
then entails that4
Prob (Xmax y) Prob (Ymax y) < Prob (Xmax y + 1). (127)
The asymptotic distribution function for Xmax at large x is, from Eq.(126),
FXmax (x) = Prob (Xmax x) = Prob (Xi x, i = 1, . . . , n) = (1 Cex )n . (128)
It is shown in Appendix A.6 that the maximum of n i.i.d. exponential random variables
is a Gumbel distribution for large n, with the properties that E(Xmax ) (1/) log n +
const. and Var (Xmax ) 2 /62 . Analogous results hold for the maximum of n i.i.d.
exponential-like random variables. Thus setting u = x log n in the line above gives
n
Ceu

u
FXmax (x) = 1 eCe , as n , (129)
n
4
If Ymax Xmax , the the event Xmax y implies the event Ymax y, which in turn implies
Prob (Xmax y) Prob (Ymax y), and similarly for the second inequality.
44
or, reinstating the original variable,
x
FXmax (x) eCne . (130)
Returning to Eq.(127) we have the following approximate bounds on the distribution on

Ymax :
y (y+1)
eCne Prob (Ymax y) < eCne . (131)
In practice it is not the number of excursions but the sequence length that is known.
For a sequence of length N , the number of upward excursions is approximately n = N/A,
where A is the expected distance between ladder points. Thus Eq. (131) can be written
more compactly as
s s
ee e Prob (Ymax y) < ee , (132)
where
N
s = y log Ce . (133)
A
Accordingly, Karlin and Altschul [12] define a normalised score

0 N
S = Ymax log Ce , (134)
A
which has the advantages that (1) C, and A are all preset parameters determined by
the BLOSUM matrix and individual amino acid frequencies (see Section 5), (2) Ymax is
independent of the arbitrary overall scale in the definition of the substitution matrix
(see Eq.(120) and following lines) and (3) it subtracts off the logarithmic growth of Ymax
with N . In terms of this score, bounds on the distribution function are
es s
ee Prob (S 0 s) < ee . (135)
Finally, the approximate p-value used by BLAST is taken from the upper end of this
inequality5 :
s
P -value = Prob (S 0 > s) 1 ee 1 (1 es ) = es , (136)
when s is large. BLAST printouts also show for each alignment a bit score, defined
as S 0 / log 2.
In addition to giving p-values for the alignment based on the highest excursion,
BLAST also calculates the number E of excursions one would expect to observe under
the null hypothesis H0 which are higher than an observed height ymax . When performing
database searches, this number is ultimately used by BLAST for calculating the reported
Expect value indicating how many matches of at least the quality of that found one
would expect by pure chance from the database search. The expected number of high
5
From Eqs. (70), (120) and the argument following Eq. (121), for BLAST matrices = 1 =
1
2
log 2 0.34. Significant bit scores are typically O(10) or much greater (see Fig. 17) so adding to a
bit score makes little difference.
45
Sequence 1:
Sequence 2:
...
Figure 16: Possible ungapped alignments of two sequences of length N1 = 6 and N2 = 8
respectively.
excursions is the expected number of excursions multiplied by the probability that an

excursion is of height > ymax , or equivalently, ymax 1. Thus, from Eqs. (118) and
(??),
N 0
E= Ce(ymax 1) = N Keymax = eS , (137)
A
where S 0 = ymax log(N K) is the normalised score. From Eq.(136), the p-value and
the expected number of high excursions are related by
P -value 1 eE , (138)
or
E log(1 P -value). (139)
6.3 Unaligned sequences and database searches

So far we have assumed we are given an ungapped alignment of two sequences. In prac-
tice, in database searches, the alignment with each sequence is not specified, and BLAST
needs to be able to calculate a P-value assuming all possible (ungapped) alignments with
each database sequence, and a corresponding Expect value giving the number of align-
ments with a lower P-value that one would expect to find in the entire database search.
The precise details of how this is done are beyond the scope of this course, and are
summarised in Sections 10.3 to 10.5 of Ewens and Grant [1]. Here we will just deal with
some of the salient points.
Consider all possible alignments of two sequences of lengths N1 and N2 as illustrated
in the toy example in Fig. 16. If we assume (without loss of generality) that N1 N2 ,
46
the total length of all overlaps is
1 + 2 + . . . + (N1 1) + (N2 N1 + 1)N1 + (N1 1) + . . . + 2 + 1

(N1 1)N1 (N1 1)N1
= + (N2 N1 + 1)N1 + = N1 N2 . (140)
2 2
As a first approximation, one might calculate P-values and E for unaligned pairs of
sequences using formulae of the previous section with the common sequence length N
replaced by N1 N2 . This may be reasonable in the limit of large sequence lengths, but for
realistic protein sequence of O(102 ) to O(103 ) amino acids edge effects become important,
and reduced effective sequence lengths N10 and N20 are used. Details of the edge effect
correction are given in Section 10.3.3 of ref. [1]. By analogy with Eq. (137), the expected
number of high scoring alignments between the two sequences is then approximated as
E = N10 N20 Keymax (141)
Given a query sequence of length N1 , say, and a database of total length D from
which one is looking for a close match, BLAST does not actually attempt all possible
alignments with each sequence. Instead a heuristic algorithm is used which first looks
for an exact match over a very small number of amino acids and then attempts to build
an alignment out in both directions. Once a high scoring alignment is found with a
database sequence of length N2 , say, a P-value is calculated by the following process.
If the expected number of high scoring excursions under the null hypothesis between
sequences of length N1 and N2 is E, then assuming the number of high scoring excursions
to be a Poisson random variable, the probability of a match of the query sequence with
this particular database sequence is 1 eE . Also, the length of the entire database is
D/N2 times the length of this particular sequence. Thus the probability of a match with
the observed high score with any sequence in the database is
(1 eE )D
Expect = . (142)
N2
For each high scoring match it finds, BLAST reports the Expect value and a P-value
(see Eq.(138))
P -value 1 eExpect . (143)
Section 10.8 of ref. [1] gives details of how the calculation is carried out and of corrections
to the last two equations for gapped alignments.
An example BLAST printout is shown in Fig. 17. Here a human haemoglobin-
protein has been BLASTed against the database of known sequences in the lamprey
proteome. The lamprey is a particularly unpleasant primitive eel-like fish which lives by
attaching its sucker-like mouth to the the skin of fish and rasping away flesh with its
sharp teeth. It is closely related to the first vertebrates, and its most recent common
ancestor with humans lived 560 million years ago. Nevertheless there is a 32% identity
between the two sequences. The bit score and Expect value in the printout indicate that
the probability of obtaining as close a match under the null hypothesis of i.i.d. sequences
is essentially zero.
47
Figure 17: An example BLAST printout: using a human haemoglobin A protein sequence
as a query against the lamprey proteome database.
7 High throughput sequencing

Since about 2008 a new generation of high machines have become available to biologists
which have drastically cut the cost of high throughput sequencing [15]. Sequencing
means determining the nucleotide order of a given stretch of DNA or RNA, and is
of course fundamental importance to modern biology and medical science. There are
currently three competing high throughput sequencing technologies available [16]: the
Roche 454 pyrosequencer, the Illumina genome analyzer, the Applied Biosystems SOLiD
sequencer. The Illumina technology, for instance, samples from a prepared solution or
library of DNA fragments of length a few hundred nucleotides to produce reads of up
to 100 bases from from one or both ends of the fragment. As of 2012, the top-of-the-
range Illumina HiSeq 2500 produces 3 109 single reads or 6 109 paired-end reads
of 100 base pairs in an 11 day run (see http://www.illumina.com/systems/hiseq_
systems.ilmn).
7.1 RNA sequencing

Applications of high throughput sequencing reach far beyond the obvious one of sequenc-
ing new genomes. In these notes we will look at an application known as RNA-Seq, or
the sequencing of expressed RNA. By detecting and quantifying which messenger RNA
(mRNA) is present in a cell, biologists can detect which genes are switched on and hence
being transcribed to mRNA which is subsequently translated to the proteins encoded for
by those genes. Ideally one would like to be able to determine an expression profile for
a given cell at a given stage during its lifetime, that is, a list of the quantity of mRNA
present in the cell from each of the tens of thousands of genes in the genome. Biologists
attempt to use such data to infer the complex regulatory pathways by which genes are
switched on and off as the life cycle of the cell progresses or in response to environmental
factors. Another use of expression profiling is to detect differential expression between
the same type of cell under two different conditions, for instance, a healthy liver cell and
a cancerous liver cell.
48
A typical experimental protocol for RNA-Seq is as follows. Total RNA extracted
from a biological source is reduced to mRNA transcripts using magnetic beads with
poly-T oligonucleotides attached. This is necessary as much of the RNA in a cell is not
mRNA transcribed from protein-coding genes, but may for instance be ribosomal RNA
which is not of interest to the experiment. mRNA transcripts are distinguished by the
fact that they have a poly-A tail which hybridises with the poly-T oligonucleotides. The
beads are separated from the solution magnetically and the required mRNA transcripts
then re-separated from the beads. The mRNA transcripts obtained are fractionated to
lengths of about 200 nucleotides either chemically or by sonication, reverse transcribed
to complementary DNA (cDNA), and short adaptor sequences ligated. The quantity
of cDNA produced at this point is insufficient to sequence and must be amplified using
polymerase chain reaction (PCR). This is a process by which the DNA double helix
is repeatedly cycled through steps of heating to separate the two strands in a solution
of single nucleotides, primers and DNA polymerase and cooling so that complementary
single nucleotides hybridise with the DNA to form two copies of each original DNA
molecule. After n steps the molar concentration of the original cDNA solution has been
amplified by a factor of up to 2n .
The resulting sample, called a cDNA library, is then passed through the sequencer.
The experimental setup in the Illumina sequencers is such that a number of lanes can be
run in parallel, so that a single run can process typically 8 experiments simultaneously.
The prepared library passed through each lane may contain O(1012 ) cDNA fragments,
of which O(108 ) are randomly sampled. Each sampled fragment is amplified using in
situ PCR, sequenced, and the data saved to a file. The sequenced reads are mapped
onto a reference genome to identify which genes they are associated with in order to
infer which genes are expressed.
As with any carefully designed biological experiment, multiple replicates of the ex-
periment are carried out in order to control uncertainties due to statistical variation.
Typically the cost of each run of the sequencer and the number of independent exper-
iments being run simultaneously severely limit the number of biological replicates that
can be performed.
The raw output from an RNA-Seq experiment is, in principle, a table listing the
number of reads identified as having originated from each gene in the genome for each
replicate of the experiment. In practice a number of complications arise because of
multiple mappings onto the genome caused by closely related genes or highly repetitive
parts of the genome, reads which fail to map because of read errors or errors in the
annotation of the reference genome, reads which cross exon splice junctions, alternate
splicings of the same gene and so on [17]. Software for dealing with these problems
exists (Bowtie, BWA, BFAST for aligning of reads to a reference genome, TopHat for
alternate splicings [18]) and is regularly used by biologists. From here on we will consider
these problems to be solved. Thus our starting point will be considered to be a table of
observed counts for each gene or transcript (see Fig. 18).
49
Sample A Sample B ... etc.
Gene
Rep 1 Rep 2 Rep 1 Rep 2
ENSG00000209432 4 6 35 45
ENSG00000209432 0 0 2 1
ENSG00000209432 110 96 177 203
typically
10,000 to ENSG00000209432 12685 10897 9246 9873
20,000 genes
ENSG00000212678 148 201 112 93
... etc.
Figure 18: The assumed starting point of our statistical model is a table of observed
read counts for each gene (or transcript) and each replicate arising from an RNA-Seq
experiment.
7.2 Overdispersion of read counts

Here we describe a statistical model developed by Robinson and Smyth [19] describing
the number of read counts observed for a given gene. This model is used in a number of
software packages for expression profiling and detecting differential expression including
edgeR [23] and DESeq [21]. It is based on two assumptions.
Firstly, one assumes that for a given library preparation, the count of the number
of reads mapping onto a particular gene from a given run of the sequencer is a Poisson
random variable. This has been verified to be an accurate assumption in experiments
by Marioni et al. [22] who carried out a statistical analysis of tables of counts obtained
sequencing the same cDNA library in a number of different lanes and runs. The result is
not surprising as this is a canonical example of summing a large number of rare events:
for any one gene a given count is unlikely to map onto that gene, but there are a lot of
reads to map (c.f. the analogous problem of modelling the number of fragments mapping
onto a given point in a genome in shotgun sequencing).
Secondly, we must acknowledge that the molar concentration of cDNA fragments in
the prepared library for each gene is not an exact measure of the abundance of mRNA
in the original cell. Extra statistical variation has been introduced by the complex but
unavoidable library preparation process. To put it another way, suppose a biological
sample of mRNA is split into two equal parts, each assumed to have the same profile
of expressed mRNA. If a cDNA library is prepared from each part, because of the
complexity of the process the gene profile of each of these two parts will not be identical,
no matter how carefully the library preparation protocol has been followed. Thus we
assume that the molar concentration of cDNA originating from any particular gene is
itself a random variable with a mean proportional to the abundance of mRNA in the
original sample.
Note also that on top of the technical variation arising from the library preparation,
a typical experiment will also have biological variation due to the fact that experimental
50
Figure 19: Read counts from two replicates of an RNA sample. Each point refers
to a single gene and the two axes are the observed read counts from each of the two
replicates. (a) Synthetic data assuming read counts are Poisson distributed; (b) a single
library preparation split into two samples and sequences at two different laboratories;
(c) two cDNA library preparations made from the same biological RNA sample and (d)
cDNA libraries prepared from two biological replicates. The data in plots (b) to (d) are
from human lymphoblastoid cell lines from ref. [24].
designs typically involve biological replicates, that is, RNA samples taken from a number
of different mice, humans, or whatever, subjected to the same condition. This biological
variation is in most cases more extreme than the technical variation, and is also incor-
porated into the statistical model described here. Data with a higher degree of variation
than that expected from a Poisson distribution is referred to as being overdispersed.
To illustrate the overdispersion in a real example, consider the plots of read counts
illustrated in Fig. 19. The first plot (a) is synthetic data consisting of pairs of read
counts generated in the computer. For each point two random numbers were generated
from a Poisson distribution with a fixed mean. This was repeated for 60,000 transcripts
for a range of means up to 105 . This is how one would expect data from two replicates
of an experiment to look if the data were not overdispersed. The remaining plots are of
real data from human lymphoblastoid cell lines from an experiment reported in ref. [24].
51
In (b) a single cDNA library preparation has been split into two samples and sequenced
at two different laboratories (one at Argonne and one at Yale). The first assumption
above is equivalent to saying that the only stochastic variation introduced is Poisson
noise from the sequencer, and the scatter plot should look the same as (a). There is a
small amount of overdispersion in (b), but these data are very close to being Poisson, as
predicted by the model. In (c), two cDNA libraries have been independently prepared
from the same biological sample of RNA. The moderate degree of overdispersion observed
is due to technical variation introduced during the complex library preparation. In (d)
cDNA libraries were prepared from two biological replicates, in this case two human
lymphoblastoid cell lines from two healthy male adults. As expected, these data are
influenced by both technical and biological variation, and therefore display a higher
degree of overdispersion.
The two assumptions above allow us to construct the following mathematical model.
Consider one particular lane of the sequencer, and let the abundance of a particular
transcript of interest, as measured by its molar concentration in the prepared cDNA
library, be the random variable R. Set
E[R] = q, Var (R) = v. (144)
Here q is a proxy for the required abundance of mRNA from this transcript in the
original biological sample. Let the number of reads mapped onto this transcripts gene
be the random variable K. By assumption, K conditioned on the cDNA library follows
a Poisson distribution. Thus we have
K|(R = r) Pois (r). (145)
It is necessary to introduce a proportionality constant at this point to allow for the

fact that the total number of reads will differ from lane to lane and from run to run
of the sequencer, and that the reads are not always distributed amongst transcripts in
the same proportion. The parameter is specific to the run and lane and we assume
here that it can be inferred from the data. Methods exist for setting by assuming the
overall distribution of read counts for the majority of genes to be preserved from one
lane to another [23, 21].
Before proceeding we introduce the following notational convention for conditional
expectations values [25]: given two random variables X and Y , if E[X|(Y = y)] = g(y),
then E[X|Y ] means the random variable g(Y ). Similarly, if Var (X|(Y = y)) = h(y),
then Var (X|Y ) means the random variable h(Y ). Using this notation, the Wikipedia
pages Law of total expectation and Law of total variance give easy to follow proofs
of two general results for the marginal mean and variance of X:
E[X] = E[E[X|Y ]]
(146)
Var (X) = E[Var (X|Y )] + Var (E[X|Y ]).
From Eq. (145) and the properties of the Poisson distribution we have
E[K|R] = R, Var (K|R) = R. (147)
52
Substituting in Eqs. (144) and (147) into Eq. (146) gives
E[K] = q, Var (K) = q + 2 v, (148)
which we rewrite as
E[K] = , Var (K) = (1 + ), (149)
where
v
= q, = . (150)
q2
Eq. (149) describes the defining property of over-dispersed count data. The quantity
is called the dispersion parameter, and is a measure of how much the distribution
deviates from being pure Poisson. Figure 20 shows estimates of the mean and variance
of K estimated from two biological replicates of an RNA-seq experiment on fly-embryo
data [21]. Because the number of replicates is small, the estimates have a considerable
degree of uncertainty. Nevertheless, one sees a definite trend for the data to lie above
the = 2 behaviour expected for Poisson data.
Figure 20: Estimates of the mean and variance of K obtained from two biological repli-
cates of an RNA-seq experiment on fly-embryo data, taken from ref [21]. The purple
line is the trend expected for Poisson data. The orange solid and dotted lines are fits to
negative binomial models used by the packages DESeq and edgeR respectively.
7.3 Negative binomial model

A common assumption made for overdispersed count data is that it should have a neg-
ative binomial (NB) distribution. The NB is a two-parameter family of distributions
defined as follows: In a series of independent Bernoulli trials with probability of success
53
p, let K be the number of failures needed to achieve m successes. Then6

k+m1 m
Prob (K = k) = p (1 p)k . (151)
m1
The mean and variance are

(1 p)m 2 (1 p)m
K = , K = , (152)
p p2
and the probability generating function is
m
K p
GK (t) = E[t ] = . (153)
1 (1 p)t
This parameterisation is not suitable for our purposes, so we introduce new parameters
and related to the old ones by
1 1
m= , p= , (154)
1 +
where is now allowed to take any positive real value. Then one easily checks that
(k + 1/) ()k
Prob (K = k) = , (155)
(k + 1)(1/) (1 + )k+1/
and that the mean and variance are

2
K = , K = (1 + ). (156)
Also, one can check via the probability generating function that in the limit 0, K
becomes a Poisson random variable with parameter . Thus, as a model distribution,
the NB has the desired properties of an overdispersed count random variable.
Furthermore, one can show that the NB model arises naturally if the molar concentra-
tion R, which has the properties in Eq. (144), is assumed to have a Gamma distribution.
More specifically, if
R Gamma(mean = q, var = v), (157)
so that
R Gamma(mean = , var = 2 ), (158)
then
K NB(mean = , var = (1 + )). (159)
Details of the calculation are left as an exercise for the reader.
6
Note that K is related to the NB random variable Y defined in Table 3 of Appendix A by the relation
K = Y m. That is, it has been shifted along the axis.
54
7.4 Detecting differential expression
Suppose a biologist wishes to detect which genes are differentially expressed in a given
cell type between two different treatments or conditions which we label A and B. They
design an experiment in which n replicate samples of RNA are taken from cells under
condition A, a further n replicate samples are taken under condition B, and the RNA-seq
protocol is performed on each of the 2n samples to produce a table of read counts Kij A
and Kij B resembling those in Fig. 18. Here the index i labels the gene or transcript,
running from 1 up to some tens of thousands, and j = 1, . . . , n labels the replicates.

This is precisely the situation that the software packages DESeq [21] and edgeR [23]
address using hypothesis testing based on the NB model. Essentially the assumption
is that for each gene and each replicate the read count is a NB random variable, and
that under the null hypothesis the underlying mRNA abundances qiA and qiB from gene
i under conditions A and B (see Eq. (144)) is such that qiA = qiB .
The remaining parameters of the NB distribution are estimated from the data. The
proportionality constants A B
j and j defined in Eq. (145) are determined for each repli-
cate by assuming the overall distribution of read counts is preserved between replicates.
The method of estimating the variances viA and viB due to library preparation and bio-
logical variance, defined in Eq. (144), differs between the two packages. A problem arises
that the number of replicates is generally not sufficient to obtain an accurate maximum
likelihood estimate the overdispersion on a gene-by-gene basis. This is dealt with by
borrowing information from other genes, either by assuming a functional dependence
between the mean and overdispersion of the NB distribution in the case of DESeq, or
by using Bayesian inference in the case of edgeR.
Both software packages use as test statistics the total number of counts under each
condition:
X n Xn
A A B B
Ki = Kij , Ki = Kij . (160)
j=1 j=1
A simplifying approximation is then made that KiA and KiB are themselves NB random
variables with effective parameters derived from the estimated parameters for the indi-
vidual counts Kij A and K B . This is an approximation because the sum of NB random
ij
variables is in general not NB, except in the Poisson limit of zero overdispersion. Now
define i (a, b) to be the probability under the null hypothesis that for the ith gene a
total of a counts are observed in condition A and b counts are observed in condition B.
Then assuming counts from the different conditions to be independent, we have
i (a, b) = Prob (KiA = a)Prob (KiB = b), (161)
where the probabilities on the right hand side are calculated from the NB distribution
with estimated parameters as described above. For any observed pair of total counts
kiA and kiB for the ith gene, a p-value pi is calculated as the sum of all probabilities
less than or equal to i (kiA , kiB ), given that the overall sum of counts for this gene is
kIS = kiA + kiB :
55
X
i (a, b)
a+b=kiS
i (a,b)i (kiA ,kiB )
pi = X . (162)
i (a, b)
a+b=kiS
Once a p-value is obtained for each gene, the hypothesis testing proceeds in the usual
way, that is, genes with a p-value less than a specified type I error rate are deemed to be
differentially expressed. Further refinements are possible to correct for multiple hypoth-
esis testing using the so-called Benjamini-Hochberg procedure. Details of the parameter
estimation, effective parameters for summed counts and correction for multiple hypoth-
esis testing alluded to above are complicated and beyond the scope of these lectures.
Details can be found in the original papers.
8 Population Genetics
If we were to zoom in on a multiple alignment of one of our chromosomes across the
entire human population we might see something like Fig. 21. Assuming it is not an X or
Y, every individual has two copies of the chromosome. At first sight the sequence apears
to be identical across the population. However we would soon notice certain isolated
points called single nucleotide polymorphisms, or SNPs, which have been highlighted by
vertical bars in the figure. At most SNPs two possible letters are observed, for instance
A and T in the left hand SNP, and C and T in the right hand SNP in Fig. 21. The
two possible letters are an example of what are referred to as alleles7 . Very occasionally
three-allele SNPs occur, while four-allele SNPs are extremely rare indeed. The human
genome contains about 1.4 million SNPs, at an average separation of about 2 kilobases,
or, to put it another way roughly 1 in 2,000 genomic sites on average is a SNP. SNPs
occur at higher density in non-protein-coding regions than in coding regions. SNPs
are the most common type of sequence variation, estimated to account for 90% of all
sequence variation.
There are other differences between the genomes of individuals, such as the presence
of copy number variations (CNVs). These occur when an extended region which may
be anywhere from a kilobase to several megabases is repeated a number of times, the
number of repeats varying from one individual to another. This variation accounts for
roughly 13% of the human genome. Other forms of genetic variation include insertions
and deletions of a single letter or of extended regions, and reversals of extended regions.
Without genetic variations, people of a given sex would all look very similar to one
another; in other words, we would all be clones.
7
The word allele is generically used in population genetics to mean any one of a number of alternate
forms of a genetic locus. For instance, if a gene contains n SNPs with two alternate bases observed at
each SNP, the 2n possible patterns may be referred to as 2n alleles.
56
A C T T C G G C T A G A C T G A T C G C T G T G G A A T A C C C A G T T C G G C T A A G G T G T C T A
A C T T C G G C T A G A C T G A T C G C T G T G G A A T A C C C A G T T T G G C T A A G G T G T C T A
A C T T C G G C T A G A C T G A A C G C T G T G G A A T A C C C A G T T C G G C T A A G G T G T C T A
A C T T C G G C T A G A C T G A A C G C T G T G G A A T A C C C A G T T T G G C T A A G G T G T C T A
A C T T C G G C T A G A C T G A A C G C T G T G G A A T A C C C A G T T C G G C T A A G G T G T C T A
Figure 21: A small portion of a multiple genomic alignment across the entire population
showing a sample of ten individuals. Only one strand of the double helix is shown. The
sequences are grouped into pairs to indicate that each individual has two copies of each
chromosome. The single nucleotide polymorphisms, or SNPs, are highlighted.
As a general rule, each instance of genetic variation is caused by a copying error or

mutation occurring in the germline8 in a single individual, which is passed on to future
generations. Population genetics is the study of the dynamics of how errors occurring in
the germlines of single individuals can in time propagate through an entire population
to create the phenomenon we know as evolution.
In these lectures we will concentrate only on SNPs and ignore other forms of genetic
variation. Mathematical modelling of the propagation of SNPs through a population
takes into account two processes: genetic drift and selection. Genetic drift is conceptually
the more difficult of the two. It refers to the diffusion process by which an allele already
present at a given SNP will become more or less prevalent in the population mainly as
a result of variability in the number of offspring of each individual. In the next section
we will look at the simplest a model of genetic drift, the Wright-Fisher model, which is
traditionally used as a starting point from which to develop more detailed and realistic
models of population genetics.
Selection refers to the force driving an allele frequency in one direction or the other
8
Germline refers to the lineage of cells which, as cells split, are passed on to the next generation.
In the case of sexual reproduction, the germline comprises those sperm or eggs which contribute to the
next generation and the lineage of cells from which they are decended, all the way back to the initial cell
from which the individual developed, called the the zygote. The vast majority of the estimated 1013 to
1014 cells in a human are of course not part of the germline.
57
as a result of the fitness conferred by the mutation from which the SNP arose. Many mu-
tations are harmful and may for instance make it unlikely an individual will survive long
enough to produce offspring. For such mutations the frequency of occurrences within
the population of the mutant allele will be forced downwards over generations. Occa-
sionally a mutation will occur which confers an advantage which makes the individual
more fit to produce offspring, thus driving the allele frequency upwards over generations.
Mutations which confer neither an advantage or a disadvantge, such as mutations in a
redundant part of the genome, are called neutral. Selection is modelled mathematically
by augmenting existing models of genetic drift. Mathematical models which take into
account only drift and mutations and not selection are referred to as models of neutral
evolution.
Navely one may think that for evolution to proceed from point mutations in single
individuals in a large population, propagaton of mutations must be helped along by
positive selection: one would expect the probability of a random mutation in a single
individual to drift through the entire population must be very small indeed. Surprisingly,
we will see that neutral evolution can occur without the aid of positive selection, as the
small probability of a mutation in a single individual taking hold of the entire population
over time by random mating turns out to be exactly balanced the number of mutations
occurring in a large population.
8.1 The Wright-Fisher Model

The simplest and probably the earliest serious mathematical model of population ge-
netics, known as the Wright-Fisher model, is attributed to Ronald A. Fisher [26] and
Sewall Wright [27], neither of whom published the statement of the model per se, but
did publish a number of consequences of the model. To begin with we will look at the
simplest form, which includes genetic drift, but not mutations or selection. We start
with a number of assumptions:
Non-overlapping generations: Assume that evolution occurs over discrete timesteps

= 0, 1, 2, . . ., so that the entire population is replaced by a new population each
timestep. An example of this is an annual plant which completes its life cycle from
germination to production of seed in one year and then dies.
Fixed population: Assume the population size N remains fixed, independent of .
Diploid population: Assume each individual has two copies of the genome.
Monoecious population: Assume each individual carries both male and female re-
productive organs, so that mating can occur between two individuals or within
one individual to produce offspring. Many trees, such as oaks, cedars and figs, are
monoecious.
Random mating: Assume each individual is the offspring of two randomly and inde-
pendently chosen parents from the previous generation. Since the population is
monoecious, both parents may be the same individual.
58
In spite of having a highly restrictive list of assumptions9 , its mathematical simplicity
makes the Wright-Fisher model a very popular starting point for much of the published
work in population genetics10 .
Given the assumptions, it is more convenient to think in terms of a population of M =
2N copies of the genome, rather than a population of N diploid individuals. Consider a
given observed SNP within the genome at which there are two distinct alleles, A1 , A2
{A, C, G, T }. Define the random variable Y ( ), whose range is Y ( ) = {0, 1, . . . , M }, to
be the number of copies of allele A1 within the population. Under the assumption of fixed
total population size and an assumption that every individual chromosome at timestep
is equally likely to be the progenitor of this SNP in a given individual chromosome
at timestep + 1, the number of A1 alleles in the population at time + 1, given the
number of A1 alleles in the previous generation , follows a binomial distribtution:
Prob (Y ( + 1) = j|Y ( ) = i) = pij , (163)
where
i M j
j
M i
pij = 1 , i, j = 0, 1, . . . , M, (164)
j M M
since each new offspring inherits either allele A1 with probability i/M , or allele A2 with
probability 1 i/M . Equations (163) and (164) define the Wright-Fisher model without
mutations, that is, the only effect driving the dynamics is genetic drift.
This system is an example of a finite-state Markov chain, with states labelled i =
0, 1, . . . , M and transition matrix pij . Eq. (164) implies that
( (
1 if j = 0 0 if j = 1, . . . , M 1
p0j = , pM j = (165)
0 if j = 1, . . . , M 1 if j = M
so Y ( ) = 0 and Y ( ) = M are absorbing states of the Markov chain. These two states
correspond to the cases where the entire population has allele A2 or A1 at the site in
question, and no further change is allowed by the model. If Y ( ) = M eventuates, we
say that A1 has become fixed in the population, and when Y ( ) = 0 we say A2 has
become fixed. If Y ( ) {1, 2, . . . , M 1}, neither allele is yet fixed, and the site in
question is said to be a segregating site.
8.1.1 Probability of fixation

So far we have not built into the model any mechanism for a site to become a segregating
site in the first place. We will come to that later when we deal with mutations in
9
The assumptions do not rule out all species of life on earth. For instance, sunflowers are monecious
annuals, and it is would not be totally impossible to contrive a situtation in which a sunflower crops size
is held more or less constant.
10
For instance, often one will encounter the concept of effective population size, for which there
seems to be no universally accepted unambiguous definition. Loosely speaking, it means the value to
which the parameter N in an ideal model with some or all of the above assumptions should be set in
order to give correct results (depending on what characteristics one wants to calculate) when applied to
a population for which the assumptions obviously do not hold.
59
Section 8.2. For the time being, consider the initial condition Y (0) = i for some i =
0, . . . , M and ask the question: What is the probability that allele A1 becomes fixed?
We have
Prob (A1 becomes fixed|Y (0) = i)
= Prob (Y () = M |Y (0) = i)
M
X
= Prob (Y () = M |Y (1) = j)Prob (Y (1) = j|Y (0) = i). (166)
j=1
We also have boundary conditions

Prob (Y () = M |Y (0) = 0) = 0, Prob (Y () = M |Y (0) = M ) = 1, (167)
since 0 and M are absorbing states. Defining
i = Prob (Y () = M |Y ( ) = i), (168)
which we note is independent of (any finite) since a Markov chain has no memory of
prior events, Eqs. (166) and (167) become
M
X
i = pij j , 0 = 0, M = 1. (169)
j=1
One easily checks that the solution is

i
Prob (A1 becomes fixed|Y (0) = i) = i = . (170)
M
Similarly, replacing the boundary conditions with 0 = 1, M = 0 gives
i
Prob (A2 becomes fixed|Y (0) = i) = 1 i = 1 , (171)
M
implying that either one allele or the other must become fixed11 .
To interpret this result, consider a mutation which occurs in one member of the
population at = 0. Intuitively one might expect that, for large populations, it is
highly unlikely a mutation could become fixed by pure chance without the aid of natural
selection. The above calculation shows that the Wright-Fisher model predicts a 1/M
probabiltiy that a mutation in a single individual will become fixed in the populaton due
to random genetic drift only. For instance, Figure 22 shows a numerical simulation of 300
trajectories of the Wright-Fisher model with a population M = 100, and, as predicted,
in roughly 1 in 100 cases the mutation has become fixed (one case has not made up
its mind yet, but looks like dying out after coming very close to fixing). On the other
hand, the rate at which such mutations occur anywhere in the genome is proportional to
the population size M , so we arrive at the surprising result that, even without natural
selection, the model predicts that random mutations can fix in a population at a rate
which is approximately independent of population size.
11
Or, more precisely, one allele or the other becomes fixed almost surely, which is statistical jargon for
there exist trajectories in state space which never become fixed, but the probability of such a trajectory
occurring is zero.
60
100
80
60
Y()
40
20
0
0 50 100 150 200 250 300
Figure 22: Simulation of the Wright-Fisher model: 300 independent trajectories for a
population of M = 100 copies of the genome, with a mutant allele introduced into one
member of the population at = 0. Trajectories for which the mutant allele has become
fixed are shown in red.
8.1.2 Expected time to fixation

A second question suggested by Figure 22 is: What is the expected time for an allele to
fix in the Wright-Fisher model? Consider first the unconditional problem of fixation in
either of the two absorbing states. Define (i) to be the expected number of time steps
before fixation at either 0 or M given the initial conditons Y (0) = i. For either of the
cases i = 0 or i = M the answer is trivial, while for any other value of i we can write a
difference equation based on a sum of conditonal expectation values labelled by the first
step,
M
X
(i) = pij (j) + 1, i = 1, . . . , M 1, (172)
j=0
(0) = (M ) = 0. (173)
This equation does not admit a simple analytic solution. However, assmuing M is
large, we can can obtain a good approximation to the solution by taking a continuum
limit in time
1
t= , t = , (174)
M M
while at the same time defining a rescaled continuous random variable
1
X(t) = Y ( ), X(t) = [0, 1], (175)
M
61
equal to the fraction of the population with allele A1 . In Eq. (172) we make the substi-
tutions
i j 1
x= , u= , t(x) = (i), (176)
M M M
and set
1
X = X(t) X(0) = (Y (1) Y (0)), (177)
M
to obtain
Z 1
1
t(x) = Prob (X(t) = u|X(0) = x)t(u)du +
0 M
Z 1
= Prob (X(t) = u|X(0) = x)
0

0 1 2 00 1
t(x) + (u x)t (x) + (u x) t (x) + . . . du +
2 M

0 1 2 00 1
= t(x) + E(X|X(0) = x)t (x) + E(X |X(0) = x)t (x) + . . . +
2 M
(178)
Equation (178) holds for any model scaled according to Eqs. (174) and (175) irrespec-
tive of the form of the transition matrix pij . For the Wright-Fisher model in particular,
Y (1)|(Y (0) = i) Bin(M, i/M ), so
1
E(X|X(0) = x) = E(Y (1) Y (0)|Y (0) = i)
M
1
= {E(Y (1)|Y (0) = i) i}
M
1 i
= M i
M M
= 0, (179)
and
E(X 2 |X(0) = x) = Var (X|X(0) = x) + E(X|X(0) = x)2

1
= Var (Y (1) Y (0)|Y (0) = i) + 0
M2
1 i i
= M 1
M2 M M
1
= x(1 x). (180)
M
Similarly one can show

k 1
E(X |X(0) = x) = o , k = 3, 4, . . . . (181)
M
62
Substituting back into Eq. (178) and tidying up gives the differential equation
1
x(1 x)t00 (x) + 1 = 0, (182)
2
and from Eq. (173) we have the boundary conditions
t(0) = t(1) = 0. (183)
One easily checks that the solution is
t(x) = 2{x log x + (1 x) log(1 x)}, (184)
or
i i i i
(i) 2M log + 1 log 1 . (185)
M M M M
For instance, a SNP with both alleles equally populated could be expected to fix in
(M/2) 1.39M generations.
A more pertinent question relevant to evolution might be to ask the expected time
to fixation for trajectories such as those highlighted in red in Figure 22, where a newly
arrived allele A1 has managed to fix. To answer this question, consider a new random
variable
Y ( ) = Y ( )|(Y () = M ), Y = 1, . . . , M, (186)
equal to the number of individuals in the population with allele A1 , conditioned on the
restricted set of trajectories for which A1 fixes. The corresponding Markov transition
matrix is
pij = Prob (Y ( + 1) = j|Y ( ) = i, Y () = M ))

Prob (Y ( + 1) = j, Y () = M )|Y ( ) = i)
=
Prob (Y () = M )|Y ( ) = i)
Prob (Y ( + 1) = j|Y ( ) = i)Prob (Y () = M |Y ( + 1) = j)
=
Prob (Y () = M )|Y ( ) = i)
pij j
= , i, j = 1, . . . , M, (187)
i
where, in the third line, we have used the Markovian property that the trajectory
+ 1 is independent of the trajectory + 1 .
For the case of the Wright-Fisher model, it is straightforward to check that Eqs. (164),
(170) and (187) imply
i M j
j1
M 1 i
pij = 1 , i, j = 1, . . . , M, (188)
j1 M M
and that consequently the random variable Y ( + 1) 1 conditioned on Y ( ) = i is

binomial:
Y ( + 1)|(Y ( ) = i) Bin(M 1, i/M ) + 1. (189)
63
Following the same line of reasoning as for the unrestricted case, we find that Eq. (178)
also applies to an analogous set of starred variables defined by
1 1 1
X (t) = Y ( ), X = (Y (1) Y (0)), t (x) = (i). (190)
M M M
However, calculations analogous to Eqs. (179) and (180) now give
1
E (X |X (0) = x) =
(1 x), (191)
M

2
1 1
E X |X (0) = x = x(1 x) + O , (192)
M M2
leading to the differential equation
0 1 00
(1 x)t (x) + x(1 x)t (x) + 1 = 0, (193)
2
for the expected time to fixation conditional on A1 fixing, t (x). The general solution is
easily found to be
2 C1
t (x) = (1 x) log(1 x) + + C2 . (194)
x x
To determine the two arbitrary constants, first note that
t(x) = Prob (X() = 1)t (x) + Prob (X() = 0)t (x)

= xt (x) + (1 x)t (1 x), (195)
where t (x) is the expected time to fixation conditional on A2 fixing. Together with
Eq. (184) this implies 2C1 + C2 = 0. Secondly, note that the boundary condition
t (1) = 0, (196)
implies C1 + C2 = 0. The two conditions can only hold simultaneously if C1 = C2 = 0,

giving finally
2
t (x) = (1 x) log(1 x), (197)
x
or
2M 2

i i
(i) = 1 log 1 . (198)
i M M
If a mutation is introduced into a single member of a large population at = 0 and the
mutation fixes, Eq. (198) says that the expected time to fix is approximately (1)
2M (1 + O(1/M )). The red trajectories in Figure 22 are broadly in agreement with this
result.
64
8.2 Including mutations in the Wright-Fisher model
So far mutations have only been included in the model as an ad hoc specification of
initial conditions. In this section we introduce mutations dynamically in a way which
will eventually enable instantaneous rate matrices to be incorporated.
Consider again a SNP in which only two alleles are present, or equivalently, assume
a simplified model genome in which the genomic alphabet only consists of two letters,
A1 and A2 . Returning to the assumptions of the Wright-Fisher model, we add one more
assumption:
Random mutation: Following the germ line over one generation, a genomic site occu-
pied by an A1 allele at the beginning of the organisms life-cycle has a probability u
of mutating to A2 by the end of the life-cycle, and a genomic site initially occupied
by an A2 allele has a probability v of mutating to A1 by the end of the life-cycle.
By end of the life-cycle we mean the point at which the genomic information is passed
onto the next generation. In principle any values u, v [0, 1] are allowed, though in
practice the mutation rates per generation are typically << 1.
Now interpret the random variable Y ( ) as the number of copies of allele A1 within
the population at the end of the life-cycle. Conditioning on the event Y ( ) = i for some
i = 0, . . . , M , the following four events are possible in any one member of the population
in the period + 1:
E1 : The site inherits allele A1 and does not mutate during the life-cycle;
E2 : The site inherits allele A1 and mutates to A2 during the life-cycle;
E3 : The site inherits allele A2 and does not mutate during the life-cycle;
E4 : The site inherits allele A2 and mutates to A1 during the life-cycle.
Then the probability the the site will be occupied by allele A1 at the end of the life-cycle
is
i i
(i) = Prob (E1 ) + Prob (E4 ) = (1 u) + 1 v, (199)
M M
and the probability it will be occupied by allele A2 is

i i
1 (i) = Prob (E2 ) + Prob (E3 ) = u+ 1 (1 v). (200)
M M
Again we see that Y ( + 1)|(Y ( ) = i) is a binomial random variable, but the Markov
transition matrix defined by Eqs. (163) is modified from Eq. (164) to

M
pij = (i)j (1 (i))M j , i, j = 0, 1, . . . , M, (201)
j
where (i) is defined above.
65
100
80
60
Y()
40
20
0
0 400 1000 1600 2200 2800 3400 4000 4600 5200 5800 6400 7000 7600 8200 8800 9400 10000
Figure 23: Simulation of a single trajectory of the Wright-Fisher model with mutations.
The parameters are: population of M = 100 copies of the genome, mutation rate A1 to
A2 of u = 0.0005 per generation and A2 to A1 of v = 0.0003 per generaton. Y ( ) is the
number of A1 alleles in the population.
One can check that if u and v are both non-zero, the Markov chain has no absorbing
states; if u = 0 and v > 0, Y ( ) = M is the only absorbing state, implying that A1
must fix (almost surely); and if v = 0 and u > 0, Y ( ) = 0 is the only absorbing state,
implying that A2 must fix (almost surely).
Figure 23 shows a simulation of a trajectory of this model with parameter values M =
100, u = 0.0005 and v = 0.0003 chosen to demonstrate the behaviour. In most real-world
scenarios populatons are much higher and mutation rates much lower. Nevertheless, we
note the important features that most of the time a genomic site is not segregating (i.e.
Y ( ) = 0 or M ), and the transition between the two almost-fixed states Y ( ) = 0 and M
occurs over a time roughly the same order of magnitude as the fixation time (1) 2M
obtained for the simpler model of Section 8.1.2
Another way to interpret the model is to consider the stationary distribution of
the Markov transition matrix Eq. (201), namely Prob (Y () = i), which is plotted
in Figure 24 for the same set of parameters as Figure 23. If we consider the genome
to be a set of independent genomic sites which have been evolving according to the
Wright-Fisher model for a considerable time, this distribution can be thought of as the
distribution of allele frequencies throughout the genome. Most genomic sites are not
SNPs, but contribute to the spike in the left hand plot at i = 100 if the site is occupied
by nucleotide A1 or the spike at i = 0 if the site is occupied by nucleotide A2 .12 The
remaining vertical bars in the plot correspond to the SNPs, the height of the bar being
the proportion of sites in a large genome at which the allele A1 occurs in a fraction i/M
12
Recall that this is a toy model corresponding to a world in which the genomic alphabet has only two
possible nucleotides.
66
-0.5
0.4
-1.0
log10(Prob(Y() = i))
0.3
Prob(Y() = i)
-1.5
0.2
-2.0
0.1
-2.5
0.0
0 20 40 60 80 100 0 20 40 60 80 100
i i
Figure 24: Stationary distribution of Wright-Fisher model with mutations. Parameters

are the same as for Fig. 23. The right-hand panel is the same plot on a logarithmic scale.
of the population. This distribution is called the site frequency spectrum (SFS). The SFS
is important because it is directly observable from sequencing data, and can, in principle,
enable estimation of mutation rates under the assumption of a given populaton genetics
model. In the next section we show how to find the SFS for Wright-Fisher type models
in the continuum limit M .
Before moving on, we make the obsevation in passing that the model we have con-
sidered is of course a very crude approximation to the observed SFS of any real genome,
even if the assumptions listed in Section 8.1 hold and we restrict ourselves to parts
of the genome undergoing neutral evolution, that is, evolution without selection. One
important problem is that sites are not independent to the extent that mutation rates
are known to depend on neighbouring bases. Models which include this sort of context
dependence are very difficult to make any progress with analytically, because the neigh-
bouring bases themselves mutate on the same timescale. Consequently, analytic results
for population genetics models which include context dependence are almost completely
absent from the scientific iterature.
8.3 Continuum limit and the forward Kolmogorov equation

Although no analytic solution exists for the SFS shown in Figure 24 corresponding to
the stationary distribution of the Markov transition matrix Eq. (201), a very close ap-
proximation can be found in the form of the exact stationary distribution of a continuum
model constucted by taking the limit M . This involves constructing a partial dif-
ferential equation known variously as the forward Komogorov equation or Fokker-Planck
67
equation, which describes the time evolution of a probability density over allele frequen-
cies 13 .
8.3.1 Derivation of the forward Kolmogorov equation

We will derive the forward Kolmogorov equation for a generic population genetics model
based on a discrete time finite state Markov chain. As previously, consider a genomic site
at which two alleles A1 and A2 are possible in a population of fixed size M , and let the
random variable Y ( ) be the number of A1 alleles present at timestep = 0, 1, 2, . . . la-
belling non-overlapping generations. The range of Y ( ) is Y ( ) = {0, . . . , M }. Assume
also that we are given a time-independent Markov transition matrix
pij = Prob (Y ( + 1) = j|Y ( ) = i), i, j = 0, . . . M. (202)
With the intention of taking the limit M , we make the substitutions which
proved useful in Section 8.1.2 for calculating fixation times:
1 1
t= , t = , X(t) = Y ( ),
M M M
(203)
1
X(t) = X(t + t) X(t) = [Y ( + 1) Y ( )].
M
After taking the limit M , t becomes a continuous time variable over the interval
[0, ) and the fraction X(t) of the population with allele A1 becomes a continuous
random variable with range X = [0, 1]. Now define the probability density f (x; p, t)
corresponding to X(t), conditional on the initial condition X(0) = p, that is,
f (x; p, t) dx = Prob (x X(t) < x + dx|X(0) = p), 0 x, p 1, t > 0. (204)
Our aim is firstly to find a partial differential equation governing the evolution of
f (x; p, t), and secondly to find the stationary distribution f (x; p, ), if it exists.
The derivation of the forward Kolmogorov equation begins with decomposing Prob (x
X(t + t) < x + dx|X(0) = p) into a sum over intermediate events u X(t) < u + du,
which we write formally as
Prob (x X(t + t) < x + dx|X(0) = p) =
X
Prob (x X(t + t) < x + dx|u X(t) < u + du)
u[0,1]
Prob (u X(t) < u + du|X(0) = p). (205)
13
The Fokker-Planck equation was published by Adriaan Fokker in 1914 and Max Planck in 1917 in
the context of studying the velocity distribution of particles undergoing Brownian motion. The forward
Kolmogorov equation is equivalent and was published by Andrei Kolmogorov in 1931 in the context of
a more general study of the continuous-time diffusion limit of Markov processes.
68
1
u + du
u
x + dx
x
0 time
0 t t + t
Figure 25: Derivation of the forward Kolmogorov equation: decomposing f (x; p, t+t)dx
into a sum over paths through intermediate events u X(t) < u + du at time t.
A path through one such intermediate state is shown in Fig. 25. At this point one should
think of du and dx as being infinitesimal, and t as being finite but small, in which case
we can use Eq. (204) to rewrite the formal sum over u as
Z 1
f (x; p, t + t) = du f (x; u, t)f (u; p, t). (206)
0
In order to make sense of the last equation we introduce one more assumption, namely
that the moments of the random variable X(t) defined by Eq. (203) are of the form
E[X(t)|X(t) = u] = a(u)t + o(t) (207)

2
E[(X(t)) |X(t) = u] = b(u)t + o(t) (208)
E[(X(t))k |X(t) = u] = o(t), k = 3, 4, . . . , (209)
as t 0 for finite, well behaved functions a(u) and b(u) defined for 0 u 1. We have
already seen this condition satisfied for the Wright-Fisher model by Eqs. (179) to (181)
and the Wright-Fisher model conditioned on fixing allele A1 by Eqs. (191) and (192),
and below we will also see it satisfied for the Wright-Fisher model with mutations. It
essentially means that the density function f (x; u, t) is a narrow distribution centred
about a mean approximately equal to u + a(u)t with variance approximately equal to
b(u)t when t is small. In other words, the point x in Fig. 25 is unlikely to stray further
than a distance of O(t) from the point u. We claim we can therefore represent f (x; u, t)
in terms of the Dirac delta function explained in Appendix D as
f (x; u, t) = (x u) a(u) 0 (x u)t + 21 b(u) 00 (x u)t + o(t). (210)
One can check using the properties of the delta function that this is consistent with
69
Eqs. (207) to (209). For instance,
E[X(t)|X(t) = u] = E[X(t + t) X(t)|X(t) = u]

Z 1
= (x u)f (x; u, t) dx
0
Z 1
(x u) (x u) a(u) 0 (x u)t + 12 b(u) 00 (x u)t dx + o(t)

=
0
Z 1

= 0+ (x u) a(u)(x u)t dx
0 x
1 1 2
Z
+ (x u) b(u)(x u)t dx + o(t)
2 0 x2
Z 1
= 0+ a(u)(x u)t dx + 0 + o(t)
0
= a(u)t + o(t). (211)
Similarly one can confirm Eqs. (208) and (209).

Returning to Eq. (206) and expanding the left hand side to first order in t, we have
f (x; p, t)
f (x; p, t) + t =
t
Z 1
f (u; p, t) (x u) a(u) 0 (x u)t + 12 b(u) 00 (x u)t du + o(t)

=
0
Z 1

= f (x; p, t) [f (u; p, t)a(u)] (x u)t du
0 u
1 1 2
Z
+ [f (u; p, t)b(u)] (x u)t du + o(t)
2 0 u2
1 2

= f (x; p, t) + [f (x; p, t)a(x)] + [f (x; p, t)b(x)] t + o(t). (212)
x 2 x2
Equating the coefficient of t on both sides finally gives us the forward Kolmogorov
equation
f (x; p, t) 1 2
= [a(x)f (x; p, t)] + [b(x)f (x; p, t)] . (213)
t x 2 x2
For any specific population genetics model, the functions a(x) and b(x) are determined
from the Markov transiton matrix pij .
8.3.2 The continuum Wright-Fisher model with mutatons

We now return to the Wright-Fisher model with 2-way mutations derived in Section 8.2.
To determine the functions a(x) and b(x) in the forward Kolmogorov equation, we need
to apply the continuum limit defined by Eq. (203). However, because the mutation rates
u and v are defined as probabilities of mutation per generation, and the generation time
70
t = 1/M becomes infinitesimal in the limit M , rescaled mutation rates with
respect to the continuum time t are needed. Accordingly we set
u
= 2M (u + v), = . (214)
u+v
This parameterisation is chosen to correspond with a parameter frequently encountered
in the population genetics literature and generally quoted for diploid populations as
= 4N , (215)
where N is the effective population and is the per-generation mutation rate. In

our case we have M = 2N , and a total mutation rate = u + v. The parameter is
a measure of the relative bias between A1 to A2 mutations and A2 to A1 mutations.
Eq. (214) inverts to give
(1 )
u= , v= . (216)
2M 2M
Then, for the transition matrix Eq. (201),
1
E[X(t)|X(t) = x] = E[Y ( + 1) Y ( )|Y ( ) = i]
M
1
= (M (i) i)
M
i i
= u+ 1 v
M M
= 12 x + 21 (1 )(1 x) t.

(217)
Comparing with Eq. (207) then gives
a(x) = 12 x + 12 (1 )(1 x), (218)
while a similar calculation using the variance and comparing with Eq. (208) gives
b(x) = x(1 x). (219)
We are now in a position to determine the stationary distribution of the contimuum

limit of the Wright-Fisher model with two-way mutations. In fact, it is straightforward
to check that the function
1
f (x) = x(1)1 (1 x)1 0 x 1, (220)
B((1 ), )
is a solution to the ordinary differential equation
1 d
1
21 (1 )(1 x) f (x) +

2 x 2 dx [x(1 x)f (x)] = 0, (221)
71
1.5
40
log10(M Prob(Y() = i))

f(x) M Prob(Y() = i)
1.0
30
0.5
20
0.0
10
-0.5
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x=i M x=i M
Figure 26: The SFS computed in Figure 24 for a Wright-Fisher model with two-way
mutation for a finite population M , rescaled to the continuum parameter x = i/M
(black), and the continuum stationary solution Eq. (220) to the forward Kolmogorov
equation with corresponding parameters = 0.16 and = 0.625 (red). The red plus
signs at x = 0 and 1 are computed from Eqs. 223 and 224.
and is thus a stationary distribution of the forward Kolmogorov equation (213). Here
the prefactor, which is written in terms of the beta function, whose definition is [31]
Z 1
B(z1 , z2 ) = xz1 1 (1 x)z2 1 dx, z1 , z2 > 0, (222)
0
has been chosen to Rensure that the probability density function f (x) has the correct
1
normalisation, i.e., 0 f (x) dx = 1. This distribution is the well known beta distribu-
tion, and was first identified as the stationary distribution in the presence of two-way
mutations by Wright in 1931 [27].
Figure 26 shows the SFS for the example in Section 8.2 reproduced with the sta-
tionary solution to the continuum theory, Eq (220), superimposed. The horizontal and
vertical axes have been rescaled by factors of 1/M and M respectively in order to make
the comparison and the parameters = 0.16 and = 0.625 calculated from Eq. (214).
To compare with the SFS at i = 0 and M , contributions from the singularities in f (x)
72
at x = 0 and x = 1 are estimated as
Z 1/M
L. H. spike M f (x) dx
0
Z 1/M
M
x(1)1 dx
B((1 ), ) 0
M 1(1)
= , (223)
(1 )B((1 ), )
and
Z 1
R. H. spike M f (x) dx
11/M
Z 1
M
(1 x)1 dx
B((1 ), ) 11/M
M 1
= . (224)
B((1 ), )
Clearly, the agreement in Figure 26 is excellent even at this moderate value of M ,

indicating that the continuum limit M is reached rapidly. In the next section we
explore the estimation of the -parameter from an observed SFS.
8.4 Parameter estimation

Generally one expects an effective population to be considerably larger than that of
the toy example considered above, and it should be possible in principle to estimate
continuum limit parameters such as from an observed SFS. However it is important to
realise that the distribution f (x) of the SFS is not observed directly, simply because a
census of the genomic makeup an entire population is not practical. Instead, biologists
will sequence the genomes of a randomly chosen sample of the population. Because of the
complex nature of the library preparation in a high throughput sequencing experiment,
biological material from individuals may also be pooled, in which case the sample size
is determined by the number of sequencing reads covering each genomic site.
8.4.1 Maximum likelihood estimator

Suppose we assume a sample of nsample individual chromosomes across nsite independent
genomic sites, each assumed chosen to be a site not susceptible to selective pressures.
We also assume the genome to have evolved so as to represent the stationary state of the
above toy model in which the genome consists only of letters A1 and A2 from a 2-letter
alphabet, and that all the assumptions leading to the Wright-Fisher model with two-way
mutations described in Section 8.2 hold.
In this case the data will consist of a set of i.i.d. random numbers K1 , K2 , . . . Knsite
of type A1 alleles, each taking a value in the set {0, 1, . . . , nsample }, observed at the
73
nsite genomic sites. At any given site j, conditional on an (unobserved) underlying
population fraction x of A1 -type alleles at that site, our generic Kj will be a binomial
random variable:
Kj |(X = x) bin(nsample , x), (225)
where X has the beta distribution, Eq. (220). Thus the unconditional distribution of
each Kj is
Z 1
Prob (Kj = k) = Prob (Kj = k|X = x)f (x) dx
0
Z 1
x(1)1 (1 x)1

nsample k
= x (1 x)nsample k dx
0 k B((1 ), )
Z 1
nsample 1
= xk+(1)1 (1 x)nsample k+1 dx
k B((1 ), ) 0

nsample B(k + (1 ), nsample k + )
= , (226)
k B((1 ), )
for k = 0, . . . , nsample . Eq. (226) is the probability function of one of the lesser known
distributions, called the beta-binomial distribution [28]. This is a 3-parameter family of
distributions whose generic probability function takes the form

n B(k + a, n k + b)
PK (k; n, a, b) = , k = 0, . . . , n, (227)
k B(a, b)
where n is a non-negative integer and a and b are positive real numbers.

Given a dataset described above, the parameters and can in principle be estimated
numerically via a maximum likelihood calculation. In this case one would maximize the
log likelihood
n
Xsite
L(, |k1 , . . . , knsite ) = log PK (kj ; nsample , (1 ), )), (228)

j=1
with respect to and , where k1 , . . . , knsite are the observed counts of type-A1 alleles at
each site. The results of estimating parameters from a small synthetic dataset created
with realistic parameter values is given in Table 3.
8.4.2 The Watterson estimator

A very commonly encountered estimator of the scaled total mutation rate , used when
<< 1, is an estimator first presented in a more general context by Watterson [29]. In
the language of our current analysis it states that, if nsample genomes are sampled, and
the proportion of sites which are observed to be segregating is S, then
S
= Pnsample 1 , (229)
i=1 (1/i)
74
Table 3: Parameters and of the Wright-Fisher model with two way mutations estimated from
simulated beta-binomial count data. The data was simulated assuming true values = 0.005,
= 0.8.
nsample nsite

20 103 0.00345 0.792
104 0.00531 0.804
105 0.00488 0.798
50 103 0.00417 0.797
104 0.00456 0.800
105 0.00499 0.799
is an unbiassed estimate of . A site is said to be segregating if more than one allele is

observed at that site (i.e. it is observed to be a SNP on the basis of a sample of size
nsample ).
The Watterson estimator is generally derived in textbooks using an alternative ap-
proach to population genetics called the coalescent. Here we present instead a derivation
given in a recent paper by Vogl [30] which starts from the Wright-Fisher model with
two-way mutations derived in the previous sections. Surprisingly, Vogls treatment gives
a slightly different answer to the the conventional result quoted above, namely that the
right hand side of Eq. (229) is an unbiassed estimated estimator of a slightly different
quantity, namely
= 2(1 ). (230)
Note that S is an unbiassed estimator of the probability that a site chosen at random
from a very large number of independent sites is observed to be segregating. Equivalently,
E[S] is the probability that the random variable Kj in the distribution Eq (226) is neither
1 or nsample . Thus to demonstrate that the right hand side of Eq. (229) is indeed an
unbiassed estimator of it suffices to show that
nsample 1
X 1
1 Prob (Kj = 0) Prob (Kj = nsample ) = 2(1 ) , as 0, (231)
i
i=1
where Kj has the beta-binomial distribution Eq. (226).

To obtain the small approximation, we start with some properties of the beta and
related functions [31]. The beta function can be written in terms of gamma functions as
(z1 )(z2 )
B(z1 , z2 ) = . (232)
(z1 + z2 )
For small and n = 1, 2, . . .,

1
() = + O(), (n + ) = (1 + (n))(n) + O(2 ), (233)

75
where is Eulers constant and (z) = d log (z)/dz = 0 (z)/(z) is called the digamma
function. It satisfies
n1
X 1
(1) = , (n) = + , n = 1, 2, . . . . (234)
i
i=1
Armed with these properties, we have from Eq. (226)

nsample B((1 ), nsample + )
Prob (Kj = 0) =
0 B((1 ), )
(nsample + ) ()
=
(nsample + ) ()
1 + (nsample ) 1/
= + O(2 )
1 + (nsample ) 1/()
= [1 (1 )(nsample ) (1 )] + O(2 )
nsample 1
X 1
= (1 ) + O(2 ). (235)
i
i=1
Similarly,
nsample 1
X 1
Prob (Kj = nsample ) = 1 (1 ) + O(2 ), (236)
i
i=1
and Eq. (231) follows.
76
Appendices
These appendices cover background mathematical and statistical material used in the
course. Appendices A and B are a highly condensed form of a subset of the information
contained in the first four chapters of the book Statistical Methods in Bioinformatics:
An Introduction by W.J. Ewens and G.R. Grant (2nd Ed., Springer, New York, NY).
The material in Appendix C contains a proof of Walds identity which is used in Sub-
section 5.2.
A Crash Course in Probability Theory

A.1 Discrete Random Variables
Associated with a stochastic measurement process resulting in one of a finite or countably
infinite set of values y Y (for example Y = {0, 1, 2, . . .}) we define a discrete random
variable Y and its associated probability function
PY (y) = Prob (Y = y), y Y , (237)
(the probability that Y takes the value y) satisfying

X
PY (y) 0, PY (y) = 1. (238)
yY
In general, the convention is to use capital letters for random variables and lower case
letters for realised values upon taking a measurement. Without loss of generality we
assume Y is a subset of the real numbers and define the cumulative distribution function
(or simply distribution function)
X
FY (y) Prob (Y y) = PY (u). (239)
uy
It satisfies FY () = 0, FY () = 1, is right continuous (i.e. limuy+ FY (u) = FY (y))

and is non-decreasing. Table 4 gives a list of commonly occurring distributions and their
properties:
Bernoulli trial. A Bernoulli trial is a stochastic process with two possible outcomes,
referred to as success (Y = 1) occurring with probability p [0, 1] and failure
(Y = 0) occurring with probability 1 p. The usual example given is the toss of
a biased coin.
Binomial. A binomial random variable is the number of successes in a series of n

independent14 Bernoulli trials each with common parameter p.
14
A definition of the concept of independent events will be given in A.3.
77
Table 4: Commonly occurring discrete probability distributions.
Name PY (y) Y Y = E[Y ] Y2 = Var (Y ) GY (t) = E[tY ]
Bernoulli trial py (1 p)1y {0, 1} p p(1 p) 1 p + pt
n

Binomial y py (1 p)ny {0, . . . , n} np np(1 p) (1 p + pt)n
78
Poisson e y /y! {0, 1, 2, . . .} e(t1)
y1

Negative binomial m1 pm (1 p)ym {m, m + 1, . . .} m/p m(1 p)/p2 (pt)m /(1 t + pt)m
Geometric (1 p)py {0, 1, 2, . . .} p/(1 p) p/(1 p)2 (1 p)/(1 tp)
m
n
m+n
mk mnk(m+nk) n
m+n

Hypergeometric y ky / k {a, a + 1, . . . , b}, (m+n) (m+n)2 (m+n1) k / k
a = max(0, k n), 2 F1 (k, m; n k + 1; t)
b = min(m, k)
Poisson. A Poisson distribution is the limiting case of a binomial distribution as the
number of independent trials n while the common probability of success
p 0 in such a way that np = for some constant parameter (0, ).
Negative binomial. Given a series of independent Bernoulli trials with common pa-
rameter p, a negative binomial random variable is the number trials up to and
including the mth success for some pre-specified integer m.
Geometric. Given a series of independent Bernoulli trials with common parameter p,
a geometric random variable is the number of trials before but not including the
first failure.
Hypergeometric Consider an urn containing m red balls and n white balls. If k balls
are drawn from the urn in sequence at random such that all remaining balls at
each draw have an equal probability of being drawn, the number of red balls drawn
is a hypergeometric random variable.
The expectation value of a random variable Y is defined as
X
E[Y ] = yPY (y). (240)
yY
For any function g we have

X X
E[g(Y )] = uPg(Y ) (u) = g(y)PY (y), (241)
ug(Y ) yY
P
since Pg(Y ) (u) = {y|g(y)=u} PY (y). Note that expectation value is a linear operator:
E[aY + b] = aE[Y ] + b. (242)
The mean Y and variance Y2 of Y are defined as
Y = E[Y ], Y2 = Var (Y ) = E[(Y Y )2 ] = E[Y 2 ] 2Y , (243)
where the final equal sign follows from the linear property of expectation values. The
positive square root of the variance is called the standard deviation.
A useful function for calculating properties of discrete distributions is the probability
generating function (pgf) defined by
X
GY (t) = E[tY ] = ty PY (y). (244)
y
The mean and variance of a distribution are given in terms of derivatives of its pgf via
the formulae
Y = G0Y (1), Y2 = G00Y (1) + Y 2Y . (245)
As a general rule, if for two random variables Y1 and Y2 we have GY1 (t) = GY2 (t) for t in
some open interval containing 1, then PY1 (y) = PY2 (y). This provides a convenient way
of demonstrating that the Poisson distribution probability function given in Table 4 is
indeed the limit of that for the binomial distribution.
79
A.2 Continuous Random Variables
Associated with a stochastic measurement process resulting in a value x X from an
interval of the real numbers (for example X = [L, H] or X = (, )) we define a
continuous random variable X and its associated density function fX such that
fX (x) dx = Prob (x < X < x + dx), x X , (246)
(the probability that X takes a value in the interval (x, x + dx)) satisfying
Z
fX (x) 0, fX (x) dx = 1. (247)
xX
For convenience, set fX (x) = 0 for x / X . Then the cumulative distribution function
(or distribution function) is defined as
Z x
FX (x) Prob (X x) = fX (u) du. (248)

It satisfies FX () = 0, FX () = 1, fX (x) = FX0 (x) and is non-decreasing.

Table 5 gives a list of commonly occurring continuous distributions and their prop-
erties:
Uniform distribution. A uniform random variable has equal probability of occurring

anywhere within a specified interval [a, b].
Exponential distribution. The exponential distribution describes the time elapsed,

from time zero, up to an event whose probability of occurring in the infinitesimal
interval [x, x + dx) is dx, for some constant . The lifetime of a radioactive atom
is an example of an exponential random variable.
Gamma distribution. A Gamma random variable is the sum of k independent expo-

nential random variables, each with the same parameter .
Normal distribution. The normal distribution arises in a number of contexts as a

limiting case of other distributions. For example, it is the limiting case of a binomial
distribution as the number of Bernoulli trials becomes large. The notation X
N (, 2 ) is often used as shorthand to mean X is a normal random variable
with mean and variance 2 . Given a random variable X N (, 2 ), then
Z = (X )/ N (0, 1). A N (0, 1) random variable is called a standard normal
variable.
Beta distribution. The beta distribution arises naturally in a number of contexts in-
cluding the order statistics of uniform random variables and Bayesian statistics. If
X1 and X2 are independent gamma random variables with parameters k = and
respectively and common parameter = 1, then U = X1 /(X1 + X2 ) is a beta
random variable with parameters and .
80
Table 5: Commonly occurring continuous probability distributions.
Name fX (x) X X = E[X] 2 = Var (X)

X mX () = E[eX ]
Uniform 1/(b a) [a, b] (a + b)/2 (a b)2 /12 (eb ea )/[(b a)]
Exponential ex [0, ) 1/ 1/2 /( )
81
Gamma k xk1 ex /(k) [0, ) k/ k/2 k /( )k
2 2 1 2 2
Normal 1 e(x) /(2 ) (, ) 2 e+ 2
2 2
(+) 1
Beta ()() x (1 x)1 [0, 1] /( + ) (+)2 (++1) 1 F1 (; + ; )
Table 6: Commonly occurring continuous probability distributions related to the normal distri-
bution. The random variables Z, Z1 , Z2 , . . . are independent N (0, 1). Note that a Chi-square
random variable is an example of a Gamma random variable with = 21 , k = /2.
Name Relationship with normal X X 2

X
Chi-square X = Z12 + . . . + Z2 [0, ) 2

p
Students t X = Z/ (Z12 + . . . + Z2 )/ (, ) 0 for > 1 /( 2)
for > 2
(Z12 +...+Z21 )/1 222 (1 +2 2)

F -distribution X= (Z +1 +...+Z2 + )/2
2 [0, ) 2 /(2 2) 1 (2 2)2 (2 4)
1 1 2
for 2 > 2 for 2 > 4
Furthermore, there are a number of commonly occurring distributions used in hy-

pothesis testing corresponding to random variables built from combinations of indepen-
dent standard normal variables. These are listed in Table 6.
Analogous to Eq. (241), for any real valued function g define the expectation value
Z
E[g(X)] = g(x)fX (x) dx. (249)
xX
The formulae of Eq. (243) can then be used to define the mean and variance of contin-
uous random variables. We further define the moment generating function of a random
variable X (either discrete or continuous) as
Z
X
mX () = E[e ] = ex fX (x) dx. (250)
xX
The mean and variance are then given by

X = m0X (0), 2
X = m00X (0) 2X . (251)
In some cases, for example the Gamma distribution, it is more convenient to calculate
the mean and variance via the moment generating function, and for some cases, for
example the uniform distribution, to derive them directly from the definition.
A.3 Events
A more general setting for defining probability distributions is in terms of events. An
example of an event is the statement Y = y in Eq. (237). More generally, one con-
siders an event to be something which wither will or will not happen when a stochastic
82
experiment is performed. Define the certain event as something which must happen, and
the empty event, denoted , as something which cannot happen. If A is an event, the
complementary event A (called not A) is the event that A does not happen. Given two
events A1 and A2 , their union, A1 A2 is the event that either A1 , A2 or both occurs,
and their intersection A1 A2 is the event that both A1 and A2 occur. For any event A
we have that A A is the certain event, and A A =
Probability is then defined as a real number Prob (A) assigned to each event A sat-
isfying the following axioms:
1. Prob (A) 0;
2. For the certain event S, Prob (S) = 1;
3. If A1 A2 = , then Prob (A1 A2 ) = Prob (A1 ) + Prob (A2 ).

= 1 Prob (A). One
It follows that for any event A, 0 Prob (A) 1 and Prob (A)
can also show that for any set of events A1 , A2 , . . . ,
Prob (A1 A2 ) = Prob (A1 ) + Prob (A2 ) Prob (A1 A2 ); (252)
Prob (A1 A2 A3 ) = Prob (A1 ) + Prob (A2 ) + Prob (A3 )

Prob (A1 A2 ) Prob (A1 A3 ) Prob (A2 A3 )
+Prob (A1 A2 A3 ); (253)
and similar generalisations for the union of higher numbers of events.

We define conditional probability, Prob (A|B), (the probability that event A occurs
given that event B occurs) as
Prob (A B)
Prob (A|B) = . (254)
Prob (B)
The consequence
Prob (A)Prob (B|A)
Prob (A|B) = . (255)
Prob (B)
is known as Bayes formula. Another useful formula is: given a set of mutually exclusive
and exhaustive events B1 , . . . Bn ( i.e. Bi Bj = for 0 i < j n and B1 . . .Bn = S,
the certain event),
Xn
Prob (A) = Prob (Bi )Prob (A|Bi ). (256)
i=1
Event A1 is said to be independent of A2 if
Prob (A1 |A2 ) = Prob (A1 ). (257)
An equivalent statement of independence is that
Prob (A1 A2 ) = Prob (A1 )Prob (A2 ). (258)
83
More generally, A1 , A2 and A3 are said to be mutually independent if and only if all of
the following hold:
Prob (Ai Aj ) = Prob (Ai )Prob (Aj ) for all i 6= j;
Prob (A1 A2 A3 ) = Prob (A1 )Prob (A2 )Prob (A3 ), (259)
with obvious generalisations involving two-, three-, four-way intersections etc. for higher
numbers of events.
Given a discrete random variable Y with range Y , we define the conditional proba-
bility function of Y restricted to some subset Y as
PY (y)
PY |Y (y) = Prob (Y = y|Y ) = P , y , (260)
u PY (u)
and the conditional expectation value of Y restricted to as

P
y yPY (y)
X
E[Y |Y ] = yPY |Y (y) = P , (261)
y y PY (y)
with similar definitions forScontinuous random variables with sums replaced by integrals.
Given a partitioning = ni=1 i , where i j = for 1 i < j n, it follows that
n
X
E[Y |Y ] = E[Y |Y i ]Prob (Y i |Y ). (262)
i=1
This Sresult generalises to any random variable X and any partitioning of an event,
A = ni=1 Bi , where Bi Bj = for 1 i < j n, as
n
X
E[X|A] = E[X|Bi ]Prob (Bi |A). (263)
i=1
In many situations it is convenient to define indicator random variables. Given any

event A, its indicator variable IA is defined as

1 if the event A occurs,
IA = (264)
0 if the event A does not occur.
Note that for any event A, IA 2 = IA , and E[IA ] = Prob (A).
A.4 Several Random Variables

Consider an n-tuple of discrete random variables Y = (Y1 , . . . , Yn ) defined over some
joint range Y Y1 . . . Yn . Associated with a measurement of Y we have a joint
probability distribution15
PY (y) = Prob (Y1 = y1 , . . . , Yn = yn ), (265)
15
In the notation of Section A.3, the argument Y1 = y1 , . . . , Yn = yn is shorthand for the event
Y1 = y1 . . . Yn = yn .
84
P
satisfying PY (y) 0 and yY PY (y) = 1.
For some subset of indices, which for convenience we choose to be {1, 2, . . . , i}, 1
i n, we define the marginal probability
PY1 ,...,Yi (y1 , . . . , yi ) = Prob (Y1 = y1 , . . . , Yi = yi )

X
= PY (y), (266)
yi+1 ,...yn
where the sum is taken over all (yi+1 , . . . , yn ) such that y Y . From Eq. (260) we also
have the conditional probability function
PY|Y1 =y1 ,...,Yi =yi (yi+1 , . . . , yn ) = Prob (Yi+1 = yi+1 , . . . , Yn = yn |Y1 = y1 , . . . , Yi = yi )

PY (y)
= . (267)
PY1 ,...Yi (y1 , . . . , yi )
A set of discrete random variables Yi is said to be independent if Y = Y1 . . . Yn

and
Yn
PY (y) = PYi (yi ). (268)
i=1
The abbreviation i.i.d. is commonly used to describe identically and independently

distributed random variables, that is, a set of independent random variables with a
common distribution function.
Given a set Y of independent random variables, it follows from the definitions
Eqs. (244) and (268) that the probability generating function of the sum Sn = ni=1 Yi
P
is
Yn
GSn (t) = GYi (t), (269)
i=1
and for i.i.d. random variables this becomes GSn (t) = GY (t)n where Y is the random
variable of the common distribution. These results can be used, for instance, to show
that a binomial random variable is the sum of n i.i.d. Bernoulli random variables (see
Table 4), or that the sum of n independent Poisson random P variables with parameters
1 , . . . , n is a Poisson random variable with parameter ni=1 i .
Associated with an n-tuple of continuous random variables X = (X1 , . . . , Xn ) defined
over some joint range X X1 . . . Xn we define the joint density function fX (x)
such that Z
Prob (X Q) = fX (x) dn x, (270)
Q
for any Q X .
For some subset of indices, which for convenience we again choose to be {1, 2, . . . , i},
1 i n, the marginal density function is defined as
Z
fX1 ,...,Xi (x1 , . . . , xi ) = fX (x) dxi+1 , . . . , dxn , (271)
R(x1 ,...,xi )
85
where R(x1 , . . . , xi ) is the range of values xi+1 , . . . , xn such that x X for fixed
x1 , . . . , xi . We also define the conditional density function of (Xi+1 , . . . , Xn ) given
X1 = x1 , . . . , Xi = xi as
fX (x)
fX|X1 =x1 ,...,Xi =xi (xi+1 , . . . , xn ) = . (272)
fX1 ,...,Xi (x1 , . . . , xi )
A set of continuous random variables Xi is said to be independent if X = X1 . . .Xn

and
Yn
fX (x) = fXi (xi ). (273)
i=1
The expectation value of a function g(X) of a set of continuous random variables

generalises from Eq. (249) as
Z
E[g(X)] = g(x)fX (x) dn x. (274)
X
The definitions of mean and variance given in Eq. (243) with Y replaced by Xi carry
over for an n-tuple of continuous random variables. We also define the covariance of Xi
and Xj as
Xi ,Xj = Cov (Xi , Xj ) = E[(Xi Xi )(Xj Xj )] = E[Xi Xj ] Xi Xj , (275)
and the correlation Xi ,Xj

Xi ,Xj = . (276)
Xi Xj
If Xi , i = 1, . . . , n are independent random variables, then Cov (Xi , Xj ) = 0.
By considering E[(Zi Zj )2 ] 0 where Zi = (Xi Xi )/Xi , one can show that for
any set of random variables X the correlation satisfies 1 Xi ,Xj 1. From the linear
property of expectation values and the bilinear property of the covariance it follows that
for any linear combination of random variables,
" # !
X X X X X
E ai Xi = ai Xi , Var ai Xi = a2i X
2
i
+ 2 ai aj Xi ,Xj . (277)
i i i i i<j
Pn
In particular, if the Xi are i.i.d. random variables whose sum is Sn = i=1 Xi , then
Sn = E[Sn ] = nX , S2 n = Var (Sn ) = nX

2
, (278)
where X is the random variable of their common distribution.

If Xi , i = 1, . . . , n are independent random variables, it follows from
Pn Eqs. (250),
(273) and (274) that the moment generating function of the sum Sn = i=1 Xi is
n
Y
mSn () = mXi (). (279)
i=1
86
If, furthermore, Xi are i.i.d. random variables, then mSn () = mX ()n , where X is the
random variable of their common distribution. These results can be used, for instance,
to show that a Gamma random variable with parameters k and is the sum of k i.i.d.
exponential random variables, each with parameter (see Table 5), and that the sum
of n independent Normal random variables with means P i and variances i2 is also a
Normal random variable, with mean i i and variance i i2 .
P
A.5 Sum of a random number of random variables

Consider i.i.d. random variables Yi P
with common probability
P function PY (y) and proba-
bility generating function GY (t) = y PY (y)ty . Let S = Ni=1 Yi , where N is an integer
valued random variable which is independent of the Yi and has probability function
PN (n). Then the mean and variance of S are given by
E[S] = E[N ]E[Y ], Var (S) = E[N ]Var (Y ) + E(Y )2 Var (N ). (280)
To derive these results, note that

X
Prob (S = s) = Prob (N = n)Prob (S = s|N = n) by Eq. (256)
n
X
= PN (n) (coefficient of ts in (GY (t))n ) by Eq. (269)
n
X
= coefficient of ts in PN (n)(GY (t))n
n
= coefficient of ts in GN (GY (t)), (281)
n
P
where GN (t) = n PN (n)t is the probability generating function of N . Thus the
probability generating function of S is GS (t) = GN (GY (t)). Finally, using Eq. (245),

0 d
= G0N (GY (1))G0Y (1) = E[N ]E[Y ],

E[S] = GS (1) = GN (GY (t)) (282)
dt t=1
and similarly for the variance.
A.6 Extreme values and order statistics

Let X1 , . . . , Xn be a set of i.i.d. continuous random variables, and define the random
variables
Xmin = min Xi , Xmax = max Xi . (283)
i i
Since the event Xmax x is equivalent to the intersection of independent events

Xi x, we have for the distribution function and probability density function of
Xmax
Fmax Prob (Xmax x) = FX (x)n , fmax = nfX (x) (FX (x))n1 , (284)
87
where X is the random variable of the common distribution of the Xi . By a similar
argument we have for the distribution function and probability density function of Xmin
Fmin Prob (Xmin x) = 1 (1 FX (x))n , fmin = nfX (x) (1 FX (x))n1 . (285)
Fmin and Fmax are called extreme value distributions. We also define order statistics
X(1) , X(2) , . . . X(n) as the permutation of X1 , . . . Xn such that
Xmin = X(1) < X(2) < . . . X(n) = Xmax . (286)
The probability density function for the ith order statistic is

n!
fX(i) (x) = (FX (x))i1 fX (x)(1 FX (x))ni . (287)
(i 1)!(n i)!
When X is a uniform distribution on the interval [0, 1], X(i) is a Beta random variable
(see Table 5) with parameters = i, = n i + 1.
For the example of an exponential random variable, FX (x) = 1 ex , the above
formulae give
Fmin (x) = 1 enx , fmin (x) = nenx , (288)
that is, Xmin is an exponential random variable with parameter n, and
Fmax (x) = (1 ex )n , fmax (x) = nex (1 ex )n1 . (289)
Calculation of the mean and variance of Xmax directly from Eq. (289) or from its moment
generating function is not simple. A more convenient derivation begins by writing

Xmax = X(1) + X(2) X(1) + . . . + X(n) X(n1) . (290)
If one thinks of Xi as representing the lifetimes of a set of n radioactive atoms, and if

we restart the clock each time a decay occurs, then X(1) is the time of the first decay,
i.e. the minimum of n independent exponential random variables, and X(i) X(i1) for
1 < i n is the time between the (i 1)th and ith decays, i.e. the minimum of n i + 1
independent exponential random variables. From Eqs. (277), (290) and Table 5 we then
have
n n
1X1 2 1 X 1
max E[Xmax ] = , max Var (Xmax ) = 2 . (291)
i i2
i=1 i=1
In the limit of large n we have
+ log n 2 2
max , max , (292)
62
where 0.5772 is Eulers constant. Now define the centred random variable in the
limit of large n as
Xmax (log n)/
U = lim = lim (Xmax log n). (293)
n 1/ n
88
1.0
0.3
0.8
0.6
0.2
FU(u)
fU(u)
0.4
0.1
0.2
0.0
0.0
-2 0 2 4 -2 0 2 4
u u
Figure 27: Distribution and probability density functions of the Type-I Gumbel distri-
bution, Eq. (294)
Its limiting distribution function and density function are, from Eq. (289),
n
eu

u u
FU (u) = lim 1 = ee , fU (u) = eue , < u < . (294)
n n
Eq. (294), plotted in Figure A.6, defines the Type-I Gumbel distribution.
The above result for the maximum of exponential random variables is an example of
a more general result: one can show that for the maximum Xmax = maxi Xi of any set of
n i.i.d. random variables with finite moments to all orders and support Xi = (A, +)
for some finite A, then
2
max a log n; max b as n , (295)
where a and b are finite constants, and the asymptotic distribution of
(Xmax max )
U= + (296)
max 6
is the Gumbel distribution Eq. (294). that is,

Prob (Xmax x) e exp{[(xmax )/(max 6)+]}
. (297)
For a set of i.i.d. discrete random variables, Y1 , . . . , Yn with Yi = {0, 1, 2, . . .}, the
distribution function and probability function of Ymax = maxi Yi are
Fmax (y) = Prob (Ymax y) = (FY (y))n , (298)
n n
Pmax (y) = Prob (Ymax = y) = (FY (y)) (FY (y 1)) . (299)
89
For a geometric distribution with parameter p = e we have FY (y) = 1 e(y+1) , and
thus n n
Pmax (y) = 1 e(y+1) 1 ey . (300)
It is easy to see that if X is an exponential random variable with parameter , then its
integer part bXc is a geometric random variable with parameter e , since Prob (bXc =
y) = Prob (y X < y + 1) = (1 e )ey . Also, it can be shown that (i) the
random variable D = Xmax Ymax , where Ymax = bXcmax = bXmax c, is approximately
uniform on the interval [0, 1] for large n, and hence E[D] 12 and Var (D) 12
1
, and (ii)
Cov (Xmax , D) 0 as n . Using Eqs.(277) and (292), we then obtain the following
approximate formulae for the mean and variance of Ymax as n ,
+ log n 1
E[Ymax ] = E[Xmax ] E[D] , (301)
2
2 1
Var (Ymax ) Var (Xmax ) + Var (D) 2 + . (302)
6 12
A.7 Poisson processes

A Poisson process is a sequence of events occurring in time such that:
1. Occurrence of an event in the time interval (a, b) is independent of occurrence of
an event in the interval (c, d) if (a, b) (c, d) = ;
2. > 0 such that
Prob (1 event occurs in (t, t + h)) = h + o(h) as h 0+
Prob ( 2 events occurs in (t, t + h)) = o(h) as h 0+
independent of t.
Let the random variable N (t) be the number of events that occur in the time interval
[0, t). Then it can be shown from the above properties that N (t) is a Poisson random
variable with mean t, that is,
et (t)j
PN (t) (j) = Prob (N (t) = j) = , j = 0, 1, 2, . . . . (303)
j!
Note that the above conditions imply that Eq. (303) also holds for any interval [a, a + t),
a 0.
For any given number of events k, let random variable Tk be the time at which the
k th event occurs. Then it can be shown that
fTk (t) dt = Prob (k th event occurs in the interval [t, t + dt))
k tk1 et
= dt, (304)
(k)
that is, Tk is a Gamma random variable with parameters k and . In particular, the
time to the first event, or equivalently because of condition 1, the time between any two
successive events, is an exponential random variable, and the time to k events is the sum
of k independent exponential random variables.
90
B Crash course in statistical inference
B.1 Parameter estimation
One is often faced with the problem of estimating the properties of an unknown or
partially known distribution given a sample of values drawn independently from that
distribution. A more formal statement of the problem is: given a family of distribution
functions FX (x; ) which depend on some unknown parameter(s) , how can be es-
timated from a sample of i.i.d. random observations Xi , i = 1, . . . n. Typically may
represent the parameter p in a binomial distribution (see Table 4) or and in a Normal
distribution (see Table 5), or perhaps the distribution is totally unknown, but its mean
and variance are to be estimated.
1 , . . . , Xn ) is some function of the random observations intended
An estimator (X
to provide an estimate of the value of , and is itself a random variable. An unbiased
estimator is one whose expected value is the parameter being estimated: E[] = . A

consistent estimator is one for which, for any > 0, Prob (| | > ) 0 as the sample
size n (often stated as in probability).
For any distribution, the sample average, defined by
n
= 1
X
X Xi , (305)
n
i=1
is an unbiased estimator of the mean X , since E[X] = (1/n) Pn E[Xi ] = X . It can

i=1
be shown that it is also a consistent estimator provided Var (X) is finite. An unbiased
estimate of the variance is 16
Pn 2
2 (Xi X)
= i=1
. (306)
(n 1)
For a sample of i.i.d. observations Xi of a continuous random variable with para-
metrically defined common density function fX (x; ) or observations Yi of a discrete
random variable with parametrically defined common probability function PY (y; ), the
likelihood function is defined as
n
Y n
Y
L(; X) = fX (x; ) or L(; Y) = PY (y; ), (307)
i=1 i=1
respectively. The maximum likelihood estimator of is defined as

= argmax (log L(; X)) . (308)

The logarithm is included for algebraic convenience as in most cases the sum is easier to
differentiate than a product. It can be shown that in general the maximum likelihood
estimator is consistent.
16 c2 and not
Pedantically speaking, this should be written as 2 , since an unbiased estimate of the
variance is not in general the square of an unbiased estimate of the standard deviation. However, the
notation used here is the standard convention.
91
As an example, consider i.i.d. random variables Y1 , . . . , Yn , each from a Poisson dis-
tribution with unknown common parameter . The log likelihood function is
n n
!
X Y
log L(; Y) = n + Yi log log (Yi !) , (309)
i=1 i=1
and setting d log L(; Y)/d|= = 0 gives the maximum likelihood estimator of as
the sample average = Y .
B.2 Hypothesis testing

Suppose one is required to make a choice between two hypothesies: a null hypothesis
H0 and an alternate hypothesis H1 , based on a set of observations that have been
made. For example, one may wish to decide whether a coin is fair:
1
H0 : Prob (toss results in a head) = ,
2
or biased:
1
H1 : Prob (toss results in a head) 6= ,
2
based on the observation of n tosses. In this example, the null hypothesis amounts to
an assumption that the number of heads obtained is a binomial random variable with
parameter p = 12 , and the alternate hypothesis an assumption that it is a binomial
random variable with p 6= 21 . In a more general setting, H0 may take the form of an
assumption of a particular family of distributions with a parameter choice = 0 , and
H1 may take the form, for instance, of an assumption that = 1 for some specified
value 1 (termed a simple hypothesis), that > 0 (a complex one-sided hypothesis) or
that 6= 0 (a complex two-sided hypothesis).
There are two approaches to this problem. The first, called the classical or frequentist
approach, asks the question: If a given hypothesis or model is assumed to be true, what
is the probability of obtaining the observed data? In general, this turns out to be a well
defined, tractable question, but unfortunately is it not really what one wants to know.
The other approach, called Bayesian, asks a question which is intuitively more aligned
with what one does want to know: Given the observed data, what is the probability a
particular hypothesis or model is true? This question is approached via the Bayesian
formula Eq. (255) in the form
Prob (Model)Prob (Data|Model)

Prob (Model|Data) = , (310)
Prob (Data)
where Prob (Data|Model) is the answer to the frequentist question, and, as a general rule,
Prob (Data) turns out to be a straightforward normalisation. Critics of the Bayesian
approach argue that the remaining factor on the right hand side, Prob (Model), called
92
the Bayesian prior, must be assigned arbitrarily or is ill-defined, or even that the question
being asked is ill-defined on ontological grounds17 .
In these notes we content ourselves with addressing the frequentist approach, for
which hypothesis testing consists of the following procedure. Once a null hypothesis H0
and and alternate hypothesis H1 have been declared, a test statistic is chosen, that is, a
quantity calculated from the data which is used as an indicator of whether H0 should be
accepted, or whether H0 should be rejected in favour of H1 . In the above coin-tossing
example, a suitable test statistic is the number of heads. Loosely speaking, an observed
number of heads close to the extremities 0 or n is an indication of bias, and a value close
to n/2 an indication of no-bias.
Table 7 lists the possible outcomes under the contingencies that H0 is true and that
H1 is true. The type I error rate, is defined as the probability of rejecting the null
hypothesis when it is actually true. An acceptable type I error rate is chosen, and
a decision algorithm based on the test statistic is designed to achieve that error rate.
Note that the type I error rate cannot be chosen arbitrarily low, as this will lead to
unacceptably high values of the type II error rate, , that is, the probability of accepting
H0 when H1 is true. Commonly chosen values are = 1% or 5%. Note also that in
many applications, researchers are interested in singling out interesting cases for further
analysis from a large set of individual cases, where the interesting cases, or positives,
are characterised as those for which hypothesis testing indicates that H1 be accepted.
In such applications, type I errors are termed false positives and type II errors false
negatives.
The decision algorithm typically consists of determining a significance point, K,
defined as that value of the test statistic beyond which H0 will be rejected. It is chosen
so that

= for a continuous random variable,
Prob (H0 rejected|H0 true) (311)
for a discrete random variable.
For the coin tossing example, K is set to the largest value such that
y ny
1
X n 1 1
Prob (Y K or Y n K|p = 2 ) = . (312)
y 2 2
y{0,...,K,nK,...n}
For example, if n = 100, = 0.05, then K = 39, so if the number of heads observed is
less than or equal to 39 or greater than or equal to 100 39 = 61, the null hypothesis is
rejected.
An alternative to calculating a significance point is to calculate the P-value, which
is defined as the probability of obtaining the observed value of a test statistic, or a
17
The ontological objection goes something like this: If there is only one universe, and therefore only
one true hypothesis; how can one talk about the probability of a hypothesis being true? In practice,
however, one is often faced with a number of null hypotheses, for example: Of 20,000 genes, hypothesis
i is that gene number i is not differentially expressed. In this case, the prior probability of the null
hypothesis being true might be interpreted as the probability that a gene chosen at random, all genes
being equally likely, is not differentially expressed.
93
Table 7: Possible testing scenarios and the probability of acceptance and rejection under each
contingency.
H0 true H1 true
H0 accepted 1 Type II error rate:
(false negatives)
H0 rejected Type I error rate: 1
(in favour of H1 ) (false positives)
more extreme value in the direction indicated by H1 . If the P-value is less than the
prescribed type-I error , then the null hypothesis H0 is rejected. Before an experiment
is conducted, the eventual P-value is a random variable. If the test statistic is continuous,
and the null hypothesis is true, then it is a uniform random variable on the interval [0, 1].
C Walds identity and martingales

Below is a derivation of Walds identity, which was required in Section 5.2 for deriving
the properties of general random walks. The derivation is adapted from that given in the
texts by Karlin and Taylor [13] (Section 6.4) and Grimmett and Stirzaker [14] (Section
7.9).
First we need to define a martingale.
Definition: Given a series X0 , X1 , X2 , . . . of random variables, Wi is said to

be a martingale with respect to Xi if, for each i, Wi = wi (X0 , X1 , . . . , Xi ) for
some function wi such that
E(|Wi |) < , i = 0, 1, . . . , (313)
and
E(Wi+1 |X0 = x0 , X1 = x1 , . . . , Xi = xi ) = wi (x0 , x1 , . . . , xi ). (314)
One way to think of a martingale is as a gambling game in which at round i the gamblers
bank balance is Wi , determined from a series of random events (such as the throw of a
die) whose outcomes are X0 to Xi . To be a martingale the game must fair in the sense
that the expected bank balance after the next round, namely E(Wi+1 |Wi = wi ), is equal
to the current balance wi .
We begin with two lemmas.
Lemma 1: For any martingale, E(Wn ) = E(W0 ), n 0.
94
Proof. The proof is by induction. The case n = 0 is trivial. For n > 0,
X
E(Wn ) = E(Wn |X0 = x0 , X1 = x1 , . . . , Xn1 = xn1 )
x0 ,x1 ,...,xn1
Prob (X0 = x0 , X1 = x1 , . . . , Xn1 = xn1 )
X
= wn1 (x1 , . . . , xn1 )
x0 ,x1 ,...,xn1
Prob (X0 = x0 , X1 = x1 , . . . , Xn1 = xn1 )
= E(Wn1 ), (315)
and the result follows.
Lemma 2: For any any restricted subset
X0 X1 . . . Xk (316)
of the joint range of X0 , X1 , . . . Xk , if Wi is a martingale with respect to Xi , then
E(Wn |(X0 , . . . , Xk ) ) = E(Wk |(X0 , . . . , Xk ) ), n k. (317)
Proof. Again the proof is by induction. The result is obvious for n = k. For n > k,
assume
E(Wn1 |(X0 , . . . , Xk ) ) = E(Wk |(X0 , . . . , Xk ) ). (318)
Then
E(Wn |(X0 , . . . , Xk ) )
X
= E(Wn |X0 = x0 , . . . , Xn1 = xn1 )
{(x0 ,...,xn1 ):(x0 ,...xk )}
Prob (X0 = x0 , . . . , Xn1 = xn1 |(X0 , . . . , Xk ) )
X
= wn1 (x0 , . . . , xn1 )
{(x0 ,...,xn1 ):(x0 ,...xk )}
Prob (X0 = x0 , . . . , Xn1 = xn1 |(X0 , . . . , Xk ) )
= E(Wn1 |(X0 , . . . , Xk ) )
= E(Wk |(X0 , . . . , Xk ) ), (319)
as required.
We are now ready Pto apply these results to random walks. If Si is the ith step in a
n
random walk, Tn = i=1 Si is the cumulative displacement from the origin at the nth
step, and mS () is the common moment generating function of the Si , we claim that
eTn
W0 = 1; Wn = for n 1, (320)
mS ()n
95
is a martingale with respect to Si .
To check the first requirement, note that for any given n,
eTn enlargest step size (321)
and mS () is bounded below, so Wn is positive and bounded above, thus E(|Wn |) =

E(Wn ) < , as required. To check the second requirement,
E(Wn+1 |S1 = s1 , S2 = s2 , . . . , Sn = sn )
Pn !
e( i=1 Si +Sn+1 )
= E S1 = s1 , S2 = s2 , . . . , Sn = sn
mS ()n+1
Pn
e i=1 si
= E(eSn+1 )
mS ()n+1
etn
(where tn = ni=1 si )
P
= n+1
mS ()
mS ()
= wn (s1 , . . . , sn ), (322)
as required. Thus Wn is indeed a martingale with respect to Si .

We further define the stopping time as the random variable N = the smallest n such
that h + Tn b or h + Tn a (see Fig. 15). We are now ready to prove
Walds identity: For any any such that mS () 1,

TN
e
E(WN ) E = 1. (323)
mS ()N
Proof.
n
X
E(WN ) = lim E(WN |N = k)Prob (N = k)
n
k=1
Xn
= lim E(Wk |N = k)Prob (N = k)
n
k=1
Xn
= lim E(Wn |N = k)Prob (N = k)
n
k=1
(using Lemma 2, and interpreting the condition N = k
as the event (S1 , . . . , Sk ) k , where k is the subset
of random walks for which the stopping time is k.)
(
X
= lim E(Wn |N = k)Prob (N = k)
n
k=1

)
X
E(Wn |N = k)Prob (N = k) . (324)
k=n+1
96
The first term in Eq.(324) is, using Lemma 1,
lim E(Wn ) = lim E(W0 ) = 1. (325)

n n
It remains to show that the second term vanishes as n . Note that if k > n,
Tn
e
N = k e(bh) ,

0 E(Wn |N = k) = E (326)
mS ()n
since the condition N = k ensures Tn b h and, by hypothesis, mS () 1. Thus

X
0 E(Wn |N = k)Prob (N = k) e(bh) Prob (N > n), (327)
k=n+1
which is the tail of a convergent sum of positive terms, and so 0 as n .
D The Dirac delta function

The Dirac delta function (or just delta function for short) is an example of what is
known as a generalised function. However, to have a practical working understanding of
the delta function it is not necessary to delve into the arcane mathematics of Lebesgue
measure theory and generalised functions. Here we will give a simple heuristic approach
which is adequate for carrying through the calculations occurring in these notes.
The Dirac delta function, (x) is equal to zero everywhere on the real number line
except at x = 0, where it is envisaged as an infinite spike, the area beneath which is 1.
Thus (
if x = 0,
(x) = (328)
0 if x 6= 0,
and Z
(x) dx = 1, (329)

for any > 0. The delta function can be represented as a limit of the normal distribution
density functions centred about zero as the variance tends to 0 (see Figure 28).
An important property of the delta function is that, for any function f (x) which is
continuous at x = a, Z
f (x)(x a) dx = f (a). (330)

By formally integrating by parts we also obtain meaning for derivatives of the delta
function, namely
Z Z
0 0
f (x) (x a) dx = f (a), f (x) 00 (x a) dx = f 00 (a). (331)

97
40
30
2)
(
22)
20
2
y = e(x
10
0
-1.0 -0.5 0.0 0.5 1.0
Figure 28: Representation of the Dirac delta function as the limit of a normal distribution
centred about zero as its variance 2 tends to zero.
98
References
[1] W.J. Ewens and G.R. Grant Statistical methods in bioinformatics: An introduction,
2nd ed., Springer, New York, 2005
[2] W.J. Ewens Mathematical Population Genetics, 2nd ed., Springer, New York, 2004
[3] International Human Genome Sequencing Consortium, Nature 409, 860921(2001)
[4] J.C. Ventner et al., Science 291, 13041351 (2001)
[5] M. Lothaire, M. (2005), Applied combinatorics on words , Encyclopedia of Math-

ematics and its Applications, 105, Cambridge University Press, ISBN 978-0-521-
84802-2; 978-0-521-84802-2, MR2165687
[6] S. Robin, F. Rodolphe and S. Schbath, DNA, words and models, Cambridge Unver-
sity Press, Cambridge, 2005
[7] C.J. Burden, P. Leopardi and S. Foret, J. Comp. Biol. 21 41-63 (2014)
[8] J. Percus and O. Percus, Commun. Pure Applied Math. 59 145160 (2005)
[9] S. Henikoff and J.G. Henikoff, Proc. Nat. Acad. Sci. 89 1091510919 (1992)
[10] A. Isaev, Introduction to mathematical methods in bioinformatics, Springer-Verlag,

Berlin, 2004
[11] S. Pietrokovski, J.G. Henikoff and S. Henikoff, The Blocks database a system for
protein classification, Nucleic Acids Research 24 197299 (1996).
[12] S. Karlin and S.F. Altschul, Proc. Natl. Acad. Sci. USA, 87, 2264 (1990)
[13] S. Karlin and H.M. Taylor, A first course in stochastic processes, Academic Press,
NY, 2nd ed., 1975.
[14] G. Grimmett and D. Stirzaker, Probability and random processes, Oxford University
Press, Oxford, 1982
[15] E. Pettersson, J. Lundeberg and A. Ahmadian, Genomics 93 105111(2009).
[16] E. Mardis,TrendsGenet, 24 133141(2008).
[17] A. Mortazavi, B.A. Williams, K. McCue, L. Schaeffer and B. Wold, Nat Methods 5
628 (2008).
[18] C. Trapnell et al., Bioinformatics 25, 11051111 (2009).
[19] M.D. Robinson and G.K. Smyth, Bioinformatics 23 28812887 (2007).
[20] M.D. Robinson, D.J. McCarthy and G.K. Smyth, Bioinformatics 26 139140 (2010).
99
[21] S. Anders and W. Huber, Genome Biology 11 R106 (2010).
[22] J.C. Marioni et al., Genome Research 18 15091517 (2008).
[23] M.D. Robinson and A. Oshlack, Genome Biology 11 R25 (2010).
[24] J.K. Pickrell et al., Nature 464 768772.
[25] N.A. Weiss. A Course in Probability. Addison-Wesley, 2005.
[26] R.A. Fisher The Genetical Theory of Natural Selection, Oxford: Clarendon Press,
1930
[27] S. Wright Evolution in Mendelian Populations, Genetics 16, 97159 (1931)
[28] N.L. Johnson, S. Kotz, and A.W. Kemp Univariate Discrete Distributions, 2nd ed.,
Wiley, New York, 1992.
[29] G.A. Watterson On the number of segregating sites in genetical models without
recombination, Theoretical Population Biology 7, 256276 (1975)
[30] C. Vogl Estimating the scaled mutation rate and mutation bias with site frequency
data, Theoretical Population Biology 98, 1927 (2014)
[31] M. Abramowitz and I. Stegun, editors, Handbook of Mathematical Functions, Dover,

New York, 9th edition, 1970
100

MATH3353 Notes

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

MATH3353 Notes

Загружено:

Авторское право:

Доступные форматы

MATH3353 Topics in Bioinformatics

ANU 1st Semester, 2015

3 Word occurrences in random sequences 12

5 Assessing the significance of alignments using random walks 34

7 High throughput sequencing 48

A Crash Course in Probability Theory 77

B Crash course in statistical inference 91

C Walds identity and martingales 94

D The Dirac delta function 97

DNA mRNA protein

2. Genome annotation: the identification of different functional parts of a sequenced

3. Sequence comparison: A common problem faced by biologists is finding close

4. Phylogenetic reconstruction: the reconstruction of family trees of species using

6. Expression profiling: At any point in time, only a small fraction of 10,000 or so

2.1 Expected coverage

Table 1 gives the proportion covered for a range of a.

2.2 Expected number of contigs

2.3 Mean contig size

Figure 4: The random variable X for calculating mean contig size.

The mean length of each successful overlap is

2.4 Variable length fragments

0 Prob (no fragment has its L.H. end in I)+

3.1 Number of occurrences of a given word in an i.i.d. sequence

The first term is

words is (see Fig. 5)

w gaga gggg gaag gagc binomial

Var (Y (w)) 4,288 6,363 3,922 3,799 3,891

3.2 Distance between words in an i.i.d. sequence

F : w occurs at position i and at i + y;

Aj : w occurs at position i and at i + y, and the next occurrence of w after i is at i + j,

Figure 6: The events E, F and Aj .

This enables us to write down an iterative formula for p(y), namely

Figure 7: Various cases leading to Eq. 31.

3.3 Markovian sequences

Prob (Ai+m = b|(Ai , . . . , Ai+m1 ) = (a1 , . . . , am )) = M (a1 , . . . , am ; b), (32)

for a specified dm d transition matrix M satisfying

for all a1 , . . . , am , b L. Note that M is assumed independent of i. In order for the

Prob ((A1 , . . . Am ) = (a1 . . . am )) = (a1 . . . am ), (34)

alphabet = cagt alphabet = cagt

Thus Eq.(32) is written more compactly as

Now define a dm dm square matrix M as

Step 0: Choose an arbitrary starting distribution on P

Step 2: Generate Am+1 , . . . , Am+n using Eq. (40).

in A. It follows that the distributional properties of Y (w; M, n) can be determined as

Prob (Ai+1 = b|Ai = a) = Mab , a, b L (47)

Setting m = 1 in Eq. (44) gives the probability of any sequence a = a1 . . . an occurring

or, observing that the summand is independent of i,

E[Y (w)] = n(w), (51)

where, for any `-word a = a1 . . . a` we define

Ma1 a2 . . . Ma`1 a` (M n`+1 )a` a1

Figure 9: Configurations of k-words at positions i and j corresponding to the three terms

where l (w) is defined by Eq. (21), and

Mw1 w2 . . . Mwk1 wk (M sk+1 )wk w1 Mw1 w2 . . . Mwk1 wk (M nks+1 )wk w1

lim (M n )ab = b , lim tr (M n ) = 1, (56)

where a is the stationary distribution satisfying

4.2 Why not align by brute force ?

For simplicity, assume n = m and use Stirlings approximation

4.3 Dynamic Programming: Global Alignment

Initialisation: Append a to the begining of each

F (i, 0) = di, F (0, j) = dj.

4.4 Dynamic Programming: Local Alignment

Figure 10: Needleman-Wunsch algorithm for a linear gap model

We construct an (n + 1) (m + 1)-matrix as in the previous section, but the formula

where l (w) is defined by Eq. (21), and