Академический Документы
Профессиональный Документы
Культура Документы
BIOINFORMATIK II
PROBABILITY & STATISTICS
Summer semester 2006
University of Z
urich and ETH Z
urich
Web page
http://www.math.unizh.ch/baps/lectures/bioinf2.html
Course content
Basic probability concepts
probability distribution, independence, conditional probability,
expectation, standard deviation, . . .
Concepts and principles in statistics
estimation, hypothesis testing, maximum likelihood, likelihood
ratio, significance, p-value, . . .
Markov chains
transition matrix, stationary distribution, reversibility, random
walks, hidden Markov models, . . .
Models and algorithms in bioinformatics
sequence alignment, models for evolution,
PAM/BLOSUM-matrices, BLAST, . . .
aggtgaccct...gtcattt
evolutionary
changes
tggagccat...gtcgatt
evolutionary
changes
acgtcaccct...gacattt
ggagactgtagacagctaatgctata
g a a c g c c c t a g c c a cgagc cc t t a t c
Sequence length: 26 nucleotides.
11 of 26 positions agree.
Conclusion? Generated purely by chance or by some other
mechanism?
To be able to answer this, one needs to understand properties of
random sequences.
Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html
Probability distribution of a RV
The important feature of a random variable is its probability distribution.
The probability distribution of a random variable X is basically the
mechanism (mathematically: the function) which tells us, with what
probability the random variable takes what values.
For a discrete random variable the probability distribution can be
expressed either by its probability function or by its distribution function.
10
11
Example: Flip a coin and let X=1 if head occurs, otherwise let X = 0.
Then the sample space is S = {0, 1} and the probability function
pX (i) := P(X = i) is given by
pX (0) =
1
2
pX (1) =
1
.
2
1 1
+ = 1.
2 2
12
13
P(X = k) =
N
k
pk (1 p)N k . Why?
p = P two a, or two c, or two g, or two t =
= P(two a) + P(two c) + P(two g) + P(two t) =
= pa pa + pc pc + pg pg + pt pt
Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html
14
k matches (probability pk )
match
match
match
Seq 1
111
000
000
111
000
111
000
111
000
111
000
111
000
111
1111
0000
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
111
000
000
111
000
111
000
111
000
111
000
111
000
111
111
000
000
111
000
111
000
111
000
111
000
111
000
111
111
000
000
111
000
111
000
111
000
111
000
111
000
111
Seq 2
111
000
000
111
000
111
000
111
000
111
000
111
000
111
1111
0000
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
111
000
000
111
000
111
000
111
000
111
000
111
000
111
111
000
000
111
000
111
000
111
000
111
000
111
000
111
111
000
000
111
000
111
000
111
000
111
000
111
000
111
15
We have to add the probabilities for the different configurations to get the
total probability:
P(X = k) = pk (1 p)N k + . . . + pk (1 p)N k
How many?
Combinatorical arguments: there are
N
N!
=
k
k!(N k)!
possible configurations of k matches and N k mismatches.
Therefore
N k
P(X = k) =
p (1 p)N k .
k
16
17
...other distributions?
In general, different RVs have different probability distributions (i.e.
different probability functions).
Consider the random sequence example again, and define
Y = the first position where a match occurs
(counted from left to right).
Seq1 : c g t c g t ... g
Seq2 : g a c c c t ... t
Then the probability function of Y would be
pY (k) = (1 p)k1 p
for k = 1, 2, 3, . . ., where p is the probability for having a match at a
fixed position i, 1 i N .
This is called the geometric distribution with parameter p.
Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html
18
19
20
Probabilities of events
Let S be the set of possible outcomes of some experiment
(S is the sample space).
An event is something that either will or will not occur when the
experiment is conducted (mathematically, E S).
Ex.1 Experiment: counting matches between two sequences of length
1000. Then S = {0, 1, . . . , 1000}.
The event E = at least 50% identity is E = {500, 501, . . . , 1000}.
Ex.2 Experiment: Rolling a dice once. Then S = {1, 2, 3, 4, 5, 6}.
Then E1 = the number turning up is at least 3 = {3, 4, 5, 6},
and E2 = the number turning up is odd = {1, 3, 5}.
21
22
23
Conditional probabilities
A fair dice is rolled once. Suppose that it is known that the number
turning up is less or equal to three.
How likely is it then that it is an odd number ?
P(number is odd | number less or equal to 3) = 2/3
The information given is: the number is 1, 2 or 3.
Two of these three outcomes are odd: therefore 2/3.
(NOTE: Without the additional information given,
P(number is odd ) = 1/2).
24
...conditional probabilities...
Suppose that E1 and E2 are two events associated with some random
experiment.
Then the conditional probability P(E1 |E2 ) that E1 occurs, given that
E2 occurs, is defined as
P(E1 E2 )
.
P(E1 |E2 ) =
P(E2 )
Here we assume that P(E2 ) > 0.
25
26
Independence
Mathematically, two events E1 and E2 are said to be independent if
and only if
P(E1 E2 ) = P(E1 ) P(E2 )
holds. This is equivalent to
P(E1 |E2 ) = P(E1 ) and P(E2 |E1 ) = P(E2 ),
so for independent events E1 and E2 , the information about the
experiment contained in E2 says nothing about the occurrence of E1 (and
vice versa).
Think of two random variables X and Y as being independent if the
value of one does not in any way affect the probabilities associated with
the possible values of the other one.
27
28
...Independence...
Once again: match counts, two random sequences...
Seq1: gtacacgggata...tacgtgact
Seq2: cgaggtagtcga...tttatacga
Each position equals a, c, g, or t with probabilities pa , pc , pg , pt , and we
define X = the number of matches.
If the positions in the sequences are independently generated, then
X Bin(N, p).
X will not be binomially distributed if successive nucleotides are
dependent (i.e. if neighbors are dependent on each other)!
Why...?
Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html
29
...dependence...
The positions in the sequences can be dependent in different ways... One
extreme case is:
Let the letter in the first position be a, c, g or t with probabilities
pa , pc , pg , pt .
Let the other letters in positions 2, 3, . . . , N be equal to the first
letter!
Then the sequences will be of the following form:
aaaaaaaa...aaaaaaa, cccccccc...ccccccccc,
ggggggg...gggggggg, or ttttttt...ttttttttttttt.
Then, the possible values of X are 0 and N . Hence, X cannot be
binomially distributed.
30
31
Expected value
Once again: two random sequence of length N = 1000, where the letters
(nucleotides) in each position are equally likely.
P(a) = P(c) = P(g) = P(t) =
1
.
4
32
In general, with X being a discrete RV, the expected value (also called
the expectation or the mean) = E[X] is defined as
X
E[X] =
k P(X = k)
kS
33
34
35
and
X
g(k) P(X = k)
E g(X) =
kS
for functions g.
36
Random variation...
If X Bin(1000, 0.25) then E[X] = 1000 0.25 = 250,
so we would expect to see approximately 250 matches in the sequence
matching example.
That is, 251 or 249 would not be a surprising result...
But what about 240? 280? 350? ...
X is a random variable, so there will be some variability around its
expected value... How much variation is expected?
37
Standard deviation
X is a random variable, so there will be some variability around its
expected value... How much variation is expected?
This expected variation is captured by the the standard deviation:
q
p
2
:= SD[X] := Var[X] = E (X E[X]) .
38
39
Variance formulas
Definition:
2
2
= Var[X] := E (X E[X]) .
Alternative formula:
2
40