Вы находитесь на странице: 1из 40

1

BIOINFORMATIK II
PROBABILITY & STATISTICS
Summer semester 2006
University of Z
urich and ETH Z
urich

Lecture 1: Basic probability.

Prof. Andrew Barbour


Dr. B
eatrice de Tili`
ere
Adapted from a course by
Dr. D. Schuhmacher & Dr. D. Svensson.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Web page
http://www.math.unizh.ch/baps/lectures/bioinf2.html

You will find there:


Up-to-date information;
Transparencies of the lectures;
Exercise sheets with solutions (...in due time);
Additional background material.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Course content
Basic probability concepts
probability distribution, independence, conditional probability,
expectation, standard deviation, . . .
Concepts and principles in statistics
estimation, hypothesis testing, maximum likelihood, likelihood
ratio, significance, p-value, . . .
Markov chains
transition matrix, stationary distribution, reversibility, random
walks, hidden Markov models, . . .
Models and algorithms in bioinformatics
sequence alignment, models for evolution,
PAM/BLOSUM-matrices, BLAST, . . .

Principles, rather than How-to !


Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

aggtgaccct...gtcattt

evolutionary
changes

tggagccat...gtcgatt

evolutionary
changes

acgtcaccct...gacattt

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Why probability and statistics in bioinformatics?


Given: Two sequences from two species. Common ancestor?

ggagactgtagacagctaatgctata
g a a c g c c c t a g c c a cgagc cc t t a t c
Sequence length: 26 nucleotides.
11 of 26 positions agree.
Conclusion? Generated purely by chance or by some other
mechanism?
To be able to answer this, one needs to understand properties of
random sequences.
Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Probability & statistics in bioinformatics?


For ...
modelling sequence evolution (Markov chains).
inferring phylogenetic trees (maximum likelihood trees).
gene prediction (hidden markov chains).
analysis of micro array data (multiple testing, multivariate
statistics)
evaluating sequence similarity in BLAST searches (extreme values,
random walks)
much more!

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Random variables (RVs)


A random variable = numerical quantity whose value depends on the
outcome of some chance experiment.
Ex 1. Flip a coin and let X=1 if head occurs, otherwise let X = 0.
Then X is a random variable.
Ex 2. Two DNA sequences are randomly chosen from a database. Then
X = the number of matches between the sequences is a RV.
Ex 3. Let X = the waiting time until a certain event occurs; e.g., time
until a nucleotide substitution first occurs at a specified position in a
genome. Then X is a RV.
There are two main types of random variables:
either DISCRETE (as in example 1 and 2)
or CONTINUOUS (example 3).
Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Probability distribution of a RV
The important feature of a random variable is its probability distribution.
The probability distribution of a random variable X is basically the
mechanism (mathematically: the function) which tells us, with what
probability the random variable takes what values.
For a discrete random variable the probability distribution can be
expressed either by its probability function or by its distribution function.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Probability and distribution functions (DISCRETE RVs)


Let X be any discrete random variable, and denote the set of the
possible values with S (=sample space).
Associated with the random variable X are
The probability function pX :
pX (i) := P(X = i) [0, 1], i S;
and the (cumulative) distribution function FX :
X
FX (j) := P(X j) =
P(X = i) [0, 1], j S.
iS;ij

A mathematical analysis of a random variable typically requires explicit


formulas for these functions!
In principle, these two functions contain all essential information
concerning properties and behavior of the random variable.
Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

10

Similar formulas for continuous random variables X: Then the


(cumulative) distribution function is given by
R
FX (t) := P(X t) = xt fX (x) dx, for t S
for some probability density function fX with fX (x) 0.

Caution with the interpretation in the continuous case:


the density fX (x) is NOT equal to P(X = x) (which in fact always
is zero for continuous RVs !),
fX (x) is NOT a probability (that is, fX (x) might be > 1 ... )
Think of it as P(t X t + h) = fX (t) h, where h small and h > 0.
Discrete random variables are perhaps more important to
bioinformatics... (in some sense)

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

11

Example: Flip a coin and let X=1 if head occurs, otherwise let X = 0.
Then the sample space is S = {0, 1} and the probability function
pX (i) := P(X = i) is given by
pX (0) =

1
2

pX (1) =

1
.
2

The distribution function is given by


1
FX (0) := P(X 0) =
2
and
FX (1) := P(X 1) = P(X = 0) + P(X = 1) =

1 1
+ = 1.
2 2

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

12

Ex: Random DNA sequences with i.i.d. letters.


Two sequences of N letters are randomly generated, i.e.
the letters are independently generated,
each position equals a, c, g, or t with probabilities pa , pc , pg , pt
Seq1: gtacacgggata...tacgtgact
Seq2: cgaggtagtcga...tttatacga
Let X = the number of matches. Then the probability function of X is
 
 
N k
N
N!
N k
,
P(X = k) =
p (1 p)
, where
=
k!(N k)!
k
k
for some match probability p [0, 1]. [n! = 1 2 (n 1) n].
This is known as the binomial distribution with parameters N and p.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

13

P(X = k) =

N
k

pk (1 p)N k . Why?

Step 1:. Fix any position (j say).


Let p = P(match in the position considered ) (which is independent of the
position chosen).



p = P two a, or two c, or two g, or two t =
= P(two a) + P(two c) + P(two g) + P(two t) =
= pa pa + pc pc + pg pg + pt pt
Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

14

So the match probability is p = p2a + p2c + p2g + p2t


(p = 1/4 if pa = pc = pg = pt = 1/4).
The probability for a mismatch is (1 p).
Step 2:. What is P(X = k) =?
Exactly k matches and N k mismatches can occur in different ways: for
example,
match match

k matches (probability pk )
match
match
match

Seq 1

111
000
000
111
000
111
000
111
000
111
000
111
000
111

1111
0000
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111

111
000
000
111
000
111
000
111
000
111
000
111
000
111

111
000
000
111
000
111
000
111
000
111
000
111
000
111

111
000
000
111
000
111
000
111
000
111
000
111
000
111

Seq 2

111
000
000
111
000
111
000
111
000
111
000
111
000
111

1111
0000
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111

111
000
000
111
000
111
000
111
000
111
000
111
000
111

111
000
000
111
000
111
000
111
000
111
000
111
000
111

111
000
000
111
000
111
000
111
000
111
000
111
000
111

N-k miss-matches (probability (1-p) N-k )

Each such configuration has probability pk (1 p)N k .


Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

15

We have to add the probabilities for the different configurations to get the
total probability:
P(X = k) = pk (1 p)N k + . . . + pk (1 p)N k
How many?
Combinatorical arguments: there are
 
N
N!
=
k
k!(N k)!
possible configurations of k matches and N k mismatches.
Therefore
 
N k
P(X = k) =
p (1 p)N k .
k

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

16

The binomial distribution


Any random variable Y having this probability function
 
N k
p(k) =
p (1 p)N k
k
is said to be binomially distributed (important in general, not just for
counting matches between random sequences!).
In general: imagine that
N independent trials are carried out,
for each trial, P(success ) = p, and P(failure ) = 1 p.
Let X = the number of successes. Then X is binomially distributed with
parameters N and p. Notation:
X Bin(N, p).

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

17

...other distributions?
In general, different RVs have different probability distributions (i.e.
different probability functions).
Consider the random sequence example again, and define
Y = the first position where a match occurs
(counted from left to right).
Seq1 : c g t c g t ... g
Seq2 : g a c c c t ... t
Then the probability function of Y would be
pY (k) = (1 p)k1 p
for k = 1, 2, 3, . . ., where p is the probability for having a match at a
fixed position i, 1 i N .
This is called the geometric distribution with parameter p.
Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

18

Another important distribution is the uniform distribution.


Suppose that each of the (finitely many) possible values of X are equally
likely, that is
1
P(X = k) =
N
for each possible value k, and where N is the number of possible values.
Then X is said to be uniformly distributed.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

19

...some important distributions?


There are infinitely many possible probability distributions but some
appear over and over again in applications.
Some examples are
Binomial
Geometric
Uniform
Poisson
Normal
Exponential
Chi-square
...
Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

20

Probabilities of events
Let S be the set of possible outcomes of some experiment
(S is the sample space).
An event is something that either will or will not occur when the
experiment is conducted (mathematically, E S).
Ex.1 Experiment: counting matches between two sequences of length
1000. Then S = {0, 1, . . . , 1000}.
The event E = at least 50% identity is E = {500, 501, . . . , 1000}.
Ex.2 Experiment: Rolling a dice once. Then S = {1, 2, 3, 4, 5, 6}.
Then E1 = the number turning up is at least 3 = {3, 4, 5, 6},
and E2 = the number turning up is odd = {1, 3, 5}.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

21

Let E, E1 and E2 be some events.


Interpretations:
E c = the event E does not occur;
E1 E2 = at least one of the events E1 and E2 occurs;
E1 E2 = both the events E1 and E2 occur.
If the events E1 and E2 cannot occur together, then they are said to be
mutually exclusive.
(Mathematically, two events are mutually exclusive if E1 E2 is the
empty set)

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

22

How to compute probabilities of events


P(S) = 1.
For any event E S, 0 P(E) 1.
P(E c ) = 1 P(E).
For mutually exclusive events E1 and E2 ,
P(E1 E2 ) = P(E1 ) + P(E2 ).
For any two events E1 and E2 ,
P(E1 E2 ) = P(E1 ) + P(E2 ) P(E1 E2 ).

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

23

Conditional probabilities

A fair dice is rolled once. Suppose that it is known that the number
turning up is less or equal to three.
How likely is it then that it is an odd number ?
P(number is odd | number less or equal to 3) = 2/3
The information given is: the number is 1, 2 or 3.
Two of these three outcomes are odd: therefore 2/3.
(NOTE: Without the additional information given,
P(number is odd ) = 1/2).

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

24

...conditional probabilities...
Suppose that E1 and E2 are two events associated with some random
experiment.
Then the conditional probability P(E1 |E2 ) that E1 occurs, given that
E2 occurs, is defined as
P(E1 E2 )
.
P(E1 |E2 ) =
P(E2 )
Here we assume that P(E2 ) > 0.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

25

The conditional probability formula:


P(E1 E2 )
.
P(E1 |E2 ) =
P(E2 )
***
Ex: The dice example again:
P(number is odd | number less or equal to 3) =

P(number is odd and less or equal to 3)


=
P(number less or equal to 3)
P( {1, 3} )
2
=
=
P( {1, 2, 3} )
3

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

26

Independence
Mathematically, two events E1 and E2 are said to be independent if
and only if
P(E1 E2 ) = P(E1 ) P(E2 )
holds. This is equivalent to
P(E1 |E2 ) = P(E1 ) and P(E2 |E1 ) = P(E2 ),
so for independent events E1 and E2 , the information about the
experiment contained in E2 says nothing about the occurrence of E1 (and
vice versa).
Think of two random variables X and Y as being independent if the
value of one does not in any way affect the probabilities associated with
the possible values of the other one.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

27

Two sequences linked by evolution are dependent ...

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

28

...Independence...
Once again: match counts, two random sequences...
Seq1: gtacacgggata...tacgtgact
Seq2: cgaggtagtcga...tttatacga
Each position equals a, c, g, or t with probabilities pa , pc , pg , pt , and we
define X = the number of matches.
If the positions in the sequences are independently generated, then
X Bin(N, p).
X will not be binomially distributed if successive nucleotides are
dependent (i.e. if neighbors are dependent on each other)!
Why...?
Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

29

...dependence...
The positions in the sequences can be dependent in different ways... One
extreme case is:
Let the letter in the first position be a, c, g or t with probabilities
pa , pc , pg , pt .
Let the other letters in positions 2, 3, . . . , N be equal to the first
letter!
Then the sequences will be of the following form:
aaaaaaaa...aaaaaaa, cccccccc...ccccccccc,
ggggggg...gggggggg, or ttttttt...ttttttttttttt.
Then, the possible values of X are 0 and N . Hence, X cannot be
binomially distributed.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

30

Expected value, variance, standard error


Associated with each random variable X (and each probability
distribution) are three important quantities:
the expected value = E[X],
the variance 2 = Var[X],
p
the standard deviation = SD[X] = Var[X].
They contain useful information about the random variable X, and they
can be computed from the probability function.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

31

Expected value
Once again: two random sequence of length N = 1000, where the letters
(nucleotides) in each position are equally likely.
P(a) = P(c) = P(g) = P(t) =

1
.
4

and X = the number of matches.


Since the nucleotides are equally probable, the match probability is
p = 0.25, and X Bin(1000, 0.25).
How many matches would we expect to see?
The intuitive answer is: about 1000 0.25 = 250.
This is in fact the expected value E[X] of this random variable X: If
X Bin(N, p) then one can prove that E[X] = N p
= 250 in our case.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

32

In general, with X being a discrete RV, the expected value (also called
the expectation or the mean) = E[X] is defined as
X
E[X] =
k P(X = k)
kS

where S is the set of possible values of X.


Ex: If X Bin(N, p), then S = {0, 1, . . . , N 1, N } and
E[X] = 0 P(X = 0) + 1 P(X = 1) + . . . + N P(X = N )
which can be shown to be equal to N p.
Ex: If a dice is rolled, and X = the number turning up,
1
1
1
E[X] = 1 + 2 + . . . + 6 = 3.5
6
6
6
NOTE: The value 3.5 is not a possible value of X !
Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

33

If the value E[X] not necessarily is a possible value of X, how can it be


an expected value...?
Interpretation: If we repeat the experiment many times and observe
independent copies X1 ,X2 ,...,Xn of X, then the average

1
X1 + . . . + Xn
n
will be close to E[X]!
Convergence: the average tends closer and closer to E[X] as n increases.
(A more precise statement is possible.)
Roll a dice 1000 times, and compute the average: it will be close to 3.5!

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

34

Expectations of linear combinations


Let X1 , . . . , Xn be (independent or dependent!) random variables,
and let c1 , . . . , cn be real numbers. Then


E c1 X1 + c2 X2 + . . . + cn Xn = c1 E[X1 ] + c2 E[X2 ] + . . . + cn E[Xn ].
Expectations of products
Let X and Y be two random variables.
If they are independent then
E[X Y ] = E[X] E[Y ].
This is generally NOT true if they are dependent.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

35

More expectation formulas:


Let X be a random variable with the set S of possible values.
Then
 2 X 2
E X =
k P(X = k)
kS

and

 X
g(k) P(X = k)
E g(X) =
kS

for functions g.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

36

Random variation...
If X Bin(1000, 0.25) then E[X] = 1000 0.25 = 250,
so we would expect to see approximately 250 matches in the sequence
matching example.
That is, 251 or 249 would not be a surprising result...
But what about 240? 280? 350? ...
X is a random variable, so there will be some variability around its
expected value... How much variation is expected?

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

37

Standard deviation
X is a random variable, so there will be some variability around its
expected value... How much variation is expected?
This expected variation is captured by the the standard deviation:
q 
p

2
:= SD[X] := Var[X] = E (X E[X]) .

Note: (X E[X])2 = the (squared) distance between X and its mean.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

38

The deviation from the mean is (in a sense) on average .

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

39

Variance formulas
Definition:
2



2
= Var[X] := E (X E[X]) .

Alternative formula:

2

Var[X] = E[X ] E[X] .


Let a and b be constants. Then
Var[a + bX] = b2 Var[X].
Let X and Y be independent random variables. Then
Var[X + Y ] = Var[X] + Var[Y ].

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

40

If X and Y are dependent random variables, then


Var[X + Y ] = Var[X] + Var[Y ] + 2 Cov[X, Y ],
where the last term is the covariance:


Cov[X, Y ] := E (X E[X])(Y E[Y ]) =
= E[X Y ] E[X] E[Y ].

The covariance measures the linear dependence between X and Y (which


is 0 in the independent case).

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Вам также может понравиться