Lecture 1: Basic Probability.: Prof. Andrew Barbour Dr. B Eatrice de Tili' Ere

1
BIOINFORMATIK II
PROBABILITY & STATISTICS
Summer semester 2006
University of Z
urich and ETH Z
urich
Lecture 1: Basic probability.
Prof. Andrew Barbour

Dr. B
eatrice de Tili`
ere
Adapted from a course by
Dr. D. Schuhmacher & Dr. D. Svensson.
Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html
Web page
http://www.math.unizh.ch/baps/lectures/bioinf2.html
You will find there:

Up-to-date information;
Transparencies of the lectures;
Exercise sheets with solutions (...in due time);
Additional background material.
Course content
Basic probability concepts
probability distribution, independence, conditional probability,
expectation, standard deviation, . . .
Concepts and principles in statistics
estimation, hypothesis testing, maximum likelihood, likelihood
ratio, significance, p-value, . . .
Markov chains
transition matrix, stationary distribution, reversibility, random
walks, hidden Markov models, . . .
Models and algorithms in bioinformatics
sequence alignment, models for evolution,
PAM/BLOSUM-matrices, BLAST, . . .
Principles, rather than How-to !

aggtgaccct...gtcattt
evolutionary
changes
tggagccat...gtcgatt
evolutionary
changes
acgtcaccct...gacattt
Why probability and statistics in bioinformatics?

Given: Two sequences from two species. Common ancestor?
ggagactgtagacagctaatgctata
g a a c g c c c t a g c c a cgagc cc t t a t c
Sequence length: 26 nucleotides.
11 of 26 positions agree.
Conclusion? Generated purely by chance or by some other
mechanism?
To be able to answer this, one needs to understand properties of
random sequences.
Probability & statistics in bioinformatics?

For ...
modelling sequence evolution (Markov chains).
inferring phylogenetic trees (maximum likelihood trees).
gene prediction (hidden markov chains).
analysis of micro array data (multiple testing, multivariate
statistics)
evaluating sequence similarity in BLAST searches (extreme values,
random walks)
much more!
Random variables (RVs)

A random variable = numerical quantity whose value depends on the
outcome of some chance experiment.
Ex 1. Flip a coin and let X=1 if head occurs, otherwise let X = 0.
Then X is a random variable.
Ex 2. Two DNA sequences are randomly chosen from a database. Then
X = the number of matches between the sequences is a RV.
Ex 3. Let X = the waiting time until a certain event occurs; e.g., time
until a nucleotide substitution first occurs at a specified position in a
genome. Then X is a RV.
There are two main types of random variables:
either DISCRETE (as in example 1 and 2)
or CONTINUOUS (example 3).
Probability distribution of a RV
The important feature of a random variable is its probability distribution.
The probability distribution of a random variable X is basically the
mechanism (mathematically: the function) which tells us, with what
probability the random variable takes what values.
For a discrete random variable the probability distribution can be
expressed either by its probability function or by its distribution function.
Probability and distribution functions (DISCRETE RVs)

Let X be any discrete random variable, and denote the set of the
possible values with S (=sample space).
Associated with the random variable X are
The probability function pX :
pX (i) := P(X = i) [0, 1], i S;
and the (cumulative) distribution function FX :
X
FX (j) := P(X j) =
P(X = i) [0, 1], j S.
iS;ij
A mathematical analysis of a random variable typically requires explicit

formulas for these functions!
In principle, these two functions contain all essential information
concerning properties and behavior of the random variable.
10
Similar formulas for continuous random variables X: Then the

(cumulative) distribution function is given by
R
FX (t) := P(X t) = xt fX (x) dx, for t S
for some probability density function fX with fX (x) 0.
Caution with the interpretation in the continuous case:

the density fX (x) is NOT equal to P(X = x) (which in fact always
is zero for continuous RVs !),
fX (x) is NOT a probability (that is, fX (x) might be > 1 ... )
Think of it as P(t X t + h) = fX (t) h, where h small and h > 0.
Discrete random variables are perhaps more important to
bioinformatics... (in some sense)
11
Example: Flip a coin and let X=1 if head occurs, otherwise let X = 0.
Then the sample space is S = {0, 1} and the probability function
pX (i) := P(X = i) is given by
pX (0) =
1
2
pX (1) =
1
.
2
The distribution function is given by

1
FX (0) := P(X 0) =
2
and
FX (1) := P(X 1) = P(X = 0) + P(X = 1) =
1 1
+ = 1.
2 2
12
Ex: Random DNA sequences with i.i.d. letters.

Two sequences of N letters are randomly generated, i.e.
the letters are independently generated,
each position equals a, c, g, or t with probabilities pa , pc , pg , pt
Seq1: gtacacgggata...tacgtgact
Seq2: cgaggtagtcga...tttatacga
Let X = the number of matches. Then the probability function of X is

N k
N
N!
N k
,
P(X = k) =
p (1 p)
, where
=
k!(N k)!
k
k
for some match probability p [0, 1]. [n! = 1 2 (n 1) n].
This is known as the binomial distribution with parameters N and p.
13
P(X = k) =
N
k
pk (1 p)N k . Why?
Step 1:. Fix any position (j say).

Let p = P(match in the position considered ) (which is independent of the
position chosen).

p = P two a, or two c, or two g, or two t =
= P(two a) + P(two c) + P(two g) + P(two t) =
= pa pa + pc pc + pg pg + pt pt
14
So the match probability is p = p2a + p2c + p2g + p2t

(p = 1/4 if pa = pc = pg = pt = 1/4).
The probability for a mismatch is (1 p).
Step 2:. What is P(X = k) =?
Exactly k matches and N k mismatches can occur in different ways: for
example,
match match
k matches (probability pk )
match
match
match
Seq 1
111
000
000
111
000
111
000
111
000
111
000
111
000
111
1111
0000
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
111
000
000
111
000
111
000
111
000
111
000
111
000
111
111
000
000
111
000
111
000
111
000
111
000
111
000
111
111
000
000
111
000
111
000
111
000
111
000
111
000
111
Seq 2
111
000
000
111
000
111
000
111
000
111
000
111
000
111
1111
0000
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
0000
1111
111
000
000
111
000
111
000
111
000
111
000
111
000
111
111
000
000
111
000
111
000
111
000
111
000
111
000
111
111
000
000
111
000
111
000
111
000
111
000
111
000
111
N-k miss-matches (probability (1-p) N-k )
Each such configuration has probability pk (1 p)N k .

15
We have to add the probabilities for the different configurations to get the
total probability:
P(X = k) = pk (1 p)N k + . . . + pk (1 p)N k
How many?
Combinatorical arguments: there are

N
N!
=
k
k!(N k)!
possible configurations of k matches and N k mismatches.
Therefore

N k
P(X = k) =
p (1 p)N k .
k
16
The binomial distribution

Any random variable Y having this probability function

N k
p(k) =
p (1 p)N k
k
is said to be binomially distributed (important in general, not just for
counting matches between random sequences!).
In general: imagine that
N independent trials are carried out,
for each trial, P(success ) = p, and P(failure ) = 1 p.
Let X = the number of successes. Then X is binomially distributed with
parameters N and p. Notation:
X Bin(N, p).
17
...other distributions?
In general, different RVs have different probability distributions (i.e.
different probability functions).
Consider the random sequence example again, and define
Y = the first position where a match occurs
(counted from left to right).
Seq1 : c g t c g t ... g
Seq2 : g a c c c t ... t
Then the probability function of Y would be
pY (k) = (1 p)k1 p
for k = 1, 2, 3, . . ., where p is the probability for having a match at a
fixed position i, 1 i N .
This is called the geometric distribution with parameter p.
18
Another important distribution is the uniform distribution.

Suppose that each of the (finitely many) possible values of X are equally
likely, that is
1
P(X = k) =
N
for each possible value k, and where N is the number of possible values.
Then X is said to be uniformly distributed.
19
...some important distributions?

There are infinitely many possible probability distributions but some
appear over and over again in applications.
Some examples are
Binomial
Geometric
Uniform
Poisson
Normal
Exponential
Chi-square
...
20
Probabilities of events
Let S be the set of possible outcomes of some experiment
(S is the sample space).
An event is something that either will or will not occur when the
experiment is conducted (mathematically, E S).
Ex.1 Experiment: counting matches between two sequences of length
1000. Then S = {0, 1, . . . , 1000}.
The event E = at least 50% identity is E = {500, 501, . . . , 1000}.
Ex.2 Experiment: Rolling a dice once. Then S = {1, 2, 3, 4, 5, 6}.
Then E1 = the number turning up is at least 3 = {3, 4, 5, 6},
and E2 = the number turning up is odd = {1, 3, 5}.
21
Let E, E1 and E2 be some events.

Interpretations:
E c = the event E does not occur;
E1 E2 = at least one of the events E1 and E2 occurs;
E1 E2 = both the events E1 and E2 occur.
If the events E1 and E2 cannot occur together, then they are said to be
mutually exclusive.
(Mathematically, two events are mutually exclusive if E1 E2 is the
empty set)
22
How to compute probabilities of events

P(S) = 1.
For any event E S, 0 P(E) 1.
P(E c ) = 1 P(E).
For mutually exclusive events E1 and E2 ,
P(E1 E2 ) = P(E1 ) + P(E2 ).
For any two events E1 and E2 ,
P(E1 E2 ) = P(E1 ) + P(E2 ) P(E1 E2 ).
23
Conditional probabilities
A fair dice is rolled once. Suppose that it is known that the number
turning up is less or equal to three.
How likely is it then that it is an odd number ?
P(number is odd | number less or equal to 3) = 2/3
The information given is: the number is 1, 2 or 3.
Two of these three outcomes are odd: therefore 2/3.
(NOTE: Without the additional information given,
P(number is odd ) = 1/2).
24
...conditional probabilities...
Suppose that E1 and E2 are two events associated with some random
experiment.
Then the conditional probability P(E1 |E2 ) that E1 occurs, given that
E2 occurs, is defined as
P(E1 E2 )
.
P(E1 |E2 ) =
P(E2 )
Here we assume that P(E2 ) > 0.
25
The conditional probability formula:

P(E1 E2 )
.
P(E1 |E2 ) =
P(E2 )
***
Ex: The dice example again:
P(number is odd | number less or equal to 3) =
P(number is odd and less or equal to 3)

=
P(number less or equal to 3)
P( {1, 3} )
2
=
=
P( {1, 2, 3} )
3
26
Independence
Mathematically, two events E1 and E2 are said to be independent if
and only if
P(E1 E2 ) = P(E1 ) P(E2 )
holds. This is equivalent to
P(E1 |E2 ) = P(E1 ) and P(E2 |E1 ) = P(E2 ),
so for independent events E1 and E2 , the information about the
experiment contained in E2 says nothing about the occurrence of E1 (and
vice versa).
Think of two random variables X and Y as being independent if the
value of one does not in any way affect the probabilities associated with
the possible values of the other one.
27
Two sequences linked by evolution are dependent ...
28
...Independence...
Once again: match counts, two random sequences...
Seq1: gtacacgggata...tacgtgact
Seq2: cgaggtagtcga...tttatacga
Each position equals a, c, g, or t with probabilities pa , pc , pg , pt , and we
define X = the number of matches.
If the positions in the sequences are independently generated, then
X Bin(N, p).
X will not be binomially distributed if successive nucleotides are
dependent (i.e. if neighbors are dependent on each other)!
Why...?
29
...dependence...
The positions in the sequences can be dependent in different ways... One
extreme case is:
Let the letter in the first position be a, c, g or t with probabilities
pa , pc , pg , pt .
Let the other letters in positions 2, 3, . . . , N be equal to the first
letter!
Then the sequences will be of the following form:
aaaaaaaa...aaaaaaa, cccccccc...ccccccccc,
ggggggg...gggggggg, or ttttttt...ttttttttttttt.
Then, the possible values of X are 0 and N . Hence, X cannot be
binomially distributed.
30
Expected value, variance, standard error

Associated with each random variable X (and each probability
distribution) are three important quantities:
the expected value = E[X],
the variance 2 = Var[X],
p
the standard deviation = SD[X] = Var[X].
They contain useful information about the random variable X, and they
can be computed from the probability function.
31
Expected value
Once again: two random sequence of length N = 1000, where the letters
(nucleotides) in each position are equally likely.
P(a) = P(c) = P(g) = P(t) =
1
.
4
and X = the number of matches.

Since the nucleotides are equally probable, the match probability is
p = 0.25, and X Bin(1000, 0.25).
How many matches would we expect to see?
The intuitive answer is: about 1000 0.25 = 250.
This is in fact the expected value E[X] of this random variable X: If
X Bin(N, p) then one can prove that E[X] = N p
= 250 in our case.
32
In general, with X being a discrete RV, the expected value (also called
the expectation or the mean) = E[X] is defined as
X
E[X] =
k P(X = k)
kS
where S is the set of possible values of X.

Ex: If X Bin(N, p), then S = {0, 1, . . . , N 1, N } and
E[X] = 0 P(X = 0) + 1 P(X = 1) + . . . + N P(X = N )
which can be shown to be equal to N p.
Ex: If a dice is rolled, and X = the number turning up,
1
1
1
E[X] = 1 + 2 + . . . + 6 = 3.5
6
6
6
NOTE: The value 3.5 is not a possible value of X !
33
If the value E[X] not necessarily is a possible value of X, how can it be

an expected value...?
Interpretation: If we repeat the experiment many times and observe
independent copies X1 ,X2 ,...,Xn of X, then the average

1
X1 + . . . + Xn
n
will be close to E[X]!
Convergence: the average tends closer and closer to E[X] as n increases.
(A more precise statement is possible.)
Roll a dice 1000 times, and compute the average: it will be close to 3.5!
34
Expectations of linear combinations

Let X1 , . . . , Xn be (independent or dependent!) random variables,
and let c1 , . . . , cn be real numbers. Then

E c1 X1 + c2 X2 + . . . + cn Xn = c1 E[X1 ] + c2 E[X2 ] + . . . + cn E[Xn ].
Expectations of products
Let X and Y be two random variables.
If they are independent then
E[X Y ] = E[X] E[Y ].
This is generally NOT true if they are dependent.
35
More expectation formulas:

Let X be a random variable with the set S of possible values.
Then
2 X 2
E X =
k P(X = k)
kS
and

X
g(k) P(X = k)
E g(X) =
kS
for functions g.
36
Random variation...
If X Bin(1000, 0.25) then E[X] = 1000 0.25 = 250,
so we would expect to see approximately 250 matches in the sequence
matching example.
That is, 251 or 249 would not be a surprising result...
But what about 240? 280? 350? ...
X is a random variable, so there will be some variability around its
expected value... How much variation is expected?
37
Standard deviation
X is a random variable, so there will be some variability around its
expected value... How much variation is expected?
This expected variation is captured by the the standard deviation:
q
p

2
:= SD[X] := Var[X] = E (X E[X]) .
Note: (X E[X])2 = the (squared) distance between X and its mean.
38
The deviation from the mean is (in a sense) on average .
39
Variance formulas
Definition:
2

2
= Var[X] := E (X E[X]) .
Alternative formula:
2
Var[X] = E[X ] E[X] .

Let a and b be constants. Then
Var[a + bX] = b2 Var[X].
Let X and Y be independent random variables. Then
Var[X + Y ] = Var[X] + Var[Y ].
40
If X and Y are dependent random variables, then

Var[X + Y ] = Var[X] + Var[Y ] + 2 Cov[X, Y ],
where the last term is the covariance:

Cov[X, Y ] := E (X E[X])(Y E[Y ]) =
= E[X Y ] E[X] E[Y ].
The covariance measures the linear dependence between X and Y (which

is 0 in the independent case).

Lecture 1: Basic Probability.: Prof. Andrew Barbour Dr. B Eatrice de Tili' Ere

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Lecture 1: Basic Probability.: Prof. Andrew Barbour Dr. B Eatrice de Tili' Ere

Загружено:

Авторское право:

Доступные форматы

1

Lecture 1: Basic probability.

Prof. Andrew Barbour

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

You will find there:

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Principles, rather than How-to !

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Why probability and statistics in bioinformatics?

Probability & statistics in bioinformatics?

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Random variables (RVs)

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Probability and distribution functions (DISCRETE RVs)

A mathematical analysis of a random variable typically requires explicit

Similar formulas for continuous random variables X: Then the

Caution with the interpretation in the continuous case:

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

The distribution function is given by

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Ex: Random DNA sequences with i.i.d. letters.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Step 1:. Fix any position (j say).

So the match probability is p = p2a + p2c + p2g + p2t

N-k miss-matches (probability (1-p) N-k )

Each such configuration has probability pk (1 p)N k .

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

The binomial distribution

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Another important distribution is the uniform distribution.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

...some important distributions?

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Let E, E1 and E2 be some events.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

How to compute probabilities of events

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

The conditional probability formula:

P(number is odd and less or equal to 3)

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Two sequences linked by evolution are dependent ...

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Expected value, variance, standard error

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

and X = the number of matches.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

where S is the set of possible values of X.

If the value E[X] not necessarily is a possible value of X, how can it be

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Expectations of linear combinations

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

More expectation formulas:

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

Note: (X E[X])2 = the (squared) distance between X and its mean.

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html

The deviation from the mean is (in a sense) on average .

Web page: http://www.math.unizh.ch/baps/lectures/bioinf2.html