Homology Search: Basic Local Alignment Search Tool (BLAST)

If you learn just one thing Bioinformatics
just learn this:
Basic Local Alignment Search

Tool (BLAST)
BLAST programs
blastp: protein
blastn: DNA
E-value
E1
p(S x)D
Where,
x = a score cuto
D = database size
= P-value
Example BLAST output
http://www-bimas.cit.nih.gov/blastinfo/
blastexample.html
BLOSUM62
Find the score of PQG

matching PQG using
BLOSUM62
Homologs
Genes related by evolution.
Orthologs
Ancestor Gene
Speciation
Gene 1
Gene 2
Orthologs
In Paralogs
In Paralogs
Gene 11
Gene 12
Gene 21
Gene 22
Out Paralogs
Fitch W. (1970). "Distinguishing

homologous from analogous proteins".
Syst Zool 19 (2): 99113.
Ortholog determination
Fundamental for comparative genomics
Open problem
No clear winner
Ortholog determination
Sequence similarity based clustering
Tree based
Hybrid approach
Sequence similarity
Pioneered by COG
Reciprocal Best Hit (usually BLAST)
Additional clustering on top of RBH
(OrthoMCL)
Numerous databases: COG, eggNOC,
OrthoMCL, InParanoid...
All vs All BLAST
Proteins
from
Genome 1
Proteins
from
Genome 2
Reciprocal Best BLAST Hit
Protein A
Protein B
Orthologs
Nothing
to do with function!
Homology vs Homoplasy
Cluster of orthlogous groups

(COG)
http://www.ncbi.nlm.nih.gov/COG/
Genome A
a
Genome B
b
c
Genome C
InParanoid
http://inparanoid.sbc.su.se/cgi-bin/index.cgi
Genome A
a
Inparalogs
Genome B
b
Inparalogs
Download BLAST
ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.25/
Creating a BLAST DB from a

multifasta file
formatdb)*i)multifasta
BLASTP
blastall)*i)input.fas)*d)dbname)
*o)outputfile
ding pseudocounts is to obtain an improved estimate

Position
Specific
Scoring
id a is in column c in all occurrences of the blocks, and
Matrixof
(PSSM)
mple. The current estimate
pca is fca, the frequency of
1
2
3
4
esian prediction improves
the
estimate
of pca by adding
Seq1
A
G
G
A
Seq2
Aand Henikoff
G
G
G
eudocounts (Henikoff
1996):
Seq3
Seq4
pca ! (nca # bca) / (Nc # Bc )

Nca = real count
bcapseudocounts,
= pseudo countrespectively, of a
are the real counts and
are the total numberNcof=real
and pseudocounts,
totalcounts
real count
! nca /Nc. It is obvious
that
as
b
ca becomes larger, the
Bc = total pseudo count
uence on p . Furthermore, not only the types of pseudo
number of sequences that is present in a MSA.
Unfortunately, an observed frequency of 0 might imply the exclusion o

corresponding residue at this position (this was the case with patterns).
One possible trick is to add a small number to all observed frequencies. T

small non-observed frequencies are referred to as pseudo-counts.
From the previous example with a pseudo-counts of 1:
0
Column 1: fA,1
=
0
Column 2: fA,2
=
...
0+1
5+20
0+1
5+20
0
Column 15: fA,15
=
0
= 0.04, fG,1
=
0
= 0.04, fH,2
=
2+1
5+20
5+1
5+20
5+1
5+20
0
= 0.12, fC,15
=
= 0.24, ...
= 0.24, ...
1+1
5+20
= 0.08, ...
There exist more sophisticated methods to produce more realistic ps

counts, and which are based on substitution matrix or Dirichlet mixt
Patterns, Profiles, HMMs, PSI-BLAST
Course 2003
How to build a PSSM

A PSSM is based on the frequencies of each residue in a specific position
of a multiple alignment.
sp|P54202|ADH2_EMENI
sp|P16580|GLNA_CHICK
sp|P40848|DHP1_SCHPO
sp|P57795|ISCS_METTE
sp|P15215|LMG1_DROME
G
G
G
G
G
H
H
H
H
H
E
E
E
E
E
Column 1: fA,1 =
Column 2: fA,2 =
...
G
K
G
F
L
0
5
0
5
Column 15: fA,15 =
V
K
Y
E
R
G
G
G
G
G
K
Y
G
P
T
V
F
R
K
T
V
E
S
G
F
10
|
KL
DR
R G
C G
MP
A
C
D
E
F
G
G
G
G
A
A
A
P
G
L
L
G
S
Y
Y
E
= 0, fG,1 =
= 0, fH,2 =
2
5
A
A
S
I
C
5
5
5
5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 2 1 0 2
0
0
0
0
5
0
0
0
0
0
0
0
5
0
0
0
0
0
1
2
0
0
1
0
0
0
0
0
0
5
0
0
0
0
1
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
2
0
0
0
0
3
0
0
0
0
1
0
0
1
0
1
1
0
0
0
0
H 0
I 0
K 0
L 0
M 0
N 0
5
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
1
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
1
0
0
1
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
P 0
Q 0
R 0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
1
0
1
0
0
0
1
0
0
0
0
0
0
0
0
S 0
T 0
V 0
W 0
Y 0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
1
0
0
1
1
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
= 1, ...
= 1, ...
= 0.4, fC,15 =
1
5
= 0.2, ...
18
The score is derived from the ratio of the observed to the expected frequencies.
More precisely, the logarithm of this ratio is taken and refereed to as the logPatterns, Profiles, HMMs,
PSI-BLAST
Course 2003
:
likelihood
ratio
f
Example
Scoreij = log( q )
0
ij
i
The complete position specific scoring matrix calculated

0 from the previous
where Scoreij is the score for residue i at position j , fij
is the relative
example:
frequency for a residue i at position j (corrected with pseudo-counts) and qi
1
2 relative
3
4frequency
5
6 of residue
7
8 i in9 a random
10
11sequence.
12
13
14
15
is the expected
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
-0.2
-0.2
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
1.3
-0.2
-0.2
0.7
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
0.7
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
0.7
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
1.3
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
1.3
-0.2
-0.2
-0.2
-0.2
1.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
1.3
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
0.7
1.3
0.7
-0.2
20
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
21
How to use PSSMs

The PSSM is applied as a sliding window along the subject sequence:
At every position, a PSSM score is calculated by summing the scores of all columns;
The highest scoring position is reported.
Score = 0.3
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
1
-0.2
-0.2
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
3
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
4
-0.2
-0.2
-0.2
-0.2
0.7
1.3
-0.2
-0.2
0.7
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
5
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
-0.2
0.7
6
-0.2
-0.2
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
7
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
8
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
0.7
0.7
-0.2
-0.2
9
-0.2
-0.2
-0.2
0.7
0.7
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
10
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
11
-0.2
-0.2
-0.2
-0.2
-0.2
1.3
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
12
1.3
-0.2
-0.2
-0.2
-0.2
1.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
13
0.7
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
1.3
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
14
-0.2
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
0.7
15
1.3
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
Position +1
T S G H E L V G G V A F P A R C A S
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
1
-0.2
-0.2
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
3
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
4
-0.2
-0.2
-0.2
-0.2
0.7
1.3
-0.2
-0.2
0.7
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
5
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
-0.2
0.7
Score = 16.1
6
-0.2
-0.2
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
7
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
8
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
0.7
0.7
-0.2
-0.2
9
-0.2
-0.2
-0.2
0.7
0.7
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
10
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
11
-0.2
-0.2
-0.2
-0.2
-0.2
1.3
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
12
1.3
-0.2
-0.2
-0.2
-0.2
1.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
13
0.7
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
1.3
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
14
-0.2
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
0.7
15
1.3
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
1
-0.2
-0.2
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
3
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
4
-0.2
-0.2
-0.2
-0.2
0.7
1.3
-0.2
-0.2
0.7
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
5
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
-0.2
0.7
Score = 0.6
6
-0.2
-0.2
-0.2
-0.2
-0.2
2.3
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
7
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
8
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
0.7
0.7
-0.2
-0.2
9
-0.2
-0.2
-0.2
0.7
0.7
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
10
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
11
-0.2
-0.2
-0.2
-0.2
-0.2
1.3
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
12
1.3
-0.2
-0.2
-0.2
-0.2
1.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
13
0.7
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
1.3
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
14
-0.2
-0.2
-0.2
0.7
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
0.7
15
1.3
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
0.7
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
-0.2
Position +1
22
These shortcomings of PSSMs set the stage for a new kind of profi
based on Markov chains, called Hidden Markov models (HMM
modeling positional dependencies

recognizing pattern instances with indels
modeling variable length patterns
detecting boundaries
PSSM search
rpsblast can be used to search a PSSM.

NCBI Conserved Domain Database (CDD) is
a collection of PSSMs.
Markov process
No)state)information)
Memoryless
Markov chains
Markov chains Markov Chains
Markov chains are stochastic processes that undergo transitions between a
Markov chains are stochastic processes that undergo transitions between a

finiteseries
series
of states
in a chainlike
finite
of states
in a chainlike
manner. manner.
x1
x2
x3
x4
x5
The system transverses states with probability

1, x2, x3, ...) = p(x1) p(x2| x1) p(x3| x2) p(x4| x3)...
The p(x
system
transverses states with probability
p(x1, x2chains
, x3, ...)
p(x1) p(x2| the
x1)probability
p(x3| x2) that
p(x4the
| xchain
3)... is in state xi
i.e. Markov
are=memoryless:
at time t, depends only on the state at the previous time step and not on the
past
of thechains
states visited
time t1. the probability that the
i.e.history
Markov
are before
memoryless:
chain is in st
at time
t, kind
depends
only on theisstate
at the
previous
time step and not on t
This
specific
of "memorylessness"
called the
Markov
property.
past history of the states visited before time t1.
Markov Chains are memory less:

property.
probability of a state depends only on the
previous
state
The Markov property states that the conditional probability distribution
The Markov property states that the conditional probability distribution

forThis
the system
at thekind
next of
step
(and in fact at all future is
steps)
depends
specific
"memorylessness"
called
the Markov
only on the current state of the system, and not additionally on the state
of the system at previous steps.
for the system at the next step (and in fact at all future steps) depends
only on the current state of the system, and not additionally on the state
of the system at previous steps.
E.g., A general Markov chain modeling DNA.

Note that any sequence can be traced
through the model by passing from one state
to the next via the transitions.
Markov chains are defined as a state

diagram
Markov chains...
transitionhidden
probability
parameter
(aij) is associated
with each transition (arro
Markov chains, and theirAextension
Markov
models (HMMs),
are commonly
determines
the consist
probability
of a certain
state (Sj) following
represented by state diagrams,
which
of states
and connecting
transitionsanother state (Si).
A Markov
chain
is defined
E.g.,
A general
Markov
chainby:
modeling DNA.
finiteany
setsequence
of states,can
S1,be
S2 traced
...SN
athat
Note
of model
transition
probabilities:
aij state
= P(qt+1=Sj|qt=Si)
a setthe
through
by passing
from one
to the
next
the state
transitions.
and
an via
initial
probability distribution, i = P(q0=Si)
A transition probability parameter (aij) is associated with each transition (arrow) and
determines the probability of a certain state (Sj) following another state (Si).
A Markov chain is defined by:
a finite set of states, S1, S2 ...SN
a set of transition probabilities: aij = P(qt+1=Sj|qt=Si)
and an initial state probability distribution, i = P(q0=Si)
http://thegrantlab.org/
Markov chains example

Simple Markov chain example for x={a,b}
Observed sequence: x = abaaababbaa
Model:
transition
probabilities
Prev! Next! Prob
aij
a!
a!
0.7
a!
b!
0.3
b!
a!
0.5
b!
b!
0.5
initial state
probability
distribution
Start
S
probs
a!0.5
b!0.5
P(x) = 0.5 x 0.3 x 0.5 x 0.7 x 0.7 x 0.3 x 0.5 x 0.3 x 0.5 x 0.5 x 0.7
Q. Can you sketch the state diagram with labeled transitions for this model?
Markov chain example
Markov chains: 3. Boundary detection
Giving a sequence we wish to label each symbol in the sequence according to its
class (e.g. transmembrane regions or extracellular/cytosolic)
Markov chains: boundary detection...
Extracellular
A simpler question: is a given sequence a transmembrane sequence?
A Markov chain for recognizing transmembrane sequences
Membrane
0.7
0.7 to be hydrophobic in composition

0.3
tend
(hydrophobic)
Cytosol
States:
SH, SL
={H,L}
(H) = 0.6, (L) = 0.4
Given a training set of labeled sequences we can begin by modeling each amino
acid as hydrophobic 0.3
(H) or hydrophilic (L)
i.e. reduce
the dimensionality
the 20
amino acids into
two classes
Question:
Is sequence of
HHLHH
a transmembrane
protein?
P(HHLHH)
= be
0.6 represented
x 0.7 x 0.7 x 0.3
0.7 x 0.7 = of
0.043
E.g., A peptide sequence
can
as ax sequence
Hs and Ls.
e.g. HHHLLHLHHLHL...
Problem: need a threshold,
threshold must be length dependent
Markov chains: boundary detection

We can classify an observed sequence (O = O1, O2, ...) by its log odds ratio
0.7
transmembrane model
null model
0.7
0.5
0.3
Transmembrane (TM)
(H) = 0.6, (L) = 0.4
0.3
0.5
0.5
0.5
Extracellular/cytosolic (E/C)
(H) = 0.5, (L) = 0.5
P(HHLHH | TM) = 0.6 x 0.7 x 0.7 x 0.3 x 0.7 x 0.7 = 0.043

P(HHLHH | EC) 0.5 x 0.5 x 0.5 x 0.5 x 0.5 x 0.5 0.016
= 2.69
In other words, it is more than twice as likely that HHLHH is a

transmembrane sequence. The log-odds score is: log2(2.69) = 1.43
Markov
chain
Parameter
Side note: Parameter estimation...
estimation
Both initial probabilities ((i))and transition probabilities (aij) are determined from
known examples of transmembrane and non-transmembrane sequences.
0.7
0.7
0.3
(H) = # of sequences that begin with H,

normalized by the total # of training
sequences
(H) = 0.6, (L) = 0.4
0.3
HHHLLHHHLLLHLHLLHLLLHLHHHL
HHHLHHLHLLLLLHHHHLLLHHHHHL
HH... (A = 12, A = 40)
HL
aHL
AHL
=
AHi
H*
#HL pairs
# H* pairs
12
40
Given sequence of Hs and Ls, find all transmembrane regions:
Using our Markov models we would still need to score successive

windows along the sequence, leading to a fuzzy boundary (just as
PSSM).
HMM:
Given Toa approach
sequence
of
H
and
L
find
this question we can construct a new four state model b
connecting the TM and E/C
models
thetransitions
transmembrane
region
0.6
0.6
HM
So whats hidden?
0.1
0.2
0.1 0.1
LM
0.2
0.1
Transitions between the

the E/C states indicate b
between membrane regio
cytosolic or extracellular
0.4 parts of the problem and the hidden

We will distinguish between the observed
0.4
parts
0.4
HE/C
LE/C
However this is
previously
it
is
clear
which
states
In the Markov models we have considered
0.4
Markov chain!
account for each part of the observed sequence
Due to the one-to-one correspondence between symbols and states
no longe
In our new model, there are multiple states that could account for each part of
the observed sequence

i.e. we dont know which state emitted a given symbol from knowledge of the
sequence and the structure of the model
This is the hidden part of the problem
0.6
0.6
0.2
For our Markov models

Given HLLH..., we know the exact state sequence (q0=SH, q1=SL, q2=SL, ...)
For our HMM
Given HLLH..., we must infer the most probable state sequence
This HMM state sequence will yield the boundaries between likely TM and E/C
regions
0.6
0.6
HM
0.1
0.4
HE/C
0.2
0.1 0.1
0.4
0.4
LM
0.2
0.1
LE/C
0.4
HM, LM, LM, HM

HM, LM, LM, HE/C
HM, LM, LH/C, HM
HM, LM, LH/C, HE/C
HM, LE/C, LM, HM
HM, LE/C, LM, HE/C
HM, LE/C, LH/C, HM,
HM, LE/C, LH/C, HE/C,
HE/C, LM, LM, HM
HE/C, LM, LM, HE/C
HE/C, LM, LH/C, HM
HE/C, LM, LH/C, HE/C
HE/C, LE/C, LM, HM
HE/C, LE/C, LM, HE/C
HE/C, LE/C, LH/CM, HM
HE/C, LE/C, LH/CM, HE/C
Hidden Markov models (HMMs)
Markov Chains
States: S1, S2 ...SN
Initial probabilities: i
Transition probabilities: aij
Hidden Markov Models

States: S1, S2 ...SN
Initial probabilities: i
Transition probabilities: aij
Alphabet of emitted symbols,
Emission probabilities: ei(a)
probability state i emits symbol a
One-to-one correspondence
between states and symbols
Symbol may be emitted by more

than one state
Similarly, a state can emit more
than one symbol
In this example we will use only one state for the transmembrane segment (M) and
0.7
0.5
0.7
use emission probabilities to distinguishAs
between
H
and
L
residues.
We
will
also
in the case of Markov chains, the HMM parameters can be learned from
add separate E & C states with distinct emission
probabilities.
labeled training
data
0.3
0.25
Note that we now have toelearn the initial probabilities, transition
probabilities
and
0.7 0.3
0
i
e
i
e
i
0.3
emission
probabilities
aij = 0.25 0.5 0.25
H 0.9
H 0.2
H 0.3
LA 0.1
L 0.8
L 0.7
Ei (x) 0 0.3 0.7
0.25
0.7
0.3
0.5
aij =
ei (x) =
ij
0.7
j'
Aij '
'
x
Ei (x ')
0.25
0.25
0.3 0.3 0
0.7
e
i
ei
ei
Side
Parameter
estimation
Sidenote:
note:
Parameter
estimation
aij = 0.25 0.5 0.25
H 0.9
H 0.2
H 0.3
L 0.1
L 0.8
L 0.7
0 0.3 0.7 0.7
0.7
0.5
As
in
the
case
of
Markov
chains,
the
HMM
parameters
can
be
learned
As in the case of Markov chains, the HMM parameters can be learned from
from
labeled
labeledtraining
trainingdata
data
ei(H) 0.2
0.9
0.3
ei(L)
0.1
0.7
0.8
0.3
0.25
Note
that
we
now
have
to
learn
the
initial
probabilities,
transition
Note that we now have to learn the initial probabilities, transition probabilities
probabilities and
and
emission
emissionprobabilities
probabilities
AAij
ij
aaij ==
ij
j ' AAijij' '

j'
0.25
0.25
EEi (x)
i (x)
eei (x)
=
''
i (x) =
x EEii(x(x')')
x
0.3
0.3
EE
M
M
C
C
00
00
11
eei(H)
0.2
i(H) 0.2
0.9
0.9
0.3
0.3
eei(L)
0.8
i(L) 0.8
0.1
0.1
0.7
0.7
i i
0.7
0.7
0.5
0.5
0.3
0.3
0.25
0.25
0.7
0.7
0.25
0.7
0.3
0.3
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
(E) = 0
(M) = 0
(C) = 1
0.5
ei
H 0.3
L 0.7
Query Sequence
States
E
M
C
START
0.25
0.7
0.3
0.3
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
(E) = 0
(M) = 0
(C) = 1
0.5
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
1x0.3
=0.3
START
0.25
0.7
0.3
0.3
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
(E) = 0
(M) = 0
(C) = 1
0.5
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
1x0.3
=0.3
START
0.25
0.7
0.3
0.3
0.5
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
0.3x0.9x0.3
=0.081
1x0.3
=0.3
0.7x0.3x0.3
=0.063
START
0.25
0.7
0.3
0.3
0.5
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
0.3x0.9x0.3
=0.081
1x0.3
=0.3
0.7x0.3x0.3
=0.063
START
0.25
0.7
0.3
0.3
0.5
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
0.3x0.9x0.3
=0.081
0.25x0.8x0.081
=0.016
0.5x0.1x0.081
=0.04
1x0.3
=0.3
0.7x0.3.0.3
=0.063
0.25x0.7x0.081
=0.014
START
0.25
0.7
0.3
0.3
0.5
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
0.3x0.9x0.3
=0.081
0.25x0.8x0.081
=0.016
0.5x0.1x0.081
=0.04
1x0.3
=0.3
0.7x0.3.0.3
=0.063
0.25x0.7x0.081
=0.014
START
0.25
0.7
0.3
0.3
0.5
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
1x0.3
=0.3
0.3x0.9x0.3
=0.081
0.25x0.8x0.081
=0.016
0.5x0.1x0.081
=0.04
0.7x0.8x0.016
=0.009
0.3x0.1x0.016
=0.0005
0.7x0.3.0.3
=0.063
0.25x0.7x0.081
=0.014
START
0.25
0.7
0.3
0.3
0.5
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
1x0.3
=0.3
0.3x0.9x0.3
=0.081
0.25x0.8x0.081
=0.016
0.5x0.1x0.081
=0.04
0.7x0.8x0.016
=0.009
0.3x0.1x0.016
=0.0005
0.7x0.3.0.3
=0.063
0.25x0.7x0.081
=0.014
START
0.25
0.7
0.3
0.3
0.5
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
1x0.3
=0.3
0.3x0.9x0.3
=0.081
0.25x0.8x0.081
=0.016
0.5x0.1x0.081
=0.04
0.7x0.8x0.016
=0.009
0.3x0.1x0.016
=0.0005
0.7x0.2x0.009
=0.001
0.3x0.9x0.009
=0.002
0.7x0.3.0.3
=0.063
0.25x0.7x0.081
=0.014
START
0.25
0.7
0.3
0.3
0.5
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
1x0.3
=0.3
0.3x0.9x0.3
=0.081
0.25x0.8x0.081
=0.016
0.5x0.1x0.081
=0.04
0.7x0.8x0.016
=0.009
0.3x0.1x0.016
=0.0005
0.7x0.2x0.009
=0.001
0.3x0.9x0.009
=0.002
0.7x0.3.0.3
=0.063
0.25x0.7x0.081
=0.014
START
0.25
0.7
0.3
0.3
0.5
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
START
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
1x0.3
=0.3
START
0.3x0.9x0.3
=0.081
0.25x0.8x0.081
=0.016
0.5x0.1x0.081
=0.04
0.7x0.8x0.016
=0.009
0.3x0.1x0.016
=0.0005
0.7x0.2x0.009
=0.001
0.3x0.9x0.009
=0.002
0.7x0.3.0.3
=0.063
0.25x0.7x0.081
=0.014
0.25
0.7
0.3
0.3
0.5
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
C
START
0.3x0.9x0.3
=0.081
0.25x0.8x0.081
=0.016
0.5x0.1x0.081
=0.04
0.7x0.8x0.016
=0.009
0.3x0.1x0.016
=0.0005
0.7x0.2x0.009
=0.001
0.3x0.9x0.009
=0.002
1x0.3
=0.3
0.7x0.3.0.3
=0.063
0.25x0.7x0.081
=0.014
0.25
0.7
0.3
0.3
0.5
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
C
START
0.3x0.9x0.3
=0.081
0.25x0.8x0.081
=0.016
0.5x0.1x0.081
=0.04
0.7x0.8x0.016
=0.009
0.3x0.1x0.016
=0.0005
0.7x0.2x0.009
=0.001
0.3x0.9x0.009
=0.002
1x0.3
=0.3
0.7x0.3.0.3
=0.063
0.25x0.7x0.081
=0.014
0.25
0.7
0.3
0.3
0.5
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
C
START
0.3x0.9x0.3
=0.081
0.25x0.8x0.081
=0.016
0.5x0.1x0.081
=0.04
0.7x0.8x0.016
=0.009
0.3x0.1x0.016
=0.0005
0.7x0.2x0.009
=0.001
0.3x0.9x0.009
=0.002
1x0.3
=0.3
0.7x0.3.0.3
=0.063
0.25x0.7x0.081
=0.014
0.25
0.7
0.3
0.3
0.5
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
0.3x0.9x0.3
=0.081
0.25x0.8x0.081
=0.016
0.5x0.1x0.081
=0.04
0.7x0.8x0.016
=0.009
0.3x0.1x0.016
=0.0005
0.7x0.2x0.009
=0.001
0.3x0.9x0.009
=0.002
1x0.3
=0.3
0.7x0.3.0.3
=0.063
0.25x0.7x0.081
=0.014
START
0.25
0.7
0.3
0.3
0.5
0.7
0.25
ei
H 0.9
L 0.1
ei
H 0.2
L 0.8
ei
H 0.3
L 0.7
Query Sequence
States
E
M
0x0.2
=0
0x0.9
=0
C
START
0.3x0.9x0.3
=0.081
0.25x0.8x0.081
=0.016
0.5x0.1x0.081
=0.04
0.7x0.8x0.016
=0.009
0.3x0.1x0.016
=0.0005
0.7x0.2x0.009
=0.001
0.3x0.9x0.009
=0.002
1x0.3
=0.3
0.7x0.3.0.3
=0.063
0.25x0.7x0.081
=0.014
E
E
Most Probable State Sequence
Viterbi Algorithm
HMMs are trained from a multiple

alignment
Training set
W
-
HMM model
A
C
D
E
BEGIN
I0
A
A
V
A
A
D
E
E
D
E
A
C
D
E
0.74
0.01
0.03
0.03
...
T
-
C
C
C
C
C
A
C
D
E
0.01
0.01
0.41
0.44
...
0.01
0.92
0.01
0.01
...
M1
M2
M3
D1
D2
D3
I1
I2
END
I3
43
Hidden Markov Model
LVPI is calculated by multiplying the emission and transition probabilities

along the path.
Start
End
0.4
L
0.3
0.46
V
0.6
0.3
P
0.4
0.3
0.5
I
0.6
Figure 6: A possible hidden Markov model of protein LVPI. The numbers

in the box indicates the emission probabilities and numbers next to arrows
indicate transition probabilities. The probability of the protein LVPI is show
in bold.
In Figure 6 the probability of L being emitted in the position 1 is 0.3; V
at position 2 is 0.6. Probability of the full sequence can be calculated as:
0.4 0.3 0.46 0.6 0.3 0.4 0.3 0.6 0.5 = 0.00035
In a real life situation the path through a model in not known. In that
case the correct probability of any sequence is the sum of the probabilities
HMMER3
http://hmmer.janelia.org)
cd)~/Desktop/h<tab>)
cd)binaries)
sudo)cp)*)/usr/bin/
Creating a HMM model of p53

Align:)
muscle)*stable)*in)infile)*out)outfile)
Create)HMM:)
hmmbuild)**informat)afa)p53.hmm)
outfile)
Search)human)genome:)
hmmsearch)*o)hits.txt)p53.hmm)
human.faa
HMMER result
#)hmmsearch)::)search)profile(s))against)a)sequence)database)
#)HMMER)3.0)(March)2010);)http://hmmer.org/)
#)Copyright)(C))2010)Howard)Hughes)Medical)Institute.)
#)Freely)distributed)under)the)GNU)General)Public)License)(GPLv3).)
#)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)
#)query)HMM)file:))))))))))))))))))PF00870.hmm)
#)target)sequence)database:))))))))PF00870_full_length_sequences*1.fasta)
#)output)directed)to)file:)))))))))result.out)
#)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)*)
Query:)))))))PF00870))[M=612])
Scores)for)complete)sequences)(score)includes)all)domains):)
)))***)full)sequence)***)))***)best)1)domain)***))))*#dom*)
))))E*value))score))bias))))E*value))score))bias))))exp))N))Sequence)))))Description)
))))*******)******)*****))))*******)******)*****)))****)**))********)))))***********)
)))))6e*226))746.2))22.8)))7.3e*226))745.9))15.8))))1.0))1))P63_MOUSE))))(O88898))
)))7.7e*226))745.8))21.8)))9.7e*226))745.5))15.1))))1.0))1))P63_RAT))))))(Q9JJP6))
)))1.7e*225))744.7)))4.7)))3.5e*225))743.6)))3.2))))1.5))1))P73_HUMAN))))(O15350))
)))1.6e*224))741.5))23.2)))))2e*224))741.2))16.1))))1.0))1))P63_HUMAN))))(Q9H3D4))
)))))2e*223))737.9))20.5)))2.2e*223))737.7))14.2))))1.0))1))Q3UVI3_MOUSE)(Q3UVI3))
)))1.5e*222))735.0)))3.4)))4.3e*222))733.4)))2.3))))1.6))1))P73_CERAE))))(Q9XSK8))
)))2.1e*222))734.5))20.2)))2.3e*222))734.3))14.0))))1.0))1))Q5CZX0_MOUSE)(Q5CZX0))
)))2.1e*221))731.1))34.0)))2.4e*221))731.0))23.6))))1.0))1))C4Q601_SCHMA)(C4Q601))
PFAM readymade HMM

library

Homology Search: Basic Local Alignment Search Tool (BLAST)

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Homology Search: Basic Local Alignment Search Tool (BLAST)

Загружено:

Авторское право:

Доступные форматы

If you learn just one thing Bioinformatics

just learn this:

Basic Local Alignment Search

Example BLAST output

Find the score of PQG

Genes related by evolution.

Fitch W. (1970). "Distinguishing

Fundamental for comparative genomics

All vs All BLAST

Reciprocal Best BLAST Hit

Cluster of orthlogous groups

Creating a BLAST DB from a

ding pseudocounts is to obtain an improved estimate

pca ! (nca # bca) / (Nc # Bc )

number of sequences that is present in a MSA.

Unfortunately, an observed frequency of 0 might imply the exclusion o

One possible trick is to add a small number to all observed frequencies. T

There exist more sophisticated methods to produce more realistic ps

Patterns, Profiles, HMMs, PSI-BLAST

How to build a PSSM

Column 15: fA,15 =

The complete position specific scoring matrix calculated

How to use PSSMs

modeling positional dependencies

rpsblast can be used to search a PSSM.

Markov chains are stochastic processes that undergo transitions between a

Markov chains are stochastic processes that undergo transitions between a

The system transverses states with probability

Markov Chains are memory less:

The Markov property states that the conditional probability distribution

E.g., A general Markov chain modeling DNA.

Markov chains are defined as a state

Markov chains example

Prev! Next! Prob

Markov chain example

Markov chains: 3. Boundary detection

Markov chains: boundary detection...

0.7 to be hydrophobic in composition

Markov chains: boundary detection

P(HHLHH | TM) = 0.6 x 0.7 x 0.7 x 0.3 x 0.7 x 0.7 = 0.043

In other words, it is more than twice as likely that HHLHH is a

(H) = # of sequences that begin with H,

Given sequence of Hs and Ls, find all transmembrane regions:

Using our Markov models we would still need to score successive

Transitions between the

0.4 parts of the problem and the hidden

the observed sequence

For our Markov models

HM, LM, LM, HM

Hidden Markov models (HMMs)

Hidden Markov Models

Emission probabilities: ei(a)

probability state i emits symbol a

Symbol may be emitted by more

j ' AAijij' '

HMMs are trained from a multiple

Hidden Markov Model

LVPI is calculated by multiplying the emission and transition probabilities

Figure 6: A possible hidden Markov model of protein LVPI. The numbers

Creating a HMM model of p53

PFAM readymade HMM

Вам также может понравиться