Вы находитесь на странице: 1из 105

Getting the most from your hidden Markov models

SAM-T98
Building and using an HMM framework
for nding protein homologs and
predi ting protein stru ture
Melissa Cline, Christian Barrett, and Kevin Karplus
University of California, Santa Cruz

http://www. se.u s .edu/resear h/ ompbio/

1
A knowledgments

 SAM-T98 developed by Kevin Karplus and Christian Barrett


Additional ollaborators: Ri hard Hughey, David Haussler,
Melissa Cline, Mark Diekhans, Leslie Grate, Kimmen
Sjolander.
 SAM hidden Markov model software pa kage developed by
Ri hard Hughey and Anders Krogh [KBM+94, HK96℄.
 The sequen e logo program used is by A. M. Amer, UCSC,
based on the Sequen e Logos program by Tom
S hneider [SS90℄.
 Supported in part by NSF grant DBI-9408579, DOE grant
DE-FG0395ER62112, GAANN fellowship program, and a
grant from Digital Equipment Corporation.

2
Outline of tutorial
 External (web-site) view of SAM-T98. (Karplus)
 Overview of tutorial and SAM-T98. (Barrett)
 Entropy and sequen e logos. (Cline)
 Introdu ing the running examples. (Barrett)
 Basi s of pro le HMMs. (Karplus)
 SHORT BREAK.
 Sequen e weighting. (Cline)
 Column regularizers. (Karplus)
 Transition Regularizers. (Barrett)
 SHORT BREAK.
 Sear hing with the model. (Barrett)
 Null models. (Karplus)
 Stepping through the pro ess. (Cline)
 Bidire tional s oring. (Barrett)
 Future dire tions and on lusions. (Barrett)

3
SAM-T98 overview

This talk is a \look under the hood" at the SAM-T98


method.
 Inputs: User supplies a target protein sequen e or a trusted
multiple alignment.
 Outputs:
{ S ores and pairwise alignments from aligning the target to
a set of HMMs that we have built for known stru tures.
{ A multiple alignment of all similar sequen es from a large
non-redundant protein database.
{ A hidden Markov model (HMM) for sear hing databases
or aligning sequen es.
{ Sear h results and alignments for sear hing PDB (or
Swissprot or NR) with the HMM.

4
SAM-T98 master web page
http://www. se.u s .edu/resear h/ ompbio/
HMM-apps/HMM-appli ations.html

UCSC HMM Applications

UCSC’s SAM-T98 method for iterative SAM HMM


construction and remote homology detection includes
extensions to our CASP2 work (html) for our CASP3
predictions. A description of the method will appear in
BioInformatics (html and postscript), and is favorably
evaluated against PSI-BLAST and other methods in J.
Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T.
Hubbard, and C. Chothia, ‘‘Sequence Comparisons Using
Multiple Sequences Detect Twice as many Remote
Homologues as Pairwise Methods’’, JMB
284(4):1201-1210, 1998. Please cite SAM-T98 and this
page as:

Kevin Karplus, Christian Barrett, Richard Hughey, "Hidden Markov Models for
Detecting Remote Protein Homologies", Bioinformatics 14(10):846-856, 1998. WWW
server available from http://www.cse.ucsc.edu/research/compbio.

Basic operations
Compare Sequence Against Protein Model Library
Search UCSC’s protein structure HMM library with a single sequence. May reveal
homologous structures.

Protein Query Against A Database


Protein homology search. This server creates a SAM-T98 multiple alignment from a
single sequence, builds an HMM from it, and then scores the selected database using the
HMM. The SAM-T98 alignment is not returned, but a selected number of hits from the
database are returned as a multiple alignment.

Create or Tune Up a Multiple Alignment The user submits sequences or a multiple


alignment and gets back a multiple alignment.

Compare Two Alignments


Submit two alignments in FASTA format and receive a score measuring the similarity of
the alignments.

5
SAM-T98 sear h web page
Be ause the method an take a while for long proteins, or ones
with many homologs, results are e-mailed ba k. They also don't
t on a slide|see T98-query.mail in handouts.
http://www. se.u s .edu/resear h/ ompbio/
HMM-apps/T98-query.html
HMM-based Sequence Query

This page offers the ability to perform remote protein homolog searches using HMM technology. It is
part of a larger suite
of UCSC’s HMM applications that are available on the web. If you are familiar with BLAST searching,
then you understand the functionality of this page.

Step 1: Enter email address and optional subject line


Please enter you email address ( required )
karplus@cse.ucsc.edu
Return email subject line (optional)
UCSC SAM-T98-query results

Step 2: Enter one sequence


You may either cut and paste your sequence in the box below or upload it from
a file on your local system. Sequences must be in a readseq-compatible format.
Cut and paste your sequence here, ...
>2crd Charybdotoxin XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS

or specify the sequence-containing file on your system here.

Step 3: Choose database to search for homologs


PDB (quick)
SWISSPROT (a couple hours)
971 SCOP domains

Step 4: Choose threshold for reporting database hits


If you want to include more remote, or weaker scoring, sequences in the
final report, choose a cutoff value that is less negative. The more negative
the cutoff, the more strict you are being about what you will consider
as a "hit".
NLL-NULL score cutoff:
-9

6
SAM-T98 multiple alignment web page

A somewhat simpler page just omputes the SAM-T98 multiple


alignment and returns it.
http://www. se.u s .edu/resear h/ ompbio/
HMM-apps/build-target98-alignment.html

See build-target98-alignment.mail in the handouts.

7
Overview

What we are talking about today: pra ti al appli ation of hidden


Markov models presented in the framework of SAM-T98.
 Criteria for a epting homologs
 Regularizers and weighting s hemes
 Speeding up database sear hes
 S oring and null models
 Library building
What we are not talking about today:
 Mathemati al foundations of hidden Markov models
 Obje tives of fold re ognition
 Other appli ations of HMMs and sto hasti modeling

8
What is SAM-T98?
SAM-T98 is an HMM-based homology dete tion method, similar
to PSI-BLAST in fun tion.
 SAM-T98 was the top fold re ognition and homology modeling method at
CASP3 of those using no dire t stru tural information [KBC+99℄.

 The methodology that developed into SAM-T98 was one of the top fold
re ognition methods at CASP2 [KKB+ 97℄.

 In two detailed omparisons using SCOP [HMBC97℄, SAM-T98 was


shown to outperform methods in luding PSI-BLAST, ISS, and Double
BLAST at superfamily re ognition [PKB+98, KBH98℄

9
SAM-T98 Overview: Alignment Building

SAM-T98 Alignment Building


Start: a single sequence

Build a model from the sequence or alignment

Use the model to search for additional homologs

Reestimate the alignment with the new homologs


(Iterations 1 - 3) (Iteration 4)
End: a SAM-T98 alignment

Starting point: A seed sequen e or reliable seed alignment.


Goal: Build an alignment that en ompasses the breadth of the
seed's family.

10
SAM-T98 Overview: Stru ture Predi tion

SAM-T98 Structure Prediction


Build a SAM-T98 Build a library of
alignment from the SAM-T98 alignments
target sequence from PDB sequences

Build a model from Build models from the


the alignment. Score alignments and score
PDB sequences the target sequence

Combine scores

Result: Fold Prediction

Starting point: A target sequen e of unknown stru ture.


Goal: Identify the stru tural family of the target sequen e
through identi ation of a homolog whose stru ture is known.

11
Review of Entropy

 In statisti al thermodynami s, the entropy of a system


des ribes the following:
{ Disorder
{ Log of the number of mi rostates in the system
The se ond de nition orresponds losely to the use of the
term in information theory.

 The entropy of a random variable x is de ned as


H (x) = P
xi P (x ) log2 P (x ) (bits).
i i

 Uses of entropy in lude


{ Estimating the minimum expe ted size of a odeword
needed for en oding or ompressing data (en oding ost).
{ Measuring the information ontent or omplexity of a
probability distribution

 We will use entropy to measure the information in the


olumns of a multiple alignment, given their amino a id
distributions.

12
Interpretation of entropy

 Formula: H (x) = X
xi
P (x ) log2 P (x ) (bits).
i i

 Entropy is minimized with a peaked distribution.


1.0 X
H (x) = xi
P (x ) log2 P (x )
i i

P(x) = 0 log2 (0) 1 log2 (1)


= 0+0
x = 0
1 2

 Entropy is maximized with a uniform distribution, a hieving


a maximum value of log2(N ), where N is the number of
values that variable x an assume.
1.0 X
H (x) = P (x ) log2 P (x )
xi
i i

P(x) = (2  0:5 log2 (0:5))


= 2  0:5  1:0
x = 1:0
1 2

13
Entropy of Alignment Columns

 Consider some alignment olumn before ontains any


amino a ids.
{ Ea h amino a id is expe ted a ording to some prior or
ba kground probability P0.
{ P0 has some entropy H0, also referred to as the
ba kground entropy.

 Consider olumn when it ontains some amino a ids.


{ Column has some posterior amino a id distribution P

omputed from the amino a id observations plus priors.


{ H , the entropy of olumn , is omputed from the

posterior distribution P .

{ If olumn is very onserved, then H approa hes zero.


If olumn is very diverse, then H approa hes H0.


{ We refer to H0 H as the bits saved at olumn . A


C

large savings indi ates a strong alignment signal.

14
Sequen e Logos

 The logos shown were reated using a modi ed version of the


program by Tom S hneider [SS90℄|we ompute the posterior
probabilities di erently.
 Ea h olumn's logo re e ts its amino a id posterior
distribution, omputed with amino a id priors and sequen e
weighting.

2 rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1 mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS

2
bits saved

F
L
Y
I
V S
AAS
V
L
TN I S
DAAG
GTTF
Y
TD
AA
E
G
E
K
SD
C CW VC
A
V
L
I
S
TS
TAS
KG
FS
EAYT
PF
S
T
T
TG
F
Y
QEM
QT
EAY
S
A
T
LY
ES
Q
I L
V

A
A
VF
I
A
V
IL
LS
A
I
I
V
RF
KKA
L
D
S
T
GS
NA
V
L
E
E
GC CC
A
S
D
RE
N
KK
QT
AR
A
V
L
S
T
G
F
Y
I

N
KK
RG
E
S
T
G
F
Y
A
V
L
I

Y
A
V
L
I
S
T
G
F
S
A
T
G
D
E
TV
F
G KSK R R M RSS EK SP HSS K
SL
E
K
S
EE
D
K
EDV
L
R
D
K E K
N SEG
EDEEDDK
R KD DL
K
E
D
SED
K
ER K
I E N
M
WEAK RN
KERA T K R R TKRATEHAN TQ L R EDRR R R
K G G KA
N N
FT
R K S TA
R D K P T D D P D D D P
KD TMN V N N Y I V K
EG RPRNL
G
I L N
GGNHG
N
PNT
N
L
G
Q
GY
R
R
L
GH N G
S
AQNNVN L
GR QGLPRD V LPD GP L
Q Y P RNTPAPYV

0
H I L P Q RQ LVN PP AVY M GQ E
P Y QQR Q PQM Q DV L QDT Q QEQ
NN PQV N I LV QLD GPHALQ TP I LQLP T L I
DP VDQMPQ H PHM QCMVHCSHF EHF KMELPLMSMAH
Q
CQ
F
Y
H
MH
Y
FN
I C
H
WY
H
FWI
Y
H
F
MF
Y
H
MF
M
CH
Y
FH
I Y
F
MWI
CH
Y
FI P
HY
WWV
H
Y
FI Y
F
MI HPVNHP I WT
K
R
E
N
G
D
Q
PI V
Y
I
F Y
H
M
CK
G
D
V
N
R
F
Q
PI I
Y
F
M C
M
WL
A
R
S
E
V
W
T
D
G
N
Y
QI H
WA
V
K
S
T
R
Q
N
GI P
V
Y
F
I V
I
H
Y
F V
H
I
Y
F H
WQ
D
F
N
Y
P
G
M
HH
WS
K
V
T
D
R
F
G
N
Q
PI Y
F
M
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

15
Sequen e Logos and entropy
 The height of ea h hara ter is proportional to the log of the
posterior probability of that amino a id in that olumn.
 The total height of ea h olumn represents the bits saved
relative to the ba kground entropy.
 The maximum possible savings at any olumn is
approximately log2(20) = 4:3.
 The logo for an alignment re e ts the signal that will be
modeled by an hmm or other model trained on the alignment.
 The logo shown re e ts the 2 rd sequen e using the
Blosum62 regularizer. Its signal re e ts what BLAST would
sear h for when using Blosum62.

2 rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS

2
bits saved

1
F NV
L
Y D
V G
A
I L
V
I TSLSS
AAT
C
I TT
S
EQ
K
RK
AY
E
LF
DV
S
I
CW C RLH RG C CRC
LG
S
AL
L
A
IL
V
N
I
V
D
SQ
ES
RG
I
KV
F
E
Q
K
E
D
Q
LTES
I
NT
Y D
S
GES
A
S

ES
KL
I E
R
A
KN VL
Q I L
M
LS
NKK
RR
A
L
VKVF
Y
A
L
S
AAT M
SK SS I EAAKKS V
AATT AA A
AA VT TG ATAAS
ALAT I KAATAT
TSSG
HA
FTG
SS
VVTQ GST KSTS ESGV SG I T
EVT
G
SVE T F RN N Q EQQ N A
ML SGFL L G RF GSF KQ L G VG F F F G
I R KA
MNK NGNKKNMKAT R NT R RGGKT KSN
T I I KL SL I K K
KKA
Q
K
G
K
E
N
DK K
EDTN
D
R
E
K
E
K
G
N
D
S
D
G
D RGA
Q
K
E
G
D I L
D
N
D
FA
T
L
D
L
D
N
D
G
D
N
DTK
E
R
E
N EDRN N
DNLRQDERLVG
YD N
DV Q NRKQNNRV RR
D
G H G G HG H W
W
HE LR
YLE
ME E LTG
V
E
MMLR
Y
E
M
N
T I EFLE L I P
F
TE
M
S
RLTTE
M I E
MEL
DV V DD V V DV H QT D V
H V VV H GV
VV V
D C P P P CPG PQ P P

0
R I DR RRRP H NRD VPC I RRPHP G I P P P NR
NQ PQYQQQ YHQPY FD I QQFY YE YFYQQ
Q
P
P
N I QPP II IQC I N
H
Q YPP
I YI
C Q P I
QYI
QD I
P F HY I PFP H YFHH
CF
M
Y
C
HY
M
CQ
HP
F
H
Y
C
M H
WF
M
Y
C
HF
M
Y
C
HP
F
H
Y
C
M Y
F
MF
MH
WPP
F
H
Y
C
M Q
HH
WP
M
Y
F M
CN
H
WM
CY
M
CF
M
Y
C
HP
F
H
Y
C
M M
CM
WY
F
MH
WN
D
P
H
CY
M
CY
F
MY
F
MH
WM
CH
WM
P
CP
F
H
Y
C
M
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

16
Interpreting Sequen e Logos
 The logos shown re e t ex erpts from alignments of 1yrnB.
The upper logo shows the 1yrnB sequen e only. The lower
logo shows an alignment of approximately 2000 sequen es.

 If an alignment has a small number of sequen es, the signal


will be dominated by the priors, and most olumns will have
similar savings and a strong ba kground signal.
2
bits saved

TK
S
E
KL
PYR
A
D
N
FKE KY
G
A
S

H
R
D
F
L
KD K
E V
TK NLRVV S
D
FK
IF
Y
LY
IL
L IE
V D DD
S
E
KF
L
L
N
S
FG
D
N
EF
I P LD
WFAKNL
VEN Y
DD
F
A IE
V
A
GL
S I
TK V
ENV
I
LMKNT
D
S
R
E
AA
G
DV
E
AA
KEE I S
KA
STS
R
V
E
EAGAAAAAAV
AASSTS I
FMSTA
I G
S
R
S
K
EGAAG
FSS
G
AA
DV VS
R
EKAK
AGA
KAATM S VA SA
LE
RGS
L I S
QA EFQTTQ ALS AETQELSI M SEM EV
VS
LDT
S
T
Q
L
RR
ALT
VS
LD
Q
KSLMY
GST
ETSE DKM KTTY
TAVS
LD
RY
TQ T
KYAI D
S
KL
E VKT PDTSE RTETSKRDR T TSRTVK KTE PKRTK QTE
KQ
T
R DLLD
W
MKQ
TTAKDYST G
W
MKQ TAYTA R SQKQ
TLSTASF
TTA K
D
G
RE
W GQN
GGRD
GNR
M
RGK
ERNK
NER I GR
K
ENRRE
W RRD
GQRNRRS GRD
G
RNNH NVYNK
E
G
RNGQYNRE
P
GRK
N
K
E
D
P NQRGQNH E
PP
G
RNVE
P
GQE
P
K
E NQG
R
NLQG V
NHTVGNL
Q
PLPVP PPDG
G LLP
GPLQG
R NLH PL
NGR
L LLN G G
I DP P H LP P G LL H PGLP I D LP P

0
I P I I V I P I P VQ I QQ HPQPVQ V QV I P I Q VQY
QPV I
P VF QH Y FH NP VVHDH DDVV
M
QNNVHDVHF QDHP VY DVHDGVHP
Q I HPYFP YQQ I I Y NYNN I Q QF I Y N I Y HPNY Q I FN I Y NP I Y Q
F
Y
H
MH
Y
F
M Y
M
CM
CF
MC
M
WMI F
MD
CF
Y
H
MH
Y
F
M H
Y
F
M I
F C
H
WF
MC
H
WH
W
CH
Y
F
M I
Y
H
F
M P
CD
CY
M
C
HH
Y
F
M I
F C
H
WH
Y
F
M I
F Y
M
CM
CH
W
CF
MI F
Y
H
MH
Y
F
M C
M
WH
W
CH
Y
F
M I
F H
W
CD
N
H
CH
Y
F
M I
F F
Y
H
M
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
 If an alignment ontains numerous, diverse sequen es, the
observed data will dominate the signal. Some olumns will
show very strong savings, while others will show almost none.
4

3
bits saved

L
2

1
K
E
R
D
S
A
Q
R
E
Q
S
D
G
R
K
KE
Q
G
N
A
S
L
R
K
Q
E
A
S
L
N
T
GT
FT
Y
I
L
V
E
D
K
N
P
S
K
S
Q
A
E
K
V
L
T
L
T
V E
MRK
V
F
IE
Q
D
A
N
S
K
R
E
A
S
N
F
Y
L
WI
N
S
T
D
H
E
YP
K
L
F
V
T
A
F
N
G
L
S
I T
VD
EK
R
K
Q
D
L
V
R
I
LA
M
F
IS
VV
L
G
T
E
I
L
T
V
I
A

0
G NT DL AGR A TD V K AT E AA K
N
TKTH
AD H
V
V
A
L
A
T
W
H
A
R
E
D
A
S
R
DR
I L
VAL
V
Q
T
M
A
Q
G P
R
I
M
S
A
K S
T
T
E
S
L
EEQ PE
M
RK F
S
P
L
V
HR
G
E
Q
S
N
T
A
P
D
LL
P
H
V
YI P
V
I
Y K
G
PY
M
FP
I I
S
Q
R
K
F
E
Y
G T
V
I
S S
R
K
E
N
M
G
Q
PQ
H
LT
N
Q
G
PE
A
SN
G
Y
P
FIAK
S
F
Y
E
QA
V
E
QL
T
R
K
S
F
QI Y
T
CG
P
HI G
L
P
HE
L
V
AS
T
RH
L
AL
F
K
V
A
E
RA
R
PL
V
AI K
E
S
R
A
T
D
L R
K
Q
S
N
A
ES
N
E
H
G
Q
D
P
MY
R
K
E
Q
G
NR
Q
P
HR
L
V
A
E
SI E
A
S
RN
G
M
PV
L
IQ
F
YA
N
M
G
H
P
DI
V
T
R
KR
L
S
D
T
N
V
GA
K
Y
S
K
CT
Q
D
N
F
Y
HA
C
Q
S
L
V
N
D
TA
R
L
S
A
K
E
VR
C
K
E
Q
P
G
N M
Y
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

17
Starting out
SAM-T98 Alignment Building
Start: a single sequence

Build a model from the sequence or alignment

Use the model to search for additional homologs

Reestimate the alignment with the new homologs


(Iterations 1 - 3) (Iteration 4)
End: a SAM-T98 alignment

To illustrate the steps taken by SAM-T98, we will take two


sequen es through the full predi tion pro ess:
1. 2 rd Charybdotoxin: small, ysteine-ri h, neurotoxin that
blo ks K + hannels.
2. 1yrnB MAT 2 homeodomain that regulates expression of
ell type-spe i genes in yeast.
These sequen es have known stru tures, but we will ignore their
stru tural information until the predi tion pro ess is omplete.

18
Example 1: 2 rd
What information do we know initially?

 Sequen e: XFTNVSCTTS KECWSVCQRL HNTSRGKCMN KKCRCYS

Other information, whi h we do not use:

 Is a member of a stru tural lass of toxins that share the same motif but
blo k di erent K +/Na+ hannels.

 Disul des C -C , C -C , C -C
7 28 13 33 17 35 are riti al for all lass members.

 A tivity is mediated through (di erent) positively harged residues that


intera t with negative harges on the hannels.

 Logo below depi ts the information in the regularized pro le onstru ted
from the single 2 rd sequen e.

 A BLAST sear h would be looking for approximately this signal.

 Through the tutorial, this signal will hange as we make it more spe i
and sensitive.

2
bits saved

F
L
YT LS
I SD
V
NI
C
A
V
L
I TTSK
EADLY
CW VC
VF
I L SLL
A
I V RV
I
I
KFHN
L
S
RD
DTSKE
N
V
KL
I M
V
NKKL RL
V
GC
A
F
I K I LS
A
C CY
A A

V
GAAS
TS
ASTTG
SA
R
EAT I TTT
AATAS
KS
E
A E
V
KGSAASQ
E
A
SATSTE G
TVSA
E
EA
ATV
ETS A
KRS LDRRS S A
AV E
FG VVGSQ G
S
A
GF G
KQT
ES
R
MS
VGE
QRS
AG
EAAG
I KD
SS Q G I T
SG
TLKSEFLLED F ESFRLYDKLEL DF DFLFTE
W
SETEDYEEDQRYT
RDEYDTKATEDTPQYA TQQYTYKD
MKAK
MK
K
EKKKTTK
EGK
K
M
K
EADSL
NAKKDLT
K
E
FA
T TT
K
EDK
E
R
EK
RD
GRRNRD
G
D
GNGNRENRRSGRGRD
GNGQGRS RGGRGRWN
K G K TNEY N K N
E
G
RQ
L
Y
P
R
P
D
N
RRR
P
N
P
D
NNR
P
Y
P
D
NLVPT
Q
L
RR
PVVND
NE Q
L
NND
NV
D
N
H
G
R
P
HNPG NN L D G GPGQPN PHL RPLL P N
LP I L L
P L P I L P P P L

0
P I VQ I P H Q N I QVV I I P Y
QVP P I D
NP HDVQP P VVV QQ
M
VDQVH DFHP VH YV QGHV V QH QQV
QQ YNQMQ Q Q I I M QNMPY NPYQ QY F I MPY I I MY MPQ
D
CF
Y
H
MFI C
H
WY
H
F
MI H
WF
Y
H
MF
Y
H
MY
H
F
MI H
Y
F
MH
Y
F
MH
WP
CY
H
F
MI C
H
WH
WH
Y
F
MI F
MH
W
CMI I
F F
Y
H
MY
H
F
MI F
MC
M
WH
Y
F
MH
WD
N
H
CFI H
Y
F
MH
Y
F
MH
WF
MH
WM
CY
H
F
MI
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

19
Example 2: 1yrnB
What do we know initially?

 Sequen e: TKPYRGHRFT KENVRILESW FAKNIENPYL DTKGLENLMK


NTSLSRIQIK NWVSNRRRKE KTITIAPELA DDLSGEPLAK KKE

Other information, whi h we do not use:

 Binds sequen e-spe i DNA through a helix-turn-helix motif and


N-terminal tail, primarily through positively- harged residues and the
sugar-phosphate ba kbone.

 Logo below depi ts the information in the regularized pro le onstru ted
from the single 1yrnB sequen e.

2
bits saved

TK
S
E
K
D
N
FKE
PYR
A
G
A
S
RL K V
F
E I RL I E FL LE
H YTK NL VV S YAKNV ND
KDAK
V IL Y D
L I
WF D
I
DDE
KF
PYLD
A
S
VN
S
IE D
GL
A
E MKNT
S
TKNVDNVL
E
I
L
D
I
D
S
R
E
GL
VEK E I S
KA EAG GR
EGAK
AGG
DV
R
AGS
R
E
E AF
KAKK
AAAV
AGA
R
EGS
LF F F
SVAAS FMSTAV
DAS A STS S FS A I
AALSI QT
E
S LAS S A
LSI M AATM
KV SSMVASA
VS R QA VSQEFQTTQG A SETQE T SRY
TQET I SEV
TYA
R S
LDTT L ALT LD KSLMY T
ETSEDKM KT LD KYA DKL
E VKT PDTS E R
QTTE S
KTY
R WTQTS
R K KTE P R E
YTTV TT QT
Q W K DR Q K K
KT RDL DL
MKT A D S KGMT
KTA A SQ R
KTLS ASTTAK F
D R
G
E N
WG NG
M
Q
GKRN
D G
K
RNR
R
G
E
ER N
N ER I G R E R WR D G R R RSG R D R Q
G
R N N
HN Y N
V K
E
G
R N Q YN RE
P
G
R K
N
K
E
D
P N Q G
R Q HE
P
G
R N E
P Q E
G
P
K
EN NQ
G
R P V G
N L Q G
N
V H T
Q
V GN L P L P
G
V PGP P DG
R L L PP L Q G
NGL N L H GP L GRL L N
I DP P H L P P G L L H P GL P I D L P P

0
I P I I V I P I P V Q I QQ HPQ P V Q V QV I P I Q V QYP V I
P V F QH Y F H NP V V H DH DDV V QNN V H DV H F QDH P V Y DV H DQ
GV H P
Q I H PY F P Y QQ I I Y NY NN I Q MQF I Y N I Y H PNY Q I F N I Y NP I Y Q
F
Y
H
M H
Y
F
M Y
M
C M
CF
M C
M
W I
M F
M D
CF
Y
H
M H
Y
F
M H
Y
F
M I
F C
H
WF
M C
H
WH
W
CH
Y
F
M I
Y
H
F P
CD
CY
M
C
H H
Y
F
M I
F C
H
WH
Y
F
M I
F
M Y
M
C M
CH
W
CF
MI F
Y
H
M H
Y
F
M C
M
W H
W
CH
Y
F
M I
F H
W
CD
N
HH
Y
F
M I
F F
Y
H
M C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

2
bits saved

LI
VSKV
SF RL L
I V
D
Y
L
I L L W
VKN LSNRRRKEKTVTVA EVA VVS E
F I I E I I
D
A
S
KE
KE
KRKR DV
S
E
KD
N
VS
KFG FF
I
D
N
ED
S
E
K
I
V
FG
P L DLL
A
D
G PL A I
V AKKKE
AAAEA
A Q ARGV E AAGE
EAESASAGG
DAASGAAAKK
AG
DAS
R
E
R
E
RK
EA
TMTSFEFASA T SASA
I
KMT
MTSAAA AFAFS T
GTGQT TS
KM E FGE
S
Q
S
QQSQSVTVTL LS
ETQY
M
TL TTG
AY R
S
QLM
TL AAAS
SSSQ
EYELMR DKS
TSEKLLLDRDLMLM R
E YEP T
RVYE DDDR
K TS SQT EDTTTTQ ESEST KTT KK KT
DSDDYDY R
GK DD
Q
DTTTKYKYKVTS KQ SSDLT SKQ QQT
KRKG
A
KSKTAEMKAGGG N DKDK I RNR I RRRKQNRR I TTTN
NEN E EGRKRNR G
GGGEGED NGED EEN N D GGG
R
PPR
P
N
V
RT
LR
P
NQ
P
NY
L
RQ
P
N
P
N
L
N
V
N NR
VV
RRRP
P P
PP P
PPP
Q
RVG
P
E
PP N
P
N
H
NG
PQ P
G P G LP
DG PP P P L LN N R GRLGG GRLLL
L L G G L L G G I L L L I L

0
Q I QNQP HQ I I I P P I Q I QQ QQVQQ I QQPPP
V V VQ VV V
Q
D
QH DVDVH D HH H H VVVP
MDP
Q DNF VDNHDD YV
I
F DNVVVV H I Q I H I
N YNPN I Y N YYYY I I QNQNF NFYNN F NF I I I
I
Y
H
F
M H
W
CY
H
F
MI F
MC
H
WH
Y
F
MI C
H
WH
Y
F
MFI CC
H
WY
H
F
MI I
F F
MF
MF
MP
H
Y
F
MH
Y
F
MH
Y
F
MF
Y
H
MC
H
WF
Y
H
MC
H
WY
M
C
HM
CH
Y
F
MH
W
CY
M
C
HF
MI H
W
CH
W
CY
H
F
MI C
M
WH
Y
F
MM
CH
W
CY
M
C
HH
Y
F
MH
Y
F
MH
Y
F
MH
Y
F
M Y Y
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83

20
Hidden Markov Models

 A hidden Markov Model (hmm) is a nite-state ma hine


with a probability for emitting ea h letter in ea h state, and
with probabilities for making ea h transition between states.
 Probabilities of letters sum to one for ea h state.
 Probabilities of transitions out of ea h state sum to one for
that state.
 We also in lude null states that emit no letters, but have
transition probabilities on their out-edges.

21
Pro le Hidden Markov Model

Start a- End

a1
a2 b4

A3 A4 A5
B1 B2 B3 B5

a1 a2 A3 - A4 . A5
. . B1 B2 B3 b4 B5

 Cir les are null states.


 Squares are mat h states , ea h of whi h is paired with a null
delete state. We all the mat h-delete pair a fat state.

 Ea h fat state is visited exa tly on e on every path from


Start to End.
 Diamonds are insert states , and are used to represent
possible extra amino a ids that are not found in most of the
sequen es in the family being modeled.

22
Computing probabilities with an hmm

 Ea h path from Start to End generates as many letters as


there are mat h and insert states on the path.
 Let path S be Start ! s1 !    ! s ! End, and let
n

x1; : : : ; x be the letter-generating states of S (m  n).


m

 We an ompute Prob (A j M ) as a sum over all paths that


have len (A) letter-generating states:
Prob (A j M ) = Prob (A j M; S ) Prob (S j M ) :
X

paths S
 The probability of the path depends only on the transition
parameters:
Prob (S j M ) = Prob (s +1 j M; s ) :
Y
n

i i
i=0

 The probability of the sequen e given the path depends only


on the emission parameters of the letter-generating states
used:
lenY( )
Prob (A j M; S ) = Prob (A j M; x ) :
A

i i
i =1

23
Generating a model from an alignment

Goal: given an alignment, build a model to mat h the aligned


sequen es, and as-yet-unseen sequen es similar to those aligned.

From this:

2 rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1 mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS

To this:

2
bits saved

F
L
VSG
AS
AV
L
TT
V
F
YE
I
C CW VC
A
V
L
ST
N I AG
I TDAS
Y TSTG
E
AA
EAY
A
T
FS
DS
E
D
TAS
S
QS
I L
TA
KKG
F
Y
Q
V
I
Y
AS
A
VF
LY
SA
P
TG
F Q
E
KM
TKEA
V
IL
LS
I
A
I
V
RF
L
T D
S
T
GS
NA
V
L
E
E
D
RE
N
KK
QT
AR
GC CC
A
S
A
V
L
S
T
G
F
Y
I

N
KK
RG
E
S
T
G
F
Y
A
V
L
I
A
V
L
S
T
G
F
Y
I S
G
D
E
A
T
FG KSK R R M RS EK SP HSS K
T
SL
ES
K
K
EE
DN
R
E
S
K
D EV
L
K
D EG
ED E
KED D K
R K D L
D
K
E S
DED
KR
E
K
I E N
M
WE
KA
KK RN E R AT RKT RRAT EHA N TQ R ED R RL R R
R
T R DG K P T D K DS N P TG N D KA
TA D D P
K
E
D
G R
M
P
N
R NV
L
G L N N
G N
N
HG Y
PNT L QF R I S LV N GQ N N
K
V N L
GR L I
V G N G N G GY Q R Y G H A
N T YV
Q G PR D L PD P L P R PA P

0
H
P I L YP Q R Q Q P QMR QQL V N
DV P P A V Y Q
M
D TG Q QE QE Q
NN P QV Q N I L V L D G P L Q L P I QL P T L I
DP V DQ MP Q H P H MQQ CMV H H
C
A
S H F
T
E H F KML
E LP L MS MA H
Q
CQ
F
Y
H
M H
Y
I
FWN
C
HY
H
FI H
WY
H
F
MI
H
MCP
F
Y
FF
MY
F
M V
I
H
Y I
Y H
WP
CV
H
Y
FI N
H
WH
WP
H
YI
F I
Y
F
M WT
K
R
E
N
G
D
Q
PI V
Y
I
FCY
H
M
Q K
G
D
V
N
R
F
I
P I
Y
F
M C
M
W
G
N
I
Y
Q L
A
R
S
E
V
W
T
DH
WA
V
K
S
T
RI P
V
YV
H
Y
Q
N
GI
FIV
H
I
YFH
W
FQ
D
F
N
Y
P
G H
WS
K
V
T
D
R
F
N
GI Y
F
M
M
H Q
PC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

24
Model building

SAM-T98 Alignment Building


Start: a single sequence

Build a model from the sequence or alignment

Use the model to search for additional homologs

Reestimate the alignment with the new homologs


(Iterations 1 - 3) (Iteration 4)
End: a SAM-T98 alignment

A model that is trained on an alignment should mat h the aligned


sequen es, plus other sequen es similar to those aligned. This
requires adding generality to the model.

25
Risks of not generalizing well
The logo shown is omputed from the following alignment, with
no generalizing. Remember that a logo re e ts the signal that a
model trained on the alignment will look for.
2 rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1 mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS

 All of the signal omes from the training alignment, even


though this alignment onsists of only three lose homologs.
 In all olumns, any amino a id not yet seen is expe ted with
a probability of zero.
 Su h a model would nd only very lose homologs at best.

FTNVSC S CWVC L T GC C S
4

T KE S QR HN SR K MN K R Y
3

R
bits saved

S
SA SQ P KK FG YK W DHKG I E
1

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

26
E e ts of generalizing
The logo shown below is omputed from the same alignment as
the previous logo, but using a more sensible generalizing s heme.
 The signal re e ts the alignment, but the alignment does not
dominate the signal.
 The onserved ysteines and the other stru turally signi ant
olumns are emphasized.
 The amino a id posteriors re e t the amino a ids seen plus
their likely substitutions and stru tural signi an e.

2
bits saved

F
L
V
L
I TDAS
Y
V
A
SG
ASTTFE
I
C CW VC
A
V
L
ST
N I AG
TSTG
E
A
FS
AA
EAY
DS
S
QT I
TKK
DG
TASF
ES
ASTG
Q
Y
ASP
A
VF
LY
I L
V
KM
ATKE
F
Q
E
A
V
IL
LS
I
A
I
V
RF
L
T
G
NS
S
A
DV
L
E
T
GC CC
A
S
D
RE
N
KK
ET
Q
AR
A
V
L
S
T
G
F
I

NRG
E
F
T
KKG
A
V
L
I
S
A
V
L
I
S
T
G
F
S
A
T
G
D
E
FGYKSK R MY
RS EK SP Y HSS
Y Y
K
TV
SL
E
K
S
E
E
D
K
EDV
LN
R
D
R
S
K
EG
E
E
DEK
EDDK
R KD DL
K
E
D
SED
K
ER K
I E N
M
WE
KAK RNE
KRA TK AT R
T K R
R EHAN TQ R EDRR LR R
G G KA
FT
R T R D K PT D K YDS N P N D A D D P
K
E
D
G R
M
P
N
R NV
L
G L N N
G N
N
HG PNT L Q
Y R IS VL N G
A
T
Q N N
K
VN L
GR Q GL P
I
D V G L PDN GPN G G Q R Y G H P R N T PAPYQ V
R NL

0
H
P I L YP Q R Q Q P QMR QQL V DV P P A V Y Q
M
D T G Q QEQE
NN P QV Q N I L V QL D G P HA L Q L
T P I L Q L P T L I
D
Q
C
P
Q
F
Y
H
M
V
H
Y
I
FW
D
N
C
H
Q
Y
H
FI
M
H
W
P
Y
H
F
MI
Q
P
F
Y
H
M
H
Y
F
M
C
P
V
I
H
Y
F
M
H
I
Y
F
M
H
WP
C
Q
V
H
Y
FI
C
N
H
W
M
H
W
V
P
H
Y
FI
H
I
Y
F
M
C
W
S
T
K
R
E
N
G
D
Q
PI
H
V
Y
I
F
F
Y
H
M
C
E
K
G
D
V
N
R
F
I
Q
P
H
I
Y
F
M
F
C
M
WY
Q
K
L
A
R
S
E
V
W
T
D
G
N
I
M
H
W
E
A
V
K
S
T
R
Q
NI
L
P
V
Y
F
IG
P
V
I
H
Y
L
V
H
I
Y
FF
M
H
W
S
Q
D
F
N
Y
P
G
M
H
M
H
W
A
S
K
V
T
D
R
F
G
N
Q
PI
H
Y
F
M
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

SAM-T98 adds generality from three spe i sour es:


1. Sequen e Weighting

2. Column Regularizers
3. Transition Regularizers

27
Issues in model building

2crd XFTNVSCTTSKECWSVCQRLHNTSRG.KCMNKKCRCYS
1bah XFTNVSCTTSKEXWSVCQRLHNTSRG.KCMNKKXRCYS
1sxm TIINVKCTSPKQCSKPCKELYGSSAGaKCMNGKCKCYNnx
1cmr ------CTTSKECWSVCQRLHNTSKG.WCDHRGCICES
1txm ----VSCTGSKDCYAPCRKQTGCPNA.KCINKSCKCYGcx
1lir XFTQESCTASNQCWSICKRLHNTNRG.KCMNKKCRCYS
2bmt XFTNVSCSASSQCWPVCKKLFGTYRG.KCMNSKCRCYS
1bkt VGINVKCKHSGQCLKPCKDA-GMRFG.KCINGKCDCTPkx
1big XFTDVKCTGSKQCWPVCKQMFGKPNG.KCMNGKCRCYS
Observe that in the alignment above, the rst two sequen es are
nearly identi al. Su h redundan y is normal in protein families.

Problem: Training data tends to be imperfe t. Training


alignments an be homogeneous and skewed.

One Solution: Sequen e weighting.


 First hallenge: Sele ting how mu h weight to assign ea h
sequen e relative to the others.
 Se ond hallenge: Sele ting how mu h total weight to
assign to the sequen es relative to the priors.

28
Heniko Sequen e Weights [HH94℄

1. Ea h olumn re eives a weight of 1, divided evenly between


its distin t residues.
2. Within ea h olumn, the weight for ea h residue is divided
evenly by its number of o urren es.
3. The weight of a sequen e is the sum of the observation
weights of its residues.

2crd XFT NVSC TTSKECWSVCQRLHNTSRG


1bah XFT NVSC TTSKEXWSVCQRLHNTSRG
1sxm TII NVKC TSPKQCSKPCKELYGSSAG
1cmr --- ---C TTSKECWSVCQRLHNTSKG
1txm --- -VSC TGSKDCYAPCRKQTGCPNA
1lir XFT QESC TASNQCWSICKRLHNTNRG
2bmt XFT NVSC SASSQCWPVCKKLFGTYRG
1bkt VGI NVKC KHSGQCLKPCKDA-GMRFG
1big XFT DVKC TGSKQCWPVCKQMFGKPNG

Residue(s) N Q D V E S K C
Counts 5 1 1 7 1 5 3 9
Residue Wt. 0.33 0.33 0.33 0.50 0.50 0.50 0.50 1.00
Observation Wt. 0.07 0.33 0.33 0.07 0.50 0.10 0.17 0.11

29
Heniko Sequen e Weights, Continued
1. Ea h olumn re eives a weight of 1, divided evenly between
its distin t residues.
2. Within ea h olumn, the weight for ea h residue is divided
evenly by its number of o urren es.
3. The weight of a sequen e is the sum of the observation
weights of its residues.

Residue(s) N Q D V E S K C
Counts 5 1 1 7 1 5 3 9
Residue Wt. 0.33 0.33 0.33 0.50 0.50 0.50 0.50 1.00
Observation Wt. 0.07 0.33 0.33 0.07 0.50 0.10 0.17 0.11

Absolute % Alignment
Sequence Weight Computation Weight Weight
2crd 0.07 + 0.07 + 0.10 + 0.11 0.35 8.6
1bah 0.07 + 0.07 + 0.10 + 0.11 0.35 8.6
1sxm 0.07 + 0.07 + 0.17 + 0.11 0.42 10.3
1cmr 0.11 0.11 2.7
1txm 0.07 + 0.17 + 0.11 0.35 8.6
1lir 0.33 + 0.50 + 0.10 + 0.11 1.04 25.6
2bmt 0.07 + 0.07 + 0.10 + 0.11 0.35 8.6
1bkt 0.07 + 0.07 + 0.17 + 0.11 0.42 10.3
1big 0.33 + 0.07 + 0.17 + 0.11 0.68 16.7

30
E e ts of low total sequen e weight

 The signal is dominated by the priors, and the data seen is


given little emphasis over the data not seen.
{ Column 26 ontains onserved gly ines, but almost
everything is onsidered likely.
{ Column 16 ontains onserved valines, but leu ine and
isoleu ine are almost as probable.
 At this total weight, most alignments will not have mu h
emphasis on the model. Even if the alignment is diverse, its
information will be muted.
 This total sequen e weight is 0.71, and the average savings is
0.25 bits/ olumn.
2 rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1 mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS

GC
bits saved

V
C
A A
VC L
CW L A I
A
S
D A
C C
A A
F V VF V
IL V N V V V
L L L S ELL
A IKKA
RF G
T
NS
RE
K
L L L S
V N I
DASS
I T A
TKKSVSTSQ
Q I Y
E
T DA K
ET
I
S KKS
I I
S
A
T
I TGTATSTGS T I P
DG FT
ES
A
M
K SV AR T RGT T G
YA ESTGAAE
D
E
RA AA
S
S
EGR Y EL SP G N
D
E
SSGR G E
D
ASSFGFES RF E F QS KE Q F F F
TV
S KEEK
E
KVKA
NDS
KT
EGD
K
RK
E
D
ADE AK DL K
E
S
E
E
A
D
R
K
E
L K
I E
K
N
STV
LAKD DE K T D
EE RKYGLRTTYR
T
MYSL
R TG L Y KDAYKR Y
K
RK
D
T
RPNRLK
G
LG
NNRE
KG
P
GRT
NN
P
G
RR YN
A Q R H
G
N
QT
T
RVRE
Y
L
M
GGLGLDN P GDNNYD G H
Q I LG I DL
DAN DADEP
YQ
D
R NV V Q
L ND
T
NL Q P H NMQ R NTN V
WR M PR I Q L P R Q
PG V DF L NE
KV
PE T G L PSP
L
Q
P P P Q L Y A

0
N I V Q
QQ R I P P QH
L DQV P NY V PG P Q
A Q L P QQ
Q
S
I
DN H DV N Y V V Q N P I HA H QD
V I FK V
K L P V D K
V Y
QP I NQ MP Q H I I MPV CM I H WV Y FN H ML MS P V H MN
FMT H
H
CQ
F
Y
H
M Y
F
M C
H
WY
F
H
MI H
WY
F
H
MI P
F
Y
H
M F
M
C H
Y
F
M H
Y
F
M H
WM
CH
Y
F
MI H
WH
WH
Y
F
M Y
F
M CS
T
K
E
R
N
G
D
Q
PI I
F
M Y
H
M
CR
F
Q
P
HI Y
F
M C
WA
S
E
R
V
T
D
G
N
Y
Q
W
F
PI H
WT
R
Q
N
G
F
P
YI V
Y
I
F I
H
Y
F
M I
Y
F
M H
WP
G
Y
M
HH
WD
R
G
N
F
Q
P
HI F
M
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

31
E e ts of high total sequen e weight

 The signal is dominated by the training data, even though


the training alignment is only three sequen es.
 Tryptophan 14, Gly ine 26, and the six ysteine olumns are
all expe ted with a very high probability to remain onserved,
even though we an't yet say if they should be onserved.
 This total sequen e weight is 1.85, and its average savings is
one bit/ olumn.
2 rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1 mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS

C CW C L G C C C
bits saved

1
FTNVS ST S E
L
YSGLAL
D
I
V
A
TD
A
GK
T
A
VYS
QL
KSL
F A
V
I
L
VQ F
KE
RI
VH S
K A
V
G
N
TR KS
A
N
A
V
NRKL
V
A
V
A
S
T
A
G
I ASATSN
V V
A
AE
G I ES
ETD N
SD I
V PF
AL
S
I AT I EQ
AAYS
L
MFDE
N
ED
Q
AK
E
L
S
I H
KG
DS
S
R
E I
L
S
I
D
E
A
W
M
L
E
KT
KF
SE
K
TD
G
KV
L
E
KRRTR
DAGS
EA
TM
DS
T SY
GATS
TL
V
E
R
KD
G
K
Y
ST
T
T
G S
EES
S
RT I T
GL GE
K YN
TDA M FGE P FG FDN AA NR FMGDDFK F P
RAS Y A
SN R
ENYVG YTGKYSDK I I LDPKY YVY L
R
RG
K
YRPRKVTT PHKEP R
TT
S
R T
QWPD A
KQNPA P
A
S
V
H
K
Q
I PRPN
DL I LNNN
DENGN
D
T
N
LE
PR RP
E
G
K
LLLN
D
L
ER
ANAN
D
E
T
N
D
F
DL
GRLGL KQ D
Q QGKKRPKL
G
KQ QQDGVS
AK
R
KT
AT
TQ S
KQKVQ
K
EP PV RP R
I GLRMQ
C
RGVGE
NP VVH RV
G R RT I
P HQQ N P N R PNG FNP E SQ
P F R

0
NQVC I E I QH P EDLQEV C HYF T
YVETPL EYE H
QFY DHMHPYL MPVV
NMP H Q RH RWDLHN N
I I GM I LPLMDMGY
DY I NYQYFFVHQ
QHDQH DWV I
MQ FD QNY V HIQPQ I F
CH
M
CF
MH
WF
M
CH
WF
M
CY
H
M
CC
MH
Y
FI I
Y
M
F H
WCY
F
MI H
WH
WY
M
FI Y
M
FHP
M
CY
FI C
WP
H
MY
M
FC
M
WN
Y
Q
F
P
HI H
WQ
G
P
F
YV
FI H
I
Y
F V
Y
I
F H
WG
M
HH
WQ
P
H
M C
M
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

32
Sele ting the total sequen e weight
We have found our best results empiri ally by setting the total
sequen e weight to save half a bit on average relative to the
ba kground entropy H0:
 If the sequen es are all very similar, the average savings
starts out high. The total sequen e weight is set low to add
more diversity to the signal.
 If the sequen es are diverse, the average savings starts out
low. The total sequen e weight is set high to fo us the signal
more on the alignment and less on the priors.
 This total sequen e weight is 1.11, and it saves an average of
half a bit per olumn.
2 rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1 mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS

2
bits saved

F
L
V
L
I TDAS
Y
V
A
SG
ASTTFE
I
C CW VC
A
V
L
ST
N I AG
TSTG
E
A
FS
AA
EAY
DS
S
QT I
TKK
DG
TASF
ES
ASTG
Q
Y
ASP
A
VF
LY
I L
V
KM
ATKE
F
SE
Q
EA
A
R
V
IL
LS
I
A
I
V
RF
L
T
G
NS
A
DV
L
T
E
Q
GC CC
A
S
D
RE
N
KK
T
A
V
L
S
T
G
F
I

NRG
E
F
T
KKG
A
V
L
I
S
A
V
L
I
S
T
G
F
S
A
T
G
D
E
FGYKSK R MY
RS EK SP Y HSS
Y Y
K
TV
SL
E
K
S
E
E
D
K
EDV
LN
R
D
R
S
K
EG
E
E
DEK
EDDK
R KD DL
K
E
D
SED
K
ER K
I E N
M
WE
KAK RNE
KRA R TKRATEHA
TN TQ
K R R EDRR R
L R
G G KA
FT
R T R D K PT D K YDS N P N D A D D P
K
E
D
G R
M
P
N
R NV
L
G L N N
G S
N
N
HV
G PNT L Q
Y R I L N G
A
T
Q N N
K
VN L
GR Q GL P
I
D V G L PDN GPN G G Q R Y G H P R N T PAPYQ V
R NL

0
H
P I L YP Q R Q Q P QMR QQL V DV P P A V Y Q
M
D T G Q QEQE
NN P QV Q N I L V QL D G P HA L Q L
T P I L Q L P T L I
D
Q
C
P
Q
F
Y
H
M
V
H
Y
I
FW
D
N
C
H
Q
Y
H
FI
M
H
W
P
Y
H
F
MI
Q
P
F
Y
H
M
H
Y
F
M
C
P
V
I
H
Y
F
M
H
I
Y
F
M
H
WP
C
Q
V
H
Y
FI
C
N
H
W
M
H
W
V
P
H
Y
FI
H
I
Y
F
M
C
W
S
T
K
R
E
N
G
D
Q
PI
H
V
Y
I
F
F
Y
H
M
C
E
K
G
D
V
N
R
F
I
Q
P
H
I
Y
F
M
F
C
M
W
Y
Q
K
L
A
R
S
E
V
W
T
D
G
N
I
M
H
W
E
A
V
K
S
T
R
Q
NI
L
P
V
Y
F
I
G
P
V
I
H
Y
L
V
H
I
Y
F
F
M
H
W
S
Q
D
F
N
Y
P
G
M
H
M
H
W
A
S
K
V
T
D
R
F
G
N
Q
PI
H
Y
F
M
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

33
Estimating weights to a hieve the target savings

1. The sequen e weights depend on the target savings.


2. The target savings depends on the olumn entropies.
3. The olumn entropies depend on the amino a id posteriors.
4. The amino a id posteriors depend on the sequen e weights.
As one might suspe t, an iterative pro edure is required.

The total sequen e weight is set by an algorithm that starts with


an initial total weight, omputes the average savings, and
modi es the total weight to approa h the target savings.
 All sequen es are assigned a weight between 0.0 and 1.0.
 The target savings annot always be a hieved pre isely,
espe ially when the target savings is too high.

34
E e ts of sequen e weighting
Using the alignment below as input, the logo re e ts the
alignment with olumn regularizing but no sequen e weighting.
.....XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
.....XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
.....XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
.....QFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
.....QFTQESCTASNQCWSICKRLHNTNRGKCMNKKCRCYS
.....EFTDVDCSVSKECWSVCKDLFGVDRGKCMGKKCRCYQ
.....QFTDVDCSVSKECWSVCKDLFGVDRGKCMGKKCRCYQ
siegrQFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
.....-----SCTASNQCWSICKRLHNTNRGKCMNKKCRCYS
.....------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
.....XFTNVSCTTSKEXWSVCQRLHNTSRGKCMNKKXRCYS
.....XFTQESCTASNQCWSICKRLHNTNRGKCMNKKCRCYS
.....XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS
.....XFTDVKCTGSKQCWPVCKQMFGKPNGKCMNGKCRCYS

FT C S CW C GCM C
4

V T V LHG R K NKK R YS
3

NQ D S AT KE PS KQR NTS
bits saved

0
E
K
D
R
A
S
T
N
L
H
V
G
P
M
I
Y
FW
V
Y
L
IS
S
G
E
TT
K
A
H
P
A
N
S
Q
C
D
M
F
R
L
Y
D
QEK
AE
LG
R
Q
P
H
K
R
G
P
N
YV
L
Y
I

V
S
TC
MP
M
E
N
F
KT
N
T
A
V
YI V
P
Y
F
SV N
A
N
V
D
G
K
P
I
E
R
L
Q
V
S
T
A
G
S
C
R
A
V
G
L
I
I
I
C
H Y
CV
S
T
Q I F
S
R
DD
E
G
Q
P
L
G
M
Y
P
K
R
TA
H
L
N
AS
HT
M
G
I I
Y
M
N
D
V
F
E
K
Y
L
L
Y
T
C
T
A
G

F
A
R
Q
Q
R
W
S
L
M
T
T
P
R
H
QC
M
F
L
L
VY
D
MI
E
L
A
H
N
T
S
D
V
R
P
T
Q
H
L
K
D
E
A
N
L
S
T
H
G
P
V
V
D
P
M
M
RQ I
I
V
FY
A
V
S
TN
K
E
P
Q
D
S
A
Q
H
VKDNK
SYN
A
N
I
L
R
D
E
G
V
P
TT
R
K
E
G
S
G
P
G
EA
KH
A
D
Q
L
A
E
S
T
H
E
N
SA
N
W
R
Q
Y
L
F
GRG
HGR
D
D
E
K
A
P
Q
R
Y
L
N
E
SA
L
TT
H
D
P
V
Y
M
A
N
S
L
E
D
T
H
P
V
I
S
QQ

I
Y
M V
S
T
AP
Y
D
MV
S
T
I
K
Q
L
N
A
H
T
E
S
V
G
A
N
V
S
R
K
QI L
P
C
EQ
N
F
D
G
E
H
W
L
T
A
K
V
R
H
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

35
E e ts of sequen e weighting, ontinued

Using the same alignment and olumn regularizer as the previous


logo, this logo re e ts the alignment with Heniko sequen e
weighting and half a bit saved on average over ba kground
entropy.
.....XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
.....XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
.....XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
.....QFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
.....QFTQESCTASNQCWSICKRLHNTNRGKCMNKKCRCYS
.....EFTDVDCSVSKECWSVCKDLFGVDRGKCMGKKCRCYQ
.....QFTDVDCSVSKECWSVCKDLFGVDRGKCMGKKCRCYQ
siegrQFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
.....-----SCTASNQCWSICKRLHNTNRGKCMNKKCRCYS
.....------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
.....XFTNVSCTTSKEXWSVCQRLHNTSRGKCMNKKXRCYS
.....XFTQESCTASNQCWSICKRLHNTNRGKCMNKKCRCYS
.....XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS
.....XFTDVKCTGSKQCWPVCKQMFGKPNGKCMNGKCRCYS

GC CC
bits saved

F
LY
I T
VSD
A
AVN
E
S
D
GA
FE
L
C CW C
A
V
ST
TS
D
E
E
N
I

DGS
FTPF
S
A
T
G
TG
F
KK
EKA
T
EY
A
A
VF
LY
ES
I L
QT I
S
V
E
N
T
R
L
SA
VI
G
F
A
V
L
S
TK
QR
I
F
L
V
M
I
G
N
D
D
N
R
A
S
E
KK
A
V
L
S
T
I
N
G
KK
R
G
T
G
E
A
V
L
S
RF
F
G
KY
I
A
V
L
S
T
I
S
E
ET
QW
S
L
ESL
V E
K
Y
KK
DAKS
NR
A
R
YR
KG
A
EM
S
Y
KR
ADS
ET
KV
S
DS
A
P Y
K
D
S
E
SR
S
Y
KL
E K
K
A
Y
KMKKA NENTRDSEETYEDQKHAKNQL E EGDE I EL
Q
D
ARD
G
QE AR
D
VVPATR
D
KD
G
K
E
R
DSSR
FTAE
GDQKR
D D DAT
KDARA RED
TA L AA
K G G G H E V
RER I
N LLGN
N KRN YRSKTVR
LN N N NTNF G
SG ATT
N KR
LSV G N P TTQ A
NG S N N PL Q E H M R
TT T S V
N
T H R PR G Q L PD QPL G TL E P Q P PS

0
L P I SQ Q I Q P QMR CQ G V P R PV
Y A
Q
L HQ Q
Q
Q
T R
N NP TRP Q E I L V L G G L NA L I
R I T A QL L D K L
G
P QQ PQ
DL MP K
H
Y P H MQQ NMV P
H
DS
I
H
Q
D Q
P
H F
D
G MV
E PP P MN
PMR
I P
V
I
H
Y D
CF
Y
H
MCY
FL
H
V
IP
F
G
Y
NV
H
M
HI
Y
F H
WY
H
FI R
D
N
Q
P
F
Y F
MMV
H
I
Y
M
CFI
Y
F
M H
WP
CV
H
YIFD
H
WH
WP
HI V
H
I C
WT
K
R
E
N
G
D Y
FV
Y
I
F
Y
F N
G
P
F
Y L
Y
V
HYI C
M
Q
P N
V
Q
I
Y
P H
W
H
MD
T
K
S
RI L
V
Y
IV
HI V
H
I
Y
F
MWH
W
F
W
HG
F
Y
HH
WD
G
N
Q
P
H V
I
H Q
F
G
N
YFI Y
FF M Y
F
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

36
Model building, Continued

SAM-T98 Alignment Building


Start: a single sequence

Build a model from the sequence or alignment

Use the model to search for additional homologs

Reestimate the alignment with the new homologs


(Iterations 1 - 3) (Iteration 4)
End: a SAM-T98 alignment

1. Sequen e Weighting
2. Column Regularizers

3. Transition Regularizers

37
Regularizing Column Probabilities

Motivation: what would you onsider the likelihood of anything


other than Cysteine aligning in the indi ated olumn in ea h of
the two examples below?
Example 1
2crd XFTNVSCTTSKECWSVCQRLHNTSRG.KCMNKKCRCYS

Example 2
2crd XFTNVSCTTSKECWSVCQRLHNTSRG.KCMNKKCRCYS
1sxm TIINVKCTSPKQCSKPCKELYGSSAGaKCMNGKCKCYN
1cmr ------CTTSKECWSVCQRLHNTSKG.WCDHRGCICES
1bah XFTNVSCTTSKEXWSVCQRLHNTSRG.KCMNKKXRCYS
1txm ----VSCTGSKDCYAPCRKQTGCPNA.KCINKSCKCYG
1lir XFTQESCTASNQCWSICKRLHNTNRG.KCMNKKCRCYS
2bmt XFTNVSCSASSQCWPVCKKLFGTYRG.KCMNSKCRCYS
1bkt VGINVKCKHSGQCLKPCKDA-GMRFG.KCINGKCDCTP
1big XFTDVKCTGSKQCWPVCKQMFGKPNG.KCMNGKCRCYS

38
Desirable Properties in a Column Regularizer

A good olumn regularizer should do the following:


 Compute a set of olumn posteriors that is biologi ally
realisti .
 Expe t the amino a ids not seen with nonzero probability.
 Respond to any pattern of onservation of residues or
physio hemi al property visible in the olumn.
 Take into a ount the size and weight of the alignment.
 Be inexpensive to ompute and easy to understand.

39
Comparison of Column Regularizers

 The following slides show the logos produ ed using the same
3-sequen e alignment, same weighting s heme, and di erent
regularizers:
{ Maximum-likelihood estimate
{ Add-one pseudo ount
{ Gribskov average s ore [GME87℄ with BLOSUM62
matrix [HH92℄
{ Re ode3.20 omp Diri hlet mixture [SKB+96℄
Other regularizers are evaluated in the handouts [Kar95℄.
 All logos depi t the savings at ea h olumn over the
ba kground entropy, where the ba kground amino a id
distribution P0 is obtained by using the regularizer with no
amino a id observations.

40
Maximum-Likelihood Estimate
Des ription: Given a set of observed olumn ounts O ~ , the posterior
~ ), is omputed as follows:
probability for amino a id a, P^ (ajO

P^ (ajO~ ) = P O(a)
amino a ids j O(j )

 Easy to ompute.

 The distribution that gives the highest probability to the observed data.

 Gives in nite ost (zero probability) to amino a id that hasn't already


been seen.

 Performs well when O~ is very, very large, but terribly in real ases.

2 rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1 mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS

FTNVSC S CW VC L T G C C C S
4

TSAT KSEQ P KRK FHNG YRK KW DMHNRKKG RI EY


S Q S
3
bits saved

2 S
1

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

41
Add-One Pseudo ount

Des ription: Given a set of observed olumn ounts O ~ , the posterior


~ ), is omputed as follows:
probability for amino a id a, P^ (ajO

P^ (ajO~ ) = P O(a) + 1
amino a ids j O(j ) + 1

 Easy to ompute.

 No knowledge|all distributions equally likely a priori.

 Performs well when O~ is very large, but poorly in most other ases.

2 rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1 mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS

1
bits saved

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

The logo above is not erroneous. You see no pronoun ed signal at any olumn
be ause there is none. The low weights ombined with the large pseudo ounts
make the posterior probabilities lose to uniform.

42
Gribskov Average S ore [GME87℄ with BLOSUM62 Matrix [HH9
Des ription: The posterior probability of some residue a given
~ observed is de ned a ording to a set of
the set of ounts O
ba kground probabilities P0, a substitution matrix M :
P
M O (b)

P^ (ajO~ )
b a;b

P0(a)e ~j
jO

 Easy to ompute.
 A hieves ex ellent performan e with one sequen e.
 Does not hange as the total sequen e weight hanges.
 Does not make optimal use of multiple sequen es.
2 rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1 mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS

2
bits saved

1
F NV
L
Y D
V
A
SG
K
ATT
A
I L
V
I TSLSS
I
AATTASK
K
AS
SSTRR
E
C
AY
LF
VL
QS
T
V
DGS
I
I
SLS
PA
T
TG
A
IL
V
TQ
A
Q
RSA
CW C
RV
L
IF
I KKF
G
A
M
N
HAT
DS
A
V
QT
AL
E
S
RT
E LG
L
D
S RS
G
S
S
TLTEA
AG
GC CC
R
S
N
K
K
E
A
K
D
A
L
V
I N GV
DHR
K
L L
I
V
I R I
K
A A

YS
A TY E
G
FTG S GM VG T
I KS
SVE AVGA F KVSLVA E
MLRSGFV
K A
FTT
KK SF
K
EL
N
I
KVT LYN W
S
F
K
AE EASF
KKF
K
FG
L
T I M
LN L NE GM SAE I ER
N KN
KK KKNA KTSNRKKNS G RNR KA
G I Q NVT NAND K
EDG G NDE DD T E TT L DAR TQDEDA
I E E
RNQ G K G EG Y L N Q N G
EGH EDRN DNLRQNERL D GS
Q GGD G R A NDRSRS
D
WEL R
YLEE
K
E LQTEMDR
Y
EN
G V E
R
KV E
L
KV PD
N
ES
KQ QLET
Q
EQ
V L
HDV VM I N VDGMDL M I QGH DF I F MGL LTMN
MR V
D C P P CPT H T P P PH

0
R D R V NV H C N H V V R
NR I D
QY D Q
LHYHR
D
YV P D
T I REP YV YTY D Y
F
GYTI Q
QQP P R R V P I Y DP QVY I R I VP
PPFN I QQP I I P QCQ NQPFPQFPD FCPQFF I I QDQG I
CF
M
Y
C
HY
M
CQ
HP
F
H
Y
C
M H
WP
F
Y
C
M
HQ
F
C
M
Y
HP
F
H
Y
C
M P
H
F
Y
M I
Y
F
M H
WPF
H
Y
C
MI Q
HH
WH
M
Y
FMN
H
WM
P
WY
M
CF
M
Y
C
HR
Q
H
PI MM
WY
F
H
MH
WQ
P
Y
HP
MP
H
F
Y
M H
F
Y
M
C H
WM
Y
H
PH
WN
P
W
MP
F
H
Y
C
M
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

43
Diri hlet Mixtures [SKB+96℄
Des ription: Several di erent pseudo ounts ( omponents) are t
to patterns of onservation and substitution within amino a id
olumns. To ompute the posterior distribution of an alignment
olumn, probability of ea h omponent is omputed and used to
mix together the posterior probabilities given the omponents.
O(a) +
P^ (ajO~ ) = X P ( jO~ ) P ;a

amino a ids b O(b) +



;b

 Can be hard to understand.


 Performs well with a broad range of alignment sizes.

2 rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1 mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS

2
bits saved

F
L
VSG
AS
AV
L
TT
V
F
YE
I
C CW VC
A
V
L
ST
N I AG
I TDAS
Y TSTG
E
FS
AA
EAY
DS
K
S
ES
DG
TASF
Y
I L

Q
A
Y
ASP
A
VF
LY
TKQT I SATKKM
A V
TG
F
Q
E
E
A
V
IL
LS
I
A
I
V
RF
L
T
G
NS
S
A
DV
L
E
T RE
KK
ET
Q
AR
GC CC
A
S
D
N
A
V
L
S
T
G
F
Y
I

N
KK
RG
E
S
T
G
F
Y
A
V
L
I
A
V
L
S
T
G
F
Y
I S
G
D
E
A
T
FG KSK R R M RS EK SP HSS K
T
SLES
K
K
E
E
DN
R
E
S
K
D EV
L
K
D EG
ED EED D K
R K D L
D
K
E S
D
E
D
K
ERK
I E N
M
WE
KA
KK RN E R AT RKT K
RRAT EHA N TQ R ED R RL R R
R
T R DG K P T D K DS N P T G N D KA
TA D D P
K
E
D
G R
M
P
N
R NV
L
G L N N
G N
N
HG
Y
PNT L QF R I S LV N G
AQ N N
K
V N L
GR Q GL I
V G N G N G GY Q R Y G H N T YQ V
PR D L PD P Y P R PA P

0
H I L YP R Q Q P MR Q L V NL P P A V M
TG Q E E
PN P QV QQ N I L Q L DQG P DV L Q L P I QD L P QT Q
N
DP V DQ MP Q H P V
H MQQ CMV H
H
C
A
S H F
T
E H F KML
E
Q
LP L MS ML
A
I
H
Q
CQ
F
Y
H
M H
Y
I
FWN
C
HY
H
FI H
WY
H
F
MI
H
MCP
F
Y
FF
MY
F
M V
I
H
Y I
Y H
WP
CV
H
Y
FI N
H
WH
WP
H
Y
FI I
Y
F
M WT
K
R
E
N
G
D
Q
PI V
Y
I
FCY
H
M
Q K
G
D
V
N
R
F
I
P I
Y
F
M C
M
WG
N
I
Y
QL
A
R
S
E
V
W
T
DH
WA
V
K
S
T
RI P
V
YV
H
YI
Q
N
GFV
H
I
Y
I H
W
FQ
D
F
N
Y
P
G
F H
WS
K
V
T
D
R
F
G
NI Y
F
MM
H Q
PC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

44
Regularizer Performan e at Various Sequen e Weights
To assess the performan e of the regularizers at various total
weights, below we ompare their expe ted savings given one
observation of amino a id b at various weights.
Expe ted olumn entropy E [H ℄ is omputed as shown, where

P^ (a; b) is the posterior probability of amino a ids a and b in


olumn , and P (a) and P (b) are their prior probabilities:

^ X P^ (a; b)
E [H ℄ = P (a; b) log

amino a ids a,b P (a)P (b)
P^ (ajb)
=
X
P^ (ajb)P (b) log
amino a ids a,b P (a)
Relative entropy of regularizer as function of sequence weight
4.50

4.00 Uprior9 Dirichet Mixture


Recode2.20comp Dirichlet Mixture
Recode3.20comp Dirichlet Mixture
3.50 Gribskov Average Score with Blosum62
Add-one Pseudocount

3.00
Bits saved

2.50

2.00

1.50

1.00

0.50

0.00
1.0e-03 1.0e-02 1.0e-01 1.0e+00 1.0e+01 1.0e+02 1.0e+03 1.0e+04
Weight of single amino acid

45
Peak savings versus weight

The previous slide showed the average bit savings as a fun tion
of weight, but an exa t mat h to the single hara ter saves more
than average.
Mat hing olumn entropy is omputed as
XP^ (aja)
H mat h = P (a) log
;
amino a ids a P (a)
Relative entropy of regularizer as function of sequence weight
4.5

4 recode2.match
recode3.match
recode2.average
3.5 recode3.average

3
Bits saved

2.5

1.5

0.5

0
0.01 0.1 1 10
Weight of single amino acid

46
Peak savings versus Average savings

We an plot the savings for the mat hing hara ter versus the
average savings|all the regularizers trained on protein data
behave similarly.
At 0.5 bits/ hara ter on average, we get about 2.2 bits of savings
for a mat hing hara ter.
Relative entropy of matching amino acid versus average relative entropy
10

uprior9
merge-opt13
recode2
recode3
blosum62
add-one
Match bits saved

0.1
0.01 0.1 1
Average bits saved

47
Model building, Continued

SAM-T98 Alignment Building


Start: a single sequence

Build a model from the sequence or alignment

Use the model to search for additional homologs

Reestimate the alignment with the new homologs


(Iterations 1 - 3) (Iteration 4)
End: a SAM-T98 alignment

1. Sequen e Weighting

2. Column Regularizers
3. Transition Regularizers

48
Model Transitions
There are nine types of transitions that an be made:
 From the mat h state: mat h-mat h, mat h-delete, and
mat h-insert.
 From the delete state: delete-mat h, delete-delete, and
delete-insert.
 From the insert state: insert-mat h, insert-delete, and
insert-insert.
The probability of entering a delete or insert state from a mat h
state is analogous to the gap opening ost in other methods, and
the delete-delete and insert-insert probabilities are analogous to
the gap extension ost.
Start a- End

a1
a2 b4

A3 A4 A5
B1 B2 B3 B5

a1 a2 A3 - A4 . A5
. . B1 B2 B3 b4 B5

49
Transition posteriors in SAM-T98

 SAM-T98 applies a system of transition regularizers to


estimate transition osts.
 These transition regularizers ontain small pseudo ounts
representing prior odds for ea h transition.
 Be ause we just use pseudo ounts, transition parameter
estimation is rude ompare to the use of Diri hlet mixture
priors to estimate mat h state probabilities.
 For any adja ent states in the model (or olumns in the
training alignment), SAM-T98 omputes the transition osts
based on two fa tors:
1. observed transitions in the (weighted) training alignment
2. the pseudo ounts in the transition regularizer

50
Transition posteriors in SAM-T98, ontd.

 Ea h alignment olumn is modeled with distin t transition


osts.
 All transitions have nite ost (i.e. 0.0 < probability  1.0).
 When the sequen e weights very small, the transition osts
re e t the regularizer.
 When the sequen e weights are large, the transition osts
re e t the observations. Gaps are most likely where the
alignment already ontains gaps.
 The transition regularizers we use are sti , so observations do
not hange transition probabilities mu h from the
ba kground.

51
Sele ting a transition regularizer

 We have never found a transition regularizer that works well


in all ontexts.
 SAM-T98 uses three di erent transition regularizers:
{ In the rst iteration, only very lose homologs are
admitted into the training alignment, and the transition
regularizer used allows numerous gaps.
{ The transition regularizer used in the next two iterations
en ourages long stret hes of mat h states.
{ The nal iteration uses a transition regularizer trained on
FSSP stru tural alignments [HS97℄.
 Based on stru tural alignments of distantly related
proteins.
 Not as sti and is hopefully a better approximation to
the \true" transition parameter values.

52
Stru tural Context and Transition Parameter Estimation

Se ondary-Stru ture-dependent gap opening and gap extension


penalties
Advantages Gap osts are appropriate for the se ondary
stru ture.
Disadvantages Gaps are not uniformly likely a ross a
se ondary stru ture element. For example, deletion of two
residues at the start of a helix is less surprising than deletion
of two residues in the middle.

53
Sear hing with the model
SAM-T98 Alignment Building
Start: a single sequence

Build a model from the sequence or alignment

Use the model to search for additional homologs

Reestimate the alignment with the new homologs


(Iterations 1 - 3) (Iteration 4)
End: a SAM-T98 alignment

Sear hing for additional homologs means applying the model to


a database ontaining potential homologs, s oring ea h sequen e
in turn, and sele ting those that s ore well. Issues in lude:
 Exe ution time
{ Redu ing the number of sequen es to s ore
{ Redu ing the time ne essary to s ore ea h sequen e
 Signi an e: when does a s ore imply homology?
{ Sele ting a reasonable null model
{ Sele ting a reasonable s ore threshold given the null model

54
Redu ing the Number of Sequen es to S ore

Problem:  As of April 1999, the non-redundant database


ontains lose to 400,000 sequen es.
 Performing a full sear h with an hmm takes on the order
of fteen hours.
 SAM-T98 requires four rounds of database sear hing.

One Solution: Use BLAST [AGM+90℄ to sele t potential


remote homologs from the non-redundant database. Sear h
against the BLAST hits rather than the full database.

55
Redu ing the Number of Sequen es to S ore, ontd.

SAM-T98 uses BLAST to identify two pools of potential


homologs:
1. A small pool of very strong mat hes (ln(P-value)  10:5),
used for the rst iteration only.
2. A larger pool of potential homologs (E  500), used for
subsequent iterations.
The \BLAST tri k" yields a potential homolog set of 10,000
sequen es in the worst ase, far fewer in a normal ase.

56
Redu ing the Time Ne essary to S ore Ea h Sequen e
 Kestrel [HHKS96℄, a spe ialized parallel pro essor under
development at UCSC, redu es the sear h times on a
database ontaining 107 amino a ids as shown:

Kestrel On a 433 MHz


(20 MHz) DEC Alpha
Query Size  512 32 64 128 256 512
Exe ution Time (se onds) 41 80 136 232 442 862

 Runtime omparison s oring a database of 100 sequen es (all


256 amino a id in length) against a database of 107 amino
a ids:
{ Kestrel ! 1hr 8min
{ 433 MHz DEC Alpha ! 12hr 17min
 At this time, SAM-T98 does not use Kestrel. However, if
database sizes ontinue to grow exponentially, spe ialized
pro essors su h as Kestrel might be ome the best solution for
limiting database sear h times.
 We expe t Kestrel to obviate the need for the \BLAST tri k".

57
S oring hmms and Bayes Rule

 The model M is a omputable fun tion that assigns a


probability Prob (A j M ) to ea h string A.
 When given a string A, we want to know how likely the model
is. That is, we want to ompute something like Prob (M j A).
 Bayes Rule:
Prob(M )
Prob (M j A) = Prob (A j M ) :
Prob(A)
 Problem: Prob(A) and Prob(M ) are inherently unknowable.

58
Null models

 Standard solution: ask how mu h more likely M is than some


null hypothesis (represented by a null model ).

Prob (M j A) Prob (A j M ) Prob(M )


= :
Prob (N j A) Prob (A j N ) Prob(N )

 Prob(
Prob(
M)
N)
is the prior odds ratio, and represents our belief in
the likelihood of the model before seeing any data.
Prob  j 
 

 Prob j is the posterior odds ratio, and represents our


M A

N A

belief in the likelihood of the model after seeing the data.


 We an generalize to a for ed hoi e among many models
(M1; : : : ; M )n

Prob (M j A) Prob (A j M ) Prob(M )


i
=
i
: i
P
jProb (M j A) P Prob (A j M ) Prob(M )
j j j j

The Prob(M ) values an be s aled arbitrarily without


j

a e ting the ratio.

59
Standard Null Model

 Null model is an i.i.d (independent, identi ally distributed)


model, that is, ea h letter is treated as being independently
drawn from the ba kground distribution.
 lenY( )
Prob (A j N; len (A)) =
A

Prob(A ) :
i
i=1

lenY( )
Prob (A j N ) = Prob(string of length len (A))
A

Prob(A ) :
i
i=1

 The length modeling is often omitted, but one must be


areful then to normalize the probabilities orre tly.

60
Reversed model for null

 When using the standard null model, ertain sequen es and


hmms have anomalous behavior. Many of the problems are
due to unusual omposition|a large number of some usually
rare amino a id.
 For example, metallothionein, with 24 ysteines in only 61
total amino a ids, s ores well on any model with multiple
highly onserved ysteines.
 We avoid this (and several other problems) by using a
reversed model M as the null model.
r

 The probability of a sequen e in M r


is exa tly the same as
the probability of the reversal of the sequen e given M .
 If we assume that M and M are equally likely, then
r

Prob (M j S ) Prob (S j M )
= :
Prob (M j S ) Prob (S j M )
r r

 This method orre ts for omposition biases, length biases,


and several subtler biases.

61
Composition as sour e of error

A ysteine-ri h protein, su h as metallothionein, an mat h any


HMM that has several highly- onserved ysteines, even if
they have quite di erent stru tures:
ost in nats
model model
HMM sequen e standard null reversed-model
1kst 4mt2 -21.15 0.01
1kst 1tabI -15.04 -0.93
4mt2 1kst -15.14 -0.10
4mt2 1tabI -21.44 -1.44
1tabI 1kst -17.79 -7.72
1tabI 4mt2 -19.63 -1.79

62
Composition examples

Metallothionein Isoform II (4mt2)

Kistrin (1kst)

63
Composition examples

Kistrin (1kst)

Trypsin-binding domain of Bowman-Birk Inhibitor (1tabI)

64
Long heli es as sour e of error

Long heli es an provide strong similarity signals from the periodi


hydrophobi ity, even when the overall folds are quite di erent:
ost in nats, normalized using
HMM sequen e Null model reversed-model
1av1A 2tmaA -22.06 2.13
1av1A 1aep -21.25 1.03
1av1A 1 ii -13.67 -1.75
1av1A 1vsgA -7.89 -0.51
2tmaA 1 ii -20.62 0.46
2tmaA 1av1A -17.96 1.01
2tmaA 1aep -12.01 0.78
2tmaA 1vsgA -8.25 0.08
1vsgA 2tmaA -14.82 -1.20
1vsgA 1av1A -13.04 -2.68
1vsgA 1aep -13.02 -3.52
1vsgA 1 ii -11.12 0.28
1aep 1av1A -11.30 1.79
1aep 2tmaA -10.73 1.06
1aep 1 ii -8.35 1.38
1aep 1vsgA -6.87 0.53
1 ii 2tmaA -23.24 -1.48
1 ii 1av1A -19.49 -5.62
1 ii 1aep -12.85 -1.77
1 ii 1vsgA -10.20 -1.57

65
Helix examples

Tropomyosin (2tmaA)

Coli in Ia (1 ii)

Flavodoxin mutant (1vsgA)

66
Helix examples

Apolipophorin III (1aep)

Apolipoprotein A-I (1av1A)

67
Dis rimination Performan e as a Fun tion of Null Model

SCOP whole chains

without reversed-model scoring


with reversed-model scoring
1000
False Positives

100

10

1
150 200 250 300 350 400 450
True Positives

68
Null Model S ores: Signi an e and Threshold Sele tion

After sele ting a null model, the next step is to test the
performan e and sele t threshold values using a test appropriate
to the nal appli ation.
Calibration curve for target models
superfamily
100000 0.1 *N exp(cost)

10000
False Positives

1000

100

10

1
-40 -35 -30 -25 -20 -15 -10 -5 0
SAM-T98 target model cost (nats)

On the rst iteration, SAM-T98 uses a stri t threshold (-40 nats)


to admit only strong mat hes. In later iterations, the threshold is
progressively loosened (-30, -24, and -16 nats).

69
Reestimating the Alignment with New Family Members

SAM-T98 Alignment Building


Start: a single sequence

Build a model from the sequence or alignment

Use the model to search for additional homologs

Reestimate the alignment with the new homologs


(Iterations 1 - 3) (Iteration 4)
End: a SAM-T98 alignment

Alignment reestimation adds the new homologs to the alignment


and attempts to nd new signal made available by their addition.
In SAM-T98, reestimation is performed by buildmodel, SAM's
implementation of the EM Algorithm.

Issues in alignment reestimation:


 Keeping the seed sequen e from drifting into a gap
 Preserving signal from a trusted seed alignment

70
E e ts of Reestimation
The sequen e logos shown were generated by the alignment
produ ed for 1yrnB in the third iteration of SAM-T98. The
upper logo re e ts the alignment before reestimation, and the
lower logo after reestimation.
2

L Y L
bits saved

P
S
F
L
V
I IV
E L I
VFE WL
F I
L PF QL G E
I
V
T
D
KK
AG
T
A
E

R
G
R
K
VS
I
VTKK
YP
RMS
R
E
A
QN
TK
F
STM
STA
EAY
D
D
KS
TSG
AR
L
FY
M
LMN VV
I
KN
EHF
N
AD
K
VEN L
D
S
KA
DSVT
W
EV
S
HA I PS
I E
H D
K
Q
L
E
D
S
L
KK
R
Q
D
RYV
FL
KEMM E
N
T
NAAKKL
DV
R
E
E
NN
L
K
L
E
D ET
A
KA
EM
AL
ERQS
YAFQSSA
KR
K
TAT
Y
STA GKFK R E D
SSKF
I
R
A
M
N
DDMSERTA
R G TN EA
T
RAKS I R
AS
A
KR
R EAH
NW
KAT
D SKD
K
ER
QDTW
S
REYGKTGS
GTANA RTE AS
VTA
S T QN E R I I
Q AN LS
S KGR
TE
ESSA KSR NNK R ES
SDV T
T
S
E
S
DR
KVT G DPN GMP
Q N
RRL
H
AGR
KQAAV EENP
S
T TG GP
Q
YL E
K L T
AQ
Q S AT NV P A G YG
K
EKQGAE T DRR
TE
D QQ GQG
RQQR
DT YQ QKQE
PGD L
RL P
N AME
PE RPTQV M
Y
MQQ
LV
S LTN
KT
TF
P

0
LG I R
I PDGQR I GCV
Q
GCPRGQKQTQPP I QGPAN D
I VPCE
Q
SPQ
NLH
Y G
D
N LN
GLG
QN V RQ
NP NDL
V
Q
P
Q
PG
V
STQCLLH
P
QLK I L
HN PLDGVL
N
G
GP LG VLVH G I L I DW NHRPLGVYY DVSG G HVWPDHD
V
P
I
H
Y
FVF
H
I
Y
FAP
M
CH
S
K
E
T
V
D
N
Q
FR
IPN
F
Y
H
MY
FI V
Q
P
I
Y I
H
Y
F C
DY
H
F
MY
F
HY
F
MI
H
Y
FI H
WP
Y
F
MH
H
WP
VDHCHH
Y
FI L
V
H
Y
I
F H
D
CN
C
DT
H
D
N
G
F
Y
PI L
V
Y
I P
Y
V
F
I N
H
D
WH
Y
FI V
F
I F
MCN
HH
Y
FI R
I
D
N
G
F
Y P
H
Y
F
MV
Y
FI P
F
Y
H
MY
M
FI H
Y
I
F HN
D
H
C
WN
P
G
Y
F
H
MV
YI Y
P
H
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
2

L Y L
bits saved

G
L
FITKD A
V I VI E WFI
E LI RL
VF
L
H
L
V
G
I
NPF
QL
DE
LDW E E VL
G KEFM
I
K
NAE
V
KKSP R
KR R
VP M
KNTKTADS
V
A
L
N V KD NE
AH
SS
I PT
KKAV KDKRNM
AA
T
EL
N
TA
R
E
D VS N SE
K
TFQ Y
Y
TS
R
GY
Q S
A K FY ESFDTDS
K I S Q
L
SLDR
Q YV
KT I KD
V
SAM K I M E KR
EA A SQM NE GT M
EE M S K A RSA M A
ENK
H ADTKAELR
KATESE
YAFSA
KGTAA
D
SGTS
ARRRT
KS
N
K
EA
RN
E AD
SS
S
KFA RS
QTG
QS LAHET SK KR TRY A R E I
DGG WDQ
NK DE RDTWAH G F TA
KAV RQ TA T A
A
SRNYQNE
D
SSLS
GD
E
GTME
PQES
RS
LRAS
RRPNNEG S V NG
E
P
S
Y
I GE
K
DTTR
S
T
STADR
K
GT
APNR
YQ NKR
KQGTKQLLERR
P
QTTT GQQR
L
Q QR
Q QKG NVP G GTAM
AQE H DY DQS GKV F
N RA
LD NTE
VDLGC
L
P
N EEE
K LPT VMGQL
V I LTNE TQ

0
LGLK I PP
GQ I Q
RRGP
CQ
PR V PCPE
V I MPAN PDN
VPC TP
GLQE
DN LRL
Q
GRN V LQV
P
G
NWLQQ
P
Q
GS
TTYQLY H QQLK I LNPLW
Q
GSL
G
D
VPVNG VLPHQG I PNHCDVPNHHPVGV I Y PDVSG HGHVDPDHY
P
I
H
Y
FVH
I
Y
F Y
FI
V
Q
P
FI Y
H
F
MT
FP
Y
FH
Y
IIH
I
Y
FQ
V
PV
C
DN
Y
H
F
MY
F
HY
F
MH
Y
FI
I H
WY
FP
H
WHVD I DHH
Y
FI L
V
H
Y
I
F H
D
CN
C
DR
D
N
G
F
Y
PI L
V
Y
I F
I N
H
D
WH
Y
FI F
M
CF
MCN
HH
Y
FI R
I
D
N
G
F
Y P
H
Y
F
MV
Y
FI P
Y
F
H
MY
M
FI H
Y
I
F HN
H
D
C
WN
P
G
F
Y
H
MV
YI P
H
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

 Asparagine 27 in reases substantially.


 Leu ines 17 and 38 and Tyrosine 29 all in rease slightly.
 Proline 3 de reases substantially. However, Proline 3 does not
gure prominently in either the FSSP [HS97℄ or HSSP [SS96℄
alignment.

71
E e ts of Reestimation, Continued

The ex erpts below show sequen es from the alignment before


reestimation (left) and after reestimation (right). The in rease in
the signal of Asparagine 27 is due to the realignment of the
se ond Histidine in Sequen e 11.
20 30 20 30
| | | |
1 WFA.KNI.ENPYLD 1 WFA.KNI.ENPYLD
2 WFA.KNIaENPYLD 2 WFA.KNIaENPYLD
3 WFA.KNI.ENPYLD 3 WFA.KNI.ENPYLD
4 WFAkKNI.ENPYLD 4 WFAkKNI.ENPYLD
5 WFA.KNI.ENPYLD 5 WFA.KNI.ENPYLD
6 WFA.KNI.ENPYLD 6 WFA.KNI.ENPYLD
7 WLL.NHL.NNPYPT 7 WLL.NHL.NNPYPT
8 WFA.KNI.ENPYLD 8 WFA.KNI.ENPYLD
9 WLH.EHV.NNPYPT 9 WLH.EHV.NNPYPT
10 WFA.KNI.ENPYLD 10 WFA.KNI.ENPYLD
11 VFQ.H--.-HQYLS 11 VFQ.HH-.--QYLS

A small hange of alignment an yield a pronoun ed hange in


signal.

72
Pitfalls in Alignment Reestimation

Issue: The alignment of ertain sequen es should not be


reestimated.
 If a seed alignment is used, alignment of the seed
sequen es should not hange.
 If some insert state is used by a large number of homologs,
its osts will drop and the template sequen e might hoose
the insert over some mat h state(s).
To illustrate, here is a portion of the alignment for 1yrnB from
the fourth iteration of SAM-T98:
10
|
1yrnB t....KPYRGhr---FTKENVRIL
1hom m....-RKRG..RQTYTRYQTLEL
1b72B a....--RRK..RRNFNKQATEIL
1mnmC inkstKPYRGhr---FTKENVRIL
1enh .....---RP..RTAFSSEQLARL
1fjlA ialkrKQRRS..RTTFSASQLDEL
1aplC t....KPYRGhr---FTKENVRIL

Note the loss of potential alignment to 1yrnB by 1aplC and


1mnmC.

73
Constraining Seed Sequen es
To keep seed sequen es from shifting during alignment
reestimation, onstraints x the residues of the seed sequen es to
ertain states in the model (or olumns in the alignment).

SAM-T98 Alignment Building with Constraints


Start: seed alignment Constraints

(Iterations 1 - 3)
Initial alignment

Database Model

New Homologs

Reestimate alignment:
seed sequences fixed
homologs permitted to move
(Iteration 4)
End: a SAM-T98 alignment

Advantages:
 The seed signal is not lost to unintended reestimation.
 The resulting alignments tend to be more a urate.

74
Reiterate

SAM-T98 Alignment Building


Start: a single sequence

Build a model from the sequence or alignment

Use the model to search for additional homologs

Reestimate the alignment with the new homologs


(Iterations 1 - 3) (Iteration 4)
End: a SAM-T98 alignment

In a progressive method su h as SAM-T98, are must be taken


to avoid mistakes early in the pro ess. Later, after the signal is
established, less are is required.

75
Avoiding early mistakes

Database sear hed:

 During the rst iteration, SAM-T98 sear hes against strong


BLAST hits to admit only strong homologs.
 In subsequent iterations, the database sear hed is a large pool
of potential but weaker homologs.

Threshold s ore:

 During the rst iteration, the threshold s ore for admitting


new homologs is very onservative (-40 nats).
 This threshold s ore is loosened with ea h iteration (-30, -24,
and -16 nats).
 These thresholds were sele ted for superfamily re ognition.

76
2 rd, starting out

 1 sequen e aligned, total sequen e weight = 0.91.


 To a hieve the desired variability, the total sequen e weight
was set at less than one sequen e.
 Initially, the gly ine and the ysteines were more important,
be ause the regularizer thinks these residues are more
ommonly onserved.

2
bits saved

F
L
YT LS
I SD
V
NI
C
A
V
L
I TTSK
EAD
CW VC
VF
LY
S
I LLL
A
I V RV
I
I
KFH
L
N
S
RD
DTSKE
N
V
KL
I M
GC
A
V
NKKL
A
RL
V
C CY
A
F
I K I LS
A

V
GAAS
TS
ASTTG
SA
R
EAT I TTT
AATAS
KS
E
A E
AASQ
V
KGS E
A
SATSTE AG
TVSE AEKRS
ATV
ETS A
LDRRS S A
AV E
FG VVGSQ G
S
A
GF G
KQT
ES
R
MS
VGE
QRS
AG
EAAG
I KD
SS Q G I T
SG
TLKSEFLLED F ESFRLYDKLEL DF DFLFTE
W
SETEDYEEDQRYT
RDEYDTKATEDTPQYA TQQYTYKD
MKAK
MK
K
EKKKTTK
EGK
K
M
K
EADSL
NAKKDLT
K
E
FA
T TT
K
EDK
E
R
EK
RD
GRRNRD
G
D
GNGNRENRRSGRGRD
GNGQGRS RGGRGRWN
K G K TNEY N K N
E
G
RQ
L
Y
P
R
P
D
N
RRR
P
N
P
D
NNR
P
Y
P
D
NLVPT
Q
L
RR
PVVND
NE Q
L
NND
NV
D
N
H
G
R
P
HNPG NN L D G GPGQPN PHL RPLL P N
LP I L L
P L P I L P P P L

0
P I VQ I P H Q N I QVV I I P Y
QVP P I D
NP HDVQP P VVV QQ
M
VDQVH DFHP VH YV QGHV V QH QQV
QQ YNQMQ Q Q I I M QNMPY NPYQ QY F I MPY I I MY MPQ
D
CF
Y
H
MFI C
H
WY
H
F
MI H
WF
Y
H
MF
Y
H
MY
H
F
MI H
Y
F
MH
Y
F
MH
WP
CY
H
F
MI C
H
WH
WH
Y
F
MI F
MH
W
CMI I
F F
Y
H
MY
H
F
MI F
MC
M
WH
Y
F
MH
WD
N
H
CFI H
Y
F
MH
Y
F
MH
WF
MH
WM
CY
H
F
MI
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

77
2 rd, Iteration 1

 14 sequen es aligned, total sequen e weight = 1.17.


 The alignment in reased by 13 sequen es, but the total
sequen e weight in reased by only one fourth of one sequen e.
This indi ates that the fourteen sequen es were very lose
homologs.
 Tryptophan 14, Gly ine 26, and all the ysteine signals
in reased substantially.
 An additional gly ine signal started to appear at olumn 22.
2

GC CC
bits saved

F
L
Y
I
VS
AVN
E
T
A
D
DF
S
E
L
S
C CW C
A
V

TS
GA
I
T S
A
T
G
D
E
EY
KK
E
ND
S
GA
S
FT
A
VF
LY
ES
I L
QT I
NR
V

PF
F
SL
TK
SA
TG
F
Q
E
VI
R
M
F
KA
T
A
V
L
I
L
V
I
G
N
D
D
N
R
A
S
E
KK
ET
A
V
L
S
T
G
I
N
G
KK
R
G
E
G
FK
A
V
L
S
T
RG
F
I
A
V
L
S
T
I

S
E
ET
QW
S
L
ESL
VE
K
Y
KK
DAKS
NR
AY
R
R
KG
A
EM
S
Y
KR
ADS
ET
KV
S
DS
A
P Y
K
D
S
E
SR
S
Y
KL
E
Y
K
K
A
Y
KMKKA NENTRDSEETYEDQKHAKNQL E EGDE I EL
Q
D
ARD
G
QE AR
D
VVPATR
D
KD
G
K
E
R
DSSR
FTAE
GDQKR
D
KDARA
D
RED
DAT
TA L AA
K G G H E V
RER I
N
GLLGN KRN YRSKTVR
N N N TNF G
SG ATT
N KR
LSV GNN P TTQ
A
NG L
S N N PL Q E H M R
TT T NS V
N
T H R PR G Q L PD QP TL Y E P Q P PS

0
L P I SQ I Q P MR C L G GV P R P A L HQ Q T R
N
NP TRP QQ E I L V Q L G
QG L N
H
A L I
R
V
P I T QA QL L QDQK L
G
P QQ PQ
DL MP K
H
Y P H MQQ NMV P DS
I
H
Q
D QH F D
G MV
E PP P MN
PMR
I P
V
I
H
Y D
CF
Y
H
MCFL
H
V
Y
IP
F
G
Y
NV
H
M
HI
Y
F H
WY
H
FI R
D
N
Q
P
F
Y F
MMV
H
I
Y
M
CFI
Y
F
M H
WP
CV
H
YI D
H
WH
WP
H
Y
FI V
H
I C
WT
K
R
E
N
G
D V
Y
I
F N
G
P
F
Y
FY
FL
Y
V
HYI C
M N
V
Q
I
Y
PQ
PH
WD
T
K
S
RIH
ML
V
YV
H
YIIV
H
I
Y H
WG
F
Y
H
F
MH
W
WD
G
N
Q
P
H
F
W
H V
I
H
Y Q
F
G
N
YFI FF M F
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

78
2 rd, Iteration 2

 28 sequen es aligned, total sequen e weight = 1.48.


 The ysteine signals in reased.
 Tryptophan 14 disappeared.
 Gly ine 22 in reases, and Gly ine 26 de reased.
 As the variability and the total sequen e weight in reased,
the signal from the priors began to de rease.

C C C C CC
bits saved

A
V
L
I
A
V
EL
I
A
V
L
I K
N
D
G G A
S
A
V
L
I I
A
V
L
I
A
V
L
N
I D
S
T
QT
KDG
S SRK
TEE A
S D
N
S
T
NKKS
D T
KS
T
LV
FTSE
VK
L
SG
F
SE
PRAY KF GQ
FAR
E
K
E
T G
KF G
S
RE
SG
FE
RGY
FFS
I E E D T K GR AY
I LKA D
Y
K S
SS
T K
Y
KSA R
R
E
Y
K
E
KEDYQ
K
L
KV
G
N
V G RPAD
RP PD R L P S
P SGPL PA
TQ
G R D
AA
AS EAN ADN NLS N A Q SAV
A
LN ADQNS NE
E
TEQTNRTT AT
KQ EW
P
K RNSV P
D
NN
LDR RNAET RT
I
K
YK K
G SDG NRYAV LT I L EK T
MTAN
I
R S A
T
ESFTS
T
D
RLN
NTG VELDGNE TL AE QG DL
N
QTT DDRQ
A MRR E PAD E E G L HI T T I P S HE H D E Q V K

0
V RY
P
FQ MV K R P L P
M V QV V T Q
IM
Q V
P L MN HT
L P
K
Q
L
Q
YP R R
L QG V
M F
A
R A P L K
R
F
Y
L
T
K
G
D
G Y V
A
I
L
M
P
L P H G
N
P
K
S E
M
PH PL QE
K V L L H QS
D
V
I
K QH P S
A
H F V R L F
P
I QT V V V QGQQR
Q
R
G
D
N
I
P Q
G
N
H
D
WG
N
D
H
CV
Y
FI M
D
G
N
H
CV
H
Y
FI H
WY
N
G
D
F
Q
PI Q
P
I
H
Y V
H
I
Y
F V
H
I
Y
F I
Y
F
M H
WT
R
K
E
M
Q
G
N
P
H I
N
Q
G
Y
F T
E
S
R
F
Y
Q
H
N
G H
WY
F
MI V
H
I
Y F
Y
M
N
P
D
G
HY
S
R
K
E
Q I
M
C S
A
I
K
Y
R
E
F
Q L
Q
V
H
Y
I T
V
P
I
F
H
Y C
M Y
F
H
W
M H
WE
K
R
S
Q
F
D
Y
N
G Y
F
I H
I
Y
F H
I
Y
F H
WP
Y
F
MH
WD
P
W
ML
H
V
Y
FI
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

79
2 rd, Iteration 3

 41 sequen es aligned, total sequen e weight = 1.51.


 The variability within the alignment in reased very slightly.
 Gly ine 26 in reased slightly.
3

C C C C CC
2
bits saved

I S
A
V
L
I
Q
V
EL
I
S
A
L
A
V
I
S
KRK
G
N
D
S
A
G
A
S
D
N
A
V
L
I
S N
A
V
L
I
S
A
V
L
I
S
VN T T TE E E T D T T
L
TE
D KG
SF
S
PKKF
DG G
FA
QD
R
K
T
E
K
TR
KG
F
GK
SG
REF
KG
KF S
G
AGSVE
R
Y
K S
TSS
EE Y YSQ R R
E
SY E R
KSS
Y
K
RY
EK
P Y
L AR K PDSAK
PAP
A LG
SI
K DP
DRP E KTAL P LA
KE PDP N
F
A AND N A N S VDN RDQNANFD
E
KAE NG
R
SKNN
PTD R DNNA
V Q R
A QND TND
GRS
QDTE
VLRQTT E
AG
G R P
T RLTE L D
E LGRMQATDTRVA
LVY
MTK
R
QDTT
ENAGELDP
VEGGR
I L
TH NAH
T LHTE
Q
L AK N NE E
I
R

0
T I Q S G V D R P L V V K I T
A F PP
S
PM L N Q Q LM I Q L MP P KV
Y P R
N Y V MV
P Q L VM SM R
E G Q QK K QY N A Q L Q I L T S I P Q L P V Q I
QR
I A NL FL A L L TV A G I H V TF F T G K F I AV L P G K Q
K
S
Q
R
G
F
Y
D
N
P T
R
S
Y
E
K
P
Q
G
M
N D
H
CH
V
Y
FI Y
P
D
M
N
G
HV
H
Y
FI H
WR
E
V
D
Y
N
G
F
Q
PI R
V
Q
I
P
H V
H
I
Y
F L
V
H
I
Y H
I
Y
F H
WW
T
S
F
R
K
E
G
Q V
I
Y
H K
E
T
S
R
D
Q
F
Y
H
N
G H
WY
F
MI H
I
Y S
F
Y
D
M
N
P
GA
Y
H
S
R
K
E
Q I
M
C
W V
A
S
I
K
Y
M
R
F
E
Q L
Q
V
H
I
Y L
E
T
D
Q
F
V
G
I C
M Y
H
F
W
M H
WT
S
E
K
R
F
Q
Y
D
P
G
NY
FI H
V
Y
I
F I
H
Y
M
F H
WH
P
Y
F H
WE
G
N
H
Q
D
P
M
W L
V
H
Y
I
F
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

80
2 rd, Iteration 4

 60 sequen es aligned, total sequen e weight = 1.99.


 The total sequen e weight in reased by a third, re e ting the
variability of the new sequen es in the alignment.
 Columns savings de reased toward the end of the alignment
as the diverse new homologs were aligned to previously
onserved olumns rather than opening gaps. The e e t was
most vivid at Cysteine 33.
 The other ysteine signals in reased substantially.
 Gly ine 22 in reased substantially.
3

C CC G C C
bits saved

A
V
L V
A A
V
N
S
D
GA
S
V
A
L
C
V A
V
S EL
DS LK A
E V I N L L
I SR T T G S
N
KT
SG S
QT
K I
TE
E
K
I Q
K
R
D
E S S
D
I T
I
D
E EF T NG
K
S G S
D
R T N G
F K KA G
ISVR
D
Y
P
N
K
ESF E FAS P
L G
K
KY E
R
E
T
F F
GL N
E
ASAY N YDQL SR
P L P
S Y
A S
T
L KAI A
NDL SDN
DGP
N
R
D
P
N
TNA
E
AS
N
Q
R
S
A
M
K R
A
GM P
N
A T
KV KG N L G R Y Y
VL V P
E
T
T
R
G
A
R
KGRD
K
A D
RLTVI H
KA
DDQ V R
E T D
NR D
KL
LV A RS Q
E
A NPATR
P GAQ V
E
K I E
T Q TG
RV

0
A I S F G I ERTP L GVKPGK
R
EH N NL LQK
L T
TF KT
E R
K PMS
F
T
DLQHE I QLEVPT LYL
V
N
TTY I
D
D I PR
KL
E
PKE I S
G
I AR Q YL QRL Q P L MV T A MH L RV F A P Q FYQM H A P QV MAP
S
E
K
R
Q
F
Y
G
N
P
D
M T
S
R
Y
E
K
P
M
Q
GN
F
Q
Y
D
G
P
ML
H
V
Y
FI Q
M
G
P
N
D
H
CV
H
Y
FI H
WY
K
E
M
P
Q
G
N
DR
V
Q
P
HI V
I
H
Y
F
C L
Y
V
H
I V
Y
I Q
H
WA
Y
T
S
W
F
G
R
K
E
DL
H
V
YI I
P
E
T
K
S
R
H
F
N
D
Y
Q Q
H
WM
Y
FI V
H
I
F
Y S
Y
F
M
N
D
P
GT
A
F
Y
H
S
E
K
RI I
C
M
W T
I
F
K
S
Y
R
E
M L
Q
V
Y
H
I L
P
F
V
H
I
Y C
MG
Q
F
P
M
HH
WV
A
D
T
S
R
K
Y
F
E
G
QL
V
Y
FI G
V
I
F
S
T
E
D
Q
N
Y V
Y
I
H
F N
D
H
WR
A
T
F
Y
S
EI Q
H
WF
E
S
M
R
K
Q
G
NN
D
E
A
K
Q
T
L
R
H
V
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

81
1yrnB, Starting out

 1 sequen e aligned, total sequen e weight = 0.91.


 To a hieve the desired variability, the training alignment was
weighted at slightly less than one sequen e.
 The gly ines at olumns 6, 34, and 75 re eived spe ial
importan e at the start.

2
bits saved

TK
S
E
KL
D
N
FKE
PYR
A
G
A
S
RL
K
F
E I RL I E FL
D
KDAK
LE
H YTK NL VV S YAKNV ND
FK
V IL
Y
L V D DD
A
S
E
K
IE
FVN
LFS
WFS
D I
NVD
EF D
I
V RD
F
I PYLDTK
GL
A
EN
L MKNT
S
R
E
GVE
DASKA
E I
KA
VS
R
EAG
EA
STA
AV I G
FAAT I VS
R
EG
S
AKG
FAS
G
AA
DV GSR
EKAK AG
SAV
LEGS
SA
I QT KVS T
AA E A A
LS SS AS S MS AALA S LSI M A MS M A
VS R QAVSQEFQTTQG SETQE T RY
TQET I SEV
TYA
R S
LDTT L ALTLD KSLMY T
ETSEDKM KT LD KYA DKL
E KT PDTSE R
QTTE S
KTY
R WTQTS
R K KTE P R E
YTTV TT QT
QV W K DR Q K K
KT R DL DL MKTT A D R
S KGM TA
F A SQKT K S ASTTA K L
D G R
E
WG Q N
GG RD G N R
M
RG K
ERN NER I G R
K
E
N R
E
WRRD G Q RN RRSG RDR
G
R N N H
G
N VY N K
E
G
R N G QYN RE
P
G RK
N
K
E
D
P N QRG QN H
G
E
PP
G
R N V E
P
G QE
P
K
EN Q
G
R
N L Q N
V H T
Q
V GN L P LP
G
V PGP PDG
R L LPP LQ NGLN L H GP LGRL LN
I DP P H L P P G L L H PGL P I D L P P

0
I P I I V I P I P Q I QQ HPQ P Q QV I P I Q QYP I
Y V VQ V V V Q V
P V F
H
QH FH NP V V
I HD
N
H DDVI QM
NN V HDVI HF
H
QDHP V Y DVI HDGV HP
Q IPYFP Y Q Y Q YNN
I F Y N Y Q PNI Q FN Y NPYY Q
I I
F
Y
H
MH
Y
F
MY
M
CM
CF
MC
M
WMI F
MD
CF
Y
H
MH
Y
F
MH
Y
F
MFI C
H
WF
MC
H
WH
W
CH
Y
F
MY
H
F
MI P
CD
CY
M
C
HH
Y
F
MFI C
H
WH
Y
F
MFI Y
M
CM
CH
W
CF
MI F
Y
H
MH
Y
F
MC
M
WH
W
CH
Y
F
MFI H
W
CD
N
H
CH
Y
F
MFI F
Y
H
M
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
2
bits saved

LI
VSKV
SF RL L
I V
D
Y
L
I L L W
VKN LSNRRRKEKTVTVA EVA VVS E
F I I E I I
D
A
S
KE
KE
KRKR D V
S
E
KD
N
VS
KFG
I
FF
D
N
ED
S
E
KK V
FG
I
D
P L DLL
A
G PL A
V
I
AKKKE
AAAEA
A Q ARGV E AAGE
EAESASAGG
ASGAAAK
DAM A
G
AS
R
E
R
E
RK
EA
TMTSFEFASA T SASA
I
KTTG
TTSAAA AFAFS T D
GTGQT TS
KM E FGES
Q
S
QQSQSVTVTL LST
ETQY
L
AYY R
S
QLTL AAAS
SSSQ
MM M
EYELMR DKS
TSEKLLLDRDLMLM R
E EP T
RVYE DDDR
K TS SQT EDTTTTQ ESEST KTT KK KT
DSDD
D
YAY
R
GK DD
Q
DTTTKYKYKVTS KQ SSDLT SKQ QQT
KRKGKSKTAEMKAGGG N DKDK I RNR I RRRKQNRR I TTTN
NENNETEGRKRNRNNN
G
GGGEGED NGED EEN GNED GGG
G
R
PPR
PVR
PLR
P
NQ
LNY
P
R
P
Q
LVVV
N
P
NRRRRP QPPP PPPR
P
VPQPP NNN
P
G P G L
D
P G PP P
P
P L
P LN N R GRLGG H GRLLL
L L G G L L G G I L L L I L

0
Q I QNQPVHQ V I I I P P I Q I QQ QQVQQ I QQPPP
VDVH DVDVHQDVHH H H VVVP DP DNF VDNHDDVYV F DNVVVV
QNQYNPN I YMNQYYYY I I I QNQNFH I NFYNNQF I HNF I I I I
I
Y
H
F
M H
W
CY
H
F
MI F
MC
H
WH
Y
F
MI C
H
WH
Y
F
MFI CC
H
WY
H
F
MI I
F F
MF
MF
MH
Y
F
MP
H
Y
F
MH
Y
F
MF
Y
H
MC
H
WF
Y
H
MC
H
WY
M
C
HM
CH
Y
F
MH
W
CY
M
C
HF
MI H
W
CH
W
CY
H
F
MI C
M
WH
Y
F
MM
CH
W
CY
M
C
HH
Y
F
MH
Y
F
MH
Y
F
MH
Y
F
M Y Y
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83

82
1yrnB, Iteration 1

 9 sequen es aligned, total sequen e weight = 0.94.


 Eight very lose homologs to 1yrnB were added to the
alignment. Their s ores were -100 and stronger.
 The homogeneity of the alignment was re e ted by the very
small in rease in total sequen e weight from 0.91.
 The gly ines were further emphasized.
2
bits saved

TK
R
S
E
K
G
D
N
FKE
LE
PYR
A
KY
G
A
S

KKE I SH
R
D
KDK
AE
R
FD
FK
L
Y
L
E VI
YVRD
V I G
L IE
TK NLRVVD
KD
S
E
K
G
F
L
IL
N
FS R
D
N
S
E
D
FKDF
FL
RD
I P LD
WFAKNLE Y
V N
A IE
V
GL
S
TK
A
I
V
ENV
I
L
MKNT
L
SE V A A E AGAAAS
A EGAAG VAG SE AG
KAS A EGS
SVAAS
DA I VS DA
AA M I ASA
E STS S
AA I STS
LSQ FM LA FSS I M
LSTAVSKAATM S
VS RRQAVSQEFQTTQTSA SETQE TQ
RY ET SEV V
LDTTL DLTLDRKSLMYRGTTT DKMRKTTY LD RKYADKL
E VKTPA W
TSE TTETYKTERWE YTTV
TS KKTE PKTTKFQTE
KQ
T
W
DLL
NDMKQ
TNAKDSSNDG
S
MKQ TA NA WSQKQ
TLSNASTTAK
DGRR
HGQGGRDGGR
M
RGK
ERGKER I GRK
EGRRR
HRRDGQRGRRS
KGRD
G
RNNEN YNKG
V N E
ER
NKKD
P N QN
Y
REE
P
G
RN
E
EP
ERP
G
N
RQ PQ E
P R NV Q NQ
N LQ G
N
V H TV
Q GN LL LPV PP
GL PDG
RLLPL LQ G
N
P
GLN LH P
GL LP
G
R
YLLN
I DP P H PGP G H PG P I D P P

0
I P I I V I P I PV VQ I QQV LHPQPVQVV QV I P I QVV QQPV I
P VF Q Y F NP V HD ND VQNNVHN HF QDHP VY D HDGVHP
H H H I H M I I
H PI
Q I PYFP YQQ I HY NYDNHQ QF I Y DHY PNY Q I FNHY ND I Y Q
F
Y
H
M
CH
Y
F
M Y
M
CM
CF
MC
M
WMI F
MD
CF
Y
H
M
CH
Y
F
M Y
F
MFI C
H
WF
MC
H
WH
W
CY
F
MY
H
FI P
CD
CY
M
C
HH
Y
F
M I
F C
H
WY
F
MFI Y
M
CM
CH
W
CF
MI F
Y
H
M
CH
Y
F
M C
M
WH
W
CY
F
MFI H
W
CN
H
C
WH
Y
F
M I
F F
Y
H
M
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

2
bits saved

LI
VSKV
SF RL L
I IV
F I
DL
W
Y
I I
E L L S
EDD N
DI I A
VKN LSNRRRK KTVTVA EVAEVVS E
I D
KKKRKR
ND
S
E D D
P L LL
A
G PL
S A I
VAKKKE
AAAEA
EQ ARGV
I AA
GAAAEA
V
ESASAG
K
DKFVS
EEF
EF E
KK
KFV
DAGRRRK
A
TMTSFEFASATTSS SSASAAFAFSGAAG
S
GAAA AG SEEEA
GTGQTKTSES FGEQQQSQSVT TL M
LSTLKMMTTS
TTG
M
LTLAAAS
EYELMRMDKTSEKLLLDRDLMVME R E AYY RR SS
ES R
STT QY E TY DDD
DK
SDTY
DS
D
A
Y
SQTR
G
E
KDTTTTQTQ
DDDT
ES
KY
L
EY RTKTT KKDPQ
TRKT
KQ Q
Q
QT
KRKGKSKTAEMKAGGG NTDKKKK
I VN
S
E
KQ
I R
S
E
S
EKLNV
S
E I TTTN
NENNETEGRKRNRNNN
GGGGEDE NV GGG
RP
R
VRLRNQ
LNY
P
RQ
LVVV
NPNRRGRD
P
NGR
P
D
P PR
P
R
PRQGNR
P
D
P NNNG
PGP PGPL DG
P LL LN PR PRQ PGRLGGP PQ GRLLLP
L LP G G P LPP P P Q I Q
I L H L L

0
Q I QNQPV
HQ V I I I PV P I N QNVQQL I QNPPP
V V V Q V V VP GN G F V V I F
Q
D
NQH
YDP
N
D
VN HMD HH
I
H H
Y
I
NQYYYY
D D
I
Q
H
DQH
I
DD YV
QNPNFY
DQVV
I
VV NFYNNQ I YNF I I I I
I
Y
H
F
M H
W
CY
H
F
MI F
MC
H
WH
Y
F
MI C
H
WH
Y
F
MFI CC
H
WY
H
F
MIPI
F F
MF
MF
MH
Y
F
MY
F
MH
Y
F
MF
Y
H
M
CC
H
WQ
F
Y
H
MC
H
WY
M
H
CH
M
CH
Y
F
MH
W
CY
M
H
CF
MI H
W
CH
W
CY
H
FI M
C
WH
Y
F
MH
M
CH
W
CY
M
H
CH
Y
F
MH
Y
F
MH
Y
F
MH
Y
F
M F
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83

The alignment was un hanged in Iteration 2. No sequen es met


the Iteration 2 threshold but not the Iteration 1 threshold.

83
1yrnB, Iteration 3
 15 sequen es aligned, total sequen e weight = 2.05.
 The variability of the alignment in reased substantially, and
the total sequen e weight more than doubled.
 Tryptophan 52 was emphasized.
 Leu ines 17 and 38, Tyrosine 29, Asparagine 55, and
Arginine 57 began to stand out. Less spe i hydrophobi
signals emerged at Columns 21, 44, and 49.

L Y L
bits saved

1
L
F E
VI I I
VVE
F LI NPF
L H DQ
L P E V
I
KKP
TAS
RR
KK
G
RYPR
I
VTKK
A
AY FM
D
QNF
L
A
RTM
AD
TK SY
LFN
V
KEVY
W I N
D
KE
S
V
SA
K
EEK
N
G
W
AH
SS
V
KDA
FSTGS
E

I V
LS
AT
D
I E
EKKEFL
K
EG
AL
QD Q
D
NMAKE
RRR
AM
KY
RN
DT
K
DTVAS
E
VS NMSESTSSFTSG EG DA S L E E
RE
N
D
A EESAAAL
RQMESSGK A NRTK RTNER S L M RRT S AA L
I
ESK ADTKTE
K
A
MTEY
EQMKAT I W DAYGQ GE
KKTA
Q
KTSS K
FL
I RV
AA
T
QRG LAHEW SKA
KR
RDA T GT A L NFGQDNS A
TQR
A N QNESSDQN DEE S
H S P KRAKS V N ETQ Q I
SD
THGTTA
DDRLS
KVTGKG
DR
R
TY
P
QQN
A
T
S
R
K AQ
L
RR
KTHN EG
RV
R
TT
D
I GG
P
Q
S
RT
S GM
S
D TYS GEGVPAPNPGT R EAR E LV DY P
L GY
T
VT
NQ RDQKT Q RK P
L PPQ DQAVQN KV C
R NDL C M C

0
LG S LN
I
P
GQG
QCGGIPQM Q T N V I EQYL PG CP F N Y
P NEDP P
GLL L
AN PRLG N QL CWL RP H
H
GV V
QRV QL Y HQML S G L P
H L WQNL R
K P
VPQ KG VLPHQ G I LDV NDH L QNQ P F GH I YPNV R P V YH V DGPV E
P
I
H
Y
FVV
H
I
Y
FIY
FE
D
T
N
V
Q
P
FP
I Y
F
H
MH
Y
FIVN
Q
V
P
I
Y
F I
H
I
Y
F P
C
DY
H
F
MH
F
Y
HY
F
MP
V
H
YN
H
IP
HI
WY
FD
H
WHV
Y
FI H
V
Y
I
F G
N
P
D
CC
DT
S
H
K
R
E
F
G
N
DI L
V
Y
I I
M N
H
D
WV
Y
FI F
M
C F
MCH
C
WH
Y
FI D
N
I
G
Y
F I
H
Y
F
M H
Y
I
F F
MY
MI H
I
Y
F HP
N
D
H
C
WG
Y
F
H
MH
YI Y
G
N
Q
P
D
H
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

W NR
2
bits saved

1
L
IT
VS
IQV
E
F
FV D K K PEL L G
S
G
Q
DA
AK
SFN
ME
D L
KVS
ETGAA
R
D
EVS
A
T
L
N
FR
MD
TL
SD
AE
E
G
V
S I
R
AY
R
SA
I
K
L K NYL
R
G E

E
I
DMDR
S
E
KR
T
AK
Q
A
DT
R
S
TR
L
NL
E
Q
KA
SD
R
EE
A
KS
I
Q
A R K S
A
E
G
DY
DQ I
KKV
TS
F
A
AT
T
LV
D I
SV
P
A
EAM
Y
F
SE
IN
T
D
E
D
N
K
TE
RS
P
L
G
A
A
D
E
S
K
K
N
E
K
N
P
TAKG
E S P
YVQ
AS TG
M
T QVGATHTK L N
RM KFTA
N
PKA D G E
NRRSF YTATS PLVVNAGE I TV R S NTS QQ E TS
D
KKLT H
Y VS
PNR N S QLTLQ SL LQ NR AS K A S N S
EP RP CGQEW I HS S GNLAQV
R
AA Q
VT
K
E GRR
KRL I K RGR
T
R
P
CQ
D
L
K
EH
R
KQPK
M
K
E
L LTH I VTP
L
D
VA
AM ED
LQGQ RK
YE
QQVR
L
T
V
L
DQRA
G
K
AE
M I E L G
VD I D
N
P E KE
P
G QM PH DNLA

0
Q P NP G Q I P P T F
PG N IQYPLVDG
Q
A N GH NKSTS I
LNYPEGLY Y
D AV V FL RE TL
T
L
H
NH
YP
I PF
Q
VHPQ
H
V I
I
G FF
I
L
PH
I VT
R
ER
S
F
VH
Y VHA
L
LQN
WV
I
V
G
I S
P QQEA
W G G Y T Y G K K W VG F H LSS
V
Y
I
F F
H
DG
C
MH
Y
F
MN
H
D
C
WC
WM
N
H
D
WY
FH
I Q
I
F
MNCF
C
DE
K
R
D
N
F
Y
PM
C
WP
Y
H
F
MW
CM
P
HY
M
FY
FI V
Y
F
MP
Y
H
FI S
F
D
Y
P
N
G
MR
F
Q
G
Y
D
N
PY
D
P
M
N
G
HF
T
G
R
N
Q
P
YIFM
CY
FH
I Y
D
CH
S
V
T
E
K
R
G
D
N
F
QIHH
H
I
Y
F N
D
H
C
WC
DN
Y
FI C
MG
P
Y
F
HQ
Y
F
H
MI
K
T
V
R
G
Q
P
YILH
Y
F
MH
V
Y
FI V
H
I
Y V
T
R
I
D
G
N
Q
P
Y V
K
D
T
R
I
G
N
Q
Y
P
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83

84
1yrnB, Iteration 4

 Number of homologs in reased dramati ally to 1999. Total


sequen e weight in reased dramati ally to 5.34.
 The most prominent signals from Iteration 3 were emphasized
further, espe ially Leu ine 17 and Tryptophan 52.
 Phenylalanine 21, Leu ine 44, and Alanine 39 appeared.

LF
2

RR LA
bits saved

1
L
K
E
R
D
S
KK
R
KE
Q
S
Q
EG
N
DA
S
Q
E
A
S
L
N
TT
L
F
Y
V
A
E
D
T
I S
L
KS
K
Q
A
EL
KT
V
LV
MRK
V
E
IE
Q
D
K
R
AE
A
NS
Y
L
W I
N
S
T
D
YPI S
K
L
F
VA
T N
ATGF
T
VD
EK
RL
K I
QV
DA
R
E
VV
M
IS
G
T
L
I
E
K
L
T
V
I
A
F
G
F
T T
A N H A K

F
K L GV W R T A R M S
QRN DA HPE I
L SN V E P I ME
T
S
E
Q
A RE M
GGAT H I S
G
D
A
S
R R V TD M K K
RSYA
S
R SQE K PA Y
R
Q
TD S LR A KN K FL S R
NE
SLH V A EA Q K TLQE
L
A
D L
K
FQ
L
AT
I E
S
Q
SE
R
KR
L
V VY R
I L
C
QQL
S
C
K A
T
P
N
TPP KP R
KT
FV
N
M
RN
Q NFV R VT
GG
V
A
H
I
SL
K
S
V
A
G
AL
A
RA Q I Y
N
A
N
H
GEQAE
A
L
N
N
M
A
V DTDS
L
A
KE
Q

0
AHV G I E I GQGEGYE S
KY LS T E V
T
D E QGPE SG GT
T
N S
V
N
E G
LPV I P
SY Y QHPSYEQ
I F PP
Y
I R
A
E
R
QR I L
D
H DNHS
R
K I HRV KFD
V
I P
V D Y YL MGSPL V A PQSQT I H WCQ T PAN L PHLK QM PKG CYT Q N
H
I
Y L
V
I
H I
F
M M
FA
R
N
T
V
E
Y
D
QI F
CP
N
M
D
C
HP
R
F
N
Y
K
E
Q
H
M
GD
CV
Y
FI L
H
Y
I L
D
V
Y
N
T
K
Q
H
R
I
G
F F
H
MM
P
G
N
H
DT
R
F
K
Y
N
G
DY
M
G
P
N
H
DC
S
P
K
W
R
Q
G
E
N
HH
M
Y
F V
I
Y R
T
F
K
HN
P
K
G
E
Q
DH
K
V
R
S
F
T
N
YI I
S
Y L
Y
C
V
FY
T
F
S
R
K
E
QG
Q
V
P
YI LT
G
V
MI M
W
CD
C
WV
Y
F
CT
P
F
Y
G
M
Q
D D
P
L
T
G
N
V
F
YP
H
Y D
W
CS
Y
Q
F
M
GH
P
M
Y
FI R
P
Q
G
E
N
W
HHH
G
PI T
R
N
F
M
Y H
D
W
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

W
3

L V F NRR RK K
2
bits saved

I
1

I Q K L Y DKK KR
R
GMS
TE R
E IQ V
SQQ
A
AL
GEE
Q
E
L
E
Q
E
L F
K R H
N
V Q
D K V WQEHS A A
V A P
S
D E
N
R
DRV
SL
N A
I
LT
E
Y I L
HSN
N
QT
S
TS
K
E
G
D
K
E
QF
AA
R
G
R
L
S
K
SD
L
K
A
PT
TA
E
F
NA
PAT
A
S
D
T
T
F
S
K
YL
S
N
M
R
K
S
K
TA
T
E
A
VA
R
N
KD
T
L
G
TAS
H
G
P
D
M
L
V
I
T
H
D
G
W
N
D
G
N
L
D
I H
R
Q
A
P
S
G
A
S
E
N
K L
G
P
A
D
S
G
P
SL
TY
L MGR CR QVV
Q
V V L D T P E DP
TQV VN ER GC
P
R
T P
LH
L GNQ I N K V R A

T I

0
H I D I PMG
P I E TP LE L V K I
P
C
C
S
K
A
H
N
G
N
Q
P
G
Q
G
C
P
L
A
V
G
Q
D
V
F
I
Y
N
E
K
Y
LY
I Y
M
S
Y
V
A
S
N
T
I
Y
N
I
L
Y
F
D
A
E
P
I
G
G
E
S
I
L
Q
V L L
A
S
Y
S
P
G
T
R
L
E
T
L
N
A
T
S L P
L
S
P
L
L
Y
V
MR
P
Q
W
G
E
N
HL
Y
V
FM
Y
H
FG
P
Y
F
HH
Y
M
FY
S
P
K
R
GH
Y
FI H
D
W
CS
T
H
M
E
KPD
P
M
HV
F
CI M
F
CF
CF
E
N
G
PM
FW
E
A
V
T
F
Y
SI MY
M
FR
M
T
S
P
K
D
F
YL
K
A
V
Q
R
GI V
H
Y
F V
I
S
T
N
P
K
G
R
Q
F K
S
E
P
L
A
T
R
Q
D
N
VA
P
Q
T
V
K
D
RI V
P
A
F
T
I
E
G
Q
Y
S
R
M H
Y
I
F S
G
P
V
A
I
T
D
R V
S
P
A
I
T
G
Y
H
E
F
K
Q
R T
G
F
R
K
M
E
Q
H A
L
D
N
E
T
K
V
R
Q
M
I Q
V
H
Y
I
F L
G
Q
S
A
T
E
V
Y
I F
R
V
Q
I Y
F
R
K
E
G
D
M
Q P
V
A
Y
I
T
S
G
F
R L
S
A
P
N
V
T
E
G
QL
A
S
V
G
T
H
E
K
RI V
A
I
P
T
G
F
R
Y
K
E G
T
A
S
V
E
K
D
N
R
Y
I
F
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83

85
The nal alignment

SAM-T98 Alignment Building


Start: a single sequence

Build a model from the sequence or alignment

Use the model to search for additional homologs

Reestimate the alignment with the new homologs


(Iterations 1 - 3) (Iteration 4)
End: a SAM-T98 alignment

 An alignment an be valuable in its own right.


 If there are homologs with known stru ture, the nal
alignment might be suÆ ient for stru ture predi tion.

86
Final SAM-T98 Alignment for 2 rd
Shown below is an ex erpt from the nal SAM-T98 alignment for
2 rd, showing remote homologs of known stru ture. The logo
shown re e ts the full alignment.
2 rd XFTNVSCTTSKECWSVCQRLHNTSRG.KCMNKKCRCYS...
1bah XFTNVSCTTSKEXWSVCQRLHNTSRG.KCMNKKXRCYS...
1 mr ------CTTSKECWSVCQRLHNTSKG.WCDHRGCICES...
1lir XFTQESCTASNQCWSICKRLHNTNRG.KCMNKKCRCYS...
1sxm TIINVKCTSPKQCSKPCKELYGSSAGaKCMNGKCKCYNnx.
1txm ----VSCTGSKDCYAPCRKQTGCPNA.KCINKSCKCYG x.
1big XFTDVKCTGSKQCWPVCKQMFGKPNG.KCMNGKCRCYS...
1bkt VGINVKCKHSGQCLKPCKDA-GMRFG.KCINGKCDCTPkx.
2bmt XFTNVSCSASSQCWPVCKKLFGTYRG.KCMNSKCRCYS...

C CC G C C
bits saved

A
V
L V
A A
V
N
S
D
GA
S
V
A
L
C
V A
V
S
I
EL
DS LK A
E V I N
G L L
SRE T T S
N SG
KT
S
QT
K I
T E
I Q
K K
R
D
E S S
D
I T
I
D
E EF TKNG
K
S GSR
D
T N
K
G
F K KA G
SVR
I G
LD
Y
P
NE
ESF
Y N
E
L
FAS L P PG KYF Y E
R
E
T F
NL SA YDQ SR
P L SS A
T KAI ADT S
A
DN
GP
N
R PTN
A AS QS M
K AM P
L A N
KV
KG
D
RRD
D N
DN E
TV L GN
KAR A
R
R R GY N
D
Y
VL V P
E
T T
RA
G
AKG K
A L I H DDQ V
E T D
NR
KL
LV A RS Q
E I NPATR
P
V
R
K
GAQ V EK I
E
T
N Q TG
R
V
T

0
S G E G R L K
A I F S RTP L L PGK L EH N L P QEL I S
KT
TF E R
K PMF
R
T
DLQHE I QA
EVPT LYV
N
TTY I
D D I
R
KLPKEA G
I AR Q YL QYL Q P L MV T I MH L RV F A P Q FYQM H
A P QV MFP
S
E
K
R
Q
F
Y
G
N
P
D
M
H
C
WWT
S
R
Y
E
K
P
M
Q
G
N
C
H
D
C
WN
F
Q
Y
D
G
P
M
HL
H
V
Y
F
MI Q
M
G
P
N
D
H
C
WV
H
Y
F
MI H
WK
E
M
P
Q
G
N
D
H
W
CR
V
Q
P
H
Y
F
MI
WMV
I
H
Y
F
C
M L
Y
V
H
I
F V
Y
I
F
M Q
H
WA
Y
T
S
W
F
G
R
K
E
D
P
Q
M
H
NL
H
V
Y
FI
M P
E
T
K
S
R
H
F
N
D
Y
Q
G
M Q
H
WM
Y
FI V
H
I
F
Y
MW
CS
Y
F
M
N
D
P
G
HT
A
F
Y
H
S
E
K
R
G
P
Q
N
M
DIH
D
CI
C
M
W
MMT
I
F
K
S
Y
R
E
M
Q
G
N
P
W L
Q
V
Y
H
I
F L
P
F
V
H
I
YWC
M
WG
Q
F
P
M
HH
WV
A
D
T
S
R
K
Y
F
E
G
Q
N
P
HL
V
Y
F
MI G
V
I
F
S
T
E
D
Q
N
Y
P
M
H
C
W V
Y
I
H
F
M N
D
H
WR
A
T
F
Y
S
E
D
Q
H
G
N
M
PI Q
H
WE
S
M
R
K
Q
G
N
P
D
HN
D
E
A
K
Q
T
L
R
H
V
Y
FIW WC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

87
Comparison with the 2 rd stru tural alignment

The logo below represents the FSSP [HS97℄ stru tural alignment
for 2 rd. The FSSP logo and the SAM-T98 logo will di er due to
homologs of unknown stru ture in SAM-T98 and additional
distant homologs in FSSP.
 Cysteine 33 and Gly ine 26 are emphasized less in the
SAM-T98 alignment while Gly ine 22 is emphasized more.
 Both alignments emphasize the ysteines.
3

C C C GC CC
bits saved

A
V
L
A
V
A
V
A
S A
V
A
V
A
V
S L L N L L L
I S S D S S S
T K I I E I I I
G R T T K G K T T T
F E G E G E N T G
K
EG G
Y SSEF S
F R K R
P F RF F
P NAAY K SL R Y
K KP A Y D S Y AP Y S
V N
A
ENQN D P
N
DE
QA
GS G V P
N
SN N
P K
L D DDDD N D NRKEE E
K
N
GL D RQ
DDL D
G
R
I T KQNK Q KL TA D Q K K KVI K
G
V L A RL L
GTS R R RKAK
Q
S
R PS
KH RK AV
TR RLN
D
A I AY EV G RE L RL L
V L SD G K
E Y E

0
R
L VS
V
T
F
S
L M I N
T
R G L G E I
TTT
D
A T L A R I L EL
E G AEV A
L
S V AL MV A T P P T MA T MV G S N Q A N E Y N
Y M I
S
V L MR
YMT P
E I TT SP QG L Q L L QV QA V I R P V T A C A QV N L N QEQ I T
K
T
D
N
I
Q
P T
A
R
Y
F
S
K
E
P S
F
M
Y
R
K
E
Q
W
GN
E
K
R
G
DI G
R
K
E
Q
P
N
HA
K
T
F
E
R
G
D
N
QI H
WF
K
S
R
Y
E
Q
M S
R
A
G
V
I
E
K
D
Q H
V
I
Y
F V
H
I
Y
F Y
P
V
H
I H
WY
R
K
T
S
F
Q
N
EI P
V
H
I
Y V
L
S
A
E
P
I
T
Q H
WQ
E
S
G
M
F
Y P
I
H
Y D
N
G
F
P
Y
H
M H
Q
V
P
I
Y L
H
V
Y
I
F T
I
Y
M
S
F
G
R
N
K Q
V
I
H
Y T
L
H
F
Q
P
V
I F
M
W E
T
D
G
R
V
I
Q
P
F H
WM
A
F
N
T
D
Q
S
R
K
Y
E L
G
K
A
S
T
V
E
Y
D
R
I D
T
N
P
Q
H
I P
I
H
Y
F H
WT
S
F
H
Q
MH
WA
F
R
E
S
M
N
K
Q L
Q
F
V
H
I
Y
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

88
SAM-T98 alignment for 1yrnB
10 20 30 40 50 60 70
| | | | | | |
1yrnB GHRFTKENVRILESWFAKNIENPYLDTKGLENLMKNTSLSRIQIKNWVSNRRRKEKTITIAPELADLL
1hom RQTYTRYQTLELEKEFHFN---RYLTRRRRIEIAHALCLTERQIKIWFQNRRMKWKKENKTKGEPG--
1hddC RTAFSSEQLARLKREFNEN---RYLTERRRQQLSSELGLNEAQIKIWFQNKRAKIKK-----------
1fft RVLFSQAQVYELERRFKQQ---KYLSAPEREHLASMIHLTPTQVKIWFQNHRYKMKRQAKDK------
1yrnA --SISPQARAFLEQVFRRK---QSLNSKEKEEVAKKCGITPLQVRVWFINKRMRSK------------
1o p RTSIENRVRWSLETMFLKC---PKPSLQQITHIANQLGLEKDVVRVWFCNRRQKGKR-----------

Shown is a portion of the nal SAM-T98 alignment, showing


sele ted sequen es of known stru ture. For brevity, other
sequen es of known stru ture were omitted when their alignment
was similar to some sequen e shown. The sequen es shown
represent other sequen es of known stru ture, as follows:
Representative Other Stru tures Represented
1yrnB 1mnmC and 1aplC
1hom 1san, 1ahdP, 9antA, 1ftz, and 1b72A
1hddC 2hddA, 3hddA, 1enh, and 1fjlA
1 t 1vnd
1yrnA 1akhA and 1b72B
1o p 1o tC, 1pog, 1hdp, and 1au7A

89
Final SAM-T98 alignment for 1yrnB, ontinued

The logo shown represents the nal SAM-T98 alignment for


1yrnB, in luding homologs of unknown stru ture.

LF
2
bits saved

K KE
E
R
D
S
Q
RR
KK
R
EG
Q
SN
DA
GS
Q
E
A
S
L
N
T
T
L
F
Y
V
A
E
D
T
I S
L K
S
KR
Q
A
EL
KT
V
LV E
IE
Q
V
D
K
MRK
R
AE
A
NS
Y
L
WI
N
S
D
T
K
L
F
VA
T
A
N
RTG
L
YPI S
F
T
VD
EK
RL
K I
QV
DA E
R
LA
IS
L
M
E
K
G
VV
T
I
L
T
V
I
A
F

F
V T T
V K

F
S
A
Q
KNL GA WN E T
S I A S H P I ME ASE
Q
A ME M
RAT D I HPD R L N M E RSS TQ
ELK RA Y
GG
QTD HS
Q
L
A
S
RGA
E
R A V TD A K K KNYAR
L S IPS
RL
R
NELH V KT
R
A K T
LQL H
D K
FQL IE
S
L Q
SERKV E
FA
VYV
R
L
CQS
C
K A
T
P
S
N
T
P
HP KP F
I E
V
I
E
N
MR
N
Q
G
N
G
F
Y
V
E
RA
KY
S
V
G
TV
GA
I
S
T
RL
S
K
V
A
GAT
L
AV
A
R
T S Q Y
N
A
E
N
H
G
Q
K
ER
Q
A
I
G
A
S
R
L
N
N
MT
R
D
TT
Q
D
S
L
V
A
K
E
E
Q

0
AVV G Y GQ E Q LY E
C D D DGPEK G GKN N V
LPY I P
SY GS
PP QHPS
V
YE I
PQF
STPP I A
E
W
R
NR I
Q
L
AN
H
L PNH
S
K
Q
D I HS
Y
V
G
S
KFD
T I
Q
P
N
V D I YL MNRPL L A FMTQC I H R
P
GQ T PYG G MHLTP M PQ I CYH T H
H
I
Y L
V
I
H F
M
C M
FA
R
N
T
V
E
Y
D
QI F
CM
D
C
HF
N
Y
K
E
Q
H
M
GD
CV
Y
FI H
Y
I L
D
V
Y
N
T
K
Q
H
R
I
G
F H
MP
G
N
H
D
WR
F
K
Y
N
G
D
HY
M
G
P
N
H
DS
P
K
W
R
Q
G
E
N
HH
M
Y
F V
I
Y
F T
F
K
H
MK
E
Q
DH
K
V
R
S
F
T
N
YI I
S
Y L
C
Y
V
FT
F
S
R
K
E
Q
M
G
NQ
V
P
YI L
NT
V
MI W
CD
C
WV
Y
F
CP
F
Y
G
M
Q
DL
T
G
N
V
F
YI P
H
Y
F D
W
CF
M
G
N
DH
P
M
Y
F R
P
Q
G
E
N
W
HH
WG
P
M
Y
FI R
N
F
M
Y D
W
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

W
3

L V F NRR RK K
2
bits saved

I
1

I Q K L Y DKK KR
R
GMS
TE R
E IQ V
SQQ
A
AL
GEE
Q
E
L
E
Q
E
L F
K R H
N
V DQ K V WQEHS A A
VA P
S
D E
N
R
DRV
E
Y
L
SN I
A
HSN
LT
I N
L QT
S
TS
K
E
G
D
K
E
QF
AA
R
G
R
L
S
K
SD
LFD
T
K
S
K
Y
A
PT
L
TA
PAT
E
K
S
E G
AAA
D
NT
F
S
N
T
M
R
S
K
A
V
T
A
R
N
KD
T
L
TAS
H
G
P
M
L
V
T
H
D I G
W
N
D
G
N
L
D
I H
R
Q
A
P
S
G
A
S
E
N
K L
G
P
A
D
S
G
P
SL
Y
TTQV
L M
VGR
NRC
G
Q
PVV
R
C E
Q
R
T G
D
L L
V
P
N H V
T
R
L
Q
P
I
E
N
D
A
K
P
V
I
T

0
H I DPI M P I
G P E E T L L V K I
P
C
C
S
K
A
H
N
G
N
Q
I P
E
Q
G
P
V
A
G G
Q
C
D
F
L
Y
V
N
Y
L
I
Y
Y
M
S
Y
KY
I
L
Y
F
P I
V
I F
A
S
N
T
N
I
D
A
E G
G
E
S L
Q
V L L
A
S
Y
S
P
G
T
R
L
E
T
L
N
A
T
S L P
L
S
P
L
L
Y
V
MR
P
Q
W
G
E
N
HL
Y
V
FM
Y
H
FG
P
Y
F
HH
Y
M
FY
S
P
K
R
GH
Y
FI H
D
W
CS
T
H
M
E
K PD
P
M
HV
F
CI M
F
C F
CF
E
N
G
PM
F W
I
E
A
V
T
F
Y
S MY
M
F R
M
T
S
P
K
D
Y L
K
A
V
Q
R
I
G V
H
Y
F V
I
S
T
N
P
K
G
R
Q
F K
S
E
P
L
A
T
R
Q
D
N
VA
P
Q
T
V
K
D
RI V
P
A
F
T
I
E
G
Q
Y
S
R
M H
Y
I
F S
G
P
V
A
I
T
D
R V
S
P
A
I
T
G
Y
H
E
F
K
Q
R T
G
F
R
K
M
E
Q
H A
L
D
N
E
T
K
V
R
Q
M
I Q
V
H
Y
I
F L
G
Q
S
A
T
E
V
Y
I F
R
V
Q
I Y
F
R
K
E
G
D
M P
V
A
Y
I
T
S
G
F
R L
S
A
P
N
V
T
E
GL
A
S
V
G
T
H
E
K
RI V
A
I
P
T
G
F
R
Y
K
E G
T
A
S
V
E
K
D
N
R
Y
I
F Q Q
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83

90
Comparison with the 1yrnB stru tural alignment

The logo below represents the 1yrnB stru tural family a ording
to FSSP. The SAM-T98 logo will di er be ause of di erent sets
of homologs, and slight di eren es in the 1yrnB sequen e
resulting from positions not fully resolved.
 FSSP pla es more emphasis on Glutami A id 18. SAM-T98
pla es more emphasis on Alanine 39 and others.
 The two logos re e t similar alignments.
2

LE F
bits saved

1
FI T ID I
LI
Y LD
K
R
E
A
Q
TS
P
K
E
S
A
R
S
G
K
E
R
K
S
E
D
D AT E
G
AN P
R
L
T
S
YS
KV
DN
E
K
A
D
K
R
A
E
E
S
N
TQ R LS
L
V
D IA G
VK
M
A
Q
AK
FSS
R
N
R
E
YTQ
T
SG
T KS
L
I
V
W
A S
K
R
E
K
NLE
KV
D
D
T
STQ
EM
R
R
NP
AAG
FSS
E
Q
R
E
K
D S
K

Q
D

S
PT
V
A
F
YTK
S
TA
N
I E
ST
GS
KE
E
R
G
R
T
D
L
R
T
K
A
R
E
I N
KA
S
M
A
MD
L
FV
S
I
E
KN
E
Q
S
R
KT
R
AL
QV
VA
D
AT
D D A N A MR
G S KQ L ER RA S
R
Y
P
KN T
A M
DAK E
RA I Y KK DA
R
G
EN T E I
N R N H T D TA K
L N H F
G A L RL Q I L S A

0
S T T D Q GA AP L L AE R M N M YFH
V L G K T QN W Q A I A V T T D RP N A
Y A N V SQ SKGDC
L G N R Q Q
P EQ SL N YE I E VE
QV G V SE T QP
L
Q T E
T GQV T AA T RR G
E
K
D
R
G
I
N
P
Q
V
P
I
H
Y
Q
L
V
I Y
L
A
E
S
T
V
N
D
G
Q
I
P
L
V
H
Y
I L
V
H
Y
I
F
K
S
A
D
R
L
N
G
V
G
L
V
P
HI
R
K
E
H
P
Q
G
N
D
C
V
P
H
Y
FI
G
L
V
I
H
R
T
P
L
V
H
L
K
S
R
T
V
D
K
E
S
F
Q
Y
M
N
P
D
G
D
K
T
I
S
Q
F
F
T
S
R
K
Y
Q
M
G
C
W
N
H
D
H
Y
F
MIG
L
P
V
H
I
A
L
W
S
E
I
T
R
G
K
E
P
N
C
Q
D
R
K
V
H
S
T
Q
D
N
GI
Q
N
G
V
I
P
P
L
H
V
Y
G
Q
D
N
C
HV
H
Y
FI
L
P
V
H
Y
I
F
G
L
V
H
I
V
R
F
D
N
G
I
Q
P
H
Q
D
N
H
P
H
L
V
Y
E
K
Q
D
N
G
F
Y
P
G
L
V
P
I
H
Q
P
L
H
V
T
E
S
Q
F
Y
N
M
P
D
G
V
N
P
G
F
Y
H
L
G
V
P
HI
K
C
P
E
Q
G
N
H
Q
Y
P
D
N
C
H
T
P
L
VT
P
L
V
H
S
F
R
Y
K
E
M
G
Q
N
P
D
H
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

W NR
bits saved

1
L
IT
SM VR
K
I F
FV
R
D
K
G E
K
K K
R R G
SE
S
T
L EQ E
N Q E
G V E
R
K Q EV YL
L
KS
T
A
A
EQS
L
Q
A
QK
AT N
S AP
DL
E
L
VV
A
E
S
PV L
D
EFAD
K
D
P R
E
K
A
A
S
DN
I V
I Y
I AN
L
N
T
RTG
T
N
S S
D
T
R
E
D
E VE
S
E
P
D
S
N
AGAASNK I I
D
EDA
E
KA
S
I
K
AANA
S
N RI Q
VF
DTL A
L AA I K KS LKD
FF
RD
TSDF NS A
GDK KA
TY
T
R
GQLA LAGM HM
GH A
MG
L
G
Q
DE G
R
L
S
AA V
G
A
T
T
K
T
KAP
R
QGT
K
T

0
TAL GTS
TT PHV H P
Q NR
L EETRTLE
R
QS
Q
LN
G
E
R
T
S
YVS
FHS
S
Q L
Y
V
P
P
I
Q
KV
LV
N
L
L
K
L
V
KT
I I
G
T
T
NG
S
E
R
Q
S
R
S
R
G
EL
V
G
N
T
R
S
R E
P
H
L
V
Y
R
K
E
C
Q
P
G
N
H
W
V
P
H
YI L
V
H
I
SP
V
T
K
S
F
Q
Y
K
D
N
P
G
H
Y
M
I
W
C
K
R
P
E
G
Q
N
I
H
YI E
R
Y
M
Q
P
G
DE
K
N
D
P
QR
K
E
P
G
Q
H
N
CL
V
A
T
L
V
F
P
I
H
I
Y
M
F
GA
Y
F
ME
V
T
S
DI
QS
I
Y
M S
A
V
T
K
I
R
D
W
N
G
E
I
Y
M
RG
V
P
I
H
AT
V
R
T
S
D
P
P
H
Y
F
M
I
MD
P
L
A
E
V
T
RK
R
P
D
N
Q
F
Y
L
Q
L
V
V
Q
L
V
M
I V
I
S
T
E
K
R
D
N
P K
I
D
P
R
N
Q L
V
H
I
Y Y
P
G
Q
D
NY
P
G
Q
D
NK
N
L
R
P
V
QI Q
I
Y
H
F P
V
I N
Q
I
F
Y Y
P
G
Q
D
N
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78

91
SAM-T98 stru ture predi tion

SAM-T98 Structure Prediction


Build a SAM-T98 Build a library of
alignment from the SAM-T98 alignments
target sequence from PDB sequences

Build a model from Build models from the


the alignment. Score alignments and score
PDB sequences the target sequence

Combine scores

Result: Fold Prediction

 Building the target models or the stru ture models... or both.


 E e ts of bidire tional s oring.
 Issues in library building:
{ When is library building worthwhile?
{ Library size vs. stru ture spa e overage.

92
Bidire tional s oring

FSSP test set


db-SAM-T98
targ-SAM-T98
sum-SAM-T98
1000
False Positives

100

10

1
0 200 400 600 800 1000
True Positives

 Bidire tional s oring yields a modest performan e gain.


 The work ne essary for bidire tional s oring is substantial
ompared to the work ne essary to build one SAM-T98
alignment for the target sequen e (targ-SAM-T98).

93
Assessing the S ore

 935 SCOP domains with known stru tural relatedness


 All-on-all s oring
 Rank by
s ore and ount number of false positives as a fun tion of s ore.
Calibration curve for different SCOP levels
100000
class
fold
superfamily
10000 family
subfamily
False Positives

1000

100

10

1
-40 -35 -30 -25 -20 -15 -10 -5 0
SAM-T98 summed cost (nats)

 Allows us to assess signi an e of a s ore for di erent levels of


protein relatedness.

94
BiDire tional S oring Results for 2 rd in rank order

The SCOP relatedness to 2 rd is as follows: S = same subfamily,


F = same family, P = same superfamily, U = un lassi ed, O =
other.

ID SCOP ID SCOP
S ore  -50 2 rd S -40 < S ore  -30 1tsk F
1big U 1 mr S
2bmt U -30 < S ore  -20 1gps P
1lir F 1gpt P
1agt F 1a w P
1mtx F 1 n2 U
2ktx F -20 < S ore  -10 1myn P
1ktx F 1b3 A U
1sxm F 2b3 U
1bah S 1vnb P
-50 < S ore  -40 1s o F 1vna P
1bkt U 2sn3 P
2pta F
1txm F

95
BiDire tional S oring Results for 2 rd in rank order, ontd.

 No known false positives were found.


 The more distant superfamily relatives appear with weak
s ores, but so also does one subfamily relative (1 mr).
 Sin e the method was tuned for distant homology dete tion,
the signal for 1 mr was muted by the other family and
superfamily in lusions.
 The alignment des ribes the family better than the subfamily.

96
BiDire tional S oring Results for 1yrnB in rank order
The SCOP relatedness to 1yrnB is as follows: S = same
subfamily, F = same family, U = un lassi ed, O = other.

ID SCOP ID SCOP
S ore  -60 2hoa F -50 < S ore  -40 1yrnB S
1ahdP F 1aplC S
1hom F 1akhB S
9antA/B U 1aplD S
1b8iA U 1mnmC/D S
1san U -40 < S ore  -30 1o tC F
1b72A U 1hdp F
3hddA/B U 1yrnA S
1hddC F 1akhA S
1fjlA/B/C F 1pog F
1ftz F 1au7A/B F
2hddA/B F 1o p F
-60 < S ore  -50 1nk2P U -30 < S ore  -20 none
1nk3P U -20 < S ore  -10 1lfb F
1vnd F 2lfb F
1ftt F
1enh F
1b72B U
1b8iB U

97
BiDire tional S oring Results for 1yrnB in rank order, ontd.

 No known false positives were found.


 The strongest hit was not the best stru tural mat h.
 The alignment provides a better des ription of the 1yrnB
family than the 1yrnB subfamily be ause the method is
tuned for superfamily modeling.

98
Should you build an hmm stru ture library?

 If you want to make stru ture predi tions for only a few
proteins, build hmms for these proteins and s ore the protein
stru ture sequen e database.
 If you want to make stru ture predi tions for 10,000
sequen es (for example), it would be more eÆ ient to build
hmms for protein stru tures.
 Anywhere in between these two extremes, the de ision to
build an hmm stru ture library depends on your time and
omputing power resour es.

99
Building Alignments for a Stru ture Library

To a hieve the best overage with a limited number of


alignments, use an established protein family hierar hy to hoose
a set of stru tures that overs \stru ture spa e" to a desired
granularity.
FSSP: Superfamily level stru ture hierar hy [HS97℄
CATH: Family level domain hierar hy [OMJ+97℄.
Pfam: Family level domain hierar hy [SED97℄.
SCOP: Multi-level hierar hy [HMBC97℄. In ludes some
asso iations that are beyond most ontemporary modeling
methods.
All stru ture hierar hies are updated periodi ally, and the
alignment library should also be updated a ordingly.

100
Future dire tions in hmms for sequen e analysis

 hmm-based re ognition and assembly of I-sites, motifs, and


other small substru tures [BB97, GBEB97℄.
 Fisher Kernel methods [JDH99℄
 Using similarity between hmms to predi t homology (Lyngso
ISMB99).
 Spe ialized hardware [HHKS96℄.
 Neural networks for improved parameter sele tion.

101
HMMs on the Web

 HMM software pa kages


SAM http://www. se.u s .edu/resear h/ ompbio/
HMMer http://hmmer.wustl.edu/
 Appli ations of HMM software
SAM-T98 www. se.u s .edu/resear h/ ompbio/
Pfam http://pfam.wustl.edu/
Prosite http://www.expasy. h/prosite/
Meta-MEME http://metameme.sds .edu/
 Commer ial vendors of HMM systems
Neomorphi http://www.neomorphi . om/
TimeLogi http://www.timelogi . om/
Net-ID http://www.netid. om/

102
Referen es
[AGM+90℄ Stephen F. Altshul, Warren Gish, Webb Miller, Myers Eugene W., and Lip-
man David J. Basi lo al alignment sear h tool. JMB, 215:403{410, 1990.
[BB97℄ C. Bystro and D. Baker. Blind predi tions of lo al protein stru ture in asp2
targets using the i-sites library. Proteins: Stru ture, Fun tion and Geneti s,
Suppl., 1:167{171, 1997.
[BHK97℄ Christian Barrett, Ri hard Hughey, and Kevin Karplus. S oring hidden
Markov models. CABIOS, 13(2):191{199, 1997.
[Edd98℄ S. R. Eddy. Pro le hidden markov models. Bioinformati s, 14(9):755{63,
1998.
[GBEB97℄ W. N. Grundy, W. Bailey, T. Elkan, and C. Baker. Meta-MEME: Motif-based
hidden Markov models of protein families. CABIOS, 13(4):397{406, 1997.
[GME87℄ Mi hael Gribskov, Andrew D. M La hlan, and David Eisenberg. Pro le anal-
ysis: Dete tion of distantly related proteins. PNAS, 84:4355{4358, July 1987.
[HH92℄ Steven Heniko and Jorja G. Heniko . Amino a id substitution matri es from
protein blo ks. PNAS, 89:10915{10919, November 1992.
[HH94℄ Steven Heniko and Jorja G. Heniko . Position-based sequen e weights. JMB,
243(4):574{578, November 1994.
[HHKS96℄ Je rey D. Hirs hberg, Ri hard Hughey, Kevin Karplus, and Don Spe k.
Kestrel: A programmable array for sequen e analysis. In Appli ation-Spe i
Array Pro essors, pages 25{34, Los Alamitos, CA, July 1996. IEEE Computer
So iety.
[HK96℄ Ri hard Hughey and Anders Krogh. Hidden Markov models for se-
quen e analysis: Extension and analysis of the basi method. CABIOS,
12(2):95{107, 1996. Information on obtaining SAM is available at
http://www. se.u s .edu/resear h/ ompbio/sam.html.
[HMBC97℄ T. Hubbard, A. Murzin, S. Brenner, and C. Chothia. s op: a stru tural
lassi ation of proteins database. NAR, 25(1):236{9, January 1997.
[HS97℄ Liisa Holm and Chris Sander. Dali/fssp lassi ation of three-dimensional
protein folds. NAR, 25:231{234, 1 Jan 1997.
[JDH99℄ T. Jaakkola, M. Diekhans, and D. Haussler. Using the sher kernel method to
dete t remote protein homologies. In Pro eedings of the Seventh International
Conferen e on Intelligent Systems for Mole ular Biology, Aug 1999.
[Kar95℄ Kevin Karplus. Regularizers for estimating distributions of amino a ids from
small samples. In ISMB-95, Menlo Park, CA, July 1995. AAAI/MIT Press.
[KBC+99℄ Kevin Karplus, Christian Barrett, Melissa Cline, Mark Diekhans, Leslie
Grate, and Ri hard Hughey. Predi ting protein stru ture using only sequen e
information. Proteins: Stru ture, Fun tion, and Geneti s, to appear, 1999.
[KBH98℄ Kevin Karplus, Christian Barrett, and Ri hard Hughey. Hidden markov mod-
els for dete ting remote protein homologies. Bioinformati s, 14(10):846{856,
1998.
[KBM+94℄ A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. Hidden
Markov models in omputational biology: Appli ations to protein modeling.
JMB, 235:1501{1531, February 1994.
[KKB+97℄ Kevin Karplus, Kimmen Sjolander, Christian Barrett, Melissa Cline, David
Haussler, Ri hard Hughey, Liisa Holm, and Chris Sander. Predi ting protein
stru ture using hidden Markov models. Proteins: Stru ture, Fun tion, and
Geneti s, Suppl. 1:134{139, 1997.
[OMJ+97℄ C. A. Orengo, A. D. Mi hie, S. Jones, D. T. Jones, M. B. Swindells, and
J. M. Thornton. Cath- a hierar hi lassi ation of protein domain stru tures.
Stru ture, 5(8):1093{108, August 1997.
[PKB+98℄ J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, and
C. Chothia. Sequen e omparisons using multiple sequen es dete t twi e
as many remote homologues as pairwise methods. JMB, 284(4):1201{1210,
1998. Paper available at http://www.mr -lmb. am.a .uk/genomes/jong/
assess paper/assess paperNov.html.

[SED97℄ E.L.L Sonnhammer, S.R. Eddy, and R. Durbin. Pfam: A omprehensive


database of protein families based on seed alignments. Proteins, 28:405{420,
1997.
[SKB+96℄ K. Sjolander, K. Karplus, M. P. Brown, R. Hughey, A. Krogh, I. S. Mian, and
D. Haussler. Diri hlet mixtures: A method for improving dete tion of weak
but signi ant protein sequen e homology. CABIOS, 12(4):327{345, August
1996.
[SS90℄ T.D. S hneider and R.M Stephens. Sequen e logos: a new way to display
onsensus sequen es. NAR, 8(10):6097{100, 1990.
[SS96℄ Reinhard S hneider and Chris Sander. The HSSP database of protein
stru ture-sequen e alignments. NAR, 24(1):201{205, 1 Jan 1996.
[WU-℄ WU-BLAST WWW ar hives.
http://blast.wustl.edu/.

Вам также может понравиться