Академический Документы
Профессиональный Документы
Культура Документы
SAM-T98
Building and using an HMM framework
for nding protein homologs and
predi
ting protein stru
ture
Melissa Cline, Christian Barrett, and Kevin Karplus
University of California, Santa Cruz
1
A
knowledgments
2
Outline of tutorial
External (web-site) view of SAM-T98. (Karplus)
Overview of tutorial and SAM-T98. (Barrett)
Entropy and sequen
e logos. (Cline)
Introdu
ing the running examples. (Barrett)
Basi
s of prole HMMs. (Karplus)
SHORT BREAK.
Sequen
e weighting. (Cline)
Column regularizers. (Karplus)
Transition Regularizers. (Barrett)
SHORT BREAK.
Sear
hing with the model. (Barrett)
Null models. (Karplus)
Stepping through the pro
ess. (Cline)
Bidire
tional s
oring. (Barrett)
Future dire
tions and
on
lusions. (Barrett)
3
SAM-T98 overview
4
SAM-T98 master web page
http://www.
se.u
s
.edu/resear
h/
ompbio/
HMM-apps/HMM-appli
ations.html
Kevin Karplus, Christian Barrett, Richard Hughey, "Hidden Markov Models for
Detecting Remote Protein Homologies", Bioinformatics 14(10):846-856, 1998. WWW
server available from http://www.cse.ucsc.edu/research/compbio.
Basic operations
Compare Sequence Against Protein Model Library
Search UCSC’s protein structure HMM library with a single sequence. May reveal
homologous structures.
5
SAM-T98 sear
h web page
Be
ause the method
an take a while for long proteins, or ones
with many homologs, results are e-mailed ba
k. They also don't
t on a slide|see T98-query.mail in handouts.
http://www.
se.u
s
.edu/resear
h/
ompbio/
HMM-apps/T98-query.html
HMM-based Sequence Query
This page offers the ability to perform remote protein homolog searches using HMM technology. It is
part of a larger suite
of UCSC’s HMM applications that are available on the web. If you are familiar with BLAST searching,
then you understand the functionality of this page.
6
SAM-T98 multiple alignment web page
7
Overview
8
What is SAM-T98?
SAM-T98 is an HMM-based homology dete
tion method, similar
to PSI-BLAST in fun
tion.
SAM-T98 was the top fold re
ognition and homology modeling method at
CASP3 of those using no dire
t stru
tural information [KBC+99℄.
The methodology that developed into SAM-T98 was one of the top fold
re
ognition methods at CASP2 [KKB+ 97℄.
9
SAM-T98 Overview: Alignment Building
10
SAM-T98 Overview: Stru
ture Predi
tion
Combine scores
11
Review of Entropy
12
Interpretation of entropy
Formula: H (x) = X
xi
P (x ) log2 P (x ) (bits).
i i
13
Entropy of Alignment Columns
posterior distribution P .
14
Sequen
e Logos
2
rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1
mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS
2
bits saved
F
L
Y
I
V S
AAS
V
L
TN I S
DAAG
GTTF
Y
TD
AA
E
G
E
K
SD
C CW VC
A
V
L
I
S
TS
TAS
KG
FS
EAYT
PF
S
T
T
TG
F
Y
QEM
QT
EAY
S
A
T
LY
ES
Q
I L
V
A
A
VF
I
A
V
IL
LS
A
I
I
V
RF
KKA
L
D
S
T
GS
NA
V
L
E
E
GC CC
A
S
D
RE
N
KK
QT
AR
A
V
L
S
T
G
F
Y
I
N
KK
RG
E
S
T
G
F
Y
A
V
L
I
Y
A
V
L
I
S
T
G
F
S
A
T
G
D
E
TV
F
G KSK R R M RSS EK SP HSS K
SL
E
K
S
EE
D
K
EDV
L
R
D
K E K
N SEG
EDEEDDK
R KD DL
K
E
D
SED
K
ER K
I E N
M
WEAK RN
KERA T K R R TKRATEHAN TQ L R EDRR R R
K G G KA
N N
FT
R K S TA
R D K P T D D P D D D P
KD TMN V N N Y I V K
EG RPRNL
G
I L N
GGNHG
N
PNT
N
L
G
Q
GY
R
R
L
GH N G
S
AQNNVN L
GR QGLPRD V LPD GP L
Q Y P RNTPAPYV
0
H I L P Q RQ LVN PP AVY M GQ E
P Y QQR Q PQM Q DV L QDT Q QEQ
NN PQV N I LV QLD GPHALQ TP I LQLP T L I
DP VDQMPQ H PHM QCMVHCSHF EHF KMELPLMSMAH
Q
CQ
F
Y
H
MH
Y
FN
I C
H
WY
H
FWI
Y
H
F
MF
Y
H
MF
M
CH
Y
FH
I Y
F
MWI
CH
Y
FI P
HY
WWV
H
Y
FI Y
F
MI HPVNHP I WT
K
R
E
N
G
D
Q
PI V
Y
I
F Y
H
M
CK
G
D
V
N
R
F
Q
PI I
Y
F
M C
M
WL
A
R
S
E
V
W
T
D
G
N
Y
QI H
WA
V
K
S
T
R
Q
N
GI P
V
Y
F
I V
I
H
Y
F V
H
I
Y
F H
WQ
D
F
N
Y
P
G
M
HH
WS
K
V
T
D
R
F
G
N
Q
PI Y
F
M
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
15
Sequen
e Logos and entropy
The height of ea
h
hara
ter is proportional to the log of the
posterior probability of that amino a
id in that
olumn.
The total height of ea
h
olumn represents the bits saved
relative to the ba
kground entropy.
The maximum possible savings at any
olumn is
approximately log2(20) = 4:3.
The logo for an alignment re
e
ts the signal that will be
modeled by an hmm or other model trained on the alignment.
The logo shown re
e
ts the 2
rd sequen
e using the
Blosum62 regularizer. Its signal re
e
ts what BLAST would
sear
h for when using Blosum62.
2 rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
2
bits saved
1
F NV
L
Y D
V G
A
I L
V
I TSLSS
AAT
C
I TT
S
EQ
K
RK
AY
E
LF
DV
S
I
CW C RLH RG C CRC
LG
S
AL
L
A
IL
V
N
I
V
D
SQ
ES
RG
I
KV
F
E
Q
K
E
D
Q
LTES
I
NT
Y D
S
GES
A
S
ES
KL
I E
R
A
KN VL
Q I L
M
LS
NKK
RR
A
L
VKVF
Y
A
L
S
AAT M
SK SS I EAAKKS V
AATT AA A
AA VT TG ATAAS
ALAT I KAATAT
TSSG
HA
FTG
SS
VVTQ GST KSTS ESGV SG I T
EVT
G
SVE T F RN N Q EQQ N A
ML SGFL L G RF GSF KQ L G VG F F F G
I R KA
MNK NGNKKNMKAT R NT R RGGKT KSN
T I I KL SL I K K
KKA
Q
K
G
K
E
N
DK K
EDTN
D
R
E
K
E
K
G
N
D
S
D
G
D RGA
Q
K
E
G
D I L
D
N
D
FA
T
L
D
L
D
N
D
G
D
N
DTK
E
R
E
N EDRN N
DNLRQDERLVG
YD N
DV Q NRKQNNRV RR
D
G H G G HG H W
W
HE LR
YLE
ME E LTG
V
E
MMLR
Y
E
M
N
T I EFLE L I P
F
TE
M
S
RLTTE
M I E
MEL
DV V DD V V DV H QT D V
H V VV H GV
VV V
D C P P P CPG PQ P P
0
R I DR RRRP H NRD VPC I RRPHP G I P P P NR
NQ PQYQQQ YHQPY FD I QQFY YE YFYQQ
Q
P
P
N I QPP II IQC I N
H
Q YPP
I YI
C Q P I
QYI
QD I
P F HY I PFP H YFHH
CF
M
Y
C
HY
M
CQ
HP
F
H
Y
C
M H
WF
M
Y
C
HF
M
Y
C
HP
F
H
Y
C
M Y
F
MF
MH
WPP
F
H
Y
C
M Q
HH
WP
M
Y
F M
CN
H
WM
CY
M
CF
M
Y
C
HP
F
H
Y
C
M M
CM
WY
F
MH
WN
D
P
H
CY
M
CY
F
MY
F
MH
WM
CH
WM
P
CP
F
H
Y
C
M
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
16
Interpreting Sequen
e Logos
The logos shown re
e
t ex
erpts from alignments of 1yrnB.
The upper logo shows the 1yrnB sequen
e only. The lower
logo shows an alignment of approximately 2000 sequen
es.
TK
S
E
KL
PYR
A
D
N
FKE KY
G
A
S
H
R
D
F
L
KD K
E V
TK NLRVV S
D
FK
IF
Y
LY
IL
L IE
V D DD
S
E
KF
L
L
N
S
FG
D
N
EF
I P LD
WFAKNL
VEN Y
DD
F
A IE
V
A
GL
S I
TK V
ENV
I
LMKNT
D
S
R
E
AA
G
DV
E
AA
KEE I S
KA
STS
R
V
E
EAGAAAAAAV
AASSTS I
FMSTA
I G
S
R
S
K
EGAAG
FSS
G
AA
DV VS
R
EKAK
AGA
KAATM S VA SA
LE
RGS
L I S
QA EFQTTQ ALS AETQELSI M SEM EV
VS
LDT
S
T
Q
L
RR
ALT
VS
LD
Q
KSLMY
GST
ETSE DKM KTTY
TAVS
LD
RY
TQ T
KYAI D
S
KL
E VKT PDTSE RTETSKRDR T TSRTVK KTE PKRTK QTE
KQ
T
R DLLD
W
MKQ
TTAKDYST G
W
MKQ TAYTA R SQKQ
TLSTASF
TTA K
D
G
RE
W GQN
GGRD
GNR
M
RGK
ERNK
NER I GR
K
ENRRE
W RRD
GQRNRRS GRD
G
RNNH NVYNK
E
G
RNGQYNRE
P
GRK
N
K
E
D
P NQRGQNH E
PP
G
RNVE
P
GQE
P
K
E NQG
R
NLQG V
NHTVGNL
Q
PLPVP PPDG
G LLP
GPLQG
R NLH PL
NGR
L LLN G G
I DP P H LP P G LL H PGLP I D LP P
0
I P I I V I P I P VQ I QQ HPQPVQ V QV I P I Q VQY
QPV I
P VF QH Y FH NP VVHDH DDVV
M
QNNVHDVHF QDHP VY DVHDGVHP
Q I HPYFP YQQ I I Y NYNN I Q QF I Y N I Y HPNY Q I FN I Y NP I Y Q
F
Y
H
MH
Y
F
M Y
M
CM
CF
MC
M
WMI F
MD
CF
Y
H
MH
Y
F
M H
Y
F
M I
F C
H
WF
MC
H
WH
W
CH
Y
F
M I
Y
H
F
M P
CD
CY
M
C
HH
Y
F
M I
F C
H
WH
Y
F
M I
F Y
M
CM
CH
W
CF
MI F
Y
H
MH
Y
F
M C
M
WH
W
CH
Y
F
M I
F H
W
CD
N
H
CH
Y
F
M I
F F
Y
H
M
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
If an alignment
ontains numerous, diverse sequen
es, the
observed data will dominate the signal. Some
olumns will
show very strong savings, while others will show almost none.
4
3
bits saved
L
2
1
K
E
R
D
S
A
Q
R
E
Q
S
D
G
R
K
KE
Q
G
N
A
S
L
R
K
Q
E
A
S
L
N
T
GT
FT
Y
I
L
V
E
D
K
N
P
S
K
S
Q
A
E
K
V
L
T
L
T
V E
MRK
V
F
IE
Q
D
A
N
S
K
R
E
A
S
N
F
Y
L
WI
N
S
T
D
H
E
YP
K
L
F
V
T
A
F
N
G
L
S
I T
VD
EK
R
K
Q
D
L
V
R
I
LA
M
F
IS
VV
L
G
T
E
I
L
T
V
I
A
0
G NT DL AGR A TD V K AT E AA K
N
TKTH
AD H
V
V
A
L
A
T
W
H
A
R
E
D
A
S
R
DR
I L
VAL
V
Q
T
M
A
Q
G P
R
I
M
S
A
K S
T
T
E
S
L
EEQ PE
M
RK F
S
P
L
V
HR
G
E
Q
S
N
T
A
P
D
LL
P
H
V
YI P
V
I
Y K
G
PY
M
FP
I I
S
Q
R
K
F
E
Y
G T
V
I
S S
R
K
E
N
M
G
Q
PQ
H
LT
N
Q
G
PE
A
SN
G
Y
P
FIAK
S
F
Y
E
QA
V
E
QL
T
R
K
S
F
QI Y
T
CG
P
HI G
L
P
HE
L
V
AS
T
RH
L
AL
F
K
V
A
E
RA
R
PL
V
AI K
E
S
R
A
T
D
L R
K
Q
S
N
A
ES
N
E
H
G
Q
D
P
MY
R
K
E
Q
G
NR
Q
P
HR
L
V
A
E
SI E
A
S
RN
G
M
PV
L
IQ
F
YA
N
M
G
H
P
DI
V
T
R
KR
L
S
D
T
N
V
GA
K
Y
S
K
CT
Q
D
N
F
Y
HA
C
Q
S
L
V
N
D
TA
R
L
S
A
K
E
VR
C
K
E
Q
P
G
N M
Y
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
17
Starting out
SAM-T98 Alignment Building
Start: a single sequence
18
Example 1: 2
rd
What information do we know initially?
Is a member of a stru
tural
lass of toxins that share the same motif but
blo
k dierent K +/Na+
hannels.
Disuldes C -C , C -C , C -C
7 28 13 33 17 35 are
riti
al for all
lass members.
Logo below depi
ts the information in the regularized prole
onstru
ted
from the single 2
rd sequen
e.
Through the tutorial, this signal will
hange as we make it more spe
i
and sensitive.
2
bits saved
F
L
YT LS
I SD
V
NI
C
A
V
L
I TTSK
EADLY
CW VC
VF
I L SLL
A
I V RV
I
I
KFHN
L
S
RD
DTSKE
N
V
KL
I M
V
NKKL RL
V
GC
A
F
I K I LS
A
C CY
A A
V
GAAS
TS
ASTTG
SA
R
EAT I TTT
AATAS
KS
E
A E
V
KGSAASQ
E
A
SATSTE G
TVSA
E
EA
ATV
ETS A
KRS LDRRS S A
AV E
FG VVGSQ G
S
A
GF G
KQT
ES
R
MS
VGE
QRS
AG
EAAG
I KD
SS Q G I T
SG
TLKSEFLLED F ESFRLYDKLEL DF DFLFTE
W
SETEDYEEDQRYT
RDEYDTKATEDTPQYA TQQYTYKD
MKAK
MK
K
EKKKTTK
EGK
K
M
K
EADSL
NAKKDLT
K
E
FA
T TT
K
EDK
E
R
EK
RD
GRRNRD
G
D
GNGNRENRRSGRGRD
GNGQGRS RGGRGRWN
K G K TNEY N K N
E
G
RQ
L
Y
P
R
P
D
N
RRR
P
N
P
D
NNR
P
Y
P
D
NLVPT
Q
L
RR
PVVND
NE Q
L
NND
NV
D
N
H
G
R
P
HNPG NN L D G GPGQPN PHL RPLL P N
LP I L L
P L P I L P P P L
0
P I VQ I P H Q N I QVV I I P Y
QVP P I D
NP HDVQP P VVV QQ
M
VDQVH DFHP VH YV QGHV V QH QQV
QQ YNQMQ Q Q I I M QNMPY NPYQ QY F I MPY I I MY MPQ
D
CF
Y
H
MFI C
H
WY
H
F
MI H
WF
Y
H
MF
Y
H
MY
H
F
MI H
Y
F
MH
Y
F
MH
WP
CY
H
F
MI C
H
WH
WH
Y
F
MI F
MH
W
CMI I
F F
Y
H
MY
H
F
MI F
MC
M
WH
Y
F
MH
WD
N
H
CFI H
Y
F
MH
Y
F
MH
WF
MH
WM
CY
H
F
MI
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
19
Example 2: 1yrnB
What do we know initially?
Logo below depi
ts the information in the regularized prole
onstru
ted
from the single 1yrnB sequen
e.
2
bits saved
TK
S
E
K
D
N
FKE
PYR
A
G
A
S
RL K V
F
E I RL I E FL LE
H YTK NL VV S YAKNV ND
KDAK
V IL Y D
L I
WF D
I
DDE
KF
PYLD
A
S
VN
S
IE D
GL
A
E MKNT
S
TKNVDNVL
E
I
L
D
I
D
S
R
E
GL
VEK E I S
KA EAG GR
EGAK
AGG
DV
R
AGS
R
E
E AF
KAKK
AAAV
AGA
R
EGS
LF F F
SVAAS FMSTAV
DAS A STS S FS A I
AALSI QT
E
S LAS S A
LSI M AATM
KV SSMVASA
VS R QA VSQEFQTTQG A SETQE T SRY
TQET I SEV
TYA
R S
LDTT L ALT LD KSLMY T
ETSEDKM KT LD KYA DKL
E VKT PDTS E R
QTTE S
KTY
R WTQTS
R K KTE P R E
YTTV TT QT
Q W K DR Q K K
KT RDL DL
MKT A D S KGMT
KTA A SQ R
KTLS ASTTAK F
D R
G
E N
WG NG
M
Q
GKRN
D G
K
RNR
R
G
E
ER N
N ER I G R E R WR D G R R RSG R D R Q
G
R N N
HN Y N
V K
E
G
R N Q YN RE
P
G
R K
N
K
E
D
P N Q G
R Q HE
P
G
R N E
P Q E
G
P
K
EN NQ
G
R P V G
N L Q G
N
V H T
Q
V GN L P L P
G
V PGP P DG
R L L PP L Q G
NGL N L H GP L GRL L N
I DP P H L P P G L L H P GL P I D L P P
0
I P I I V I P I P V Q I QQ HPQ P V Q V QV I P I Q V QYP V I
P V F QH Y F H NP V V H DH DDV V QNN V H DV H F QDH P V Y DV H DQ
GV H P
Q I H PY F P Y QQ I I Y NY NN I Q MQF I Y N I Y H PNY Q I F N I Y NP I Y Q
F
Y
H
M H
Y
F
M Y
M
C M
CF
M C
M
W I
M F
M D
CF
Y
H
M H
Y
F
M H
Y
F
M I
F C
H
WF
M C
H
WH
W
CH
Y
F
M I
Y
H
F P
CD
CY
M
C
H H
Y
F
M I
F C
H
WH
Y
F
M I
F
M Y
M
C M
CH
W
CF
MI F
Y
H
M H
Y
F
M C
M
W H
W
CH
Y
F
M I
F H
W
CD
N
HH
Y
F
M I
F F
Y
H
M C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
2
bits saved
LI
VSKV
SF RL L
I V
D
Y
L
I L L W
VKN LSNRRRKEKTVTVA EVA VVS E
F I I E I I
D
A
S
KE
KE
KRKR DV
S
E
KD
N
VS
KFG FF
I
D
N
ED
S
E
K
I
V
FG
P L DLL
A
D
G PL A I
V AKKKE
AAAEA
A Q ARGV E AAGE
EAESASAGG
DAASGAAAKK
AG
DAS
R
E
R
E
RK
EA
TMTSFEFASA T SASA
I
KMT
MTSAAA AFAFS T
GTGQT TS
KM E FGE
S
Q
S
QQSQSVTVTL LS
ETQY
M
TL TTG
AY R
S
QLM
TL AAAS
SSSQ
EYELMR DKS
TSEKLLLDRDLMLM R
E YEP T
RVYE DDDR
K TS SQT EDTTTTQ ESEST KTT KK KT
DSDDYDY R
GK DD
Q
DTTTKYKYKVTS KQ SSDLT SKQ QQT
KRKG
A
KSKTAEMKAGGG N DKDK I RNR I RRRKQNRR I TTTN
NEN E EGRKRNR G
GGGEGED NGED EEN N D GGG
R
PPR
P
N
V
RT
LR
P
NQ
P
NY
L
RQ
P
N
P
N
L
N
V
N NR
VV
RRRP
P P
PP P
PPP
Q
RVG
P
E
PP N
P
N
H
NG
PQ P
G P G LP
DG PP P P L LN N R GRLGG GRLLL
L L G G L L G G I L L L I L
0
Q I QNQP HQ I I I P P I Q I QQ QQVQQ I QQPPP
V V VQ VV V
Q
D
QH DVDVH D HH H H VVVP
MDP
Q DNF VDNHDD YV
I
F DNVVVV H I Q I H I
N YNPN I Y N YYYY I I QNQNF NFYNN F NF I I I
I
Y
H
F
M H
W
CY
H
F
MI F
MC
H
WH
Y
F
MI C
H
WH
Y
F
MFI CC
H
WY
H
F
MI I
F F
MF
MF
MP
H
Y
F
MH
Y
F
MH
Y
F
MF
Y
H
MC
H
WF
Y
H
MC
H
WY
M
C
HM
CH
Y
F
MH
W
CY
M
C
HF
MI H
W
CH
W
CY
H
F
MI C
M
WH
Y
F
MM
CH
W
CY
M
C
HH
Y
F
MH
Y
F
MH
Y
F
MH
Y
F
M Y Y
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
20
Hidden Markov Models
21
Prole Hidden Markov Model
Start a- End
a1
a2 b4
A3 A4 A5
B1 B2 B3 B5
a1 a2 A3 - A4 . A5
. . B1 B2 B3 b4 B5
22
Computing probabilities with an hmm
paths S
The probability of the path depends only on the transition
parameters:
Prob (S j M ) = Prob (s +1 j M; s ) :
Y
n
i i
i=0
i i
i =1
23
Generating a model from an alignment
From this:
2
rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1
mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS
To this:
2
bits saved
F
L
VSG
AS
AV
L
TT
V
F
YE
I
C CW VC
A
V
L
ST
N I AG
I TDAS
Y TSTG
E
AA
EAY
A
T
FS
DS
E
D
TAS
S
QS
I L
TA
KKG
F
Y
Q
V
I
Y
AS
A
VF
LY
SA
P
TG
F Q
E
KM
TKEA
V
IL
LS
I
A
I
V
RF
L
T D
S
T
GS
NA
V
L
E
E
D
RE
N
KK
QT
AR
GC CC
A
S
A
V
L
S
T
G
F
Y
I
N
KK
RG
E
S
T
G
F
Y
A
V
L
I
A
V
L
S
T
G
F
Y
I S
G
D
E
A
T
FG KSK R R M RS EK SP HSS K
T
SL
ES
K
K
EE
DN
R
E
S
K
D EV
L
K
D EG
ED E
KED D K
R K D L
D
K
E S
DED
KR
E
K
I E N
M
WE
KA
KK RN E R AT RKT RRAT EHA N TQ R ED R RL R R
R
T R DG K P T D K DS N P TG N D KA
TA D D P
K
E
D
G R
M
P
N
R NV
L
G L N N
G N
N
HG Y
PNT L QF R I S LV N GQ N N
K
V N L
GR L I
V G N G N G GY Q R Y G H A
N T YV
Q G PR D L PD P L P R PA P
0
H
P I L YP Q R Q Q P QMR QQL V N
DV P P A V Y Q
M
D TG Q QE QE Q
NN P QV Q N I L V L D G P L Q L P I QL P T L I
DP V DQ MP Q H P H MQQ CMV H H
C
A
S H F
T
E H F KML
E LP L MS MA H
Q
CQ
F
Y
H
M H
Y
I
FWN
C
HY
H
FI H
WY
H
F
MI
H
MCP
F
Y
FF
MY
F
M V
I
H
Y I
Y H
WP
CV
H
Y
FI N
H
WH
WP
H
YI
F I
Y
F
M WT
K
R
E
N
G
D
Q
PI V
Y
I
FCY
H
M
Q K
G
D
V
N
R
F
I
P I
Y
F
M C
M
W
G
N
I
Y
Q L
A
R
S
E
V
W
T
DH
WA
V
K
S
T
RI P
V
YV
H
Y
Q
N
GI
FIV
H
I
YFH
W
FQ
D
F
N
Y
P
G H
WS
K
V
T
D
R
F
N
GI Y
F
M
M
H Q
PC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
24
Model building
25
Risks of not generalizing well
The logo shown is
omputed from the following alignment, with
no generalizing. Remember that a logo re
e
ts the signal that a
model trained on the alignment will look for.
2
rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1
mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS
FTNVSC S CWVC L T GC C S
4
T KE S QR HN SR K MN K R Y
3
R
bits saved
S
SA SQ P KK FG YK W DHKG I E
1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
26
Ee
ts of generalizing
The logo shown below is
omputed from the same alignment as
the previous logo, but using a more sensible generalizing s
heme.
The signal re
e
ts the alignment, but the alignment does not
dominate the signal.
The
onserved
ysteines and the other stru
turally signi
ant
olumns are emphasized.
The amino a
id posteriors re
e
t the amino a
ids seen plus
their likely substitutions and stru
tural signi
an
e.
2
bits saved
F
L
V
L
I TDAS
Y
V
A
SG
ASTTFE
I
C CW VC
A
V
L
ST
N I AG
TSTG
E
A
FS
AA
EAY
DS
S
QT I
TKK
DG
TASF
ES
ASTG
Q
Y
ASP
A
VF
LY
I L
V
KM
ATKE
F
Q
E
A
V
IL
LS
I
A
I
V
RF
L
T
G
NS
S
A
DV
L
E
T
GC CC
A
S
D
RE
N
KK
ET
Q
AR
A
V
L
S
T
G
F
I
NRG
E
F
T
KKG
A
V
L
I
S
A
V
L
I
S
T
G
F
S
A
T
G
D
E
FGYKSK R MY
RS EK SP Y HSS
Y Y
K
TV
SL
E
K
S
E
E
D
K
EDV
LN
R
D
R
S
K
EG
E
E
DEK
EDDK
R KD DL
K
E
D
SED
K
ER K
I E N
M
WE
KAK RNE
KRA TK AT R
T K R
R EHAN TQ R EDRR LR R
G G KA
FT
R T R D K PT D K YDS N P N D A D D P
K
E
D
G R
M
P
N
R NV
L
G L N N
G N
N
HG PNT L Q
Y R IS VL N G
A
T
Q N N
K
VN L
GR Q GL P
I
D V G L PDN GPN G G Q R Y G H P R N T PAPYQ V
R NL
0
H
P I L YP Q R Q Q P QMR QQL V DV P P A V Y Q
M
D T G Q QEQE
NN P QV Q N I L V QL D G P HA L Q L
T P I L Q L P T L I
D
Q
C
P
Q
F
Y
H
M
V
H
Y
I
FW
D
N
C
H
Q
Y
H
FI
M
H
W
P
Y
H
F
MI
Q
P
F
Y
H
M
H
Y
F
M
C
P
V
I
H
Y
F
M
H
I
Y
F
M
H
WP
C
Q
V
H
Y
FI
C
N
H
W
M
H
W
V
P
H
Y
FI
H
I
Y
F
M
C
W
S
T
K
R
E
N
G
D
Q
PI
H
V
Y
I
F
F
Y
H
M
C
E
K
G
D
V
N
R
F
I
Q
P
H
I
Y
F
M
F
C
M
WY
Q
K
L
A
R
S
E
V
W
T
D
G
N
I
M
H
W
E
A
V
K
S
T
R
Q
NI
L
P
V
Y
F
IG
P
V
I
H
Y
L
V
H
I
Y
FF
M
H
W
S
Q
D
F
N
Y
P
G
M
H
M
H
W
A
S
K
V
T
D
R
F
G
N
Q
PI
H
Y
F
M
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
2. Column Regularizers
3. Transition Regularizers
27
Issues in model building
2crd XFTNVSCTTSKECWSVCQRLHNTSRG.KCMNKKCRCYS
1bah XFTNVSCTTSKEXWSVCQRLHNTSRG.KCMNKKXRCYS
1sxm TIINVKCTSPKQCSKPCKELYGSSAGaKCMNGKCKCYNnx
1cmr ------CTTSKECWSVCQRLHNTSKG.WCDHRGCICES
1txm ----VSCTGSKDCYAPCRKQTGCPNA.KCINKSCKCYGcx
1lir XFTQESCTASNQCWSICKRLHNTNRG.KCMNKKCRCYS
2bmt XFTNVSCSASSQCWPVCKKLFGTYRG.KCMNSKCRCYS
1bkt VGINVKCKHSGQCLKPCKDA-GMRFG.KCINGKCDCTPkx
1big XFTDVKCTGSKQCWPVCKQMFGKPNG.KCMNGKCRCYS
Observe that in the alignment above, the rst two sequen
es are
nearly identi
al. Su
h redundan
y is normal in protein families.
28
Heniko Sequen
e Weights [HH94℄
Residue(s) N Q D V E S K C
Counts 5 1 1 7 1 5 3 9
Residue Wt. 0.33 0.33 0.33 0.50 0.50 0.50 0.50 1.00
Observation Wt. 0.07 0.33 0.33 0.07 0.50 0.10 0.17 0.11
29
Heniko Sequen
e Weights, Continued
1. Ea
h
olumn re
eives a weight of 1, divided evenly between
its distin
t residues.
2. Within ea
h
olumn, the weight for ea
h residue is divided
evenly by its number of o
urren
es.
3. The weight of a sequen
e is the sum of the observation
weights of its residues.
Residue(s) N Q D V E S K C
Counts 5 1 1 7 1 5 3 9
Residue Wt. 0.33 0.33 0.33 0.50 0.50 0.50 0.50 1.00
Observation Wt. 0.07 0.33 0.33 0.07 0.50 0.10 0.17 0.11
Absolute % Alignment
Sequence Weight Computation Weight Weight
2crd 0.07 + 0.07 + 0.10 + 0.11 0.35 8.6
1bah 0.07 + 0.07 + 0.10 + 0.11 0.35 8.6
1sxm 0.07 + 0.07 + 0.17 + 0.11 0.42 10.3
1cmr 0.11 0.11 2.7
1txm 0.07 + 0.17 + 0.11 0.35 8.6
1lir 0.33 + 0.50 + 0.10 + 0.11 1.04 25.6
2bmt 0.07 + 0.07 + 0.10 + 0.11 0.35 8.6
1bkt 0.07 + 0.07 + 0.17 + 0.11 0.42 10.3
1big 0.33 + 0.07 + 0.17 + 0.11 0.68 16.7
30
Ee
ts of low total sequen
e weight
GC
bits saved
V
C
A A
VC L
CW L A I
A
S
D A
C C
A A
F V VF V
IL V N V V V
L L L S ELL
A IKKA
RF G
T
NS
RE
K
L L L S
V N I
DASS
I T A
TKKSVSTSQ
Q I Y
E
T DA K
ET
I
S KKS
I I
S
A
T
I TGTATSTGS T I P
DG FT
ES
A
M
K SV AR T RGT T G
YA ESTGAAE
D
E
RA AA
S
S
EGR Y EL SP G N
D
E
SSGR G E
D
ASSFGFES RF E F QS KE Q F F F
TV
S KEEK
E
KVKA
NDS
KT
EGD
K
RK
E
D
ADE AK DL K
E
S
E
E
A
D
R
K
E
L K
I E
K
N
STV
LAKD DE K T D
EE RKYGLRTTYR
T
MYSL
R TG L Y KDAYKR Y
K
RK
D
T
RPNRLK
G
LG
NNRE
KG
P
GRT
NN
P
G
RR YN
A Q R H
G
N
QT
T
RVRE
Y
L
M
GGLGLDN P GDNNYD G H
Q I LG I DL
DAN DADEP
YQ
D
R NV V Q
L ND
T
NL Q P H NMQ R NTN V
WR M PR I Q L P R Q
PG V DF L NE
KV
PE T G L PSP
L
Q
P P P Q L Y A
0
N I V Q
QQ R I P P QH
L DQV P NY V PG P Q
A Q L P QQ
Q
S
I
DN H DV N Y V V Q N P I HA H QD
V I FK V
K L P V D K
V Y
QP I NQ MP Q H I I MPV CM I H WV Y FN H ML MS P V H MN
FMT H
H
CQ
F
Y
H
M Y
F
M C
H
WY
F
H
MI H
WY
F
H
MI P
F
Y
H
M F
M
C H
Y
F
M H
Y
F
M H
WM
CH
Y
F
MI H
WH
WH
Y
F
M Y
F
M CS
T
K
E
R
N
G
D
Q
PI I
F
M Y
H
M
CR
F
Q
P
HI Y
F
M C
WA
S
E
R
V
T
D
G
N
Y
Q
W
F
PI H
WT
R
Q
N
G
F
P
YI V
Y
I
F I
H
Y
F
M I
Y
F
M H
WP
G
Y
M
HH
WD
R
G
N
F
Q
P
HI F
M
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
31
Ee
ts of high total sequen
e weight
C CW C L G C C C
bits saved
1
FTNVS ST S E
L
YSGLAL
D
I
V
A
TD
A
GK
T
A
VYS
QL
KSL
F A
V
I
L
VQ F
KE
RI
VH S
K A
V
G
N
TR KS
A
N
A
V
NRKL
V
A
V
A
S
T
A
G
I ASATSN
V V
A
AE
G I ES
ETD N
SD I
V PF
AL
S
I AT I EQ
AAYS
L
MFDE
N
ED
Q
AK
E
L
S
I H
KG
DS
S
R
E I
L
S
I
D
E
A
W
M
L
E
KT
KF
SE
K
TD
G
KV
L
E
KRRTR
DAGS
EA
TM
DS
T SY
GATS
TL
V
E
R
KD
G
K
Y
ST
T
T
G S
EES
S
RT I T
GL GE
K YN
TDA M FGE P FG FDN AA NR FMGDDFK F P
RAS Y A
SN R
ENYVG YTGKYSDK I I LDPKY YVY L
R
RG
K
YRPRKVTT PHKEP R
TT
S
R T
QWPD A
KQNPA P
A
S
V
H
K
Q
I PRPN
DL I LNNN
DENGN
D
T
N
LE
PR RP
E
G
K
LLLN
D
L
ER
ANAN
D
E
T
N
D
F
DL
GRLGL KQ D
Q QGKKRPKL
G
KQ QQDGVS
AK
R
KT
AT
TQ S
KQKVQ
K
EP PV RP R
I GLRMQ
C
RGVGE
NP VVH RV
G R RT I
P HQQ N P N R PNG FNP E SQ
P F R
0
NQVC I E I QH P EDLQEV C HYF T
YVETPL EYE H
QFY DHMHPYL MPVV
NMP H Q RH RWDLHN N
I I GM I LPLMDMGY
DY I NYQYFFVHQ
QHDQH DWV I
MQ FD QNY V HIQPQ I F
CH
M
CF
MH
WF
M
CH
WF
M
CY
H
M
CC
MH
Y
FI I
Y
M
F H
WCY
F
MI H
WH
WY
M
FI Y
M
FHP
M
CY
FI C
WP
H
MY
M
FC
M
WN
Y
Q
F
P
HI H
WQ
G
P
F
YV
FI H
I
Y
F V
Y
I
F H
WG
M
HH
WQ
P
H
M C
M
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
32
Sele
ting the total sequen
e weight
We have found our best results empiri
ally by setting the total
sequen
e weight to save half a bit on average relative to the
ba
kground entropy H0:
If the sequen
es are all very similar, the average savings
starts out high. The total sequen
e weight is set low to add
more diversity to the signal.
If the sequen
es are diverse, the average savings starts out
low. The total sequen
e weight is set high to fo
us the signal
more on the alignment and less on the priors.
This total sequen
e weight is 1.11, and it saves an average of
half a bit per
olumn.
2
rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1
mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS
2
bits saved
F
L
V
L
I TDAS
Y
V
A
SG
ASTTFE
I
C CW VC
A
V
L
ST
N I AG
TSTG
E
A
FS
AA
EAY
DS
S
QT I
TKK
DG
TASF
ES
ASTG
Q
Y
ASP
A
VF
LY
I L
V
KM
ATKE
F
SE
Q
EA
A
R
V
IL
LS
I
A
I
V
RF
L
T
G
NS
A
DV
L
T
E
Q
GC CC
A
S
D
RE
N
KK
T
A
V
L
S
T
G
F
I
NRG
E
F
T
KKG
A
V
L
I
S
A
V
L
I
S
T
G
F
S
A
T
G
D
E
FGYKSK R MY
RS EK SP Y HSS
Y Y
K
TV
SL
E
K
S
E
E
D
K
EDV
LN
R
D
R
S
K
EG
E
E
DEK
EDDK
R KD DL
K
E
D
SED
K
ER K
I E N
M
WE
KAK RNE
KRA R TKRATEHA
TN TQ
K R R EDRR R
L R
G G KA
FT
R T R D K PT D K YDS N P N D A D D P
K
E
D
G R
M
P
N
R NV
L
G L N N
G S
N
N
HV
G PNT L Q
Y R I L N G
A
T
Q N N
K
VN L
GR Q GL P
I
D V G L PDN GPN G G Q R Y G H P R N T PAPYQ V
R NL
0
H
P I L YP Q R Q Q P QMR QQL V DV P P A V Y Q
M
D T G Q QEQE
NN P QV Q N I L V QL D G P HA L Q L
T P I L Q L P T L I
D
Q
C
P
Q
F
Y
H
M
V
H
Y
I
FW
D
N
C
H
Q
Y
H
FI
M
H
W
P
Y
H
F
MI
Q
P
F
Y
H
M
H
Y
F
M
C
P
V
I
H
Y
F
M
H
I
Y
F
M
H
WP
C
Q
V
H
Y
FI
C
N
H
W
M
H
W
V
P
H
Y
FI
H
I
Y
F
M
C
W
S
T
K
R
E
N
G
D
Q
PI
H
V
Y
I
F
F
Y
H
M
C
E
K
G
D
V
N
R
F
I
Q
P
H
I
Y
F
M
F
C
M
W
Y
Q
K
L
A
R
S
E
V
W
T
D
G
N
I
M
H
W
E
A
V
K
S
T
R
Q
NI
L
P
V
Y
F
I
G
P
V
I
H
Y
L
V
H
I
Y
F
F
M
H
W
S
Q
D
F
N
Y
P
G
M
H
M
H
W
A
S
K
V
T
D
R
F
G
N
Q
PI
H
Y
F
M
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
33
Estimating weights to a
hieve the target savings
34
Ee
ts of sequen
e weighting
Using the alignment below as input, the logo re
e
ts the
alignment with
olumn regularizing but no sequen
e weighting.
.....XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
.....XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
.....XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
.....QFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
.....QFTQESCTASNQCWSICKRLHNTNRGKCMNKKCRCYS
.....EFTDVDCSVSKECWSVCKDLFGVDRGKCMGKKCRCYQ
.....QFTDVDCSVSKECWSVCKDLFGVDRGKCMGKKCRCYQ
siegrQFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
.....-----SCTASNQCWSICKRLHNTNRGKCMNKKCRCYS
.....------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
.....XFTNVSCTTSKEXWSVCQRLHNTSRGKCMNKKXRCYS
.....XFTQESCTASNQCWSICKRLHNTNRGKCMNKKCRCYS
.....XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS
.....XFTDVKCTGSKQCWPVCKQMFGKPNGKCMNGKCRCYS
FT C S CW C GCM C
4
V T V LHG R K NKK R YS
3
NQ D S AT KE PS KQR NTS
bits saved
0
E
K
D
R
A
S
T
N
L
H
V
G
P
M
I
Y
FW
V
Y
L
IS
S
G
E
TT
K
A
H
P
A
N
S
Q
C
D
M
F
R
L
Y
D
QEK
AE
LG
R
Q
P
H
K
R
G
P
N
YV
L
Y
I
V
S
TC
MP
M
E
N
F
KT
N
T
A
V
YI V
P
Y
F
SV N
A
N
V
D
G
K
P
I
E
R
L
Q
V
S
T
A
G
S
C
R
A
V
G
L
I
I
I
C
H Y
CV
S
T
Q I F
S
R
DD
E
G
Q
P
L
G
M
Y
P
K
R
TA
H
L
N
AS
HT
M
G
I I
Y
M
N
D
V
F
E
K
Y
L
L
Y
T
C
T
A
G
F
A
R
Q
Q
R
W
S
L
M
T
T
P
R
H
QC
M
F
L
L
VY
D
MI
E
L
A
H
N
T
S
D
V
R
P
T
Q
H
L
K
D
E
A
N
L
S
T
H
G
P
V
V
D
P
M
M
RQ I
I
V
FY
A
V
S
TN
K
E
P
Q
D
S
A
Q
H
VKDNK
SYN
A
N
I
L
R
D
E
G
V
P
TT
R
K
E
G
S
G
P
G
EA
KH
A
D
Q
L
A
E
S
T
H
E
N
SA
N
W
R
Q
Y
L
F
GRG
HGR
D
D
E
K
A
P
Q
R
Y
L
N
E
SA
L
TT
H
D
P
V
Y
M
A
N
S
L
E
D
T
H
P
V
I
S
QQ
I
Y
M V
S
T
AP
Y
D
MV
S
T
I
K
Q
L
N
A
H
T
E
S
V
G
A
N
V
S
R
K
QI L
P
C
EQ
N
F
D
G
E
H
W
L
T
A
K
V
R
H
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
35
Ee
ts of sequen
e weighting,
ontinued
GC CC
bits saved
F
LY
I T
VSD
A
AVN
E
S
D
GA
FE
L
C CW C
A
V
ST
TS
D
E
E
N
I
DGS
FTPF
S
A
T
G
TG
F
KK
EKA
T
EY
A
A
VF
LY
ES
I L
QT I
S
V
E
N
T
R
L
SA
VI
G
F
A
V
L
S
TK
QR
I
F
L
V
M
I
G
N
D
D
N
R
A
S
E
KK
A
V
L
S
T
I
N
G
KK
R
G
T
G
E
A
V
L
S
RF
F
G
KY
I
A
V
L
S
T
I
S
E
ET
QW
S
L
ESL
V E
K
Y
KK
DAKS
NR
A
R
YR
KG
A
EM
S
Y
KR
ADS
ET
KV
S
DS
A
P Y
K
D
S
E
SR
S
Y
KL
E K
K
A
Y
KMKKA NENTRDSEETYEDQKHAKNQL E EGDE I EL
Q
D
ARD
G
QE AR
D
VVPATR
D
KD
G
K
E
R
DSSR
FTAE
GDQKR
D D DAT
KDARA RED
TA L AA
K G G G H E V
RER I
N LLGN
N KRN YRSKTVR
LN N N NTNF G
SG ATT
N KR
LSV G N P TTQ A
NG S N N PL Q E H M R
TT T S V
N
T H R PR G Q L PD QPL G TL E P Q P PS
0
L P I SQ Q I Q P QMR CQ G V P R PV
Y A
Q
L HQ Q
Q
Q
T R
N NP TRP Q E I L V L G G L NA L I
R I T A QL L D K L
G
P QQ PQ
DL MP K
H
Y P H MQQ NMV P
H
DS
I
H
Q
D Q
P
H F
D
G MV
E PP P MN
PMR
I P
V
I
H
Y D
CF
Y
H
MCY
FL
H
V
IP
F
G
Y
NV
H
M
HI
Y
F H
WY
H
FI R
D
N
Q
P
F
Y F
MMV
H
I
Y
M
CFI
Y
F
M H
WP
CV
H
YIFD
H
WH
WP
HI V
H
I C
WT
K
R
E
N
G
D Y
FV
Y
I
F
Y
F N
G
P
F
Y L
Y
V
HYI C
M
Q
P N
V
Q
I
Y
P H
W
H
MD
T
K
S
RI L
V
Y
IV
HI V
H
I
Y
F
MWH
W
F
W
HG
F
Y
HH
WD
G
N
Q
P
H V
I
H Q
F
G
N
YFI Y
FF M Y
F
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
36
Model building, Continued
1. Sequen
e Weighting
2. Column Regularizers
3. Transition Regularizers
37
Regularizing Column Probabilities
Example 2
2crd XFTNVSCTTSKECWSVCQRLHNTSRG.KCMNKKCRCYS
1sxm TIINVKCTSPKQCSKPCKELYGSSAGaKCMNGKCKCYN
1cmr ------CTTSKECWSVCQRLHNTSKG.WCDHRGCICES
1bah XFTNVSCTTSKEXWSVCQRLHNTSRG.KCMNKKXRCYS
1txm ----VSCTGSKDCYAPCRKQTGCPNA.KCINKSCKCYG
1lir XFTQESCTASNQCWSICKRLHNTNRG.KCMNKKCRCYS
2bmt XFTNVSCSASSQCWPVCKKLFGTYRG.KCMNSKCRCYS
1bkt VGINVKCKHSGQCLKPCKDA-GMRFG.KCINGKCDCTP
1big XFTDVKCTGSKQCWPVCKQMFGKPNG.KCMNGKCRCYS
38
Desirable Properties in a Column Regularizer
39
Comparison of Column Regularizers
The following slides show the logos produ
ed using the same
3-sequen
e alignment, same weighting s
heme, and dierent
regularizers:
{ Maximum-likelihood estimate
{ Add-one pseudo
ount
{ Gribskov average s
ore [GME87℄ with BLOSUM62
matrix [HH92℄
{ Re
ode3.20
omp Diri
hlet mixture [SKB+96℄
Other regularizers are evaluated in the handouts [Kar95℄.
All logos depi
t the savings at ea
h
olumn over the
ba
kground entropy, where the ba
kground amino a
id
distribution P0 is obtained by using the regularizer with no
amino a
id observations.
40
Maximum-Likelihood Estimate
Des
ription: Given a set of observed
olumn
ounts O ~ , the posterior
~ ), is
omputed as follows:
probability for amino a
id a, P^ (ajO
P^ (ajO~ ) = P O(a)
amino a
ids j O(j )
Easy to ompute.
The distribution that gives the highest probability to the observed data.
Performs well when O~ is very, very large, but terribly in real ases.
2
rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1
mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS
FTNVSC S CW VC L T G C C C S
4
2 S
1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
41
Add-One Pseudo
ount
P^ (ajO~ ) = P O(a) + 1
amino a
ids j O(j ) + 1
Easy to ompute.
Performs well when O~ is very large, but poorly in most other ases.
2
rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1
mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS
1
bits saved
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
The logo above is not erroneous. You see no pronoun
ed signal at any
olumn
be
ause there is none. The low weights
ombined with the large pseudo
ounts
make the posterior probabilities
lose to uniform.
42
Gribskov Average S
ore [GME87℄ with BLOSUM62 Matrix [HH9
Des
ription: The posterior probability of some residue a given
~ observed is dened a
ording to a set of
the set of
ounts O
ba
kground probabilities P0, a substitution matrix M :
P
M O (b)
P^ (ajO~ )
b a;b
P0(a)e ~j
jO
Easy to
ompute.
A
hieves ex
ellent performan
e with one sequen
e.
Does not
hange as the total sequen
e weight
hanges.
Does not make optimal use of multiple sequen
es.
2
rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1
mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS
2
bits saved
1
F NV
L
Y D
V
A
SG
K
ATT
A
I L
V
I TSLSS
I
AATTASK
K
AS
SSTRR
E
C
AY
LF
VL
QS
T
V
DGS
I
I
SLS
PA
T
TG
A
IL
V
TQ
A
Q
RSA
CW C
RV
L
IF
I KKF
G
A
M
N
HAT
DS
A
V
QT
AL
E
S
RT
E LG
L
D
S RS
G
S
S
TLTEA
AG
GC CC
R
S
N
K
K
E
A
K
D
A
L
V
I N GV
DHR
K
L L
I
V
I R I
K
A A
YS
A TY E
G
FTG S GM VG T
I KS
SVE AVGA F KVSLVA E
MLRSGFV
K A
FTT
KK SF
K
EL
N
I
KVT LYN W
S
F
K
AE EASF
KKF
K
FG
L
T I M
LN L NE GM SAE I ER
N KN
KK KKNA KTSNRKKNS G RNR KA
G I Q NVT NAND K
EDG G NDE DD T E TT L DAR TQDEDA
I E E
RNQ G K G EG Y L N Q N G
EGH EDRN DNLRQNERL D GS
Q GGD G R A NDRSRS
D
WEL R
YLEE
K
E LQTEMDR
Y
EN
G V E
R
KV E
L
KV PD
N
ES
KQ QLET
Q
EQ
V L
HDV VM I N VDGMDL M I QGH DF I F MGL LTMN
MR V
D C P P CPT H T P P PH
0
R D R V NV H C N H V V R
NR I D
QY D Q
LHYHR
D
YV P D
T I REP YV YTY D Y
F
GYTI Q
QQP P R R V P I Y DP QVY I R I VP
PPFN I QQP I I P QCQ NQPFPQFPD FCPQFF I I QDQG I
CF
M
Y
C
HY
M
CQ
HP
F
H
Y
C
M H
WP
F
Y
C
M
HQ
F
C
M
Y
HP
F
H
Y
C
M P
H
F
Y
M I
Y
F
M H
WPF
H
Y
C
MI Q
HH
WH
M
Y
FMN
H
WM
P
WY
M
CF
M
Y
C
HR
Q
H
PI MM
WY
F
H
MH
WQ
P
Y
HP
MP
H
F
Y
M H
F
Y
M
C H
WM
Y
H
PH
WN
P
W
MP
F
H
Y
C
M
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
43
Diri
hlet Mixtures [SKB+96℄
Des
ription: Several dierent pseudo
ounts (
omponents) are t
to patterns of
onservation and substitution within amino a
id
olumns. To
ompute the posterior distribution of an alignment
olumn, probability of ea
h
omponent is
omputed and used to
mix together the posterior probabilities given the
omponents.
O(a) +
P^ (ajO~ ) = X P (
jO~ ) P
;a
2
rd XFTNVSCTTSKECWSVCQRLHNTSRGKCMNKKCRCYS
1
mr ------CTTSKECWSVCQRLHNTSKGWCDHRGCICES
2bmt XFTNVSCSASSQCWPVCKKLFGTYRGKCMNSKCRCYS
2
bits saved
F
L
VSG
AS
AV
L
TT
V
F
YE
I
C CW VC
A
V
L
ST
N I AG
I TDAS
Y TSTG
E
FS
AA
EAY
DS
K
S
ES
DG
TASF
Y
I L
Q
A
Y
ASP
A
VF
LY
TKQT I SATKKM
A V
TG
F
Q
E
E
A
V
IL
LS
I
A
I
V
RF
L
T
G
NS
S
A
DV
L
E
T RE
KK
ET
Q
AR
GC CC
A
S
D
N
A
V
L
S
T
G
F
Y
I
N
KK
RG
E
S
T
G
F
Y
A
V
L
I
A
V
L
S
T
G
F
Y
I S
G
D
E
A
T
FG KSK R R M RS EK SP HSS K
T
SLES
K
K
E
E
DN
R
E
S
K
D EV
L
K
D EG
ED EED D K
R K D L
D
K
E S
D
E
D
K
ERK
I E N
M
WE
KA
KK RN E R AT RKT K
RRAT EHA N TQ R ED R RL R R
R
T R DG K P T D K DS N P T G N D KA
TA D D P
K
E
D
G R
M
P
N
R NV
L
G L N N
G N
N
HG
Y
PNT L QF R I S LV N G
AQ N N
K
V N L
GR Q GL I
V G N G N G GY Q R Y G H N T YQ V
PR D L PD P Y P R PA P
0
H I L YP R Q Q P MR Q L V NL P P A V M
TG Q E E
PN P QV QQ N I L Q L DQG P DV L Q L P I QD L P QT Q
N
DP V DQ MP Q H P V
H MQQ CMV H
H
C
A
S H F
T
E H F KML
E
Q
LP L MS ML
A
I
H
Q
CQ
F
Y
H
M H
Y
I
FWN
C
HY
H
FI H
WY
H
F
MI
H
MCP
F
Y
FF
MY
F
M V
I
H
Y I
Y H
WP
CV
H
Y
FI N
H
WH
WP
H
Y
FI I
Y
F
M WT
K
R
E
N
G
D
Q
PI V
Y
I
FCY
H
M
Q K
G
D
V
N
R
F
I
P I
Y
F
M C
M
WG
N
I
Y
QL
A
R
S
E
V
W
T
DH
WA
V
K
S
T
RI P
V
YV
H
YI
Q
N
GFV
H
I
Y
I H
W
FQ
D
F
N
Y
P
G
F H
WS
K
V
T
D
R
F
G
NI Y
F
MM
H Q
PC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
44
Regularizer Performan
e at Various Sequen
e Weights
To assess the performan
e of the regularizers at various total
weights, below we
ompare their expe
ted savings given one
observation of amino a
id b at various weights.
Expe
ted
olumn entropy E [H ℄ is
omputed as shown, where
^ X P^ (a; b)
E [H ℄ = P (a; b) log
amino a
ids a,b P (a)P (b)
P^ (ajb)
=
X
P^ (ajb)P (b) log
amino a
ids a,b P (a)
Relative entropy of regularizer as function of sequence weight
4.50
3.00
Bits saved
2.50
2.00
1.50
1.00
0.50
0.00
1.0e-03 1.0e-02 1.0e-01 1.0e+00 1.0e+01 1.0e+02 1.0e+03 1.0e+04
Weight of single amino acid
45
Peak savings versus weight
The previous slide showed the average bit savings as a fun
tion
of weight, but an exa
t mat
h to the single
hara
ter saves more
than average.
Mat
hing
olumn entropy is
omputed as
XP^ (aja)
H mat
h = P (a) log
;
amino a
ids a P (a)
Relative entropy of regularizer as function of sequence weight
4.5
4 recode2.match
recode3.match
recode2.average
3.5 recode3.average
3
Bits saved
2.5
1.5
0.5
0
0.01 0.1 1 10
Weight of single amino acid
46
Peak savings versus Average savings
We
an plot the savings for the mat
hing
hara
ter versus the
average savings|all the regularizers trained on protein data
behave similarly.
At 0.5 bits/
hara
ter on average, we get about 2.2 bits of savings
for a mat
hing
hara
ter.
Relative entropy of matching amino acid versus average relative entropy
10
uprior9
merge-opt13
recode2
recode3
blosum62
add-one
Match bits saved
0.1
0.01 0.1 1
Average bits saved
47
Model building, Continued
1. Sequen e Weighting
2. Column Regularizers
3. Transition Regularizers
48
Model Transitions
There are nine types of transitions that
an be made:
From the mat
h state: mat
h-mat
h, mat
h-delete, and
mat
h-insert.
From the delete state: delete-mat
h, delete-delete, and
delete-insert.
From the insert state: insert-mat
h, insert-delete, and
insert-insert.
The probability of entering a delete or insert state from a mat
h
state is analogous to the gap opening
ost in other methods, and
the delete-delete and insert-insert probabilities are analogous to
the gap extension
ost.
Start a- End
a1
a2 b4
A3 A4 A5
B1 B2 B3 B5
a1 a2 A3 - A4 . A5
. . B1 B2 B3 b4 B5
49
Transition posteriors in SAM-T98
50
Transition posteriors in SAM-T98,
ontd.
51
Sele
ting a transition regularizer
52
Stru
tural Context and Transition Parameter Estimation
53
Sear
hing with the model
SAM-T98 Alignment Building
Start: a single sequence
54
Redu
ing the Number of Sequen
es to S
ore
55
Redu
ing the Number of Sequen
es to S
ore,
ontd.
56
Redu
ing the Time Ne
essary to S
ore Ea
h Sequen
e
Kestrel [HHKS96℄, a spe
ialized parallel pro
essor under
development at UCSC, redu
es the sear
h times on a
database
ontaining 107 amino a
ids as shown:
57
S
oring hmms and Bayes Rule
58
Null models
Prob(
Prob(
M)
N)
is the prior odds ratio, and represents our belief in
the likelihood of the model before seeing any data.
Prob j
N A
59
Standard Null Model
Prob(A ) :
i
i=1
lenY( )
Prob (A j N ) = Prob(string of length len (A))
A
Prob(A ) :
i
i=1
60
Reversed model for null
Prob (M j S ) Prob (S j M )
= :
Prob (M j S ) Prob (S j M )
r r
61
Composition as sour
e of error
62
Composition examples
Kistrin (1kst)
63
Composition examples
Kistrin (1kst)
64
Long heli
es as sour
e of error
65
Helix examples
Tropomyosin (2tmaA)
Coli in Ia (1 ii)
66
Helix examples
67
Dis
rimination Performan
e as a Fun
tion of Null Model
100
10
1
150 200 250 300 350 400 450
True Positives
68
Null Model S
ores: Signi
an
e and Threshold Sele
tion
After sele
ting a null model, the next step is to test the
performan
e and sele
t threshold values using a test appropriate
to the nal appli
ation.
Calibration curve for target models
superfamily
100000 0.1 *N exp(cost)
10000
False Positives
1000
100
10
1
-40 -35 -30 -25 -20 -15 -10 -5 0
SAM-T98 target model cost (nats)
69
Reestimating the Alignment with New Family Members
70
Ee
ts of Reestimation
The sequen
e logos shown were generated by the alignment
produ
ed for 1yrnB in the third iteration of SAM-T98. The
upper logo re
e
ts the alignment before reestimation, and the
lower logo after reestimation.
2
L Y L
bits saved
P
S
F
L
V
I IV
E L I
VFE WL
F I
L PF QL G E
I
V
T
D
KK
AG
T
A
E
R
G
R
K
VS
I
VTKK
YP
RMS
R
E
A
QN
TK
F
STM
STA
EAY
D
D
KS
TSG
AR
L
FY
M
LMN VV
I
KN
EHF
N
AD
K
VEN L
D
S
KA
DSVT
W
EV
S
HA I PS
I E
H D
K
Q
L
E
D
S
L
KK
R
Q
D
RYV
FL
KEMM E
N
T
NAAKKL
DV
R
E
E
NN
L
K
L
E
D ET
A
KA
EM
AL
ERQS
YAFQSSA
KR
K
TAT
Y
STA GKFK R E D
SSKF
I
R
A
M
N
DDMSERTA
R G TN EA
T
RAKS I R
AS
A
KR
R EAH
NW
KAT
D SKD
K
ER
QDTW
S
REYGKTGS
GTANA RTE AS
VTA
S T QN E R I I
Q AN LS
S KGR
TE
ESSA KSR NNK R ES
SDV T
T
S
E
S
DR
KVT G DPN GMP
Q N
RRL
H
AGR
KQAAV EENP
S
T TG GP
Q
YL E
K L T
AQ
Q S AT NV P A G YG
K
EKQGAE T DRR
TE
D QQ GQG
RQQR
DT YQ QKQE
PGD L
RL P
N AME
PE RPTQV M
Y
MQQ
LV
S LTN
KT
TF
P
0
LG I R
I PDGQR I GCV
Q
GCPRGQKQTQPP I QGPAN D
I VPCE
Q
SPQ
NLH
Y G
D
N LN
GLG
QN V RQ
NP NDL
V
Q
P
Q
PG
V
STQCLLH
P
QLK I L
HN PLDGVL
N
G
GP LG VLVH G I L I DW NHRPLGVYY DVSG G HVWPDHD
V
P
I
H
Y
FVF
H
I
Y
FAP
M
CH
S
K
E
T
V
D
N
Q
FR
IPN
F
Y
H
MY
FI V
Q
P
I
Y I
H
Y
F C
DY
H
F
MY
F
HY
F
MI
H
Y
FI H
WP
Y
F
MH
H
WP
VDHCHH
Y
FI L
V
H
Y
I
F H
D
CN
C
DT
H
D
N
G
F
Y
PI L
V
Y
I P
Y
V
F
I N
H
D
WH
Y
FI V
F
I F
MCN
HH
Y
FI R
I
D
N
G
F
Y P
H
Y
F
MV
Y
FI P
F
Y
H
MY
M
FI H
Y
I
F HN
D
H
C
WN
P
G
Y
F
H
MV
YI Y
P
H
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
2
L Y L
bits saved
G
L
FITKD A
V I VI E WFI
E LI RL
VF
L
H
L
V
G
I
NPF
QL
DE
LDW E E VL
G KEFM
I
K
NAE
V
KKSP R
KR R
VP M
KNTKTADS
V
A
L
N V KD NE
AH
SS
I PT
KKAV KDKRNM
AA
T
EL
N
TA
R
E
D VS N SE
K
TFQ Y
Y
TS
R
GY
Q S
A K FY ESFDTDS
K I S Q
L
SLDR
Q YV
KT I KD
V
SAM K I M E KR
EA A SQM NE GT M
EE M S K A RSA M A
ENK
H ADTKAELR
KATESE
YAFSA
KGTAA
D
SGTS
ARRRT
KS
N
K
EA
RN
E AD
SS
S
KFA RS
QTG
QS LAHET SK KR TRY A R E I
DGG WDQ
NK DE RDTWAH G F TA
KAV RQ TA T A
A
SRNYQNE
D
SSLS
GD
E
GTME
PQES
RS
LRAS
RRPNNEG S V NG
E
P
S
Y
I GE
K
DTTR
S
T
STADR
K
GT
APNR
YQ NKR
KQGTKQLLERR
P
QTTT GQQR
L
Q QR
Q QKG NVP G GTAM
AQE H DY DQS GKV F
N RA
LD NTE
VDLGC
L
P
N EEE
K LPT VMGQL
V I LTNE TQ
0
LGLK I PP
GQ I Q
RRGP
CQ
PR V PCPE
V I MPAN PDN
VPC TP
GLQE
DN LRL
Q
GRN V LQV
P
G
NWLQQ
P
Q
GS
TTYQLY H QQLK I LNPLW
Q
GSL
G
D
VPVNG VLPHQG I PNHCDVPNHHPVGV I Y PDVSG HGHVDPDHY
P
I
H
Y
FVH
I
Y
F Y
FI
V
Q
P
FI Y
H
F
MT
FP
Y
FH
Y
IIH
I
Y
FQ
V
PV
C
DN
Y
H
F
MY
F
HY
F
MH
Y
FI
I H
WY
FP
H
WHVD I DHH
Y
FI L
V
H
Y
I
F H
D
CN
C
DR
D
N
G
F
Y
PI L
V
Y
I F
I N
H
D
WH
Y
FI F
M
CF
MCN
HH
Y
FI R
I
D
N
G
F
Y P
H
Y
F
MV
Y
FI P
Y
F
H
MY
M
FI H
Y
I
F HN
H
D
C
WN
P
G
F
Y
H
MV
YI P
H
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
71
Ee
ts of Reestimation, Continued
72
Pitfalls in Alignment Reestimation
73
Constraining Seed Sequen
es
To keep seed sequen
es from shifting during alignment
reestimation,
onstraints x the residues of the seed sequen
es to
ertain states in the model (or
olumns in the alignment).
(Iterations 1 - 3)
Initial alignment
Database Model
New Homologs
Reestimate alignment:
seed sequences fixed
homologs permitted to move
(Iteration 4)
End: a SAM-T98 alignment
Advantages:
The seed signal is not lost to unintended reestimation.
The resulting alignments tend to be more a
urate.
74
Reiterate
75
Avoiding early mistakes
Threshold s ore:
76
2
rd, starting out
2
bits saved
F
L
YT LS
I SD
V
NI
C
A
V
L
I TTSK
EAD
CW VC
VF
LY
S
I LLL
A
I V RV
I
I
KFH
L
N
S
RD
DTSKE
N
V
KL
I M
GC
A
V
NKKL
A
RL
V
C CY
A
F
I K I LS
A
V
GAAS
TS
ASTTG
SA
R
EAT I TTT
AATAS
KS
E
A E
AASQ
V
KGS E
A
SATSTE AG
TVSE AEKRS
ATV
ETS A
LDRRS S A
AV E
FG VVGSQ G
S
A
GF G
KQT
ES
R
MS
VGE
QRS
AG
EAAG
I KD
SS Q G I T
SG
TLKSEFLLED F ESFRLYDKLEL DF DFLFTE
W
SETEDYEEDQRYT
RDEYDTKATEDTPQYA TQQYTYKD
MKAK
MK
K
EKKKTTK
EGK
K
M
K
EADSL
NAKKDLT
K
E
FA
T TT
K
EDK
E
R
EK
RD
GRRNRD
G
D
GNGNRENRRSGRGRD
GNGQGRS RGGRGRWN
K G K TNEY N K N
E
G
RQ
L
Y
P
R
P
D
N
RRR
P
N
P
D
NNR
P
Y
P
D
NLVPT
Q
L
RR
PVVND
NE Q
L
NND
NV
D
N
H
G
R
P
HNPG NN L D G GPGQPN PHL RPLL P N
LP I L L
P L P I L P P P L
0
P I VQ I P H Q N I QVV I I P Y
QVP P I D
NP HDVQP P VVV QQ
M
VDQVH DFHP VH YV QGHV V QH QQV
QQ YNQMQ Q Q I I M QNMPY NPYQ QY F I MPY I I MY MPQ
D
CF
Y
H
MFI C
H
WY
H
F
MI H
WF
Y
H
MF
Y
H
MY
H
F
MI H
Y
F
MH
Y
F
MH
WP
CY
H
F
MI C
H
WH
WH
Y
F
MI F
MH
W
CMI I
F F
Y
H
MY
H
F
MI F
MC
M
WH
Y
F
MH
WD
N
H
CFI H
Y
F
MH
Y
F
MH
WF
MH
WM
CY
H
F
MI
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
77
2
rd, Iteration 1
GC CC
bits saved
F
L
Y
I
VS
AVN
E
T
A
D
DF
S
E
L
S
C CW C
A
V
TS
GA
I
T S
A
T
G
D
E
EY
KK
E
ND
S
GA
S
FT
A
VF
LY
ES
I L
QT I
NR
V
PF
F
SL
TK
SA
TG
F
Q
E
VI
R
M
F
KA
T
A
V
L
I
L
V
I
G
N
D
D
N
R
A
S
E
KK
ET
A
V
L
S
T
G
I
N
G
KK
R
G
E
G
FK
A
V
L
S
T
RG
F
I
A
V
L
S
T
I
S
E
ET
QW
S
L
ESL
VE
K
Y
KK
DAKS
NR
AY
R
R
KG
A
EM
S
Y
KR
ADS
ET
KV
S
DS
A
P Y
K
D
S
E
SR
S
Y
KL
E
Y
K
K
A
Y
KMKKA NENTRDSEETYEDQKHAKNQL E EGDE I EL
Q
D
ARD
G
QE AR
D
VVPATR
D
KD
G
K
E
R
DSSR
FTAE
GDQKR
D
KDARA
D
RED
DAT
TA L AA
K G G H E V
RER I
N
GLLGN KRN YRSKTVR
N N N TNF G
SG ATT
N KR
LSV GNN P TTQ
A
NG L
S N N PL Q E H M R
TT T NS V
N
T H R PR G Q L PD QP TL Y E P Q P PS
0
L P I SQ I Q P MR C L G GV P R P A L HQ Q T R
N
NP TRP QQ E I L V Q L G
QG L N
H
A L I
R
V
P I T QA QL L QDQK L
G
P QQ PQ
DL MP K
H
Y P H MQQ NMV P DS
I
H
Q
D QH F D
G MV
E PP P MN
PMR
I P
V
I
H
Y D
CF
Y
H
MCFL
H
V
Y
IP
F
G
Y
NV
H
M
HI
Y
F H
WY
H
FI R
D
N
Q
P
F
Y F
MMV
H
I
Y
M
CFI
Y
F
M H
WP
CV
H
YI D
H
WH
WP
H
Y
FI V
H
I C
WT
K
R
E
N
G
D V
Y
I
F N
G
P
F
Y
FY
FL
Y
V
HYI C
M N
V
Q
I
Y
PQ
PH
WD
T
K
S
RIH
ML
V
YV
H
YIIV
H
I
Y H
WG
F
Y
H
F
MH
W
WD
G
N
Q
P
H
F
W
H V
I
H
Y Q
F
G
N
YFI FF M F
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
78
2
rd, Iteration 2
C C C C CC
bits saved
A
V
L
I
A
V
EL
I
A
V
L
I K
N
D
G G A
S
A
V
L
I I
A
V
L
I
A
V
L
N
I D
S
T
QT
KDG
S SRK
TEE A
S D
N
S
T
NKKS
D T
KS
T
LV
FTSE
VK
L
SG
F
SE
PRAY KF GQ
FAR
E
K
E
T G
KF G
S
RE
SG
FE
RGY
FFS
I E E D T K GR AY
I LKA D
Y
K S
SS
T K
Y
KSA R
R
E
Y
K
E
KEDYQ
K
L
KV
G
N
V G RPAD
RP PD R L P S
P SGPL PA
TQ
G R D
AA
AS EAN ADN NLS N A Q SAV
A
LN ADQNS NE
E
TEQTNRTT AT
KQ EW
P
K RNSV P
D
NN
LDR RNAET RT
I
K
YK K
G SDG NRYAV LT I L EK T
MTAN
I
R S A
T
ESFTS
T
D
RLN
NTG VELDGNE TL AE QG DL
N
QTT DDRQ
A MRR E PAD E E G L HI T T I P S HE H D E Q V K
0
V RY
P
FQ MV K R P L P
M V QV V T Q
IM
Q V
P L MN HT
L P
K
Q
L
Q
YP R R
L QG V
M F
A
R A P L K
R
F
Y
L
T
K
G
D
G Y V
A
I
L
M
P
L P H G
N
P
K
S E
M
PH PL QE
K V L L H QS
D
V
I
K QH P S
A
H F V R L F
P
I QT V V V QGQQR
Q
R
G
D
N
I
P Q
G
N
H
D
WG
N
D
H
CV
Y
FI M
D
G
N
H
CV
H
Y
FI H
WY
N
G
D
F
Q
PI Q
P
I
H
Y V
H
I
Y
F V
H
I
Y
F I
Y
F
M H
WT
R
K
E
M
Q
G
N
P
H I
N
Q
G
Y
F T
E
S
R
F
Y
Q
H
N
G H
WY
F
MI V
H
I
Y F
Y
M
N
P
D
G
HY
S
R
K
E
Q I
M
C S
A
I
K
Y
R
E
F
Q L
Q
V
H
Y
I T
V
P
I
F
H
Y C
M Y
F
H
W
M H
WE
K
R
S
Q
F
D
Y
N
G Y
F
I H
I
Y
F H
I
Y
F H
WP
Y
F
MH
WD
P
W
ML
H
V
Y
FI
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
79
2
rd, Iteration 3
C C C C CC
2
bits saved
I S
A
V
L
I
Q
V
EL
I
S
A
L
A
V
I
S
KRK
G
N
D
S
A
G
A
S
D
N
A
V
L
I
S N
A
V
L
I
S
A
V
L
I
S
VN T T TE E E T D T T
L
TE
D KG
SF
S
PKKF
DG G
FA
QD
R
K
T
E
K
TR
KG
F
GK
SG
REF
KG
KF S
G
AGSVE
R
Y
K S
TSS
EE Y YSQ R R
E
SY E R
KSS
Y
K
RY
EK
P Y
L AR K PDSAK
PAP
A LG
SI
K DP
DRP E KTAL P LA
KE PDP N
F
A AND N A N S VDN RDQNANFD
E
KAE NG
R
SKNN
PTD R DNNA
V Q R
A QND TND
GRS
QDTE
VLRQTT E
AG
G R P
T RLTE L D
E LGRMQATDTRVA
LVY
MTK
R
QDTT
ENAGELDP
VEGGR
I L
TH NAH
T LHTE
Q
L AK N NE E
I
R
0
T I Q S G V D R P L V V K I T
A F PP
S
PM L N Q Q LM I Q L MP P KV
Y P R
N Y V MV
P Q L VM SM R
E G Q QK K QY N A Q L Q I L T S I P Q L P V Q I
QR
I A NL FL A L L TV A G I H V TF F T G K F I AV L P G K Q
K
S
Q
R
G
F
Y
D
N
P T
R
S
Y
E
K
P
Q
G
M
N D
H
CH
V
Y
FI Y
P
D
M
N
G
HV
H
Y
FI H
WR
E
V
D
Y
N
G
F
Q
PI R
V
Q
I
P
H V
H
I
Y
F L
V
H
I
Y H
I
Y
F H
WW
T
S
F
R
K
E
G
Q V
I
Y
H K
E
T
S
R
D
Q
F
Y
H
N
G H
WY
F
MI H
I
Y S
F
Y
D
M
N
P
GA
Y
H
S
R
K
E
Q I
M
C
W V
A
S
I
K
Y
M
R
F
E
Q L
Q
V
H
I
Y L
E
T
D
Q
F
V
G
I C
M Y
H
F
W
M H
WT
S
E
K
R
F
Q
Y
D
P
G
NY
FI H
V
Y
I
F I
H
Y
M
F H
WH
P
Y
F H
WE
G
N
H
Q
D
P
M
W L
V
H
Y
I
F
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
80
2
rd, Iteration 4
C CC G C C
bits saved
A
V
L V
A A
V
N
S
D
GA
S
V
A
L
C
V A
V
S EL
DS LK A
E V I N L L
I SR T T G S
N
KT
SG S
QT
K I
TE
E
K
I Q
K
R
D
E S S
D
I T
I
D
E EF T NG
K
S G S
D
R T N G
F K KA G
ISVR
D
Y
P
N
K
ESF E FAS P
L G
K
KY E
R
E
T
F F
GL N
E
ASAY N YDQL SR
P L P
S Y
A S
T
L KAI A
NDL SDN
DGP
N
R
D
P
N
TNA
E
AS
N
Q
R
S
A
M
K R
A
GM P
N
A T
KV KG N L G R Y Y
VL V P
E
T
T
R
G
A
R
KGRD
K
A D
RLTVI H
KA
DDQ V R
E T D
NR D
KL
LV A RS Q
E
A NPATR
P GAQ V
E
K I E
T Q TG
RV
0
A I S F G I ERTP L GVKPGK
R
EH N NL LQK
L T
TF KT
E R
K PMS
F
T
DLQHE I QLEVPT LYL
V
N
TTY I
D
D I PR
KL
E
PKE I S
G
I AR Q YL QRL Q P L MV T A MH L RV F A P Q FYQM H A P QV MAP
S
E
K
R
Q
F
Y
G
N
P
D
M T
S
R
Y
E
K
P
M
Q
GN
F
Q
Y
D
G
P
ML
H
V
Y
FI Q
M
G
P
N
D
H
CV
H
Y
FI H
WY
K
E
M
P
Q
G
N
DR
V
Q
P
HI V
I
H
Y
F
C L
Y
V
H
I V
Y
I Q
H
WA
Y
T
S
W
F
G
R
K
E
DL
H
V
YI I
P
E
T
K
S
R
H
F
N
D
Y
Q Q
H
WM
Y
FI V
H
I
F
Y S
Y
F
M
N
D
P
GT
A
F
Y
H
S
E
K
RI I
C
M
W T
I
F
K
S
Y
R
E
M L
Q
V
Y
H
I L
P
F
V
H
I
Y C
MG
Q
F
P
M
HH
WV
A
D
T
S
R
K
Y
F
E
G
QL
V
Y
FI G
V
I
F
S
T
E
D
Q
N
Y V
Y
I
H
F N
D
H
WR
A
T
F
Y
S
EI Q
H
WF
E
S
M
R
K
Q
G
NN
D
E
A
K
Q
T
L
R
H
V
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
81
1yrnB, Starting out
2
bits saved
TK
S
E
KL
D
N
FKE
PYR
A
G
A
S
RL
K
F
E I RL I E FL
D
KDAK
LE
H YTK NL VV S YAKNV ND
FK
V IL
Y
L V D DD
A
S
E
K
IE
FVN
LFS
WFS
D I
NVD
EF D
I
V RD
F
I PYLDTK
GL
A
EN
L MKNT
S
R
E
GVE
DASKA
E I
KA
VS
R
EAG
EA
STA
AV I G
FAAT I VS
R
EG
S
AKG
FAS
G
AA
DV GSR
EKAK AG
SAV
LEGS
SA
I QT KVS T
AA E A A
LS SS AS S MS AALA S LSI M A MS M A
VS R QAVSQEFQTTQG SETQE T RY
TQET I SEV
TYA
R S
LDTT L ALTLD KSLMY T
ETSEDKM KT LD KYA DKL
E KT PDTSE R
QTTE S
KTY
R WTQTS
R K KTE P R E
YTTV TT QT
QV W K DR Q K K
KT R DL DL MKTT A D R
S KGM TA
F A SQKT K S ASTTA K L
D G R
E
WG Q N
GG RD G N R
M
RG K
ERN NER I G R
K
E
N R
E
WRRD G Q RN RRSG RDR
G
R N N H
G
N VY N K
E
G
R N G QYN RE
P
G RK
N
K
E
D
P N QRG QN H
G
E
PP
G
R N V E
P
G QE
P
K
EN Q
G
R
N L Q N
V H T
Q
V GN L P LP
G
V PGP PDG
R L LPP LQ NGLN L H GP LGRL LN
I DP P H L P P G L L H PGL P I D L P P
0
I P I I V I P I P Q I QQ HPQ P Q QV I P I Q QYP I
Y V VQ V V V Q V
P V F
H
QH FH NP V V
I HD
N
H DDVI QM
NN V HDVI HF
H
QDHP V Y DVI HDGV HP
Q IPYFP Y Q Y Q YNN
I F Y N Y Q PNI Q FN Y NPYY Q
I I
F
Y
H
MH
Y
F
MY
M
CM
CF
MC
M
WMI F
MD
CF
Y
H
MH
Y
F
MH
Y
F
MFI C
H
WF
MC
H
WH
W
CH
Y
F
MY
H
F
MI P
CD
CY
M
C
HH
Y
F
MFI C
H
WH
Y
F
MFI Y
M
CM
CH
W
CF
MI F
Y
H
MH
Y
F
MC
M
WH
W
CH
Y
F
MFI H
W
CD
N
H
CH
Y
F
MFI F
Y
H
M
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
2
bits saved
LI
VSKV
SF RL L
I V
D
Y
L
I L L W
VKN LSNRRRKEKTVTVA EVA VVS E
F I I E I I
D
A
S
KE
KE
KRKR D V
S
E
KD
N
VS
KFG
I
FF
D
N
ED
S
E
KK V
FG
I
D
P L DLL
A
G PL A
V
I
AKKKE
AAAEA
A Q ARGV E AAGE
EAESASAGG
ASGAAAK
DAM A
G
AS
R
E
R
E
RK
EA
TMTSFEFASA T SASA
I
KTTG
TTSAAA AFAFS T D
GTGQT TS
KM E FGES
Q
S
QQSQSVTVTL LST
ETQY
L
AYY R
S
QLTL AAAS
SSSQ
MM M
EYELMR DKS
TSEKLLLDRDLMLM R
E EP T
RVYE DDDR
K TS SQT EDTTTTQ ESEST KTT KK KT
DSDD
D
YAY
R
GK DD
Q
DTTTKYKYKVTS KQ SSDLT SKQ QQT
KRKGKSKTAEMKAGGG N DKDK I RNR I RRRKQNRR I TTTN
NENNETEGRKRNRNNN
G
GGGEGED NGED EEN GNED GGG
G
R
PPR
PVR
PLR
P
NQ
LNY
P
R
P
Q
LVVV
N
P
NRRRRP QPPP PPPR
P
VPQPP NNN
P
G P G L
D
P G PP P
P
P L
P LN N R GRLGG H GRLLL
L L G G L L G G I L L L I L
0
Q I QNQPVHQ V I I I P P I Q I QQ QQVQQ I QQPPP
VDVH DVDVHQDVHH H H VVVP DP DNF VDNHDDVYV F DNVVVV
QNQYNPN I YMNQYYYY I I I QNQNFH I NFYNNQF I HNF I I I I
I
Y
H
F
M H
W
CY
H
F
MI F
MC
H
WH
Y
F
MI C
H
WH
Y
F
MFI CC
H
WY
H
F
MI I
F F
MF
MF
MH
Y
F
MP
H
Y
F
MH
Y
F
MF
Y
H
MC
H
WF
Y
H
MC
H
WY
M
C
HM
CH
Y
F
MH
W
CY
M
C
HF
MI H
W
CH
W
CY
H
F
MI C
M
WH
Y
F
MM
CH
W
CY
M
C
HH
Y
F
MH
Y
F
MH
Y
F
MH
Y
F
M Y Y
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
82
1yrnB, Iteration 1
TK
R
S
E
K
G
D
N
FKE
LE
PYR
A
KY
G
A
S
KKE I SH
R
D
KDK
AE
R
FD
FK
L
Y
L
E VI
YVRD
V I G
L IE
TK NLRVVD
KD
S
E
K
G
F
L
IL
N
FS R
D
N
S
E
D
FKDF
FL
RD
I P LD
WFAKNLE Y
V N
A IE
V
GL
S
TK
A
I
V
ENV
I
L
MKNT
L
SE V A A E AGAAAS
A EGAAG VAG SE AG
KAS A EGS
SVAAS
DA I VS DA
AA M I ASA
E STS S
AA I STS
LSQ FM LA FSS I M
LSTAVSKAATM S
VS RRQAVSQEFQTTQTSA SETQE TQ
RY ET SEV V
LDTTL DLTLDRKSLMYRGTTT DKMRKTTY LD RKYADKL
E VKTPA W
TSE TTETYKTERWE YTTV
TS KKTE PKTTKFQTE
KQ
T
W
DLL
NDMKQ
TNAKDSSNDG
S
MKQ TA NA WSQKQ
TLSNASTTAK
DGRR
HGQGGRDGGR
M
RGK
ERGKER I GRK
EGRRR
HRRDGQRGRRS
KGRD
G
RNNEN YNKG
V N E
ER
NKKD
P N QN
Y
REE
P
G
RN
E
EP
ERP
G
N
RQ PQ E
P R NV Q NQ
N LQ G
N
V H TV
Q GN LL LPV PP
GL PDG
RLLPL LQ G
N
P
GLN LH P
GL LP
G
R
YLLN
I DP P H PGP G H PG P I D P P
0
I P I I V I P I PV VQ I QQV LHPQPVQVV QV I P I QVV QQPV I
P VF Q Y F NP V HD ND VQNNVHN HF QDHP VY D HDGVHP
H H H I H M I I
H PI
Q I PYFP YQQ I HY NYDNHQ QF I Y DHY PNY Q I FNHY ND I Y Q
F
Y
H
M
CH
Y
F
M Y
M
CM
CF
MC
M
WMI F
MD
CF
Y
H
M
CH
Y
F
M Y
F
MFI C
H
WF
MC
H
WH
W
CY
F
MY
H
FI P
CD
CY
M
C
HH
Y
F
M I
F C
H
WY
F
MFI Y
M
CM
CH
W
CF
MI F
Y
H
M
CH
Y
F
M C
M
WH
W
CY
F
MFI H
W
CN
H
C
WH
Y
F
M I
F F
Y
H
M
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
2
bits saved
LI
VSKV
SF RL L
I IV
F I
DL
W
Y
I I
E L L S
EDD N
DI I A
VKN LSNRRRK KTVTVA EVAEVVS E
I D
KKKRKR
ND
S
E D D
P L LL
A
G PL
S A I
VAKKKE
AAAEA
EQ ARGV
I AA
GAAAEA
V
ESASAG
K
DKFVS
EEF
EF E
KK
KFV
DAGRRRK
A
TMTSFEFASATTSS SSASAAFAFSGAAG
S
GAAA AG SEEEA
GTGQTKTSES FGEQQQSQSVT TL M
LSTLKMMTTS
TTG
M
LTLAAAS
EYELMRMDKTSEKLLLDRDLMVME R E AYY RR SS
ES R
STT QY E TY DDD
DK
SDTY
DS
D
A
Y
SQTR
G
E
KDTTTTQTQ
DDDT
ES
KY
L
EY RTKTT KKDPQ
TRKT
KQ Q
Q
QT
KRKGKSKTAEMKAGGG NTDKKKK
I VN
S
E
KQ
I R
S
E
S
EKLNV
S
E I TTTN
NENNETEGRKRNRNNN
GGGGEDE NV GGG
RP
R
VRLRNQ
LNY
P
RQ
LVVV
NPNRRGRD
P
NGR
P
D
P PR
P
R
PRQGNR
P
D
P NNNG
PGP PGPL DG
P LL LN PR PRQ PGRLGGP PQ GRLLLP
L LP G G P LPP P P Q I Q
I L H L L
0
Q I QNQPV
HQ V I I I PV P I N QNVQQL I QNPPP
V V V Q V V VP GN G F V V I F
Q
D
NQH
YDP
N
D
VN HMD HH
I
H H
Y
I
NQYYYY
D D
I
Q
H
DQH
I
DD YV
QNPNFY
DQVV
I
VV NFYNNQ I YNF I I I I
I
Y
H
F
M H
W
CY
H
F
MI F
MC
H
WH
Y
F
MI C
H
WH
Y
F
MFI CC
H
WY
H
F
MIPI
F F
MF
MF
MH
Y
F
MY
F
MH
Y
F
MF
Y
H
M
CC
H
WQ
F
Y
H
MC
H
WY
M
H
CH
M
CH
Y
F
MH
W
CY
M
H
CF
MI H
W
CH
W
CY
H
FI M
C
WH
Y
F
MH
M
CH
W
CY
M
H
CH
Y
F
MH
Y
F
MH
Y
F
MH
Y
F
M F
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
83
1yrnB, Iteration 3
15 sequen
es aligned, total sequen
e weight = 2.05.
The variability of the alignment in
reased substantially, and
the total sequen
e weight more than doubled.
Tryptophan 52 was emphasized.
Leu
ines 17 and 38, Tyrosine 29, Asparagine 55, and
Arginine 57 began to stand out. Less spe
i
hydrophobi
signals emerged at Columns 21, 44, and 49.
L Y L
bits saved
1
L
F E
VI I I
VVE
F LI NPF
L H DQ
L P E V
I
KKP
TAS
RR
KK
G
RYPR
I
VTKK
A
AY FM
D
QNF
L
A
RTM
AD
TK SY
LFN
V
KEVY
W I N
D
KE
S
V
SA
K
EEK
N
G
W
AH
SS
V
KDA
FSTGS
E
I V
LS
AT
D
I E
EKKEFL
K
EG
AL
QD Q
D
NMAKE
RRR
AM
KY
RN
DT
K
DTVAS
E
VS NMSESTSSFTSG EG DA S L E E
RE
N
D
A EESAAAL
RQMESSGK A NRTK RTNER S L M RRT S AA L
I
ESK ADTKTE
K
A
MTEY
EQMKAT I W DAYGQ GE
KKTA
Q
KTSS K
FL
I RV
AA
T
QRG LAHEW SKA
KR
RDA T GT A L NFGQDNS A
TQR
A N QNESSDQN DEE S
H S P KRAKS V N ETQ Q I
SD
THGTTA
DDRLS
KVTGKG
DR
R
TY
P
QQN
A
T
S
R
K AQ
L
RR
KTHN EG
RV
R
TT
D
I GG
P
Q
S
RT
S GM
S
D TYS GEGVPAPNPGT R EAR E LV DY P
L GY
T
VT
NQ RDQKT Q RK P
L PPQ DQAVQN KV C
R NDL C M C
0
LG S LN
I
P
GQG
QCGGIPQM Q T N V I EQYL PG CP F N Y
P NEDP P
GLL L
AN PRLG N QL CWL RP H
H
GV V
QRV QL Y HQML S G L P
H L WQNL R
K P
VPQ KG VLPHQ G I LDV NDH L QNQ P F GH I YPNV R P V YH V DGPV E
P
I
H
Y
FVV
H
I
Y
FIY
FE
D
T
N
V
Q
P
FP
I Y
F
H
MH
Y
FIVN
Q
V
P
I
Y
F I
H
I
Y
F P
C
DY
H
F
MH
F
Y
HY
F
MP
V
H
YN
H
IP
HI
WY
FD
H
WHV
Y
FI H
V
Y
I
F G
N
P
D
CC
DT
S
H
K
R
E
F
G
N
DI L
V
Y
I I
M N
H
D
WV
Y
FI F
M
C F
MCH
C
WH
Y
FI D
N
I
G
Y
F I
H
Y
F
M H
Y
I
F F
MY
MI H
I
Y
F HP
N
D
H
C
WG
Y
F
H
MH
YI Y
G
N
Q
P
D
H
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
W NR
2
bits saved
1
L
IT
VS
IQV
E
F
FV D K K PEL L G
S
G
Q
DA
AK
SFN
ME
D L
KVS
ETGAA
R
D
EVS
A
T
L
N
FR
MD
TL
SD
AE
E
G
V
S I
R
AY
R
SA
I
K
L K NYL
R
G E
E
I
DMDR
S
E
KR
T
AK
Q
A
DT
R
S
TR
L
NL
E
Q
KA
SD
R
EE
A
KS
I
Q
A R K S
A
E
G
DY
DQ I
KKV
TS
F
A
AT
T
LV
D I
SV
P
A
EAM
Y
F
SE
IN
T
D
E
D
N
K
TE
RS
P
L
G
A
A
D
E
S
K
K
N
E
K
N
P
TAKG
E S P
YVQ
AS TG
M
T QVGATHTK L N
RM KFTA
N
PKA D G E
NRRSF YTATS PLVVNAGE I TV R S NTS QQ E TS
D
KKLT H
Y VS
PNR N S QLTLQ SL LQ NR AS K A S N S
EP RP CGQEW I HS S GNLAQV
R
AA Q
VT
K
E GRR
KRL I K RGR
T
R
P
CQ
D
L
K
EH
R
KQPK
M
K
E
L LTH I VTP
L
D
VA
AM ED
LQGQ RK
YE
QQVR
L
T
V
L
DQRA
G
K
AE
M I E L G
VD I D
N
P E KE
P
G QM PH DNLA
0
Q P NP G Q I P P T F
PG N IQYPLVDG
Q
A N GH NKSTS I
LNYPEGLY Y
D AV V FL RE TL
T
L
H
NH
YP
I PF
Q
VHPQ
H
V I
I
G FF
I
L
PH
I VT
R
ER
S
F
VH
Y VHA
L
LQN
WV
I
V
G
I S
P QQEA
W G G Y T Y G K K W VG F H LSS
V
Y
I
F F
H
DG
C
MH
Y
F
MN
H
D
C
WC
WM
N
H
D
WY
FH
I Q
I
F
MNCF
C
DE
K
R
D
N
F
Y
PM
C
WP
Y
H
F
MW
CM
P
HY
M
FY
FI V
Y
F
MP
Y
H
FI S
F
D
Y
P
N
G
MR
F
Q
G
Y
D
N
PY
D
P
M
N
G
HF
T
G
R
N
Q
P
YIFM
CY
FH
I Y
D
CH
S
V
T
E
K
R
G
D
N
F
QIHH
H
I
Y
F N
D
H
C
WC
DN
Y
FI C
MG
P
Y
F
HQ
Y
F
H
MI
K
T
V
R
G
Q
P
YILH
Y
F
MH
V
Y
FI V
H
I
Y V
T
R
I
D
G
N
Q
P
Y V
K
D
T
R
I
G
N
Q
Y
P
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
1yrnB, Iteration 4
LF
2
RR LA
bits saved
1
L
K
E
R
D
S
KK
R
KE
Q
S
Q
EG
N
DA
S
Q
E
A
S
L
N
TT
L
F
Y
V
A
E
D
T
I S
L
KS
K
Q
A
EL
KT
V
LV
MRK
V
E
IE
Q
D
K
R
AE
A
NS
Y
L
W I
N
S
T
D
YPI S
K
L
F
VA
T N
ATGF
T
VD
EK
RL
K I
QV
DA
R
E
VV
M
IS
G
T
L
I
E
K
L
T
V
I
A
F
G
F
T T
A N H A K
F
K L GV W R T A R M S
QRN DA HPE I
L SN V E P I ME
T
S
E
Q
A RE M
GGAT H I S
G
D
A
S
R R V TD M K K
RSYA
S
R SQE K PA Y
R
Q
TD S LR A KN K FL S R
NE
SLH V A EA Q K TLQE
L
A
D L
K
FQ
L
AT
I E
S
Q
SE
R
KR
L
V VY R
I L
C
QQL
S
C
K A
T
P
N
TPP KP R
KT
FV
N
M
RN
Q NFV R VT
GG
V
A
H
I
SL
K
S
V
A
G
AL
A
RA Q I Y
N
A
N
H
GEQAE
A
L
N
N
M
A
V DTDS
L
A
KE
Q
0
AHV G I E I GQGEGYE S
KY LS T E V
T
D E QGPE SG GT
T
N S
V
N
E G
LPV I P
SY Y QHPSYEQ
I F PP
Y
I R
A
E
R
QR I L
D
H DNHS
R
K I HRV KFD
V
I P
V D Y YL MGSPL V A PQSQT I H WCQ T PAN L PHLK QM PKG CYT Q N
H
I
Y L
V
I
H I
F
M M
FA
R
N
T
V
E
Y
D
QI F
CP
N
M
D
C
HP
R
F
N
Y
K
E
Q
H
M
GD
CV
Y
FI L
H
Y
I L
D
V
Y
N
T
K
Q
H
R
I
G
F F
H
MM
P
G
N
H
DT
R
F
K
Y
N
G
DY
M
G
P
N
H
DC
S
P
K
W
R
Q
G
E
N
HH
M
Y
F V
I
Y R
T
F
K
HN
P
K
G
E
Q
DH
K
V
R
S
F
T
N
YI I
S
Y L
Y
C
V
FY
T
F
S
R
K
E
QG
Q
V
P
YI LT
G
V
MI M
W
CD
C
WV
Y
F
CT
P
F
Y
G
M
Q
D D
P
L
T
G
N
V
F
YP
H
Y D
W
CS
Y
Q
F
M
GH
P
M
Y
FI R
P
Q
G
E
N
W
HHH
G
PI T
R
N
F
M
Y H
D
W
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
W
3
L V F NRR RK K
2
bits saved
I
1
I Q K L Y DKK KR
R
GMS
TE R
E IQ V
SQQ
A
AL
GEE
Q
E
L
E
Q
E
L F
K R H
N
V Q
D K V WQEHS A A
V A P
S
D E
N
R
DRV
SL
N A
I
LT
E
Y I L
HSN
N
QT
S
TS
K
E
G
D
K
E
QF
AA
R
G
R
L
S
K
SD
L
K
A
PT
TA
E
F
NA
PAT
A
S
D
T
T
F
S
K
YL
S
N
M
R
K
S
K
TA
T
E
A
VA
R
N
KD
T
L
G
TAS
H
G
P
D
M
L
V
I
T
H
D
G
W
N
D
G
N
L
D
I H
R
Q
A
P
S
G
A
S
E
N
K L
G
P
A
D
S
G
P
SL
TY
L MGR CR QVV
Q
V V L D T P E DP
TQV VN ER GC
P
R
T P
LH
L GNQ I N K V R A
T I
0
H I D I PMG
P I E TP LE L V K I
P
C
C
S
K
A
H
N
G
N
Q
P
G
Q
G
C
P
L
A
V
G
Q
D
V
F
I
Y
N
E
K
Y
LY
I Y
M
S
Y
V
A
S
N
T
I
Y
N
I
L
Y
F
D
A
E
P
I
G
G
E
S
I
L
Q
V L L
A
S
Y
S
P
G
T
R
L
E
T
L
N
A
T
S L P
L
S
P
L
L
Y
V
MR
P
Q
W
G
E
N
HL
Y
V
FM
Y
H
FG
P
Y
F
HH
Y
M
FY
S
P
K
R
GH
Y
FI H
D
W
CS
T
H
M
E
KPD
P
M
HV
F
CI M
F
CF
CF
E
N
G
PM
FW
E
A
V
T
F
Y
SI MY
M
FR
M
T
S
P
K
D
F
YL
K
A
V
Q
R
GI V
H
Y
F V
I
S
T
N
P
K
G
R
Q
F K
S
E
P
L
A
T
R
Q
D
N
VA
P
Q
T
V
K
D
RI V
P
A
F
T
I
E
G
Q
Y
S
R
M H
Y
I
F S
G
P
V
A
I
T
D
R V
S
P
A
I
T
G
Y
H
E
F
K
Q
R T
G
F
R
K
M
E
Q
H A
L
D
N
E
T
K
V
R
Q
M
I Q
V
H
Y
I
F L
G
Q
S
A
T
E
V
Y
I F
R
V
Q
I Y
F
R
K
E
G
D
M
Q P
V
A
Y
I
T
S
G
F
R L
S
A
P
N
V
T
E
G
QL
A
S
V
G
T
H
E
K
RI V
A
I
P
T
G
F
R
Y
K
E G
T
A
S
V
E
K
D
N
R
Y
I
F
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
85
The nal alignment
86
Final SAM-T98 Alignment for 2
rd
Shown below is an ex
erpt from the nal SAM-T98 alignment for
2
rd, showing remote homologs of known stru
ture. The logo
shown re
e
ts the full alignment.
2
rd XFTNVSCTTSKECWSVCQRLHNTSRG.KCMNKKCRCYS...
1bah XFTNVSCTTSKEXWSVCQRLHNTSRG.KCMNKKXRCYS...
1
mr ------CTTSKECWSVCQRLHNTSKG.WCDHRGCICES...
1lir XFTQESCTASNQCWSICKRLHNTNRG.KCMNKKCRCYS...
1sxm TIINVKCTSPKQCSKPCKELYGSSAGaKCMNGKCKCYNnx.
1txm ----VSCTGSKDCYAPCRKQTGCPNA.KCINKSCKCYG
x.
1big XFTDVKCTGSKQCWPVCKQMFGKPNG.KCMNGKCRCYS...
1bkt VGINVKCKHSGQCLKPCKDA-GMRFG.KCINGKCDCTPkx.
2bmt XFTNVSCSASSQCWPVCKKLFGTYRG.KCMNSKCRCYS...
C CC G C C
bits saved
A
V
L V
A A
V
N
S
D
GA
S
V
A
L
C
V A
V
S
I
EL
DS LK A
E V I N
G L L
SRE T T S
N SG
KT
S
QT
K I
T E
I Q
K K
R
D
E S S
D
I T
I
D
E EF TKNG
K
S GSR
D
T N
K
G
F K KA G
SVR
I G
LD
Y
P
NE
ESF
Y N
E
L
FAS L P PG KYF Y E
R
E
T F
NL SA YDQ SR
P L SS A
T KAI ADT S
A
DN
GP
N
R PTN
A AS QS M
K AM P
L A N
KV
KG
D
RRD
D N
DN E
TV L GN
KAR A
R
R R GY N
D
Y
VL V P
E
T T
RA
G
AKG K
A L I H DDQ V
E T D
NR
KL
LV A RS Q
E I NPATR
P
V
R
K
GAQ V EK I
E
T
N Q TG
R
V
T
0
S G E G R L K
A I F S RTP L L PGK L EH N L P QEL I S
KT
TF E R
K PMF
R
T
DLQHE I QA
EVPT LYV
N
TTY I
D D I
R
KLPKEA G
I AR Q YL QYL Q P L MV T I MH L RV F A P Q FYQM H
A P QV MFP
S
E
K
R
Q
F
Y
G
N
P
D
M
H
C
WWT
S
R
Y
E
K
P
M
Q
G
N
C
H
D
C
WN
F
Q
Y
D
G
P
M
HL
H
V
Y
F
MI Q
M
G
P
N
D
H
C
WV
H
Y
F
MI H
WK
E
M
P
Q
G
N
D
H
W
CR
V
Q
P
H
Y
F
MI
WMV
I
H
Y
F
C
M L
Y
V
H
I
F V
Y
I
F
M Q
H
WA
Y
T
S
W
F
G
R
K
E
D
P
Q
M
H
NL
H
V
Y
FI
M P
E
T
K
S
R
H
F
N
D
Y
Q
G
M Q
H
WM
Y
FI V
H
I
F
Y
MW
CS
Y
F
M
N
D
P
G
HT
A
F
Y
H
S
E
K
R
G
P
Q
N
M
DIH
D
CI
C
M
W
MMT
I
F
K
S
Y
R
E
M
Q
G
N
P
W L
Q
V
Y
H
I
F L
P
F
V
H
I
YWC
M
WG
Q
F
P
M
HH
WV
A
D
T
S
R
K
Y
F
E
G
Q
N
P
HL
V
Y
F
MI G
V
I
F
S
T
E
D
Q
N
Y
P
M
H
C
W V
Y
I
H
F
M N
D
H
WR
A
T
F
Y
S
E
D
Q
H
G
N
M
PI Q
H
WE
S
M
R
K
Q
G
N
P
D
HN
D
E
A
K
Q
T
L
R
H
V
Y
FIW WC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
87
Comparison with the 2
rd stru
tural alignment
The logo below represents the FSSP [HS97℄ stru
tural alignment
for 2
rd. The FSSP logo and the SAM-T98 logo will dier due to
homologs of unknown stru
ture in SAM-T98 and additional
distant homologs in FSSP.
Cysteine 33 and Gly
ine 26 are emphasized less in the
SAM-T98 alignment while Gly
ine 22 is emphasized more.
Both alignments emphasize the
ysteines.
3
C C C GC CC
bits saved
A
V
L
A
V
A
V
A
S A
V
A
V
A
V
S L L N L L L
I S S D S S S
T K I I E I I I
G R T T K G K T T T
F E G E G E N T G
K
EG G
Y SSEF S
F R K R
P F RF F
P NAAY K SL R Y
K KP A Y D S Y AP Y S
V N
A
ENQN D P
N
DE
QA
GS G V P
N
SN N
P K
L D DDDD N D NRKEE E
K
N
GL D RQ
DDL D
G
R
I T KQNK Q KL TA D Q K K KVI K
G
V L A RL L
GTS R R RKAK
Q
S
R PS
KH RK AV
TR RLN
D
A I AY EV G RE L RL L
V L SD G K
E Y E
0
R
L VS
V
T
F
S
L M I N
T
R G L G E I
TTT
D
A T L A R I L EL
E G AEV A
L
S V AL MV A T P P T MA T MV G S N Q A N E Y N
Y M I
S
V L MR
YMT P
E I TT SP QG L Q L L QV QA V I R P V T A C A QV N L N QEQ I T
K
T
D
N
I
Q
P T
A
R
Y
F
S
K
E
P S
F
M
Y
R
K
E
Q
W
GN
E
K
R
G
DI G
R
K
E
Q
P
N
HA
K
T
F
E
R
G
D
N
QI H
WF
K
S
R
Y
E
Q
M S
R
A
G
V
I
E
K
D
Q H
V
I
Y
F V
H
I
Y
F Y
P
V
H
I H
WY
R
K
T
S
F
Q
N
EI P
V
H
I
Y V
L
S
A
E
P
I
T
Q H
WQ
E
S
G
M
F
Y P
I
H
Y D
N
G
F
P
Y
H
M H
Q
V
P
I
Y L
H
V
Y
I
F T
I
Y
M
S
F
G
R
N
K Q
V
I
H
Y T
L
H
F
Q
P
V
I F
M
W E
T
D
G
R
V
I
Q
P
F H
WM
A
F
N
T
D
Q
S
R
K
Y
E L
G
K
A
S
T
V
E
Y
D
R
I D
T
N
P
Q
H
I P
I
H
Y
F H
WT
S
F
H
Q
MH
WA
F
R
E
S
M
N
K
Q L
Q
F
V
H
I
Y
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
88
SAM-T98 alignment for 1yrnB
10 20 30 40 50 60 70
| | | | | | |
1yrnB GHRFTKENVRILESWFAKNIENPYLDTKGLENLMKNTSLSRIQIKNWVSNRRRKEKTITIAPELADLL
1hom RQTYTRYQTLELEKEFHFN---RYLTRRRRIEIAHALCLTERQIKIWFQNRRMKWKKENKTKGEPG--
1hddC RTAFSSEQLARLKREFNEN---RYLTERRRQQLSSELGLNEAQIKIWFQNKRAKIKK-----------
1fft RVLFSQAQVYELERRFKQQ---KYLSAPEREHLASMIHLTPTQVKIWFQNHRYKMKRQAKDK------
1yrnA --SISPQARAFLEQVFRRK---QSLNSKEKEEVAKKCGITPLQVRVWFINKRMRSK------------
1o
p RTSIENRVRWSLETMFLKC---PKPSLQQITHIANQLGLEKDVVRVWFCNRRQKGKR-----------
89
Final SAM-T98 alignment for 1yrnB,
ontinued
LF
2
bits saved
K KE
E
R
D
S
Q
RR
KK
R
EG
Q
SN
DA
GS
Q
E
A
S
L
N
T
T
L
F
Y
V
A
E
D
T
I S
L K
S
KR
Q
A
EL
KT
V
LV E
IE
Q
V
D
K
MRK
R
AE
A
NS
Y
L
WI
N
S
D
T
K
L
F
VA
T
A
N
RTG
L
YPI S
F
T
VD
EK
RL
K I
QV
DA E
R
LA
IS
L
M
E
K
G
VV
T
I
L
T
V
I
A
F
F
V T T
V K
F
S
A
Q
KNL GA WN E T
S I A S H P I ME ASE
Q
A ME M
RAT D I HPD R L N M E RSS TQ
ELK RA Y
GG
QTD HS
Q
L
A
S
RGA
E
R A V TD A K K KNYAR
L S IPS
RL
R
NELH V KT
R
A K T
LQL H
D K
FQL IE
S
L Q
SERKV E
FA
VYV
R
L
CQS
C
K A
T
P
S
N
T
P
HP KP F
I E
V
I
E
N
MR
N
Q
G
N
G
F
Y
V
E
RA
KY
S
V
G
TV
GA
I
S
T
RL
S
K
V
A
GAT
L
AV
A
R
T S Q Y
N
A
E
N
H
G
Q
K
ER
Q
A
I
G
A
S
R
L
N
N
MT
R
D
TT
Q
D
S
L
V
A
K
E
E
Q
0
AVV G Y GQ E Q LY E
C D D DGPEK G GKN N V
LPY I P
SY GS
PP QHPS
V
YE I
PQF
STPP I A
E
W
R
NR I
Q
L
AN
H
L PNH
S
K
Q
D I HS
Y
V
G
S
KFD
T I
Q
P
N
V D I YL MNRPL L A FMTQC I H R
P
GQ T PYG G MHLTP M PQ I CYH T H
H
I
Y L
V
I
H F
M
C M
FA
R
N
T
V
E
Y
D
QI F
CM
D
C
HF
N
Y
K
E
Q
H
M
GD
CV
Y
FI H
Y
I L
D
V
Y
N
T
K
Q
H
R
I
G
F H
MP
G
N
H
D
WR
F
K
Y
N
G
D
HY
M
G
P
N
H
DS
P
K
W
R
Q
G
E
N
HH
M
Y
F V
I
Y
F T
F
K
H
MK
E
Q
DH
K
V
R
S
F
T
N
YI I
S
Y L
C
Y
V
FT
F
S
R
K
E
Q
M
G
NQ
V
P
YI L
NT
V
MI W
CD
C
WV
Y
F
CP
F
Y
G
M
Q
DL
T
G
N
V
F
YI P
H
Y
F D
W
CF
M
G
N
DH
P
M
Y
F R
P
Q
G
E
N
W
HH
WG
P
M
Y
FI R
N
F
M
Y D
W
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
W
3
L V F NRR RK K
2
bits saved
I
1
I Q K L Y DKK KR
R
GMS
TE R
E IQ V
SQQ
A
AL
GEE
Q
E
L
E
Q
E
L F
K R H
N
V DQ K V WQEHS A A
VA P
S
D E
N
R
DRV
E
Y
L
SN I
A
HSN
LT
I N
L QT
S
TS
K
E
G
D
K
E
QF
AA
R
G
R
L
S
K
SD
LFD
T
K
S
K
Y
A
PT
L
TA
PAT
E
K
S
E G
AAA
D
NT
F
S
N
T
M
R
S
K
A
V
T
A
R
N
KD
T
L
TAS
H
G
P
M
L
V
T
H
D I G
W
N
D
G
N
L
D
I H
R
Q
A
P
S
G
A
S
E
N
K L
G
P
A
D
S
G
P
SL
Y
TTQV
L M
VGR
NRC
G
Q
PVV
R
C E
Q
R
T G
D
L L
V
P
N H V
T
R
L
Q
P
I
E
N
D
A
K
P
V
I
T
0
H I DPI M P I
G P E E T L L V K I
P
C
C
S
K
A
H
N
G
N
Q
I P
E
Q
G
P
V
A
G G
Q
C
D
F
L
Y
V
N
Y
L
I
Y
Y
M
S
Y
KY
I
L
Y
F
P I
V
I F
A
S
N
T
N
I
D
A
E G
G
E
S L
Q
V L L
A
S
Y
S
P
G
T
R
L
E
T
L
N
A
T
S L P
L
S
P
L
L
Y
V
MR
P
Q
W
G
E
N
HL
Y
V
FM
Y
H
FG
P
Y
F
HH
Y
M
FY
S
P
K
R
GH
Y
FI H
D
W
CS
T
H
M
E
K PD
P
M
HV
F
CI M
F
C F
CF
E
N
G
PM
F W
I
E
A
V
T
F
Y
S MY
M
F R
M
T
S
P
K
D
Y L
K
A
V
Q
R
I
G V
H
Y
F V
I
S
T
N
P
K
G
R
Q
F K
S
E
P
L
A
T
R
Q
D
N
VA
P
Q
T
V
K
D
RI V
P
A
F
T
I
E
G
Q
Y
S
R
M H
Y
I
F S
G
P
V
A
I
T
D
R V
S
P
A
I
T
G
Y
H
E
F
K
Q
R T
G
F
R
K
M
E
Q
H A
L
D
N
E
T
K
V
R
Q
M
I Q
V
H
Y
I
F L
G
Q
S
A
T
E
V
Y
I F
R
V
Q
I Y
F
R
K
E
G
D
M P
V
A
Y
I
T
S
G
F
R L
S
A
P
N
V
T
E
GL
A
S
V
G
T
H
E
K
RI V
A
I
P
T
G
F
R
Y
K
E G
T
A
S
V
E
K
D
N
R
Y
I
F Q Q
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
90
Comparison with the 1yrnB stru
tural alignment
The logo below represents the 1yrnB stru
tural family a
ording
to FSSP. The SAM-T98 logo will dier be
ause of dierent sets
of homologs, and slight dieren
es in the 1yrnB sequen
e
resulting from positions not fully resolved.
FSSP pla
es more emphasis on Glutami
A
id 18. SAM-T98
pla
es more emphasis on Alanine 39 and others.
The two logos re
e
t similar alignments.
2
LE F
bits saved
1
FI T ID I
LI
Y LD
K
R
E
A
Q
TS
P
K
E
S
A
R
S
G
K
E
R
K
S
E
D
D AT E
G
AN P
R
L
T
S
YS
KV
DN
E
K
A
D
K
R
A
E
E
S
N
TQ R LS
L
V
D IA G
VK
M
A
Q
AK
FSS
R
N
R
E
YTQ
T
SG
T KS
L
I
V
W
A S
K
R
E
K
NLE
KV
D
D
T
STQ
EM
R
R
NP
AAG
FSS
E
Q
R
E
K
D S
K
Q
D
S
PT
V
A
F
YTK
S
TA
N
I E
ST
GS
KE
E
R
G
R
T
D
L
R
T
K
A
R
E
I N
KA
S
M
A
MD
L
FV
S
I
E
KN
E
Q
S
R
KT
R
AL
QV
VA
D
AT
D D A N A MR
G S KQ L ER RA S
R
Y
P
KN T
A M
DAK E
RA I Y KK DA
R
G
EN T E I
N R N H T D TA K
L N H F
G A L RL Q I L S A
0
S T T D Q GA AP L L AE R M N M YFH
V L G K T QN W Q A I A V T T D RP N A
Y A N V SQ SKGDC
L G N R Q Q
P EQ SL N YE I E VE
QV G V SE T QP
L
Q T E
T GQV T AA T RR G
E
K
D
R
G
I
N
P
Q
V
P
I
H
Y
Q
L
V
I Y
L
A
E
S
T
V
N
D
G
Q
I
P
L
V
H
Y
I L
V
H
Y
I
F
K
S
A
D
R
L
N
G
V
G
L
V
P
HI
R
K
E
H
P
Q
G
N
D
C
V
P
H
Y
FI
G
L
V
I
H
R
T
P
L
V
H
L
K
S
R
T
V
D
K
E
S
F
Q
Y
M
N
P
D
G
D
K
T
I
S
Q
F
F
T
S
R
K
Y
Q
M
G
C
W
N
H
D
H
Y
F
MIG
L
P
V
H
I
A
L
W
S
E
I
T
R
G
K
E
P
N
C
Q
D
R
K
V
H
S
T
Q
D
N
GI
Q
N
G
V
I
P
P
L
H
V
Y
G
Q
D
N
C
HV
H
Y
FI
L
P
V
H
Y
I
F
G
L
V
H
I
V
R
F
D
N
G
I
Q
P
H
Q
D
N
H
P
H
L
V
Y
E
K
Q
D
N
G
F
Y
P
G
L
V
P
I
H
Q
P
L
H
V
T
E
S
Q
F
Y
N
M
P
D
G
V
N
P
G
F
Y
H
L
G
V
P
HI
K
C
P
E
Q
G
N
H
Q
Y
P
D
N
C
H
T
P
L
VT
P
L
V
H
S
F
R
Y
K
E
M
G
Q
N
P
D
H
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
W NR
bits saved
1
L
IT
SM VR
K
I F
FV
R
D
K
G E
K
K K
R R G
SE
S
T
L EQ E
N Q E
G V E
R
K Q EV YL
L
KS
T
A
A
EQS
L
Q
A
QK
AT N
S AP
DL
E
L
VV
A
E
S
PV L
D
EFAD
K
D
P R
E
K
A
A
S
DN
I V
I Y
I AN
L
N
T
RTG
T
N
S S
D
T
R
E
D
E VE
S
E
P
D
S
N
AGAASNK I I
D
EDA
E
KA
S
I
K
AANA
S
N RI Q
VF
DTL A
L AA I K KS LKD
FF
RD
TSDF NS A
GDK KA
TY
T
R
GQLA LAGM HM
GH A
MG
L
G
Q
DE G
R
L
S
AA V
G
A
T
T
K
T
KAP
R
QGT
K
T
0
TAL GTS
TT PHV H P
Q NR
L EETRTLE
R
QS
Q
LN
G
E
R
T
S
YVS
FHS
S
Q L
Y
V
P
P
I
Q
KV
LV
N
L
L
K
L
V
KT
I I
G
T
T
NG
S
E
R
Q
S
R
S
R
G
EL
V
G
N
T
R
S
R E
P
H
L
V
Y
R
K
E
C
Q
P
G
N
H
W
V
P
H
YI L
V
H
I
SP
V
T
K
S
F
Q
Y
K
D
N
P
G
H
Y
M
I
W
C
K
R
P
E
G
Q
N
I
H
YI E
R
Y
M
Q
P
G
DE
K
N
D
P
QR
K
E
P
G
Q
H
N
CL
V
A
T
L
V
F
P
I
H
I
Y
M
F
GA
Y
F
ME
V
T
S
DI
QS
I
Y
M S
A
V
T
K
I
R
D
W
N
G
E
I
Y
M
RG
V
P
I
H
AT
V
R
T
S
D
P
P
H
Y
F
M
I
MD
P
L
A
E
V
T
RK
R
P
D
N
Q
F
Y
L
Q
L
V
V
Q
L
V
M
I V
I
S
T
E
K
R
D
N
P K
I
D
P
R
N
Q L
V
H
I
Y Y
P
G
Q
D
NY
P
G
Q
D
NK
N
L
R
P
V
QI Q
I
Y
H
F P
V
I N
Q
I
F
Y Y
P
G
Q
D
N
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
91
SAM-T98 stru
ture predi
tion
Combine scores
92
Bidire
tional s
oring
100
10
1
0 200 400 600 800 1000
True Positives
93
Assessing the S
ore
1000
100
10
1
-40 -35 -30 -25 -20 -15 -10 -5 0
SAM-T98 summed cost (nats)
94
BiDire
tional S
oring Results for 2
rd in rank order
ID SCOP ID SCOP
S
ore -50 2
rd S -40 < S
ore -30 1tsk F
1big U 1
mr S
2bmt U -30 < S
ore -20 1gps P
1lir F 1gpt P
1agt F 1a
w P
1mtx F 1
n2 U
2ktx F -20 < S
ore -10 1myn P
1ktx F 1b3
A U
1sxm F 2b3
U
1bah S 1vnb P
-50 < S
ore -40 1s
o F 1vna P
1bkt U 2sn3 P
2pta F
1txm F
95
BiDire
tional S
oring Results for 2
rd in rank order,
ontd.
96
BiDire
tional S
oring Results for 1yrnB in rank order
The SCOP relatedness to 1yrnB is as follows: S = same
subfamily, F = same family, U = un
lassied, O = other.
ID SCOP ID SCOP
S
ore -60 2hoa F -50 < S
ore -40 1yrnB S
1ahdP F 1aplC S
1hom F 1akhB S
9antA/B U 1aplD S
1b8iA U 1mnmC/D S
1san U -40 < S
ore -30 1o
tC F
1b72A U 1hdp F
3hddA/B U 1yrnA S
1hddC F 1akhA S
1fjlA/B/C F 1pog F
1ftz F 1au7A/B F
2hddA/B F 1o
p F
-60 < S
ore -50 1nk2P U -30 < S
ore -20 none
1nk3P U -20 < S
ore -10 1lfb F
1vnd F 2lfb F
1ftt F
1enh F
1b72B U
1b8iB U
97
BiDire
tional S
oring Results for 1yrnB in rank order,
ontd.
98
Should you build an hmm stru
ture library?
If you want to make stru
ture predi
tions for only a few
proteins, build hmms for these proteins and s
ore the protein
stru
ture sequen
e database.
If you want to make stru
ture predi
tions for 10,000
sequen
es (for example), it would be more eÆ
ient to build
hmms for protein stru
tures.
Anywhere in between these two extremes, the de
ision to
build an hmm stru
ture library depends on your time and
omputing power resour
es.
99
Building Alignments for a Stru
ture Library
100
Future dire
tions in hmms for sequen
e analysis
101
HMMs on the Web
102
Referen
es
[AGM+90℄ Stephen F. Altshul, Warren Gish, Webb Miller, Myers Eugene W., and Lip-
man David J. Basi
lo
al alignment sear
h tool. JMB, 215:403{410, 1990.
[BB97℄ C. Bystro and D. Baker. Blind predi
tions of lo
al protein stru
ture in
asp2
targets using the i-sites library. Proteins: Stru
ture, Fun
tion and Geneti
s,
Suppl., 1:167{171, 1997.
[BHK97℄ Christian Barrett, Ri
hard Hughey, and Kevin Karplus. S
oring hidden
Markov models. CABIOS, 13(2):191{199, 1997.
[Edd98℄ S. R. Eddy. Prole hidden markov models. Bioinformati
s, 14(9):755{63,
1998.
[GBEB97℄ W. N. Grundy, W. Bailey, T. Elkan, and C. Baker. Meta-MEME: Motif-based
hidden Markov models of protein families. CABIOS, 13(4):397{406, 1997.
[GME87℄ Mi
hael Gribskov, Andrew D. M
La
hlan, and David Eisenberg. Prole anal-
ysis: Dete
tion of distantly related proteins. PNAS, 84:4355{4358, July 1987.
[HH92℄ Steven Heniko and Jorja G. Heniko. Amino a
id substitution matri
es from
protein blo
ks. PNAS, 89:10915{10919, November 1992.
[HH94℄ Steven Heniko and Jorja G. Heniko. Position-based sequen
e weights. JMB,
243(4):574{578, November 1994.
[HHKS96℄ Jerey D. Hirs
hberg, Ri
hard Hughey, Kevin Karplus, and Don Spe
k.
Kestrel: A programmable array for sequen
e analysis. In Appli
ation-Spe
i
Array Pro
essors, pages 25{34, Los Alamitos, CA, July 1996. IEEE Computer
So
iety.
[HK96℄ Ri
hard Hughey and Anders Krogh. Hidden Markov models for se-
quen
e analysis: Extension and analysis of the basi
method. CABIOS,
12(2):95{107, 1996. Information on obtaining SAM is available at
http://www.
se.u
s
.edu/resear
h/
ompbio/sam.html.
[HMBC97℄ T. Hubbard, A. Murzin, S. Brenner, and C. Chothia. s
op: a stru
tural
lassi
ation of proteins database. NAR, 25(1):236{9, January 1997.
[HS97℄ Liisa Holm and Chris Sander. Dali/fssp
lassi
ation of three-dimensional
protein folds. NAR, 25:231{234, 1 Jan 1997.
[JDH99℄ T. Jaakkola, M. Diekhans, and D. Haussler. Using the sher kernel method to
dete
t remote protein homologies. In Pro
eedings of the Seventh International
Conferen
e on Intelligent Systems for Mole
ular Biology, Aug 1999.
[Kar95℄ Kevin Karplus. Regularizers for estimating distributions of amino a
ids from
small samples. In ISMB-95, Menlo Park, CA, July 1995. AAAI/MIT Press.
[KBC+99℄ Kevin Karplus, Christian Barrett, Melissa Cline, Mark Diekhans, Leslie
Grate, and Ri
hard Hughey. Predi
ting protein stru
ture using only sequen
e
information. Proteins: Stru
ture, Fun
tion, and Geneti
s, to appear, 1999.
[KBH98℄ Kevin Karplus, Christian Barrett, and Ri
hard Hughey. Hidden markov mod-
els for dete
ting remote protein homologies. Bioinformati
s, 14(10):846{856,
1998.
[KBM+94℄ A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. Hidden
Markov models in
omputational biology: Appli
ations to protein modeling.
JMB, 235:1501{1531, February 1994.
[KKB+97℄ Kevin Karplus, Kimmen Sjolander, Christian Barrett, Melissa Cline, David
Haussler, Ri
hard Hughey, Liisa Holm, and Chris Sander. Predi
ting protein
stru
ture using hidden Markov models. Proteins: Stru
ture, Fun
tion, and
Geneti
s, Suppl. 1:134{139, 1997.
[OMJ+97℄ C. A. Orengo, A. D. Mi
hie, S. Jones, D. T. Jones, M. B. Swindells, and
J. M. Thornton. Cath- a hierar
hi
lassi
ation of protein domain stru
tures.
Stru
ture, 5(8):1093{108, August 1997.
[PKB+98℄ J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, and
C. Chothia. Sequen
e
omparisons using multiple sequen
es dete
t twi
e
as many remote homologues as pairwise methods. JMB, 284(4):1201{1210,
1998. Paper available at http://www.mr
-lmb.
am.a
.uk/genomes/jong/
assess paper/assess paperNov.html.