Академический Документы
Профессиональный Документы
Культура Документы
The currently used major algorithms foraligning biological sequences (protein and
nucleic acid sequences) stem from the pioneering work of Needleman 8: Wunsch
(1970). Needleman-Wunsch’s method has also been applied to statistical tests of
relatedness between a pair of sequences (Barker & Dayhoff, 1972; Doolittle, 1981).
SelIers (1974) proved that evolutionary distances obtained with a similar algorithm
to that of Needleman & Wunsch satisfy metricconditions. Sellers’ metric was later
generalized by Waterman et al. (1976) SO that deletions/insertions (gaps) of any
length are allowed.- Inclusion of multiple-sized gapsis feasible for comparing \
biological sequences since a long gap can be produced by a single mutational event. ”,
This situation is incorporated intothe method of Waterman et al.( 1976) by assigning
a weight w k < kuJ,to agap of length k, whereas the gapweight is confined to wk=kur,
for all k values in the method of Needleman, Wunsch and Sellers. However, the
algorithm of Waterman et al. (1976) has adrawback in that it takes large
a number of
computational steps of the order of M 2 N compared to the M N steps of
Xeedleman-Wunsch-Sellers’ algorithm, where 111 and N ( M 2 N )are thelengths of
the proteins or nucleic acids under comparison. This is a particularly serious
problem when calculations are made on a low-speed small computer.
In this letterI present a new algorithm which allowsmultiple-sized gaps but runs
in essentially HA’steps if the gap weight has a special form of wk=uk+v ( u 20,
6 2 0). This formof gap weight has usually been used in computer systems that adopt
t.he matching algorithm of Waterman el al. (Smith et al., 1981; Kanehisa, 1982).
(a) A lgon’thm
Let the two sequencesbe A =&,a2. . . aMand B = b, b2. . . b,. A weight. d(a,, b,) is
given to an aligned pair of residues a, and b,. Usually, d(a,, b,) SO if a,= b,, and
d(a,, b,) > 0 if a
, # bn.The algorithmof Waterman 41 al. (1976) generates a distance
matrix Dm,,by an induction as follows:
Drn.n=?llin [ D m - 1.n- I +d(am, bn), P m . nr &m.nII (1)
where
701i
and
P,,,=Min [Dm-l.n+~!lr
nilin (Dm-k.n+u:k)]
2sksm
A 0 12 22 32 42 52 A 0 12 22 32 42 52
A
12 0 12 22 42 52 A
12 34 44 54 64 74
3 3 3 1\3 3 3 3
22 /2\: 12 32 42 22 22 34 44 64 74
A 3 2 A
-SF \I 3 3 2
32 '23\lk\i 22 32 A
32 32 '22 34 54
64
A
I 'I 91 3 2 5 2391 3
G
42 42 32 '22 IO 32 G
42 42 3222
54 44
1 1 1 5 1 3 5 3 1 5 f 3
t 52 52 42 3232 20 G 52 52
42 32 32 54
1 4 4 5 1 5 4 4 5 1
T
626252 23'24 32 T 62 62 52 42\42 44
1 4 4 5 4 4
T 72 72 62 52 42 32
I
T 72 72 62 52 52 64
1 4 4 / 1 -5 4 4 1 1
P,!,,=3lin I D , , , - K l - l . n + t r K I + ~P,!,_,.,+n,
tl. [ (7)
a1~d
P,!,,,,].
P,,,,=JIin [PESn, (8)
The relations for QmVn are obtainable analogously. These relat.iorrs are also easily
extended for ,522.
To execute the above procedure 011 a compukr, one needs to prepare a queue
memory with .If x (X,+ I ) cells storing ( O S k S K , ) Yalues,in addition bo
(L+ I ) one-dimensional arrays andL+ 1 variables which store temporary valuesof
,:P and Q;,= (1=0,1,2, . . ., L).The number of computational steps is roughly
proportional to (L+ 2)JZN. if N B K,.
W e cannot a priori determine appropriate values for the weights and parameters
involved in the algorithm, but they may be estimated by a dynamic. optimization
procedure (Ssnkoff et al., 1976). The weights t.hus obt.ained are useful for examining
prerjously unknown relatedness between a pair of sequences. Suchan investigation
on the interrelation of 4-5 S RNA sequences is reported elsewhere (Takeishi it Cotoh,
1982).
I thank Dr XI. I . Kanehisa for sending me the complete so~lrcecodes of the h s Xlamos
708 0. G O T O H
Sequence Analysis Package. I also thank Dr K. Takeishi and Dr Y. Tagashira for discussion
and encouragement.
Department of Biochemistry
Saitama Cancer Center Research Institute
Ins-machi, Saitama 362. Japan
REFERESCES
Barker, W . C. & Dayhoff, 31. 0.(197;2).In B t l a of Protein Sequence und Stnrctwe (Dayhodi
JI. O., ed.), vel. 6, pp. 101-110, Xational Biomedical Research Foundation, Silrer?
Spring.
Doolittle, R. F. (1981). Science. 214, 149-159.
Goad. W . B. & Kanehisa. JI. 1. (1982). Sztcl. Acids Res. 10, 247-263.
Kanehisa. 31. I. (1982). AVucl..-kids Res. 10. 183-196.
Luehrsen, K.R.,Sicholson, D.E., Eubanks, D. C. & Fox. G . E.(1981). .''atwe flo72&n),293,
---
I aa-736.
Seedlernan, S. B. & Wunsch. C. D. (3970). J. Y o l . B W . 48, 443453.
Sankoff, D.,Cedergren, R. J . & Lapalme, G . (1976). J . MoZ. E d . 7 , 133-149.
Sellers, P. H. (19'52).J . .4ppl. Jfath. (Sium), 26, 787-793.
Sellers, P. H.(1980). J . dlgorifhm, 1, 339-353.
Smith, T.F.8. N'aterman, 11. 8.(1981).J. .Mol. Biol. 147, 195-197.
Smith. T.F Waterman 31. S. Br Fit,ch, R . 31. (1981). J . Jfol. E d . 18, 3 8 4 6 .
Takeishi. K.8. Gotoh, 0.(1982).J. Riorhpm. (Tokyo), 92. 1173-11'5'5.
Waterman, AI. 8.. Smit.h. 7'. F. & Beyer, \V. X. (1976). ddrarr. Y a t h . 20, 367-387.
Edited by S . Bretwer