Вы находитесь на странице: 1из 4

J. Mol. B i d .

(1982) 162, 705-708

An Improved Algorithm for Matching Biological Sequences


The aigorithm of Waterman el al. (1976) for matching biological sequences was
modified under some limitationsto be accomplished in essentially J W N steps, instead
of the iM2N steps necessary in the originalalgorithm. The limitations do not
seriously reduce the generality of the original method, and the present method is
available for most practical uses. The algorithmcan be executed on asmall computer
with a limited capacity of core memory.

The currently used major algorithms foraligning biological sequences (protein and
nucleic acid sequences) stem from the pioneering work of Needleman 8: Wunsch
(1970). Needleman-Wunsch’s method has also been applied to statistical tests of
relatedness between a pair of sequences (Barker & Dayhoff, 1972; Doolittle, 1981).
SelIers (1974) proved that evolutionary distances obtained with a similar algorithm
to that of Needleman & Wunsch satisfy metricconditions. Sellers’ metric was later
generalized by Waterman et al. (1976) SO that deletions/insertions (gaps) of any
length are allowed.- Inclusion of multiple-sized gapsis feasible for comparing \
biological sequences since a long gap can be produced by a single mutational event. ”,
This situation is incorporated intothe method of Waterman et al.( 1976) by assigning
a weight w k < kuJ,to agap of length k, whereas the gapweight is confined to wk=kur,
for all k values in the method of Needleman, Wunsch and Sellers. However, the
algorithm of Waterman et al. (1976) has adrawback in that it takes large
a number of
computational steps of the order of M 2 N compared to the M N steps of
Xeedleman-Wunsch-Sellers’ algorithm, where 111 and N ( M 2 N )are thelengths of
the proteins or nucleic acids under comparison. This is a particularly serious
problem when calculations are made on a low-speed small computer.
In this letterI present a new algorithm which allowsmultiple-sized gaps but runs
in essentially HA’steps if the gap weight has a special form of wk=uk+v ( u 20,
6 2 0). This formof gap weight has usually been used in computer systems that adopt
t.he matching algorithm of Waterman el al. (Smith et al., 1981; Kanehisa, 1982).

(a) A lgon’thm
Let the two sequencesbe A =&,a2. . . aMand B = b, b2. . . b,. A weight. d(a,, b,) is
given to an aligned pair of residues a, and b,. Usually, d(a,, b,) SO if a,= b,, and
d(a,, b,) > 0 if a
, # bn.The algorithmof Waterman 41 al. (1976) generates a distance
matrix Dm,,by an induction as follows:
Drn.n=?llin [ D m - 1.n- I +d(am, bn), P m . nr &m.nII (1)
where
701i

and

Although Pm,,(or QmJ appears to be calculated in m- 1 (or n.- 1) steps, it can be


obtained in a single step according to the following recursion relations:

P,,,=Min [Dm-l.n+~!lr
nilin (Dm-k.n+u:k)]
2sksm

=Min [Dm-l.n+z~!l,hlin (Dm-l-k.n+u?k+lj]


lsksrn-1

=Min [Dm-,.n+w,, Min (Dm-l-k.n+wk)+~)


lsksm-1
=Min [Dm-l.n+~!l.Pm-,.,+u] (4)
and
Qmn..=Min[Dm.n-l+U!lr Qm.n-l+z(]- (5)
T h u s , the induction is completed in i M N steps, each of which consistsof choosing the
smallest of three numbers forDm.,,and the smaller of two numbers for or Qqa.
At the beginning of the induction, one mayset Dm,o=Pm,o= torn(1 S m I Y ) ,and
DO.n=QO.n=t(:n (1s-nIN). Alternatively, Dm,o=Pm.o=Oand DO,n=QO,n=~nr 0;.
= Pm. = 0 and DO,.= Q,,. =0 may be chosen in searching for the most locally
similar subsequence (Sellers, 1980; Smith 8: Waterman, 1981: Goad & Kanehisa,
1982).
I n a computer program, not all the elements of Dm*.,P,,,” and Qmva need be
memorized; two one-dimensional arrays and one variable are sufficient to store
temporary values of these quantities. This feature is also useful for executing the
algorithm on a small computer equipped with a small size of core memory.
The optimallymatched alignments are availableby backtracking guided by the
“direction matrix” em,n, whose element is a three-bit binary number indicating the
paths throughwhich the minimum valueof Dm..is chosen (Smith el d.,1981;Goad&
Kanehisa, 1982). The complete set of emSnvalues is obtained by runningthe above
algorithm twice, first calculating (k20)and second exchanging the
column/row assignments of X and B, and finally taking bit-to-bit. logical OR values
of the first ern,. and the second tn,,.
Figure 1 shows an example.in which ern., values obtained after t.hefirst run of t h p
algorithm are shown in (a), and the final em.. values in (b).Figure l(a) and (b) also
demonstrates Dm.,\?lues and Q,.. values. respectively. Sote that the second run
converts the underlined ern,, values from one to five, although theydo not contributr
to the traceback (indicated by arrows) in this example.
The above-mentioned algorithm can be further generalized if wk has the folh*-
ing form: wk=u0k+r ( 1 s k < K I ) . ~ c ! ~ = ~ ~ , ( X . - h ’ ~ () K + u, <~ k~ _, < K 2 ) ....
.
w k = u L( k - K , ) 3 w p , (KL<k),where ‘IL, terms are constants of ~ , , > 2 ~ . ~ > 2 ~. .~ .
> u,20. The simplest caseof interest is L = 1 and 11 = 0, Le. trkis a linear function of’
k in t.he range 1 ,<k< K , , while i t is a const.ant.( = wK,) for all k values greatert h r l
A A A L T T A A A A T T

A 0 12 22 32 42 52 A 0 12 22 32 42 52

A
12 0 12 22 42 52 A
12 34 44 54 64 74
3 3 3 1\3 3 3 3
22 /2\: 12 32 42 22 22 34 44 64 74
A 3 2 A
-SF \I 3 3 2
32 '23\lk\i 22 32 A
32 32 '22 34 54
64
A
I 'I 91 3 2 5 2391 3
G
42 42 32 '22 IO 32 G
42 42 3222
54 44
1 1 1 5 1 3 5 3 1 5 f 3
t 52 52 42 3232 20 G 52 52
42 32 32 54
1 4 4 5 1 5 4 4 5 1
T
626252 23'24 32 T 62 62 52 42\42 44
1 4 4 5 4 4
T 72 72 62 52 42 32
I
T 72 72 62 52 52 64
1 4 4 / 1 -5 4 4 1 1

X I . Such an assignment of rc., seen~sa.deqnate for alignment ofsequences with largr


gaps, e.g. aiignment of'Halococcmnorrhuac) 5 S ItSA (Luehrsw r t nl.. 1981) against
usual prokaryotic or eukaryot.ir 5 S H.S.As. \\'htw I,= I . thr recursion relatiow for
Pmanare derived as:

P,!,,=3lin I D , , , - K l - l . n + t r K I + ~P,!,_,.,+n,
tl. [ (7)
a1~d
P,!,,,,].
P,,,,=JIin [PESn, (8)
The relations for QmVn are obtainable analogously. These relat.iorrs are also easily
extended for ,522.
To execute the above procedure 011 a compukr, one needs to prepare a queue
memory with .If x (X,+ I ) cells storing ( O S k S K , ) Yalues,in addition bo
(L+ I ) one-dimensional arrays andL+ 1 variables which store temporary valuesof
,:P and Q;,= (1=0,1,2, . . ., L).The number of computational steps is roughly
proportional to (L+ 2)JZN. if N B K,.
W e cannot a priori determine appropriate values for the weights and parameters
involved in the algorithm, but they may be estimated by a dynamic. optimization
procedure (Ssnkoff et al., 1976). The weights t.hus obt.ained are useful for examining
prerjously unknown relatedness between a pair of sequences. Suchan investigation
on the interrelation of 4-5 S RNA sequences is reported elsewhere (Takeishi it Cotoh,
1982).
I thank Dr XI. I . Kanehisa for sending me the complete so~lrcecodes of the h s Xlamos
708 0. G O T O H
Sequence Analysis Package. I also thank Dr K. Takeishi and Dr Y. Tagashira for discussion
and encouragement.

Department of Biochemistry
Saitama Cancer Center Research Institute
Ins-machi, Saitama 362. Japan

Received 28 June 1982

REFERESCES
Barker, W . C. & Dayhoff, 31. 0.(197;2).In B t l a of Protein Sequence und Stnrctwe (Dayhodi
JI. O., ed.), vel. 6, pp. 101-110, Xational Biomedical Research Foundation, Silrer?
Spring.
Doolittle, R. F. (1981). Science. 214, 149-159.
Goad. W . B. & Kanehisa. JI. 1. (1982). Sztcl. Acids Res. 10, 247-263.
Kanehisa. 31. I. (1982). AVucl..-kids Res. 10. 183-196.
Luehrsen, K.R.,Sicholson, D.E., Eubanks, D. C. & Fox. G . E.(1981). .''atwe flo72&n),293,
---
I aa-736.
Seedlernan, S. B. & Wunsch. C. D. (3970). J. Y o l . B W . 48, 443453.
Sankoff, D.,Cedergren, R. J . & Lapalme, G . (1976). J . MoZ. E d . 7 , 133-149.
Sellers, P. H. (19'52).J . .4ppl. Jfath. (Sium), 26, 787-793.
Sellers, P. H.(1980). J . dlgorifhm, 1, 339-353.
Smith, T.F.8. N'aterman, 11. 8.(1981).J. .Mol. Biol. 147, 195-197.
Smith. T.F Waterman 31. S. Br Fit,ch, R . 31. (1981). J . Jfol. E d . 18, 3 8 4 6 .
Takeishi. K.8. Gotoh, 0.(1982).J. Riorhpm. (Tokyo), 92. 1173-11'5'5.
Waterman, AI. 8.. Smit.h. 7'. F. & Beyer, \V. X. (1976). ddrarr. Y a t h . 20, 367-387.

Edited by S . Bretwer

Вам также может понравиться