Вы находитесь на странице: 1из 6

ARTICLE IN PRESS

Chaos, Solitons and Fractals xxx (2003) xxx–xxx


www.elsevier.com/locate/chaos

Nonlinear deterministic structures and the randomness


of protein sequences
Yanzhao Huang, Yi Xiao *

Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, China
Accepted 20 November 2002

Abstract
To clarify the randomness of protein sequences, we make a detailed analysis of a set of typical protein sequences
representing each structural classes by using nonlinear prediction method. No deterministic structures are found in
these protein sequences and this implies that they behave as random sequences. We also give an explanation to the
controversial results obtained in previous investigations.
Ó 2003 Elsevier Science Ltd. All rights reserved.

One of unsolved problems in molecular biophysics is how proteins encode their structural informations in their
amino acid sequences. The amino acid sequences of proteins appear very irregular, but the three-dimensional struc-
tures they encode clearly show certain regularity. This riddle has motivated intensive studies of the longitudinal
correlation properties of protein sequences [1–18] to see whether they are random or not. However, these studies gave
opposing results: some studies showed that protein sequences were indistinguishable from random ones, while other
results indicated that protein sequences were nonrandom. For examples, White and Jacobs [1] studied the statistical
distribution of hydrophobic residues along the length of protein chains by using a binary hydrophobicity scale, which
assigns hydrophobic residues a value of one and nonhydrophobes a value of zero. Using the standard run test, they
found that, for the majority of the 5247 proteins examined, the distribution of hydrophobic residues along a sequence
could not be distinguished from that expected for a random distribution. On the other hand, Pande et al. [8] studied
the statistics of protein sequences by using the idea of mapping the sequence onto the trajectory of a random walk.
They found pronounced deviations from pure randomness. It is noted that both studies use a binary scale of hy-
drophobicity and hydrophilicity but different mapping schemes. In the work of White and Jocobs, Phe, Met, Leu, Ile,
Val, Cys, Ala, Pro, Gly, Trp and Tyr were considered as hydrophobic and other residues as hydrophilic, while in the
work of Pande et al., Lys, Arg, His, Asp and Glu were considered as hydrophilic and other as hydrophobic. Recently,
Weiss and Herzel [12,13] analyzed the correlation functions in large sets of nonhomologous protein sequences. They
found that the hydrophobicity autocorrelation showed period 3 to 4 oscillations. These oscillation decayed until they
vanish at a length of 10–15 amino acids and they can be related to the 3.6 periodicity of a-helices. Rackovsky [14]
demonstrated the existence in protein domain sequences of sets of statistically significant periodic signals, characteristic
of the architectures of those domains. Therefore, despite the efforts spent, it is still an open question whether protein
sequences are random or not. Thus, further work is warranted to clarify the apparent contradictions in the above
results.
The above investigations were based on statistical methods, usually used in physics, namely correlation functions,
random walk, Fourier transform, etc. As mentioned above, protein sequences are very irregular. It is known that

*
Corresponding author.
E-mail address: yxiao@mail.hust.edu.cn (Y. Xiao).

0960-0779/03/$ - see front matter Ó 2003 Elsevier Science Ltd. All rights reserved.
PII: S 0 9 6 0 - 0 7 7 9 ( 0 2 ) 0 0 5 7 1 - 4
ARTICLE IN PRESS
2 Y. Huang, Y. Xiao / Chaos, Solitons and Fractals xxx (2003) xxx–xxx

nonlinear dynamics theory have developed some very good methods to identify determinism or randomness of ir-
regular systems [19] and so it is reasonable to investigate the correlation properties of protein sequences by using the
methods of nonlinear dynamics. In fact, the theory of chaos has already been applied to investigate the behaviors of
biomolecules. For examples, El Naschie et al. [20–22] studied the possible connections of spatial chaos in mechanical
elastic chains to the conformations of biomolecules and they showed supercoiling in the elastic band very similar to
that of DNA. They also investigated chaos and order in symbolic sequences and polymers [23,24]. In the present paper
we study the correlation properties of protein sequences by using nonlinear prediction method which has been
previously used successfully to distinguish between chaos and noise in time series. This method can give specific in-
formation of how different regions are characterized and can detect the determinism which is not detected by the
standard methods, such as Fourier transformation and power spectrum. It can also give reasonable results for short
sequences.
The nonlinear prediction technique works as follows [19,25]. For an arbitrary symbolic series x1 ; x2 ; x3 ; . . . ; xN , one
constructs a set of d-dimensional vectors:

X1  ðx1 ; x2 ; . . . ; xd Þ;
X2  ðx2 ; x3 ; . . . ; xdþ1 Þ;
ð1Þ
...
XN dþ1  ðxN dþ1 ; xN dþ2 ; . . . ; xN Þ

which correspond to all possible segments of d consecutive symbols. Next, for each vector Xp  ðxp ; ; xpþ1 ; . . . ; xpþd1 Þ,
(1 6 p 6 N  d), one searches for its nearest neighbor XH ðpÞ  ðxHðpÞ ; xHðpÞþ1 ; . . . ; xHðpÞþd1 Þ and then compares how close
the symbols xpþd and xHðpÞþd are following these two vectors. The closeness of a pair of symbols xi and xj can be
measured in a Hamming metric:

0 xi ¼ xj
hðxi ; xj Þ ¼ ð2Þ
1 xi 6¼ xj

while the closeness of a pair of vectors Xi and Xj can be measured by


X
d1
H ðXi ; Xj Þ ¼ hðxiþk ; xjþk Þ; ð3Þ
k¼0

The nearest neighbors XHðpÞ of a given vector Xp are those Xj Õs which make H ðXp ; Xj Þ be a minimum for j 6¼ p. Once the
nearest neighbors XHðpÞ have been determined, we compute the mean local error: ep ¼ hhðxpþd ; xH ðpÞþd Þi where h
i denotes
the average over all the nearest neighbors of Xp since there are usually more than one the nearest neighbors. From this,
the overall mean error is
1 X N d
hEd i ¼ ep ð4Þ
N  d p¼1

For a perfect deterministic sequence, e.g., periodic sequence, hEd i ¼ 0. For uncorrelated random P chains, there is no
relation between any symbol xpþd and the vector Xp , and in that case hEd i can be approximated by fag pðaÞ½1  pðaÞ,
where fag is the alphabet taken by xi and pðaÞ is the probability of occurrences for the symbol a. Consequently, for such
series, the overall mean error hEd i will not depend on the embedding dimension d.
For protein sequences, there are different ways to define the alphabet taken by xi based on the selection of phys-
icochemical properties of amino acids. In the present paper, we shall consider three different schemes of representing
amino acids: (i) The WhiteÕs scheme [1]. The 20 amino acids are divided into two types: hydrophobic ðHÞ and hy-
drophilic ðP Þ. In this case, each xi can take one of two symbols fH ; P g, with H representing Phe, Met, Leu, Ile, Val, Cys,
Ala, Pro, Gly, Trp, Tyr and P representing other amino acids. In this case, for a uniform random process,
pðH Þ ¼ pðP Þ ¼ 0:5 and hEd i ¼ 0:5; (ii) The PandeÕs scheme [8]. It is similar to (i), but with P representing Arg, Asp, Glu,
His, Lys and H representing other amino acids. (iii) In this case, each xi can take one of 20 symbols fA; C; D; . . . ; Y g
which represents the 20 amino acids. The similarity between xi and xj is taken as the value Bðxi ; xj Þ of the blocks
substitution matrix (BLOSUM62), e.g., hðxi ; xj Þ ¼ Bðxi ; xj Þ, and the closeness of a pair of vectors Xi and Xj is
X
d1
H ðXi ; Xj Þ ¼ Bðxiþk ; xjþk Þ ð5Þ
k¼0
ARTICLE IN PRESS
Y. Huang, Y. Xiao / Chaos, Solitons and Fractals xxx (2003) xxx–xxx 3

0.53
α
β
0.52
αβ

Averaged <Ed> 0.51

0.5

0.49

0.48

0.47

0.46

0.45
5 10 15 20 25 30 35 40 45 50
d
Fig. 1. The average values of hEd i versus the embedding dimension d calculated for the scheme (i).

0.5
α
β
αβ
0.45
Averaged <Ed>

0.4

0.35

0.3

0.25
5 10 15 20 25 30 35 40 45 50
d
Fig. 2. The average values of hEd i versus the embedding dimension d calculated for the scheme (ii).

The overall mean error is defined as


 B 1 X N d
1 X N d
Ed ¼ eBp ¼ hBðxpþd ; xHðpÞþd Þi ð6Þ
N  d p¼1 N  d p¼1

It must be noted that, in this case, the larger the value of HðXi ; Xj Þ, the closer the vectors Xi and Xj . Similarly, the larger
the value of hEdB i, the stronger the nonlinear correlation.
ARTICLE IN PRESS
4 Y. Huang, Y. Xiao / Chaos, Solitons and Fractals xxx (2003) xxx–xxx

α
β
–0.7 αβ

–0.75
Averaged <EB
d
>
–0.8

–0.85

–0.9

–0.95

–1
5 10 15 20 25 30 35 40 45 50
d
Fig. 3. The average values of hEdB i versus the embedding dimension d calculated for the scheme (iii).

1
α
0.9 β
αβ
0.8

0.7

0.6
<Rd>

0.5

0.4

0.3

0.2

0.1

0
5 10 15 20 25 30 35 40 45 50
d
Fig. 4. The reduced overall mean error hRd i versus the embedding dimension d calculated for the scheme (iii).

Protein sequences corresponding to three different structural classes a, b, ab [26] are analyzed respectively. The
representative protein sequences of the three structural classes are taken from the PDB-select domain sequences with
less than 25% identity [27]. In the database, there are 108, 136, and 413 sequences in a, b, and ab classes respectively.
For these sequences, the average error hEd i over the ensemble of protein sequences in a structural class is computed as a
function of the embedding dimension d. These results are shown in Figs. 1–4.
Fig. 1 shows the average values of hEd i versus the embedding dimension d calculated by using the scheme (i), i.e., the
WhiteÕs scheme. It can be seen that the average values of hEd i for all the structural classes show no significant deviation
ARTICLE IN PRESS
Y. Huang, Y. Xiao / Chaos, Solitons and Fractals xxx (2003) xxx–xxx 5

from 0.5 for any embedding dimension d. This implies that the protein sequences behave as random ones on average.
This is just the conclusion given by White et al. [1].
Fig. 2 shows the average values of hEd i versus the embedding dimension d calculated by using the scheme (ii), i.e., the
PandeÕs scheme. In this case, the average values of hEd i of all the three structural classes show clear deviation from 0.5
and are around 0.375. This means that the protein sequences represented by the scheme (ii) deviate significantly from
uniform random sequences with hEd i ¼ 0:5. However, this does not imply that protein sequences in the PandeÕs scheme
show nonlinear deterministic structures. In fact, in the PandeÕs scheme, the probabilities of occurrences of H and P in
protein sequences are not equal to those (0.5) in uniform random sequences. If the probability of each of 20 amino acids
in protein sequences is 1=20, then the probabilities of occurrences of H and P are 3=4 and 1=4 respectively in the PandeÕs
scheme, since 15 amino acids are hydrophobic and only 5 amino acids are hydrophilic. Therefore, for an uncorrelated
random series with pðH Þ ¼ 3=4 and pðP Þ ¼ 1=4, the mean error is: hEd i ¼ ð3=4Þ½1  ð3=4Þ þ ð1=4Þ½1  ð1=4Þ ¼ 0:375.
Fig. 2 indeed shows that the average values of hEd i for protein sequences of the three structural classes are around
0.375. Furthermore, it is noted that the averaged values of hEd i for the three structural classes are separated with each
other and those for ab class lay between a and b classes. It seems that the protein sequences of b class are more regular
than those of other two classes.
Fig. 3 shows the average values of hEdB i versus the embedding dimension d calculated by using the scheme (iii). Our
task was to identify the average values of hEdB i of a given class of sequences with magnitudes larger than one would
expect in randomly selected sequences. However, in this case, it is difficult to give the values of hEdB i for the random
sequence analytically. We therefore needed to construct a scale on which to measure the size of hEdB i. We did this as
follows. For every sequence in a structural class, we generated an ensemble of 1000 sequences, each with a composition
identical to that of the actual protein sequence but also with a randomly permuted ordering of amino acids. For each
such random sequence, we calculated the values of hEdB i versus the embedding dimension d. From these, we generated
an average value hrd i of hEdB i for each d over the ensemble and a standard deviation rfrd g. These values, together with
hEdB i, for the actual protein sequence, made it possible to generate reduced values of hEdB i as a function of d:
hEdB i  hrd i
Rd ¼ ð7Þ
rfrd g

The quantity defined in Eq. (7) gives the deviation of the specified hEdB i of the actual protein sequence from its random
ensemble average, measured in units of the SD. Thus, large positive values indicate nonlinear correlation coefficients
that are significantly larger than those that are measured for random sequences. These are the signals we sought. The
averaged nonlinear correlation coefficients hRd i is an average of Rd over the ensemble of protein sequences in a
structural class. The results are shown in Fig. 4. Furthermore, we defined a significant nonlinear correlation coefficient
as one for which hRd i P 1:0. As shown in Fig. 4, none of hRd i of all the three different structural classes show significant
deviations from random sequences. This again implies that the protein sequences behave as random sequences on
average and show no clear nonlinear deterministic structures. However, for 5 < d < 20, the values of hRd i of the protein
sequences of all the three structural classes are larger than other parts of them and, in particular, they are close to 1.0 for
a and b classes. This implies that some of the protein sequences of a and b classes may have certain nonlinear deter-
ministic structures for this range of d because hRd i is the average over one structural class.
In conclusion, we did not find any significant deterministic structures in the protein sequences on average from the
calculations based on all the three kinds of the schemes. Furthermore, the method used here is very simple and makes us
able to clarify the controversy in the previous investigations. Our results show that the controversy may be due to having
used different schemes in representing amino acids. Although the protein sequences do not behave as uniform random
sequences in the PandeÕs scheme, they still behave as random sequences. Using the more sophisticated BLOSUM matrix
as the measure of the distance between amino acids, we again did not find significant evidence of nonlinear correlations in
the protein sequences. These raise important questions about how a random sequence can fold into a spatial structure
with certain regularity and how a random sequence can encode its structural information. Although the analysis of
nonlinear deterministic structures using the schemes above shows that the protein sequences behave as random sequences
on average, it does not preclude the possibility that some of protein sequences have deterministic structures and that the
protein sequences encode the structural information in other ways and different schemes.

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant no. 10175023 and
90103031.
ARTICLE IN PRESS
6 Y. Huang, Y. Xiao / Chaos, Solitons and Fractals xxx (2003) xxx–xxx

References

[1] White ST, Jacob RE. Biophys J 1990;57:911.


[2] Li XQ, Luo LF. Acta Sci Nat Univ Intram 1992;23:534.
[3] Johnson MS, Overington JP. J Mol Biol 1993;233:716.
[4] Cohen C, Parry DAD. Science 1994;263:488.
[5] Shakhnovich EI. Phys Rev Lett 1994;72:3907.
[6] Hobohm U, Sander C. J Mol Biol 1995;251:390.
[7] Rahman RS, Rackovsky S. Biophys J 1995;68:1591.
[8] Pande VS et al. Proc Natl Acad Sci USA 1994;91:12972.
[9] Eisenberg D et al. Proc Natl Acad Sci USA 1984;81:140.
[10] Herzel H, Grosse I. Physica A 1995;216:518.
[11] Herzel H, Grosse I. Phys Rev E 1997;55:800.
[12] Weiss O, Herzel H. J Theor Biol 1998;190:341.
[13] Weiss O, Herzel H. Zeitschr Phys Chem 1998;204:183.
[14] Rackovsky S. Proc Natl Acad Sci USA 1995;81:140.
[15] Mandell AJ et al. J Stat Phys 1998;93:673.
[16] Mandell AJ et al. Physica A 1997;244:254.
[17] Chechetin VR, Lobzin VV. J Theor Biol 1999;198:197.
[18] Korotkova MA et al. J Mol Model 1999;5:103.
[19] Kantz H, Schreiber T. Nonlinear time series analysis. Cambridge: Cambridge University Press; 1997.
[20] El Naschie MS, Kapitaniak T. Phys Lett A 1990;147:275.
[21] El Naschie MS. J Phys Soc Jpn 1989;58:4310.
[22] El Naschie MS, Al Athel S. Z Naturforsch 1989;44a:645.
[23] Ebeling W, El Naschie MS, Chaos, Solitons & Fractals, 4, Special Issue, 1994.
[24] El Naschie MS. Chaos, Solitons & Fractals 1998;9:135.
[25] Barral J et al. Phys Rev E 2000;61:1812.
[26] Branden C, Tiize J. Introduction to protein structures. second ed. New York: Garland Publishing; 1999.
[27] The database of domain sequences used here can be obtained by going to: http://www.cmbi.kun.nl/gv/pdbsel/.

Вам также может понравиться