Вы находитесь на странице: 1из 13

Physica Scripta

Phys. Scr. 94 (2019) 125006 (13pp) https://doi.org/10.1088/1402-4896/ab3739

White noise functional integral for


exponentially decaying memory: nucleotide
distribution in bacterial genomes
Renante R Violanda1, Christopher C Bernido1,2,3 and
M Victoria Carpio-Bernido1,2
1
Department of Physics, University of San Carlos, Cebu City 6000, Philippines
2
Research Center for Theoretical Physics, Central Visayan Institute Foundation, Jagna, Bohol 6308,
Philippines

E-mail: cbernido.cvif@gmail.com

Received 5 March 2019, revised 22 July 2019


Accepted for publication 31 July 2019
Published 11 September 2019

Abstract
We utilize a stochastic functional integral approach that forms a natural framework for analyzing
ubiquitous complex sequences of fluctuations with underlying non-Markovian stochastic process
beyond fractional Brownian motion. We demonstrate how Hida white noise calculus, guided by
mean square deviation (MSD) analysis of empirical data, allows derivation of single nucleotide
occurrence probability distributions for whole genomes of four significant species of bacteria:
(a) freshwater cyanobacteria Synechococcus elongatus PCC7942, 2.7 Mbp, (b) marine
cyanobacteria Prochlorococcus marinus subsp. marinus str. CCMP1375, 1.8 Mbp,
(c) pathogenic bacteria Staphylococcus aureus subsp. aureus NCTC 8325, 2.8 Mbp, and
(d) Staphylococcus aureus ILRI Eymole1/1, 2.9 Mbp. Here, the stochastic variable is chosen
to represent separation distances between succeeding identical single nucleotides where distance
is defined as the number of steps through intervening bases. The stochastic parameter set takes
values of nucleotide occurrence count along the genome length. The probability density function
(PDF) is derived in closed form for the associated stochastic process with exponentially damped
memory kernel, and is shown to satisfy a modified diffusion equation with a parameter-
dependent diffusion coefficient. The PDF yields an analytical result for MSDs that match
empirical plots, showing a rising nonlinear curve that flattens to a plateau starting close to 1 kb,
similar to restricted diffusion. The plots exhibit compliance with Chargaff’s second parity rule
for nucleotides. The same PDF describes occurrences of single nucleotides adenine, guanine,
cytosine, and thymine for all four bacterial genomes considered.
Keywords: non-Markovian stochastic process, white noise calculus, DNA sequence, S.
elongatus PCC7942, P. marinus subsp. marinus str. CCMP1375, S. aureus subsp. aureus NCTC
8325, S. aureus ILRI Eymole1/1

(Some figures may appear in colour only in the online journal)

1. Introduction nucleotides that make up the DNA [1–3]. To determine the


diversity or similarity of species at the DNA level, the gen-
The evolution of a species arising from mutations, genetic ome can be viewed as a collection of molecules, or nucleo-
drift, and adaptation to environmental changes, among others, tides, interacting with each other that form a highly structured
is recorded in changes that occur in the sequence of DNA sequence that encodes information. In this paper, we
extract physical information from this highly organized col-
3
Author to whom any correspondence should be addressed. lection of molecules using stochastic analysis which has been

0031-8949/19/125006+13$33.00 1 © 2019 IOP Publishing Ltd Printed in the UK


Phys. Scr. 94 (2019) 125006 R R Violanda et al

used in physics to understand many biopolymers. We use the


Hida white noise functional integral [4] which is an especially
natural framework for treating stochastic processes with var-
ious forms of memory kernels and probability measure weight
functions [5, 6]. The functional integral approach which is Figure 1. Distances between adjacent adenine A nucleotides are
basically a Green function integral approach readily yields the marked xi with i = 1, 2,¼, N - 1, where N is the total number of
probability density function (PDF) in closed form. With the adenines in a genome. Distances are determined by the number of
PDF, significant physical quantities could then be derived steps through intervening non-adenine nucleotides such
that, x1 = 4, x2 = 2, x3 = 5.
such as moments, variances, correlations and first passage
times.
Here we demonstrate that the Hida white noise integral Hida white noise calculus [4]. Note that, in the literature,
approach technique is handy for investigating genomes or noise may refer to random fluctuations of a signal. In this
sections thereof. Various stochastic and computational mod- paper, the term white noise does not refer to noise external to
the system. We use the term white noise as a mathematical
els have been used to reveal patterns in DNA sequences not
description [4] of the fluctuating separation distances between
only to test similarities or differences between species, but
similar nucleotides in a DNA sequence. For the bacterial
more importantly to understand how information is encoded
species considered, the behavior of the fluctuating nucleotide
in a genome [7–15]. However, it remains a tedious endeavor
separation distances turns out to be aptly characterized by a
to distill information from the complexity of differences in
stochastic white noise model with memory which provides
nucleotide distributions along genomes and alleles even for
new insight into how a specific nucleotide is distributed in a
the same species. In this work, we focus on the location and
genome. This is followed by section 3 on the bacterial gen-
variable separation distances between two nearest-neighbor
ome data, sources, and computational work done for
identical single nucleotides in the whole genome of four
empirical MSD analysis. Sections 4 and 5 present the main
bacterial species: (a) freshwater cyanobacteria Synechococcus
results and discussion, respectively.
elongatus PCC7942, (b) marine cyanobacteria Prochlorococcus
marinus subsp. marinus str. CCMP1375, 1.8 Mb, (c) pathogenic
bacteria Staphylococcus aureus subsp. aureus NCTC 8325, 2.8 2. White noise functional integral approach
Mb, and (d) Staphylococcus aureus ILRI Eymole1/1, 2.9 Mb.
The unicellular cyanobacteria S. elongatus has been studied as a In terms of the Hida white noise variable w (s ), defined as the
cyanobacterial clock [3, 16] for exhibiting circadian rhythms and derivative of ordinary Brownian motion B (s ), i.e. w (s ) =
has recently been shown to protect rats from heart attacks aided dB (s ) /ds [4], we will consider the stochastic process given
by its photosynthetic ability [17]. S. elongatus PCC7942 has by
a length of approximately 2700 000 bp with a GC content of
L ⎡ b ⎤
55.5%. The marine cyanobacteria Prochlorococcus plays an
important role in the global carbon balance [18]. P. marinus
x (L ) = x 0 + bc ò0 exp ⎢ - (L - s) ⎥ w (s) ds ,
⎣ 2 ⎦
(1 )

subsp. marinus str. CCMP1375 has 1751 080 bp with 1882 where, b , c > 0 are constants. The stochastic variable x (L )
coding genes and 94 non-coding genes. Common infections represents fluctuating separation distances between succeed-
are caused by the pathogenic bacteria S. aureus. Moreover, ing identical single nucleotides where distance is defined in
S. aureus subsp. aureus NCTC 8325 is an exemplar for strains terms of steps through the intervening bases as shown in
used in genetic manipulation. S. aureus ILRI Eymole1/1 has figure 1.
2874 302 bp with a GC-content of 32.88% [19]. The stochastic parameter set takes values of nucleotide
We investigate, in particular, the distribution of a given occurrences along the whole genome length, typically over a
nucleotide in a DNA sequence. Mean square displacement million base pairs, so that we take 0  s  L with
(MSD) analysis for the fluctuations in distribution of single L = N - 1, where N is the total number of the single
nucleotides along the bacterial genomes lends insight for the nucleotide present in the genome strand. In equation (1),
choice of underlying stochastic process for the system. The exp [-(b /2)(L - s )] is a memory kernel characteristic of the
genome-wide patterns in MSD plots generated deviate from system. The initial value of the process is fixed at x 0. How-
linearity pointing to non-Markovian stochasticity. Evaluation ever, from equation (1), we see that the fluctuating variable
of the expectation of the constrained stochastic variable over can take any value in the domain when the parameter takes
Hida white noise probability measure yields analytical results the value, s = L. We thus introduce the constraint fixing the
for the MSD that match empirical plots, showing a rising endpoint x (L ) = xL , using the Donsker delta functional,
nonlinear curve that flattens to a plateau starting close to 1 kb, d (x (L ) - xL ), the stochastic distributional analog of the
and following Chargaff’s second parity rule for nucleotides. Dirac delta function [20–22]. The delta functional constraint
The same PDF describes occurrences of single nucleotides takes as nonvanishing paths only those with terminal values at
adenine (A), guanine (G), cytosine (C), and thymine (T) for the specified point xL . The PDF P (xL , L; x 0 , 0) for fluctua-
all four bacterial genomes considered. tions ending at xL with initial value at x 0 is the expectation
In section 2, we present relevant essential points of the
stochastic integral approach for processes with memory using
P (xL , L; x 0, 0) = ò d (x (L ) - xL ) dm (w ). (2 )

2
Phys. Scr. 94 (2019) 125006 R R Violanda et al

Figure 2. Separation distances between a nucleotide base and the identical single nucleotide immediately preceding it for S. elongatus
PCC7942. The occurrence number n is the nth nucleotide base of the same type encountered in the DNA sequence.

In equation (2), dm (w ) = Nw exp ⎡⎣ -(1 /2) ò w (t )2dt ⎤⎦ d¥w, We now write, e (s ) = k bc exp [-(b /2)(L - s )], e Î S (),
is the Gaussian white noise probability measure where the and integration over dm (w ) can be done using the characteristic
exponential is responsible for the Gaussian fall-off and Nw is a functional equation (3). With equation (5), the PDF can then be
normalization factor [4]. The dm (w ) is defined by the char- written as
acteristic functional 1 +¥

⎡ L ⎤ ⎡ 1 L ⎤
P (xL , L; x 0, 0) =
2p ò-¥ exp {ik [x 0 - xL ]}

òS¢ () exp ⎢i


⎣ ò0 w (s) e (s) ds⎥ dm (w ) = exp ⎢ -
⎦ ⎣ 2 ò0 e 2 d s⎥ ,
⎦ ⎧ 1 L ⎫
(3 )
´ exp ⎨ - k 2bc
⎩ 2 ò0 exp [ - b (L - s)] ds⎬ dk.

(6 )

where e (s ) Î S () with the Gel’fand triple, S () Ì L2 () Ì We also remark that it is the pairing between the dual spaces
S ¢ (). Here S (), L2 (), and S ¢ () are the Schwartz space through the triple S () Ì L2 () Ì S ¢ (), defined as the
of test functions, the Hilbert space of square integrable bilinear extension of the inner product on L2, that facilitates the
functions, and the space of tempered distributions, respec- treatment of memory functions in the stochastic integral. In
tively [4]. equation (6), the integral over dk is a Gaussian integral which
To evaluate equation (2), we write the Donsker delta yields
functional in terms of its Fourier representation to get 1
P (xL , L; x 0, 0) =
1 +¥ 2pc [1 - exp ( - bL )]
P (xL , L; x 0, 0) =
2p ò ò-¥ exp {ik [x (L ) - xL ]} dk dm (w ).
⎧ - (xL - x 0 )2 ⎫
(4 ) ´ exp ⎨ ⎬, (7 )
⎩ 2c [1 - exp ( - bL )] ⎭
Using equation (1) for x (L ) we have
where integration over ds has been carried out. The PDF of

1 equation (7) can be shown to satisfy a modified diffusion
P (xL , L; x 0, 0) =
2p -¥ ò
dk exp {ik [x 0 - xL ]}
equation with a parameter-dependent diffusion coefficient similar
⎧ L ⎡ b ⎤ ⎫ to other stochastic processes with memory [23].
ò
´ exp ⎨ik bc
⎩ 0
ò
exp ⎢ - (L - s) ⎥ w (s) ds⎬ dm (w ).
⎣ 2 ⎦ ⎭ With equation (7), the MSD can be calculated,
(5 ) MSD = á (x - á xñ)2 ñ, where brackets á¼ñ denote an average.

3
Phys. Scr. 94 (2019) 125006 R R Violanda et al

Figure 3. A magnified view of figure 2 for the 400th–500th identical single nucleotide occurring in the DNA sequence of S. elongatus
PCC7942.

This leads to 3. Data and methods


MSD = c [1 - exp ( - bL )]. (8 )
DNA sequences were obtained from the National Center for
For the genomic analysis, it is convenient to uniformly shift Biotechnology Information (www.ncbi.nlm.nih.gov) for the
the graph of MSD versus L for equation (8) without changing whole genomes of: (a) Synechococcus elongatus PCC7942,
its shape by adding a constant interval such as (a - c ) > 0. 2.7 Mb, (b) Prochlorococcus marinus subsp. marinus str.
In particular, we have, MSDa = MSD + (a - c ), where the CCMP1375, 1.8 Mb, (c) Staphylococcus aureus subsp. aur-
shifted MSDa can then be written as eus NCTC 8325, 2.8 Mb, and (d) Staphylococcus aureus ILRI
MSDa = a - c e-bL . (9 ) Eymole1/1, 2.9 Mb. For a given genome, separation dis-
tances between adjacent identical single nucleotides were
Equations (7) and (9) are the key relations that we apply to the
evaluated with the distances determined by the number of
bacterial genomes.
intervening nucleotides as depicted in figure 1. The varying
We note that the stochastic process and corresponding
distances generate a graph with high density of fluctuations as
MSD, equation (8), discussed in this paper have some simi-
larity with the direct tempering model of Meerschaert and shown in figure 2 of section 4 (Results). The numbers in the
Sabzikar [24, 25] as discussed in reference [24]. The differ- horizontal axis of figure 2 are the sequential occurrences of
ence lies in that the starting point, (equation (64), in [24]) is the selected single nucleotide such as adenine A along the
Mandelbrot’s definition of fractional Brownian motion where length of the genome. The same procedure is done for single
an exponential truncation factor, exp [-l (t - t ¢)], is added nucleotides G, C, and T for each bacterial genome.
with l as the truncation parameter. For l  0 the classical To compute the empirical MSD, any given nucleotide in
fractional Brownian motion is obtained. In the long time limit the circular DNA sequence of a bacteria, say G, can be taken
t  l-1 and H = 1 /2, the MSD, equation (69) of [24], as starting point corresponding to s = 0. One then evaluates,
reduces to the form, (s 2 /l )(1 - e-lt) which is then similar á [x (s + Ds ) - x (s )]2 ñ, with s taking successively increasing
to equation (8) above for c = s 2 /l and b = l. It is observed values of occurrence numbers. The first point in the empirical
that the Meerschaert and Sabzikar model does not lead to MSD plot is for Ds = 1, i.e. nearest-neighbor G nucleotides.
ordinary Brownian motion for H = 1 /2 due to the added The second MSD point is for Ds = 2 which takes the dif-
exponential factor, exp [-l (t - t ¢ )]. Note that time t in [24] ference of values for two next-to-nearest nucleotide G, and so
corresponds to our occurrence number L. on for increasing values of Ds. The MSD of the separation

4
Phys. Scr. 94 (2019) 125006 R R Violanda et al

Figure 4. MSD for T, A, G, and C nucleotide bases of S. elongatus PCC7942. Empirical MSD (blue dots); theoretical fit (red curve) given by
equation (9).

Table 1. Values of parameters in equation (9) that give a good match Table 2. Values of parameters in equation (9) that give a good match
between theoretical MSD (red curve, figure 4) and genome-based between theoretical MSD (red curve, figure 5) and genome-based
MSD (blue dots, figure 4) for S. elongatus PCC7942. MSD (blue dots, figure 5) for P. marinus subsp. marinus str.
CCMP1375.
Base T Base A Base G Base C
Base T Base A Base G Base C
a 32.2 31.85 16.63 16.69
b 0.003 54 0.003 32 0.003 12 0.002 75 a 15.82 15.59 53.92 53.93
c 0.62 0.59 0.24 0.27 b 0.002 96 0.004 24 0.007 12 0.008 33
c 0.18 0.19 1.28 1.15

distances is then generated for each nucleotide in the whole


genome as shown in figures 4–7 of the following Section. graphs for distance versus occurrence number for nucleotides
G, A, and C are also shown in figure 2. A magnified view of
the fluctuating values of separation distances versus occur-
rence number is shown in figure 3.
4. Results The MSD for the fluctuating separation distances
(figures 2 and 3) of nucleotides A, G, C, and T for S. elon-
4.1. Synechococcus elongatus PCC7942 gatus PCC7942 are shown in figure 4. Note how the best fit
curve MSD coincides with the analytical MSD (red curve in
Following the procedure for determining distances between
figure 4) given by equation (9) for parameter values given in
similar nucleotides depicted in figure 1, the results for the
table 1.
varying distances between two successive T nucleotides in the
genome of S. elongatus PCC7942 are shown in figure 2. The
occurrence number is the number of times a T nucleotide is
4.2. Prochlorococcus marinus subsp. marinus str. CCMP1375
encountered sequentially as one runs along the length of a
genome. Hence, the ninth T nucleotide encountered in a Following the same procedure as with S. elongatus, we obtain
sequence is assigned an occurrence number of 9. Similar the MSD of the separation distances for nucleotides A, G, C,

5
Phys. Scr. 94 (2019) 125006 R R Violanda et al

Figure 5. MSD for T, A, G, and C nucleotide bases of P. marinus subsp. marinus str. CCMP1375. Empirical MSD (blue dots); theoretical fit
(red curve) given by equation (9).

Table 3. Values of parameters in equation (9) that give a good match ILRI Eymole1/1 are plotted with parameter values given in
between theoretical MSD (red curve, figure 6) and genome-based table 4. The MSD plots are shown in figure 7.
MSD (blue dots, figure 6) for S. aureus subsp. aureus NCTC 8325.
Base T Base A Base G Base C
a 12.63 13.30 65.05 63.82
b 0.0027 0.001 59 0.002 46 0.002 16 5. Discussion
c 0.18 0.19 0.89 0.90
As shown in figures 4–7, a good match between theoretical
and genome-based MSD is obtained for each of the four
bacterial genomes. This indicates that varying separation
and T for the genome of P. marinus subsp. marinus str. distances between neighboring nucleotides of the same type
CCMP1375, with parameter values in table 2. The MSD plots
are described by a non-Markovian stochastic process char-
are shown in figure 5.
acterized by equations (7)–(9). The theoretical MSD calcu-
lated as an ensemble average closely matches the empirically
derived MSD especially at large occurrence numbers (see
4.3. Staphylococcus aureus subsp. aureus NCTC 8325
figures 4–7). Noting that increasing occurrence numbers
The MSD of the separation distances similar to figures 2 and correspond to the role of increasing time for fluctuations
3 for nucleotides A, G, C, and T for the bacterial species S. measured in a time series, one could say that the system is
aureus subsp. aureus NCTC 8325 are obtained for parameter ergodic for large occurrence numbers. For small occurrence
values given in table 3. The corresponding MSD plots are numbers, however, a slight deviation is observed between the
shown in figure 6. theoretical and empirical MSD. A non-ergodic behavior is
therefore manifested for small occurrence numbers, or when a
series of similar nucleotides are relatively close to each other.
4.4. Staphylococcus aureus ILRI Eymole1/1
We now discuss the significance of parameters a, b, and c of
The MSD of the separation distances (similar to figures 2 and equation (9), as well as the PDF obtained for the different
3) for nucleotides A, G, C, and T for the genome of S. aureus genomes.

6
Phys. Scr. 94 (2019) 125006 R R Violanda et al

Figure 6. MSD for A, G, C, and T nucleotide bases of S. aureus subsp. aureus NCTC 8325. Empirical MSD (blue dots); theoretical fit (red
curve) given by equation (9).

Table 4. Values of parameters in equation (9) that give a good match PDF and MSD, equations (7) and (8), respectively, i.e.
between theoretical MSD (red curve, figure 7) and genome-based exp (-bL ) » 1 - bL + (b 2L2 /2!)+¼, one sees that the
MSD (blue dots, figure 7) for S. aureus ILRI Eymole1/1. stochastic process approaches the mathematical form of
Base T Base A Base G Base C ordinary Brownian motion when, b  1. The only physical
difference is that the ‘step length’ or distances between suc-
a 12.42 13.54 70.46 58.61 ceeding similar nucleotides can vary (see, also, figure 9). Note
b 0.002 89 0.001 79 0.002 53 0.001 92
that here, L is the occurrence number which corresponds to
c 0.20 0.19 0.93 0.92
time T in discussions of Brownian fluctuations.

5.2. Chargaff’s second parity rule


5.1. Memory parameter
For the four species of bacteria considered, the difference
Although occurrence distances of all nucleotides considered between parameters a and c, i.e. (a - c ), can be used to test
are influenced by the same memory function of the form the second parity rule of Chargaff [26]. This parity rule states
exp [-b (L - s ) /2], as shown in equation (1), the effect of that for a single DNA strand, the frequency of appearance of
the memory function differs with values of b > 0 ranging nucleotide A is equal to that of T, and the same holds for the
from 0.001 59 to 0.008 33 (see tables 1–4). Smaller values of frequency of appearance of C which is equal to that of G.
b diminishes memory effects. This may be seen by expanding Specifically, for a given genome (see, tables 1–4), whenever
the memory function, i.e. the parameters (a - c ) of A and T have approximately the
b b2 same values, then Chargaff’s second rule applies. The same
exp [ - b (L - s) / 2] = 1 - (L - s) + (L - s)2 situation holds for G and C. Explicitly, consider table 1 for S.
2 8
elongatus PCC7942 for which values of (a - c ) are sum-
b3
- (L - s)3 + O (b 4) +¼ , (10) marized in table 5. The corresponding graph of MSD versus
48 occurrence number for S. elongatus PCC7942 is shown in
where terms with higher powers of b are negligible figure 8 where the frequency of appearance of A is closely
with b  1. The b may then be referred to as a memory matched by T as reflected in their MSD. The MSD of G and C
parameter. Likewise, by expanding the exponential in the likewise imply similar frequencies of appearance.

7
Phys. Scr. 94 (2019) 125006 R R Violanda et al

Figure 7. MSD for A, G, C, and T nucleotide bases of S._aureus ILRI Eymole1/1. Empirical MSD (blue dots); theoretical fit (red curve)
given by equation (9).

Table 5. Chargaff’s second parity rule may be verified if (a–c) for A nucleotides, or ‘step lengths,’ can be very different from each
and T have approximately the same values. The same holds for G other.
and C.
S. elongatus PCC7942 T A G C 5.4. Probability density function
(a–c) 31.58 31.26 16.39 16.42 We could also extract information from the probability of
occurrence of a given nucleotide from the PDF. This may be
illustrated in figure 10 where empirical and theoretical graphs
5.3. MSD for large occurrence numbers (equation (7)) of the PDF are shown as a function of (x - x 0 )
and the occurrence number L.
As observed in figures 4–7, the MSD plateaus at a roughly Both empirical and theoretical graphs in figure 10 show a
horizontal line as occurrence numbers become large. This peak at x = x 0. Recall that x represents separation distance of
behavior is similar to confined or restricted diffusion (see, e.g. a single nucleotide from an identical nucleotide immediately
[27]). The approximate MSD values at large occurrence preceding it in the sequence (see figure 1). Hence, given any
numbers when the graph becomes approximately flat are nucleotide separated by a distance x 0 from the identical
given by parameter a as summarized in table 6. We can also neighboring nucleotide, it is most likely that the fluctuation of
compare the genome-based MSD with an MSD arising from distance values would end up with a nucleotide characterized
purely random arrangement of the four nucleotides. This is by a distance x = x 0 from a nucleotide of the same type
shown in figure 9 and reflected in table 6 for comparison. As preceding it.
expected, the genomic nucleotide sequence is far from We note that the PDF, equation (7), is a solution of a
random. modified diffusion equation of the form [6]
We note that the randomly distributed A, G, C, and T of ¶P (x , s ; x 0, 0) ⎡ bc ⎤ ¶ 2 P (x , s ; x 0 , 0 )
figure 9 do not represent an MSD typical of ordinary Brow- =⎢ ⎥ , (11)
¶s ⎣ 2 exp (bs) ⎦ ¶x 2
nian motion. In the usual Brownian motion, or random walk,
the step lengths are normally equal although directions are with a parameter-dependent diffusion coefficient D(s).
unpredictable. In figure 9, the distances between similar Such kinetic equations are of recent interest due to the wide

8
Phys. Scr. 94 (2019) 125006 R R Violanda et al

Figure 8. MSD versus occurrence number for nucleotides T (blue), A (red), G (green), and C (black) for S. elongatus PCC7942.

Table 6. MSD values at large occurrence numbers given by parameter a. The last row shows the exact MSD value for randomly placed
nucleotides corresponding to figure 9.
MSD at Large Occurrence Numbers
Base T Base A Base G Base C
S. elongatus PCC7942 32.2 31.85 16.63 16.69
P. marinus subsp. marinus str. CCMP1375 15.82 15.59 53.92 53.93
S. aureus subsp. aureus NCTC 8325 12.63 13.30 65.05 63.82
S. aureus ILRI Eymole1/1 12.42 13.54 70.46 58.61
Random locations of A, G, C, and T 23 23 23 23

range of geological, biological and chemical phenomena 5.5. Autocorrelation function


where time-dependent diffusion coefficients arise (see, e.g.
For a stochastic process described by equation (1), the fluc-
[23, 28–31]).
Matching theory with empirical data at the level of their tuations have an autocorrelation function given by,
MSD and PDF does provide a certain level of confidence that Rx (L ) = (cN0 /2) exp (-bL /2), where white noise w (s ) has a
a stochastic model indeed properly describes the experimental power spectral density Sw = (N0 /2), with N0 a constant. We
data. We could not exclude the possibility however that other can then compare this with the autocorrelation based on
stochastic models with appropriate approximations and lim- empirical data.
iting cases such as those considered in [24], could also Figure 11 shows an example of the autocorrelation function
describe the same experimental data. It would then be of the T-base occurrence distance fluctuation of S. elongatus
important to consider other criteria, such as kurtosis or first PCC7942 bacteria, fitted with the theoretical autocorrelation
passage time properties, to distinguish more precisely and pin function (solid red curve). We have normalized the theoretical
down the correct model for given experimental data. autocorrelation function such that at lag L = 0, the Rx (0) = 1.

9
Phys. Scr. 94 (2019) 125006 R R Violanda et al

Figure 9. MSD versus occurrence number for randomly placed T, A, G, and C nucleotide bases. The graph significantly differs from genome-
based MSD of figures 4–8.

Since Rx (L > 0)  1, the graph removes the Rx (0) from the and the challenge of looking for similarities and differences
plot to emphasize the exponential decay. Similar graphs are that cut across diverse bacterial communities [33, 34], a non-
exhibited by the other nucleotides of other bacterial species. Markovian stochastic process as presented in this paper could
In figure 11, a weak but positive correlation for small lag provide a framework and analytical tool for revealing infor-
values is observed with the autocorrelation function dropping mation encoded in these complex genome sequences. More-
close to zero starting around occurrence number 1000. Note over, knowing the appropriate PDF and the modified
that occurrence numbers less than 1000 refer to nucleotides diffusion equation it obeys provides a novel perspective and
nearer to each other as compared to those with occurrence analytical tool. Such stochastic framework may find future
numbers beyond 1000 where nucleotides have much larger use not only in bacterial taxonomy [33], but also in the design
separations. Considering that the initial or starting point of the and synthesis of DNA sequences [35, 36].
occurrence number (e.g. the first A in figure 1) could be any Here we note that, in evaluating nucleotide separation
nucleotide in the circular genome of the bacteria, a possible distances, no distinction has been made yet between coding
source for the positive correlation could be a functionally and non-coding regions of the DNA sequence [9]. Most
related group of nucleotides beyond which a drop in corre- bacterial genomes have 10%–20% noncoding DNA. It would
lation may develop. For example, a bacterial gene may consist thus be interesting to apply the stochastic functional integral
of around a thousand or more nucleotides depending on its method to investigate spatial distributions of and correlations
function. Moreover an operon, which is a cluster of adjacent between special clusters in genomic strands. It is of interest to
genes with a common control mechanism [32], may also be determine whether trends and relations between motifs will
responsible for this positive correlation. emerge in MSD plots of spatial distributions of dyads, triads
and tetrads of nucleotides. Nonlinear behavior of the MSD
could shed light on functional separation distances for related
6. Conclusion genes clustered together such as an operon. This would be of
immediate consequence to studies of genetic maps and pat-
Predicting patterns in DNA sequences, in general, is made terns in repeated sequences (see, e.g. [37]) and reserved for
challenging by the high degree of complexity and variability future work. Moreover, applications to other biopolymers can
of nucleotide combinations in genomes. With the rapid be explored. For instance, the sequence of amino acids in a
increase in the number of bacterial genomes being sequenced protein and its interaction with a solvent play an important

10
Phys. Scr. 94 (2019) 125006 R R Violanda et al

Figure 10. Probability density function (PDF) as a function of occurrence number L and distance, x - x 0 , for S. elongatus PCC7942. Top
graph (empirical); bottom graph theoretical, equation (7), for b = 0.003 54; c = 0.62.

role in the crystallization of proteins needed to determine Recent advances in computational analysis [41] could help
protein structures [38–40]. Given a protein, the distribution deal with a sequence of twenty amino acids comprising a
properties of identified residues significant in a crystallization protein instead of sequences of four nucleotides discussed in
process can be investigated using the method in this paper. this paper. Possible reduction could be done if distributions of

11
Phys. Scr. 94 (2019) 125006 R R Violanda et al

Figure 11. Autocorrelation function versus occurrence number for the T-base of S. elongatus PCC7942 (Empirical: blue dots; Theoretical:
solid red curve). The fluctuation shows a weak but positive correlation at small lag values indicating that distance fluctuation is weakly
persistent only for small occurrence numbers.

‘patches’ or clusters [40] rather than single amino acids are [4] Hida T, Kuo H H, Potthoff J and Streit L 1993 White Noise—
considered. An Infinite Dimensional Calculus (Dordrecht: Kluwer)
[5] Bernido C C and Carpio-Bernido M V 2012 White noise
analysis: some applications in complex systems, biophysics
and quantum mechanics Int. J. Mod. Phys. B 26 1230014
Acknowledgments [6] Bernido C C and Carpio-Bernido M V 2014 Methods and
Applications of White Noise Analysis in Interdisciplinary
Sciences (Singapore: World Scientific)
The authors thank Benjamin E Rubin, Rev R Aure, Hyunjin [7] Afreixo V, Rodrigues J M, Bastos C A and Silva R M 2016
Shim, and Victor Sojo for helpful discussions. R R V wishes The exceptional genomic word symmetry along DNA
to acknowledge support from the Commission on Higher sequences BMC Bioinform. 17 59
Education. [8] Kuruoglu E E and Arndt P F 2017 The information capacity of
the genetic code: is the natural code optimal? J. Theor. Biol.
419 227–37
[9] Hart A and Martinez S 2014 Markovianness and conditional
ORCID iDs independence in annotated bacterial DNA Stat. Appl. Genet.
Mol. Biol. 13 693–716
[10] Vergne N 2008 Drifting Markov models with polynomial drift
Christopher C Bernido https://orcid.org/0000-0002- and applications to DNA sequences Stat. Appl. Genet. Mol.
9329-214X Biol. 7 6
[11] Bai F L, Liu Y Z and Wang T M 2007 A representation of
DNA primary sequences by random walk Math. Biosci. 209
References 282–91
[12] Hérisson J, Payen G and Gherbi R 2007 A 3D pattern matching
algorithm for DNA sequences Bioinformatics 23 680–6
[1] Shapiro B J, Leducq J-B and Mallet J 2016 What is speciation? [13] Peng C-K, Buldyrev S V, Goldberger A L, Havlin S,
PLoS Genet. 12 e1005860 Sciortino F, Simons M and Stanley H E 1992 Long-range
[2] Sojo V, Pomiankowski A and Lane N 2014 A bioenergetic correlations in nucleotide sequences Nature 356 168–70
basis for membrane divergence in archaea and bacteria PLoS [14] Churchill G A 1989 Stochastic models for heterogeneous DNA
Biol. 12 e1001926 sequences Bull. Math. Biol. 51 79–94
[3] Rubin B E, Wetmore K M, Price M N, Diamond S, [15] Burks C and Farmer D 1984 Towards modeling DNA
Shultzaberger R K, Lowe L C, Curtin G, Arkin A P, sequences as automata Physica D 10 157–67
Deutschbauer A and Golden S S 2015 The essential gene set [16] Cohen S E and Golden S S 2015 Circadian rhythms in
of a photosynthetic organism PNAS 112 E6634–43 cyanobacteria Microbiol. Mol. Biol. Rev. 79 373–85

12
Phys. Scr. 94 (2019) 125006 R R Violanda et al

[17] Cohen J E et al 2017 An innovative biologic system for [29] Cheng-Wu L, Hong-Lai X, Cheng G and Wen-biao L 2018
photon-powered myocardium in the ischemic heart Sci. Adv. Modeling and experiments for the time-dependent diffusion
3 e1603078 coefficient during methane desorption from coal J. Geophys.
[18] Hugler M and Sievert S M 2011 Beyond the Calvin cycle: Eng. 15 315–29
autotrophic carbon fixation in the ocean Annu. Rev. Mar. [30] Barredo W, Bernido C C, Carpio-Bernido M V and
Sci. 3 261–89 Bornales J B 2018 Modelling non- Markovian fluctuations
[19] Zubair S, Fischer A et al 2015 Complete genome sequence of in intracellular biomolecular transport Math. Biosci. 297
Staphylococcus aureus, strain ILRI Eymole1/1, isolated 27–31
from a Kenyan dromedary camel Stand. Genomic Sci. [31] Schulz J H P, Chechkin A V and Metzler R 2013 Correlated
10 109 continuous time random walks: combining scale-invariance
[20] Kuo H-H 1983 Donsker’s delta function as a generalized with long-range memory for spatial and temporal dynamics
Brownian functional and its application Theory and J. Phys. A: Math. Theor. 46 475001
Application of Random Fields. Lect. Notes Control Inf. Sci. [32] Ermolaeva M D, White O and Salzberg S 2001 Prediction of
vol 49 (Berlin: Springer) pp 167–78 operons in microbial genomes Nucleic Acids Res. 21
[21] Lascheck A, Leukert P, Streit L and Westerkamp W 1994 1216–21
More about Donsker’s delta function Soochow J. Math. 20 [33] Land M et al 2015 Insights from 20 years of bacterial genome
401–18 sequencing Funct. Integr. Genomics 15 141–61
[22] Nunno G D, Øksendal B and Proske F (ed) 2009 The Donsker [34] Sumner J G, Jarvis P D and Francis A R 2017 A
delta function and applications Malliavin Calculus for Lévy representation-theoretic approach to the calculation of
Processes with Applications to Finance (Berlin: Springer) evolutionary distance in bacteria J. Phys. A: Math. Theor. 50
[23] Carpio-Bernido M V, Barredo W I and Bernido. C C 2017 On 335601
time-dependent diffusion coefficients arising from stochastic [35] Inouye M, Ishida Y and Inouye K 2017 Designing of a single
processes with memory Structure, Function and Dynamics gene encoding four functional proteins J. Theor. Biol. 419
from nm to Gm ed C D Villagonzalo et al (New York: 266–8
American Institute of Physics) p 050004 [36] Jiménez-Sánchez A 2017 Bacterial cell cycle classification.
[24] Molina-Garcia D, Sandev T, Safdari H, Pagnini G, Application to DNA synthesis and DNA content at any cell
Chechkin A and Metzler R 2018 Crossover from anomalous age J. Theor. Biol. 419 8–12
to normal diffusion: truncated power-law noise correlations [37] Avershina E and Rudi K 2015 Dominant short repeated
and applications to dynamics in lipid bilayers New J. Phys. sequences in bacterial genomes Genomics 105 175–81
20 103027 [38] Kurgan L et al 2009 CRYSTALP2: sequence-based protein
[25] Meerschaert M M and Sabzikar F 2013 Tempered fractional crystallization propensity prediction BMC Struct. Biol. 9 50
Brownian motion Stat. Probab. Lett. 83 2269–75 [39] Fusco D, Barnum T J, Bruno A E, Luft J R, Snell E H,
[26] Sobottka M and Hart A G 2011 A model capturing novel Mukherjee S and Charbonneau P 2014 Statistical analysis of
strand symmetries in bacterial DNA Biochem. Biophys. Res. crystallization database links protein physico-chemical
Commun. 410 823–8 features with crystallization mechanisms PLoS One 9
[27] Burov S, Jeon J-H, Metzler R and Barkai E 2011 Single e101123
particle tracking in systems showing anomalous diffusion: [40] Abramo M C, Caccamo C, Calvo M, Conti Nibali V, Costa D,
the role of weak ergodicity breaking Phys. Chem. Chem. Giordano R, Pellicane G, Ruberto R and Wanderlingh U 2011
Phys. 13 1800–12 Molecular dynamics and small-angle neutron scattering of
[28] Yamanaka K, Narumi T, Hashiguchi M, Okabe H, Hara K and lysozyme aqueous solutions Philo. Mag. 91 2066
Hidaka Y 2018 Time-dependent diffusion coefficients for [41] Shim H 2019 Feature learning of virus genome evolution with
chaotic advection due to fluctuations of convective rolls the nucleotide skip-gram neural network Evol. Bioinform. 15
Fluids 3 99 1–10

13

Вам также может понравиться