Вы находитесь на странице: 1из 10

Academy of Scientific Research and Technology

27
th
National Radio Science Conference
Faculty of Electronic Engineering, Menoufia Univ., Menouf, Egypt
16-18 March 2010
',=,',--' -'' -=-' ,-,'



A New Genomic Signal Processing Method
for Identifying Protein-Coding Regions in a DNA Sequence


Ashraf M. Aziz*, Senior Member IEEE


* Prof., Chief of Electronic Technology Branch, MTC, Cairo, Egypt

Abstract

Signal processing theory, methods, and applications are becoming increasingly very important in the study of genes in the
field of bioinformatics. In this paper, a new signal processing method for identifying the protein-coding regions in genes is
proposed. Unlike the existing methods, which depend on the pronounced period three peaks observed in the Fourier
transform of the coding regions in genes using binary indicator sequences and rectangular window, the proposed method
relies on soft indicator sequences and a gradual window. The proposed method reveals period three peaks for coding
regions and achieves better discrimination between the coding areas and non-coding regions based on experimental data of
a number of genes, compared to the existing methods based on binary indicator sequences and rectangular window.

1. Introduction

The discovery of the double helix structure of the DNA (Deoxyribo nucleic acid) molecule in 1953, by Watson and
Crick [1], is one of the landmarks of molecular biology. Subsequent to this sensational discovery, there has been
phenomenal progress in genomics. With the enormous amount of genomic data available to us in the public domain, it is
becoming increasingly important to be able to process this information in ways that are useful to humankind. Genomic
signal processing methods have played an important role in this field.

Genomic signal processing is the analysis, processing, and the use of genomic signals for gaining biological knowledge
and translating this knowledge into systems-based applications. The term bioinformatics is the use of computational
techniques in the studies of genomes [2]. It is a rapidly developing area of computer science devoted to collecting,
organizing, and analyzing DNA and protein sequences. The field of bioinformatics is dealt primarily with biological data
encoded in digital symbol sequences, such as DNA and amino acid sequences. The high variability and the high
complexity of genetic signals call for sophisticated mathematical modeling, signal processing, and information extraction
methods [3].

DNA carries the genetic instructions for life, coding for proteins that we depend on to survive as well as passing on
characteristics from one generation to the next. That is why DNA is known as the "secret of life" or the "building blocks of
life". All living things are made of DNA; plants, animals and humans, and it is the small differences in the DNA code that
makes us different from one another and makes species different from each other. The entire set of DNA is the genome of
the organism.

Coded in the DNA are instructions necessary for a cell's proper functioning. Those instructions are stored in specific
units called genes. When a particular instruction becomes active, the corresponding gene is said to turn on or be expressed.
Following the expression of a particular gene, the corresponding section of the DNA strand is copied into a less stable
molecule called messenger ribo nucleic acid (mRNA). The process of producing mRNA is called transcription. The
mRNA is then transferred to the ribosome, where the protein molecule is produced by interpreting the instruction in
mRNA. This process of producing proteins is referred to as translation. Translation takes place according to the genetic
code, which maps successive triplets of RNA bases to amino acids. Thus, a protein is a chain of amino acid units. The
translation from mRNA to protein is aided by molecules called the transfer RNA (tRNA) molecules. Proteins can carry out
a number of tasks, such as catalyzing reactions, transporting oxygen, regulating the production of other proteins, and many
others. In summary, the way proteins are encoded by genes involves the two major steps: transcription and translation.

The major goals of genomic signal processing research are: (1) Sequencing and comparison of genomes of different
species, (2) Identifying genes and determining the functions of proteins they encode, (3) Predicting the structural features
of the protein from the amino acid sequence, (4) Understanding gene expression and gene protein interaction to control

Academy of Scientific Research and Technology
27
th
National Radio Science Conference
Faculty of Electronic Engineering, Menoufia Univ., Menouf, Egypt
16-18 March 2010
',=,',--' -'' -=-' ,-,'



cellular processes, (5) Tracing the evolutionary relationships among existing species and constructing phylogenetic trees
and (6) Discovering associations between gene mutations and disease. This paper focuses on the process of identifying
protein-coding genes in a DNA sequence.

The purpose of this paper is to achieve a better protein-coding regions discrimination compared to the existing
methods. The remainder of this paper is organized as follows. A brief outline of the basic biology of DNA is discussed in
Section 2. A quick review of signal processing methods for identifying protein-coding genes is provided in Section 3. A
new genomic signal processing method, based on soft indicator sequences and a gradual window, is proposed in Section 4.
The performance of the proposed method, using experimental data, in terms of protein-coding regions identification is also
reported in Section 4. Section 5 contains conclusion.

2. Basic Biology of DNA

In this section, we describe the basic biology of DNA. This is
helpful in understanding the DNA structure and functionalities. Figure
1 shows a schematic for the DNA molecule [3]. It is a molecule of a
double helix strands (Fig. 1-a). Between the two strands of the
backbone which is outside, there are pairs of nitrogenous bases (Fig. 1-
b). The backbone is a very regular structure made from sugar
(deoxyribose) and phosphate (Figure 1-c). There are four types of
nitrogenous bases (or nucleotides), denoted with the letters A
(adenine), C (cytosine), G (guanine), and T (thymine). The sugar has
five carbon atoms that are typically numbered from 1' to 5'. The
phosphate is attached to the 5' carbon atom, whereas the base is
attached to the 1' carbon. The 3' carbon also has a hydroxyl group
(OH) attached to it.
The nucleotides (A,T,G,C) can be represented by a string of
characters, for example .... AGGTAC CCAAGTATAAGAAGTTA.... . It
is seen that life is governed by quarternary codes (nucleotides). These
nucleotides are made from carbon, nitrogen, hydrogen and oxygen
atoms. There are about three billion of these nucleotides in the DNA of
a single human cell. The genome sequence corresponding to the top
strand of the DNA molecule shown in Fig. 1.b is AGACTGAA. In the
double stranded DNA, the base A always pairs with T, and C pairs
with G. Thus the bottom strand in Fig. 1-b is TCTGACTT which is the
complement of the top strand. These base-pairings occur through
hydrogen bonds [2-5]. The RNA (ribo nucleic acid) molecule is
closely related to the DNA. It is also made of four bases but instead of
thymine, a molecule called uracil (U) is used. The molecule U pairs
with A by hydrogen bonding just like T pairs with A. RNA molecules
are short (and short-lived) single-stranded molecules which are used
by the cell as temporary copies of portions of DNA. The RNA that is
used to make proteins is called messenger RNA (mRNA).
As shown in Fig. 2, a DNA sequence can be separated into genes
and intergenic spaces. Genes contain the information for generation of
proteins. Each gene is responsible for the production of a different
protein. Even though all the cells in an organism have identical genes,
only a selected sub subset is active in any particular family of cells.
Figure 3 shows the steps involved in the production of a protein from a
gene. It is worth noting that a gene has two types of subregions called
the exons (protein-coding regions) and introns.
Fig.1 DNA molecule


Academy of Scientific Research and Technology
27
th
National Radio Science Conference
Faculty of Electronic Engineering, Menoufia Univ., Menouf, Egypt
16-18 March 2010
',=,',--' -'' -=-' ,-,'




Fig.2 Genes and intergenic spaces Fig. 3 Production of a protein from genes

Fig.4 Genetic code Fig.5 List of the twenty amino acids


Fig. 6 An example of how mRNA is converted to protein using the genetic code

Academy of Scientific Research and Technology
27
th
National Radio Science Conference
Faculty of Electronic Engineering, Menoufia Univ., Menouf, Egypt
16-18 March 2010
',=,',--' -'' -=-' ,-,'



The gene is first copied into a single stranded chain called the messenger RNA or mRNA molecule. This process is
called transcription. The introns are then removed from the mRNA by a process called splicing. The spliced mRNA is then
used by a large molecule (ribosome), by a process called translation, to produce the appropriate protein. When the mRNA
molecule is spliced, it contains only the exons of the gene. i.e. the introns are removed. The mRNA is in reality the
complement of the gene, that is, C's are replaced with G's, and A's with T's (rather U's). Thus, if the gene is AATTAGC
then the mRNA is UUAAUCG. The observation that each gene creates a protein, through mRNA, is often expressed as
gene in DNA RNA protein.

The spliced mRNA is divided into groups of three adjacent bases. The ribosomes move along the mRNA in 5'3'
direction and read the mRNA sequence in nonoverlapping chunks of three nucleotides. Thus the mRNA is considered as a
sequence of triplet codons. Each triplet codon instructs the cell machinery to synthesize an amino acid (codes for a
particular amino acid). The codon sequence therefore uniquely identifies an amino acid sequence which defines a protein
(Fig. 4). As there are 4 different possible nucleotides, there are
3
4
possible codons. There are 20 types of standard amino
acids that are regularly found in nature as well nonstandard types that rarely appear. Since there are only 20 possible
amino acids and 64 possible codons, most amino acids can be specified by more than one triplet. i.e., the mapping from
codons to amino acids is many-to one (Fig.5). When a gene is expressed (all the codons in the mRNA are exhausted) each
codon in the mRNA produces an amino acid according to the genetic code, and the amino acids are bonded together into a
chain (typically a few hundred long). This is the protein corresponding to the original gene. Figure 6 shows an example of
how mRNA is converted to protein using the genetic code. Similar to DNA, a protein molecule can be represented by a
string of characters from an alphabet of size 20.


3. A Quick Review of Genomic Signal Processing Methods for Identifying Protein Coding
Genes


In this section, we discuss the application of signal processing in genomics and review existing techniques used for
gene identication. There have been various signal processing techniques that have been applied in the eld of genomics,
due to the discrete nature of the DNA and protein data. As these DNA and protein sequences are symbolic, they cannot be
processed directly with the existing signal processing algorithms, and numerical assignments need to be made to these
sequences. After these symbolic sequences are mapped to suitable numerical sequences, we can apply existing signal
processing algorithms to them in order to explain a few of the important properties of the sequences [2].

There are three categories of numerical mapping techniques; indicator sequence mapping, real number mapping, and
complex number mapping [6]. In indicator sequence mapping, numerical domain mapping for the DNA sequence can be
obtained in terms of indicator vectors. In this category, each nucleotide is represented using an indicator vector
k
u ,
wherein only the position of the element k ( k = A, T, C, and G) is represented by the number one and all others positions
are zero. Then, some weights are assigned to the nucleotides. A binary indicator sequence is obtained by setting the
corresponding weight to 1 and the other weights to 0. Consider the sequence = ) (n x {AATTCAGGCTAGTCTAACC}. For
this sequence, we have the binary indicator sequences as
= ) (n b
A
{1100010000100001100},
= ) (n b
C
{0000100010000100011},
= ) (n b
G
{0000001100010000000},
= ) (n b
T
{001100000100101000}. (1)

The indicator sequences composed of ones and zeros. The position corresponding to the presence of the nucleotide is
denoted by one, in the respective indicator sequences. For example, ) (n b
A
has one at positions n = 1,2, 5,11,16,17 since
the sequence ) (n x has A in the corresponding positions.
In real number mapping , the number mapping is as follows
. 5 . 1 , 5 . 0 , 5 . 0 , 5 . 1 G C T A (2)

Academy of Scientific Research and Technology
27
th
National Radio Science Conference
Faculty of Electronic Engineering, Menoufia Univ., Menouf, Egypt
16-18 March 2010
',=,',--' -'' -=-' ,-,'




The notation in Equation (2), e.g. A -1.5, can be read as the nucleotide A is mapped to number -1.5. The
complementary nucleotides are equal in magnitude and opposite in sign. This rule is better suited for computing
correlation values. Another method that is discussed in [6] assigns an increasing sequence of integers to the alphabetically
sorted nucleotides after obtaining the indicator sequences. The assignment is done as

. 4 , 3 , 2 , 1 T G C A (3)
In complex number mapping, the four characters (A, T, C, and G) are represented by four complex numbers as follows
(see [7] for details):
. 1 , 1 , 1 , 1 j T j G j C j A + + (4)

After these symbolic sequences are mapped to suitable numerical sequences, we can apply existing signal processing
algorithms to them in order to explain a few of the important properties of the sequences. Signal processing techniques,
such as Fourier transform and wavelet transform [8], have been used to identify a periodicity in the DNA sequences and to
help in nding the protein coding regions of the sequence. This is also called gene identication and is characterized from
the frequency spectrum of the DNA sequences. We focus on Fourier transform. A detailed discussion on applying Fourier
analysis to DNA sequences is presented in [9].

The Fourier spectra of the DNA sequences using a sliding window identify the coding regions present in the DNA
sequence. It is noticed that protein-coding regions (exons) in genes have a period-3 component because of coding biases in
the translation of codons into amino acids [10-13]. The period-3 property is not present outside exons, and can be exploited
to locate exons. The major signal that characterizes the coding regions is the three base periodicity. The value of the
spectrum at 3 / N f = (where N is the length of the window) determines the nature of the DNA region. The three-base
periodicity is also called the 3 / N periodicity or 3 / 2 periodicity. The spectra of the binary indicator sequences can be
computed to obtain the power spectrum as in [9]. A coding measure can be characterized based on the relative strength of
the periodicity at 3 / N f = and appears as a peak in the average spectrum. The origin of the periodicity can be attributed
to the codon bias that refers to the unequal usage of codons in the coding regions and the triplet bias, which is the bias in
the usage of nucleotide triplets. The advantages of these techniques are that they are robust to sequencing errors and
computational complexity is very low. Datta et al [14] presented a mathematical treatment to explain the reason for the
peak at 3 / N f = in a coding region.

An extension to the Fourier transform approach has been proposed in [15], where DNA spectrograms (squared
magnitude of the short-time Fourier transform) are computed in order to differentiate between the coding and non-coding
regions. In order to compute the spectra, two numerical mapping techniques, binary mapping and complex mapping, are
used. The DNA spectrum is computed using the short-time Fourier transforms by translating windows along the binary
indicator sequences. The weighted spectrum is then computed by reducing the dimensionality of the indicator sequences
from four to three. The weights are optimized such that the observed peaks at 3 / N f = can be distinguished from the
non-coding regions, i.e., the other periodicities.

The use of digital innite-impulse response (IIR) ltering to detect coding regions in the DNA sequences has been
presented in [3]. As the 3 / N periodicity is exhibited by the coding region in a DNA sequence, an anti-notch lter to
identify these coding regions is discussed in [3]. The design of the anti-notch lter is based on the fact that there is a sharp
peak at 2/3 in the spectrum. This is a bandpass filter with passband centered at . 3 / 2
0
= If we pay careful attention
to the design of the digital filter, we can isolate the period-3 behavior. Thus, after observing the binary indicator sequences
of the DNA data through the lter and the output signal spectrum, if there is a pronounced peak at 3 / 2 , then the
sequence is from the coding region of the DNA.


Academy of Scientific Research and Technology
27
th
National Radio Science Conference
Faculty of Electronic Engineering, Menoufia Univ., Menouf, Egypt
16-18 March 2010
',=,',--' -'' -=-' ,-,'



4. Proposed Method for Identifying Protein-Coding Regions and Results Using Experimental
Data

Most of the methods presented in the literature (see [15-17] for examples) use binary indicator sequences and
rectangular windows to parse the DNA sequence under processing. The binary sequences and rectangular windows cause
abrupt truncations of the DNA sequence and leads to extraneous peaks in the spectrum. The proposed method uses soft
sequences and applies a gradual window to parse the DNA sequence. This method avoids abrupt truncations of the DNA
sequence and removes most of the extraneous peaks which improve the discrimination results.

Consider a DNA sequence, , 1 0 ), ( N i i D consists of four nucleotides, A, T, C, and G. The DNA sequence is
mapped into four soft sequences, ), ( ), ( ), ( i C i T i A and . 1 0 ), ( N i i G These soft sequences are related to the
presence or absence of the four nucleotides at location i in ) (i D . For a given i , the value of a soft sequence, attributed
to certain nucleotide, takes a value of 1 at location i if the
th
i element of the DNA sequence declare the same type of
nucleotide. Else, the value of the soft function takes a value less than one according to the number of occurrence of the
nucleotide. The number of occurrence of a certain nucleotide counts the number of presence of this nucleotide in ) (i D ,
i.e. it counts the number of
'
1 s in the corresponding soft sequence. The number of occurrences of the four nucleotides are
denoted by , , ,
C T A
N N N and
G
N , i.e.,

}, , , , { ), 1 ) ( ( ) (
1
0
G C T A B i B i B N
N
i
B
=

=
(5)
where

=
=
, otherwise 0
, 1 ) ( if 1
) 1 ) ( (
i B
i B (6)
and


. N N N N N
G C T A
= + + + (7)


The soft sequences are defined as

=
=
}. , , , { , otherwise /
, ) ( if 1
) (
G C T A B N N
B i D
i B
B
(8)

For example, for the DNA sequence ) 24 ( = N ,

}, { ) ( A G T A T G C C T G A G A A T C A G T A C G T C i D = (9)
the number of occurrences, , , ,
C T A
N N N and
G
N will be 7, 6, 5, 6, respectively. Thus the four soft sequences are

, ] 1 ,
24
7
,
24
7
, 1 ,
24
7
,
24
7
,
24
7
,
24
7
,
24
7
,
24
7
, 1 ,
24
7
, 1 , 1 ,
24
7
,
24
7
, 1 ,
24
7
,
24
7
, 1 ,
24
7
,
24
7
,
24
7
,
24
7
[ ) ( = i A

, ]
24
6
,
24
6
, 1 ,
24
6
, 1 ,
24
6
,
24
6
,
24
6
, 1 ,
24
6
,
24
6
,
24
6
,
24
6
,
24
6
, 1 ,
24
6
,
24
6
,
24
6
, 1 ,
24
6
,
24
6
,
24
6
, 1 ,
24
6
[ ) ( = i T


Academy of Scientific Research and Technology
27
th
National Radio Science Conference
Faculty of Electronic Engineering, Menoufia Univ., Menouf, Egypt
16-18 March 2010
',=,',--' -'' -=-' ,-,'



, ]
24
5
,
24
5
,
24
5
,
24
5
,
24
5
,
24
5
, 1 , 1 ,
24
5
,
24
5
,
24
5
,
24
5
,
24
5
,
24
5
,
24
5
, 1 ,
24
5
,
24
5
,
24
5
,
24
5
, 1 ,
24
5
,
24
5
, 1 [ ) ( = i C

. ]
24
6
, 1 ,
24
6
,
24
6
,
24
6
, 1 ,
24
6
,
24
6
,
24
6
, 1 ,
24
6
, 1 ,
24
6
,
24
6
,
24
6
,
24
6
,
24
6
, 1 ,
24
6
,
24
6
,
24
6
, 1 ,
24
6
,
24
6
[ ) ( = i G (10)

The four soft sequences are parsed into overlapping segments of length , L . N L This can be achieved by using a
window w of length L and slide the window forward one by one until all the soft sequences are processed. In most
publications [9-11, 13-15, 17], a rectangular window is used. Instead, we use a more gradual window (see Fig.7). The
reason is that the rectangular windows introduce discontinuities by abrupt truncating the soft sequences which spread the
spectrum of the sequences in the frequency domain. This causes the spectrum to leak over into adjacent frequencies. Using
a gradual window minimizes the leakage effect. We use the following gradual window to parse the soft sequences:

=
. otherwise , 0
), 1 ( 0 ),
1
2
cos( 05 . 0 1
) (
L n
L
n
n w

(11)
Figure 7 compares the rectangular window as well as the gradual window of equation (11).

For the truncated soft sequences, the discrete Fourier transform (DFT) is calculated as

= =
1
0
) / 2 (
. 1 ......., , 2 , 1 , 0 ), , , , ( , ) ( ) ( ) (
L
n
L n k j
B
L k G C T A B e n w n B k U

(12)

The sum of the absolute values of the power spectra is determined as

=
= + + + =
1
0
2 2 2 2
. 1 ......., , 2 , 1 , 0 ), | ) ( | | ) ( | | ) ( | | ) ( | ( ) (
L
n
G C T A
L k k U k U k U k U k S (13)

The sum of the absolute values of the power spectra, ), (k S is used as indicator of coding regions. It is used as a
coding measure to detect probable protein-coding genes in the DNA sequence. When ) (k S is plotted against k , it reveals
an appreciable peak at coding region ( 3 / N ) and shows no such peak for noncoding region.

The proposed method is tested using experimental data downloaded from the National Science for Biotechnology
Information (NCBI) data base [18]. An optimal window size of 351 is used as used in the program Genescan [7, 15].
Figures 8 and 9 show the power spectrum of a gene, HMX1 (Acc. No. AF009614, Musculus) using binary indicator
sequence (with rectangular window), and using the proposed soft indicator sequences (with gradual window), respectively.
HMX1 has two exons, one from nucleotide positions 1267 to 1639, and the other from 3888 to 4513. As shown in Fig. 8,
the binary decision sequences method (using rectangular window) is not accurate since the peak of the first exon is shifted
from the actual region. It is also shows two peaks for the first exon. On the contrary, the peak of the first peak is inside the
right region when the proposed method is used, as seen from Fig. 9. From Fig. 8 and Fig. 9, it is also shown that the
proposed method achieves a clear discrimination. Table 1 compares the results obtained using the binary indicator
sequences and the proposed method for some types of genes. In all examples, the proposed method achieves better
performance than the soft indicator sequences and also achieves a clear discrimination. The proposed method achieves
accurate determination of the exon regions. These results, and other results, enable us to conclude that the proposed
method is more accurate and achieves a better discrimination.

Academy of Scientific Research and Technology
27
th
National Radio Science Conference
Faculty of Electronic Engineering, Menoufia Univ., Menouf, Egypt
16-18 March 2010
',=,',--' -'' -=-' ,-,'



-50 0 50 100 150 200 250 300 350 400
0
0.5
1
1.5
solid: rectangular window
dashed: gradual window
nucleotide positions
W
i
n
d
o
w

m
a
g
n
i
t
u
d
e

Fig.7 Comparison of rectangular and gradual windows
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Gene: HMX1
exon1: 1267:1639
exon2: 3888:4513
nucleotide positions
s
t
r
e
n
g
h
t

Fig.8 Power spectrum in case of binary indicator sequences and rectangular window

Academy of Scientific Research and Technology
27
th
National Radio Science Conference
Faculty of Electronic Engineering, Menoufia Univ., Menouf, Egypt
16-18 March 2010
',=,',--' -'' -=-' ,-,'



0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Gene: HMX1
exon1: 1267:1639
exon2: 3888:4513
nucleotide positions
s
t
r
e
n
g
h
t

Fig.9 Power spectrum in case of soft indicator sequences and gradual window (proposed)

Table 1: Comparison of peak locations
Location of Peaks
Accession
number
Gene
Length
of
sequence
Exon region
(length)
using binary
indicator
sequences
and
rectangular
window
using soft
indicator
Sequences
and
gradual
window
(proposed)
AB016625 Human OCTN2 25871 15591-15792 (172) 76 57
AF019074 EKLF, Musculus 6350 3761-4574 (814) 131 271
AB009589
Human gene for
Osteomodulin
12414 10624-10949 (326) 8 112
L26462
Human
betagolbin A
chain
2464

(a) 866-957 (92)
(b) 1088-1310 (223)
(c) 2161-2289 (129)

(a) 38
(b) 88
(c) 23

(a) 30
(b) 76
(c) 45



Academy of Scientific Research and Technology
27
th
National Radio Science Conference
Faculty of Electronic Engineering, Menoufia Univ., Menouf, Egypt
16-18 March 2010
',=,',--' -'' -=-' ,-,'



5. Conclusion

In this paper, the biology of DNA has been described from the signal processing point of view. It has been shown that
DNA carries the genetic instructions for life, coding for proteins that we depend on to survive as well as passing on
characteristics from one generation to the next. Proteins can carry out a number of tasks and the way they are encoded by
genes involves two major steps; transcription and translation. Transcription is the process of producing mRNA. The
mRNA is then transferred to the ribosome, where the protein molecule is produced by interpreting the instruction in
mRNA. This process of producing proteins is referred to as translation. Translation takes place according to the genetic
code, which maps successive triplets of RNA bases to amino acids. Thus, a protein is a chain of amino acid units. A new
signal processing method for identifying the protein-coding regions in DNA sequences has been proposed. Unlike most of
the existing methods, which depend on the pronounced period three peaks observed in the Fourier transform of the coding
regions in genes using binary indicator sequences and rectangular window, the proposed method relies on soft indicator
sequences and a gradual window. This method avoids abrupt truncations of the DNA sequence, does not cause the
spectrum to leak over frequencies, and removes most of the extraneous peaks. It has been found that the proposed method
reveals period three peaks for coding regions and shows no such peak for noncoding regions. The proposed method
achieved better discrimination between the coding areas and non-coding regions based on experimental data of a number
of genes, compared to the existing methods using binary indicator sequences and rectangular window.

References

[1] J. D. Watson and F. H. C. Crick, ''A Structure for DNA,'' Nature, p. 737, April 1953.
[2] R Matteo and Pavesi Giulio, ''Detecting Conserved Coding Genomic Regions through Signal Processing of Nucleotide
Substitution Patterns,'' Artificial Intelligence in Medicine, Vol. 45, No. 2-3, pp.117-123, March 2009.
[3] P. P. Vaidyanathan, ''Genomics and Proteomics: A Signal Processor's Tour,'' IEEE Circuits and Systems Magazine, pp.
6-29, 2004.
[4] D. Schonfeld, J. Goutsias, I. Shmulevich, I Tabus, and A. H. Tewfik, ''Introduction to the Issue on Genomic and
Proteomic Signal Processing,'' IEEE Journal of Selected Topics in Signal Processing, Vol. 2, No. 3, pp.257-260, June
2008.
[5] Z. Aydin and Y. Altunbasak, ''A Signal Processing Application in Genomic Research: Protein Secondary Structure
Prediction,'' IEEE Signal Processing Magazine, Vol. 7, pp.128-131, July 2006.
[6] W. Wang and D. H. Johnson, ''Computing Linear Transforms of Symbolic Signals'' IEEE Transactions on Signal
Processing, Vol. 50, No. 3, pp.628-634, March 2002.
[7] D. Anastassiou, "Genomic Signal Processing," IEEE Signal Processing Magazine, pp. 8-20, July 2001.
[8] M. Altaiski, O. Mornev, and R. Polozov, ''Wavelet Analysis of DNA Sequences,'' Genetic Analysis: Biomolecular
Engineering, pp. 165-168, Dec. 1996.
[9] J. Berger, S. Mitra, and J. Astola, ''Power Spectrum Analysis for DNA Sequences'' IEEE Transactions on Signal
Processing, Vol. 2, pp. 29-32, 2003.
[10] Jaakko Astola, Edward Dougherty, Ilya Shmulevich, Ioan Tabus, ''Genomic Signal Processing,'' Signal Processing,
Vol. 83, No. 4, pp.691-694, April 2003.
[11] M. K. Hota and V. K. Srivastava, ''DSP Technique for Gene and Exon Prediction Taking Complex Indicator
Sequence'' TENCON 2008, IEEE Region 10 Conference, pp.1-6, Nov. 2008.
[12] Ilya Shmulevich, Genomic Signal Processing, Princeton Univ. Press., New York, 2007.
[13] M. Akhtar, J. Epps, and E. Ambikairajah, ''Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene
Prediction,'' IEEE Journal of Selected Topics in Signal Processing, Vol. 2, No. 3, pp.310-321, June 2008.
[14] S. Datta and A. Asif, ''A Fast DFT Based Gene Prediction Algorithm for Identification of Protein Coding Regions''
Proc. IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, Vol. 5, pp.653-656, March
2005.
[15] D. Anastassiou, "Frequency Domain Analysis of Biomolecular Sequences," Bioinformatics, Vol. 16, No. 12, pp.
1073-1081, 2000.
[16] E. Dougherty, I. Shmulevich, J. Chen, and Z. Jane Wang, Genomic Signal Processing and Statistics, New York:
Hindawi Publishing Corporation, 2005.
[17] Paul Dan Cristea, ''Large Scale Features in DNA Genomic Signals,'' Signal Processing, Vol. 83, No. 4, pp.871-888,
April 2003.
[18] National Center for Biotechnology Information (NCBI) [On-Line]. Available: http://www.ncbi.nlm.nih.gov/.

Вам также может понравиться