Computational DNA Sequence Analysis

International Journal of Computer Trends and Technology (IJCTT) - volume4 Issue5May 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 1264

Computational DNA Sequence Analysis
Archana Yashodhar
*1
, Manjula
*2
, Praveen N
*3
, Pavithra K
*4
*1234
Lecturer
Dept of Computer Science, Mangalore University, Mangalore.

Abstract: DNA sequence is process of determining precise
order of nucleotides within a DNA molecule. It is at the
centre of human Genome project, which premises to
revolutionaries the Bio-medical Sciences and the treatment
of Human diseases. Extensive research and development
has taken place over last few years in the areas of DNA
Sequence Analysis. In this paper we have discussed about
some of the methods of DNA sequencing analysis, and the
algorithm for DNA sequencing.
I. INTRODUCTION TO DNA
Deoxyribonucleic acid (DNA) is a nucleic acid
that contains the genetic instruction used in the
development and functioning of all known living
organisms and some viruses. The main role of DNA
molecule is the long-term storage information. The
DNA segments that carry this genetic information are
called genes. For instance the study of DNA of tiny
bacteria gives the great insight into the working of the
human body. Since the DNA stores all the genetic
information for an individual. It is what is passed on
when the organisms reproduces. An organism becomes
a similar life as its parents because all its genetics
information was a copy of its parents. DNA is present
in the nucleus of every cell.

1.1 Chemical composition
DNA is composed of just three chemical substances:
A five carbon or a pentose sugar called deoxyribose. A
phosphate group consisting of an atom of phosphrous
surrounded by four atom of oxygen and three atoms of
hydrogen. Nitrogen contains nucleobases. There are
two types of bases in DNA:one type is known as
purine. Other type is known as pyrimidine. Purines are
structurally double ring substances and pyrimidines
are single ring substances. The Purines found in DNA
are Adenine(A) and Guanine(G).The pyrimidines of
DNA are Thymine(T) and Cytosine(C).

1.2 Nucleoside
When a molecules of purine or a pyrimidine base
is linked to a molecule of a pentose sugar deoxyribose a
nucleosides is formed.

1.3 Nucleotides
When a nucleoside is joined to phosphoric acid
molecules, it becomes nucleotides. Hence a nucleotide
is a phosphorylated derivative or phosphate ester of a
nucleosides is formed and it is named
monophosphate. Nucleotide is a basic unit of DNA
molecules.
II. STRUCTURE OF DNA DOUBLE HELIX: THE WATSON-
CRICK MODEL OF DNA
In 1953, James D.Watson and francis H.C crick,
working at the Cambridge university, in England built a
model to explain how the polynucleotides chains are
arranged in a DNA molecules. Polynucleotides means
when a multiple nucleotides are linked together. The
following diagram shows The Watson-crick model of
DNA. It is double helix.

Fig. 1 The Watson-crick model of DNA.
The Watson-Crick model of DNA has the following
important feature:
i) The DNA molecules consist of two
polynucleotide chain and hence is double- stranded. It
has constant diameter of 20 A.
ii) Each strand is coiled in the form of helix, and the
two strands are coiled about each other to form double


helix. The DNA double-helix is comparable to a twisted
ladder with uprights and steps.
iii)The backbone of each helically coiled strand is
composed of repeating units of sugar and phosphate.
iv) The base pairing is always between the larger
purine and smaller pyrimidines. This keeps the diameter
of DNA molecule constant at 20A.
v) The base pair is even more specific Adenine(A) is
always paired with Thymine(T),while Cytosine(C) is
always paired with Guanine(G).This called the
complementary base pairing.
- There are two hydrogen bonds between Adenine
and Thymine (A=T) and there are three hydrogen bonds
between cytosine and Guanine.
- For example, if the one strand has AGCTAACA
and the other strand will have TCGATTG.
vi) Successive base pair are at a distance of
3.4A.there are ten base pair for each turn of the module.
3. Function of DNA
i) Directing protein synthesis: In majority of
organisms DNA is the genetic material. A gene is a
part of the DNA molecule. A gene acts mainly by
directing protein synthesis. All proteins are synthesized
under instructions from the gene.

ii) Replication: Only DNA among the various
molecules found in a cell is capable of making an
identical copy of itself by replication. This is necessary
to pass on exact copies of the parental DNA to the
daughter cell. This is called autocatalytic function.

iii)Transcription: The synthesis of ribonucleic
acid(RNA) by DNA. This is called heterocatalytic
function.

iv)Mutation: Genes replicates millions times without
an error. Sometimes however there might be an error in
duplication is called mutation. This creates a new gene.
The mutated gene will function in a different way and
become important in evolution.

4. DNA Sequence Analysis
A DNA sequence or genetic sequence is a
succession of letters representing the primary structure
of hypothetical DNA molecule or strand, with the
capacity to carry information as described by central
dogma of molecular biology. The possible letters are A,
C, T, G representing the four nucleotides bases of a
DNA strand .Adenine, Cytosine, Guanine, Thymine in
the sequence AAAGTCTGAC and are conviently
linked to a phosphodiester backbone. Shorter sequence
of nucleotides is referred to as oligonucleotides and is
used in a range of laboratory application in molecular
biology. Sequence can be derived from the biological
raw material through a process called DNA sequence.
The term DNA sequencing method determining the
order of the nucleotide bases-adenine, guanine, cytosine
and thymine. The knowledge of DNA sequence has
become indispensable for basic biological research and
other branches utilizing DNA sequencing and in
numerous applied field such as diagnostic,
biotechnology or forensic biology. The advent of DNA
sequencing has significantly accelerated biological
research and discovery.

4.1 Sequence Alignment
1. In bioinformatics, a sequence alignment is a way
of arranging the sequence of RNA, DNA and
proteins to identify regions of similarity that may
be consequences of structural, functional or
evolutionary relationship between sequences.
2. Method for DNA sequencing was developed in
1977 by Maxam and Gilbert (1977).
3. Aligned sequence of nucleotides or amino acids
residues are typically represented as rows within a
matrix.
4. Gaps are inserted between the residues so that
identical or similar character is aligned in
successive columns.

4.2 Pair-wise sequence alignment:
Is the process of comparing two DNA or
proteins sequences of searching for a series of
individual character or character patterns that are in
the same order in sequences.
The sequences are aligned by writing them
across a page in two rows.
Identical or similar character is placed in the
same columns and a no identical character can be
placed either in the same column as a mismatch or
opposite gap in the other sequence.

5. Types of pair-wise sequence alignment
There are two types of pair-wise sequence alignment
a) Global alignment
b) Local alignment.
a)global alignment
An attempt is made to align the entire
sequence, using all sequence characters up to
both ends of each sequences. That are quite
similar and approximately the same length are
suitable candidates for global alignment. The
Needle-Wunsch algorithm is used to produces
global alignment between pairs of DNA or
proteins sequences.
A global alignment is made possible by
including gaps either within the middle of the


alignment or at either end of one or both
sequences, example shown below:
Of two protein sequences

L G P S S K Q T G K G S S R I W D N
| | | | | | |
L N I T K S A G KG Q I G R S N D W
Vertical line indicates presences of identical amino
acids.

b) Local alignment
Local alignment[1] stretches of sequences with the
highest density of matches are aligned, thus
generating one or more islands of matches or sub
alignment in the aligned sequences. Local alignment
are more suitable for aligning sequences that differ
in length or sequences that share a conserved
sequences of domain The Smith-Waterman algorithm
is used to produces local alignments between pairs
of DNA or protein sequence.
The local alignment stops at the ends of regions
of strong similarity.
Example:
------------------TGKG-----------------------
| | | |
-------------------AGKG-----------------------
Dashes in the figure indicate sequences not included in
the alignment.

5. A.Three principal methods of pair-wise sequence
alignment
Alignment of two sequences is performed using the
following methods:
1. Dot matrix method.
2. The dynamic programming algorithms.
3. Word or K-tuple method, such as used by the
FASTA and BLAST.

6. Simple alignment
An alignment between two sequence is
simply a pair-wise match between the character of
each sequence. A true alignment of nucleotide or
amino acids sequence is one that reflects the
evolutionary relationship between two or more
homologs(sequence that share common ancestor).
Three kinds changes can occur at any given
position within a sequence.
i) A mutation that replaces one character with another.
ii) A insertion that adds one or more positions.
iii) A deletion that deletes one or more positions.
Insertions and deletions have been found to occur in
nature at a significantly lower frequency than
mutations.
Gaps in alignment are commonly added to reflect the
occurrence of this type change. In the simplest case
where no internal gap are allowed, aligning two
sequences is simply a matter of choosing the starting
point of the shorter sequence.
Consider the following two shorter sequences of
nucleotides: AATCTATA and AAGATA. These two
sequences can be aligned in only three different ways
when no gaps are allowed.
-Three possible simple alignment between two short

AATCTATA AATCTATA AATCTATA
AAGATA AAGATA AAGATA
To determine which of the three alignments
shown optimal, we must decide to evaluate, or
score each alignment. The scoring function is
determined by the amount of credit an alignment
receives for each aligned pair of identical
residue(match score) and the penalty for aligned
pair of non identical residues(the mismatch score).
The score of given alignment is:
match score; if seq1=seq2;
mis match score; seq1seq2; }
Where n be the length of longer sequence. for
example assuming a match score of 1 and
mismatch score is 0. The score for three alignment
would be 4,1 and 3 fro left to right.

7. GAPS
Considerations of the possibility of insertion
and deletion Events significantly complicates
sequences alignments by vastly increasing the
number of possible alignment between two or more
sequences.
Example the sequence which can aligned in
only three different ways without gaps can be
aligned in 28 different ways when two internal
gaps are allowed in the shorter sequences.

Three possible alignments:
AATCTATA AATCTATA AATCTATA
AAGATA AAG-- ATA AA- - GATA

7.1 Simple gap penalty
In scoring an alignment that include gaps an
additional term the gap penalty must be included
in the scoring function. A
simple alignment score for a gapped alignment can
be computed as follows :
gap penalty; if seq1=-or seq2=-
Match score ;if no gaps and seq1=seq2;
Mis match score; if no gaps
and seq1seq2; }



For example assuming a match score is 1
and a mismatch score is 0,and gap penalty of -
1,the score for the gapped alignment for above
shown sequences will be 1,3,3.that is:

1) A A T C T A T A
A A G - A T - A
1+1+0 -1 +0+0-1+1=1
2) A A T C T A T A
A A - G - A T A
1+1 -1+0-1+1 +1+1=3
3) A A T C T A T A
A A - - A T - A
1+1- 1-1+0+0- 1+1=3

8. Dynamic programming
The most obvious methods, exhaustive search
of all possible alignments, is generally not feasible
for example two modest-sized sequences of 100
and 95 nucleotides. If we were to devise an
algorithms that computed and scored all possible
alignments, our program would have to test nearly
55 millions possible alignments. just to consider the
case where exactly five gaps are inserted into the
shorter sequences.
As the lengths of the sequences grow the
number of possible alignment to search quickly
becomes intractable or impossible to compute in
reasonable amount of time we can overcome this
problem using dynamic programming.
Dynamic programming[3] is a computational
method that is used to align two protein or nucleic
acid sequences. The method is very important for
sequences analysis because it provides highest-
scoring alignment between two sequences.
The dynamic programming has types
8.1) global alignment
8.2) local alignment
Example of global alignment is Needleman-Wunsch
algorithm.
Example of local alignment is Smith-Waterman
algorithm.

8.1 Local alignment Smith-Waterman algorithms
The modifications of the dynamic programming
algorithm[1] for sequences alignment provides the
ability to create a local alignment between two
sequences. local alignment is more meaning full than
global matches because they identify conserved
local sequences domain that are present in both
sequences. The Smith-Waterman algorithm is a well-
known algorithm for performing local sequence
alignment;
For determining similar regions between two
nucleotide or protein sequences. Instead of looking at
the total sequence, the Smith-Waterman algorithm
compares segments of all possible lengths and
optimizes the similarity measure. The algorithm was
first proposed by Temple Smith and Michael Waterman
in 1981 .

Algorithms
:LOCAL_AlIGNMENT_SCORE(A=a1a2..am,
B=b1b2.bn)
Begin
M[0,0]=0
Best =0
Endi=0
Endj=0
for j=1 to n do M[0,j]=0
for i=1 to m do
S[i,0]=0
for j=1 to n do
M[i,j]=max {0
M[i-1,j]+gap
M[I,j-1]+gap
M[i-1,j-1]+S(Ai,Bj)
}
if M[i,j]>Best then
Best=M[i,j]
Endi=i
Endj=j
Output Best as the score of an optimal local
alignment
end.

The implication of this in there are no
values below zero in a local alignment scoring
matrix where:-Mi,j is the score at position i in
sequence A and j in sequence B, S(Ai,Bj) is the
score for aligning the character at positions i and
j.gap is the penalty for gap of sequence a and
sequence b.
The most important changes are
1) The edges of the matrix is initialized
to zero instead of increasing gap
penalty.
2) The maximum score is never less than
0 and no pointer is recorded until the
score is greater than 0.

3) The trace back starts from the highest
score in the matrix and ends at a
score of 0.
The algorithms explanation can be given as follows;
A matrix M is built as follows:
M(i,0)=0, 0 i m
M(0,j)=0, 0 j m
M(i,j)=max { 0
M(i-1,j-1)+w(a
i
,b
j
)


M(i-1,j)+w(a
i
,-)
M(i,j-1)+w(-,b
j
)
}
Where:
a,b=string over the alphabets
m=length(a)
n=length(b)
M(i,j)-is the maximum score between a
suffix of a[1..i] and a suffix
of b[1..j].
W(c,d),c,d U{ - }. - is the gap-
scoring scheme.

EXAMPLE:
Find the best local alignment between these two
sequences:
ATGCATCCCATGAC
TCTATATCCGT
Using -2 as a gap penalty,-3 as a mismatch
penalty and 2 as a score for the match.
Solution:
Trace back begins at the highest value (which is
also the alignment score.

-now its start from the maximum value and it
go traceback until it reaches the 0.Which yields to
the alignment:
C C T A
| | | |
C C T A
Local alignment can be performed every where
possible along two sequences. With an alignment
score of 8
Score=(AA) +(TT) +(CC) +(CC)
=2+2+2+2
=8.
III. CONCLUSION
It can be concluded that the dynamic
programming algorithm can be used to provide
alignment of DNA or protein sequences that
includes either all the sequences or just localized
regions that represents conserved domains by a
local alignment. Hence from calculating and
working many times on these algorithm considering
different organisms, it is found that Needleman-
Wunsch and Smith-waterman algorithm are excellent
method for finding the similarity and dissimilarity
between the different organisms.

IV. REFERENCES
[1] Altschul, S., W. Gish, W. Miller, E. Myers, and D.
Lipman.1990. A basic local alignment search tool.J . Mol.
Biol.215:403410.
[2] Mitchison, Graeme (1998), Biological Sequence Analysis:
Probabilistic Models of Proteins and Nucleic Acids (1st ed.),
Cambridge: Cambridge University
press,doi:10.2277/0521629713, ISBN 0-521-62971-3
[3] Engle, M.L. and C. Burks. 1993. Artificially generated data
sets for testingDNA fragment assembly algorithms. Genomics
16: 286288.
[4] Altschul. J ournal of Molecular Biology, 1990. 215; p403-410
(Original BLAST paper)
[5] Altschul et al. Nucleic Acids Research, 197. 25; p3389-3402
(PSIBLAST Paper)
[6] Gelfand, M.S., A.A. Mironov, and P.A. Pevzner. 1996. Spliced
alignment: A new approach to gene recognition Proc. Natl.
Acad. Sci. 93: 90619066.)
[7] S. R., What is dynamic programming?, Nature Biotechnology,
22, 909910 (2004).
[8] Nocedal, J .; Wright, S. J.: Numerical Optimization, page 9,
Springer, 2006.
A T G C A T C C C A T G A C
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T 0 0 2 0 0 0 2 0 0 0 0 2 0 0 0
C 0 0 0 0 2 0 0 4 2 2 0 0 0 0 2
T 0 0 2 0 0 0 0 2 1 0 0 2 0 0 0
A 0 2 0 0 0 2 0 0 0 0 2 0 0 2 0
T 0 0 4 2 0 0 2 0 0 0 0 4 2 0 0
A 0 2 0 0 0 2 0 0 0 0 2 0 0 2 0
T 0 0 4 2 0 0 4 2 0 0 0 4 0 0 0
C 0 0 2 0 4 0 0 6 4 2 0 0 0 0 2
C 0 0 0 0 2 0 0 4 8 6 4 2 0 0 2
G 0 0 0 2 0 0 0 2 6 5 3 1 4 2 0
T 0 0 2 0 0 0 2 0 4 3 2 5 3 1 0

Computational DNA Sequence Analysis

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Computational DNA Sequence Analysis

Загружено:

Авторское право:

Доступные форматы

International Journal of Computer Trends and Technology (IJCTT) - volume4 Issue5May 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 1264

Вам также может понравиться