0 оценок0% нашли этот документ полезным (0 голосов)
37 просмотров5 страниц
DNA sequence is process of determining precise
order of nucleotides within a DNA molecule. It is at the
centre of human Genome project, which premises to
revolutionaries the Bio-medical Sciences and the treatment
of Human diseases. Extensive research and development
has taken place over last few years in the areas of DNA
Sequence Analysis. In this paper we have discussed about
some of the methods of DNA sequencing analysis, and the
algorithm for DNA sequencing.
DNA sequence is process of determining precise
order of nucleotides within a DNA molecule. It is at the
centre of human Genome project, which premises to
revolutionaries the Bio-medical Sciences and the treatment
of Human diseases. Extensive research and development
has taken place over last few years in the areas of DNA
Sequence Analysis. In this paper we have discussed about
some of the methods of DNA sequencing analysis, and the
algorithm for DNA sequencing.
DNA sequence is process of determining precise
order of nucleotides within a DNA molecule. It is at the
centre of human Genome project, which premises to
revolutionaries the Bio-medical Sciences and the treatment
of Human diseases. Extensive research and development
has taken place over last few years in the areas of DNA
Sequence Analysis. In this paper we have discussed about
some of the methods of DNA sequencing analysis, and the
algorithm for DNA sequencing.
Computational DNA Sequence Analysis Archana Yashodhar *1 , Manjula *2 , Praveen N *3 , Pavithra K *4 *1234 Lecturer Dept of Computer Science, Mangalore University, Mangalore.
Abstract: DNA sequence is process of determining precise order of nucleotides within a DNA molecule. It is at the centre of human Genome project, which premises to revolutionaries the Bio-medical Sciences and the treatment of Human diseases. Extensive research and development has taken place over last few years in the areas of DNA Sequence Analysis. In this paper we have discussed about some of the methods of DNA sequencing analysis, and the algorithm for DNA sequencing. I. INTRODUCTION TO DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instruction used in the development and functioning of all known living organisms and some viruses. The main role of DNA molecule is the long-term storage information. The DNA segments that carry this genetic information are called genes. For instance the study of DNA of tiny bacteria gives the great insight into the working of the human body. Since the DNA stores all the genetic information for an individual. It is what is passed on when the organisms reproduces. An organism becomes a similar life as its parents because all its genetics information was a copy of its parents. DNA is present in the nucleus of every cell.
1.1 Chemical composition DNA is composed of just three chemical substances: A five carbon or a pentose sugar called deoxyribose. A phosphate group consisting of an atom of phosphrous surrounded by four atom of oxygen and three atoms of hydrogen. Nitrogen contains nucleobases. There are two types of bases in DNA:one type is known as purine. Other type is known as pyrimidine. Purines are structurally double ring substances and pyrimidines are single ring substances. The Purines found in DNA are Adenine(A) and Guanine(G).The pyrimidines of DNA are Thymine(T) and Cytosine(C).
1.2 Nucleoside When a molecules of purine or a pyrimidine base is linked to a molecule of a pentose sugar deoxyribose a nucleosides is formed.
1.3 Nucleotides When a nucleoside is joined to phosphoric acid molecules, it becomes nucleotides. Hence a nucleotide is a phosphorylated derivative or phosphate ester of a nucleosides is formed and it is named monophosphate. Nucleotide is a basic unit of DNA molecules. II. STRUCTURE OF DNA DOUBLE HELIX: THE WATSON- CRICK MODEL OF DNA In 1953, James D.Watson and francis H.C crick, working at the Cambridge university, in England built a model to explain how the polynucleotides chains are arranged in a DNA molecules. Polynucleotides means when a multiple nucleotides are linked together. The following diagram shows The Watson-crick model of DNA. It is double helix.
Fig. 1 The Watson-crick model of DNA. The Watson-Crick model of DNA has the following important feature: i) The DNA molecules consist of two polynucleotide chain and hence is double- stranded. It has constant diameter of 20 A. ii) Each strand is coiled in the form of helix, and the two strands are coiled about each other to form double International Journal of Computer Trends and Technology (IJCTT) - volume4 Issue5May 2013
helix. The DNA double-helix is comparable to a twisted ladder with uprights and steps. iii)The backbone of each helically coiled strand is composed of repeating units of sugar and phosphate. iv) The base pairing is always between the larger purine and smaller pyrimidines. This keeps the diameter of DNA molecule constant at 20A. v) The base pair is even more specific Adenine(A) is always paired with Thymine(T),while Cytosine(C) is always paired with Guanine(G).This called the complementary base pairing. - There are two hydrogen bonds between Adenine and Thymine (A=T) and there are three hydrogen bonds between cytosine and Guanine. - For example, if the one strand has AGCTAACA and the other strand will have TCGATTG. vi) Successive base pair are at a distance of 3.4A.there are ten base pair for each turn of the module. 3. Function of DNA i) Directing protein synthesis: In majority of organisms DNA is the genetic material. A gene is a part of the DNA molecule. A gene acts mainly by directing protein synthesis. All proteins are synthesized under instructions from the gene.
ii) Replication: Only DNA among the various molecules found in a cell is capable of making an identical copy of itself by replication. This is necessary to pass on exact copies of the parental DNA to the daughter cell. This is called autocatalytic function.
iii)Transcription: The synthesis of ribonucleic acid(RNA) by DNA. This is called heterocatalytic function.
iv)Mutation: Genes replicates millions times without an error. Sometimes however there might be an error in duplication is called mutation. This creates a new gene. The mutated gene will function in a different way and become important in evolution.
4. DNA Sequence Analysis A DNA sequence or genetic sequence is a succession of letters representing the primary structure of hypothetical DNA molecule or strand, with the capacity to carry information as described by central dogma of molecular biology. The possible letters are A, C, T, G representing the four nucleotides bases of a DNA strand .Adenine, Cytosine, Guanine, Thymine in the sequence AAAGTCTGAC and are conviently linked to a phosphodiester backbone. Shorter sequence of nucleotides is referred to as oligonucleotides and is used in a range of laboratory application in molecular biology. Sequence can be derived from the biological raw material through a process called DNA sequence. The term DNA sequencing method determining the order of the nucleotide bases-adenine, guanine, cytosine and thymine. The knowledge of DNA sequence has become indispensable for basic biological research and other branches utilizing DNA sequencing and in numerous applied field such as diagnostic, biotechnology or forensic biology. The advent of DNA sequencing has significantly accelerated biological research and discovery.
4.1 Sequence Alignment 1. In bioinformatics, a sequence alignment is a way of arranging the sequence of RNA, DNA and proteins to identify regions of similarity that may be consequences of structural, functional or evolutionary relationship between sequences. 2. Method for DNA sequencing was developed in 1977 by Maxam and Gilbert (1977). 3. Aligned sequence of nucleotides or amino acids residues are typically represented as rows within a matrix. 4. Gaps are inserted between the residues so that identical or similar character is aligned in successive columns.
4.2 Pair-wise sequence alignment: Is the process of comparing two DNA or proteins sequences of searching for a series of individual character or character patterns that are in the same order in sequences. The sequences are aligned by writing them across a page in two rows. Identical or similar character is placed in the same columns and a no identical character can be placed either in the same column as a mismatch or opposite gap in the other sequence.
5. Types of pair-wise sequence alignment There are two types of pair-wise sequence alignment a) Global alignment b) Local alignment. a)global alignment An attempt is made to align the entire sequence, using all sequence characters up to both ends of each sequences. That are quite similar and approximately the same length are suitable candidates for global alignment. The Needle-Wunsch algorithm is used to produces global alignment between pairs of DNA or proteins sequences. A global alignment is made possible by including gaps either within the middle of the International Journal of Computer Trends and Technology (IJCTT) - volume4 Issue5May 2013
alignment or at either end of one or both sequences, example shown below: Of two protein sequences
L G P S S K Q T G K G S S R I W D N | | | | | | | L N I T K S A G KG Q I G R S N D W Vertical line indicates presences of identical amino acids.
b) Local alignment Local alignment[1] stretches of sequences with the highest density of matches are aligned, thus generating one or more islands of matches or sub alignment in the aligned sequences. Local alignment are more suitable for aligning sequences that differ in length or sequences that share a conserved sequences of domain The Smith-Waterman algorithm is used to produces local alignments between pairs of DNA or protein sequence. The local alignment stops at the ends of regions of strong similarity. Example: ------------------TGKG----------------------- | | | | -------------------AGKG----------------------- Dashes in the figure indicate sequences not included in the alignment.
5. A.Three principal methods of pair-wise sequence alignment Alignment of two sequences is performed using the following methods: 1. Dot matrix method. 2. The dynamic programming algorithms. 3. Word or K-tuple method, such as used by the FASTA and BLAST.
6. Simple alignment An alignment between two sequence is simply a pair-wise match between the character of each sequence. A true alignment of nucleotide or amino acids sequence is one that reflects the evolutionary relationship between two or more homologs(sequence that share common ancestor). Three kinds changes can occur at any given position within a sequence. i) A mutation that replaces one character with another. ii) A insertion that adds one or more positions. iii) A deletion that deletes one or more positions. Insertions and deletions have been found to occur in nature at a significantly lower frequency than mutations. Gaps in alignment are commonly added to reflect the occurrence of this type change. In the simplest case where no internal gap are allowed, aligning two sequences is simply a matter of choosing the starting point of the shorter sequence. Consider the following two shorter sequences of nucleotides: AATCTATA and AAGATA. These two sequences can be aligned in only three different ways when no gaps are allowed. -Three possible simple alignment between two short
AATCTATA AATCTATA AATCTATA AAGATA AAGATA AAGATA To determine which of the three alignments shown optimal, we must decide to evaluate, or score each alignment. The scoring function is determined by the amount of credit an alignment receives for each aligned pair of identical residue(match score) and the penalty for aligned pair of non identical residues(the mismatch score). The score of given alignment is: match score; if seq1=seq2; mis match score; seq1seq2; } Where n be the length of longer sequence. for example assuming a match score of 1 and mismatch score is 0. The score for three alignment would be 4,1 and 3 fro left to right.
7. GAPS Considerations of the possibility of insertion and deletion Events significantly complicates sequences alignments by vastly increasing the number of possible alignment between two or more sequences. Example the sequence which can aligned in only three different ways without gaps can be aligned in 28 different ways when two internal gaps are allowed in the shorter sequences.
Three possible alignments: AATCTATA AATCTATA AATCTATA AAGATA AAG-- ATA AA- - GATA
7.1 Simple gap penalty In scoring an alignment that include gaps an additional term the gap penalty must be included in the scoring function. A simple alignment score for a gapped alignment can be computed as follows : gap penalty; if seq1=-or seq2=- Match score ;if no gaps and seq1=seq2; Mis match score; if no gaps and seq1seq2; }
International Journal of Computer Trends and Technology (IJCTT) - volume4 Issue5May 2013
For example assuming a match score is 1 and a mismatch score is 0,and gap penalty of - 1,the score for the gapped alignment for above shown sequences will be 1,3,3.that is:
1) A A T C T A T A A A G - A T - A 1+1+0 -1 +0+0-1+1=1 2) A A T C T A T A A A - G - A T A 1+1 -1+0-1+1 +1+1=3 3) A A T C T A T A A A - - A T - A 1+1- 1-1+0+0- 1+1=3
8. Dynamic programming The most obvious methods, exhaustive search of all possible alignments, is generally not feasible for example two modest-sized sequences of 100 and 95 nucleotides. If we were to devise an algorithms that computed and scored all possible alignments, our program would have to test nearly 55 millions possible alignments. just to consider the case where exactly five gaps are inserted into the shorter sequences. As the lengths of the sequences grow the number of possible alignment to search quickly becomes intractable or impossible to compute in reasonable amount of time we can overcome this problem using dynamic programming. Dynamic programming[3] is a computational method that is used to align two protein or nucleic acid sequences. The method is very important for sequences analysis because it provides highest- scoring alignment between two sequences. The dynamic programming has types 8.1) global alignment 8.2) local alignment Example of global alignment is Needleman-Wunsch algorithm. Example of local alignment is Smith-Waterman algorithm.
8.1 Local alignment Smith-Waterman algorithms The modifications of the dynamic programming algorithm[1] for sequences alignment provides the ability to create a local alignment between two sequences. local alignment is more meaning full than global matches because they identify conserved local sequences domain that are present in both sequences. The Smith-Waterman algorithm is a well- known algorithm for performing local sequence alignment; For determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. The algorithm was first proposed by Temple Smith and Michael Waterman in 1981 .
Algorithms :LOCAL_AlIGNMENT_SCORE(A=a1a2..am, B=b1b2.bn) Begin M[0,0]=0 Best =0 Endi=0 Endj=0 for j=1 to n do M[0,j]=0 for i=1 to m do S[i,0]=0 for j=1 to n do M[i,j]=max {0 M[i-1,j]+gap M[I,j-1]+gap M[i-1,j-1]+S(Ai,Bj) } if M[i,j]>Best then Best=M[i,j] Endi=i Endj=j Output Best as the score of an optimal local alignment end.
The implication of this in there are no values below zero in a local alignment scoring matrix where:-Mi,j is the score at position i in sequence A and j in sequence B, S(Ai,Bj) is the score for aligning the character at positions i and j.gap is the penalty for gap of sequence a and sequence b. The most important changes are 1) The edges of the matrix is initialized to zero instead of increasing gap penalty. 2) The maximum score is never less than 0 and no pointer is recorded until the score is greater than 0.
3) The trace back starts from the highest score in the matrix and ends at a score of 0. The algorithms explanation can be given as follows; A matrix M is built as follows: M(i,0)=0, 0 i m M(0,j)=0, 0 j m M(i,j)=max { 0 M(i-1,j-1)+w(a i ,b j ) International Journal of Computer Trends and Technology (IJCTT) - volume4 Issue5May 2013
M(i-1,j)+w(a i ,-) M(i,j-1)+w(-,b j ) } Where: a,b=string over the alphabets m=length(a) n=length(b) M(i,j)-is the maximum score between a suffix of a[1..i] and a suffix of b[1..j]. W(c,d),c,d U{ - }. - is the gap- scoring scheme.
EXAMPLE: Find the best local alignment between these two sequences: ATGCATCCCATGAC TCTATATCCGT Using -2 as a gap penalty,-3 as a mismatch penalty and 2 as a score for the match. Solution: Trace back begins at the highest value (which is also the alignment score.
-now its start from the maximum value and it go traceback until it reaches the 0.Which yields to the alignment: C C T A | | | | C C T A Local alignment can be performed every where possible along two sequences. With an alignment score of 8 Score=(AA) +(TT) +(CC) +(CC) =2+2+2+2 =8. III. CONCLUSION It can be concluded that the dynamic programming algorithm can be used to provide alignment of DNA or protein sequences that includes either all the sequences or just localized regions that represents conserved domains by a local alignment. Hence from calculating and working many times on these algorithm considering different organisms, it is found that Needleman- Wunsch and Smith-waterman algorithm are excellent method for finding the similarity and dissimilarity between the different organisms.
IV. REFERENCES [1] Altschul, S., W. Gish, W. Miller, E. Myers, and D. Lipman.1990. A basic local alignment search tool.J . Mol. Biol.215:403410. [2] Mitchison, Graeme (1998), Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (1st ed.), Cambridge: Cambridge University press,doi:10.2277/0521629713, ISBN 0-521-62971-3 [3] Engle, M.L. and C. Burks. 1993. Artificially generated data sets for testingDNA fragment assembly algorithms. Genomics 16: 286288. [4] Altschul. J ournal of Molecular Biology, 1990. 215; p403-410 (Original BLAST paper) [5] Altschul et al. Nucleic Acids Research, 197. 25; p3389-3402 (PSIBLAST Paper) [6] Gelfand, M.S., A.A. Mironov, and P.A. Pevzner. 1996. Spliced alignment: A new approach to gene recognition Proc. Natl. Acad. Sci. 93: 90619066.) [7] S. R., What is dynamic programming?, Nature Biotechnology, 22, 909910 (2004). [8] Nocedal, J .; Wright, S. J.: Numerical Optimization, page 9, Springer, 2006. A T G C A T C C C A T G A C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 T 0 0 2 0 0 0 2 0 0 0 0 2 0 0 0 C 0 0 0 0 2 0 0 4 2 2 0 0 0 0 2 T 0 0 2 0 0 0 0 2 1 0 0 2 0 0 0 A 0 2 0 0 0 2 0 0 0 0 2 0 0 2 0 T 0 0 4 2 0 0 2 0 0 0 0 4 2 0 0 A 0 2 0 0 0 2 0 0 0 0 2 0 0 2 0 T 0 0 4 2 0 0 4 2 0 0 0 4 0 0 0 C 0 0 2 0 4 0 0 6 4 2 0 0 0 0 2 C 0 0 0 0 2 0 0 4 8 6 4 2 0 0 2 G 0 0 0 2 0 0 0 2 6 5 3 1 4 2 0 T 0 0 2 0 0 0 2 0 4 3 2 5 3 1 0
Q1 a Write a program to construct a dot plot for the alignment of human and chicken+haemoglobin β chain. Identify the segments, which are same in both sequences