Вы находитесь на странице: 1из 8

DNA Sequence Analysis

Takashi Gojobori, National Institute of Genetics, Mishima, Japan Allison Wyndham, National Institute of Genetics, Mishima, Japan
Analysis of DNA sequence data can reveal biologically useful information and with the immense quantity of data now available, sequence analysis provides a valuable complement to empirical studies and is an essential part of the characterization of new proteins. Useful methodologies include searching databases, conducting sequence alignments, estimating the evolutionary or genetic distance and constructing molecular phylogenetic trees.

Secondary article
Article Contents
. Introduction . Database Searching . Multiple Sequence Alignment . Genetic or Evolutionary Distance . Molecular Phylogenetics . Conclusions

Introduction
DNA sequence data contain a wealth of biologically useful information. How to extract this information has given rise to the eld of DNA sequence analysis. Sequence analysis may include the comparison of an uncharacterized query sequence with a gene of known function. Sequence comparisons may reveal conserved characteristics shared by a group of related sequences such as functional domains or active site residues. Related DNA sequences may also contain phylogenetic information that can be used to infer evolutionary relationships among a taxonomic group. Analysis can focus on parts or the entire regions of genes and genomes. The availability of sequence databases means that pair-wise comparisons can now be performed with the same query sequence against many thousands of identied sequences. Sequencing DNA in the laboratory is rapid and simple to implement; hence interest is currently focused on data analysis that utilizes statistical methods and computer analysis. The focus of this article will therefore be on bioinformatics approaches and includes explanations of database searching, sequence alignment, estimation of genetic or evolutionary distances and construction of molecular phylogenetic trees.

number of other eukaryotes are expected to be complete within the next few years. Sequence data may take several forms. In complementary DNA (cDNA) sequencing projects, the nucleotide sequences of messenger RNAs (mRNAs) that are expressed in a tissue or cells are determined in the form of fulllength cDNAs or partial but distinguishable fragments of cDNAs called ESTs (expressed sequence tags). As a cDNA contains a protein-coding region, cDNA-sequencing projects also contribute to the accumulation of amino acid sequence data. Genomic DNAs are also available as large sections of genome sequence (contained on cloning vectors such as cosmids or bacterial articial chromosomes) and also as small fragments that are the preliminary results of genome projects (high-throughput genomic sequences and genome survey sequences).

DNA databanks
As DNA sequencing technology has advanced, DNA and amino acid sequences have accumulated with enormous speed. This vast volume of data is stored and maintained by a tripartite network of international databanks called the International Nucleotide Sequence Database Collaboration. This comprises the DNA Data Bank of Japan (DDBJ) at the National Institute of Genetics (NIG) in Japan, the European Bioinformatics Institute (EBI) at the European Molecular Biology Laboratory (EMBL) and the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) in the United States. These databanks exchange and update information daily and are searchable through the Internet. A variety of services are also provided by these organizations, including data retrieval and searching tools, curated material such as patent information, relevant literature and cross-references where possible. Any user can conduct a database search using online tools or other methods including e-mail. In addition to nucleotide sequence information, several databases hold only amino acid sequences. Primarily, these
1

Database Searching
Genome projects and the accumulation of DNA sequences
Advances in DNA sequencing technology enable us to determine the nucleotide sequence of an entire genome of a species as well as individual gene sequences. The genomes of numerous prokaryotic and viral species have already been completely sequenced. Large-scale genome projects have also completed the sequence of ve eukaryotic genomes (Saccharomyces cerevisiae, Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster, and Homo sapiens, as of year 2001) and the genomes of a large

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

DNA Sequence Analysis

are SwissProt, which collects experimentally conrmed protein sequences, and its supplement trEMBL, which contains the hypothetical translation products of non-EST nucleotide sequences held by EMBL. Additional protein databases are available through the Protein Information Resource (PIR) and the Martinsried Institute for Protein Sequences (MIPS). As well as primary amino acid sequences, these resources provide access to databases of protein structures and conserved motifs (e.g. the PBD protein structure database, ProSite and Pfam). Additional databases and specialist Internet-based resources useful for sequence analysis are presented in detail in regular database issues of Nucleic Acids Research.

Similarity searches
One of the most informative methods used in sequence data analysis is similarity searching. For coding DNAs, similarity at the sequence level implies some structural or functional similarity between the protein products. Searching a database with an uncharacterized gene sequence may reveal such things as homologues in other species or a sequence element that forms a structural domain within the protein. Searches can be conducted with either nucleotide or peptide sequences. However, detection of similarity at the nucleotide level is dicult unless the sequences are closely related. For analysis of coding DNAs, similarity searching with the translated protein sequence is more informative. Commonly used tools for similarity searching are the FASTA and BLAST computer programs (Pearson and Lipman, 1988; Altschul et al., 1990). When comparing sequences both these tools search for discrete regions of similarity (local alignment), rather than attempting to align both sequences end-to-end (global alignment). Regions of similarity are scored according to a substitution matrix, rather than a simple match/mismatch scheme. A substitution matrix provides sensitivity for the type of residue under consideration and in the case of mismatches, for the identity of the comparator. For example, high scores are awarded to conservative substitutions (e.g. Leu!Ile) and for conservation of structurally signicant residues such as Trp or Pro. The type of substitution matrix can be chosen to suit the user, allowing for comparison of closely related or divergent sequences. Frequently used matrices are the PAM and BLOSUM families (Dayho et al., 1978; Heniko and Heniko, 1992). The PAM matrices measure units of evolutionary divergence as point-accepted mutations (PAM). One PAM is a unit of divergence where 1% of the amino acids have undergone substitution (this may include multiple substitutions at the same site). Therefore, the PAM250 matrix allows for 250 substitutions per 100 residues, while PAM70 allows for only 70 substitutions appropriate to more closely related sequences. The BLOSUM nomenclature refers to the
2

maximum level of amino acid identity between comparable sequences above which the matrix treats them as the same sequence. For example, the BLOSUM62 matrix is useful for comparing sequences with less than 62% identity; conversely, the BLOSUM30 matrix is appropriate for detecting weak similarity. Compared with the PAM matrices, the BLOSUM matrices are generally more eective in similarity searches because the data used to construct them are more comprehensive and relate to more divergent proteins. However, the PAM matrices are still recommended over the BLOSUM family for searching with short query sequences (PAM30 and PAM70 in this case). Neither FASTA nor BLAST conducts an exhaustive search, which would be time consuming; rather, heuristics are used to guide the algorithm, allowing the user to set parameters to provide a balance between speed and sensitivity. As a result however, FASTA in particular may not identify directly repeats or multiple domains that are shared by two proteins. The user-variable parameters include gap penalties and word size. Sequences rarely align evenly, therefore, penalties are awarded for gap opening and gap extension. Gap extension penalties are usually lower than for gap opening given that once a gap appears in a sequence, its size often varies. Introduction of gaps was eschewed in earlier versions of BLAST, however, version 2.0 now includes the Gapped-BLAST function that deals explicitly with gaps. Word size or ktuple sets a threshold for the size of the match. At a word size of three an alignment will be initiated whenever three consecutive residue matches are detected. Small word sizes are prone to stochastic errors but are better for detecting weak similarity. Rather than requiring an exact word match as FASTA does, BLAST regards a word that is only similar as a hit (above a set threshold), thus maintaining both high word size and sensitivity without sacricing speed. The BLAST program was developed at NCBI specically for database searching and is maintained with strong technical support and continuing renement (NCBI BLAST).

Functional predictions
Similarity searching is useful particularly for predicting functions of newly identied sequences. This is based upon the premise that regions that are conserved must be functionally or structurally important. If we search a sequence and nd a region where conservation is strong, it is reasonable to speculate that the region is functionally or structurally important and that these characteristics are shared by the similar sequences. Sequences may be similar because they share the same ancestral sequence, such as homologues of the same gene in dierent species. However, dierentiation may occur following gene duplication, giving rise to a family of genes with related but distinct functions. In some cases, sequence similarity may persist

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

DNA Sequence Analysis

after functional similarity has been eroded. Conversely, structural and even functional similarity between proteins can exist despite the complete absence of sequence similarity and such relationships will not be detected using a similarity search. Similarity may also be restricted to one domain or a sequence motif shared by two sequences. This may apply to both coding and noncoding DNAs. Noncoding DNA may contain conserved regulatory elements that control gene expression, heterochromatin structure or DNA silencing. When considering a coding sequence, the protein it encodes may possess more than one functional domain. As each of the dierent domains is assumed to have a single and particular biological function, a protein having multiple domains is considered to exhibit multiple functions. Thus, a protein having multiple and dierent domains is called a mosaic protein. Accordingly, when conducting a similarity search, a query sequence may match many dierent proteins so that no consistent function can be inferred. Therefore, the function of proteins or genes should be considered region by region using similarity searches. Having detected a region of similarity between two sequences, the task remains to determine what the similarity means. The identity of the match sequence(s) may provide a clue to this. Ideally, a specic region of the query sequence will produce matches to other sequences whose functions are both known and interrelated. Moreover, global similarity may be detected between two sequences, which may indicate the existence of another member of the same gene family. A discrete region of similarity may also fall into one of the known classes of domains or motifs, in which case a search of databases such as ProSite or Pfam should reveal this. At this point any available empirical data relating to the sequence will be useful for functional predictions. Further, if database searches have revealed a group of related sequences that all match the query, then constructing a multiple sequence alignment is the most useful means to identify regions of functional importance.

multiple alignment tools seek a global rather than a local alignment. Alignment of multiple sequences involves maximizing similarity among the DNA or amino acid sequences and allowing insertion of gaps, where a gap is a site or a series of sites that has no corresponding sites in a given sequence. When DNA sequences are aligned with each other, identication of appropriate corresponding nucleotides may be quite dicult because there are only four kinds of nucleotides. Alignment of amino acid sequences is therefore easier and can be more meaningful if the intent is to compare a group of related sequences for potential functional characteristics.

The CLUSTAL tool


The CLUSTAL tool is commonly used for multiple alignment. It uses a progressive algorithm to align sequences in successively larger groups, beginning with the most closely related sequences. Using the CLUSTAL W package (Thompson et al., 1994), all pairs of sequences are compared and a tentative measure of similarity is derived, represented by a distance matrix. This is used to produce a phylogenetic guide tree, using the neighbourjoining method. The branching pattern of the tree is used to determine the most closely related pair of sequences. A nal alignment is obtained by repeating this procedure until it reaches the root of a tree. As with similarity searching, the user can select the appropriate gap penalties, word size and substitution matrix. In the case of DNA sequences, transversion-type changes can be weighted more heavily than transition-type changes, because the latter changes occur much more frequently than the former (Kimura, 1983). Regardless of the software used to create an alignment, visual inspection is usually required to rene the nal alignment, particularly to ensure the correct placement of gaps.

Other alignment strategies


Multiple sequence alignment can become dicult among divergent sequences, for example, at less than 25% amino acid identity. A relatively new approach has been developed that can produce high-quality alignments using only weakly similar sequences. Instead of using the conventional dynamic programming algorithms and substitution matrices common to regular alignment tools, a statistical concept known as a hidden Markov chain is used to model the residue identities at each site in the alignment (Baldi et al., 1994). This method has been used to align very large datasets (hundreds of sequences), producing alignments that require little or no manual modication. The hidden Markov model method has been used to create the Pfam database of families of conserved protein motifs. This method has enabled both the ecient identication and alignment of protein sequences sharing only remote
3

Multiple Sequence Alignment


Alignment of a group of related sequences can achieve two purposes the assembly of a group of evolutionarily related sequences for the identication of conserved regions or the alignment of a group of proteins to illustrate structural characteristics (e.g. for secondary structure prediction). The focus here will be on the former purpose. Alignment of more than two sequences either DNA or protein is more complex than the pair-wise comparisons described above, particularly for divergent sequences. Since the aim of multiple alignment is usually to compare a group of sequences along their whole length, most

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

DNA Sequence Analysis

similarity. However, this modelling method requires training of the model on a preliminary dataset before it can be used in practice and is not straightforward for use by the casual investigator. Where available, solved three-dimensional protein structures can assist in constructing a multiple alignment. We can examine the multiple alignment of amino acid sequences using families of proteins where tertiary structures of at least two members have already been solved. Then the aligned tertiary structures can be compared with our sequence alignment. However, the number of available structure solutions is dwarfed by the amount of sequence data now available.

tions and transversions are dierent (which is often the case). Since then, many dierent methods have been developed to estimate the correct number of substitutions, each varying in parameters and assumptions. The methods of Li et al. (1985) and Nei and Gojobori (1986) have also been developed to estimate the numbers of synonymous and nonsynonymous substitutions per site.

Number of amino acid substitutions


Numbers of amino acid substitutions are estimated in a manner similar to nucleotide substitutions. Under the assumption that the substitution rates between any pair of amino acids are equal, the number of amino acid substitutions can be given by the formula: Ka 5 2 ln (1 2 p) where p represents the proportion of amino acid dierences. However, the substitution rate between similar amino acids is generally much higher than that between dissimilar ones. To account for this bias, Kimura (1983) proposed the modied formula: Ka 5 2 ln (1 2 p 2 1/5 p2) The substitution numbers produced by this formula are very close to those produced using Dayhos PAM algorithm (Dayho et al., 1978; used for regular amino acid substitution matrices, discussed above).

Genetic or Evolutionary Distance


Genetic distance is a quantitative measure of evolutionary similarity or dissimilarity between taxa. This is based on the assumption that a pair of genes descended from the same ancestral sequence will independently accumulate nucleotide changes (substitutions) over time. Using sequence data, the numbers of nucleotide or amino acid dierences among sequences (substitutions) are used to calculate genetic distance. The number of observed character dierences often does not represent the real number of substitutions that have occurred, particularly between distantly related sequences where multiple mutations of the same character may have occurred over time. Further, one may wish to dierentiate between nucleotide substitutions that cause amino acid mutations (nonsynonymous substitutions) and those that are silent (synonymous substitutions). The methods introduced below attempt to compensate for these and other factors.

Genetic distance and phylogenetics


According to the molecular clock theory, rst proposed by Zuckerkandl and Pauling (1965), sequences diverge at a constant rate. Therefore, the genetic distance between two related sequences can also be used as a measure of the time elapsed since divergence from the common ancestral sequence (Kimura, 1983). Since the proposal of the molecular clock, researchers found that in fact the estimated substitution rates of many sequences vary considerably. Further, highly divergent sequences will have accumulated multiple substitutions at the same site (referred to as saturation) such that the genuine genetic distances will become dicult to estimate. Nevertheless, genetic distance between sequences can be used as the basis to reconstruct a phylogenetic tree that describes the evolutionary relationships among taxa. Alternatively, nucleotide or amino acid character information can be used to construct a tree. Both these approaches are discussed in the following section.

Number of nucleotide substitutions


The number of nucleotide substitutions is estimated by making pair-wise comparisons of nucleotide sequences and by correcting for multiple substitutions at the same site. For this, one needs a model of nucleotide substitution. Since random patterns of nucleotide substitution are rare, parameters are used in substitution models to take patterns that are generally observed into account. At its simplest, the one-parameter method of Jukes and Cantor (1969) assumes that the rates of nucleotide substitutions between all possible pairs of dierent nucleotides are equal. The number of nucleotide substitutions (Kn) is then estimated by the equation Kn 5 2 3/4 ln (1 2 4/3p) where p is the observed proportion of nucleotide dierences. There are many modications of this formula. For example, Kimuras (1980) two-parameter method was developed under the assumption that the rates of transi4

Molecular Phylogenetics
Molecular phylogenetics is the study of molecular evolution by constructing phylogenetic trees based on DNA and

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

DNA Sequence Analysis

Table 1 Methods for constructing a phylogenetic tree Classication of trees Distance-based methods Method UPGMA Neighbour-joining Minimum evolution Parsimony Compatibility Maximum-likelihood Rooted or unrooted tree Rooted trees Unrooted trees Rooted or unrooted Rooted Rooted Rooted

Character-based methods

amino acid sequences. As the name suggests, a molecular phylogenetic tree is a branching representation of phylogenetic relationships among sequences. The terminal nodes (leaves) of the tree represent existing sequences or species, referred to as operational taxonomic units (OTUs), while internal nodes represent hypothetical ancestral sequences. Ideally, the tree is bifurcating, that is, each internal node gives rise to two branches representing the products of a speciation event. Necessarily, phylogenetic reconstruction involves using incomplete data, since the true ancestral sequences will almost always remain unknowable. Therefore, assumptions must be made during tree reconstruction in the place of ancestral data. These assumptions and the quality of the data will determine whether the nal tree is reliable. Methods for constructing a phylogenetic tree can be separated into two major categories, depending on the traits used. These are character-based methods and distance-based methods (Table 1).

network without any ancestral node. Tree building methods (discussed below) vary in whether they produce a rooted or unrooted tree. For example, the maximum likelihood method gives a rooted tree whereas the neighbour-joining method produces an unrooted tree. It is possible to identify the ancestral node in a phylogenetic tree that has been constructed by the neighbour-joining method. When considering a group of related sequences, a more distantly related sequence can be included as an outgroup when a phylogenetic tree is constructed. Then the branching point between the outgroup and the remaining sequences should be the ancestral node. However, if the outgroup is too distant to the other genes, the substitution numbers may not be estimated correctly due to saturation.

Methods for constructing phylogenetic trees


Distance-based methods Using distance-based method, the numbers of nucleotide and amino acid substitutions are used as evolutionary distances and should produce the correct phylogenetic tree. In reality, however, the number of substitutions is usually unknown, therefore many methods for estimating this number have been developed. Most distance methods perform well if the sequence data are additive, or at least approximately so. That is, the distance between two OTUs should be equal to the sum of the connecting branch lengths. However, data that include divergent sequences are often not additive, in which case these methods will perform poorly. The unweighted pair-group method with arithmetic mean (UPGMA) and the neighbour-joining (NJ) methods are typical distance-based methods. The UPGMA method was originally developed by Sokal and Michener (1958) for constructing a tree based upon the phenotypic similarities between OTUs. This method employs a sequential clustering approach. Topological relationships are inferred from a distance matrix in order of decreasing similarity, and a phylogenetic tree is built in a stepwise manner. The distance between two composite OTUs is calculated as the arithmetic mean of the pair-wise distances between the constituent OTUs of the two composite OTUs. In practice, the two OTUs most similar to one another are rst identied in the distance
5

Gene tree and species tree


When a phylogenetic tree is constructed by using gene sequences, the tree obtained (the gene tree) may be dierent from the historical evolutionary tree of the species (the species tree) that can be estimated utilizing information other than sequences, such as morphological dierences from fossil records. This is a particular problem when paralogous genes are included in the analysis. Ideally, orthologous genes (those descended from a common ancestor) should be used to construct a phylogeny. However, if a duplication event has occurred during evolution that results in genes a and b, then a in one descendant species is the paralogue of b in another (while a genes of all descendant species are orthologues). Tree reconstruction should involve either a genes or b genes, but not a mixture otherwise the tree will assume shared ancestral nodes that are incorrect. Thus, special attention to detecting paralogous genes is required when inferring a species tree from a gene tree.

Rooted and unrooted trees


Phylogenetic trees may be rooted or unrooted. A rooted tree has an ancestral node, but the unrooted tree is a

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

DNA Sequence Analysis

matrix; they are then treated as a new single, composite OTU. Subsequently among the new set of OTUs, the pair having the highest similarity is identied. The procedure is repeated until only two OTUs are left. UPGMA is the simplest method for tree construction, though there are now superior distance methods available. Nevertheless, the UPGMA method gives the correct trees with a high probability if the rates of nucleotide or amino acid substitutions are approximately constant over time among genes compared. The NJ method of Saitou and Nei (1987) is commonly used and has largely superseded UPGMA as the distance method of choice. An advantage of the algorithm used in this method is its ability to examine large data sets and many dierent branch patterns without taking long computational times. Further, the NJ method produces only one nal tree. The steps in constructing a tree are as follows. First, a star phylogeny is considered for computing the total number of substitutions (S), in which all sequences originate equidistant from a single common root. Note that the observed number S will be less than the true number of substitutions, particularly for distantly related sequences that have been subject to saturation. The total number of substitutions (Sij) are then calculated for a tree in which sequences i and j are paired separately from the remaining sequences. The rst pair of neighbours is chosen by calculating Sij values for all pairs of sequences and choosing the smallest Sij. The identied pair is treated as a composite OTU in the next step. This procedure is repeated until all the star phylogeny is entirely converted into a bifurcating tree. The NJ method is similar to, but simpler and less computationally intensive than another distance method, the minimum evolution (ME) method (Rzhetsky and Nei, 1992). Rather than beginning with the closest pair of sequences (as with the NJ method), the total sum (S) of branch lengths is calculated for all possible branching patterns using pair-wise distance data. The branching pattern that yields the lowest S value is therefore the most likely tree. Two relatively new methods that use distance data have been developed that allow for non-tree-like characteristics amongst sequence data. This is particularly useful for sequences that have been subject to convergent or parallel evolution, or for examining kinship among organisms where horizontal gene transfer is common. These methods are spectral analysis, derived from a Fourier transformation method (Hendy et al., 1994), and split decomposition (Bandelt and Dress, 1992). These methods produce a group of weakly compatible splits or a spectrum of splits as a graph, instead of a tree. In this context, a split is an edge within a phylogenetic tree that denes a boundary between two character states (e.g. ears or no ears; Ser or Thr). Compared with a regular phylogenetic tree, a splits graph may not be planar and it may include reticulate features. However, if the distances used to produce a splits graph are
6

additive, split decomposition and spectral analysis will produce a graph that, to all appearances, is an unrooted, bifurcating tree. Therefore, these methods can also be used to produce a regular tree if the data support a tree-like relationship.

Character-based methods The character-based methods utilize characters such as nucleotides and amino acids themselves in the multiple alignment of sequences, instead of evolutionary distances. In a multiple alignment, the same or dierent nucleotides (or amino acids) may occupy each site. Of course, no information is obtained if the same characters occupy a site. However, when dierent characters occupy a site, this site is informative and useful for constructing a phylogenetic tree. Typical examples of character-based methods are the parsimony method and the maximum likelihood (ML) method. Parsimony is a phylogenetic reconstruction principle that seeks a branching pattern among taxa which requires the smallest number of evolutionary changes (reviewed in Swoord et al., 1996). However, evolution may not occur in a way that minimizes evolutionary changes, particularly in the cases of convergent or parallel evolution, or reversals to an ancestral state. As a result, parsimony is reasonably good for comparisons of closely related species or genes, but less so for divergent taxa. For distant comparisons, hidden changes may have accumulated to such an extent (saturation) that the parsimony principle cannot detect all changes, leading to foreshortened branch lengths and possibly incorrect relationships. A potential disadvantage of the parsimony method is that it will produce more than one, often numerous, trees, all of which are equally parsimonious. However, a consensus tree can be used to illustrate the results by showing multifurcations where various trees dier. One of the merits of the parsimony method is that we can infer all the intermediate stages of amino acid or nucleotide sequences in the continuous lineage between the ancestor and the existing species, though this is best used only amongst closely related sequences. Compatibility methods presuppose that the most likely phylogeny is the one that is compatible with the greatest number of individual characters. Accordingly, the phylogenetic tree constructed by this method will rarely be the same as that by using the parsimony method. There is dispute amongst the scientic community as to which method is preferable. Both methods have been criticized because neither includes statistical methods for evaluating the phylogenies (Felsenstein, 1982). Indeed, if convergence and reversals are scattered at random over characters, then parsimony is the better method. However, if saturation is restricted to certain characters then compatibility analysis is the better method.

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

DNA Sequence Analysis

Rather than producing a model from the sequence data, the maximum likelihood (ML) method (Felsenstein, 1981) considers the problem in reverse, by seeking the tree that is most likely to have produced the data. For each possible tree, probabilities among the sequences are calculated for each site and then combined to produce an overall likelihood. A disadvantage of the ML method is that it is quite sensitive to the model of nucleotide substitution that has been chosen. The ML method is the most computationally intensive of the commonly used tree-building methods but has become more popular as computing capabilities have increased, nevertheless, an exhaustive search for ML trees remains cumbersome for more than 10 or so sequences. We can avoid an exhaustive search by instead using a star phylogeny and then use it to search for trees with higher ML values by performing local branch rearrangement.

References
Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology 215: 403 410. Baldi P, Chauvin Y, Hunkapiller T and McClure MA (1994) Hidden Markov models of biological primary sequence information. Proceedings of the National Academy of Sciences of the USA 91: 10591063. Bandelt HJ and Dress AW (1992) Split decomposition: a new and useful approach to phylogenetic analysis of distance data. Molecular Phylogenetics and Evolution 1: 242252. Dayho MO, Schwartz RM and Orcutt BC (1978) A model of evolutionary change in proteins. In: Dayho MO (ed.) Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, pp. 345352. Washington DC: National Biomedical Research Foundation. Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum-likelihood approach. Journal of Molecular Evolution 17: 368376. Felsenstein J (1982) Numerical methods for inferring evolutionary trees. Quarterly Review of Biology 57: 379404. Felsenstein J (1985) Condence limits on phylogenies: an approach using the bootstrap. Evolution 39: 783791. Hendy MD, Penny D and Steel MA (1994) A discrete Fourier analysis for evolutionary trees. Proceedings of the National Academy of Sciences of the USA 91: 33393343. Heniko S and Heniko JG (1992) Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the USA 89: 1091510919. Jukes TH and Cantor CR (1969) Evolution of protein molecules. In: Munro HN (ed.) Mammalian Protein Metabolism, pp. 21132. New York: Academic Press. Kimura M (1980) A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequence. Journal of Molecular Evolution 16: 111120. Kimura M (1983) The Neutral Theory of Molecular Evolution. Cambridge: Cambridge University Press. Li W-H, Wu C-I and Luo C-C (1985) A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Molecular Biology and Evolution 2: 150174. NCBI BLAST [http://www.ncbi.nlm.nih.gov/BLAST/] Nei M and Gojobori T (1986) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Molecular Biology and Evolution 3: 418426. Pearson WR and Lipman DJ (1988) Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the USA 85: 24442448. Rzhetsky A and Nei M (1992) A simple method for estimating and testing minimum-evolution trees. Molecular Biology and Evolution 9: 945967. Saitou N and Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4: 406425. Sokal RR and Michener CD (1958) A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin 28: 14091438. Swoord DL, Olsen GJ, Waddell PJ and Hillis DM (1996) Phylogenetic inference. In: Hillis DM, Moritz C and Mable BK (eds) Molecular Systematics, 2nd edn, pp. 411501. Sunderland, Massachusetts: Sinauer Associates. Thompson JD, Higgins DG and Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specic gap penalties and weight matrix choice. Nucleic Acids Research 22: 46734680.

Tree evaluation
There are several ways to evaluate the quality of a phylogenetic tree. Consideration should also be given to the estimation of branch lengths and the comparison of tree topologies (e.g. for trees produced from the same data using dierent methods), though these will not be discussed here. The bootstrap test (Felsenstein, 1985) is one of the most routinely used tests of tree condence that can be applied to most tree-building methods. Bootstrapping involves randomly replacing portions of the data and resampling to test the ability of the data to reproduce the original tree. This process is then repeated many times and the proportion of replications in which a given cluster of sequences appears is calculated. If this proportion is high, say more than 80%, for a cluster, this cluster is considered to be signicant. The result is a numerical value on each branch that represents the condence with which the data may reproduce it. However, the accuracy of bootstrap values may either be over- or underestimates depending on factors such as variation in substitution rates.

Conclusions
DNA sequence analysis is a powerful tool for predicting gene functions and inferring evolutionary relationships among genes and species by comparing nucleotide and amino acid sequences with each other. It has become increasingly important in molecular biology and its related elds as DNA and protein sequences have accumulated rapidly with the advancement of sequencing techniques and the remarkable progress of genome projects. The tools necessary for sequence analysis are now widely available and can be readily implemented by investigators from all elds of life sciences.

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

DNA Sequence Analysis

Zuckerkandl E and Pauling L (1965) Evolutionary divergence and convergence in proteins. In: Bryson V and Vogel VH (eds) Evolving Genes and Proteins, pp. 97166. New York: Academic Press.

Further Reading
Altschul SF, Boguski MS, Gish W and Wootton JC (1994) Issues in searching molecular sequence databases. Nature Genetics 6: 119129. Baxevanis AD and Ouellette BFF (eds) (1998) Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. Methods of Biochemical Analysis 39. New York: Wiley-Interscience. Database issue (2000) Nucleic Acids Research 28 (1). DNA Database of Japan (DDBJ) [http://www.ddbj.nig.ac.jp] Doolittle RF (1995) The multiplicity of domains in proteins. Annual Review of Biochemistry 64: 287314. European Bioinformatics Institute (EBI) [http://www.ebi.ac.uk] Fitch WM (2000) Homology. Trends in Genetics 16: 227231. Gojobori T, Moriyama EN and Kimura M (1990) Statistical methods for estimating sequence divergence. Methods in Enzymology 183: 531550.

Graur D and Li W-H (2000) Fundamentals of Molecular Evolution, 2nd edn. Sunderland, Massachusetts: Sinauer Associates. Heniko S and Heniko JG (1993) Performance evaluation of amino acid substitution matrices. Proteins 17: 4961. Martinsried Institute for Protein Sequences (MIPS) [http://www.mips.biochem.mpg.de] National Center for Biotechnology Information (NCBI) [http:// www.ncbi.nlm.nih.gov] Nei M (1996) Phylogenetic analysis in molecular evolutionary genetics. Annual Review of Genetics 30: 371403. Nei M and Kumar S (2000) Molecular Evolution and Phylogenetics. Oxford, UK: Oxford University Press. Ouellette BFM and Boguski MS (1997) Database divisions and homology search les: a guide for the perplexed. Genome Research 7: 952955. Protein Information Resource (PIR) [http://www.nbrf.georgetown.edu/ pirwww/pirhome.shtml] Swissprot [expasy.nhri.org.tw/sprot/] Thornton JM, Orengo CA, Todd AE and Pearl FM (1999) Protein folds, functions and evolution. Journal of Molecular Biology 293: 333342.

ENCYCLOPEDIA OF LIFE SCIENCES / & 2001 Macmillan Publishers Ltd, Nature Publishing Group / www.els.net

Вам также может понравиться