Академический Документы
Профессиональный Документы
Культура Документы
Dr. Kerstin Hoef- Emden Universitt zu Kln Botanisches Institut Gyrhofstr. 15 50931 Kln
molecular phylogeny = studying relationships among organisms using molecular markers (e.g. DNA or protein sequences)
dissimilarities among sequences = genetic divergence caused by mutations during the course of time
- their accuracy can be tested in in silico simulations - are based on assumptions about the processes of molecular evolution - may be computationally intense (this refers more to CPU time than to memory) - may be sensitive to artefacts - usually results are displayed as trees
consistency = Does a method reconstruct the correct tree given an infinite amount of data? (All methods do, if assumptions are not violated.)
efficiency = How quickly converges a method to the correct tree with a finite amount of data? (The less data is needed to infer the correct tree, the more efficient the method.)
robustness = How well is the performance of a method, if the assumptions about the evolutionary process are violated?
a) Simulate the evolution of a randomly chosen DNA or protein sequence under a given evolutionary model and tree topology into several lineages. b) Use the phylogenetic method under test to infer a phylogenetic tree. c) Does the resulting phylogenetic tree correspond to the true tree? d) Modify the tree topology to different extremes in branch lengths and repeat the test.
- The true tree is unkown; each inferred tree represents a hypothesis. - No infinite amounts of data are available (no nuclei with infinite space, which contain infinite amounts of DNA). - By using robust and efficient methods and appropriate evolutionary models, the inferred trees hopefully converge to the real phylogeny as close as possible. - The simulation studies give some hints about potential vulnerabilities of the phylogenetic methods.
Trees: Nomenclature
terminal node = operational taxonomic unit (OTU) = contemporary taxon mathematics: branch = edge; node = vertex (plural: vertices)
Treefile Formats
Translate 1 MPorph, 2 S11679, 3 CCMP736, 4 MCont316, 5 S1382, [...] 21 UTEX637 ; tree PAUP_1 = [&U] (1:0.096043,(((((2:0.041968,12:0.011298):0.014339,(13:0,20:0):0.012408): 0.012250,3:0.188535):0.027987,(4:0.107423,5:0.115320):0.013260,((((14:0.001531,15:0): 0.038667,19:0.005060):0.000989,18:0.018300):0.015597,(17:0.006567,21:0.029673):0.017735): 0.009011):0.011165,(((6:0.021826,9:0.009905):0.002602,(7:0.014488,11:0.066979):0.005662): 0.014995,(8:0.046098,10:0.079671):0.010953):0.009027):0.020750,16:0.048813); End;
Phylip (Newick)
(MPorph:0.094736,(((((S11679:0.041475,U1424:0.011216):0.014077,(M1712:0,S10379:0):0.012338): 0.011839,CCMP736:0.183396):0.027333,(MCont316:0.106050,S1382:0.117050):0.013282, ((((S3794:0.001532,S3694:0):0.038496,S899:0.004933):0.001127,S3194:0.018102):0.015038, (S4094:0.006579,UTEX637:0.029760):0.018250):0.009256):0.011383, (((Chondrus:0.021641,S13531a:0.009722):0.002405,(S4194a:0.014451,S1896:0.065892):0.005690): 0.014595,(S4194b:0.046357,S13531b:0.078882):0.010916):0.008914):0.020599,S5981:0.048038);
Displaying Trees
Paup for MacOS 9: graphical output to screen, file or printer; nexus or Newick format (unscaled, scaled, rooted and unrooted trees) Phylip: treefile to graphics converter; Newick format (unscaled, scaled, rooted and unrooted trees) Paup for Windows or portable format (Unixoids): auxiliary program necessary
- Phylip converter programs - TreeView for MacOS 9 (and Windows?): all tree types and tree edition; nexus and Newick for Unixoids: no unrooted trees, no tree editing; nexus and Newick - TreeEdit (MacOS), Treetool etc.
General purpose Paup 4b: Windows, MacOS 9, Unixoids (Linux, Solaris, MacOS X etc.) Phylip 3.62: Windows, MacOS 8, 9, X, Linux, C- Sources
MacOS 9PPC: graphical user interface (i.e. mouse driven) and graphical output of trees All others (Windows, Unixoids): command line (can be submitted to batch queues, unfortunately no checkpointing) Distributor in Europe: Palgrave- MacMillan, UK (Windows and MacOS 9 PPC; GBP 62/72) Distributor in USA (and for portable versions): Sinauer Associates (USD 85- 150)
Precompiled for: MacOS 8/9 PPC, MacOS X, Windows, Red Hat Linux (i368) C- Sources for all other Unixoids User- interface: text- based menu system
Runs under MacOS X, Windows, Unixoids C- Sources User- interface: command line- driven; syntax similar to Paup
Input Formats
Paup: nexus format (interleaved or sequential) Phylip: phylip format MrBayes: nexus format (interleaved or sequential)
Nexus Format #NEXUS BEGIN TAXA; DIMENSIONS NTAX=6; TAXLABELS 'S3694' 'S5981' 'S4094' 'S3194' 'S899' 'S10379' ; END; BEGIN CHARACTERS; DIMENSIONS NCHAR=14; FORMAT DATATYPE=NUCLEOTIDE GAP=; MATRIX [1] 'S3694' cCCAAGCGTTTCCG [2] 'S5981' CCCAATCGTTTCCC [3] 'S4094' CCCAATCGTTTCCG [4] 'S3194' GCCAATCGTTTCCG [5] 'S899' CCCAAGCGTTTCCG [6] 'S10379' CCCAATCGTTTCCG ; END;
Phylip Format
6 14 S3694 S5981 S4094 S3194 S899 S10379
Work- Flow
amplification/sequencing alignment
choice of evolutionary model phylogenetic analyses tree(s) user- defined trees and topology testing
results
Taxon Sampling
- the diversity of the group should be represented (guessing by looking at phenotype or using systematics of group). e.g. combinations of morphological characters or representatives of all species/genera of a group or serotypes or .... - at least two representatives of each presumed clade (guessing) - not to few taxa (> 15) - outgroup taxa (closest related sistergroup only!)
All orthologues and paralogues (or alleles) of a gene in an organism have to be sequenced!
Why?
correct tree
Drosophila Arabidopsis
Drosophila
Chlamydomonas Homo
Arabidopsis
- choose single copy genes (protein- coding) or highly synchronised genes (ribosomal DNA) - choose higher variable genes for closely related organisms and conserved genes for farther related organisms - in sexually reproducing organisms, two alleles may occur
conserved = potentially suited for phylogenies of genera or higher level taxa highly variable = potentially suited for phylogenies of species or lower level taxa
some examples for more conserved genes: actin elongation factor 1 (EF- 1) rbcL tubulins and lots more ...
aim: group of organisms or gene family choice of molecular marker(s) and taxon sampling
improvement of
Work- Flow
amplification/sequencing
alignment
choice of evolutionary model phylogenetic analyses tree(s) user- defined trees and topology testing
results
genomic DNA
mRNA
genomic DNA advantage disadvantage introns (add information) introns (splice sites?)
cloning of PCR products - > problem! solutions: a) proofreading polymerase instead of Taq b) sequence more than 2 clones (better more than three; only an option, if no allelic variation can be expected!)
Sequencing
- sequencing of forward and reverse strands - thoroughful proofreading ribosomal RNA sequences: secondary structure protein sequences: translation (Stop codons?) - BLAST search (PCR contamination, chimaeric sequences?)
aim: group of organisms or gene family choice of molecular marker(s) and taxon sampling
improvement of
Work- Flow
amplification/sequencing
alignment
choice of evolutionary model phylogenetic analyses tree(s) user- defined trees and topology testing
results
Alignment (1)
automatic alignment good as a starting point for an alignment not good, if sequences contain a lot of indels and highly variable regions (i.e. non- coding regions such as ITS or intron sequences or variable regions in ribosomal RNA sequences)
proteins: check alignment afterwards by eye ribosomal RNA, intron and ITS sequences: always manual editing needed
Alignment (2)
Unusual amino acids in the translated sequence? Deviations in a highly conserved region of ribosomal RNA? One G whereas all others have two in a highly conserved region?
- > back to the assembly data and cross checking (it may be true, though!)
Alignment (3)
The alignment is the very basis of the phylogenetic analyses. A software can not differentiate between a real mutation and a sequencing or alignment error.
Alignment (4)
Exclusion of nonalignable regions and saving of the data set in nexus (PAUP) or phylip (Phylip, PAML, Molphy) format depending on the software used for phylogenetic analyses. In protein- coding sequences: perhaps excluding third codon position
Alignment (5)
e.g. ribosomal DNA CCMP152 TAGGAAATCTAGAGCTAATACATGCACCATCGCTCTAATTTGATATTTT-------M1303 TAGGAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTGTGTTTAGTT------M2180 TAGGAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTGTGTTTAGTT------S9772e TAGGAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTTACAATATCTAA----S9772b TAGgAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTATATATATTTGT----C9772a TAGgAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTATATATATTTGT----M1703 TAGGAATTCTAGAGCTAATACATGCACCAGTGCCCTTAGTTTATTCTTTTTTAAGACA M1481 TAGGAATTCTAGAGCTAATACATGCACCATCGCTTTTTTTTCTTTTTTCTTTTTTCTT SB9801 TAGGAATTCTAGAGCTAATACATGCACCATCGTTTTTCTTGACAGGAAGGAAGAAAAA M1318 TAGGAATTCTAGAGCTAATACATGCACCAGTGCCCTTAGTTTATTCTTTTTTAAGACA M1312 TAGgAATTCTAGAGCTAATACATGCACCATAGCCTTTTGTAATTTTTTTTAAAGTTTT Hruf TAGGAATTCTAGAGCTAATACATGCCCCATCGCTTTCGAAGTTTTTTAATTTTTTTTC mask XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxx
CTC = discussion zone between alignable and nonalignable regions TTT = highly variable region -> exclude from analyses
Alignment (6)
multiple sequence alignment limits of sequence number and length sufficiently high? manual editing protected mode (no deletion of nucleotides) several import/export formats (e.g. clustal for automated pre- alignment) for phylogenetic analyses: nexus/phylip format export definition of a mask to exclude non- alignable regions
Alignment (7)
Automatic Alignment Tools Align Clustal W and relatives T- Coffee Malign TreeAlign PileUp and others ... BioEdit SeaView SeAl ARB (RNA- coding DNA) DCSE (RNA- coding DNA) SeqLab (GCG) MacVector and others ... Manual Alignment
Questions?
aim: group of organisms or gene family choice of molecular marker(s) and taxon sampling
improvement of
Work- Flow
amplification/sequencing alignment
results
base frequencies substitution rate matrix proportion of invariable sites gamma- distributed among- site rate variation (covarion/covariotide)
homogeneity of base frequencies among taxa: Program Paup performs a chi square test for biased base frequencies. (command line: basefreqs)
Jukes- Cantor model: All point mutations occur at the same rate. (number of substitution types [nst] =1) Hasegawa- Kishino- Yano and Kimura- 2- parameter: Differing rates for transitions and transversions (nst=2) General time reversible model (GTR): Each type of substition has a different substitution rate, reversals are considered equally likely (nst=6)
A A C G T v i v
C v v i
G i v v
T v i v A C G T
A a b c
C a d e
G b d f
T c e f A C G T
A a b a
C a a e
G b a a
T a e a -
i = transition v = transversion
protein- coding sequences: faster rates at the third position ribosomal DNA, internal transcribed spacers: alternating pattern of more conserved and highly variable regions correlating with secondary structure (helices and unpaired regions).
ribosomal DNA
A A
TACCATGAAAAAGTGGAC DNA U
A G
A A
DNA
protein
gamma- distributed among- site rate variation = nucleotides evolve at differing rates in differing positions; modelled by the shape parameter
= 0.25 ~ 10 =1
substitution rate
Covarion/Covariotide
Individual sequences or lineages evolve faster than others. = not evolving according to a molecular clock.
Example for a command line in Paup 4b to calculate the parameters of a Tamura- Nei model with unequal base frequencies, proportion of invariable sites and gamma distribution (a tree has to be available in the memory)
different combinations of base frequencies + substitution rate matrix + proportion of invariable sites/gamma distribution = 56 evolutionary models in the program Paup 4b
The program Modeltest 3.5 (by Posada and Crandall) performs hierarchical likelihood ratio tests (hLRT) and also computes the Akaike information criterion (AIC).
Modeltest consists of a command file for Paup (modelblock) and an executable (Posada Lab at http://darwin.uvigo.es/)
1.) Start Paup and load data set. 2.) Load modelblock of Modeltest with command execute into Paup 3.) Paup will follow the commands given in the modelblock: First a tree is constructed using the simplest and fastest method. Then Paup computes the likelihood values for all 56 evolutionary models for the data set given the tree. The likelihood scores of the 56 models are saved in a file called model.scores. 4.) The Modeltest executable is started and fed with the model.scores file. It performs the hLRT and AIC tests and saves the results to a file.
Testing models of evolution - Modeltest Version 3.06 (c) Copyright, 1998-2000 David Posada (dp47@email.byu.edu) Department of Zoology, Brigham Young University WIDB 574, Provo, UT 84602, USA _______________________________________________________________ Wed Sep 15 21:33:13 2004 Input format: Paup matrix file
** Log Likelihood scores ** JC F81 K80 HKY TrNef TrN K81 K81uf TIMef TIM TVMef TVM SYM GTR = = = = = = = = = = = = = = 3853.2573 3843.7336 3852.4849 3842.8232 3852.4060 3842.6401 3851.4536 3842.2886 3851.3740 3842.1130 3846.7188 3840.2319 3846.6523 3839.9839
** Hierarchical Likelihood Ratio Tests (hLRTs) ** Equal base frequencies Null model = JC Alternative model = F81 -lnL1 2(lnL1-lnL0) = 13.2373 P-value = 0.004151 Ti=Tv Null model = F81 Alternative model = HKY -lnL1 2(lnL1-lnL0) = 0.9131 P-value = 0.339297 Equal rates among sites Null model = F81 Alternative model = F81+G -lnL1 2(lnL1-lnL0) = 622.9395 Using mixed chi-square distribution P-value = <0.000001 No Invariable sites Null model = F81+G Alternative model = F81+I+G -lnL1 2(lnL1-lnL0) = 17.3423 Using mixed chi-square distribution P-value = 0.000016
0.4495 0.6163
Nst=1
Rates=gamma
Shape=0.6163
Pinvar=0.4495;
Questions?
aim: group of organisms or gene family choice of molecular marker(s) and taxon sampling
improvement of
Work- Flow
amplification/sequencing alignment choice of evolutionary model user- defined trees and topology testing
phylogenetic analyses
tree(s)
results
1) Optimality Criterion Scoring method to decide which tree is the best e.g. maximum parsimony, distance analysis, maximum likelihood
2) Tree Search Algorithm Method to construct a tree e.g. exhaustive search, branch- and bound, heuristic search, quartet puzzling, neighbor- joining
The tree which requires the fewest mutation steps to explain the nucleotide pattern of an alignment is the best. Each point mutation equals one point in the scoring system, thus in unweighted parsimony only integer values are possible as scores.
best tree = maximum parsimony tree (MPT) if several trees are equally scored = equally parsimonious trees (EPT)
Problem: Evolutionary model is implicit and cannot be adapted to the data set. All mutations at all positions are considered equal, even in more variable regions.
All sequences of an alignment are compared pairwise with each other. Each pair is assigned an evolutionary distance value expressing the degree of divergence. The results are listed in a distance matrix, which is used to construct a tree.
To calculate the distances, one out of the 56 different evolutionary models can be chosen or the estimators of the maximum likelihood method can be used.
The tree with the shortest sum of distances is the best. Since the distances are no integers, there is usually only one best tree.
paup> showdist HKY85 distance matrix 1 0.15819 0.16875 0.15953 0.17529 0.10471 0.10967 2 0.15158 0.13655 0.14077 0.10205 0.10370 3 4 5 6 7
1 2 3 4 5 6 7
0.13192 0.12807
0.03539
Computation Steps: - A tree is given. - For each position in the alignment, the site- wise log likelihood is calculated. This includes all possible combinations of ancestral character states in a tree. - The log likelihoods of all positions of the alignment are multiplied and result in the total log likelihood value.
A ? A ?
C G
Theoretically good: Safest method to find the best tree! Problem: Computationally intense! With a lot of taxa impossible to do.
Branch and bound is a speed- up procedure for exhaustive search. It also considers all possible trees.
The score/distance/likelihood of a randomly generated starting tree is calculated and used as a threshold. All trees that are already worse than this threshold during construction procedure are not finished, but skipped. If a tree turns out to be better, it is used as a new threshold.
The trees are considered to form a landscape called the treespace. The best trees are on top of the hills, the worst trees are in the valleys. Heuristic searches start with a random tree, which may be located in a valley and try to find the best tree located in the global maximum of the tree space by rearranging the branches of the starting tree.
1 Starts with a randomly generated tree, which may be in a valley. 2 Local rearrangements by exchanging neighbouring branches optimise the tree. The tree may end up in a local optimum only. 3 Global rearrangements help to cross the valley and find the another hill. 4 The new tree is again rearranged by small exchanges to climb up the hill to the top. If this is the global optimum, further global rearrangements will not improve the tree.
Subtree Pruning and Regrafting (SPR) = A branch with a subtree is removed from a tree and added between two nodes somewhere else in the tree (= one new tree).
Tree Bisection and Reconnection (TBR) = A tree is split into two subtrees and both parts are connected between all possible nodes of the other (= several new trees are considered).
Computation steps:
1) Calculate net divergence of each taxon from the others, and compute a corrected distance for further use. 2) Start with a star- like tree (belongs to star- decomposition methods). 3) Join the two taxa with the lowest divergence. 4) Recalculate the distance matrix by treating the joined taxa as one. 5) Repeat steps 3 to 4 until all taxa are joined and the tree is resolved.
Maximum Parsimony
Distance Matrix
Maximum Likelihood
Begin paup; set autoclose increase=auto outroot=monophy; outgroup 1-4; set crit=p; hsear addseq=rand nreps=10; savetrees file=pars.tre brlens; Lset Base=(0.3315 0.2201 0.2334) Nst=1 Rates=gamma set crit=d; dset dist=ml; nj; savetrees file=nj.tre brlens; set crit=l; hsear addseq=rand nreps=1; savetrees file=ml.tre brlens; quit; End;
Shape=0.9144 Pinvar=0.3911;
Uses also likelihoods in calculations, but is based on a formula introduced by Reverend Bayes and uses posterior probabilities.
Bayesian analysis starts with a set of a priori expectations about evolutionary model, tree topology and branch lengths. By examining the data (the alignment), the posterior probabilities of the hypotheses given the data are calculated using the Bayes formula. Since it is impossible to compute the complete joint posterior probability distribution of trees and evolutionary model parameters (a landscape with hills and valleys), samples are drawn using a Metropolis- coupled Markov chain Monte Carlo method.
1) initialization of Markov chain with random tree and random evolutionary parameters - > calculation of probability 2) proposal of new state of chain with one changed parameter (topology, branch length or evolutionary model) - > calculation of probability 3) If P(Tnew)/P(Told) 1 - > accepting new state, if the ratio is < 1, a random number decides This corresponds to one generation of the Markov chain. A chain is run over several thousands to millions of generations. Every 100th generation a tree and its parameters are sampled and saved to files. After a while, the chain starts circling around a probability optimum comprising the best trees.
H2 C H3
H1
Since the Markov chain may end up stuck in a local maximum, in addition to this cold chain also three so- called heated chains (H1 to H3) are initialized. These chains have lower thresholds to be able to jump over valleys more easily. From time to time a heated chain exchanges parameters (= Metropolis- coupled) with the cold chain, helping it to find the global optimum.
Maximum Likelihood usually results in one optimal tree (sometimes also two), whereas Bayesian analysis results in a set of optimal trees.
Results of a Bayesian analysis after summarizing of the sampled trees and evolutionary model parameters:
A tree file listing the trees according to their posterior and accumulative posterior probabilities. A list of credibility intervals and mean values for all parameters of the evolutionary model. A consensus tree with branch lengths and posterior probabilities indicating support for branches.
All trees and parameters, which were sampled prior to reaching the likelihood plateau (i.e. the burn- in phase before arriving at the global optimum) a excluded from the summaries. The sump command results in a plot showing the likelihood values. If the burn- in was not properly removed, the command has to be repeated with a higher burnin value.
Questions?
aim: group of organisms or gene family choice of molecular marker(s) and taxon sampling amplification/sequencing alignment
Work- Flow
improvement of
choice of evolutionary model phylogenetic analyses user- defined trees and topology testing
tree(s)
results
In a 50 percent majority rule consensus tree, all branches that occur in at least 50% of all bootstrap subsamples are displayed on the branches.
The consensus tree resulting from a Bayesian analyses is a 50% majority rule consensus tree, inferred from sampled trees of the Markov chain. Similar to the support values of a bootstrap consensus, the posterior probabilities express how many of the sampled trees are found with this topology.
Posterior probabilites are usually higher than bootstrap support values and have been subject of debates.
correct tree
a) A high rate of mutations results in multiple reversals to the original character state (= homoplasies). b) In addition, a high mutation rate causes signal noise blurring the information in the sequences.
Consequence: Homoplasies are erroneously interpreted as indicators for relatedness. Choice of inappropriate evolutionary model increases vulnerability against LBA.
Tree is ladderised at the root, i.e. most or all long branches emerge successively close to the root of the tree.
maximum parsimony
aim: group of organisms or gene family choice of molecular marker(s) and taxon sampling
Work- Flow
improvement of
amplification/sequencing alignment
choice of evolutionary model phylogenetic analyses tree(s) user- defined trees and topology testing
results
Preventing/Reducing LBA
1.) Improved Taxon Sampling
Use several genes with differing evolutionary rates and concatenate. The influence of the long branches may be broken (also good: use of genes from different genomes, e.g. nuclear, mitochondrial, plastid).
Different genes most likely will need different evolutionary models. Neither Paup nor Phylip allow for a partitioning of data. The more genes are included and the more divergent the data are in terms of evolutionary rates, the more likely Modeltest will propose the most complex evolutionary model, GTR+I+.
- Run a phylogenetic analysis with each singe- gene dataset and the concatenated data set. - Let Paup save the 1000 best trees resulting from each analysis. - Concatenate the tree files. - Calculate the likelihood scores for each of the trees with the lscores command in Paup, but use as a data matrix the single- gene alignments only. - Load the scorefiles for each data set in a spreadsheet program and calculate the sum of log likelihoods for each tree. - Sort the trees according to their log likelihood.
aim: group of organisms or gene family choice of molecular marker(s) and taxon sampling amplification/sequencing alignment
Work- Flow
improvement of
results
Kishino- Hasegawa and Shimodaira- Hasegawa tests are used to test hypothetical user- defined trees by comparison with the optimal tree.
Problem: Tests were designed to compare random trees, but used by biologists to compare trees with the optimal tree.
1) Construct constraints to test a hypothesis 2) Infer the optimal constraint trees with Paup (ML) 3) Concatenate all treefiles that are supposed to be subjected to the test. 4) Let Paup calculate the total log likelihood and the site- wise log likelihoods and save the data to a scorefile. 5) Use a text editor to delete superfluous data (all parameters of the evolutionary model). 6) Feed Consel with the scorefile.
consel scorefile
catpv scorefile
makermt Generates multiscale bootstrap subsamples from the site- wise log likelihoods in the scorefile. Bootstrap samples from site- wise log likelihoods = the RELL method Multiscale bootstrap = sizes of subsamples differ from original data set By default, makermt generates 10 sets of replicates, each with 10,000 subsamples (0.5x, 0.6, 0.7, 0.8, 0.9, 1.1, 1.2, 1.3, 1.4, 1.5 fold the size of the original data set) and an additional set with 10,000 subsamples of 1.0- fold size. consel Calculates the probabilities for KHT, SHT and normal bootstrap using the 10,000 subsamples of 1.0- fold size and the probabilities for the AUT and the multiscale bootstrap using the multiscale bootstrap samples.
| | | | | | | |
| | | | | | | |
rank = tree ranking item = no. of tree in treefile obs = log likelihood difference au = p- values of the approximately unbiased test np = p- values of the multiscale bootstrap bp = p- values of the normal bootstrap pp = posterior probabilities kh = Kishino- Hasegawa test sh = Shimodaira- Hasegawa test wkh, wsh = weighted Kishino- Hasegawa and Shimodaira- Hasegawa tests
Due to degeneration of the genetic code, codons may be biased! This bias may apply not only to the third position, but also to first and second.
Protein alignments have 20 character states instead of 4 = analyses, especially ML analyses take much longer!
Using nucleotide data, but considering nonsynonymous/synonymous substitutions and/ortranslating during analysis - > this is also quite time- consumptive!
Maximum likelihood analyses of protein sequences are usually based on substitution matrices derived from empirical data instead of estimating the substitution rate matrix from the data set.
e.g. Dayhoff (Dayhoff et al. 1978) JTT (Jones, Thornton, Taylor 1992) WAG (Wheelan, Goldman 2001)
Phylip (text- based menu): phylip format; maximum likelihood; PAM, JTT, PMB Tree- Puzzle (text- based menu): phylip format; maximum likelihood with quartet puzzling; Dayhoff, JTT, WAG, VT etc. PAML (Unixoids, DOS console): phylip format; maximum likelihood; Dayhoff, JTT, WAG etc. (Molphy [Unixoids]): phylip format; maximum likelihood; Dayhoff, JTT) MrBayes: Bayesian analysis
One possibility:
Calculate a tree and gamma categories using Tree- Puzzle 5.2 (Schmidt, Strimmer and von Haeseler 2004).
Use the gamma category estimates as settings to perform a maximum likelihood analyses with proml from the Phylip 3.62 package (Joe Felsenstein 2004).
e.g. ML bootstrapping: a) b) c) d) start seqboot = generate bootstrap samples from data set run dnaml or proml (depending in data set) run contree to create a consensus tree consensus treefile may be loaded into a tree displaying program
Molecular Clock
Diverse
Assumes that sequences evolve at equal evolutionary rates. This is usually not the case and puts the analysis under a constraint. Molecular clock hypothesis may be tested with a likelihood ratio test similar to the evolutionary models in Modeltest. If fossils are available one may try dating the divergences of lineages.
may add information to the results (predominantly RNA- coding, ITS or intron regions)
Synapomorphy Analyses
Searching for synapomorphic characters or strings of characters may be useful for systematic purposes
Questions?
That's it!