Вы находитесь на странице: 1из 107

An Introduction to Molecular Phylogeny

Dr. Kerstin Hoef- Emden Universitt zu Kln Botanisches Institut Gyrhofstr. 15 50931 Kln

What is molecular phylogeny?

phylon = Greek for stem genesis = Greek for origin

molecular phylogeny = studying relationships among organisms using molecular markers (e.g. DNA or protein sequences)

dissimilarities among sequences = genetic divergence caused by mutations during the course of time

Molecular Phylogenetic Methods

- their accuracy can be tested in in silico simulations - are based on assumptions about the processes of molecular evolution - may be computationally intense (this refers more to CPU time than to memory) - may be sensitive to artefacts - usually results are displayed as trees

Accuracy of Molecular Phylogenetic Methods

consistency = Does a method reconstruct the correct tree given an infinite amount of data? (All methods do, if assumptions are not violated.)

efficiency = How quickly converges a method to the correct tree with a finite amount of data? (The less data is needed to infer the correct tree, the more efficient the method.)

robustness = How well is the performance of a method, if the assumptions about the evolutionary process are violated?

How to test phylogenetic methods?

e.g. by simulation in silico (= in the computer)

a) Simulate the evolution of a randomly chosen DNA or protein sequence under a given evolutionary model and tree topology into several lineages. b) Use the phylogenetic method under test to infer a phylogenetic tree. c) Does the resulting phylogenetic tree correspond to the true tree? d) Modify the tree topology to different extremes in branch lengths and repeat the test.

Phylogenetic Methods and Real Life Sequences

- The true tree is unkown; each inferred tree represents a hypothesis. - No infinite amounts of data are available (no nuclei with infinite space, which contain infinite amounts of DNA). - By using robust and efficient methods and appropriate evolutionary models, the inferred trees hopefully converge to the real phylogeny as close as possible. - The simulation studies give some hints about potential vulnerabilities of the phylogenetic methods.

Trees: Nomenclature

terminal branch internal branch

internal node = unknown ancestor = extinct taxon

terminal node = operational taxonomic unit (OTU) = contemporary taxon mathematics: branch = edge; node = vertex (plural: vertices)

Tree Types: Unscaled Trees


slanted cladogram rectangular cladogram

disadvantage: no information about evolutionary rates in a tree

Tree Types: Scaled Trees


unrooted phenogram rooted phenogram

disadvantage: direction of evolution is unknown advantage: higher resolution

#NEXUS Begin trees; [Treefile saved Thu Sep 16 03:12:19 2004]

Treefile Formats

Translate 1 MPorph, 2 S11679, 3 CCMP736, 4 MCont316, 5 S1382, [...] 21 UTEX637 ; tree PAUP_1 = [&U] (1:0.096043,(((((2:0.041968,12:0.011298):0.014339,(13:0,20:0):0.012408): 0.012250,3:0.188535):0.027987,(4:0.107423,5:0.115320):0.013260,((((14:0.001531,15:0): 0.038667,19:0.005060):0.000989,18:0.018300):0.015597,(17:0.006567,21:0.029673):0.017735): 0.009011):0.011165,(((6:0.021826,9:0.009905):0.002602,(7:0.014488,11:0.066979):0.005662): 0.014995,(8:0.046098,10:0.079671):0.010953):0.009027):0.020750,16:0.048813); End;

Phylip (Newick)
(MPorph:0.094736,(((((S11679:0.041475,U1424:0.011216):0.014077,(M1712:0,S10379:0):0.012338): 0.011839,CCMP736:0.183396):0.027333,(MCont316:0.106050,S1382:0.117050):0.013282, ((((S3794:0.001532,S3694:0):0.038496,S899:0.004933):0.001127,S3194:0.018102):0.015038, (S4094:0.006579,UTEX637:0.029760):0.018250):0.009256):0.011383, (((Chondrus:0.021641,S13531a:0.009722):0.002405,(S4194a:0.014451,S1896:0.065892):0.005690): 0.014595,(S4194b:0.046357,S13531b:0.078882):0.010916):0.008914):0.020599,S5981:0.048038);

Displaying Trees
Paup for MacOS 9: graphical output to screen, file or printer; nexus or Newick format (unscaled, scaled, rooted and unrooted trees) Phylip: treefile to graphics converter; Newick format (unscaled, scaled, rooted and unrooted trees) Paup for Windows or portable format (Unixoids): auxiliary program necessary

- Phylip converter programs - TreeView for MacOS 9 (and Windows?): all tree types and tree edition; nexus and Newick for Unixoids: no unrooted trees, no tree editing; nexus and Newick - TreeEdit (MacOS), Treetool etc.

Phylogeny Programs (1)

General purpose Paup 4b: Windows, MacOS 9, Unixoids (Linux, Solaris, MacOS X etc.) Phylip 3.62: Windows, MacOS 8, 9, X, Linux, C- Sources

Bayesian Analyses MrBayes 3: Windows, MacOS, C- Sources (Unixoids)

Links to phylogeny- related software collected by Joe Felsenstein: http://evolution.genetics.washington.edu/phylip/software.html

Phylogeny Programs (2)


Paup* 4b10 = Phylogenetic Analysis Using Parsimony (* and other methods)
Written by David Swofford. First versions up to Paup 3 were available for MacOS < 9 only and were focused on the parsimony method. Paup 4 is available for different OS and one of the most powerful tools for phylogenetic analyses concerning nucleotide sequences. Sold as a beta version, but more stable than some sold final versions of other software.

MacOS 9PPC: graphical user interface (i.e. mouse driven) and graphical output of trees All others (Windows, Unixoids): command line (can be submitted to batch queues, unfortunately no checkpointing) Distributor in Europe: Palgrave- MacMillan, UK (Windows and MacOS 9 PPC; GBP 62/72) Distributor in USA (and for portable versions): Sinauer Associates (USD 85- 150)

Phylogeny Programs (3)

Phylip 3.62 = Phylogenetic Inference Package


Written by Joe Felsenstein. Multiple purpose package for nucleotide as well as protein sequences. Approx. 30 different programs to fullfil different tasks. Freely available over the internet (http://evolution.gs.washington.edu/phylip.html).

Precompiled for: MacOS 8/9 PPC, MacOS X, Windows, Red Hat Linux (i368) C- Sources for all other Unixoids User- interface: text- based menu system

Phylogeny Programs (4)


MrBayes 3
Written by John Huelsenbeck and Frederic Ronquist. Specialised on Bayesian Analyses. Handles nucleotide as well as protein sequences. Partitioned computation of concatenated data sets. Freely available over internet (http://morphbank.ebc.uu.se/mrbayes/)

Runs under MacOS X, Windows, Unixoids C- Sources User- interface: command line- driven; syntax similar to Paup

Parallelised version available; no checkpointing.

Phylogeny Programs (5)

Input Formats

Paup: nexus format (interleaved or sequential) Phylip: phylip format MrBayes: nexus format (interleaved or sequential)

Nexus Format #NEXUS BEGIN TAXA; DIMENSIONS NTAX=6; TAXLABELS 'S3694' 'S5981' 'S4094' 'S3194' 'S899' 'S10379' ; END; BEGIN CHARACTERS; DIMENSIONS NCHAR=14; FORMAT DATATYPE=NUCLEOTIDE GAP=; MATRIX [1] 'S3694' cCCAAGCGTTTCCG [2] 'S5981' CCCAATCGTTTCCC [3] 'S4094' CCCAATCGTTTCCG [4] 'S3194' GCCAATCGTTTCCG [5] 'S899' CCCAAGCGTTTCCG [6] 'S10379' CCCAATCGTTTCCG ; END;

Phylogeny Programs (6)

Phylip Format
6 14 S3694 S5981 S4094 S3194 S899 S10379

cCCAAGCGTTTCCG CCCAATCGTTTCCC CCCAATCGTTTCCG GCCAATCGTTTCCG CCCAAGCGTTTCCG CCCAATCGTTTCCG

aim: group of organisms or gene family

Work- Flow

choice of molecular marker(s) and taxon sampling


improvement of

amplification/sequencing alignment

choice of evolutionary model phylogenetic analyses tree(s) user- defined trees and topology testing

results

Taxon Sampling

Strategy for an initial Taxon Sampling

- the diversity of the group should be represented (guessing by looking at phenotype or using systematics of group). e.g. combinations of morphological characters or representatives of all species/genera of a group or serotypes or .... - at least two representatives of each presumed clade (guessing) - not to few taxa (> 15) - outgroup taxa (closest related sistergroup only!)

Choice of Molecular Marker(s): Phylogenies of Gene Families (1)

All orthologues and paralogues (or alleles) of a gene in an organism have to be sequenced!

Why?

Choice of Molecular Marker(s): Phylogenies of Gene Families (2)


A very fictitious example for a weird tree caused
Homo

by an incomplete sampling of a gene family (taxon sampling also not recommended).

correct tree

Drosophila Arabidopsis

Chlamydomonas Homo Drosophila Arabidopsis Chlamydomonas

very bad tree!

Drosophila

Chlamydomonas Homo

Arabidopsis

Choice of Molecular Marker(s): Phylogenies of Organisms (1)

- choose single copy genes (protein- coding) or highly synchronised genes (ribosomal DNA) - choose higher variable genes for closely related organisms and conserved genes for farther related organisms - in sexually reproducing organisms, two alleles may occur

Choice of Molecular Marker(s): Phylogenies of Organisms (2)

e.g. the eukaryotic ribosomal operon


5.8S rDNA SSU rDNA ITS1 ITS2 LSU rDNA

conserved = potentially suited for phylogenies of genera or higher level taxa highly variable = potentially suited for phylogenies of species or lower level taxa

Choice of Molecular Marker(s): Phylogenies of Organisms (3)

some examples for more conserved genes: actin elongation factor 1 (EF- 1) rbcL tubulins and lots more ...

aim: group of organisms or gene family choice of molecular marker(s) and taxon sampling
improvement of

Work- Flow

amplification/sequencing
alignment

choice of evolutionary model phylogenetic analyses tree(s) user- defined trees and topology testing

results

genomic DNA

mRNA

DNA Amplification (1)


RT- PCR PCR cDNA cloning

template for sequencing

DNA Amplification (2)

genomic DNA advantage disadvantage introns (add information) introns (splice sites?)

cDNA no introns (just ORF) no introns

DNA Amplification (3)


Taq polymerase no proofreading = introduces reading errors (pred. transitions)) direct sequencing (large template pool) - > usually no problem

cloning of PCR products - > problem! solutions: a) proofreading polymerase instead of Taq b) sequence more than 2 clones (better more than three; only an option, if no allelic variation can be expected!)

Sequencing

Reduce/avoid sequencing errors

- sequencing of forward and reverse strands - thoroughful proofreading ribosomal RNA sequences: secondary structure protein sequences: translation (Stop codons?) - BLAST search (PCR contamination, chimaeric sequences?)

aim: group of organisms or gene family choice of molecular marker(s) and taxon sampling
improvement of

Work- Flow

amplification/sequencing

alignment
choice of evolutionary model phylogenetic analyses tree(s) user- defined trees and topology testing

results

Alignment (1)

automatic alignment good as a starting point for an alignment not good, if sequences contain a lot of indels and highly variable regions (i.e. non- coding regions such as ITS or intron sequences or variable regions in ribosomal RNA sequences)

proteins: check alignment afterwards by eye ribosomal RNA, intron and ITS sequences: always manual editing needed

Alignment (2)

Second round of proofreading

Unusual amino acids in the translated sequence? Deviations in a highly conserved region of ribosomal RNA? One G whereas all others have two in a highly conserved region?

- > back to the assembly data and cross checking (it may be true, though!)

Alignment (3)

The alignment is the very basis of the phylogenetic analyses. A software can not differentiate between a real mutation and a sequencing or alignment error.

Effects of an Erroneous Alignment

Decreasing of the resolution. Worst case: artefactual tree topology

Alignment (4)

Preparation of the Alignment for the Phylogenetic Analyses

Exclusion of nonalignable regions and saving of the data set in nexus (PAUP) or phylip (Phylip, PAML, Molphy) format depending on the software used for phylogenetic analyses. In protein- coding sequences: perhaps excluding third codon position

Alignment (5)
e.g. ribosomal DNA CCMP152 TAGGAAATCTAGAGCTAATACATGCACCATCGCTCTAATTTGATATTTT-------M1303 TAGGAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTGTGTTTAGTT------M2180 TAGGAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTGTGTTTAGTT------S9772e TAGGAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTTACAATATCTAA----S9772b TAGgAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTATATATATTTGT----C9772a TAGgAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTATATATATTTGT----M1703 TAGGAATTCTAGAGCTAATACATGCACCAGTGCCCTTAGTTTATTCTTTTTTAAGACA M1481 TAGGAATTCTAGAGCTAATACATGCACCATCGCTTTTTTTTCTTTTTTCTTTTTTCTT SB9801 TAGGAATTCTAGAGCTAATACATGCACCATCGTTTTTCTTGACAGGAAGGAAGAAAAA M1318 TAGGAATTCTAGAGCTAATACATGCACCAGTGCCCTTAGTTTATTCTTTTTTAAGACA M1312 TAGgAATTCTAGAGCTAATACATGCACCATAGCCTTTTGTAATTTTTTTTAAAGTTTT Hruf TAGGAATTCTAGAGCTAATACATGCCCCATCGCTTTCGAAGTTTTTTAATTTTTTTTC mask XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxx

CTC = discussion zone between alignable and nonalignable regions TTT = highly variable region -> exclude from analyses

Alignment (6)

Properties of an alignment editor suited for phylogenetic purposes

multiple sequence alignment limits of sequence number and length sufficiently high? manual editing protected mode (no deletion of nucleotides) several import/export formats (e.g. clustal for automated pre- alignment) for phylogenetic analyses: nexus/phylip format export definition of a mask to exclude non- alignable regions

Alignment (7)
Automatic Alignment Tools Align Clustal W and relatives T- Coffee Malign TreeAlign PileUp and others ... BioEdit SeaView SeAl ARB (RNA- coding DNA) DCSE (RNA- coding DNA) SeqLab (GCG) MacVector and others ... Manual Alignment

Questions?

aim: group of organisms or gene family choice of molecular marker(s) and taxon sampling
improvement of

Work- Flow

amplification/sequencing alignment

choice of evolutionary model


phylogenetic analyses tree(s) user- defined trees and topology testing

results

Evolutionary Models (1)

consist of a combination of the following parameters:

base frequencies substitution rate matrix proportion of invariable sites gamma- distributed among- site rate variation (covarion/covariotide)

Evolutionary Models (2)


base frequencies
percentages of A, C, G or T in the alignment; can be set to: - equal (0.25 for each nt) - empirical (= computed from the alignment) - estimate (= optimised as a likelihood parameter) - manually set

homogeneity of base frequencies among taxa: Program Paup performs a chi square test for biased base frequencies. (command line: basefreqs)

Evolutionary Models (3)

substitution rate matrix assumptions about substitution rates of point mutations

Jukes- Cantor model: All point mutations occur at the same rate. (number of substitution types [nst] =1) Hasegawa- Kishino- Yano and Kimura- 2- parameter: Differing rates for transitions and transversions (nst=2) General time reversible model (GTR): Each type of substition has a different substitution rate, reversals are considered equally likely (nst=6)

Evolutionary Models (4)


substitution rate matrix nst=2 nst=6 nst=6 (but 3 rate classes)

A A C G T v i v

C v v i

G i v v

T v i v A C G T

A a b c

C a d e

G b d f

T c e f A C G T

A a b a

C a a e

G b a a

T a e a -

i = transition v = transversion

equal rates for both directions

the Tamura-Nei model

Evolutionary Models (5)


proportion of invariable sites and gamma- distributed among- site rate variation
DNA sequences do not evolve at the same rate in all positions.

protein- coding sequences: faster rates at the third position ribosomal DNA, internal transcribed spacers: alternating pattern of more conserved and highly variable regions correlating with secondary structure (helices and unpaired regions).

Evolutionary Models (6)

protein- coding genes: degenerate code


123 TCA TCC TCG TCT 123 CGA CGC CGG CGT 123 GTA GTC GTG GTT

ribosomal DNA
A A

TACCATGAAAAAGTGGAC DNA U

A G

A A

DNA

U-G A-U C-G AC-G A

protein

Ser Arg Val

RNA secondary structure

red = highly variable positions

Evolutionary Models (7)

proportion of invariable sites = proportion of positions, which do not evolve

gamma- distributed among- site rate variation = nucleotides evolve at differing rates in differing positions; modelled by the shape parameter

both parameters may be used separately or can be combined

Evolutionary Models (8)


gamma- distributed among- site rate variation
=
proportion of sites

= 0.25 ~ 10 =1

substitution rate

continuous gamma distribution

discrete gamma distribution

Evolutionary Models (9)

Covarion/Covariotide

Individual sequences or lineages evolve faster than others. = not evolving according to a molecular clock.

Not implemented in most phylogeny programs (MrBayes is an exception).

Evolutionary Models (10)

Example for a command line in Paup 4b to calculate the parameters of a Tamura- Nei model with unequal base frequencies, proportion of invariable sites and gamma distribution (a tree has to be available in the memory)

lscores 1/ nst=6 basefre=est rmat=est rclass=(a b a a e a) pinv=est rate=gam shape=est;

Evolutionary Models (11)

different combinations of base frequencies + substitution rate matrix + proportion of invariable sites/gamma distribution = 56 evolutionary models in the program Paup 4b

How to decide, which model fits best a data set?

Choice of Evolutionary Model (1)

The program Modeltest 3.5 (by Posada and Crandall) performs hierarchical likelihood ratio tests (hLRT) and also computes the Akaike information criterion (AIC).

Modeltest consists of a command file for Paup (modelblock) and an executable (Posada Lab at http://darwin.uvigo.es/)

Choice of Evolutionary Model (2)


Running Modeltest

1.) Start Paup and load data set. 2.) Load modelblock of Modeltest with command execute into Paup 3.) Paup will follow the commands given in the modelblock: First a tree is constructed using the simplest and fastest method. Then Paup computes the likelihood values for all 56 evolutionary models for the data set given the tree. The likelihood scores of the 56 models are saved in a file called model.scores. 4.) The Modeltest executable is started and fed with the model.scores file. It performs the hLRT and AIC tests and saves the results to a file.

Testing models of evolution - Modeltest Version 3.06 (c) Copyright, 1998-2000 David Posada (dp47@email.byu.edu) Department of Zoology, Brigham Young University WIDB 574, Provo, UT 84602, USA _______________________________________________________________ Wed Sep 15 21:33:13 2004 Input format: Paup matrix file

Choice of Evolutionary Model (3)


+I 3853.2573 3843.7336 3852.4849 3842.8232 3852.4060 3842.6401 3851.4536 3842.2886 3851.3740 3842.1130 3846.7188 3840.2319 3846.6523 3839.9839 +G 3814.9705 3806.0015 3814.0562 3804.8003 3813.3804 3804.7976 3813.0771 3804.3494 3812.3914 3804.3457 3807.1191 3802.2058 3806.4158 3802.2041 +I+G 3806.2795 3797.3303 3805.3757 3796.1357 3804.7378 3796.1355 3804.3674 3795.6624 3803.7239 3795.6621 3798.2488 3793.3215 3797.5940 3793.3086

** Log Likelihood scores ** JC F81 K80 HKY TrNef TrN K81 K81uf TIMef TIM TVMef TVM SYM GTR = = = = = = = = = = = = = = 3853.2573 3843.7336 3852.4849 3842.8232 3852.4060 3842.6401 3851.4536 3842.2886 3851.3740 3842.1130 3846.7188 3840.2319 3846.6523 3839.9839

** Hierarchical Likelihood Ratio Tests (hLRTs) ** Equal base frequencies Null model = JC Alternative model = F81 -lnL1 2(lnL1-lnL0) = 13.2373 P-value = 0.004151 Ti=Tv Null model = F81 Alternative model = HKY -lnL1 2(lnL1-lnL0) = 0.9131 P-value = 0.339297 Equal rates among sites Null model = F81 Alternative model = F81+G -lnL1 2(lnL1-lnL0) = 622.9395 Using mixed chi-square distribution P-value = <0.000001 No Invariable sites Null model = F81+G Alternative model = F81+I+G -lnL1 2(lnL1-lnL0) = 17.3423 Using mixed chi-square distribution P-value = 0.000016

Choice of Evolutionary Model (4)


-lnL0 = 4124.0898 = 4117.4712 df = 3

-lnL0 = 4117.4712 = 4117.0146 df = 1

-lnL0 = 4117.4712 = 3806.0015 df = 1

-lnL0 = 3806.0015 = 3797.3303 df = 1

Choice of Evolutionary Model (5)


Model selected: F81+I+G -lnL = 3797.3303 Base frequencies: freqA = 0.3045 freqC = 0.2348 freqG = 0.2328 freqT = 0.2279 Substitution model: All rates equal Among-site rate variation Proportion of invariable sites (I) = Variable sites (G) Gamma distribution shape parameter = [...]

0.4495 0.6163

BEGIN PAUP; Lset Base=(0.3045 0.2348 0.2328) END;

Nst=1

Rates=gamma

Shape=0.6163

Pinvar=0.4495;

Questions?

aim: group of organisms or gene family choice of molecular marker(s) and taxon sampling
improvement of

Work- Flow

amplification/sequencing alignment choice of evolutionary model user- defined trees and topology testing

phylogenetic analyses
tree(s)

results

Phylogenetic Analysis Methods

Combination of an Optimality Criterion with a Tree Search Algorithm

1) Optimality Criterion Scoring method to decide which tree is the best e.g. maximum parsimony, distance analysis, maximum likelihood

2) Tree Search Algorithm Method to construct a tree e.g. exhaustive search, branch- and bound, heuristic search, quartet puzzling, neighbor- joining

Phylogenetic Analysis: Maximum Parsimony

The tree which requires the fewest mutation steps to explain the nucleotide pattern of an alignment is the best. Each point mutation equals one point in the scoring system, thus in unweighted parsimony only integer values are possible as scores.

best tree = maximum parsimony tree (MPT) if several trees are equally scored = equally parsimonious trees (EPT)

Problem: Evolutionary model is implicit and cannot be adapted to the data set. All mutations at all positions are considered equal, even in more variable regions.

Phylogenetic Analysis: Distance Matrix Methods (1)

All sequences of an alignment are compared pairwise with each other. Each pair is assigned an evolutionary distance value expressing the degree of divergence. The results are listed in a distance matrix, which is used to construct a tree.

To calculate the distances, one out of the 56 different evolutionary models can be chosen or the estimators of the maximum likelihood method can be used.

The tree with the shortest sum of distances is the best. Since the distances are no integers, there is usually only one best tree.

Phylogenetic Analysis: Distance Matrix Methods (2)

paup> showdist HKY85 distance matrix 1 0.15819 0.16875 0.15953 0.17529 0.10471 0.10967 2 0.15158 0.13655 0.14077 0.10205 0.10370 3 4 5 6 7

1 2 3 4 5 6 7

MPorph S11679 CCMP736 MCont316 S1382 Chondrus S4194a

0.18535 0.18336 0.15863 0.15845

0.15227 0.12481 0.12654

0.13192 0.12807

0.03539

Phylogenetic Analysis: Maximum Likelihood (1)


Probablistic method: Tries to find the tree that optimises the probability of observing the data in the alignment. Likelihood is expressed as negative natural logarithm (- lnL; lowest - lnL is the best).

Computation Steps: - A tree is given. - For each position in the alignment, the site- wise log likelihood is calculated. This includes all possible combinations of ancestral character states in a tree. - The log likelihoods of all positions of the alignment are multiplied and result in the total log likelihood value.

Phylogenetic Analysis: Maximum Likelihood (2)


Example Alignment Seq 1 ATTA... Seq 2 ACTA... Seq 3 CCTA... Seq 4 GGTG... 1234...
position 1 of 1500 positions probabilities depend in evolutionary model settings site- wise log likelihood: all probabilites of the 16 character combinations total log likelihood: product of 1500 site- wise log likelihoods

A ? A ?

C G

tour- taxon- tree = 16 possible combinations of ancestral states

A-A A-C A-G A-T

C-A C-C C-G C-T

G-A G-C G-G G-T

T-A T-C T-G T-T

Phylogenetic Analysis: Exhaustive Tree Search


All possible trees are calculated according to the chosen optimality criterion.

Theoretically good: Safest method to find the best tree! Problem: Computationally intense! With a lot of taxa impossible to do.

e.g. rooted bifurcating trees:

6 taxa = 945 trees 10 taxa = 34,459,425 trees 15 taxa = 213,458,046,676,875 trees

Phylogenetic Analysis: Branch- and- Bound Tree Search

Branch and bound is a speed- up procedure for exhaustive search. It also considers all possible trees.

The score/distance/likelihood of a randomly generated starting tree is calculated and used as a threshold. All trees that are already worse than this threshold during construction procedure are not finished, but skipped. If a tree turns out to be better, it is used as a new threshold.

Disadvantage: Still too time consuming for larger data sets.

Phylogenetic Analysis: Heuristic Tree Search (1)

The trees are considered to form a landscape called the treespace. The best trees are on top of the hills, the worst trees are in the valleys. Heuristic searches start with a random tree, which may be located in a valley and try to find the best tree located in the global maximum of the tree space by rearranging the branches of the starting tree.

Phylogenetic Analysis: Heuristic Tree Search (2)


3 2 4

1 Starts with a randomly generated tree, which may be in a valley. 2 Local rearrangements by exchanging neighbouring branches optimise the tree. The tree may end up in a local optimum only. 3 Global rearrangements help to cross the valley and find the another hill. 4 The new tree is again rearranged by small exchanges to climb up the hill to the top. If this is the global optimum, further global rearrangements will not improve the tree.

Phylogenetic Analysis: Heuristic Tree Search (3)


Tree Rearrangement Methods

Nearest- Neighbour Interchange (NNI) = Adjacent branches are rearranged.

Subtree Pruning and Regrafting (SPR) = A branch with a subtree is removed from a tree and added between two nodes somewhere else in the tree (= one new tree).

Tree Bisection and Reconnection (TBR) = A tree is split into two subtrees and both parts are connected between all possible nodes of the other (= several new trees are considered).

Phylogenetic Analysis: Neighbor- Joining (1)


Preferred method to infer trees from distance matrices. Belongs to the clustering methods.

Computation steps:

1) Calculate net divergence of each taxon from the others, and compute a corrected distance for further use. 2) Start with a star- like tree (belongs to star- decomposition methods). 3) Join the two taxa with the lowest divergence. 4) Recalculate the distance matrix by treating the joined taxa as one. 5) Repeat steps 3 to 4 until all taxa are joined and the tree is resolved.

Phylogenetic Analysis: Neighbor- Joining (2)

first distance matrix

corrected distance matrix

recalculation of distance matrix

Phylogenetic Analysis: A Comparison of Methods

Maximum Parsimony

Distance Matrix

Maximum Likelihood

discrete characters implicit evolutionary model

continuous characters explicit evolutionary model

discrete characters explicit evolutionary model

heuristic tree search

neighbor- joining trees

heuristic tree search

fastest method large data sets

robust method computational intense

Phylogenetic Analysis: Paup Commands

Begin paup; set autoclose increase=auto outroot=monophy; outgroup 1-4; set crit=p; hsear addseq=rand nreps=10; savetrees file=pars.tre brlens; Lset Base=(0.3315 0.2201 0.2334) Nst=1 Rates=gamma set crit=d; dset dist=ml; nj; savetrees file=nj.tre brlens; set crit=l; hsear addseq=rand nreps=1; savetrees file=ml.tre brlens; quit; End;

Shape=0.9144 Pinvar=0.3911;

Phylogenetic Analysis: Bayesian Analysis (1)

Uses also likelihoods in calculations, but is based on a formula introduced by Reverend Bayes and uses posterior probabilities.

Bayesian analysis starts with a set of a priori expectations about evolutionary model, tree topology and branch lengths. By examining the data (the alignment), the posterior probabilities of the hypotheses given the data are calculated using the Bayes formula. Since it is impossible to compute the complete joint posterior probability distribution of trees and evolutionary model parameters (a landscape with hills and valleys), samples are drawn using a Metropolis- coupled Markov chain Monte Carlo method.

Phylogenetic Analysis: Bayesian Analysis (2)

1) initialization of Markov chain with random tree and random evolutionary parameters - > calculation of probability 2) proposal of new state of chain with one changed parameter (topology, branch length or evolutionary model) - > calculation of probability 3) If P(Tnew)/P(Told) 1 - > accepting new state, if the ratio is < 1, a random number decides This corresponds to one generation of the Markov chain. A chain is run over several thousands to millions of generations. Every 100th generation a tree and its parameters are sampled and saved to files. After a while, the chain starts circling around a probability optimum comprising the best trees.

Phylogenetic Analysis: Bayesian Analysis (3)

H2 C H3

H1

Since the Markov chain may end up stuck in a local maximum, in addition to this cold chain also three so- called heated chains (H1 to H3) are initialized. These chains have lower thresholds to be able to jump over valleys more easily. From time to time a heated chain exchanges parameters (= Metropolis- coupled) with the cold chain, helping it to find the global optimum.

Phylogenetic Analysis: Bayesian Analysis (4)

Maximum Likelihood usually results in one optimal tree (sometimes also two), whereas Bayesian analysis results in a set of optimal trees.

Results of a Bayesian analysis after summarizing of the sampled trees and evolutionary model parameters:

A tree file listing the trees according to their posterior and accumulative posterior probabilities. A list of credibility intervals and mean values for all parameters of the evolutionary model. A consensus tree with branch lengths and posterior probabilities indicating support for branches.

Phylogenetic Analysis: Bayesian Analysis (5)

Example command block for MrBayes to be attached to the nexus file.


Begin mrbayes; set autoclose=yes; lset nst=6 rates=invgamma ngammacat=4 covarion=yes; mcmcp ngen=3500000 printfreq=1000 samplefreq=100 nchains=4 savebrlens=yes filename=SSU; mcmc; quit; End;

Phylogenetic Analysis: Bayesian Analysis (6)

Summary of analysis data

sump filename=SSU.p burnin=8000; sumt filename=SSU.t burnin=8000;

All trees and parameters, which were sampled prior to reaching the likelihood plateau (i.e. the burn- in phase before arriving at the global optimum) a excluded from the summaries. The sump command results in a plot showing the likelihood values. If the burn- in was not properly removed, the command has to be repeated with a higher burnin value.

Questions?

aim: group of organisms or gene family choice of molecular marker(s) and taxon sampling amplification/sequencing alignment

Work- Flow

improvement of

choice of evolutionary model phylogenetic analyses user- defined trees and topology testing

tree(s)

results

Trees - Support for Branches: Bootstrap Analysis


In bootstrap analysis, single positions are randomly drawn from the alignment (imagine a lottery) and assembled to a new dataset of the same size as the original alignment. As a result, some positions may occur several times, whereas others are excluded. This lottery is repeated at least 100 times resulting in at least 100 subsamples of the original alignment. Of each subsample a phylogenetic analysis is done (MP, distance, ML). The results are summarised in a consensus tree.

In a 50 percent majority rule consensus tree, all branches that occur in at least 50% of all bootstrap subsamples are displayed on the branches.

bootstrap values > 95% = significantly supported branches

Trees: Support for Branches Posterior Probabilities

The consensus tree resulting from a Bayesian analyses is a 50% majority rule consensus tree, inferred from sampled trees of the Markov chain. Similar to the support values of a bootstrap consensus, the posterior probabilities express how many of the sampled trees are found with this topology.

Posterior probabilites are usually higher than bootstrap support values and have been subject of debates.

Trees: The Long Branch Attraction Artefact (LBA) (1)


Artefacts

correct tree

result of phylogenetic analysis

Trees: The Long Branch Attraction Artefact (LBA) (2)


Artefacts Explanation Long branches indicate a higher rate of mutations.

a) A high rate of mutations results in multiple reversals to the original character state (= homoplasies). b) In addition, a high mutation rate causes signal noise blurring the information in the sequences.

Consequence: Homoplasies are erroneously interpreted as indicators for relatedness. Choice of inappropriate evolutionary model increases vulnerability against LBA.

Trees: Potential Indicators for LBA (1)

Tree is ladderised at the root, i.e. most or all long branches emerge successively close to the root of the tree.

Trees: Potential Indicators for LBA (2)


Differing tree topologies depending in whether simple or complex evolutionary models are used.

maximum parsimony

maximum likelihood (F81+I+)

aim: group of organisms or gene family choice of molecular marker(s) and taxon sampling

Work- Flow

improvement of

amplification/sequencing alignment

choice of evolutionary model phylogenetic analyses tree(s) user- defined trees and topology testing

results

Preventing/Reducing LBA
1.) Improved Taxon Sampling

Breaking up the long branches by adding related taxa.

more taxa = higher resolution by adding information to variable positions

2.) Improved Choice of Markers

Use several genes with differing evolutionary rates and concatenate. The influence of the long branches may be broken (also good: use of genes from different genomes, e.g. nuclear, mitochondrial, plastid).

more positions = higher resolution by extending the data matrix

How to Handle Concatenated Data? (1)

Choice of Evolutionary Model

Different genes most likely will need different evolutionary models. Neither Paup nor Phylip allow for a partitioning of data. The more genes are included and the more divergent the data are in terms of evolutionary rates, the more likely Modeltest will propose the most complex evolutionary model, GTR+I+.

How to Handle Concatenated Data? (2)


Choice of Evolutionary Model: MrBayes allows for a partitioning of data
Begin mrbayes; set autoclose=yes; log start filename=sum.log; charset NM=1-1564; charset ITS2=1565-1850; charset LSU=1851-2733; partition concP2=3:NM,ITS2,LSU; set partition=concP2; lset applyto=(1) nst=6 nucmodel=4by4 rates=invgamma ngammacat=4 covarion=yes; lset applyto=(2) nst=6 nucmodel=4by4 rates=invgamma ngammacat=4 covarion=yes; lset applyto=(3) nst=6 nucmodel=4by4 rates=invgamma ngammacat=4 covarion=yes; unlink statefreq=(all); unlink shape=(all); unlink revmat=(all); unlink switchrates=(all); prset ratepr=variable; mcmcp ngen=3500000 printfreq=1000 samplefreq=100 nchains=4 savebrlens=yes filename=conc; mcmc; quit; End;

How to Handle Concatenated Data? (3)


The Likelihood Summation Method

- Run a phylogenetic analysis with each singe- gene dataset and the concatenated data set. - Let Paup save the 1000 best trees resulting from each analysis. - Concatenate the tree files. - Calculate the likelihood scores for each of the trees with the lscores command in Paup, but use as a data matrix the single- gene alignments only. - Load the scorefiles for each data set in a spreadsheet program and calculate the sum of log likelihoods for each tree. - Sort the trees according to their log likelihood.

aim: group of organisms or gene family choice of molecular marker(s) and taxon sampling amplification/sequencing alignment

Work- Flow

improvement of

choice of evolutionary model phylogenetic analyses tree(s)

user- defined trees and topology testing

results

Testing Tree Topology (1)

Kishino- Hasegawa and Shimodaira- Hasegawa tests are used to test hypothetical user- defined trees by comparison with the optimal tree.

Problem: Tests were designed to compare random trees, but used by biologists to compare trees with the optimal tree.

Testing Tree Topology (2): Consel


Consel - written by H. Shimodaira - needs input files with site- wise log likelihoods - accepts Paup, Molphy and PAML scorefiles - consists of a suite of programs for different tasks - C source code; Unix command line or DOS console - performs: Approximately unbiased test Kishino- Hasegawa test (unweighted/weighted) Shimodaira- Hasegawa test (unweighted/weighted) bootstrap probabilities posterior probabilities

Testing Tree Topology (3): Consel and Paup

Using Consel in Combination with Paup

1) Construct constraints to test a hypothesis 2) Infer the optimal constraint trees with Paup (ML) 3) Concatenate all treefiles that are supposed to be subjected to the test. 4) Let Paup calculate the total log likelihood and the site- wise log likelihoods and save the data to a scorefile. 5) Use a text editor to delete superfluous data (all parameters of the evolutionary model). 6) Feed Consel with the scorefile.

Testing Tree Topology (4): Consel Commands

1) Generate bootstrap subsamples

makermt paup scorefile.txt

2) Perform the test

consel scorefile

3) Generate an output file with the test results

catpv scorefile

Testing Tree Topology (5): How Does Consel Work?

makermt Generates multiscale bootstrap subsamples from the site- wise log likelihoods in the scorefile. Bootstrap samples from site- wise log likelihoods = the RELL method Multiscale bootstrap = sizes of subsamples differ from original data set By default, makermt generates 10 sets of replicates, each with 10,000 subsamples (0.5x, 0.6, 0.7, 0.8, 0.9, 1.1, 1.2, 1.3, 1.4, 1.5 fold the size of the original data set) and an additional set with 10,000 subsamples of 1.0- fold size. consel Calculates the probabilities for KHT, SHT and normal bootstrap using the 10,000 subsamples of 1.0- fold size and the probabilities for the AUT and the multiscale bootstrap using the multiscale bootstrap samples.

Testing Tree Topology (6): Consel Output


catpv summarises the results of the test (the probability values)
# reading nm.pv # rank item obs # 1 1 -15.8 # 2 6 15.8 # 3 5 35.2 # 4 4 36.3 # 5 7 56.8 # 6 2 129.4 # 7 3 142.2

au 0.957 0.064 0.005 0.002 7e-05 2e-64 4e-06

np 0.942 0.055 0.002 0.001 1e-04 5e-21 5e-06

| | | | | | | |

bp 0.944 0.054 0.002 4e-04 2e-04 0 0

pp 1.000 1e-07 5e-16 2e-16 2e-25 6e-57 2e-62

kh 0.936 0.064 0.006 0.005 3e-04 0 0

sh 0.994 0.415 0.079 0.063 0.012 0 0

wkh 0.936 0.064 0.006 0.005 3e-04 0 0

wsh 0.999 0.218 0.027 0.019 0.001 0 0

| | | | | | | |

rank = tree ranking item = no. of tree in treefile obs = log likelihood difference au = p- values of the approximately unbiased test np = p- values of the multiscale bootstrap bp = p- values of the normal bootstrap pp = posterior probabilities kh = Kishino- Hasegawa test sh = Shimodaira- Hasegawa test wkh, wsh = weighted Kishino- Hasegawa and Shimodaira- Hasegawa tests

Phylogeny With Protein Sequences (1)

Due to degeneration of the genetic code, codons may be biased! This bias may apply not only to the third position, but also to first and second.

Presumably it would be better to use protein sequences instead, but:

Protein alignments have 20 character states instead of 4 = analyses, especially ML analyses take much longer!

Using nucleotide data, but considering nonsynonymous/synonymous substitutions and/ortranslating during analysis - > this is also quite time- consumptive!

Phylogeny With Protein Sequences (2)

Maximum likelihood analyses of protein sequences are usually based on substitution matrices derived from empirical data instead of estimating the substitution rate matrix from the data set.

e.g. Dayhoff (Dayhoff et al. 1978) JTT (Jones, Thornton, Taylor 1992) WAG (Wheelan, Goldman 2001)

Phylogeny With Protein Sequences (3)

Paup: only limited possiblities, no substitution matrices included; no maximum likelihood

Programs for phylogenetic analyses of protein sequences

Phylip (text- based menu): phylip format; maximum likelihood; PAM, JTT, PMB Tree- Puzzle (text- based menu): phylip format; maximum likelihood with quartet puzzling; Dayhoff, JTT, WAG, VT etc. PAML (Unixoids, DOS console): phylip format; maximum likelihood; Dayhoff, JTT, WAG etc. (Molphy [Unixoids]): phylip format; maximum likelihood; Dayhoff, JTT) MrBayes: Bayesian analysis

Phylogeny With Protein Sequences (4)

One possibility:

Calculate a tree and gamma categories using Tree- Puzzle 5.2 (Schmidt, Strimmer and von Haeseler 2004).

Use the gamma category estimates as settings to perform a maximum likelihood analyses with proml from the Phylip 3.62 package (Joe Felsenstein 2004).

Phylogeny With Protein Sequences (5)


Text based menu of Tree- Puzzle 5.2
GENERAL OPTIONS b Type of analysis? k Tree search procedure? v Approximate quartet likelihood? u List unresolved quartets? n Number of puzzling steps? j List puzzling step trees? o Display as outgroup? z Compute clocklike branch lengths? e Parameter estimates? x Parameter estimation uses? SUBSTITUTION PROCESS d Type of sequence input data? m Model of substitution? f Amino acid frequencies? RATE HETEROGENEITY w Model of rate heterogeneity? Tree reconstruction Quartet puzzling Yes No 1000 No Chondrus (1) No Approximate (faster) Neighbor-joining tree Auto: Amino acids Auto: JTT (Jones et al. 1992) Estimate from data set Uniform rate

Quit [q], confirm [y], or change [menu] settings:

Phylogeny With Protein Sequences (6)

Phylip 3.62 suite of 30 programs

e.g. ML bootstrapping: a) b) c) d) start seqboot = generate bootstrap samples from data set run dnaml or proml (depending in data set) run contree to create a consensus tree consensus treefile may be loaded into a tree displaying program

Molecular Clock

Diverse

Assumes that sequences evolve at equal evolutionary rates. This is usually not the case and puts the analysis under a constraint. Molecular clock hypothesis may be tested with a likelihood ratio test similar to the evolutionary models in Modeltest. If fossils are available one may try dating the divergences of lineages.

Secondary Structure Analyses

may add information to the results (predominantly RNA- coding, ITS or intron regions)

Synapomorphy Analyses

Searching for synapomorphic characters or strings of characters may be useful for systematic purposes

Questions?

That's it!

Вам также может понравиться