Вы находитесь на странице: 1из 128

Multiple sequence

alignment
Extension of pairwise sequence
alignment to a larger set
Allows identification of patterns across
larger numbers of sequences
Allows building of sequence profiles and
hidden Markov models (last week)
Basis for more sophisticated analysis
methods including phylogenetics
Multiple sequence
alignment example

Ovar STCVLSAYWKD-LNNYH
Bota STCVLSAYWKD-LNNYH
Susc STCVLSAYWRNELNNFH
Hosa STCMLGTY-QD-FNKFH
Rano STCMLGTY-QD-LNKFH
Sasa STCVLGKLSQE-LHKLQ
Multiple alignment uses
An important factor in judging the quality
of a multiple sequence alignment is
considering its use
This drives the interpretation of each
position in the alignment
The multiple alignment procedure may
not include this knowledge and
therefore be incorrect
Interpretation of
positions
Generally there are two interpretations
of a position in a multiple sequence
alignment:
Evolutionary/historical
Functional/structural
In many cases these are the same, but
they may not be.
Evolutionary/historical
Each position in the multiple sequence
alignment represents a conserved
historical position in the divergence of
the sequences from a common
ancestor.
Each nucleotide or amino acid
difference in a column is the result of a
mutation of that position.
Functional/structural
Each position in the multiple sequence
alignment represents a conserved
functional position: varying amino acids
in this position nevertheless are located
in the same place in the protein, and
share the same function
Generally applicable to amino acids
(and the corresponding DNA sequence)
Structural alignment
In the ideal case, we have the three-
dimensional structure of the proteins
and can create the multiple sequence
alignment based on the three
dimensional structure
This is the “Gold Standard” of multiple
sequence alignments, but is rarely
available.
What is a good multiple
sequence alignment?
As for the case of two sequences, the
best alignment can be defined as the
one in which the number of changes
between two sequences is minimised
Compute over all pairs of sequences –
for k sequences there are k(k+1)/2
pairwise comparisons
Multiple sequence
alignment algorithm
Ideal approach to multiple sequence
alignment is to extend dynamic
programming.
Instead of aligning two sequences (two
dimensional grid) we align k sequences
(k dimensional grid)
Extension is relatively straightforward
Dynamic programming
for sequence alignment
Recurrence relation
Tabular computation
Traceback
Pairwise recurrence relation (revision):
S(i,j) = max[S(i-1, j-1) + m(i,j), S(i-1, j) + g, S(i, j-
1) + g]
m(i,j) = similarity matrix eg BLOSUM
g = gap penalty
There are seven cases
when aligning three
sequences
1 2 3 4 5 6 7
I KYG KYG KYG K-G KYG K-G K-G
J KYG KYG K-G KYG K-G KYG K-G
K KYG K-G KYG KYG K-G K-G KYG
Three sequence
recurrence relation
S(i,j,k) = max[
S(i-1, j-1, k-1) + m(i,j) + m(i,k) + m(j,k),
S(i-1, j-1, k) + m(i,j) + g,
S(i-1, j, k-1) + m(i,k) + g, S(i,
j-1, k-1) + m(j,k) + g,
S(i-1, j, k)+ g + g, S(i,
j-1, k) + g + g, S(i, j,
k-1) + g + g]
m(i,j) = similarity matrix eg BLOSUM
g = gap penalty
Two to three
dimensional grid
Extending dynamic
programming
Based on the extrapolation from two to
three sequences, we can define the
recurrence relation for any number of
sequences in the same way
The other steps - tabular computation
and traceback - are done in the same
way as for pairwise alignment
Dynamic programming
time increases
exponentially
Time taken for alignment by dynamic
programming is O(n * m) for two
sequences n, m characters long.
Time taken for alignment by dynamic
programming is O(n * m * p) for three
sequences n, m, p characters long.
Dynamic programming
time increases
exponentially
Clearly, for k sequences, each
sequence ni characters long, the time
required will be
k
O( Π ni )
i=1
This is exponential - O( nk )
We need to fill out each ‘box’ in the grid
Heuristic multiple
sequence alignment
What shortcuts can we make?
How can we prevent calculation time
from growing exponentially with number
of sequences?
Heuristic multiple
sequence alignment
Currently, most practical methods are
hierarchial methods
For example, pairwise alignments,
defining hierarchy followed by
progressive addition of sequences to
alignment – Clustal W
Steps in multiple
alignment
Collect set of sequences that can be
aligned, eg based on BLAST matches
Edit sequences to contain alignable
regions. Eg chromosomal sequences
must be cut down to only contain gene
region of interest
Run multiple alignment program
Steps in multiple
sequence alignment
Assess alignment output, possibly re-
align with different parameters or
different program
Adjust alignment manually in some
cases, particularly regions with many
gaps
May need to remove unrelated
sequences included by accident, or edit
region of sequence included
Clustal W
Possibly most commonly used multiple
sequence alignment program
Based on Feng & Doolittle’s 1987 idea
of progressive alignment
ClustalW contains a number of
improvements over the previous
ClustalV (1988)
ClustalW initial pairwise
alignment
All sequences aligned pairwise to all others –
k sequences gives k(k+1)/2 alignments
ClustalW includes a choice of dynamic
programming (“slow/accurate”) or
Wilbur/Lipman (“fast/approximate”)
adaptation of dynamic programming)
Affine gap penalties – can adjust gap opening
and gap extension penalty
Wilbur-Lipman fast
pairwise alignment
Related to BLAST searching
Based on exactly matching k-tuples (1,2
for protein; 2-4 for nucleotide)
Extend alignment along diagonals with
many exact k-tuples
Adjacent diagonals considered when
extending alignment
All these parameters can be specified
Establishing hierarchial
order
Scores from pairwise alignments allow
us to create a hierarchy of the
sequences
Most similar sequences should be
aligned first
May need several intermediate
alignments that will later be joined
Multiple sequence
alignment example

Ovar STCVLSAYWKD-LNNYH
Bota STCVLSAYWKD-LNNYH
Susc STCVLSAYWRNELNNFH
Hosa STCMLGTY-QD-FNKFH
Rano STCMLGTY-QD-LNKFH
Sasa STCVLGKLSQE-LHKLQ
Hierarchy of addition
Align Ovar and Bota first
Align Hosa and Rano
Align Susc to Ovar and Bota
Align these two clusters to each other
Align Sasa to large alignment
Clustal W progressive
multiple alignment
Example illustrates all possible cases:
Align two sequences to each other
Align a sequence to an existing
alignment
Align two alignments to each other
Aligning alignments
Pairwise alignment of alignments is also
called profile alignments
Use dynamic programming.
S(i,j) = max[S(i-1, j-1) + m(i,j), S(i-1, j) + g,
S(i, j-1) + g]
m(i,j) = similarity score averaged over
characters at that position
g = gap penalty
Aligning alignments
Once sequences are aligned and gaps
introduced, these are not altered
(hierarchy)
Alignment finds a local optimum as
early alignment decisions are “locked in”
by the “greedy” algorithm
Early errors will be propagated and may
cause final alignment to be worse
ClustalW refinements
Alignment parameters vary throughout
alignment process
M(i,j) – more distant sequences given
more weight when calculating character
at i or j
M(i,j) – different BLOSUM matrix used
based on similarity of sequences
ClustalW refinements
G – gap penalty:
Lower gap penalty at pre-existing gap
Higher gap penalty near pre-existing gap
Lower gap penalty in hydrophilic regions
as these are likely to be external loops
(protein only)
Gap penalty based on known amino acid
characteristics
ClustalW refinements
Note that the refinements are most
useful when aligning proteins
Protein structural information is valuable
in interpreting sequence
Attempt to codify biological, chemical
and physical knowledge in alignment
algorithm
ClustalW misapplied
Clustal W and other algorithms that
include a pairwise comparison step
must not be used to align sequences
that do not all share a common block

Sequences that do not share a common


block are generally from sequence
assembly projects
Sequence assembly
Multiple sequence
alignment example

Ovar STCVLSAYWKD-LNNYH
Bota STCVLSAYWKD-LNNYH
Susc STCVLSAYWRNELNNFH
Hosa STCMLGTY-QD-FNKFH
Rano STCMLGTY-QD-LNKFH
Sasa STCVLGKLSQE-LHKLQ
Multiple sequence
alignment
The principle of dynamic programming
can be extended to multiple sequences
Unfortunately, the time required grows
exponentially with the number of
sequences and sequence lengths
Algorithms in use are heuristic and most
are progressive/hierarchial
Clustal W
Three-step algorithm
Pairwise alignment of all pairs to
determine sequence similarity
Define an order of addition of
sequences to alignments based on
similarity
Construct multiple alignment
progressively based on defined order
Clustal W progressive
multiple alignment
Three possible cases:
Align two sequences to each other
Align a sequence to an existing
alignment
Align two alignments to each other

Always effectively pairwise alignments


Aligning alignments
Pairwise alignment of alignments is also
called profile alignments
Use dynamic programming.
S(i,j) = max[S(i-1, j-1) + m(i,j), S(i-1, j) + g,
S(i, j-1) + g]
m(i,j) = similarity score averaged over
characters at that position
g = gap penalty
Aligning alignments
Once sequences are aligned and gaps
introduced, these are not altered
(hierarchy)
Alignment finds a local optimum as
early alignment decisions are “locked in”
by the “greedy” algorithm
Early errors will be propagated and may
cause final alignment to be worse
Adjusting alignment
Because alignments are done with
heuristics and we may have biological
knowledge or use knowledge not
included in alignment, they are often not
perfect
It is frequently useful to align using
several methods or repeat methods, or
adjust alignment manually/by eye.
Active research
There are many other multiple
sequence alignment programs
Many use structural information to
construct an initial profile
Potential for development of new
algorithms
Ensure tradeoff between algorithm and
biological knowledge
Phylogenetic trees
Human
Mouse
Drosophila
Honey bee
Fern
Wheat
Pine
Time
Evolutionary history
Species are related through history
Likely one origin of life on earth
All current species are ultimately
descendants of the original life
Species specialise and diverge
gradually over time
Darwin: Origin of Species (1859)
Sequence evolution
Genome of species “carried along”
through history
Sequences can change: mutation etc
Gene duplication and specialisation
Many gene functions are centrally
important and are conserved – selection
to maintain protein function
Genome passed on
Human
Mouse
Drosophila
Honey bee
Fern
Wheat
Pine
Time
Sequence evolution
Most biological sequence analysis only
works because we assume evolution
Sequences are similar because they are
descendant copies (with variation) of
the same original sequence
Proteins with similar function and no
sequence similarity may suggest
independent origin of function
Evolutionary
relationships
We wish to find the evolutionary history
(phylogenetic tree) of a set of
sequences/multiple alignment.
Often the sequences are from different
species and the tree from of the
sequences should be the same as the
tree of the species
The use of sequence
phylogeny
What if the sequences suggest a
different history than the known species
relationships?
The use of sequence
phylogeny
Phylogenetic trees from sequences can
increase understanding of species
relationships
Can improve estimates of times of
speciation and interpretation of fossil
record
Can increase understanding of
biological function at many levels
Implicit hierarchies
Tree/cluster relationships as represented by
phylogenetic trees are implicitly assuming that the
data is related in a hierarchy
Cannot represent more complex types of
relationships among data
Note that the ClustalW algorithm assumes a
hierarchy of relationships – heuristic depends on it
Examples of non-
hierarchial data
Family relationships: I am related to all
my cousins but they are not related to
each other
Examples of non-
hierarchial data
Map distances
New Hampshire (Newick)
Format
Human
Mouse
Drosophila
Honey bee
Fern
Wheat
Pine
(((Human, Mouse), (Dros, Bee)), (Fern,
(Wheat, Pine));
New Hampshire (Newick) Format

Human
Mouse
Drosophila
Honey bee
Fern
Wheat
Pine
(((Human, Mouse), (Dros, Bee)), (Fern,
(Wheat, Pine));
New Hampshire (Newick) Format

Human
Mouse
Drosophila
Honey bee
Fern
Wheat
Pine
(((Human, Mouse), (Dros, Bee)), (Fern,
(Wheat, Pine));
New Hampshire (Newick) Format

Human
Mouse
Drosophila
Honey bee
Fern
Wheat
Pine
(((Human, Mouse), (Dros, Bee)), (Fern,
(Wheat, Pine));
Reconstructing
evolutionary trees from
multiple sequence
alignments
Three classes of methods:
Parsimony
Distance methods
Maximum likelihood
Area of active research
Reconstructing
evolutionary trees from
other kinds of data
Morphological characteristics, i.e.
physical appearance and anatomy
This is the traditional method for
determining how species are related
Generally uses parsimony method
Reconstructing
evolutionary trees from
other kinds of data
Genomic scale information is new and
potentially very valuable: patterns of
shared genes between species, order of
genes along chromosomes etc
Metabolic and biochemical data:
metabolic networks likely to be more
similar in closely related species
Reconstructing
evolutionary trees from
other kinds of data
Microarray data: which genes are being
expressed and how much?
Microarray data measures mRNA levels
Can compare mRNA levels across
species and in different circumstances
eg stress response
Cluster analysis
Non-evolutionary data may also be best
described hierarchially – generally
referred to as ‘cluster analysis’
Exactly the same principles as when
constructing evolutionary trees
Microarray expression patterns often
analysed with cluster analysis to identify
genes with related expression patterns
Methods for making
evolutionary trees
Methods need to be adapted to take
account of each data type
Nevertheless, most methods can be
classified as parsimony, distance or
maximum likelihood
These methods will be described next
week for multiple sequence alignments
Constructing
phylogenetic trees
Parsimony
Distance methods
Maximum likelihood
Bootstrapping
Parsimony
Oldest, most intuitive method.
Formally described by Hennig 1966
Find the tree in which the smallest
number of changes are needed to
explain the observed multiple sequence
alignment
Parsimony example

Human GGTTATCCTACATGTATA
Mouse ACTTGTCCAACGCGGACA
Rat ACTCGTCCAACGTGCACA
Dog AGCTGCCTTACGTACATA
Cat AGCTGTCTTACGTACGTA
Human Human
Dog Mouse
Cat Cat
Mouse Dog
Rat Rat

Mouse Cat
Dog Dog
Cat Human
Human Mouse
Rat Rat
Human G
Dog A
Cat A
Mouse A
Rat A

Cost 1
Human GGTTATCCTACATGTATA
Mouse ACTTGTCCAACGCGGACA
Rat ACTCGTCCAACGTGCACA
Dog AGCTGCCTTACGTACATA
Cat AGCTGTCTTACGTACGTA
Human G G
Dog A G
Cat A G
Mouse A C
Rat A C

Cost 1 1
Human GGTTATCCTACATGTATA
Mouse ACTTGTCCAACGCGGACA
Rat ACTCGTCCAACGTGCACA
Dog AGCTGCCTTACGTACATA
Cat AGCTGTCTTACGTACGTA
Human G G T
Dog A G C
Cat A G C
Mouse A C T
Rat A C T

Cost 1 1 2
Human GGTTATCCTACATGTATA
Mouse ACTTGTCCAACGCGGACA
Rat ACTCGTCCAACGTGCACA
Dog AGCTGCCTTACGTACATA
Cat AGCTGTCTTACGTACGTA
Human G G T T
Dog A G C T
Cat A G C T
Mouse A C T T
Rat A C T C

Cost 1 1 2 1 16
Human GGTTATCCTACATGTATA
Mouse ACTTGTCCAACGCGGACA
Rat ACTCGTCCAACGTGCACA
Dog AGCTGCCTTACGTACATA
Cat AGCTGTCTTACGTACGTA
Mouse A C T T
Dog A G C T
Cat A G C T
Human G G T T
Rat A C T C

Cost 1 2 2 1 21
Human GGTTATCCTACATGTATA
Mouse ACTTGTCCAACGCGGACA
Rat ACTCGTCCAACGTGCACA
Dog AGCTGCCTTACGTACATA
Cat AGCTGTCTTACGTACGTA
Human G G T T
Mouse A C T T
Cat A G C T
Dog A G C T
Rat A C T C

Cost 1 2 2 1 21
Human GGTTATCCTACATGTATA
Mouse ACTTGTCCAACGCGGACA
Rat ACTCGTCCAACGTGCACA
Dog AGCTGCCTTACGTACATA
Cat AGCTGTCTTACGTACGTA
Cat A G C T
Dog A G C T
Human G G T T
Mouse A C T T
Rat A C T C

Cost 1 1 1 1 15
Human GGTTATCCTACATGTATA
Mouse ACTTGTCCAACGCGGACA
Rat ACTCGTCCAACGTGCACA
Dog AGCTGCCTTACGTACATA
Cat AGCTGTCTTACGTACGTA
Most parsimonious tree
The cost over all positions for a given
tree is the ‘tree length’
There may be several trees with the
same minimum tree length
The most parsimonious tree is then a
consensus of these trees
Parsimony
implementation
Number of trees for n species:
(2n-5)! / ((n-3)! 2n-3)
Concentrate search “near” previously-
found good trees – branch swapping,
prune and re-graft
Branch-and-bound – eliminate clearly
inferior trees
Parsimony variants
We can assign different costs to certain
changes: eg V -> L cheaper than G -> Y
We can assign different costs to certain
sites: eg 3rd codon position cheaper
Other variants for non-sequence data
Most of these slow down computation
Parsimony
considerations
Some researchers believe it has most
philosophical support (Occam’s razor)
Does not deal with repeated changes at
the same site well – simulations have
revealed “long branch attraction”
Computing time may grow exponentially
with more data, also “bad” data slows it
Long branch attraction
A c A c

B D
B D
Distance methods
For sequence data, a two-step process
Compute pairwise distances between
sequences
Find tree based on distance matrix:
• Fitch-Margoliash (1967) least squares
• Neighbor-joining (Saitou & Nei 1987)
• Minimum Evolution (Rzhetsky & Nei
1993)
Computing sequence
distances
We can simply count differences
between sequences (p-distance)
We can correct for multiple changes at
the same position: Jukes-Cantor 1969
We can assign different costs to
different changes
We can correct for variation among
sites: gamma parameter methods
DNA distances
Kimura 2-parameter: correct for multiple
changes and weight transitions less
than transversions (1972)
Variety of more complex distances that
include different base compositions,
different rates of base change and
gamma parameter.
Distances that take protein translation
into account
Protein distances
Based on PAM or BLOSUM matrices
Category distances (similar amino
acids)
Heuristic distances based on large
collections of sequence alignments and
PAM/BLOSUM matrices
Not as well-developed as for DNA
Fitch Margoliash Least
Squares
Compare the observed distance matrix
to the distances along the branches of a
particular tree
For a given topology, we can find the
branch lengths that minimises the
square of the difference
Calculate least square for each tree
topology, find best topology
Fitch-Margoliash
Difficult to calculate all branch lengths at once,
except for three species:
Branch i = (d(i,j) +d(i,k) – d(j,k)) / 2

j
i
k
Fitch-Margoliash
Reduce a tree with more than three
species to d(i,j), d(i,k), d(j,k) where i and
j are closest species, k is a composite
average over all other species
Repeat for next-closest species
Estimate of branch lengths for a given
tree topology, then find best estimate
over all trees
Neighbor-joining
“Star decomposition” method
Determine which pair of sequences
reduces length of total tree most
Repeat until all sequences included
Very fast for large numbers of
sequences
Heuristic ‘greedy’ algorithm
Star decomposition
Star decomposition
Star decomposition
Star decomposition
Neighbor-joining
algorithm
Star tree length = 1/ (n-1) Σ d(i,j)
For n species, over all i, j
Joining i and j gives tree length =
1 / 2(n-2) Σ (d(i, k) + d(j, k)) + _ d(i, j) + 1/
(n-2) Σ d(k, l)
For n species, where k and l are all
species other than i and j
Neighbor-joining
algorithm
Once we have found the i and j species
that reduce the total tree length the
most, replace all d(i,k) and d(j,k) with
the average
Now have a distance matrix with n-1
species, repeat process
Stop once there are three “species” left
Star decomposition
Minimum Evolution
Similar to Fitch-Margoliash – minimise
square of difference between real
distance matrix and distances along
tree.
Tree with the shortest total branch
lengths
Neighbor-joining is the greedy heuristic
for finding the minimum evolution tree
Increasingly popular
UPGMA
Unweighted Pair Grouping with
Arithmetic Mean
Rarely produces the correct tree
Nevertheless still commonly used both
in phylogenetics and other clustering
Recognise and avoid it!
UPGMA
Find smallest value in distance matrix,
say d(a,b)
Join these two species; replace d(a,c)
and d(b,c) with average distance, for all
other species c
Repeat until tree formed
Only produces correct tree if all
branches are the same distance from
origin
UPGMA counterexample
A c . A B C
.15 .15 B .55
.10 C .40 .65
D .65 .90 .55
.40 .40
A and C
B D
joined first
Maximum likelihood
Statistical method - powerful and
flexible, also computationally complex
Given a particular tree and a model of
the evolutionary change, calculate the
likelihood of the tree based on data, i.e.
the given multiple sequence alignment
Felsenstein 1981
Maximum likelihood
Likelihood(tree | data) proportional to Probability(
data | tree)
As in parsimony and least squares distance method,
search to find best tree, i.e. tree with maximum
likelihood given data
Model of evolutionary change can be simple or
complex, mirroring distance methods
Maximum likelihood
Tree with branches, vk branch lengths
Probability of character change PAC(t)
for A -> C in unit of time t
Don’t know character states inside tree
(in the past) so calculate for all
possibilities, e.g. A, C, G, T
v4 Human G
s2
v2
v5 Dog A
s1
v1
v6 Cat A
s0

v3 v7 Mouse A
s3
v8 Rat A
v4 Human G
A
v2
v5 Dog A
A
v1
v6 Cat A
A

v3 v7 Mouse A
A
Rat A
v8
L = p(A) PAA(v1) PAA(v2) PAG(v4)
PAA(v5) PAA(v6) PAA(v3) PAA(v7) PAA(v8)
v4 Human G
G
v2
v5 Dog A
A
v1
v6 Cat A
A

v3 v7 Mouse A
A
Rat A
v8
L = p(A) PAA(v1) PAG(v2) PGG(v4)
PGA(v5) PAA(v6) PAA(v3) PAA(v7) PAA(v8)
v4 Human s4
s2
v2
v5 Dog s5
s1
v1
v6 Cat s6
s0

v3 v7 Mouse s7
s3
v8 Rat s8
L = p(s0) Ps0s1(v1) Ps1s2(v2) Ps2s4(v4) Ps2s5(v5)
Ps1s6(v6) Ps0s3(v3) Ps3s7(v7) Ps3s8(v8)
Calculating Pij(V)
Rate matrix of instantaneous change
Multiple substitutions per site is
achieved by multiplying instantaneous
matrix by itself
Pik(v) = Pik(instantaneous)v
Instantaneous DNA rate
matrix
. A C G T
A 1–(a1+a2+a3) a1 a2 a3
C a4 1 –(a4+a5+a6) a5 a6
G a7 a8 1 –(a7+a8+a9) a9
G a10 a11 a12 1–(a10+a11+a12)
DNA rate matrix
Jukes-Cantor distance model equivalent to all
a entries equal
Kimura two-parameter equivalent to two
values, for transitions and transversion
Site to site variation (like gamma distance
models) usually achieved by bins of sites with
different rates of change
Reversible model means the rate matrix is
symmetric
Example rate matrices

Jukes-Cantor Kimura 2-parameter


. A C G T . A C G T
A 1-3a a a a A 1-a-2b b a b
C a 1-3a a a C b 1-a-2b b a
G a a 1-3a a G a b 1-a-2b b
T a a a 1-3a T b a b 1-a-2b
Sppeding up maximum
likelihood
Maximum likelihood does best in
simulation but is also slowest method
Variety of new heuristics to find ML tree
faster
Can bypass searching and calculate
likelihood of selected/interesting trees
and compare – of course the true ML
tree may not be included in comparison
Choosing a model
More parameters desirable to describe
sequence change accurately
But more parameters are slower and
harder to estimate
Eg 1000 sites into four groups with four
different rates of change -> less data for
each category
Overparameterisation
Which method to build
my tree with?
Considerable debate as to which is the
best method
Different methods imply different
models of sequence change, different
philosophies
In fact, the quality of the data is most
important factor influencing tree
construction
Quality of the tree
Phylogenetic trees can vary
dramatically with slight changes in data
We want to know which branches are
reliable, and which branches do not
have strong support from the data
Bootstrapping is the most common
method used (Felsenstein 1985)
Bootstrapping
Bootstrapping is a general statistical
technique for determining how much
error is in a set of results
Bootstrapping is applicable if either:
We are not sure of the sample space
the data is drawn from
The method applied to the data cannot
be easily analysed
Bootstrapping
Pseudoreplicates of data set
Randomly draw items from data set with
replacement, to size of original data set
Calculate desired result for each
pseudoreplicate
Examine distribution of result across
large set of pseudoreplicates
Bootstrapping sequence
alignments
For each pseudoreplicate, draw
columns of data randomly with
replacement to create a multiple
sequence alignment of same length
Construct phylogenetic tree for each,
using same method and parameters
Derive consensus tree, compare to
original tree
Example alignment
11111111
12345678901234567
Ovar STCVLSAYWKDELNNYH
Bota STCVLSAYWKD-LNNYH
Susc STCVLSAYWRN-LNNFH
Hosa STCMLGTY-QD-FNKFH
Rano STCMLGTY-QD-LNKFH
Sasa STCVLGKLSQE-LHKLQ
Random numbers with
replacement
In the example, there are 17 positions in
the alignment
Chose a random number between 1
and 17, 17 times
Some will be chosen several times,
some not at all
5 10 17 10 7 13 15 5 2 7 7 8 1 10 8 7 11
11111111
12345678901234567
Ovar STCVLSAYWKDELNNYH
Bota STCVLSAYWKD-LNNYH
Susc STCVLSAYWRN-LNNFH
Hosa STCMLGTY-QD-FNKFH
Rano STCMLGTY-QD-LNKFH
Sasa STCVLGKLSQE-LHKLQ
11 2 42 31 1 1 1
Columns selected
1111111
12557777880001357
Ovar STLLAAAAYYKKKDLNH
Bota STLLAAAAYYKKKDLNH
Susc STLLAAAAYYRRRNLNH
Hosa STLLTTTTYYQQQDFKH
Rano STLLTTTTYYQQQDLKH
Sasa STLLKKKKLLQQQELKQ
Consensus tree
1000 pseudoreplicates -> 1000 trees
Consensus tree: Identify branches
present in all trees first
Then find branch seen in most trees
Find next most common branch,
continue
Do not include branches that conflict
with previously-added branches
Consensus tree
C A E A C E A E C

77%
B D B D B D
C E A C E
A E A D E B A

B C D C B D
D B
E C A C D C E A
A 55%

B D B E D B
What do the bootstrap
values mean?
Bootstrap values for phylogenetic trees do not follow
proper statistical behaviour
Bootstrap value 95% actually close to 100%
confidence in that branch
Bootstrap value 75% often close to 95% confidence
Bootstrap value 60% is much lower confidence
Less than 50% bootstrap: no confidence in that
branch over an alternative
Bootstrap limitation
Because we are resampling from the
existing data set, we cannot ever have
pseudoreplicates with columns not
observed in the original data set
May sometimes lead to overestimation
of bootstrap value for particular branch
Other tree reliability
methods
Jackknifing – removing individual
sequences or parts of alignment
Parametric bootstrap – data sets
constructed based on underlying
evolutionary model
Likelihood ratio tests – compare the
likelihood of two (or more) rival trees
Practical considerations
Because NJ is very fast and gives pretty
good trees, use NJ to test different
parameters and bootstrapping
Also test different alignments to see
how much the alignment affects the tree
Use Maximum Likelihood for most
reliable tree; or to compare selected
trees
Phylogenetics software
PHYLIP – Joe Felsenstein – most used
software. Free, modular, large variety of
methods
PAUP* - David Swofford – Friendlier
user interface, originally just parsimony,
not free
MEGA2 – Sudhir Kumar – Free,
strength in neighbor-joining, PCs only
Phylogenetics software
Many specialist packages – new
methods implemented by authors
May have different input and output
formats
Extensive index at

http://evolution.genetics.washington.edu
/phylip/software.html

Вам также может понравиться