Академический Документы
Профессиональный Документы
Культура Документы
General questions
1. What problem does CLUSTALW address, and how does it work?
Extending dynamic programming alignment algorithms to solve the
multiple alignment problem (i.e. finding the best alignment between
more than two sequences) can be very expensive - for two sequences we
compute a 2D matrix, while for n, we must construct a n-dimensional
hypercube and evaluate 2n 1 predecessors.
CLUSTALW is perhaps the most widely used progressive alignment
algorithm. That is a family of algorithms for solving the multiple
alignment problem by producing multiple alignments from a number of
pairwise alignments. As this approach doesnt necessarily give the best
alignment of the n sequences (as the dynamic programming approach
would), it is heuristic in nature, but much more efficient.
Outline of CLUSTAW algorithm:
pairwise alignment: align all possible pairs of sequences against
each other, obtaining a similarity matrix, such that
similarityi,j =
exactM atches
alignmentLength
Algorithms
We would like to compare DNA sequences from different species, so we can
understand evolution. We do that by building trees that represent the hierarchy of organisms. These are called phylogeny trees, they are labelled
(hypothetical ancestors are shown) and consist of nodes (called taxonomic
units) that split into:
leaves - each leaf is a different existing species.
internal nodes - called hypothetical taxonomic units; these are species
that are not existent now, but we think existed before and evolved in
other species.
Building phylogeny trees can be split into two phases - building an unlabelled tree and labelling it:
parsimony based methods we choose the simplest scientific explanation that fits the evidence we allow as fewest mutations as possible
(as we assume minimal mutations is most likely). Parsimony based
method work on already construed, unlabelled trees to produce a final,
labelled phylogeny tree.
distance based methods cluster nodes in such a way, that there is minimal distance between nodes within a cluster and maximum between
clusters. These methods produce an unlabelled tree.
*Comment:* I wrote all of these, because I feel slightly confused about
all these terms...
1. Fitch parsimony
abstract problem this is a parsimony based method for labelling
an already constructed tree to produce a phylogeny tree.
practical use understanding evolution.
outline of the algorithm
space and time complexity
2. Sankoff parsimony
abstract problem this is a parsimony based method for labelling
an already constructed tree to produce a phylogeny tree.
practical use understanding evolution.
outline of the algorithm
3. UPGMA
abstract problem this is a distance based method for constructing an unlabeled tree. This tree is rooted and its ultrametric
the distance from the root to any leaf is the same. It is also a
hierarchical clustering algorithm.
practical use understanding evolution.
outline of the algorithm
(a) initialisation:
for each species i we create a new cluster Ci = i.
construct a new tree, where each cluster is a leaf.
(b) iteration:
find i and j, such that the distance d(Ci , Cj ) between
cluster Ci and Cj is minimal.
T
create a new cluster Ck = Ci Cj
create a corresponding node in the tree, a hight
remove Ci and Cj .
if there is only one cluster left - terminate.
d(Ci ,Cj )
.
2
(b) iteration:
for the current number of taxa r, we compute a new matrix
Q:
r
r
X
X
Qij = (r 2)Dij
Dik
Djk
k=1
k=1
after all data points are assigned, calculate the new mean
value (centre) of each cluster. That is
i =
X d
|Ci |
dC
i
7. Markov Clustering
abstract problem a partitioning clustering algorithm. Unlike
the two k-means clustering algorithms above, Markov clustering
does not require specifying the number of clusters in advance.
practical use analyse changes of activity in genes and functional
similarity among genes.
outline of the algorithm
(a) initialisation:
create an associated with the input graph adjacency matrix M .
create a distance matrix M 0 out of the adjacency matrix
M - i.e. we want matrix entry Mij0 to show the probability
that j will be reached from i. Thus M 0 is just a normalised
Mij
version of M : Mij0 = P M
kj
k
10