Multiple Sequence Alignment

Multiple Sequence
Alignment
What is a Multiple
Sequence Alignment?
Unlike pair-wise alignments which
involve
2 sequences
(nucleotide or protein)
Multiple sequence alignments involve
more than
2 sequences
(often 100s, either nucleotide or
protein).
As
was
the
case
for
pair-wise
Family
Terminology (for
Proteins)
Group of proteins of similar biochemical

function with (roughly) > 50% sequence
identity when aligned
Family is transitive, even if sequence identity
< 50%
A B and B C implies A C
1940 protein families in Protein Data Bank
(v1.61, Nov 2002)
Superfamily
Group of protein families related by
distant yet detectable sequence
similarity
1100 protein superfamilies in Protein
DataBank (v1.61)
Block
Ungapped conserved sequence pattern
(in protein family)
Motif
Conserved sequence pattern found in
multiple proteins with similar biochemical
activity, usually near active site
Module
Conserved sequence (contiguous) of one
or more motifs,considered fundamental
unit of structure or function
Homologous
Extended sequence pattern suggesting
common evolutionary origin (contains
one or more motifs, may contain gaps)
Multi-domain (chimeric) protein

Encoded by (artificial) gene containing
multiple domains
Super-secondary structure
Combination of several secondary structural
elements, folding adjacent polypeptide
chains into specific 3D configurations
Fold
Similar to motif, but usually larger
combination of secondary structural units
701 folds in Protein DataBank (v1.61, Nov
2002)
Domain
Segment of polypeptide chain that can
fold into 3D structure
irrespective of other segments (multiple
domains in protein)
Class
Classify domains according to secondary
structure
Examples: mainly-, mainly-, / , +
, membrane
Applications of Multiple
Alignment
Homology Modeling
Phylogenetic Analysis
Advanced
Database
Searches,
Patterns, Motifs, Promoters
Why do we need
multiple alignments?
In order to reveal the relationship between a

group
of sequences. (homology)
In order to characterize protein families to
identify conserved regions of a specific family,
and
locate its variable regions.
In order to retrieve information about domains
or active sites. Similar regions may indicate
similar functions. (e.g promoter regions in
DNAs)
Why do we need
To plan point mutations based upon

highlighted regions of multiple alignments,
either very similar or very different.
To build a family profile for use in a more
sensitive database scan. Such a search
can find new (more distant) members of
the family.
Determination of the consensus sequence
of several aligned sequences, for further
analysis.
Why do we need
Multiple alignments are used
protein modeling programs.
for
To help prediction of secondary and

tertiary structures of new sequences.
Multiple alignments are input
constructing phylogenetic trees.
for
Dynamic Programming
Pairwise
sequence
alignment:
a
scoring matrix where each position
provides the best alignment up to that
point
Extension to 3 sequences: the lattice
of a cube that is to be filled with
calculated dynamic programming scores.
Scoring positions on 3 surfaces of the
cube represent the alignment of a pair
Dynamic Programming
2 Sequences
3 Sequences
2 Sequence 3 Sequence
Match/Mismatch
Gap in sequence 1
Gap in sequence 2
7 possibilities
Seven Possibilities
All three match/mismatch (AAA)

Sequence 1 & 2 match/mismatch with gap in 3
Sequence 1 with gaps in 2 & 3 (A--)
Sequence 2 with gaps in 1 & 3 (-A-)
Sequence 3 with gaps in 1 & 2 (--A)
Computational
Complexity
For protein sequences each 300 amino acid in

length
and excluding gaps, with DP
algorithm
Two sequences, 3002 comparisons
Three sequences, 3003 comparisons
N sequences, 300N comparisons
With gaps allowed?
The number of comparisons & memory

required
are too large for n > 3 and not practical
MSA
Carrillo and Lipman (1988) : multiple sequence
alignment space bounded by pairwise alignments
MSA can be projectedOptimal

on to alignments are likely to
the sides of the cube found in the pink volume
Scoring of MSA: Sum of

Pairs
SP scoring is the standard method for
scoring multiple sequence alignments.
Scores = summation of all

combinations of amino acid pairs
possible
These scores may or may not be weighted

so as to reduce the influence of more
closely related sequences in the MSA
Sum of Pairs
Given 5 sequences:
NCCE
NNCE
N-CN
SCSN
SCSE
How many possible combinations of
pairwise
alignments for each position?
Answer: (5, 2) = 10
Sum of Pairs
Assume: match/mismatch/gap = 1/0/-1
NCCE
NNCE
N-CN
SCSN
SCSE
The 1st position: # of N-N (3), # of S-S (1), # of
N-S (6)
SP(1) = 4*1 + 0*6 + (-1)*0 = 4
The 2nd position: # of C-C (3), # of N-C (3), # of
gaps (4),
SP(2) = 3*1 + 0*3 + (-1)*4 = -1
The Computational
Challenge of MSA
Finding optimal alignment between a

group of sequences that include:
matches, mismatches and gaps is very
difficult.
For Pairwise Alignments, Dynamic
Programming methods are used, but
they are impractical with multiple
alignments (too many calculations, too
much CPU time).
The Computational
Challenge of MSA
The difficulties with aligning a group of

sequences varies with the degree of
similarity between the sequences.
High degree of variation
compared
sequences
alignments possible.
of
the
many
Many possibilities very hard to find

optimal alignment.
The Computational
Challenge of MSA
Approximate methods are used instead of
Dynamic
programming methods.
Another
computational
placement and
challenge
scoring of gaps in the aligned sequences.
is
Approximate Methods
Progressive methods
Starts with the most similar sequences,and

builds the
alignment by adding the rest of the sequences.
Iterative methods
Starts by making initial alignments of small
groups of sequences, and than revise the
alignment for better results.
Approximate Methods
Alignment based
domains (or
on
small
conserved
patterns), found in the same order within the

aligned sequences.
Alignment based on statistical or probabilistic
models of the sequences
Progressive Methods
The most practical
method for
and
widely
used
multiple alignment is the progressive

global
alignment.
How does it work?
Steps to create
multiple alignment
Pairwise comparisons of all sequences
Start with the most related (similar)
sequences, then the next most similar pair
and so on. Once an alignment of two
sequences has been made, then this is fixed.
Perform cluster analysis on the pairwise data
to generate a hierarchy for alignment. This
may be in the form of a binary tree or simple
ordering tree.
Steps to create
multiple alignment
After the clustering is done, relationships
between the sequences are modeled by a
tree.
If the program (e.g clustalw) builds an
evolutionary tree, then the sequences
represent the outer branches of the
evolutionary tree.
Inner branches represent dissimilarities of
the sequences at the outer branches.
Clustal W
Clustal W is a global multiple

alignment program for DNA or
protein.
Clustal W was produced by Julie D.
Thompson, Toby Gibson of EMBL,
Germany and Desmond Higgins of
EBI, Cambridge, UK.
Clustal W
Clustal W
Improving the sensitivity of progressive
multiple sequence alignment through
sequence weighting,
Positions-specific gap penalties and
weight matrix choice.
Clustal W
Clustal W can create multiple
alignments
Manipulate existing alignments and
create phylogenic trees.
Alignment can be done by 2 methods:
slow/accurate
fast/approximate
ClustalW alignment
Method
ClustalW alignment algorithm consists
of 3
steps:
STEP 1
Pairwise Alignments are performed between
all sequences in the compared group.
Alignment scores are used to build a distance
matrix.
Calculating the distance matrix, the program
takes into account the divergence of the
sequences.
ClustalW alignment
Method
STEP 2
A guide (phylogenetic)
created from the
distance matrix .
tree
is
ClustalW alignment
Method
STEP 3
Progressive alignment of the sequences is

done,
following the branch order of the guide tree.
The sequences are aligned from the tips to

the root.
ClustalW alignment
Method
At each stage of the progressive

alignment full dynamic programming
is applied, and uses a scoring matrix.
The program calculates sequence
weights from the guide tree, and
choose the scoring matrix accordingly
(according to the divergence of the
compared sequences).
ClustalW alignment
Method
Clustalw weights the sequences according to

the distance of each sequence from the root.
Clustalw calculates gaps in a novel way,
designed to place them between conserved
domains.
Clustalw penalizes for gap opening and
extension.
Genetic Distance
Calculation
ClustalW calculates the distance in

the following way:
No. of Mismatches in the alignment
No. of Matches in the alignment
Positions opposite to a gap are not
considered
ClustalX - Multiple
Sequence Alignment
Program
ClustalX provides a window-based user

interface to the ClustalW program.
It uses the Vibrant multi-platform user
interface development library, developed
by the National Center for Biotechnology
Information as part of their NCBI
SOFTWARE DEVELOPEMENT TOOLKIT.
Multiple alignment in
GCG
The program available in GCG for
multiple alignment is Pileup.
Pileup does global alignment very
similar to Clustal W.
The input file for Pileup is a list of
sequence file_names or sequence
codes in the database. The list is
created by a text editor.
Multiple alignment in
GCG
Pileup creates a multiple sequence
alignment from a group of related
sequences
using
progressive,
pairwise alignments.
It can also plot a tree showing the
clustering relationships used to
create the alignment.
PileUp for Multiple

Alignment
Pileup performs pairwise alignments of all the the

sequences, using the method of Needleman &
Wunsch.
The alignment scores are used to to produce a
tree
The tree is used to guide the multiple alignment
from a group of related sequences .
Please note that there is no one absolute
alignment, even for a limited number of
sequences.
PileUp considirations
PileUp does global multiple alignment, and
therefore is good for a group of similar
sequences.
PileUp will fail to find the best local region of
similarity (such as a shared motif) among
distantly related sequences.
PileUp always aligns all of the sequences you
specified in the input file, even if they are not
related. The alignment can be degraded if
some of the sequences are only distantly
related.
Problems with
Progressive alignments
In progressive alignment the ultimate
multiple alignment is dependent on the
initial pairwise alignments.
The first sequences to be aligned are the
most similar (closely related on the tree).
If the initial alignments is good, with very
few
errors,
the
ultimate
multiple
alignment will be good.
Problems with
Progressive alignments
However, if the sequences aligned are
distantly related, much more errors can
be made affecting the final alignment
Another problem with progressive
alignment is that the ultimate multiple
alignment is dependent on choosing the
correct scoring matrices, and the
correct gap penalty.
Iterative Methods for

Multiple Sequence
Alignment
do NOT depend on the initial pairwise

alignment (recall progressive methods)
Starting with an initial alignment and
repeatedly realigning groups of the
sequences
Repeat until one msa doesnt change
significantly from the next.
After iterations, alignments are better
and better.
Iterative Methods
MultAlign
PRRP
DiAlign
Genetic Algorithm
MultAlign
Pairwise scores recalculated
during progressive alignment
Tree is recalculated
Alignment is refined
PRRP
Initial pairwise alignment predicts
tree
Tree produces weights
Locally aligned regions considered
to produce new alignment and tree
Continue until alignments converge
DIALIGN
Pairs of sequences aligned to
locate ungapped aligned regions
Diagonals of various lengths
identified
Collection of weighted diagonals
provide alignment
Genetic Algorithm
Goal: use
alignments
GAs
A class of
algorithms.
to
identify
probabilistic
the
best
optimization
Inspired by the biological evolution process

Uses concepts of Natural Selection and
Genetic Inheritance (Darwin 1859)
The search from a large space is parallel
and avoids the calculations of derivatives
GA for MSA
Representation:
For sequences 200 residues long,
extend each to 250.
Gaps are randomly inserted in each
XXXXXXXXXX----sequence.
---------XXXXXXXX
--XXXXXXXXX----XXX---X-XXXX-X--
GA for MSA
Fitness function: Sum of pairs with
Standard scoring matrices

gap penalties
Selection
Half are chosen probabilistically to
proceed
unchanged (Natural selection)
Half proceed with mutations and
crossovers
The probability of selection depends on
the
value of fitness
GA for MSA
Mutations: sequences are not
changes, but gaps
are inserted and rearranged
XXXXXXXX
XXXXXXXX
XXXXXXXX
XXXXXXXX
XXXXXXXX
XXX--XXXXX
XXX--XXXXX
XXXXXXX--X
XXXXXXX--X
XXXXXXX--X
GA for MSA
Crossover:
AGWS N---VDPA
AEWS TEEE-ALATWS -E-EGAAL
--WDKVEVC-AL
WD-H VEEE-WL
WD-Y VWELL-L
--WDKN---VDPA
WD -HTEEE-ALWD -Y -E-EGAAL
AGWSVEVC-AL
AEWSVEEE-WL
ATWSVWELL-L
GA for MSA
Notredame, C. and Higgins D.
SAGA: Sequence Alignment by
Genetic Algorithm,
Slow if number of sequences > 20
GA has been applied in many
different bioinformatics problems.
Other MSA Algorithms

Statistical and Probablistic
Methods
Hidden Markov Model
Neural Networks
Based on Profiles
Profile
A portion of the MSA which is highly
conserved.
A scoring matrix for the mini-msa is
called the profile.
Profiles include
and gaps.
matches,
mismatches
Scanned regions without any gaps are

called as Blocks.
Applications
Modeling protein families
Modeling protein domains --Pfam
Gene finding
Protein structure predictions
Genome annotations
Modeling protein
families
Functional biological sequences come
in families. Sequences in a family have
diverged from each other in their
primary sequence during evolution.
Knowing that a sequence belongs to
a family often allows inferences about
its function.
Modeling protein
families
To find more members of a known

family, pairwise search with any
existing members may not find
distantly related sequences
More powerful approaches will use

the statistical
features of the whole set of sequences
Statistical Profiles
Proteins which share a common
ancestor are not exactly alike,
however,
they
inherit
many
similarities in primary structure from
their ancestor.
This is known as conservation of
primary structure in a protein family.
Statistical model
These structural similarities make
it possible to
create a statistical model of a
protein family.
Probabilities calculated from the

observed frequencies of amino
acids in the family.
Probability of a
sequence
Product of the amino acid
probabilities given by the profile.
E.g. probability of CGGSV,
0.8 * 0.4 * 0.8 * 0.6 * 0.2 = .031
Transformation to
logarithmic
Multiplication of fractions is computationally
expensive and prone to floating point errors
such as underflow, a convenient
transformation into the logarithmic world is
used.
The score of CGGSV is
loge
(0.8)+loge(0.4)+loge(0.8)+loge(0.6)+loge(0.2)
= -3.48
Score Calculation
Members of a protein family have
varying lengths, so a score penalty is
charged for insertions and deletions.
The scores of individual amino acids in
a profile are also position specific.
More weight must be given to an
unlikely amino acid which appears in a
structurally important position in the
protein than to one which appears in a
structurally unimportant position.
Refinements to create
good profile models
Introduce many additional free parameters
which
must be calculated when building a profile,
The calculations must be done by trial and

error.
Hidden Markov Models

Hidden Markov models (HMMs) offer a
more systematic approach to estimating
model parameters.
The HMM is a dynamic kind of statistical
profile. Like an ordinary profile, it is built
by analyzing the distribution of amino
acids in a training set of related proteins.
However, an HMM has a more complex
topology than a profile. It can be visualized
as a finite state machine, familiar to
students of computer science.
HMM
Finite state machines typically move
through a series
of states and produce some kind of
output either when
The machine has reached a
particular state
or
When it is moving from state to
state.
HMM
The HMM generates a protein sequence by
emitting amino acids as it progresses
through a series of states.
Each state has a table of amino acid
Emission probabilities similar to those in a
profile model.
Transition probabilities for moving from state
to state.
States in HMM
There
are
three
kinds
of
states
represented by three different shapes.
The squares are called match states, and
the amino acids emitted from them form
the conserved primary structure of a
protein.
The diamond shapes are insert states and
emit amino acids which result from
insertions.
The circles are special, silent states known
as delete states and model deletions.
Hidden Markov Model for the protein ACCY
Probability Calculation
Any
sequence
can
be
represented by a path through
the model.
The probability of any sequence,
given the model, is computed by
multiplying the emission and
transition probabilities along the
path.
For example, the probability of A
being emitted in position 1 is 0.3,
and the probability of C being
emitted in position 2 is 0.6. The
probability of ACCY along this path
is
.4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.76x10 -6
The calculation is simplified by
transforming probabilities to logs
The resulting number is the raw score
of a sequence, given the HMM.
For example, the score of ACCY along
the path is
loge(.4) + loge(.3) + loge(.46) + loge(.6) + loge (.97) +
loge(.5) + loge(.015) + loge(.73) +loge(.01) + loge(1)
= -13.25
Limitations of HMM
The calculation is easy if the exact
state path is known. In a real model,
many different state paths through a
model can generate the same
sequence.
Therefore, the correct probability of a
sequence is the sum of probabilities
over all of the possible state paths.
Alternative Methods
Forward Algorithm
- Calculate the sum over all paths
Viterbi Algorithm
- Calculate the probable path
Viterbi algorithm
The algorithm employs a matrix
Columns are indexed by the states
Rows are indexed by the sequence.
Deletion states are not shown, since, by
definition, they have a zero probability of
emitting an amino acid.
Steps
1. Initialize the matrix with zeros
2. Fill the matrix with the probability of each amino
acid to occur in the three different states
3. The maximum
calculated.
probability,
max(I1,
M1),
is
4. A pointer is set from the winner back to state I0.

5. Steps 2-4 are repeated until the matrix is filled.
Probability calculation
Prob(A in state I0) = 0.4*0.3=0.12
Prob(C in state I1) = 0.05*0.06*0.5 = .015
Prob(C in state M1) = 0.46*0.01 = 0.005
Prob(C in state M2) = 0.46*0.5 = 0.23
Prob(Y in state I3) = 0.015*0.73*0.01 = .
0001
Prob(Y in state M3) = 0.97*0.23 = 0.22
The most likely path through the

model can now be found by
following the back-pointers.
Probability of a
sequence
Once the most probable path
through the model is known, the
probability of a sequence given the
model can be computed by
multiplying all probabilities along
the path.
Forward algorithm
The forward algorithm is similar to
Viterbi.
However in step 3, a sum rather than a
maximum is computed, and no back
pointers are necessary.
The probability of the sequence is found
by summing the probabilities in the last
column.
Forward algorithm
Limitations of HMM
The HMM is a linear modeland is unable
to capture higher order correlations
among amino acids in a protein molecule.
Hydrogen bonds between non-adjacent
amino acids in a polypeptide chain,
Hydrogen bonds created between amino
acids in multiple chains
Disulfide
bridges,
chemical
bonds
between C (cysteine) amino acids which
are distant from each other within the
molecule.
Limitations of HMM
In reality, amino acids which are far
apart in the linear chain may be
physically close to each other when
a protein folds.
Chemical and electrical interactions
between them cannot be predicted
with a linear model.
Limitations of HMM
Another flaw of HMMs lies at the very
heart of the mathematical theory behind
these models. are independent.
The probability of a protein sequence can
be found by multiplying the probabilities
of the amino acids in the sequence.
This is only valid if the probability of any
amino acid in the sequence is independent
of the probabilities of its neighbors.
Limitations of HMM
In biology, this is not the case. There are, in
fact, strong dependencies between these
probabilities.
For example, hydrophobic amino acids are
highly likely to appear in proximity to each
other.
Because such molecules fear water, they cluster
at the inside of a protein, rather than at the
surface where they would be forced to
encounter water molecules.
Insight into New

Methods
These
biological
realities
have
motivated research into new kinds of
statistical models.
Hybrids of HMMs and neural nets,
dynamic Bayesian nets, factorial
HMMs, Boltzmann trees and hidden
Markov random fields are among the
areas being explored.

Multiple Sequence Alignment

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Multiple Sequence Alignment

Загружено:

Авторское право:

Доступные форматы

Multiple Sequence

Group of proteins of similar biochemical

Multi-domain (chimeric) protein

In order to reveal the relationship between a

To plan point mutations based upon

To help prediction of secondary and

All three match/mismatch (AAA)

For protein sequences each 300 amino acid in

The number of comparisons & memory

MSA can be projectedOptimal

Scoring of MSA: Sum of

Scores = summation of all

These scores may or may not be weighted

Finding optimal alignment between a

The difficulties with aligning a group of

Many possibilities very hard to find

scoring of gaps in the aligned sequences.

Starts with the most similar sequences,and

patterns), found in the same order within the

models of the sequences

multiple alignment is the progressive

Clustal W is a global multiple

Progressive alignment of the sequences is

The sequences are aligned from the tips to

At each stage of the progressive

Clustalw weights the sequences according to

ClustalW calculates the distance in

ClustalX provides a window-based user

PileUp for Multiple

Pileup performs pairwise alignments of all the the

Iterative Methods for

do NOT depend on the initial pairwise

Inspired by the biological evolution process

Standard scoring matrices

Other MSA Algorithms

Scanned regions without any gaps are

To find more members of a known

More powerful approaches will use

Probabilities calculated from the

The calculations must be done by trial and

Hidden Markov Models

Hidden Markov Model for the protein ACCY

4. A pointer is set from the winner back to state I0.

The most likely path through the

Insight into New

Вам также может понравиться