Вы находитесь на странице: 1из 89

Multiple Sequence

Alignment

What is a Multiple
Sequence Alignment?
Unlike pair-wise alignments which
involve
2 sequences
(nucleotide or protein)
Multiple sequence alignments involve
more than
2 sequences
(often 100s, either nucleotide or
protein).
As

was

the

case

for

pair-wise

Family

Terminology (for
Proteins)

Group of proteins of similar biochemical


function with (roughly) > 50% sequence
identity when aligned
Family is transitive, even if sequence identity
< 50%
A B and B C implies A C
1940 protein families in Protein Data Bank
(v1.61, Nov 2002)

Superfamily
Group of protein families related by
distant yet detectable sequence
similarity
1100 protein superfamilies in Protein
DataBank (v1.61)

Block
Ungapped conserved sequence pattern
(in protein family)

Motif
Conserved sequence pattern found in
multiple proteins with similar biochemical
activity, usually near active site

Module
Conserved sequence (contiguous) of one
or more motifs,considered fundamental
unit of structure or function

Homologous
Extended sequence pattern suggesting
common evolutionary origin (contains
one or more motifs, may contain gaps)

Multi-domain (chimeric) protein


Encoded by (artificial) gene containing
multiple domains
Super-secondary structure
Combination of several secondary structural
elements, folding adjacent polypeptide
chains into specific 3D configurations
Fold
Similar to motif, but usually larger
combination of secondary structural units
701 folds in Protein DataBank (v1.61, Nov
2002)

Domain
Segment of polypeptide chain that can
fold into 3D structure
irrespective of other segments (multiple
domains in protein)

Class
Classify domains according to secondary
structure
Examples: mainly-, mainly-, / , +
, membrane

Applications of Multiple
Alignment
Homology Modeling

Phylogenetic Analysis

Advanced
Database
Searches,
Patterns, Motifs, Promoters

Why do we need
multiple alignments?

In order to reveal the relationship between a


group
of sequences. (homology)
In order to characterize protein families to
identify conserved regions of a specific family,
and
locate its variable regions.
In order to retrieve information about domains
or active sites. Similar regions may indicate
similar functions. (e.g promoter regions in
DNAs)

Why do we need
multiple alignments?

To plan point mutations based upon


highlighted regions of multiple alignments,
either very similar or very different.
To build a family profile for use in a more
sensitive database scan. Such a search
can find new (more distant) members of
the family.
Determination of the consensus sequence
of several aligned sequences, for further
analysis.

Why do we need
multiple alignments?
Multiple alignments are used
protein modeling programs.

for

To help prediction of secondary and


tertiary structures of new sequences.
Multiple alignments are input
constructing phylogenetic trees.

for

Dynamic Programming
Pairwise
sequence
alignment:
a
scoring matrix where each position
provides the best alignment up to that
point
Extension to 3 sequences: the lattice
of a cube that is to be filled with
calculated dynamic programming scores.
Scoring positions on 3 surfaces of the
cube represent the alignment of a pair

Dynamic Programming

2 Sequences

3 Sequences

2 Sequence 3 Sequence

Match/Mismatch
Gap in sequence 1
Gap in sequence 2

7 possibilities

Seven Possibilities

All three match/mismatch (AAA)


Sequence 1 & 2 match/mismatch with gap in 3
Sequence 1 & 3 match/mismatch with gap in 2
Sequence 2 & 3 match/mismatch with gap in 1
Sequence 1 with gaps in 2 & 3 (A--)
Sequence 2 with gaps in 1 & 3 (-A-)
Sequence 3 with gaps in 1 & 2 (--A)

Computational
Complexity

For protein sequences each 300 amino acid in


length
and excluding gaps, with DP
algorithm
Two sequences, 3002 comparisons
Three sequences, 3003 comparisons
N sequences, 300N comparisons
With gaps allowed?

The number of comparisons & memory


required
are too large for n > 3 and not practical

MSA
Carrillo and Lipman (1988) : multiple sequence
alignment space bounded by pairwise alignments

MSA can be projectedOptimal


on to alignments are likely to
the sides of the cube found in the pink volume

Scoring of MSA: Sum of


Pairs
SP scoring is the standard method for
scoring multiple sequence alignments.

Scores = summation of all


combinations of amino acid pairs

possible

These scores may or may not be weighted


so as to reduce the influence of more
closely related sequences in the MSA

Sum of Pairs

Given 5 sequences:
NCCE
NNCE
N-CN
SCSN
SCSE
How many possible combinations of
pairwise
alignments for each position?
Answer: (5, 2) = 10

Sum of Pairs
Assume: match/mismatch/gap = 1/0/-1
NCCE
NNCE
N-CN
SCSN
SCSE
The 1st position: # of N-N (3), # of S-S (1), # of
N-S (6)
SP(1) = 4*1 + 0*6 + (-1)*0 = 4
The 2nd position: # of C-C (3), # of N-C (3), # of
gaps (4),
SP(2) = 3*1 + 0*3 + (-1)*4 = -1

The Computational
Challenge of MSA

Finding optimal alignment between a


group of sequences that include:
matches, mismatches and gaps is very
difficult.
For Pairwise Alignments, Dynamic
Programming methods are used, but
they are impractical with multiple
alignments (too many calculations, too
much CPU time).

The Computational
Challenge of MSA

The difficulties with aligning a group of


sequences varies with the degree of
similarity between the sequences.
High degree of variation
compared
sequences

alignments possible.

of

the
many

Many possibilities very hard to find


optimal alignment.

The Computational
Challenge of MSA
Approximate methods are used instead of
Dynamic
programming methods.
Another
computational
placement and

challenge

scoring of gaps in the aligned sequences.

is

Approximate Methods
Progressive methods

Starts with the most similar sequences,and


builds the
alignment by adding the rest of the sequences.

Iterative methods
Starts by making initial alignments of small
groups of sequences, and than revise the
alignment for better results.

Approximate Methods
Alignment based
domains (or

on

small

conserved

patterns), found in the same order within the


aligned sequences.
Alignment based on statistical or probabilistic

models of the sequences

Progressive Methods
The most practical
method for

and

widely

used

multiple alignment is the progressive


global
alignment.
How does it work?

Steps to create
multiple alignment
Pairwise comparisons of all sequences
Start with the most related (similar)
sequences, then the next most similar pair
and so on. Once an alignment of two
sequences has been made, then this is fixed.
Perform cluster analysis on the pairwise data
to generate a hierarchy for alignment. This
may be in the form of a binary tree or simple
ordering tree.

Steps to create
multiple alignment
After the clustering is done, relationships
between the sequences are modeled by a
tree.
If the program (e.g clustalw) builds an
evolutionary tree, then the sequences
represent the outer branches of the
evolutionary tree.
Inner branches represent dissimilarities of
the sequences at the outer branches.

Clustal W

Clustal W is a global multiple


alignment program for DNA or
protein.
Clustal W was produced by Julie D.
Thompson, Toby Gibson of EMBL,
Germany and Desmond Higgins of
EBI, Cambridge, UK.

Clustal W
Clustal W
Improving the sensitivity of progressive
multiple sequence alignment through
sequence weighting,
Positions-specific gap penalties and
weight matrix choice.

Clustal W
Clustal W can create multiple
alignments
Manipulate existing alignments and
create phylogenic trees.
Alignment can be done by 2 methods:
slow/accurate
fast/approximate

ClustalW alignment
Method
ClustalW alignment algorithm consists

of 3

steps:
STEP 1
Pairwise Alignments are performed between
all sequences in the compared group.
Alignment scores are used to build a distance
matrix.
Calculating the distance matrix, the program
takes into account the divergence of the
sequences.

ClustalW alignment
Method
STEP 2
A guide (phylogenetic)
created from the
distance matrix .

tree

is

ClustalW alignment
Method

STEP 3

Progressive alignment of the sequences is


done,
following the branch order of the guide tree.

The sequences are aligned from the tips to


the root.

ClustalW alignment
Method

At each stage of the progressive


alignment full dynamic programming
is applied, and uses a scoring matrix.
The program calculates sequence
weights from the guide tree, and
choose the scoring matrix accordingly
(according to the divergence of the
compared sequences).

ClustalW alignment
Method

Clustalw weights the sequences according to


the distance of each sequence from the root.
Clustalw calculates gaps in a novel way,
designed to place them between conserved
domains.
Clustalw penalizes for gap opening and
extension.

Genetic Distance
Calculation

ClustalW calculates the distance in


the following way:
No. of Mismatches in the alignment
No. of Matches in the alignment
Positions opposite to a gap are not
considered

ClustalX - Multiple
Sequence Alignment
Program

ClustalX provides a window-based user


interface to the ClustalW program.
It uses the Vibrant multi-platform user
interface development library, developed
by the National Center for Biotechnology
Information as part of their NCBI
SOFTWARE DEVELOPEMENT TOOLKIT.

Multiple alignment in
GCG
The program available in GCG for
multiple alignment is Pileup.
Pileup does global alignment very
similar to Clustal W.
The input file for Pileup is a list of
sequence file_names or sequence
codes in the database. The list is
created by a text editor.

Multiple alignment in
GCG
Pileup creates a multiple sequence
alignment from a group of related
sequences
using
progressive,
pairwise alignments.
It can also plot a tree showing the
clustering relationships used to
create the alignment.

PileUp for Multiple


Alignment

Pileup performs pairwise alignments of all the the


sequences, using the method of Needleman &
Wunsch.
The alignment scores are used to to produce a
tree
The tree is used to guide the multiple alignment
from a group of related sequences .
Please note that there is no one absolute
alignment, even for a limited number of
sequences.

PileUp considirations
PileUp does global multiple alignment, and
therefore is good for a group of similar
sequences.
PileUp will fail to find the best local region of
similarity (such as a shared motif) among
distantly related sequences.
PileUp always aligns all of the sequences you
specified in the input file, even if they are not
related. The alignment can be degraded if
some of the sequences are only distantly
related.

Problems with
Progressive alignments
In progressive alignment the ultimate
multiple alignment is dependent on the
initial pairwise alignments.
The first sequences to be aligned are the
most similar (closely related on the tree).
If the initial alignments is good, with very
few
errors,
the
ultimate
multiple
alignment will be good.

Problems with
Progressive alignments
However, if the sequences aligned are
distantly related, much more errors can
be made affecting the final alignment
Another problem with progressive
alignment is that the ultimate multiple
alignment is dependent on choosing the
correct scoring matrices, and the
correct gap penalty.

Iterative Methods for


Multiple Sequence
Alignment

do NOT depend on the initial pairwise


alignment (recall progressive methods)
Starting with an initial alignment and
repeatedly realigning groups of the
sequences
Repeat until one msa doesnt change
significantly from the next.
After iterations, alignments are better
and better.

Iterative Methods
MultAlign
PRRP
DiAlign
Genetic Algorithm

MultAlign
Pairwise scores recalculated
during progressive alignment
Tree is recalculated
Alignment is refined

PRRP
Initial pairwise alignment predicts
tree
Tree produces weights
Locally aligned regions considered
to produce new alignment and tree
Continue until alignments converge

DIALIGN
Pairs of sequences aligned to
locate ungapped aligned regions
Diagonals of various lengths
identified
Collection of weighted diagonals
provide alignment

Genetic Algorithm
Goal: use
alignments

GAs

A class of
algorithms.

to

identify

probabilistic

the

best

optimization

Inspired by the biological evolution process


Uses concepts of Natural Selection and
Genetic Inheritance (Darwin 1859)
The search from a large space is parallel
and avoids the calculations of derivatives

GA for MSA
Representation:
For sequences 200 residues long,
extend each to 250.
Gaps are randomly inserted in each
XXXXXXXXXX----sequence.
---------XXXXXXXX
--XXXXXXXXX----XXX---X-XXXX-X--

GA for MSA
Fitness function: Sum of pairs with

Standard scoring matrices


gap penalties
Selection
Half are chosen probabilistically to
proceed
unchanged (Natural selection)
Half proceed with mutations and
crossovers
The probability of selection depends on
the
value of fitness

GA for MSA
Mutations: sequences are not
changes, but gaps
are inserted and rearranged
XXXXXXXX
XXXXXXXX
XXXXXXXX
XXXXXXXX
XXXXXXXX

XXX--XXXXX
XXX--XXXXX
XXXXXXX--X
XXXXXXX--X
XXXXXXX--X

GA for MSA
Crossover:
AGWS N---VDPA
AEWS TEEE-ALATWS -E-EGAAL

--WDKVEVC-AL
WD-H VEEE-WL
WD-Y VWELL-L

--WDKN---VDPA
WD -HTEEE-ALWD -Y -E-EGAAL

AGWSVEVC-AL
AEWSVEEE-WL
ATWSVWELL-L

GA for MSA
Notredame, C. and Higgins D.
SAGA: Sequence Alignment by
Genetic Algorithm,
Slow if number of sequences > 20
GA has been applied in many
different bioinformatics problems.

Other MSA Algorithms


Statistical and Probablistic
Methods
Hidden Markov Model
Neural Networks

Based on Profiles

Profile
A portion of the MSA which is highly
conserved.
A scoring matrix for the mini-msa is
called the profile.
Profiles include
and gaps.

matches,

mismatches

Scanned regions without any gaps are


called as Blocks.

Applications
Modeling protein families
Modeling protein domains --Pfam
Gene finding
Protein structure predictions
Genome annotations

Modeling protein
families
Functional biological sequences come
in families. Sequences in a family have
diverged from each other in their
primary sequence during evolution.
Knowing that a sequence belongs to
a family often allows inferences about
its function.

Modeling protein
families

To find more members of a known


family, pairwise search with any
existing members may not find
distantly related sequences

More powerful approaches will use


the statistical
features of the whole set of sequences

Statistical Profiles
Proteins which share a common
ancestor are not exactly alike,
however,
they
inherit
many
similarities in primary structure from
their ancestor.
This is known as conservation of
primary structure in a protein family.

Statistical model
These structural similarities make
it possible to
create a statistical model of a
protein family.

Probabilities calculated from the


observed frequencies of amino
acids in the family.

Probability of a
sequence
Product of the amino acid
probabilities given by the profile.
E.g. probability of CGGSV,
0.8 * 0.4 * 0.8 * 0.6 * 0.2 = .031

Transformation to
logarithmic
Multiplication of fractions is computationally
expensive and prone to floating point errors
such as underflow, a convenient
transformation into the logarithmic world is
used.
The score of CGGSV is
loge
(0.8)+loge(0.4)+loge(0.8)+loge(0.6)+loge(0.2)
= -3.48

Score Calculation
Members of a protein family have
varying lengths, so a score penalty is
charged for insertions and deletions.
The scores of individual amino acids in
a profile are also position specific.
More weight must be given to an
unlikely amino acid which appears in a
structurally important position in the
protein than to one which appears in a
structurally unimportant position.

Refinements to create
good profile models
Introduce many additional free parameters
which
must be calculated when building a profile,

The calculations must be done by trial and


error.

Hidden Markov Models


Hidden Markov models (HMMs) offer a
more systematic approach to estimating
model parameters.
The HMM is a dynamic kind of statistical
profile. Like an ordinary profile, it is built
by analyzing the distribution of amino
acids in a training set of related proteins.
However, an HMM has a more complex
topology than a profile. It can be visualized
as a finite state machine, familiar to
students of computer science.

HMM
Finite state machines typically move
through a series
of states and produce some kind of
output either when
The machine has reached a
particular state
or
When it is moving from state to
state.

HMM
The HMM generates a protein sequence by
emitting amino acids as it progresses
through a series of states.
Each state has a table of amino acid
Emission probabilities similar to those in a
profile model.
Transition probabilities for moving from state
to state.

States in HMM
There
are
three
kinds
of
states
represented by three different shapes.
The squares are called match states, and
the amino acids emitted from them form
the conserved primary structure of a
protein.
The diamond shapes are insert states and
emit amino acids which result from
insertions.
The circles are special, silent states known
as delete states and model deletions.

Hidden Markov Model for the protein ACCY

Probability Calculation
Any
sequence
can
be
represented by a path through
the model.
The probability of any sequence,
given the model, is computed by
multiplying the emission and
transition probabilities along the
path.

Probability Calculation
For example, the probability of A
being emitted in position 1 is 0.3,
and the probability of C being
emitted in position 2 is 0.6. The
probability of ACCY along this path
is
.4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.76x10 -6

Probability Calculation
The calculation is simplified by
transforming probabilities to logs
The resulting number is the raw score
of a sequence, given the HMM.
For example, the score of ACCY along
the path is
loge(.4) + loge(.3) + loge(.46) + loge(.6) + loge (.97) +
loge(.5) + loge(.015) + loge(.73) +loge(.01) + loge(1)
= -13.25

Limitations of HMM
The calculation is easy if the exact
state path is known. In a real model,
many different state paths through a
model can generate the same
sequence.
Therefore, the correct probability of a
sequence is the sum of probabilities
over all of the possible state paths.

Alternative Methods
Forward Algorithm
- Calculate the sum over all paths
Viterbi Algorithm
- Calculate the probable path

Viterbi algorithm
The algorithm employs a matrix
Columns are indexed by the states
Rows are indexed by the sequence.
Deletion states are not shown, since, by
definition, they have a zero probability of
emitting an amino acid.

Steps
1. Initialize the matrix with zeros
2. Fill the matrix with the probability of each amino
acid to occur in the three different states
3. The maximum
calculated.

probability,

max(I1,

M1),

is

4. A pointer is set from the winner back to state I0.


5. Steps 2-4 are repeated until the matrix is filled.

Probability calculation
Prob(A in state I0) = 0.4*0.3=0.12
Prob(C in state I1) = 0.05*0.06*0.5 = .015
Prob(C in state M1) = 0.46*0.01 = 0.005
Prob(C in state M2) = 0.46*0.5 = 0.23
Prob(Y in state I3) = 0.015*0.73*0.01 = .
0001
Prob(Y in state M3) = 0.97*0.23 = 0.22

The most likely path through the


model can now be found by
following the back-pointers.

Probability of a
sequence
Once the most probable path
through the model is known, the
probability of a sequence given the
model can be computed by
multiplying all probabilities along
the path.

Forward algorithm
The forward algorithm is similar to
Viterbi.
However in step 3, a sum rather than a
maximum is computed, and no back
pointers are necessary.
The probability of the sequence is found
by summing the probabilities in the last
column.

Forward algorithm

Limitations of HMM
The HMM is a linear modeland is unable
to capture higher order correlations
among amino acids in a protein molecule.
Hydrogen bonds between non-adjacent
amino acids in a polypeptide chain,
Hydrogen bonds created between amino
acids in multiple chains
Disulfide
bridges,
chemical
bonds
between C (cysteine) amino acids which
are distant from each other within the
molecule.

Limitations of HMM
In reality, amino acids which are far
apart in the linear chain may be
physically close to each other when
a protein folds.
Chemical and electrical interactions
between them cannot be predicted
with a linear model.

Limitations of HMM
Another flaw of HMMs lies at the very
heart of the mathematical theory behind
these models. are independent.
The probability of a protein sequence can
be found by multiplying the probabilities
of the amino acids in the sequence.
This is only valid if the probability of any
amino acid in the sequence is independent
of the probabilities of its neighbors.

Limitations of HMM
In biology, this is not the case. There are, in
fact, strong dependencies between these
probabilities.
For example, hydrophobic amino acids are
highly likely to appear in proximity to each
other.
Because such molecules fear water, they cluster
at the inside of a protein, rather than at the
surface where they would be forced to
encounter water molecules.

Insight into New


Methods
These
biological
realities
have
motivated research into new kinds of
statistical models.
Hybrids of HMMs and neural nets,
dynamic Bayesian nets, factorial
HMMs, Boltzmann trees and hidden
Markov random fields are among the
areas being explored.

Вам также может понравиться