Академический Документы
Профессиональный Документы
Культура Документы
Alignment
What is a Multiple
Sequence Alignment?
Unlike pair-wise alignments which
involve
2 sequences
(nucleotide or protein)
Multiple sequence alignments involve
more than
2 sequences
(often 100s, either nucleotide or
protein).
As
was
the
case
for
pair-wise
Family
Terminology (for
Proteins)
Superfamily
Group of protein families related by
distant yet detectable sequence
similarity
1100 protein superfamilies in Protein
DataBank (v1.61)
Block
Ungapped conserved sequence pattern
(in protein family)
Motif
Conserved sequence pattern found in
multiple proteins with similar biochemical
activity, usually near active site
Module
Conserved sequence (contiguous) of one
or more motifs,considered fundamental
unit of structure or function
Homologous
Extended sequence pattern suggesting
common evolutionary origin (contains
one or more motifs, may contain gaps)
Domain
Segment of polypeptide chain that can
fold into 3D structure
irrespective of other segments (multiple
domains in protein)
Class
Classify domains according to secondary
structure
Examples: mainly-, mainly-, / , +
, membrane
Applications of Multiple
Alignment
Homology Modeling
Phylogenetic Analysis
Advanced
Database
Searches,
Patterns, Motifs, Promoters
Why do we need
multiple alignments?
Why do we need
multiple alignments?
Why do we need
multiple alignments?
Multiple alignments are used
protein modeling programs.
for
for
Dynamic Programming
Pairwise
sequence
alignment:
a
scoring matrix where each position
provides the best alignment up to that
point
Extension to 3 sequences: the lattice
of a cube that is to be filled with
calculated dynamic programming scores.
Scoring positions on 3 surfaces of the
cube represent the alignment of a pair
Dynamic Programming
2 Sequences
3 Sequences
2 Sequence 3 Sequence
Match/Mismatch
Gap in sequence 1
Gap in sequence 2
7 possibilities
Seven Possibilities
Computational
Complexity
MSA
Carrillo and Lipman (1988) : multiple sequence
alignment space bounded by pairwise alignments
possible
Sum of Pairs
Given 5 sequences:
NCCE
NNCE
N-CN
SCSN
SCSE
How many possible combinations of
pairwise
alignments for each position?
Answer: (5, 2) = 10
Sum of Pairs
Assume: match/mismatch/gap = 1/0/-1
NCCE
NNCE
N-CN
SCSN
SCSE
The 1st position: # of N-N (3), # of S-S (1), # of
N-S (6)
SP(1) = 4*1 + 0*6 + (-1)*0 = 4
The 2nd position: # of C-C (3), # of N-C (3), # of
gaps (4),
SP(2) = 3*1 + 0*3 + (-1)*4 = -1
The Computational
Challenge of MSA
The Computational
Challenge of MSA
alignments possible.
of
the
many
The Computational
Challenge of MSA
Approximate methods are used instead of
Dynamic
programming methods.
Another
computational
placement and
challenge
is
Approximate Methods
Progressive methods
Iterative methods
Starts by making initial alignments of small
groups of sequences, and than revise the
alignment for better results.
Approximate Methods
Alignment based
domains (or
on
small
conserved
Progressive Methods
The most practical
method for
and
widely
used
Steps to create
multiple alignment
Pairwise comparisons of all sequences
Start with the most related (similar)
sequences, then the next most similar pair
and so on. Once an alignment of two
sequences has been made, then this is fixed.
Perform cluster analysis on the pairwise data
to generate a hierarchy for alignment. This
may be in the form of a binary tree or simple
ordering tree.
Steps to create
multiple alignment
After the clustering is done, relationships
between the sequences are modeled by a
tree.
If the program (e.g clustalw) builds an
evolutionary tree, then the sequences
represent the outer branches of the
evolutionary tree.
Inner branches represent dissimilarities of
the sequences at the outer branches.
Clustal W
Clustal W
Clustal W
Improving the sensitivity of progressive
multiple sequence alignment through
sequence weighting,
Positions-specific gap penalties and
weight matrix choice.
Clustal W
Clustal W can create multiple
alignments
Manipulate existing alignments and
create phylogenic trees.
Alignment can be done by 2 methods:
slow/accurate
fast/approximate
ClustalW alignment
Method
ClustalW alignment algorithm consists
of 3
steps:
STEP 1
Pairwise Alignments are performed between
all sequences in the compared group.
Alignment scores are used to build a distance
matrix.
Calculating the distance matrix, the program
takes into account the divergence of the
sequences.
ClustalW alignment
Method
STEP 2
A guide (phylogenetic)
created from the
distance matrix .
tree
is
ClustalW alignment
Method
STEP 3
ClustalW alignment
Method
ClustalW alignment
Method
Genetic Distance
Calculation
ClustalX - Multiple
Sequence Alignment
Program
Multiple alignment in
GCG
The program available in GCG for
multiple alignment is Pileup.
Pileup does global alignment very
similar to Clustal W.
The input file for Pileup is a list of
sequence file_names or sequence
codes in the database. The list is
created by a text editor.
Multiple alignment in
GCG
Pileup creates a multiple sequence
alignment from a group of related
sequences
using
progressive,
pairwise alignments.
It can also plot a tree showing the
clustering relationships used to
create the alignment.
PileUp considirations
PileUp does global multiple alignment, and
therefore is good for a group of similar
sequences.
PileUp will fail to find the best local region of
similarity (such as a shared motif) among
distantly related sequences.
PileUp always aligns all of the sequences you
specified in the input file, even if they are not
related. The alignment can be degraded if
some of the sequences are only distantly
related.
Problems with
Progressive alignments
In progressive alignment the ultimate
multiple alignment is dependent on the
initial pairwise alignments.
The first sequences to be aligned are the
most similar (closely related on the tree).
If the initial alignments is good, with very
few
errors,
the
ultimate
multiple
alignment will be good.
Problems with
Progressive alignments
However, if the sequences aligned are
distantly related, much more errors can
be made affecting the final alignment
Another problem with progressive
alignment is that the ultimate multiple
alignment is dependent on choosing the
correct scoring matrices, and the
correct gap penalty.
Iterative Methods
MultAlign
PRRP
DiAlign
Genetic Algorithm
MultAlign
Pairwise scores recalculated
during progressive alignment
Tree is recalculated
Alignment is refined
PRRP
Initial pairwise alignment predicts
tree
Tree produces weights
Locally aligned regions considered
to produce new alignment and tree
Continue until alignments converge
DIALIGN
Pairs of sequences aligned to
locate ungapped aligned regions
Diagonals of various lengths
identified
Collection of weighted diagonals
provide alignment
Genetic Algorithm
Goal: use
alignments
GAs
A class of
algorithms.
to
identify
probabilistic
the
best
optimization
GA for MSA
Representation:
For sequences 200 residues long,
extend each to 250.
Gaps are randomly inserted in each
XXXXXXXXXX----sequence.
---------XXXXXXXX
--XXXXXXXXX----XXX---X-XXXX-X--
GA for MSA
Fitness function: Sum of pairs with
GA for MSA
Mutations: sequences are not
changes, but gaps
are inserted and rearranged
XXXXXXXX
XXXXXXXX
XXXXXXXX
XXXXXXXX
XXXXXXXX
XXX--XXXXX
XXX--XXXXX
XXXXXXX--X
XXXXXXX--X
XXXXXXX--X
GA for MSA
Crossover:
AGWS N---VDPA
AEWS TEEE-ALATWS -E-EGAAL
--WDKVEVC-AL
WD-H VEEE-WL
WD-Y VWELL-L
--WDKN---VDPA
WD -HTEEE-ALWD -Y -E-EGAAL
AGWSVEVC-AL
AEWSVEEE-WL
ATWSVWELL-L
GA for MSA
Notredame, C. and Higgins D.
SAGA: Sequence Alignment by
Genetic Algorithm,
Slow if number of sequences > 20
GA has been applied in many
different bioinformatics problems.
Based on Profiles
Profile
A portion of the MSA which is highly
conserved.
A scoring matrix for the mini-msa is
called the profile.
Profiles include
and gaps.
matches,
mismatches
Applications
Modeling protein families
Modeling protein domains --Pfam
Gene finding
Protein structure predictions
Genome annotations
Modeling protein
families
Functional biological sequences come
in families. Sequences in a family have
diverged from each other in their
primary sequence during evolution.
Knowing that a sequence belongs to
a family often allows inferences about
its function.
Modeling protein
families
Statistical Profiles
Proteins which share a common
ancestor are not exactly alike,
however,
they
inherit
many
similarities in primary structure from
their ancestor.
This is known as conservation of
primary structure in a protein family.
Statistical model
These structural similarities make
it possible to
create a statistical model of a
protein family.
Probability of a
sequence
Product of the amino acid
probabilities given by the profile.
E.g. probability of CGGSV,
0.8 * 0.4 * 0.8 * 0.6 * 0.2 = .031
Transformation to
logarithmic
Multiplication of fractions is computationally
expensive and prone to floating point errors
such as underflow, a convenient
transformation into the logarithmic world is
used.
The score of CGGSV is
loge
(0.8)+loge(0.4)+loge(0.8)+loge(0.6)+loge(0.2)
= -3.48
Score Calculation
Members of a protein family have
varying lengths, so a score penalty is
charged for insertions and deletions.
The scores of individual amino acids in
a profile are also position specific.
More weight must be given to an
unlikely amino acid which appears in a
structurally important position in the
protein than to one which appears in a
structurally unimportant position.
Refinements to create
good profile models
Introduce many additional free parameters
which
must be calculated when building a profile,
HMM
Finite state machines typically move
through a series
of states and produce some kind of
output either when
The machine has reached a
particular state
or
When it is moving from state to
state.
HMM
The HMM generates a protein sequence by
emitting amino acids as it progresses
through a series of states.
Each state has a table of amino acid
Emission probabilities similar to those in a
profile model.
Transition probabilities for moving from state
to state.
States in HMM
There
are
three
kinds
of
states
represented by three different shapes.
The squares are called match states, and
the amino acids emitted from them form
the conserved primary structure of a
protein.
The diamond shapes are insert states and
emit amino acids which result from
insertions.
The circles are special, silent states known
as delete states and model deletions.
Probability Calculation
Any
sequence
can
be
represented by a path through
the model.
The probability of any sequence,
given the model, is computed by
multiplying the emission and
transition probabilities along the
path.
Probability Calculation
For example, the probability of A
being emitted in position 1 is 0.3,
and the probability of C being
emitted in position 2 is 0.6. The
probability of ACCY along this path
is
.4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.76x10 -6
Probability Calculation
The calculation is simplified by
transforming probabilities to logs
The resulting number is the raw score
of a sequence, given the HMM.
For example, the score of ACCY along
the path is
loge(.4) + loge(.3) + loge(.46) + loge(.6) + loge (.97) +
loge(.5) + loge(.015) + loge(.73) +loge(.01) + loge(1)
= -13.25
Limitations of HMM
The calculation is easy if the exact
state path is known. In a real model,
many different state paths through a
model can generate the same
sequence.
Therefore, the correct probability of a
sequence is the sum of probabilities
over all of the possible state paths.
Alternative Methods
Forward Algorithm
- Calculate the sum over all paths
Viterbi Algorithm
- Calculate the probable path
Viterbi algorithm
The algorithm employs a matrix
Columns are indexed by the states
Rows are indexed by the sequence.
Deletion states are not shown, since, by
definition, they have a zero probability of
emitting an amino acid.
Steps
1. Initialize the matrix with zeros
2. Fill the matrix with the probability of each amino
acid to occur in the three different states
3. The maximum
calculated.
probability,
max(I1,
M1),
is
Probability calculation
Prob(A in state I0) = 0.4*0.3=0.12
Prob(C in state I1) = 0.05*0.06*0.5 = .015
Prob(C in state M1) = 0.46*0.01 = 0.005
Prob(C in state M2) = 0.46*0.5 = 0.23
Prob(Y in state I3) = 0.015*0.73*0.01 = .
0001
Prob(Y in state M3) = 0.97*0.23 = 0.22
Probability of a
sequence
Once the most probable path
through the model is known, the
probability of a sequence given the
model can be computed by
multiplying all probabilities along
the path.
Forward algorithm
The forward algorithm is similar to
Viterbi.
However in step 3, a sum rather than a
maximum is computed, and no back
pointers are necessary.
The probability of the sequence is found
by summing the probabilities in the last
column.
Forward algorithm
Limitations of HMM
The HMM is a linear modeland is unable
to capture higher order correlations
among amino acids in a protein molecule.
Hydrogen bonds between non-adjacent
amino acids in a polypeptide chain,
Hydrogen bonds created between amino
acids in multiple chains
Disulfide
bridges,
chemical
bonds
between C (cysteine) amino acids which
are distant from each other within the
molecule.
Limitations of HMM
In reality, amino acids which are far
apart in the linear chain may be
physically close to each other when
a protein folds.
Chemical and electrical interactions
between them cannot be predicted
with a linear model.
Limitations of HMM
Another flaw of HMMs lies at the very
heart of the mathematical theory behind
these models. are independent.
The probability of a protein sequence can
be found by multiplying the probabilities
of the amino acids in the sequence.
This is only valid if the probability of any
amino acid in the sequence is independent
of the probabilities of its neighbors.
Limitations of HMM
In biology, this is not the case. There are, in
fact, strong dependencies between these
probabilities.
For example, hydrophobic amino acids are
highly likely to appear in proximity to each
other.
Because such molecules fear water, they cluster
at the inside of a protein, rather than at the
surface where they would be forced to
encounter water molecules.