Академический Документы
Профессиональный Документы
Культура Документы
Dynamic DynamicProgramming ProgrammingAlgorithm, Algorithm,continued continued End Endgaps gaps Affine Affineversus versuslinear lineargap gappenalties penalties Gloabl, Gloabl,local-global local-globaland andlocal localalignment alignment SeqLab: SeqLab:using usingGenBank GenBankFeatures Features Seqlab: Seqlab:optimizing optimizingparameters parametersfor forBestFit BestFit
number of gaps
AGGCTACT~T~TCA GGCTACTATATCA
-5 -1
-10
AGGCTACTTT~~CA GGCTACTATATCA
-6
Affine Gap DP
Optimal alignment is the highest scoring. Alignments entering the last box on the bottom row can have 5 types of arrows, instead of just three. (1) Match (2) Open a gap in first sequence. (3) Open a gap in second sequence. (4) Extend a gap in first sequence. (5) Extend a gap in second sequence.
I = insertion matrix
scores for alignments with gap in first sequence.
D = deletion matrix
scores for alignments with gap in second sequence.
Q F G A K L K L
M(i,j) is the max over three diagonal arrows
A D
Q F G A K L K L
A D
Q F G
A K L K L D Q F G P
D Q F G P
D Q
D(i,j) is the max over three right arrows
F G P
AGGCTACT~TATCA GGCTACTA~ATCA
If you think this alignment does not make sense, then D to I and I to D can simply be disallowed in the DP algorithm. Most programs do this.
[Exception: For a global alignment, D-to-I or I-to-D arrows are allowed at the ends of alignments because there is no other way to complete the matrix.]
So, we can save the traceback arrows as just letters M, I or D. *except at the ends.
M = match matrix
I = insertion matrix
scores for alignments with gap in first sequence.
D = deletion matrix
scores for alignments with gap in second sequence.
M M I I I I D M M M M
Traceback
The traceback is a sequence of M I or D , but this is NOT the traceback of the letters. Instead it is a traceback of the location of the letters (which matrix they are in), since the location (the matrix) defines the direction of the next arrow.
MIIIIMDMMMD
A ~ ~ ~ ~ D P Q F G ~ A K L K L D ~ Q F G P
Example: here is an alignment of mouse nitric oxide synthase (think black line). It has multiple domains which are homologous to several shorter proteins. If we penalize end gaps, what happens to the score of the true alignment? Did "end gaps" evolve the same way as internal gaps? (no!) Unless the two proteins are known to be single domains,
G
0 0
T T
0 0
C A
0 0
G
0
C T
0 0
T
0
T C A C T
0 0 0 0 0
G
0 0
T T
0 0
C A
0 0
G
0
C T
0 0
T
0
T C A C T
0 0 0 0 0
G
0 0
T T C A
0 0 0 0
G C T T
0 0 0 0
T C A C T
G
0 0 0 0 0 0
T T C A
G C T T
T C A C T
Local Alignment
A local alignment can start and end anywhere in the sequences (i.e. in the alignment matrix).
start
P G T S F E P
TSF TSF
0 + match score
start
Local Alignment
...is the most generally applicable alignment method, since it has the fewest assumptions.
ATSFM~~~~~~~ ~~~~~PGTSFEP
2DRC:A 1DRF:_
52/53 63/64
--RPLPGRKNIILSSQP--GTDDRVTWVKSVDEAIAACG------DVPEIMVIGGGRVYE KNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYK
2DRC:A 1DRF:_
102/103 123/124
QFLPK--AQKLYLTHIDAEVEGDTHFPDYEPDDWESVF------SEFHDADAQNSHSYCF EAMNHPGHLKLFVTRIMQDFESDTFFPEIDLEKYKLLPEYPGVLSDVQEE---KGIKYKF
2DRC:A 1DRF:_
154/155 180/181
EILERR EVYEKN
Note: Lots of mismatches (id=38%), few gaps (8), gaps are long (1-7).
Structure-based alignment
Two similar structures may be superimposed. The parts that overlay well are the matches (purple and green), and the parts that do not overlay well are the insertions (yellow and red). Aligned positions have similar chemical 3D environment.
Parametric search
How do we know what should be used as the gap penalty and extension penalty?
0 1 5 10 10 50
0 0 1 1 10 50