Вы находитесь на странице: 1из 29

Bioinformatics 1: lecture 5

Dynamic DynamicProgramming ProgrammingAlgorithm, Algorithm,continued continued End Endgaps gaps Affine Affineversus versuslinear lineargap gappenalties penalties Gloabl, Gloabl,local-global local-globaland andlocal localalignment alignment SeqLab: SeqLab:using usingGenBank GenBankFeatures Features Seqlab: Seqlab:optimizing optimizingparameters parametersfor forBestFit BestFit

Thinking about gaps


Each gap represents an evolutionary event (duplication, polymerase stutter, deletion/ligation, etc.) If the alignment has "evolutionary distance" meaning, then the gap penalty score should be proportional to the number of gaps. Two problems: What about long gaps versus short gaps? Are they equally probable? What about gaps at the ends of the sequence? How many evolutionary events took place there?

Which alignment is intuitively better? AGGCTACT~T~TCA GGCTACTATATCA AGGCTACTTT~~CA GGCTACTATATCA

Linear versus Affine gap penalty


linear gap penalty Gap penalty for the whole sequence is just the number of gaps times a constant. number of gaps affine gap penalty Gap penalty for the whole sequence is the affine function.
N*(gap initiation penalty) + E*(gap extension penalty)

number of gaps

Affine gap penalty scoring


gap initiation = -5 -5 gap extension = -1 -5

AGGCTACT~T~TCA GGCTACTATATCA
-5 -1

-10

AGGCTACTTT~~CA GGCTACTATATCA

-6

Affine Gap DP
Optimal alignment is the highest scoring. Alignments entering the last box on the bottom row can have 5 types of arrows, instead of just three. (1) Match (2) Open a gap in first sequence. (3) Open a gap in second sequence. (4) Extend a gap in first sequence. (5) Extend a gap in second sequence.

Note: I wrote the sequences in the gap rows!!

Affine gap penalty worksheet


M = match matrix
A D P

scores for alignments currently in a match state

I = insertion matrix
scores for alignments with gap in first sequence.

D = deletion matrix
scores for alignments with gap in second sequence.

Q F G A K L K L
M(i,j) is the max over three diagonal arrows

A D

Q F G A K L K L

A D

Q F G

A K L K L D Q F G P

D Q F G P

D Q
D(i,j) is the max over three right arrows

I(i,j) is the max over three down arrows

F G P

Does gap to gap make sense???


Special rules may apply for going from I to D and D to I.

AGGCTACT~TATCA GGCTACTA~ATCA
If you think this alignment does not make sense, then D to I and I to D can simply be disallowed in the DP algorithm. Most programs do this.
[Exception: For a global alignment, D-to-I or I-to-D arrows are allowed at the ends of alignments because there is no other way to complete the matrix.]

Traceback for affine gap DP


Each box in each matrix has a traceback "arrow". In the... M matrix, it is always diagonal (goes back to i-1, j-1) and goes back to either the M, D or I matrices. I matrix it is always down (i, j-1) , and goes back to either I or M [not D!] D matrix it is always right (i-1, j) , and goes back to either D or M. * *

So, we can save the traceback arrows as just letters M, I or D. *except at the ends.

Affine gap penalty worksheet


scores for alignments currently in a match state

M = match matrix

I = insertion matrix
scores for alignments with gap in first sequence.

D = deletion matrix
scores for alignments with gap in second sequence.

M M I I I I D M M M M

Traceback
The traceback is a sequence of M I or D , but this is NOT the traceback of the letters. Instead it is a traceback of the location of the letters (which matrix they are in), since the location (the matrix) defines the direction of the next arrow.

MIIIIMDMMMD
A ~ ~ ~ ~ D P Q F G ~ A K L K L D ~ Q F G P

Should we penalize gaps at the ends ?

Example: here is an alignment of mouse nitric oxide synthase (think black line). It has multiple domains which are homologous to several shorter proteins. If we penalize end gaps, what happens to the score of the true alignment? Did "end gaps" evolve the same way as internal gaps? (no!) Unless the two proteins are known to be single domains,

it makes more sense NOT to penalize end gaps.

How to NOT penalize end gaps


First: Ignore starting gap penalties, set gap rows to zero (keep the traceback arrows, though).

G
0 0

T T
0 0

C A
0 0

G
0

C T
0 0

T
0

T C A C T

0 0 0 0 0

How to NOT penalize end gaps


Second: Start the traceback with the MAX score at the end of either sequence. (i.e. last row, or last column)

G
0 0

T T
0 0

C A
0 0

G
0

C T
0 0

T
0

T C A C T

0 0 0 0 0

Global in one sequence, local in the other


If we penalize end gaps in sequence 2 but not in sequence 1, we are asking for an alignment that contains all of sequence 2 within sequence 1.

G
0 0

T T C A
0 0 0 0

G C T T
0 0 0 0

T C A C T

Global in one sequence, local in the other


If we penalize end gaps in sequence 1 but not in sequence 2, we are asking for an alignment that contains all of sequence 1 within sequence 2.

G
0 0 0 0 0 0

T T C A

G C T T

T C A C T

Global, global-local, and local alignment


The choice of alignment method makes a statement about how the sequences are related. Was one sequence inserted into the other? Global alignment (with end gaps) requires that all 4 termini are counted. In general, the two sequences be about the same length. Global-local alignment (no end gaps in 1 or both seqs) requires that one of the two sequences be completely contained in the other or that 2 or the 4 the termini be included. Local alignment finds subsequences in both. Does not require that the termini be included in the alignment.

Local Alignment
A local alignment can start and end anywhere in the sequences (i.e. in the alignment matrix).
start

A(i-1,j-1) + match score


A T S F M

A(i,j-1) + gap* A(i,j) = MAX A(i-1,j) + gap


end

*linear gap penalty

P G T S F E P

TSF TSF

0 + match score

start

end is defined as the maximum score over the whole matrix.

Local Alignment
...is the most generally applicable alignment method, since it has the fewest assumptions.

The optimal alignment may be no alignment


If the maximum score in the alignment matrix is < 0., then the optimal local alignment has score = 0 and looks like this:

ATSFM~~~~~~~ ~~~~~PGTSFEP

Structure-based alignments are "correct"


The closest thing to a "Gold Standard" for protein alignments is the sequence alignment that comes from a structure superposition.
2DRC:A 1DRF:_ 1/2 3/4 MISLIAALAVDRVIGMENAM-PFNLPADLAWFKRNTL-------DKPVIMGRHTWESIGSLNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQNLVIMGKKTWFSIPE

2DRC:A 1DRF:_

52/53 63/64

--RPLPGRKNIILSSQP--GTDDRVTWVKSVDEAIAACG------DVPEIMVIGGGRVYE KNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYK

2DRC:A 1DRF:_

102/103 123/124

QFLPK--AQKLYLTHIDAEVEGDTHFPDYEPDDWESVF------SEFHDADAQNSHSYCF EAMNHPGHLKLFVTRIMQDFESDTFFPEIDLEKYKLLPEYPGVLSDVQEE---KGIKYKF

2DRC:A 1DRF:_

154/155 180/181

EILERR EVYEKN

Note: Lots of mismatches (id=38%), few gaps (8), gaps are long (1-7).

Structure-based alignment
Two similar structures may be superimposed. The parts that overlay well are the matches (purple and green), and the parts that do not overlay well are the insertions (yellow and red). Aligned positions have similar chemical 3D environment.

In class exercise: Feaures Start SeqLab


(1) Get the following two gene sequences from the databases: gb_ba1:Lbadhfr gb_ba1:Ecdhfolg To retrieve these sequences. Go to File-->Add Sequences from-->Databases. For the 1st one, type gb_ba1:Lba* into the "Database specification" window and hit "Show matching entries". Select the one you want and "Add to main window." Do the 2nd one similarly. wildcard

In class exercise: Features


(2) Get info for each sequence. Look for the GenBank keyword "FEATURES". In LBADHFR, find thse features: (what are they?) source mRNA CDS In ECDHFOLG, find: RBS repeat_unit promoter

In class exercise: Features


(3) Set the Display to "Features coloring" Double click on a blue shaded region of ECDHFOLG. A features window appears. Select "Features at Cursor." Note that the region is now selected. You can copy it. (4) Create a new "feature": Find the sequence "CGATCG" in ECDHFOLG. Select it. Open the Features window and Add a feature for this region. Call it "restriction_site" and put "PvuI" in the comments area. Give it a Diamond Shape. Back in the Editor, set Display to Graphical Features, change the scale to 16:1 and find the Diamond. Is it in the CDS?

In class exercise: translation


(5) Double-click on the CDS and select the CDS feature in the window that pops up. Close the window. The coding region is still selected. Copy the selected region. Create a new DNA sequence. Paste the selected region (text) into that new line. Remove any gaps if neessary. Translate that gene to amino acids in frame 1 only (one letter code). The amino acid sequence should start with "MISLIAA...". Rename the sequence "ecdhfr" using the INFO window. Remove all gaps. Do the same for the CDS region of LBADHFR. Label the new protein sequence "lcdhfr" It should start "MTAFL..."

In class exercise: pairwise alignments


(6) Remove any gaps you may have created in both sequences. Check that the sequences agree with the sequences in the corresponding "features" (in the DNA sequences). If so, go ahead and delete the DNA sequences. (7) Select the two protein sequence, now names ecdhfr and lcdhfr. Align them using Functions-->Pairwise->Bestfit. Open Options...Set different gap penalty schemes. At the bottom, select both "New sequence file..." and give them unique names ending in .gap For example, for gap penalty 10 use "ec_10.gap". Run. When finished, select the two .gap files in the Output Manager window, and "Add to Editor". Compare results as on the following page.

Parametric search

How do we know what should be used as the gap penalty and extension penalty?

SeqLab: Using BestFit


Worksheet for BesFit exercise using ECDHFR and LCDHFR: Gap opening penalty extension penalty number of gaps avg. gap length alignment length %identity

0 1 5 10 10 50

0 0 1 1 10 50

Вам также может понравиться