Вы находитесь на странице: 1из 92

Sequence Alignment

Dr. Zoya Khalid


Zoya.khalid@nu.edu.pk
Sequence Alignment
• Sequence alignment is the way of arranging the sequences of DNA,
RNA or proteins to identify the region of similarity. The similarity may
indicate the functional, structural and evolutionary significance of the
sequence.

• The alignment can be made between a known and unknown


sequences or between two unknown sequences.

• The known sequence is called reference sequence while the unknown


sequence is called query sequence.
Sequence Alignment
• Sequences that are much alike may share similar secondary and 3D
structure, similar function and likely a common ancestral sequence.
Why alignments ?

• Detect homology (similarity)


• Study evolution
• Predict functions
• Model 3D-structure
What is alignment ??
• Alignment is the task of locating equivalent regions of two or more
sequences to maximize their similarity
Principals of Sequence Alignment
• Similarity is descriptive term that tells about the degree of match
between the two sequences or more.
Scoring Alignments
• Alignments of related sequences are expected to provide good score
as compared to the randomly chosen ones.
• The correct alignment of two related sequences should ideally be the
one that gives best score.
Types of Alignment

Based on Based on
completeness Numbers

Pairwise
Global Local Multiple
Local Vs. Global
Methods and Algorithms
Progressive
Dynamic
Dot Matrix Methods:
Programming
Clustal, Tcoffee

Iterative FASTA
methods BLAST
Dot Plot Matrix
• A dot matrix analysis is a method for comparing two sequences to look for possible alignment
(Gibbs and McIntyre 1970)

• One sequence (A) is listed across the top of the matrix and the other (B) is listed down the left
side
• Starting from the first character in B, one moves across the page keeping in the first row and
placing a dot in many column where the character in A is the same

• The process is continued until all possible comparisons between A and B are made

• Any region of similarity is revealed by a diagonal row of dots

• Isolated dots not on diagonal represent random matches


Sequence comparison with dot matrices
• Basic Method: For two sequences of lengths M and N, lay out an M by
N grid (matrix) with one sequence across the top and one sequence
down the left side.

• For each position in the grid, compare the sequence elements at the
top (column) and to the left (row). If and only if they are the same,
place a dot at that position.
Dot Plot
• Dot plot are two dimensional graphs, showing a comparison of two
sequences.
• The principle used to generate the dot plot is: The top X and the left y axes
of a rectangular array are used to represent the two sequences to be
compared.
• Calculation:
• Matrix
• Columns = residues of sequence 1
• Rows = residues of sequence 2
• A dot is plotted at every co-ordinate where there is similarity between the
bases.
Identical Sequences
Seq1: MALWGRL
Seq2: MALWGRL

M A L W G R L
M *

A *

L *

W *

G *

R *

L *
Dotplots
Multiple diagonal indicate repetition
Analysis of Dot Plot Matrix
• Region of similarity appears as diagonal run of dots.
• Principal diagonal shows identical sequence.
• Global and local alignment are shown.
• Multiple diagonal indicate repetition
• Reverse diagonal (perpendicular to diagonal) indicate INVERSION.
• Reverse diagonal crossing diagonal (X) indicate PALINDROMES.
• Formation of box indicate the low complexity region
Repeats
Reverse diagonal crossing diagonal (X) indicate PALINDROMES.
Formation of box indicate the low complexity region
Dot Plot Software

• GCG is a commercial software, hence not possible to use all the


time.
• Instead of this, we can use the EMBOSS package, which are
following:
• Dotmatcher
• Dotpath
• Polydot
• Dottup
• (http://emboss.bioinformatics.nl/cgi-bin/emboss/dottup)
Applications
• Shows all possible alignments between DNA or protein sequences

• All kinds of local and global alignments can be trapped

• Help to recognize large region of similarity


Dynamic Programming

• Needleman-Wunsch
• Pairwise global alignment only. Difference :
• Different Scoring matrices
• Gap penalty functions
• Smith-Waterman • Sequence Coverage

• Pairwise, local alignment.


Methods
Different scoring
T-T = 5
H-H= 8
S-S = 4
E-E = 5
Q-Q= 6
U-U= 0
N-N= 6
C-C = 9
Gap=0
Mismatch=-1
With Gap cost
T-T = 5
H-H= 8
S-S = 4
E-E = 5
Q-Q= 6
U-U= 0
N-N= 6
C-C = 9
Gap=-1
Dynamic Programming
• Algorithmic technique for optimization problems that have
two properties:
• Optimal substructure: Optimal solution can be computed from
optimal solutions to sub-problems

• Overlapping sub-problems: Sub-problems overlap such that the


total number of distinct sub-problems to be solved is relatively
small

31
Dynamic Programming
• Break problem into overlapping subproblems
• use memoization: remember solutions to
subproblems that we have already seen

3 5 7

1 8

2 4 6

32
Fibonacci example
• 1,1,2,3,5,8,13,21,...
• fib(n) = fib(n - 2) + fib(n - 1)
• Could implement as a simple recursive function
• However, complexity of simple recursive function is
exponential in n

33
Fibonacci dynamic programming
• Two approaches
1.Memoization: Store results from previous calls of function in a table (top
down approach)
2.Solve subproblems from smallest to largest, storing results in table (bottom
up approach)
• Both require evaluating all (n-1) subproblems only once: O(n)

34
Dynamic Programming Graphs
• Dynamic programming algorithms can be
represented by a directed acyclic graph
• Each subproblem is a vertex
• Direct dependencies between subproblems are edges

1 2 3 4 5 6

graph for fib(6) 35


Memoization

• In a top-down recursive approach we can use memoization to create a potentially large dictionary indexed by
each of the subproblems that we are solving (aligned sequences).

• This needs O(n 2m2 ) space if we index each subproblem by the starting and end points of the subsequences
for which an optimal alignment needs to be computed.

• The advantage is that we solve each subproblem at most once: if it is not in the dictionary, the problem gets
computed and then inserted into dictionary for further reference.

Dynamic Programming
In a bottom-up iterative approach we can use dynamic programming. We define the order of computing sub-
problems in such a way that a solution to a problem is computed once the relevant sub-problems have been
solved.

In particular, simpler sub-problems will come before more complex ones. This removes the need for keeping
track of which sub-problems have been solved (the dictionary in memoization turns into a matrix) and ensures
that there is no duplicated work (each sub-alignment is computed only once).
Pairwise Alignment Via
Dynamic Programming

• first algorithm by Needleman & Wunsch,


Journal of Molecular Biology, 1970
• dynamic programming algorithm:
determine best alignment of two sequences
by determining best alignment of all
prefixes of the sequences

37
Global Alignment
Needleman-Wunsch Algorithm
• The Needleman–Wunsch algorithm is an algorithm used in bioinformatics
to align protein or nucleotide sequences.

• It was one of the first applications of dynamic programming to compare


biological sequences.

• The algorithm was developed by Saul B. Needleman and Christian D.


Wunsch and published in 1970.

• The Needleman–Wunsch algorithm is still widely used for optimal global


alignment, particularly when the quality of the global alignment is of the
utmost importance.
Dynamic Programming Idea
• consider last step in computing alignment of AAAC
with AGC
• three possible options; in each we ll choose a different
pairing for end of alignment, and add this to the best
alignment of previous characters

AAA C AAAC -
AG C AG C

AAA C consider best score of

AGC -
alignment of + aligning
these prefixes this pair
39
DP Algorithm for Global Alignment
with Linear Gap Penalty
• Subproblem: F(i,j) = score of best alignment of the length i
prefix of x and the length j prefix of y.

# F(i −1, j −1) +S(xi, yj )


%
F(i, j) = max$ F(i −1, j) + s
% F(i, j −1) + s
&

40
Dynamic Programming
Implementation
• given an n-character sequence x, and an m-character
sequence y
• construct an (n+1) ´ (m+1) matrix F
• F ( i, j ) = score of the best alignment of
x[1…i ] with y[1…j ]
A G C

A
A
score of best alignment of
A AAA to AG
C
41
Initializing Matrix: Global Alignment with
Linear Gap Penalty
A G C

0 s 2s 3s

A s

A 2s

A 3s

C 4s
42
DP Algorithm Sketch:
Global Alignment
• initialize first row and column of matrix
• fill in rest of matrix from top to bottom, left to right
• for each F ( i, j ), save pointer(s) to cell(s) that
resulted in best score
• F (m, n) holds the optimal alignment score; trace
pointers back from F (m, n) to F (0, 0) to recover
alignment

43
Global Alignment Example
• suppose we choose the following scoring scheme:
S(x i , y i ) =
+1 when xi = yi
-1 when x ≠ y
i i

s (penalty for aligning with a space) = -2

44
Global Alignment Example
A G C

0 -2 -4 -6
one optimal alignment
A -2 1 -1 -3
x: A A A C
y: A G - C
A -4 -1 0 -2
but there are three
A -6 -3 -2 -1 optimal alignments
here (can you find
C -8 -5 -4 -1 them?)
45
Equally Optimal Alignments
• many optimal alignments may exist for a given pair of
sequences
• can use preference ordering over paths when doing
traceback

highroad 1 lowroad 3
2 2

3 1
• High road and low road alignments show the two most
different optimal alignments
46
High road & Low road Alignments
A G C
High road alignment
0 -2 -4 -6
x: A A A C
y: A G - C
A -2 1 -1 -3

A -4 -1 0 -2 Low road alignment


x: A A A C
A -6 -3 -2 -1 y: - A G C

C -8 -5 -4 -1
47
Semi-Global Alignment
• Global alignment seeks the best, full length alignment; that is, the
best way to match up two sequences along their entire length.

• For some applications, it is desirable to relax this requirement and not


penalize

• For example, for sequence assembly, we seek sequence fragments


that overlap, that is we expect to be able to align the end of one
fragment with the beginning of another.
Semi-Global Alignment
• In semi-global alignment, we do not allow gaps at the beginning of s
and the beginning of t in the same alignment. Nor do we not allow
gaps at the end of s and the end of t.

• Like global alignment, the optimal semi-global alignment can be


found using dynamic programming using either distance or similarity
scoring.
Filling the Matrix

•.
Filling the Matrix

•.
Semi-Global Alignment
Local Alignment
Smith waterman Algorithm
• Initialize rows and columns with zero that will enable to move both
sequences without any penalty.

• Choose maximum number anywhere and traceback

• Negative values should be replaced with 0 because we do not care for


dissimilar sequences just the chunk of similarity.
Local alignments: why?
• Two genes in different species may be similar over short conserved
regions and dissimilar over remaining regions
• Example
• Homeobox genes have a short region called the homeodomain that is highly
conserved between species
• A global alignment would not find the homeodomain because it would try to
align the ENTIRE sequence
The local alignment problem
• Goal: Find the best local alignment between two strings

• Input: Strings v, w and scoring matrix δ

• Output: Alignment of substrings of v and w whose alignment score is


maximum among all possible alignment of all possible substrings
The local alignment recurrence
• The largest value of si,j over the whole edit graph is the score of the best
local alignment.
• The recurrence:

• Complexity: O(N2), or O(MN)


Computational Complexity
• initialization: O(m), O(n) where sequence lengths are m, n
• filling in rest of matrix: O(mn)
• traceback: O(m + n)
• hence, if sequences have nearly same length, the computational
complexity is

2
O(n )
80
Scoring indels: naive approach
• A fixed penalty σ is given to every indel:
• -σ for 1 indel,
• -2σ for 2 consecutive indels
• -3σ for 3 consecutive indels, etc.
• Can be too severe penalty for a series of 100 consecutive indels
Gap Penalties
• Minimizing gaps in an alignment is important to create a useful
alignment.

• Too many gaps can cause an alignment to become meaningless.

• Gap penalties are used to adjust alignment scores based on the


number and length of gaps
Affine Gap Penalties
• The most widely used gap penalty function is the affine gap penalty.

• The affine gap penalty combines the components in both constant


and linear gap penalty A+B (L-1)
A = gap open penalty
B= gap extension penalty
L= Length of the gap
• Gap open used be penalized more
Affine Gap Penalties
• Affine gap penalty : a+ b(L-1)
• a= gap opening penalty (say -11)
• b = gap extension penalty (say -1)
• L= length of the gap

PRT - - -EINS
PRTWPSEIN-
Total Gap penalty =-24
A model for sequence evolution

Вам также может понравиться