Pairwise Sequence Alignment: DR Avril Coghlan Alc@sanger - Ac.uk

Pairwise sequence Alignment
Dr Avril Coghlan
alc@sanger.ac.uk
Note: this talk contains animations which can only be seen by

downloading and using ‘View Slide show’ in Powerpoint
Sequence comparison
• How can we compare the human & Drosophila
melanogaster Eyeless protein sequences?
One method is a dotplot
• A dotplot is a graphical (visual) approach
Regions of local similarity between the 2 sequences appear as diagonal
lines of coloured cells (‘dots’)
Fruitfly Eyeless
Window-size = 10,
Threshold = 5
Human Eyeless
Sequence alignment
• A second method for comparing sequences is a
sequence alignment
• An alignment is an arrangement in columns of 2
sequences, highlighting their similarity
The sequences are padded with gaps (dashes) so that wherever
possible, alignment columns contain identical letters from the two
sequences involved
An insertion or deletion is represented by ‘–’ (a gap)
The symbol “|” is used to represent matches
eg. here is an alignment for amino acid sequences
“QKGSYPVRSTC” & “QKGSGPVRSTC”:
Q K G S Y P V R S T C This
Therealignment
are1 10
is mismatch
matches
has
| | | | | | | | | |
Q K G S G P V R S T C 11 columns
1 2 3 4 5 6 7 8 9 10 11
Sequence alignment
• An alignment of the human and fruitfly
(Drosophila melanogaster) Eyeless proteins:
What does an alignment mean?
• An alignment is tells you tells you what mutations
occurred in the sequences since the sequences
shared a common ancestor
eg. an alignment of the human & fruitfly Eyeless suggests:
(i) there were probably deletion(s) at the start of the human
Eyeless, or insertion(s) at the start of fruitfly Eyeless
(ii) there was probably a G→N substitution in human Eyeless, or a N→G

substitution in fruitfly Eyeless (see arrow)
How do we make an alignment?
• Given two or more sequences, what is the best way
to align them to each other
We want the alignment columns to contain identical letters
• Comparison of similar sequences of similar length is
straightforward
eg. for amino acid sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”, we
line up the identical letters in columns:
Q K G S Y P V R S T C sequence 1
| | | | | | | | | |
Q K G S G P V R S T C sequence 2
The alignment implies that one mutation occurred since the two
sequences shared a common ancestor
That is, the alignment implies there was a G→Y substitution in
sequence 1 or a Y→G substitution in sequence 2
Problem
• Are there other possible plausible alignments for
sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”?
Answer
• Are there other possible plausible alignments for
sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”?
There are many other possible alignments, eg. :
Q K G S Y - P V R S T C
| | | | | | | | |
Q K G - S G P V R S T C
Q K G S - Y P V R S T C
| | | | | | | | |
Q K G S G P - V R S T C
Q K G - - - - - S Y P V R S T C
| | | | | |
Q K G S G P V R S - - - - - T C
Q K - G S Y P V R S T C
| | |
Q K G S G P V R S T - C etc. etc. etc. . . .
Number of possible pairwise alignments
• There are lots of different possible alignments for
two sequences that are both of length n
The number of possible alignments of 2 seqs of length n letters (amino
acids/nucleotides) is ( ) (“2n2nchoose n”)
n
2n
( n) can be calculated as ( 2n
n ) = (2*n) !
n! * n!
where n! (‘n factorial’) = n * (n - 1) * (n – 2) * (n – 3) * ... * 3 * 2 * 1
• For example, for “QKGSYPVRSTC” &
“QKGSGPVRSTC”, n (length) = 11 letters
The number of possible alignments of these two sequences is
(2*11) = ( 22 ) = (2*11) ! = 22!
11 11
11! * 11! 39916800*3991680
= 1.124001e+21/1.593351e+15 = 705,432 possible alignments

Number of possible pairwise alignments
• Even for relatively short sequences, (2nn ) is large, so
there are lots of possible alignments
eg. for two sequences that are both 11 letters long, there are
705,432 possible alignments
• In fact, the number of possible alignments, ( 2nn ),
increases exponentially with the sequence length (n)
ie. ( 2nn ) is approximately equal to 22n
For two sequences of
Number of 17 letters long (n=17),
possible there are 2.3 billion
alignments possible alignments
Length of sequences (n)

• Many of the possible alignments for 2 seqs are
implausible as they imply many mutations occurred
(but it’s known mutations are rare)
eg. for amino acid sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”, the
alignment made by lining the identical letters into columns only
implies one mutation:
Q K G S Y P V R S T C This alignment implies that 1 G→Y or
| | | | | | | | | | Y→G substitution occurred
Q K G S G P V R S T C
Many of the alternative alignments for these two sequences imply

that many more mutations occurred, eg. :
Q K G S Y - P V R S T C This alignment implies that 1 S→Y or

| | | | | | | | | Y→S substitution occurred;
Q K G - S G P V R S T C
that 1 insertion of S or deletion of S
occurred;
and that 1 deletion of G or insertion of G
occurred
Further Reading
• Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
• Practical on pairwise alignment in R in the Little Book of R for
Bioinformatics:
https://a-little-book-of-r-for-
bioinformatics.readthedocs.org/en/latest/src/chapter4.html

Pairwise Sequence Alignment: DR Avril Coghlan Alc@sanger - Ac.uk

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Pairwise Sequence Alignment: DR Avril Coghlan Alc@sanger - Ac.uk

Загружено:

Авторское право:

Доступные форматы

Pairwise sequence Alignment

Note: this talk contains animations which can only be seen by

(ii) there was probably a G→N substitution in human Eyeless, or a N→G

= 1.124001e+21/1.593351e+15 = 705,432 possible alignments

Length of sequences (n)

Many of the alternative alignments for these two sequences imply

Q K G S Y - P V R S T C This alignment implies that 1 S→Y or

Вам также может понравиться