Вы находитесь на странице: 1из 52

!

"#"$%$ '(")
*"+),#-.

uenluon of Mlnlmum
LdlL ulsLance

uan !urafsky
/01 +"$"2,3 ,3. )10 +)3"#4+5
Spell correcuon
1he user Lyped grae"
Whlch ls closesL?
graf
gra
grall
glrae
CompuLauonal 8lology
Allgn Lwo sequences of nucleoudes
8esulung allgnmenL:
Also for Machlne 1ranslauon, lnformauon LxLracuon, Speech 8ecognluon
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
uan !urafsky
'(") *"+),#-.
1he mlnlmum edlL dlsLance beLween Lwo sLrlngs
ls Lhe mlnlmum number of edlung operauons
lnseruon
ueleuon
SubsuLuuon
needed Lo Lransform one lnLo Lhe oLher
uan !urafsky
!"#"$%$ '(") *"+),#-.
1wo sLrlngs and Lhelr ,2"4#$.#):
uan !urafsky
!"#"$%$ '(") *"+),#-.
lf each operauon has cosL of 1
ulsLance beLween Lhese ls 3
lf subsuLuuons cosL 2 (LevenshLeln)
ulsLance beLween Lhem ls 8
uan !urafsky
62"4#$.#) "# 70$8%),90#,2 :"0204;
Clven a sequence of bases

An allgnmenL:
Clven Lwo sequences, allgn each leuer Lo a leuer or gap
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
uan !urafsky
<)=.3 %+.+ 0> '(") *"+),#-. "# ?@A
Lvaluaung Machlne 1ranslauon and speech recognluon
R Spokesman confirms senior government adviser was shot
H Spokesman said the senior adviser was shot dead
S I D I
named LnuLy LxLracuon and LnuLy Coreference
l8M lnc. announced Loday
l8M proLs
SLanford resldenL !ohn Pennessy announced yesLerday
for SLanford unlverslLy resldenL !ohn Pennessy

uan !urafsky
/01 )0 B#( )=. !"# '(") *"+),#-.5
Searchlng for a paLh (sequence of edlLs) from Lhe sLarL sLrlng Lo
Lhe nal sLrlng:
C#"9,2 +),).: Lhe word we're Lransformlng
<8.3,)03+: lnserL, deleLe, subsuLuLe
D0,2 +),).: Lhe word we're Lrylng Lo geL Lo
A,)= -0+): whaL we wanL Lo mlnlmlze: Lhe number of edlLs
8
uan !urafsky
!"#"$%$ '(") ,+ E.,3-=
8uL Lhe space of all edlL sequences ls huge!
We can'L aord Lo navlgaLe naively
LoLs of dlsuncL paLhs wlnd up aL Lhe same sLaLe.
We don'L have Lo keep Lrack of all of Lhem
!usL Lhe shorLesL paLh Lo each of Lhose revlsLed sLaLes.
9
uan !urafsky
*.B#"#4 !"# '(") *"+),#-.
lor Lwo sLrlngs
x of lengLh !
? of lengLh "

We dene u($%&)
Lhe edlL dlsLance beLween x[1..$] and ?[1..&]
l.e., Lhe rsL $ characLers of x and Lhe rsL & characLers of ?
1he edlL dlsLance beLween x and ? ls Lhus u(!%")
!"#"$%$ '(")
*"+),#-.

uenluon of Mlnlmum
LdlL ulsLance

!"#"$%$ '(")
*"+),#-.

Compuung Mlnlmum
LdlL ulsLance

uan !urafsky
*;#,$"- A3043,$$"#4 >03
!"#"$%$ '(") *"+),#-.
*;#,$"- 83043,$$"#4: A Labular compuLauon of u(!%")
Solvlng problems by comblnlng soluuons Lo subproblems.
8ouom-up
We compuLe u(l,[) for small $%&
And compuLe larger u(l,[) based on prevlously compuLed smaller values
l.e., compuLe u($%&) for all $ (0 < $ < n) and & (0 < [ < m)
uan !urafsky
*.B#"#4 !"# '(") *"+),#-. F@.G.#+=)."#H
lnluallzauon
D(i,0) = i
D(0,j) = j
8ecurrence 8elauon'
For each i = 1M
For each j = 1N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) " Y(j)
0; if X(i) = Y(j)
1ermlnauon'
D(N,M) is distance

uan !urafsky
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
I=. '(") *"+),#-. I,J2.
uan !urafsky
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
I=. '(") *"+),#-. I,J2.
uan !urafsky
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
'(") *"+),#-.
uan !urafsky
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
I=. '(") *"+),#-. I,J2.
!"#"$%$ '(")
*"+),#-.

Compuung Mlnlmum
LdlL ulsLance

!"#"$%$ '(")
*"+),#-.

8ackLrace for
Compuung AllgnmenLs

uan !urafsky
70$8%9#4 ,2"4#$.#)+
LdlL dlsLance lsn'L sumclenL
We oen need Lo ,2"4# each characLer of Lhe Lwo sLrlngs Lo each oLher
We do Lhls by keeplng a backLrace"
Lvery ume we enLer a cell, remember where we came from
When we reach Lhe end,
1race back Lhe paLh from Lhe upper rlghL corner Lo read o Lhe allgnmenL
uan !urafsky
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
'(") *"+),#-.
uan !urafsky
!"#'(") 1")= :,-K)3,-.
uan !urafsky
6(("#4 :,-K)3,-. )0 !"#"$%$ '(") *"+),#-.
8ase condluons: 1ermlnauon:
D(i,0) = i D(0,j) = j D(N,M) is distance
8ecurrence 8elauon'
For each i = 1M
For each j = 1N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) " Y(j)
0; if X(i) = Y(j)
LEFT
ptr(i,j)= DOWN
DIAG

lnseruon
deleuon
subsuLuuon
lnseruon
deleuon
subsuLuuon
uan !urafsky
I=. *"+),#-. !,)3"L
Sllde adapLed from Seram 8aLzoglou
y
0
............ y
M

x
0

.
.
.
.
.
.
.
.


x
N

Lvery non-decreaslng paLh

from (0,0) Lo (M, n)

corresponds Lo
an allgnmenL
of Lhe Lwo sequences

An optimal alignment is composed
of optimal subalignments
uan !urafsky
M.+%2) 0> :,-K)3,-.
1wo sLrlngs and Lhelr ,2"4#$.#):
uan !urafsky
A.3>03$,#-.
1lme:
C(nm)
Space:
C(nm)
8ackLrace
C(n+m)
!"#"$%$ '(")
*"+),#-.

8ackLrace for
Compuung AllgnmenLs

!"#"$%$ '(")
*"+),#-.

WelghLed Mlnlmum LdlL
ulsLance

uan !urafsky
N."4=).( '(") *"+),#-.
Why would we add welghLs Lo Lhe compuLauon?
Spell Correcuon: some leuers are more llkely Lo be mlsLyped Lhan oLhers
8lology: cerLaln klnds of deleuons or lnseruons are more llkely Lhan
oLhers
uan !urafsky
70#>%+"0# $,)3"L >03 +8.22"#4 .3303+
uan !urafsky
uan !urafsky
N."4=).( !"# '(") *"+),#-.
lnluallzauon:
D(0,0) = 0
D(i,0) = D(i-1,0) + del[x(i)]; 1 < i # N
D(0,j) = D(0,j-1) + ins[y(j)]; 1 < j # M
8ecurrence 8elauon'
D(i-1,j) + del[x(i)]
D(i,j)= min D(i,j-1) + ins[y(j)]
D(i-1,j-1) + sub[x(i),y(j)]
1ermlnauon'
D(N,M) is distance

uan !urafsky
Where did the name, dynamic
programming, come from?
!The 1950s were not good years for mathematical research. [the] Secretary of
Defense !had a pathological fear and hatred of the word, research!

I decided therefore to use the word, programming.

I wanted to get across the idea that this was dynamic, this was multistage! I thought,
lets ! take a word that has an absolutely precise meaning, namely dynamic! its
impossible to use the word, dynamic, in a pejorative sense. Try thinking of some
combination that will possibly give it a pejorative meaning. Its impossible.

Thus, I thought dynamic programming was a good name. It was something not even a
Congressman could object to.

Richard Bellman, Eye of the Hurricane: an autobiography 1984.
!"#"$%$ '(")
*"+),#-.

WelghLed Mlnlmum LdlL
ulsLance

!"#"$%$ '(")
*"+),#-.

Mlnlmum LdlL ulsLance
ln CompuLauonal 8lology

uan !urafsky
E.O%.#-. 62"4#$.#)
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
uan !urafsky
N=; +.O%.#-. ,2"4#$.#)5
Comparlng genes or reglons from dlerenL specles
Lo nd lmporLanL reglons
deLermlne funcuon
uncover evoluuonary forces
Assembllng fragmenLs Lo sequence unA
Compare lndlvlduals Lo looklng for muLauons
uan !urafsky
62"4#$.#)+ "# )10 B.2(+
ln naLural Language rocesslng
We generally Lalk abouL dlsLance (mlnlmlzed)
And welghLs
ln CompuLauonal 8lology
We generally Lalk abouL slmllarlLy (maxlmlzed)
And scores
uan !urafsky
I=. ?..(2.$,#PN%#+-= 62403")=$
lnluallzauon:
D(i,0) = -i * d
D(0,j) = -j * d
8ecurrence 8elauon'
D(i-1,j) - d
D(i,j)= min D(i,j-1) - d
D(i-1,j-1) + s[x(i),y(j)]
1ermlnauon'
D(N,M) is distance

uan !urafsky
I=. ?..(2.$,#PN%#+-= !,)3"L
Sllde adapLed from Seram 8aLzoglou
x
1
............ x
M

y
1

.
.
.
.
.
.
.
.


y
N

(noLe LhaL Lhe orlgln ls
aL Lhe upper le.)
uan !urafsky
6 G,3",#) 0> )=. J,+"- ,2403")=$Q
Maybe lL ls Ck Lo have an unllmlLed # of gaps ln Lhe beglnnlng
and end:
Sllde from Seram 8aLzoglou
----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC
GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------
If so, we dont want to penalize gaps at the ends
uan !urafsky
*"R.3.#) );8.+ 0> 0G.32,8+
Sllde from Seram 8aLzoglou
Example:
2 overlappingreads from a
sequencing project
Example:
Search for a mouse gene
within a human chromosome
uan !urafsky
I=. <G.32,8 *.).-90# G,3",#)
Changes:

1. lnluallzauon
For all i, j,
F(i, 0) = 0
F(0, j) = 0

2. 1ermlnauon
max
i
F(i, N)
F
OPT
= max
max
j
F(M, j)
Sllde from Seram 8aLzoglou
x
1
............ x
M

y
1

.
.
.
.
.
.
.
.


y
N

uan !urafsky
Clven Lwo sLrlngs x = x
1
..x
M
,
y = y
1
..y
n


llnd subsLrlngs x', y' whose slmllarlLy
(opumal global allgnmenL value)
ls maxlmum

x = aaaacccccggggua
y = ucccgggaaccaacc Sllde from Seram 8aLzoglou
I=. @0-,2 62"4#$.#) A30J2.$
uan !urafsky
I=. E$")=PN,).3$,# ,2403")=$
C(.,: lgnore badly allgnlng reglons

Modlcauons Lo needleman-Wunsch:

C#"9,2"S,90#: F(0, j) = 0
F(i, 0) = 0

0
C).3,90#: F(i, j) = max F(i 1, j) d
F(i, j 1) d
F(i 1, j 1) + s(x
i
, y
j
)
Sllde from Seram 8aLzoglou
uan !urafsky
I=. E$")=PN,).3$,# ,2403")=$
I.3$"#,90#:
1. lf we wanL Lhe besL local allgnmenL.

l
C1
= max
l,[
l(l, [)

llnd l
C1
and Lrace back
2. lf we wanL all local allgnmenLs scorlng > L

?? lor all l, [ nd l(l, [) > L, and Lrace back?

CompllcaLed by overlapplng local allgnmenLs


Sllde from Seram 8aLzoglou
uan !urafsky
@0-,2 ,2"4#$.#) .L,$82.
A T T A T C
0 0 0 0 0 0 0
A 0
T 0
C 0
A 0
T 0
X = ATCAT
Y = ATTATC

LeL:
m = 1 (1 polnL for maLch)
d = 1 (-1 polnL for del/lns/sub)
uan !urafsky
@0-,2 ,2"4#$.#) .L,$82.
A T T A T C
0 0 0 0 0 0 0
A 0 1 0 0 1 0 0
T 0 0 2 1 0 2 0
C 0 0 1 1 0 1 3
A 0 1 0 0 2 1 2
T 0 0 2 0 1 3 2
X = ATCAT
Y = ATTATC

uan !urafsky
@0-,2 ,2"4#$.#) .L,$82.
A T T A T C
0 0 0 0 0 0 0
A 0 1 0 0 1 0 0
T 0 0 2 1 0 2 0
C 0 0 1 1 0 1 3
A 0 1 0 0 2 1 2
T 0 0 2 0 1 3 2
X = ATCAT
Y = ATTATC

uan !urafsky
@0-,2 ,2"4#$.#) .L,$82.
A T T A T C
0 0 0 0 0 0 0
A 0 1 0 0 1 0 0
T 0 0 2 1 0 2 0
C 0 0 1 1 0 1 3
A 0 1 0 0 2 1 2
T 0 0 2 0 1 3 2
X = ATCAT
Y = ATTATC

!"#"$%$ '(")
*"+),#-.

Mlnlmum LdlL ulsLance
ln CompuLauonal 8lology

Вам также может понравиться