Вы находитесь на странице: 1из 75

# Dynamic Programming: The Edit Distance Problem

## CS 2: Introduction to Programming Methods 9 February 2004

Today
A review of dynamic programming. The Edit Distance problem.

## Review: Dynamic Programming

Dynamic programming
A method for developing algorithms. Given a problem to solve, dynamic programming says:
Solve subproblems of the problem first. Remember these solutions for later. Then put together a solution to the problem.

## Think of this as a bottom-up approach.

Solve problems in order of increasing size.

## Example: The fibonacci sequence

fib(0) = 0 fib(1) = 1 fib(2) = 1 fib(3) = 2 fib(4) = 3 fib(5) = 5 fib(n) = fib(n-1) + fib(n-2)

## Navely computing fib(n)

int fib(int n) { if (n == 0 || n == 1) { return n; } else { return (fib(n-1) + fib(n-2)); } }

## Navely computing fib(n)

fib(n) fib(n-2) fib(n-4) fib(n-6) fib(n-5) fib(n-3) fib(n-5) fib(n-4) fib(n-3) fib(n-5) fib(n-4) fib(n-1) fib(n-2) fib(n-4) fib(n-3)

## Navely computing fib(n)

fib(n) fib(n-2) fib(n-4) fib(n-6) fib(n-5) fib(n-3) fib(n-5) fib(n-4) fib(n-3) fib(n-5) fib(n-4) fib(n-1) fib(n-2) fib(n-4) fib(n-3)

## Navely computing fib(n)

fib(n) fib(n-2) fib(n-4) fib(n-6) fib(n-5) fib(n-3) fib(n-5) fib(n-4) fib(n-3) fib(n-5) fib(n-4) fib(n-1) fib(n-2) fib(n-4) fib(n-3)

## Navely computing fib(n)

fib(n) fib(n-2) fib(n-4) fib(n-6) fib(n-5) fib(n-3) fib(n-5) fib(n-4) fib(n-3) fib(n-5) fib(n-4) fib(n-1) fib(n-2) fib(n-4) fib(n-3)

10

## How to fix things...

The solution to fib(n) depends on:
The solution to fib(n - 1). The solution to fib(n - 2).

## Use dynamic programming to organize everything:

Each subproblem is characterized by the value of its input. To compute fib(n), we need fib(n - 2) and fib(n - 1). So compute fib(0) first, and work upwards. Use an array to keep track of everything.

11

## Computing fib(n): Take 2

int fib(int n) { int[] seq = new int[n+1]; seq = 0; seq = 1; for (int i = 2; i <= n; i++) { // looks like fib(i) = fib(i - 1) + fib(i - 2) seq[i] = seq[i - 1] + seq[i - 2]; } return seq[n]; }
CS 2: Introduction to Programming Methods http://www.cs.caltech.edu/courses/cs2/ The Edit Distance Problem 9 Feburary 2004 12

## Computing fib(n): Take 2

int fib(int n) { // This array will keep track of the solutions to // all n+1 problems. int[] seq = new int[n + 1]; ...

13

## Computing fib(n): Take 2

int fib(int n) { ... // Assume that (n >= 1). Otherwise, seq is to small. // The smallest subproblems have trivial solutions. seq = 0; seq = 1; ...

14

## Computing fib(n): Take 2

int fib(int n) { ... // Solve other subproblems in order of increasing size. // Use the fact that weve already solved even smaller // subproblems already. for (int i = 2; i <= n; i++) { seq[i] = seq[i - 1] + seq[i - 2]; } ...

15

## Computing fib(n): Take 2

int fib(int n) { ... // Finally, return solution to the actual problem. return seq[n]; }

16

## The Edit Distance problem

Problem: what is the cheapest way to transform one word (the source) into another word (the output)?
Example: transform algorithm into alligator.

Initially, you start at the first character of the source and have an empty output. At any point, you can:
Delete the current character of the source. Insert a new character into the output word. Copy the current character of the source into the output.

## Copying and deleting move you to the next character.

CS 2: Introduction to Programming Methods http://www.cs.caltech.edu/courses/cs2/ The Edit Distance Problem 9 Feburary 2004 18

Example: algorithm

alligator

Source: Output:

Operations: (none)

## The Edit Distance Problem 9 Feburary 2004

19

Example: algorithm

alligator

Source: Output:

Operations: (none)

## The Edit Distance Problem 9 Feburary 2004

20

Example: algorithm

alligator

Source: Output:

a a

Operations: Copy

## The Edit Distance Problem 9 Feburary 2004

21

Example: algorithm

alligator

Source: Output:

a a

l l

## The Edit Distance Problem 9 Feburary 2004

22

Example: algorithm

alligator

Source: Output:

a a

l l

g l

## The Edit Distance Problem 9 Feburary 2004

23

Example: algorithm

alligator

Source: Output:

a a

l l

g l

o i

## The Edit Distance Problem 9 Feburary 2004

24

Example: algorithm

alligator

Source: Output:

a a

l l

g l

o i

r g

## The Edit Distance Problem 9 Feburary 2004

25

Example: algorithm

alligator

Source: Output:

a a

l l

g l

o i

r g

## Operations: Copy, Copy, Insert(l), Insert(i), Copy Delete

CS 2: Introduction to Programming Methods http://www.cs.caltech.edu/courses/cs2/ The Edit Distance Problem 9 Feburary 2004 26

Example: algorithm

alligator

Source: Output:

a a

l l

g l

o i

r g

## Operations: Copy, Copy, Insert(l), Insert(i), Copy Delete, Delete

CS 2: Introduction to Programming Methods http://www.cs.caltech.edu/courses/cs2/ The Edit Distance Problem 9 Feburary 2004 27

Example: algorithm

alligator

Source: Output:

a a

l l

g l

o i

r g

## Operations: Copy, Copy, Insert(l), Insert(i), Copy Delete, Delete, Delete

CS 2: Introduction to Programming Methods http://www.cs.caltech.edu/courses/cs2/ The Edit Distance Problem 9 Feburary 2004 28

Example: algorithm

alligator

Source: Output:

a a

l l

g l

o i

r g

i a

Operations: Copy, Copy, Insert(l), Insert(i), Copy Delete, Delete, Delete, Insert(a)
CS 2: Introduction to Programming Methods http://www.cs.caltech.edu/courses/cs2/ The Edit Distance Problem 9 Feburary 2004 29

Example: algorithm

alligator

Source: Output:

a a

l l

g l

o i

r g

i a

t t

Operations: Copy, Copy, Insert(l), Insert(i), Copy Delete, Delete, Delete, Insert(a), Copy
CS 2: Introduction to Programming Methods http://www.cs.caltech.edu/courses/cs2/ The Edit Distance Problem 9 Feburary 2004 30

Example: algorithm

alligator

Source: Output:

a a

l l

g l

o i

r g

i a

t t

Operations: Copy, Copy, Insert(l), Insert(i), Copy Delete, Delete, Delete, Insert(a), Copy, Delete, Delete
CS 2: Introduction to Programming Methods http://www.cs.caltech.edu/courses/cs2/ The Edit Distance Problem 9 Feburary 2004 31

Example: algorithm

alligator

Source: Output:

a a

l l

g l

o i

r g

i a

t t

h o

m r

Operations: Copy, Copy, Insert(l), Insert(i), Copy Delete, Delete, Delete, Insert(a), Copy, Delete, Delete, Insert(o), Insert(r)
CS 2: Introduction to Programming Methods http://www.cs.caltech.edu/courses/cs2/ The Edit Distance Problem 9 Feburary 2004 32

## The Edit Distance problem

Problem: what is the cheapest way to transform one word (the source) into another word (the output)? At any point, you can:
Delete the current character of the source. Insert a new character into the output word. Copy the current character of the source into the output.

Each operation has a cost associated with it. The cost of a transformation is the sum of the costs of each operation in the sequence.

## The Edit Distance Problem 9 Feburary 2004

33

Example: algorithm

alligator

Operations: Copy, Copy, Insert(l), Insert(i), Copy Delete, Delete, Delete, Insert(a), Copy, Delete, Delete, Insert(o), Insert(r) Assume that:
Copying costs 3 Inserting costs 5 Deleting costs 2

## The cost of the above transformation is: 47

This is just 3 + 3 + 5 + 5 + 3 + 2 + 2 + ...

34

## The Edit Distance problem

Problem: what is the cheapest way to transform one word (the source) into another word (the output)? At any point, you can:
Delete the current character of the source. Insert a new character into the output word. Copy the current character of the source into the output.

35

## How to solve this problem?

Will trying out all possible transformations work?

36

## How to solve this problem?

Will trying out all possible transformations work? Answer: It can, but it would take way too long.

37

## Will dynamic programming work?

Depends on our problem. Can our problem be solved based on the solution of some subproblems? What are the subproblems, if there any?

38

## Will dynamic programming work?

Depends on our problem. Can our problem be solved based on the solution of some subproblems? What are the subproblems, if there any? As is typical when trying out dynamic programming, it helps to find some key observation or insight.
We will need three observations for this problem.

## The Edit Distance Problem 9 Feburary 2004

39

Notation
(s1, s2, s3, ..., sn) stands for a sequence of n elements. Used for a list of operations.
Example: (Copy, Copy, Insert(c), Delete).

## Also used for strings.

Example: help would correspond to (h, e, l ,p).

## The Edit Distance Problem 9 Feburary 2004

40

Key Observation #1
Given: Strings X = (x1, ..., xn) and Y = (y1, ..., yt), and a sequence S of operations (s1, s2, ..., sm). Let S = (s1, ..., s(m-1)), X = (x1, ..., x(n-1)), and Y = (y1, ..., y(t-1)). Suppose that S is the cheapest sequence of operations to transform X into Y, and that sm is a Copy operation.

41

## Key Observation #1 (cont.)

Claim: then, S is the cheapest sequence of operations to transform X to Y. Proof: by contradiction. Suppose that T = (t1, ..., tk) is cheaper than S and tranforms X to Y. Then (t1, ..., tk, sm) is cheaper than S and transforms X to Y. This is a contradiction, as S was supposed to be the cheapest way to transform X to Y.

## The Edit Distance Problem 9 Feburary 2004

42

Key Observation #2
Given: Strings X = (x1, ..., xn) and Y = (y1, ..., yt), and a sequence S of operations (s1, s2, ..., sm). Let S = (s1, ..., s(m-1)) and X = (x1, ..., x(n-1)). Suppose that S is the cheapest sequence of operations to transform X into Y, and that sm is a Delete operation.

43

## Key Observation #2 (cont.)

Claim: then, S is the cheapest sequence of operations to transform X to Y. Proof: by contradiction. Suppose that T = (t1, ..., tk) is cheaper than S and tranforms X to Y. Then (t1, ..., tk, sm) is cheaper than S and transforms X to Y. This is a contradiction, as S was supposed to be the cheapest way to transform X to Y.

## The Edit Distance Problem 9 Feburary 2004

44

Key Observation #3
Given: Strings X = (x1, ..., xn) and Y = (y1, ..., yt), and a sequence S of operations (s1, s2, ..., sm). Let S = (s1, ..., s(m-1)) and Y = (y1, ..., y(t-1)). Suppose that S is the cheapest sequence of operations to transform X into Y, and that sm is an Insert(yn) operation.

45

## Key Observation #3 (cont.)

Claim: then, S is the cheapest sequence of operations to transform X to Y. Proof: by contradiction. Suppose that T = (t1, ..., tk) is cheaper than S and tranforms X to Y. Then (t1, ..., tk, sm) is cheaper than S and transforms X to Y. This is a contradiction, as S was supposed to be the cheapest way to transform X to Y.

46

## An observation about the observations...

Given X, Y, and S, we have covered all possible cases for sm (the last operation).

47

## Putting the observations to use

The best sequence of operations to transform X to Y depended on one of the following:
(1) The best way to transform X to Y. (2) The best way to transform X to Y. (3) The best way to transform X to Y.

## The three cases become the subproblems to consider.

Given the solution to all three, we can find the solution to our actual problem (transforming X to Y). Why? The best solution to transforming X to Y must contain a solution to one of the three cases by the three observations.

48

## How to organize the subproblems

Each subproblem is characterized by some initial part of the original strings X and Y. So use a matrix. X
a l g o r i

49

## How to organize the subproblems (cont.)

The marked location contains the score for the cheapest way to transform algo to a. X
a l g o r i

50

## How to organize the subproblems (cont.)

The marked location contains the score for the cheapest way to transform alg to . X
a l g o r i

51

## How to organize the subproblems (cont.)

The marked location contains the score for the cheapest way to transform to a. X
a l g o r i

52

## How to organize the subproblems (cont.)

The marked location contains the score for the cheapest way to transform to . This is the smallest problem possible. X
a l g o r i

## The Edit Distance Problem 9 Feburary 2004

53

Each location should store the last operation in the cheapest sequence of operations used to get there. X
a l g o r i

54

## Additional information to store (cont.)

Example: the marked location would contain a score, and possibly a Copy operation. X
a l g o r i

55

## How to fill out the matrix

For concreteness, assume that Copy costs 5, Insert(c) costs 10, and Delete costs 10. X
a l g o r i

56

## How to fill out the matrix (cont.)

Start at the upper left. This is trivial: score is zero, operation is null. X
a 0 null a l g o r i

57

## How to fill out the matrix (cont.)

All other locations depend on the values in up to three other places. X
a 0 null a l g o r i

58

## How to fill out the matrix (cont.)

Diagonal movement corresponds to a Copy operation. Vertical Insert(c). Horizontal Delete. X
a 0 null a l g o r i

59

## How to fill out the matrix (cont.)

Fill out each row in turn. Only one option for the first row... X
a 0 null a 10 Del l g o r i

60

## How to fill out the matrix (cont.)

Fill out each row in turn. Only one option for the first row... X
a 0 null a 10 Del l 20 Del g 30 Del o 40 Del r 50 Del i 60 Del

61

## How to fill out the matrix (cont.)

Fill out each row in turn.

X
a 0 null a 10 Del l 20 Del g 30 Del o 40 Del r 50 Del i 60 Del

Y
l

10 Ins(a)

62

## How to fill out the matrix (cont.)

Three possibilities to consider for the marked square. Clear that Copy is cheapest. (5 versus 20 for Insert(a) or Delete). X
a 0 null a 10 Del l 20 Del g 30 Del o 40 Del r 50 Del i 60 Del

Y
l

10 Ins(a)

63

## How to fill out the matrix (cont.)

Three possibilities to consider for the marked square. Clear that Copy is cheapest. (5 versus 20 for Insert(a) or Delete). X
a 0 null a 10 Del l 20 Del g 30 Del o 40 Del r 50 Del i 60 Del

Y
l

10 5 Ins(a) Cop

64

## How to fill out the matrix (cont.)

Two possibilities to consider for the marked square. Copy is not possible since a != l. X
a 0 null a 10 Del l 20 Del g 30 Del o 40 Del r 50 Del i 60 Del

Y
l

10 5 Ins(a) Cop

65

## How to fill out the matrix (cont.)

If we chose Insert(a)....

X
a 0 null a 10 Del l 20 Del 30 Ins(a) g 30 Del o 40 Del r 50 Del i 60 Del

Y
l

10 5 Ins(a) Cop

66

## How to fill out the matrix (cont.)

But Delete is cheaper!

X
a 0 null a 10 Del l 20 Del 15 Del g 30 Del o 40 Del r 50 Del i 60 Del

Y
l

10 5 Ins(a) Cop

67

## How to fill out the matrix (cont.)

Fill the rest of the matrix out in a similar fashion.

X
a 0 null a 10 Del l 20 Del 15 Del g 30 Del 25 Del 20 Del o 40 Del 35 Del 30 Del r 50 Del 45 Del 40 Del i 60 Del 55 Del 50 Del
68

Y
l

15 10 Ins(l) Cop

## The Edit Distance Problem 9 Feburary 2004

Transforming algori to al
The cheapest sequence of operations has cost: 50

X
a 0 null a 10 Del l 20 Del 15 Del g 30 Del 25 Del 20 Del o 40 Del 35 Del 30 Del r 50 Del 45 Del 40 Del i 60 Del 55 Del 50 Del
69

Y
l

15 10 Ins(l) Cop

## Transforming algori to al (cont.)

Recover the sequence by working backwards. We get: Copy, Copy, Delete, Delete, Delete, Delete. X
a 0 null a 10 Del l 20 Del 15 Del g 30 Del 25 Del 20 Del o 40 Del 35 Del 30 Del r 50 Del 45 Del 40 Del i 60 Del 55 Del 50 Del
70

Y
l

15 10 Ins(l) Cop

## Why this worked

This worked becuase of the three observations we made earlier. X
a 0 null a 10 Del l 20 Del 15 Del g 30 Del 25 Del 20 Del o 40 Del 35 Del 30 Del r 50 Del 45 Del 40 Del i 60 Del 55 Del 50 Del
71

Y
l

15 10 Ins(l) Cop

## Why this worked (cont.)

The best answer to put in a location must use the best solution to one of three possible subproblems. X
a l g o r i

72

## Why this worked (cont.)

So we solved those subproblems first. Then we considered cases and figured out how best to solve our current problem. X
a l g o r i

73

## Why this worked (cont.)

We could recover the cheapest sequence of operations since we stored operations at each step. X
a 0 null a 10 Del l 20 Del 15 Del g 30 Del 25 Del 20 Del o 40 Del 35 Del 30 Del r 50 Del 45 Del 40 Del i 60 Del 55 Del 50 Del
74

Y
l

15 10 Ins(l) Cop

## Why this worked (cont.)

Each locations operation tells us where to look for the previous one in the sequence. X
a 0 null a 10 Del l 20 Del 15 Del g 30 Del 25 Del 20 Del o 40 Del 35 Del 30 Del r 50 Del 45 Del 40 Del i 60 Del 55 Del 50 Del
75

Y
l

15 10 Ins(l) Cop