Edit Distance

undefined
E
d
i
t
 
D
i
s
t
a
n
c
e
Michael T. Goodrich
University of California, Irvine
Most slides adapted from slides by Dan Jurafsky, Stanford University
How similar are two strings?
 
Spell correction
The user typed
“graffe”
Which is closest?
graf
graft
grail
giraffe
 
 
 
Computational Biology
Align two sequences of nucleotides
 
 
Resulting alignment:
 
Also for Machine Translation, Information Extraction, Speech
Recognition
 
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
 
-
AG
G
CTATCAC
CT
GACC
T
C
CA
GG
C
CGA
--
TGCCC
---
T
AG
-
CTATCAC
--
GACC
G
C
--
GG
T
CGA
TT
TGCCC
GAC
Edit Distance
The minimum edit distance between two strings
Is the minimum number of editing operations
Insertion
Deletion
Substitution
Needed to transform one into the other
Minimum Edit Distance
T
w
o
 
s
t
r
i
n
g
s
 
a
n
d
 
t
h
e
i
r
 
a
l
i
g
n
m
e
n
t
:
Minimum Edit Distance
 
If each operation has cost of 1
Distance between these is 5
If substitutions cost 2 (Levenshtein)
Distance between them is 8
Alignment in Computational Biology
 
Given a sequence of bases
 
 
An alignment:
 
 
Given two sequences, align each letter to a letter
or gap
 
-
AG
G
CTATCAC
CT
GACC
T
C
CA
GG
C
CGA
--
TGCCC
---
T
AG
-
CTATCAC
--
GACC
G
C
--
GG
T
CGA
TT
TGCCC
GAC
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
Uses of Edit Distance in NLP
Evaluating Machine Translation and speech
recognition
R 
Spokesman confirms    senior government adviser was appointed
H 
Spokesman said    the senior            adviser was appointed
              S      I              D                        I
How to find the Min Edit Distance?
Searching for a path (sequence of edits) from the
start string to the final string:
I
n
i
t
i
a
l
 
s
t
a
t
e
:
 
t
h
e
 
w
o
r
d
 
w
e
r
e
 
t
r
a
n
s
f
o
r
m
i
n
g
O
p
e
r
a
t
o
r
s
:
 
i
n
s
e
r
t
,
 
d
e
l
e
t
e
,
 
s
u
b
s
t
i
t
u
t
e
G
o
a
l
 
s
t
a
t
e
:
 
 
t
h
e
 
w
o
r
d
 
w
e
r
e
 
t
r
y
i
n
g
 
t
o
 
g
e
t
 
t
o
P
a
t
h
 
c
o
s
t
:
 
w
h
a
t
 
w
e
 
w
a
n
t
 
t
o
 
m
i
n
i
m
i
z
e
:
 
t
h
e
 
n
u
m
b
e
r
 
o
f
e
d
i
t
s
Minimum Edit Distance as Search
But the space of all edit sequences is huge!
We can’t afford to navigate naïvely
Lots of distinct paths wind up at the same state.
We don’t have to keep track of all of them
I
n
s
t
e
a
d
,
 
w
e
 
u
s
e
 
d
y
n
a
m
i
c
 
p
r
o
g
r
a
m
m
i
n
g
,
 
i
n
 
a
w
a
y
 
v
e
r
y
 
s
i
m
i
l
a
r
 
t
o
 
t
h
e
 
L
C
S
 
p
r
o
b
l
e
m
Defining Edit Distance Parameters
For two strings
X of length 
n
Y of length 
m
We define D(
i,j
)
the edit distance between X[1..
i
] and Y[1..
j
]
i.e., the first 
i
 characters of X and the first 
j
 characters of Y
The edit distance between X and Y is thus D(
n,m
)
Dynamic Programming for Edit Distance
D
y
n
a
m
i
c
 
p
r
o
g
r
a
m
m
i
n
g
:
 
A
 
t
a
b
u
l
a
r
 
c
o
m
p
u
t
a
t
i
o
n
 
o
f
D
(
n
,
m
)
Solving problems by combining solutions to
subproblems.
Bottom-up
We compute D(i,j) for small 
i,j
And compute larger D(i,j) based on previously
computed smaller values
i.e., compute D(
i,j
) for all 
i
 (0 < 
i
 < n)  and
 j 
(0 < j < m)
Defining Min Edit Distance
(Levenshtein)
Initialization
D(i,0) = i
D(0,j) = j
Recurrence Relation
:
For each  i = 1…M
 
  For each  j = 1…N
                          
D(i-1,j) + 1
          D(i,j)= min   D(i,j-1) + 1
                          D(i-1,j-1) +   2;  if X(i) ≠ Y(j)
                                         0;  if X(i) = Y(j)
Termination
:
D(N,M) is distance
The Edit Distance Table
T
h
e
 
E
d
i
t
 
D
i
s
t
a
n
c
e
 
T
a
b
l
e
Edit Distance
T
h
e
 
E
d
i
t
 
D
i
s
t
a
n
c
e
 
T
a
b
l
e
Computing alignments
Edit distance isn’t sufficient
W
e
 
o
f
t
e
n
 
n
e
e
d
 
t
o
 
a
l
i
g
n
 
e
a
c
h
 
c
h
a
r
a
c
t
e
r
 
o
f
 
t
h
e
 
t
w
o
s
t
r
i
n
g
s
 
t
o
 
e
a
c
h
 
o
t
h
e
r
We do this by keeping a “backtrace”
Every time we enter a cell, remember where we
came from
When we reach the end,
Trace back the path from the upper right corner to
read off the alignment
Edit Distance
MinEdit with Backtrace
Adding Backtrace to Minimum Edit
Distance
Base conditions:                                                        Termination:
D(i,0) = i         D(0,j) = j         D(N,M) is distance 
Recurrence Relation
:
For each  i = 1…M
 
 For each  j = 1…N
                           
D(i-1,j) + 1
      D(i,j)=     min   D(i,j-1) + 1
                           D(i-1,j-1) +  2; if X(i) ≠ Y(j)
                                         0; if X(i) = Y(j)
                        LEFT
            ptr(i,j)=   DOWN
                        DIAG
insertion
deletion
substitution
insertion
deletion
substitution
Performance
Time:
    
O(nm)
Space:
    
O(nm)
Can be improved to O(n+m) using Hirschberg’s
algorithm
Backtrace:
    
O(n+m)
Types of Edit Distance Metrics
T
h
e
 
L
e
v
e
n
s
h
t
e
i
n
 
d
i
s
t
a
n
c
e
 
a
l
l
o
w
s
 
d
e
l
e
t
i
o
n
,
i
n
s
e
r
t
i
o
n
 
a
n
d
 
s
u
b
s
t
i
t
u
t
i
o
n
.
T
h
e
 
L
o
n
g
e
s
t
 
c
o
m
m
o
n
 
s
u
b
s
e
q
u
e
n
c
e
 
(
L
C
S
)
d
i
s
t
a
n
c
e
 
a
l
l
o
w
s
 
o
n
l
y
 
i
n
s
e
r
t
i
o
n
 
a
n
d
 
d
e
l
e
t
i
o
n
,
 
n
o
t
s
u
b
s
t
i
t
u
t
i
o
n
.
T
h
e
 
H
a
m
m
i
n
g
 
d
i
s
t
a
n
c
e
 
a
l
l
o
w
s
 
o
n
l
y
 
s
u
b
s
t
i
t
u
t
i
o
n
;
h
e
n
c
e
,
 
i
t
 
o
n
l
y
 
a
p
p
l
i
e
s
 
t
o
 
s
t
r
i
n
g
s
 
o
f
 
t
h
e
 
s
a
m
e
 
l
e
n
g
t
h
.
T
h
e
 
D
a
m
e
r
a
u
L
e
v
e
n
s
h
t
e
i
n
 
d
i
s
t
a
n
c
e
 
a
l
l
o
w
s
i
n
s
e
r
t
i
o
n
,
 
d
e
l
e
t
i
o
n
,
 
s
u
b
s
t
i
t
u
t
i
o
n
,
 
a
n
d
 
t
h
e
t
r
a
n
s
p
o
s
i
t
i
o
n
 
o
f
 
t
w
o
 
a
d
j
a
c
e
n
t
 
c
h
a
r
a
c
t
e
r
s
.
T
h
e
 
J
a
r
o
 
d
i
s
t
a
n
c
e
 
a
l
l
o
w
s
 
o
n
l
y
 
t
r
a
n
s
p
o
s
i
t
i
o
n
.
Common Algorithm
From https://en.wikipedia.org/wiki/Edit_distance
 
This doesn’t include transposition – how could we add this operation?
Slide Note
Embed
Share

Edit distance, a crucial concept in Computational Biology and NLP, measures the minimum number of operations needed to transform one string into another. It is widely used for tasks such as spell correction, aligning nucleotide sequences, evaluating machine translation, and speech recognition. By considering insertions, deletions, and substitutions, edit distance helps find the closest match between strings and identify optimal alignments in various applications.

  • Computational Biology
  • NLP
  • Edit Distance
  • Alignment
  • Spell Correction

Uploaded on Apr 20, 2024 | 3 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Edit Distance Michael T. Goodrich University of California, Irvine Most slides adapted from slides by Dan Jurafsky, Stanford University

  2. How similar are two strings? Spell correction The user typed graffe Which is closest? graf graft grail giraffe Computational Biology Align two sequences of nucleotides AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC Resulting alignment: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Also for Machine Translation, Information Extraction, Speech Recognition

  3. Edit Distance The minimum edit distance between two strings Is the minimum number of editing operations Insertion Deletion Substitution Needed to transform one into the other

  4. Minimum Edit Distance Two strings and their alignment:

  5. Minimum Edit Distance If each operation has cost of 1 Distance between these is 5 If substitutions cost 2 (Levenshtein) Distance between them is 8

  6. Alignment in Computational Biology Given a sequence of bases AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC An alignment: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two sequences, align each letter to a letter or gap

  7. Uses of Edit Distance in NLP Evaluating Machine Translation and speech recognition R Spokesman confirms senior government adviser was appointed H Spokesman said the senior adviser was appointed S I D I

  8. How to find the Min Edit Distance? Searching for a path (sequence of edits) from the start string to the final string: Initial state: the word we re transforming Operators: insert, delete, substitute Goal state: the word we re trying to get to Path cost: what we want to minimize: the number of edits

  9. Minimum Edit Distance as Search But the space of all edit sequences is huge! We can t afford to navigate na vely Lots of distinct paths wind up at the same state. We don t have to keep track of all of them Instead, we use dynamic programming, in a way very similar to the LCS problem

  10. Defining Edit Distance Parameters For two strings X of length n Y of length m We define D(i,j) the edit distance between X[1..i] and Y[1..j] i.e., the first i characters of X and the first j characters of Y The edit distance between X and Y is thus D(n,m)

  11. Dynamic Programming for Edit Distance Dynamic programming: A tabular computation of D(n,m) Solving problems by combining solutions to subproblems. Bottom-up We compute D(i,j) for small i,j And compute larger D(i,j) based on previously computed smaller values i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)

  12. Defining Min Edit Distance (Levenshtein) Initialization D(i,0) = i D(0,j) = j Recurrence Relation: For each i = 1 M For each j = 1 N D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) Y(j) 0; if X(i) = Y(j) Termination: D(N,M) is distance

  13. The Edit Distance Table N O I 9 8 7 T N E T N I # 6 5 4 3 2 1 0 # 1 E 2 X 3 E 4 C 5 U 6 T 7 I 8 O 9 N

  14. The Edit Distance Table N O I 9 8 7 T N E T N I # 6 5 4 3 2 1 0 # 1 E 2 X 3 E 4 C 5 U 6 T 7 I 8 O 9 N

  15. Edit Distance N O I 9 8 7 T N E T N I # 6 5 4 3 2 1 0 # 1 E 2 X 3 E 4 C 5 U 6 T 7 I 8 O 9 N

  16. The Edit Distance Table N O I T N E T N I # 9 8 7 6 5 4 3 2 1 0 # 8 7 6 5 4 3 4 3 2 1 E 9 8 7 6 5 4 5 4 3 2 X 10 9 8 7 6 5 6 5 4 3 E 11 10 9 8 7 6 7 6 5 4 C 12 11 10 9 8 7 8 7 6 5 U 11 10 9 8 9 8 7 8 7 6 T 10 9 8 9 10 9 8 7 6 7 I 9 8 9 10 11 10 9 8 7 8 O 8 9 10 11 10 9 8 7 8 9 N

  17. Computing alignments Edit distance isn t sufficient We often need to align each character of the two strings to each other We do this by keeping a backtrace Every time we enter a cell, remember where we came from When we reach the end, Trace back the path from the upper right corner to read off the alignment

  18. Edit Distance N O I 9 8 7 T N E T N I # 6 5 4 3 2 1 0 # 1 E 2 X 3 E 4 C 5 U 6 T 7 I 8 O 9 N

  19. MinEdit with Backtrace

  20. Adding Backtrace to Minimum Edit Distance Base conditions: Termination: D(i,0) = i D(0,j) = j D(N,M) is distance Recurrence Relation: For each i = 1 M For each j = 1 N deletion D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) Y(j) 0; if X(i) = Y(j) LEFT ptr(i,j)= DOWN DIAG insertion substitution insertion deletion substitution

  21. Performance Time: Space: Can be improved to O(n+m) using Hirschberg s algorithm O(nm) O(nm) Backtrace: O(n+m)

  22. Types of Edit Distance Metrics The Levenshtein distance allows deletion, insertion and substitution. The Longest common subsequence (LCS) distance allows only insertion and deletion, not substitution. The Hamming distance allows only substitution; hence, it only applies to strings of the same length. The Damerau Levenshtein distance allows insertion, deletion, substitution, and the transposition of two adjacent characters. The Jaro distance allows only transposition.

  23. Common Algorithm This doesn t include transposition how could we add this operation? From https://en.wikipedia.org/wiki/Edit_distance

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#