
Advanced Techniques for Computing Edit Distance and Embedding Methods
Explore streaming algorithms and randomized embedding techniques for edit distance computation, including methods for handling bit flips/symbol changes, insertions, deletions, and more. Discover how to compute edit distance efficiently with full or streaming access and in different locations. Dive into the concept of embedding edit distance into 1 distance and utilizing Hamming distance in algorithmic computations.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Streaming algorithms for edit distance Michal Kouck Charles U., Prague Joint work with: Diptarka Chakraborty Charles U., Prague Elazar Goldenberg The Academic College of Tel Aviv-Yaffo
Edit distance ? a b c z d e f w h i k l m ? a b c d e f g h i j k l m Edit distance ????(?,?): the number of that take ? into ?. 1) bit flips/symbol changes 2) insertions, and 3) deletions
Main question How do you compute edit distance when: We have full access to ? and ?. We have streaming access to ? and ?. ? and ? are in different locations.
Our results Randomized embedding of edit metric into Hamming metric. Streaming algorithm computing exact edit distance: one pass over ? and ? (synchronous model) uses space O(?), where ? = ????(?,?) uses time O(? + ? ?3/2) Few other edit distance algorithms in various settings.
Embedding edit into 1 distance Embedding Bar-Yossef-Jayram-Krauthgamer-Kumar 04 Ostrovsky-Rabani 07 Cormode-Muthukrishnan 02 distortion ?(?2/3) 2?( log ? log log ? ) ?(log ? log ?) (with moves) Lower bounds Andoni-Deza-Gupta-Indyk-Raskhodnikova 03 Knot-Naor 05 Krauthgamer-Rabani 09 3/2 ( log ?1/2 ?(1)) (log ?) Randomized embedding Jowhari 12 ?(log ? 2log ?) *
Hamming distance ? a b c z d e f i j k l m n ? a b c d e f z i j k l m n Hamming distance ???????(?,?): the number of bit flips that take ? into ?.
Randomized embedding of edit distance Hamming distance ?: 0,1? 0,1? 0,13? such that for any ? and ? 0,1? 1 2 ?????,? ???????(?(?,?),? ?,? ) O( ????(?,?)2) with probability 2/3 over a random choice of ?. distortion is independent of ?.
Algorithm for embedding ? Input:? 0,1? and random bits ? 0,1?. Interpret ? as hash functions 1, 2, 3?:{0,1} {0,1}. ? 1 For ? 1 to 3? do 1. If ? ? then Output ?? ? ? + ?(??) 2. Else Output 0 ? 0 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 1 1
Why it works ? a b c z d e f g h i k l m a a b c c z z d d e f l m a a b c c d e e f f f l m ? a b c d e f g h i j k l m
Synchronization The two pointers into ? and ? behave like a random walk on a line. With probability 2/3 they synchronize in O(?2) steps. Prob[synchronize in ? steps] > 1 12? ? But, the expected number of steps to synchronize is O(?).
Computing edit distance Computing edit distance Wagner-Fischer 74 Masek-Paterson 80 Landau-Myers-Schmidt 98 O(?2) O O(? + ?2) (?2/log? ) Streaming (synchronous) ours Belazzougui-Zhang ours O(? + ?6) O(? + ?2)O(??2) (sequential stream) O(? + ?2), O(? + ??3/2) Approximating edit distance O(?) O O(?) ?-approx log1+?? -approx ?-approx Landau-Myers-Schmidt 98 Andoni-Krauthgamer-Onak 10 Saha 14 (streaming) (?1+?) SETH (?2 ?) Backus-Indyk 14
Streaming algorithm for edit distance Two steps: 1. Using our embedding procedure extract a kernel (? ,? )from ? and ?such that ????? ,? = ???? (?,?) and |? |,|? | ?6 2. Apply off-line O(? + ?2) algorithm for edit distance on the kernel.
Computing a kernel for ? and ? Two steps: 1. Remove too many repetitions from matched parts of ? and ?. 2. Shrink long rigid matched parts.
Dynamic pgming for edit distance ?1..? ? ?,? + [??+1 ??+1] ?1..? ? ?,? + 1 + 1 ?(? + 1,? + 1) ? ? + 1,? + 1 Edit matrix: ? ?,? = ???? (?1 ?,?1 ?)
Diagonals [?? ??] [??+ = ??+ ] [??+? ??+?] ??????,??,? = max ?>1 ?, ??+ = ??+ longest common extension
Diagonals ? ? [?? ??] [??+ = ??+ ] [??+? ??+?] ?(? ?) time algorithm Slide in ?(1) time ?(? + ?2) time algorithm [Ukkonen] [Landau et al.]
Streaming algorithm (idea) ?(?) Break the matrix in slices of ? ? rows & slide in ?(1) time ?(? + ?2) time algorithm [Belazzougui, Zhang]
Periodicity ?1 ? ? ? ?1..? ? ? Two diagonals sliding in parallel periodicity
Periodicity ?1 ? ? ? ?1..? ? = ? + ? ? ? Slide check: ??+1= ??+1+ ?????? ??+1= ??+?+1 (periodicity) (equality)
Multiple diagonals ?(? + ? ?3/2)-time algorithm Conjecture: ?(? + ?2)-time
We would like to know Limits on randomized embedding of edit distance into Hamming distance? What is the complexity of our edit distance algorithm?