Understanding the Burrows-Wheeler Transform and Suffix Array

Slide Note
Embed
Share

Explore the concept of the Burrows-Wheeler Transform and Suffix Array, key algorithms in data compression. Learn about the transformation process, suffix arrays, compression techniques like Bzip, and how to compute the BWT. Discover the significant role these methods play in efficient data compression.


Uploaded on Sep 24, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Data Compression The Burrows-Wheeler Transform

  2. The Suffix Array Prop 1. All suffixes in SUF(T) having prefix P are contiguous. Prop 2. Starting position is the lexicographic one of P. 5 (N2) space T = mississippi# T = mississippi# SUF(T) SA 12 11 8 5 2 1 10 9 7 4 6 3 suffix pointer # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# P=si Suffix Array SA: (N log2N) bits Text T: N chars In practice, a total of 5N bytes

  3. The big (unconscious) step...

  4. The Burrows-Wheeler Transform (1994) Let us given a text T = mississippi# F L mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss # mississipp i #mississip i ppi#missis i p s i ssippi#mis i ssissippi# s m Sort the rows ssippi#missi m ississippi # T sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i

  5. A famous example Much longer...

  6. Compressing L seems promising... Key observation: L is locally homogeneous l L is highly compressible Algorithm Bzip : 1. Move-to-Front coding of L 2. Run-Length coding 3. Statistical coder Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

  7. How to compute the BWT ? SA L BWT matrix We said that: L[i] precedes F[i] in T #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ssissippi#m #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss i p s s m # p i s s i i 12 11 8 5 2 1 10 9 7 4 6 3 L[3] = T[ 8 - 1 ] Given SA and T, we have L[i] = T[SA[i]-1] This is one of the main reasons for the number of pubblications spurred in 94- 10 on Suffix Array construction

  8. A useful tool: L F mapping F L unknown # mississipp i #mississip i ppi#missis i p s i ssippi#mis i ssissippi# s m Can we map L s chars onto F s chars ? ... Need to distinguish equal chars... m ississippi # p i#mississi p pi#mississ s ippi#missi s issippi#mi s sippi#miss s sissippi#m p i s s i i Take two equal L s chars Rotate rightward their rows Same relative order !! Rankchar(pos) and Selectchar(pos) are key operations nowadays

  9. The BWT is invertible F L unknown Two key properties: # mississipp i i #mississip i ppi#missis p s 1. LF-array maps L s to F s chars 2. L[ i ] precedes F[ i ] in T i ssippi#mis i ssissippi# s m Reconstruct T backward: p i m ississippi # T = .... # i p p i#mississi p pi#mississ s ippi#missi s issippi#mi s sippi#miss s sissippi#m p i s s i i Several issues about efficiency in time and space

  10. You find this in your Linux distribution

Related