Understanding the Burrows-Wheeler Transform and Suffix Array
Explore the concept of the Burrows-Wheeler Transform and Suffix Array, key algorithms in data compression. Learn about the transformation process, suffix arrays, compression techniques like Bzip, and how to compute the BWT. Discover the significant role these methods play in efficient data compression.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Data Compression The Burrows-Wheeler Transform
The Suffix Array Prop 1. All suffixes in SUF(T) having prefix P are contiguous. Prop 2. Starting position is the lexicographic one of P. 5 (N2) space T = mississippi# T = mississippi# SUF(T) SA 12 11 8 5 2 1 10 9 7 4 6 3 suffix pointer # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# P=si Suffix Array SA: (N log2N) bits Text T: N chars In practice, a total of 5N bytes
The Burrows-Wheeler Transform (1994) Let us given a text T = mississippi# F L mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss # mississipp i #mississip i ppi#missis i p s i ssippi#mis i ssissippi# s m Sort the rows ssippi#missi m ississippi # T sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i
A famous example Much longer...
Compressing L seems promising... Key observation: L is locally homogeneous l L is highly compressible Algorithm Bzip : 1. Move-to-Front coding of L 2. Run-Length coding 3. Statistical coder Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
How to compute the BWT ? SA L BWT matrix We said that: L[i] precedes F[i] in T #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ssissippi#m #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss i p s s m # p i s s i i 12 11 8 5 2 1 10 9 7 4 6 3 L[3] = T[ 8 - 1 ] Given SA and T, we have L[i] = T[SA[i]-1] This is one of the main reasons for the number of pubblications spurred in 94- 10 on Suffix Array construction
A useful tool: L F mapping F L unknown # mississipp i #mississip i ppi#missis i p s i ssippi#mis i ssissippi# s m Can we map L s chars onto F s chars ? ... Need to distinguish equal chars... m ississippi # p i#mississi p pi#mississ s ippi#missi s issippi#mi s sippi#miss s sissippi#m p i s s i i Take two equal L s chars Rotate rightward their rows Same relative order !! Rankchar(pos) and Selectchar(pos) are key operations nowadays
The BWT is invertible F L unknown Two key properties: # mississipp i i #mississip i ppi#missis p s 1. LF-array maps L s to F s chars 2. L[ i ] precedes F[ i ] in T i ssippi#mis i ssissippi# s m Reconstruct T backward: p i m ississippi # T = .... # i p p i#mississi p pi#mississ s ippi#missi s issippi#mi s sippi#miss s sissippi#m p i s s i i Several issues about efficiency in time and space