Understanding the Burrows-Wheeler Transform and Suffix Array

Slide Note

Explore the concept of the Burrows-Wheeler Transform and Suffix Array, key algorithms in data compression. Learn about the transformation process, suffix arrays, compression techniques like Bzip, and how to compute the BWT. Discover the significant role these methods play in efficient data compression.

aalbe Follow

Uploaded on Sep 24, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Data Compression The Burrows-Wheeler Transform

The Suffix Array Prop 1. All suffixes in SUF(T) having prefix P are contiguous. Prop 2. Starting position is the lexicographic one of P. 5 (N2) space T = mississippi# T = mississippi# SUF(T) SA 12 11 8 5 2 1 10 9 7 4 6 3 suffix pointer # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# P=si Suffix Array SA: (N log2N) bits Text T: N chars In practice, a total of 5N bytes

The big (unconscious) step...

The Burrows-Wheeler Transform (1994) Let us given a text T = mississippi# F L mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss # mississipp i #mississip i ppi#missis i p s i ssippi#mis i ssissippi# s m Sort the rows ssippi#missi m ississippi # T sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i

A famous example Much longer...

Compressing L seems promising... Key observation: L is locally homogeneous l L is highly compressible Algorithm Bzip : 1. Move-to-Front coding of L 2. Run-Length coding 3. Statistical coder Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

How to compute the BWT ? SA L BWT matrix We said that: L[i] precedes F[i] in T #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ssissippi#m #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss i p s s m # p i s s i i 12 11 8 5 2 1 10 9 7 4 6 3 L[3] = T[ 8 - 1 ] Given SA and T, we have L[i] = T[SA[i]-1] This is one of the main reasons for the number of pubblications spurred in 94- 10 on Suffix Array construction

A useful tool: L F mapping F L unknown # mississipp i #mississip i ppi#missis i p s i ssippi#mis i ssissippi# s m Can we map L s chars onto F s chars ? ... Need to distinguish equal chars... m ississippi # p i#mississi p pi#mississ s ippi#missi s issippi#mi s sippi#miss s sissippi#m p i s s i i Take two equal L s chars Rotate rightward their rows Same relative order !! Rankchar(pos) and Selectchar(pos) are key operations nowadays

The BWT is invertible F L unknown Two key properties: # mississipp i i #mississip i ppi#missis p s 1. LF-array maps L s to F s chars 2. L[ i ] precedes F[ i ] in T i ssippi#mis i ssissippi# s m Reconstruct T backward: p i m ississippi # T = .... # i p p i#mississi p pi#mississ s ippi#missi s issippi#mi s sippi#miss s sissippi#m p i s s i i Several issues about efficiency in time and space

You find this in your Linux distribution

Understanding the Burrows-Wheeler Transform and Suffix Array

Download Presentation

Presentation Transcript

Related

More Related Content