The Burrows-Wheeler Transform and Suffix Array

undefined

Data Compression

The Burrows-Wheeler Transform

The Suffix Array

Prop 1.

 All suffixes in SUF(T) having prefix P are contiguous.

P=si

T = mississippi

Prop 2.

 Starting position is the lexicographic one of P.

The big (unconscious) step...

The Burrows-Wheeler Transform

(1994)

Let us given a text

 T = mississippi

mississippi#

ississippi#

ssissippi#

mi

sissippi#

is

sippi#

issis

ippi#mississ

ppi#mississi

pi#mississip

i#mississipp

#mississippi

ssippi#

issi

issippi#

iss

A famous example

Much

longer...

Compressing L seems promising...

Key observation:

L is locally homogeneous



Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

mississipp

mississip

ippi

missis

issippi

mis

ississippi

ississippi

pi

mississi

ppi

mississ

sippi

missi

sissippi

mi

ssippi

miss

ssissippi

How to compute the BWT ?

L[3] = T[ 8 - 1 ]

We said that: L[i] precedes F[i] in T

mississipp

ississip

ppi#

mi

ssis

unknown

A useful tool:  L



 F mapping

ississip

The BWT is invertible

mississipp

ppi#

mi

ssis

unknown

Reconstruct T backward:

Several issues about efficiency in time and space

You find this in your Linux distribution

ississip

Decompress any substring

mississipp

ppi#

mi

ssis

unknown

You can reconstruct any substring

backward

IF

you know the row of

its last character

How

do we

 know where to start

? Keep sampled

 positions

Trade-off between space (n log n/S

bits) and decompression time of an L-

long substring (S+ L time) due to the

sampling step S

Generalised Rank-Select over the column L

Search is possible,

not in these lectures...

Slide Note

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Embed Share

Download

Explore the concept of the Burrows-Wheeler Transform and Suffix Array, key algorithms in data compression. Learn about the transformation process, suffix arrays, compression techniques like Bzip, and how to compute the BWT. Discover the significant role these methods play in efficient data compression.

aalbe Follow

Uploaded on Sep 24, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Data Compression The Burrows-Wheeler Transform

The Suffix Array Prop 1. All suffixes in SUF(T) having prefix P are contiguous. Prop 2. Starting position is the lexicographic one of P. 5 (N2) space T = mississippi# T = mississippi# SUF(T) SA 12 11 8 5 2 1 10 9 7 4 6 3 suffix pointer # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# P=si Suffix Array SA: (N log2N) bits Text T: N chars In practice, a total of 5N bytes

The big (unconscious) step...

The Burrows-Wheeler Transform (1994) Let us given a text T = mississippi# F L mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss # mississipp i #mississip i ppi#missis i p s i ssippi#mis i ssissippi# s m Sort the rows ssippi#missi m ississippi # T sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i

A famous example Much longer...

Compressing L seems promising... Key observation: L is locally homogeneous l L is highly compressible Algorithm Bzip : 1. Move-to-Front coding of L 2. Run-Length coding 3. Statistical coder Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

How to compute the BWT ? SA L BWT matrix We said that: L[i] precedes F[i] in T #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ssissippi#m #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss i p s s m # p i s s i i 12 11 8 5 2 1 10 9 7 4 6 3 L[3] = T[ 8 - 1 ] Given SA and T, we have L[i] = T[SA[i]-1] This is one of the main reasons for the number of pubblications spurred in 94- 10 on Suffix Array construction

A useful tool: L F mapping F L unknown # mississipp i #mississip i ppi#missis i p s i ssippi#mis i ssissippi# s m Can we map L s chars onto F s chars ? ... Need to distinguish equal chars... m ississippi # p i#mississi p pi#mississ s ippi#missi s issippi#mi s sippi#miss s sissippi#m p i s s i i Take two equal L s chars Rotate rightward their rows Same relative order !! Rankchar(pos) and Selectchar(pos) are key operations nowadays

The BWT is invertible F L unknown Two key properties: # mississipp i i #mississip i ppi#missis p s 1. LF-array maps L s to F s chars 2. L[ i ] precedes F[ i ] in T i ssippi#mis i ssissippi# s m Reconstruct T backward: p i m ississippi # T = .... # i p p i#mississi p pi#mississ s ippi#missi s issippi#mi s sippi#miss s sissippi#m p i s s i i Several issues about efficiency in time and space

You find this in your Linux distribution

The Burrows-Wheeler Transform and Suffix Array

Download Presentation

Presentation Transcript

Related

More Related Content