The Burrows-Wheeler Transform and Suffix Array

undefined
Data Compression
The Burrows-Wheeler Transform
The Suffix Array
Prop 1.
 All suffixes in SUF(T) having prefix P are contiguous.
 
P=si
T = mississippi
#
 
Prop 2.
 Starting position is the lexicographic one of P.
The big (unconscious) step...
The Burrows-Wheeler Transform   
(1994)
Let us given a text
 T = mississippi
#
 
mississippi#
 
ississippi#
m
 
ssissippi#
mi
 
sissippi#
m
is
 
sippi#
m
issis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
 
ssippi#
m
issi
 
issippi#
m
iss
 
F
 
L
A famous example
Much
longer...
Compressing L seems promising...
 
Key observation:
l
L is locally homogeneous
 
Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
 
#
mississipp
i
#
mississip
ippi
#
missis
issippi
#
mis
ississippi
#
m
ississippi
pi
#
mississi
ppi
#
mississ
sippi
#
missi
sissippi
#
mi
ssippi
#
miss
ssissippi
#
m
How to compute the BWT ?
 
L[3] = T[ 8 - 1 ]
 
We said that: L[i] precedes F[i] in T
#  
mississipp
  i
i  
#
m
ississip
  
p
i  
ppi#
mi
ssis
  
s
F
L
unknown
A useful tool:  L 
 F mapping
i  
#
m
ississip
  
p
The BWT is invertible
#  
mississipp
  i
i  
ppi#
mi
ssis
  
s
F
L
unknown
 
Reconstruct T backward:
Several issues about efficiency in time and space
You find this in your Linux distribution
 
i  
#
m
ississip
  
p
Decompress any substring
#  
mississipp
  i
i  
ppi#
mi
ssis
  
s
F
L
unknown
 
You can reconstruct any substring
backward 
IF 
you know the row of
its last character
How 
do we
 know where to start
 
? Keep sampled
 positions
 
Trade-off between space (n log n/S
bits) and decompression time of an L-
long substring (S+ L time) due to the
sampling step S
Generalised Rank-Select over the column L
Search is possible, 
not in these lectures...
Slide Note

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Embed
Share

Explore the concept of the Burrows-Wheeler Transform and Suffix Array, key algorithms in data compression. Learn about the transformation process, suffix arrays, compression techniques like Bzip, and how to compute the BWT. Discover the significant role these methods play in efficient data compression.

  • Data Compression
  • Burrows-Wheeler Transform
  • Suffix Array
  • Bzip Algorithm
  • Compression Techniques

Uploaded on Sep 24, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Data Compression The Burrows-Wheeler Transform

  2. The Suffix Array Prop 1. All suffixes in SUF(T) having prefix P are contiguous. Prop 2. Starting position is the lexicographic one of P. 5 (N2) space T = mississippi# T = mississippi# SUF(T) SA 12 11 8 5 2 1 10 9 7 4 6 3 suffix pointer # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# P=si Suffix Array SA: (N log2N) bits Text T: N chars In practice, a total of 5N bytes

  3. The big (unconscious) step...

  4. The Burrows-Wheeler Transform (1994) Let us given a text T = mississippi# F L mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss # mississipp i #mississip i ppi#missis i p s i ssippi#mis i ssissippi# s m Sort the rows ssippi#missi m ississippi # T sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i

  5. A famous example Much longer...

  6. Compressing L seems promising... Key observation: L is locally homogeneous l L is highly compressible Algorithm Bzip : 1. Move-to-Front coding of L 2. Run-Length coding 3. Statistical coder Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

  7. How to compute the BWT ? SA L BWT matrix We said that: L[i] precedes F[i] in T #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ssissippi#m #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss i p s s m # p i s s i i 12 11 8 5 2 1 10 9 7 4 6 3 L[3] = T[ 8 - 1 ] Given SA and T, we have L[i] = T[SA[i]-1] This is one of the main reasons for the number of pubblications spurred in 94- 10 on Suffix Array construction

  8. A useful tool: L F mapping F L unknown # mississipp i #mississip i ppi#missis i p s i ssippi#mis i ssissippi# s m Can we map L s chars onto F s chars ? ... Need to distinguish equal chars... m ississippi # p i#mississi p pi#mississ s ippi#missi s issippi#mi s sippi#miss s sissippi#m p i s s i i Take two equal L s chars Rotate rightward their rows Same relative order !! Rankchar(pos) and Selectchar(pos) are key operations nowadays

  9. The BWT is invertible F L unknown Two key properties: # mississipp i i #mississip i ppi#missis p s 1. LF-array maps L s to F s chars 2. L[ i ] precedes F[ i ] in T i ssippi#mis i ssissippi# s m Reconstruct T backward: p i m ississippi # T = .... # i p p i#mississi p pi#mississ s ippi#missi s issippi#mi s sippi#miss s sissippi#m p i s s i i Several issues about efficiency in time and space

  10. You find this in your Linux distribution

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#