Key Features of Data Compression Codes

Reverse Multi-Delimiter Compression Codes
Igor Zavadskyi, Anatoly Anisimov
Taras Shevchenko National University of Kyiv, Ukraine
Computer science department
Data compression codes key features
: 
compression rate
1.
 Arithmetic
2.
 Asymmetric numeration systems
3.
 Huffman
4.
 Multi-delimiter
5.
 Fibonacci
6.
 Byte-aligned
 
RPBC
7.
 Byte-aligned
 
SCDC
Codes with delimiters
Key features
: 
synchronization
Arithmetic code, ANS and Huffman codes are not synchronizable
Codes with delimiters are synchronizable
Key features
: 
decoding speed
RPBC
Byte-aligned algorithms for multi-delimiter codes
Byte-aligned algorithms for Fibonacci codes
Huffman codes
, 
ANS
Arithmetic encoding
SCDC
Key features
: 
search in compressed file
 
efficient
We have developed a new class of variable length prefix codes, so called (; k)-codes, which can be used for efficient data compression.
Huffman, arithmetic, ANS, RPBC
SCDC, Fibonacci, multi-delimiter
+
Natural language text compression: Fibonacci codes
Fib3
Fib2
Fib4
Fib
m
 is the pattern code
 
with delimiter
 
1…1
m
Delimiters
Delimiter
    
Strings
Fibonacci codes
Fibonacci codes
Multi-delimiter codes
Multi-delimiter codes
Delimiter
    
Strings
111
   
1111
  
      
11111
 
        
11111
 
0110
 
+
+
+
+
+
+
 
,
 
D
2,
3
 
s
 
01110
011
1
10
011
11
10
Multi-delimiter code: definition
Let
 
m
1
 ,… 
,
 
m
t
 
be a sequence of integers
, 0 < 
m
1
 
<…< 
m
t
The code                       contains:
    - codewords of a form          1…10, … , 1…10
    - codewords of a form 
     
α
01…10, … , 
α
01…10
(type 1)
(type 2)
m
1
m
t
m
1
m
t
A type 2 codeword contains the delimiter only as a suffix and does
not contain any of type 1 codewords as a prefix.
Completeness
Universality
Synchronizability with the delay 1
Direct search in compressed file
     etc.
Code
 
D
2,3
 (delimiters 0110, 01110) – example
Codewords
of length
 ≤7
Multi-delimiter codes
 
vs Fibonacci codes
The codes with the shortest codeword of length 3
The codes with the shortest codeword of length 2
The codes with the shortest codeword of length 4
Code
Asymptotic
density
Number of codewords of length ≤ 
n
A.V. Anisimov, I.O. Zavadskyi
,
 
Variable-Length Prefix Codes With Multiple Delimiters
,
IEEE Transactions on Information Theory
, 
vol.
 63, 
no.
 5, pp. 2885–2895, 2017
.
Main problem: non-monotonousness
1
2
4
3
8
5
Dict[1] = the
Dict[2] = and
Dict[4] = of
Dict[3] = to
Dict[8] = that
Dict[5] = in
110
0110
00110
10110
000110
010110
00001110110
92784
Dict[
92784
] = Abhor
NULL
NULL
The reverse codes – key idea
 
011
 
0111010
 
0101110
 
0110001010
 
110
 
0101000110
Write all codewords from right to left.
This way we obtain a non-prefix but uniquely decodable code
with a 
simple principle of monotonous codeword set construction
.
011
0101110
0101000110
Direct code:
Reverse code:
Codeword set construction
 
— R
2,4 
, K={ 0, 1, 3, 5 }
L=3
 
011
L=4
 
011
 
0
L=5
 
011
 
0110
 
0
 
01
 
01100
01101
01111
 
01111
L=6
 
0
0
0
 
0110
 
01
L=
7
 
011000
011010
011110
011001
 
0
0
0
0
 
01100
01101
01111
 
01
01
01
 
011
 
0111
1. Replicate the sets of codewords of lengths L−k−1 
    and append the sequences of the form 01
k
, k 
 K.
 
2. Replicate the set of words of the length L−1 with 
    a suffix 01
r
, r > m
t 
, and append a single “1” bit 
    to all elements of this set. 
3. If L = m + 1, m 
 {m
1
,...,m
t
}, append the word 01
m
 
    
to the codeword set.
 
011011111
 
1
R
2–∞
 decoding
Automaton
Algorithm
input
: 
encoded
 
Text
output
:
 
dictionary indices
Decoding time comparison, milliseconds
R
2–∞
 improved decoding algorithm (idea)
Pack TAB
j
[Text[i]] into
32-bit word
Get p1,…,p4 via bit-level
operations
Improved decoding timing, milliseconds
Empirical comparison of codes compression rate
Conclusions
The reverse multi-delimiter codes are the first compression codes
that have all the following properties:
Operate quite close to entropy (2–3% away)
Allow us to search the data in a compressed file without its decompression
Provide fast decoding on a level with byte-aligned codes
Are synchronizable with the delay 1
These codes point the way towards information retrieval systems of
a new type, which can operate with compressed data directly,
performing the decompression only on the fly.
Thank You
!
Igor Zavadskyi
i
h
o
r
z
a
@
g
m
a
i
l
.
c
o
m
Anatoly Anisimov
a
v
a
.
l
e
c
t
u
r
e
s
@
g
m
a
i
l
.
c
o
m
Slide Note
Embed
Share

In-depth exploration of data compression codes including multi-delimiter, Fibonacci, and Huffman codes. Discusses synchronization, decoding speed, and search efficiency in compressed files. Also covers the use of variable-length prefix codes for efficient data compression techniques.

  • Data Compression
  • Coding Techniques
  • Information Theory
  • Data Science
  • Computer Science

Uploaded on Feb 17, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Reverse Multi Reverse Multi- -Delimiter Compression Codes Delimiter Compression Codes Igor Igor Zavadskyi Zavadskyi, Anatoly Anisimov , Anatoly Anisimov Taras Shevchenko National University of Kyiv, Ukraine Computer science department

  2. Data compression codes key features: compression rate 1. Arithmetic 2. Asymmetric numeration systems 3. Huffman 4. Multi-delimiter 5. Fibonacci 6. Byte-aligned RPBC 7. Byte-aligned SCDC Codes with delimiters

  3. Key features: synchronization Arithmetic code, ANS and Huffman codes are not synchronizable error Codes with delimiters are synchronizable error

  4. Key features: decoding speed 1 RPBC Byte-aligned algorithms for multi-delimiter codes 2 SCDC 3 Byte-aligned algorithms for Fibonacci codes 4 Huffman codes, ANS 5 Arithmetic encoding 6

  5. Key features: search in compressed file W e h a v e d e v e l o p e d a n e w c l a s s o f v a r i a b l e l e n g t h p r e f i x c o d e s , s o c a l l e d ( ; k ) - c o d e s , w h i c h c a n b e u s e d f o r e f f i c i e n t d a t a c o m p r e s s i o n . efficient Huffman, arithmetic, ANS, RPBC SCDC, Fibonacci, multi-delimiter +

  6. Natural language text compression: Fibonacci codes Fib3 ? Fib3 Fib2 Fib4 Fib4 Fib2 Fibm is the pattern code with delimiter 1 1 m

  7. Delimiters Fibonacci codes Delimiter 111 Strings 11111 1111 11111 Multi-delimiter codes Delimiter 0110 , D2,3 Strings s + + + 01110 011110 0111110

  8. Multi-delimiter code: definition Let m1, , mtbe a sequence of integers, 0 < m1< < mt 1, , m D The code contains: m t (type 1) - codewords of a form 1 10, , 1 10 m1 mt (type 2) - codewords of a form 01 10, , 01 10 m1 mt A type 2 codeword contains the delimiter only as a suffix and does not contain any of type 1 codewords as a prefix.

  9. Multi-delimiter codes properties Completeness Universality Synchronizability with the delay 1 Direct search in compressed file etc.

  10. Code D2,3(delimiters 0110, 01110) example codewords 1101110010110110 delimiters

  11. Codewords of length 7

  12. Multi-delimiter codes vs Fibonacci codes Asymptotic density Number of codewords of length n Code The codes with the shortest codeword of length 2 The codes with the shortest codeword of length 3 The codes with the shortest codeword of length 4

  13. The flexible family of codes Number of short codewords Asymptotic density Longer delimiters More delimiters

  14. Publication A.V. Anisimov, I.O. Zavadskyi, Variable-Length Prefix Codes With Multiple Delimiters, IEEE Transactions on Information Theory, vol. 63, no. 5, pp. 2885 2895, 2017.

  15. Main problem: non-monotonousness 110 0110 00110 10110 000110 010110 NULL 1 2 4 3 8 5 Dict[1] = the Dict[2] = and Dict[4] = of Dict[3] = to Dict[8] = that Dict[5] = in 92784 Dict[92784] = Abhor 00001110110 NULL

  16. The reverse codes key idea Write all codewords from right to left. 0101110 011 0101000110 Direct code: 0110001010 0101000110 011 110 0111010 0101110 Reverse code: This way we obtain a non-prefix but uniquely decodable code with a simple principle of monotonous codeword set construction.

  17. Codeword set construction R2,4 , K={ 0, 1, 3, 5 } L=3 L=4 L=5 L=6 1. Replicate the sets of codewords of lengths L k 1 and append the sequences of the form 01k, k K. L=7 011 011 011 011 0110 0110 0 0 011000 011010 011110 011001 01100 01101 01111 01111 01111 0 0 0 0 0 0 0 01100 01101 2. Replicate the set of words of the length L 1 with a suffix 01r, r > mt , and append a single 1 bit to all elements of this set. 01 01 3. If L = m + 1, m {m1,...,mt}, append the word 01m to the codeword set. 01 01 01 0111 L=9 L=10 011011111 011011111 1

  18. R2decoding Automaton Algorithm input: encoded Text output: dictionary indices

  19. Decoding time comparison, milliseconds

  20. R2improved decoding algorithm (idea) Pack TABj[Text[i]] into 32-bit word Get p1, ,p4 via bit-level operations

  21. Improved decoding timing, milliseconds

  22. Empirical comparison of codes compression rate

  23. Conclusions The reverse multi-delimiter codes are the first compression codes that have all the following properties: Operate quite close to entropy (2 3% away) Allow us to search the data in a compressed file without its decompression Provide fast decoding on a level with byte-aligned codes Are synchronizable with the delay 1 These codes point the way towards information retrieval systems of a new type, which can operate with compressed data directly, performing the decompression only on the fly.

  24. Thank You Thank You! ! Igor Igor Zavadskyi Zavadskyi ihorza@gmail.com ihorza@gmail.com Anatoly Anisimov Anatoly Anisimov ava.lectures@gmail.com ava.lectures@gmail.com

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#