Advanced Compression Techniques in Unordered Integer Sequences

Slide Note

Presenting innovative methods for compressing and accessing unordered integer sequences efficiently. Explore fast element extraction and direct addressable variable-length codes to optimize memory usage and enhance data handling. Cutting-edge research from top universities and workshops is showcased, offering insights into storing, accessing, and manipulating integer sequences for various applications.

arl_mcc Follow

Uploaded on Sep 26, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Compressed unordered integer sequences Compressed unordered integer sequences with fast direct access with fast direct access Igor Igor Zavadskyi Zavadskyi Taras Shevchenko National University of Kyiv, Ukraine Computer science department

Fast element extraction from unordered integer sequence If integers are arranged in ascending order and deltas between them are small enough, the problem is reduced to performing the select operation on a bitmap, where ?-th bit is set if the number ? belongs to the sequence. 1 5 6 9 17 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 But what if integers are not sorted?

K. Fredriksson and F. Nikitin, Simple compression code supporting random access and fast string matching, in IEEA Workshop, 2007, pp. 203 216. Simple Dense Coding vs RMD-codes SDC Auxiliary bitstream 1 1 1 0 1 0 1 0 1 0 1 0 0 1 0 0 1 0 0 Main bitstream 0 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 0 I. Zavadskyi and A. Anisimov, Reverse multi- delimiter compression codes, in DCC, 2020, pp. 173 182. RMD 0 1 1 0 1 1 0 0 1 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0

Main algorithms idea 1. Store the absolute positions of Level 1 blocks of codewords, L1[i]. 2. Get the relative Level 2 block position using a linear approximation, L1[i]+j*k[i]+ ??. 3. Get to the codeword searching the Level 2 block sequentially. 0 1 1 0 1 1 0 0 1 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 ?2 = 26~28codewords ?1 = 214~217codewords The most memory is occupied by ??. We allocate different number of bits for -values in different Level 1 blocks.

N. R. Brisaboa, S. Ladra, and G. Navarro, Directly addressable variable-length codes, in SPIRE, 2009, pp. 122 130. Directly Addressable variable-length Codes Chunk 3 Chunk 2 Chunk 1 ?1= 4 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 ?2= 258 0 0 1 0 0 1 1 1 ?3= 39 Getting x2 0 1 0 0 0 0 1 0 0 1 1 1 ?1= ?12 = 0010; ? = ???? ?1,2 = 1; ?2= ?2? = 0000; ? = ???? ?2,? = 1; ?3= ?3? = 0001; ?2= ?3&?2&?1= 0001 0000 0010. ?1= ?1= ?2= ?2= ?3= 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1

Experiments Have been provided on integer sequence obtained by applying either word-based or character-based frequency compression to 200MB English text from Pizza&Chili corpus. 350 500 450 300 400 250 350 Access time, ns Access time, ns 300 200 250 150 200 100 150 100 50 50 0 0 0 5 10 15 20 25 30 0 10 20 30 40 50 60 Excess over entropy, % Excess over entropy, % Word-based Character-based