SIMD-Based Decoding of Posting Lists

SIMD-Based Decoding of Posting Lists
Slide Note
Embed
Share

SIMD-Based Decoding of Posting Lists by Alexander A. Stepanov, Anil R. Gangolli, Daniel E. Rose, Ryan J. Ernst, Paramjit Oberoi discusses efficient methods for decoding posting lists by replacing IDs with deltas, utilizing vbyte and Byte Aligned formats, and exploring byte-oriented encodings. The paper presents a taxonomy of byte-oriented encodings for length encoding and descriptor organization, providing insights into optimizing storage for integer data representation.

  • SIMD-Based Decoding
  • Posting Lists
  • Byte-Oriented Encoding
  • VByte Format
  • Descriptor
  • Encoding Dimensions

Uploaded on Feb 17, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. CIKM 2011 Glasgow, UK SIMD-Based Decoding of Posting Lists Alexander A. Stepanov, Anil R. Gangolli, Daniel E. Rose, Ryan J. Ernst, Paramjit Oberoi A9.com 130 Lytton Ave. Palo Alto, CA 94301 USA

  2. Posting Lists 127 127 17351 17224 17352 1 17355 3 21121 3766 ant bee 32 32 17352 17320 100908 83556 cow 203404 203404 203410 6 203412 2 203478 66 203500 22 Replacing IDs with deltas gives smaller numbers, which can be stored in less space given appropriate encoding. 26 October 2011 Stepanov et al., CIKM 2011 2

  3. vbyte Format: data (7 bits) continuation bit (additional bytes as needed) increasing byte addresses Grossman s Byte Aligned (BA) Format: length (2 bits) data (6-30 bits) increasing byte addresses 26 October 2011 Stepanov et al., CIKM 2011 3

  4. Definition: Byte-Oriented Encoding 1. All significant bits of the natural binary representation are preserved. 2. Each byte contains bits from only one integer. 3. Data bits within a single byte of the encoding preserve the ordering they had in the original integer. 4. All data bits from a single integer precede all bits from the next integer. 26 October 2011 Stepanov et al., CIKM 2011 4

  5. Descriptors When does an integer end? Equivalent to knowing its length Encodings use auxiliary descriptor bits to represent the length 26 October 2011 Stepanov et al., CIKM 2011 5

  6. Dimensions of Encodings Descriptor can express length in binary or unary. Descriptor bits can be stored adjacent to each single integer, or descriptors of several integers can be grouped so that each byte contains either descriptor or data. If length of a single integer is expressed in unary, the bits of the unary representation may be packed contiguously or split across several bytes (as in vbyte). 26 October 2011 Stepanov et al., CIKM 2011 6

  7. A Taxonomy of Byte-Oriented Encodings Length Encoding Our Name Arrangement Names in the Literature varint-SU Split Unary v-byte, vbyte, VB, varint, VInt Packed Unary Group Unary Split Binary varint-PB Packed Binary BA, varint30 varint-GB Group Binary group varint, k-wise (k=4) null suppression 26 October 2011 Stepanov et al., CIKM 2011 7

  8. A Taxonomy of Byte-Oriented Encodings Length Encoding Our Name Arrangement Names in the Literature varint-SU Split Unary v-byte, vbyte, VB, varint, VInt varint-PU Packed Unary none (introduced here) varint-GU Group Unary none (introduced here) varint-SB Split Binary none (not useful) varint-PB Packed Binary BA, varint30 varint-GB Group Binary group varint, k-wise (k=4) null suppression 26 October 2011 Stepanov et al., CIKM 2011 8

  9. Definition: Byte-Preserving Encoding We call a format byte-preserving if each byte containing significant bits in the original integer appears without modification in the encoded form. Observe: Encoding omits leading 0-bytes Decoding reinserts them 26 October 2011 Stepanov et al., CIKM 2011 9

  10. Re-Inserting 0-bytes in Parallel 00 00 CC BB BB BB AA AA 00 00 00 00 00 00 00 CC 00 BB BB BB 00 00 AA AA 26 October 2011 Stepanov et al., CIKM 2011 10

  11. Format for SIMD Decoding Group descriptor bits from several encoded integers into a separate descriptorbyte Group data bytes into k-byte blocks Decode however many integers fit in this block 26 October 2011 Stepanov et al., CIKM 2011 11

  12. varint-GU 1 descriptor byte 8 data bytes increasing byte addresses Represent up to 8 variable-sized integers (as many as fit in 8 bytes) For each integer i, descriptor contains length(i)-1 in unary, separated by 0s Number of integers in block is number of zero bits in descriptor Example: Encode 4 integers 0xAAAA, 0xBBBBBB, 0xCC, 0xDDDDDDDD. Byte counts are 2, 3, 1, 4. Last integer doesn t fit in this block; pad with 0s. 11001101 0x00 0x00 0xCC 0xBB 0xBB 0xBB 0xAA 0xAA increasing byte addresses 26 October 2011 Stepanov et al., CIKM 2011 12

  13. Intel SIMD PSHUFB Instruction Permutes data bytes in parallel, with optional insertion of 0-bytes. Operation specified by a shuffle sequence Both data and shuffle sequences are stored in special registers (currently 16 bytes) 26 October 2011 Stepanov et al., CIKM 2011 13

  14. Decoding Using PSHUFB We pre-compute a table of 256 possible shuffle sequences Each descriptor uniquely identifies the arrangement and lengths of the integers So, we use descriptor to index into table 26 October 2011 Stepanov et al., CIKM 2011 14

  15. Generic Decoding Algorithm 1. Read a chunk of data and its corresponding descriptor. 2. Look up the appropriate shuffle sequence and offset from the table. 3. Perform the shuffle. 4. Write the result. 5. Advance the input and output pointers. 26 October 2011 Stepanov et al., CIKM 2011 15

  16. Results: Decoding Speed 1600 1400 millions of integers per second 1200 1000 varint-SU 800 varint-GB varint-GB SIMD 600 varint-GU SIMD 400 200 0 Wikipedia Reuters GOV2 26 October 2011 Stepanov et al., CIKM 2011 16

  17. Conclusions Taxonomy of byte-oriented formats clarifies relationships of existing formats and reveals new ones. SIMD provides significant performance gains for integer decoding. New format (varint-GU) outperforms others. 26 October 2011 Stepanov et al., CIKM 2011 17

More Related Content