Efficient Data Structures for Top-k Completion Queries

Slide Note
Embed
Share

Explore space-efficient data structures for top-k completion queries, particularly focusing on scored string sets and trie-based solutions. The research compares three solutions: RMQ Trie, Completion Trie, and Score-Decomposed Trie, emphasizing space efficiency and fast access for large sets of scored strings.


Uploaded on Sep 23, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Space-Efficient Data Structures for Top-k Completion Bo-June (Paul) Hsu Microsoft Research Giuseppe Ottaviano Universit di Pisa WWW 2013

  2. String auto-completion

  3. Scored string sets three trial triangle trie triple triply 2 1 9 5 4 3 Top-k Completion query: Given prefix p, return k strings prefixed by p with highest scores Example: p= tr , k=2 (triangle, 9), (trie, 5)

  4. Space-Efficiency Scored string sets can be very large Hundreds of millions of queries for web search auto- suggest Must fit in RAM for fast access Need space-efficient solutions! We compare three solutions RMQ Trie, based on Range Minimum Queries Completion Trie, based on a modified trie with variable-sized pointers Score-Decomposed Trie, based on succinct data structures

  5. RMQ TRIE (RT)

  6. RMQ Trie three trial triangle trie triple triply 2 1 9 5 4 3 Lexicographic order strings starting with given prefix in a contiguous range If we can find the max in a range, it is top-1 Range is split in two subranges, can proceed recursively using a heap to retrieve top-k

  7. RMQ Trie To store the strings we can use any data structure that keeps the strings sorted We use a compressed trie To find max score in a range we use a succinct Range Minimum Query (RMQ) data structure Needs only 2.6 additional bits per score, answers queries in O(log n) time This is a standard strategy, but not very fast. We use it as a baseline.

  8. COMPLETION TRIE (CT)

  9. (Scored) compacted tries Node label Branching character t h r three trial triangle 9 trie triple triply 2 1 ree i 2 e a p 5 4 3 l y 5 e n l gle 4 1 3 9

  10. Completion Trie t 9 h r three trial triangle 9 trie triple triply 2 1 ree i 9 2 e a p 5 4 3 l y 4 5 9 e n l gle 4 1 3 9

  11. Completion Trie t 9 t 9 h r hree 2 ri 9 ree i 9 2 e a p a 9 e 5 pl 4 l y 4 5 9 e n l ngle 9 l 1 e 4 y 3 gle 4 1 3 9

  12. Completion Trie t 9 ri 9 e 5 pl 4 hree 2 a 9 ngle 9 l 1 e 4 y 3 Scores encoded differentially (either from parent or previous sibling) Pointers and score deltas encoded with variable bytes All node information in the same stream, favoring cache-efficiency

  13. SCORE-DECOMPOSED TRIE (SDT)

  14. Trees as balanced parentheses (()(()()())) () (()()()) () () () 2n bits are sufficient (and necessary) to represent a tree Can support O(1) operations with 2n + o(n) bits

  15. Score-decomposed trie Builds on compressed path-decomposed tries [Grossi-Ottaviano ALENEX 2012] Parentheses-based representation of trees Dictionary-compression of node labels

  16. Score-decomposed trie 9 t triangle h r h,2 e,5 p,4 l,1 ree i 2 e a p L : t1ri2a1ngle BP: ( ((( ) B : h epl R : 2 541 l y 5 e n l gle 4 1 3 9

  17. three trial triangle 9 trie triple triply 2 1 5 4 3

  18. Score compression ... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ... 3 bits/value 11 bits/value 16 bits/value Data structure to store scores in RT and SDT Packed-blocks array Folklore data structure, similar to many existing packed arrays, Frame-Of-Reference, PFORDelta, Divide the array into fixed-size blocks Encode the values of each block with the same number of bits Store separately the block offsets

  19. Score compression ... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ... 3 bits/value 11 bits/value 16 bits/value Can be unlucky Each block may contain a large value But scores are power-law distributed Also, tree-wise monotone sorting On average, 4 bits per score

  20. Space Dataset gzip CT SDT RT QueriesA 57% 30% 31% 27% QueriesB 25% 48% 26% 27% URLs 24% 57% 26% 27% Unigrams 39% 43% 35% 37%

  21. Space 30 Raw 25 Bytes / String CT 20 RT 15 SDT 10 GZip 5 1 1K 1M # Strings

  22. Time 8 6 Time ( s) RT SDT CT 4 2 0 1 1K 1M # Strings Time per returned completion on a top-10 query

  23. Thanks for your attention! Questions?

Related