Efficient Data Structures for Top-k Completion Queries

 
Space-Efficient Data Structures
for Top-k Completion
 
Giuseppe Ottaviano
Università di Pisa
 
Bo-June (Paul) Hsu
Microsoft Research
 
WWW 2013
String auto-completion
Scored string sets
 
Top-k Completion query:
Given prefix p, return k strings prefixed by p with highest
scores
Example: p=“tr”, k=2
(triangle, 9), (trie, 5)
Space-Efficiency
 
Scored string sets can be very large
Hundreds of millions of queries for web search auto-
suggest
Must fit in RAM for fast access
Need space-efficient solutions!
We compare three solutions
RMQ Trie, based on Range Minimum Queries
Completion Trie, based on a modified trie with
variable-sized pointers
Score-Decomposed Trie, based on succinct data
structures
 
RMQ TRIE (RT)
 
RMQ Trie
 
Lexicographic order → strings starting with given
prefix in a contiguous range
If we can find the max in a range, it is top-1
Range is split in two subranges, can proceed
recursively using a heap to retrieve top-k
RMQ Trie
 
To store the strings we can use any data
structure that keeps the strings sorted
We use a compressed trie
To find max score in a range we use a succinct
Range Minimum Query (RMQ) data structure
Needs only 2.6 additional bits per score, answers
queries in O(log n) time
This is a standard strategy, but not very fast.
We use it as a baseline.
 
COMPLETION TRIE (CT)
 
t
i
ree
ε
ε
l
ε
ε
ε
gle
h
r
e
p
a
l
n
(Scored) compacted tries
y
e
t
i
ree
ε
ε
l
ε
ε
ε
gle
h
r
e
p
a
l
n
Completion Trie
y
e
 
Completion Trie
t
i
ree
ε
ε
l
ε
ε
ε
gle
 
h
 
r
 
e
 
p
 
a
 
l
 
n
 
y
 
e
 
Completion Trie
 
Scores encoded differentially (either from
parent or previous sibling)
Pointers and score deltas encoded with
variable bytes
All node information in the same stream,
favoring cache-efficiency
 
SCORE-DECOMPOSED TRIE (SDT)
 
Trees as balanced parentheses
 
(()()())
 
(()(()()()))
 
2n bits are sufficient (and necessary) to represent a tree
Can support O(1) operations with 2n + o(n) bits
 
Score-decomposed trie
 
Builds on compressed path-decomposed tries
[Grossi-Ottaviano ALENEX 2012]
Parentheses-based representation of trees
Dictionary-compression of node labels
Score-decomposed trie
t
i
ree
ε
ε
l
ε
ε
ε
gle
h
r
e
p
a
l
n
y
e
2
1
4
3
5
9
 
L : 
t
1
ri
2
a
1
ngle
BP:  (  (((    )
B :  h  epl
R :  2  541
Score compression
... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ... 
 
Data structure to store scores in RT and SDT
Packed-blocks array
“Folklore” data structure, similar to many existing packed
arrays, Frame-Of-Reference, PFORDelta,…
Divide the array into fixed-size blocks
Encode the values of each block with the same number
of bits
Store separately the block offsets
 
Score compression
 
... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ...
 
Can be unlucky
Each block may contain a large value
But scores are power-law distributed
Also, tree-wise monotone sorting
On average, 4 bits per score
 
Space
 
Space
 
Time
 
Time per returned completion
on a top-10 query
 
Thanks for your attention!
 
Questions?
Slide Note
Embed
Share

Explore space-efficient data structures for top-k completion queries, particularly focusing on scored string sets and trie-based solutions. The research compares three solutions: RMQ Trie, Completion Trie, and Score-Decomposed Trie, emphasizing space efficiency and fast access for large sets of scored strings.

  • Data Structures
  • Trie
  • Completion
  • Space Efficiency
  • Query

Uploaded on Sep 23, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Space-Efficient Data Structures for Top-k Completion Bo-June (Paul) Hsu Microsoft Research Giuseppe Ottaviano Universit di Pisa WWW 2013

  2. String auto-completion

  3. Scored string sets three trial triangle trie triple triply 2 1 9 5 4 3 Top-k Completion query: Given prefix p, return k strings prefixed by p with highest scores Example: p= tr , k=2 (triangle, 9), (trie, 5)

  4. Space-Efficiency Scored string sets can be very large Hundreds of millions of queries for web search auto- suggest Must fit in RAM for fast access Need space-efficient solutions! We compare three solutions RMQ Trie, based on Range Minimum Queries Completion Trie, based on a modified trie with variable-sized pointers Score-Decomposed Trie, based on succinct data structures

  5. RMQ TRIE (RT)

  6. RMQ Trie three trial triangle trie triple triply 2 1 9 5 4 3 Lexicographic order strings starting with given prefix in a contiguous range If we can find the max in a range, it is top-1 Range is split in two subranges, can proceed recursively using a heap to retrieve top-k

  7. RMQ Trie To store the strings we can use any data structure that keeps the strings sorted We use a compressed trie To find max score in a range we use a succinct Range Minimum Query (RMQ) data structure Needs only 2.6 additional bits per score, answers queries in O(log n) time This is a standard strategy, but not very fast. We use it as a baseline.

  8. COMPLETION TRIE (CT)

  9. (Scored) compacted tries Node label Branching character t h r three trial triangle 9 trie triple triply 2 1 ree i 2 e a p 5 4 3 l y 5 e n l gle 4 1 3 9

  10. Completion Trie t 9 h r three trial triangle 9 trie triple triply 2 1 ree i 9 2 e a p 5 4 3 l y 4 5 9 e n l gle 4 1 3 9

  11. Completion Trie t 9 t 9 h r hree 2 ri 9 ree i 9 2 e a p a 9 e 5 pl 4 l y 4 5 9 e n l ngle 9 l 1 e 4 y 3 gle 4 1 3 9

  12. Completion Trie t 9 ri 9 e 5 pl 4 hree 2 a 9 ngle 9 l 1 e 4 y 3 Scores encoded differentially (either from parent or previous sibling) Pointers and score deltas encoded with variable bytes All node information in the same stream, favoring cache-efficiency

  13. SCORE-DECOMPOSED TRIE (SDT)

  14. Trees as balanced parentheses (()(()()())) () (()()()) () () () 2n bits are sufficient (and necessary) to represent a tree Can support O(1) operations with 2n + o(n) bits

  15. Score-decomposed trie Builds on compressed path-decomposed tries [Grossi-Ottaviano ALENEX 2012] Parentheses-based representation of trees Dictionary-compression of node labels

  16. Score-decomposed trie 9 t triangle h r h,2 e,5 p,4 l,1 ree i 2 e a p L : t1ri2a1ngle BP: ( ((( ) B : h epl R : 2 541 l y 5 e n l gle 4 1 3 9

  17. three trial triangle 9 trie triple triply 2 1 5 4 3

  18. Score compression ... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ... 3 bits/value 11 bits/value 16 bits/value Data structure to store scores in RT and SDT Packed-blocks array Folklore data structure, similar to many existing packed arrays, Frame-Of-Reference, PFORDelta, Divide the array into fixed-size blocks Encode the values of each block with the same number of bits Store separately the block offsets

  19. Score compression ... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ... 3 bits/value 11 bits/value 16 bits/value Can be unlucky Each block may contain a large value But scores are power-law distributed Also, tree-wise monotone sorting On average, 4 bits per score

  20. Space Dataset gzip CT SDT RT QueriesA 57% 30% 31% 27% QueriesB 25% 48% 26% 27% URLs 24% 57% 26% 27% Unigrams 39% 43% 35% 37%

  21. Space 30 Raw 25 Bytes / String CT 20 RT 15 SDT 10 GZip 5 1 1K 1M # Strings

  22. Time 8 6 Time ( s) RT SDT CT 4 2 0 1 1K 1M # Strings Time per returned completion on a top-10 query

  23. Thanks for your attention! Questions?

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#