Efficient Maintenance of Indexed String Sequences in Compressed Space
The Wavelet Trie, Prefix Operations, and Dynamic Sequences are explored in maintaining an indexed sequence of strings efficiently. Various operations like access, rank, select, and dynamic updates are discussed with the aim of storing strings in minimal space while supporting polylog operations effectively.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space Giuseppe Ottaviano* Roberto Grossi Universit di Pisa * Part of the work done while at Microsoft Research Cambridge
Disclaimer THIS HAS NOTHING TO DO WITH WAVELETS!
Indexed String Sequences (foo, bar, foobar, foo, bar, bar, foo) 0 1 2 3 4 5 6 Queries Access(i): access the i-th element Access(2) = foobar Rank(s, pos): count occurrences of s before pos Rank(bar, 5) = 2 Select(s, i): find the i-th occurrence of a s Select(foo, 2) = 6
Prefix operations (foo, bar, foobar, foo, bar, bar, foo) 0 1 2 3 4 5 6 Queries RankPrefix(p, pos): count strings prefixed by p before pos RankPrefix(foo, 5) = 3 SelectPrefix(p, i): find the i-th string prefixed by p SelectPrefix(foo, 2) = 3
Example: storing relations Write the columns as string sequences Store them separately Reduce relational operations to sequence queries User Likes URL 0 Leonard battle.net/wow/ 0 What does Sheldon like? 1 Penny tmz.com 1 2 Sheldon battle.net/wow/ 2 Who likes pages from domain wikipedia.org? 3 Penny thecheesecakefactory.com 3 4 Leonard wikipedia.org/Star_Trek 4 Other operations: range counting, 5 Sheldon wikipedia.org/String_theory 5 6 Sheldon marvel.com 6
Dynamic sequences We want to support the following operations: Insert(s, pos): insert the string s immediately before position pos Append(s): append the string s at end of the sequence (special case of Insert) Delete(pos): delete the string at position pos If data structure only supports Append, we call it append-only, otherwise dynamic (or fully dynamic)
Requirements Store the sequence in as little space as possible Close to the information-theoretic lower bound But still be able to support all the described operations (query and update) efficiently Aim for worst-case polylog operations
Some notation (foo, bar, foobar, foo, bar, bar, foo) 0 1 2 3 4 5 6 Sequence S, |S| = n In the example n = 7 String set Sset is unordered set of distinct strings appearing in S In the example, {foo, bar, foobar}, |Sset| = 3 Also called alphabet Sequence symbols can also be integers, characters, As long as they are binarized to strings
Wavelet Trees Introduced in 2003 to represent Compressed Suffix Arrays Support Access/Rank/Select on sequences on a finite alphabet (of integers) Reduces to operations on bitvectors by recursively partitioning the alphabet String sequences can be reduced to integer sequences
Wavelet Trees S = (a, b, r, a, c, a, d, a, b, r, a), Sset={a, b, c, d, r} abracadabra 00101010010 {a, b} {c, d, r} abaaaba 0100010 rcdr 1011 {d, r} rdr 101 a b c d r
Wavelet Trees Space equal to entropy of the sequence Plus negligible terms Supports Access/Rank/Select in O(log |Sset|) Later extended to support Insert/Delete but tree structure is fixed a priori String set Sset is cannot be changed! Unrealistic restriction in many database applications
The Wavelet Trie The Wavelet Trie is a Wavelet Tree on sequences of binary strings (Sset {0, 1}*) Supports Access/Rank(Prefix)/Select(Prefix) Fully dynamic or append only (with better bounds) The string set need not be known in advance
Wavelet Trie: Construction Sequence of binary strings Branching bit: : 010 : 1001011 010111 0100110 0100001 0101010 0100110 010111 010110 010110 010111 0100110 0100001 0101010 0100110 010111 0 1 110 001 110 11 010 11 10 Common prefix:
Wavelet Trie: Construction : 010 : 1001011 010111 0100110 0100001 0101010 0100110 010111 010110 : : 101 : : 1011 : : 110 : 01 : 10 : 10 : :
Wavelet Trie: Access : 010 : 010 : 1001011 : 1001011 0 1 2 3 4 5 6 010111 0100110 0100001 0101010 0100110 010111 010110 : : : 101 : 101 : : : 1011 : 1011 : : 110 : 01 : 10 : 10 Access(5) = 0101 1 1 : : Rank is similar
Wavelet Trie: Select : 010 : 1001011 : 1001011 : 010 0 1 2 3 4 5 6 010111 0100110 0100001 0101010 0100110 010111 010110 : : 101 : 101 : : : : 1011 : 1011 : : 110 : 110 : : 01 : 01 : 10 : 10 : 10 : 10 Select(0100110, 1) = 4 : : : :
Wavelet Trie: Append : 010 : 1001011 0 010111 0100110 0100001 0101010 0100110 010111 010110 010010 : : 101 : 1 : 1011 : 10 : : 110 : : 110 : 01 : 10 Insert/Delete are similar : : : 0 :
Space analysis Information-theoretic lower bound LB(S) = LT(Sset) + nH0(S) LT is the information-theoretic lower bound for storing a set of strings Static WT: LB(S) + o( n) Append-only WT: LB(S) + PT(Sset)+ o( n) PT(Sset): space taken by the Patricia Trie Fully dynamic WT: LB(S) + PT(Sset)+ O(nH0(S))
Operations time complexity Need new dynamic bitvectors to support initialization (create a bitvector 0n or 1n) Static and Append-only Wavelet Trie All supported operations in O(|s| + hs) hs is number of nodes traversed by string s Fully dynamic Wavelet Trie All supported operations in O(|s| + hs log n) Deletion may take O(| | + hs log n) where is longest string in the trie
Thanks for your attention! Questions?