Challenges and Innovations in Relational Engine Algorithms
Exploring the complexity of processing graph data in relational query engines, this content delves into the challenges faced, practices adopted in academia, and innovative solutions like LMS-NPRR, trie join, and specialized data structures. It discusses the difficulties in handling acyclic vs. cyclic data, recursion, join-heavy scenarios, and the quest for optimal query processing in relational databases.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Graph data in relational engine: algorithm and architecture Xilun Wu Purdue University
The quest Support graph data processing in relational query engine/database. (Acyclic vs Cyclic, Recursion, Join-Heavy ) Why difficult? What practice made in academia? Requires different Index, Join Algorithm, and Query Compiler.
Learn by example We will use Triangle Counting as an example in the following explanation. Worst-case: ?(?3) Intermediate Result: ?(?? + |???|) 3 2) Output: (?
Our practice: LMS-NPRR 1. Architect [SQL] query engine using LMS (VLDB 14, SIGMOD 18) 2. Novel worst-case optimal [multi-way] join (PODS 12, ICDT 14, SIGMOD 15) 3. Specialized Data Structure (Limitation: processing graph data requires nonuniform data structure or indexing). Is it common in relational data processing? Is it avoidable?
Architect a query engine 1. Architect [SQL] query engine using LMS (VLDB 14, SIGMOD 18). 2. Core idea is: Write a query compiler as an interpreter but the interpreting result is a program for evaluating the input query. (This codegen is via LMS) 3. This allows recompiling of DBMS during runtime.
Leapfrog trie join (by LogicBlox) 1. Two orthogonal directions: Perform Leapfrog Join on each level and Levels are organized as a Trie. 2. Leapfrog Join:
Leapfrog trie join (by LogicBlox) 3. Relation as a Trie
Leapfrog trie join (by LogicBlox) 4. Trie Join
Hurdles along the path 1. We implemented LFTJ in LMS for relational data. (slow) 2. Then we saw TrieArray and added. (Hard to write DS in LMS) 3. Tried TPC-H. (Not fast enough. 2x slower than Hyper for Q5) 4. We realized our query plan is not optimized but sooner we saw EmptyHeaded. (Far ahead of us) 5. Implemented their DataLayout. (Even harder than TrieArray) 6. It turned out that LMS is not the best for code which twists data.
Look back to failure: TPC-H Q5 Yes. Single instance of LFTJ applies to a subgraph. But how to partition this???
Look back to failure: TPC-H Q5 1. Didn t execute query plan efficiently due to lack of early aggregation, select push down. (VLDB 14 did well on this) 2. Optimization on query plan involving multi-way join could be different. (EmptyHeaded has a GHD optimizer) 3. Not sure how to implement Data Structure in LMS.
EmptyHeaded(SIGMOD16) Main proposal: Existing solutions need modify the DB architecture (add layer, etc.), what if we keep the architecture untouched but deploy novel Query Compiler, Join Algorithm and Specific [Indexing] Data Structure?
Organization 1. Architecture: Query Compiler + Backend (TODO: make this formal) 2. Query Compiler: GHD ICDT 97 3. Novel worst-case optimal pairwise join (PODS 12) 4. Bitset/Intset/Hybridset (data skew is analyzed in EH paper)
Architecture Datalog-Like Query Input: TriangleAgg(;z) :- Edge(a,b),Edge(b,c),Edge(a,c),z:long<-[COUNT(*)].
Architecture 1. Transfer Query into Hypergraph 2. Enumerate all GHDs of this Hypergraph 3. Select the optimal GHD (of smallest cost estimation)
Architecture 1. Translate GHD into Cpp code 2. This Cpp code calls library API
Architecture Library selects appropriate data layout and perform NPRR join on it
Query Compiler Input: It takes a datalog-like query. Supports Aggregation and Recursion. Representation: Hypergraph and GHD (Generalized Hypertree Decomposition, is similar to Relational Algebra but allows estimation on cardinality). Target: Find the GHD of smallest estimate cost.
NPRR join Pseudocode for NPRR Join algorithm >
NPRR join Looks terrific? Don t panic. It can be abstracted as: Step1: Find the intersection set on level1. Step2: For each element in the set, find the intersection set on level2, but in its subtrees. Step3: Go one level deeper and repeat step2 until we reach the last level.
GHD Translation 1. Within a Node: An instance of NPRR join. 2. Across Nodes: Yannakakis seminal algorithm. 3. Recursion: One GHD instance for each recursion.
Execution Backend (Data Layout) Set intersection can be slow due to data skew. Two types of set: Bitset and Uintset. Heuristic of choosing: Range/Cardinality.