Challenges and Innovations in Relational Engine Algorithms

 
Graph data in relational engine:
algorithm and architecture
 
Xilun Wu
Purdue University
undefined
 
The quest
 
Support graph data processing in relational query
engine/database. (Acyclic vs Cyclic, Recursion, “Join-Heavy”)
Why difficult? What practice made in academia?
Requires different Index, Join Algorithm, and Query Compiler.
undefined
 
Learn by example
 
We will use Triangle Counting as an example in the following
explanation.
Worst-case:
Intermediate Result:
Output:
 
undefined
 
Our practice: LMS-NPRR
 
1.  Architect [SQL] query engine using LMS (VLDB’14,
SIGMOD’18)
2.  Novel worst-case optimal [multi-way] join (PODS’12, ICDT’14,
SIGMOD’15)
3.  Specialized Data Structure (Limitation: processing graph data
requires nonuniform data structure or indexing). Is it common in
relational data processing? Is it avoidable?
undefined
 
Architect a query engine
 
1.  Architect [SQL] query engine using LMS (VLDB’14,
SIGMOD’18).
2. Core idea is: Write a query compiler as an interpreter but the
interpreting result is a program for evaluating the input query.
(This codegen is via LMS)
3. This allows recompiling of DBMS during runtime.
undefined
 
Leapfrog trie join (by LogicBlox)
 
1. Two orthogonal directions: Perform Leapfrog Join on each level
and Levels are organized as a Trie.
2. Leapfrog Join:
undefined
 
Leapfrog trie join (by LogicBlox)
 
3. Relation as a Trie
undefined
 
Leapfrog trie join (by LogicBlox)
 
4. Trie Join
undefined
 
TrieArray
undefined
 
Hurdles along the path
 
1. We implemented LFTJ in LMS for relational data. (slow)
2. Then we saw TrieArray and added. (Hard to write DS in LMS)
3. Tried TPC-H. (Not fast enough. 2x slower than Hyper for Q5)
4. We realized our query plan is not optimized but sooner we saw
EmptyHeaded. (Far ahead of us)
5. Implemented their DataLayout. (Even harder than TrieArray)
6. It turned out that LMS is not the best for code which twists data.
undefined
 
Look back to “failure”: TPC-H Q5
undefined
 
Look back to “failure”: TPC-H Q5
 
Yes. Single instance of LFTJ applies to a subgraph.
But how to partition this???
undefined
 
Look back to “failure”: TPC-H Q5
 
1. Didn’t execute query plan efficiently due to lack of early
aggregation, select push down. (VLDB’14 did well on this)
2. Optimization on query plan involving multi-way join could be
different. (EmptyHeaded has a GHD optimizer)
3. Not sure how to implement Data Structure in LMS.
undefined
 
EmptyHeaded(SIGMOD’16)
 
Main proposal
: Existing solutions need modify the DB
architecture (add layer, etc.), what if we keep the architecture
untouched but deploy novel 
Query Compiler, Join Algorithm 
and
Specific [Indexing] Data Structure
?
undefined
 
Organization
 
1.  Architecture: Query Compiler + Backend (TODO: make this
formal)
2.  Query Compiler: GHD ICDT’97
3.  Novel worst-case optimal pairwise join (PODS’12)
4.  Bitset/Intset/Hybridset (data skew is analyzed in EH paper)
undefined
 
Architecture
 
undefined
 
Architecture
 
Datalog-Like Query Input:
Datalog-Like Query Input:
TriangleAgg(;z) :- Edge(a,b),Edge(b,c),Edge(a,c),z:long<-[COUNT(*)].
TriangleAgg(;z) :- Edge(a,b),Edge(b,c),Edge(a,c),z:long<-[COUNT(*)].
undefined
 
Architecture
 
1. Transfer Query into Hypergraph
1. Transfer Query into Hypergraph
2. Enumerate all GHDs of this Hypergraph
2. Enumerate all GHDs of this Hypergraph
3. Select the optimal GHD (of smallest cost estimation)
3. Select the optimal GHD (of smallest cost estimation)
undefined
 
Architecture
 
1. Translate GHD into Cpp code
1. Translate GHD into Cpp code
2. This Cpp code calls library API
2. This Cpp code calls library API
undefined
 
Architecture
 
Library selects appropriate data layout and perform NPRR join on it
Library selects appropriate data layout and perform NPRR join on it
undefined
 
Query Compiler
 
Input: It takes a datalog-like query. Supports Aggregation and
Recursion.
Representation: Hypergraph and GHD (Generalized Hypertree
Decomposition, is similar to Relational Algebra but allows
estimation on cardinality).
Target: Find the GHD of smallest estimate cost.
undefined
 
Example: Barbell query
undefined
 
NPRR join
 
Pseudocode for NPRR Join
algorithm —>
undefined
 
NPRR join
 
Looks terrific? Don’t panic. It can be
abstracted as:
Step1: Find the intersection set on
level1.
Step2: For each element in the set,
find the intersection set on level2, but
in its subtrees.
Step3: Go one level deeper and
repeat step2 until we reach the last
level.
undefined
 
GHD Translation
 
1. Within a Node:  An instance of NPRR join.
2. Across Nodes:  Yannakakis’ seminal algorithm.
3. Recursion:  One GHD instance for each recursion.
undefined
 
Execution Backend (Data Layout)
 
Set intersection can be slow due to data skew.
Two types of set: Bitset and Uintset.
Heuristic of choosing: Range/Cardinality.
 
Question?
undefined
 
* LevelHeaded(VLDB’17)
 
Slide Note
Embed
Share

Exploring the complexity of processing graph data in relational query engines, this content delves into the challenges faced, practices adopted in academia, and innovative solutions like LMS-NPRR, trie join, and specialized data structures. It discusses the difficulties in handling acyclic vs. cyclic data, recursion, join-heavy scenarios, and the quest for optimal query processing in relational databases.

  • Relational Engine Algorithms
  • Graph Data Processing
  • Database Query
  • LMS-NPRR
  • Trie Join

Uploaded on Sep 10, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Graph data in relational engine: algorithm and architecture Xilun Wu Purdue University

  2. The quest Support graph data processing in relational query engine/database. (Acyclic vs Cyclic, Recursion, Join-Heavy ) Why difficult? What practice made in academia? Requires different Index, Join Algorithm, and Query Compiler.

  3. Learn by example We will use Triangle Counting as an example in the following explanation. Worst-case: ?(?3) Intermediate Result: ?(?? + |???|) 3 2) Output: (?

  4. Our practice: LMS-NPRR 1. Architect [SQL] query engine using LMS (VLDB 14, SIGMOD 18) 2. Novel worst-case optimal [multi-way] join (PODS 12, ICDT 14, SIGMOD 15) 3. Specialized Data Structure (Limitation: processing graph data requires nonuniform data structure or indexing). Is it common in relational data processing? Is it avoidable?

  5. Architect a query engine 1. Architect [SQL] query engine using LMS (VLDB 14, SIGMOD 18). 2. Core idea is: Write a query compiler as an interpreter but the interpreting result is a program for evaluating the input query. (This codegen is via LMS) 3. This allows recompiling of DBMS during runtime.

  6. Leapfrog trie join (by LogicBlox) 1. Two orthogonal directions: Perform Leapfrog Join on each level and Levels are organized as a Trie. 2. Leapfrog Join:

  7. Leapfrog trie join (by LogicBlox) 3. Relation as a Trie

  8. Leapfrog trie join (by LogicBlox) 4. Trie Join

  9. TrieArray

  10. Hurdles along the path 1. We implemented LFTJ in LMS for relational data. (slow) 2. Then we saw TrieArray and added. (Hard to write DS in LMS) 3. Tried TPC-H. (Not fast enough. 2x slower than Hyper for Q5) 4. We realized our query plan is not optimized but sooner we saw EmptyHeaded. (Far ahead of us) 5. Implemented their DataLayout. (Even harder than TrieArray) 6. It turned out that LMS is not the best for code which twists data.

  11. Look back to failure: TPC-H Q5

  12. Look back to failure: TPC-H Q5 Yes. Single instance of LFTJ applies to a subgraph. But how to partition this???

  13. Look back to failure: TPC-H Q5 1. Didn t execute query plan efficiently due to lack of early aggregation, select push down. (VLDB 14 did well on this) 2. Optimization on query plan involving multi-way join could be different. (EmptyHeaded has a GHD optimizer) 3. Not sure how to implement Data Structure in LMS.

  14. EmptyHeaded(SIGMOD16) Main proposal: Existing solutions need modify the DB architecture (add layer, etc.), what if we keep the architecture untouched but deploy novel Query Compiler, Join Algorithm and Specific [Indexing] Data Structure?

  15. Organization 1. Architecture: Query Compiler + Backend (TODO: make this formal) 2. Query Compiler: GHD ICDT 97 3. Novel worst-case optimal pairwise join (PODS 12) 4. Bitset/Intset/Hybridset (data skew is analyzed in EH paper)

  16. Architecture

  17. Architecture Datalog-Like Query Input: TriangleAgg(;z) :- Edge(a,b),Edge(b,c),Edge(a,c),z:long<-[COUNT(*)].

  18. Architecture 1. Transfer Query into Hypergraph 2. Enumerate all GHDs of this Hypergraph 3. Select the optimal GHD (of smallest cost estimation)

  19. Architecture 1. Translate GHD into Cpp code 2. This Cpp code calls library API

  20. Architecture Library selects appropriate data layout and perform NPRR join on it

  21. Query Compiler Input: It takes a datalog-like query. Supports Aggregation and Recursion. Representation: Hypergraph and GHD (Generalized Hypertree Decomposition, is similar to Relational Algebra but allows estimation on cardinality). Target: Find the GHD of smallest estimate cost.

  22. Example: Barbell query

  23. NPRR join Pseudocode for NPRR Join algorithm >

  24. NPRR join Looks terrific? Don t panic. It can be abstracted as: Step1: Find the intersection set on level1. Step2: For each element in the set, find the intersection set on level2, but in its subtrees. Step3: Go one level deeper and repeat step2 until we reach the last level.

  25. GHD Translation 1. Within a Node: An instance of NPRR join. 2. Across Nodes: Yannakakis seminal algorithm. 3. Recursion: One GHD instance for each recursion.

  26. Execution Backend (Data Layout) Set intersection can be slow due to data skew. Two types of set: Bitset and Uintset. Heuristic of choosing: Range/Cardinality.

  27. Question?

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#