Challenges and Innovations in Relational Engine Algorithms

Graph data in relational engine:

algorithm and architecture

Xilun Wu

Purdue University

undefined

The quest

•

Support graph data processing in relational query

engine/database. (Acyclic vs Cyclic, Recursion, “Join-Heavy”)

•

Why difficult? What practice made in academia?

•

Requires different Index, Join Algorithm, and Query Compiler.

undefined

Learn by example

•

We will use Triangle Counting as an example in the following

explanation.

•

Worst-case:

•

Intermediate Result:

•

Output:

undefined

Our practice: LMS-NPRR

•

1.  Architect [SQL] query engine using LMS (VLDB’14,

SIGMOD’18)

•

2.  Novel worst-case optimal [multi-way] join (PODS’12, ICDT’14,

SIGMOD’15)

•

3.  Specialized Data Structure (Limitation: processing graph data

requires nonuniform data structure or indexing). Is it common in

relational data processing? Is it avoidable?

undefined

Architect a query engine

•

1.  Architect [SQL] query engine using LMS (VLDB’14,

SIGMOD’18).

•

2. Core idea is: Write a query compiler as an interpreter but the

interpreting result is a program for evaluating the input query.

(This codegen is via LMS)

•

3. This allows recompiling of DBMS during runtime.

undefined

Leapfrog trie join (by LogicBlox)

•

1. Two orthogonal directions: Perform Leapfrog Join on each level

and Levels are organized as a Trie.

•

2. Leapfrog Join:

undefined

Leapfrog trie join (by LogicBlox)

•

3. Relation as a Trie

undefined

Leapfrog trie join (by LogicBlox)

•

4. Trie Join

undefined

TrieArray

undefined

Hurdles along the path

•

1. We implemented LFTJ in LMS for relational data. (slow)

•

2. Then we saw TrieArray and added. (Hard to write DS in LMS)

•

3. Tried TPC-H. (Not fast enough. 2x slower than Hyper for Q5)

•

4. We realized our query plan is not optimized but sooner we saw

EmptyHeaded. (Far ahead of us)

•

5. Implemented their DataLayout. (Even harder than TrieArray)

•

6. It turned out that LMS is not the best for code which twists data.

undefined

Look back to “failure”: TPC-H Q5

undefined

Look back to “failure”: TPC-H Q5

Yes. Single instance of LFTJ applies to a subgraph.

But how to partition this???

undefined

Look back to “failure”: TPC-H Q5

•

1. Didn’t execute query plan efficiently due to lack of early

aggregation, select push down. (VLDB’14 did well on this)

•

2. Optimization on query plan involving multi-way join could be

different. (EmptyHeaded has a GHD optimizer)

•

3. Not sure how to implement Data Structure in LMS.

undefined

EmptyHeaded(SIGMOD’16)

•

Main proposal

: Existing solutions need modify the DB

architecture (add layer, etc.), what if we keep the architecture

untouched but deploy novel

Query Compiler, Join Algorithm

and

Specific [Indexing] Data Structure

undefined

Organization

•

1.  Architecture: Query Compiler + Backend (TODO: make this

formal)

•

2.  Query Compiler: GHD ICDT’97

•

3.  Novel worst-case optimal pairwise join (PODS’12)

•

4.  Bitset/Intset/Hybridset (data skew is analyzed in EH paper)

undefined

Architecture

undefined

Architecture

Datalog-Like Query Input:

Datalog-Like Query Input:

TriangleAgg(;z) :- Edge(a,b),Edge(b,c),Edge(a,c),z:long<-[COUNT(*)].

TriangleAgg(;z) :- Edge(a,b),Edge(b,c),Edge(a,c),z:long<-[COUNT(*)].

undefined

Architecture

1. Transfer Query into Hypergraph

1. Transfer Query into Hypergraph

2. Enumerate all GHDs of this Hypergraph

2. Enumerate all GHDs of this Hypergraph

3. Select the optimal GHD (of smallest cost estimation)

3. Select the optimal GHD (of smallest cost estimation)

undefined

Architecture

1. Translate GHD into Cpp code

1. Translate GHD into Cpp code

2. This Cpp code calls library API

2. This Cpp code calls library API

undefined

Architecture

Library selects appropriate data layout and perform NPRR join on it

Library selects appropriate data layout and perform NPRR join on it

undefined

Query Compiler

•

Input: It takes a datalog-like query. Supports Aggregation and

Recursion.

•

Representation: Hypergraph and GHD (Generalized Hypertree

Decomposition, is similar to Relational Algebra but allows

estimation on cardinality).

•

Target: Find the GHD of smallest estimate cost.

undefined

Example: Barbell query

undefined

NPRR join

•

Pseudocode for NPRR Join

algorithm —>

undefined

NPRR join

•

Looks terrific? Don’t panic. It can be

abstracted as:

•

Step1: Find the intersection set on

level1.

•

Step2: For each element in the set,

find the intersection set on level2, but

in its subtrees.

•

Step3: Go one level deeper and

repeat step2 until we reach the last

level.

undefined

GHD Translation

•

1. Within a Node:  An instance of NPRR join.

•

2. Across Nodes:  Yannakakis’ seminal algorithm.

•

3. Recursion:  One GHD instance for each recursion.

undefined

Execution Backend (Data Layout)

•

Set intersection can be slow due to data skew.

•

Two types of set: Bitset and Uintset.

•

Heuristic of choosing: Range/Cardinality.

Question?

undefined

* LevelHeaded(VLDB’17)

Slide Note

Embed Share

Download

Exploring the complexity of processing graph data in relational query engines, this content delves into the challenges faced, practices adopted in academia, and innovative solutions like LMS-NPRR, trie join, and specialized data structures. It discusses the difficulties in handling acyclic vs. cyclic data, recursion, join-heavy scenarios, and the quest for optimal query processing in relational databases.

klinto Follow

Uploaded on Sep 10, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Graph data in relational engine: algorithm and architecture Xilun Wu Purdue University

The quest Support graph data processing in relational query engine/database. (Acyclic vs Cyclic, Recursion, Join-Heavy ) Why difficult? What practice made in academia? Requires different Index, Join Algorithm, and Query Compiler.

Learn by example We will use Triangle Counting as an example in the following explanation. Worst-case: ?(?3) Intermediate Result: ?(?? + |???|) 3 2) Output: (?

Our practice: LMS-NPRR 1. Architect [SQL] query engine using LMS (VLDB 14, SIGMOD 18) 2. Novel worst-case optimal [multi-way] join (PODS 12, ICDT 14, SIGMOD 15) 3. Specialized Data Structure (Limitation: processing graph data requires nonuniform data structure or indexing). Is it common in relational data processing? Is it avoidable?

Architect a query engine 1. Architect [SQL] query engine using LMS (VLDB 14, SIGMOD 18). 2. Core idea is: Write a query compiler as an interpreter but the interpreting result is a program for evaluating the input query. (This codegen is via LMS) 3. This allows recompiling of DBMS during runtime.

Leapfrog trie join (by LogicBlox) 1. Two orthogonal directions: Perform Leapfrog Join on each level and Levels are organized as a Trie. 2. Leapfrog Join:

Leapfrog trie join (by LogicBlox) 3. Relation as a Trie

Leapfrog trie join (by LogicBlox) 4. Trie Join

TrieArray

Hurdles along the path 1. We implemented LFTJ in LMS for relational data. (slow) 2. Then we saw TrieArray and added. (Hard to write DS in LMS) 3. Tried TPC-H. (Not fast enough. 2x slower than Hyper for Q5) 4. We realized our query plan is not optimized but sooner we saw EmptyHeaded. (Far ahead of us) 5. Implemented their DataLayout. (Even harder than TrieArray) 6. It turned out that LMS is not the best for code which twists data.

Look back to failure: TPC-H Q5

Look back to failure: TPC-H Q5 Yes. Single instance of LFTJ applies to a subgraph. But how to partition this???

Look back to failure: TPC-H Q5 1. Didn t execute query plan efficiently due to lack of early aggregation, select push down. (VLDB 14 did well on this) 2. Optimization on query plan involving multi-way join could be different. (EmptyHeaded has a GHD optimizer) 3. Not sure how to implement Data Structure in LMS.

EmptyHeaded(SIGMOD16) Main proposal: Existing solutions need modify the DB architecture (add layer, etc.), what if we keep the architecture untouched but deploy novel Query Compiler, Join Algorithm and Specific [Indexing] Data Structure?

Organization 1. Architecture: Query Compiler + Backend (TODO: make this formal) 2. Query Compiler: GHD ICDT 97 3. Novel worst-case optimal pairwise join (PODS 12) 4. Bitset/Intset/Hybridset (data skew is analyzed in EH paper)

Architecture

Architecture Datalog-Like Query Input: TriangleAgg(;z) :- Edge(a,b),Edge(b,c),Edge(a,c),z:long<-[COUNT(*)].

Architecture 1. Transfer Query into Hypergraph 2. Enumerate all GHDs of this Hypergraph 3. Select the optimal GHD (of smallest cost estimation)

Architecture 1. Translate GHD into Cpp code 2. This Cpp code calls library API

Architecture Library selects appropriate data layout and perform NPRR join on it

Query Compiler Input: It takes a datalog-like query. Supports Aggregation and Recursion. Representation: Hypergraph and GHD (Generalized Hypertree Decomposition, is similar to Relational Algebra but allows estimation on cardinality). Target: Find the GHD of smallest estimate cost.

Example: Barbell query

NPRR join Pseudocode for NPRR Join algorithm >

NPRR join Looks terrific? Don t panic. It can be abstracted as: Step1: Find the intersection set on level1. Step2: For each element in the set, find the intersection set on level2, but in its subtrees. Step3: Go one level deeper and repeat step2 until we reach the last level.

GHD Translation 1. Within a Node: An instance of NPRR join. 2. Across Nodes: Yannakakis seminal algorithm. 3. Recursion: One GHD instance for each recursion.

Execution Backend (Data Layout) Set intersection can be slow due to data skew. Two types of set: Bitset and Uintset. Heuristic of choosing: Range/Cardinality.

Question?

Challenges and Innovations in Relational Engine Algorithms

Download Presentation

Presentation Transcript

Related

More Related Content