Exploring Links Between Convex Geometry and Query Processing

Links between Convex Geometry
and Join Processing
Christopher Ré
Stanford University
Warning
:
This is (mostly) a 
theory
 talk…
 
… but we (and others) are 
building
database engines 
with these ideas.
Motivation
: Joins!
 
Databases are about 
three
 things:
     
Efficiency,
       
Efficiency, and
         
Efficiency
 
Worst-case
 
Parallel
 
Beyond Worst-case
 
Geometry
Joins Since System R
Join(R,S,T) = { (a,b,c) : (a,b) in R, (a,c) in S, (b,c) in T}
R(A,B), S(A,C), T(B,C)
 
Join(R,S,T) = Join(Join(R,S),T)
 
System R searches through 
pairwise
 joins
 
For 40+ years, major commercial database
use System-R style optimizer.
Nugget
: 
“DBs have been asymptotically
suboptimal for the last 4 decades…”
6
Example Queries
Data: R(A,B), S(A,C),
and T(B,C)
 
Q
1 
= 
Join(R,S)
 
Q
2
 = 
Join(R,S,T)
Today: graph where
edges colored R,S,T.
v
 
S
 
R
 
T
Nodes are
data values.
 
“triples of nodes on a path of
length 2 that goes via R then S”
 
“triples of nodes that form R-S-T
triangles”
Background: Triangles
[Alon 80, Loomis-Whitney 49]
8
If R,S,T contain ≤ N tuples,
how big can |
Q|
 be?
Let 
Q
 be Join(R,S,T) = “R-S-T triangles”
Data: R(A,B), S(A,C), and T(B,C)
)
 
R(A,B), S(A,C)
 
T(B,C)
)
|Join(R,S)| ≤ N
2
 
R
 covers
 B,
S covers C
 
Correct asymptotic:|
Q
| in 
( N
3/2  
)
|Join(R,S,T)| ≤ N
2
Pairwise Joins are Suboptimal
9
Panic
! A simple modification:
any pairwise join plan takes 

N
2
)
JOIN(R,S) = [N]x {1} x [N]
|Join(R,S)| = N
2
DB is 
toast
!
 
|R|=|S|=|T|=N
[N]={1,…,N}
 
R=[N] x {1}
S ={1} x [N]
T = {1} x [N]
R(A,B), S(B,C),T(A,C)
Data
10
Relax
! [Itai & Rodeh 78, Alon et. al 97]
 
It’s cute, let’s see it.
“The heavy-light technique”
Heavy
versus Light
Sketch: Heavy v. Light Nodes
Goal: Time O(N
3/2
) – 
ignoring log factors.
 
(case 1) 
If v in H, check whether each edge e in E
forms a triangle with v.
Case I:  In total most 2 N|H| probes
Since |H| ≤ 2N
1/2
 then total time O(N
3/2
)
11
N Edges
 
2 probes:
 each O(1) time.
Case 2.
12
(case 2)
 If v not in H, for each pair of edges check.
Union is linear, so we’re done.
v
N Edges
Case II: Each light node explores d(v)
2 
 where d(v) is
the degree of node v.
How do we generalize to joins?
13
Fractional Hypergraph Covers
14
Given a hypergraph H=(V,E) a 
fractional edge
cover 
is x : E 
R
 such that x ≥ 0 and
for each 
v in V we have  
e : v in e
 x(e) 
 1
x(R) + x(T) ≥ 1 // cover for A
x(R) + x(S) ≥ 1 // cover for B
x(S) + x(T) ≥ 1 // cover for C
 
x(R,S,T)=(1,0,1) … or…
x(R,S,T)=(0.5, 0.5, 0.5)
 
We think of a 
query
 as hypergraph to cover.
 
R
 
S
 
T
Size bounds [GM05, AGM08]
15
 
Triangle
: |R|=|S|=|T| ≤ N,
 x(R)=x(S)=x(T)=0.5
 N
1.5
Fix a query Q=(V,E).
Let 
N
 be a tuple of |E| positive integers.
Define S(Q, 
N
) be the maximum size
of Q  subject to|R
e
|≤ N
e
One more example.
16
R(A,B,C),S(A,B,D),T(A,C,D),U(B,C,D)
 
x(R) = x(S) = x(T) = x(U) = 1/3
Output size is O(N
4/3
)
 
R(
A
,B,C),S(
A
,B,D),T(
A
,C,D),U(B,C,D)
AGM’s result.
17
Atserias, Grohe, and Marx (
AGM
) allow one to
write a linear program that 
tightly bounds
 the
output size of 
any
 join query.
 
We would call this 
worst-case optimal
Proof using Han/Shearer’s lemma
(non constructive)
Ngo, Porat, Ré
, 
and Rudra 
(PODS 2012)
18
 
We show 
AGM’s
 fractional cover inequality is
equivalent to the 
Bollabás-Thomason
inequality from geometry.
1
st
 algorithm for joins with
optimal worst-case runtime
(experts: optimal data complexity)
 
Algorithmic Idea
: LP is a guide to decide
“heavy” 
v. 
“light” 
of previous example.
Implemented & described at ICDT14!
19
Todd Veidhulzen. 
“Leapfrog Triejoin: A Simple,
Worst-Case Optimal Join Algorithm”
[ICDT14, 
Best Newcomer Award
!]
 
Tidbit
: Faster on 
cyclic
 queries than other
commercial DB optimizers… without resorting
to specialized graph processing!
Much simpler proofs!
20
Skew strikes back:
New Developments in the Theory of
Join Algorithms. (
SIGMOD Record 13 )
Atri
Rudra
Hung
Ngo
Even simpler
than heavy vs.
light!
21
3 Vignettes of Related Ideas
 
Faster 
Detection:
 
Alon-Yuster-Zwick, one can
check if a graph contains a 2k-length cycle in
O(N
2-1/k
) using heavy vs. light argument.
 
Systems: 
Heavy vs. light used in parallel
database systems (e.g., Teradata).
22
(1) Heavy
versus Light
(2) Map Reduce Joins: Afrati & Ullman
EDBT 2010
23
A
C
B
Q
1
 = R(A,B),S(B,C),T(A,C)
Q
2
 = R(A,B),S(B,C)
Mappers send data to
reducer 
via hash function
.
 
Optimal
: Recast as a (fractional) cover problem!
[Koutris, Suciu, and Beame PODS14]
Goal: 
Given p reducers, minimize communication
by picking 
“how large” 
each attribute’s share is.
 
Afrati & Ullman. 
Solve
 Constrained Mathematical Program.
Lower bound portion uses Covers!
(3) Tighter Runtime Guarantees
24
Yannakakis’s seminal algorithm has a
stronger
 guarantee for 

acyclic queries.
O( N + OUT )
 
General
: O(N + N
w*
 + OUT)
where w* is the fractional
hypertreewidth
(NPRR+Y’s+Treewidth)
Databases
with N tuples.
Databases
with N+1 tuples.
 
Databases
with N tuples.
 
Databases
with N+1 tuples.
 
Output Size
 
0.5N
 
0.2N
25
Begs a question
: What is the
tightest guarantee that
one can hope for?
Pathology of Worst-case Analysis for Joins
26
 
My not-so-secret goal:
Join theory even closer to practice.
We all want to go “beyond worst-case”
27
 
One of the1
st
 beyond worst-
case analysis was by a DB
theoretician: Ron Fagin.
Instance Optimality
Tim Roughgarden
(Stanford) has great
notes on this.
Measuring complexity
28
Databases
with N tuples.
Databases
with N+1 tuples.
 
Let W(A,N) = sup
D
  T(A,D) s.t. D has N tuples
Let T(A,D) be # of steps that
algorithm
 A takes on 
database
 D
 
Measure W(A,N) growth with N, asymptotically.
Notions of complexity
29
Databases
with N tuples.
Databases
with N+1 tuples.
Algorithm 
Opt
 is instance optimal if there exists
constant c such that T(
Opt
,D) ≤ c T(A,D)
 for A in a 
class of algorithms 
and any D.
Essentially singleton boxes, much stronger…
Let T(A,D) be # of steps that
algorithm
 A takes on 
database
 D
So how do we pick a 
class of algorithms
?
30
What do join algorithms do?
31
Famous algorithms
: Hash, Sort-merge,
index-nested, block-nested loop, Grace,
PRISM, double pipelined.…
Observation
: Algorithms are 
generic
.
Do not depend on data values
but 
may 
use data 
order & equality.
 
E.g., use an index to skip many consecutive values.
A Nugget
: 
“A 
little
 sorting changes
the complexity landscape 
a lot
.”
32
Suppose R[j] = 2j and S[j] = 2j+1 for j=1…N
Warm up: Intersection [Huang & Lin 1972]
33
Given two sorted lists R and S of length N
R[1] < … < R[N] and S[1] < … < S[N].
Warm up: Intersection [Huang & Lin 1972]
34
Given two sorted lists R and S of length N
R[1] < … < R[N] and S[1] < … < S[N].
Suppose R[i] = i and S[i] = N/2 + i for i=1…N/2
Skip to R[N] in O(log N) time!
Message: Difference in the 
certification
35
 
Same: 
Input and output size
 are the same!
running time is
 Log(N) v. N
R[N] < S[1]
is enough
Certify each
alternation
 
Different
: Work to 
certify
 
output is 
empty
36
Goal
: Algorithms that run in
time proportional to the size of
a 
smallest certificate
.
Generalizing Certificates to Joins
37
To define a certificate, need to describe:
1.
How are data stored? (
search trees
), and
2.
How are certificates encoded? (
arguments)
Assume: a global attribute order A
1
…A
n
.
All relations are stored consistently with this order
(… we can remove this …)
Think of the data in a trie
38
A
4
A
5
 
R
[3]
 
R
[1,2]
 
R
[1,1,2]
A
2
 
R
[2,1]
 
R
[3,1,1]
 
R
[1,2,1]
Index
indicates the
tuples order.
A relation
R(A
2
,A
4
,A
5
)
39
NB
: search trees capture
hash tables, B+trees, tries,
up to a log N factor.
Compare elements in
this dictionary order…
Any algorithm must certify its output
 
An 
argument
 is a set of propositional
statements of the following forms.
1.
R[i] < S[j]
2.
R[i] = S[j]
3.
R[i] = R[j]
40
Here i,j are tuples of indexes
as illustrated in previous slide.
 
Certificate
 is an argument, 
cert,
 such that any
instance that satisfies 
cert 
has the same output
(up to isomorphism).
Certificate Complexity
41
Goal
: run in time O( (
|cert| 
+ Z) log N) where
cert
 denotes a smallest certificate,
Z is the size of the output, and
N is the size of the data.
O hides constants depending on Q.
NB: Input Size under log
Comments about Certificates
42
1. A comparison-based algorithm (all available
join algorithms) takes at least 
|cert| 
steps.
 
2. N ≥ 
|cert| 
where N is the input size 
(strictly finer
notion of complexity)
 
3. Certificates provide an instance-dependent
measure of complexity.
 (conditioning)
Minesweeper Algorithm (MS)
43
“Removing the haystack to find the needles”
Minesweeper: Key operation
44
Deduce: no output tuple can be in an interval
Index for R
in (A,B) order
 
(=3, [3,4] )
 
([-∞, 1],*)
Consider: Q = R(A,B),S(A)
 
(=2,[-∞,3])
 
No output tuple has
 A = 3 and B in [3,4]
 
Index for S
Picture the Output Space as a Grid
45
A values
B
values
 
(=3, [3,4] )
([-∞, 1],*)
 
(=2,[-∞,4])
([-∞,4], *)
R(A,B)
 
S(A)
 
Goal
: Run
proportional to
smallest cover.
The algorithm
 
1.
Pick an uncovered point, 
t
.
 
2.
Find all possible ways to
cover 
t
 with “gaps”.
 
3.
Insert gaps in to a data
structure.
 
Repeat until all points covered.
46
 
t
Idea
: Reuse information as much as possible.
Nugget
: 
“The boundary for efficiency
has changed from the worst case.”
47
The Boundary
48
View a query as a hypergraph.
R(A,B),S(A,B,C),T(B,C,D)
 
Want: 
Acyclic-like properties but closed under edge
removal. 

acyclic does 
not
 have this property.
 
Turns out, 
-acyclic [Fagin83] is the right notion
Certificate Dichotomy [PODS14]
49
Theorem
 [NNRR14, Certificate Dichotomy]
Given query Q
 
(1)
if Q is 
-acyclic, then there is some order of
attributes such that MS takes O(
|cert| 
+ Z) on all
instances.
 
(1)
Assuming the 3SUM conjecture, for any 
-
cyclic query there is some family of instances
where 
any
 algorithm runs in time 
( 
|cert|
4/3 
+ Z)
O hides log N factor
Further Results
 
1. No polynomial time bound in 
|cert| 
for 
-
acyclic queries (Exponential time hypothesis)
-acyclic is the worst-case boundary, this changes
complexity landscape.
 
2. Q, treewidth w, MS runs in O(
|cert|
w+1
 
+ Z) time
 
 
3. Fractional results for triangle, O(
|cert|
3/2
 
+ Z).
50
Some 
PRELIMINARY
 Empirical Results
51
7 graph datasets (100M+ edges)
 
(2) On 
small
 #
 of 
-acyclic queries in LB,
up to 230x faster and at most 5% slower than LFTJ
Take away
: minesweeper 
might be 
useful.
Real evaluation underway with LB’s help!
 
(1) Real certificates can be 1000x smaller than N
Future Directions in Joins
 
Compressed Representations
.
Column-type storage [Abadi VLDB05]
Morphing and Bit Weaving [Patel et. al. VLDB06 VLDB2012].
Factorised DBs using covers [Olteanu et al. VLDB2012-2013].
Compression & bit-level operations for modern hardware?
 
Deeper understanding of data properties.
Bounded degree data [A. Durand & E. Grandjean ToCL97]
Bounded expansion [Segoufin ICDT13, Kazana&Segoufin
PODS13]
Bounded treewidth data [Gottlob et al. AAAI06, Aarnborg 85,
Grohe ICDT99]
Personally obsessed!
 
One join algorithm to rule them all?
 Stay tuned…
52
Convex Geometry and Joins
 
1.
worst-case optimal Join
algorithm via Fractional Covers
 
2.
Other Uses of Fractional Covers
 
3.
Beyond Worst-case for Joins
using Covers and Geometry.
Tool: Fractional cover
s to bound volumes
Conclusion
54
Joins are awesome:
Theory and Practice can argue!
 
Geometry is key to these
new algorithms.
 
Beyond worst-case analysis may be
an opportunity for the DB community.
Slide Note
Embed
Share

Delve into the intersection of convex geometry and query processing at Stanford University, where theoretical discussions are being applied to real-world database engine development. Learn about the optimization of database joins, the historical evolution of database engines, and the challenges faced in achieving optimal efficiency in database operations.


Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Links between Convex Geometry and Join Processing Christopher R Stanford University

  2. Query processing is not rocket science When you flunk out of query processing, we make you go build rockets. Anonymous (J. Hamilton or D. DeWitt)

  3. Warning: This is (mostly) a theorytalk but we (and others) are building database engines with these ideas.

  4. Motivation: Joins! Databases are about three things: Efficiency, Efficiency, and Efficiency Beyond Worst-case Worst-case Parallel

  5. Joins Since System R R(A,B), S(A,C), T(B,C) Join(R,S,T) = { (a,b,c) : (a,b) in R, (a,c) in S, (b,c) in T} Join(R,S,T) = Join(Join(R,S),T) System R searches through pairwise joins For 40+ years, major commercial database use System-R style optimizer.

  6. Nugget: DBs have been asymptotically suboptimal for the last 4 decades 6

  7. Example Queries Data: R(A,B), S(A,C), and T(B,C) Today: graph where edges colored R,S,T. R Nodes are data values. S v T Q1 = Join(R,S) triples of nodes on a path of length 2 that goes via R then S Q2 = Join(R,S,T) triples of nodes that form R-S-T triangles

  8. Background: Triangles [Alon 80, Loomis-Whitney 49] Data: R(A,B), S(A,C), and T(B,C)) Let Qbe Join(R,S,T) = R-S-T triangles If R,S,T contain N tuples, how big can |Q| be? R covers B, S covers C R(A,B), S(A,C) |Join(R,S)| N2 T(B,C)) |Join(R,S,T)| N2 Correct asymptotic:|Q| in ( N3/2 ) Can we compute Q in time O(N3/2)? 8

  9. Pairwise Joins are Suboptimal R(A,B), S(B,C),T(A,C) Data R=[N] x {1} S ={1} x [N] T = {1} x [N] |R|=|S|=|T|=N [N]={1, ,N} JOIN(R,S) = [N]x {1} x [N] |Join(R,S)| = N2 DB is toast! N N 1 Data in R and S Panic! A simple modification: any pairwise join plan takes (N2) 9

  10. Heavy versus Light Relax! [Itai & Rodeh 78, Alon et. al 97] It s cute, let s see it. The heavy-light technique 10

  11. Sketch: Heavy v. Light Nodes Goal: Time O(N3/2) ignoring log factors. Call a node heavy if it has more than N1/2 neighbors. Let H be set of heavy nodes. (case 1) If v in H, check whether each edge e in E forms a triangle with v. x v1 .. y v2 2 probes: each O(1) time. e v N Edges Case I: In total most 2 N|H| probes Since |H| 2N1/2 then total time O(N3/2) 11

  12. Case 2. (case 2) If v not in H, for each pair of edges check. x v1 .. y v2 v N Edges Case II: Each light node explores d(v)2 where d(v) is the degree of node v. N d(v) 2N3/2 d(v)d(v) v V-H v V-H Union is linear, so we re done. 12

  13. How do we generalize to joins? 13

  14. Fractional Hypergraph Covers Given a hypergraph H=(V,E) a fractional edge cover is x : E R such that x 0 and for each v in V we have e : v in e x(e) 1 x(R) + x(T) 1 // cover for A x(R) + x(S) 1 // cover for B x(S) + x(T) 1 // cover for C Ex: R(A,B),S(B,C),T(A,C). x(R,S,T)=(1,0,1) or x(R,S,T)=(0.5, 0.5, 0.5) We think of a query as hypergraph to cover. R S T 14

  15. Size bounds [GM05, AGM08] Fix a query Q=(V,E). Let N be a tuple of |E| positive integers. Define S(Q, N) be the maximum size of Q subject to|Re| Ne Thm [Atserias, Grohe, Marx FOCS08]: Given any hypergraph cover x for (V,E) then S(Q,N) e in E |Re|x(e) Triangle: |R|=|S|=|T| N, x(R)=x(S)=x(T)=0.5 N1.5 15

  16. One more example. R(A,B,C),S(A,B,D),T(A,C,D),U(B,C,D) R(A,B,C),S(A,B,D),T(A,C,D),U(B,C,D) x(R) = x(S) = x(T) = x(U) = 1/3 Output size is O(N4/3) Known since Loomis-Whitney (1940s Geometers!) 16

  17. AGMs result. Atserias, Grohe, and Marx (AGM) allow one to write a linear program that tightly bounds the output size of any join query. Proof using Han/Shearer s lemma (non constructive) Open: Compute the output in upper bound time? We would call this worst-case optimal 17

  18. Ngo, Porat, R, and Rudra (PODS 2012) 1st algorithm for joins with optimal worst-case runtime (experts: optimal data complexity) We show AGM s fractional cover inequality is equivalent to the Bollab s-Thomason inequality from geometry. Algorithmic Idea: LP is a guide to decide heavy v. light of previous example. 18

  19. Implemented & described at ICDT14! Todd Veidhulzen. Leapfrog Triejoin: A Simple, Worst-Case Optimal Join Algorithm [ICDT14, Best Newcomer Award!] Tidbit: Faster on cyclic queries than other commercial DB optimizers without resorting to specialized graph processing! 19

  20. Much simpler proofs! Even simpler than heavy vs. light! Atri Rudra Hung Ngo Skew strikes back: New Developments in the Theory of Join Algorithms. (SIGMOD Record 13 ) 20

  21. 3 Vignettes of Related Ideas 21

  22. (1) Heavy versus Light Faster Detection:Alon-Yuster-Zwick, one can check if a graph contains a 2k-length cycle in O(N2-1/k) using heavy vs. light argument. Systems: Heavy vs. light used in parallel database systems (e.g., Teradata). 22

  23. (2) Map Reduce Joins: Afrati & Ullman EDBT 2010 Q1 = R(A,B),S(B,C),T(A,C) Q2 = R(A,B),S(B,C) Mappers send data to reducer via hash function. B A Goal: Given p reducers, minimize communication by picking how large each attribute s share is. Afrati & Ullman. Solve Constrained Mathematical Program. Lower bound portion uses Covers! Optimal: Recast as a (fractional) cover problem! [Koutris, Suciu, and Beame PODS14] 23

  24. (3) Tighter Runtime Guarantees Measure: longest runtime in each strata (Traditional Worst-case) Databases with N tuples. Databases with N+1 tuples. Yannakakis s seminal algorithm has a stronger guarantee for acyclic queries. Output Size 0.5N 0.2N O( N + OUT ) Databases with N tuples. General: O(N + Nw* + OUT) where w* is the fractional hypertreewidth (NPRR+Y s+Treewidth) Databases with N+1 tuples. 24

  25. Begs a question: What is the tightest guarantee that one can hope for? 25

  26. Pathology of Worst-case Analysis for Joins Worst case: answer is huge Usually, the answer is smaller than the database not larger! Worst case: one reads the entire input. Rarely, if ever, does this happen on large databases Worst case: insensitive how data are stored. Data often stored in an index or sorted, and this changes landscape My not-so-secret goal: Join theory even closer to practice. 26

  27. We all want to go beyond worst-case Tim Roughgarden (Stanford) has great notes on this. One of the1st beyond worst- case analysis was by a DB theoretician: Ron Fagin. Instance Optimality 27

  28. Measuring complexity Databases with N tuples. Worst-case analysis (Traditional CS) Databases with N+1 tuples. Let T(A,D) be # of steps that algorithm A takes on database D Let W(A,N) = supD T(A,D) s.t. D has N tuples Measure W(A,N) growth with N, asymptotically. 28

  29. Notions of complexity Databases with N tuples. Instance Optimality Databases with N+1 tuples. Let T(A,D) be # of steps that algorithm A takes on database D Algorithm Opt is instance optimal if there exists constant c such that T(Opt,D) c T(A,D) for A in a class of algorithms and any D. Essentially singleton boxes, much stronger 29

  30. So how do we pick a class of algorithms? 30

  31. What do join algorithms do? Famous algorithms: Hash, Sort-merge, index-nested, block-nested loop, Grace, PRISM, double pipelined. Observation: Algorithms are generic. Do not depend on data values but may use data order & equality. E.g., use an index to skip many consecutive values. Call these comparison-based algorithms. 31

  32. A Nugget: A little sorting changes the complexity landscape a lot. 32

  33. Warm up: Intersection [Huang & Lin 1972] Given two sorted lists R and S of length N R[1] < < R[N] and S[1] < < S[N]. Suppose R[j] = 2j and S[j] = 2j+1 for j=1 N At position i, no idea what comes next. Ping-pong back and forth (N) time 33

  34. Warm up: Intersection [Huang & Lin 1972] Given two sorted lists R and S of length N R[1] < < R[N] and S[1] < < S[N]. Suppose R[i] = i and S[i] = N/2 + i for i=1 N/2 Skip to R[N] in O(log N) time! 34

  35. Message: Difference in the certification Certify each alternation R[N] < S[1] is enough running time is Log(N) v. N Same: Input and output size are the same! Different: Work to certifyoutput is empty 35

  36. Goal: Algorithms that run in time proportional to the size of a smallest certificate. 36

  37. Generalizing Certificates to Joins Assume: a global attribute order A1 An. All relations are stored consistently with this order ( we can remove this ) To define a certificate, need to describe: 1. How are data stored? (search trees), and 2. How are certificates encoded? (arguments) 37

  38. Think of the data in a trie A relation R(A2,A4,A5) A2 A4 A5 R[3] 1 2 4 A2 1 2 7 1 3 5 R[1,2] R[2,1] 7 4 2 A4 10 4 1 Index A5 indicates the tuples order. R[1,2,1] R[3,1,1] R[1,1,2] 38

  39. NB: search trees capture hash tables, B+trees, tries, up to a log N factor. Compare elements in this dictionary order 39

  40. Any algorithm must certify its output An argument is a set of propositional statements of the following forms. 1. R[i] < S[j] Here i,j are tuples of indexes as illustrated in previous slide. 2. R[i] = S[j] 3. R[i] = R[j] Certificate is an argument, cert, such that any instance that satisfies cert has the same output (up to isomorphism). 40

  41. Certificate Complexity Goal: run in time O( (|cert| + Z) log N) where cert denotes a smallest certificate, Z is the size of the output, and N is the size of the data. NB: Input Size under log O hides constants depending on Q. Runtimes of the above are essentially instance optimal for comparison based. Ron Fagin says log-instance optimal 41

  42. Comments about Certificates 1. A comparison-based algorithm (all available join algorithms) takes at least |cert| steps. 2. N |cert| where N is the input size (strictly finer notion of complexity) 3. Certificates provide an instance-dependent measure of complexity. (conditioning) 42

  43. Minesweeper Algorithm (MS) Removing the haystack to find the needles 43

  44. Minesweeper: Key operation Deduce: no output tuple can be in an interval Consider: Q = R(A,B),S(A) ([- , 1],*) A 2 3 3 B 4 2 5 (=2,[- ,3]) (=3, [3,4] ) No output tuple has A = 3 and B in [3,4] Index for R in (A,B) order A 4 Big Conceptual Change: Find best way to rule out all tuples not to find tuples. Index for S 44

  45. Picture the Output Space as a Grid B R(A,B),S(A) values A 2 3 3 B 4 2 5 A 4 ([- ,4], *) ([- , 1],*) (=2,[- ,4]) R(A,B) S(A) (=3, [3,4] ) Goal: Run proportional to smallest cover. A values 45

  46. The algorithm Idea: Reuse information as much as possible. 1. Pick an uncovered point, t. 2. Find all possible ways to cover twith gaps . t 3. Insert gaps in to a data structure. Hard part: Data structure to find t, efficiently. Repeat until all points covered. 46

  47. Nugget: The boundary for efficiency has changed from the worst case. 47

  48. The Boundary View a query as a hypergraph. R(A,B),S(A,B,C),T(B,C,D) Want: Acyclic-like properties but closed under edge removal. acyclic does not have this property. Turns out, -acyclic [Fagin83] is the right notion 48

  49. Certificate Dichotomy [PODS14] Theorem [NNRR14, Certificate Dichotomy] Given query Q (1)if Q is -acyclic, then there is some order of attributes such that MS takes O(|cert| + Z) on all instances. (1)Assuming the 3SUM conjecture, for any - cyclic query there is some family of instances where any algorithm runs in time ( |cert|4/3 + Z) 49 O hides log N factor

  50. Further Results 1. No polynomial time bound in |cert| for - acyclic queries (Exponential time hypothesis) -acyclic is the worst-case boundary, this changes complexity landscape. 2. Q, treewidth w, MS runs in O(|cert|w+1+ Z) time 3. Fractional results for triangle, O(|cert|3/2+ Z). 50

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#