Parallel Skyline Queries in Distributed Systems

P
ARALLEL
 S
KYLINE
 Q
UERIES
Foto Afrati
Paraschos Koutris
Dan Suciu
Jeffrey Ullman
University of Washington
W
HAT
 
IS
 T
HE
 S
KYLINE
?
A d-dimensional set R
A point 
x 
dominates
 
x
’ if forall k: 
x
(k) ≤ 
x
’(k)
The 
skyline
 of R are all 
non-dominated 
points of R
domination
2
skyline
C
ONTRIBUTIONS
We design algorithms for 
Skyline Queries 
based
on two parallel models:
MP
:
 perfect
 load balancing 
[Koutris, Suciu ‘11]
GMP
:
 weaker
 load balancing 
[Afrati, Ullman 
10]
We present 3 
algorithms
 with theoretical
guarantees for:
#
 
synchronization steps
load balance
3
P
REVIOUS
 A
PPROACHES
Several efficient algorithms for skyline queries
exist in the literature
Parallel algorithms use various 
partitionings
:
Grid-based 
partitioning 
[WZFZAA ’06]
Random
 partitioning 
[CRZ ’07]
Angle-based 
space partitioning 
[VDK ’08]
Hyperplane projections 
[KYZ ’11]
Previous approaches typically require a
logarithmic
 number of communication steps: our
algorithms achieve 
1 or 2
 steps
4
M
ASSIVELY
 P
ARALLEL
 M
ODELS
P servers: R 
partitioned 
in
to R
1
,R
2
,…, R
P
n = |R|
The algorithm alternates between 
communication
and 
computation
 steps
MP
 model: each node holds O(n/P) data
GMP
 model: each node holds O(P
ε
  
* n/P) where
0 
 
ε
 < 1
ε =
 
0 
: GMP = MP
ε =
 
1 
: GMP = sequential computation in one node
5
A
N
 E
XAMPLE
How do we compute 
set intersection 
in one step
in the MP model?
Hash each value x (from R or S) to a server
Communication Phase
  send tuple R(x) to server @h(x)
  send tuple S(x) to server @h(x)
Computation Phase
  output a tuple only if it occurs twice
Intersection 
Q(x):-R(x),S(x)
6
T
HE
 B
ROADCAST
 S
TEP
In addition to regular communication steps, we
allow 
broadcast
 steps:
the data exchanged is 
independent
 of n
Known
 results:
Q(x,y)=R(x),S(x,y) can be computed in 1 MP step
iff
 a broadcast step is allowed 
[Koutris, Suciu ‘11]
Q(x,y)=R(x),S(x,y),T(y) 
can
 
not be computed in 1
MP step 
[Koutris, Suciu ‘11] 
, but can be in 1 GMP step with
ε=1/2 
[Afrati, Ullman ‘10]
7
Broadcast
Grid-based 
partitioning into 
cells
Pre-processing the cells to compute the 
relaxed
skyline
Communication
Careful 
distribution
 of the cells (with their data) to
the servers 
Computation:
Local
 computation of the skyline at each server
O
UTLINE
 
OF
 
OUR
 A
PPROACH
8
B
UCKETIZING
Partition
 R
 
into M buckets across some dimension,
such that each partition contains 
O(n/M) 
points
Equivalently, compute (M+1) 
partition points
:
  
-∞ = b
0 
, b
1
 , … , b
M
 = +∞
9
Algorithm
:
Local
: each server evenly partitions its data to M buckets
Broadcast
: servers exchange MxP partition points
Local
: each server picks every P-th value as partition point
 
bucketize across
dimension 1
 
bucketize across
dimension 2
M=P or P
1/(d-1)
C
ELLS
A 
cell
 is an intersection of buckets from all
dimensions
Every point belongs in 
exactly one 
cell
Every cell holds 
O(n/P) 
data (and 
not
 O(n/P
d
) !!)
10
In each cell, we can
keep only candidates
for skyline points
candidate
rejected
Broadcast
Grid-based 
partitioning into 
cells
Pre-processing the cells to compute the 
relaxed
skyline
Communication
Careful 
distribution
 of the cells (with their data) to
the servers 
Computation:
Local
 computation of the skyline at each server
O
UTLINE
 
OF
 
OUR
 A
PPROACH
11
C
ELLS
We are interested in the 
non-empty 
cells
Any cell that is 
strictly dominated 
by another does not
contribute to the skyline
12
no points belong in the
final skyline
strict
domination
 
domination
R
ELAXED
 S
KYLINE
 
OF
 C
ELLS
The 
relaxed skyline 
consists of the non-empty cells
that are not strictly dominated by non-empty cells
We focus on the relaxed skyline of non-empty cells
13
 
skyline
relaxed skyline
O
N
 R
ELAXED
 S
KYLINES
To compute the skyline points of a cell 
B
, we need to
compare
 with cells that:
belong in the 
relaxed skyline
weakly
 dominate 
B
 (have one common coordinate)
14
cell 
B
Broadcast
Grid-based 
partitioning into 
cells
Pre-processing the cells to compute the 
relaxed
skyline
Communication
Careful 
distribution
 of the cells (with their data) to
the servers 
Computation:
Local
 computation of the skyline at each server
O
UTLINE
 
OF
 
OUR
 A
PPROACH
15
A N
A
Ï
VE
 A
PPROACH
Try the following:
Partition
 into P buckets (M=P)
Allocate
 cells in the relaxed skyline to servers 
+
 cells
that weakly dominate them: O(n/P) data per cell
Locally
 compute the skyline points
This works if the relaxed skyline is 
small
But the relaxed skyline can have as many as
  
 
Ω
(P
d-1
)
 cells for dimension 
d
16
A 1-
STEP
 A
LGORITHM
Choose a 
coarser
 bucketization (<P buckets)
This gives a 
weak
 load-balanced
 
algorithm
 
with
maximum load of
   
O( (n/P) P
(d-2)/(d-1) 
)
ε = (
d-2
)
/(d-1) (
ε=0 
implies GMP=MP)
17
Corollary.
For 
d=2 dimensions
, we obtain a 
perfectly load balanced
algorithm for MP
A 2-
STEP
 A
LGORITHM
Step 1:
 group
 the cells in the relaxed skyline by
bucket for every dimension
18
Server 1
Server 2
Server 1
Server 2
A 2-
STEP
 A
LGORITHM
For each 
bucket
, compute the local skyline
A point is a skyline point 
iff
 it is a local skyline point in
every one of the 
d
 buckets
Step 2
: 
intersect
 the local skylines
19
 
This point is in the skyline
of the y-bucket, but not the
x-bucket
x-bucket
y-bucket
A 1-S
TEP
 A
LGORITHM
 
FOR
 3D
20
 
Key idea
: to reject this point, we only need
the minimum x-coordinate from cell 
B
 
cell 
B
A 1-S
TEP
 A
LGORITHM
 
FOR
 3D
The observation 
reduces
 the number of points that
need to be communicated
With smart partitioning, we can achieve 
perfect load-
balance
 in 
1 step
However, the property holds 
only
 for 2 and 3
dimensions
21
C
ONLUSION
3 algorithms for 
Skyline Queries
:
2 step 
+
 perfect load balance
1 step 
+
 some replication
1 step 
+
 perfect load balance for d < 4
Open Questions
Can we compute the skyline in 1 step with 
perfect
 load
balance for >3 dimensions?
A more 
general
 question: what classes of queries can
we compute in the MP model with perfect or weaker
load balance guarantees?
22
Thank you!
23
I
NTERIOR
 C
ELLS
Two cells are 
co-linear 
if they share exactly two
coordinates
A cell 
i
 is 
interior
 if every colinear cell in 
S
r
(J) 
belongs
in the same hyperplane as 
i
. Else, it is a 
corner
 cell.
Interior cells are easy to handle: we can send the
whole plane to a single processor
24
C
ORNER
 C
ELLS
We group the corner cells into 
lines
Border
 cells are the 
minimal/maximal 
cells of each line
Fact
: lines meet only on border cells
Grouping
: each line is a group, a cell is assigned to the
lexicographically first line it belongs to
25
Assigning the groups
We have two ways to assign groups to servers
The first is
 deterministic 
and greedily assigns a
group to any server that is not overloaded (M=P)
The second is 
randomized 
and sends each group
randomly to some server (M = P log P)
26
About the MP model
[KS11] 
A dichotomy result on Conjunctive
Queries that can be computed in 1 step
with perfect load balancing
Easy Queries:
Q(x,y,z) :- R(x,y) , S(y,z)
Q(x,y,z,) :- R(x), S(x,y), T(x,y,z)
Hard Queries:
Q(x,y) :- R(x), S(x,y), T(y)
Q(x,y) :- R(x), S(x), T(y)
27
Slide Note

----- Meeting Notes (3/19/12 15:48) -----

remind when switching between algorithms

why is it called a broadcast step?

Embed
Share

Explore the concept of skyline queries in parallel computing, focusing on non-dominated points in a d-dimensional set. Learn about efficient algorithms, massively parallel models, communication strategies, and the application of broadcast steps. Enhance your knowledge of skyline computation processes through detailed explanations and visual representations.

  • Parallel Computing
  • Skyline Queries
  • Distributed Systems
  • Algorithms
  • Non-Dominated Points

Uploaded on Sep 24, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. PARALLEL SKYLINE QUERIES Foto Afrati Paraschos Koutris Dan Suciu Jeffrey Ullman University of Washington

  2. WHATIS THE SKYLINE? A d-dimensional set R A point x dominates x if forall k: x(k) x (k) The skyline of R are all non-dominated points of R skyline domination 2

  3. CONTRIBUTIONS We design algorithms for Skyline Queries based on two parallel models: MP: perfect load balancing [Koutris, Suciu 11] GMP: weaker load balancing [Afrati, Ullman 10] We present 3 algorithms with theoretical guarantees for: # synchronization steps load balance 3

  4. PREVIOUS APPROACHES Several efficient algorithms for skyline queries exist in the literature Parallel algorithms use various partitionings: Grid-based partitioning [WZFZAA 06] Random partitioning [CRZ 07] Angle-based space partitioning [VDK 08] Hyperplane projections [KYZ 11] Previous approaches typically require a logarithmic number of communication steps: our algorithms achieve 1 or 2 steps 4

  5. MASSIVELY PARALLEL MODELS P servers: R partitioned into R1,R2, , RP n = |R| The algorithm alternates between communication and computation steps MP model: each node holds O(n/P) data GMP model: each node holds O(P * n/P) where 0 < 1 = 0 : GMP = MP = 1 : GMP = sequential computation in one node 5

  6. AN EXAMPLE How do we compute set intersection in one step in the MP model? Hash each value x (from R or S) to a server Intersection Q(x):-R(x),S(x) Communication Phase send tuple R(x) to server @h(x) send tuple S(x) to server @h(x) Computation Phase output a tuple only if it occurs twice 6

  7. THE BROADCAST STEP In addition to regular communication steps, we allow broadcast steps: the data exchanged is independent of n Known results: Q(x,y)=R(x),S(x,y) can be computed in 1 MP step iff a broadcast step is allowed [Koutris, Suciu 11] Q(x,y)=R(x),S(x,y),T(y) can not be computed in 1 MP step [Koutris, Suciu 11] , but can be in 1 GMP step with =1/2 [Afrati, Ullman 10] 7

  8. OUTLINEOFOUR APPROACH Broadcast Grid-based partitioning into cells Pre-processing the cells to compute the relaxed skyline Communication Careful distribution of the cells (with their data) to the servers Computation: Local computation of the skyline at each server 8

  9. Algorithm: Local: each server evenly partitions its data to M buckets Broadcast: servers exchange MxP partition points Local: each server picks every P-th value as partition point BUCKETIZING Partition Rinto M buckets across some dimension, such that each partition contains O(n/M) points Equivalently, compute (M+1) partition points: - = b0 , b1, , bM= + M=P or P1/(d-1) bucketize across dimension 1 bucketize across dimension 2 9

  10. CELLS A cell is an intersection of buckets from all dimensions Every point belongs in exactly one cell Every cell holds O(n/P) data (and not O(n/Pd) !!) In each cell, we can keep only candidates for skyline points candidate rejected 10

  11. OUTLINEOFOUR APPROACH Broadcast Grid-based partitioning into cells Pre-processing the cells to compute the relaxed skyline Communication Careful distribution of the cells (with their data) to the servers Computation: Local computation of the skyline at each server 11

  12. CELLS We are interested in the non-empty cells Any cell that is strictly dominated by another does not contribute to the skyline no points belong in the final skyline strict domination domination 12

  13. RELAXED SKYLINEOF CELLS The relaxed skyline consists of the non-empty cells that are not strictly dominated by non-empty cells We focus on the relaxed skyline of non-empty cells relaxed skyline skyline 13

  14. ON RELAXED SKYLINES To compute the skyline points of a cell B, we need to compare with cells that: belong in the relaxed skyline weakly dominate B (have one common coordinate) cell B 14

  15. OUTLINEOFOUR APPROACH Broadcast Grid-based partitioning into cells Pre-processing the cells to compute the relaxed skyline Communication Careful distribution of the cells (with their data) to the servers Computation: Local computation of the skyline at each server 15

  16. A NAVE APPROACH Try the following: Partition into P buckets (M=P) Allocate cells in the relaxed skyline to servers + cells that weakly dominate them: O(n/P) data per cell Locally compute the skyline points This works if the relaxed skyline is small But the relaxed skyline can have as many as (Pd-1) cells for dimension d 16

  17. A 1-STEP ALGORITHM Choose a coarser bucketization (<P buckets) This gives a weak load-balanced algorithm with maximum load of O( (n/P) P(d-2)/(d-1) ) = (d-2)/(d-1) ( =0 implies GMP=MP) Corollary. For d=2 dimensions, we obtain a perfectly load balanced algorithm for MP 17

  18. A 2-STEP ALGORITHM Step 1: group the cells in the relaxed skyline by bucket for every dimension Server 1 Server 2 Server 2 Server 1 18

  19. A 2-STEP ALGORITHM For each bucket, compute the local skyline A point is a skyline point iff it is a local skyline point in every one of the d buckets Step 2: intersect the local skylines This point is in the skyline of the y-bucket, but not the x-bucket x-bucket 19 y-bucket

  20. A 1-STEP ALGORITHMFOR 3D Key idea: to reject this point, we only need the minimum x-coordinate from cell B cell B 20

  21. A 1-STEP ALGORITHMFOR 3D The observation reduces the number of points that need to be communicated With smart partitioning, we can achieve perfect load- balance in 1 step However, the property holds only for 2 and 3 dimensions 21

  22. CONLUSION 3 algorithms for Skyline Queries: 2 step + perfect load balance 1 step + some replication 1 step + perfect load balance for d < 4 Open Questions Can we compute the skyline in 1 step with perfect load balance for >3 dimensions? A more general question: what classes of queries can we compute in the MP model with perfect or weaker load balance guarantees? 22

  23. Thank you! 23

  24. INTERIOR CELLS Two cells are co-linear if they share exactly two coordinates A cell i is interior if every colinear cell in Sr(J) belongs in the same hyperplane as i. Else, it is a corner cell. Interior cells are easy to handle: we can send the whole plane to a single processor 24

  25. CORNER CELLS We group the corner cells into lines Border cells are the minimal/maximal cells of each line Fact: lines meet only on border cells Grouping: each line is a group, a cell is assigned to the lexicographically first line it belongs to 25

  26. Assigning the groups We have two ways to assign groups to servers The first is deterministic and greedily assigns a group to any server that is not overloaded (M=P) The second is randomized and sends each group randomly to some server (M = P log P) 26

  27. About the MP model [KS11] A dichotomy result on Conjunctive Queries that can be computed in 1 step with perfect load balancing Easy Queries: Q(x,y,z) :- R(x,y) , S(y,z) Q(x,y,z,) :- R(x), S(x,y), T(x,y,z) Hard Queries: Q(x,y) :- R(x), S(x,y), T(y) Q(x,y) :- R(x), S(x), T(y) 27

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#