Geo-distributed Data Analytics
This publication features research by a team including Pu, Ananthanarayanan, Bodik, Kandula, Akella, Bahl, and Stoica on geo-distributed data analytics, offering valuable insights and contributions to the field. The work delves into the challenges and solutions surrounding distributed data analysis across geographic locations, providing a comprehensive understanding of the topic through innovative perspectives and methodologies.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Geo-distributed Data Analytics Qifan Pu, Ganesh Ananthanarayanan, Peter Bodik, Srikanth Kandula, Aditya Akella, Paramvir Bahl, Ion Stoica
Global DC Footprints Apps are deployed in many DCs and edges To provide low latency access Streaming videos, search, communication (OTT) Analyze data logs in geo-distributed DCs 99thpercentile of search query latency Top-k congested ISPs (for re-routing calls)
Centralized Data Analytics Paradigm Geo-distributed Data Analytics Perf. counters User activities WAN Seattle London Beijing Berkeley Wasteful & Slow
A single logical analytics cluster across all sites WAN Seattle London Beijing Berkeley Unchanged intra-DC queries run geo-distrib.
A single logical analytics system across all sites WAN Seattle London Beijing Berkeley Incorporating WAN bandwidths is key to geo-distributed analytics performance. 5
Inter-DC WAN Bandwidths Heterogeneous and (relatively) limited Higher utilization than intra-DC bandwidths Growth is slowing down (expensive to lay) Costly to use inter-DC links Used by intermediate tasks for data transfer E.g., reduce shuffles or join operators
Geo-distrib. Query Execution Graph of dependent tasks Seattle 40GB 20GB WAN London 20GB 40GB
Incorporating WAN bandwidths Task placement Decides the destinations of network transfers Data placement Decides the sources of network transfers
Example Query [1] SELECT time_window, percentile(latency, 99) GROUP BY time_window Seattle 40GB 20GB London 20GB 40GB WAN 200 MB/s 800 MB/s
Query [1]: Impact of Task Placement Seattle40 20 London Seattle London40 20 180 180 1 1 60 60 50s 0.8 150 150 50 50 0.8 0.8 120 120 40 40 2.5x 0.6 0.6 0.5 0.5 90 90 30 30 0.4 0.4 20s 20s 20s 60 60 20 20 40GB40GB 12.5s 12.5s 12.5s 0.2 0.2 0.2 30 30 10 10 2.5s How to generally solve for any #sites, BW heterogeneity and data skew? 0 0 0 0 0 0 Upload Time (s) Download Time (s) Task Fractions Input Data (GB)
Task Placement (TP Solver) Sites M Tasks N Task 1 -> London Task 2 -> Beijing Task 5 -> London Data Matrix (MxN) Upload BWs Download BWs TP Solver Optimization Goal: Minimize the longest transfer of all links 11
Query [2]: Impact of Data Placement Seattle100 50 London Seattle London100 50 180 180 1 1 60 60 0.93 160GB 50s 50s 50s 0.8 Query Lag 150 150 50 50 0.8 0.8 2x 120 120 40 40 100GB100GB 0.6 0.6 90 90 30 30 24s 24s 24s 0.4 0.4 60 60 20 20 40GB 0.2 0.07 0.2 0.2 6.25s 6s 30 30 10 10 0 0 0 0 0 0 Upload Time (s) Download Time (s) Task Fractions Input Data (GB) How to jointly optimize data and task placement?
Iridium Jointly optimize data and task placement with greedy heuristic Approach Goal Constraints improve query response time bandwidth, query lags, etc
Iridium with Single Dataset Iterative heuristics for joint task-data placement 1, Identify bottlenecks by solving task placement TP Solver 2, assess:find amount of data to move to alleviate current bottleneck TP Solver Until query arrivals, repeat.
Iridium with Multiple Datasets Prioritize high-value datasets: score = value x urgency / cost - value = sum(timeReduction) for all queries - urgency = 1/avg(query_lag) - cost = amount of data moved
Iridium: putting together Placement of data Before query arrival prioritize the move of high-value datasets Placement of tasks During query execution: TP constrained solver See paper for: estimation of query arrivals, contention of move&query, etc Solver
Evaluation Spark 1.1.0 and HDFS 2.4.1 Override Spark s task scheduler with ours Data placement creates copies in cross-site HDFS Geo-distributed deployment across 8 regions Tokyo, Singapore, Sydney, Frankfurt, Ireland, Sao Paulo, Virginia (US) and California (US). https://github.com/Microsoft-MNR/GDA
How well does Iridium perform? Spark jobs, SQL queries and streaming queries Conviva: video sessions paramters Bing Edge: running dashboard, streaming TPC-DS: decision support queries for retail AMP BDB: mix of Hive and Spark queries Baseline: In-place : Leave data unmoved + Spark s scheduling Centralized : aggregate all data onto one site
Iridium outperforms 4x-19x vs. Centralized 3x-4x vs. In-place 19x 19x 100 Query Response Time 10x 10x 7x 7x Reduction (%) in 4x 4x 4x 4x 80 4x 4x 3x 3x 3 3x x 60 40 20 0 Big-Data Conviva Bing-Edge TPC-DS
Median Reduction (%) Task placement Data placement Iridium (both) Vs. Centralized Vs. In-place 18% 38% 75% 24% 30% 63% vs. Centralized: Data placement has higher contribution vs. In-place: Equal contributions from two techniques Iridium subsumes both baselines!
Bandwidth Cost Using inter-DC WAN has a cost $/byte #bytes sent should be low Budgeted Scheme 1. Bandwidth-optimal scheme (MinBW) 2. Budget of B * MinBW; B 1 3. Never violate the budget at any point in time
Bandwidth Cost 100 better Query Response Time 1.5*MinBW 1.3*Bmin Reduction (%) in 80 60 40 Iridium MinBW 1*MinBW 20 (64%, 19%) 0 0 20 40 60 80 Reduction (%) in WAN Usage Iridium can speed up queries while using MinBW: a scheme that minimizes bandwidth Iridium: budget the bandwidth usage to be B * MinBW near-optimal WAN bandwidth cost
Current limitations Modeling: Compute constraints, complex DAGs WAN topologies beyond congestion-free core What is the optimal placement? Bandwidth cost vs. latency tradeoff with look- ahead capability Hierarchical GDA (semi-autonomy per site) Network-aware Query Optimizer
Related work JetStream (NSDI 14) Data aggregation and adaptive filtering Does not support arbitrary queries, nor optimizes task and data placement WANalytics (CIDR 15), Geode (NSDI 15) Optimize BW usage for SQL & general DAG jobs Can lead to poor query performance time
Low Latency Geo-distributed Data Analytics Data is geographically distributed Services with global footprints Analyze logs across DCs 99 percentile movie rating Median Skype call setup latency Seattle London WAN Beijing Berkeley Abstraction: Single logical analytics cluster across all sites Incorporating WAN bandwidths Reduce response time over baselines by 3x 19x https://github.com/Microsoft-MNR/GDA