Aggressive Cloning of Jobs for Effective Straggler Mitigation

Slide Note
Embed
Share

Small jobs are increasingly important in big data processing systems, but are particularly sensitive to stragglers. Various straggler mitigation techniques like blacklisting and speculation have limitations when it comes to small jobs. A proposed solution is the proactive cloning of jobs to probabilistically mitigate stragglers without waiting or speculation, which could offer a feasible alternative to existing methods.


Uploaded on Oct 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica

  2. Small jobs increasingly important Most jobs are small 82% of jobs contain less than 10 tasks (Facebook s Hadoop cluster) Most small jobs are interactive and latency- constrained Data analyst testing query on small sample Small jobs particularly sensitive to stragglers

  3. Straggler Mitigation Blacklisting: Clusters periodically diagnose and eliminate machines with faulty hardware Speculation: Non-deterministic stragglers Complete systemic modeling is intrinsically complex [e.g., Dean 12 at Google] LATE [OSDI 08], Mantri [OSDI 10]

  4. Despite the mitigation techniques LATE: The slowest task runs 8 times slower than the median task Mantri: The slowest task runs 6 times slower than the median task ( but they work well for large jobs)

  5. State-of-the-Art Straggler Mitigation Speculative Execution: (in LATE, Mantri, MapReduce) 1. Wait: observe relative progress rates of tasks 2. Speculate: launch copies of tasks that are predicted to be stragglers

  6. Why doesnt this work for small jobs? 1. Consist of just a few tasks Statistically hard to predict stragglers Need to wait longer to accurately predict stragglers 2. Run all their tasks simultaneously Waiting can constitute considerable fraction of a small job s duration Wait & Speculate is ill-suited to address stragglers in small jobs

  7. Cloning Jobs Proactively launch clones of a job, just as they are submitted Pick the result from the earliest clone Probabilistically mitigates stragglers Eschews waiting, speculation, causal analysis Is this really feasible??

  8. Low Cluster Utilization Clusters have median utilization of under 20% Provisioned for (short burst of) peak utilization Cluster energy-efficiency proposals Not adopted in today s clusters! Peak utilization decides half the energy bill Hardware and software reliability issues

  9. Tragedy of commons? If every job utilizes the lowly utilized cluster Instability and negative performance effects Power-law: 90% of jobs use 6% of resources FB, Bing, Yahoo! Power-law exponent = 1.9 Can clone small jobs with few extra resources

  10. Strawman M1 R1 M2 Job Earliest M1 R1 M2 Easy to implement Directly extends to any framework

  11. Number of map clones >> 3 clones Contention for input data by map task clones Storage crunch Cannot increase replication

  12. Task-level Cloning M1 M1 Earliest Earliest R1 Job R1 Earliest M2 M2

  13. 3 clones suffices Task-level Cloning Strawman

  14. Dolly: Cloning Jobs Task-level cloning of jobs Works within a budget Cap on the extra cluster resources for cloning

  15. Evaluation Workload derived from Facebook traces FB: 3500 node Hadoop cluster, 375K jobs, 1 month Trace-driven simulator Baselines: LATE and Mantri, + blacklisting Cloning budget of 5%

  16. Baseline: LATE Small jobs benefit significantly! Average completion time improves by 44%

  17. Baseline: Mantri Small jobs benefit significantly! Average completion time improves by 42%

  18. Intermediate Data Contention We would like every reduce clone to get its own copy of intermediate data (map output) Not replicated, to avoid overheads What if a map clone straggles?

  19. Intermediate Data Contention M1 M1 M1 R1 R1 M2 M2 M2 Wait for exclusive copy or contend for the available copy?

  20. Conclusion Stragglers in small jobs are not well-handled by traditional mitigation strategies Guessing task to speculate very hard, waiting wastes significant computation time Dolly: Proactive Cloning of jobs Power-law Small cloning budget (5%) suffices Jobs improve by at least 42% w.r.t. state-of-the-art straggler mitigation strategies Low utilization + Power-law + Cloning?

Related


More Related Content