Fault-Tolerant MapReduce-MPI for HPC Clusters: Enhancing Fault Tolerance in High-Performance Computing

Slide Note
Embed
Share

This research discusses the design and implementation of FT-MRMPI for HPC clusters, focusing on fault tolerance and reliability in MapReduce applications. It addresses challenges, presents the fault tolerance model, and highlights the differences in fault tolerance between MapReduce and MPI. The study also delves into scheduling restrictions and the compatibility of detect-restart fault tolerance models with HPC schedulers.


Uploaded on Oct 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Fault Tolerant MapReduce-MPI for HPC Clusters Yanfei Guo*, Wesley Bland+, Pavan Balaji+, Xiaobo Zhou* *Dept. of Computer Science, University of Colorado, Colorado Springs +Mathematics and Computer Science Division, Argonne National Laboratory

  2. Outline Overview Backgrounds Challenges FT-MRMPI Design Checkpoint-Restart Detect-Resume Evaluation Conclusion 2 FT-MRMPI for HPC Clusters, SC15

  3. MapReduce on HPC Clusters What MapReduce Provides Mira Hadoop CPU 16 1.6GHz PPC A2 cores 16 2.4GHz Intel Xeon cores Write serial code and run parallelly Reliable execution with detect-restart fault tolerance model Memory 16 GB (1 GB/core) 64-128 GB (4-8 GB/core) HPC Clusters Storage Local: N/A Shared: 24 PB SAN Local: 500 GB x 8 Shared: N/A High Performance CPU, Storage, Network MapReduce on HPC Clusters Network 5D Torus 10/40 Gbps Ethernet High Performance Big Data Analytics Software Env MPI, Java, Reduced Data Movements between Systems File System GPFS HDFS Scheduler Cobalt Hadoop MapReduce on HPC software stack MapReduce Lib MapReduce-MPI Hadoop With Fault Tolerance 3 FT-MRMPI for HPC Clusters, SC15

  4. Fault Tolerance Model of MapReduce Master/Worker Model Detect: Master monitors the all workers Restart: Affect tasks are rescheduled to another worker Worker Job Worker Scheduler Reduce Slot MapTask Map Slot Master Worker MapTask MapTask MapTask ReduceTask ReduceTask ReduceTask MapTask 4 FT-MRMPI for HPC Clusters, SC15

  5. No Fault Tolerance in MPI MPI: Message Passing Interface Inter-process Communication Communicator (COMM) Frequent Failures at Large Scale MTTF=4.2 hr (NCSA Blue Waters) MTTF<1 hr in future MPI Standard 3.1 Custom Error Handler No guarantee that all processes go into the error handler No fix for a broken COMM 5 FT-MRMPI for HPC Clusters, SC15

  6. Scheduling Restrictions Gang Scheduling Scheduler all processes at the same time Preferred by HPC application with extensive synchronizations MapReduce Scheduling Per-task scheduling Schedule each task as early as possible Compatible with the detect-restart fault tolerance model Resizing a Running Job Many platform does not support Large overhead (re-queueing) The detect-restart fault tolerance model is not compatible with HPC schedulers 6 FT-MRMPI for HPC Clusters, SC15

  7. Overall Design Fault Tolerant MapReduce using MPI MapReduce Job Reliable Failure Detection and Propagation Distributed Master Distributed Master Compatible Fault Tolerance Model Task Runner Task Runner Load Balancer Load Balancer FT-MRMPI Task Runner Failure Hldr Failure Hldr MapReduce Processs MapReduce Processs Distributed Master & Load Balancer Failure Handler MPI Features Tracable Job Interfaces HPC Scheduler Compatible Fault Tolerance Models Checkpoint-Restart Detect-Resume 7 FT-MRMPI for HPC Clusters, SC15

  8. Task Runner User Program MR-MPI Tracing, Establish Consistent States Map() Delegating Operations to the Library RD record Call (*func)() New Interface Process Highly Extensible WR KV Embedded Tracing Record Level Consistency int main(int narg, char** args) { MPI_Init(&narg,&args); <***snip***> mr->map(new WCMapper(), new WCReader(), NULL, NULL); mr->collate(NULL); mr->reduce(new WCReducer(), NULL, new WCWriter(), NULL); <***snip***> } template <typename K, typename V> class RecordReader template <typename K, typename V> class RecordWriter class WordRecordReader : public RecordReader<int, string> template <typename K, typename V> class Mapper template <typename K, typename V> class Reducer void Mapper::map(int& key, string& value, BaseRecordWriter* out, void* param) { out->add(value, 1); } 8 FT-MRMPI for HPC Clusters, SC15

  9. Distributed Master & Load Balancer Task Dispatching MapReduce Job Global Task Pool Task Pool Job Init Task Task Task Task Task Task Task Task Task Task Task Task Recovery Global Consistent State Distributed Master Distributed Master Shuffle Buffer Tracing Load Balancing Task Runner Task Runner Load Balancer Load Balancer Monitoring Processing Speed of Tasks Failure Hldr Failure Hldr Linear Job Performance Model MapReduce Processs MapReduce Processs 9 FT-MRMPI for HPC Clusters, SC15

  10. Fault Tolerance Model: Checkpoint-Restart RD record Failed Process Normal Process Other Processes Custom Error Handler Save and exit gracefully Err Hldr Save States Propagate failure event with MPI_Abort() MPI_Abort() Checkpoint Asynchronous in phase Saved locally Multiple granularity Restart to Recover Resubmit w/ -recover Pickup from where it left Map Shuffle Reduce 10 FT-MRMPI for HPC Clusters, SC15

  11. Where to Write Checkpoint Write to GPFS Performance issue due to small I/O 1400 Interferences on shared hardware 1200 Job Completion Time (s) Write to Node Local Disk 1000 800 Fast, no interferences 600 Global availability in recovery? 400 Background Data Copier 200 0 Write local GPFS Local Copier Sync to GPFS in background Wordcount 100GB, 256 procs, ppn=8 Overlapping I/O w/ computation 11 FT-MRMPI for HPC Clusters, SC15

  12. Recover Point Recover to Last File (ft-file) Less frequent checkpoint Need reprocess when recover, lost some work Recover to Last Record (ft-rec) Require fine grained checkpoint Skipping records than reprocessing 300 300 Job Recover Time (s) Job Recover Time (s) 250 250 200 200 150 150 100 100 50 50 0 0 ft-rec ft-file ft-rec ft-file Init Recover Runtime wordcount Skip/Reprocess Init Recover Runtime pagerank Skip/Reprocess 12 FT-MRMPI for HPC Clusters, SC15

  13. Drawbacks of Checkpoint-Restart Checkpoint-Restart works, but not perfect Large Overhead due to Read/Write Checkpoints Requires Human intervention Failure in Recovery 13 FT-MRMPI for HPC Clusters, SC15

  14. Fault Tolerance Model: Detect-Resume Failed Process Normal Process Other Processes Detect Global knowledge of failure Err Hldr Revoke() Identify failed processes by comparing groups Shrink() Resume Fix COMM by excluding failed processes Balanced distribution of affected tasks Work-Conserving vs. Non-Work-Conserving User Level Failure Mitigation (ULFM) MPIX_Comm_revoke() MPIX_Comm_shrink() Map Shuffle Reduce 14 FT-MRMPI for HPC Clusters, SC15

  15. Evaluation Setup LCRC Fusion Cluster[1] 256 nodes CPU: 2-way 8-core Intel Xeon X5550 Memory: 36GB Local Disk: 250 GB Network: Mellanox Infiniband QDR Benchmarks Wordcount, BFS, Pagerank mrmpiBLAST [1] http://www.lcrc.anl.gov 15 FT-MRMPI for HPC Clusters, SC15

  16. Job Performance 10%-13% overhead of checkpointing Up to 39% shorter completion time with failure Job Completion Time (s) Job Completion Time (s) 500 1000 50 100 32 64 128 256 512 1024 2048 32 64 128 256 512 1024 2048 mrmpi checkpoint-restart detect-resume (WC) detect-resume (NWC) MR-MPI checkpoint-restart detect-resume (WC) detect-resume (NWC) 16 FT-MRMPI for HPC Clusters, SC15

  17. Checkpoint Overhead Factors Granularity: number of records per checkpoint Size of records 60 50 Checkpoint Overhead (%) 40 30 20 10 0 1 10 100 1000 10000 100000 1000000 10000000 Number of Records per Checkpoint 17 FT-MRMPI for HPC Clusters, SC15

  18. Time Decomposition Performance with failure and recovery Wordcount, All processes together Detect-Recover has less data that needed to be recovered Checkpoint-Restart Detect-Resume 100% 100% 90% 90% 80% 80% Execution Time Percentage Execution Time Percentage 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% 64 128 256 512 1024 2048 64 128 256 512 1024 2048 map recover aggregate convert reduce map recover aggregate convert reduce 18 FT-MRMPI for HPC Clusters, SC15

  19. Continuous Failures Pagerank 256 processes, randomly kill 1 process every 5 secs 1800 1600 1400 1200 1000 800 600 400 200 0 2 4 8 16 32 64 128 Work-Conserving Non-Work-Conserving Reference 19 FT-MRMPI for HPC Clusters, SC15

  20. Conclusion First Fault Tolerant MapReduce Implementation in MPI Redesign MR-MPI to provide fault tolerance Highly extensible while providing the essential features for FT Two Fault Tolerance Model Checkpoint-Restart Detect-Resume 20 FT-MRMPI for HPC Clusters, SC15

  21. Thank you! Q & A

  22. Backup Slides

  23. Prefetching Data Copier Recover from GPFS Reading everything from GPFS Processes wait for I/O Prefetching in Recovery Move from GPFS to local disk Overlapping I/O with computation

  24. 2-Pass KV-KMV Conversion 4-Pass in MR-MPI Excessive disk I/O when shuffle HashTable Keys Hard to make checkpoints 2-Pass KV-KMV Conversion Log-Structure File System KMV sketch KV KMV KV->Sketch, Sketch->KMV

  25. Recover Time Recover from local, GPFS, GPFS w/ prefetching 50 45 40 Recover Time (s) 35 30 25 20 15 10 5 0 64 128 256 512 1024 2048 local GPFS GPFS w/ prefetching 25 FT-MRMPI for HPC Clusters, SC15

Related